LLM security advice looks solid until you check the hard cases

Plenty of people now type their security worries straight into a chatbot. A hacked account, a suspicious email, a stalker who might be tracking a phone, all of it lands in the same window someone would use to ask about dinner. A benchmark called HelpBench tests how well chatbots handle those moments, and the results give security professionals something to watch in what their users are being told.

OPIS

Researchers at University College London and Google built HelpBench from 450 questions drawn from Reddit posts where people asked for help with digital privacy, safety, and security. They rewrote each question with the help of another AI model and a manual review, stripping out identifying details and making the wording sound like something a person would type to a chatbot. They then ran 18 leading models against the questions and scored the answers with rubrics, question-by-question checklists of what a good answer should include and avoid, written by experts who each have more than ten years in the field.

The headline number looks healthy. Across all the models, advice scored about 82 percent against the rubrics. The rubrics rewarded answers for including the right facts and for a calm, focused delivery. Scams and account compromise were the topics the models handled best. Questions about reporting and removing content gave them the most trouble.

The weak spots

The averages hide a long tail of weak answers. About one in ten responses scored below 65 percent, and some of those failures carried real danger. The worst showed up in situations involving abuse and stalking.

Asked how to remove spyware from a phone, most models walked through the technical steps and left out the risk that a sudden removal can tip off an abuser and set off physical violence. One model, asked how to hide a location from an abuser, suggested wearing a disguise to a store because “abusers rarely go to stores midday.” Another treated a stalking victim’s worry about location tracking as simple intimidation, a reading that could leave someone exposed to harm.

Other failures were quieter. One question described a method for securely deleting sensitive files using an encrypted vault. The models praised the setup and missed the part where moving a file into the vault leaves the original sitting on the disk, recoverable by anyone with the right tool. The user walks away feeling safe when the data is still there.

Some answers pushed solutions most people could not carry out. A suspected malware infection drew a recommendation to throw the device away. A request for tighter privacy on payment apps brought a suggestion to open an account at a different bank. Models sometimes told calm users to become “hyper-vigilant” or described a routine situation as horrifying and traumatic, language that can raise a person’s anxiety for no reason.

Newer models, similar scores

The benchmark complicates the assumption that the next model release will fix these problems. Within most model families, scores barely moved from one version to the next, and some versions did worse than the ones that came before.

A newer Grok release lost ground on factual accuracy and repeated more of the misconceptions the experts flagged. One model, Qwen, posted a large gain that made it the standout. GPT 5.0 came out on top overall at 87 percent. The general picture is steady performance, with improvement arriving slowly and unevenly.

Reasons for caution

A few points deserve weight before anyone treats the scores as final. The paper is new and awaits peer review. The questions come from Reddit and were rewritten with AI help, so they may differ from how people phrase things on their own. The automated grader matched human experts closely on factual points and less closely on tone, where judgment gets fuzzy.

The rubrics also carry the researchers’ own standards for good advice. Models lost points for staying neutral when a user asked about getting around a platform ban, with one answer telling the user that a terms-of-service violation was “between you and the app.” Models lost points for raising the emotional temperature of a reply. Those are defensible standards, and they reflect a particular view of what safety advice should do.

More about

LLM security advice looks solid until you check the hard cases

The weak spots

Newer models, similar scores

Reasons for caution

Featured news

Resources

Don't miss