Small language models step into the fight against phishing sites

Phishing sites keep rising, and security teams are searching for ways to sort suspicious pages at speed. A recent study explores whether small language models (SLMs) can scan raw HTML to catch these threats. The work reviews a range of model sizes and tests how they handle detection tasks while keeping compute demands in check.

SLMs website phishing detection

Although LLM-based website phishing detection is still a relatively new area, it is gaining momentum. Several studies have already reported promising results using different models and analysis strategies.

Preparation phase that guided the study

The authors used a public dataset of roughly ten thousand websites, split between benign and phishing pages. From this set, they took a balanced sample of one thousand sites for the main benchmark. Each page was fed to the models in a trimmed form. Only a small slice of the original HTML remained. This kept processing costs down and reflected the idea that sections such as long script blocks add little to a verdict.

The selected elements included tags linked to navigation, images and metadata. These often revealed patterns that appeared in deceptive layouts. Two trimmed versions were created: one retained up to five percent of the HTML, and the other retained up to 50%. The larger benchmark used the smaller version.

Each model received the same prompt template. It examined the structure of the page along with text and link patterns. The output included a score from zero to ten, a label that marks the page as phishing or benign, and a short explanation. This format allowed the researchers measure accuracy and internal consistency.

What the results showed

The tests gave a mixed picture. Some models handled phishing detection well, while others struggled with basic tasks such as format consistency or steady decisions. Accuracy across the remaining models ranged from fifty six percent to nearly 89%, with most landing at or above 80%. This shows that small models can sort websites with steady output, but the range in results shows that quality varies.

Performance also differed in ways that matter for daily use. One model caught almost every phishing page it flagged, reaching 98% precision on the cases where it returned an output, but it often failed to produce the required format. Because the results came back incomplete, that option could not be used. Other models caught fewer threats but returned answers in a steady format, which made them more dependable overall.

The study found that mid sized models in the ten to twenty billion range performed close to older large models. This points to progress among newer small models. Runtime also varied. Larger models took several seconds per page, which can slow scanning work. Smaller models ran faster but often delivered weaker results. This pattern appeared across the tests.

Benefits of using small language models

Running models on internal systems keeps sensitive information where it belongs. URLs, HTML and user metadata stay inside the organization instead of passing through outside providers. This matters for teams that work under strict data protection rules or handle sensitive material. In phishing detection, a local setup also gives teams direct control over their data and reduces exposure to outside systems.

The models in this study were tested in their default form, but organizations with the right skills can tune them for phishing work. Internal datasets can be used to adjust model weights or to build retrieval based systems that raise performance.

The open source ecosystem offers a broad set of models that can be adapted for specific domains, and fine tuned models for related tasks often appear on platforms such as Hugging Face. According to the authors, no fine tuned models focused on phishing detection seem to be publicly available at this point.

Running models locally removes dependence on outside providers and avoids vendor lock-in. Operations do not hinge on the availability, pricing changes or internal decisions of a third party. A local setup is also insulated from cloud outages and outside network issues, which can improve reliability. Lower latency is another benefit and helps when phishing detection needs quick responses.

Challenges tied to small language models in practice

Small models fall short of larger proprietary systems, and the gap shows up across every measure. The tested models performed well for their scale, and some smaller options did fine on narrow tasks, but none reached the level seen in related work with larger proprietary systems. That gap would likely hold under similar testing. It also carries risk, since lower performance leads to more false alarms or missed threats, which can disrupt phishing detection and give attackers room to move.

Don't miss