In its most basic and simplest form, filtering access to content on the web can be achieved by rather blunt instruments such as DNS black-holes. And, in the early 2000s, this was more than enough.
Over time, though, the web became more popular and useful, both for users and attackers, and web filtering providers needed to up their game again: simple URL filtering was used to allow access to example.com/news and block access to example.com/badstuff. Again, for a time, that was enough.
“But web filtering is an arms race, and the evolution of the anonymous proxy was in full swing. The only remaining tool in the web filter provider’s arsenal was full dynamic content analysis, which can be achieved in multiple distinct and (sometimes) opposing ways,” recounts Craig Fearnsides, Operations Technical Authority at Smoothwall, a UK-based developer of firewall and web content filtering software.
“One method (which pure firewall vendors employ) is Layer 7 signature analysis, looking for patterns on the wire and blocking packets. URL and domain pattern matching is next – allowing news.*.com but blocking ads.*.com. Finally, there are regular expression-based methods which allow for content to be scored and categorised according to a customer’s requirements. This involves associating positive/negative scores to phrases, as well as more nuanced categorisation, e.g. essex vs sex vs sextuple.”
No web filtering is no longer an option
“Web filtering has had a somewhat dogged history and has been vastly misunderstood for many years,” Fearnsides tells me.
But, thankfully, the IT/network security community now generally understands and accepts that access to the web is an essential part of most of their fellow employees’ working day, and that something needs to be in place that will keep the corporate environment safe without preventing people from doing their jobs.
“This is where transparent (filtering without requiring explicit proxy settings) and traditional web filtering can help massively, giving already busy IT/network administrators the tools to keep a business moving, without bogging them down in low-level implementation details,” he noted.
Real-time web filtering challenges
The greatest challenge for a web filtering vendor is always going to be speed, followed closely by comprehension, he says.
“There are many shortcuts that can be used to increase throughput of dynamic filtering solutions (this is usually along the lines of limiting escalation of computational effort), but they often lead to poor categorization or false positives. The only real way to improve speed of throughput is by optimising each of the distinct layers of categorisation over many iterations,” he pointed out.
“One method involves using machine learning to understand what content has been previously identified as one or multiple categories (based on a subset of existing basic lists). We then point the tool at a heap of requests/responses to have it seek out subtle differences that would be nearly impossible for a single person. This machine learning process produces vastly improved patterns for use by a dumb regular expression engine, increasing throughput and effectiveness at the same time.”
He says that false positives are inevitable, but can be mitigated by allowing customers to prioritize one or multiple categorization results over another.
For example, the solution can be made to block adult content BEFORE allowing audio/video content. This would have the appropriate effect of allowing access to audio and video content, while still preventing access to adult content sites that may be (appropriately) categorised as audio/video content sites.