Google analyzes millions of pages per day when searching for phishing behavior. This kind of activity is, of course, not done by people but by computers.
The computers are programmed to look for certain things that will identify the page as a phishing site. Those things are actually the same things that users should check when evaluating if a page is legitimate or not.
According to a post on Google’s official online security blog, the first step is looking at the URL- Does it contain words like “login” or “banking” or trademarks of the phishing target? Does it use an IP address for its hostname? Does it have a large number of host components, making the address unusually long? If the answer is yes to all of these questions, the page could be a phishing one.
The second step consists of analyzing the page – Does it contain a password field? Does the majority of the links point to the phishing target so that the phishing pages functions as the legitimate one would? Google’s computers check also the terms most often used on the page, and a telling terms like “password” raises a red flag.
The third step consists of a look-up of the hosting information – does the institution claim to be based in one country but the webpage is hosted on servers in another country and on a local ISP’s network? If the answer is yes, chances are high it’s not a legal site.
Lastly, checking to see whether the page is popular and checking the spam reputation of the domain on which the page is hosted will give you another clue – phishing pages are usually hosted on domains that have a (bad) reputation when it comes to spam sending.
When all these clues are combined and indicate that the site is likely set up for phishing purposes, it is put on Google’s blacklist that is used by the browsers to warn the users that they have landed on a malicious page.
“False positives” do happen, but they happen once every 10,000 checked pages, and even then it is usually a site set up for some other malicious purpose. The basis on which the classifier is trained to recognize phishing pages is provided by a sample of around ten million analyzed URLs in the last three months and an addition of current features, and it is executed once a day.
Phishers may use a number of techniques to try and bypass this system, but they can’t escape forever. The more people come to their site, the likelihood of someone recognizing it for what it is and reporting it to Google rises, so it’s just a matter of time before it gets flagged.
For more in-depth information about this whole process, check out the paper that Google’s anti-phishing experts wrote on the subject.