Lookalike domains: Artificial intelligence may come to the rescue

In the world of network security, hackers often use lookalike domains to trick users to unintended and unwanted web sites, to deliver malicious software into or to send data out of victim’s network, taking advantage of the fact that it’s hard to tell the difference between those domains and the targets they look alike. For example, in a recent card skimming malware attack, domain google-analyitics.org was used to receive collected payment card data (there is an extra “i” in that domain).

To craft lookalike domains, we’ve seen methods such as adding extra character(s), replacing a single character (examples are 0 for o, 1 for l) or a sequence of characters (examples are rn for m, vv for w), typosquatting (e.g. yhaoo.com), inserting top level domains (e.g. walmart.co.com) and the list keeps growing.

Microsoft recently took control of six domains that were about to be used by a Russian hacking group to spoof U.S. political organizations. The action shed light on the serious issues that lookalike domains may pose. Not only can these malicious domains have an effect at political level, but they may also play a huge role in cyberwarfare overall.

With those background in mind, let’s take a look at how enterprises and users can defend themselves from the threats caused by lookalike domains.

Protect yourself from attacks using lookalike domains

Defending your organization against attacks involving lookalike domains is harder than it looks, but a few mechanisms are available.

If you’re concerned about the possibility of someone mounting a lookalike domain attack against one of your domains, you can simply register or buy out those lookalikes. For example, currently yah00.com is owned by Yahoo! (Oath Inc.), and goog1e.com is owned by Google. You can also ask for law enforcement’s help in seizing or taking down a domain. Obviously, these methods are usually on the pricier side and not for everyone. Even Oath Inc. does not seem to own yah0o.com at the time of writing.

If you’re concerned about your users falling victim to a lookalike domain attack, then detecting them is key. Over the years, the network security industry has offered a few methods of detection.

One method starts with text variations. Earlier I mentioned the lookalike letter pairs of (l, 1), and (rn, m). We can do a textual comparison of those particular letters between a questionable domain and a well-known one. The challenge is that we soon run into ambiguity issues due to the huge number of combinations. For example, ol and d look similar, and lo and b look similar, but what about lol? Does it look like bl or ld, or both? The false positive rate can be high with this approach.

A variation of this method is to compare questionable domains with pre-generated domain permutations. An open source tool called DNSTwist is available to generate domain permutations. For example, from amazon.com, the tool might generate the following: ajazon.com, amazpn.com, amaz9n.com, amazo.com, a.mazon.com (and more). Such pre-generated domains could be stored in a software system and used to compare against any domains. The downside is that you have to assume that the hacker picked a domain permutation that DNSTwist could generate.

Hybrid solution with artificial intelligence

So far it seems like there are many limitations and challenges with the text-based detection methods listed above. Is there a better way?

Let’s take a step back and think about the root of the problem. Lookalike domains are deceptive because they are visually confusing. And they are visually confusing because people have to read a long URL in a short amount of time. Our brains often trick us by automatically re-assembling scrambled letters.

Since this is about vision, should we use images instead of text for detection? Rather than checking which letters a domain is composed of, we could generate images of it, extract the statistical characteristics of the images (called “features”) and compare against pre-extracted features of well-known domains. The algorithms can be trained by humans to learn and adapt. We’ve seen a report of improvement of 13% to 45% in terms of area under the ROC curve (one of the most effective accuracy metrics for machine learning algorithms) using a Siamese Neural Network. This method is not without its own challenges of course. For one, the algorithms need huge amount of data and human supervision to train. Also, different fonts of the same text look different and create an additional layer of complexity.

Because using AI to detect lookalike domains is still a work in progress, for now users still have to rely on the existing methods mentioned above. But just as everything else in IT, we may soon see new innovative technology on the defense side of this cat-and-mouse game called security.