Big Data analytics to the rescue

In the battle against cyber criminals, the good guys have suffered some heavy losses. We’ve all heard the horror stories about major retailers losing tens of millions of credit records or criminal organizations accumulating billions of passwords. As consumers, we can look at a handful of friends at a cocktail party and assume that most, if not all, of them have already been affected.

So how can an IT security organization ensure they are not the next target (excuse the pun)?

It turns out there are common characteristics of successful attacks. However, the evidence of intrusion are often hidden in the noise of IDS/IPS alerts; security teams have no visibility to telltale signs of much of the discovery and capture activities; and exfiltration is cleverly designed to operate below alert thresholds, the traces hidden in huge volumes of data.

These attacks are successful because the security paradigm is based on identifying “known bad” activities and the alert noise generated by that approach necessarily limits the amount of data that can be analyzed.

So how can Big Data analytics help? Think about the amount of operations data generated by a retailer’s IT environment. Each device generates operating data at the OS, network, and application layers. There are tens of thousands of PoS devices, network devices, back end servers, middle ware-¦ the list goes on and on. Even a modest sized operation daily generates gigabytes of data, and large enterprises generate well into the terabytes of operations data. Hidden in this data are the fingerprints of intrusion, discovery, capture and exfiltration activities and many of those activities are going to be anomalous.

It turns out that finding anomalies in huge volumes of data is exactly what Big Data analytics approaches, such as unsupervised machine learning, are good at.

Finding the important amid the noise
It would be easy to assume that IT security teams of the enterprises that have been breached were just ineffective or lazy. But that flies in the face of reason. Even a modest size organization can experience tens of thousands of alerts a day from their perimeter defenses. We would have to assume that number to be well into the hundreds of thousands at a large retail organization. Ten thousand of those are likely to be high severity alerts. In fact the vast majority of security architects and CISOs will tell you they simply can’t process the alert noise generated by their intrusion tools.

Cyber criminals are well aware of the challenge IT security teams face. They can be fairly confident that alerts generated by their attempts to penetrate a target of worth will go unnoticed in the massive volume of simultaneous notifications. Some security architects, however, have taken the clever step of running advanced analytics on the alert themselves. It can be a relatively simple exercise to monitor IDS alerts in real-time to uncover an unusual concentration of attacks on a specific target, from a specific source or of a specific type. The security team at a major digital marketing firm, a prime target for criminals because they house hundreds of thousands of valid email addresses for their clients, did just that. Real-time analysis of hundreds of thousands of alerts generated in a typical week resulted in 5-10 accurate notifications of activities that required special focus.

Finding the suspicious activities inside the perimeter
The “known bad” approach actually limits our security in three ways. First, it requires significant human effort to implement, manage and maintain the threat signatures and rules that trigger alerts because they’re constantly evolving. Second, it invariably generates a very high volume of alert “noise.” Third, the amount of manual effort and resultant noise weigh against analyzing other valuable sources of data.

Nowhere is this third impact more noticeable then in the inability of security teams to identify suspicious activities inside their perimeter.

Once an attacker has breached perimeter defenses, they set out to find vulnerable host systems and data stores. Almost invariably, this results in activities that are abnormal. To give a few examples: a new process will appear on a server and connect to the network; systems that usually receive network traffic will start sending; or authorized access users will generate an unusual level of failed passwords or start to access the network from a new device or at an odd time of day.

There are two impediments to successfully finding the “fingerprints” of these activities. The first is the “known bad” approach. Let’s take the simple example of scanning internal systems for unusual software processes that are connecting to the network. This is a particularly useful approach to finding compromised internal systems. The known bad approach would be to identify the specific software processes you are concerned about like FTP (File Transfer). Hackers will expect that you will look for that and so instead they would use the little known PUT capability of HTTP (internet protocol). FTP and HTTP will be normal processes on some servers, so in order not to generate false alarms, your security architect would have to know to which servers these alert rules apply.
When you are talking about hundreds, thousands or tens of thousands of devices, this is simply impractical.

Machine learning algorithms, on the other hand, can easily “learn” the normal activities of hundreds of thousands of servers and tell you immediately when one of them connects to the network with a software process that is unusual for that specific device. It can do so on commodity hardware, with very little setup and none of the required maintenance associated with rules.

Similarly, audit and access logs can be analyzed, again without rules, to immediately identify suspicious access attempts.

Finding the earliest signs of data theft
The fingerprints of data exfiltration are hidden in massive sets of machine data being generated by web proxies and network flow. However, getting usable and actionable information from these data sets has significant challenges.

When the data in question comes from sources such as web proxy servers, the fact that almost all the data within these massive data sets relates to non-malicious, standard business activity is a significant challenge to consider. Differentiating malicious activity from non-malicious activity is extremely difficult as there may only be a small handful of malicious activities each day that are hidden in the billions of interactions that take place every minute. Generating alerts on non-malicious activity only adds to the cover you are giving to advanced criminal attackers.

Traditional methods of extracting usable information from this data involve searching for known signatures of an attack. Unfortunately, advanced hackers and criminal enterprises know enough to modify the threat signature to avoid detection. In the end, however, the attack is going to generate outlier behaviors, so a complementary approach to signature and rule-based intrusion detection is analyzing internal and outgoing traffic for statistically unusual behavior.

However, the level of statistical analysis required far exceeds the capabilities of even the more advanced security architects or analysts. For instance, there are generally statistically unusual interactions happening all the time in a typical organization. Trying to scan for unusual websites visited by employees of a large enterprise can generate thousands of false alerts per day.

As organizations scale in size, more advanced analyses of interactions across multiple dimensions are required. As an example, the fact that an employee visits a new website only becomes a valid concern if the interaction also involves an unusual protocol for that user and while that user is usually a consumer of data, they are now sending substantial volumes of data in small bursts.

Statistically, modeling data for unusual patterns across multiple dimensions – and doing it accurately – is a complex task even for small data sets, let alone massive data sets. Appropriate modeling techniques and computationally stable and scalable implementations are beyond the scope of simple tools and analyses. Finally, the analysis needs to be executed in real-time, which places additional constraints on the system because it has to be online during the process.

Staying ahead of the bad guys
Statistical techniques are the only approach that can identify unknown attacks, and even when applied properly will still require a certain amount of human intervention. Security teams can definitely react a lot faster if they are immediately aware of previously unknown threats, so staying ahead of the bad guys really comes down to two things: the speed of a real-time analysis solution and the reaction time of the security team. In the end, this requires that both the right technology and organizational processes are in place.

As more and more data and data sets become available, the challenge of gaining actionable insight becomes more and more complex. For example, in a smaller office with a couple hundred employees, identifying a user exfiltrating data to an unusual website can be achieved by simple reporting. However, the same report within a large enterprise that employs thousands or tens-of-thousands of people may contain 500 unusual events per hour, which becomes too large to effectively triage and analyze.

As the data increases, effective, accurate and scalable statistical analyses become more and more important as simple reports and rules generate too much information to triage and action. Since humans are unable to effectively process this volume of information, the only way we’ll be able to do it is by relying on machine learning.

While humans become less effective as data sets get bigger, machines actually become more effective, as they have more data to analyze and learn what normal behavior looks like. As a result, they’ll become even better at flagging the anomalies. There’s no doubt that machine learning will become a much larger part of an effective security strategy as the amount of data increases and becomes even more valuable to an organization.

The importance of security analytics is directly proportional to how much a breach will cost an organization, and in the current environment, they are becoming essential. Amid the perpetual race of hackers looking to break through a perimeter versus security professionals moving to patch the newfound vulnerabilities – and the cycle beginning over again – security analytics have become invaluable.