What’s polluting your data lake?

A data lake is a large system of files and unstructured data collected from many, untrusted sources, stored and dispensed for business services, and is susceptible to malware pollution. As enterprises continue to produce, collect, and store more data, there is greater potential for costly cyber risks.

data lake pollution

Every time you send an email or text you are producing data. Every business service your organization has deployed is generating and exchanging data from third-party partners and supply chain providers. Every new merger and acquisition (M&A) results in large volume of data being transferred across two companies. Every IoT device or subscription is generating data that’s collected and stored in data lakes. You get the point: Mass data production and collection are unavoidable. And, as a result, our data lakes are becoming an overwhelmingly large and a ripe target for cybercriminals.

With digital transformations—a.k.a cloud adoptions and data migrations—having occurred over the past couple of years, cloud data storage has significantly increased. As enterprise data lakes and cloud storage environments expand, cybersecurity will become a greater challenge.

The impacts of malware pollution

Understanding the impact of malware pollution on a data lake can best be understood by looking at how real-life pollution affects our on-land lakes.

Water is fed into lakes from groundwater, streams and various types of precipitation run-off. Similarly, a data lake collects data from a multitude of sources such as internal applications, third party/supply chain partners, IoT devices, etc. All this data constantly flows in and out of the data lake. It can move into a data warehouse or other cloud storage environments or be extracted for further business insights or reference. The same process can be witnessed with freshwater lakes, extracting water for irrigation and churning water into other streams.

External “pollution” that feeds into a lake (both physical and digital) can harm the existing ecosystem. When unknown malware enters a data lake, bad actors can gain access to the data stored in the lake, manipulate it or mine it to sell on the dark web. This data can include sensitive customer data that can lead to a breach in personally identifiable information (PII) or even corporate data that provides credentials to other systems and applications, allowing bad actors to continue to move throughout a network. Remember, in both physical and digital lakes, pollution can pile up over time, exacerbating the problem even further.

What cyber threats are targeting data lakes?

Commonly, an attacker infiltrates a data lake by taking advantage of critical vulnerabilities, weaponizing the data files and misconfigurations impacting applications that integrate and communicate with the data lake. As a recent example, a vulnerability within Azure Synapse had a direct impact on data lakes. What is alarming about this is the fact that many enterprises have no idea that there is a misconfiguration or vulnerability, giving the attacker plenty of time to enact a slew of nefarious activities. And even when a vulnerability is disclosed, it does not mean that the threat no longer exists. Bad actors find crafty ways to leverage existing vulnerabilities to compromise data lakes, recently demonstrated through the Log4Shell vulnerability. Months after the initial incident occurred, bad actors have been spotted exploiting the vulnerability to infiltrate an enterprise data lake or repository.

Since data lakes collect files in its raw format, they typically host a lot of sensitive content yet to be monetized and used in business services. This includes email attachments, PDFs, Word documents (to name a few). It’s simple and cost effective for a bad actor to create or obtain an innocent-looking file that is embedded with malicious code that can be injected into the data pipeline. In fact, bad actors can purchase a malicious file object for under $100 on the dark web that they can use for this process.

Strengthening unknown malware removal efforts

When it comes to data lakes, the focus has been primarily on collecting as much data as possible so that the enterprise can conduct analytics activity and create new insights to be used by the business operators. And, where this activity can open new opportunities for an enterprise, it can upend everything with just one security incident. Cyberattacks are evolving and becoming more sophisticated. They go beyond introducing new malware to demanding ransoms, holding the data hostage or even causing a system outage to disrupt business operations. They can also expose sensitive data and file content that can adversely impact the commercial enterprise or government agencies.

Bad actors have evolved their tactics and techniques in recent times. The common “spray and pray” attack methodology is not the same anymore. They craft targeted attacks by leveraging advanced obfuscation and socially engineered methods to weaponize files content that can bypass traditional security systems. In addition, they create completely new malware strains that simply scanning for known threats isn’t enough. Over 450,000 new malware programs are registered every day. If you are relying on signature-based methodologies, you will be missing the completely new attack types targeting your organization daily.

At this rate, it’s impossible for detection-based solutions to keep up with the nature of today’s threats. When new malware can evade detection, security teams are forced to go into reactive mode and clean up the “pollution” after it has occurred.

The best way to remove pollution from a data lake is to avoid pollution in the first place and ensure that proactive security safeguards are put in place. Building out a strategy and implementing technologies that can protect a data lake as a whole and not the individual applications feeding into the lake is a great place to start. It’s important for security strategies to focus on removing all threats, both known and unknown. Much like a water treatment plant ensures only safe water flows in the lake, Content Disarm and Reconstruction ensures only safe files enter the data lake.

Don't miss