Leveraging AI and automation to identify sensitive data at scale

In this interview with Help Net Security, Apoorv Agarwal, CEO at Text IQ, talks about the risk of unstructured data for organizations and the opportunity to leverage AI and automation to identify sensitive data at scale.

identify sensitive data

You say that, as organizations are trying to tackle breaches and ransomware attacks, they are overlooking sensitive information hidden in their data. What could be the main reason for this?

Ideally, organizations should have a handle on where sensitive information is sitting in their data. In general, companies end up retaining the information they collect for a long time, even when they have no real use for this information. I think the problem boils down to a broader issue of data governance.

It’s impossible to have strong data governance without some level of automation; for instance, the volume of data generated by enterprises is rising exponentially and relying on humans to take stock of all the sensitive information that’s laying buried in their database—undetected, and more often than not, in an unstructured format—simply does not work at scale.

Data breaches and ransomware attacks will continue to happen, but organizations have a real opportunity to leverage AI, which gives them the ability to proactively identify sensitive and personal data at scale; once the data is identified, they can choose to redact, delete, encrypt or take whatever the necessary steps are to secure it so that it never falls into the wrong hands.

How does unstructured data pose a risk and what can be done about it?

For one, up to 80% of enterprise data is unstructured – the sheer size of its attack surface makes it very vulnerable to be targeted by bad actors. Secondly, this unstructured data is replete with all types of sensitive information: trade secrets, personal information, health information, intellectual property, etc; for instance, no one builds a structured database containing an organization’s trade secrets—it’s more likely lying scattered in emails, chats, Excel sheets and other forms of unstructured data.

The challenge presented by unstructured data is that it is voluminous and finding the sensitive information lying within it is like looking for the proverbial needle in the haystack. Finding those risky and sensitive needles requires machine learning techniques that are scalable.

Is automation the only way to go or does the human element still hold value?

Well, I think it’s obvious that data is growing at a faster rate than the human population. There are not enough humans, not enough time in the day for the volume and complexity of the task.

I think it’s also important to note that machines are not a point where you can just push a button and complete these tasks autonomously. They do need some help from humans. The job cannot be done by machines or humans alone.

Could you explain in what way does AI identify and safeguard sensitive information?

It doesn’t safeguard sensitive information—it identifies it. Once it has identified it, organizations can then take actions to safeguard it either by deleting, redacting, encrypting or changing the access controls to it.

The challenge is in the identification itself. The status quo, when it comes to identification, is based on antiquated approaches and technologies – RegEx, search terms. Besides being slow and not very scalable, these labor-intensive approaches produce results that can be riddled with inaccuracies.

But not every 9 digit number is an SSN. AI, on the other hand, can look at the larger context of the information to more accurately determine if a piece of information is sensitive or not. As an example, consider email. When analyzing emails for sensitive information, AI has the ability to consider contexts such as who wrote it, who consumed, who was copied to it and the network of relationships between the people in the email chain in determining whether a part of the email is sensitive or not.

Now, theoretically, humans could triangulate all of these contexts, but there’s not enough humans in the world to pull this off; and besides, humans aren’t good at computational tasks, they are better at abstract thinking.

The way enterprises safeguard the data can have a great impact on their overall business and reputation. How aware do you think enterprises are of this and what would you think they should improve?

They are very aware of it. No organization believes that it’s completely invulnerable to data breaches. It is very much top-of-mind at the board level.

Where they can improve is in the following: For too long, they have been relying on data loss prevention, search terms and manual review. They really need to pivot and tap into new technologies such as AI.

Share this