Using deep learning and natural language understanding to protect enterprise communication
In this Help Net Security podcast recorded at Black Hat USA 2019, Dhananjay Sampath, CEO at Armorblox, talks about how they use natural language understanding and deep learning to automatically create and adapt policies, continuously measure risk exposure, and prevent attacks and data loss.
Here’s a transcript of the podcast for your convenience.
Hi, I’m DJ Sampath and I’m the co-founder and CEO of Armorblox. I’m here today, I’m excited to talk about what we are building. Thank you so much for having me as part of this podcast and I’d love to tell you a little bit more about natural language understanding and how it can meaningfully move the needle for cybersecurity.
Essentially one of the things that happened when we got started about almost 20 months ago, was we had a lot of conversations with CIOs and CISOs, and it was over 300+ conversations. Over those conversations we’ve learned that people talk about a whole lot of different problems: “I’ve got a phishing problem, I’ve got an insider threat problem, I’ve got a business email compromise happening, I’ve got a data loss situation”.
When we started digging into that, we put this whole feedback that we got up on a board and we said “hey, let’s dig into this and let’s understand what’s truly causing a slew of different types of problems”. Our big moment of epiphany was when we learned that a lot of times even the state-of-the-art security solutions look at metadata, but when it comes to actually looking at the content itself, they don’t do a very good job. They don’t actually try and understand the content or the context of the data.
They end up either turning it into a signature of some sort, it’s like an md5sum or a SHA-256, and signatures are fundamentally brittle in nature. On top of it, if you think about the DLP solutions, they look for keywords. Even from a compliance perspective, people are just looking for “is this PCI, PII, does it have this keyword here?” Keyword based searching is also really brittle. If you you’re detecting an email or how a spam or scammer sends out a Nigerian prince attack, and if your keywords are just looking for Nigerian prince, if it turns into a Nigerian king or Nigerian princess, keyword-based matches are not going to be very successful.
That got us to think about “hey, what can we do with the content that’s meaningful?” And so, we did a lot of work and research and we discovered that natural language understanding was just coming off of age in some sense. To describe a little bit more about it, if you’ve used Google Home, or Alexa, or Siri in the recent past, you know you’re using natural language understanding. When you say “Hey Siri, what’s the weather today?”, it basically uses an OP to translate it into something meaningful, slams a server, gets the data, puts it back in there in the form of something that’s meaningful for you to be able to process it.
We said “wait, what if we could take natural language understanding and bring it to cybersecurity where we can now start understanding the context of communications, context of documents that are being constructed?” and here’s where we got a little bit lucky too. One of the big trends that was transforming in the AI world was this inflection point with respect to comprehending textual data that happened about 18 months ago.
To give you more context I’ll take you back in about almost five, seven years ago where image recognition became commoditized. For instance, if you log into Facebook, it automatically recognizes who you are and it allows you to tag your pictures, or if a Tesla is driving around, it can easily detect the difference between a leaf blowing across or somebody walking across. Your iPhone these days uses face recognition to be able to unlock itself. That’s really an interesting situation, when neural networks got really smart about recognizing images and it’s because there are tons and tons of labeled data that’s available, that they can make a distinction between all of these objects.
A similar moment of renaissance happened for textual data where there’s huge corpuses of data that you can now leverage. You can use things like Wikipedia, which is the largest human annotated database that’s out there, leveraging things like that in a Twitter and Stack Overflow and so on and so forth, to be able to understand here is the context of a communication, to be able to break down sentences meaningfully was truly exciting, and we said “hey wait, we can now take this and evolve these techniques using deep learning, and make natural understanding a fabric of any cybersecurity stack that you have”.
One of the things that we sort of observed when we started playing with natural language understanding was, we now have the ability to recognize different categories of documents. We were able to suddenly start applying that to say “hey, one of the documents that you’re seeing over here is a medical record versus another document that you’re seeing over there as possibly just a flyer that contains something about an illness or about medications that you could take”.
The challenge over there for a DLP solution or data loss protection or prevention solution, or any compliance solutions to be effective, is to understand the provenance. To be able to not just look ahead and look behind but recognize that this is a patient record. Being able to step back out and look at all of the entities, what is this document really talking about, and getting to that level of machine-driven comprehension is really what we’re trying to solve at Armorblox.
We built this core platform that connects into your existing infrastructure and uses APIs to do that for most parts. The whole onboarding process of Armorblox is relatively straightforward, but we sort of took that extra effort to be able to make that onboarding process incredibly comfortable and easy for customers.
The other big realization that we had once we started doing this was, more than 70 percent of data inside of the enterprise is largely textual data. Whether it’s in documents, or emails, or your Box files, Dropbox files, Salesforce records, they’re all textual data. Even if you slam an API, yes you get a response back, that’s the structure, but it’s textual information that describes the property or characteristic of something that you query.
We sort of figured out that we want to start at some place which has the most amount of label data sources and emails is one of the first pieces that we picked up on when we said “wait, we could solve some interesting problems in emails because they are labeled data, it has high intent when somebody communicates with somebody else, and there are very specific goals that they’re trying to achieve when they do that”.
There’s a lot of structure in emails still, there’s “Hey, how’s it going?”, and then the body, and then the sign off, and phone numbers, signatures and so forth. We said “OK, let’s start with the emails and let’s see what sort of problems exist with emails that we can effectively solve”. And that’s when in the conversation that we had with all the customers, we dug back in and said “OK, business email compromise was one of the new emerging types of attacks that was truly causing a lot of disruption”.
FBI at this point estimates more than $15 billion lost in the past 36 months. They’re expecting a growth of roughly around 138 percent, which means for every single minute that we have been talking over here, North American enterprises have lost four thousand dollars per minute. It’s just incredible, the scale of some of these attacks, it’s just nuts. We said “OK, can we meaningfully solve the problem?”, because when we looked at the scams and the attacks, they did not contain a malicious link, they didn’t contain a malicious attachment. They were purely socially engineered.
It starts off by saying: “Hey Mirko, are you at your desk?” And then you respond back by saying “who is this?”. And when you respond back, you are now giving them your DICOM headers, your DMARC headers. There’s so much of recon information available in that response.
What ends up happening is they take all that information and they no longer want to communicate with you. They’ve gotten what they wanted and they’re going to reach out to your assistant and respond back saying: “Hey, here’s an email that Mirko sent me and he said that he’s going to pay this in invoice at this point in time. Can you please work with him to make this payment as soon as possible?” And the bottom of that email it’s going to contain all of the additional context of our communication, a supposed communication that has happened between you and that vendor. In larger organizations, when they see a specific approval pattern and specific workflows for processing of these invoices, people are just going to quickly move forward and execute on taking it to completion.
This is exactly the reason why Facebook and Google lost over 120 million dollars from a single scammer who, by the way, did get caught, but for the longest time was able to commit fraud for almost three years without getting caught. And look at the dollar value. The payoff was so much higher. He didn’t have to come up with doing complex lateral movement or getting into their infrastructure. It’s just simply knocking on the door and sending out an email and walking away with money. It was as simple as that.
That’s one of the primary use cases we started building towards. And we said “hey, let’s solve this problem first”. And that’s what our customers today are using Armorblox for. We are rapidly broadening a value proposition. We’re going beyond emails, just started to support documents. We already have completed integrations and we’re going to start announcing integrations with Box and Dropbox across the entire spectrum. Today we’re completely integrated into the Office 365 ecosystem and the G Suite ecosystem, and their respective emails. And we see ourselves being able to plug into the rest of the offerings that build these platforms have as well, which is now connecting it to SharePoint, connecting into OneDrive. And then the Google ecosystem connecting with the Google Drive, and all of the other calendars and so on and so forth.
We’re rapidly expanding that value proposition and being able to collect intelligence and being able to detect when something looks suspicious, when a specific workflow looks anomalous. This is what we’re trying to redefine, what DLP even means, because classically DLP has always been thought of as “hey, here’s my PII, here’s my PCI, here’s my keyword protected”, and solutions by an large look for keywords until we said “keywords are not working”. It creates a lot of false positives, people never turn those appliances on, but from a compliance checkbox perspective, they’ll go ahead purchase something and put it inside the infrastructure.
The way to change that status quo is to be able to say “hey, listen, keywords and not important, what’s important are topics”. Are you talking about a specific type of topic and then annotating those topics on specific communication patterns, communication links if you will. So, if the two of us have never talked about an invoice for a specific product that you have never purchased, there’s no other email from my domain that talks about it or there’s no automatic email confirmation that he got from me, if I had reached out directly to your assistant at that point, automatically the solution should be able to recognize that this is a violation of a workflow and then and race a corresponding alert.
That’s sort of how we see DLP evolving into not being just data loss but it’s like “hey, what is the context over here?” And is the context being leaked now in some format? And if it is, then let’s go ahead and protect against that. To that end we’re rapidly expanding the core value proposition. We are also working on a mobile app that would allow end users to participate in security function.
Today, if you think about it, it’s classic that the Chief Information Security Officer and the security team that’s responsible for incident response. A big pain point that we discovered as we spoke, and during our initial deployments as well, was the magnitude of the number of alerts that are showing up on their single pane of glass, to the extent that they coined the term alert fatigue to describe this where there’s like so many alerts and how to even manage them, now it became a real problem.
Then when we dug a little deeper into that, we recognized that the context of a conversation is lost on people that are managing your shop halfway across the globe. Say for instance, we’re talking about an invoice. The person that knows whether that invoice was a real invoice or not, is most likely used in this in this particular example that we’ve been using. In having somebody that’s halfway across the globe to determine if this was a scam or not is that much harder. And they have to reach back out to you, the time to remediate that is non-trivial.
One of the ways we explore, try to reduce that amount of time it takes for an incident response to happen, is to involve end users and people that are working inside of the organizations that have the most amount of context to participate in it. If you were to receive an email or if somebody else were to receive an email pretending to be an email that came from you instead of your organization, you would get an alert that said “hey, did you really send this email out?”
Or you just received an email about an invoice: “Does this look right to you?” And you have the opportunity to say yes or no. Because, we realize users don’t want to have complex interactions on their phones. They just want to be able to click a simple button. So give them two buttons, and again, you can’t send those alerts on a very high frequency. You’ve got to send them not more than two to four alerts a month, because beyond which end users find it really hard. Even when you think about MFA, multi factor authentication, you generally prompt the user to confirm and then you try to remember that for maybe a week or a couple of weeks, until you have to send out the challenge response mechanism again.
We learned that if you were to keep the alerting rate to a reasonable low percentage, we can truly effect a meaningful change in terms of alert fatigue for a lot of the security teams.
That’s another initiative that we’re working on and the reason why we feel like we’re in the best position to be able to do that is again because of our core NLU engine, because we now have the context of the communications and we can recognize those workflows and alert the right person at the right time. We feel like we sort of have connected the dots across the board and we’re putting that together in a single platform that’s Armorblox.
That was a quick overview. I know there’s a lot of information but if you find this interesting and if you’d like to learn more about Armorblox, please visit our web site.