Data discovery gaps that catch enterprises off guard

In this interview with Help Net Security, Avani Desai, CEO at Schellman, talks about the gap between what organizations think they know about their data and what discovery scans turn up. She shares stories of shadow data in abandoned cloud storage, post-merger surprises where duplicated datasets slowed integration, and why synthetic data is overmarketed while confidential computing stays underappreciated.

Desai also explains why smaller companies often beat large enterprises on compliance, and the one question that gets executives to admit their data map is out of date.

data discovery gaps

Walk me through a discovery project you’ve seen go sideways. What did the company think they had versus what the scanners turned up, and which surprise hurt the most?

One discovery project that really stuck with me was a company that believed they had a mature understanding of their environment because they had invested heavily in tooling. They had dashboards, scanners, data catalogs — everything you would expect from an organization that thought it had strong visibility into its data environment.

But once we started validating what was happening, the scanners uncovered a much bigger gap between the known data and operational reality. We found shadow data sitting in old development environments, abandoned cloud storage buckets, and legacy SaaS exports. In some cases, customer data was still living in systems teams believed had been decommissioned years earlier.

The biggest issue wasn’t just the exposure itself. It was the realization that the organization had been making governance and risk decisions based on incomplete information. Once leadership understands that the data map is wrong, every downstream assumption starts getting questioned too.

Tell me about a merger or acquisition where the data inventory question became a deal issue. What got missed in due diligence that surfaced post-close?

I’ve definitely seen data inventory become a real issue post-close. One situation involved a company that represented it had strong segregation between customer environments and a relatively simple data architecture. During diligence, though, the focus was more on the areas buyers traditionally prioritize: growth metrics, contracts, and headline security certifications.

After the deal closed, the buyer discovered several datasets had been duplicated across multiple inherited systems from prior acquisitions, all without clear retention policies or defined ownership. The challenge was that the company did not understand the extent of the data sprawl across the environment.

What became painful post-close was the remediation effort and cost associated with that remediation, including customer notifications, operational disruption, and the work required to normalize the environment. It significantly slowed integration because you cannot modernize or consolidate systems when you cannot confidently identify where sensitive data lives, who owns it, or how it is being used.

Tokenization, format-preserving encryption, synthetic data, confidential computing. Pick the one you think is over-marketed and the one you think is under-appreciated, and explain why.

I personally think synthetic data is probably the most overmarketed right now. It has valuable use cases, especially for testing and model development. But I think organizations position it as, “Hey, this is going to be a magic replacement for governance and security hygiene.” Synthetic data does not automatically solve for weak controls, poor access management, or immature lifecycle governance. In many cases, it creates a false sense of security if the underlying governance issues are still unresolved.

On the underappreciated side, I would probably say confidential computing. A lot of people still view it as niche or overly technical. But as organizations continue moving sensitive workloads into shared and increasingly AI-driven environments, the ability to protect data in use, not just data at rest or in transit, becomes incredibly important. We are still very early in understanding how critical that capability is going to become.

One thing that surprises people is how often smaller companies outperform larger enterprises on modernization and compliance. They are typically more agile and move with greater intentionality because they have fewer legacy systems, fewer political silos, and much clearer ownership. At the end of the day, they are simply less complicated, which often makes execution easier.

Where have you seen a smaller company outperform an enterprise on fragmentation and compliance? What were they doing structurally that the bigger shop couldn’t replicate?

I have seen startups that prioritize security and compliance and are willing to invest in it have far better visibility into their environments than massive billion-dollar enterprises because they built governance into the architecture from the beginning. We talk a lot about privacy by design or security by design. This is really governance by design.

Large organizations often struggle because every acquisition, every regional deployment, every business unit adds another layer of complexity. The issue is rarely a lack of smart people, resources, or budget. More often, they lack unified ownership and operational simplicity. The bigger challenge is that organizations cannot operate in silos if they want consistent visibility, governance, and compliance across the enterprise.

When a company tells you their data map is current, what’s the question that usually makes them admit it isn’t?

When a company tells me their data map is current, the question I usually ask is: who is accountable for validating it after operational change? That’s usually the moment where there’s a pause.

Maintaining a current data map is a living document that requires continuous operational discipline. If a company has frequent product releases, is going through cloud migrations, acquisitions, AI pilots, or bringing on new third parties, the environment is constantly changing. If nobody owns that continuous validation process, then the reality is the data map is already outdated the moment it’s completed.

Download: The IT and security field guide to AI adoption

Don't miss