Building a modern data registry: Go beyond data classification

For organizations, understanding what data they store and analyze is gaining increasing urgency due to new privacy regulations, from the Global Data Privacy Regulation (GDPR) to the California Consumer Privacy Act (CCPA) and Brazil’s General Data Protection Law (LGPD). But these regulations are not the only reason organizations are focused on privacy.

Security imperatives and pressure to extract more value from the information they store has also put pressure on companies to get data privacy right. Historically, organizations have invested in a variety of technologies to inventory their physical assets, such as servers and PCs, but lacked adequate technology to find, map and inventory data assets.

After all, the challenge to transition to becoming a data-driven organization extends beyond the practical considerations of how to automate the data pipeline and map data assets, to answering whether the actions to access, analyze or share data are consistent with compliance, risk and privacy considerations. Balancing the drive toward becoming a data-driven organization, while ensuring privacy-aware data governance has now emerged as a crucial strategic concern. However, traditional data classification and cataloging tools simply lack the capabilities needed to find, map and inventory data assets accurately and efficiently at scale in the new age of GDPR and other privacy regulations.

A data registry how to

Enter the modern data registry. By taking a fresh approach to data discovery – focused on creating an inclusive list of what data is kept where and why – organizations can better meet data privacy, protection and governance requirements.

Organizations need to start with the basics. A modern data registry cannot be a data warehouse – you’ll simply be duplicating the data it maps, and introducing limitations in scale. Instead, organizations should build the registry in an index-like map, focusing on five key functionality and operational characteristics:

1. Content granularity: Privacy regulations, like GDPR and CCPA require organizations to account for the data they collect – and this isn’t just knowing the type of data they collect. Companies need to know what data they have and who that data belongs to. Privacy is all about people, so knowing the “people” context of data is essential to meeting privacy requirements.

2. Usage context: Knowing what and whose data you have is a critical first step but creating a modern data registry with complete data intelligence means going further. This requires operational, technical and business knowledge, such as who can access this data, what applications are consuming the data, what third parties have access to the data, what is the purpose for collecting this data and does the organization have adequate consent to collect and process the data.

3. Data source coverage: A data registry that only covers unstructured files or relationship databases will not provide a complete data inventory. With the growing amount of data sources and applications used throughout the enterprise, organizations need to create a process that covers both unstructured file shares and structured databases, big data, cloud, NoSQL, logs, mail, messaging, applications and more.

4. Ability to scale: Organizations gather and analyze tens, if not hundreds, of petabytes of data. With increasing pressure to extract more value from data, this number is only increasing. A modern data registry not only needs to deliver an efficient index of data along with associated usage, but it must do so in a way that is scalable for a global enterprise.

5. Dynamic not static: Once a data registry is created, organizations must anticipate that it will be changed and moved on a constant basis. Consequently, the registry must be able to self-update and accommodate any changes in near real-time to provide the clearest, most accurate picture of what data is kept where, when, and who it belongs to.

A new approach to building a data registry from data intelligence

Once the functional and operational foundation for a modern data registry is built, it is time to create a full accounting and inventory of your enterprise’s distributed data assets. This requires data intelligence down to the discrete entity value – something not possible with metadata alone. Obtaining this level of data requires a hybrid approach to content discovery and contextualization, achieved by considering these four key requirements:

1. Entity discovery and resolution: In order to obtain the level of data intelligence necessary for privacy and protection use cases, organizations need a data discovery mechanism that can extract and resolve data entities based on data values – no matter if the data resides in structured, unstructured or semi-structured stores. Organizations also need to implement scanning systems that can disambiguate identical looking data based on context. For example, your system should be able to sperate a social security number from an account ID, even though they both may have the same value.

2. Entry correlation and contextualization: Privacy is about people. Period. To comply with privacy regulations, organizations need to account for their data and show correlation or association of data to a data subject. This must be reflected in a modern data registry. While essential for privacy, this can also provide a new level of understanding around the connectedness of data to high value identities like transactional IDs, account IDs and patent IDs.

3. Entity classification by type and category: The approach to building a modern data registry must move past traditional classification tooling. Modern data registries should have entity-level granularity that requires more refined entity-level classification. If built with artificial intelligence or machine learning, this will expand how data is identified based on heuristics and inferred categorizations.

4. Metadata capture and cataloging: Even though pure-play metadata catalogs leave much to be desired from the registry standpoint, they still provide value because they can record where data categories can be found. This helps to both classify data entities correctly and identify where to prioritize deeper entity searches. The challenge lies in relying on human tags and annotations, since human error makes this data privy to inconsistencies. So, while technical metadata is important, you also need to capture operational and business context like access rights, purpose of use, or consent.

It cannot be said enough – the only way to comply with privacy regulations like GDPR and CCPA is if the organization can account for what data they hold and what individual the data belongs to.

A modern data registry looks beyond simply classifying and cataloging data to show the correlation and association of data to a data subject. Providing a new understanding of the connectedness of data to high-value identities no matter if they are located- in the data center or the cloud.

Don't miss