Forrester changed the way they think about data catalogs, and here’s what you need to know – Atlan

It’s the latest sign of a major shift in how we think about metadata.

As we predicted at the beginning of this year, metadata is hot in 2022 — and it’s only getting hotter.

But this isn’t the old-school idea of metadata we all know and hate. We’re talking about those IT “data inventories” that take 18 months to set up, monolithic systems that only work when ruled by dictator-like data stewards, and siloed data catalogs that are the last thing you want to open in the middle of working on a data dashboard or pipeline.

The data industry is in the middle of a fundamental shift in how we think about metadata. In the past year or two, we’ve seen a slew of brand new ideas emerge to capture this new idea of metadata — e.g. the metrics layer, modern data catalogs, and active metadata — all backed by major analysts and companies in the data space.

Now we’ve got the latest sign of this shift. This summer, Forrester scrapped its Wave report on “Machine Learning Data Catalogs” to make way for one on “Enterprise Data Catalogs for DataOps”. Here’s everything you need to know about where this change came from, why it happened, and what it means for modern metadata.

A quick history of metadata

In the earliest days of big data, companies’ biggest challenge was simply keeping track of all the data they now had. IT teams were tasked with creating an “inventory of data” that listed a company’s stored data and its metadata. But in this Data Catalog 1.0 era, companies spent more time implementing and updating these tools than actually using them.

In the early 2010s, there was a big shift — the Data Catalog 2.0 era emerged. This brought a greater focus on data stewardship and integrating data with business context to create a single source of truth that went beyond the IT team. At least, that was the plan. These 2.0 data catalogs came with a host of problems, including rigid data governance teams, complex technology setup, lengthy implementation cycles, and low internal adoption.

Today, metadata platforms are becoming more active, data teams are becoming more diverse than ever, and metadata itself is becoming big data. These changes have brought us to Data Catalog 3.0, a new generation of data governance and metadata management tools that promise to overcome past cataloging challenges and supercharge the power of metadata for modern businesses.

Last year, Gartner scrapped their old categorization of data catalogs in favor of one that reflects this fundamental shift in how we think about metadata. Now Forrester has made its own move to define this new category on its own terms.

Forrester: Moving from Machine Learning Data Catalogs to Enterprise Data Catalogs for DataOps

One of the biggest challenges with Data Catalog 2.0s was adoption — no matter how it was set up, companies found that people rarely used their expensive data catalog. For a while, the data world thought that machine learning was the solution. That’s why, until recently, Forrester’s reports focused on evaluating “Machine Learning Data Catalogs”.

However, in early 2022, Forrester dropped machine learning in its Now Tech report. It explained that even as ML-based systems became ubiquitous, the problems they were meant to solve persisted. Although machine learning allowed data architects to get a clearer picture of the data within their organization, it didn’t fully address modern challenges around data management and provisioning.

The key change — just “conceptual data understanding” via a data wiki is no longer enough. Instead, data teams need a catalog built to enable DataOps. This requires in-depth information about and control over their data to “build data-driven applications and address data flow and performance”.

Provisioning data is more complex under distributed cloud, edge compute, intelligent applications, automation, and self-service analytics use cases… Data engineers need a data catalog that does more than generate a wiki about data and metadata.

Forrester Now Tech: Enterprise Data Catalogs for DataOps, Q1 2022

What is an enterprise data catalog for DataOps?

So what actually is an enterprise data catalog for DataOps (EDC)?

According to Forrester, “[enterprise] data catalogs create data transparency and enable data engineers to implement DataOps activities that develop, coordinate, and orchestrate the provisioning of data policies and controls and manage the data and analytics product portfolio.”

There are three key ideas that distinguish EDCs from the earlier Machine Learning Data Catalogs.

Handles the diversity and granularity of modern data and metadata

Our data environments are chaotic, spanning cloud-native capabilities, anomaly detection, synchronous and asynchronous processing, and edge compute.

Forrester Now Tech: Enterprise Data Catalogs for DataOps, Q1 2022

Today a company’s data isn’t just made up of simple tables and charts. It includes a wide range of data products and associated assets, such as databases, pipelines, services, policies, code, and models. To make matters worse, each of these assets has its own metadata that just keeps getting more detailed.

EDCs are built for this complex portfolio of data and metadata. Rather than just storing a “wiki” of this data, EDCs act as a “system of record” to automatically capture and manage all of a company’s data through the data product lifecycle. This includes syncing context and enabling delivery across data engineers, data scientists, and application developers.

Example of this principle in action

For example, we work with a data team that ingests 1.2 TB of event data every day. Instead of trying to manage this data and create metadata manually, they use APIs to assess incoming data and automatically create its metadata.

Auto-assigning owners: They scan query log history and custom metadata to predict the best owner for each data asset.
Auto-attaching column descriptions: These are recommended by a bot, by scanning interactions with that asset, and verified by a human.
Auto-classification: By scanning through an asset’s columns and how similar assets are classified, they can classify sensitive assets based on PII and GDPR restrictions.

Provides deep transparency into data flow and delivery

Adoption of CI/CD practices by DataOps requires detailed intelligence of data movement and transformation.

Forrester Wave™: Enterprise Data Catalogs for DataOps, Q2 2022

A key idea in DataOps is CI/CD, a software engineering principle to improve collaboration, productivity, and speed through continuous integration and delivery. For data, implementing CI/CD practices rely on understanding exactly how data is moved and transformed across the company.

EDCs provide granular data visibility and governance with features like column-level lineage, impact analysis, root cause analysis, and data policy compliance. These should be programmatic, rather than manual, with automated flags, alerts, and/or suggestions to help users keep on top of complex, fast-moving data flows.

Example of this principle in action

For example, we work with a data team that deals with hundreds of metadata change events (e.g. schema changes, like adding, deleting, and updating columns; or classification changes, like removing a PII tag), which affect over 100,000 tables daily.

To make sure that they always know the downstream effects of these changes, the company uses APIs to automatically track and trigger notifications for schema and classification changes. These metadata change events also automatically trigger a data quality testing suite to ensure that only high-quality, compliant data makes its way to production systems.

Designed around modern DataOps and engineering best practices

Not all data catalogs are made for data engineers… [Look] beyond checkbox technical functionality and align tool capabilities to how your DataOps model functions.

Forrester Now Tech: Enterprise Data Catalogs for DataOps, Q1 2022

With data growing far beyond the IT team, data engineering tools can no longer just focus on the data warehouse and lake. DataOps merges the best practices and learnings from the data and developer worlds to help diverse data people work together better.

EDCs are a critical way to connect the “data and developer environments”. Features like bidirectional communication, collaboration, and two-way workflows lead to simpler, faster data delivery across teams and functions.

Example of this principle in action

For example, we work with a data team that uses this idea to reduce cross-team surprises and address issues proactively. They use APIs to monitor pipeline health, which flag if a pipeline that feeds into a BI dashboard breaks. If this happens, their system first creates an all-team announcement — e.g. “There is an active issue with the upstream pipeline, so don’t use this dashboard!” — which is automatically published in the BI tool that data consumers use. Next, the system files a Jira ticket, tagged to the correct owner, to track and initiate work on this issue. This automated process keeps the data team from getting surprised by that awful Slack message, “Why does the number on this dashboard look wrong?”

The role of active metadata in enterprise data catalogs

Enterprise data catalogs take an active approach to translate the library of controls and data products into services for deployments that bridge data to the application.

Forrester Now Tech: Enterprise Data Catalogs for DataOps, Q1 2022

Though not part of their opening EDC definition, Forrester mentioned an “active approach” and active metadata several times while evaluating different catalogs. This is because active metadata is a critical part of modern EDCs.

DataOps, like other modern concepts such as the data mesh and data fabric, is fundamentally based on being able to collect, store, and analyze metadata. However, in a world where metadata is approaching “big data” and its use cases are growing even faster, the standard way of storing metadata is no longer enough.

The solution is “active metadata”, which is a key component of modern data catalogs. Instead of just collecting metadata from the rest of the data stack and bringing it back into a passive data catalog, active metadata makes a two-way movement of metadata possible. It sends enriched metadata and unified context back into every tool in the data stack, and enables powerful programmatic use cases through automation.

While metadata management isn’t new, it’s incredible how much change it has gone through in recent years. We’re at an inflection point in the metadata space, a moment where we are collectively turning away from old-school data catalogs and embracing the future of metadata.

It’s fascinating to see this change in action, especially when it’s marked by major shifts like this one from Forrester. Given how far they’ve gone in just the last few months, we can’t wait to see how EDCs and active metadata continue to evolve in the coming years!

Found this content helpful? I write weekly on active metadata, DataOps, data culture, and our learnings building Atlan at my newsletter, Metadata Weekly. Subscribe here.