Data Chaos: Where Did It Go Wrong This Time?

I hear Data Catalog, Data Discovery, Data Observability, and Data Governance thrown around almost every day. But when these terms are used loosely, it reminds me of that iconic line from The Princess Bride: “You keep using that word. I do not think it means what you think it means.”
These concepts create a challenging mix of complexity and high stakes, making it crucial to get the data right.

So, let’s break it down in a super simple way: data as toys (I have two kids, so I had to use toys).

  1. Data Catalog 📖: A list of all of our kid’s toys so my wife knows what our kids have and where to find them.
  2. Data Discovery 🕵️: Finding kid’s toys in our house so our kids can play with them.
  3. Data Observability 👀: Making sure our kid’s toys are working. If broken, Daddy fixes them.
  4. Data Governance🚦: Rules for who can play with which toys and how to take care of them. Mommy sets those rules, not daddy.
  5. Metadata 🏷️: Tag on your toy that tells you its name, brand, and when it was made. #Target

The purpose of building a data catalog is to make data discovery as efficient as possible. Users need to know where the data lives, how to access it, and how to connect to it for exploration.

Imagine a data scientist or analyst working on a new machine learning (ML) model or dashboard. They need the right dataset to get started. But as data moves through layers of transformations, new datasets emerge, and dependencies grow. Without proper tools, a single change upstream can break pipelines, dashboards, and reports downstream.

Better data discovery not only helps users find what they need but also makes it easier to govern and categorize data effectively.

So, how do we solve this interconnected data problem? Should we manually track all data sources, update the catalog, and send company-wide updates every time something changes? Nah – let’s skip that tedious approach and leverage tools to automate the process.

Here’s how I tackled it:

  • Identify Ownership: Identify downstream owners who depend on upstream datasets. Good luck with this task, you will need it.
  • Automate Notifications: A dedicated Slack channel to notify owners whenever upstream schema changes occur. I implemented this using Apache Airflow as an orchestrator. Any schema or transformation changes made via dbt (integrated with a GitHub repository) triggered Slack alerts with custom messages about impacted models (new/updated/deleted).
  • Control Merges: Downstream owners had to approve changes before they could be merged into the master branch, preventing breaking changes from reaching production.

This approach significantly reduced broken pipelines and dashboards. But tools like DataHub take it a step further. It’s a data catalog that makes data discovery and governance almost fun (yes, fun!). Business and technical users alike can explore data effortlessly while uncovering downstream dependencies. Developers and data engineers can proactively engage with impacted teams, minimizing disruptions before they happen.

The result? A more mindful, collaborative, and efficient data ecosystem BUT with fewer headaches.

What strategies or tools have you used to tackle data discovery, observability, and governance? Let’s share and learn from each other!