Skip to main content

Data Lineage Concepts

What is Data Lineage?

Data lineage is the concept of tracking and observing data flowing through a data pipeline. Data lineage can be used to understand data sources, troubleshoot job failures, manage PII, and ensure compliance with data regulations.

Lineage on Astronomer

In the Cloud UI, the Lineage tab renders the lineage metadata generated by your DAGs as a dynamic graph. For more information on using the lineage tab, see Data Lineage.

Astro leverages OpenLineage to emit lineage metadata. OpenLineage is open source industry standard framework for data lineage: it standardizes the definition of data lineage, the metadata that makes up lineage data, and the approach for collecting lineage data from external systems. In other words, it defines a formalized specification for data lineage.

Core Concepts

The following terms are used frequently when discussing data lineage and OpenLineage in particular. We define them here specifically in the context of using OpenLineage with Astro.

  • Integration: A means of gathering lineage data from a source system (e.g. a scheduler or data platform). For example, the OpenLineage Airflow integration allows lineage data to be collected from Airflow DAGs. A full list of OpenLineage integrations can be found here.
  • Extractor: In the openlineage-airflow package, an extractor is a module that gathers lineage metadata from a specific hook or operator. For example, extractors exist for the PostgresOperator and SnowflakeOperator, meaning that if openlineage-airflow is installed and configured for your Airflow environment, then lineage data will be generated automatically from those operators when your DAG runs. An extractor must exist for a specific operator to get lineage data from it.
  • Job: A process which consumes or produces datasets. In the context of Airflow, an OpenLineage job corresponds to a task in your DAG (assuming the task is an instance of an operator that has an extractor built). Jobs can also represent work completed in other applications that emit lineage data, such as a Spark job or a dbt model. Jobs appear as nodes on your lineage graphs in the lineage UI.
  • Dataset: Any collection of data that your jobs interact with. For example, a dataset can correspond to a table in your database or a set of data that you run a Great Expectations check on. A dataset is typically registered as part of your lineage data when a job writing to the dataset is completed (e.g. data is inserted into a table).
  • Run: An instance of a job where lineage data is generated. In the context of the Airflow integration, an OpenLineage run will be generated with each DAG run.
  • Facet: A piece of lineage metadata about a job, dataset, or run (e.g. you might hear “job facet”).

OpenLineage and Airflow

Using OpenLineage with Airflow gives you more insight into complex data ecosystems and can lead to better data governance. Airflow is a natural place to integrate data lineage because it touches and moves data across many parts of an organization.

More specifically, OpenLineage with Airflow provides the following capabilities:

  • Quickly find the root cause of task failures by identifying issues in upstream datasets (e.g. if an upstream job outside of Airflow failed to populate a key dataset).
  • Easily see the blast radius of any job failures or changes to data by visualizing the relationship between jobs and datasets.
  • Identify where key data is used in jobs across an organization.

These capabilities translate can be used to achieve the following benefits:

  • Make recovery from complex failures quicker. The faster you can identify the problem and the blast radius, the easier it is to solve and prevent any erroneous decisions being made from bad data.
  • Make it easier for teams to work together across an organization. Visualizing the full scope of where a dataset is used reduces “sleuthing” time.
  • Ensure compliance with data regulations by fully understanding where data is used.