Skip to main content

Enable Data Lineage for External Systems

Overview

This guide explains how to configure your data pipelines to emit lineage data to Astro.

To generate lineage graphs for your data pipelines, you first need to configure your data pipelines to emit lineage data. Because lineage data can be generated in all stages of your pipeline, you can configure pipeline components outside of Astro, such as dbt or Databricks, to emit lineage data whenever they're running a job. Coupled with lineage data emitted from your DAGs, Astro generates a lineage graph that can provide context to your data before, during, and after it reaches your Deployment.

Lineage data is generated via OpenLineage, which is an open source standard for lineage data creation and collection. Astro receives metadata about running jobs and datasets via the OpenLineage API. Each Astro Organization has an OpenLineage API key that you can specify in your external systems. Your external systems can use this API key to send lineage data back to your Control Plane.

Diagram showing how lineage data flows to Astro

Generally, configuring a system to send lineage data requires:

  • Installing an OpenLineage backend to emit lineage data from the system
  • Specifying your organization's OpenLineage API endpoint to send lineage data back to the Astro Control Plane.
tip

You can access a version of this documentation directly from the Lineage tab in the Cloud UI. The embedded documentation additionally loads your Organization's configuration values, such as your OpenLineage API key and your Astro base domain, directly into configuration steps.

Retrieve Your OpenLineage API Key

To send lineage data from an external system to Astro, you must specify your Organization's OpenLineage API key in the external system's configuration. To find your Organization's API key:

  1. In the Cloud UI, open the Lineage tab.

  2. In the left-hand lineage menu, click Integrations:

    Location of the "Integrations" button in the Lineage tab of the Cloud UI

  3. In Getting Started, copy the value in Lineage API Key.

For more information about how to configure this API key in external systems, read the following integration guides.

Integration Guides

Lineage is configured automatically for all Deployments on Astro Runtime 4.2.0+. The easiest way to add lineage to an existing Deployment on Runtime <4.2.0 is to upgrade Runtime.

Note: If you don't see lineage features enabled for a Deployment on Runtime 4.2.0+, then you might need to push code to the Deployment to trigger the automatic configuration process.

To configure lineage on an existing Deployment on Runtime <4.2.0 without upgrading Runtime:

  1. In your locally hosted Astro project, update your requirements.txt file to include the following line:

    openlineage-airflow
  2. Push your changes to your Deployment.

  3. In the Cloud UI, set the following environment variables in your Deployment:

    AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend
    OPENLINEAGE_NAMESPACE=<your-deployment-namespace>
    OPENLINEAGE_URL=https://<your-astro-base-domain>
    OPENLINEAGE_API_KEY=<your-lineage-api-key>

Verify

To view lineage metadata, go to your organization's landing page and open the Lineage tab in the Organization view. You should see your most recent DAG run represented as a data lineage graph in the Lineage page.

Note: Lineage information will appear only for DAGs that use operators which have extractors defined in the openlineage-airflow library, such as the PostgresOperator and SnowflakeOperator. For a full list of supported operators, see Data Lineage Support and Compatibility.

Note: If you don't see lineage data for a DAG even after configuring lineage in your Deployment, you might need to run the DAG at least once so that it starts emitting lineage data.