Send lineage metadata to Astro

Data lineage is the concept of tracking data from its origin to wherever it is consumed downstream as it flows through a data pipeline. This includes connections between datasets and tables in a database as well as rich metadata about the tasks that create and transform data. You can observe data lineage to:

Trace the history of a dataset.
Troubleshoot run failures.
Ensure compliance with data regulations.

This guide provides information about how lineage metadata is automatically extracted from Apache Airflow tasks on Astro and how to integrate external systems, including Databricks and dbt, that require additional configuration. To learn about how to view data lineage on Astro, see Create Data Products.

Extract lineage metadata from Airflow operators using supported extractors

Astro uses the OpenLineage Airflow library (openlineage-airflow) to extract lineage from Airflow tasks and stores that data in Astro Observe. This package includes default extractors for popular Airflow operators.

The latest version of the OpenLineage Airflow library is installed on Astro Runtime by default, meaning that you can use all default extractors without additional configuration.

Each operator generates different lineage metadata based on its default extractor. For more information about operators with default extractors and what lineage metadata they generate, see OpenLineage documentation.

Upgrade OpenLineage for Astro Deployments

If you want to upgrade only the OpenLineage version for a Deployment

Prerequisites

An Astro Deployment
Astro Runtime version 9.0.0 or higher
The Astro CLI

Setup

Ensure your Astro Runtime version compatibility in your Deployment Settings. OpenLineage requires a Runtime version 9.0.0 or higher.
Update the Airflow OpenLineage provider in your Astro project's requirements.txt file to the most recent version of OpenLineage. For example

apache-airflow-providers-openlineage==1.12.1

Deploy your updates to Astro, which also restarts your Deployment.

astro deploy

tip

See Deploy code to view the different options for code deploys.

Extract lineage metadata from Airflow operators using custom extractors

If you want to extract lineage metadata from an Airflow operator that doesn't have a default extractor, you can write a custom extractor and add it to your Astro project.

To write a custom extractor, see OpenLineage documentation. To add a custom extractor to an Astro Deployment:

Add your custom extractor files to the include folder of your local Astro project.
Deploy your project. See Deploy code.
Set the following environment variable in your Astro Deployment:
```
OPENLINEAGE_EXTRACTORS='<path-to-extractor-class-1>;<path-to-extractor-class-2>;<path-to-extractor-class-x>'
```
Specify the path to your extractor class as relative to the base of your Astro project directory (for example, include/myExtractorClass). If you are importing only one custom extractor, do not include a semicolon after the file path.

Extract lineage metadata from Airflow operators using custom inlets and outlets

An alternative to writing a custom extractor is to specify dataset inlets and outlets directly in your task parameters. These inlets and outlets appear as dependency lines on the lineage graph for your DAG. This option is suitable if your priority is rendering an accurate lineage graph of your DAG, and you don't need to generate specific facets from your operators.

To specify inlets and outlets, see the OpenLineage documentation and Apache Airflow documentation. Note that OpenLineage only supports specifying inlets and outlets using Table objects.

Extract lineage metadata from external systems to Astro

To send lineage metadata from an external system to Astro, you need to configure the external system's OpenLineage integration with a Deployment namespace, your Organization's OpenLineage URL, and your organization's OpenLineage API key. This information is used to send OpenLineage data to your Astro lineage backend.

To locate your Deployment namespace in the Astro UI, open the Deployment and copy the value in Namespace. To locate your Organization's OpenLineage URL and OpenLineage API key, go to https://cloud.<your-astro-base-domain>.io/settings and copy the values in the Lineage API Key and OpenLineage URL fields.

Use the following topics to configure these values in supported external systems and send lineage metadata from those systems to Astro.

Snowflake and OpenLineage with Airflow

Lineage data emitted from Snowflake is similar to what is collected from other SQL databases, including Amazon Redshift and Google BigQuery. However, Snowflake is unique in that it emits query tags that provide additional task execution details.

When you run a task in Airflow that interacts with Snowflake, the query tag allows each task to be directly matched with the Snowflake query or queries that are run by that task. If the task fails, for example, you can look up the Snowflake query that was executed by that task and reduce the time required to troubleshoot the task failure.

To emit lineage metadata from Snowflake:

Add a Snowflake connection to Airflow. See Snowflake connection.
Run an Airflow DAG or task with the SnowflakeOperator or SnowflakeOperatorAsync. This operator is officially supported by OpenLineage and does not require additional configuration. If you don't run Airflow on Astro, see Extract lineage metadata from external systems to Astro.

Data collected

When you run an Airflow task with the SnowflakeOperator, the following data is collected:

Task duration
SQL queries. For a list of supported queries, see the OpenLineage tests repository.
Query duration. This is different from the Airflow task duration
Input datasets
Output datasets
Quality metrics based on dataset and column-level checks, including successes and failures per run

To view this data in the Astro UI, click Data Assets in the Astro Observe menu, then select your data asset. See Create Data Products.

tip

Airflow tasks run with the SnowflakeOperator emit SQL source code that you can view in the Astro UI. See View SQL source code.

OpenLineage facets are JSON objects that provide additional context about a given job run. By default, a job run for an Airflow task includes facets that show the source code for the task, whether the task run was successful, and who owns the task. All default facets for a job run appear as Standard Facets in the Info tab of your data pipeline's lineage graph.

You can configure both Airflow and external systems to generate custom facets that contain more specific information about job runs. Custom facets appear as Custom Facets in the Info tab of your data pipeline's lineage graph. To create a custom facet, see OpenLineage Documentation.

Disable OpenLineage

By default, OpenLineage is enabled for all Astro Deployments. If you don't want your Deployment to collect or send lineage data, you can disable OpenLineage.

Before you disable OpenLineage, keep the following in mind:

You can't use Astro alerts in a Deployment with OpenLineage disabled.
A Deployment with OpenLineage disabled will not send any data to the Astro Observe in the Astro UI.

To disable OpenLIneage for a Deployment, set the following environment variable:

Key: OPENLINEAGE_DISABLED
Value: True

To re-enable OpenLineage, you can either set OPENLINEAGE_DISABLED = False or remove the environment variable.

Disable OpenLineage locally

If you also want to disable OpenLineage in your local environment, you can alternatively set ENV OPENLINEAGE_DISABLED = True in your Astro project Dockerfile. After you deploy the change to Astro, this ensures that OpenLineage is disabled both locally and on Astro.

Send lineage metadata to Astro

Extract lineage metadata from Airflow operators using supported extractors

Upgrade OpenLineage for Astro Deployments

Prerequisites

Setup

Extract lineage metadata from Airflow operators using custom extractors

Extract lineage metadata from Airflow operators using custom inlets and outlets

Extract lineage metadata from external systems to Astro

Snowflake and OpenLineage with Airflow

Data collected

Generate custom facets for OpenLineage events

Disable OpenLineage

Was this page helpful?

Extract lineage metadata from Airflow operators using supported extractors​

Upgrade OpenLineage for Astro Deployments​

Prerequisites​

Setup​

Extract lineage metadata from Airflow operators using custom extractors​

Extract lineage metadata from Airflow operators using custom inlets and outlets​

Extract lineage metadata from external systems to Astro​

Snowflake and OpenLineage with Airflow​

Data collected​

Generate custom facets for OpenLineage events​

Disable OpenLineage​

Was this page helpful?

Extract lineage metadata from Airflow operators using supported extractors

Upgrade OpenLineage for Astro Deployments

Prerequisites

Setup

Extract lineage metadata from Airflow operators using custom extractors

Extract lineage metadata from Airflow operators using custom inlets and outlets

Extract lineage metadata from external systems to Astro

Snowflake and OpenLineage with Airflow

Data collected

Generate custom facets for OpenLineage events

Disable OpenLineage