Skip to main content
Version: 0.27

Access the Airflow Database

Overview

On Astronomer, each Airflow deployment is equipped with a PostgreSQL database that serves as Airflow's underlying metadata database and your Airflow Scheduler's source of truth.

On Astronomer Software, a Postgres Metadata database will be created for each individual Airflow Deployment and hosted within your wider Platform Database.

This guide will cover guidelines for the following:

  • Risks associated with and use cases for Database Access
  • Access to Airflow's Database in Local Development
  • Decode Astronomer's Metadata Secret
  • Access to Airflow's Database on Astronomer
  • Example DAG that incorporates a query to the database

Note: Airflow's "Ad-Hoc Query" feature used to be a common way to test DB connections and query the Airflow Metadata Database via the Airflow UI but was deprecated in 1.10 for security reasons.

Risk Associated with Database Access

As noted above, every Airflow Deployment's Metadata Database on Astronomer is hosted within the Platform Database your team initiated during the install process for Astronomer Software. Given its importance to the Scheduler's performance, it's worth noting the risks associated with accessing Airflow's Database.

We strongly recommend users do not write to the database directly as it can compromise both the integrity of your Airflow Deployment and both of our team's ability to support a user in the case of an issue.

Use Cases

Keeping the above risk in mind, pulling from and reading the database safely can be a great way to gather metadata from your Airflow Deployment that isn't otherwise exposed on Astronomer and Airflow (yet).

A few examples of what you can query for:

  • Completed Task Count
  • Tasks started per hour, per week
  • Task duration

Read below for DB access guidelines both locally and on Astronomer.

Local Access to the Airflow Database

To successfully query from your Airflow Deployment's Database, you'll need to set up your local Postgres connection.

1. Find the airflow_db Connection Object

On the Airflow UI, navigate over to Admin > Connections.

The airflow_db connection is generated by default.

Connections Page

2. Edit the Connection

In the airflow_db connection object:

  • Change the Conn Type to Postgres

  • Add the following connection information:

    ```
    host: postgres
    schema: postgres
    login: postgres
    password: postgres
    port (optional): 5432
    ```

    Your connection should look something like:

Local Connections Details Page

The port will be set to 5342 by default but if you’ve set a custom port, you’ll need to update that here as well.

3. Connect to airflow_db in your DAG

Here's an example DAG, where postgres_conn_id is set to airflow_db:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from datetime import datetime, timedelta


def get_dag_ids():
postgres_hook = PostgresHook(postgres_conn_id="airflow_db")
records = postgres_hook.get_records(sql="select dag_id from dag")
print(records)


with DAG(
"connect_dag",
start_date=datetime(2019, 1, 1),
max_active_runs=1,
schedule_interval=None,
# catchup=False # enable if you don't want historical dag runs to run
) as dag:

t1 = PythonOperator(
task_id="get_dag_nums",
python_callable=get_dag_ids,
)

Here, the task above grabs all dag_ids stored inside dag Table within Airflow's Metadata Database.

Connect to the DB via a PostgreSQL Client

With the connection information above, you should also be able to connect to the Airflow Database from any PostgreSQL client.

Using psql, a terminal-based front-end to PostgreSQL, run:

psql -h localhost -U postgres -p 5432 -W

When prompted, enter the password from the Connection Object above to access the database.

Alternatively, you can also connect to the DB via the Postgres container itself by exec-ing into it and executing the psql command.

To do so, run:

docker exec -it <postgres_container_id> /bin/bash

psql -U postgres

Access to the Airflow Database on Astronomer

The easiest way to pull from Airflow's Metadata Database on Astronomer is to leverage the AIRFLOW_CONN_AIRFLOW_DB Environment Variable, which we set here.

This Environment Variable, which we set by default, silently enables users to leverage the airflow_db connection. It's worth noting that the connection itself in the Airflow UI will NOT reflect the correct credentials (Conn Type, Host, Schema, Login, Password, Port).

To pull from the Airflow Database, follow the steps below. Note that you do not have to set this Environment Variable yourself and you do not have to populate the airflow_db connection in the Airflow UI.

1. Leave your airflow_db Connection as is

Your airflow_db connection by default will look like the following:

Conn Id: airflow_db
Conn Type: MySQL
Host: mysql
Schema: airflow
Login: root

While this information is incorrect in the Airflow UI, the underlying connection will still succeed, as connections set by an Environment Variable take precedence over connection details in the Software UI and in the Metadata Database itself.

For clarity, we intend to make sure this connection is properly populated in upcoming versions of Astronomer.

2. Call airflow_db in your DAG

You can use the same example DAG outlined above where postgres_conn_id is set to airflow_db.

3. Verify the Connection in your Task Logs

To verify a successful connection, you can inspect the corresponding task log -

Dependencies all met for <TaskInstance: example_dag.get_dag_nums 2020-05-07T19:02:38.022685+00:00 [queued]>

--------------------------------------------------------------------------------
Starting attempt 1 of 1

--------------------------------------------------------------------------------
Executing <Task(PythonOperator): get_dag_nums> on 2020-05-07T19:02:38.022685+00:00
Started process 429 to run task
Running %s on host %s <TaskInstance: example_dag.get_dag_nums 2020-05-07T19:02:38.022685+00:00 [running]> desolate-spectrum-3088-worker-866745d995-rn576
[2020-05-07 19:02:54,033] {base_hook.py:87} INFO - Using connection to: id: airflow_db. Host: desolate-spectrum-3088-pgbouncer.astronomer-desolate-spectrum-3088.svc.cluster.local, Port: 6543, Schema: desolate-spectrum-3088-metadata, Login: desolate_spectrum_3088_airflow, Password: XXXXXXXX, extra: XXXXXXXX
[('example_dag',)]
Done. Returned value was: None

In these logs, you can see that the connection info is properly passed:

Conn Id: airflow_db
Host: desolate-spectrum-3088-pgbouncer
Schema: desolate-spectrum-3088-metadata
Login: desolate_spectrum_3088_airflow
Password: XXXXXXXX
Port: 6543

Decode the Airflow Metadata Secret on Astronomer

On Astronomer, your deployment's Postgres credentials are also stored as a Kubernetes Secret (airflow-metadata) in your deployment's Kubernetes namespace. For users with kubectl access to the cluster who want to properly populate the airflow_db connection in the Airflow UI with the proper credentials, follow our instructions below.

Note: This is entirely optional and is not required, as the AIRFLOW_CONN_AIRFLOW_DB Environment Variable is pre-set on Astronomer and will allow users to reference the airflow_db connection in a DAG even though the connection itself isn't populated in the Airflow UI.

Pre-Requisites

  • Access to your Kubernetes Cluster with permissions to:
    • List Namespaces
    • List Pods
    • List Kubernetes Secrets
  • Kubectl
  • Kubectx (optional)

Pull Postgres Credentials via Kubectl

1. Switch into your Kubernetes Cluster

The rest of this guide will assume the use of kubectx - a command line tool that allows you to easily switch between Kubernetes Clusters and Namespaces.

To start, switch into the Kubernetes Cluster that hosts Astronomer Software.

kubectx

2. List the Namespaces in your Cluster

To list the namespaces on your Kubernetes cluster, run:

kubens

3. Confirm your Deployment's Corresponding Namespace

Find the Kubernetes Namespace that corresponds to the Airflow deployment whose database you'd like credentials to and run:

kubens <NAMESPACE>

Then, run:

kubectl get pods

You should see something like:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
quasaric-sun-9051-flower-7bbdf98d94-zxxjd 1/1 Running 0 93d
quasaric-sun-9051-2346-pgbouncer-c997bbd9d-dgsjr 2/2 Running 0 2d
quasaric-sun-9051-2346-redis-0 1/1 Running 0 93d
quasaric-sun-9051-2346-scheduler-59f856bd5-d7gl4 1/1 Running 0 3h
quasaric-sun-9051-2346-statsd-5c7d7b6777-x7v4x 1/1 Running 0 93d
quasaric-sun-9051-2346-webserver-56fb447559-gjg8n 1/1 Running 0 3h
quasaric-sun-9051-2346-worker-0 2/2 Running 0 3h

On every Astronomer Deployment, you'll see a Kubernetes Pod for each component - the combination of which depends on the Airflow Executor running in that deployment.

The example above assumes the Celery Executor and so includes an additional component for a Celery Worker, a Redis queue and Flower dashboard.

4. Get Secret

As a next step, you'll have to pull one of multiple Kubernetes Secrets for the Kubernetes Namespace in which your Airflow Deployment lives.

To list those secrets, run:

kubectl get secret

You'll see something like the following:

$ kubectl get secret
NAME TYPE DATA AGE
default-token-fk86l kubernetes.io/service-account-token 3 93d
geocentric-instrument-2346-airflow-metadata Opaque 1 93d
geocentric-instrument-2346-airflow-result-backend Opaque 1 93d
geocentric-instrument-2346-broker-url Opaque 1 93d
geocentric-instrument-2346-env Opaque 0 92d
geocentric-instrument-2346-fernet-key Opaque 1 93d
geocentric-instrument-2346-pgbouncer-config Opaque 2 93d
geocentric-instrument-2346-pgbouncer-stats Opaque 1 93d
geocentric-instrument-2346-redis-password Opaque 1 93d
geocentric-instrument-2346-registry kubernetes.io/dockerconfigjson 1 93d
geocentric-instrument-2346-scheduler-serviceaccount-token-w29bn kubernetes.io/service-account-token 3 93d
geocentric-instrument-2346-airflow-worker-token-gqr4w kubernetes.io/service-account-token 3 93d

The secret we're looking for lives in that "airflow-metadata" pod (for Airflow's Metadata database).

Now, run:

kubectl get secret <airflow metadata pod>

Next, run:

kubectl get secret <airflow metadata pod> -o yaml

This will pull some metadata on the secret itself, including an encoded "connection" string.

apiVersion: v1
data:
connection: cG9zdGdyZXNxbDovL2dlb2NlbnRyaWNfaW5zdHJ1bWVudF8yMzQ2X2FpcmZsb3c6VTJvN3F2VnVsWnZ5cXl2V1hXbTBSSGh1UHlqdk1IT3BAZ2VvY2VudHJpYy1pbnN0cnVtZW50LTIzNDYtcGdib3VuY2VyOjY1NDMvZ2VvY2VudHJpYy1pbnN0cnVtZW50LTIzNDYtbWV0YWRhdGE=
kind: Secret
metadata:
creationTimestamp: "2019-05-02T09:06:11Z"
labels:
chart: airflow
heritage: Tiller
release: geocentric-instrument-2346
workspace: 484c967e-4ec1-4b36-8536-1525813028e1
name: geocentric-instrument-2346-airflow-metadata
namespace: astronomer-cloud-geocentric-instrument-2346
resourceVersion: "47818384"
selfLink: /api/v1/namespaces/astronomer-cloud-geocentric-instrument-2346/secrets/geocentric-instrument-2346-airflow-metadata
uid: a84029d3-0e6d-12e8-b31a-42010a96009f
type: Opaque

6. Decode the Secret

Now, let's grab the "connection" string from the top of that output and decode it:

echo "<connectionstring>" | base64 --decode

You'll get something like the following (don't worry, this is a sample deployment):

postgresql://quasaric_sun_9051_airflow:U2o7qvVulGvyqyvAXWm0RPhuPvjvlHOp@quasaric-sun-9051-pgbouncer:6543echo /quasaric-sun-9051-metadata

7. Save your Credentials

Based on the output above, you can find the connection credentials you're looking for in the following format: login:password@host:port/schema

In this example, that'd be:

- Host: quasaric-sun-9051-pgbouncer
- Schema: quasaric-sun-9051-metadata
- Login: quasaric_sun_9051_airflow
- Password: U2o7qvVulGvyqyvAXWm0RPhuPvjvlHOp
- Port: 6543

8. Update and Confirm your Connection

To finish creating and confirming your connection, enter the credentials from Step 7 into the airflow_db connection in the Airflow UI.

If you want to store this connection and access it programmatically, you're also free to store it in a Secret Backend.

What's Next

For a list of handy queries to reference, check out Useful SQL queries for Apache Airflow.