Fivetran is a popular ELT platform that automates ingesting data from a variety of sources into a database, offering pre-built integrations for many common data tools. Using Airflow with Fivetran allows you to schedule your Fivetran syncs based on events in your larger data ecosystem, as well as trigger downstream actions after a sync has finished.
In this tutorial, you'll learn how to install and use the Airflow Fivetran provider to submit and monitor Fivetran syncs.
There are multiple resources for learning about this topic. See also:
Time to complete
This tutorial takes approximately 1 hour to complete.
To get the most out of this tutorial, make sure you have an understanding of:
- The basics of Fivetran. See Fivetran's Getting started documentation.
- Airflow fundamentals, such as writing DAGs and defining tasks. See Get started with Apache Airflow.
- Airflow operators. See Operators 101.
- A Fivetran account. Fivetran offers a 14-day free trial for new customers.
- The Astro CLI.
- A GitHub account with the permissions for the following GitHub scopes:
Step 1: Configure your Astro project
An Astro project contains all of the files you need to run Airflow locally.
Create a new Astro project:
$ mkdir astro-fivetran-project && cd astro-fivetran-project
$ astro dev init
Add the Fivetran provider package to your
Run the following command to start your project in a local environment:
astro dev start
Step 2: Create a new private GitHub repository
For this tutorial you will use metadata from a GitHub repository as your data source. We recommend using a new repository to prevent unintentionally ingesting large amounts of data.
- Create a new private GitHub repository called
- Commit at least one change to the repository. The content of your commit does not matter for this tutorial.
Step 3: Connect Fivetran to a destination
Fivetran needs at least one destination to be configured in order to create syncs. A destination is a relational database where your ingested data will be loaded into.
Log in to your Fivetran account.
Click Destinations in the left menu, then click ADD DESTINATION in the upper right corner.
Choose either Connect your destination or I don't have one. For this tutorial you can either connect to an existing data warehouse by following the relevant Fivetran documentation, or use a Fivetran-managed BigQuery service. In this tutorial we will use Fivetran's managed BigQuery service by selecting I don't have one and then clicking CONTINUE SETUP.
Configure your destination connection. You can choose any configuration. Astronomer recommends using UTC as your timezone in all data tools as a best practice. Click SAVE & TEST to save your destination.
Click Continue to get back to the list of your destinations.
Step 4: Configure a Fivetran connector
Fivetran needs at least one connector to be configured in order to create syncs. A Fivetran connector reaches out to a specific data source, receives data from it and writes it to your destination. View the Fivetran website for an up-to-date list of connectors.
In Fivetran, click Connectors in the left menu and then click ADD CONNECTOR in the upper right corner.
Select the GitHub Connector and click CONTINUE SETUP.
Configure the GitHub Connector:
- Destination schema:
- Authentication mode: Either OAuth or a Personal Access Token, then click AUTHORIZE. In this tutorial, we use OAuth authentication.
- Destination schema:
Authenticate Fivetran to your Github Account.
In Sync Mode select Sync Specific Repositories, then in Repositories select the
Click SAVE & TEST. After your connection has been tested click CONTINUE.
Click Start Sync to start your initial sync. This initial synchronization will load all historic metadata from your GitHub repository and has to be completed in Fivetran for the sync to become active. Once the Fivetran sync is active, you can set the sync frequency under the Setup tab or run the sync using the Fivetran API.
Step 5: Generate a Fivetran API key
You have now created a Fivetran sync that will extract new metadata from a GitHub repository and load it into a relational database on a time-based schedule. But to run the sync every time a specific event happens in your data ecosystem, you need to use Airflow for orchestration.
To connect Airflow to Fivetran, create a Fivetran API key.
In Fivetran, open your user account and click API Key.
Click Generate API Key. Copy the API key information to a secure place for later.
Step 6: Create an Airflow connection to Fivetran
In a web browser, go to
localhost:8080to access the Airflow UI.
Click Admin -> Connections -> + to create a new connection.
Name your connection
fivetran_connand select the Fivetran connection type. Provide your Fivetran API key and Fivetran API secret. If the Fivetran connection type isn't available, try restarting your Airflow instance with
astro dev restartto ensure the contents of
requirements.txthave been installed.
Step 7: Create your Airflow DAG
For this tutorial you will create a DAG that triggers your Fivetran sync to ingest the GitHub repo metadata to your destination.
astro-fivetran-projectin a code-editor.
dagsfolder add a new Python file called
Copy and paste the following DAG code into the file:
from airflow.decorators import dag, task
from pendulum import datetime
from fivetran_provider_async.operators import FivetranOperatorAsync
FIVETRAN_CONNECTOR_ID = "<your Fivetran connector ID>"
GITHUB_REPOSITORY = "<your GitHub handle>/airflow-fivetran-tutorial"
TAG_NAME = "sync-metadata"
@dag(start_date=datetime(2023, 1, 1), schedule="@daily", catchup=False)
run_fivetran_sync = FivetranOperatorAsync(
upstream() >> run_fivetran_sync >> downstream()
This DAG contains three tasks:
upstreamtask runs before the Fivetran sync job is run. This task could contain a sensor or deferrable operator waiting for an action to be completed in another data tool.
run_fivetran_synctask uses the FivetranOperatorAsync to trigger the Fivetran connector specified as
FIVETRAN_CONNECTOR_IDas soon as the
upstreamtask has completed successfully.
downstreamtask runs after the Fivetran sync job has finished. Commonly, tasks containing data transformations on the data loaded into your Fivetran destination will be set as downstream tasks.
FIVETRAN_CONNECTOR_IDvariable with the ID of your connector. You can find the ID of your connector in the Fivetran UI under the Setup tab:
Add your GitHub username to the
Save your DAG file with the changed variable names.
The FivetranOperatorAsync is one of many deferrable operators. Instead of taking up a worker slot, these operators will hand their task to the Airflow Triggerer component while waiting for a condition to be fulfilled. For longer running tasks, this can result in cost savings and greater scalability as more worker slots are available.
Step 8: Run your DAG
In the Airflow UI, unpause your DAG to start its last scheduled DAG run with a logical date of yesterday.
Click on the
run_fivetran_synctask in the Airflow Grid View. This task will be in a deferred state (violet square) until the Fivetran sync is completed.
After the DAG run has finished successfully, verify in your Fivetran UI that the sync shows an additional
Manual Update Triggeredentry in its
(Optional) Step 9: Visualize your commits with Tableau
As a bonus, you can implement data observability into your data pipeline. If you are using a custom Fivetran destination, you can simply view your data directly and connect your favorite BI tool to it. If you followed this tutorial using Fivetran's managed BigQuery service, you can add a BI tool like Tableau to view your data.
In the Fivetran UI go to Destinations and select your warehouse, click the BI tools tab, and click + BI tool.
Authenticate to Tableau. See Fivetran documentation. Note that you have to authorize the same Google user to Tableau as you authorized to the Fivetran managed BigQuery service.
Create a new Tableau worksheet either in Tableau desktop or in Tableau Cloud.
Connect your Tableau worksheet to the BigQuery project created by the Fivetran managed service. Make sure you are using the same Google user you authorized in Fivetran. See Tableau documentation.
You can now visualize all elements of the GitHub repository metadata that has been loaded into BigQuery. Using the commit table, drag and drop the commit(Count) item from the Tables section to the
Rowslist and the Committer Date to the
Columnslist. Change the grain of Committer Date displayed to Minute by clicking on the right side of the element in the Columns list. This creates a line graph showing how many commits were made each minute in the
airflow-fivetran-tutorialrepository. The following screenshot shows that two commits were made in separate minutes:
Add a few commits to your
my_fivetran_dagmanually by clicking on the play button in the Airflow UI.
View your updated dashboard with commits per minute for the aggregated hour.
Using Airflow with Fivetran allows you to embed your Fivetran syncs in your larger data ecosystem, making the scheduling of Fivetran syncs event- and data-driven. The
airflow-provider-fivetran-async offers asynchronous capabilities to make your architecture more scalable and efficient.