A new Python package from Fivetran and Astronomer enables connector management in Airflow.
We have previously discussed orchestration tools that can be used to extend and manage ELT pipelines. Since then, we’ve been hard at work developing Fivetran integrations for these tools so that data engineers can benefit from the open source projects while still experiencing the ease-of-use that brought them to Fivetran.
Now, we are happy to introduce our integration for one of these tools, the Fivetran Provider for Apache Airflow!
Astronomer helps organizations adopt Apache Airflow. As part of these efforts, they have just released the Astronomer Registry, a hub for Airflow providers designed to bridge the gap between the Airflow community and the broader data ecosystem. If you are familiar with dbt’s hub for community contributions and Fivetran's dbt packages, this registry may look similar.
Fivetran's developer relations team worked with Astronomer as partners on the launch, developing a provider that allows users to start and monitor Fivetran data syncs from within Airflow. The provider only needs an API key and API secret and the connector IDs configured to combine the benefits of automated data ingestion from Fivetran and data pipeline orchestration from Airflow. The three main benefits our users are experiencing for this provider are the following:
Airflow allows you to have fine grain control over when things happen, and awareness on tasks that comprise pipelines, their dependencies and their execution. This allows for data transformations across incoming data sources. You can prioritize synchronizations as well. If you are bringing in more data than your warehouse can concurrently handle, there is important data that you want pushed first, or long jobs that you want first or last.
If you are orchestrating ELT with Fivetran in Airflow, you inherit all the benefits of using it, such as automatic data updates and schema migrations. We are seeing a lot of Airflow users building out this functionality in Python, and spend time fixing and rewriting extract and load tasks. The Fivetran provider will allow you to focus on building new DAGs instead of constantly fixing broken ones, while taking advantage of the scale Airflow can provide.
This synchronization of syncs enables one of our top feature requests we receive at Fivetran -- the ability to trigger data transformations from Fivetran syncs. Airflow sensors monitor Fivetran metadata and will resolve tasks as soon as the sync is complete. This is great because if you run transform jobs on data too late, it can cause latency problems and SLA issues; but if transformations start too early, EL jobs and T jobs may overlap and cause missing data and integrity issues.
Fivetran users aren’t just moving data around, things are happening both before Fivetran loads data and after dbt transforms it; and Airflow provides a single space to manage everything data that is happening. Airflow also provides a single space for various data practitioners to collaborate. A data engineer can build data models without acute knowledge of the machine learning models a data scientist is building, and vice versa.
Airflow defines data pipelines as directed acyclic graphs, or DAGs, that are built mostly of tasks called Operators and Sensors. The Fivetran Provider enables the creation of FivetranOperators and FivetranSensors. As mentioned earlier, all that is needed to run Fivetran in Airflow are the API Key, API Secret, and connector ID(s) from Fivetran, and the Fivetran Provider from the Astronomer registry. The API Key and API secret are configured as an Airflow Connection and the connector ID(s) are configured as Airflow Variables.
In Airflow, an operator represents a single, ideally idempotent, task. Operators determine what actually executes when your DAG runs. The FivetranOperator validates that the connector ID supplied exists and is valid for the API Key and Secret given, changes the connector’s schedule type to run on a schedule defined by Airflow instead of Fivetran, and finally calls for the start of a Fivetran sync.
Sensors are a special kind of operator. When they run, they will check to see if a certain criteria is met before they complete and let their downstream tasks execute. A FivetranSensor will monitor the status of a Fivetran sync and allow the DAG to progress as soon as data has been fully loaded into a warehouse.
The FivetranHook is included in our provider, but should be accessed and modified by advanced users only. This code is where Airflow interacts with Fivetran REST API, so that complicated logic to start and monitor Fivetran jobs is abstracted away from the FivetranOperator and FivetranSensors in order to make them as readable and easy to use as possible.
To get started, install the Fivetran Provider package in your Airflow environment by running
pip install airflow-provider-fivetran, and check out all the other Airflow integrations in the new Astronomer Registry!