The smart payment provider orchestrates ELT in Airflow to generate massive cost savings
Billie.io is a business built on the speed of its innovation. The fintech startup, based in Berlin, is reinventing the way that businesses handle payments: through Billie, SMEs can get instant financing for each invoice (i.e. they do not need to wait 90 days to get paid by their customers). Billie.io also outsources the collections process and coverage of default risk.
If we look under the hood at Billie’s data architecture, the picture is just as innovative. Since co-founding the company in February 2017, Igor Chtivelband, Billie’s Vice President of Data, has been quietly building an archetype for data-driven businesses to emulate. As discussed in a previous Case Study, Igor’s team at Billie uses Fivetran to ingest data to Snowflake.
“We use Fivetran to copy data from various data sources, be it Google Analytics, Salesforce, our production database, LinkedIn, Facebook, to our data warehouse, which is Snowflake. We completely delegate this part, we don't want to deal with it.” says Igor. “However, if we want to build complicated logic with some kind of preconditions, "If that, do this, do this first, then do that, calculate this and then calculate that," this is where Airflow is useful.”
Apache Airflow is a community-managed platform to programmatically author, schedule and monitor workflows, allowing complex orchestration and automation. As discussed in our blog post Scheduling vs Orchestrating, there are a number of clear benefits to bringing in Airflow to orchestrate ELT.
“That's where Airflow helps us,” says Igor, “and this is the reason why we use a combination of both. So you can see Airflow as an orchestrator for these processes. Airflow decides, "Now it's time to synchronize data from Google Analytics to our data warehouse.”
Airflow gives Igor’s team at Billie fine grain control over when things happen, and awareness on tasks that comprise pipelines, their dependencies and their execution. This allows for data transformations across incoming data sources.
Above: Billie’s Extract, Load and Transformation process is orchestrated by Apache Airflow
For Billie, many Fivetran connectors are used as-is – but Airflow is essential in the Extract, Load and Transformation process of their production database to data warehouse. Igor treats each of the steps as ‘segments’, which can then be dynamically scheduled or managed through Airflow. For example, the Fivetran segment, or ‘Extract’ and ‘Load’, can be scheduled and run independently of the transformation layer, to avoid latency problems or SLA issues, or even to prevent transformations occurring too early.
During business hours, Billie runs its most important data pipelines every five minutes. Outside of business hours that frequency gets dialed back to every two hours, significantly cutting the Snowflake compute resources used. This is possible due to Airflow's flexible and configurable data pipeline scheduling.
A simple shift from a five-minute to a two-hour sync can save the business a lot of money in the long run.
“It's a very simple trick. Really, it's a no-brainer. We're talking approximately 20% reduction of our bills. On top of that, we also feel better with ourselves, because we saved money, we saved your resources. We help the planet, because the servers are not running, so no one has to cool them down,” says Igor.
Igor and his team are now using Airflow and Fivetran’s Operators, Sensors and Hooks to create complex and valuable workflows for the business.
Operators are used to execute tasks in Airflow. For Fivetran, this means a FivetranOperator starts a Fivetran data sync. The flexibility of Airflow's scheduling allows Billie to easily and dynamically change when a FivetranOperator is called, giving them fine control of their data warehousing costs.
Sensors will check to see if a certain criteria is met before they complete and let their downstream tasks execute. A FivetranSensor will monitor the status of a Fivetran sync and allow the DAG to progress as soon as data has been fully loaded into a warehouse. “This gives us an ability to understand, "Is the sync done or not?," says Igor, “And if the sync’s done, do something immediately in Airflow.”
Together, Billie uses these tools for report scheduling – a smart workflow to ensure that data is fully synced and available before running a reporting process on top of Snowflake.
“Every morning we have time-critical operations. We have to generate reports as soon as possible, if the data is available. We cannot send our business partners this report before we have the data in Snowflake. So we use the ‘Sensor’ operator, in order to understand, "Can we do it yet or not?" And that's much better than just pulling every five minutes, to understand if the data is there.”
Today, thousands of companies stand to benefit from orchestrating Fivetran ingestion and transformations with Airflow. Igor has some sage advice for teams getting started:
“I would say it's one of the cases when the appetite comes with food. Once you migrate from traditional data warehouse technologies like Teradata to Snowflake, you realize how much easier it is. The next step is Fivetran and dbt. My recommendation is to try cherry picking an easy experiment, and then if it works, then it's easier to become convinced. You can also show it to your boss like, "We've achieved this in one week, and our data analysts are happy."
To get started, visit the Fivetran Github repo, where you’ll find documentation on getting set up across a variety of platforms, from Google Cloud to AWS and Astronomer. You can also install the Fivetran Provider package directly in your Airflow environment by running pip install airflow-provider-fivetran. Read more on Fivetran’s blog here.