A recent discussion touched on everything from the history of ETL to metadata management and the future of data engineering.
In a wide-ranging interview with Jeff Meyerson, host of the Software Engineering Daily podcast, Fivetran CEO George Fraser explained the critical difference between column-store and row-store data warehouses, described the distinctive Fivetran engineering approach, and shared his belief that data engineering will evolve away from incidental complexity and toward functional simplicity. Here are the highlights, edited and condensed for clarity.
Jeff Meyerson: We’re going to be talking about data integration and data warehousing. Let’s start with the term “data warehouse.” What does it mean?
George Fraser: A data warehouse is a database that you use to store all of the data about everything that has ever happened in your company. Some people feel strongly that a data warehouse has to have a star schema, or a particular kind of technology under it, but in my opinion it’s just a database that has that role within your company.
JM: Tell me about data warehousing pre- and post-cloud.
GF: Before the cloud, data warehouses were on-prem, and mostly used by the largest companies because they were very expensive. Many were using the same technology that they used for their production databases — they’d use the same kind of database to run their website or retail inventory as they would to operate their data warehouse. Around the same time as the cloud revolution happened, another revolution happened in data warehouses — the widespread adoption of column stores, which are a fundamentally different way of building a relational database that is much faster for analytics queries. When data warehouses moved to the cloud and column-store databases got widespread adoption and became far cheaper, a lot more companies started using data warehouses.
JM: Why do people put data in a data warehouse today?
GF: Businesses use many different tools. They will typically have a finance tool, like QuickBooks or NetSuite, and payment systems like Stripe or Square. They might have software development tools like GitHub or Jira. They’ll have marketing tools — sometimes dozens or hundreds. If you want to know what is happening inside your business, the first step is usually to replicate all of that data into a single database, which is your data warehouse, so you can write a query that references any piece of data about anything that happens.
JM: Is the data in the data warehouse sitting in memory? Is it on disk? Do we know what it's doing?
GF: So the data in a data warehouse will be sitting on disk until you query it and then it will be pipelined in the memory to run your query. A lot of data warehouses today, not only do they store the data on disk, but they actually store the data in a cloud-based blob store, like S3. Snowflake and BigQuery work like this. When the data’s cold, it resides in blob storage, and when you start querying it, they load it into local disk. Then right at the moment when you run the query, they load it into memory.
JM: If it’s on disk, why not just leave it in our database until we’re ready to query it? What advantage is the data warehouse giving us?
GF: The columnar format of the data in the warehouse is very optimized for analytics. That’s where the term “column store” comes from. So if I was writing a table of people, instead of writing the person's identifier and name and email address, and then writing the next person's ID and name and email address, I would actually write all of the IDs in a column, then all the names, then all the email addresses. For very complex and fascinating reasons, this layout lends itself to writing a very fast query processor.
JM: So if you want to aggregate a total of an entire column, you would be able to iterate through every cell in that column without skipping over the other information in the rows, which is going to be much faster.
GF: Exactly right. That is the first and most obvious advantage of the columnar format. If you aren’t interested in data in a column, you can simply not read it and you don't even have to skip over it. It turns out that in order to fully take advantage of the column-store concept, you need not just a different file format but also a different query planner. You need a block-oriented query processor that, if you look deep inside the code, has the inner function operating over vectors instead of tuples. So every level of the column-store database is implemented in different and in many cases opposite ways.
JM: One aspect of modern data platforms is that we’re not just interfacing with databases. When we use APIs like Segment, Salesforce and Google Analytics, we’re accumulating data, which is in a database at these companies. Explain how these API-based systems have changed the world of the data platform.
GF: The fundamental change is that people use more tools to operate their business, and these tools all have APIs. The data is scattered in different places, and all of the APIs work differently, so it creates a lot more incidental complexity. The best way to create a centralized system to access data from all your APIs and data stores is to make a copy to replicate it into one data warehouse. The reason is speed. APIs are too slow to run arbitrary queries. If you're lucky and it's a simple query, it may return fast, but it could take hours. That doesn't work when you're trying to answer questions about your business. The same is true for the GitHub API, the NetSuite API, the Salesforce API, you name it. So you need to copy all that data into a faster database to access it at acceptable speeds. That's the problem Fivetran is designed to solve in the simplest possible way. We have prebuilt connectors to a little over 150 data sources, including all the major databases, and we replicate the data one-to-one into one database. We like to think we solve the first and hardest step of integrating your data, which is getting it all in one place.
JM: Explain how Fivetran automates the data integration process.
GF: Well, our data connectors do not work like conventional ETL tools, where the user specifies all of the details of what API endpoints to call, how to normalize the data, how to load into the destination. Fivetran connectors are automated. So when we talk to a data source like Salesforce, the first thing we do is call up the metadata API and ask Salesforce what are all of the objects and fields available in the Salesforce account. Then we write the requests against the Salesforce API dynamically, so we’re getting all of the data out of this particular customer’s Salesforce source. Then we ingest that data, again, through an automated process into the destination data warehouse.
This may seem obvious to software engineers, but believe it or not, this is a very radical approach in the land of ETL. The idea of a zero-touch ETL tool was very different than what was out there when we first set out to do this. But we have found that it is a highly successful approach for our customers. They get a perfect replica of all of their data in their data warehouse and they can transform and model that data nondestructively by writing SQL queries. SQL turns out to be a great language for managing data, cleaning it up, and reorganizing it into a format that's ready for analysts to query.
JM: So your data connectors essentially eliminate the work of the engineer who would write a script to query an API like Salesforce a bunch of times for the data, and just put it into the data warehouse in the proper format.
GF: That's exactly right. There's no magic to it. We’re just studying, for example, the Salesforce API and writing code just as you would if you were building a connector. The big difference is that we write general-purpose connectors that anyone can use. So we use the Salesforce metadata API to discover the available objects and ingest all the data in your account. If you then add a new custom field or table, we discover that and add it to your data warehouse automatically. Whereas when people do ETL internally, they typically write a script that queries all the fields and objects that existed in Salesforce when they wrote the script. That’s a lot easier and faster. Fivetran kind of does it the hard way, but the advantage is there are hundreds of customers who use this connector, so we just have to write one perfect Salesforce connector that anyone can use. So there's an economy of scale to the work our engineers do. We have to understand every little quirk of every API endpoint of all the data sources we support, but we only have to figure out all those quirks once, and then every customer can benefit from them.
JM: Let's focus for a moment on the business. This vertical of connectors between a data source and destination, that's a deceptively big business, with a lot of adjacencies to expand into. Tell me about the longer-term vision for Fivetran.
GF: It is an astonishingly wide problem. If you talk to 10 engineers, three of them will have at some point built an internal data pipeline for their company. An astonishing amount of engineering effort goes into writing connectors to NetSuite over and over and over, and they're all really the same. So this is a huge sector, and we’re very optimistic about our prospects. Each year of the company's history has been the fastest-growing year ever, despite the base getting higher and higher. We grew over three times last year. In terms of the future, we’re continuing to work on expanding the scope of our connectors and improving the quality of our connectors. There are a staggering number of corner cases, especially in databases, and we are intent on tracking down every last one so you can plug in any of our supported data sources and it just works for you. That is our mission in software engineering.
There are also a bunch of adjacent opportunities. We think we can add a lot of value around managing metadata because we build the connectors and control them very tightly. We should be able to tell you things like: “This row in your data warehouse came from a particular API call that ran seven months ago against your Salesforce data source.” That kind of information is really valuable. We also think we can help people write transformations. Because our connectors are non-configurable and standardized, the schemas we deliver for API data sources like NetSuite are always the same. Our customers are writing the same queries against the schema over and over, and we think we can write templates and recipes that, for example, transform your normalized NetSuite schema into a dimensionalized schema that's ready for you to throw Tableau at. These SQL queries are not going to be perfect, but they’ll be 80% or more of the way there and save your analytics team a lot of time.
JM: Let's talk about engineering at your company. Tell me more about how you’ve structured the company from an engineering management point of view.
GF: The engineering team spends probably about half of its time working on issues that are shared across all the data sources, whether that's the UI, or the core pipeline code, or the code that talks to the destinations. The other half goes toward writing and maintaining connectors. One of the cool things about engineering at Fivetran is that we’re always writing new connectors, so we’re always building new stuff and learning new lessons, and we’re very rigorous about taking those lessons back to the existing connectors and re-factoring them to reflect our latest best practices. The connectors are definitely on a spectrum of difficulty. So the most difficult connectors are databases, and the least difficult connectors are always startup APIs. They’re always very clean — they’re beautiful. Unfortunately, they usually don't have very many users. But if you look at an API of a company that's a few years old that has some good engineers there, they’re always great. So typically what you'll see for an engineer at Fivetran is they’ll go through a progression where they start out working on the more straightforward connectors, then on core code, working on destinations, working on more difficult connectors like databases. It's really a great place to grow your career as an engineer, because we have this great variety of work.
JM: Last question: How does the world of data engineering evolve in the next decade?
GF: I think it's going to get a lot simpler. The underlying technology has gotten so much faster and so much cheaper that it is possible to do data engineering without this proliferation of tools that predominated in the past. So I think life is actually getting better and easier for data practitioners. They can use really fast, really cheap databases, like Snowflake, BigQuery and Redshift. They can use zero-touch ETL tools, like Fivetran, and they can focus their efforts on the things that really differentiate their business, as supposed to building connectors and query engines over and over and over. So I think we should be optimistic about the future of data engineering.