New to analytics? Don’t know what an ERD is? Read this comprehensive list of important data integration terms.
If you are reading this article, there is a good chance that you are already somewhat familiar with the world of data integration. If you aren’t, this glossary will fill in many gaps, although you should also consider reading A Primer to the World of Data.
The following key terms are frequently used to discuss the subject, especially in the context of automated data integration using Fivetran.
Analytics refers to identifying meaningful patterns in data in order to inform business decision-making. Activities such as creating dashboards, reports and predictive models (i.e. machine learning) all broadly fall under the umbrella of analytics.
In casual usage, it is often interchangeable with terms such as “business intelligence” and “data science.”
At Fivetran, we use analytics templates to help analysts denormalize Fivetran schemas into a form that is suitable for common analytics use cases. Our analyst templates take the form of snippets of code compatible with SQL and dbt (data build tool – see below). They are plug-and-play quick-start blocks that can be used off-the-shelf or customized as needed by your team.
An application programming interface (API) is a protocol used by an application to exchange data and communicate with other applications. In the context of data integration, we are generally talking about API endpoints that provide business data feeds in formats like JSON or XML.
Data connectors (see below) extract data from API endpoints using small units of software called scripts.
APIs enable not only data extraction but also the automated, programmatic operation of an application. Fivetran, for instance, offers an API that allows users to programmatically manage users and connectors.
The emergence of the cloud has resulted in the proliferation of Software-as-a-service (SaaS) applications for businesses. These applications and services cover a huge range of operations such as ecommerce, marketing, payment processing, customer relationship management, HR benefits, team communication, engineering project management, customer service and so forth.
Applications produce a great deal of data about an organization’s operations and activities. A typical organization today might easily use dozens or hundreds of applications. Applications are one of the main sources of data that Fivetran handles for customers.
Big data is often described using Three Vs:
Volume – Scale is the single most obvious consideration. “Big data” is often used to describe petabyte-scale (i.e. hundreds of billions or trillions of records) and larger datasets that are difficult or impossible to manage with a conventional data warehouse. The most sensible way to handle such a scale is with large numbers of machines working in parallel, i.e. distributed processing systems such as the Hadoop ecosystem. These distributed processing systems require intensive data engineering capabilities to use.
Variety – Data can encompass a huge range of data sources and formats. A single organization may use hundreds of applications, each generating potentially invaluable data about an organization’s operations.
It’s also estimated that up to 90% of the world’s data is unstructured and therefore can’t be stored in or queried from a conventional relational database. Unstructured data includes images, sound and video clips, free text, and more.
Velocity – Some products must analyze huge amounts of data and produce actionable conclusions in real-time. Examples include mapping and navigation services, self-driving vehicles, and industrial internet of things (IoT) applications.
There is a very good chance that, strictly speaking, your company’s use case does not involve capital “B” Big Data. Your needs might be well below the petabyte scale, might not involve unstructured data or hundreds of applications, and might not require real-time processing. That does not mean, however, that your organization should shy away from anything cloud-based or labeled “big data.” It just means that you should be circumspect about the kinds of tools, technologies, and infrastructure that you commit to.
A cloud function is a small unit of software that is hosted on a cloud platform such as AWS or Google Cloud Platform. Users write cloud functions to perform any number of programmatic tasks and behaviors while leveraging the infrastructure of a cloud platform. In the context of data integration, cloud functions allow you to build custom data pipelines and integrations. Fivetran offers cloud function data connectors that allow you to link custom data integrations with the Fivetran environment.
A data connector continuously replicates data from a source to a destination on a set schedule such as every minute, hour or day, like an automated digital copy-maker. The source is typically a particular application or database. A Salesforce connector, for instance, moves data from your Salesforce account to your destination. A destination is an analytics repository of record where data is aggregated and analyzed (see below).
The full list of data connectors offered by Fivetran can be found here.
A cursor signifies where the last Fivetran sync on a particular source stopped. It is essentially a bookmark that allows your connector to begin reading where it last left off, and is integral to enabling incremental syncs (see below).
Cursors can be based on log entries, timestamps, or some other measurement. You shouldn’t have to worry about the specifics of cursors unless you are writing a custom integration with a cloud function.
In general business intelligence and analytics usage, dashboard refers to a summary overview of metrics regarding some system, initiative, or business unit. Dashboards typically contain data visualizations in the form of charts, numbers, and maps.
Source: Looker documentation
The common usage of the word “dashboard” in business intelligence and analytics should not be confused with the Fivetran dashboard, the GUI (see below) that allows users to control their Fivetran accounts.
A database is a structured, computerized collection of data. Databases are often used to make permanent records of transactions and operations performed by a software or business unit. Most often, databases are organized into tables that consist of columns and rows. These tables contain records that can be connected with each other using keys. This is called a relational structure (also see ERD). The language used to query relational databases is SQL (structured query language), of which there are many dialects.
In the context of data integration, it is important to differentiate between operational or transactional databases and analytical databases. In the context of analytics, the former type of databases are generally sources of data. The latter form of database is often a destination (see below) that contains data aggregated from numerous databases.
This also leads to a broader discussion of columns and rows. In a table, a column lists all values for a particular field, while a row lists all values for a particular entity. Analytical databases and data warehouses are typically designed to read data in a columnar fashion. This makes computations over sequences of numbers extremely easy. By contrast, operational/transactional systems are row-based in order to quickly and easily modify individual records.
Software engineers whose primary role is to build and maintain data infrastructure are referred to as data engineers. Data infrastructure includes pipelines from sources to destinations, i.e. data warehouses and data lakes, as well as the destinations themselves. Traditionally, a major component of a data engineer’s job was building ETL (see below) data pipelines.
With the proliferation of labor-saving data integration tools such as Fivetran, the data engineering profession is pivoting toward managing and optimizing the design and performance of core data infrastructure and building custom tools to support the activities of analysts and data scientists.
The precise meaning of data integration can vary by context, but it fundamentally refers to the full workflow of aggregating operational and transactional data from across an organization, and then massaging (i.e. transforming) and analyzing it to enable data-driven decisions. At Fivetran, we pride ourselves on providing “automated data integration” that requires a minimum of human effort from the user.
You can read a comprehensive beginner’s treatment of automated data integration here.
A data lake is an object store that can contain both structured data and raw, unstructured data. They are generally intended for use with large-scale machine learning applications.
Data literacy refers to the ability to navigate and interpret data sets. This ability can range in depth from a working knowledge of using a business intelligence tool and reading charts, to a deep, technical understanding of statistical inference and the ability to build machine learning algorithms in a low-level programming language.
A person’s data literacy needs are determined by the specific role they perform. Non-technical managers and business users should, at the very least, be able to read dashboards and reports from a business intelligence tool. Their job is to use data to make informed, defensible decisions rather than relying on “gut” or guesswork.
Analysts typically have a strong background in statistics, build and maintain dashboards and reports, run analytical experiments, and use SQL and high-level programming languages such as Python to build custom data models. The job of an analyst is, fundamentally, to help an organization make sense of its data and test hypotheses.
Data scientists and machine learning engineers additionally need familiarity with machine learning algorithms, quantitative research methods from the social and hard sciences, and sometimes familiarity with lower-level programming languages. Depending on the specific organization, data scientists and machine learning engineers may perform the same duties as analysts and data engineers, while also designing and building systems for prediction or automated decision-making.
A data model is an abstract representation of the real world. Every business application has an underlying data model that accounts for the business operations and transactions the application performs. These models are typically in the form of interrelated tables with columns and rows and can be represented using an entity-relationship diagram (ERD – see below).
Analysts and data scientists will also produce data models for the purposes of reporting, visualizing data, or training predictive models.
You may hear the phrase “data pipeline” used to refer to collections of connectors or specific examples of data integration tools. It’s best to think of a data pipeline as the “EL” portion of the ELT sequence. The pipeline is what delivers data to your destination. Transformations are then performed within the environment of your destination.
Data build tool, or dbt, is a SQL-based development environment by Fishtown Analytics that serves as a comprehensive analytics workflow tool. You can use dbt to transform and model data, automate data quality testing, and deploy, document, and version-control analytics code.
At Fivetran, we currently partner with dbt and offer analytics templates in order to make transformations as seamless as possible. Currently, we offer dbt packages for NetSuite, Salesforce and Mailchimp with many more on the way.
A destination is a data repository of record intended to permanently store large amounts of data for the purposes of analytics. This category includes both data lakes and data warehouses. See our documentation for a full list of destinations.
A data warehouse is a column-based, relational database used to comprehensively store data from across an organization. Analysts can access a data warehouse to perform analytics. Note that this is in contrast to row-based, relational databases that are used to record and perform individual transactions.
An entity-relationship diagram or entity-relationship model is a visual representation of a schema. ERDs specifically represent interrelations between tables and the fields (called keys) that connect them.
Fivetran ERD diagram for Asana
Every table has a primary key that denotes the basic entity that the table represents and uniquely identifies each row in that table. Foreign keys are primary keys from another table. They connect the entity represented by the primary key with another entity represented in another table.
Event tracking systems allow you to measure and monitor user activity on your web pages and mobile apps.
Notable examples of event tracking systems include Segment and Snowplow, both commonly used to track user navigation across web pages and mobile apps. One specific method of event tracking is called the webhook, which enables you to set callbacks to trigger after some behavior or event is performed on your website or app. These callbacks are generally used to invoke any kind of programmatic behavior, but can be used to stream data to you.
Extraction refers to the copying of data from a source. It can be performed by means ranging from manually downloading a file to writing a piece of software to ingest a data feed from an API endpoint.Loading refers to moving data into a destination, typically a data warehouse or a data lake.Transformation refers to altering or reorganizing data into a form that suits the needs of an analyst or end-user.
Extract-transform-load (ETL) is an approach to data integration in which a data pipeline extracts data from a source and then transforms data before loading it into its destination. Transformations turn raw data into a form that is usable by analysts and can be complex to orchestrate. The major advantage of ETL is that it preserves bandwidth, data storage capacity, and on-demand computational power. This comes at the expense of heavy labor costs to periodically reengineer the transformation layer of the pipeline.Extract-load-transform (ELT) is an approach to data integration in which transformations are not performed until after the data has been loaded. As a more modern, cloud-based approach, ELT turns the tradeoffs from ETL on their head. ETL was developed at a time when bandwidth, data storage capacity, and on-demand computational power were extremely expensive. Since bandwidth, data storage capacity, and on-demand computational power are now very cheap, the intent is, instead, to preserve engineering labor costs by placing transformations in a less sensitive part of the sequence.Fivetran performs a type of ELT, although the normalization process makes our sequence look more like EtLT, in which a modest amount of transformation is performed before the data is loaded to the destination.You can read a more detailed comparison of ELT and ETL here.
A graphical user interface or GUI (pronounced “gooey”) is a visual tool that allows you to use a mouse or touchscreen to manipulate buttons, text boxes, and other elements on a screen in order to control an application. Most computer-related tasks that aren’t performed using a command line use a GUI.
At Fivetran, we refer to our GUI as the Fivetran dashboard because it summarizes details about your account and gives you a fine degree of control over your connectors, user access, transformations, destinations, and more.
The word “dashboard” in the specific context of Fivetran should not be confused with the use of “dashboard” more generally in the context of business intelligence. In the most general sense, a dashboard is a summary display of various metrics relating to the performance of a system, i.e. a business unit or campaign.
Incremental syncs, also known as incremental updates, update only new or modified data. After the initial sync, Fivetran connectors sync most tables using incremental updates. We use a variety of mechanisms to capture the changes in the source data, depending on how the source provides change data. During incremental syncs, Fivetran maintains an internal set of progress cursors, which let us track the exact point where our last successful sync left off. Incremental syncs are efficient because they update only the changed data, instead of re-importing whole tables.
During the initial sync, also known as a historical sync, for a connector, Fivetran connects to your source and copies the entire contents of every table that you’ve selected to sync. We sometimes refer to this as the historical sync because during the initial sync, we sync all your data, including data that is old. How long the initial sync takes depends on the amount of data and the limitations of your source. For example, some sources only allow a limited number of API calls. The initial sync does not count towards your monthly active rows.
The Fivetran log connector integrates metadata from your connectors directly into your destination so that they are available for analytics. You can study this data to optimize your organization’s internal workflows.
Log services provide a JSON data feed through an API endpoint. You can use this API endpoint to automate various workflows. They are specific to a destination.
Machine learning is often associated with data science and sometimes used interchangeably with terms such as “predictive modeling” and “artificial intelligence.” Machine learning is fundamentally about automated decision-making using statistical inference.
Machine learning consists of a training set, a test set and an algorithm, which together make a model.
An active row is one that is added to or updated within a destination. Monthly active rows are counted on the basis of distinct primary keys. That is, a single, unique row that is updated multiple times in a month will only be counted once. Monthly active rows directly connect pricing with consumption. This helps Fivetran customers by only billing them for what they use.
You can read more about monthly active rows here.
Normalization is the process of reorganizing a data model to eliminate duplicate, redundant, and derived data values, and to accurately represent logical connections between entities. The purpose of normalization is to create a concise, easily-understood representation of data.
This is not to be confused with the concept of “normalization” from statistics, in which values are scaled to between 0 and 1.
OAuth is an open standard for managing access to sensitive data. As of this writing (mid-2020) it is on version 2.0.
You will usually provide OAuth credentials to allow Fivetran access to your company’s accounts and their API endpoints.
A schema is a blueprint for a database, listing all of its tables, columns, and relationships. You can also think of it as a data model for a database. A visual representation of a schema is called an entity-relationship diagram.
Good database schema design is essential to good data integration. A well-designed schema makes data tractable by making it easy to read and understand, managing the throughput required to sync it, and offering a standardized jumping-off point for analysis.
Schema changes occur when your data sources add, remove or modify columns and tables. You must accommodate these changes when you’re extracting and loading data. A tool like Fivetran performs automated schema migration so that you don’t have to do this by hand.
Webhook is a web development tool that enables you to set callbacks to trigger after some behavior or event is performed on your website or app. These callbacks are generally used to invoke any kind of programmatic behavior, but can also be used to stream data to you.