A Primer to the World of Data
The world is saturated with data. Websites, apps, devices and sensors embedded in machines, buildings and vehicles continuously collect and stream enormous volumes of information. This data is used to guide business decisions and power artificially intelligent products that we interact with daily.
How is data transformed from raw signals and scraps of information into useful knowledge and products? The process involves several stages, in roughly the following order:
- Data is gathered from sensor feeds, manual data entry or software and stored in files or databases.
- Data is extracted from files, databases, and API endpoints and centralized in data warehouses.
- Data is processed to meet the needs of various business units.
- Data is used for business intelligence or to power products.
The tools and technologies an organization uses to execute this process form its data stack. Modern data stacks are hosted in the cloud.
What is the Cloud and What Does it Have to do with Data?
The “cloud” refers to the use of internet-enabled decentralized computation and storage. Cloud technology distributes software and data across internet-enabled machines as needed, allowing organizations to easily scale their operations up and down. Traditionally, organizations hosted their code and data on hardware that they owned on-premise. At larger scales, organizations operate data centers, designing and building proprietary IT infrastructure.
There is little need today for most organizations to host their software and data on-premise. With the plummeting cost of storage, computation, and internet bandwidth, the cloud offers accessible, cheap, performant, and scalable off-the-shelf solutions to a range of IT infrastructure needs.
Many products and services are now “cloud-native,” meaning they are designed from the ground up to leverage web infrastructure. These products and services include every element of the data stack: data sources, data pipelines, data warehouses, and business intelligence tools. The providers of such products and services frequently strive to make their clients’ experiences as easy and painless as possible. Such services are called fully managed services.
Today, the third-party data centers that host these products and services are frequently provided by large tech companies such as Amazon, Google, and Microsoft.
1. Where Data Comes From
Data can originate from sensor inputs, such as scans at a checkout line, manual data entry, such as from forms collected by the Census Bureau, digital documents and content, such as social media posts, or digital activity recorded by software triggers, such as clicks on a website. The data is typically stored in cloud-based digital files and operational databases. These files and databases may be made directly accessible to the parties that need them, or streamed in the form of API endpoints.
Organizations today use a wide range of cloud applications to provide services such as customer relationship management, payment processing, enterprise resource planning and more. Data generated by these applications not only provides a high-level overview of an organization’s performance, but also invaluable insight at the level of individual accounts. A highly capable data science team using a rich data set can predict customers’ needs as well as seasonal trends with uncanny accuracy.
- Database – Software applications that store data in a structured, typically relational manner.
- Operational database – A database that is updated in real-time and meant to support day-to-day operations. For instance, an eCommerce website will likely have an operational database to record transactions and store listings and customer profiles.
- API endpoint – Application programming interfaces allow applications to communicate with each other. An endpoint is one end of such a communications channel. An endpoint streams data in a machine-readable format such as XML or JSON.
- Data science – an umbrella term for the analytical use of data
Notable data sources:
- Salesforce – Leading customer relationship management platform
- NetSuite – Popular suite of enterprise resource planning software
- Zendesk – Used for customer service ticketing
- Zuora – Used to manage subscriptions and billing
- Shopify – Popular eCommerce platform
- Square – Popular software for retail transactions
- Google AdWords – Common online advertising platform
See more data sources here .
2. Centralizing Data
An organization will typically contain multiple teams using a variety of applications to aid different parts of its workflow. In order to fully leverage its data, the organization must extract and load it in a central environment to gain a comprehensive view of its operations and track individual entities across multiple applications. The destination for this data is typically a data warehouse, which, unlike, an operational database, is meant to be a structured repository of record for the purposes of analytics and business intelligence. Some organizations opt to use data lakes, which store both structured and raw, unstructured data.
This work can be conducted on an ad-hoc basis, or a dedicated data engineering team can build custom software to ingest the various files, database tables, and API feeds. A more practical approach is to use pre-built software to outsource or automate some or all of the process, i.e. to use a fully-managed service. These tools are referred to as data connectors or data pipelines.
- Extraction – Reading data from a data source
- Loading – Writing data to a data warehouse
- Data connector/data pipeline – Software used to extract data from a source and load it into a data warehouse
- Data warehouse – A data repository that, like a database, typically has a relational structure but, unlike a database, is meant to be a central repository of record for the purposes of analytics
- Data lake – A data repository meant to permanently accommodate large amounts of raw, unstructured data
Notable data pipeline tools:
- Fivetran – (That’s us!) A data pipeline and ELT tool featuring a wide range of proprietary data connectors.
- Stitch – A cheap, accessible data pipeline and ELT tool that relies extensively on open-source data connectors.
- Informatica – A legacy ETL tool that was originally designed to work with on-premise systems.
Notable data warehouses:
- Google BigQuery – A true serverless data warehouse that activates (and deactivates) additional computation and storage resources on the fly
- Snowflake – A quasi-serverless data warehouse that scales easily, though with some manual configuration
- Amazon Redshift – Amazon's data warehouse, built from ParAccel technology
- Microsoft Azure – Microsoft's data warehouse
3. Processing the Data
Data from the aforementioned sources is not always provided in a readily usable format. The data must be transformed to comply with data models that organize the data in a way suitable for reporting, dashboards, or machine learning. Transformations include data cleaning, summarizing and pivoting tables, as well as joining records from multiple sources together.
Stages 2 and 3 are collectively called “data integration” as well as the acronyms ELT (extract-load-transform) and ETL (extract-transform-load). Traditionally, organizations used ETL because transforming data before loading it lessened the computational and storage load demanded of an on-premise data warehouse. The chief disadvantage of ETL is brittleness. Both downstream changes to business needs as well as upstream changes to the structure of the source data can necessitate a revision of the data pipeline.
Because the transformation stage is sensitively placed between the extraction and loading stages, automating the ETL process can also require the careful, rule-based coordination of different pieces of transformation software, called orchestration.
The rise of cloud technology has made the labor-intensive ETL approach obsolescent. By replicating data straight from the source and allowing transformations to be performed at the discretion of analysts, ELT does away with the brittleness of ETL. This approach leverages the cloud, outsourcing and automation to preserve labor, time and money.
An important concern that arises alongside centralizing and processing data is that of data governance. Organizations must use data in a manner that is operationally efficient and complies with legal transparency and privacy regulations. This means that organizations must dictate and document (data cataloging) how and what data is integrated, how that data is used, that the data is accurate and consistent, and that it is accessible only to the appropriate parties.
- Data model – An abstract representation of real-world entities and their relations, meant primarily to inform business intelligence. A semantic layer translates elements in a data model into a human-readable lexicon.
- Transformation – The process of altering data so that it complies with the requirements of a data model
- Data cleaning – Removing errors, inconsistencies, and irrelevant records from data
- ELT – Extract-load-transform, a data pipeline doctrine involving extraction, loading, and transformation, in that order. Enabled by the low cost of computation, storage, and internet bandwidth
- ETL – Extract-transform-load, the traditional data pipeline doctrine involving extraction, transformation and loading. Formerly necessary to preserve scarce computation, storage, and bandwidth resources.
- Orchestration – The process of scheduling and coordinating multiple pieces of software in order to perform transformations
- Data integration – A general term for the process of extracting, centralizing, and processing data. Used interchangeably with data acquisition and data ingestion.
- Data governance – Processes related to ensuring the proper usage, integrity and security of data.
- Data cataloging – Documentation about the meaning, relationships, and origin of data. Sometimes used interchangeably with data dictionary, which refers to database-specific documentation.
Notable tools with transformation, orchestration, and governance features:
- Matillion – A cloud-based ETL tool featuring GUI-based orchestration
- Airflow – An open-source orchestration tool originally developed by AirBnB. Most famous for the use of directed acyclic graphs.
- Luigi – A Python-based open-source orchestration tool originally developed by Spotify
- Alteryx – A tool that supports orchestration, transformation, governance, and analytics
4. Using the Data
Ultimately, the data furnished by a data stack is meant to guide decisions made at every level of an organization and to power artificially intelligent products. The use of data to support decisions within an organization is known as analytics or business intelligence. A relatively technical and difficult approach to conducting analytics is to use a language like R or Python to build visualizations, dashboards, and tables of summary statistics. The advantage of this approach is that a highly code-savvy analyst or data scientist can build entire custom websites and applications from scratch or even prototype applications of machine learning.
The disadvantage of the technical approach is that it is extremely labor-intensive, often doesn’t easily integrate with the rest of the data stack and few people are qualified to perform it. A more accessible approach is to use a business intelligence platform. These tools typically integrate directly with data warehouses, do not require coding competency outside of SQL, and feature a large selection of visualization and dashboard templates.
The low-level, technical approach remains necessary to achieve the pinnacle of data science: machine learning. Well-known practical examples of machine learning range from targeted ads, self-driving vehicles, and IBM Watson. Early signs indicate that machine learning will become more accessible in the future. Google BigQuery features machine learning conducted entirely using SQL and integrates with the BI tool Looker.
- Analytics – A general term that encompasses the use of data to guide decisions. Often used interchangeably in corporate settings with business intelligence.
- R – A programming language designed for statistical computing. It is a high-level scripting language.
- Python – A popular general-purpose programming language with extensive libraries for statistical computing, machine learning, web development, and other purposes. It is a high-level scripting language, but many of its packages are actually wrappers for lower-level languages and are thus highly performant.
- SQL – “Structured Query Language,” the standard language used to import, read, modify, and delete data in relational databases. It is commonly used by analysts and non-technical users alike.
- Visualization – An image meant to quickly convey numerical information
- Dashboard – An organized collection of visualizations
- Machine learning – The application of mathematics to pattern recognition and prediction. A simple and common example is linear regression.
- Supervised learning – Also known as classification. The algorithm requires a training set with known correct answers.
- Unsupervised learning – Also known as clustering. Does not require a training set with known answers.
- Reinforcement learning – Uses feedback from the environment and/or an adversary to optimize the agent’s behavior.
Notable business intelligence platforms:
- Looker – Features a proprietary language called LookML as an additional layer of abstraction over SQL
- Tableau – Capable of generating stunning visualizations
- Mode – Includes Python integration for those inclined to machine learning and predictive modeling
- Domo – All-in-one data integration and business intelligence tool
Common tools for machine learning:
- Google BigQuery – Use SQL for machine learning! As of this writing, it supports linear regression, logistic regression, and k-means clustering.
- Python packages – Including Jupyter, Scikit-learn, Tensorflow, Pandas, PyTorch
- R packages – Including ROD-BC, Gmodels, Class, Tm
Notable applications of machine learning:
- Recommendation engines
- Google Search
- Netflix recommendations
- Facebook targeted ads
- Vision recognition
- Self-driving cars
- Automated photo and video tagging
- Speech and natural language recognition
- Medical diagnostics
- Fraud detection
- Crime prediction
The Information Age Has Just Begun
The modern preeminence of data makes it essential for organizations to conduct themselves with the guidance of facts and to develop products that leverage the wealth of data their activities generate. In order to be competitive and innovative, your organization must have a solid foundation in data engineering and data integration. With the proliferation of managed services, this foundation is within your grasp.