Off-the-shelf customer data platforms have serious shortcomings. Consider data warehouses instead.
There’s been a lot of buzz around Customer Data Platforms (or CDPs) lately. Every vendor in martech is trying to sell you on buying their SaaS to build a “single view of the customer”.
If you’re not familiar with Customer Data Platforms, Wikipedia offers a good explanation:
A customer data platform (CDP) is a collection of software which creates a persistent, unified customer database that is accessible to other systems. Data is pulled from multiple sources, cleaned and combined to create a single customer profile. This structured data is then made available to other marketing systems.
Sounds useful, right? It is! If you’ve ever had to deal with customer data at a company, regardless of size or your business department (sales, marketing, support, engineering, analytics, etc.), the value proposition of a CDP is clear. Building and operationalizing a single view of the customer is hard. In fact, it’s only been getting harder.
As a (former) early employee at Segment and someone who has been twiddling the lines between marketing and data communities for over five years, I find Customer Data Platforms nothing short of intriguing.
In the marketing community, an all-in-one platform to solve our countless data problems sounds like the holy grail. In the data community, on the other hand, we’ve been trying to do this all along using the data warehouse. Yet, most people don’t make the connection between the two.
Before I make the case for why the data warehouse (and not an off-the-shelf CDP) should be your Customer Data Platform, I’d like to provide some context on the CDP market and its key players.
CDPs are all-in-one marketing and data platforms. They aim to serve as a database for all your customer information with a bundled activation layer to help you leverage the data for marketing automation.
All CDPs have a few common components:
Data ingestion. Since CDPs are databases of customer data, they need a way to ingest data. Most CDPs achieve this via an API for developers to track traits about users and events that they’re taking across your applications.
Identity Resolution. CDPs build and maintain graphs of user profiles so that all user identifiers (cookie, IDFA, device ID, etc.) can be mapped back to a “single user ID”. Most CDPs implement a simple deterministic algorithm for identity resolution. These algorithms are functionally similar to the queries that your analyst team has already written to do marketing attribution in SQL, e.g. joining across “anonymous” and “known” user profiles using a handful of identifiers. Some CDPs, however, do identity resolution probabilistically, which is not straightforward to do in-house without a skilled data science team.
Audience builder. This is perhaps the most necessary component of a CDP. Without an audience builder, a CDP is just “Customer Data Infrastructure”. The audience builder is an interface for marketers to create customer segments without SQL and sync them to various marketing and advertising platforms to run targeted campaigns.
Outside of these core components, some CDPs have additional features for marketers, like cross-channel orchestration, predictive audiences, etc. Check out this guide that offers a complete overview of the available data integration technologies, including Customer Data Platforms.
CDPs aren’t just some new kids on the block and should be taken very seriously. As seen below, the industry and term has been consistently growing over the last 5 years.
Buyer demand follows this trend too. The overall combined revenue of players in the CDP industry is north of $2 billion as of 2019.
The CDP players can be broken down into a few major categories
General purpose Customer Data Platforms
General purpose CDPs target a broad range of use cases. The leading players in the general CDP space today today are Segment and mParticle. Then, there’s a number of runner-ups like Treasure Data, Simon Data, Lytics, Blueshift, and Redpoint Global.
Vertical Customer Data Platforms
These are CDPs that target a very specific type of company (industry, size, maturity, SaaS tools in place, etc.) and solve specific problems.
Conglomerate Customer Data Platforms
These are big companies like Adobe, Salesforce, and Microsoft that have started calling their existing CRMs or “marketing clouds” CDPs.
There’s too many CDP vendors to count these days. This makes the space rather difficult to maneuver as a newcomer. In the rest of this article, we’ll focus on the general-purpose CDPs like Segment Personas, mParticle, and Treasure Data. They’re what most people think of when they hear CDP and they have far more companies using them by sheer count than the rest. As a (former) early employee at Segment, that’s the space that I have the most experience with.
There are 5 key reasons why you should prefer your data warehouse to an off-the-shelf CDP.
CDPs are not the single source of truth. The data warehouse has all your data.
CDPs do not mesh with data teams. Marketing and data teams should work together.
CDPs are not flexible. Every business has a unique data model.
CDPs own your data. You’re locked in.
CDPs do not benefit from the data ecosystem. You’re siloed.
The data warehouse has all your data. Whether you’re a D2C brand, B2B SaaS company, e-commerce marketplace, or even a massive bank like Capital One, chances are your customer data is already in a data warehouse. The #1 reason that your CDP should be the data warehouse is that your data warehouse is already your CDP.
CDPs claim to be the single source of truth, but CDPs do not replace data warehouses. There’s nothing about having separate databases of customer information for different departments that spells “single source of truth”. Some CDPs support importing data from the data warehouse but doing so results in additional data latency and “data freshness” remains an unfulfilled promise of a customer data platform.
It’s easier than ever to centralize all your data in a warehouse using SaaS platforms like Fivetran. Once your customer data is in the warehouse, data teams define core definitions, like “what is an ‘active’ user”, in SQL. The data warehouse is the source of truth for your business’ trusted definitions. It doesn’t make sense to redefine these definitions in your sales, marketing, etc. tools -- they should originate from your data warehouse.
Most companies do not think of their data warehouse as being a platform for more than analytics, but companies with modern data stacks have been building operational data pipelines off of their warehouse for years. SaaS solutions like Hightouch let you easily push data and definitions to business tools from your warehouse with just SQL, no scripts.
CDPs target marketing teams and primarily sell to CMOs. Ultimately, marketers are not the right persona to solve the intricate data problems that CDPs address.
Self-service access and data democratization is important, but it’s a cross-functional effort. Data teams should be responsible for understanding your company’s data model and building clean data models for everyone else to consume. Marketing teams should be empowered to analyze customer behavior and iterate on customer segments for campaigns without being bottlenecked by data teams.
CDPs do not recognize this and instead, they give marketers immense capabilities without the process or guard rails of data and engineering teams. There can be a happy, productive balance between marketing and data teams, but the tools and processes your company adopts must understand the role and workflow of each team and facilitate collaboration between them. This is the whole thesis behind Hightouch Activate -- it allows marketers to visually segment users based on data models that your data team assembles in SQL within your data warehouse.
Ultimately, there aren’t engineering or marketing concerns -- there are business concerns. Marketing alone often does not have the technical capacity to evaluate deeply technical concerns, but they do have the business leverage to be severely affected by them. Effective collaboration between teams is the foundation of a successful company.
CDPs are built around rigid data models. Segment Personas, as an example, offers only two core objects -- users and accounts. What’s more? A user can only belong to a single account.
In reality, data models aren’t so cookie-cutter. Users can be in multiple accounts and accounts can have sub-accounts, business units, etc. Apart from users and accounts, companies of the 21st century have their own proprietary objects and hierarchy.
B2B companies like GitHub have organizations, repositories, issues, pull requests, etc. And, that’s just in their app without considering Salesforce/CRM, Zendesk/support tools, etc.
B2C companies like Amazon have users, carts, subscriptions (Prime, Audible, etc.), sellers, orders, returns, gift cards, search history, and global product inventory. The list goes on.
The CDP ecosystem’s response to custom data is “events”. CDPs allow you to send them a stream of custom events performed by your users. This sounds great in theory, but it’s not always easy to answer the questions you need with just events. Data warehouses have what CDPs lack -- the ability to model and query arbitrary relational data.
When it comes to the limitations of the data models in CDPs, I think back to my time as an engineer at Segment building the Personas product. We were unable to effectively “dogfood” our own product due to shortcomings in the data model, like not being able to handle users in multiple workspaces (accounts). As a result, I would frequently have to write SQL against our data warehouse to query the state of a user or account at Segment.
CDPs offer restricted access to your customer data, whereas data warehouses offer unrestricted access to your data. The best companies recognize that their ability to leverage customer data is a competitive advantage. Therefore, they should own their data.
CDPs only expose very specific actions on top of your customer data, generally purpose-built for marketing workflows. Since CDPs are all-in-one solutions, you’re locked in and subject to the whims of your CDP vendor in terms of how you can use your customer data. There’s no such thing as a smooth transition from one CDP to another. With the advent of the cloud, there’s no reason that your company’s business workflows should be tied to a vendor’s data plane.
And this is just from a functionality perspective. With the rise of regulation and concerns around data privacy (GDPR, CCPA, etc.), data residency (e.g. invalidation of Privacy Shield), and data security (SOC2, ISO, HIPAA, etc.), there is no truly on-premise CDP offering.
Since CDPs own your data, they own your ecosystem. Each CDP has to build their own independent “ecosystem”.
Because CDPs are built to play well with their proprietary ecosystem, every CDP has to independently address these concerns via proprietary product features. As an example, if you send a bunch of bad events to a CDP, you’re limited to the features they have available to clean your data set. The transformations you need to run often don’t exist so you have no choice but to file a support ticket. However, if your CDP is your data warehouse, you can use SQL to transform your data in any way you wish and tools like dbt on top to systematically encode and execute these transformations.
There are a number of concerns after data collection -- data QA, metadata/discovery, monitoring, observability, lineage, etc. No single vendor, not even a software giant like Salesforce or Adobe, is poised to build best-in-class software that addresses each of these concerns. In most cases, CDPs do not address all of these data concerns effectively, as they’re focused on building features that appeal to marketers. Even in a perfect world where a CDP does address all of these concerns, you would have to use a separate set of tools to solve these concerns again for the data warehouse since CDPs do not replace data warehouses.
On the other hand, the ecosystem around data warehouses is growing rapidly. Data warehouses are the standard that every vendor in SaaS is thinking about. Companies attacking these problems in a warehouse-first way are emerging left and right.
It would not be fair to CDPs if we didn’t talk about when it does make sense to choose them. Despite not believing CDPs are the “be all, end all” to customer data, there are cases where it does make sense to consider a CDP.
Vertical CDPs are CDPs built for a specific type of company, categorized by industry, size, purpose, etc. Contrary to general-purpose CDPs, I’m actually very bullish on vertical CDPs.
My two favorite examples of vertical CDPs are Amperity & Zaius.
Amperity focuses on hard data science problems for traditional retail companies, like making a best guess of what a household is from disparate data sources.
Zaius focuses on building off-the-shelf integrations for the mid-market ecommerce company using Shopify or Magento, supplemented by common SaaS services.
Companies using vertical CDPs still have data warehouses for analytics at a minimum. In fact, a number of large enterprises just use Amperity for the identity piece and build their own pipelines from the data warehouse to other tools for sales/marketing/support.
CDPs do give marketers new abilities. The average marketer isn’t suited to solve problems like identity resolution unless they’re well-versed in SQL. For the aforementioned reasons, we’d argue that this is okay, and that marketing and data teams just need a framework to collaborate.
That said, if you do not have access to someone with SQL skills to model your company’s data, it might make sense to settle for an off-the-shelf CDP.
If your company has a subpar data stack and you don’t intend to improve it soon but you’re severely bottlenecked on the marketing side, then it may make sense to use an off-the-shelf CDP as a stopgap. The warehouse-based approach is only as good as the data warehouse itself.
If human resources are the problem, we’d urge you to address that directly. Many companies fail to implement CDPs. Half of the challenge of adopting any software is human. It’s very difficult to build a database of all your customer information without having someone on your team that can navigate the intricacies of your company’s data. Services like Snowflake, Fivetran, and Hightouch have made building a modern data stack a breeze.
Some CDPs do have real time capabilities that are somewhere between “challenging” and “impossible” to achieve with data warehouses alone today.
In most business cases, true real time capabilities frankly aren’t necessary or helpful. That being said, there are certain use cases, where executing operations in near real-time is valuable. For example, a transactional notification like “Thanks for making a purchase” when you check out at a Starbucks shouldn’t deliver an hour later. From our user research, a majority of legitimate real time use cases are for core product flows like this, where engineering is involved rather than use cases that marketing would drive autonomously.
If empowering marketing to drive these real time use cases is crucial enough to your business to outweigh the rest of the downsides of having a consistent, sane data infrastructure, then it is justifiable to pursue an off-the-shelf CDP. The only thing I’d urge you to beware of is that even CDPs advertising real time capabilities cannot always achieve them.
This is because behind the scenes, most CDPs leverage off-the-shelf data warehouses like Snowflake and BigQuery as a significant part of their internal architecture. Therefore, CDPs are ultimately bottlenecked by the same technological limitations that your data team faces.
“There’s no magic in magic, it’s all in the details”
-- Walt Disney
Data warehouses are becoming increasingly faster.
JetBlue is running operational pipelines to predict flight delays with 2 minutes end-to-end latency across Snowflake & dbt
Google BigQuery has streaming insert APIs.
Real time capabilities are on the horizon. Snowflake, BigQuery, and Redshift all have beta features implementing incrementally-computed SQL views, which is the basis of a real time stream processing system. Materialize is building real time streaming SQL data warehouses from the ground up and gaining significant traction.
No one can predict the future with certainty, but the industry is pointing towards a modern data warehouse being the strongest bet as your core database for customer data.
You can use tools like Hightouch to turn your data warehouse into a customer data platform.
First, Hightouch allows you to sync any data from your data warehouse into sales, marketing, and support tools with just SQL, no scripts.
Sometimes, your marketing team needs to drill into customer data to build and distribute audiences from a centralized location. In addition to the primary SQL interface, Hightouch offers an audience builder directly on top of your data warehouse.
Hightouch’s audience builder does not make any assumptions about your company’s data model. Rather than forcing your company to mold its data model to that of a CDP, Hightouch molds itself to your company’s data model. Hightouch’s audience builder is powered by a schema modeling layer that allows you to encode your company’s relational object hierarchy & events by labeling tables and views from your data warehouse that you’d like to expose to business users.
This is an example of how specific business processes can be enabled on top of your company’s customer data without losing flexibility or control. Hightouch’s audience builder is just one of many products to come that will be built directly on top of the data warehouse.
Hightouch brings you the best of both worlds. Your data team can focus on parts of the stack unique to your business -- modeling your company’s data and answering challenging business questions. And, your marketing team can leverage customer data to run campaigns without being bottlenecked by data teams.
Curious to see Hightouch in action? Just book a demo — we’d love to show you around.
Thanks to JJ Fliegelman, Arpit Choudhury, David Beyer, Charles Wang, Mike Boyarski, Preston Johnston, and Nancy Hung for giving feedback on this article.