The modern data stack (MDS) makes business more responsive to changing demands, while allowing engineers to automate mind-numbing tasks and focus on higher-value work. When organizations employ a multi-cloud strategy, leveraging different cloud services for different functions within the stack, they gain still more efficiencies. For example, they can cherry-pick where data should go according to the relative strengths of their cloud providers, or — importantly — use optimizations to reduce costs.
But what common errors do organizations make when implementing an MDS, and how do you ensure that yours is future-proof? Interoperability and open source are also important, but why?
Our recent fireside chat, “The Multi-Cloud Modern Data Stack,” brought together four industry trailblazers to discuss the ground rules for the multi-cloud MDS:
High-growth businesses embrace data — and multi-cloud
According to Martin Casado, when you’re a General Partner at Andreessen Horowitz, you see “an awful lot of pitch decks,” and then you meet with “about 3,000 companies a year.” Among those pitches and gatherings, Casado sees a trend: Data is central to high-growth businesses.
Acting as de facto moderator, Casado said multi-cloud had become something like the default setting at successful businesses: “I think it’s endemic to the conversation.”
Ali Ghodsi noted that companies such as Fivetran and Databricks will continue to re-envision how data is handled. Over time, these and other companies will make common tasks even simpler to perform than they are today. "We’re in the first inning” with the modern data stack, Ghodsi said. This is because “the MDS isn’t out to just mimic on-prem.” Instead, multi-cloud MDS is leading the way.
“I would love to have all of the data on Google Cloud,” Sudhir Hasbe joked — but he knows that’s not likely: “The real question is how can we enable organizations to leverage all their data across these platforms.”
Multi-cloud helps future-proof your organization
If you’re a business leader, you should ensure that you’re future-proofing your company, says Ghodsi. The best way to do that is to avoid getting locked into a single technology. Instead, go with the best of the breed. When you have a multi-cloud MDS, “you’re not going to pick the wrong cloud” — you’ve already picked a variety.
Because Ghodsi believes the future will be dominated by AI and ML, he also advises that you keep both in mind when building your modern data stack. These emerging technologies will be increasingly important as time goes by.
George Fraser added some advice: Watch what you’re building on top of your data stack. Many companies migrating a legacy data warehouse to a modern system like Databricks, BigQuery or Snowflake now regret their storage procedures. These companies built a large amount of logic on top of their data stacks that were linked to a particular system — they are, in effect, locked in.
According to Fraser, one way around this problem is to opt for the popular framework from dbt Labs. Dbt manages views on top of your data warehouse, and it largely insulates you from the differences among systems.
For Hasbe, interoperability is critical — but interoperability within multi-cloud is just the starting point. You should also ensure that you have interoperability for all storage tiers and engines, whether you’re using the information for a data science task or any other operation.
Hasbe added that there are four personas involved in the multi-cloud MDS space: data engineers, data scientists, data analysts — and the administrators who govern and pay for it. All too often, the software is geared toward only one or two of these personas. It’s worth the effort to give each of them a seat at the table. “If you don’t, there can be infighting and politics among those personas,” said Hasbe.
Data governance across clouds is key
According to Hasbe, simplifying data ingestion is critical, but leveraging a platform that allows you to pick and choose where your data resides is vital. Additionally, you’ll need a common governance framework that lets you manage how long your data is stored — and when you get rid of it.
But the most crucial task is to avoid data duplication, according to Hasbe. He said customers often “are blown away when they see how many copies were created by different users in their organization.” This creates governance and compliance issues. It’s critical to have a single data storage tier that can be regionalized but also governed through a common catalog.
Data governance is “something of the Wild West right now,” according to Fraser. While not a data governance tool, Fivetran wants to help. “We have a lot of knowledge about where this data came from — when we got it, whose user ID was used to get that data from Salesforce — and so a lot of the stuff we’re talking about is how do we surface all that metadata to feed it downstream into data governance tools to make them work better,” Fraser said.
Optimization reduces cloud costs
According to an article co-written by Martin Casado, the average cost of goods (COGs) for cloud services is “absolutely insane.” His article examined software companies that have had an IPO in the last five years — their COGs have been 50 percent.
“Good job throwing a hand grenade,” Ghodsi joked. Cloud expense is a big issue because business needs predictability around their COGs, and cloud vendors are making a 30 percent margin. Ghodsi believes competition among cloud vendors because of multi-cloud will drive costs down over time.
“I think organizations will need infrastructure that can flexibly go up and down,” said Hasbe. Companies can innovate faster with the cloud: “I think that is the value proposition of moving to the cloud, whichever cloud you move to.” Economies of scale kick in; costs will improve.
Fraser said Fivetran was able to cut its COGs in half by “hammering away at optimizations,” which is much simpler to do in the cloud: “It’s much easier to slice the salami in the cloud.” Fraser said companies should start by looking at low-hanging fruit that can be optimized.
The future stack: one storage layer, multiple use cases
Where is the modern data stack headed? “It’s about creating one storage layer that serves multiple use cases,” observed Fraser. For example, Databricks has data-warehouse-like characteristics, like a fast SQL engine and an optimized file format that can be read quickly to data lakes.
“Everyone’s kind of trying to solve the same problem,” he noted. Vendors are starting with the data warehouse and adding data-lake-like characteristics to it. Other people are starting with the data lake and adding data warehouse characteristics to it: “No matter what, it’s going to be good for customers.”
Hasbe adds that multi-cloud interoperability will allow you to store data wherever you get the best price. Ghodsi agreed, adding that multi-cloud will become a must-have. If a piece of software works only on one cloud, that will be a dealbreaker for most users.
“There are exciting times ahead” for the multi-cloud modern data stack, according to Ghodsi — the benefits greatly outweigh the challenges.