You might have small data (and that’s okay)

Just because a buzzword doesn’t apply to you doesn’t mean people are belittling your use case.
February 22, 2020

The following is adapted from Jesse Anderson’s blog. Jesse Anderson is a data engineer and director of the Big Data Institute.

There is a common beginner question for engineers starting out with Big Data. An engineer will post to social media saying “I need to know which Big Data technology to use. I have 3 billion rows in 10,000 files. The whole dataset is 100 GB. Is Big Data Technology X efficient for processing this?”

The short answer is no. The long answer is more than likely no and only a qualified data engineer can tell you for sure.

The issue starts with a misunderstanding of what Big Data is and isn’t. The original poster is assuming that small data technologies can’t do something for them. After all, 3 billion rows sounds like a lot. It isn’t.

If you think about it, you can easily provision a VM with 256 GB of RAM. For a dataset of 100 GB, the entire dataset could fit in memory. There are some nuances like how much this dataset will grow and the complexity of the processing, but this probably isn’t a Big Data problem.

On threads with answers to these questions, there is often another person who responds that the use case doesn’t need Big Data. Sometimes, the original poster will get insulted or think that people are belittling their use case. They aren’t.

This is because their use case would be so much better off in a “small” data technology like a cloud data warehouse as a data store. Using a technology with a relational structure instead of a Big Data technology like the Hadoop ecosystem has these major benefits:

  • Less conceptual complexity
  • More prevalent in the marketplace
  • More people know the technology
  • Easier operationally
  • Faster speeds of queries
  • Cheaper operationally, technically, and people-wise
  • Shorter development cycles

When someone is telling you that your use case is small data, they aren’t belittling you or your use case. They’re saving you time, money, and effort.

If you do have Big Data problems, you are specifically held back by a small data technology limitation. You are saying “can’t” because you are hitting a known technical limitation. Namely:

  • You’re a manager and you ask for a new feature or a report and the technical person says they can’t due to a technical limitation.
  • You’re a developer and you can’t add new features because the database or data warehouse will fall over and die.
  • You’re an analyst and you can’t do your report because it would take too long or process too much data.
  • You’re a Data Warehouse Engineer and you still can’t do the most intensive queries because they take too much time and resources to run.

These problems often accompany a scale of 100s of billions of rows or petabytes of data. For these problems, you will need highly-trained data engineers.

I’ve seen companies succeed with Big Data in the following ways:

  • Allowing enough time to have a sane project plan
  • Having realistic expectations for what Big Data would do for the company
  • Spending the money on excellent training
  • Getting the team the mentoring and help they need
  • Realizing Big Data is a complex animal

And I’ve seen companies fail in the following ways:

  • Thinking Big Data is the silver bullet that will save the company from itself
  • Rushing through the process and not giving the team the time and resources to succeed
  • Thinking the team can just read some books or watch some YouTube videos to learn Big Data
  • Cheaping out on training and help for the team
  • Having a team without the right skills

Remember that even if your organization does have Big Data use cases, not every data-related use case within your organization is a Big Data one. You can simultaneously have small data and Big Data use cases coexisting within the same organization, and the two should be approached somewhat differently. Don’t hit a fly with a sledgehammer – using Big Data technologies for small data will bring a high expense with little reward.

If you’re running a business that needs help with your Big Data strategy, you can read about my mentoring service.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

You might have small data (and that’s okay)

You might have small data (and that’s okay)

February 22, 2020
February 22, 2020
You might have small data (and that’s okay)
Just because a buzzword doesn’t apply to you doesn’t mean people are belittling your use case.

The following is adapted from Jesse Anderson’s blog. Jesse Anderson is a data engineer and director of the Big Data Institute.

There is a common beginner question for engineers starting out with Big Data. An engineer will post to social media saying “I need to know which Big Data technology to use. I have 3 billion rows in 10,000 files. The whole dataset is 100 GB. Is Big Data Technology X efficient for processing this?”

The short answer is no. The long answer is more than likely no and only a qualified data engineer can tell you for sure.

The issue starts with a misunderstanding of what Big Data is and isn’t. The original poster is assuming that small data technologies can’t do something for them. After all, 3 billion rows sounds like a lot. It isn’t.

If you think about it, you can easily provision a VM with 256 GB of RAM. For a dataset of 100 GB, the entire dataset could fit in memory. There are some nuances like how much this dataset will grow and the complexity of the processing, but this probably isn’t a Big Data problem.

On threads with answers to these questions, there is often another person who responds that the use case doesn’t need Big Data. Sometimes, the original poster will get insulted or think that people are belittling their use case. They aren’t.

This is because their use case would be so much better off in a “small” data technology like a cloud data warehouse as a data store. Using a technology with a relational structure instead of a Big Data technology like the Hadoop ecosystem has these major benefits:

  • Less conceptual complexity
  • More prevalent in the marketplace
  • More people know the technology
  • Easier operationally
  • Faster speeds of queries
  • Cheaper operationally, technically, and people-wise
  • Shorter development cycles

When someone is telling you that your use case is small data, they aren’t belittling you or your use case. They’re saving you time, money, and effort.

If you do have Big Data problems, you are specifically held back by a small data technology limitation. You are saying “can’t” because you are hitting a known technical limitation. Namely:

  • You’re a manager and you ask for a new feature or a report and the technical person says they can’t due to a technical limitation.
  • You’re a developer and you can’t add new features because the database or data warehouse will fall over and die.
  • You’re an analyst and you can’t do your report because it would take too long or process too much data.
  • You’re a Data Warehouse Engineer and you still can’t do the most intensive queries because they take too much time and resources to run.

These problems often accompany a scale of 100s of billions of rows or petabytes of data. For these problems, you will need highly-trained data engineers.

I’ve seen companies succeed with Big Data in the following ways:

  • Allowing enough time to have a sane project plan
  • Having realistic expectations for what Big Data would do for the company
  • Spending the money on excellent training
  • Getting the team the mentoring and help they need
  • Realizing Big Data is a complex animal

And I’ve seen companies fail in the following ways:

  • Thinking Big Data is the silver bullet that will save the company from itself
  • Rushing through the process and not giving the team the time and resources to succeed
  • Thinking the team can just read some books or watch some YouTube videos to learn Big Data
  • Cheaping out on training and help for the team
  • Having a team without the right skills

Remember that even if your organization does have Big Data use cases, not every data-related use case within your organization is a Big Data one. You can simultaneously have small data and Big Data use cases coexisting within the same organization, and the two should be approached somewhat differently. Don’t hit a fly with a sledgehammer – using Big Data technologies for small data will bring a high expense with little reward.

If you’re running a business that needs help with your Big Data strategy, you can read about my mentoring service.

Topics
No items found.
Share

Related blog posts

No items found.
No items found.
No items found.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.