When Should I Cluster and Partition My Data?
Question
Should I manually cluster or partition my BigQuery data?
Environment
Answer
We recommend that you manually partition and cluster your data into BigQuery for optimal Fivetran sync performance.
Partitioning your data in Google BigQuery will impact the cost and performance of writing into Google Cloud Platform (GCP). You can partition your data using 2 main strategies:
- use a table column (TIMESTAMP, DATE, or INTEGER column)
- use the data’s time of ingestion
Clustering will address better read query performance as it tells BigQuery to store your data by certain fields.
What should I partition by?
Partitioning by TIMESTAMP or the data’s time of ingestion can be very useful if you want to run analytics for a specific period of time. Partitioning allows BigQuery to read and only process the rows of the specific time span. This means faster and more cost-efficient queries.
INTEGER partitioning allows you to store a range of values in the same partition. You will have to identify the min and max values for the range size, but this could work for user id, geo coordinates, or even zip code. For more information, see our Partitioned Tables documentation.
What should I cluster by?
Think about what columns people will query your table and the order of the columns. Your queries commonly use filters or aggregation against multiple particular columns. For more information, see our Clustered Tables documentation.
Considerations
Clustering can be done on non-partitioned and partitioned data. There’s no limit on how many values a clustered column might have whereas there is a limitation of 4000 partitions within BigQuery.
When partitioning on TIMESTAMP field or by ingestion time both are limited with day granularity. That is why it is important to identify the correct partitioning strategy.