Snowflake Internal vs. External Staging
Question
Should I use internal or external staging for my Snowflake destination?
Environment
Location: Snowflake
Answer
You can choose between using internal or external staging. In either case, HVR manages the files in the staging area (for example, we delete the files when they are no longer needed). We recommend that you don't keep the files HVR uses to apply changes to the target in the staging area, since it leads to inconsistencies. You can configure an alternative (file) destination to keep this data in files (for example, in a data lake).
NOTE: We recommend that you use internal staging unless you have a strong reason not to.
When you're choosing between using internal and external staging, consider the following:
- Internal staging is easier to configure than external staging.
- External staging may have better performance. With internal staging, data must first be written to a local file system, then uploaded into the staging area. Data cannot be streamed directly into Snowflake's internal staging area. On the other hand, external staging allows for changes to be streamed directly into the staging area, bypassing the step to write files locally.
- Parallelism can affect the performance of external staging. With internal staging, HVR loads data using four parallel streams. With external staging, parallelism is dependent on the cloud running your Snowflake destination (AWS, Azure, or GCP). On AWS, external staging likely performs faster, especially for large volumes of change data. If low latency is critical, then you can run a benchmark on your data set to learn how you can get the best performance.
- Internal staging is accessible through your Snowflake credentials, but you must use alternative credentials to use external staging. Many organizations require regularly rotating of credentials, introducing additional management when using external staging.
- With external staging in S3, you can use KMS client side encryption, allowing for end-to-end encrypted data transfers. However, the files to be uploaded to internal staging are not encrypted at the OS level, though you can enable file system encryption. In Snowflake's internal staging, the data is encrypted, as is communication to Snowflake.
- Costs for using internal staging appear on your Snowflake invoice, while you pay your cloud provider directly for using external staging.
Context
In Snowflake's Overview of data loading documentation, they note that bulk load is the most efficient way to load large volumes of data. For the insert-only initial sync, HVR 6 therefore leverages staging. We load data from files in the staging area directly into the target tables using copy.
For updates following the initial sync, we process changes using controlled routines. With the many source systems (SaaS applications, databases, etc.) that we support, we rarely come across an insert-only workload. Almost all use cases include updates and deletes. We use staging to bulk load incremental changes into staging tables, followed by merge statements to apply the changes to the target tables. Since the data already resides in Snowflake, the merges are very efficient, allowing for high throughput (averaging a high number of changes per second).