Staging for BigQuery
Fivetran HVR stores/stages data in a temporary directory (staging file) before loading it into the target BigQuery location during Burst Integrate and Bulk Refresh. For more information about staging files on BigQuery, refer to Google BigQuery documentation.
The following outlines how HVR implements staging for BigQuery:
- HVR first writes data into the staging directory on Google Cloud Storage. Since 6.2.5/1, the staging files can be stored in CSV or Parquet format.
Known issues with CSV and Parquet:
- Dates and timestamps before year 1583 can be replicated incorrectly due to the shift from the Gregorian to the Julian calendar.
- TIMESTAMP WITH TIMEZONE values with fractional seconds in years before 1698 or after 2242 may be reported as mismatches by Compare due to a rounding error during data querying.
- Parquet timestamps are interpreted as UTC. For replication to work correctly, ensure the replication environment uses UTC by setting the environment variable TZ=UTC. You can define this environment variable using the action Environment.
- HVR then uses the BigQuery SQL command
INSERT INTOto ingest data into the BigQuery target tables from the staging directory.- To ingest data into BigQuery target, HVR pulls the compressed data from the staging directory (located on Google Cloud Storage) into a target table. A special 'external table' needs to exist for each target table that HVR loads data into. HVR will create these tables in BigQuery with names having the following patterns _ _x or _ _bx.
Configuring Staging
Setting up staging on Google Cloud Storage (GCS) requires configuring both the GCS environment and HVR.
Prerequisites
A GCS location (bucket) to store staging files. For more information about creating and configuring a GCS bucket, refer to the GCS documentation.
A Google Cloud user (storage account) to access the GCS bucket.
Configure permissions required on GCS and BigQuery. For more information, refer to the Google BigQuery documentation.
HVR Location Configuration
To connect HVR to GCS, define the following location properties when creating a BigQuery location.
For an existing BigQuery location, you can define these properties by editing the location's source and target properties.
The Location Property equivalent to the UI field is shown in parentheses below.
In the STORAGE drop-down menu (File_Scheme), select the protocol for connecting HVR to Google Cloud Storage:
- Google Cloud Storage (HTTPS)
- Google Cloud Storage (HTTP)
In the BUCKET field (GS_Bucket), enter the name of the GCS bucket.
In the STAGING DIRECTORY field (Staging_Directory), specify the path to the directory where HVR will create temporary staging files within the GCS bucket (e.g. /my_staging_dir).
Since 6.2.5/1, in the STAGING FILE FORMAT drop-down menu (Staging_File_Format), select the format for storing the staging files:
- CSV
Known issues with CSV: TIMESTAMP WITH TIMEZONE values with fractional seconds in years before 1698 or after 2242 may be reported as mismatches by Compare due to a rounding error during data querying.
- Parquet
Known issues with Parquet:
- Dates and timestamps before year 1583 can be replicated incorrectly due to the shift from the Gregorian to the Julian calendar.
- TIMESTAMP WITH TIMEZONE values with fractional seconds in years before 1698 or after 2242 may be reported as mismatches by Compare due to a rounding error during data querying.
- Parquet timestamps are interpreted as UTC. For replication to work correctly, ensure the replication environment uses UTC by setting the environment variable TZ=UTC. You can define this environment variable using the action Environment with parameters Name=TZ and Value=path_to_directory.
- CSV