Staging for Greenplum
While running Burst Integrate or Bulk Refresh, Fivetran HVR can put data into a temporary directory (‘staging file') before loading data into a target database.
For best performance, HVR performs Burst Integrate or Bulk Refresh into Greenplum location using staging files and the Greenplum Parallel File Distribution (gpfdist) server. HVR implements the file staging in Greenplum as follows:
- HVR first writes data into the local staging directory on the machine from where HVR connects to Greenplum.
- The local staging directory does not have to be present on the Greenplum database machine.
- The staging files (inside the Staging directory) are written in the CSV format and is compressed.
- HVR then uses the Greenplum SQL command
copy
to ingest the data into the Greenplum target tables from the staging directory.- To ingest data into Greenplum, HVR pulls the compressed data from gpfdist:// or gpfdists:// directory into a target table. This requires that a special Greenplum 'external table' exists for each target table that HVR loads data into. HVR will create these tables in Greenplum database with names having the following patterns _ _x or _ _bx.
Configuring Staging
HVR must be configured to stage data on the local directory before loading it into Greenplum. For staging the data on local directory and perform Burst Integrate and Bulk Refresh into Greenplum, the following are required:
To use the gpfdist server for bulk loading operations, ensure that gpfdist is configured on the machine from which HVR will connect to Greenplum.
When you start/create a gpfdist instance/service you must specify a listen port (option -p) and the path (option -d) to the staging directory. Optionally, you may also specify the path to log (option -l) output messages and errors to a log file. For example,
In Linux, to start a gpfdist instance:
gpfdist -p 8081 -d /my_staging_dir -l /hvr/staging_log
In Windows, to create a gpfdist service:
sc create gpfdist binpath= "path_to_gpfdist.exe -p 8081 -d \my_staging_dir -l \hvr\staging_log"
On Windows, gpfdist is a service and the values for port and path must be supplied while creating the Windows service for gpfdist. For more information about running gpfdist as a Windows Service, refer to the Pivotal Greenplum Documentation.
Define the following location properties (while creating a location or by editing the existing location's source and target properties) for the Redshift location:
The Location Property equivalent to the UI field is shown inside (bracket) below.
Staging Directory Is Local (Staging_Directory_Is_Local): Indicates that staging files are created in the local directory. This field is automatically selected by default and it cannot be unselected.
STAGING DIRECTORY (Staging_Directory): local directory path where HVR will create the temporary staging files (e.g. /my_staging_dir). This should be the -d (directory) option of the gpfdist server command.
STAGING DIRECTORY AS VISIBLE FROM THE DATABASE (Staging_Directory_Database): directory path from where Greenplum will access the temporary staging files. This should either be a local directory on the Greenplum head-node or it should be a URL (starting with gpfdist:// or gpfdists://) pointing to staging directory specified in STAGING DIRECTORY. The format for the URL is gpfdist://hostname:port where hostname is the name of the machine used to connect to Greenplum and port is the -p (http port) option of the gpfdist server command.
Sample screenshot