Temporary Files in HVR
Question
- When does HVR use temp files?
- Where do temp files go?
- How big are temp files?
Environment
HVR 5
Answer
When does HVR use temp files?
HVR tries to 'stream' data, but there are some moments when it has to create temporary files.
HVR uses temporary files at various moments, for example:
During log-based capture, if a transaction exceeds a limit (all limits are detailed below) then the capture job will spill it to disk on the capture machine.
Row-wise compare and refresh sometimes need to sort data outside the DBMS. If the data in a table exceeds a limit then this data will be spilled into multiple sort-merge files on the target machine.
During row-wise refresh, if the number of changes detected exceeds a limit then these changes will be spilt to disk.
During row-wise verbose compare or refresh from the GUI, if the number of changes detected exceeds a limit then these changes will be spilt to disk.
If a single LOB value (e.g. long varchar or CLOB) exceeds a limit then its contents will be spilt to disk.
Action Transform with parameter /Command directs the output of a command transform to a temp file and then reads it back.
Where do temp files go?
HVR's temporary files are created in $HVR_TMP
, but if that is not defined then they are located in $HVR_CONFIG/tmp/
.
When bulk-loading data into Greenplum or Redshift, HVR puts the data into a 'staging' file first.
How big are temp files?
The size of these files depends on the data that HVR is processing.
Answer to the Question 1 gives a few scenarios. Several of them say that HVR spills to disk if it 'exceeds a limit'.
For scenarios 1, 3, 4, and 5 this limit is controlled by env-variable $HVR_TEMP_SPILL
. This controls the number of bytes for the temporary 'blobs' in HVR's memory; these will spill to disk if their size exceeds (X-T)/N, where X= $HVR_TEMP_SPILL
, T=total_size_of_currently_cached_temp_objects
and N=num_open_cached_objects
. The default of $HVR_TEMP_STORE
is 64M (bytes). This means that, by default, a single BLOB will spill to disk if it's bigger than 32M.
For case 2 (sorting), the 'spill limit' is controlled by env-variable $HVR_SORT_BYTES_LIMIT
(default 512Mb) and $HVR_SORT_ROW_LIMIT
(10M). If data exceeds one of these amounts then HVR's sort will spill to disk. But it will never spill if the number of rows is below $HVR_SORT_ROW_MIN
(default 128).
HVR's native uncompressed format was originally quite inflated because varchar(1000) containing "hello world" is held as 1000 bytes, not as 11 bytes. So, for older HVR versions, for cases 1, 2, and 3 above (log-based capture transactions, sorting, and row-wise changes), the temp file was sometimes much bigger than the original table data.
To combat this "inflation", since version 4.6.1 HVR compresses most of its temp files. In HVR 4.6.1, temp files for cases 1, 3, 4 & 5 are compressed unless $HVR_TMP_NO_COMPRESS=1
is set. Since HVR 4.6.2, temp files for cases 1, 3, 4 & 5 are compressed unless $HVR_TMP_COMPRESS_LEVEL=0
is set. Before HVR 4.6.2, files for case 2 (sorting during row-wise) are only compressed if HVR_SORT_COMPRESS=1
is set. Since HVR 4.6.2, files for case 2 are compressed unless $HVR_SORT_COMPRESS_LEVEL=0
is set.