Fileslink
We have two options for file sync. If you keep files in cloud-based storage services, you can use Fivetran to sync these files, in formats we support, to your destination. Adding a Connector to your bucket immediately syncs files into your destination. Otherwise, you can use the Upload feature in the Dashboard to manually upload new files to your destination.
The following documentation outlines how Files work for cloud-based storage services. If you want to learn how browser-based upload works, see our Browser Upload documentation. If you want to learn how our Google Sheets file connector works, see our Google Sheets documentation.
Supported cloud-based storage serviceslink
To add a cloud-based storage service to your Fivetran account, use Add Connector in the Dashboard and search for your service.
Please contact us if you'd like to integrate with another cloud-based file storage service.
Supported file formatslink
- Separated Value Files (CSV*, TSV, etc.)
- JSON text files delimited by new lines
- JSON Arrays
- Avro
- Parquet
Our Email connector supports all the same file formats as a cloud-based storage service.
Supported file compressionlink
- TAR and ZIP archives
- BZip2 and GZip compression
Supported encodingslink
- UTF-8, UTF-16, and UTF-32, with big or little endian order
- UTF-8 encoding is assumed if no Byte-Order Mark is present at the beginning of a file
Sync overviewlink
Our integration with storage services is optimized for buckets that receive periodic, automatic data dumps. These could be internal reports generated by your own tools, or APIs that we haven't yet offered full support for.
File storage connections occur at the table level. Each different connection that is created within Fivetran (shown below in the blue connection icon circle) connects to a single storage bucket, and its underlying data is loaded into a single destination table. Any files added to that bucket under the specified folder will be pulled and loaded into the specified destination table. You can configure multiple storage connections such that they sync into a single schema. All file storage cloud connections are pull integrations -- Fivetran periodically pulls new or changed data from the source bucket.
Configuration optionslink
Using these configuration options, you can select subsets of your folders, certain types of files, and more to sync only the files you need in your destination. Setting up multiple connectors targeted at the same source, but with different options, can allow you to slice and dice a source any way you'd like.
Folder path (optional)link
This folder path is used to specify a portion of the bucket in which you'd like Fivetran to look for files. Any files under the specified folder and all of its nested subfolders will be examined for files we can upload. If no prefix is supplied, we'll look through the entire bucket for files to sync.
File pattern (optional)link
The file pattern is a regular expression that we use to decide whether or not to sync certain files. It applies to everything under the prefix. For instance, suppose under the prefix logs
you had three folders: 2017
, 2016
, and errors
. Using the pattern \d\d\d\d/.*
, you could exclude all the files in the errors
folder, because \d\d\d\d
applies to the folders, and .*
applies to the files under them. If you're not sure what regular expression to use, you can leave this field blank, and we'll sync everything under the prefix. If you're feeling particularly bold, you can learn to write your own regex here and test it out using this tool.
File typelink
The file type is used to let Fivetran know that even files without a file extension ought to be parsed as this file type. For example, if you have an automated CSV output system that saves files without a .csv extension, you can specify the CSV type and we will sync them correctly as CSVs. Selecting "infer" will let Fivetran infer from a file's extension (.csv
, .tsv
, .json
, .avro
, or .log
) what to sync. If you do choose a file type, every file we examine will be interpreted as the file type you select, so make sure everything Fivetran syncs has the same file type!
Compressionlink
The compression format is used to let Fivetran know that even files without a compression extension should be decompressed using the selected compression format. For example, if you have an automated CSV output system that GZIPs files to save space, but saves them without a .gzip extension, you can set this field to GZIP. The integration will then decompress every file that it examines using GZIP. If all of your compressed files are correctly marked with a matching compression extension (.bz2
, .gz
, .gzip
, .tar
, or .zip
), you can select "infer".
Error handlinglink
Selecting skip ignores any improperly formatted data within a file, allowing you to sync only valid data. Choosing fail enables you to prevent a file from syncing if any improperly formatted data is detected. With either option you will receive a notification on your dashboard if errors are encountered.
Modified file mergelink
When a previously synced file is modified, should the rows in the destination be replaced or should the new rows be appended to the table. upsert_file
option will replace records in destination, using the filename and line number as the primary key. append_file
option will append records.
Archive folder pattern (optional)link
If there are multiple files within archive (TAR or ZIP) folders, you can use the archive folder pattern to filter those as well. For example, the archive folder pattern .*json
will sync from an archive folder only those files that end in a .json file extension.
Null sequence (optional)link
CSVs have no native notion of a null character. However, some CSV generators have created one, using characters such as \N
to represent null. Note: text is un-escaped before the null sequence is matched, so don't use the escape character in your null sequence. Only use this field if you are sure your CSVs have a null sequence.
Delimiter (optional)link
The delimiter is a character used in CSV files to separate one field from the next. If this is left blank, Fivetran will infer the delimiter for each file, and files of many different types of delimiters can be stored in the same folder with no problems. If this is not left blank, then all CSV files in your search path will be parsed with this delimiter.
Escape character (optional)link
CSVs have a special rule for escaping quotation marks as opposed to other characters - they require two consecutive double quotes to represent an escaped double quote. However, some CSV generators do not follow this rule and use other characters like backslash for escaping. Only use this field if you are sure your CSVs have a different escape character.
Skip header lines (optional)link
Some CSV generating programs include additional header lines at the top of the file. They consist of few lines that do not match the format of the rest of the rows in the file. These header rows can cause undesired behavior in Fivetran because we attempt to parse them as if they were records in your CSV. By setting this value, you can skip fixed-length headers at the beginning of your CSV files.
Note: Only use this field if you are sure your CSVs have more than one header line. We will skip these lines before we map the header to row data for the CSVs.
Skip footer lines (optional)link
Some CSV generating programs include a footer at the bottom of the file. It consists of a few lines that do not match the format of the rest of the rows in the file. These footer rows can cause undesired behavior in Fivetran because we attempt to parse them as if they were records in your CSV. By setting this value, you can skip fixed-length footers at the end of your CSV files.
Headerless files (optional)link
Some CSV generating programs do not include column name headers for the files — they only contain data rows. By setting this value, you request Fivetran to generate generic column names following the convention of column_0
, column_1
, ... column_n
to map the rows.
Note: Only use this option if you are sure your CSVs do not have the header line, and the column orderings remain the same. Otherwise, it can cause undesired behavior in the schema. For example, if the column order changes in the source, we won't be able to identify column changes. If you have no control of the CSV generator and still want to turn on the empty file option, we recommend always keeping the error handling option "on" to catch potential errors.
Naminglink
You can assign custom schema and table names while building your connector. Column names are generated based on the uploaded files. For a CSV, there will be one column for each entry in the first line. For a JSON file, there will be one column for each top level field in the JSON object.
If you change the name of a file in your bucket, all data in that file will be re-imported, which may result in duplicate values within that table.
Primary keyslink
Fivetran adds the below three columns as composite primary keys for a table. In all modes, _file
and _line
are composite primary keys. The column _modified
is an additional primary key that is added if Modified File Merge is set to the append_file
option.
COLUMN | DATA TYPE |
---|---|
_file 🔑 | TEXT |
_line 🔑 | INT |
_modified 🔑 | TIMESTAMP |
Excluding source datalink
Inclusion and exclusion of source data in file storage integrations is done via specification of a folder path. The integration will sync all files under a certain folder path, even if they are themselves in nested subfolders of the specified folder. Excluding fields from source files is not currently supported.
Updating datalink
After the initial load of historical data, Fivetran only pulls incremental updates of any new or modified files from the source, adding an extra column to your tables called fivetran_synced (TIMESTAMP) that indicates the start time of the job that synced this row.
We do not recommend manually changing the contents of your files. Doing so will result in the following behaviours:
The integration has file-level granularity. If you modify a file, we update the table in the destination to match the contents of the file. Adding rows to a CSV that has already been synced will change the last_modified date, which will then trigger a re-sync of the contents of that file. The file name and line numbers will be used as primary keys. Whatever you change in your file will be represented in the file in the destination, but you may lose the previous version of the file that corresponded to the old record.
Frequency of updateslink
By default, all storage services sync new and modified files every 15 minutes. Depending on the size of each update or the number of tables, it may take longer. In that case, the integration will update at the next 15 minute interval, i.e.
-
Update X Start: 9:00am
-
Update X Finish: 9:18am
-
Update Y Start: 9:30am
You can change the update interval in the Dashboard. A storage service sync that encounters an error will re-try periodically after the shorter of the sync frequency or 1 hour.
Deleted datalink
We handle deleted data based on last modified dates, and we treat deleted files (and column) and deleted rows differently.
Remember, our storage service integrations were designed around automated dumps of well-structured data. If you find yourself frequently deleting rows, columns, or entire files, from your buckets, consider using our browser-based CSV Uploader.
If you delete a file in your source bucket, it no longer exists, so it will not have a last modified date. Therefore, the deletion will not be reflected in your destination - so data from the deleted file will still remain in your destination.
Deleted columns will change the last modified date, but the deletions will not appear in your destination.
Deleted rows, on the other hand, will have a last modified date, so the change will be reflected in your destination.