Amazon S3 Requirements
This section describes the requirements, access privileges, and other features of Fivetran HVR when using Amazon S3(Simple Storage Service) for replication.
Supported Platforms
- Amazon S3 versions compatible with HVR can be found on our Platform Support Matrix page (6.1.0, 6.1.5).
Data Management
- Learn how HVR maps data types between source and target DBMSes or file systems on the Data Type Mapping page.
- HVR uses the S3 REST interface (cURL library) to connect, read and write data to S3 during Capture, Continuous Integrate, Bulk Refresh, and Direct File Compare.
- If there is an HVR Agent running on Amazon EC2 node, which is in the AWS network together with the S3 bucket, then the communication between the HVR Hub and AWS network is done via HVR protocol, which is more efficient than direct S3 transfer. Another approach to avoid the described bottleneck is to configure the HVR Hub on an EC2 node.
Permissions
To Capture or Integrate into an Amazon S3 location, it is recommended that the AWS user has the AmazonS3FullAccess permission policy. The user permission policy AmazonS3ReadOnlyAccess is sufficient for capture locations that have the location property File_State_Directory defined.
For more information on the Amazon S3 permissions policy, refer to the AWS S3 documentation.
Alternatively, the following minimal permission set can also be used for integrate location:
- s3:GetBucketLocation
- s3:ListBucket
- s3:ListBucketMultipartUploads
- s3:AbortMultipartUpload
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
Sample JSON with a user role permission policy for S3 location
{ "Statement": [ { "Sid": <identifier>, "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>", }, "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:AbortMultipartUpload" ], "Resource": "arn:aws:s3:::<bucket_name>/*" }, { "Sid": <identifier>, "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": [ "s3:ListBucket", "s3:GetBucketLocation", "s3:ListBucketMultipartUploads" ], "Resource": "arn:aws:s3:::<bucket_name>" } ] }
For minimal permission, since version 6.1.0/10, HVR also supports the AWS temporary security credentials in IAM. There are two ways to request for the AWS Security Token Service (STS) temporary credentials:
Using a combination of AWS STS Role ARN, AWS Access Key Id, and AWS Secret Access Key
Sample JSON
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": "sts:AssumeRole" } ] }
Using a combination of AWS STS Role ARN and AWS IAM Role (a role that has access to an EC2 machine)
Sample JSON
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<account_id>:<user>/<username>", "arn:aws:iam::<account_id>:<role>/<username>" ] }, "Action": "sts:AssumeRole" } ] }
S3 Bucket Region
By default, HVR connects to us-east-1 once for determining your S3 bucket region. If a firewall restriction or a service such as Amazon Private Link is preventing the determination of your S3 bucket region, you can change this region (us-east-1) to the region where your S3 bucket is located by defining the following action:
Group | Table | Action | Parameter(s) |
---|---|---|---|
Snowflake | * | Environment | Name=HVR_S3_BOOTSTRAP_REGION,Value=s3_bucket_region |
AWS China
For enabling HVR to interact with AWS China cloud, define the Environment variable HVR_AWS_CLOUD with value CHINA on the HVR Hub and remote machine.
S3 encryption with Key Management Service (KMS) is not supported in the AWS China cloud.
Hive External Tables
To Compare files that reside on the Azure Blob Storage location, HVR allows you to create Hive external tables above Azure Blob Storage. The Hive ODBC connection can be enabled for Azure Blob Storage location by selecting the Hive External Tables field while creating a location or editing the existing location's file connection properties. For more information about configuring Hive external tables, refer to Hadoop Amazon Web Services Support and Apache Hadoop - Amazon EMR documentation.
ODBC Connection
HVR uses an ODBC connection to the Hadoop cluster for which it requires the ODBC driver (Amazon ODBC or HortonWorks ODBC) for Hive installed on the machine (or in the same network). The Amazon and HortonWorks ODBC drivers are similar and compatible to work with Hive 2.x release. However, it is recommended to use the Amazon ODBC driver for Amazon Hive and the Hortonworks ODBC driver for HortonWorks Hive. For information about the supported ODBC driver version, refer to the HVR release notes (hvr.rel) available in HVR_HOME directory or the download page.
On Linux, HVR additionally requires unixODBC.
By default, HVR uses Amazon ODBC driver for connecting to Hadoop. However, if you want to use the (user installed) Hortonworks ODBC driver, while creating a location or editing the existing location's file connection properties, use the ODBC Driver field in HVR UI and specify the ODBC driver.
Amazon does not recommend changing the security policy of the EMR. This is the reason why it is required to create a tunnel between the machine where the ODBC driver is installed and the EMR cluster. On Linux, Unix and macOS you can create the tunnel with the following command:
ssh -i ~/mykeypair.pem -N -L 8157:ec2-###-##-##-###.compute-1.amazonaws.com:8088 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com
Channel Configuration
For the file formats (CSV, JSON, and AVRO) the following action definitions are required to handle certain limitations of the Hive deserialization implementation during Bulk or Row-wise Compare:
For CSV
Group Table Action Parameter(s) S3 * FileFormat NullRepresentation=\\N S3 * TableProperties CharacterMapping="\x00>\\0;\n>\\n;\r>\\r;">\"" S3 * TableProperties MapBinary=BASE64
For JSON
Group Table Action Parameter(s) S3 * TableProperties MapBinary=BASE64 S3 * FileFormat JsonMode=ROW_FRAGMENTS For Avro
Group Table Action Parameter(s) S3 * FileFormat AvroVersion=v1_8 v1_8 is the default value for parameter AvroVersion, so it is not mandatory to define this action.