Data connector for Amazon S3 Cross-Account
The data connector for S3 Cross-Account enables you to directly import a dataset from an Amazon S3 bucket using cross-account authentication.
It can be used to import delimiter separated value files to InfoSum Platform, such as data exported from AWS Redshift. You can split rows in your data into multiple files with the same structure and merge them to a single dataset. You can import files in many formats, including compressed GZIP files.
Before starting, you will need the following information to hand:
- User ARN
- Session Name
- Bucket
- Import file name(s)
Click here for the steps to configure your Amazon S3 cross-account.
To configure a connection, log in to the Platform if you haven't already done so and either create a dataset or access the Bunker of an existing dataset. Once you're in the Bunker, select Import a dataset or use the Import tab, and locate the S3 Cross-Account connector.
Click Connect and enter your credentials as shown below.
The above form contains the following fields:
- Principal: Infosum generated ARN. You need to enable the ARN in this field to recognise the User Role that is supplied in step 3 below. To do this, go to your AWS account and replace the text in the Trust Policy with the InfoSum trust relationship policy as described in configuring AWS IAM for the InfoSum S3 cross-account data connector.
- External ID: Infosum generated ID per user email domain. You should use this external ID, along with the Principal for an extra validation when allowing the assumption of their user role
- User ARN: You need to create a User Role that has a Permissions Policy that allows reading from their S3 bucket. Note: This is not the User ARN shown in the user's AWS account. The correct ARN to use is the Role ARN shown in IAM > Roles > Summary in AWS.
- Session Name: User defined session
- Bucket Name: Specify the bucket name (no leading s3 identifier, i.e. "bucket-name" not "s3://bucket-name")
- Prefix: Optionally add extra path in this box [SubFolder/NextFolder/]
- GPG Encryption - GPG Public key to encrypt the file
In the Bunker UI, you may need to enter the GPG key, which you can find here.
GPG Key:
You can ignore this field if you are not uploading an encrypted file.
Your Bunker will generate a public/private key pair. You can use the GPG public key provided here to encrypt your file. The encrypted file will be decrypted using the Bunker’s private key when you upload the file.
When you have completed all required fields, click Connect.
If you are experiencing slower than expected import/export speeds and you're using a VPN or firewall that can block data upload or download, please refer to Add IP addresses to an Allowlist.
The above form shows all the files available within the selected S3 bucket and contains the following fields:
- Key: File name(s) within the S3 bucket separated by the key delimiter specified in 2 below
- Key Delimiter: Delimiter used to separate files names in 1 above
- File is GZIP compressed: Select this field if you are importing compressed GZIP (.gz) files. When you click Download, the Bunker will decompress the files. You can ignore this field if you are not importing compressed GZIP files
- Field Delimiter: Delimiter used to separate values in the file(s)
- If you are uploading an encrypted file, enable the "This file is gpg encrypted" option. When you click Download, the Bunker will decrypt the file using the Bunker private key.
Working with multiple files
- You can specify any number of files.
- All files must have the same structure.
- Clicking on a filename overwrites it to the Object field. For this reason, we recommend listing multiple files in a text editor and cutting and pasting them to the Object field.
Copy the file names into the Key field and click Download, then Connect.
A subset of the data will then appear as a preview. You can perform some minor manipulations at this point, such as selecting which columns to import, enabling multi-value columns, renaming columns and excluding rows.
When you're happy with the preview, accept the settings and select a blank import configuration, then you'll be taken to the Import Wizard. This will show how our Platform has understood your dataset and mapped columns into our Global Schema.
If this looks correct, accept the Wizard Settings, otherwise untick the boxes so they can be correctly mapped during the later normalization phase.