Amazon S3 integration
Kris Rogers avatar
Written by Kris Rogers
Updated over a week ago

S3 Bucket Setup

Customer Hosted

You can use Kapiche to analyse files from your own Amazon S3 buckets. To do so, you will need to give the Kapiche IAM User access to your S3 Bucket. To do this, you will use a Bucket Policy (Amazon documentation is here).

More formally, what you will need to do is grant the Kapiche IAM user arn:aws:iam::671409750083:user/Kapiche both s3:GetObject and s3:ListBucket permission.

To do this in the Amazon AWS Console, navigate to your bucket and select Permissions > Bucket Policy. You should see a page similar to the screenshot below.

You need enter the following policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::671409750083:user/Kapiche"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<YOUR-BUCKET-NAME>/*",
"arn:aws:s3:::<YOUR-BUCKET-NAME>"
]
}
]
}

NOTE: Be sure to replace <YOUR-BUCKET-NAME> with the name of your bucket.

Once this is done, you should now be able to use this bucket with your Kapiche account.

Kapiche Hosted

For customers wanting to use a Kapiche S3 bucket, we will create an access key/secret key pair that only has access to a designated S3 bucket. Customers do not share S3 buckets. We then limit access for this key/secret to only certain folders in the bucket. For staff at Kapiche to get access to a customer bucket, they have to get an access request to get the delegated role added to their respective user. Kapiche also uses AWS Security Hub to make sure we aren't leaking data from our S3 buckets and are compliant with our security policies.

Once we've setup an S3 bucket, you could use numerous third party libraries such as boto to quickly get started uploading data.

Please contact us directly to discuss setting up a Kapiche hosted S3 bucket.

Linking Kapiche with an S3 Bucket

Once you have a bucket setup, you need to link it with Kapiche. IIf you are using a Kapiche hosted bucket then we will perform this step for you.

Integration settings can be accessed via dropdown menu that appears when clicking your name in the upper right hand corner of the product.

You should see S3 among the list of available integrations, you can link your bucket by clicking the 'Add' button, entering your bucket name and hitting 'Update'.

Data Format

Kapiche has a few key expectations of the data files that it ingests via S3:

  • Files are in CSV format

  • Each new file added to the bucket should contain the same column headers; Kapiche will ignore columns that were not defined at Project creation

  • For convenience, we recommend organising data files into subfolders based on the original data source

  • General data guidelines can be found here

Automatic Ingestion

Kapiche allows you to link a Project to a specific folder in an S3 bucket to facilitate automatic ingestion of new data.

We will run a periodic task to check for new files, where a new file is defined as any file that has a last modified date more recent than the time of the last periodic check. This means that you should only create new files for data updates; updating an existing file with new data will cause Kapiche to re-import that file, which may lead to data duplication.

You can enable automatic ingestion when creating a project:

Make sure to choose the specific folder for the data source the Project pertains to.

You can also adjust these settings after a Project has been created:

Did this answer your question?