GCS Bucket Setup

Scenario 1. Customer Hosted

You can use Kapiche to analyse files from your own Google Cloud Storage buckets. To do so, you will need to give the Kapiche IAM User access to your GCS Bucket. To do this, you will need to give read access to the Kapiche Service account

More formally, what you will need to do is grant the Kapiche IAM service account botanic-api-prod@kapiche-all.iam.gserviceaccount.com, Storage Object Admin role to the bucket you will use in Kapiche.

Note: We use Workload Identity, so we dont use long lived service account keys, instead we opt for short lived token

To do this in the Google Cloud Console, navigate to your bucket and select Permissions You should see a page similar to the screenshot below.

You will need to enter botanic-api-prod@kapiche-all.iam.gserviceaccount.com into the principals section and choose Storage Object Admin for the role.

Once this is done, you should now be able to use this bucket with your Kapiche account.

Scenario 2. Kapiche Hosted

For customers wanting to use a Kapiche GCS bucket, we will create a service account that only has access to the new GCS bucket. We will then give customers permissions on this user from their own service account in their environment.

Customers do not share GCS buckets. For staff at Kapiche to get access to a customer bucket, they have to get an access request to get the delegated role added to their respective user. Kapiche also uses Google Cloud Security Command Center to make sure we aren't leaking data from our GCS buckets and are compliant with our security policies.

Once we've setup a GCS bucket, you could use numerous third party libraries such as google storage or gcloud cli to quickly get started uploading data.

Scenario 3. Storage Transfer Service

We can setup a storage transfer job to transfer files between buckets in different accounts. This is managed securely by Google Cloud.

https://cloud.google.com/storage-transfer/docs/cloud-storage-to-cloud-storage

Setup Instructions:

The customer will need to grant the Kapiche Storage Transfer Service account access to their source bucket:
- Navigate to the source bucket permissions in Google Cloud Console
- Add the service account: 545996148440@storage-transfer-service.iam.gserviceaccount.com
- Grant it the Storage Object Viewer role for read access
- Grant it the Storage Transfer Admin role to manage transfers
Transfer job configuration:
- We'll set up a daily transfer schedule based on your requirements
- Can specify exact file patterns or prefixes to transfer
- Options for retention policies and cleanup
- Full audit logging of all transfers

Alternatively: The customer can give us their storage transfer service account and we can give them access to the destination bucket and they can manage the transfer job.

Please contact us directly to discuss setting up a Kapiche hosted GCS bucket.

Linking Kapiche with a GCS Bucket

Once you have a bucket setup, you need to link it with Kapiche. If you are using a Kapiche hosted bucket then we will perform this step for you.

Integration settings can be accessed via the drop-down menu that appears when clicking your name in the upper right hand corner of the product.

You should see GCS among the list of available integrations, you can link your bucket by clicking the 'Add' button, entering your bucket name and hitting 'Update'.

Data Format

Kapiche has a few key expectations of the data files that it ingests via GCS:

Files are in CSV, XLS or XLSX format
Each new file added to the bucket should contain the same column headers; Kapiche will ignore columns that were not defined at Project creation
For convenience, we recommend organising data files into sub-folders based on the original data source
General data guidelines can be found here

Automatic Ingestion

Kapiche allows you to link a Project to a specific folder in an GCS bucket to facilitate automatic ingestion of new data.

We will run a periodic task to check for new files, where a new file is defined as any file that has a last modified date more recent than the time of the last periodic check. This means that you should only create new files for data updates; updating an existing file with new data will cause Kapiche to re-import that file, which may lead to data duplication.

You can enable automatic ingestion when creating a project:

Make sure to choose the specific folder for the data source the Project pertains to.

You can also adjust these settings after a Project has been created:

Amazon S3 integration

Product Update: 16 December 2022

Integrating external data into Kapiche

How do I export calls from Genesys?

Google BigQuery integration