GCS Bucket Setup
You can use Kapiche to analyse files from your own Google Cloud Storage buckets. To do so, you will need to give the Kapiche IAM User access to your GCS Bucket. To do this, you will need to give read access to the Kapiche Service account
More formally, what you will need to do is grant the Kapiche IAM service account email@example.com, Storage Object Admin role to the bucket you will use in Kapiche.
Note: We use Workload Identity, so we dont use long lived service account keys, instead we opt for short lived token
To do this in the Google Cloud Console, navigate to your bucket and select Permissions You should see a page similar to the screenshot below.
You will need to enter firstname.lastname@example.org into the principals section and choose Storage Object Admin for the role.
Once this is done, you should now be able to use this bucket with your Kapiche account.
For customers wanting to use a Kapiche GCS bucket, we will create a service account that only has access to a designated GCS bucket. We ask customers to use workload identity when uploading to buckets to manage the security risk of long term access keys and rolling credentials. Customers do not share GCS buckets. For staff at Kapiche to get access to a customer bucket, they have to get an access request to get the delegated role added to their respective user. Kapiche also uses Google Cloud Security Command Centre to make sure we aren't leaking data from our GCS buckets and are compliant with our security policies.
Once we've setup a GCS bucket, you could use numerous third party libraries such as google storage or gcloud cli to quickly get started uploading data.
Please contact us directly to discuss setting up a Kapiche hosted GCS bucket.
Linking Kapiche with a GCS Bucket
Once you have a bucket setup, you need to link it with Kapiche. If you are using a Kapiche hosted bucket then we will perform this step for you.
Integration settings can be accessed via the drop-down menu that appears when clicking your name in the upper right hand corner of the product.
You should see GCS among the list of available integrations, you can link your bucket by clicking the 'Add' button, entering your bucket name and hitting 'Update'.
Kapiche has a few key expectations of the data files that it ingests via GCS:
Files are in CSV, XLS or XLSX format
Each new file added to the bucket should contain the same column headers; Kapiche will ignore columns that were not defined at Project creation
For convenience, we recommend organising data files into sub-folders based on the original data source
General data guidelines can be found here
Kapiche allows you to link a Project to a specific folder in an GCS bucket to facilitate automatic ingestion of new data.
We will run a periodic task to check for new files, where a new file is defined as any file that has a last modified date more recent than the time of the last periodic check. This means that you should only create new files for data updates; updating an existing file with new data will cause Kapiche to re-import that file, which may lead to data duplication.
You can enable automatic ingestion when creating a project:
Make sure to choose the specific folder for the data source the Project pertains to.
You can also adjust these settings after a Project has been created: