How to use the Export API

Using the Export API to export data from Kapiche

Caleb Hattingh avatar
Written by Caleb Hattingh
Updated over a week ago

Overview

Kapiche provides an Export API that allows you to programmatically fetch enriched data directly from Kapiche for use in your own data warehouse and backend systems.

The API is provided over HTTPS endpoints and is available via an allow list per customer.

At the current time, there is one export data schema available: document-queries.

Requirements

To make use of the API, two steps are necessary:

  1. An access token must be created.

  2. The API must be enabled for a specific analysis within the Kapiche product. Each analysis will have a different base URL (explained further below).

With those two steps completed, an HTTPS API call can now be made to the Kapiche API server to obtain export data.

Optional: Specific IP Address prefixes can be added to an allow-list to improve security even further. If your API calls will always be made from a specific IP Address, that IP address, or a common prefix of a range of IP addresses, can be configured.

NOTE: Only an Admin user can complete the steps mentioned above.

In the sections that follow, each of these requirements will be described in more detail.

Quickstart - Code Example

Let’s assume that the site name of your site is acme. This can be found by looking at your URL in Kapiche once you have logged in to the product:

https://app.kapiche.com/acme/projects/7914/details

In the above URL, the site name is acme. Secondly, your Kapiche Admin must obtain a token. For this code example, assume for now that the token is xxx. Finally, when the Export API is enabled on a particular analysis, a base URL for that analysis will be created.

Given those 3 pieces of information, an API call can be made like this, for example using curl:

$ curl \
-H "Site-Name: acme" \
-H "Content-Type: application/json" \
-H "Authorization: Site xxx" \
-X GET \
https://api.kapiche.com/export/><uuid>/document-queries/?start_document_id=1&docs_count=1000&export_format=csv

In the call above,

  • the base URL for the target analysis is https://api.kapiche.com/export/><uuid>/,

  • the document-queries data schema type was requested,

  • with parameters start_document_id=1, docs_count=1000 and output_format=csv

This type of export is discussed further in the section Available Export Data Schemas.

In the following sections, each of the required components of the API will be explained.

Access Tokens

The Admin user for your site can add and delete API tokens on the screen shown below, under the Site Administration section:

When adding a token, the token is presented in a temporary dialog as shown below:

This token is a secret and must be safeguarded like a password. Keep it in your password manager, and only provide it to authorized persons, such as a developer who must write code to access the export API programmatically.

When the dialog is closed, the token will be copied to your clipboard to make it easy for you to paste into your password manager.

NOTE: Kapiche does not store these tokens: only salted, hashed values are stored. Kapiche will not be able to provide a lost token.

If you lose a token, or you need to replace a token that may have been compromised, simply delete tokens in this screen and create a new one. It is good practice to periodically replace tokens with new ones.

Analysis base URL

When the Export API is enabled for a particular analysis, a base URL will be created that is specific to that analysis:

In the image above, the analysis 18-44 age group in the project grocery has been enabled for the export API. In this image above, the base URL is given as:

https://api.kapiche.com/export/0bccbcba45224739b26c5e4b6ce60db9/

This URL must be used for all exports on this analysis. However, this is not the full URL for extracting data. Different kinds of exported data can be obtained by appending an extra component to the URL. For example, to export data in the form of row-based documents including all Saved Themes in that analysis, append document-queries/. The full URL to access that type of export (from this analysis) would be the following:

https://api.kapiche.com/export/0bccbcba45224739b26c5e4b6ce60db9/document-queries/

The different kinds of exports, as well as their configurable options, are described in more detail next.

Available Export Data Options

1. document-queries

This export data format contains the following fields (or columns):

  1. The document_id as used internally by Kapiche. This field is added by Kapiche and is not part of the original source data. Note that the name of this field is given with two trailing underscores, for disambiguation: document_id__.

  2. One field for every structured field in the analysis.

  3. One field for every text field that is part of the analysis.

  4. One field for every sentiment classification corresponding to each text field in the analysis.

  5. One field for every Saved Theme in the analysis. Each such field contains “1” or “0” to indicate whether that document part of that Saved Theme.

document_id__

City

Age

Response

Sentiment: Response

Sales // Discount

Sales // Return

1

Brisbane

18-25

The sales discount was great!

positive

1

0

2

Sydney

26-35

I returned the faulty item.

negative

0

1

3

Melbourne

36-45

Lovely discount but I returned it anyway.

mixed

1

1

4

Perth

46-60

No complaints.

neutral

0

0

This endpoint provides these additional optional configuration parameters:

Parameter

Help

start_document_id

(int) The starting document_id value from where the exported data should begin. This must be used for correct pagination to export large datasets in batches. See the discussion on pagination further below for examples. Default: 1

docs_count

(int) The maximum number of documents to include in the export. There is a hard upper limit of 50 000. Default: 50 000

vertical-themes

(boolean) If true, the export data will be reshaped so that all theme names are vertically arranged. See the discussion further below on reshaping for examples.
Default: false

theme-level-separator

(text) If vertical-themes is enabled, this string will be used to construct separate columns from each theme name, split by the given separator. See the discussion further below for examples.
Default: ''

export_format

(text) What format to use when generating the export.

Options: 'csv', 'parquet'

Default: 'csv'

These are discussed in more detail below.

Pagination

As more data is added to a project, the number of documents will grow. Large datasets must be exported in batches. By default, the batch size is 50 000 documents per call. Each document is identified a document_id integer key that is assigned internally by Kapiche. This key can be used to control the start point for each export API call.

For example, consider the following call:

$ curl \
-H "Site-Name: acme" \
-H "Content-Type: application/json" \
-H "Authorization: Site xxx" \
-X GET \
https://api.kapiche.com/export/><uuid>/document-queries/?start_document_id=1

This call will return a data export starting at document_id 1, and containing up to 50000 documents. The HTTP response of the above call will contain a header indicating the next document id that follows the resulting data.

Here is an example of the response headers:

Server: gunicorn Date: Tue, 06 Dec 2022 23:39:01 GMT
Connection: keep-alive
Transfer-Encoding: chunked
Content-Type: application/zip Content-Disposition: attachment; filename="c84ac7fa4e9c4d918ea5c15f3a99652b.zip"
Kapiche-Next-Document-Id: 507
Allow: GET, HEAD, OPTIONS
Cache-Control: no-cache, no-store, must-revalidate
Vary: Accept-Encoding, Cookie, Origin

Automation code can use this information to periodically fetch new data for export. The specific header to use is Kapiche-Next-Document-Id. The value given is what must be specified as the start document id for the next API call, to retrieve the next batch of data.

If there is no new data beyond the given start_document_id then the HTTP response status code will be 204. In an automation setup, that HTTP status code simply means the automation system must wait until new data has been added to the project. Typically an automation is set up to run weekly, or on some similar predictable cadence.

Reshaping

The same data can be provided in a reshaped format where the Saved Themes information is pivoted into a vertical structure. This format can be requested by adding a URL query parameter vertical-themes=true.

In this format, the data is restructured so that each document row is repeated for each Saved Theme that document appears in. In this reshaped format, it is sometimes also convenient to split up theme names into separate columns if a hierarchy has been created in the name.

For example, consider two Saved Themes called Sales // Discount and Sales // Returns. If the URL query parameter theme-level-separator=// is provided, these names will be split into separate columns for each level.

Consider the following example exported data. This is obtained by default from the document-queries export format with no extra parameters:

document_id__

City

Age

Response

Sentiment: Response

Sales // Discount

Sales // Return

1

Brisbane

18-25

The sales discount was great!

positive

1

0

2

Sydney

26-35

I returned the faulty item.

negative

0

1

3

Melbourne

36-45

Lovely discount but I returned it anyway.

mixed

1

1

4

Perth

46-60

No complaints.

neutral

0

0

If the parameter vertical-themes=true is provided in the Export API call, the following structure will be produced:

document_id__

City

Age

Response

Sentiment: Response

Theme__

1

Brisbane

18-25

The sales discount was great!

positive

Sales // Discount

2

Sydney

26-35

I returned the faulty item.

negative

Sales // Return

3

Melbourne

36-45

Lovely discount but I returned it anyway.

mixed

Sales // Discount

3

Melbourne

36-45

Lovely discount but I returned it anyway.

mixed

Sales // Return

Note:

  • Documents that appear in multiple Saved Themes will appear in multiple rows, as with document_id 3 above.

  • Documents that do not appear in a Saved Theme will be absent, as with document_id 4 above.

Finally, if the additional parameter theme-level-separator=// is provided, theme levels will be split:

document_id__

City

Age

Response

Sentiment: Response

Theme__ Level 1

Theme__ Level 2

1

Brisbane

18-25

The sales discount was great!

positive

Sales

Discount

2

Sydney

26-35

I returned the faulty item.

negative

Sales

Return

3

Melbourne

36-45

Lovely discount but I returned it anyway.

mixed

Sales

Discount

3

Melbourne

36-45

Lovely discount but I returned it anyway.

mixed

Sales

Return

The full URL for the above export data structure would look like this:

$ curl \
-H "Site-Name: acme" \
-H "Content-Type: application/json" \
-H "Authorization: Site xxx" \
-X GET \
"https://api.kapiche.com/export/><uuid>/document-queries/?start_document_id=1&vertical-themes=true&theme-level-separator=//"
Did this answer your question?