Formatting your data to get the best results

This guide will help ensure that your data files are optimized to get the best insights possible.

Ryan Stuart avatar
Written by Ryan Stuart
Updated over a week ago

We’ve put together a list of our suggested best practices to follow when it comes to formatting your data to help you get the best results with Kapiche.

Let's take a look!

Supported file formats

Data should be in a spreadsheet format: csv, xls or xlsx.

Note: If using an xls or xslx spreadsheet with multiple sheets, the data to be analysed by Kapiche will need to be in the first sheet.

File size limits

At the moment, data uploads are limited to 500K records and a max file size of 500MB for CSV files (100MB for xlsx files) per file. This means that if the total number of records you would like to upload exceeds this, you will need to upload them in batches.

Column formatting

1) Each column should correspond to a specific field (e.g. "Age", "Location", etc) and fields should not be repeated or spread across multiple columns (e.g. for “Gender” there should be a single column for that field, rather than having unique columns for Male, Female & Other).

2) The first row in your spreadsheet needs to be column headers (every column requires a heading - the headings cannot be blank).

3) Each row (after the first) should correspond to an individual survey response, support conversation, product review, etc.

4) There should be at least one free text (unstructured/verbatim) field and at least 500 free text responses (note: this is not a hard-and-fast rule - our system can definitely support smaller quantities, particularly if they’re rich with text/are detailed responses).

Things to avoid

1) Non-numeric values for scores / rating fields; e.g. “10 - Great” instead of “10” (when non-numerics are included as part of the same column our system will likely strip them as per http://help.kapiche.com/en/articles/6500637-how-does-kapiche-clean-numerical-and-nps-data).

2) Mixing numeric values for scores with non-numeric values (e.g. if most of your responses are numeric values but some of them also include a text label. Behaviour will be as per http://help.kapiche.com/en/articles/6500637-how-does-kapiche-clean-numerical-and-nps-data)

3) Ambiguous date formats (e.g. 01/10/2020 could mean the First of October or the 10th of January) - to prevent this from happening, we recommend following the YYYY/MM/DD format for your dates!

If you have a Date column with Dates e.g. Feb'24 and it is not a Date column and you upload it to Kapiche. Our software can sometimes interpret this as 24th of February as its trying to guess at the format. If you have it in Date format, we can easily tell that format.

5) Bad or Missing dates (e.g. sometimes exports from systems will miss the date field or move the data across if the date field is missing.) This will often result in an invalid schema, as Kapiche system needs the date field for a lot of the widgets. You can choose to skip bad or missing dates at Project creation time or in Project Settings.

6) Fields that are split across multiple columns; e.g. “Gender” is split across two columns: “Gender: Male” and “Gender: Female” with values “1” and “0” instead of a single “Gender” column with values “Male” and “Female”.

7) Rows that don’t correspond to a record of customer data (e.g. titles, dates & other descriptive information about the document or data in the first rows, a “totals” row that sums up the values of numeric columns, etc.)

Our suggestions above are quite general in nature as every data set is unique, so if you have any questions about your specific use-case be sure to reach out to us by using the chat icon towards the bottom-right of any screen so that we can offer you more personalized guidance! 👉

Did this answer your question?