How does PII Redaction work? | Kapiche Help Center

What is PII Redaction?

PII redaction detects any personally identifiable information (PII) in your data, such as full names, phone numbers, addresses, etc, and replaces it with a generic placeholders so that we are not storing sensitive information.

How does Kapiche do PII Redaction?

Kapiche leverages Microsoft Presidio. The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values.

Which entities will be detected and anonymised?

Presidio contains predefined recognizers for PII entities. This page describes the different entities Presidio can detect and the method Presidio employs to detect those. We support the following entities.

Entity Type	Description
AU_POSTCODE	4 digit code to identify where a person lives
AU_ABN	The Australian Business Number (ABN) is a unique 11 digit identifier issued to all entities registered in the Australian Business Register (ABR).
AU_MEDICARE	Medicare number is a unique identifier issued by Australian Government that enables the cardholder to receive a rebates of medical expenses under Australia's Medicare system
AU_TFN	The tax file number (TFN) is a unique identifier issued by the Australian Taxation Office to each taxpaying entity
CREDIT_CARD	Credit card numbers typically range from 12 to 19 digits (https://en.wikipedia.org/wiki/Payment_card_number). In the demo below it uses a strict credit card checksum to validate and redact only legitimate credit card numbers. However, we've chosen to err on the side of caution by removing the strict validation check. This approach ensures that even if customers inadvertently enter incorrect credit card numbers in feedback data, we can still catch and redact that sensitive information.
CRYPTO	A Crypto wallet number. Currently only Bitcoin address is supported
DATE OF BIRTH	A date time specific referring to someones date of birth
EMAIL_ADDRESS	An email address identifies an email box to which email messages are delivered
IBAN_CODE	The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.
IP_ADDRESS	An Internet Protocol (IP) address (either IPv4 or IPv6).
LOCATION	Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains
MEDICAL_LICENSE	Common medical license numbers.
PERSON	A full person name, which can include first names, middle names or initials, and last names.
PHONE_NUMBER	A telephone number
URL	A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet
US_BANK_NUMBER	A US bank account number is between 8 to 17 digits.
US_DRIVER_LICENSE	A US driver license according to https://ntsi.com/drivers-license-format/
US_ITIN	US Individual Taxpayer Identification Number (ITIN). Nine digits that start with a "9" and contain a "7" or "8" as the 4 digit.
US_PASSPORT	A US passport number with 9 digits.
US_SSN	A US Social Security Number (SSN) with 9 digits.
UK_NHS	A UK NHS number is 10 digits.
SG_NRIC_FIN	A National Registration Identification Card
IN_PAN	The Indian Permanent Account Number (PAN) is a unique 12 character alphanumeric identifier issued to all business and individual entities registered as Tax Payers.
STUDENT_NUMBER	A student number is a 5 to 8 digit number starting with a S used for universities to identify students

These entites are added to Kapiche stopwords so they don't appear in your storyboards when analyzing or developing themes.

How Accurate is the detection?

Each recognizer, regardless of its complexity, could have false positives and false negatives. We try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the usability of the output, while a recognizer with many false negatives might make the system unusable due to security concerns . We try to us an F2.5 or F2 which give more weight to recall over precision. If you want to test yourself you can use the demo space from huggingface here. We use the spacy model, with an acceptance threshold of 0.85.

What is the end to end journey of the imported data?

File is imported via integration or upload
Structured columns are ignored / dropped that are considered PII entities e.g. full names, phone numbers, addresses, etc.
Unstructured text is sent to internal presidio microservice that analyzes and redacts PII data
The redacted file is then stored encrypted in Kapiche for further processing
Normal processing is carried out on unstructured fields

Creating a Project & Uploading Data

How does Kapiche calculate Impact?

How does Kapiche clean Numerical and NPS Data?

How to build meaningful Queries in Kapiche

How Kapiche Works