Skip to main content
All CollectionsFrequently Asked Questions
How does PII Redaction work?
How does PII Redaction work?

Find out how Kapiche's PII redaction feature works

Christina Petrakos avatar
Written by Christina Petrakos
Updated over 5 months ago

What is PII Redaction?

PII redaction detects any personally identifiable information (PII) in your data, such as full names, phone numbers, addresses, etc, and replaces it with a generic placeholders so that we are not storing sensitive information.

How does Kapiche do PII Redaction?

Kapiche leverages Microsoft Presidio. The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values.

Which entities will be detected and anonymised?

Presidio contains predefined recognizers for PII entities. This page describes the different entities Presidio can detect and the method Presidio employs to detect those. We support the following entities.

Entity Type

Description

AU_POSTCODE

4 digit code to identify where a person lives

AU_ABN

The Australian Business Number (ABN) is a unique 11 digit identifier issued to all entities registered in the Australian Business Register (ABR).

AU_MEDICARE

Medicare number is a unique identifier issued by Australian Government that enables the cardholder to receive a rebates of medical expenses under Australia's Medicare system

AU_TFN

The tax file number (TFN) is a unique identifier issued by the Australian Taxation Office to each taxpaying entity

CREDIT_CARD

Credit card numbers typically range from 12 to 19 digits (https://en.wikipedia.org/wiki/Payment_card_number). In the demo below it uses a strict credit card checksum to validate and redact only legitimate credit card numbers. However, we've chosen to err on the side of caution by removing the strict validation check. This approach ensures that even if customers inadvertently enter incorrect credit card numbers in feedback data, we can still catch and redact that sensitive information.

CRYPTO

A Crypto wallet number. Currently only Bitcoin address is supported

DATE OF BIRTH

A date time specific referring to someones date of birth

EMAIL_ADDRESS

An email address identifies an email box to which email messages are delivered

IBAN_CODE

The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.

IP_ADDRESS

An Internet Protocol (IP) address (either IPv4 or IPv6).

LOCATION

Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains

MEDICAL_LICENSE

Common medical license numbers.

PERSON

A full person name, which can include first names, middle names or initials, and last names.

PHONE_NUMBER

A telephone number

URL

A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet

US_BANK_NUMBER

A US bank account number is between 8 to 17 digits.

US_DRIVER_LICENSE

A US driver license according to https://ntsi.com/drivers-license-format/

US_ITIN

US Individual Taxpayer Identification Number (ITIN). Nine digits that start with a "9" and contain a "7" or "8" as the 4 digit.

US_PASSPORT

A US passport number with 9 digits.

US_SSN

A US Social Security Number (SSN) with 9 digits.

UK_NHS

A UK NHS number is 10 digits.

SG_NRIC_FIN

A National Registration Identification Card

IN_PAN

The Indian Permanent Account Number (PAN) is a unique 12 character alphanumeric identifier issued to all business and individual entities registered as Tax Payers.

STUDENT_NUMBER

A student number is a 5 to 8 digit number starting with a S used for universities to identify students

These entites are added to Kapiche stopwords so they don't appear in your storyboards when analyzing or developing themes.

How Accurate is the detection?

Each recognizer, regardless of its complexity, could have false positives and false negatives. We try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the usability of the output, while a recognizer with many false negatives might make the system unusable due to security concerns . We try to us an F2.5 or F2 which give more weight to recall over precision. If you want to test yourself you can use the demo space from huggingface here. We use the spacy model, with an acceptance threshold of 0.85.

What is the end to end journey of the imported data?

  1. File is imported via integration or upload

  2. Structured columns are ignored / dropped that are considered PII entities e.g. full names, phone numbers, addresses, etc.

  3. Unstructured text is sent to internal presidio microservice that analyzes and redacts PII data

  4. The redacted file is then stored encrypted in Kapiche for further processing

  5. Normal processing is carried out on unstructured fields

Did this answer your question?