Skip to main content
All CollectionsFrequently Asked Questions
How does PII Redaction work?
How does PII Redaction work?

Find out how Kapiche's PII redaction feature works

Christina Petrakos avatar
Written by Christina Petrakos
Updated over a week ago

What is PII Redaction?

PII redaction detects any personally identifiable information (PII) in your data, such as full names, phone numbers, addresses, etc, and replaces it with a generic placeholders so that we are not storing sensitive information.

How does Kapiche do PII Redaction?

Kapiche leverages Microsoft Presidio. The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values.

Which entities will be detected and anonymised?

Presidio contains predefined recognizers for PII entities. This page describes the different entities Presidio can detect and the method Presidio employs to detect those. We support the following entities.

Entity Type



4 digit code to identify where a person lives


The Australian Business Number (ABN) is a unique 11 digit identifier issued to all entities registered in the Australian Business Register (ABR).


Medicare number is a unique identifier issued by Australian Government that enables the cardholder to receive a rebates of medical expenses under Australia's Medicare system


The tax file number (TFN) is a unique identifier issued by the Australian Taxation Office to each taxpaying entity


A credit card number is between 12 to 19 digits.


A Crypto wallet number. Currently only Bitcoin address is supported


A date time specific referring to someones date of birth


An email address identifies an email box to which email messages are delivered


The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.


An Internet Protocol (IP) address (either IPv4 or IPv6).


Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains


Common medical license numbers.


A full person name, which can include first names, middle names or initials, and last names.


A telephone number


A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet


A US bank account number is between 8 to 17 digits.


A US driver license according to


US Individual Taxpayer Identification Number (ITIN). Nine digits that start with a "9" and contain a "7" or "8" as the 4 digit.


A US passport number with 9 digits.


A US Social Security Number (SSN) with 9 digits.


A UK NHS number is 10 digits.


A National Registration Identification Card


The Indian Permanent Account Number (PAN) is a unique 12 character alphanumeric identifier issued to all business and individual entities registered as Tax Payers.


A student number is a 5 to 8 digit number starting with a S used for universities to identify students

These entites are added to Kapiche stopwords so they don't appear in your storyboards when analyzing or developing themes.

How Accurate is the detection?

Each recognizer, regardless of its complexity, could have false positives and false negatives. We try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the usability of the output, while a recognizer with many false negatives might make the system unusable due to security concerns . We try to us an F2.5 or F2 which give more weight to recall over precision. If you want to test yourself you can use the demo space from huggingface here. We use the spacy model, with an acceptance threshold of 0.85.

What is the end to end journey of the imported data?

  1. File is imported via integration or upload

  2. Structured columns are ignored / dropped that are considered PII entities e.g. full names, phone numbers, addresses, etc.

  3. Unstructured text is sent to internal presidio microservice that analyzes and redacts PII data

  4. The redacted file is then stored encrypted in Kapiche for further processing

  5. Normal processing is carried out on unstructured fields

Did this answer your question?