What is PII Redaction?
PII redaction detects any personally identifiable information (PII) in your data, such as full names, phone numbers, addresses, etc, and replaces it with a generic placeholders so that we are not storing sensitive information.
How does Kapiche do PII Redaction?
Kapiche leverages Microsoft Presidio. The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values.
Which entities will be detected and anonymised?
Presidio contains predefined recognizers for PII entities. This page describes the different entities Presidio can detect and the method Presidio employs to detect those. We support the following entities.
Entity Type | Description |
AU_POSTCODE | 4 digit code to identify where a person lives |
AU_ABN | The Australian Business Number (ABN) is a unique 11 digit identifier issued to all entities registered in the Australian Business Register (ABR). |
AU_MEDICARE | Medicare number is a unique identifier issued by Australian Government that enables the cardholder to receive a rebates of medical expenses under Australia's Medicare system |
AU_TFN | The tax file number (TFN) is a unique identifier issued by the Australian Taxation Office to each taxpaying entity |
CREDIT_CARD | Credit card numbers typically range from 12 to 19 digits (https://en.wikipedia.org/wiki/Payment_card_number). In the demo below it uses a strict credit card checksum to validate and redact only legitimate credit card numbers. However, we've chosen to err on the side of caution by removing the strict validation check. This approach ensures that even if customers inadvertently enter incorrect credit card numbers in feedback data, we can still catch and redact that sensitive information. |
CRYPTO | A Crypto wallet number. Currently only Bitcoin address is supported |
DATE OF BIRTH | A date time specific referring to someones date of birth |
EMAIL_ADDRESS | An email address identifies an email box to which email messages are delivered |
IBAN_CODE | The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. |
IP_ADDRESS | An Internet Protocol (IP) address (either IPv4 or IPv6). |
LOCATION | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains |
MEDICAL_LICENSE | Common medical license numbers. |
PERSON | A full person name, which can include first names, middle names or initials, and last names. |
PHONE_NUMBER | A telephone number |
URL | A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet |
US_BANK_NUMBER | A US bank account number is between 8 to 17 digits. |
US_DRIVER_LICENSE | A US driver license according to https://ntsi.com/drivers-license-format/ |
US_ITIN | US Individual Taxpayer Identification Number (ITIN). Nine digits that start with a "9" and contain a "7" or "8" as the 4 digit. |
US_PASSPORT | A US passport number with 9 digits. |
US_SSN | A US Social Security Number (SSN) with 9 digits. |
UK_NHS | A UK NHS number is 10 digits. |
SG_NRIC_FIN | A National Registration Identification Card |
IN_PAN | The Indian Permanent Account Number (PAN) is a unique 12 character alphanumeric identifier issued to all business and individual entities registered as Tax Payers. |
STUDENT_NUMBER | A student number is a 5 to 8 digit number starting with a S used for universities to identify students |
These entites are added to Kapiche stopwords so they don't appear in your storyboards when analyzing or developing themes.
How Accurate is the detection?
Each recognizer, regardless of its complexity, could have false positives and false negatives. We try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the usability of the output, while a recognizer with many false negatives might make the system unusable due to security concerns . We try to us an F2.5 or F2 which give more weight to recall over precision. If you want to test yourself you can use the demo space from huggingface here. We use the spacy model, with an acceptance threshold of 0.85.
What is the end to end journey of the imported data?
File is imported via integration or upload
Structured columns are ignored / dropped that are considered PII entities e.g. full names, phone numbers, addresses, etc.
Unstructured text is sent to internal presidio microservice that analyzes and redacts PII data
The redacted file is then stored encrypted in Kapiche for further processing
Normal processing is carried out on unstructured fields