What is PII Redaction?
PII redaction detects any personally identifiable information (PII) in your data, such as full names, phone numbers, addresses, etc, and replaces it with a generic placeholders so that we are not storing sensitive information.
How does Kapiche do PII Redaction?
Kapiche leverages Microsoft Presidio. The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values.
Which entities will be detected and anonymised?
Presidio contains predefined recognizers for PII entities. This page describes the different entities Presidio can detect and the method Presidio employs to detect those.
These entites are added to Kapiche stopwords so they don't appear in your storyboards when analyzing or developing themes.
How Accurate is the detection?
Each recognizer, regardless of its complexity, could have false positives and false negatives. We try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the usability of the output, while a recognizer with many false negatives might make the system unusable due to security concerns . We try to us an F2.5 or F2 which give more weight to recall over precision. If you want to test yourself you can use the demo space from huggingface here. We use the spacy model, with an acceptance threshold of 0.86.
What is the end to end journey of the imported data?
File is imported via integration or upload
Structured columns are ignored / dropped that are considered PII entities e.g. full names, phone numbers, addresses, etc.
Unstructured text is sent to internal presidio microservice that analyzes and redacts PII data
The redacted file is then stored encrypted in Kapiche for further processing
Normal processing is carried out on unstructured fields