Redaction accuracy

Accuracy in redaction is a complex topic. No software removes the need to look at a file before sharing it, and for personal use the user is the only judge of what counts as sensitive in the first place. This piece walks through the metrics, the regulations, and what current tools publish. Total Redact is one of the products discussed.

Definition of accuracy

The standard metrics in PII detection are precision, recall, and F1, defined in the i2b2/UTHealth de-identification overview as [1]:

Precision = TP / (TP + FP). The share of flagged spans that were actually sensitive.
Recall = TP / (TP + FN). The share of truly sensitive spans the system found.
F1 = the harmonic mean of the two.

Plain “accuracy”, calculated as (TP + TN) / total, is not used because the data is heavily imbalanced: in any document, almost every token is not sensitive, so a do-nothing classifier scores 99%+ on raw accuracy. Manning, Raghavan, and Schütze, Introduction to Information Retrieval §8.3 [2]:

“In almost all circumstances, the data is extremely skewed: normally over 99.9% of the documents are in the nonrelevant category. … labeling all documents as nonrelevant is completely unsatisfying to an information retrieval system user.”

What these metrics measure, and what they miss

Precision, recall, and F1 are computed against a fixed list of categories: the i2b2 medical PHI list, HIPAA’s 18 Safe Harbor identifiers, or whatever set of entities a vendor’s NER model was trained on. Mireshghallah et al., in Trust No Bot (COLM 2024), examined real chatbot conversations and made the gap explicit [3]:

“PII detection systems are limited in the kinds of information they can detect, and many other embarrassing, identifiable (specific), and harmful information can remain undetected.”

Their analysis found that on top of the ~70% of queries with detectable PII, around 15% contained non-PII sensitive content the standard detectors do not see: sexual preferences, drug use, mental-health context, politics, religion. Storytelling and roleplay prompts are a separate problem because the names in them may or may not be fictional. A high F1 on a public benchmark does not mean the tool found what the user considers sensitive.

For personal use this gives accuracy two layers: how the model performs on the standard categories, and how well the user can tell the tool about the rest. The second layer is what features like a personal watchlist exist for.

How much accuracy is enough?

None of the major regulators sets a numerical threshold for “redacted enough.” All of them frame the question in terms of reasonable measures and residual risk.

HIPAA (United States)

From the HHS Office for Civil Rights de-identification guidance [4]:

“Although the risk is very small, it is not zero, and there is a possibility that de-identified data could be linked back to the identity of the patient to which it corresponds.”

“There is no explicit numerical level of identification risk that is deemed to universally meet the ‘very small’ level indicated by the method.”

GDPR (European Union)

Recital 26 defines anonymous information by what is reasonable, not by what is perfect [5]:

“account should be taken of all the means reasonably likely to be used … to identify the natural person directly or indirectly … taking into consideration the available technology at the time of the processing and technological developments.”

The Article 29 Working Party Opinion 05/2014 on Anonymisation Techniques (WP216) is more pointed: each technique carries “residual risk of identification inherent” in it, and the Opinion recommends, “Do not rely on the ‘release and forget’ approach” [6].

CCPA / CPRA (California)

Cal. Civ. Code § 1798.140(m) defines “deidentified” information as data that “cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer,” provided the business takes “reasonable measures” to keep it that way and contractually binds anyone it shares the data with [7]. Reasonable, not perfect.

PCI DSS

PCI DSS v4.0 Requirement 3.4 treats display masking (showing only the BIN and last four digits of a card number) as one control, alongside Requirement 3.5 on rendering stored card data unreadable [8]. Masking is one layer in a defense-in-depth scheme, not a single redaction step that has to be perfect on its own.

All four assume non-zero residual risk, require process, and let humans audit the outcome. None ask for a number.

Personal use

Outside regulated workflows, no one is auditing a personal chatbot prompt or a shared Google Doc. The accuracy required is the accuracy that prevents the specific items the user does not want exposed. That can be narrower than what HIPAA cares about, or much broader: a journal entry, a draft story, a chat that mentions a family member by name. Standard NER and regex do not get at the second category. The user does, with a list of terms they have decided are personal.

Vendor publications

Verified, primary-source numbers and quotes:

Vendor	Publication
Microsoft Presidio Most-cited open-source PII detector, used as a baseline in academic comparisons [9]	“there is no guarantee that Presidio will find all sensitive information”
AWS Comprehend Medical Major cloud provider’s purpose-built medical PHI detection API [10]	“we recommend that you use additional human review or other methods to confirm the accuracy of detected PHI”
Google Cloud DLP Major cloud provider’s hosted Sensitive Data Protection service [11]	“Built-in infoType detectors are not a perfectly accurate detection method … they can’t guarantee compliance with regulatory requirements”
Nightfall AI Enterprise SaaS DLP product for cloud applications [12]	“95% precision”
BigID Enterprise data discovery and privacy platform, common in governance and DSAR work [13]	“97%+ accuracy” (scoped to PI inventory)
Logikcull eDiscovery platform widely used in legal review and DSAR workflows [14]	“99%+ accuracy” on SSNs, names, addresses, and phone numbers
i2b2 / UTHealth 2014 Standard academic PHI redaction benchmark, basis for over a decade of follow-up work [1]	F1 = 0.9360 (best system, entity-level strict)
Kocaman et al., 2023 Modern transformer-era result on the same i2b2-2014 corpus [15]	F1 = 0.978 (13 fine-grained labels)

No published number is 100%, and the highest numbers come from the vendors with the narrowest scope. Vendor marketing (95%, 97%, 99%) and academic benchmarks (F1 ~0.94 to ~0.98) are not directly comparable: the marketing numbers rarely state which entity types, which dataset, or which scoring method.

Standard benchmarks like i2b2 are written for clean clinical English. Real documents contain multiple languages, scanned pages, embedded tables, document metadata, comments, footers, and email headers. Performance on a clean corpus is an upper bound on what any tool will do on real files. Public redaction failures, like the Manafort filing in 2019 [16] and the Epstein document release in 2024 [17], involved redaction that was technically applied but left identifiers behind in metadata or in copy-paste-recoverable text.

Total Redact accuracy

For the documents in our test corpus, Total Redact catches every planted sensitive item. The corpus runs to 500+ automated cases across 425 sample documents covering health-form, tax-form, banking, employment, and email scenarios, including the metadata, comment, footnote, and email-header positions where redaction failures hide.

That bar does not generalize cleanly to every document in the world. Every redactor faces the same set of context problems. Different document formats (PDF, DOCX, scanned PDFs that require OCR before any text-based detector can see them) each have their own failure modes. A short token like “Jo” can be a person or a fragment of a longer word. A word like “Grant” can be a first name or a verb. A column of nine-digit numbers in a spreadsheet can be phone numbers or transaction IDs. The same person can appear in one file as “Jonathan Smith”, “Jon”, and “J. Smith”, and a detector that catches one form may miss the other two. Sensitive data routinely sits in places readers do not see: document metadata, footers, alt text, comment threads, slide speaker notes.

Detection itself is a trade-off. A redactor that flags every token is unusable; one that flags too few is unsafe. Total Redact gives the user both controls. The watchlist forces a term to be caught every time it appears. The allowlist suppresses a term so it stops getting flagged on every document, which removes the noise of bypassing the same false positive over and over. Every detection is shown for review before anything is written, and the user accepts or rejects each one. Over time, watchlist and allowlist together turn the app into a personalized redactor that has learned what is and is not sensitive in the user’s work. The point is the workflow, not a single accuracy number.

Total Redact has been tested thoroughly against these patterns and is refined with each release. Users are invited to run real documents through the app and report anything that was missed; that feedback shapes the next round of detection.

Sources

Stubbs, Kotfila, Uzuner. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform 58, 2015. PMC4989908.
Manning, Raghavan, Schütze. Introduction to Information Retrieval, §8.3. Cambridge University Press, 2008. nlp.stanford.edu.
Mireshghallah, Antoniak, More, Choi, Farnadi. Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild. COLM 2024. arXiv:2407.11438.
HHS Office for Civil Rights. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule. hhs.gov.
Regulation (EU) 2016/679 (GDPR), Recital 26. gdpr-info.eu.
Article 29 Data Protection Working Party. Opinion 05/2014 on Anonymisation Techniques, WP216, April 2014. ec.europa.eu (PDF).
California Civil Code § 1798.140(m). leginfo.legislature.ca.gov.
PCI Security Standards Council. Payment Card Industry Data Security Standard v4.0, Requirement 3. pcisecuritystandards.org.
Microsoft. Presidio documentation. microsoft.github.io/presidio.
Amazon Web Services. Detect PHI: Amazon Comprehend Medical. docs.aws.amazon.com.
Google Cloud. InfoType detector reference, Sensitive Data Protection. cloud.google.com.
Nightfall AI. nightfall.ai.
BigID. PI / PII inventory. bigid.com.
Logikcull. DSAR use case. logikcull.com.
Kocaman, Ul Haq, Talby. Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets. arXiv:2312.08495, December 2023. arXiv:2312.08495.
American Bar Association, Embarrassing Redaction Failures, Judges’ Journal Spring 2019 (Manafort filing). americanbar.org.
CNN, Jeffrey Epstein documents: associates and others named in newly unsealed list, January 2024 (Giuffre v. Maxwell unsealing). cnn.com.