Skip to main content

Top Open Source Sensitive Data Discovery Tools in 2026

Tianzhou · May 7, 2026

You can't protect data you haven't found. Before you mask it, encrypt it, or gate access to it, you first have to know which columns hold the PII, PHI, and other regulated stuff. That's sensitive data discovery, and it's the unglamorous step everyone skips until an auditor asks.

The open source options span a wide spectrum, and they don't all do the same job. On one end you have low-level NLP building blocks. In the middle, PII libraries with batteries included and CLI scanners that crawl databases and storage. On the far end, full metadata platforms where classification is one feature among dozens. The five below cover that range. One of them (PiiCatcher) got archived in January 2026 but still shows up in old guides, so I've kept it in the list with a clear warning rather than let you find out the hard way.

spaCy

spaCy is the NLP library that quietly powers half the tools on this page.

spaCy

It provides named entity recognition (NER) for persons, organizations, locations, and other entities in text. PiiCatcher, OpenMetadata, and Microsoft Presidio all lean on spaCy under the hood. So spaCy isn't really a discovery tool you reach for directly. It's the engine the discovery tools are built on. Pick it when you're building a custom detection pipeline and want to start from the metal.

Microsoft Presidio

Microsoft Presidio is an actively maintained PII detection and anonymization library, and it's the one I'd point most teams at by default. It pairs spaCy NER with regex, rule-based recognizers, and checksums.

Detection is handled by 50+ predefined recognizers (credit cards, SSNs, phone numbers, names, locations, financial data, bitcoin wallets) plus whatever custom recognizers you add. It covers text and DICOM/standard images, multi-language, with context-aware confidence scoring.

The catch: there's no native database scanner. You sample rows yourself, hand them to Presidio, and tag the results yourself. The upside is the separate Anonymizer module, which can mask, redact, hash, or encrypt findings in the same pipeline. So you detect and remediate without leaving the library.

Verdict: the safe pick for teams building their own discovery pipeline, especially when you're scanning text fields, log streams, or document corpora alongside databases. You bring the plumbing, Presidio brings the brains.

Hawk-Eye

Hawk-Eye is the broad-spectrum scanner of the bunch, covering databases, cloud storage, and files, including images and videos via OCR.

hawk-eye

It does pattern matching with configurable fingerprints in YAML, and runs OCR across 350+ file types including DOCX, PDF, images, and videos. On the data source side it reaches MySQL, PostgreSQL, MongoDB, CouchDB, Redis, S3, Google Cloud Storage, Firebase, Slack, Google Drive, and the local filesystem.

The whole point here is coverage. A database scanner stops at the database, but PII rarely stays put. It leaks into a Slack thread, a CSV in S3, a screenshot in someone's Drive. Hawk-Eye follows it there.

Verdict: reach for it when the security team needs to audit data sources well beyond the database.

PiiCatcher (archived)

PiiCatcher was archived in January 2026 and is now read-only on GitHub. No new releases since v0.21.2 (July 2023). It still runs, but bug fixes and security patches will not land. New deployments should use Microsoft Presidio (above) or one of the platform options below.

piicatcher

For context, PiiCatcher was a focused CLI scanner that detected PII in databases (PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQuery) using regex on column names plus spaCy NLP on sampled values, then tagged findings directly into DataHub or Amundsen. That catalog integration was its real strength, and to be honest nothing has stepped in to fill that exact niche. Hawk-Eye covers a different database list (no warehouses), OpenMetadata wraps classification inside a full platform, and Presidio is a library, not a scanner. So if you loved PiiCatcher, none of these is a drop-in replacement. They each solve a slightly different shape of the problem.

OpenMetadata

OpenMetadata is a unified metadata platform that treats auto-classification as a core governance feature, not a bolt-on.

openmetadata

Its auto-classification workflow is powered by spaCy with configurable confidence thresholds (0 to 100). It identifies PII and either auto-applies tags or queues them for review. It runs as a separate workflow from metadata ingestion, so you tune classification independently. On reach, it ships 84+ connectors across databases, dashboards, messaging, and pipelines.

The reason to use OpenMetadata is what happens after classification. Tags don't just sit in an inventory, they drive data quality rules, access policies, and team workflows. And the no-code profiler puts classification within reach of people who aren't engineers.

Verdict: the right call when you want classification to actually drive downstream governance, not just produce a list nobody reads.

Alternative: DataHub is another open source metadata platform, but its auto-classification only supports Snowflake and is marked as deprecated. On DataHub, plan to wire up your own scanning, e.g. a Presidio-based job that writes tags via the DataHub API.

Comparison

ToolStatusLanguagePrimary use caseDetection methodData source supportLicense
spaCyActivePythonNLP library / building blockNamed entity recognition (NER), ML modelsN/A (text processing only)MIT
PresidioActivePythonPII library with built-in recognizersNER (spaCy) + regex + rule-based + checksumText input (programmatic); no native DB scannerMIT
Hawk-EyeActivePythonMulti-source scanner (DBs, cloud, files)Pattern matching + OCRMySQL, PostgreSQL, MongoDB, Redis, S3, GCS, Firebase, SlackLGPL 2.1 + Commons Clause
PiiCatcherArchived (2026)PythonCLI scanner for databasesRegex + NLP (spaCy)PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQueryApache 2.0
OpenMetadataActiveJava / PythonData platform with governanceAuto-classification workflow (spaCy), confidence thresholds84+ connectorsApache 2.0

star-history

Picks by use case:

  • spaCy for building a custom NER pipeline from scratch.
  • Microsoft Presidio to drop in a maintained PII library with batteries included.
  • Hawk-Eye for coverage across databases, cloud storage, and files.
  • PiiCatcher, skip it for new work. The project is archived.
  • OpenMetadata for classification inside a full metadata platform.

Start lightweight. Move to the full platforms when governance actually demands it, not before.

Discovery, then protection

Discovery is only the prerequisite. Once you know which columns hold sensitive data, the next step is controlling what's actually returned at query time. Bytebase dynamic data masking is driven by classification results: scan, classify, mask. The Bytebase REST/gRPC API takes classification output and applies masking policies, so a discovered PII column gets masked without a human in between.

Back to blog

Explore the standard for database development