Sensitive data discovery is the first step in protecting PII, PHI, and other regulated data. Before you can mask, encrypt, or gate access, you have to find it.
Open source tools for the job span a spectrum: low-level NLP building blocks, higher-level PII libraries with batteries included, CLI scanners that crawl databases and storage, and full metadata platforms where classification is one feature among many. The picks below cover that range — including one (PiiCatcher) that was archived in January 2026 but still shows up in old guides, so you should know its status.
spaCy
spaCy is the NLP library that powers many sensitive-data detection tools.

spaCy provides named entity recognition (NER) for persons, organizations, locations, and other entities in text. PiiCatcher, OpenMetadata, and Microsoft Presidio all use spaCy under the hood. For custom detection pipelines, you can build directly on spaCy.
Microsoft Presidio
Microsoft Presidio is an actively maintained PII detection and anonymization library. It pairs spaCy NER with regex, rule-based recognizers, and checksums.
Detection approach: 50+ predefined recognizers (credit cards, SSNs, phone numbers, names, locations, financial data, bitcoin wallets) plus custom recognizers. Covers text and DICOM/standard images. Multi-language with context-aware confidence scoring.
Data source support: Text input (programmatic). No native database scanner — you sample rows, hand them to Presidio, and tag the results yourself.
Key strength: Active maintenance, broad recognizer library, and a separate Anonymizer module that can mask, redact, hash, or encrypt findings in the same pipeline.
Best for: Teams building their own discovery pipeline, especially when scanning text fields, log streams, or document corpora alongside databases.
Hawk-Eye
Hawk-Eye is a broad-spectrum scanner covering databases, cloud storage, and files — including images and videos via OCR.

Detection approach: Pattern matching with configurable fingerprints in YAML. OCR for images and documents — 350+ file types including DOCX, PDF, images, videos.
Data source support: MySQL, PostgreSQL, MongoDB, CouchDB, Redis, S3, Google Cloud Storage, Firebase, Slack, Google Drive, local filesystem.
Key strength: Coverage. Database scanners stop at the database; Hawk-Eye finds PII wherever it has leaked.
Best for: Security teams auditing data sources beyond the database.
PiiCatcher (archived)
PiiCatcher was archived in January 2026 and is now read-only on GitHub. No new releases since v0.21.2 (July 2023). It still runs, but bug fixes and security patches will not land. New deployments should use Microsoft Presidio (above) or one of the platform options below.

For context — PiiCatcher was a focused CLI scanner that detected PII in databases (PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQuery) using regex on column names plus spaCy NLP on sampled values, and tagged findings directly into DataHub or Amundsen. The catalog-integration model was its main strength. No actively maintained open-source tool fills that exact niche today — Hawk-Eye (above) covers a different database list (no warehouses), OpenMetadata wraps classification inside a full platform, and Presidio is a library rather than a scanner.
OpenMetadata
OpenMetadata is a unified metadata platform with auto-classification as a core governance feature.

Detection approach: Auto-classification workflow powered by spaCy with configurable confidence thresholds (0–100). Identifies PII and either auto-applies tags or queues them for review. Runs as a separate workflow from metadata ingestion, so classification tunes independently.
Data source support: 84+ connectors across databases, dashboards, messaging, and pipelines.
Key strength: Classification flows into governance. Tags drive data quality rules, access policies, and team workflows. The no-code profiler makes classification reachable for non-engineers.
Best for: Teams that want classification to drive downstream governance, not just inventory.
Alternative: DataHub is another open-source metadata platform, but its auto-classification feature only supports Snowflake and is marked as deprecated. On DataHub, plan to wire up your own scanning — e.g. a Presidio-based job that writes tags via the DataHub API.
Comparison
| Tool | Status | Language | Primary use case | Detection method | Data source support | License |
|---|---|---|---|---|---|---|
| spaCy | Active | Python | NLP library / building block | Named entity recognition (NER), ML models | N/A (text processing only) | MIT |
| Presidio | Active | Python | PII library with built-in recognizers | NER (spaCy) + regex + rule-based + checksum | Text input (programmatic); no native DB scanner | MIT |
| Hawk-Eye | Active | Python | Multi-source scanner (DBs, cloud, files) | Pattern matching + OCR | MySQL, PostgreSQL, MongoDB, Redis, S3, GCS, Firebase, Slack | LGPL 2.1 + Commons Clause |
| PiiCatcher | Archived (2026) | Python | CLI scanner for databases | Regex + NLP (spaCy) | PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQuery | Apache 2.0 |
| OpenMetadata | Active | Java / Python | Data platform with governance | Auto-classification workflow (spaCy), confidence thresholds | 84+ connectors | Apache 2.0 |

Picks by use case:
- spaCy — Build a custom NER pipeline from scratch.
- Microsoft Presidio — Drop in a maintained PII library with batteries included.
- Hawk-Eye — Coverage across databases, cloud storage, and files.
- PiiCatcher — Skip for new work; the project is archived.
- OpenMetadata — Classification inside a full metadata platform.
Start lightweight. Move to full platforms when governance demands it.
Discovery, then protection
Discovery is the prerequisite. Once you know which columns hold sensitive data, the next step is to control what's returned at query time. Bytebase dynamic data masking is driven by classification results — scan, classify, mask. The Bytebase REST/gRPC API takes classification output and applies masking policies, so a discovered PII column gets a masking policy without a human in between.