Skip to main content

Top Open Source Sensitive Data Discovery Tools in 2026

Tianzhou · May 7, 2026

Sensitive data discovery is the first step in protecting PII, PHI, and other regulated data. Before you can mask, encrypt, or gate access, you have to find it.

Open source tools for the job span a spectrum: low-level NLP building blocks, higher-level PII libraries with batteries included, CLI scanners that crawl databases and storage, and full metadata platforms where classification is one feature among many. The picks below cover that range — including one (PiiCatcher) that was archived in January 2026 but still shows up in old guides, so you should know its status.

spaCy

spaCy is the NLP library that powers many sensitive-data detection tools.

spaCy

spaCy provides named entity recognition (NER) for persons, organizations, locations, and other entities in text. PiiCatcher, OpenMetadata, and Microsoft Presidio all use spaCy under the hood. For custom detection pipelines, you can build directly on spaCy.

Microsoft Presidio

Microsoft Presidio is an actively maintained PII detection and anonymization library. It pairs spaCy NER with regex, rule-based recognizers, and checksums.

Detection approach: 50+ predefined recognizers (credit cards, SSNs, phone numbers, names, locations, financial data, bitcoin wallets) plus custom recognizers. Covers text and DICOM/standard images. Multi-language with context-aware confidence scoring.

Data source support: Text input (programmatic). No native database scanner — you sample rows, hand them to Presidio, and tag the results yourself.

Key strength: Active maintenance, broad recognizer library, and a separate Anonymizer module that can mask, redact, hash, or encrypt findings in the same pipeline.

Best for: Teams building their own discovery pipeline, especially when scanning text fields, log streams, or document corpora alongside databases.

Hawk-Eye

Hawk-Eye is a broad-spectrum scanner covering databases, cloud storage, and files — including images and videos via OCR.

hawk-eye

Detection approach: Pattern matching with configurable fingerprints in YAML. OCR for images and documents — 350+ file types including DOCX, PDF, images, videos.

Data source support: MySQL, PostgreSQL, MongoDB, CouchDB, Redis, S3, Google Cloud Storage, Firebase, Slack, Google Drive, local filesystem.

Key strength: Coverage. Database scanners stop at the database; Hawk-Eye finds PII wherever it has leaked.

Best for: Security teams auditing data sources beyond the database.

PiiCatcher (archived)

PiiCatcher was archived in January 2026 and is now read-only on GitHub. No new releases since v0.21.2 (July 2023). It still runs, but bug fixes and security patches will not land. New deployments should use Microsoft Presidio (above) or one of the platform options below.

piicatcher

For context — PiiCatcher was a focused CLI scanner that detected PII in databases (PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQuery) using regex on column names plus spaCy NLP on sampled values, and tagged findings directly into DataHub or Amundsen. The catalog-integration model was its main strength. No actively maintained open-source tool fills that exact niche today — Hawk-Eye (above) covers a different database list (no warehouses), OpenMetadata wraps classification inside a full platform, and Presidio is a library rather than a scanner.

OpenMetadata

OpenMetadata is a unified metadata platform with auto-classification as a core governance feature.

openmetadata

Detection approach: Auto-classification workflow powered by spaCy with configurable confidence thresholds (0–100). Identifies PII and either auto-applies tags or queues them for review. Runs as a separate workflow from metadata ingestion, so classification tunes independently.

Data source support: 84+ connectors across databases, dashboards, messaging, and pipelines.

Key strength: Classification flows into governance. Tags drive data quality rules, access policies, and team workflows. The no-code profiler makes classification reachable for non-engineers.

Best for: Teams that want classification to drive downstream governance, not just inventory.

Alternative: DataHub is another open-source metadata platform, but its auto-classification feature only supports Snowflake and is marked as deprecated. On DataHub, plan to wire up your own scanning — e.g. a Presidio-based job that writes tags via the DataHub API.

Comparison

ToolStatusLanguagePrimary use caseDetection methodData source supportLicense
spaCyActivePythonNLP library / building blockNamed entity recognition (NER), ML modelsN/A (text processing only)MIT
PresidioActivePythonPII library with built-in recognizersNER (spaCy) + regex + rule-based + checksumText input (programmatic); no native DB scannerMIT
Hawk-EyeActivePythonMulti-source scanner (DBs, cloud, files)Pattern matching + OCRMySQL, PostgreSQL, MongoDB, Redis, S3, GCS, Firebase, SlackLGPL 2.1 + Commons Clause
PiiCatcherArchived (2026)PythonCLI scanner for databasesRegex + NLP (spaCy)PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQueryApache 2.0
OpenMetadataActiveJava / PythonData platform with governanceAuto-classification workflow (spaCy), confidence thresholds84+ connectorsApache 2.0

star-history

Picks by use case:

  • spaCy — Build a custom NER pipeline from scratch.
  • Microsoft Presidio — Drop in a maintained PII library with batteries included.
  • Hawk-Eye — Coverage across databases, cloud storage, and files.
  • PiiCatcher — Skip for new work; the project is archived.
  • OpenMetadata — Classification inside a full metadata platform.

Start lightweight. Move to full platforms when governance demands it.

Discovery, then protection

Discovery is the prerequisite. Once you know which columns hold sensitive data, the next step is to control what's returned at query time. Bytebase dynamic data masking is driven by classification results — scan, classify, mask. The Bytebase REST/gRPC API takes classification output and applies masking policies, so a discovered PII column gets a masking policy without a human in between.

Back to blog

Explore the standard for database development