PII Detection & Redaction (Azure AI Language)
Overview
A build for detecting and redacting personally identifiable information (PII) in free text using Azure AI Language’s PII-recognition API. It scans documents, returns redacted text, and enumerates the entities (e.g. SSN, phone number) that were removed.
Why It Exists
Before publishing or storing customer data, organisations must strip PII to meet privacy and compliance rules. This build evaluates a managed NLP service as a fast, low-maintenance way to identify and redact sensitive entities at scale.
What We Built
A Python script (build.py, adapted from Microsoft’s Azure Text Analytics sample) that authenticates with AzureKeyCredential, instantiates a TextAnalyticsClient, and calls recognize_pii_entities over a batch of documents, modelling a loan-payments scenario. It prints the original versus redacted text and lists each redacted entity with its category, and shows how to explicitly extract specific entity types such as SSNs.
Technologies & Approach
Python with the Azure SDK (azure.ai.textanalytics), targeting Azure AI Language’s PII recognition (Text Analytics API v3.1+). Configuration is via environment variables for the Language endpoint and key. The managed service handles detection and categorisation, keeping the integration thin.
Outcome / Impact
Validated that Azure’s managed PII service can detect and redact sensitive entities, and surface their categories, with minimal code, confirming it as a viable building block for privacy/compliance pipelines. Documented honestly as an evaluation build built from the vendor sample.
Capabilities Demonstrated
- PII detection and automated redaction of unstructured text
- Entity categorisation (SSN, phone, national IDs, etc.)
- Integration of managed NLP/AI Language services
- Privacy- and compliance-oriented data handling