e policies out of documents and into machine-readable formats. Use a schema that captures classification, retention, access control, and quality constraints.
Policy Schema Example:
# policies/customer_pii.yaml
apiVersion: governance.codcompass.io/v1
kind: DataPolicy
metadata:
name: customer-pii-protection
labels:
domain: analytics
sensitivity: PII
spec:
target:
resource_type: table
name_pattern: "raw\\.customer_*"
rules:
- name: encryption_at_rest
type: infrastructure
enforcement: mandatory
config:
algorithm: AES-256
- name: no_direct_access
type: access_control
enforcement: mandatory
config:
allowed_roles:
- "role:pii_analyst"
- "role:data_engineer"
deny_public: true
- name: retention_policy
type: lifecycle
enforcement: advisory
config:
max_age_days: 365
action: archive
2. Implement Metadata Harvesting
Governance requires context. Deploy scanners that automatically extract metadata, lineage, and data profiles from your storage and compute engines. This feeds the governance engine with real-time state.
Scanner Architecture:
- Ingestion: Use agents or API connectors to poll metadata stores (e.g., Hive Metastore, Snowflake Information Schema, Postgres catalogs).
- Enrichment: Apply regex-based classifiers to detect sensitive data patterns (emails, SSNs, credit cards).
- Storage: Push enriched metadata to a central Graph-based Catalog (e.g., DataHub, Amundsen, or OpenMetadata).
3. Enforce via CI/CD and Runtime
Enforcement must happen at two points:
- Shift-Left (CI/CD): Validate policies against infrastructure-as-code (IaC) and data pipeline definitions before deployment.
- Runtime (Data Plane): Block or quarantine data that violates quality or classification rules during ingestion.
CI/CD Validation Snippet:
# .github/workflows/governance-check.yaml
name: Governance Gate
on:
pull_request:
paths:
- 'infra/data-pipelines/**'
- 'policies/**'
jobs:
validate-governance:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Policy Engine
run: pip install opa-cli great-expectations
- name: Check Policy Compliance
run: |
# Validate IaC against governance policies
opa eval --data policies/ --input infra/terraform/data_warehouse.tf \
'data.governance.compliance.allow'
- name: Run Data Quality Tests
run: |
# Execute Great Expectations checkpoints
gx suite run main_suite expectations/
4. Automate Remediation and Auditing
When violations occur, the system should attempt auto-remediation where safe (e.g., tagging unclassified assets) and generate alerts for manual intervention. Audit trails must be immutable.
Remediation Logic:
# remediation_engine.py
def handle_violation(violation):
if violation.rule == "encryption_at_rest" and violation.severity == "high":
# Auto-remediation: Enable encryption via API
storage_client.update_bucket_encryption(violation.resource_id)
audit_log.record(action="AUTO_REMEDIATE", resource=violation.resource_id)
elif violation.rule == "access_control":
# Manual intervention required
send_alert_to_security_channel(violation)
audit_log.record(action="ALERT_SENT", resource=violation.resource_id)
Architecture Decisions
| Decision Area | Option A: Centralized Enforcement | Option B: Federated Enforcement | Recommendation |
|---|
| Control | Single policy engine; consistent rules. | Domain teams own policies; local autonomy. | Hybrid: Core security policies centralized; domain-specific quality rules federated. |
| Performance | Proxy-based enforcement adds latency. | Native enforcement (e.g., Snowflake policies) has zero overhead. | Native: Leverage platform-native capabilities (Row Level Security, Tags) where available; use proxy only for cross-platform consistency. |
| Metadata Store | Relational DB (Simple, limited graph queries). | Graph Database (Complex relationships, lineage traversal). | Graph: Use a graph-backed catalog for lineage and impact analysis. |
| Policy Language | Custom DSL (Low learning curve, limited expressiveness). | Rego/OPA (Standard, powerful, ecosystem support). | Rego/OPA: Industry standard for policy-as-code; integrates with Kubernetes, Terraform, and CI/CD. |
Pitfall Guide
Avoid these common implementation traps that derail governance initiatives:
- Boiling the Ocean: Attempting to govern all data assets simultaneously. Fix: Start with "Crown Jewels"βcritical PII, financial data, and core customer tables. Expand scope iteratively.
- Governance as a Bottleneck: Designing gates that require manual approval for every change. Fix: Implement "Governance by Exception." Auto-approve compliant changes; flag only violations for review.
- Static Policies in Dynamic Environments: Hardcoding policies that break when schemas evolve. Fix: Use pattern matching and semantic tagging rather than rigid table names. Implement schema evolution policies that allow backward-compatible changes.
- Ignoring Data Lineage: Enforcing policies without understanding upstream/downstream impact. Fix: Integrate lineage tracking. A policy change on a source table must trigger impact analysis on downstream dashboards and models.
- Lack of Business Ownership: Engineering defines policies without business context. Fix: Establish a Data Governance Council with business representatives who define classification levels and retention requirements. Engineering implements; Business defines.
- Neglecting the "Human" Loop: Over-automation without a process for exceptions. Fix: Build a self-service portal for data owners to request policy exceptions, which are tracked, justified, and time-bound.
- Tooling Over Process: Buying an expensive governance tool before defining the workflow. Fix: Map the governance workflow first. Tools should automate the workflow, not replace it.
Production Bundle
Action Checklist
Decision Matrix: Enforcement Strategy
| Strategy | Pros | Cons | Best Use Case |
|---|
| Ingestion Validation | Prevents bad data from entering the lakehouse. | Adds latency to pipelines; requires pipeline modification. | High-volume streaming data; strict quality requirements. |
| Policy-as-Code (IaC) | Catch misconfigurations before deployment. | Does not catch runtime data drift. | Infrastructure provisioning; schema definitions. |
| Runtime Proxy | Transparent to pipelines; covers all access. | Single point of failure; performance overhead. | Multi-cloud environments; legacy systems hard to modify. |
| Native Platform Policies | Zero latency; leverages platform optimizations. | Vendor lock-in; limited cross-platform consistency. | Single-vendor stacks (e.g., all Snowflake/Databricks). |
Configuration Template
Copy this template to bootstrap your governance repository structure.
# governance-repo/structure.yaml
# governance/
# βββ policies/
# β βββ classification.yaml # Defines sensitivity levels
# β βββ retention.yaml # Defines lifecycle rules
# β βββ access_control.yaml # Defines RBAC/ABAC rules
# β βββ quality_thresholds.yaml # Defines acceptable error rates
# βββ scanners/
# β βββ config.yaml # Scanner targets and frequency
# β βββ classifiers.yaml # Regex patterns for PII detection
# βββ enforcement/
# β βββ ci_pipeline.yaml # GitHub Actions/GitLab CI config
# β βββ opa_policies/ # Rego rules for evaluation
# βββ audit/
# βββ schema.json # Schema for audit logs
Rego Policy Example (enforcement/opa_policies/no_public_buckets.rego):
package governance.infrastructure
# Deny if bucket has public ACL
deny[msg] {
input.resource_type == "storage_bucket"
input.config.public_access == true
msg := "Policy Violation: Storage bucket cannot be public. Ensure private ACL."
}
# Warn if encryption is not explicitly enabled
warn[msg] {
input.resource_type == "storage_bucket"
not input.config.encryption
msg := "Warning: Storage bucket encryption is not explicitly configured."
}
Quick Start Guide
- Initialize Governance Repo: Create a version-controlled repository for policies. Define the policy schema and create your first three critical policies (e.g., PII Classification, Encryption at Rest, Retention for GDPR).
- Connect Metadata Source: Deploy a metadata scanner to your primary data warehouse. Configure it to harvest table schemas, access grants, and lineage. Push this metadata to your governance catalog.
- Hook CI/CD: Add a step to your data pipeline CI pipeline that runs
opa eval against proposed changes. Block merges if critical policies are violated.
- Validate and Iterate: Run the pipeline with a test change that violates a policy. Verify the block occurs. Review the audit log. Adjust policy thresholds based on feedback from data engineers.
Conclusion
Data governance is not a compliance checkbox; it is a reliability engineering discipline. By adopting a code-driven framework, organizations can decouple velocity from risk, ensuring that data assets are trustworthy, secure, and compliant by design. The transition requires upfront investment in tooling and process, but the ROI is realized through reduced audit overhead, eliminated compliance drift, and the acceleration of data product delivery. Implement governance as code, and turn your data from a liability into a governed, high-velocity asset.