NIST AI Risk Management Framework Self-Assessment
This document provides a self-assessment of DataSynth against the NIST AI Risk Management Framework (AI 100-1, January 2023). The framework defines four core functions – MAP, MEASURE, MANAGE, and GOVERN – each with categories and subcategories. This assessment covers all four functions as they apply to a synthetic data generation tool.
Assessment Scope
- System: DataSynth synthetic financial data generator
- Version: 0.5.x
- Assessment Date: 2025
- Assessor: Development team (self-assessment)
- AI System Type: Data generation tool (not a decision-making AI system)
- Risk Classification: The generated synthetic data may be used as training data for AI/ML systems. DataSynth itself does not make autonomous decisions, but the quality of its output can affect downstream AI system performance.
MAP: Context and Framing
The MAP function establishes the context for AI risk management by identifying intended use cases, users, and known limitations.
MAP 1: Intended Use Cases
DataSynth is designed for the following use cases:
| Use Case | Description | Risk Level |
|---|---|---|
| ML Training Data | Generate labeled datasets for fraud detection, anomaly detection, and audit analytics models | Medium |
| Software Testing | Provide realistic test data for ERP systems, accounting platforms, and audit tools | Low |
| Privacy-Preserving Analytics | Replace real financial data with synthetic equivalents that preserve statistical properties | Medium |
| Compliance Testing | Generate SOX control test evidence, COSO framework data, and SoD violation scenarios | Low |
| Process Mining | Create OCEL 2.0 event logs for process analysis without exposing real business processes | Low |
| Education and Research | Provide realistic financial datasets for academic research and training | Low |
Not intended for: Replacement of real financial records in regulatory filings, direct use as evidence in audit engagements, or any scenario where the synthetic nature of the data is concealed.
MAP 2: Intended Users
| User Group | Typical Use | Access Level |
|---|---|---|
| Data Scientists | Training ML models for fraud/anomaly detection | API or CLI |
| QA Engineers | ERP and accounting system load/integration testing | CLI or Python wrapper |
| Auditors | Testing audit analytics tools against known-labeled data | CLI output files |
| Compliance Teams | SOX control testing, COSO framework validation | CLI or server API |
| Researchers | Academic study of financial data patterns | Python wrapper |
MAP 3: Known Limitations
DataSynth users should understand the following limitations:
-
No Real PII: Generated names, identifiers, and addresses are synthetic. They do not correspond to real individuals or organizations. This is a design feature, not a limitation, but downstream systems should not treat synthetic identities as real.
-
Statistical Approximation: Generated data follows configurable statistical distributions (log-normal, Benford’s Law, Gaussian mixtures) that approximate real-world patterns. They are not derived from actual transaction populations unless fingerprint extraction is used.
-
Industry Profile Approximations: Pre-configured industry profiles (retail, manufacturing, financial services, healthcare, technology) are based on published research and general knowledge. They may not match specific organizations within an industry.
-
Temporal Pattern Simplification: Business day calendars, holiday schedules, and intraday patterns are modeled but may not capture all regional or organizational nuances.
-
Anomaly Injection Boundaries: Injected fraud patterns follow configurable typologies (ACFE taxonomy) but do not represent the full diversity of real-world fraud schemes.
-
Fingerprint Extraction Privacy: When extracting fingerprints from real data, differential privacy noise and k-anonymity are applied. The privacy guarantees depend on correct epsilon/delta parameter selection.
MAP 4: Deployment Context
DataSynth can be deployed as:
- A CLI tool on developer workstations
- A server (REST/gRPC/WebSocket) in cloud or on-premises environments
- A Python library embedded in data pipelines
- A desktop application (Tauri/SvelteKit)
Each deployment context has different risk profiles. Server deployments require authentication, TLS, and rate limiting. CLI usage on trusted workstations has fewer access control requirements.
MEASURE: Metrics and Evaluation
The MEASURE function establishes metrics, methods, and benchmarks for evaluating AI system trustworthiness.
MEASURE 1: Quality Gate Metrics
DataSynth includes a comprehensive evaluation framework (datasynth-eval) with configurable quality gates. Each metric has defined thresholds and automated pass/fail checking.
Statistical Quality
| Metric | Gate Name | Threshold | Comparison | Purpose |
|---|---|---|---|---|
| Benford’s Law MAD | benford_compliance | 0.015 | LTE | First-digit distribution follows Benford’s Law |
| Balance Coherence | balance_sheet_valid | 1.0 | GTE | Assets = Liabilities + Equity |
| Document Chain Integrity | doc_chain_complete | 0.95 | GTE | P2P/O2C chains are complete |
| Temporal Consistency | temporal_valid | 0.90 | GTE | Temporal patterns match configuration |
| Correlation Preservation | correlation_check | 0.80 | GTE | Cross-field correlations preserved |
Data Quality
| Metric | Gate Name | Threshold | Comparison | Purpose |
|---|---|---|---|---|
| Completion Rate | completeness | 0.95 | GTE | Required fields are populated |
| Duplicate Rate | uniqueness | 0.05 | LTE | Acceptable duplicate rate |
| Referential Integrity | ref_integrity | 0.99 | GTE | Foreign key references valid |
| IC Match Rate | ic_matching | 0.95 | GTE | Intercompany transactions match |
Gate Profiles
Quality gates are organized into profiles with configurable strictness:
evaluation:
quality_gates:
profile: strict # strict, default, lenient
fail_strategy: collect_all
gates:
- name: benford_compliance
metric: benford_mad
threshold: 0.015
comparison: lte
- name: balance_valid
metric: balance_coherence
threshold: 1.0
comparison: gte
- name: completeness
metric: completion_rate
threshold: 0.95
comparison: gte
MEASURE 2: Privacy Evaluation
DataSynth evaluates privacy risk through empirical attacks on generated data.
Membership Inference Attack (MIA)
The MIA module (datasynth-eval/src/privacy/membership_inference.rs) implements a distance-based classifier that attempts to determine whether a specific record was part of the generation configuration. Key metrics:
| Metric | Threshold | Interpretation |
|---|---|---|
| AUC-ROC | <= 0.60 | Near-random classification indicates strong privacy |
| Accuracy | <= 0.55 | Low accuracy means synthetic data does not memorize patterns |
| Precision/Recall | Balanced | No systematic bias toward members or non-members |
Linkage Attack Assessment
The linkage module (datasynth-eval/src/privacy/linkage.rs) evaluates re-identification risk using quasi-identifier combinations:
| Metric | Threshold | Interpretation |
|---|---|---|
| Re-identification Rate | <= 0.05 | Less than 5% of synthetic records can be linked to originals |
| K-Anonymity Achieved | >= 5 | Each quasi-identifier combination appears at least 5 times |
| Unique QI Overlap | Reported | Number of overlapping quasi-identifier combinations |
NIST SP 800-226 Alignment
The evaluation framework includes self-assessment against NIST SP 800-226 criteria for de-identification. The NistAlignmentReport evaluates:
- Data transformation adequacy
- Re-identification risk assessment
- Documentation completeness
- Privacy control effectiveness
Overall alignment score must meet >= 71% for a passing grade.
Fingerprint Module Privacy
When fingerprint extraction is used with real data input, the datasynth-fingerprint privacy engine provides:
| Mechanism | Parameter | Default (Standard Level) |
|---|---|---|
| Differential Privacy (Laplace) | Epsilon | 1.0 |
| K-Anonymity | K threshold | 5 |
| Outlier Protection | Winsorization percentile | 95th |
| Composition | Method | Naive (RDP/zCDP available) |
Privacy levels provide pre-configured parameter sets:
| Level | Epsilon | K | Use Case |
|---|---|---|---|
| Minimal | 5.0 | 3 | Low sensitivity |
| Standard | 1.0 | 5 | Balanced (default) |
| High | 0.5 | 10 | Sensitive data |
| Maximum | 0.1 | 20 | Highly sensitive data |
MEASURE 3: Completeness and Uniqueness
The evaluation module tracks data completeness and uniqueness metrics:
- Completeness: Measures the percentage of non-null values across all required fields. Reported as
overall_completenessin the evaluation output. - Uniqueness: Measures the duplicate rate across primary key fields. Collision-free UUIDs (FNV-1a hash-based with generator-type discriminators) ensure deterministic uniqueness.
MEASURE 4: Distribution Validation
Statistical validation tests verify that generated data matches configured distributions:
| Test | Implementation | Purpose |
|---|---|---|
| Benford First Digit | Chi-squared against Benford distribution | Transaction amounts follow expected first-digit distribution |
| Distribution Fit | Anderson-Darling test | Amount distributions match configured log-normal parameters |
| Correlation Check | Pearson/Spearman correlation | Cross-field correlations preserved via copula models |
| Temporal Patterns | Autocorrelation analysis | Seasonality and period-end patterns present |
MANAGE: Risk Mitigation
The MANAGE function addresses risk response and mitigation strategies.
MANAGE 1: Deterministic Reproducibility
DataSynth uses ChaCha8 CSPRNG with configurable seeds. Given the same configuration and seed, the output is identical across runs and platforms. This provides:
- Auditability: Any generated dataset can be exactly reproduced by preserving the configuration YAML and seed value.
- Debugging: Anomalous output can be reproduced for investigation.
- Regression Testing: Changes to generation logic can be detected by comparing output hashes.
global:
seed: 42 # Deterministic seed
industry: manufacturing
start_date: 2024-01-01
period_months: 12
MANAGE 2: Audit Logging
DataSynth provides audit trails at multiple levels:
Generation Audit: The runtime emits structured JSON logs for every generation phase, including timing, record counts, and resource utilization.
Privacy Audit: The fingerprint module maintains a PrivacyAudit record of every privacy-related action (noise additions with epsilon spent, value suppressions, generalizations, winsorizations). This audit is embedded in the .dsf fingerprint file.
Server Audit: The REST/gRPC server logs authentication attempts, configuration changes, stream operations, and rate limit events with request correlation IDs (X-Request-Id).
Run Manifest: Each generation run produces a manifest documenting the configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.
MANAGE 3: Data Lineage Tracking
DataSynth tracks data lineage through:
- Configuration Hashing: SHA-256 hash of the input configuration is embedded in all output metadata.
- Content Credentials: Every output file includes a
ContentCredentiallinking back to the generator version, configuration hash, and seed. - Document Reference Chains: Generated document flows maintain explicit reference chains (PO -> GR -> Invoice -> Payment) with
DocumentReferencerecords. - Data Governance Reports: Automated Article 10 governance reports document all processing steps from COA generation through quality validation.
MANAGE 4: Content Marking
All synthetic output is marked to prevent confusion with real data:
- CSV: Comment headers with
# SYNTHETIC DATA - Generated by DataSynth v{version} - JSON:
_metadata.content_credentialobject with generator, timestamp, config hash, and EU AI Act article reference - Parquet: Custom metadata key-value pairs with full credential JSON
- Sidecar Files: Optional
.credential.jsonfiles alongside output files
Content marking is enabled by default and can be configured:
marking:
enabled: true
format: embedded # embedded, sidecar, both
MANAGE 5: Graceful Degradation
The resource guard system (datasynth-core) monitors memory, disk, and CPU usage, applying progressive degradation:
| Level | Memory Threshold | Response |
|---|---|---|
| Normal | < 70% | Full feature generation |
| Reduced | 70-85% | Disable optional features |
| Minimal | 85-95% | Core generation only |
| Emergency | > 95% | Graceful shutdown |
This prevents resource exhaustion from affecting other systems in shared environments.
GOVERN: Policies and Oversight
The GOVERN function establishes organizational policies and structures for AI risk management.
GOVERN 1: Access Control
DataSynth implements layered access control for the server deployment:
API Key Authentication: Keys are hashed with Argon2id at startup. Verification uses timing-safe comparison with a short-lived cache to prevent side-channel attacks. Keys are provided via the X-API-Key header or Authorization: Bearer header.
JWT/OIDC Integration (optional jwt feature): Supports external identity providers (Keycloak, Auth0, Entra ID) with RS256 token validation. JWT claims include subject, roles, and tenant ID for multi-tenancy.
RBAC: Role-based access control via JWT claims enables differentiated access:
| Role | Permissions |
|---|---|
operator | Start/stop/pause generation streams |
admin | Configuration changes, API key management |
viewer | Read-only access to status and metrics |
Exempt Paths: Health (/health), readiness (/ready), liveness (/live), and metrics (/metrics) endpoints are exempt from authentication for infrastructure integration.
GOVERN 2: Configuration Management
DataSynth configuration is managed through:
- YAML Schema Validation: All configuration is validated against a typed schema before generation begins. Invalid configurations produce descriptive error messages.
- Industry Presets: Pre-validated configuration presets for common industries (retail, manufacturing, financial services, healthcare, technology) reduce misconfiguration risk.
- Complexity Levels: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide validated scaling parameters.
- Template System: YAML/JSON templates with merge strategies enable configuration reuse while allowing overrides.
GOVERN 3: Quality Gates as Governance Controls
Quality gates serve as automated governance controls:
evaluation:
quality_gates:
profile: strict
fail_strategy: fail_fast # Stop on first failure
gates:
- name: benford_compliance
metric: benford_mad
threshold: 0.015
comparison: lte
- name: privacy_mia
metric: privacy_mia_auc
threshold: 0.60
comparison: lte
- name: balance_coherence
metric: balance_coherence
threshold: 1.0
comparison: gte
Gate profiles can enforce:
- Fail-fast: Stop generation on first quality failure
- Collect-all: Run all checks and report all failures
- Custom thresholds: Organization-specific quality requirements
The GateEngine evaluates all configured gates against the ComprehensiveEvaluation and produces a GateResult with per-gate pass/fail status, actual values, and summary messages.
GOVERN 4: Audit Trail Completeness
The following audit artifacts are produced for each generation run:
| Artifact | Location | Contents |
|---|---|---|
| Run Manifest | output/_manifest.json | Config hash, seed, timestamps, record counts, gate results |
| Content Credentials | Embedded in each output file | Generator version, config hash, seed, EU AI Act reference |
| Data Governance Report | output/_governance_report.json | Article 10 data sources, processing steps, quality measures, bias assessment |
| Privacy Audit | Embedded in .dsf files | Epsilon spent, actions taken, composition method, remaining budget |
| Server Logs | Structured JSON to stdout/log aggregator | Request traces, auth events, config changes, stream operations |
| Quality Gate Results | output/_evaluation.json | Per-gate pass/fail, actual vs threshold, summary |
GOVERN 5: Incident Response
For scenarios where generated data is mistakenly used as real data:
- Detection: Content credentials in output files identify synthetic origin
- Containment: Deterministic generation means the exact dataset can be reproduced and identified
- Remediation: All output files carry machine-readable markers that downstream systems can check programmatically
- Prevention: Content marking is enabled by default and requires explicit configuration to disable
Assessment Summary
| Function | Category Count | Addressed | Notes |
|---|---|---|---|
| MAP | 4 | 4 | Use cases, users, limitations, and deployment documented |
| MEASURE | 4 | 4 | Quality gates, privacy metrics, completeness, distribution validation |
| MANAGE | 5 | 5 | Reproducibility, audit logging, lineage, content marking, degradation |
| GOVERN | 5 | 5 | Access control, config management, quality gates, audit trails, incident response |
Overall Assessment: DataSynth provides comprehensive risk management controls appropriate for a synthetic data generation tool. The primary residual risks relate to (1) parameter misconfiguration leading to unrealistic output, mitigated by quality gates and industry presets, and (2) privacy leakage during fingerprint extraction from real data, mitigated by differential privacy with configurable epsilon/delta budgets and empirical privacy evaluation.