Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NIST AI Risk Management Framework Self-Assessment

This document provides a self-assessment of DataSynth against the NIST AI Risk Management Framework (AI 100-1, January 2023). The framework defines four core functions – MAP, MEASURE, MANAGE, and GOVERN – each with categories and subcategories. This assessment covers all four functions as they apply to a synthetic data generation tool.

Assessment Scope

  • System: DataSynth synthetic financial data generator
  • Version: 0.5.x
  • Assessment Date: 2025
  • Assessor: Development team (self-assessment)
  • AI System Type: Data generation tool (not a decision-making AI system)
  • Risk Classification: The generated synthetic data may be used as training data for AI/ML systems. DataSynth itself does not make autonomous decisions, but the quality of its output can affect downstream AI system performance.

MAP: Context and Framing

The MAP function establishes the context for AI risk management by identifying intended use cases, users, and known limitations.

MAP 1: Intended Use Cases

DataSynth is designed for the following use cases:

Use CaseDescriptionRisk Level
ML Training DataGenerate labeled datasets for fraud detection, anomaly detection, and audit analytics modelsMedium
Software TestingProvide realistic test data for ERP systems, accounting platforms, and audit toolsLow
Privacy-Preserving AnalyticsReplace real financial data with synthetic equivalents that preserve statistical propertiesMedium
Compliance TestingGenerate SOX control test evidence, COSO framework data, and SoD violation scenariosLow
Process MiningCreate OCEL 2.0 event logs for process analysis without exposing real business processesLow
Education and ResearchProvide realistic financial datasets for academic research and trainingLow

Not intended for: Replacement of real financial records in regulatory filings, direct use as evidence in audit engagements, or any scenario where the synthetic nature of the data is concealed.

MAP 2: Intended Users

User GroupTypical UseAccess Level
Data ScientistsTraining ML models for fraud/anomaly detectionAPI or CLI
QA EngineersERP and accounting system load/integration testingCLI or Python wrapper
AuditorsTesting audit analytics tools against known-labeled dataCLI output files
Compliance TeamsSOX control testing, COSO framework validationCLI or server API
ResearchersAcademic study of financial data patternsPython wrapper

MAP 3: Known Limitations

DataSynth users should understand the following limitations:

  1. No Real PII: Generated names, identifiers, and addresses are synthetic. They do not correspond to real individuals or organizations. This is a design feature, not a limitation, but downstream systems should not treat synthetic identities as real.

  2. Statistical Approximation: Generated data follows configurable statistical distributions (log-normal, Benford’s Law, Gaussian mixtures) that approximate real-world patterns. They are not derived from actual transaction populations unless fingerprint extraction is used.

  3. Industry Profile Approximations: Pre-configured industry profiles (retail, manufacturing, financial services, healthcare, technology) are based on published research and general knowledge. They may not match specific organizations within an industry.

  4. Temporal Pattern Simplification: Business day calendars, holiday schedules, and intraday patterns are modeled but may not capture all regional or organizational nuances.

  5. Anomaly Injection Boundaries: Injected fraud patterns follow configurable typologies (ACFE taxonomy) but do not represent the full diversity of real-world fraud schemes.

  6. Fingerprint Extraction Privacy: When extracting fingerprints from real data, differential privacy noise and k-anonymity are applied. The privacy guarantees depend on correct epsilon/delta parameter selection.

MAP 4: Deployment Context

DataSynth can be deployed as:

  • A CLI tool on developer workstations
  • A server (REST/gRPC/WebSocket) in cloud or on-premises environments
  • A Python library embedded in data pipelines
  • A desktop application (Tauri/SvelteKit)

Each deployment context has different risk profiles. Server deployments require authentication, TLS, and rate limiting. CLI usage on trusted workstations has fewer access control requirements.


MEASURE: Metrics and Evaluation

The MEASURE function establishes metrics, methods, and benchmarks for evaluating AI system trustworthiness.

MEASURE 1: Quality Gate Metrics

DataSynth includes a comprehensive evaluation framework (datasynth-eval) with configurable quality gates. Each metric has defined thresholds and automated pass/fail checking.

Statistical Quality

MetricGate NameThresholdComparisonPurpose
Benford’s Law MADbenford_compliance0.015LTEFirst-digit distribution follows Benford’s Law
Balance Coherencebalance_sheet_valid1.0GTEAssets = Liabilities + Equity
Document Chain Integritydoc_chain_complete0.95GTEP2P/O2C chains are complete
Temporal Consistencytemporal_valid0.90GTETemporal patterns match configuration
Correlation Preservationcorrelation_check0.80GTECross-field correlations preserved

Data Quality

MetricGate NameThresholdComparisonPurpose
Completion Ratecompleteness0.95GTERequired fields are populated
Duplicate Rateuniqueness0.05LTEAcceptable duplicate rate
Referential Integrityref_integrity0.99GTEForeign key references valid
IC Match Rateic_matching0.95GTEIntercompany transactions match

Gate Profiles

Quality gates are organized into profiles with configurable strictness:

evaluation:
  quality_gates:
    profile: strict    # strict, default, lenient
    fail_strategy: collect_all
    gates:
      - name: benford_compliance
        metric: benford_mad
        threshold: 0.015
        comparison: lte
      - name: balance_valid
        metric: balance_coherence
        threshold: 1.0
        comparison: gte
      - name: completeness
        metric: completion_rate
        threshold: 0.95
        comparison: gte

MEASURE 2: Privacy Evaluation

DataSynth evaluates privacy risk through empirical attacks on generated data.

Membership Inference Attack (MIA)

The MIA module (datasynth-eval/src/privacy/membership_inference.rs) implements a distance-based classifier that attempts to determine whether a specific record was part of the generation configuration. Key metrics:

MetricThresholdInterpretation
AUC-ROC<= 0.60Near-random classification indicates strong privacy
Accuracy<= 0.55Low accuracy means synthetic data does not memorize patterns
Precision/RecallBalancedNo systematic bias toward members or non-members

Linkage Attack Assessment

The linkage module (datasynth-eval/src/privacy/linkage.rs) evaluates re-identification risk using quasi-identifier combinations:

MetricThresholdInterpretation
Re-identification Rate<= 0.05Less than 5% of synthetic records can be linked to originals
K-Anonymity Achieved>= 5Each quasi-identifier combination appears at least 5 times
Unique QI OverlapReportedNumber of overlapping quasi-identifier combinations

NIST SP 800-226 Alignment

The evaluation framework includes self-assessment against NIST SP 800-226 criteria for de-identification. The NistAlignmentReport evaluates:

  • Data transformation adequacy
  • Re-identification risk assessment
  • Documentation completeness
  • Privacy control effectiveness

Overall alignment score must meet >= 71% for a passing grade.

Fingerprint Module Privacy

When fingerprint extraction is used with real data input, the datasynth-fingerprint privacy engine provides:

MechanismParameterDefault (Standard Level)
Differential Privacy (Laplace)Epsilon1.0
K-AnonymityK threshold5
Outlier ProtectionWinsorization percentile95th
CompositionMethodNaive (RDP/zCDP available)

Privacy levels provide pre-configured parameter sets:

LevelEpsilonKUse Case
Minimal5.03Low sensitivity
Standard1.05Balanced (default)
High0.510Sensitive data
Maximum0.120Highly sensitive data

MEASURE 3: Completeness and Uniqueness

The evaluation module tracks data completeness and uniqueness metrics:

  • Completeness: Measures the percentage of non-null values across all required fields. Reported as overall_completeness in the evaluation output.
  • Uniqueness: Measures the duplicate rate across primary key fields. Collision-free UUIDs (FNV-1a hash-based with generator-type discriminators) ensure deterministic uniqueness.

MEASURE 4: Distribution Validation

Statistical validation tests verify that generated data matches configured distributions:

TestImplementationPurpose
Benford First DigitChi-squared against Benford distributionTransaction amounts follow expected first-digit distribution
Distribution FitAnderson-Darling testAmount distributions match configured log-normal parameters
Correlation CheckPearson/Spearman correlationCross-field correlations preserved via copula models
Temporal PatternsAutocorrelation analysisSeasonality and period-end patterns present

MANAGE: Risk Mitigation

The MANAGE function addresses risk response and mitigation strategies.

MANAGE 1: Deterministic Reproducibility

DataSynth uses ChaCha8 CSPRNG with configurable seeds. Given the same configuration and seed, the output is identical across runs and platforms. This provides:

  • Auditability: Any generated dataset can be exactly reproduced by preserving the configuration YAML and seed value.
  • Debugging: Anomalous output can be reproduced for investigation.
  • Regression Testing: Changes to generation logic can be detected by comparing output hashes.
global:
  seed: 42                    # Deterministic seed
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

MANAGE 2: Audit Logging

DataSynth provides audit trails at multiple levels:

Generation Audit: The runtime emits structured JSON logs for every generation phase, including timing, record counts, and resource utilization.

Privacy Audit: The fingerprint module maintains a PrivacyAudit record of every privacy-related action (noise additions with epsilon spent, value suppressions, generalizations, winsorizations). This audit is embedded in the .dsf fingerprint file.

Server Audit: The REST/gRPC server logs authentication attempts, configuration changes, stream operations, and rate limit events with request correlation IDs (X-Request-Id).

Run Manifest: Each generation run produces a manifest documenting the configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.

MANAGE 3: Data Lineage Tracking

DataSynth tracks data lineage through:

  • Configuration Hashing: SHA-256 hash of the input configuration is embedded in all output metadata.
  • Content Credentials: Every output file includes a ContentCredential linking back to the generator version, configuration hash, and seed.
  • Document Reference Chains: Generated document flows maintain explicit reference chains (PO -> GR -> Invoice -> Payment) with DocumentReference records.
  • Data Governance Reports: Automated Article 10 governance reports document all processing steps from COA generation through quality validation.

MANAGE 4: Content Marking

All synthetic output is marked to prevent confusion with real data:

  • CSV: Comment headers with # SYNTHETIC DATA - Generated by DataSynth v{version}
  • JSON: _metadata.content_credential object with generator, timestamp, config hash, and EU AI Act article reference
  • Parquet: Custom metadata key-value pairs with full credential JSON
  • Sidecar Files: Optional .credential.json files alongside output files

Content marking is enabled by default and can be configured:

marking:
  enabled: true
  format: embedded    # embedded, sidecar, both

MANAGE 5: Graceful Degradation

The resource guard system (datasynth-core) monitors memory, disk, and CPU usage, applying progressive degradation:

LevelMemory ThresholdResponse
Normal< 70%Full feature generation
Reduced70-85%Disable optional features
Minimal85-95%Core generation only
Emergency> 95%Graceful shutdown

This prevents resource exhaustion from affecting other systems in shared environments.


GOVERN: Policies and Oversight

The GOVERN function establishes organizational policies and structures for AI risk management.

GOVERN 1: Access Control

DataSynth implements layered access control for the server deployment:

API Key Authentication: Keys are hashed with Argon2id at startup. Verification uses timing-safe comparison with a short-lived cache to prevent side-channel attacks. Keys are provided via the X-API-Key header or Authorization: Bearer header.

JWT/OIDC Integration (optional jwt feature): Supports external identity providers (Keycloak, Auth0, Entra ID) with RS256 token validation. JWT claims include subject, roles, and tenant ID for multi-tenancy.

RBAC: Role-based access control via JWT claims enables differentiated access:

RolePermissions
operatorStart/stop/pause generation streams
adminConfiguration changes, API key management
viewerRead-only access to status and metrics

Exempt Paths: Health (/health), readiness (/ready), liveness (/live), and metrics (/metrics) endpoints are exempt from authentication for infrastructure integration.

GOVERN 2: Configuration Management

DataSynth configuration is managed through:

  • YAML Schema Validation: All configuration is validated against a typed schema before generation begins. Invalid configurations produce descriptive error messages.
  • Industry Presets: Pre-validated configuration presets for common industries (retail, manufacturing, financial services, healthcare, technology) reduce misconfiguration risk.
  • Complexity Levels: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide validated scaling parameters.
  • Template System: YAML/JSON templates with merge strategies enable configuration reuse while allowing overrides.

GOVERN 3: Quality Gates as Governance Controls

Quality gates serve as automated governance controls:

evaluation:
  quality_gates:
    profile: strict
    fail_strategy: fail_fast    # Stop on first failure
    gates:
      - name: benford_compliance
        metric: benford_mad
        threshold: 0.015
        comparison: lte
      - name: privacy_mia
        metric: privacy_mia_auc
        threshold: 0.60
        comparison: lte
      - name: balance_coherence
        metric: balance_coherence
        threshold: 1.0
        comparison: gte

Gate profiles can enforce:

  • Fail-fast: Stop generation on first quality failure
  • Collect-all: Run all checks and report all failures
  • Custom thresholds: Organization-specific quality requirements

The GateEngine evaluates all configured gates against the ComprehensiveEvaluation and produces a GateResult with per-gate pass/fail status, actual values, and summary messages.

GOVERN 4: Audit Trail Completeness

The following audit artifacts are produced for each generation run:

ArtifactLocationContents
Run Manifestoutput/_manifest.jsonConfig hash, seed, timestamps, record counts, gate results
Content CredentialsEmbedded in each output fileGenerator version, config hash, seed, EU AI Act reference
Data Governance Reportoutput/_governance_report.jsonArticle 10 data sources, processing steps, quality measures, bias assessment
Privacy AuditEmbedded in .dsf filesEpsilon spent, actions taken, composition method, remaining budget
Server LogsStructured JSON to stdout/log aggregatorRequest traces, auth events, config changes, stream operations
Quality Gate Resultsoutput/_evaluation.jsonPer-gate pass/fail, actual vs threshold, summary

GOVERN 5: Incident Response

For scenarios where generated data is mistakenly used as real data:

  1. Detection: Content credentials in output files identify synthetic origin
  2. Containment: Deterministic generation means the exact dataset can be reproduced and identified
  3. Remediation: All output files carry machine-readable markers that downstream systems can check programmatically
  4. Prevention: Content marking is enabled by default and requires explicit configuration to disable

Assessment Summary

FunctionCategory CountAddressedNotes
MAP44Use cases, users, limitations, and deployment documented
MEASURE44Quality gates, privacy metrics, completeness, distribution validation
MANAGE55Reproducibility, audit logging, lineage, content marking, degradation
GOVERN55Access control, config management, quality gates, audit trails, incident response

Overall Assessment: DataSynth provides comprehensive risk management controls appropriate for a synthetic data generation tool. The primary residual risks relate to (1) parameter misconfiguration leading to unrealistic output, mitigated by quality gates and industry presets, and (2) privacy leakage during fingerprint extraction from real data, mitigated by differential privacy with configurable epsilon/delta budgets and empirical privacy evaluation.

See Also