SyntheticData

High-Performance Synthetic Enterprise Financial Data Generator

Developed by Ernst & Young Ltd., Zurich, Switzerland

What is SyntheticData?

SyntheticData is a configurable synthetic data generator that produces realistic, interconnected enterprise financial data. It generates General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, and ML-ready graph exports at scale.

The generator produces statistically accurate data based on empirical research from real-world general ledger patterns, ensuring that synthetic datasets exhibit the same characteristics as production data—including Benford’s Law compliance, temporal patterns, and document flow integrity.

New in v0.5.0: LLM-augmented generation (vendor names, descriptions, anomaly explanations), diffusion model backend (statistical denoising, hybrid generation), causal & counterfactual generation (SCMs, do-calculus interventions), federated fingerprinting, synthetic data certificates, and ecosystem integrations (Airflow, dbt, MLflow, Spark).

v0.3.0: ACFE-aligned fraud taxonomy, collusion modeling, industry-specific transactions (Manufacturing, Retail, Healthcare), and ML benchmarks.

v0.2.x: Privacy-preserving fingerprinting, accounting/audit standards (US GAAP, IFRS, ISA, SOX), streaming output API.

Quick Links

Section	Description
Getting Started	Installation, quick start guide, and demo mode
User Guide	CLI reference, server API, desktop UI, Python wrapper
Configuration	Complete YAML schema and presets
Architecture	System design, data flow, resource management
Crate Reference	Detailed crate documentation (15 crates)
Advanced Topics	Anomaly injection, graph export, fingerprinting, performance
Use Cases	Fraud detection, audit, AML/KYC, compliance

Key Features

Core Data Generation

Feature	Description
Statistical Distributions	Line item counts, amounts, and patterns based on empirical GL research
Benford’s Law Compliance	First-digit distribution following Benford’s Law with configurable fraud patterns
Industry Presets	Manufacturing, Retail, Financial Services, Healthcare, Technology
Chart of Accounts	Small (~100), Medium (~400), Large (~2500) account structures
Temporal Patterns	Month-end, quarter-end, year-end volume spikes with working hour modeling
Regional Calendars	Holiday calendars for US, DE, GB, CN, JP, IN with lunar calendar support

Enterprise Simulation

Master Data Management: Vendors, customers, materials, fixed assets, employees with temporal validity
Document Flow Engine: Complete P2P (Procure-to-Pay) and O2C (Order-to-Cash) processes with three-way matching
Intercompany Transactions: IC matching, transfer pricing, consolidation eliminations
Balance Coherence: Opening balances, running balance tracking, trial balance generation
Subledger Simulation: AR, AP, Fixed Assets, Inventory with GL reconciliation
Currency & FX: Realistic exchange rates (Ornstein-Uhlenbeck process), currency translation, CTA generation
Period Close Engine: Monthly close, depreciation runs, accruals, year-end closing
Banking/KYC/AML: Customer personas, KYC profiles, AML typologies (structuring, funnel, mule, layering, round-tripping)
Process Mining: OCEL 2.0 event logs with object-centric relationships
Audit Simulation: ISA-compliant engagements, workpapers, findings, risk assessments, professional judgments

Fraud Patterns & Industry-Specific Features

ACFE-Aligned Fraud Taxonomy: Asset Misappropriation, Corruption, Financial Statement Fraud calibrated to ACFE statistics
Collusion & Conspiracy Modeling: Multi-party fraud networks with 9 ring types and role-based conspirators
Management Override: Senior-level fraud with fraud triangle modeling (Pressure, Opportunity, Rationalization)
Red Flag Generation: 40+ probabilistic fraud indicators with Bayesian probabilities
Industry-Specific Transactions: Manufacturing (BOM, WIP), Retail (POS, shrinkage), Healthcare (ICD-10, claims)
Industry-Specific Anomalies: Authentic fraud patterns per industry (upcoding, sweethearting, yield manipulation)

Machine Learning & Analytics

Graph Export: PyTorch Geometric, Neo4j, DGL, and RustGraph formats with train/val/test splits
Anomaly Injection: 60+ fraud types, errors, process issues with full labeling
Data Quality Variations: Missing values (MCAR, MAR, MNAR), format variations, duplicates, typos
Evaluation Framework: Auto-tuning with configuration recommendations based on metric gaps
ACFE Benchmarks: ML benchmarks calibrated to ACFE fraud statistics
Industry Benchmarks: Pre-configured benchmarks for fraud detection by industry

AI & ML-Powered Generation

LLM-Augmented Generation: Use LLMs to generate realistic vendor names, transaction descriptions, memo fields, and anomaly explanations via pluggable provider abstraction (Mock, OpenAI, Anthropic, Custom)
Natural Language Configuration: Generate YAML configs from natural language descriptions (init --from-description "Generate 1 year of retail data for a mid-size US company")
Diffusion Model Backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
Hybrid Generation: Blend rule-based and diffusion outputs using interpolation, selection, or ensemble strategies
Causal Generation: Define Structural Causal Models (SCMs) with interventional (“what-if”) and counterfactual generation
Built-in Causal Templates: Pre-configured fraud_detection and revenue_cycle causal graphs

Privacy-Preserving Fingerprinting

Fingerprint Extraction: Extract statistical properties from real data into .dsf files
Differential Privacy: Laplace and Gaussian mechanisms with configurable epsilon budget
K-Anonymity: Suppression of rare categorical values below configurable threshold
Privacy Audit Trail: Complete logging of all privacy decisions and epsilon spent
Fidelity Evaluation: Validate synthetic data matches original fingerprint (KS, Wasserstein, Benford MAD)
Gaussian Copula: Preserve multivariate correlations during synthesis
Federated Fingerprinting: Extract fingerprints from distributed data sources without centralization using secure aggregation (weighted average, median, trimmed mean)
Synthetic Data Certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing, embeddable in Parquet metadata and JSON output
Privacy-Utility Pareto Frontier: Automated exploration of optimal epsilon values for given utility targets

Production Features

REST & gRPC APIs: Streaming generation with authentication and rate limiting
Desktop UI: Cross-platform Tauri/SvelteKit application with 15+ configuration pages
Resource Guards: Memory, disk, and CPU monitoring with graceful degradation
Graceful Degradation: Progressive feature reduction under resource pressure (Normal→Reduced→Minimal→Emergency)
Deterministic Generation: Seeded RNG (ChaCha8) for reproducible output
Python Wrapper: Programmatic access with blueprints and config validation

Performance

Metric	Performance
Single-threaded throughput	~100,000+ entries/second
Parallel scaling	Linear with available cores
Memory efficiency	Streaming generation for large volumes

Use Cases

Use Case	Description
Fraud Detection ML	Train supervised models with labeled fraud patterns
Graph Neural Networks	Entity relationship graphs for anomaly detection
AML/KYC Testing	Banking transaction data with structuring, layering, mule patterns
Audit Analytics	Test audit procedures with known control exceptions
Process Mining	OCEL 2.0 event logs for process discovery
ERP Testing	Load testing with realistic transaction volumes
SOX Compliance	Test internal control monitoring systems
Data Quality ML	Train models to detect missing values, typos, duplicates
Causal Analysis	“What-if” scenarios and counterfactual generation for audit
LLM Training Data	Generate LLM-enriched training datasets with realistic metadata
Pipeline Orchestration	Integrate with Airflow, dbt, MLflow, and Spark pipelines

Quick Start

# Install from source
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release

# Run demo mode
./target/release/datasynth-data generate --demo --output ./output

# Or create a custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Fingerprinting (New in v0.2.0)

# Extract fingerprint from real data with privacy protection
./target/release/datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Validate fingerprint integrity
./target/release/datasynth-data fingerprint validate ./fingerprint.dsf

# View fingerprint details
./target/release/datasynth-data fingerprint info ./fingerprint.dsf --detailed

# Evaluate synthetic data fidelity
./target/release/datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

LLM-Augmented Generation (New in v0.5.0)

# Generate config from natural language description
./target/release/datasynth-data init \
    --from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
    -o config.yaml

# Generate with LLM enrichment (uses mock provider by default)
./target/release/datasynth-data generate --config config.yaml --output ./output

Causal Generation (New in v0.5.0)

# Generate data with causal structure (fraud_detection template)
./target/release/datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --output ./causal_output

# Run intervention ("what-if" scenario)
./target/release/datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output

Diffusion Model Training (New in v0.5.0)

# Train a diffusion model on fingerprint data
./target/release/datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json

# Evaluate diffusion model fit
./target/release/datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000

Python Wrapper

from datasynth_py import DataSynth
from datasynth_py.config import blueprints

config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)

Architecture

SyntheticData is organized as a Rust workspace with 15 modular crates:

datasynth-cli          Command-line interface (binary: datasynth-data)
datasynth-server       REST/gRPC/WebSocket server with auth and rate limiting
datasynth-ui           Tauri/SvelteKit desktop application
    │
datasynth-runtime      Orchestration layer (parallel execution, resource guards)
    │
datasynth-generators   Data generators (JE, documents, subledgers, anomalies, audit)
datasynth-banking      KYC/AML banking transaction generator
datasynth-ocpm         Object-Centric Process Mining (OCEL 2.0)
datasynth-fingerprint  Privacy-preserving fingerprint extraction and synthesis
datasynth-standards    Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
    │
datasynth-graph        Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval         Evaluation framework with auto-tuning
    │
datasynth-config       Configuration schema, validation, industry presets
    │
datasynth-core         Domain models, traits, distributions, resource guards
    │
datasynth-output       Output sinks (CSV, JSON, Parquet, streaming)
datasynth-test-utils   Test utilities, fixtures, mocks

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available upon request. Please contact the author at michael.ivertowski@ch.ey.com for inquiries.

SyntheticData is provided “as is” without warranty of any kind. It is intended for testing, development, and educational purposes. Generated data should not be used as a substitute for real financial records.

Keyboard shortcuts

SyntheticData Documentation