SyntheticData

High-Performance Synthetic Enterprise Financial Data Generator

Developed by Ernst & Young Ltd., Zurich, Switzerland

What is SyntheticData?

SyntheticData is a configurable synthetic data generator that produces realistic, interconnected enterprise financial data. It generates General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, and ML-ready graph exports at scale.

The generator produces statistically accurate data based on empirical research from real-world general ledger patterns, ensuring that synthetic datasets exhibit the same characteristics as production data—including Benford’s Law compliance, temporal patterns, and document flow integrity.

New in v0.5.0: LLM-augmented generation (vendor names, descriptions, anomaly explanations), diffusion model backend (statistical denoising, hybrid generation), causal & counterfactual generation (SCMs, do-calculus interventions), federated fingerprinting, synthetic data certificates, and ecosystem integrations (Airflow, dbt, MLflow, Spark).

v0.3.0: ACFE-aligned fraud taxonomy, collusion modeling, industry-specific transactions (Manufacturing, Retail, Healthcare), and ML benchmarks.

v0.2.x: Privacy-preserving fingerprinting, accounting/audit standards (US GAAP, IFRS, ISA, SOX), streaming output API.

Quick Links

Section	Description
Getting Started	Installation, quick start guide, and demo mode
User Guide	CLI reference, server API, desktop UI, Python wrapper
Configuration	Complete YAML schema and presets
Architecture	System design, data flow, resource management
Crate Reference	Detailed crate documentation (15 crates)
Advanced Topics	Anomaly injection, graph export, fingerprinting, performance
Use Cases	Fraud detection, audit, AML/KYC, compliance

Key Features

Core Data Generation

Feature	Description
Statistical Distributions	Line item counts, amounts, and patterns based on empirical GL research
Benford’s Law Compliance	First-digit distribution following Benford’s Law with configurable fraud patterns
Industry Presets	Manufacturing, Retail, Financial Services, Healthcare, Technology
Chart of Accounts	Small (~100), Medium (~400), Large (~2500) account structures
Temporal Patterns	Month-end, quarter-end, year-end volume spikes with working hour modeling
Regional Calendars	Holiday calendars for US, DE, GB, CN, JP, IN with lunar calendar support

Enterprise Simulation

Master Data Management: Vendors, customers, materials, fixed assets, employees with temporal validity
Document Flow Engine: Complete P2P (Procure-to-Pay) and O2C (Order-to-Cash) processes with three-way matching
Intercompany Transactions: IC matching, transfer pricing, consolidation eliminations
Balance Coherence: Opening balances, running balance tracking, trial balance generation
Subledger Simulation: AR, AP, Fixed Assets, Inventory with GL reconciliation
Currency & FX: Realistic exchange rates (Ornstein-Uhlenbeck process), currency translation, CTA generation
Period Close Engine: Monthly close, depreciation runs, accruals, year-end closing
Banking/KYC/AML: Customer personas, KYC profiles, AML typologies (structuring, funnel, mule, layering, round-tripping)
Process Mining: OCEL 2.0 event logs with object-centric relationships
Audit Simulation: ISA-compliant engagements, workpapers, findings, risk assessments, professional judgments

Fraud Patterns & Industry-Specific Features

ACFE-Aligned Fraud Taxonomy: Asset Misappropriation, Corruption, Financial Statement Fraud calibrated to ACFE statistics
Collusion & Conspiracy Modeling: Multi-party fraud networks with 9 ring types and role-based conspirators
Management Override: Senior-level fraud with fraud triangle modeling (Pressure, Opportunity, Rationalization)
Red Flag Generation: 40+ probabilistic fraud indicators with Bayesian probabilities
Industry-Specific Transactions: Manufacturing (BOM, WIP), Retail (POS, shrinkage), Healthcare (ICD-10, claims)
Industry-Specific Anomalies: Authentic fraud patterns per industry (upcoding, sweethearting, yield manipulation)

Machine Learning & Analytics

Graph Export: PyTorch Geometric, Neo4j, DGL, and RustGraph formats with train/val/test splits
Anomaly Injection: 60+ fraud types, errors, process issues with full labeling
Data Quality Variations: Missing values (MCAR, MAR, MNAR), format variations, duplicates, typos
Evaluation Framework: Auto-tuning with configuration recommendations based on metric gaps
ACFE Benchmarks: ML benchmarks calibrated to ACFE fraud statistics
Industry Benchmarks: Pre-configured benchmarks for fraud detection by industry

AI & ML-Powered Generation

LLM-Augmented Generation: Use LLMs to generate realistic vendor names, transaction descriptions, memo fields, and anomaly explanations via pluggable provider abstraction (Mock, OpenAI, Anthropic, Custom)
Natural Language Configuration: Generate YAML configs from natural language descriptions (init --from-description "Generate 1 year of retail data for a mid-size US company")
Diffusion Model Backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
Hybrid Generation: Blend rule-based and diffusion outputs using interpolation, selection, or ensemble strategies
Causal Generation: Define Structural Causal Models (SCMs) with interventional (“what-if”) and counterfactual generation
Built-in Causal Templates: Pre-configured fraud_detection and revenue_cycle causal graphs

Privacy-Preserving Fingerprinting

Fingerprint Extraction: Extract statistical properties from real data into .dsf files
Differential Privacy: Laplace and Gaussian mechanisms with configurable epsilon budget
K-Anonymity: Suppression of rare categorical values below configurable threshold
Privacy Audit Trail: Complete logging of all privacy decisions and epsilon spent
Fidelity Evaluation: Validate synthetic data matches original fingerprint (KS, Wasserstein, Benford MAD)
Gaussian Copula: Preserve multivariate correlations during synthesis
Federated Fingerprinting: Extract fingerprints from distributed data sources without centralization using secure aggregation (weighted average, median, trimmed mean)
Synthetic Data Certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing, embeddable in Parquet metadata and JSON output
Privacy-Utility Pareto Frontier: Automated exploration of optimal epsilon values for given utility targets

Production Features

REST & gRPC APIs: Streaming generation with authentication and rate limiting
Desktop UI: Cross-platform Tauri/SvelteKit application with 15+ configuration pages
Resource Guards: Memory, disk, and CPU monitoring with graceful degradation
Graceful Degradation: Progressive feature reduction under resource pressure (Normal→Reduced→Minimal→Emergency)
Deterministic Generation: Seeded RNG (ChaCha8) for reproducible output
Python Wrapper: Programmatic access with blueprints and config validation

Performance

Metric	Performance
Single-threaded throughput	~100,000+ entries/second
Parallel scaling	Linear with available cores
Memory efficiency	Streaming generation for large volumes

Use Cases

Use Case	Description
Fraud Detection ML	Train supervised models with labeled fraud patterns
Graph Neural Networks	Entity relationship graphs for anomaly detection
AML/KYC Testing	Banking transaction data with structuring, layering, mule patterns
Audit Analytics	Test audit procedures with known control exceptions
Process Mining	OCEL 2.0 event logs for process discovery
ERP Testing	Load testing with realistic transaction volumes
SOX Compliance	Test internal control monitoring systems
Data Quality ML	Train models to detect missing values, typos, duplicates
Causal Analysis	“What-if” scenarios and counterfactual generation for audit
LLM Training Data	Generate LLM-enriched training datasets with realistic metadata
Pipeline Orchestration	Integrate with Airflow, dbt, MLflow, and Spark pipelines

Quick Start

# Install from source
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release

# Run demo mode
./target/release/datasynth-data generate --demo --output ./output

# Or create a custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Fingerprinting (New in v0.2.0)

# Extract fingerprint from real data with privacy protection
./target/release/datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Validate fingerprint integrity
./target/release/datasynth-data fingerprint validate ./fingerprint.dsf

# View fingerprint details
./target/release/datasynth-data fingerprint info ./fingerprint.dsf --detailed

# Evaluate synthetic data fidelity
./target/release/datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

LLM-Augmented Generation (New in v0.5.0)

# Generate config from natural language description
./target/release/datasynth-data init \
    --from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
    -o config.yaml

# Generate with LLM enrichment (uses mock provider by default)
./target/release/datasynth-data generate --config config.yaml --output ./output

Causal Generation (New in v0.5.0)

# Generate data with causal structure (fraud_detection template)
./target/release/datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --output ./causal_output

# Run intervention ("what-if" scenario)
./target/release/datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output

Diffusion Model Training (New in v0.5.0)

# Train a diffusion model on fingerprint data
./target/release/datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json

# Evaluate diffusion model fit
./target/release/datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000

Python Wrapper

from datasynth_py import DataSynth
from datasynth_py.config import blueprints

config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)

Architecture

SyntheticData is organized as a Rust workspace with 15 modular crates:

datasynth-cli          Command-line interface (binary: datasynth-data)
datasynth-server       REST/gRPC/WebSocket server with auth and rate limiting
datasynth-ui           Tauri/SvelteKit desktop application
    │
datasynth-runtime      Orchestration layer (parallel execution, resource guards)
    │
datasynth-generators   Data generators (JE, documents, subledgers, anomalies, audit)
datasynth-banking      KYC/AML banking transaction generator
datasynth-ocpm         Object-Centric Process Mining (OCEL 2.0)
datasynth-fingerprint  Privacy-preserving fingerprint extraction and synthesis
datasynth-standards    Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
    │
datasynth-graph        Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval         Evaluation framework with auto-tuning
    │
datasynth-config       Configuration schema, validation, industry presets
    │
datasynth-core         Domain models, traits, distributions, resource guards
    │
datasynth-output       Output sinks (CSV, JSON, Parquet, streaming)
datasynth-test-utils   Test utilities, fixtures, mocks

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available upon request. Please contact the author at michael.ivertowski@ch.ey.com for inquiries.

SyntheticData is provided “as is” without warranty of any kind. It is intended for testing, development, and educational purposes. Generated data should not be used as a substitute for real financial records.

Getting Started

Welcome to SyntheticData! This section will help you get up and running quickly.

What You’ll Learn

Installation: Set up SyntheticData on your system
Quick Start: Generate your first synthetic dataset
Demo Mode: Explore SyntheticData with built-in demo presets

Prerequisites

Before you begin, ensure you have:

Rust 1.88+: SyntheticData is written in Rust and requires the Rust toolchain
Git: For cloning the repository
(Optional) Node.js 18+: Required only for the desktop UI

Installation Overview

# Clone and build
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release

# The binary is at target/release/datasynth-data

First Steps

The fastest way to explore SyntheticData is through demo mode:

datasynth-data generate --demo --output ./demo-output

This generates a complete set of synthetic financial data using sensible defaults.

Architecture at a Glance

SyntheticData generates interconnected financial data:

┌─────────────────────────────────────────────────────────────┐
│                    Configuration (YAML)                      │
├─────────────────────────────────────────────────────────────┤
│                    Generation Pipeline                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │  Master  │→│ Document │→│  Journal │→│  Output  │     │
│  │   Data   │  │  Flows   │  │ Entries  │  │  Files   │     │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
├─────────────────────────────────────────────────────────────┤
│  Output: CSV, JSON, Neo4j, PyTorch Geometric, ACDOCA        │
└─────────────────────────────────────────────────────────────┘

Next Steps

Follow the Installation Guide to set up your environment
Work through the Quick Start Tutorial
Explore Demo Mode for a hands-on introduction
Review the CLI Reference for all available commands

Getting Help

Check the User Guide for detailed usage instructions
Review Configuration for all available options
See Use Cases for real-world examples

Installation

This guide covers installing SyntheticData from source.

Prerequisites

Required

Requirement	Version	Purpose
Rust	1.88+	Compilation
Git	Any recent	Clone repository
C compiler	gcc/clang	Native dependencies

Optional

Requirement	Version	Purpose
Node.js	18+	Desktop UI
npm	9+	Desktop UI dependencies

Installing Rust

If you don’t have Rust installed, use rustup:

# Linux/macOS
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Windows
# Download and run rustup-init.exe from https://rustup.rs

# Verify installation
rustc --version
cargo --version

Building from Source

Clone the Repository

git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData

Build Release Binary

# Build optimized release binary
cargo build --release

# The binary is at target/release/datasynth-data

Verify Installation

# Check version
./target/release/datasynth-data --version

# View help
./target/release/datasynth-data --help

# Run quick validation
./target/release/datasynth-data info

Adding to PATH

To run datasynth-data from anywhere:

Linux/macOS

# Option 1: Symlink to local bin
ln -s $(pwd)/target/release/datasynth-data ~/.local/bin/datasynth-data

# Option 2: Copy to system bin (requires sudo)
sudo cp target/release/datasynth-data /usr/local/bin/

# Option 3: Add target/release to PATH in ~/.bashrc or ~/.zshrc
export PATH="$PATH:/path/to/SyntheticData/target/release"

Windows

Add the target/release directory to your system PATH environment variable.

Building the Desktop UI

The desktop UI requires additional setup:

# Navigate to UI crate
cd crates/datasynth-ui

# Install Node.js dependencies
npm install

# Run in development mode
npm run tauri dev

# Build production release
npm run tauri build

Platform-Specific Dependencies

Linux (Ubuntu/Debian):

sudo apt-get install libwebkit2gtk-4.1-dev \
    libgtk-3-dev \
    libayatana-appindicator3-dev \
    librsvg2-dev

macOS: No additional dependencies required.

Windows: Install WebView2 runtime (usually pre-installed on Windows 10/11).

Running Tests

Verify your installation by running the test suite:

# Run all tests
cargo test

# Run tests for a specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators

# Run with output
cargo test -- --nocapture

Development Setup

For development work:

# Check code without building
cargo check

# Format code
cargo fmt

# Run lints
cargo clippy

# Build documentation
cargo doc --workspace --no-deps --open

Running Benchmarks

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench --bench generation_throughput

Troubleshooting

Build Failures

Missing C compiler:

# Ubuntu/Debian
sudo apt-get install build-essential

# macOS
xcode-select --install

# Fedora/RHEL
sudo dnf install gcc

Out of memory during build:

# Limit parallel jobs
cargo build --release -j 2

Runtime Issues

Permission denied:

chmod +x target/release/datasynth-data

Library not found (Linux):

# Check for missing dependencies
ldd target/release/datasynth-data

Next Steps

Follow the Quick Start Guide to generate your first dataset
Explore Demo Mode for a hands-on introduction
Review the CLI Reference for all commands

Quick Start

This guide walks you through generating your first synthetic financial dataset.

Overview

The typical workflow is:

Initialize a configuration file
Validate the configuration
Generate synthetic data
Review the output

Step 1: Initialize Configuration

Create a configuration file for your industry and complexity needs:

# Manufacturing company with medium complexity (~400 accounts)
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available Industry Presets

Industry	Description
`manufacturing`	Production, inventory, cost accounting
`retail`	Sales, inventory, customer transactions
`financial_services`	Banking, investments, regulatory reporting
`healthcare`	Patient revenue, medical supplies, compliance
`technology`	R&D, SaaS revenue, deferred revenue

Complexity Levels

Level	Accounts	Description
`small`	~100	Simple chart of accounts
`medium`	~400	Typical mid-size company
`large`	~2500	Enterprise-scale structure

Step 2: Review Configuration

Open config.yaml to review and customize:

global:
  seed: 42                        # For reproducible generation
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 1.0

transactions:
  target_count: 100000            # Number of journal entries

fraud:
  enabled: true
  fraud_rate: 0.005               # 0.5% fraud transactions

output:
  format: csv
  compression: none

See the Configuration Guide for all options.

Step 3: Validate Configuration

Check your configuration for errors:

datasynth-data validate --config config.yaml

The validator checks:

Required fields are present
Values are within valid ranges
Distribution weights sum to 1.0
Dates are consistent

Step 4: Generate Data

Run the generation:

datasynth-data generate --config config.yaml --output ./output

You’ll see a progress bar:

Generating synthetic data...
[████████████████████████████████] 100000/100000 entries
Generation complete in 1.2s

Step 5: Explore Output

The output directory contains organized subdirectories:

output/
├── master_data/
│   ├── vendors.csv
│   ├── customers.csv
│   ├── materials.csv
│   └── employees.csv
├── transactions/
│   ├── journal_entries.csv
│   ├── acdoca.csv
│   ├── purchase_orders.csv
│   └── vendor_invoices.csv
├── subledgers/
│   ├── ar_open_items.csv
│   └── ap_open_items.csv
├── period_close/
│   └── trial_balances/
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── controls/
    └── internal_controls.csv

Common Customizations

Generate More Data

transactions:
  target_count: 1000000           # 1 million entries

Enable Graph Export

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

Add Anomaly Injection

anomaly_injection:
  enabled: true
  total_rate: 0.02                # 2% anomaly rate
  generate_labels: true           # For ML training

Multiple Companies

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    volume_weight: 0.6

  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    volume_weight: 0.4

Next Steps

Explore Demo Mode for built-in presets
Learn the CLI Reference
Review Output Formats
See Configuration for all options

Quick Reference

# Common commands
datasynth-data init --industry <industry> --complexity <level> -o config.yaml
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output
datasynth-data generate --demo --output ./demo-output
datasynth-data info                   # Show available presets

Demo Mode

Demo mode provides a quick way to explore SyntheticData without creating a configuration file. It uses sensible defaults to generate a complete synthetic dataset.

Running Demo Mode

datasynth-data generate --demo --output ./demo-output

What Demo Mode Generates

Demo mode creates a comprehensive dataset with:

Category	Contents
Master Data	Vendors, customers, materials, employees
Transactions	~10,000 journal entries
Document Flows	P2P and O2C process documents
Subledgers	AR and AP records
Period Close	Trial balances
Controls	Internal control mappings

Demo Configuration

Demo mode uses these defaults:

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 3
  group_currency: USD

companies:
  - code: "1000"
    name: "Demo Company"
    currency: USD
    country: US

chart_of_accounts:
  complexity: medium              # ~400 accounts

transactions:
  target_count: 10000

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.01
  generate_labels: true

output:
  format: csv

Output Structure

After running demo mode, explore the output:

tree demo-output/

demo-output/
├── master_data/
│   ├── chart_of_accounts.csv     # GL accounts
│   ├── vendors.csv               # Vendor master
│   ├── customers.csv             # Customer master
│   ├── materials.csv             # Material/product master
│   └── employees.csv             # Employee/user master
├── transactions/
│   ├── journal_entries.csv       # Main JE file
│   ├── acdoca.csv                # SAP HANA format
│   ├── purchase_orders.csv       # P2P documents
│   ├── goods_receipts.csv
│   ├── vendor_invoices.csv
│   ├── payments.csv
│   ├── sales_orders.csv          # O2C documents
│   ├── deliveries.csv
│   ├── customer_invoices.csv
│   └── customer_receipts.csv
├── subledgers/
│   ├── ar_open_items.csv
│   ├── ap_open_items.csv
│   └── inventory_positions.csv
├── period_close/
│   └── trial_balances/
│       ├── 2024_01.csv
│       ├── 2024_02.csv
│       └── 2024_03.csv
├── labels/
│   ├── anomaly_labels.csv        # For ML training
│   └── fraud_labels.csv
└── controls/
    ├── internal_controls.csv
    └── sod_rules.csv

Exploring the Data

Journal Entries

head -5 demo-output/transactions/journal_entries.csv

Key fields:

document_id: Unique transaction identifier
posting_date: When the entry was posted
company_code: Company identifier
account_number: GL account
debit_amount / credit_amount: Entry amounts
is_fraud: Fraud label (true/false)
is_anomaly: Anomaly label

Fraud Labels

# View fraud transactions
grep "true" demo-output/labels/fraud_labels.csv | head

Trial Balance

# Check balance coherence
head demo-output/period_close/trial_balances/2024_01.csv

Customizing Demo Output

You can combine demo mode with some options:

# Change output directory
datasynth-data generate --demo --output ./my-demo

# Use demo as starting point, then create config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Edit config.yaml as needed
datasynth-data generate --config config.yaml --output ./output

Use Cases for Demo Mode

Quick Exploration

Test SyntheticData’s capabilities before creating a custom configuration.

Development Testing

Generate test data quickly for development purposes.

Training & Workshops

Provide sample data for training sessions without complex setup.

Benchmarking

Establish baseline performance metrics.

Moving Beyond Demo Mode

When you’re ready for more control:

Create a configuration file:

datasynth-data init --industry <your-industry> -o config.yaml

Customize settings:
- Adjust transaction volume
- Configure multiple companies
- Enable graph export
- Fine-tune fraud/anomaly rates

Generate with your config:

datasynth-data generate --config config.yaml --output ./output

Next Steps

Review Quick Start for custom configurations
Learn the CLI Reference
Explore Configuration Options
See Use Cases for real-world examples

User Guide

This section covers the different ways to use SyntheticData.

Interfaces

SyntheticData offers three interfaces:

Interface	Use Case
CLI	Command-line generation, scripting, automation
Server API	REST/gRPC/WebSocket for applications
Desktop UI	Visual configuration and monitoring

Quick Comparison

Feature	CLI	Server	Desktop UI
Configuration editing	YAML files	API endpoints	Visual forms
Batch generation	Yes	Yes	Yes
Streaming generation	No	Yes	Yes (view)
Progress monitoring	Progress bar	WebSocket	Real-time
Scripting/automation	Yes	Yes	No
Visual feedback	Minimal	None	Full

CLI Overview

The command-line interface (datasynth-data) is ideal for:

Batch generation
CI/CD pipelines
Scripting and automation
Server environments

datasynth-data generate --config config.yaml --output ./output

Server Overview

The server (datasynth-server) provides:

REST API for configuration and control
gRPC for high-performance integration
WebSocket for real-time streaming

cargo run -p datasynth-server -- --port 3000

Desktop UI Overview

The desktop application offers:

Visual configuration editor
Industry preset selector
Real-time generation monitoring
Cross-platform support (Windows, macOS, Linux)

cd crates/datasynth-ui && npm run tauri dev

Output Formats

SyntheticData produces various output formats:

CSV: Standard tabular data
JSON: Structured data with nested objects
ACDOCA: SAP HANA Universal Journal format
PyTorch Geometric: ML-ready graph tensors
Neo4j: Graph database import format

See Output Formats for details.

Choosing an Interface

Use the CLI if you:

Need to automate generation
Work in headless/server environments
Prefer command-line tools
Want to integrate with shell scripts

Use the Server if you:

Build applications that consume synthetic data
Need streaming generation
Want API-based control
Integrate with microservices

Use the Desktop UI if you:

Prefer visual configuration
Want to explore options interactively
Need real-time monitoring
Are new to SyntheticData

Next Steps

CLI Reference - Complete command documentation
Server API - REST, gRPC, and WebSocket endpoints
Desktop UI - Desktop application guide
Output Formats - Detailed output file documentation

CLI Reference

The datasynth-data command-line tool provides commands for generating synthetic financial data and extracting fingerprints from real data.

Installation

After building the project, the binary is at target/release/datasynth-data.

cargo build --release
./target/release/datasynth-data --help

Global Options

Option	Description
`-h, --help`	Show help information
`-V, --version`	Show version number
`-v, --verbose`	Enable verbose output
`-q, --quiet`	Suppress non-error output

Commands

generate

Generate synthetic financial data.

datasynth-data generate [OPTIONS]

Options:

Option	Type	Description
`--config <PATH>`	Path	Configuration YAML file
`--demo`	Flag	Use demo preset instead of config
`--output <DIR>`	Path	Output directory (required)
`--format <FMT>`	String	Output format: csv, json
`--seed <NUM>`	u64	Override random seed

Examples:

# Generate with configuration file
datasynth-data generate --config config.yaml --output ./output

# Use demo mode
datasynth-data generate --demo --output ./demo-output

# Override seed for reproducibility
datasynth-data generate --config config.yaml --output ./output --seed 12345

# JSON output format
datasynth-data generate --config config.yaml --output ./output --format json

init

Create a new configuration file from industry presets.

datasynth-data init [OPTIONS]

Options:

Option	Type	Description
`--industry <NAME>`	String	Industry preset
`--complexity <LEVEL>`	String	small, medium, large
`-o, --output <PATH>`	Path	Output file path

Available Industries:

manufacturing
retail
financial_services
healthcare
technology

Examples:

# Create manufacturing config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Create large retail config
datasynth-data init --industry retail --complexity large -o retail.yaml

validate

Validate a configuration file.

datasynth-data validate --config <PATH>

Options:

Option	Type	Description
`--config <PATH>`	Path	Configuration file to validate

Example:

datasynth-data validate --config config.yaml

Validation Checks:

Required fields present
Value ranges (period_months: 1-120)
Distribution weights sum to 1.0 (±0.01 tolerance)
Date consistency
Company code uniqueness
Compression level: 1-9 when enabled
All rate/percentage fields: 0.0-1.0
Approval thresholds: strictly ascending order

info

Display available presets and configuration options.

datasynth-data info

Output includes:

Available industry presets
Complexity levels
Supported output formats
Feature capabilities

fingerprint

Privacy-preserving fingerprint extraction and evaluation. This command has several subcommands.

datasynth-data fingerprint <SUBCOMMAND>

fingerprint extract

Extract a fingerprint from real data with privacy controls.

datasynth-data fingerprint extract [OPTIONS]

Options:

Option	Type	Description
`--input <PATH>`	Path	Input CSV data file (required)
`--output <PATH>`	Path	Output .dsf fingerprint file (required)
`--privacy-level <LEVEL>`	String	Privacy level: minimal, standard, high, maximum
`--epsilon <FLOAT>`	f64	Custom differential privacy epsilon (overrides level)
`--k <INT>`	usize	Custom k-anonymity threshold (overrides level)

Privacy Levels:

Level	Epsilon	k	Outlier %	Use Case
minimal	5.0	3	99%	Low privacy, high utility
standard	1.0	5	95%	Balanced (default)
high	0.5	10	90%	Higher privacy
maximum	0.1	20	85%	Maximum privacy

Examples:

# Extract with standard privacy
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Extract with custom privacy parameters
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --epsilon 0.75 \
    --k 8

fingerprint validate

Validate a fingerprint file’s integrity and structure.

datasynth-data fingerprint validate <PATH>

Arguments:

Argument	Type	Description
`<PATH>`	Path	Path to .dsf fingerprint file

Validation Checks:

DSF file structure (ZIP archive with required components)
SHA-256 checksums for all components
Required fields in manifest, schema, statistics
Privacy audit completeness

Example:

datasynth-data fingerprint validate ./fingerprint.dsf

fingerprint info

Display fingerprint metadata and statistics.

datasynth-data fingerprint info <PATH> [OPTIONS]

Arguments:

Argument	Type	Description
`<PATH>`	Path	Path to .dsf fingerprint file

Options:

Option	Type	Description
`--detailed`	Flag	Show detailed statistics
`--json`	Flag	Output as JSON

Examples:

# Basic info
datasynth-data fingerprint info ./fingerprint.dsf

# Detailed statistics
datasynth-data fingerprint info ./fingerprint.dsf --detailed

# JSON output for scripting
datasynth-data fingerprint info ./fingerprint.dsf --json

fingerprint diff

Compare two fingerprints.

datasynth-data fingerprint diff <PATH1> <PATH2>

Arguments:

Argument	Type	Description
`<PATH1>`	Path	First .dsf fingerprint file
`<PATH2>`	Path	Second .dsf fingerprint file

Output includes:

Schema differences (columns added/removed/changed)
Statistical distribution changes
Correlation matrix differences

Example:

datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf

fingerprint evaluate

Evaluate synthetic data fidelity against a fingerprint.

datasynth-data fingerprint evaluate [OPTIONS]

Options:

Option	Type	Description
`--fingerprint <PATH>`	Path	Reference .dsf fingerprint file (required)
`--synthetic <PATH>`	Path	Directory containing synthetic data (required)
`--threshold <FLOAT>`	f64	Minimum fidelity score (0.0-1.0, default 0.8)
`--report <PATH>`	Path	Output report file (HTML or JSON based on extension)

Fidelity Metrics:

Statistical: KS statistic, Wasserstein distance, Benford MAD
Correlation: Correlation matrix RMSE
Schema: Column type match, row count ratio
Rules: Balance equation compliance rate

Examples:

# Basic evaluation
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

# Generate HTML report
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.85 \
    --report ./fidelity_report.html

diffusion (v0.5.0)

Train and evaluate diffusion models for statistical data generation.

diffusion train

Train a diffusion model from a fingerprint file.

datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json \
    --n-steps 1000 \
    --schedule cosine

Option	Type	Default	Description
`--fingerprint`	path	(required)	Path to .dsf fingerprint file
`--output`	path	(required)	Output path for trained model
`--n-steps`	integer	`1000`	Number of diffusion steps
`--schedule`	string	`linear`	Noise schedule: `linear`, `cosine`, `sigmoid`

diffusion evaluate

Evaluate a trained diffusion model’s fit quality.

datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000

Option	Type	Default	Description
`--model`	path	(required)	Path to trained model
`--samples`	integer	`1000`	Number of evaluation samples

causal (v0.5.0)

Generate data with causal structure, run interventions, and produce counterfactuals.

causal generate

Generate data following a causal graph structure.

datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --seed 42 \
    --output ./causal_output

Option	Type	Default	Description
`--template`	string	(required)	Built-in template (`fraud_detection`, `revenue_cycle`) or path to custom YAML
`--samples`	integer	`1000`	Number of samples to generate
`--seed`	integer	(random)	Random seed for reproducibility
`--output`	path	(required)	Output directory

causal intervene

Run do-calculus interventions (“what-if” scenarios).

datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output

Option	Type	Default	Description
`--template`	string	(required)	Causal template or YAML path
`--variable`	string	(required)	Variable to intervene on
`--value`	float	(required)	Value to set the variable to
`--samples`	integer	`1000`	Number of intervention samples
`--output`	path	(required)	Output directory

causal validate

Validate that generated data preserves causal structure.

datasynth-data causal validate \
    --data ./causal_output \
    --template fraud_detection

Option	Type	Default	Description
`--data`	path	(required)	Path to generated data
`--template`	string	(required)	Causal template to validate against

fingerprint federated (v0.5.0)

Aggregate fingerprints from multiple distributed sources without centralizing raw data.

datasynth-data fingerprint federated \
    --sources ./source_a.dsf ./source_b.dsf ./source_c.dsf \
    --output ./aggregated.dsf \
    --method weighted_average \
    --max-epsilon 5.0

Option	Type	Default	Description
`--sources`	paths	(required)	Two or more .dsf fingerprint files
`--output`	path	(required)	Output path for aggregated fingerprint
`--method`	string	`weighted_average`	Aggregation method: `weighted_average`, `median`, `trimmed_mean`
`--max-epsilon`	float	`5.0`	Maximum epsilon budget per source

init –from-description (v0.5.0)

Generate configuration from a natural language description using LLM.

datasynth-data init \
    --from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
    -o config.yaml

Uses the configured LLM provider (defaults to Mock) to parse the description and generate an appropriate YAML configuration.

generate –certificate (v0.5.0)

Attach a synthetic data certificate to the generated output.

datasynth-data generate \
    --config config.yaml \
    --output ./output \
    --certificate

Produces a certificate.json in the output directory containing DP guarantees, quality metrics, and an HMAC-SHA256 signature.

Signal Handling (Unix)

On Unix systems, you can pause and resume generation:

# Start generation in background
datasynth-data generate --config config.yaml --output ./output &

# Pause generation
kill -USR1 $(pgrep datasynth-data)

# Resume generation (send SIGUSR1 again)
kill -USR1 $(pgrep datasynth-data)

Exit Codes

Code	Meaning
0	Success
1	General error
2	Configuration error
3	I/O error
4	Validation error
5	Fingerprint error

Environment Variables

Variable	Description
`SYNTH_DATA_LOG`	Log level (error, warn, info, debug, trace)
`SYNTH_DATA_THREADS`	Number of worker threads

Example:

SYNTH_DATA_LOG=debug datasynth-data generate --config config.yaml --output ./output

Configuration File Location

The tool looks for configuration files in this order:

Path specified with --config
./datasynth-data.yaml in current directory
~/.config/datasynth-data/config.yaml

Output Directory Structure

Generation creates this structure:

output/
├── master_data/          Vendors, customers, materials, assets, employees
├── transactions/         Journal entries, purchase orders, invoices, payments
├── subledgers/           AR, AP, FA, inventory detail records
├── period_close/         Trial balances, accruals, closing entries
├── consolidation/        Eliminations, currency translation
├── fx/                   Exchange rates, CTA adjustments
├── banking/              KYC profiles, bank transactions, AML typology labels
├── process_mining/       OCEL 2.0 event logs, process variants
├── audit/                Engagements, workpapers, findings, risk assessments
├── graphs/               PyTorch Geometric, Neo4j, DGL exports (if enabled)
├── labels/               Anomaly, fraud, and data quality labels for ML
└── controls/             Internal control mappings, SoD rules

Scripting Examples

Batch Generation

#!/bin/bash
for industry in manufacturing retail healthcare; do
    datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
    datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done

CI/CD Pipeline

# GitHub Actions example
- name: Generate Test Data
  run: |
    cargo build --release
    ./target/release/datasynth-data generate --demo --output ./test-data

- name: Validate Generation
  run: |
    # Check output files exist
    test -f ./test-data/transactions/journal_entries.csv
    test -f ./test-data/master_data/vendors.csv

Reproducible Generation

# Same seed produces identical output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2  # No differences

Fingerprint Pipeline

#!/bin/bash
# Extract fingerprint from real data
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level high

# Validate the fingerprint
datasynth-data fingerprint validate ./fingerprint.dsf

# Generate synthetic data matching the fingerprint
# (fingerprint informs config generation)
datasynth-data generate --config generated_config.yaml --output ./synthetic

# Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic \
    --threshold 0.85 \
    --report ./fidelity_report.html

Troubleshooting

Common Issues

“Configuration file not found”

# Check file path
ls -la config.yaml
# Use absolute path
datasynth-data generate --config /full/path/to/config.yaml --output ./output

“Invalid configuration”

# Validate first
datasynth-data validate --config config.yaml

“Permission denied”

# Check output directory permissions
mkdir -p ./output
chmod 755 ./output

“Out of memory”

The generator includes memory guards that prevent OOM conditions. If you still encounter issues:

Reduce transaction count in configuration
The system will automatically reduce batch sizes under memory pressure
Check memory_guard settings in configuration

“Fingerprint validation failed”

# Check DSF file integrity
datasynth-data fingerprint validate ./fingerprint.dsf

# View detailed info
datasynth-data fingerprint info ./fingerprint.dsf --detailed

“Low fidelity score”

If synthetic data fidelity is below threshold:

Review the fidelity report for specific metrics
Adjust configuration to better match fingerprint statistics
Consider using the evaluation framework’s auto-tuning recommendations

Server API

SyntheticData provides a server component with REST, gRPC, and WebSocket APIs for application integration.

Starting the Server

cargo run -p datasynth-server -- --port 3000 --worker-threads 4

Options:

Option	Default	Description
`--port`	3000	HTTP/WebSocket port
`--grpc-port`	50051	gRPC port
`--worker-threads`	CPU cores	Worker thread count
`--api-key`	None	Required API key
`--rate-limit`	100	Max requests per minute

Authentication

When --api-key is set, include it in requests:

curl -H "X-API-Key: your-api-key" http://localhost:3000/api/config

REST API

Configuration Endpoints

GET /api/config

Retrieve current configuration.

curl http://localhost:3000/api/config

Response:

{
  "global": {
    "seed": 42,
    "industry": "manufacturing",
    "start_date": "2024-01-01",
    "period_months": 12
  },
  "transactions": {
    "target_count": 100000
  }
}

POST /api/config

Update configuration.

curl -X POST http://localhost:3000/api/config \
  -H "Content-Type: application/json" \
  -d '{"transactions": {"target_count": 50000}}'

POST /api/config/validate

Validate configuration without applying.

curl -X POST http://localhost:3000/api/config/validate \
  -H "Content-Type: application/json" \
  -d @config.json

Stream Control Endpoints

POST /api/stream/start

Start data generation.

curl -X POST http://localhost:3000/api/stream/start

Response:

{
  "status": "started",
  "stream_id": "abc123"
}

POST /api/stream/stop

Stop current generation.

curl -X POST http://localhost:3000/api/stream/stop

POST /api/stream/pause

Pause generation.

curl -X POST http://localhost:3000/api/stream/pause

POST /api/stream/resume

Resume paused generation.

curl -X POST http://localhost:3000/api/stream/resume

Pattern Trigger Endpoints

POST /api/stream/trigger/

Trigger special event patterns.

Available patterns:

month_end - Month-end close activities
quarter_end - Quarter-end activities
year_end - Year-end close activities

curl -X POST http://localhost:3000/api/stream/trigger/month_end

Health Check

GET /health

curl http://localhost:3000/health

Response:

{
  "status": "healthy",
  "uptime_seconds": 3600
}

WebSocket API

Connect to receive real-time events during generation.

Connection

const ws = new WebSocket('ws://localhost:3000/ws/events');

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data);
};

Event Types

Progress Event:

{
  "type": "progress",
  "current": 50000,
  "total": 100000,
  "percent": 50.0,
  "rate": 85000.5
}

Entry Event:

{
  "type": "entry",
  "data": {
    "document_id": "abc-123",
    "posting_date": "2024-03-15",
    "account": "1100",
    "debit": "1000.00",
    "credit": "0.00"
  }
}

Error Event:

{
  "type": "error",
  "message": "Memory limit exceeded"
}

Complete Event:

{
  "type": "complete",
  "total_entries": 100000,
  "duration_ms": 1200
}

gRPC API

Proto Definition

syntax = "proto3";

package synth;

service SynthService {
  rpc GetConfig(Empty) returns (Config);
  rpc SetConfig(Config) returns (Status);
  rpc StartGeneration(GenerationRequest) returns (stream Entry);
  rpc StopGeneration(Empty) returns (Status);
}

message Config {
  string yaml = 1;
}

message GenerationRequest {
  optional int64 count = 1;
}

message Entry {
  string document_id = 1;
  string posting_date = 2;
  string company_code = 3;
  repeated LineItem lines = 4;
}

message LineItem {
  string account = 1;
  string debit = 2;
  string credit = 3;
}

Client Example (Rust)

use synth::synth_client::SynthClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut client = SynthClient::connect("http://localhost:50051").await?;

    let request = tonic::Request::new(GenerationRequest { count: Some(1000) });
    let mut stream = client.start_generation(request).await?.into_inner();

    while let Some(entry) = stream.message().await? {
        println!("Entry: {:?}", entry.document_id);
    }

    Ok(())
}

Rate Limiting

The server implements sliding window rate limiting:

Metric	Default
Window	1 minute
Max requests	100

Exceeding the limit returns 429 Too Many Requests:

{
  "error": "rate_limit_exceeded",
  "retry_after": 30
}

Memory Management

The server enforces memory limits:

# Set memory limit (bytes)
cargo run -p datasynth-server -- --memory-limit 1073741824  # 1GB

When memory is low:

Generation pauses automatically
WebSocket sends warning event
New requests may be rejected

Error Responses

HTTP Status	Meaning
400	Invalid request/configuration
401	Missing or invalid API key
429	Rate limit exceeded
500	Internal server error
503	Server overloaded

Error Response Format:

{
  "error": "error_code",
  "message": "Human readable description",
  "details": {}
}

Integration Examples

Python Client

import requests
import websocket
import json

BASE_URL = "http://localhost:3000"

# Set configuration
config = {
    "transactions": {"target_count": 10000}
}
requests.post(f"{BASE_URL}/api/config", json=config)

# Start generation
requests.post(f"{BASE_URL}/api/stream/start")

# Monitor via WebSocket
ws = websocket.create_connection(f"ws://localhost:3000/ws/events")
while True:
    event = json.loads(ws.recv())
    if event["type"] == "complete":
        break
    print(f"Progress: {event.get('percent', 0)}%")

Node.js Client

const axios = require('axios');
const WebSocket = require('ws');

const BASE_URL = 'http://localhost:3000';

async function generate() {
    // Configure
    await axios.post(`${BASE_URL}/api/config`, {
        transactions: { target_count: 10000 }
    });

    // Start
    await axios.post(`${BASE_URL}/api/stream/start`);

    // Monitor
    const ws = new WebSocket('ws://localhost:3000/ws/events');
    ws.on('message', (data) => {
        const event = JSON.parse(data);
        console.log(event);
    });
}

Desktop UI

SyntheticData includes a cross-platform desktop application built with Tauri and SvelteKit.

Overview

The desktop UI provides:

Visual configuration editing
Industry preset selection
Real-time generation monitoring
Configuration validation feedback

Installation

Prerequisites

Requirement	Version
Node.js	18+
npm	9+
Rust	1.88+
Platform dependencies	See below

Platform Dependencies

Linux (Ubuntu/Debian):

sudo apt-get install libwebkit2gtk-4.1-dev \
    libgtk-3-dev \
    libayatana-appindicator3-dev \
    librsvg2-dev

macOS: No additional dependencies required.

Windows: WebView2 runtime (usually pre-installed on Windows 10/11).

Running in Development

cd crates/datasynth-ui
npm install
npm run tauri dev

Building for Production

cd crates/datasynth-ui
npm run tauri build

Build outputs are in crates/datasynth-ui/src-tauri/target/release/bundle/.

Application Layout

Dashboard

The main dashboard provides:

Quick stats overview
Recent generation history
System status

Configuration Editor

Access via the sidebar. Configuration is organized into sections:

Section	Contents
Global	Industry, dates, seed, performance
Companies	Company definitions and weights
Transactions	Target count, line items, amounts
Master Data	Vendors, customers, materials
Document Flows	P2P, O2C configuration
Financial	Balance, subledger, FX, period close
Compliance	Fraud, controls, approval
Analytics	Graph export, anomaly, data quality
Output	Formats, compression

Configuration Sections

Global Settings

Industry: Select from presets (manufacturing, retail, etc.)
Start Date: Beginning of simulation period
Period Months: Duration (1-120 months)
Group Currency: Base currency for consolidation
Random Seed: For reproducible generation

Chart of Accounts

Complexity: Small (~100), Medium (~400), Large (~2500) accounts
Structure: Industry-specific account hierarchies

Transactions

Target Count: Number of journal entries to generate
Line Item Distribution: Configure line count probabilities
Amount Distribution: Log-normal parameters, round number bias

Master Data

Configure generation parameters for:

Vendors (count, payment terms, intercompany flags)
Customers (count, credit terms, payment behavior)
Materials (count, valuation methods)
Fixed Assets (count, depreciation methods)
Employees (count, hierarchy depth)

Document Flows

P2P (Procure-to-Pay): PO → GR → Invoice → Payment rates
O2C (Order-to-Cash): SO → Delivery → Invoice → Receipt rates
Three-Way Match: Tolerance settings

Financial Settings

Balance: Opening balance configuration
Subledger: AR, AP, FA, Inventory settings
FX: Currency pairs, rate volatility
Period Close: Accrual, depreciation, closing settings

Compliance

Fraud: Enable/disable, fraud rate, fraud types
Controls: Internal control definitions
Approval: Threshold configuration, SoD rules

Analytics

Graph Export: Format selection (PyTorch Geometric, Neo4j, DGL)
Anomaly Injection: Rate, types, labeling
Data Quality: Missing values, format variations, duplicates

Output Settings

Format: CSV or JSON
Compression: None, gzip, or zstd
File Organization: Directory structure options

Preset Selector

Quickly load industry presets:

Click “Load Preset” in the header
Select industry
Choose complexity level
Click “Apply”

Real-time Streaming

During generation, view:

Progress bar with percentage
Entries per second
Memory usage
Recent entries table

Access streaming view via “Generate” → “Stream”.

Validation

The UI validates configuration in real-time:

Required fields are highlighted
Invalid values show error messages
Distribution weights are checked
Constraints are enforced

Keyboard Shortcuts

Shortcut	Action
`Ctrl/Cmd + S`	Save configuration
`Ctrl/Cmd + G`	Start generation
`Ctrl/Cmd + ,`	Open settings
`Escape`	Close modal

Configuration Files

The UI stores configurations in:

Platform	Location
Linux	`~/.config/datasynth-data/`
macOS	`~/Library/Application Support/datasynth-data/`
Windows	`%APPDATA%\datasynth-data\`

Exporting Configuration

To use your configuration with the CLI:

Configure in the UI
Click “Export” → “Export YAML”
Save the .yaml file
Use with CLI: datasynth-data generate --config exported.yaml --output ./output

Development

Project Structure

crates/datasynth-ui/
├── src/                      # SvelteKit frontend
│   ├── routes/               # Page routes
│   │   ├── +page.svelte      # Dashboard
│   │   ├── generate/         # Generation views
│   │   └── config/           # Configuration pages
│   └── lib/
│       ├── stores/           # State management
│       └── components/       # Reusable components
├── src-tauri/                # Rust backend
│   └── src/
│       └── main.rs           # Tauri commands
├── package.json
└── tauri.conf.json

Adding a Configuration Page

Create route in src/routes/config/<section>/+page.svelte
Add form components
Connect to config store
Add navigation link

Debugging

# Enable Tauri dev tools
npm run tauri dev

# View browser console (Ctrl/Cmd + Shift + I in dev mode)

Troubleshooting

UI Doesn’t Start

# Check Node dependencies
npm install

# Rebuild native modules
npm run tauri clean
npm run tauri build

Configuration Not Saving

Check file permissions in the config directory.

WebSocket Connection Failed

Ensure the server is running if using streaming features:

cargo run -p datasynth-server -- --port 3000

Output Formats

SyntheticData generates multiple file types organized into categories.

Output Directory Structure

output/
├── master_data/          # Entity master records
├── transactions/         # Journal entries and documents
├── subledgers/           # Subsidiary ledger records
├── period_close/         # Trial balances and closing
├── consolidation/        # Elimination entries
├── fx/                   # Exchange rates
├── banking/              # KYC profiles and bank transactions
├── process_mining/       # OCEL 2.0 event logs
├── audit/                # Audit engagements and workpapers
├── graphs/               # ML-ready graph exports
├── labels/               # Anomaly, fraud, and quality labels
└── controls/             # Internal control mappings

File Formats

CSV

Default format with standard conventions:

UTF-8 encoding
Comma-separated values
Header row included
Quoted strings when needed
Decimal values serialized as strings (prevents floating-point artifacts)

Example (journal_entries.csv):

document_id,posting_date,company_code,account,description,debit,credit,is_fraud
abc-123,2024-01-15,1000,1100,Customer payment,"1000.00","0.00",false
abc-123,2024-01-15,1000,1200,Cash receipt,"0.00","1000.00",false

JSON

Structured format with nested objects:

Example (journal_entries.json):

[
  {
    "header": {
      "document_id": "abc-123",
      "posting_date": "2024-01-15",
      "company_code": "1000",
      "source": "Manual",
      "is_fraud": false
    },
    "lines": [
      {
        "account": "1100",
        "description": "Customer payment",
        "debit": "1000.00",
        "credit": "0.00"
      },
      {
        "account": "1200",
        "description": "Cash receipt",
        "debit": "0.00",
        "credit": "1000.00"
      }
    ]
  }
]

ACDOCA (SAP HANA)

SAP Universal Journal format with simulation fields:

Field	Description
RCLNT	Client
RLDNR	Ledger
RBUKRS	Company code
GJAHR	Fiscal year
BELNR	Document number
DOCLN	Line item
RYEAR	Year
POPER	Posting period
RACCT	Account
DRCRK	Debit/Credit indicator
HSL	Amount in local currency
ZSIM_*	Simulation metadata fields

Master Data Files

chart_of_accounts.csv

Field	Description
account_number	GL account code
account_name	Display name
account_type	Asset, Liability, Equity, Revenue, Expense
account_subtype	Detailed classification
is_control_account	Links to subledger
normal_balance	Debit or Credit

vendors.csv

Field	Description
vendor_id	Unique identifier
vendor_name	Company name
tax_id	Tax identification
payment_terms	Standard terms
currency	Transaction currency
is_intercompany	IC flag

customers.csv

Field	Description
customer_id	Unique identifier
customer_name	Company/person name
credit_limit	Maximum credit
credit_rating	Rating code
payment_behavior	Typical payment pattern

materials.csv

Field	Description
material_id	Unique identifier
description	Material name
material_type	Classification
valuation_method	FIFO, LIFO, Avg
standard_cost	Unit cost

employees.csv

Field	Description
employee_id	Unique identifier
name	Full name
department	Department code
manager_id	Hierarchy link
approval_limit	Maximum approval amount
transaction_codes	Authorized T-codes

Transaction Files

journal_entries.csv

Field	Description
document_id	Entry identifier
company_code	Company
fiscal_year	Year
fiscal_period	Period
posting_date	Date posted
document_date	Original date
source	Transaction source
business_process	Process category
is_fraud	Fraud indicator
is_anomaly	Anomaly indicator

Line Items (embedded or separate)

Field	Description
line_number	Sequence
account_number	GL account
cost_center	Cost center
profit_center	Profit center
debit_amount	Debit
credit_amount	Credit
description	Line description

Document Flow Files

purchase_orders.csv:

Order header with vendor, dates, status
Line items with materials, quantities, prices

goods_receipts.csv:

Receipt linked to PO
Quantities received, variances

vendor_invoices.csv:

Invoice with three-way match status
Payment terms, due date

payments.csv:

Payment documents
Bank references, cleared invoices

document_references.csv:

Links between documents (FollowOn, Payment, Reversal)
Ensures complete document chains

Subledger Files

ar_open_items.csv

Field	Description
customer_id	Customer reference
invoice_number	Document number
invoice_date	Date issued
due_date	Payment due
original_amount	Invoice total
open_amount	Remaining balance
aging_bucket	0-30, 31-60, 61-90, 90+

ap_open_items.csv

Similar structure for payables.

fa_register.csv

Field	Description
asset_id	Asset identifier
description	Asset name
acquisition_date	Purchase date
acquisition_cost	Original cost
useful_life_years	Depreciation period
depreciation_method	Straight-line, etc.
accumulated_depreciation	Total depreciation
net_book_value	Current value

inventory_positions.csv

Field	Description
material_id	Material reference
warehouse	Location
quantity	Units on hand
unit_cost	Current cost
total_value	Extended value

Period Close Files

trial_balances/YYYY_MM.csv

Field	Description
account_number	GL account
account_name	Description
opening_balance	Period start
period_debits	Total debits
period_credits	Total credits
closing_balance	Period end

accruals.csv

Accrual entries with reversal dates.

depreciation.csv

Monthly depreciation entries per asset.

Banking Files

banking_customers.csv

Field	Description
customer_id	Unique identifier
customer_type	retail, business, trust
name	Customer name
created_at	Account creation date
risk_score	Calculated risk score (0-100)
kyc_status	verified, pending, enhanced_due_diligence
pep_flag	Politically exposed person
sanctions_flag	Sanctions list match

bank_accounts.csv

Field	Description
account_id	Unique identifier
customer_id	Owner reference
account_type	checking, savings, money_market
currency	Account currency
opened_date	Opening date
balance	Current balance
status	active, dormant, closed

bank_transactions.csv

Field	Description
transaction_id	Unique identifier
account_id	Account reference
timestamp	Transaction time
amount	Transaction amount
currency	Transaction currency
direction	credit, debit
channel	branch, atm, online, wire, ach
category	Transaction category
counterparty_id	Counterparty reference

kyc_profiles.csv

Field	Description
customer_id	Customer reference
declared_turnover	Expected monthly volume
transaction_frequency	Expected transactions/month
source_of_funds	Declared income source
geographic_exposure	List of countries
cash_intensity	Expected cash ratio
beneficial_owner_complexity	Ownership layers

aml_typology_labels.csv

Field	Description
transaction_id	Transaction reference
typology	structuring, funnel, layering, mule, fraud
confidence	Confidence score (0-1)
pattern_id	Related pattern identifier
related_transactions	Comma-separated related IDs

entity_risk_labels.csv

Field	Description
entity_id	Customer or account ID
entity_type	customer, account
risk_category	high, medium, low
risk_factors	Contributing factors
label_date	Label timestamp

Process Mining Files (OCEL 2.0)

event_log.json

OCEL 2.0 format event log:

{
  "ocel:global-log": {
    "ocel:version": "2.0",
    "ocel:ordering": "timestamp"
  },
  "ocel:events": {
    "e1": {
      "ocel:activity": "Create Purchase Order",
      "ocel:timestamp": "2024-01-15T10:30:00Z",
      "ocel:typedOmap": [
        {"ocel:oid": "PO-001", "ocel:qualifier": "order"}
      ]
    }
  },
  "ocel:objects": {
    "PO-001": {
      "ocel:type": "PurchaseOrder",
      "ocel:attributes": {
        "vendor": "VEND-001",
        "amount": "10000.00"
      }
    }
  }
}

objects.json

Object instances with types and attributes.

events.json

Event records with object relationships.

process_variants.csv

Field	Description
variant_id	Unique identifier
activity_sequence	Ordered activity list
frequency	Occurrence count
avg_duration	Average case duration

Audit Files

audit_engagements.csv

Field	Description
engagement_id	Unique identifier
client_name	Client entity
engagement_type	Financial, Compliance, Operational
fiscal_year	Audit period
materiality	Materiality threshold
status	Planning, Fieldwork, Completion

audit_workpapers.csv

Field	Description
workpaper_id	Unique identifier
engagement_id	Engagement reference
workpaper_type	Lead schedule, Substantive, etc.
prepared_by	Preparer ID
reviewed_by	Reviewer ID
status	Draft, Reviewed, Final

audit_evidence.csv

Field	Description
evidence_id	Unique identifier
workpaper_id	Workpaper reference
evidence_type	Document, Inquiry, Observation, etc.
source	Evidence source
reliability	High, Medium, Low
sufficiency	Sufficient, Insufficient

audit_risks.csv

Field	Description
risk_id	Unique identifier
engagement_id	Engagement reference
risk_description	Risk narrative
risk_level	High, Significant, Low
likelihood	Probable, Possible, Remote
response	Response strategy

audit_findings.csv

Field	Description
finding_id	Unique identifier
engagement_id	Engagement reference
finding_type	Deficiency, Significant, Material Weakness
description	Finding narrative
recommendation	Recommended action
management_response	Response text

audit_judgments.csv

Field	Description
judgment_id	Unique identifier
workpaper_id	Workpaper reference
judgment_area	Revenue recognition, Estimates, etc.
alternatives_considered	Options evaluated
conclusion	Selected approach
rationale	Reasoning documentation

Graph Export Files

PyTorch Geometric

graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, edge_features]
├── labels.pt           # Node/edge labels
├── train_mask.pt       # Training split
├── val_mask.pt         # Validation split
└── test_mask.pt        # Test split

Neo4j

graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── nodes_user.csv
├── edges_transaction.csv
├── edges_approval.csv
└── import.cypher        # Import script

DGL (Deep Graph Library)

graphs/transaction_network/dgl/
├── graph.bin           # DGL binary format
├── node_features.npy   # NumPy arrays
└── edge_features.npy

Label Files

anomaly_labels.csv

Field	Description
document_id	Entry reference
anomaly_id	Unique anomaly ID
anomaly_type	Classification
anomaly_category	Fraud, Error, Process, Statistical, Relational
severity	Low, Medium, High
description	Human-readable explanation

fraud_labels.csv

Field	Description
document_id	Entry reference
fraud_type	Specific fraud pattern (20+ types)
detection_difficulty	Easy, Medium, Hard
description	Fraud scenario description

quality_labels.csv

Field	Description
record_id	Record reference
field_name	Affected field
issue_type	MissingValue, Typo, FormatVariation, Duplicate
issue_subtype	Detailed classification
original_value	Value before modification
modified_value	Value after modification
severity	Severity level (1-5)

Control Files

internal_controls.csv

Field	Description
control_id	Unique identifier
control_name	Description
control_type	Preventive, Detective
frequency	Continuous, Daily, etc.
assertions	Completeness, Accuracy, etc.

control_account_mappings.csv

Field	Description
control_id	Control reference
account_number	GL account
threshold	Monetary threshold

sod_rules.csv

Segregation of duties conflict definitions.

sod_conflict_pairs.csv

Actual SoD violations detected in generated data.

Parquet Format

Apache Parquet columnar format for large analytical datasets:

output:
  format: parquet
  compression: snappy      # snappy, gzip, zstd

Benefits:

Columnar storage — efficient for queries touching few columns
Built-in compression — typically 5-10x smaller than CSV
Schema embedding — self-describing files with full type information
Predicate pushdown — query engines skip irrelevant row groups

Use with: Apache Spark, DuckDB, Polars, pandas, BigQuery, Snowflake, Databricks.

ERP-Specific Formats

SyntheticData can export in native ERP table schemas:

Format	Target ERP	Tables
`sap`	SAP S/4HANA	BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
`oracle`	Oracle EBS	GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
`netsuite`	NetSuite	Journal entries with subsidiary, multi-book, custom fields

See ERP Output Formats for field mappings and configuration.

Compression Options

Option	Extension	Use Case
none	.csv/.json	Development, small datasets
gzip	.csv.gz	General compression
zstd	.csv.zst	High performance
snappy	.parquet	Parquet default (fast)

Configuration

output:
  format: csv              # csv, json, jsonl, parquet, sap, oracle, netsuite
  compression: none        # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)
  compression_level: 6     # 1-9 (if compression enabled)
  streaming: false         # Enable streaming mode for large outputs

ERP Output Formats

SyntheticData can export data in native ERP table formats, enabling direct load testing and integration validation against SAP S/4HANA, Oracle EBS, and NetSuite environments.

Overview

The datasynth-output crate provides three ERP-specific exporters alongside the standard CSV/JSON/Parquet sinks. Each exporter transforms the internal data model into the target ERP’s table schema with correct field names, data types, and referential integrity.

ERP System	Exporter	Tables
SAP S/4HANA	`SapExporter`	BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
Oracle EBS	`OracleExporter`	GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
NetSuite	`NetSuiteExporter`	Journal entries with subsidiary, multi-book, custom fields

SAP S/4HANA

Supported Tables

Table	Description	Source Data
BKPF	Document Header	Journal entry headers
BSEG	Document Line Items	Journal entry line items
ACDOCA	Universal Journal	Full ACDOCA event records
LFA1	Vendor Master	Vendor records
KNA1	Customer Master	Customer records
MARA	Material Master	Material records
CSKS	Cost Center Master	Cost center assignments
CEPC	Profit Center Master	Profit center assignments

BKPF Fields (Document Header)

SAP Field	Description	Example
`MANDT`	Client	`100`
`BUKRS`	Company Code	`1000`
`BELNR`	Document Number	`0100000001`
`GJAHR`	Fiscal Year	`2024`
`BLART`	Document Type	`SA` (G/L posting)
`BLDAT`	Document Date	`2024-01-15`
`BUDAT`	Posting Date	`2024-01-15`
`MONAT`	Fiscal Period	`1`
`CPUDT`	Entry Date	`2024-01-15`
`CPUTM`	Entry Time	`10:30:00`
`USNAM`	User Name	`JSMITH`

BSEG Fields (Line Items)

SAP Field	Description	Example
`MANDT`	Client	`100`
`BUKRS`	Company Code	`1000`
`BELNR`	Document Number	`0100000001`
`GJAHR`	Fiscal Year	`2024`
`BUZEI`	Line Item	`001`
`BSCHL`	Posting Key	`40` (debit) / `50` (credit)
`HKONT`	GL Account	`1100`
`DMBTR`	Amount in Local Currency	`1000.00`
`WRBTR`	Amount in Doc Currency	`1000.00`
`KOSTL`	Cost Center	`CC100`
`PRCTR`	Profit Center	`PC100`

ACDOCA Fields (Universal Journal)

The ACDOCA format includes all standard SAP Universal Journal fields plus simulation metadata:

Field	Description
`RCLNT`	Client
`RLDNR`	Ledger
`RBUKRS`	Company Code
`GJAHR`	Fiscal Year
`BELNR`	Document Number
`DOCLN`	Line Item
`POPER`	Posting Period
`RACCT`	Account
`DRCRK`	Debit/Credit Indicator
`HSL`	Amount in Local Currency
`ZSIM_*`	Simulation metadata fields

Configuration

output:
  format: sap
  sap:
    tables:
      - bkpf
      - bseg
      - acdoca
      - lfa1
      - kna1
      - mara
    client: "100"
    ledger: "0L"

Oracle EBS

Supported Tables

Table	Description	Source Data
GL_JE_HEADERS	Journal Entry Headers	Journal entry headers
GL_JE_LINES	Journal Entry Lines	Journal entry line items
GL_JE_BATCHES	Journal Entry Batches	Batch groupings

GL_JE_HEADERS Fields

Oracle Field	Description	Example
`JE_HEADER_ID`	Unique Header ID	`10001`
`LEDGER_ID`	Ledger ID	`1`
`JE_BATCH_ID`	Batch ID	`5001`
`PERIOD_NAME`	Period Name	`JAN-24`
`NAME`	Journal Name	`Manual Entry 001`
`JE_CATEGORY`	Category	`MANUAL`, `ADJUSTMENT`, `PAYABLES`
`JE_SOURCE`	Source	`MANUAL`, `PAYABLES`, `RECEIVABLES`
`CURRENCY_CODE`	Currency	`USD`
`ACTUAL_FLAG`	Type	`A` (Actual), `B` (Budget), `E` (Encumbrance)
`STATUS`	Status	`P` (Posted), `U` (Unposted)
`DEFAULT_EFFECTIVE_DATE`	Effective Date	`2024-01-15`
`RUNNING_TOTAL_DR`	Total Debits	`10000.00`
`RUNNING_TOTAL_CR`	Total Credits	`10000.00`
`PARENT_JE_HEADER_ID`	Parent (for reversals)	`null`
`ACCRUAL_REV_FLAG`	Reversal Flag	`Y` / `N`

GL_JE_LINES Fields

Oracle Field	Description	Example
`JE_HEADER_ID`	Header Reference	`10001`
`JE_LINE_NUM`	Line Number	`1`
`CODE_COMBINATION_ID`	Account Combo ID	`10110`
`ENTERED_DR`	Entered Debit	`1000.00`
`ENTERED_CR`	Entered Credit	`0.00`
`ACCOUNTED_DR`	Accounted Debit	`1000.00`
`ACCOUNTED_CR`	Accounted Credit	`0.00`
`DESCRIPTION`	Line Description	`Customer payment`
`EFFECTIVE_DATE`	Effective Date	`2024-01-15`

Configuration

output:
  format: oracle
  oracle:
    ledger_id: 1
    set_of_books_id: 1

NetSuite

Journal Entry Fields

NetSuite export includes support for subsidiaries, multi-book accounting, and custom fields:

NetSuite Field	Description	Example
`INTERNAL_ID`	Internal ID	`50001`
`EXTERNAL_ID`	External ID (for import)	`DS-JE-001`
`TRAN_ID`	Transaction Number	`JE00001`
`TRAN_DATE`	Transaction Date	`2024-01-15`
`POSTING_PERIOD`	Period ID	`Jan 2024`
`SUBSIDIARY`	Subsidiary ID	`1`
`CURRENCY`	Currency Code	`USD`
`EXCHANGE_RATE`	Exchange Rate	`1.000000`
`MEMO`	Memo	`Monthly accrual`
`APPROVED`	Approval Status	`true`
`REVERSAL_DATE`	Reversal Date	`2024-02-01`
`DEPARTMENT`	Department ID	`100`
`CLASS`	Class ID	`1`
`LOCATION`	Location ID	`1`
`TOTAL_DEBIT`	Total Debits	`5000.00`
`TOTAL_CREDIT`	Total Credits	`5000.00`

NetSuite Line Fields

Field	Description
`ACCOUNT`	Account internal ID
`DEBIT`	Debit amount
`CREDIT`	Credit amount
`MEMO`	Line memo
`DEPARTMENT`	Department
`CLASS`	Class segment
`LOCATION`	Location segment
`ENTITY`	Customer/Vendor reference
`CUSTOM_FIELDS`	Additional custom field map

Configuration

output:
  format: netsuite
  netsuite:
    subsidiary_id: 1
    include_custom_fields: true

Usage Examples

SAP Load Testing

Generate data for SAP S/4HANA load testing with full table coverage:

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 100000

output:
  format: sap
  sap:
    tables: [bkpf, bseg, acdoca, lfa1, kna1, mara, csks, cepc]
    client: "100"

Oracle EBS Migration Validation

Generate journal entries in Oracle EBS format for migration testing:

output:
  format: oracle
  oracle:
    ledger_id: 1

NetSuite Integration Testing

Generate multi-subsidiary data with custom fields:

output:
  format: netsuite
  netsuite:
    subsidiary_id: 1
    include_custom_fields: true

Output Files

Format	Output Files
SAP	`sap_bkpf.csv`, `sap_bseg.csv`, `sap_acdoca.csv`, `sap_lfa1.csv`, `sap_kna1.csv`, `sap_mara.csv`, `sap_csks.csv`, `sap_cepc.csv`
Oracle	`oracle_gl_je_headers.csv`, `oracle_gl_je_lines.csv`, `oracle_gl_je_batches.csv`
NetSuite	`netsuite_journal_entries.csv`, `netsuite_journal_lines.csv`

Streaming Output

SyntheticData provides streaming output sinks for real-time data generation, enabling memory-efficient export of large datasets without loading everything into memory at once.

Overview

The streaming module in datasynth-output implements the StreamingSink trait for four output formats:

Sink	Description	File Extension
`CsvStreamingSink`	CSV with automatic headers	`.csv`
`JsonStreamingSink`	Pretty-printed JSON arrays	`.json`
`NdjsonStreamingSink`	Newline-delimited JSON	`.jsonl` / `.ndjson`
`ParquetStreamingSink`	Apache Parquet columnar	`.parquet`

All streaming sinks accept StreamEvent values through the process() method:

#![allow(unused)]
fn main() {
pub enum StreamEvent<T> {
    Data(T),       // A data record to write
    Flush,         // Force flush to disk
    Close,         // Close the stream
}
}

StreamingSink Trait

All streaming sinks implement:

#![allow(unused)]
fn main() {
pub trait StreamingSink<T: Serialize + Send> {
    /// Process a single stream event (data, flush, or close).
    fn process(&mut self, event: StreamEvent<T>) -> SynthResult<()>;

    /// Close the stream and flush remaining data.
    fn close(&mut self) -> SynthResult<()>;

    /// Return the number of items written so far.
    fn items_written(&self) -> u64;

    /// Return the number of bytes written so far.
    fn bytes_written(&self) -> u64;
}
}

When to Use Streaming vs Batch

Scenario	Recommendation
< 100K records	Batch (`CsvSink` / `JsonSink`) — simpler API
100K–10M records	Streaming — lower memory footprint
> 10M records	Streaming with Parquet — columnar compression
Real-time consumers	Streaming NDJSON — line-by-line parsing
REST/WebSocket API	Streaming — integrated with server endpoints

CSV Streaming

#![allow(unused)]
fn main() {
use datasynth_output::streaming::CsvStreamingSink;
use datasynth_core::traits::StreamEvent;

let mut sink = CsvStreamingSink::<JournalEntry>::new("output.csv".into())?;

// Write records one at a time (memory efficient)
for entry in generate_entries() {
    sink.process(StreamEvent::Data(entry))?;
}

// Periodic flush (optional — ensures data is on disk)
sink.process(StreamEvent::Flush)?;

// Close when done
sink.close()?;

println!("Wrote {} items ({} bytes)", sink.items_written(), sink.bytes_written());
}

Headers are written automatically on the first Data event.

JSON Streaming

Pretty-printed JSON Array

#![allow(unused)]
fn main() {
use datasynth_output::streaming::JsonStreamingSink;

let mut sink = JsonStreamingSink::<JournalEntry>::new("output.json".into())?;
for entry in entries {
    sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;  // Writes closing bracket
}

Output:

[
  { "document_id": "abc-001", ... },
  { "document_id": "abc-002", ... }
]

Newline-Delimited JSON (NDJSON)

#![allow(unused)]
fn main() {
use datasynth_output::streaming::NdjsonStreamingSink;

let mut sink = NdjsonStreamingSink::<JournalEntry>::new("output.jsonl".into())?;
for entry in entries {
    sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;
}

Output:

{"document_id":"abc-001",...}
{"document_id":"abc-002",...}

NDJSON is ideal for streaming consumers that process records line by line (e.g., jq, Kafka, log aggregators).

Parquet Streaming

Apache Parquet provides columnar compression, making it ideal for large analytical datasets:

#![allow(unused)]
fn main() {
use datasynth_output::streaming::ParquetStreamingSink;

let mut sink = ParquetStreamingSink::<JournalEntry>::new("output.parquet".into())?;
for entry in entries {
    sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;
}

Parquet benefits:

Columnar storage: Efficient for analytical queries that touch few columns
Built-in compression: Snappy, Gzip, or Zstd per column group
Schema embedding: Self-describing files with full type information
Predicate pushdown: Query engines can skip irrelevant row groups

Configuration

Streaming output is enabled when using the server API or when the runtime detects memory pressure:

output:
  format: csv           # csv, json, jsonl, parquet
  streaming: true       # Enable streaming mode
  compression: none     # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)

Server Streaming

The server API uses streaming sinks for the /api/stream/ endpoints:

# Start streaming generation
curl -X POST http://localhost:3000/api/stream/start \
  -H "Content-Type: application/json" \
  -d '{"config": {...}, "format": "ndjson"}'

# WebSocket streaming
wscat -c ws://localhost:3000/ws/events

Backpressure

Streaming sinks monitor write throughput and provide backpressure signals:

items_written() / bytes_written(): Track progress for rate limiting
Flush events: Force periodic disk writes to bound memory usage
Disk space monitoring: The runtime’s DiskGuard can pause generation when disk space runs low

Performance

Format	Throughput	File Size	Use Case
CSV	~150K rows/sec	Largest	Universal compatibility
NDJSON	~120K rows/sec	Large	Streaming consumers
JSON	~100K rows/sec	Large	Human-readable
Parquet	~80K rows/sec	Smallest	Analytics, data lakes

Throughput varies with record complexity and disk speed.

Python Wrapper Specification (In-Memory Configs)

This document specifies a Python wrapper that makes DataSynth usable out of the box without requiring persisted configuration files. The wrapper focuses on rich, structured configuration objects and reusable configuration blueprints so developers can generate data entirely in memory while still benefiting from the full DataSynth configuration model.

Goals

Zero-file setup: Instantiate and run DataSynth without writing YAML/JSON to disk.
Rich configuration: Offer a Pythonic API that maps cleanly to the full DataSynth configuration schema.
Blueprints: Provide reusable, parameterized configuration templates for common scenarios.
Interoperable: Allow optional export to YAML/JSON for debugging or CLI parity.
Composable: Enable programmatic composition, overrides, and validation.

Non-goals

Replacing the DataSynth CLI or server API.
Hiding the underlying schema; the wrapper should expose all configuration knobs.
Managing persistence beyond optional explicit export helpers.

Package layout

packages/
  datasynth_py/
    __init__.py
    client.py             # entrypoint wrapper
    config/
      __init__.py
      models.py           # typed config objects
      blueprints.py       # blueprint registry + builders
      validation.py       # schema validation helpers
    runtime/
      __init__.py
      session.py          # in-memory execution

Core API surface

`DataSynth` entrypoint

from datasynth_py import DataSynth

synth = DataSynth()

Responsibilities

Provide a generate() method that accepts rich configuration objects.
Provide blueprints registry access for common starting points.
Manage execution in memory, including optional output sinks.

`generate()` signature

result = synth.generate(
    config=Config(...),
    output=OutputSpec(...),
    seed=42,
)

Behavior

Validates configuration objects.
Converts configuration to DataSynth schema (internal model or JSON/YAML in-memory string).
Executes the generator and returns result handles (paths, in-memory tables, or streams).

Optional output handling

OutputSpec can include:

format (e.g., parquet, csv, jsonl)
sink (memory, temp_dir, path)
compression settings

When sink="memory", the wrapper returns in-memory table objects (pandas DataFrames by default).

Configuration model

Typed configuration objects

Provide typed dataclasses/Pydantic models mirroring the DataSynth YAML schema:

GlobalSettings
CompanySettings
TransactionSettings
MasterDataSettings
ComplianceSettings
OutputSettings

Example:

from datasynth_py.config import Config, GlobalSettings, CompanySettings

config = Config(
    global_settings=GlobalSettings(
        locale="en_US",
        fiscal_year_start="2024-01-01",
        periods=12,
    ),
    companies=CompanySettings(count=5, industry="retail"),
)

Overrides and layering

Allow configuration layering to support incremental overrides:

config = base_config.override(
    companies={"count": 10},
    output={"format": "parquet"},
)

The wrapper merges overrides deeply, preserving nested settings.

Blueprints

Blueprints provide preconfigured setups with parameters. The wrapper ships with a registry:

from datasynth_py import blueprints

config = blueprints.retail_small(companies=3, transactions=5000)

Blueprint characteristics

Parameterized: Each blueprint accepts keyword overrides for key metrics.
Composable: A blueprint can extend or wrap another blueprint.
Discoverable: Registry lists available blueprints and metadata.

blueprints.list()
# ["retail_small", "banking_medium", "saas_subscription", ...]

Execution model

The wrapper runs the Rust engine in-process via FFI or uses the DataSynth runtime API:

In-memory config: converted to serialized config strings without writing to disk.
Transient workspace: uses a temporary directory only if required by runtime internals.
Deterministic runs: seed controls RNG.

Streaming generation

The wrapper exposes a streaming session that connects to datasynth-server over WebSockets while using REST endpoints to start, pause, resume, and stop streams.

Examples

Example 1: Minimal generation in memory

from datasynth_py import DataSynth
from datasynth_py.config import Config, GlobalSettings, CompanySettings

config = Config(
    global_settings=GlobalSettings(locale="en_US", fiscal_year_start="2024-01-01"),
    companies=CompanySettings(count=2),
)

synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "memory"})

# result.tables -> dict[str, pandas.DataFrame]
print(result.tables["transactions"].head())

Example 2: Use a blueprint with overrides

from datasynth_py import DataSynth, blueprints

synth = DataSynth()
config = blueprints.retail_small(companies=4, transactions=15000)

result = synth.generate(
    config=config,
    output={"format": "parquet", "sink": "temp_dir"},
    seed=7,
)

print(result.output_dir)

Example 3: Layering overrides for a custom scenario

from datasynth_py import DataSynth
from datasynth_py.config import Config, GlobalSettings, TransactionSettings

base = Config(global_settings=GlobalSettings(locale="en_GB"))
custom = base.override(
    transactions=TransactionSettings(
        count=25000,
        currency="GBP",
        anomaly_rate=0.02,
    )
)

synth = DataSynth()
result = synth.generate(config=custom, output={"format": "jsonl", "sink": "memory"})

Example 4: Export configuration for debugging

from datasynth_py import DataSynth
from datasynth_py.config import Config

synth = DataSynth()
config = Config(...)

print(config.to_yaml())
print(config.to_json())

Example 5: Streaming events

import asyncio

from datasynth_py import DataSynth
from datasynth_py.config import blueprints


async def main() -> None:
    synth = DataSynth(server_url="http://localhost:3000")
    config = blueprints.retail_small(companies=2, transactions=5000)
    session = synth.stream(config=config, events_per_second=50)

    async for event in session.events():
        print(event)
        break


asyncio.run(main())

Decisions

In-memory table format: pandas DataFrames are the default return type for memory sinks.
Validation errors: configuration validation raises ConfigValidationError with structured error details.

Python Wrapper Guide

This guide explains how to use the DataSynth Python wrapper for in-memory configuration, local CLI generation, and streaming generation through the server API.

Installation

The wrapper lives in the repository under python/. Install it in development mode:

cd python
pip install -e ".[all]"

Or install just the core with specific extras:

pip install -e ".[cli]"      # For CLI generation (requires PyYAML)
pip install -e ".[memory]"   # For in-memory tables (requires pandas)
pip install -e ".[streaming]" # For streaming (requires websockets)

Quick start (CLI generation)

from datasynth_py import DataSynth, CompanyConfig, Config, GlobalSettings, ChartOfAccountsSettings

config = Config(
    global_settings=GlobalSettings(
        industry="retail",
        start_date="2024-01-01",
        period_months=12,
    ),
    companies=[
        CompanyConfig(code="C001", name="Retail Corp", currency="USD", country="US"),
    ],
    chart_of_accounts=ChartOfAccountsSettings(complexity="small"),
)

synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

print(result.output_dir)  # Path to generated files

Using blueprints

Blueprints provide preconfigured templates for common scenarios:

from datasynth_py import DataSynth
from datasynth_py.config import blueprints

# List available blueprints
print(blueprints.list())
# ['retail_small', 'banking_medium', 'manufacturing_large',
#  'banking_aml', 'ml_training', 'with_graph_export']

# Create a retail configuration with 4 companies
config = blueprints.retail_small(companies=4, transactions=10000)

# Banking/AML focused configuration
config = blueprints.banking_aml(customers=1000, typologies=True)

# ML training optimized configuration
config = blueprints.ml_training(
    industry="manufacturing",
    anomaly_ratio=0.05,
)

# Add graph export to any configuration
config = blueprints.with_graph_export(
    base_config=blueprints.retail_small(),
    formats=["pytorch_geometric", "neo4j"],
)

synth = DataSynth()
result = synth.generate(config=config, output={"format": "parquet", "sink": "path", "path": "./output"})

Configuration model

The configuration model matches the CLI schema:

from datasynth_py import (
    ChartOfAccountsSettings,
    CompanyConfig,
    Config,
    FraudSettings,
    GlobalSettings,
)

config = Config(
    global_settings=GlobalSettings(
        industry="manufacturing",      # Industry sector
        start_date="2024-01-01",       # Simulation start date
        period_months=12,              # Number of months to simulate
        seed=42,                       # Random seed for reproducibility
        group_currency="USD",          # Base currency
    ),
    companies=[
        CompanyConfig(
            code="M001",
            name="Manufacturing Co",
            currency="USD",
            country="US",
            annual_transaction_volume="ten_k",  # Volume preset
        ),
        CompanyConfig(
            code="M002",
            name="Manufacturing EU",
            currency="EUR",
            country="DE",
            annual_transaction_volume="hundred_k",
        ),
    ],
    chart_of_accounts=ChartOfAccountsSettings(
        complexity="medium",           # small, medium, or large
    ),
    fraud=FraudSettings(
        enabled=True,
        rate=0.01,                     # 1% fraud rate
    ),
)

Valid industry values

manufacturing
retail
financial_services
healthcare
technology
professional_services
energy
transportation
real_estate
telecommunications

Transaction volume presets

ten_k - 10,000 transactions/year
hundred_k - 100,000 transactions/year
one_m - 1,000,000 transactions/year
ten_m - 10,000,000 transactions/year
hundred_m - 100,000,000 transactions/year

Extended configuration

Additional configuration sections for specialized scenarios:

from datasynth_py.config.models import (
    Config,
    GlobalSettings,
    BankingSettings,
    ScenarioSettings,
    TemporalDriftSettings,
    DataQualitySettings,
    GraphExportSettings,
)

config = Config(
    global_settings=GlobalSettings(industry="financial_services"),

    # Banking/KYC/AML configuration
    banking=BankingSettings(
        enabled=True,
        retail_customers=1000,
        business_customers=200,
        typologies_enabled=True,  # Structuring, layering, mule patterns
    ),

    # ML training scenario
    scenario=ScenarioSettings(
        tags=["ml_training", "fraud_detection"],
        ml_training=True,
        target_anomaly_ratio=0.05,
    ),

    # Temporal drift for concept drift testing
    temporal=TemporalDriftSettings(
        enabled=True,
        amount_mean_drift=0.02,
        drift_type="gradual",  # gradual, sudden, recurring
    ),

    # Data quality issues for DQ model training
    data_quality=DataQualitySettings(
        enabled=True,
        missing_rate=0.05,
        typo_rate=0.02,
    ),

    # Graph export for GNN training
    graph_export=GraphExportSettings(
        enabled=True,
        formats=["pytorch_geometric", "neo4j"],
    ),
)

Configuration layering

Override configuration values:

from datasynth_py import Config, GlobalSettings

base = Config(global_settings=GlobalSettings(industry="retail", start_date="2024-01-01"))
custom = base.override(
    fraud={"enabled": True, "rate": 0.02},
)

Validation

Validation raises ConfigValidationError with structured error details:

from datasynth_py import Config, GlobalSettings
from datasynth_py.config.validation import ConfigValidationError

try:
    Config(global_settings=GlobalSettings(period_months=0)).validate()
except ConfigValidationError as exc:
    for error in exc.errors:
        print(error.path, error.message, error.value)

Output options

Control where and how data is generated:

from datasynth_py import DataSynth, OutputSpec

synth = DataSynth()

# Write to a specific path
result = synth.generate(
    config=config,
    output=OutputSpec(format="csv", sink="path", path="./output"),
)

# Write to a temporary directory
result = synth.generate(
    config=config,
    output=OutputSpec(format="parquet", sink="temp_dir"),
)
print(result.output_dir)  # Temp directory path

# Load into memory (requires pandas)
result = synth.generate(
    config=config,
    output=OutputSpec(format="csv", sink="memory"),
)
print(result.tables["journal_entries"].head())

Fingerprint Operations

The Python wrapper provides access to fingerprint extraction, validation, and evaluation:

from datasynth_py import DataSynth

synth = DataSynth()

# Extract fingerprint from real data
synth.fingerprint.extract(
    input_path="./real_data/",
    output_path="./fingerprint.dsf",
    privacy_level="standard"  # minimal, standard, high, maximum
)

# Validate fingerprint file
is_valid, errors = synth.fingerprint.validate("./fingerprint.dsf")
if not is_valid:
    print(f"Validation errors: {errors}")

# Get fingerprint info
info = synth.fingerprint.info("./fingerprint.dsf", detailed=True)
print(f"Privacy level: {info.privacy_level}")
print(f"Epsilon spent: {info.epsilon_spent}")
print(f"Tables: {info.tables}")

# Evaluate synthetic data fidelity
report = synth.fingerprint.evaluate(
    fingerprint_path="./fingerprint.dsf",
    synthetic_path="./synthetic_data/",
    threshold=0.8
)
print(f"Overall score: {report.overall_score}")
print(f"Statistical fidelity: {report.statistical_fidelity}")
print(f"Correlation fidelity: {report.correlation_fidelity}")
print(f"Passes threshold: {report.passes}")

FidelityReport Fields

Field	Description
`overall_score`	Weighted average of all fidelity metrics (0-1)
`statistical_fidelity`	KS statistic, Wasserstein distance, Benford MAD
`correlation_fidelity`	Correlation matrix RMSE
`schema_fidelity`	Column type match, row count ratio
`passes`	Whether the score meets the threshold

Streaming generation

Streaming uses the DataSynth server for real-time event generation. Start the server first:

cargo run -p datasynth-server -- --port 3000

Then stream events:

import asyncio

from datasynth_py import DataSynth
from datasynth_py.config import blueprints


async def main() -> None:
    synth = DataSynth(server_url="http://localhost:3000")
    config = blueprints.retail_small(companies=2)
    session = synth.stream(config=config, events_per_second=100)

    async for event in session.events():
        print(event)
        break


asyncio.run(main())

Stream controls

session.pause()
session.resume()
session.stop()

Pattern triggers

Trigger specific patterns during streaming to simulate real-world scenarios:

# Trigger temporal patterns
session.trigger_month_end()    # Month-end volume spike
session.trigger_year_end()     # Year-end closing entries
session.trigger_pattern("quarter_end_spike")

# Trigger anomaly patterns
session.trigger_fraud_cluster()  # Clustered fraud transactions
session.trigger_pattern("dormant_account_activity")

# Available patterns:
# - period_end_spike
# - quarter_end_spike
# - year_end_spike
# - fraud_cluster
# - error_burst
# - dormant_account_activity

Synchronous event consumption

For simpler use cases without async/await:

def process_event(event):
    print(f"Received: {event['document_id']}")

session.sync_events(callback=process_event, max_events=1000)

Runtime requirements

The wrapper shells out to the datasynth-data CLI for batch generation. Ensure the binary is available:

cargo build --release
export DATASYNTH_BINARY=target/release/datasynth-data

Alternatively, pass binary_path when creating the client:

synth = DataSynth(binary_path="/path/to/datasynth-data")

Troubleshooting

MissingDependencyError: Install the required optional dependency (PyYAML, pandas, or websockets).
CLI not found: Build the datasynth-data binary and set DATASYNTH_BINARY or pass binary_path.
ConfigValidationError: Check the error details for invalid configuration values.
Streaming errors: Verify the server is running and reachable at the configured URL.

Ecosystem Integrations (v0.5.0)

DataSynth includes optional integrations with popular data engineering and ML platforms. Install with:

pip install datasynth-py[integrations]
# Or install specific integrations
pip install datasynth-py[airflow,dbt,mlflow,spark]

Apache Airflow

Use the DataSynthOperator to generate data as part of Airflow DAGs:

from datasynth_py.integrations import DataSynthOperator, DataSynthSensor, DataSynthValidateOperator

# Generate data
generate = DataSynthOperator(
    task_id="generate_data",
    config=config,
    output_path="/data/synthetic/output",
)

# Wait for completion
sensor = DataSynthSensor(
    task_id="wait_for_data",
    output_path="/data/synthetic/output",
)

# Validate config
validate = DataSynthValidateOperator(
    task_id="validate_config",
    config_path="/data/configs/config.yaml",
)

dbt Integration

Generate dbt sources and seeds from synthetic data:

from datasynth_py.integrations import DbtSourceGenerator, create_dbt_project

gen = DbtSourceGenerator()

# Generate sources.yml for dbt
sources_path = gen.generate_sources_yaml("./output", "./my_dbt_project")

# Generate seed CSVs
seeds_dir = gen.generate_seeds("./output", "./my_dbt_project")

# Create complete dbt project from synthetic output
project = create_dbt_project("./output", "my_dbt_project")

MLflow Tracking

Track generation runs as MLflow experiments:

from datasynth_py.integrations import DataSynthMlflowTracker

tracker = DataSynthMlflowTracker(experiment_name="synthetic_data_runs")

# Track a generation run
run_info = tracker.track_generation("./output", config=cfg)

# Log quality metrics
tracker.log_quality_metrics({
    "completeness": 0.98,
    "benford_mad": 0.008,
    "correlation_preservation": 0.95,
})

# Compare recent runs
comparison = tracker.compare_runs(n=5)

Apache Spark

Read synthetic data as Spark DataFrames:

from datasynth_py.integrations import DataSynthSparkReader

reader = DataSynthSparkReader()

# Read a single table
df = reader.read_table(spark, "./output", "journal_entries")

# Read all tables
tables = reader.read_all_tables(spark, "./output")

# Create temporary views for SQL queries
views = reader.create_temp_views(spark, "./output")
spark.sql("SELECT * FROM journal_entries WHERE amount > 10000").show()

For comprehensive integration documentation, see the Ecosystem Integrations guide.

Ecosystem Integrations

DataSynth’s Python wrapper includes optional integrations with popular data engineering and ML platforms for seamless pipeline orchestration.

Installation

# Install all integrations
pip install datasynth-py[integrations]

# Install specific integrations
pip install datasynth-py[airflow]
pip install datasynth-py[dbt]
pip install datasynth-py[mlflow]
pip install datasynth-py[spark]

Apache Airflow

The Airflow integration provides custom operators and sensors for orchestrating synthetic data generation in Airflow DAGs.

DataSynthOperator

Generates synthetic data as an Airflow task:

from datasynth_py.integrations import DataSynthOperator

generate = DataSynthOperator(
    task_id="generate_synthetic_data",
    config={
        "global": {"industry": "retail", "start_date": "2024-01-01", "period_months": 12},
        "transactions": {"target_count": 50000},
        "output": {"format": "csv"},
    },
    output_path="/data/synthetic/{{ ds }}",
)

Parameter	Type	Description
`task_id`	str	Airflow task identifier
`config`	dict	Generation configuration (inline)
`config_path`	str	Path to YAML config file (alternative to `config`)
`output_path`	str	Output directory (supports Jinja templates)

DataSynthSensor

Waits for synthetic data generation to complete:

from datasynth_py.integrations import DataSynthSensor

wait = DataSynthSensor(
    task_id="wait_for_data",
    output_path="/data/synthetic/{{ ds }}",
    poke_interval=30,
    timeout=600,
)

DataSynthValidateOperator

Validates a configuration file before generation:

from datasynth_py.integrations import DataSynthValidateOperator

validate = DataSynthValidateOperator(
    task_id="validate_config",
    config_path="/configs/retail.yaml",
)

Complete DAG Example

from airflow import DAG
from airflow.utils.dates import days_ago
from datasynth_py.integrations import (
    DataSynthOperator,
    DataSynthSensor,
    DataSynthValidateOperator,
)

with DAG(
    "weekly_synthetic_data",
    start_date=days_ago(1),
    schedule_interval="@weekly",
    catchup=False,
) as dag:

    validate = DataSynthValidateOperator(
        task_id="validate",
        config_path="/configs/retail.yaml",
    )

    generate = DataSynthOperator(
        task_id="generate",
        config_path="/configs/retail.yaml",
        output_path="/data/synthetic/{{ ds }}",
    )

    wait = DataSynthSensor(
        task_id="wait",
        output_path="/data/synthetic/{{ ds }}",
    )

    validate >> generate >> wait

dbt Integration

Generate dbt-compatible project structures from synthetic data output.

DbtSourceGenerator

from datasynth_py.integrations import DbtSourceGenerator

gen = DbtSourceGenerator()

Generate sources.yml

Creates a dbt sources.yml file pointing to synthetic data tables:

sources_path = gen.generate_sources_yaml(
    output_dir="./synthetic_output",
    dbt_project_dir="./my_dbt_project",
)
# Creates ./my_dbt_project/models/sources.yml

Generate Seeds

Copies synthetic CSV files as dbt seeds:

seeds_dir = gen.generate_seeds(
    output_dir="./synthetic_output",
    dbt_project_dir="./my_dbt_project",
)
# Copies CSVs to ./my_dbt_project/seeds/

create_dbt_project

Creates a complete dbt project structure from synthetic output:

from datasynth_py.integrations import create_dbt_project

project = create_dbt_project(
    output_dir="./synthetic_output",
    project_name="synthetic_test",
)

This creates:

synthetic_test/
├── dbt_project.yml
├── models/
│   └── sources.yml
├── seeds/
│   ├── journal_entries.csv
│   ├── vendors.csv
│   ├── customers.csv
│   └── ...
└── tests/

Usage with dbt CLI

cd synthetic_test
dbt seed      # Load synthetic CSVs
dbt run       # Run transformations
dbt test      # Run data tests

MLflow Integration

Track synthetic data generation runs as MLflow experiments for comparison and reproducibility.

DataSynthMlflowTracker

from datasynth_py.integrations import DataSynthMlflowTracker

tracker = DataSynthMlflowTracker(experiment_name="synthetic_data_experiments")

Track a Generation Run

run_info = tracker.track_generation(
    output_dir="./output",
    config=config,
)
# Logs: config parameters, output file counts, generation metadata

Log Quality Metrics

tracker.log_quality_metrics({
    "completeness": 0.98,
    "benford_mad": 0.008,
    "correlation_preservation": 0.95,
    "statistical_fidelity": 0.92,
})

Compare Runs

comparison = tracker.compare_runs(n=5)
for run in comparison:
    print(f"Run {run['run_id']}: {run['metrics']}")

Experiment Comparison

Use MLflow to compare different generation configurations:

import mlflow

configs = {
    "baseline": baseline_config,
    "with_diffusion": diffusion_config,
    "high_fraud": high_fraud_config,
}

for name, cfg in configs.items():
    with mlflow.start_run(run_name=name):
        result = synth.generate(config=cfg, output={"format": "csv", "sink": "temp_dir"})
        tracker.track_generation(result.output_dir, config=cfg)
        tracker.log_quality_metrics(evaluate_quality(result.output_dir))

View results in the MLflow UI:

mlflow ui --port 5000
# Open http://localhost:5000

Apache Spark

Read synthetic data output directly as Spark DataFrames for large-scale analysis.

DataSynthSparkReader

from datasynth_py.integrations import DataSynthSparkReader

reader = DataSynthSparkReader()

Read a Single Table

df = reader.read_table(spark, "./output", "journal_entries")
df.printSchema()
df.show(5)

Read All Tables

tables = reader.read_all_tables(spark, "./output")
for name, df in tables.items():
    print(f"{name}: {df.count()} rows, {len(df.columns)} columns")

Create Temporary Views

views = reader.create_temp_views(spark, "./output")

# Now use SQL
spark.sql("""
    SELECT
        v.vendor_id,
        v.name,
        COUNT(p.document_id) as payment_count,
        SUM(p.amount) as total_paid
    FROM vendors v
    JOIN payments p ON v.vendor_id = p.vendor_id
    GROUP BY v.vendor_id, v.name
    ORDER BY total_paid DESC
    LIMIT 10
""").show()

Spark + DataSynth Pipeline

from pyspark.sql import SparkSession
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
from datasynth_py.integrations import DataSynthSparkReader

# Generate
synth = DataSynth()
config = blueprints.retail_small(transactions=100000)
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Load into Spark
spark = SparkSession.builder.appName("SyntheticAnalysis").getOrCreate()
reader = DataSynthSparkReader()
reader.create_temp_views(spark, result.output_dir)

# Analyze
spark.sql("""
    SELECT fiscal_period, COUNT(*) as entry_count, SUM(amount) as total_amount
    FROM journal_entries
    GROUP BY fiscal_period
    ORDER BY fiscal_period
""").show()

Integration Dependencies

Integration	Required Package	Version
Airflow	`apache-airflow`	>= 2.5
dbt	`dbt-core`	>= 1.5
MLflow	`mlflow`	>= 2.0
Spark	`pyspark`	>= 3.3

All integrations are optional — install only what you need.

Configuration

SyntheticData uses YAML configuration files to control all aspects of data generation.

Quick Start

# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Validate configuration
datasynth-data validate --config config.yaml

# Generate with configuration
datasynth-data generate --config config.yaml --output ./output

Configuration Sections

Section	Description
Global Settings	Industry, dates, seed, performance
Companies	Company codes, currencies, volume weights
Transactions	Line items, amounts, sources
Master Data	Vendors, customers, materials, assets
Document Flows	P2P, O2C configuration
Financial Settings	Balance, subledger, FX, period close
Compliance	Fraud, controls, approval
AI & ML Features	LLM, diffusion, causal, certificates
Output Settings	Format, compression
Source-to-Pay	S2C sourcing pipeline (projects, RFx, bids, contracts, catalogs, scorecards)
Financial Reporting	Financial statements, bank reconciliation, management KPIs, budgets
HR	Payroll runs, time entries, expense reports
Manufacturing	Production orders, quality inspections, cycle counts
Sales Quotes	Quote-to-order pipeline
Accounting Standards	Revenue recognition (ASC 606/IFRS 15), impairment testing

Minimal Configuration

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 10000

output:
  format: csv

Full Configuration Example

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.6
  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    country: DE
    volume_weight: 0.4

chart_of_accounts:
  complexity: medium

transactions:
  target_count: 100000
  line_items:
    distribution: empirical
  amounts:
    min: 100
    max: 1000000

master_data:
  vendors:
    count: 200
  customers:
    count: 500
  materials:
    count: 1000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3
  o2c:
    enabled: true
    flow_rate: 0.3

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

# AI & ML Features (v0.5.0)
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 1000

causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 1000
  validate: true

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true

# Enterprise Process Chains (v0.6.0)
source_to_pay:
  enabled: true
  projects_per_period: 5
  avg_bids_per_rfx: 4
  contract_award_rate: 0.75
  catalog_items_per_contract: 10

financial_reporting:
  enabled: true
  generate_balance_sheet: true
  generate_income_statement: true
  generate_cash_flow: true
  generate_changes_in_equity: true
  management_kpis:
    enabled: true
  budgets:
    enabled: true
    variance_threshold: 0.10

hr:
  enabled: true
  payroll_frequency: monthly
  time_tracking: true
  expense_reports: true

manufacturing:
  enabled: true
  production_orders_per_period: 20
  quality_inspection_rate: 0.30
  cycle_count_frequency: quarterly

sales_quotes:
  enabled: true
  quotes_per_period: 15
  conversion_rate: 0.35

output:
  format: csv
  compression: none

Configuration Loading

Configuration can be loaded from:

YAML file (recommended):

datasynth-data generate --config config.yaml --output ./output

JSON file:

datasynth-data generate --config config.json --output ./output

Demo preset:

datasynth-data generate --demo --output ./output

Validation

The configuration is validated for:

Rule	Description
Required fields	All mandatory fields must be present
Value ranges	Numbers within valid bounds
Distributions	Weights sum to 1.0 (±0.01 tolerance)
Dates	Valid date ranges
Uniqueness	Company codes must be unique
Consistency	Cross-field validations

Run validation:

datasynth-data validate --config config.yaml

Overriding Values

Command-line options override configuration file values:

# Override seed
datasynth-data generate --config config.yaml --seed 12345 --output ./output

# Override format
datasynth-data generate --config config.yaml --format json --output ./output

Environment Variables

Some settings can be controlled via environment variables:

Variable	Configuration Equivalent
`SYNTH_DATA_SEED`	`global.seed`
`SYNTH_DATA_THREADS`	`global.worker_threads`
`SYNTH_DATA_MEMORY_LIMIT`	`global.memory_limit`

YAML Schema Reference

Complete reference for all configuration options.

Schema Overview

global:                    # Global settings
companies:                 # Company definitions
chart_of_accounts:         # COA structure
transactions:              # Transaction settings
master_data:               # Master data settings
document_flows:            # P2P, O2C flows
intercompany:              # IC settings
balance:                   # Balance settings
subledger:                 # Subledger settings
fx:                        # FX settings
period_close:              # Period close settings
fraud:                     # Fraud injection
internal_controls:         # SOX controls
anomaly_injection:         # Anomaly injection
data_quality:              # Data quality variations
graph_export:              # Graph export settings
output:                    # Output settings
business_processes:        # Process distribution
templates:                 # External templates
approval:                  # Approval thresholds
departments:               # Department distribution
source_to_pay:             # Source-to-Pay (v0.6.0)
financial_reporting:       # Financial statements & KPIs (v0.6.0)
hr:                        # HR / payroll / expenses (v0.6.0)
manufacturing:             # Production orders & costing (v0.6.0)
sales_quotes:              # Quote-to-order pipeline (v0.6.0)

global

global:
  seed: 42                           # u64, optional - RNG seed
  industry: manufacturing            # string - industry preset
  start_date: 2024-01-01             # date - generation start
  period_months: 12                  # u32, 1-120 - duration
  group_currency: USD                # string - base currency
  worker_threads: 4                  # usize, optional - parallelism
  memory_limit: 2147483648           # u64, optional - bytes

Industries: manufacturing, retail, financial_services, healthcare, technology, energy, telecom, transportation, hospitality

companies

companies:
  - code: "1000"                     # string - unique code
    name: "Headquarters"             # string - display name
    currency: USD                    # string - local currency
    country: US                      # string - ISO country code
    volume_weight: 0.6               # f64, 0-1 - transaction weight
    is_parent: true                  # bool - consolidation parent
    parent_code: null                # string, optional - parent ref

Constraints:

volume_weight across all companies must sum to 1.0
code must be unique

chart_of_accounts

chart_of_accounts:
  complexity: medium                 # small, medium, large
  industry_specific: true            # bool - use industry COA
  custom_accounts: []                # list - additional accounts

Complexity levels:

small: ~100 accounts
medium: ~400 accounts
large: ~2500 accounts

transactions

transactions:
  target_count: 100000               # u64 - total JEs to generate

  line_items:
    distribution: empirical          # empirical, uniform, custom
    min_lines: 2                     # u32 - minimum line items
    max_lines: 20                    # u32 - maximum line items
    custom_distribution:             # only if distribution: custom
      2: 0.6068
      3: 0.0524
      4: 0.1732

  amounts:
    min: 100                         # f64 - minimum amount
    max: 1000000                     # f64 - maximum amount
    distribution: log_normal         # log_normal, uniform, custom
    round_number_bias: 0.15          # f64, 0-1 - round number preference

  sources:                           # transaction source weights
    manual: 0.3
    automated: 0.5
    recurring: 0.15
    adjustment: 0.05

  benford:
    enabled: true                    # bool - Benford's Law compliance

  temporal:
    month_end_spike: 2.5             # f64 - month-end volume multiplier
    quarter_end_spike: 3.0           # f64 - quarter-end multiplier
    year_end_spike: 4.0              # f64 - year-end multiplier
    working_hours_only: true         # bool - restrict to business hours

master_data

master_data:
  vendors:
    count: 200                       # u32 - number of vendors
    intercompany_ratio: 0.05         # f64, 0-1 - IC vendor ratio

  customers:
    count: 500                       # u32 - number of customers
    intercompany_ratio: 0.05         # f64, 0-1 - IC customer ratio

  materials:
    count: 1000                      # u32 - number of materials

  fixed_assets:
    count: 100                       # u32 - number of assets

  employees:
    count: 50                        # u32 - number of employees
    hierarchy_depth: 4               # u32 - org chart depth

document_flows

document_flows:
  p2p:                               # Procure-to-Pay
    enabled: true
    flow_rate: 0.3                   # f64, 0-1 - JE percentage
    completion_rate: 0.95            # f64, 0-1 - full flow rate
    three_way_match:
      quantity_tolerance: 0.02       # f64, 0-1 - qty variance allowed
      price_tolerance: 0.01          # f64, 0-1 - price variance allowed

  o2c:                               # Order-to-Cash
    enabled: true
    flow_rate: 0.3                   # f64, 0-1 - JE percentage
    completion_rate: 0.95            # f64, 0-1 - full flow rate

intercompany

intercompany:
  enabled: true
  transaction_types:                 # weights must sum to 1.0
    goods_sale: 0.4
    service_provided: 0.2
    loan: 0.15
    dividend: 0.1
    management_fee: 0.1
    royalty: 0.05

  transfer_pricing:
    method: cost_plus                # cost_plus, resale_minus, comparable
    markup_range:
      min: 0.03
      max: 0.10

balance

balance:
  opening_balance:
    enabled: true
    total_assets: 10000000           # f64 - opening balance sheet size

  coherence_check:
    enabled: true                    # bool - verify A = L + E
    tolerance: 0.01                  # f64 - allowed imbalance

subledger

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90]      # list of days

  ap:
    enabled: true
    aging_buckets: [30, 60, 90]

  fixed_assets:
    enabled: true
    depreciation_methods:
      - straight_line
      - declining_balance

  inventory:
    enabled: true
    valuation_methods:
      - fifo
      - weighted_average

fx

fx:
  enabled: true
  base_currency: USD

  currency_pairs:                    # currencies to generate
    - EUR
    - GBP
    - CHF
    - JPY

  volatility: 0.01                   # f64 - daily volatility

  translation:
    method: current_rate             # current_rate, temporal

period_close

period_close:
  enabled: true

  monthly:
    accruals: true
    depreciation: true

  quarterly:
    intercompany_elimination: true

  annual:
    closing_entries: true
    retained_earnings: true

fraud

fraud:
  enabled: true
  fraud_rate: 0.005                  # f64, 0-1 - fraud percentage

  types:                             # weights must sum to 1.0
    fictitious_transaction: 0.15
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    split_transaction: 0.15
    round_tripping: 0.05
    kickback_scheme: 0.10
    ghost_employee: 0.05
    duplicate_payment: 0.15
    unauthorized_discount: 0.10
    suspense_abuse: 0.05

internal_controls

internal_controls:
  enabled: true

  controls:
    - id: "CTL-001"
      name: "Payment Approval"
      type: preventive
      frequency: continuous

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]

anomaly_injection

anomaly_injection:
  enabled: true
  total_rate: 0.02                   # f64, 0-1 - total anomaly rate
  generate_labels: true              # bool - output ML labels

  categories:                        # weights must sum to 1.0
    fraud: 0.25
    error: 0.40
    process_issue: 0.20
    statistical: 0.10
    relational: 0.05

  temporal_pattern:
    year_end_spike: 1.5              # f64 - year-end multiplier

  clustering:
    enabled: true
    cluster_probability: 0.2

data_quality

data_quality:
  enabled: true

  missing_values:
    rate: 0.01                       # f64, 0-1
    pattern: mcar                    # mcar, mar, mnar, systematic

  format_variations:
    date_formats: true
    amount_formats: true

  duplicates:
    rate: 0.001                      # f64, 0-1
    types: [exact, near, fuzzy]

  typos:
    rate: 0.005                      # f64, 0-1
    keyboard_aware: true

graph_export

graph_export:
  enabled: true

  formats:
    - pytorch_geometric
    - neo4j
    - dgl

  graphs:
    - transaction_network
    - approval_network
    - entity_relationship

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly

  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

output

output:
  format: csv                        # csv, json
  compression: none                  # none, gzip, zstd
  compression_level: 6               # u32, 1-9 (if compression enabled)

  files:
    journal_entries: true
    acdoca: true
    master_data: true
    documents: true
    subledgers: true
    trial_balances: true
    labels: true
    controls: true

Validation Summary

Field	Constraint
`period_months`	1-120
`compression_level`	1-9
All rates/percentages	0.0-1.0
Distributions	Sum to 1.0 (±0.01)
Company codes	Unique
Dates	Valid and consistent

Diffusion Configuration (v0.5.0)

diffusion:
  enabled: false                    # Enable diffusion model backend
  n_steps: 1000                     # Number of diffusion steps (default: 1000)
  schedule: "linear"                # Noise schedule: "linear", "cosine", "sigmoid"
  sample_size: 1000                 # Number of samples to generate (default: 1000)

Field	Type	Default	Description
`enabled`	bool	`false`	Enable diffusion model generation
`n_steps`	integer	`1000`	Number of forward/reverse diffusion steps
`schedule`	string	`"linear"`	Noise schedule type: `linear`, `cosine`, `sigmoid`
`sample_size`	integer	`1000`	Number of samples to generate

Causal Configuration (v0.5.0)

causal:
  enabled: false                    # Enable causal generation
  template: "fraud_detection"       # Built-in template or custom graph path
  sample_size: 1000                 # Number of samples to generate
  validate: true                    # Validate causal structure in output

Field	Type	Default	Description
`enabled`	bool	`false`	Enable causal/counterfactual generation
`template`	string	`"fraud_detection"`	Template name (`fraud_detection`, `revenue_cycle`) or path to custom YAML
`sample_size`	integer	`1000`	Number of causal samples to generate
`validate`	bool	`true`	Run causal structure validation on output

Built-in Causal Templates

Template	Variables	Description
`fraud_detection`	transaction_amount, approval_level, vendor_risk, fraud_flag	Fraud detection causal graph
`revenue_cycle`	order_size, credit_score, payment_delay, revenue	Revenue cycle causal graph

Certificate Configuration (v0.5.0)

certificates:
  enabled: false                    # Enable synthetic data certificates
  issuer: "DataSynth"              # Certificate issuer name
  include_quality_metrics: true     # Include quality metrics in certificate

Field	Type	Default	Description
`enabled`	bool	`false`	Attach certificate to generated output
`issuer`	string	`"DataSynth"`	Issuer identity for the certificate
`include_quality_metrics`	bool	`true`	Include Benford MAD, correlation, fidelity metrics

Source-to-Pay Configuration (v0.6.0)

source_to_pay:
  enabled: false                       # Enable source-to-pay generation

  spend_analysis:
    hhi_threshold: 2500.0              # f64 - HHI threshold for sourcing trigger
    contract_coverage_target: 0.80     # f64, 0-1 - target spend under contracts

  sourcing:
    projects_per_year: 10              # u32 - sourcing projects per year
    renewal_horizon_months: 3          # u32 - months before expiry to trigger renewal
    project_duration_months: 4         # u32 - average project duration

  qualification:
    pass_rate: 0.75                    # f64, 0-1 - qualification pass rate
    validity_days: 365                 # u32 - qualification validity in days
    financial_weight: 0.25             # f64 - financial stability weight
    quality_weight: 0.30               # f64 - quality management weight
    delivery_weight: 0.25              # f64 - delivery performance weight
    compliance_weight: 0.20            # f64 - compliance weight

  rfx:
    rfi_threshold: 100000.0            # f64 - spend above which RFI required
    min_invited_vendors: 3             # u32 - minimum vendors per RFx
    max_invited_vendors: 8             # u32 - maximum vendors per RFx
    response_rate: 0.70                # f64, 0-1 - vendor response rate
    default_price_weight: 0.40         # f64 - price weight in evaluation
    default_quality_weight: 0.35       # f64 - quality weight in evaluation
    default_delivery_weight: 0.25      # f64 - delivery weight in evaluation

  contracts:
    min_duration_months: 12            # u32 - minimum contract duration
    max_duration_months: 36            # u32 - maximum contract duration
    auto_renewal_rate: 0.40            # f64, 0-1 - auto-renewal rate
    amendment_rate: 0.20               # f64, 0-1 - contracts with amendments
    type_distribution:
      fixed_price: 0.40               # f64 - fixed price contracts
      blanket: 0.30                    # f64 - blanket/framework agreements
      time_and_materials: 0.15         # f64 - T&M contracts
      service_agreement: 0.15          # f64 - service agreements

  catalog:
    preferred_vendor_flag_rate: 0.70   # f64, 0-1 - items marked as preferred
    multi_source_rate: 0.25            # f64, 0-1 - items with multiple sources

  scorecards:
    frequency: "quarterly"             # string - review frequency
    on_time_delivery_weight: 0.30      # f64 - OTD weight in score
    quality_weight: 0.30               # f64 - quality weight in score
    price_weight: 0.25                 # f64 - price competitiveness weight
    responsiveness_weight: 0.15        # f64 - responsiveness weight
    grade_a_threshold: 90.0            # f64 - grade A threshold
    grade_b_threshold: 75.0            # f64 - grade B threshold
    grade_c_threshold: 60.0            # f64 - grade C threshold

  p2p_integration:
    off_contract_rate: 0.15            # f64, 0-1 - maverick purchase rate
    price_tolerance: 0.02              # f64 - contract price variance allowed
    catalog_enforcement: false          # bool - enforce catalog ordering

Field	Type	Default	Description
`enabled`	bool	`false`	Enable source-to-pay generation
`sourcing.projects_per_year`	u32	`10`	Sourcing projects per year
`qualification.pass_rate`	f64	`0.75`	Supplier qualification pass rate
`rfx.response_rate`	f64	`0.70`	Fraction of invited vendors that respond
`contracts.auto_renewal_rate`	f64	`0.40`	Auto-renewal rate
`scorecards.frequency`	string	`"quarterly"`	Scorecard review frequency
`p2p_integration.off_contract_rate`	f64	`0.15`	Rate of off-contract (maverick) purchases

Financial Reporting Configuration (v0.6.0)

financial_reporting:
  enabled: false                       # Enable financial reporting generation
  generate_balance_sheet: true         # bool - generate balance sheet
  generate_income_statement: true      # bool - generate income statement
  generate_cash_flow: true             # bool - generate cash flow statement
  generate_changes_in_equity: true     # bool - generate changes in equity
  comparative_periods: 1               # u32 - number of comparative periods

  management_kpis:
    enabled: false                     # bool - enable KPI generation
    frequency: "monthly"               # string - monthly, quarterly

  budgets:
    enabled: false                     # bool - enable budget generation
    revenue_growth_rate: 0.05          # f64 - expected revenue growth rate
    expense_inflation_rate: 0.03       # f64 - expected expense inflation rate
    variance_noise: 0.10               # f64 - noise for budget vs actual

Field	Type	Default	Description
`enabled`	bool	`false`	Enable financial reporting generation
`generate_balance_sheet`	bool	`true`	Generate balance sheet output
`generate_income_statement`	bool	`true`	Generate income statement output
`generate_cash_flow`	bool	`true`	Generate cash flow statement output
`generate_changes_in_equity`	bool	`true`	Generate changes in equity statement
`comparative_periods`	u32	`1`	Number of comparative periods to include
`management_kpis.enabled`	bool	`false`	Enable management KPI calculation
`management_kpis.frequency`	string	`"monthly"`	KPI calculation frequency
`budgets.enabled`	bool	`false`	Enable budget generation
`budgets.revenue_growth_rate`	f64	`0.05`	Expected revenue growth rate for budgeting
`budgets.expense_inflation_rate`	f64	`0.03`	Expected expense inflation rate
`budgets.variance_noise`	f64	`0.10`	Random noise added to budget vs actual

HR Configuration (v0.6.0)

hr:
  enabled: false                       # Enable HR generation

  payroll:
    enabled: true                      # bool - enable payroll generation
    pay_frequency: "monthly"           # string - monthly, biweekly, weekly
    salary_ranges:
      staff_min: 50000.0               # f64 - staff level minimum salary
      staff_max: 70000.0               # f64 - staff level maximum salary
      manager_min: 80000.0             # f64 - manager level minimum salary
      manager_max: 120000.0            # f64 - manager level maximum salary
      director_min: 120000.0           # f64 - director level minimum salary
      director_max: 180000.0           # f64 - director level maximum salary
      executive_min: 180000.0          # f64 - executive level minimum salary
      executive_max: 350000.0          # f64 - executive level maximum salary
    tax_rates:
      federal_effective: 0.22          # f64 - federal effective tax rate
      state_effective: 0.05            # f64 - state effective tax rate
      fica: 0.0765                     # f64 - FICA/social security rate
    benefits_enrollment_rate: 0.60     # f64, 0-1 - benefits enrollment rate
    retirement_participation_rate: 0.45 # f64, 0-1 - retirement plan participation

  time_attendance:
    enabled: true                      # bool - enable time tracking
    overtime_rate: 0.10                # f64, 0-1 - employees with overtime

  expenses:
    enabled: true                      # bool - enable expense report generation
    submission_rate: 0.30              # f64, 0-1 - employees submitting per month
    policy_violation_rate: 0.08        # f64, 0-1 - rate of policy violations

Field	Type	Default	Description
`enabled`	bool	`false`	Enable HR generation
`payroll.enabled`	bool	`true`	Enable payroll generation
`payroll.pay_frequency`	string	`"monthly"`	Pay frequency: `monthly`, `biweekly`, `weekly`
`payroll.benefits_enrollment_rate`	f64	`0.60`	Benefits enrollment rate
`payroll.retirement_participation_rate`	f64	`0.45`	Retirement plan participation rate
`time_attendance.enabled`	bool	`true`	Enable time tracking
`time_attendance.overtime_rate`	f64	`0.10`	Rate of employees with overtime
`expenses.enabled`	bool	`true`	Enable expense report generation
`expenses.submission_rate`	f64	`0.30`	Rate of employees submitting expenses per month
`expenses.policy_violation_rate`	f64	`0.08`	Rate of policy violations

Manufacturing Configuration (v0.6.0)

manufacturing:
  enabled: false                       # Enable manufacturing generation

  production_orders:
    orders_per_month: 50               # u32 - production orders per month
    avg_batch_size: 100                # u32 - average batch size
    yield_rate: 0.97                   # f64, 0-1 - production yield rate
    make_to_order_rate: 0.20           # f64, 0-1 - MTO vs MTS ratio
    rework_rate: 0.03                  # f64, 0-1 - rework rate

  costing:
    labor_rate_per_hour: 35.0          # f64 - labor rate per hour
    overhead_rate: 1.50                # f64 - overhead multiplier on direct labor
    standard_cost_update_frequency: "quarterly"  # string - cost update cycle

  routing:
    avg_operations: 4                  # u32 - average operations per routing
    setup_time_hours: 1.5              # f64 - average setup time in hours
    run_time_variation: 0.15           # f64 - run time variation coefficient

Field	Type	Default	Description
`enabled`	bool	`false`	Enable manufacturing generation
`production_orders.orders_per_month`	u32	`50`	Number of production orders per month
`production_orders.avg_batch_size`	u32	`100`	Average batch size
`production_orders.yield_rate`	f64	`0.97`	Production yield rate
`production_orders.rework_rate`	f64	`0.03`	Rework rate
`costing.labor_rate_per_hour`	f64	`35.0`	Direct labor cost per hour
`costing.overhead_rate`	f64	`1.50`	Overhead application multiplier
`routing.avg_operations`	u32	`4`	Average operations per routing step
`routing.setup_time_hours`	f64	`1.5`	Average machine setup time in hours

Sales Quotes Configuration (v0.6.0)

sales_quotes:
  enabled: false                       # Enable sales quote generation
  quotes_per_month: 30                 # u32 - quotes generated per month
  win_rate: 0.35                       # f64, 0-1 - quote-to-order conversion
  validity_days: 30                    # u32 - default quote validity period

Field	Type	Default	Description
`enabled`	bool	`false`	Enable sales quote generation
`quotes_per_month`	u32	`30`	Number of quotes generated per month
`win_rate`	f64	`0.35`	Fraction of quotes that convert to sales orders
`validity_days`	u32	`30`	Default quote validity period in days

Industry Presets

SyntheticData includes pre-configured settings for common industries.

Using Presets

# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available Industries

Industry	Key Characteristics
Manufacturing	Heavy P2P, inventory, fixed assets
Retail	High O2C volume, seasonal patterns
Financial Services	Complex intercompany, high controls
Healthcare	Regulatory focus, insurance seasonality
Technology	SaaS revenue, R&D capitalization

Complexity Levels

Level	Accounts	Vendors	Customers	Materials
Small	~100	50	100	200
Medium	~400	200	500	1000
Large	~2500	1000	5000	10000

Manufacturing

Characteristics:

High P2P activity (procurement, production)
Significant inventory and WIP
Fixed asset intensive
Cost accounting emphasis

Key Settings:

global:
  industry: manufacturing

transactions:
  sources:
    manual: 0.2
    automated: 0.6
    recurring: 0.15
    adjustment: 0.05

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4          # 40% of JEs from P2P
  o2c:
    enabled: true
    flow_rate: 0.25         # 25% of JEs from O2C

master_data:
  materials:
    count: 1000
  fixed_assets:
    count: 200

subledger:
  inventory:
    enabled: true
    valuation_methods:
      - weighted_average
      - fifo

Typical Account Distribution:

45% expense accounts (production costs)
25% asset accounts (inventory, equipment)
15% liability accounts
10% revenue accounts
5% equity accounts

Retail

Characteristics:

High transaction volume
Strong seasonal patterns
High O2C activity
Inventory turnover focus

Key Settings:

global:
  industry: retail

transactions:
  target_count: 500000      # High volume
  temporal:
    month_end_spike: 1.5
    quarter_end_spike: 2.0
    year_end_spike: 5.0     # Holiday season

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.25
  o2c:
    enabled: true
    flow_rate: 0.45         # High sales activity

master_data:
  customers:
    count: 2000
  materials:
    count: 5000

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]

Seasonal Pattern:

Q4 volume: 200-300% of Q1-Q3 average
Black Friday/holiday spikes
Post-holiday returns

Financial Services

Characteristics:

Complex intercompany structures
High regulatory requirements
Sophisticated controls
Mark-to-market adjustments

Key Settings:

global:
  industry: financial_services

transactions:
  sources:
    automated: 0.7          # High automation
    adjustment: 0.15        # MTM adjustments

intercompany:
  enabled: true
  transaction_types:
    loan: 0.3
    service_provided: 0.25
    dividend: 0.2
    management_fee: 0.15
    royalty: 0.1

internal_controls:
  enabled: true
  controls:
    - id: "SOX-001"
      type: preventive
      frequency: continuous

fx:
  enabled: true
  currency_pairs:
    - EUR
    - GBP
    - CHF
    - JPY
    - CNY
  volatility: 0.015

Control Requirements:

SOX 404 compliance mandatory
High SoD enforcement
Continuous monitoring

Healthcare

Characteristics:

Complex revenue recognition (insurance)
Regulatory compliance (HIPAA)
Seasonal patterns (flu season, open enrollment)
High accounts receivable

Key Settings:

global:
  industry: healthcare

transactions:
  amounts:
    min: 50
    max: 500000
    distribution: log_normal

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.5          # Revenue cycle focus

master_data:
  customers:
    count: 1000             # Patient/payer mix

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120, 180]  # Extended aging

fraud:
  types:
    fictitious_transaction: 0.2
    revenue_manipulation: 0.3   # Upcoding focus
    duplicate_payment: 0.2

Seasonal Pattern:

Q1 spike (insurance deductible reset)
Flu season (Oct-Feb)
Open enrollment (Nov-Dec)

Technology

Characteristics:

SaaS/subscription revenue
R&D capitalization
Stock compensation
Deferred revenue management

Key Settings:

global:
  industry: technology

transactions:
  sources:
    automated: 0.65
    recurring: 0.25         # Subscription billing
    manual: 0.08
    adjustment: 0.02

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.35

subledger:
  ar:
    enabled: true

# Additional technology-specific
deferred_revenue:
  enabled: true
  recognition_period: monthly

capitalization:
  r_and_d:
    enabled: true
    threshold: 50000

Revenue Pattern:

Monthly recurring revenue (MRR)
Annual contract billing (ACV)
Usage-based components

Process Chain Defaults (v0.6.0)

Starting in v0.6.0, all five industry presets include default settings for the new enterprise process chains. When you generate a configuration with datasynth-data init, the preset populates sensible defaults for each new section, though they remain disabled until explicitly turned on.

Process Chain	Manufacturing	Retail	Financial Services	Healthcare	Technology
`source_to_pay`	High	Medium	Low	Medium	Low
`financial_reporting`	Full	Full	Full	Full	Full
`hr`	Full	Full	Full	Full	Full
`manufacturing`	High	–	–	–	–
`sales_quotes`	Medium	High	Low	Medium	High

Manufacturing presets emphasize production orders, routing, and costing. Retail presets increase sales quote volume and quote-to-order win rates. Financial Services presets focus on financial reporting with comprehensive KPIs and budgets. Healthcare and Technology presets provide balanced defaults.

Each preset configures the following when you set enabled: true:

source_to_pay: Sourcing projects, RFx events, contract management, catalogs, and vendor scorecards that feed into the existing P2P document flow.
financial_reporting: Balance sheets, income statements, cash flow statements, management KPIs, and budget vs. actual variance analysis.
hr: Payroll runs based on employee master data, time and attendance tracking, and expense report generation with policy violation injection.
manufacturing: Production orders, WIP tracking, standard costing with labor and overhead, and routing operations.
sales_quotes: Quote-to-order pipeline that feeds into the existing O2C document flow.

Customizing Presets

Start with a preset and customize:

# Generate preset
datasynth-data init --industry manufacturing -o config.yaml

# Edit config.yaml
# - Adjust transaction counts
# - Add companies
# - Enable additional features

# Validate and generate
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output

Combining Industries

For conglomerates, use multiple companies with different characteristics:

companies:
  - code: "1000"
    name: "Manufacturing Division"
    volume_weight: 0.5

  - code: "2000"
    name: "Retail Division"
    volume_weight: 0.3

  - code: "3000"
    name: "Services Division"
    volume_weight: 0.2

Global Settings

Global settings control overall generation behavior.

Configuration

global:
  seed: 42                           # Random seed for reproducibility
  industry: manufacturing            # Industry preset
  start_date: 2024-01-01             # Generation start date
  period_months: 12                  # Duration in months
  group_currency: USD                # Base/reporting currency
  worker_threads: 4                  # Parallel workers (optional)
  memory_limit: 2147483648           # Memory limit in bytes (optional)

Fields

seed

Random number generator seed for reproducible output.

Property	Value
Type	`u64`
Required	No
Default	Random

global:
  seed: 42  # Same seed = same output

Use cases:

Reproducible test datasets
Debugging
Consistent benchmarks

industry

Industry preset for domain-specific settings.

Property	Value
Type	`string`
Required	Yes
Values	See below

Available industries:

Industry	Description
`manufacturing`	Production, inventory, cost accounting
`retail`	High volume sales, seasonal patterns
`financial_services`	Complex IC, regulatory compliance
`healthcare`	Insurance billing, compliance
`technology`	SaaS revenue, R&D
`energy`	Long-term assets, commodity trading
`telecom`	Subscription revenue, network assets
`transportation`	Fleet assets, fuel costs
`hospitality`	Seasonal, revenue management

start_date

Beginning date for generated data.

Property	Value
Type	`date` (YYYY-MM-DD)
Required	Yes

global:
  start_date: 2024-01-01

Notes:

First transaction will be on or after this date
Combined with period_months to determine date range

period_months

Duration of generation period.

Property	Value
Type	`u32`
Required	Yes
Range	1-120

global:
  period_months: 12    # One year
  period_months: 36    # Three years
  period_months: 1     # One month

Considerations:

Longer periods = more data
Period close features require at least 1 month
Year-end close requires at least 12 months

group_currency

Base currency for consolidation and reporting.

Property	Value
Type	`string` (ISO 4217)
Required	Yes

global:
  group_currency: USD
  group_currency: EUR
  group_currency: CHF

Used for:

Currency translation
Consolidation
Intercompany eliminations

worker_threads

Number of parallel worker threads.

Property	Value
Type	`usize`
Required	No
Default	Number of CPU cores

global:
  worker_threads: 4    # Use 4 threads
  worker_threads: 1    # Single-threaded

Guidance:

Default (CPU cores) is usually optimal
Reduce for memory-constrained systems
Increase may not improve performance beyond CPU cores

memory_limit

Maximum memory usage in bytes.

Property	Value
Type	`u64`
Required	No
Default	None (system limit)

global:
  memory_limit: 1073741824    # 1 GB
  memory_limit: 2147483648    # 2 GB
  memory_limit: 4294967296    # 4 GB

Behavior:

Soft limit: Generation slows down
Hard limit: Generation pauses until memory freed
Streaming output to reduce memory pressure

Environment Variable Overrides

Variable	Setting
`SYNTH_DATA_SEED`	`global.seed`
`SYNTH_DATA_THREADS`	`global.worker_threads`
`SYNTH_DATA_MEMORY_LIMIT`	`global.memory_limit`

SYNTH_DATA_SEED=12345 datasynth-data generate --config config.yaml --output ./output

Examples

Minimal

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

Full Control

global:
  seed: 42
  industry: financial_services
  start_date: 2023-01-01
  period_months: 36
  group_currency: USD
  worker_threads: 8
  memory_limit: 8589934592  # 8 GB

Development/Testing

global:
  seed: 42                # Reproducible
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 1        # Short period
  group_currency: USD
  worker_threads: 1       # Single thread for debugging

Validation

Check	Rule
`period_months`	1 ≤ value ≤ 120
`start_date`	Valid date
`industry`	Known industry preset
`group_currency`	Valid ISO 4217 code

Companies

Company configuration defines the legal entities for data generation.

Configuration

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.6
    is_parent: true
    parent_code: null

  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    country: DE
    volume_weight: 0.4
    is_parent: false
    parent_code: "1000"

Fields

code

Unique identifier for the company.

Property	Value
Type	`string`
Required	Yes
Constraints	Unique across all companies

companies:
  - code: "1000"      # Four-digit SAP-style
  - code: "US01"      # Region-based
  - code: "HQ"        # Abbreviated

name

Display name for the company.

Property	Value
Type	`string`
Required	Yes

companies:
  - name: "Headquarters"
  - name: "European Operations GmbH"
  - name: "Asia Pacific Holdings"

currency

Local currency for the company.

Property	Value
Type	`string` (ISO 4217)
Required	Yes

companies:
  - currency: USD
  - currency: EUR
  - currency: CHF
  - currency: JPY

Used for:

Transaction amounts
Local reporting
FX translation

country

Country code for the company.

Property	Value
Type	`string` (ISO 3166-1 alpha-2)
Required	Yes

companies:
  - country: US
  - country: DE
  - country: CH
  - country: JP

Affects:

Holiday calendars
Tax calculations
Regional templates

volume_weight

Relative transaction volume for this company.

Property	Value
Type	`f64`
Required	Yes
Range	0.0 - 1.0
Constraint	Sum across all companies = 1.0

companies:
  - code: "1000"
    volume_weight: 0.5    # 50% of transactions

  - code: "2000"
    volume_weight: 0.3    # 30% of transactions

  - code: "3000"
    volume_weight: 0.2    # 20% of transactions

is_parent

Whether this company is the consolidation parent.

Property	Value
Type	`bool`
Required	No
Default	`false`

companies:
  - code: "1000"
    is_parent: true       # Consolidation parent

  - code: "2000"
    is_parent: false      # Subsidiary

Notes:

Exactly one company should be is_parent: true for consolidation
Parent receives elimination entries

parent_code

Reference to parent company for subsidiaries.

Property	Value
Type	`string` or `null`
Required	No
Default	`null`

companies:
  - code: "1000"
    is_parent: true
    parent_code: null     # No parent (is the parent)

  - code: "2000"
    is_parent: false
    parent_code: "1000"   # Owned by 1000

  - code: "3000"
    is_parent: false
    parent_code: "2000"   # Owned by 2000 (nested)

Examples

Single Company

companies:
  - code: "1000"
    name: "Demo Company"
    currency: USD
    country: US
    volume_weight: 1.0

Multi-National

companies:
  - code: "1000"
    name: "Global Holdings Inc"
    currency: USD
    country: US
    volume_weight: 0.4
    is_parent: true

  - code: "2000"
    name: "European Operations GmbH"
    currency: EUR
    country: DE
    volume_weight: 0.25
    parent_code: "1000"

  - code: "3000"
    name: "UK Limited"
    currency: GBP
    country: GB
    volume_weight: 0.15
    parent_code: "2000"

  - code: "4000"
    name: "Asia Pacific Pte Ltd"
    currency: SGD
    country: SG
    volume_weight: 0.2
    parent_code: "1000"

Regional Structure

companies:
  - code: "HQ"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.3
    is_parent: true

  - code: "NA01"
    name: "North America Operations"
    currency: USD
    country: US
    volume_weight: 0.3
    parent_code: "HQ"

  - code: "EU01"
    name: "EMEA Operations"
    currency: EUR
    country: DE
    volume_weight: 0.25
    parent_code: "HQ"

  - code: "AP01"
    name: "APAC Operations"
    currency: JPY
    country: JP
    volume_weight: 0.15
    parent_code: "HQ"

Validation

Check	Rule
`code`	Must be unique
`volume_weight`	Sum must equal 1.0 (±0.01)
`parent_code`	Must reference existing company or be null
`is_parent`	At most one true (if intercompany enabled)

Intercompany Implications

When multiple companies exist:

Intercompany transactions generated between companies
FX rates generated for currency pairs
Elimination entries created for parent
Transfer pricing applied

See Intercompany Processing for details.

Transactions

Transaction settings control journal entry generation.

Configuration

transactions:
  target_count: 100000

  line_items:
    distribution: empirical
    min_lines: 2
    max_lines: 20

  amounts:
    min: 100
    max: 1000000
    distribution: log_normal
    round_number_bias: 0.15

  sources:
    manual: 0.3
    automated: 0.5
    recurring: 0.15
    adjustment: 0.05

  benford:
    enabled: true

  temporal:
    month_end_spike: 2.5
    quarter_end_spike: 3.0
    year_end_spike: 4.0
    working_hours_only: true

Fields

target_count

Total number of journal entries to generate.

Property	Value
Type	`u64`
Required	Yes

transactions:
  target_count: 10000      # Small dataset
  target_count: 100000     # Medium dataset
  target_count: 1000000    # Large dataset

line_items

Controls the number of line items per journal entry.

distribution

Value	Description
`empirical`	Based on real-world GL research
`uniform`	Equal probability for all counts
`custom`	User-defined probabilities

Empirical distribution (default):

2 lines: 60.68%
3 lines: 5.24%
4 lines: 17.32%
Even counts: 88% preference

line_items:
  distribution: empirical

Custom distribution:

line_items:
  distribution: custom
  custom_distribution:
    2: 0.50
    3: 0.10
    4: 0.20
    5: 0.10
    6: 0.10

min_lines / max_lines

Property	Value
Type	`u32`
Default	2 / 20

line_items:
  min_lines: 2
  max_lines: 10

amounts

Controls transaction amounts.

min / max

Property	Value
Type	`f64`
Required	Yes

amounts:
  min: 100           # Minimum amount
  max: 1000000       # Maximum amount

distribution

Value	Description
`log_normal`	Log-normal distribution (realistic)
`uniform`	Equal probability across range
`custom`	User-defined

amounts:
  distribution: log_normal

round_number_bias

Preference for round numbers (100, 500, 1000, etc.).

Property	Value
Type	`f64`
Range	0.0 - 1.0
Default	0.15

amounts:
  round_number_bias: 0.15    # 15% round numbers
  round_number_bias: 0.0     # No round number bias

sources

Transaction source distribution (weights must sum to 1.0).

Source	Description
`manual`	Manual journal entries
`automated`	System-generated
`recurring`	Scheduled recurring entries
`adjustment`	Period-end adjustments

sources:
  manual: 0.3
  automated: 0.5
  recurring: 0.15
  adjustment: 0.05

benford

Benford’s Law compliance for first-digit distribution.

benford:
  enabled: true       # Follow P(d) = log10(1 + 1/d)
  enabled: false      # Disable Benford compliance

Expected distribution (enabled):

Digit	Probability
1	30.1%
2	17.6%
3	12.5%
4	9.7%
5	7.9%
6	6.7%
7	5.8%
8	5.1%
9	4.6%

temporal

Temporal patterns for transaction timing.

Spikes

Volume multipliers for period ends:

temporal:
  month_end_spike: 2.5    # 2.5x volume at month end
  quarter_end_spike: 3.0  # 3.0x at quarter end
  year_end_spike: 4.0     # 4.0x at year end

Working Hours

Restrict transactions to business hours:

temporal:
  working_hours_only: true   # Mon-Fri, 8am-6pm
  working_hours_only: false  # Any time

Examples

High Volume Retail

transactions:
  target_count: 500000

  line_items:
    distribution: empirical
    min_lines: 2
    max_lines: 6

  amounts:
    min: 10
    max: 50000
    distribution: log_normal
    round_number_bias: 0.3

  sources:
    manual: 0.1
    automated: 0.8
    recurring: 0.08
    adjustment: 0.02

  temporal:
    month_end_spike: 1.5
    quarter_end_spike: 2.0
    year_end_spike: 5.0

Low Volume Manual

transactions:
  target_count: 5000

  line_items:
    distribution: empirical

  amounts:
    min: 1000
    max: 10000000

  sources:
    manual: 0.6
    automated: 0.2
    recurring: 0.1
    adjustment: 0.1

  temporal:
    month_end_spike: 3.0
    quarter_end_spike: 4.0
    year_end_spike: 5.0
    working_hours_only: true

Testing/Development

transactions:
  target_count: 1000

  line_items:
    distribution: uniform
    min_lines: 2
    max_lines: 4

  amounts:
    min: 100
    max: 10000
    distribution: uniform
    round_number_bias: 0.0

  sources:
    manual: 1.0

  benford:
    enabled: false

Validation

Check	Rule
`target_count`	> 0
`min_lines`	≥ 2
`max_lines`	≥ min_lines
`amounts.min`	> 0
`amounts.max`	> min
`round_number_bias`	0.0 - 1.0
`sources`	Sum = 1.0 (±0.01)
`*_spike`	≥ 1.0

Master Data

Master data settings control generation of business entities.

Configuration

master_data:
  vendors:
    count: 200
    intercompany_ratio: 0.05

  customers:
    count: 500
    intercompany_ratio: 0.05

  materials:
    count: 1000

  fixed_assets:
    count: 100

  employees:
    count: 50
    hierarchy_depth: 4

Vendors

Supplier master data configuration.

master_data:
  vendors:
    count: 200                    # Number of vendors
    intercompany_ratio: 0.05      # IC vendor percentage

    payment_terms:
      - code: "NET30"
        days: 30
        weight: 0.5
      - code: "NET60"
        days: 60
        weight: 0.3
      - code: "NET10"
        days: 10
        weight: 0.2

    behavior:
      late_payment_rate: 0.1      # % with late payment tendency
      discount_usage_rate: 0.3    # % that take early pay discounts

Generated Fields

Field	Description
`vendor_id`	Unique identifier
`vendor_name`	Company name
`tax_id`	Tax identification number
`payment_terms`	Default payment terms
`currency`	Transaction currency
`bank_account`	Bank details
`is_intercompany`	IC vendor flag
`valid_from`	Temporal validity start

Customers

Customer master data configuration.

master_data:
  customers:
    count: 500                    # Number of customers
    intercompany_ratio: 0.05      # IC customer percentage

    credit_rating:
      - code: "AAA"
        limit_multiplier: 10.0
        weight: 0.1
      - code: "AA"
        limit_multiplier: 5.0
        weight: 0.2
      - code: "A"
        limit_multiplier: 2.0
        weight: 0.4
      - code: "B"
        limit_multiplier: 1.0
        weight: 0.3

    payment_behavior:
      on_time_rate: 0.7           # % that pay on time
      early_payment_rate: 0.1     # % that pay early
      late_payment_rate: 0.2      # % that pay late

Generated Fields

Field	Description
`customer_id`	Unique identifier
`customer_name`	Company/person name
`credit_limit`	Maximum credit
`credit_rating`	Rating code
`payment_behavior`	Payment tendency
`currency`	Transaction currency
`is_intercompany`	IC customer flag

Materials

Product/material master data.

master_data:
  materials:
    count: 1000                   # Number of materials

    types:
      raw_material: 0.3
      work_in_progress: 0.1
      finished_goods: 0.4
      services: 0.2

    valuation:
      - method: fifo
        weight: 0.3
      - method: weighted_average
        weight: 0.5
      - method: standard_cost
        weight: 0.2

Generated Fields

Field	Description
`material_id`	Unique identifier
`description`	Material name
`material_type`	Classification
`unit_of_measure`	UOM
`valuation_method`	Costing method
`standard_cost`	Unit cost
`gl_account`	Inventory account

Fixed Assets

Capital asset master data.

master_data:
  fixed_assets:
    count: 100                    # Number of assets

    categories:
      buildings: 0.1
      machinery: 0.3
      vehicles: 0.2
      furniture: 0.2
      it_equipment: 0.2

    depreciation:
      - method: straight_line
        weight: 0.7
      - method: declining_balance
        weight: 0.2
      - method: units_of_production
        weight: 0.1

Generated Fields

Field	Description
`asset_id`	Unique identifier
`description`	Asset name
`asset_class`	Category
`acquisition_date`	Purchase date
`acquisition_cost`	Original cost
`useful_life`	Years
`depreciation_method`	Method
`salvage_value`	Residual value

Employees

User/employee master data.

master_data:
  employees:
    count: 50                     # Number of employees
    hierarchy_depth: 4            # Org chart depth

    roles:
      - name: "AP Clerk"
        approval_limit: 5000
        weight: 0.3
      - name: "AP Manager"
        approval_limit: 50000
        weight: 0.1
      - name: "AR Clerk"
        approval_limit: 5000
        weight: 0.3
      - name: "Controller"
        approval_limit: 500000
        weight: 0.1
      - name: "CFO"
        approval_limit: 999999999
        weight: 0.05

    transaction_codes:
      - "FB01"     # Post document
      - "FB50"     # Enter GL
      - "F-28"     # Post incoming payment
      - "F-53"     # Post outgoing payment

Generated Fields

Field	Description
`employee_id`	Unique identifier
`name`	Full name
`department`	Department code
`role`	Job role
`manager_id`	Supervisor reference
`approval_limit`	Max approval amount
`transaction_codes`	Authorized T-codes

HR and Payroll Integration (v0.6.0)

Employee master data serves as the foundation for the hr configuration section introduced in v0.6.0. When the HR module is enabled, each employee record drives downstream generation:

Payroll: Salary, tax withholdings, benefits deductions, and retirement contributions are computed per employee based on their role and the salary ranges defined in hr.payroll.salary_ranges. The pay_frequency setting (monthly, biweekly, or weekly) determines how many payroll runs are generated per period.
Time and Attendance: Time entries are generated for each employee according to working days in the period. The overtime_rate controls how many employees have overtime hours in a given period.
Expense Reports: A subset of employees (controlled by hr.expenses.submission_rate) generate expense reports each month. Policy violations are injected at the configured policy_violation_rate.

The employees.count and employees.hierarchy_depth settings in master_data directly determine the population size for all HR outputs. Increasing the employee count will proportionally increase payroll journal entries, time records, and expense reports.

master_data:
  employees:
    count: 200                     # Drives payroll and HR volume
    hierarchy_depth: 5

hr:
  enabled: true                    # Activates payroll, time, and expenses
  payroll:
    pay_frequency: "biweekly"      # 26 pay periods per year
  expenses:
    submission_rate: 0.40          # 40% of employees submit per month

Examples

Small Company

master_data:
  vendors:
    count: 50
  customers:
    count: 100
  materials:
    count: 200
  fixed_assets:
    count: 20
  employees:
    count: 10
    hierarchy_depth: 2

Large Enterprise

master_data:
  vendors:
    count: 2000
    intercompany_ratio: 0.1
  customers:
    count: 10000
    intercompany_ratio: 0.1
  materials:
    count: 50000
  fixed_assets:
    count: 5000
  employees:
    count: 500
    hierarchy_depth: 8

Validation

Check	Rule
`count`	> 0
`intercompany_ratio`	0.0 - 1.0
`hierarchy_depth`	≥ 1
Distribution weights	Sum = 1.0

Document Flows

Document flow settings control P2P (Procure-to-Pay) and O2C (Order-to-Cash) process generation, including document types, three-way matching, credit checks, and document chain management.

Configuration

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3
    completion_rate: 0.95
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01

  o2c:
    enabled: true
    flow_rate: 0.3
    completion_rate: 0.95

Procure-to-Pay (P2P)

Flow

Purchase     Purchase    Goods      Vendor     Three-Way
Requisition → Order   → Receipt  → Invoice  → Match    → Payment
                │                     │          │
                │                ┌────┘          │
                ▼                ▼               ▼
           AP Open Item ← Match Result      AP Aging

Purchase Order Types

SyntheticData models 6 PO types, each with different downstream behavior:

Type	Description	Requires GR?	Use Case
`Standard`	Standard goods purchase	Yes	Most common PO type
`Service`	Service procurement	No	Consulting, maintenance, etc.
`Framework`	Blanket/framework agreement	Yes	Long-term supply agreements
`Consignment`	Vendor-managed inventory	Yes	Consignment stock
`StockTransfer`	Inter-plant transfer	Yes	Internal stock movement
`Subcontracting`	External processing	Yes	Outsourced manufacturing

Goods Receipt Movement Types

Goods receipts use SAP-style movement type codes:

Movement Type	Code	Description
`GrForPo`	101	Standard GR against purchase order
`ReturnToVendor`	122	Return materials to vendor
`GrForProduction`	131	GR from production order
`TransferPosting`	301	Transfer between plants/locations
`InitialEntry`	561	Initial stock entry
`Scrapping`	551	Scrap disposal
`Consumption`	201	Direct consumption posting

Three-Way Match

The three-way match validator compares Purchase Order, Goods Receipt, and Vendor Invoice to detect variances before payment.

Algorithm

For each invoice line item:
  1. Find matching PO line (by PO reference + line number)
  2. Sum GR quantities for that PO line (supports multiple partial GRs)
  3. Compare:
     a. PO quantity vs GR quantity → QuantityPoGr variance
     b. GR quantity vs Invoice quantity → QuantityGrInvoice variance
     c. PO unit price vs Invoice unit price → PricePoInvoice variance
     d. PO total vs Invoice total → TotalAmount variance
  4. Apply tolerances:
     - Quantity: ±quantity_tolerance (default 2%)
     - Price: ±price_tolerance (default 5%)
     - Absolute: ±absolute_amount_tolerance (default $0.01)
  5. Check over-delivery:
     - If GR qty > PO qty and allow_over_delivery=true:
       allow up to max_over_delivery_pct (default 10%)

Variance Types

Variance Type	Description	Detection
`QuantityPoGr`	GR quantity differs from PO quantity	PO vs GR comparison
`QuantityGrInvoice`	Invoice quantity differs from GR quantity	GR vs Invoice comparison
`PricePoInvoice`	Invoice unit price differs from PO price	PO vs Invoice comparison
`TotalAmount`	Total invoice amount mismatch	Overall amount check
`MissingLine`	PO line not found in invoice or GR	Line matching
`ExtraLine`	Invoice has lines not on PO	Line matching

Match Outcomes

Outcome	Meaning	Action
`passed`	All within tolerance	Proceed to payment
`quantity_variance`	Quantity outside tolerance	Review required
`price_variance`	Price outside tolerance	Review required
`blocked`	Multiple variances or critical mismatch	Manual resolution

Configuration

document_flows:
  p2p:
    three_way_match:
      enabled: true
      price_tolerance: 0.05              # 5% price variance allowed
      quantity_tolerance: 0.02            # 2% quantity variance allowed
      absolute_amount_tolerance: 0.01     # $0.01 rounding tolerance
      allow_over_delivery: true
      max_over_delivery_pct: 0.10         # 10% over-delivery allowed

P2P Stage Configuration

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3                       # 30% of JEs from P2P
    completion_rate: 0.95                # 95% complete full flow

    stages:
      po_approval_rate: 0.9             # 90% of POs approved
      gr_rate: 0.98                     # 98% of POs get goods receipts
      invoice_rate: 0.95                # 95% of GRs get invoices
      payment_rate: 0.92                # 92% of invoices get paid

    timing:
      po_to_gr_days:
        min: 1
        max: 30
      gr_to_invoice_days:
        min: 1
        max: 14
      invoice_to_payment_days:
        min: 10
        max: 60

P2P Journal Entries

Stage	Debit	Credit	Trigger
Goods Receipt	Inventory (1300)	GR/IR Clearing (2100)	GR posted
Invoice Receipt	GR/IR Clearing (2100)	Accounts Payable (2000)	Invoice verified
Payment	Accounts Payable (2000)	Cash (1000)	Payment executed
Price Variance	PPV Expense (5xxx)	GR/IR Clearing (2100)	Price mismatch

Order-to-Cash (O2C)

Flow

Sales     Credit   Delivery    Customer    Customer
Order   → Check  → (Pick/   → Invoice   → Receipt
  │                Pack/         │           │
  │                Ship)         │           │
  │                  │           ▼           ▼
  │                  │      AR Open Item   AR Aging
  │                  │           │
  │                  │           └→ Dunning (if overdue)
  │                  ▼
  │            Inventory Issue
  │            (COGS posting)
  ▼
Revenue Recognition
(ASC 606 / IFRS 15)

Sales Order Types

SyntheticData models 9 SO types:

Type	Description	Requires Delivery?
`Standard`	Standard sales order	Yes
`Rush`	Priority/expedited order	Yes
`CashSale`	Immediate payment at sale	Yes
`Return`	Customer return order	No (creates return delivery)
`FreeOfCharge`	No-charge delivery (samples, warranty)	Yes
`Consignment`	Consignment fill-up/issue	Yes
`Service`	Service order (no physical delivery)	No
`CreditMemoRequest`	Request for credit memo	No
`DebitMemoRequest`	Request for debit memo	No

Delivery Types

6 delivery types model different fulfillment scenarios:

Type	Description	Direction
`Outbound`	Standard outbound delivery	Ship to customer
`Return`	Customer return delivery	Receive from customer
`StockTransfer`	Inter-plant stock transfer	Internal movement
`Replenishment`	Replenishment delivery	Warehouse → store
`ConsignmentIssue`	Issue from consignment stock	Consignment → customer
`ConsignmentReturn`	Return to consignment stock	Customer → consignment

Customer Invoice Types

7 invoice types with different accounting treatment:

Type	Description	AR Impact
`Standard`	Normal sales invoice	Creates receivable
`CreditMemo`	Credit for returns/adjustments	Reduces receivable
`DebitMemo`	Additional charge	Increases receivable
`ProForma`	Pre-delivery invoice (no posting)	None
`DownPaymentRequest`	Advance payment request	Creates special receivable
`FinalInvoice`	Settles down payment	Clears down payment
`Intercompany`	IC billing	Creates IC receivable

Credit Check

Sales orders pass through credit verification before delivery:

document_flows:
  o2c:
    credit_check:
      enabled: true
      check_credit_limit: true          # Verify customer limit
      check_overdue: true               # Check for past-due AR
      block_threshold: 0.9              # Block if >90% of limit used

O2C Stage Configuration

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.3                       # 30% of JEs from O2C
    completion_rate: 0.95                # 95% complete full flow

    stages:
      so_approval_rate: 0.95            # 95% of SOs approved
      credit_check_pass_rate: 0.9       # 90% pass credit check
      delivery_rate: 0.98               # 98% of SOs get deliveries
      invoice_rate: 0.95                # 95% of deliveries get invoices
      collection_rate: 0.85             # 85% of invoices collected

    timing:
      so_to_delivery_days:
        min: 1
        max: 14
      delivery_to_invoice_days:
        min: 0
        max: 3
      invoice_to_payment_days:
        min: 15
        max: 90

O2C Journal Entries

Stage	Debit	Credit	Trigger
Delivery	Cost of Goods Sold (5000)	Inventory (1300)	Goods issued
Invoice	Accounts Receivable (1100)	Revenue (4000)	Invoice posted
Receipt	Cash (1000)	Accounts Receivable (1100)	Payment received
Credit Memo	Revenue (4000)	Accounts Receivable (1100)	Credit issued

Document Chain Manager

The document chain manager maintains referential integrity across the complete document flow by tracking references between documents.

Reference Types

Type	Description	Example
`FollowOn`	Next document in normal flow	PO → GR → Invoice → Payment
`Payment`	Payment for invoice	PAY-001 → INV-001
`Reversal`	Correction or reversal document	CRED-001 → INV-001
`Partial`	Partial fulfillment	GR-001 (partial) → PO-001
`CreditMemo`	Credit against invoice	CM-001 → INV-001
`DebitMemo`	Debit against invoice	DM-001 → INV-001
`Return`	Return against delivery	RET-001 → DEL-001
`IntercompanyMatch`	IC matched pair	IC-INV-001 → IC-INV-002
`Manual`	User-defined reference	Any → Any

Document Chain Output

PO-001 ─→ GR-001 ─→ INV-001 ─→ PAY-001
   │          │          │          │
   └──────────┴──────────┴──────────┘
              Document Chain

The document_references.csv output file records all links:

Field	Description
`source_document_id`	Referencing document
`target_document_id`	Referenced document
`reference_type`	Type of reference
`created_date`	Date reference was created

Complex Scenario Examples

Partial Deliveries with Split Invoice

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
    completion_rate: 0.90           # 10% incomplete (partial deliveries)
    three_way_match:
      quantity_tolerance: 0.05      # 5% tolerance for partials
      allow_over_delivery: true
      max_over_delivery_pct: 0.10
    timing:
      po_to_gr_days: { min: 3, max: 45 }    # Longer lead times
      gr_to_invoice_days: { min: 1, max: 21 }
      invoice_to_payment_days: { min: 30, max: 90 }

High-Volume Retail O2C

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.5                  # 50% of JEs from O2C
    completion_rate: 0.98           # High completion rate
    stages:
      so_approval_rate: 0.99       # Auto-approved
      credit_check_pass_rate: 0.95
      delivery_rate: 0.99
      invoice_rate: 0.99
      collection_rate: 0.92
    timing:
      so_to_delivery_days: { min: 0, max: 3 }     # Fast fulfillment
      delivery_to_invoice_days: { min: 0, max: 0 } # Immediate invoice
      invoice_to_payment_days: { min: 10, max: 45 }

Combined Manufacturing P2P + O2C

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.35
    completion_rate: 0.95
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01
    timing:
      po_to_gr_days: { min: 5, max: 30 }
      gr_to_invoice_days: { min: 1, max: 10 }
      invoice_to_payment_days: { min: 20, max: 45 }

  o2c:
    enabled: true
    flow_rate: 0.35
    completion_rate: 0.90
    credit_check:
      enabled: true
      block_threshold: 0.85
    timing:
      so_to_delivery_days: { min: 3, max: 21 }
      delivery_to_invoice_days: { min: 0, max: 2 }
      invoice_to_payment_days: { min: 30, max: 60 }

Validation

Check	Rule
`flow_rate`	0.0 - 1.0
`completion_rate`	0.0 - 1.0
`tolerance` values	0.0 - 1.0
`timing.min`	≥ 0
`timing.max`	≥ min
Stage rates	0.0 - 1.0

Subledgers

SyntheticData generates subsidiary ledger records for Accounts Receivable (AR), Accounts Payable (AP), Fixed Assets (FA), and Inventory, with automatic GL reconciliation and document flow linking.

Overview

Subledger generators produce detailed records that reconcile back to GL control accounts:

Subledger	Control Account	Record Types	Output Files
AR	1100 (AR Control)	Open items, aging, receipts, credit memos, dunning	`ar_open_items.csv`, `ar_aging.csv`
AP	2000 (AP Control)	Open items, aging, payment scheduling, debit memos	`ap_open_items.csv`, `ap_aging.csv`
FA	1600+ (Asset accounts)	Register, depreciation, acquisitions, disposals	`fa_register.csv`, `fa_depreciation.csv`
Inventory	1300 (Inventory)	Positions, movements (22 types), valuation	`inventory_positions.csv`, `inventory_movements.csv`

Configuration

subledger:
  enabled: true
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]    # Days
    dunning_levels: 3
    credit_memo_rate: 0.05               # 5% of invoices get credit memos
  ap:
    enabled: true
    aging_buckets: [30, 60, 90, 120]
    early_payment_discount_rate: 0.02
    payment_scheduling: true
  fa:
    enabled: true
    depreciation_methods:
      - straight_line
      - declining_balance
      - sum_of_years_digits
    disposal_rate: 0.03                  # 3% of assets disposed per year
  inventory:
    enabled: true
    valuation_method: standard_cost      # standard_cost, moving_average, fifo, lifo
    cycle_count_frequency: monthly

Accounts Receivable (AR)

Record Types

The AR subledger generates:

Open Items: Outstanding customer invoices with aging classification
Receipts: Customer payments applied to invoices (full, partial, on-account)
Credit Memos: Credits issued for returns, disputes, or pricing adjustments
Aging Reports: Aged balances by customer and aging bucket
Dunning Notices: Automated collection notices at configurable levels

Open Item Fields

Field	Description
`customer_id`	Customer reference
`invoice_number`	Document number
`invoice_date`	Issue date
`due_date`	Payment due date
`original_amount`	Invoice total
`open_amount`	Remaining balance
`currency`	Invoice currency
`payment_terms`	Net 30, Net 60, etc.
`aging_bucket`	0-30, 31-60, 61-90, 91-120, 120+
`dunning_level`	Current dunning level (0-3)
`last_dunning_date`	Date of last dunning notice
`dispute_flag`	Whether item is disputed

Aging Buckets

Default aging buckets classify receivables by days past due:

Bucket	Range	Typical %
Current	0-30 days	65-75%
31-60	31-60 days	12-18%
61-90	61-90 days	5-8%
91-120	91-120 days	2-4%
120+	Over 120 days	1-3%

Dunning Process

Dunning generates progressively urgent collection notices:

Level	Days Overdue	Action
0	0-30	No action (within terms)
1	31-60	Friendly reminder
2	61-90	Formal notice
3	90+	Final demand / collections

Document Flow Integration

AR open items are created from O2C customer invoices:

Sales Order → Delivery → Customer Invoice → AR Open Item → Customer Receipt
                                                 │
                                                 └→ Dunning Notice (if overdue)

Accounts Payable (AP)

Record Types

The AP subledger generates:

Open Items: Outstanding vendor invoices with aging and payment scheduling
Payments: Vendor payment runs (check, wire, ACH)
Debit Memos: Deductions for quality issues, returns, pricing errors
Aging Reports: Aged payables by vendor
Payment Scheduling: Planned payments considering cash flow and discounts

Open Item Fields

Field	Description
`vendor_id`	Vendor reference
`invoice_number`	Vendor invoice number
`invoice_date`	Invoice receipt date
`due_date`	Payment due date
`baseline_date`	Date for terms calculation
`original_amount`	Invoice total
`open_amount`	Remaining balance
`currency`	Invoice currency
`payment_terms`	2/10 Net 30, etc.
`discount_date`	Discount deadline
`discount_amount`	Available discount
`payment_block`	Block code (if blocked)
`three_way_match_status`	Matched / Variance / Blocked

Early Payment Discounts

The AP generator models cash discount optimization:

Payment Terms: 2/10 Net 30
  → Pay within 10 days: 2% discount
  → Pay by day 30: full amount
  → Past day 30: overdue

early_payment_discount_rate: 0.02   # Take 2% discount when offered

Payment Scheduling

When enabled, the AP generator creates a payment schedule that optimizes:

Discount capture: Prioritize invoices with expiring discounts
Cash flow: Spread payments across the period
Vendor priority: Pay critical vendors first

Document Flow Integration

AP open items are created from P2P vendor invoices:

Purchase Order → Goods Receipt → Vendor Invoice → Three-Way Match → AP Open Item → Payment
                                                                          │
                                                                          └→ Debit Memo (if variance)

Fixed Assets (FA)

Record Types

The FA subledger generates:

Asset Register: Master record for each fixed asset
Depreciation Schedule: Monthly depreciation entries per asset
Acquisitions: New asset additions (from PO or direct capitalization)
Disposals: Asset retirements, sales, scrapping
Transfers: Inter-company or inter-department transfers
Impairment: Write-downs when fair value drops below book value

Asset Register Fields

Field	Description
`asset_id`	Unique identifier
`description`	Asset name/description
`asset_class`	Buildings, Equipment, Vehicles, IT, Furniture
`acquisition_date`	Purchase/capitalization date
`acquisition_cost`	Original cost
`useful_life_years`	Depreciable life
`salvage_value`	Residual value
`depreciation_method`	Method used
`accumulated_depreciation`	Total depreciation to date
`net_book_value`	Current carrying value
`disposal_date`	Date retired (if applicable)
`disposal_proceeds`	Sale price (if sold)
`disposal_gain_loss`	Gain or loss on disposal

Depreciation Methods

Method	Description	Use Case
`StraightLine`	Equal amounts each period	Default, most common
`DecliningBalance { rate }`	Fixed percentage of remaining balance	Accelerated (tax)
`SumOfYearsDigits`	Decreasing fractions of depreciable base	Accelerated
`UnitsOfProduction { total_units }`	Based on usage/output	Manufacturing equipment
`None`	No depreciation	Land, construction in progress

Depreciation Journal Entries

Each period, the FA generator creates depreciation entries:

Debit	Credit	Amount
Depreciation Expense (6xxx)	Accumulated Depreciation (1650)	Period depreciation

Disposal Accounting

When an asset is disposed:

Scenario	Debit	Credit
Sale at gain	Cash, Accum Depr	Asset Cost, Gain on Disposal
Sale at loss	Cash, Accum Depr, Loss on Disposal	Asset Cost
Scrapping	Accum Depr, Loss on Disposal	Asset Cost

Inventory

Record Types

The Inventory subledger generates:

Positions: Current stock levels by material, plant, and storage location
Movements: 22 movement types covering receipts, issues, transfers, and adjustments
Valuation: Inventory value calculated using configurable valuation methods

Position Fields

Field	Description
`material_id`	Material reference
`plant`	Plant/warehouse code
`storage_location`	Storage location within plant
`quantity`	Units on hand
`unit_of_measure`	UOM
`unit_cost`	Per-unit cost
`total_value`	Extended value
`valuation_method`	StandardCost, MovingAverage, FIFO, LIFO
`stock_status`	Unrestricted, QualityInspection, Blocked
`last_movement_date`	Date of last stock change

Movement Types (22 types)

Category	Movement Type	Description
Goods Receipt	`GoodsReceiptPO`	Receipt against purchase order
	`GoodsReceiptProduction`	Receipt from production order
	`GoodsReceiptOther`	Receipt without reference
	`GoodsReceipt`	Generic goods receipt
Returns	`ReturnToVendor`	Return materials to vendor
Goods Issue	`GoodsIssueSales`	Issue for sales order / delivery
	`GoodsIssueProduction`	Issue to production order
	`GoodsIssueCostCenter`	Issue to cost center (consumption)
	`GoodsIssueScrapping`	Scrap disposal
	`GoodsIssue`	Generic goods issue
	`Scrap`	Alias for scrapping
Transfers	`TransferPlant`	Between plants
	`TransferStorageLocation`	Between storage locations
	`TransferIn`	Inbound transfer
	`TransferOut`	Outbound transfer
	`TransferToInspection`	Move to quality inspection
	`TransferFromInspection`	Release from quality inspection
Adjustments	`PhysicalInventory`	Physical count difference
	`InventoryAdjustmentIn`	Positive adjustment
	`InventoryAdjustmentOut`	Negative adjustment
	`InitialStock`	Initial stock entry
Reversals	`ReversalGoodsReceipt`	Reverse a goods receipt
	`ReversalGoodsIssue`	Reverse a goods issue

Valuation Methods

Method	Description	Use Case
`StandardCost`	Fixed cost per unit, variances posted separately	Manufacturing
`MovingAverage`	Weighted average of all receipts	General purpose
`FIFO`	First-in, first-out costing	Perishable goods
`LIFO`	Last-in, first-out costing	Tax optimization (where permitted)

Cycle Counting (v0.6.0)

The cycle_count_frequency setting controls how often physical inventory counts are performed. Cycle counting generates PhysicalInventory movement records that reconcile book quantities against counted quantities:

subledger:
  inventory:
    enabled: true
    cycle_count_frequency: monthly     # monthly, quarterly, annual

Frequency	Behavior
`monthly`	Each storage location counted once per month on a rolling basis
`quarterly`	Full count once per quarter, with high-value items counted monthly
`annual`	Single year-end wall-to-wall count

Cycle count differences generate adjustment entries (InventoryAdjustmentIn or InventoryAdjustmentOut) and are flagged in the quality labels output for audit trail analysis.

Quality Inspection (v0.6.0)

Inventory positions can be placed in quality inspection status via TransferToInspection movements. This models the inspection hold process common in manufacturing and pharmaceutical industries:

Goods Receipt → Transfer to Inspection → QC Hold → Transfer from Inspection → Unrestricted Use
                                                 └→ Scrap (if rejected)

The rate of items routed through inspection depends on the material type and vendor scorecard grades (when source_to_pay is enabled). Materials from vendors with grade C or lower are routed through inspection at a higher rate.

Inventory Journal Entries

Movement	Debit	Credit
Goods Receipt (PO)	Inventory	GR/IR Clearing
Goods Issue (Sales)	COGS	Inventory
Goods Issue (Production)	WIP	Inventory
Scrap	Scrap Expense	Inventory
Physical Count (surplus)	Inventory	Inventory Adjustment
Physical Count (shortage)	Inventory Adjustment	Inventory

GL Reconciliation

The subledger generators ensure that subledger balances reconcile to GL control accounts:

GL Control Account Balance = Σ Subledger Open Items

AR Control (1100) = Σ AR Open Items
AP Control (2000) = Σ AP Open Items
Inventory  (1300) = Σ Inventory Position Values
FA Gross   (1600) = Σ FA Acquisition Costs
Accum Depr (1650) = Σ FA Accumulated Depreciation

Reconciliation is validated by the datasynth-eval coherence module and any differences are flagged as potential data quality issues.

Output Files

File	Content
`subledgers/ar_open_items.csv`	AR outstanding invoices
`subledgers/ar_aging.csv`	AR aging analysis
`subledgers/ap_open_items.csv`	AP outstanding invoices
`subledgers/ap_aging.csv`	AP aging analysis
`subledgers/fa_register.csv`	Fixed asset master records
`subledgers/fa_depreciation.csv`	Depreciation schedule entries
`subledgers/inventory_positions.csv`	Current stock positions
`subledgers/inventory_movements.csv`	Stock movement history

FX & Currency

SyntheticData generates realistic foreign exchange rates, currency translation entries, and cumulative translation adjustments (CTA) for multi-currency enterprise simulation.

Overview

The FX module in datasynth-generators provides three generators:

Generator	Purpose	Output
FX Rate Service	Daily exchange rates via Ornstein-Uhlenbeck process	`fx/daily_rates.csv`, `fx/period_rates.csv`
Currency Translator	Translate foreign-currency financials to reporting currency	`consolidation/currency_translation.csv`
CTA Generator	Cumulative Translation Adjustment for consolidation	`consolidation/cta_entries.csv`

Configuration

fx:
  enabled: true
  base_currency: USD                    # Reporting/functional currency
  currencies:
    - code: EUR
      initial_rate: 1.10
      volatility: 0.08
      mean_reversion: 0.05
    - code: GBP
      initial_rate: 1.27
      volatility: 0.07
      mean_reversion: 0.04
    - code: JPY
      initial_rate: 0.0067
      volatility: 0.10
      mean_reversion: 0.06
    - code: CHF
      initial_rate: 1.12
      volatility: 0.06
      mean_reversion: 0.03

  translation:
    method: current_rate                # current_rate, temporal, monetary_non_monetary
    equity_at_historical: true
    income_at_average: true

  cta:
    enabled: true
    equity_account: "3900"              # CTA equity account

FX Rate Service

Ornstein-Uhlenbeck Process

Exchange rates are generated using a mean-reverting stochastic process (Ornstein-Uhlenbeck), which models the tendency of exchange rates to revert toward a long-term equilibrium:

dX(t) = θ(μ - X(t))dt + σdW(t)

where:
  X(t)  = log exchange rate at time t
  θ     = mean reversion speed (mean_reversion config)
  μ     = long-term mean (derived from initial_rate)
  σ     = volatility
  dW(t) = Wiener process (random walk)

This produces rates that:

Mean-revert: Rates drift back toward the initial level over time
Have realistic volatility: Day-to-day movements match configurable volatility targets
Are serially correlated: Today’s rate depends on yesterday’s rate (not i.i.d.)
Are deterministic: Given the same seed, rates are exactly reproducible

Rate Types

Rate Type	Usage	Calculation
Daily spot	Transaction-date rates	O-U process output for each business day
Period average	Income statement translation	Arithmetic mean of daily rates within the period
Period closing	Balance sheet translation	Last business day rate in the period
Historical	Equity items	Rate at the date equity was contributed

Output: daily_rates.csv

Field	Description
`date`	Business day
`from_currency`	Source currency (e.g., EUR)
`to_currency`	Target currency (e.g., USD)
`spot_rate`	Daily spot rate
`inverse_rate`	1 / spot_rate

Output: period_rates.csv

Field	Description
`period`	Fiscal period (YYYY-MM)
`from_currency`	Source currency
`to_currency`	Target currency
`average_rate`	Period average
`closing_rate`	Period-end closing rate

Currency Translation

Translation Methods

SyntheticData supports three standard currency translation methods:

Current Rate Method (ASC 830 / IAS 21 — default)

The most common method for foreign subsidiaries with functional currency different from reporting currency:

Item	Rate Used
Assets	Closing rate
Liabilities	Closing rate
Equity (contributed capital)	Historical rate
Equity (retained earnings)	Rolled-forward
Revenue	Average rate
Expenses	Average rate
Dividends	Rate on declaration date
CTA	Balancing item → Equity

Temporal Method (ASC 830)

Used when the foreign operation’s functional currency is the parent’s currency (e.g., highly inflationary economies):

Item	Rate Used
Monetary assets/liabilities	Closing rate
Non-monetary assets (at cost)	Historical rate
Non-monetary assets (at fair value)	Rate at fair value date
Revenue	Average rate
Expenses	Average rate
Depreciation	Historical rate of related asset
Remeasurement gain/loss	Income statement

Monetary/Non-Monetary Method

Item	Rate Used
Monetary items	Closing rate
Non-monetary items	Historical rate

Translation Configuration

fx:
  translation:
    method: current_rate      # current_rate | temporal | monetary_non_monetary
    equity_at_historical: true
    income_at_average: true

CTA Generator

The Cumulative Translation Adjustment arises because assets/liabilities are translated at closing rates while equity is at historical rates. The CTA is posted to Other Comprehensive Income (OCI) in equity:

CTA = Translated Net Assets (at closing rate)
    - Translated Equity (at historical rates)
    - Translated Net Income (at average rate)

CTA Journal Entry

Debit	Credit	Description
CTA (Equity 3900)	Various BS accounts	Translation adjustment for period

The CTA accumulates over time and is only recycled to the income statement when a foreign subsidiary is disposed of.

Configuration

fx:
  cta:
    enabled: true
    equity_account: "3900"    # OCI - CTA account

Multi-Currency Company Configuration

Multi-currency scenarios require companies with different functional currencies:

companies:
  - code: C001
    name: "US Parent Corp"
    currency: USD
    country: US

  - code: C002
    name: "European Subsidiary"
    currency: EUR
    country: DE

  - code: C003
    name: "UK Subsidiary"
    currency: GBP
    country: GB

  - code: C004
    name: "Japan Subsidiary"
    currency: JPY
    country: JP

fx:
  enabled: true
  base_currency: USD
  currencies:
    - { code: EUR, initial_rate: 1.10, volatility: 0.08, mean_reversion: 0.05 }
    - { code: GBP, initial_rate: 1.27, volatility: 0.07, mean_reversion: 0.04 }
    - { code: JPY, initial_rate: 0.0067, volatility: 0.10, mean_reversion: 0.06 }

intercompany:
  enabled: true
  # IC transactions generate FX exposure

Output Files

File	Content
`fx/daily_rates.csv`	Daily spot rates for all currency pairs
`fx/period_rates.csv`	Period average and closing rates
`consolidation/currency_translation.csv`	Translation entries per entity/period
`consolidation/cta_entries.csv`	CTA adjustments (if CTA enabled)
`consolidation/consolidated_trial_balance.csv`	Translated and consolidated TB

Financial Settings

Financial settings control balance, subledger, FX, and period close.

Balance Configuration

balance:
  opening_balance:
    enabled: true
    total_assets: 10000000

  coherence_check:
    enabled: true
    tolerance: 0.01

Opening Balance

Generate coherent opening balance sheet:

balance:
  opening_balance:
    enabled: true
    total_assets: 10000000           # Total asset value

    structure:                        # Balance sheet structure
      current_assets: 0.3
      fixed_assets: 0.5
      other_assets: 0.2

      current_liabilities: 0.2
      long_term_debt: 0.3
      equity: 0.5

Balance Coherence

Verify accounting equation:

balance:
  coherence_check:
    enabled: true                    # Verify Assets = L + E
    tolerance: 0.01                  # Allowed rounding variance
    frequency: monthly               # When to check

Subledger Configuration

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]

  ap:
    enabled: true
    aging_buckets: [30, 60, 90]

  fixed_assets:
    enabled: true
    depreciation_methods:
      - straight_line
      - declining_balance

  inventory:
    enabled: true
    valuation_methods:
      - fifo
      - weighted_average

Accounts Receivable

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]  # Aging period boundaries

    collection:
      on_time_rate: 0.7               # % paid within terms
      write_off_rate: 0.02            # % written off

    reconciliation:
      enabled: true                   # Reconcile to GL
      control_account: "1100"         # AR control account

Accounts Payable

subledger:
  ap:
    enabled: true
    aging_buckets: [30, 60, 90]

    payment:
      discount_usage_rate: 0.3        # % taking early pay discount
      late_payment_rate: 0.1          # % paid late

    reconciliation:
      enabled: true
      control_account: "2000"         # AP control account

Fixed Assets

subledger:
  fixed_assets:
    enabled: true

    depreciation_methods:
      - method: straight_line
        weight: 0.7
      - method: declining_balance
        rate: 0.2
        weight: 0.2
      - method: units_of_production
        weight: 0.1

    disposal:
      rate: 0.05                      # Annual disposal rate
      gain_loss_account: "8000"       # Gain/loss account

    reconciliation:
      enabled: true
      control_accounts:
        asset: "1500"
        depreciation: "1510"

Inventory

subledger:
  inventory:
    enabled: true

    valuation_methods:
      - method: fifo
        weight: 0.3
      - method: weighted_average
        weight: 0.5
      - method: standard_cost
        weight: 0.2

    movements:
      receipt_weight: 0.4
      issue_weight: 0.4
      adjustment_weight: 0.1
      transfer_weight: 0.1

    reconciliation:
      enabled: true
      control_account: "1200"

FX Configuration

fx:
  enabled: true
  base_currency: USD

  currency_pairs:
    - EUR
    - GBP
    - CHF
    - JPY

  volatility: 0.01

  translation:
    method: current_rate

Exchange Rates

fx:
  enabled: true
  base_currency: USD                  # Reporting currency

  currency_pairs:                     # Currencies to generate
    - EUR
    - GBP
    - CHF

  rate_types:
    - spot                            # Daily spot rates
    - closing                         # Period closing rates
    - average                         # Period average rates

  volatility: 0.01                    # Daily volatility
  mean_reversion: 0.1                 # Ornstein-Uhlenbeck parameter

Currency Translation

fx:
  translation:
    method: current_rate              # current_rate, temporal

    rate_mapping:
      assets: closing_rate
      liabilities: closing_rate
      equity: historical_rate
      revenue: average_rate
      expense: average_rate

    cta_account: "3500"               # CTA equity account

Period Close Configuration

period_close:
  enabled: true

  monthly:
    accruals: true
    depreciation: true

  quarterly:
    intercompany_elimination: true

  annual:
    closing_entries: true
    retained_earnings: true

Monthly Close

period_close:
  monthly:
    accruals:
      enabled: true
      auto_reverse: true              # Reverse in next period
      categories:
        - expense_accrual
        - revenue_accrual
        - payroll_accrual

    depreciation:
      enabled: true
      run_date: last_day              # When to run

    reconciliation:
      enabled: true
      subledger_to_gl: true

Quarterly Close

period_close:
  quarterly:
    intercompany_elimination:
      enabled: true
      types:
        - intercompany_sales
        - intercompany_profit
        - intercompany_dividends

    currency_translation:
      enabled: true

Annual Close

period_close:
  annual:
    closing_entries:
      enabled: true
      close_revenue: true
      close_expense: true

    retained_earnings:
      enabled: true
      account: "3100"

    year_end_adjustments:
      - bad_debt_provision
      - inventory_reserve
      - bonus_accrual

Combined Example

balance:
  opening_balance:
    enabled: true
    total_assets: 50000000
  coherence_check:
    enabled: true

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120, 180]
  ap:
    enabled: true
    aging_buckets: [30, 60, 90]
  fixed_assets:
    enabled: true
  inventory:
    enabled: true

fx:
  enabled: true
  base_currency: USD
  currency_pairs: [EUR, GBP, CHF, JPY, CNY]
  volatility: 0.012

period_close:
  enabled: true
  monthly:
    accruals: true
    depreciation: true
  quarterly:
    intercompany_elimination: true
  annual:
    closing_entries: true
    retained_earnings: true

Financial Reporting (v0.6.0)

The financial_reporting section generates structured financial statements, management KPIs, and budgets derived from the underlying journal entries, trial balances, and period close data.

Financial Statements

financial_reporting:
  enabled: true
  generate_balance_sheet: true         # Balance sheet
  generate_income_statement: true      # Income statement / P&L
  generate_cash_flow: true             # Cash flow statement
  generate_changes_in_equity: true     # Statement of changes in equity
  comparative_periods: 1               # Number of prior-period comparatives

When enabled, the generator produces financial statements at each period close. The comparative_periods setting controls how many prior periods are included for comparative analysis. Statements are aggregated from the trial balance and subledger data, ensuring consistency with the underlying journal entries.

Management KPIs

financial_reporting:
  management_kpis:
    enabled: true
    frequency: "monthly"               # monthly or quarterly

Management KPIs include ratios and metrics computed from the generated financial data:

KPI Category	Examples
Liquidity	Current ratio, quick ratio, cash conversion cycle
Profitability	Gross margin, operating margin, ROE, ROA
Efficiency	Inventory turnover, receivables turnover, asset turnover
Leverage	Debt-to-equity, interest coverage

Budgets

financial_reporting:
  budgets:
    enabled: true
    revenue_growth_rate: 0.05          # 5% expected growth
    expense_inflation_rate: 0.03       # 3% cost inflation
    variance_noise: 0.10               # 10% random noise on actuals vs budget

Budget generation creates a budget line for each GL account based on prior-period actuals, adjusted by the configured growth and inflation rates. The variance_noise parameter controls the spread between budget and actual figures, producing realistic budget-to-actual variance reports.

Compliance

Compliance settings control fraud injection, internal controls, and approval workflows.

Fraud Configuration

fraud:
  enabled: true
  fraud_rate: 0.005

  types:
    fictitious_transaction: 0.15
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    split_transaction: 0.15
    round_tripping: 0.05
    kickback_scheme: 0.10
    ghost_employee: 0.05
    duplicate_payment: 0.15
    unauthorized_discount: 0.10
    suspense_abuse: 0.05

Fraud Rate

Overall percentage of fraudulent transactions:

fraud:
  enabled: true
  fraud_rate: 0.005    # 0.5% fraud rate
  fraud_rate: 0.01     # 1% fraud rate
  fraud_rate: 0.001    # 0.1% fraud rate

Fraud Types

Type	Description
`fictitious_transaction`	Completely fabricated entries
`revenue_manipulation`	Premature/delayed revenue recognition
`expense_capitalization`	Improper capitalization of expenses
`split_transaction`	Split to avoid approval thresholds
`round_tripping`	Circular transactions to inflate revenue
`kickback_scheme`	Vendor kickback arrangements
`ghost_employee`	Payments to non-existent employees
`duplicate_payment`	Same invoice paid multiple times
`unauthorized_discount`	Unapproved customer discounts
`suspense_abuse`	Hiding items in suspense accounts

Fraud Patterns

fraud:
  patterns:
    threshold_adjacent:
      enabled: true
      threshold: 10000             # Approval threshold
      range: 0.1                   # % below threshold

    time_based:
      weekend_preference: 0.3      # Weekend entry rate
      after_hours_preference: 0.2  # After hours rate

    entity_targeting:
      repeat_offender_rate: 0.4    # Same user commits multiple

Internal Controls Configuration

internal_controls:
  enabled: true

  controls:
    - id: "CTL-001"
      name: "Payment Approval"
      type: preventive
      frequency: continuous
      assertions:
        - authorization
        - validity

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]

Control Definition

internal_controls:
  controls:
    - id: "CTL-001"
      name: "Payment Approval"
      description: "Payments require manager approval"
      type: preventive              # preventive, detective
      frequency: continuous         # continuous, daily, weekly, monthly
      assertions:
        - authorization
        - validity
        - completeness
      accounts: ["2000"]            # Applicable accounts
      threshold: 5000               # Trigger threshold

    - id: "CTL-002"
      name: "Journal Entry Review"
      type: detective
      frequency: daily
      assertions:
        - accuracy
        - completeness

Control Types

Type	Description
`preventive`	Prevents errors/fraud before occurrence
`detective`	Detects errors/fraud after occurrence

Control Assertions

Assertion	Description
`authorization`	Proper approval obtained
`validity`	Transaction is legitimate
`completeness`	All transactions recorded
`accuracy`	Amounts are correct
`cutoff`	Recorded in correct period
`classification`	Properly categorized

Segregation of Duties

internal_controls:
  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]
      description: "Cannot create and approve payments"

    - conflict_type: create_approve
      processes: [ar_invoice, ar_receipt]

    - conflict_type: custody_recording
      processes: [cash_handling, cash_recording]

    - conflict_type: authorization_custody
      processes: [vendor_master, ap_payment]

SoD Conflict Types

Type	Description
`create_approve`	Create and approve same transaction
`custody_recording`	Physical custody and recording
`authorization_custody`	Authorization and physical access
`create_modify`	Create and modify master data

Approval Configuration

approval:
  enabled: true

  thresholds:
    - level: 1
      name: "Clerk"
      max_amount: 5000
    - level: 2
      name: "Supervisor"
      max_amount: 25000
    - level: 3
      name: "Manager"
      max_amount: 100000
    - level: 4
      name: "Director"
      max_amount: 500000
    - level: 5
      name: "Executive"
      max_amount: null          # Unlimited

Approval Thresholds

approval:
  thresholds:
    - level: 1
      name: "Level 1 - Clerk"
      max_amount: 5000
      auto_approve: false

    - level: 2
      name: "Level 2 - Supervisor"
      max_amount: 25000
      auto_approve: false

    - level: 3
      name: "Level 3 - Manager"
      max_amount: 100000
      requires_dual: false        # Single approver

    - level: 4
      name: "Level 4 - Director"
      max_amount: 500000
      requires_dual: true         # Dual approval required

Approval Process

approval:
  process:
    workflow: hierarchical        # hierarchical, matrix
    escalation_days: 3            # Auto-escalate after N days
    reminder_days: 1              # Send reminder after N days

  exceptions:
    recurring_exempt: true        # Skip for recurring entries
    system_exempt: true           # Skip for system entries

Combined Example

fraud:
  enabled: true
  fraud_rate: 0.005
  types:
    fictitious_transaction: 0.15
    split_transaction: 0.20
    duplicate_payment: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

internal_controls:
  enabled: true
  controls:
    - id: "SOX-001"
      name: "Payment Authorization"
      type: preventive
      frequency: continuous
      threshold: 10000

    - id: "SOX-002"
      name: "JE Review"
      type: detective
      frequency: daily

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]
    - conflict_type: create_approve
      processes: [ar_invoice, ar_receipt]
    - conflict_type: create_modify
      processes: [vendor_master, ap_invoice]

approval:
  enabled: true
  thresholds:
    - level: 1
      max_amount: 5000
    - level: 2
      max_amount: 25000
    - level: 3
      max_amount: 100000
    - level: 4
      max_amount: null

Validation

Check	Rule
`fraud_rate`	0.0 - 1.0
`fraud.types`	Sum = 1.0
`control.id`	Unique
`thresholds`	Strictly ascending

Synthetic Data Certificates (v0.5.0)

Certificates provide cryptographic proof of the privacy guarantees and quality metrics of generated data.

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true

When enabled, a certificate.json file is produced alongside the output containing:

DP Guarantee: Mechanism (Laplace/Gaussian), epsilon, delta, composition method
Quality Metrics: Benford MAD, correlation preservation, statistical fidelity, MIA AUC
Config Hash: SHA-256 hash of the generation configuration
Signature: HMAC-SHA256 signature for tamper detection
Fingerprint Hash: Hash of source fingerprint (if fingerprint-based generation)

The certificate can be embedded in Parquet file metadata or included as a separate JSON file.

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

# Certificate is written to ./output/certificate.json

Output Settings

Output settings control file formats and organization.

Configuration

output:
  format: csv
  compression: none
  compression_level: 6

  files:
    journal_entries: true
    acdoca: true
    master_data: true
    documents: true
    subledgers: true
    trial_balances: true
    labels: true
    controls: true

Format

Output file format selection.

output:
  format: csv        # CSV format (default)
  format: json       # JSON format
  format: jsonl      # Newline-delimited JSON
  format: parquet    # Apache Parquet columnar
  format: sap        # SAP S/4HANA table format
  format: oracle     # Oracle EBS GL tables
  format: netsuite   # NetSuite journal entries

CSV Format

Standard comma-separated values:

document_id,posting_date,company_code,account,debit,credit
abc-123,2024-01-15,1000,1100,"1000.00","0.00"
abc-123,2024-01-15,1000,4000,"0.00","1000.00"

Characteristics:

UTF-8 encoding
Header row included
Quoted strings when needed
Decimals as strings

JSON Format

Structured JSON with nested objects:

[
  {
    "header": {
      "document_id": "abc-123",
      "posting_date": "2024-01-15",
      "company_code": "1000"
    },
    "lines": [
      {"account": "1100", "debit": "1000.00", "credit": "0.00"},
      {"account": "4000", "debit": "0.00", "credit": "1000.00"}
    ]
  }
]

Parquet Format

Apache Parquet columnar format for analytics:

output:
  format: parquet
  compression: snappy     # snappy (default), gzip, zstd

Parquet files are self-describing with embedded schema and support columnar compression. Ideal for Spark, DuckDB, Polars, pandas, and cloud data warehouses.

ERP Formats

Export in native ERP table schemas for load testing and integration validation:

# SAP S/4HANA
output:
  format: sap
  sap:
    tables: [bkpf, bseg, acdoca, lfa1, kna1, mara, csks, cepc]
    client: "100"

# Oracle EBS
output:
  format: oracle
  oracle:
    ledger_id: 1

# NetSuite
output:
  format: netsuite
  netsuite:
    subsidiary_id: 1
    include_custom_fields: true

See ERP Output Formats for full field mappings.

Streaming Mode

Enable streaming output for memory-efficient generation of large datasets:

output:
  format: csv           # Any format
  streaming: true       # Enable streaming mode

See Streaming Output for details.

Compression

File compression options.

output:
  compression: none     # No compression
  compression: gzip     # Gzip compression (.gz)
  compression: zstd     # Zstandard compression (.zst)

Compression Level

When compression is enabled:

output:
  compression: gzip
  compression_level: 6    # 1-9, higher = smaller + slower

Level	Speed	Size	Use Case
1	Fastest	Largest	Quick compression
6	Balanced	Medium	General use (default)
9	Slowest	Smallest	Maximum compression

Compression Comparison

Compression	Extension	Speed	Ratio
`none`	`.csv`	N/A	1.0
`gzip`	`.csv.gz`	Medium	~0.15
`zstd`	`.csv.zst`	Fast	~0.12

File Selection

Control which files are generated:

output:
  files:
    # Core transaction data
    journal_entries: true    # journal_entries.csv
    acdoca: true             # acdoca.csv (SAP format)

    # Master data
    master_data: true        # vendors.csv, customers.csv, etc.

    # Document flow
    documents: true          # purchase_orders.csv, invoices.csv, etc.

    # Subsidiary ledgers
    subledgers: true         # ar_open_items.csv, ap_open_items.csv, etc.

    # Period close
    trial_balances: true     # trial_balances/*.csv

    # ML labels
    labels: true             # anomaly_labels.csv, fraud_labels.csv

    # Controls
    controls: true           # internal_controls.csv, sod_rules.csv

Output Directory Structure

With all files enabled:

output/
├── master_data/
│   ├── chart_of_accounts.csv
│   ├── vendors.csv
│   ├── customers.csv
│   ├── materials.csv
│   ├── fixed_assets.csv
│   └── employees.csv
├── transactions/
│   ├── journal_entries.csv
│   └── acdoca.csv
├── documents/
│   ├── purchase_orders.csv
│   ├── goods_receipts.csv
│   ├── vendor_invoices.csv
│   ├── payments.csv
│   ├── sales_orders.csv
│   ├── deliveries.csv
│   ├── customer_invoices.csv
│   └── customer_receipts.csv
├── subledgers/
│   ├── ar_open_items.csv
│   ├── ar_aging.csv
│   ├── ap_open_items.csv
│   ├── ap_aging.csv
│   ├── fa_register.csv
│   ├── fa_depreciation.csv
│   ├── inventory_positions.csv
│   └── inventory_movements.csv
├── period_close/
│   └── trial_balances/
│       ├── 2024_01.csv
│       ├── 2024_02.csv
│       └── ...
├── consolidation/
│   ├── eliminations.csv
│   └── currency_translation.csv
├── fx/
│   ├── daily_rates.csv
│   └── period_rates.csv
├── graphs/                      # If graph_export enabled
│   ├── pytorch_geometric/
│   └── neo4j/
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── controls/
    ├── internal_controls.csv
    ├── control_mappings.csv
    └── sod_rules.csv

Examples

Development (Fast)

output:
  format: csv
  compression: none
  files:
    journal_entries: true
    master_data: true
    labels: true

Production (Compact)

output:
  format: csv
  compression: zstd
  compression_level: 6
  files:
    journal_entries: true
    acdoca: true
    master_data: true
    documents: true
    subledgers: true
    trial_balances: true
    labels: true
    controls: true

ML Training Focus

output:
  format: csv
  compression: gzip
  files:
    journal_entries: true
    labels: true                 # Important for supervised learning
    master_data: true            # For feature engineering

SAP Integration

output:
  format: csv
  compression: none
  files:
    journal_entries: false
    acdoca: true                 # SAP ACDOCA format
    master_data: true
    documents: true

Validation

Check	Rule
`format`	`csv` or `json`
`compression`	`none`, `gzip`, or `zstd`
`compression_level`	1-9 (only when compression enabled)

AI & ML Features Configuration

This page documents the configuration for DataSynth’s AI and ML-powered generation features: LLM-augmented generation, diffusion models, causal generation, and synthetic data certificates.

LLM Configuration

Configure the LLM provider for metadata enrichment and natural language configuration.

llm:
  provider: mock              # Provider type
  model: "gpt-4o-mini"       # Model identifier
  api_key_env: "OPENAI_API_KEY"  # Environment variable for API key
  base_url: null              # Custom API endpoint (for 'custom' provider)
  max_retries: 3              # Retry attempts on failure
  timeout_secs: 30            # Request timeout
  cache_enabled: true         # Enable prompt-level caching

Provider Types

Provider	Value	Requirements	Description
Mock	`mock`	None	Deterministic, no network. Default for CI/CD
OpenAI	`openai`	`OPENAI_API_KEY` env var	OpenAI API (GPT-4o, GPT-4o-mini, etc.)
Anthropic	`anthropic`	`ANTHROPIC_API_KEY` env var	Anthropic API (Claude models)
Custom	`custom`	`base_url` + API key env var	Any OpenAI-compatible endpoint

Field Reference

Field	Type	Default	Description
`provider`	string	`"mock"`	LLM provider type
`model`	string	`"gpt-4o-mini"`	Model identifier passed to the API
`api_key_env`	string	`""`	Environment variable name containing the API key
`base_url`	string	null	Custom API base URL (required for `custom` provider)
`max_retries`	integer	`3`	Maximum retry attempts on transient failures
`timeout_secs`	integer	`30`	Per-request timeout in seconds
`cache_enabled`	bool	`true`	Cache responses to avoid duplicate API calls

Examples

Mock provider (default, no config needed):

# LLM enrichment uses mock provider by default
# No configuration required

OpenAI:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"

Anthropic:

llm:
  provider: anthropic
  model: "claude-sonnet-4-5-20250929"
  api_key_env: "ANTHROPIC_API_KEY"

Self-hosted (e.g., vLLM, Ollama):

llm:
  provider: custom
  model: "llama-3-8b"
  api_key_env: "LOCAL_API_KEY"
  base_url: "http://localhost:8000/v1"

Azure OpenAI:

llm:
  provider: custom
  model: "gpt-4o-mini"
  api_key_env: "AZURE_OPENAI_KEY"
  base_url: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o-mini"

Diffusion Configuration

Configure the statistical diffusion model backend for learned distribution capture.

diffusion:
  enabled: false              # Enable diffusion generation
  n_steps: 1000               # Number of diffusion steps
  schedule: "linear"          # Noise schedule type
  sample_size: 1000           # Number of samples to generate

Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable diffusion model generation
`n_steps`	integer	`1000`	Number of forward/reverse diffusion steps. Higher values improve quality but increase compute time
`schedule`	string	`"linear"`	Noise schedule: `"linear"`, `"cosine"`, `"sigmoid"`
`sample_size`	integer	`1000`	Number of diffusion-generated samples

Noise Schedules

Schedule	Characteristics	Best For
`linear`	Uniform noise addition, simple and robust	General purpose
`cosine`	Slower noise addition, preserves fine details	Financial amounts with precise distributions
`sigmoid`	Smooth transition between linear and cosine	Balanced quality and compute

Examples

Basic diffusion:

diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

Fast diffusion (fewer steps):

diffusion:
  enabled: true
  n_steps: 200
  schedule: "linear"
  sample_size: 1000

Causal Configuration

Configure causal graph-based data generation with Structural Causal Models.

causal:
  enabled: false              # Enable causal generation
  template: "fraud_detection" # Built-in template or custom YAML path
  sample_size: 1000           # Number of samples
  validate: true              # Validate causal structure in output

Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable causal/counterfactual generation
`template`	string	`"fraud_detection"`	Template name or path to custom YAML graph
`sample_size`	integer	`1000`	Number of causal samples to generate
`validate`	bool	`true`	Run causal structure validation on output

Built-in Templates

Template	Variables	Use Case
`fraud_detection`	transaction_amount, approval_level, vendor_risk, fraud_flag	Fraud risk modeling
`revenue_cycle`	order_size, credit_score, payment_delay, revenue	Revenue and credit analysis

Custom Causal Graph

Point template to a YAML file defining a custom causal graph:

causal:
  enabled: true
  template: "./graphs/custom_fraud.yaml"
  sample_size: 10000
  validate: true

Custom graph format:

# custom_fraud.yaml
variables:
  - name: transaction_amount
    type: continuous
    distribution: lognormal
    params:
      mu: 8.0
      sigma: 1.5
  - name: approval_level
    type: count
    distribution: normal
    params:
      mean: 1.0
      std: 0.5
  - name: fraud_flag
    type: binary

edges:
  - from: transaction_amount
    to: approval_level
    mechanism:
      type: linear
      coefficient: 0.00005
  - from: transaction_amount
    to: fraud_flag
    mechanism:
      type: logistic
      scale: 0.0001
      midpoint: 50000.0
    strength: 0.8

Causal Mechanism Types

Type	Parameters	Description
`linear`	`coefficient`	y += coefficient × parent
`threshold`	`cutoff`	y = 1 if parent > cutoff, else 0
`polynomial`	`coefficients` (list)	y += Σ c[i] × parent^i
`logistic`	`scale`, `midpoint`	y += 1 / (1 + e^(-scale × (parent - midpoint)))

Certificate Configuration

Configure synthetic data certificates for provenance and privacy attestation.

certificates:
  enabled: false              # Enable certificate generation
  issuer: "DataSynth"        # Certificate issuer identity
  include_quality_metrics: true  # Include quality metrics

Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Generate a certificate with each output
`issuer`	string	`"DataSynth"`	Issuer identity embedded in the certificate
`include_quality_metrics`	bool	`true`	Include Benford MAD, correlation, fidelity, MIA AUC metrics

Certificate Contents

When enabled, a certificate.json is produced containing:

Section	Contents
Identity	certificate_id, generation_timestamp, generator_version
Reproducibility	config_hash (SHA-256), seed, fingerprint_hash
Privacy	DP mechanism, epsilon, delta, composition method, total queries
Quality	Benford MAD, correlation preservation, statistical fidelity, MIA AUC
Integrity	HMAC-SHA256 signature

Combined Example

A complete configuration using all AI/ML features:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

companies:
  - code: "1000"
    name: "Manufacturing Corp"
    currency: USD
    country: US

transactions:
  target_count: 50000

# LLM enrichment for realistic metadata
llm:
  provider: mock

# Diffusion for learned distributions
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

# Causal structure for fraud scenarios
causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 10000
  validate: true

# Certificate for provenance
certificates:
  enabled: true
  issuer: "DataSynth v0.5.0"
  include_quality_metrics: true

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02

output:
  format: csv

CLI Flags

Several AI/ML features can also be controlled via CLI flags:

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

# Initialize from natural language
datasynth-data init --from-description "1 year of retail data with fraud" -o config.yaml

# Train diffusion model
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json

# Generate causal data
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/

Architecture

SyntheticData is designed as a modular, high-performance data generation system.

Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         Application Layer                            │
│   datasynth-cli │ datasynth-server │ datasynth-ui                               │
├─────────────────────────────────────────────────────────────────────┤
│                        Orchestration Layer                           │
│                         datasynth-runtime                                │
├─────────────────────────────────────────────────────────────────────┤
│                        Generation Layer                              │
│   datasynth-generators │ datasynth-graph                                    │
├─────────────────────────────────────────────────────────────────────┤
│                        Foundation Layer                              │
│   datasynth-core │ datasynth-config │ datasynth-output                          │
└─────────────────────────────────────────────────────────────────────┘

Key Characteristics

Characteristic	Description
Modular	12 independent crates with clear boundaries
Layered	Strict dependency hierarchy prevents cycles
High-Performance	Parallel execution, memory-efficient streaming
Deterministic	Seeded RNG for reproducible output
Type-Safe	Rust’s type system ensures correctness

Architecture Sections

Section	Description
Workspace Layout	Crate organization and dependencies
Domain Models	Core data structures
Data Flow	How data moves through the system
Generation Pipeline	Step-by-step generation process
Memory Management	Memory tracking and limits
Design Decisions	Key architectural choices

Design Principles

Separation of Concerns

Each crate has a single responsibility:

datasynth-core: Domain models and distributions
datasynth-config: Configuration and validation
datasynth-generators: Data generation logic
datasynth-output: File writing
datasynth-runtime: Orchestration

Dependency Inversion

Core components define traits, implementations provided by higher layers:

#![allow(unused)]
fn main() {
// datasynth-core defines the trait
pub trait Generator<T> {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<T>>;
}

// datasynth-generators implements it
impl Generator<JournalEntry> for JournalEntryGenerator {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<JournalEntry>> {
        // Implementation
    }
}
}

Configuration-Driven

All behavior controlled by configuration:

transactions:
  target_count: 100000
  benford:
    enabled: true

Memory Safety

Rust’s ownership system prevents:

Data races in parallel generation
Memory leaks
Buffer overflows

Component Interactions

                    ┌─────────────┐
                    │   Config    │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  JE Generator│  │ Doc Generator│  │ Master Data  │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┼─────────────────┘
                         │
                         ▼
                ┌──────────────┐
                │ Orchestrator │
                └──────┬───────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
        ▼              ▼              ▼
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │   CSV   │   │  Graph  │   │  JSON   │
   └─────────┘   └─────────┘   └─────────┘

Performance Architecture

Parallel Execution

#![allow(unused)]
fn main() {
// Thread pool distributes work
let entries: Vec<JournalEntry> = (0..num_threads)
    .into_par_iter()
    .flat_map(|thread_id| {
        let mut gen = generator_for_thread(thread_id);
        gen.generate_batch(batch_size)
    })
    .collect();
}

Streaming Output

#![allow(unused)]
fn main() {
// Memory-efficient streaming
for entry in generator.generate_stream() {
    sink.write(&entry)?;
}
}

Memory Guards

#![allow(unused)]
fn main() {
// Memory limits enforced
let guard = MemoryGuard::new(config);
while !guard.check().exceeds_hard_limit {
    generate_batch();
}
}

Extension Points

Custom Generators

Implement the Generator trait:

#![allow(unused)]
fn main() {
impl Generator<CustomType> for CustomGenerator {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<CustomType>> {
        // Custom logic
    }
}
}

Custom Output Sinks

Implement the Sink trait:

#![allow(unused)]
fn main() {
impl Sink<JournalEntry> for CustomSink {
    fn write(&mut self, entry: &JournalEntry) -> Result<()> {
        // Custom output logic
    }
}
}

Custom Distributions

Create specialized samplers:

#![allow(unused)]
fn main() {
impl AmountSampler for CustomAmountSampler {
    fn sample(&mut self) -> Decimal {
        // Custom distribution
    }
}
}

Workspace Layout

SyntheticData is organized as a Rust workspace with 15 crates following a layered architecture.

Crate Hierarchy

datasynth-cli          → Binary entry point (commands: generate, validate, init, info, fingerprint)
datasynth-server       → REST/gRPC/WebSocket server with auth, rate limiting, timeouts
datasynth-ui           → Tauri/SvelteKit desktop UI
    │
    ▼
datasynth-runtime      → Orchestration layer (GenerationOrchestrator coordinates workflow)
    │
    ├─────────────────────────────────────┐
    ▼                                     ▼
datasynth-generators   datasynth-banking  datasynth-ocpm  datasynth-fingerprint  datasynth-standards
    │                        │                  │                    │
    └────────────────────────┴──────────────────┴────────────────────┘
                                     │
                    ┌────────────────┴────────────────┐
                    ▼                                 ▼
           datasynth-graph                    datasynth-eval
                    │                                 │
                    └────────────────┬────────────────┘
                                     ▼
                            datasynth-config
                                     │
                                     ▼
                            datasynth-core         → Foundation layer
                                     │
                                     ▼
                            datasynth-output

                            datasynth-test-utils   → Testing utilities

Dependency Matrix

Crate	Depends On
datasynth-core	(none)
datasynth-config	datasynth-core
datasynth-output	datasynth-core
datasynth-generators	datasynth-core, datasynth-config
datasynth-graph	datasynth-core, datasynth-generators
datasynth-eval	datasynth-core
datasynth-banking	datasynth-core, datasynth-config
datasynth-ocpm	datasynth-core
datasynth-fingerprint	datasynth-core, datasynth-config
datasynth-standards	datasynth-core, datasynth-config
datasynth-runtime	datasynth-core, datasynth-config, datasynth-generators, datasynth-output, datasynth-graph, datasynth-banking, datasynth-ocpm, datasynth-fingerprint, datasynth-eval
datasynth-cli	datasynth-runtime, datasynth-fingerprint
datasynth-server	datasynth-runtime
datasynth-ui	datasynth-runtime (via Tauri)
datasynth-test-utils	datasynth-core

Directory Structure

SyntheticData/
├── Cargo.toml              # Workspace manifest
├── crates/
│   ├── datasynth-core/
│   │   ├── Cargo.toml
│   │   ├── README.md
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Domain models (JournalEntry, Master data, etc.)
│   │       ├── distributions/  # Statistical samplers
│   │       ├── traits/         # Generator, Sink, PostProcessor traits
│   │       ├── templates/      # Template loading system
│   │       ├── accounts.rs     # GL account constants
│   │       ├── uuid_factory.rs # Deterministic UUID generation
│   │       ├── memory_guard.rs # Memory limit enforcement
│   │       ├── disk_guard.rs   # Disk space monitoring
│   │       ├── cpu_monitor.rs  # CPU load tracking
│   │       ├── resource_guard.rs # Unified resource orchestration
│   │       ├── degradation.rs  # Graceful degradation controller
│   │       ├── llm/            # LLM provider abstraction (Mock, HTTP, OpenAI, Anthropic)
│   │       ├── diffusion/      # Diffusion model backend (statistical, hybrid, training)
│   │       └── causal/         # Causal graphs, SCMs, interventions, counterfactuals
│   ├── datasynth-config/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── schema.rs       # Configuration schema
│   │       ├── validation.rs   # Config validation rules
│   │       └── presets/        # Industry preset definitions
│   ├── datasynth-generators/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── je_generator.rs
│   │       ├── coa_generator.rs
│   │       ├── control_generator.rs
│   │       ├── master_data/    # Vendor, Customer, Material, Asset, Employee
│   │       ├── document_flow/  # P2P, O2C, three-way match
│   │       ├── intercompany/   # IC generation, matching, elimination
│   │       ├── balance/        # Opening balance, balance tracker
│   │       ├── subledger/      # AR, AP, FA, Inventory
│   │       ├── fx/             # FX rates, translation, CTA
│   │       ├── period_close/   # Close engine, accruals, depreciation
│   │       ├── anomaly/        # Anomaly injection engine
│   │       ├── data_quality/   # Missing values, typos, duplicates
│   │       ├── audit/          # Engagement, workpaper, evidence, findings
│   │       ├── llm_enrichment/ # LLM-powered vendor names, descriptions, anomaly explanations
│   │       └── relationships/  # Entity graph, cross-process links, relationship strength
│   ├── datasynth-output/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── csv_sink.rs
│   │       ├── json_sink.rs
│   │       └── control_export.rs
│   ├── datasynth-graph/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Node, edge types
│   │       ├── builders/       # Transaction, approval, entity graphs
│   │       ├── exporters/      # PyTorch Geometric, Neo4j, DGL
│   │       └── ml/             # Feature computation, train/val/test splits
│   ├── datasynth-runtime/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── orchestrator.rs # GenerationOrchestrator
│   │       └── progress.rs     # Progress tracking
│   ├── datasynth-cli/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       └── main.rs         # generate, validate, init, info, fingerprint commands
│   ├── datasynth-server/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── main.rs
│   │       ├── rest/           # Axum REST API
│   │       ├── grpc/           # Tonic gRPC service
│   │       └── websocket/      # WebSocket streaming
│   ├── datasynth-ui/
│   │   ├── package.json
│   │   ├── src/                # SvelteKit frontend
│   │   │   ├── routes/         # 15+ config pages
│   │   │   └── lib/            # Components, stores
│   │   └── src-tauri/          # Rust backend
│   ├── datasynth-eval/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── statistical/    # Benford, distributions, temporal
│   │       ├── coherence/      # Balance, IC, document chains
│   │       ├── quality/        # Completeness, consistency, duplicates
│   │       ├── ml/             # Feature distributions, label quality
│   │       ├── report/         # HTML/JSON report generation
│   │       └── enhancement/    # AutoTuner, RecommendationEngine
│   ├── datasynth-banking/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Customer, Account, Transaction, KYC
│   │       ├── generators/     # Customer, account, transaction generation
│   │       ├── typologies/     # Structuring, funnel, layering, mule, fraud
│   │       ├── personas/       # Retail, business, trust behaviors
│   │       └── labels/         # Entity, relationship, transaction labels
│   ├── datasynth-ocpm/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # EventLog, Event, ObjectInstance, ObjectType
│   │       ├── generator/      # P2P, O2C event generation
│   │       └── export/         # OCEL 2.0 JSON export
│   ├── datasynth-fingerprint/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Fingerprint, Manifest, Schema, Statistics
│   │       ├── privacy/        # Laplace, Gaussian, k-anonymity, PrivacyEngine
│   │       ├── extraction/     # Schema, stats, correlation, integrity extractors
│   │       ├── io/             # DSF file reader, writer, validator
│   │       ├── synthesis/      # ConfigSynthesizer, DistributionFitter, GaussianCopula
│   │       ├── evaluation/     # FidelityEvaluator, FidelityReport
│   │       ├── federated/      # Federated fingerprint protocol, secure aggregation
│   │       └── certificates/   # Synthetic data certificates, HMAC-SHA256 signing
│   ├── datasynth-standards/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── framework.rs     # AccountingFramework, FrameworkSettings
│   │       ├── accounting/      # Revenue (ASC 606/IFRS 15), Leases, Fair Value, Impairment
│   │       ├── audit/           # ISA standards, Analytical procedures, Opinions
│   │       └── regulatory/      # SOX 302/404, DeficiencyMatrix
│   └── datasynth-test-utils/
│       ├── Cargo.toml
│       └── src/
│           └── lib.rs          # Test fixtures, assertions, mocks
├── benches/                    # Criterion benchmark suite
├── docs/                       # This documentation (mdBook)
├── python/                     # Python wrapper (datasynth_py)
├── examples/                   # Example configurations and templates
└── tests/                      # Integration tests

Crate Purposes

Application Layer

Crate	Purpose
datasynth-cli	Command-line interface with generate, validate, init, info, fingerprint commands
datasynth-server	REST/gRPC/WebSocket API with auth, rate limiting, timeouts
datasynth-ui	Cross-platform desktop application (Tauri + SvelteKit)

Processing Layer

Crate	Purpose
datasynth-runtime	Orchestrates generation workflow with resource guards
datasynth-generators	Core data generation (JE, master data, documents, anomalies, audit)
datasynth-graph	Graph construction and export for ML

Domain-Specific Modules

Crate	Purpose
datasynth-banking	KYC/AML banking transactions with fraud typologies
datasynth-ocpm	OCEL 2.0 process mining event logs
datasynth-fingerprint	Privacy-preserving fingerprint extraction and synthesis
datasynth-standards	Accounting/audit standards (US GAAP, IFRS, ISA, SOX, PCAOB)

Foundation Layer

Crate	Purpose
datasynth-core	Domain models, traits, distributions, resource guards
datasynth-config	Configuration schema and validation
datasynth-output	Output sinks (CSV, JSON, Parquet)

Supporting Crates

Crate	Purpose
datasynth-eval	Quality evaluation with auto-tuning recommendations
datasynth-test-utils	Test fixtures and assertions

Build Commands

# Build entire workspace
cargo build --release

# Build specific crate
cargo build -p datasynth-core
cargo build -p datasynth-generators
cargo build -p datasynth-fingerprint

# Run tests
cargo test
cargo test -p datasynth-core
cargo test -p datasynth-fingerprint

# Generate documentation
cargo doc --workspace --no-deps

# Run benchmarks
cargo bench

Feature Flags

Workspace-level features:

[workspace.features]
default = ["full"]
full = ["server", "ui", "graph"]
server = []
ui = []
graph = []

Crate-level features:

# datasynth-core
[features]
templates = ["serde_yaml"]

# datasynth-output
[features]
compression = ["flate2", "zstd"]

Adding a New Crate

Create directory: crates/datasynth-newcrate/

Add Cargo.toml:

[package]
name = "datasynth-newcrate"
version = "0.2.0"
edition = "2021"

[dependencies]
datasynth-core = { path = "../datasynth-core" }

Add to workspace Cargo.toml:

[workspace]
members = [
    # ...
    "crates/datasynth-newcrate",
]

Create src/lib.rs
Add documentation to docs/src/crates/

Domain Models

Core data structures representing enterprise financial concepts.

Model Categories

Category	Models
Accounting	JournalEntry, ChartOfAccounts, ACDOCA
Master Data	Vendor, Customer, Material, FixedAsset, Employee
Documents	PurchaseOrder, Invoice, Payment, etc.
Financial	TrialBalance, FxRate, AccountBalance
Financial Reporting	FinancialStatement, CashFlowItem, BankReconciliation, BankStatementLine
Sourcing (S2C)	SourcingProject, SupplierQualification, RfxEvent, Bid, BidEvaluation, ProcurementContract, CatalogItem, SupplierScorecard, SpendAnalysis
HR / Payroll	PayrollRun, PayrollLineItem, TimeEntry, ExpenseReport, ExpenseLineItem
Manufacturing	ProductionOrder, QualityInspection, CycleCount
Sales Quotes	SalesQuote, QuoteLineItem
Compliance	InternalControl, SoDRule, LabeledAnomaly

Accounting

JournalEntry

The core accounting record.

#![allow(unused)]
fn main() {
pub struct JournalEntry {
    pub header: JournalEntryHeader,
    pub lines: Vec<JournalEntryLine>,
}

pub struct JournalEntryHeader {
    pub document_id: Uuid,
    pub company_code: String,
    pub fiscal_year: u16,
    pub fiscal_period: u8,
    pub posting_date: NaiveDate,
    pub document_date: NaiveDate,
    pub created_at: DateTime<Utc>,
    pub source: TransactionSource,
    pub business_process: Option<BusinessProcess>,

    // Document references
    pub source_document_type: Option<DocumentType>,
    pub source_document_id: Option<String>,

    // Labels
    pub is_fraud: bool,
    pub fraud_type: Option<FraudType>,
    pub is_anomaly: bool,
    pub anomaly_type: Option<AnomalyType>,

    // Control markers
    pub control_ids: Vec<String>,
    pub sox_relevant: bool,
    pub sod_violation: bool,
}

pub struct JournalEntryLine {
    pub line_number: u32,
    pub account_number: String,
    pub cost_center: Option<String>,
    pub profit_center: Option<String>,
    pub debit_amount: Decimal,
    pub credit_amount: Decimal,
    pub description: String,
    pub tax_code: Option<String>,
}
}

Invariant: Sum of debits must equal sum of credits.

ChartOfAccounts

GL account structure.

#![allow(unused)]
fn main() {
pub struct ChartOfAccounts {
    pub accounts: Vec<Account>,
}

pub struct Account {
    pub account_number: String,
    pub name: String,
    pub account_type: AccountType,
    pub account_subtype: AccountSubType,
    pub is_control_account: bool,
    pub normal_balance: NormalBalance,
    pub is_active: bool,
}

pub enum AccountType {
    Asset,
    Liability,
    Equity,
    Revenue,
    Expense,
}

pub enum AccountSubType {
    // Assets
    Cash, AccountsReceivable, Inventory, FixedAsset,
    // Liabilities
    AccountsPayable, AccruedLiabilities, LongTermDebt,
    // Equity
    CommonStock, RetainedEarnings,
    // Revenue
    SalesRevenue, ServiceRevenue,
    // Expense
    CostOfGoodsSold, OperatingExpense,
    // ...
}
}

ACDOCA

SAP HANA Universal Journal format.

#![allow(unused)]
fn main() {
pub struct AcdocaEntry {
    pub rclnt: String,           // Client
    pub rldnr: String,           // Ledger
    pub rbukrs: String,          // Company code
    pub gjahr: u16,              // Fiscal year
    pub belnr: String,           // Document number
    pub docln: u32,              // Line item
    pub ryear: u16,              // Year
    pub poper: u8,               // Posting period
    pub racct: String,           // Account
    pub drcrk: DebitCreditIndicator,
    pub hsl: Decimal,            // Amount in local currency
    pub ksl: Decimal,            // Amount in group currency

    // Simulation fields
    pub zsim_fraud: bool,
    pub zsim_anomaly: bool,
    pub zsim_source: String,
}
}

Master Data

Vendor

Supplier master record.

#![allow(unused)]
fn main() {
pub struct Vendor {
    pub vendor_id: String,
    pub vendor_name: String,
    pub tax_id: Option<String>,
    pub currency: String,
    pub country: String,
    pub payment_terms: PaymentTerms,
    pub bank_account: Option<BankAccount>,
    pub is_intercompany: bool,
    pub behavior: VendorBehavior,
    pub valid_from: NaiveDate,
    pub valid_to: Option<NaiveDate>,
}

pub struct VendorBehavior {
    pub late_payment_tendency: f64,
    pub discount_usage_rate: f64,
}
}

Customer

Customer master record.

#![allow(unused)]
fn main() {
pub struct Customer {
    pub customer_id: String,
    pub customer_name: String,
    pub currency: String,
    pub country: String,
    pub credit_limit: Decimal,
    pub credit_rating: CreditRating,
    pub payment_behavior: PaymentBehavior,
    pub is_intercompany: bool,
    pub valid_from: NaiveDate,
}

pub struct PaymentBehavior {
    pub on_time_rate: f64,
    pub early_payment_rate: f64,
    pub late_payment_rate: f64,
    pub average_days_late: u32,
}
}

Material

Product/material master.

#![allow(unused)]
fn main() {
pub struct Material {
    pub material_id: String,
    pub description: String,
    pub material_type: MaterialType,
    pub unit_of_measure: String,
    pub valuation_method: ValuationMethod,
    pub standard_cost: Decimal,
    pub gl_account: String,
}

pub enum MaterialType {
    RawMaterial,
    WorkInProgress,
    FinishedGoods,
    Service,
}

pub enum ValuationMethod {
    Fifo,
    Lifo,
    WeightedAverage,
    StandardCost,
}
}

FixedAsset

Capital asset record.

#![allow(unused)]
fn main() {
pub struct FixedAsset {
    pub asset_id: String,
    pub description: String,
    pub asset_class: AssetClass,
    pub acquisition_date: NaiveDate,
    pub acquisition_cost: Decimal,
    pub useful_life_years: u32,
    pub depreciation_method: DepreciationMethod,
    pub salvage_value: Decimal,
    pub accumulated_depreciation: Decimal,
    pub disposal_date: Option<NaiveDate>,
}
}

Employee

User/employee record.

#![allow(unused)]
fn main() {
pub struct Employee {
    pub employee_id: String,
    pub name: String,
    pub department: String,
    pub role: String,
    pub manager_id: Option<String>,
    pub approval_limit: Decimal,
    pub transaction_codes: Vec<String>,
    pub hire_date: NaiveDate,
}
}

Documents

PurchaseOrder

P2P initiating document.

#![allow(unused)]
fn main() {
pub struct PurchaseOrder {
    pub po_number: String,
    pub vendor_id: String,
    pub company_code: String,
    pub order_date: NaiveDate,
    pub items: Vec<PoLineItem>,
    pub total_amount: Decimal,
    pub currency: String,
    pub status: PoStatus,
}

pub struct PoLineItem {
    pub line_number: u32,
    pub material_id: String,
    pub quantity: Decimal,
    pub unit_price: Decimal,
    pub gl_account: String,
}
}

VendorInvoice

AP invoice with three-way match.

#![allow(unused)]
fn main() {
pub struct VendorInvoice {
    pub invoice_number: String,
    pub vendor_id: String,
    pub po_number: Option<String>,
    pub gr_number: Option<String>,
    pub invoice_date: NaiveDate,
    pub due_date: NaiveDate,
    pub total_amount: Decimal,
    pub match_status: MatchStatus,
}

pub enum MatchStatus {
    Matched,
    QuantityVariance,
    PriceVariance,
    Blocked,
}
}

DocumentReference

Links documents in flows.

#![allow(unused)]
fn main() {
pub struct DocumentReference {
    pub from_document_type: DocumentType,
    pub from_document_id: String,
    pub to_document_type: DocumentType,
    pub to_document_id: String,
    pub reference_type: ReferenceType,
}

pub enum ReferenceType {
    FollowsFrom,     // Normal flow
    PaymentFor,      // Payment → Invoice
    ReversalOf,      // Reversal/credit memo
}
}

Financial

TrialBalance

Period-end balances.

#![allow(unused)]
fn main() {
pub struct TrialBalance {
    pub company_code: String,
    pub fiscal_year: u16,
    pub fiscal_period: u8,
    pub accounts: Vec<TrialBalanceRow>,
}

pub struct TrialBalanceRow {
    pub account_number: String,
    pub account_name: String,
    pub opening_balance: Decimal,
    pub period_debits: Decimal,
    pub period_credits: Decimal,
    pub closing_balance: Decimal,
}
}

FxRate

Exchange rate record.

#![allow(unused)]
fn main() {
pub struct FxRate {
    pub from_currency: String,
    pub to_currency: String,
    pub rate_date: NaiveDate,
    pub rate_type: RateType,
    pub rate: Decimal,
}

pub enum RateType {
    Spot,
    Closing,
    Average,
}
}

Compliance

LabeledAnomaly

ML training label.

#![allow(unused)]
fn main() {
pub struct LabeledAnomaly {
    pub document_id: Uuid,
    pub anomaly_id: String,
    pub anomaly_type: AnomalyType,
    pub category: AnomalyCategory,
    pub severity: Severity,
    pub description: String,
    pub detection_difficulty: DetectionDifficulty,
}

pub enum AnomalyType {
    Fraud,
    Error,
    ProcessIssue,
    Statistical,
    Relational,
}
}

InternalControl

SOX control definition.

#![allow(unused)]
fn main() {
pub struct InternalControl {
    pub control_id: String,
    pub name: String,
    pub description: String,
    pub control_type: ControlType,
    pub frequency: ControlFrequency,
    pub assertions: Vec<Assertion>,
}
}

Financial Reporting

FinancialStatement

Period-end financial statement with line items.

#![allow(unused)]
fn main() {
pub enum StatementType {
    BalanceSheet,
    IncomeStatement,
    CashFlowStatement,
    ChangesInEquity,
}

pub struct FinancialStatementLineItem {
    pub line_code: String,
    pub label: String,
    pub section: String,
    pub sort_order: u32,
    pub amount: Decimal,
    pub amount_prior: Option<Decimal>,
    pub indent_level: u8,
    pub is_total: bool,
    pub gl_accounts: Vec<String>,
}

pub struct CashFlowItem {
    pub item_code: String,
    pub label: String,
    pub category: CashFlowCategory,  // Operating, Investing, Financing
    pub amount: Decimal,
}
}

BankReconciliation

Bank statement reconciliation with auto-matching.

#![allow(unused)]
fn main() {
pub struct BankStatementLine {
    pub line_id: String,
    pub statement_date: NaiveDate,
    pub direction: Direction,         // Inflow, Outflow
    pub amount: Decimal,
    pub description: String,
    pub match_status: MatchStatus,    // Unmatched, AutoMatched, ManuallyMatched, BankCharge, Interest
    pub matched_payment_id: Option<String>,
}

pub struct BankReconciliation {
    pub reconciliation_id: String,
    pub company_code: String,
    pub bank_account: String,
    pub period_start: NaiveDate,
    pub period_end: NaiveDate,
    pub opening_balance: Decimal,
    pub closing_balance: Decimal,
    pub status: ReconciliationStatus, // InProgress, Completed, CompletedWithExceptions
}
}

Sourcing (S2C)

Source-to-Contract models for the procurement pipeline.

SourcingProject

Top-level sourcing initiative.

#![allow(unused)]
fn main() {
pub struct SourcingProject {
    pub project_id: String,
    pub title: String,
    pub category: String,
    pub status: SourcingProjectStatus,
    pub estimated_spend: Decimal,
    pub start_date: NaiveDate,
    pub target_award_date: NaiveDate,
}
}

RfxEvent

Request for Information/Proposal/Quote.

#![allow(unused)]
fn main() {
pub struct RfxEvent {
    pub rfx_id: String,
    pub project_id: String,
    pub rfx_type: RfxType,       // Rfi, Rfp, Rfq
    pub title: String,
    pub issue_date: NaiveDate,
    pub close_date: NaiveDate,
    pub invited_suppliers: Vec<String>,
}
}

ProcurementContract

Awarded contract resulting from bid evaluation.

#![allow(unused)]
fn main() {
pub struct ProcurementContract {
    pub contract_id: String,
    pub vendor_id: String,
    pub rfx_id: Option<String>,
    pub contract_value: Decimal,
    pub start_date: NaiveDate,
    pub end_date: NaiveDate,
    pub auto_renew: bool,
}
}

Additional S2C models include SpendAnalysis, SupplierQualification, Bid, BidEvaluation, CatalogItem, and SupplierScorecard.

HR / Payroll

Hire-to-Retire (H2R) process models.

PayrollRun

A complete pay cycle for a company.

#![allow(unused)]
fn main() {
pub struct PayrollRun {
    pub payroll_id: String,
    pub company_code: String,
    pub pay_period_start: NaiveDate,
    pub pay_period_end: NaiveDate,
    pub run_date: NaiveDate,
    pub status: PayrollRunStatus,     // Draft, Calculated, Approved, Posted, Reversed
    pub total_gross: Decimal,
    pub total_deductions: Decimal,
    pub total_net: Decimal,
    pub total_employer_cost: Decimal,
    pub employee_count: u32,
}
}

TimeEntry

Employee time tracking record.

#![allow(unused)]
fn main() {
pub struct TimeEntry {
    pub entry_id: String,
    pub employee_id: String,
    pub date: NaiveDate,
    pub hours_regular: f64,
    pub hours_overtime: f64,
    pub hours_pto: f64,
    pub hours_sick: f64,
    pub project_id: Option<String>,
    pub cost_center: Option<String>,
    pub approval_status: TimeApprovalStatus,  // Pending, Approved, Rejected
}
}

ExpenseReport

Employee expense reimbursement.

#![allow(unused)]
fn main() {
pub struct ExpenseReport {
    pub report_id: String,
    pub employee_id: String,
    pub submission_date: NaiveDate,
    pub status: ExpenseStatus,        // Draft, Submitted, Approved, Rejected, Paid
    pub total_amount: Decimal,
    pub line_items: Vec<ExpenseLineItem>,
}

pub enum ExpenseCategory {
    Travel, Meals, Lodging, Transportation,
    Office, Entertainment, Training, Other,
}
}

Manufacturing

Production and quality process models.

ProductionOrder

Manufacturing production order linked to materials.

#![allow(unused)]
fn main() {
pub struct ProductionOrder {
    pub order_id: String,
    pub material_id: String,
    pub planned_quantity: Decimal,
    pub actual_quantity: Decimal,
    pub start_date: NaiveDate,
    pub end_date: Option<NaiveDate>,
    pub status: ProductionOrderStatus,
}
}

QualityInspection

Quality control inspection record.

#![allow(unused)]
fn main() {
pub struct QualityInspection {
    pub inspection_id: String,
    pub production_order_id: String,
    pub inspection_date: NaiveDate,
    pub result: InspectionResult,     // Pass, Fail, Conditional
    pub defect_count: u32,
}
}

CycleCount

Inventory cycle count with variance tracking.

#![allow(unused)]
fn main() {
pub struct CycleCount {
    pub count_id: String,
    pub material_id: String,
    pub warehouse: String,
    pub count_date: NaiveDate,
    pub system_quantity: Decimal,
    pub counted_quantity: Decimal,
    pub variance: Decimal,
}
}

Sales Quotes

Quote-to-order pipeline models.

SalesQuote

Sales quotation record.

#![allow(unused)]
fn main() {
pub struct SalesQuote {
    pub quote_id: String,
    pub customer_id: String,
    pub quote_date: NaiveDate,
    pub valid_until: NaiveDate,
    pub total_amount: Decimal,
    pub status: QuoteStatus,          // Draft, Sent, Won, Lost, Expired
    pub converted_order_id: Option<String>,
}
}

Decimal Handling

All monetary amounts use rust_decimal::Decimal:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

let amount = dec!(1234.56);
let tax = amount * dec!(0.077);
}

Serialized as strings to prevent IEEE 754 issues:

{"amount": "1234.56"}

Data Flow

How data flows through the SyntheticData system.

High-Level Flow

┌─────────────┐
│   Config    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│                     Orchestrator                             │
│                                                              │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐ │
│  │  Master  │ → │  Opening │ → │ Transact │ → │  Period  │ │
│  │   Data   │   │ Balances │   │   ions   │   │  Close   │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘ │
│                                                              │
└───────────────────────────┬─────────────────────────────────┘
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  CSV Sink   │      │ Graph Export│      │  Labels     │
└─────────────┘      └─────────────┘      └─────────────┘

Phase 1: Configuration Loading

YAML File → Parser → Validator → Config Object

Load: Read YAML/JSON file
Parse: Convert to strongly-typed structures
Validate: Check constraints and ranges
Resolve: Apply defaults and presets

#![allow(unused)]
fn main() {
let config = Config::from_yaml_file("config.yaml")?;
ConfigValidator::new().validate(&config)?;
}

Phase 2: Master Data Generation

Config → Master Data Generators → Entity Registry

Order of generation (to satisfy dependencies):

Chart of Accounts: GL account structure
Employees: Users with approval limits
Vendors: Suppliers (reference employees as approvers)
Customers: Buyers (reference employees)
Materials: Products (reference accounts)
Fixed Assets: Capital assets (reference accounts)

#![allow(unused)]
fn main() {
// Entity registry maintains references
let registry = EntityRegistry::new();
registry.register_vendors(&vendors);
registry.register_customers(&customers);
}

Phase 3: Opening Balance Generation

Config + CoA → Balance Generator → Opening JEs

Generates coherent opening balance sheet:

Calculate target balances per account type
Distribute across accounts
Generate opening entries
Verify A = L + E

#![allow(unused)]
fn main() {
let opening = OpeningBalanceGenerator::new(&config);
let entries = opening.generate()?;

// Verify balance coherence
assert!(entries.iter().all(|e| e.is_balanced()));
}

Phase 4: Transaction Generation

Document Flow Path

Config → P2P/O2C Generators → Documents → JE Generator → Entries

P2P Flow:

PO Generator → Purchase Order
                    │
                    ▼
GR Generator → Goods Receipt → JE (Inventory/GR-IR)
                    │
                    ▼
Invoice Gen. → Vendor Invoice → JE (GR-IR/AP)
                    │
                    ▼
Payment Gen. → Payment → JE (AP/Cash)

Direct JE Path

Config → JE Generator → Entries

For transactions not from document flows:

Manual entries
Recurring entries
Adjustments

Phase 5: Balance Tracking

Entries → Balance Tracker → Running Balances → Trial Balance

Continuous tracking during generation:

#![allow(unused)]
fn main() {
let mut tracker = BalanceTracker::new(&coa);

for entry in &entries {
    tracker.post(&entry)?;

    // Verify coherence after each entry
    assert!(tracker.is_balanced());
}

let trial_balance = tracker.to_trial_balance(period);
}

Phase 6: Anomaly Injection

Entries → Anomaly Injector → Modified Entries + Labels

Anomalies injected post-generation:

Select entries based on targeting strategy
Apply anomaly transformation
Generate label record

#![allow(unused)]
fn main() {
let injector = AnomalyInjector::new(&config.anomaly_injection);
let (modified, labels) = injector.inject(&entries)?;
}

Phase 7: Period Close

Entries + Balances → Close Engine → Closing Entries

Monthly:

Accruals
Depreciation
Subledger reconciliation

Quarterly:

IC eliminations
Currency translation

Annual:

Closing entries
Retained earnings

Phase 8: Output Generation

CSV/JSON Output

Entries + Master Data → Sinks → Files

#![allow(unused)]
fn main() {
let mut sink = CsvSink::new("output/journal_entries.csv")?;
sink.write_batch(&entries)?;
sink.flush()?;
}

Graph Output

Entries → Graph Builder → Graph → Exporter → PyG/Neo4j

#![allow(unused)]
fn main() {
let builder = TransactionGraphBuilder::new();
let graph = builder.build(&entries)?;

let exporter = PyTorchGeometricExporter::new("output/graphs");
exporter.export(&graph, split_config)?;
}

Phase 9: Enterprise Process Chains (v0.6.0)

Source-to-Contract (S2C) Flow

Spend Analysis → Sourcing Project → Supplier Qualification → RFx Event → Bids →
Bid Evaluation → Contract Award → Catalog Items → [feeds into P2P] → Supplier Scorecard

S2C data feeds into the existing P2P procurement flow. Procurement contracts and catalog items provide the upstream sourcing context for purchase orders.

HR / Payroll Flow

Employees (Master Data) → Time Entries → Payroll Run → JE (Salary Expense/Cash)
                        → Expense Reports → JE (Expense/AP)

HR data depends on the employee master data from Phase 2. Payroll runs generate journal entries that post to salary expense and cash accounts.

Financial Reporting Flow

Trial Balance → Balance Sheet + Income Statement
             → Cash Flow Statement (indirect method)
             → Changes in Equity
             → Management KPIs
             → Budget Variance Analysis

Payments (P2P/O2C) → Bank Reconciliation → Matched/Unmatched Items

Financial statements are derived from the adjusted trial balance. Bank reconciliations match payments from document flows against bank statement lines.

Manufacturing Flow

Materials (Master Data) → Production Orders → Quality Inspections
                                            → Cycle Counts

Manufacturing data depends on materials from the master data. Production orders consume raw materials and produce finished goods.

Sales Quote Flow

Customers (Master Data) → Sales Quotes → [feeds into O2C when won]

The quote-to-order pipeline generates sales quotes that, when won, link to sales orders in the O2C flow.

Accounting Standards Flow

Customers → Customer Contracts → Performance Obligations (ASC 606/IFRS 15)
Fixed Assets → Impairment Tests → Recoverable Amount Calculations

Revenue recognition generates contracts with performance obligations. Impairment testing evaluates fixed asset carrying amounts against recoverable values.

Data Dependencies

         ┌─────────────┐
         │    Config   │
         └──────┬──────┘
                │
    ┌───────────┼───────────┐
    │           │           │
    ▼           ▼           ▼
┌───────┐  ┌───────┐  ┌───────┐
│  CoA  │  │Vendors│  │Customs│
└───┬───┘  └───┬───┘  └───┬───┘
    │          │          │
    │    ┌─────┴─────┐    │
    │    │           │    │
    ▼    ▼           ▼    ▼
┌─────────────┐  ┌─────────────┐
│   P2P Docs  │  │   O2C Docs  │
└──────┬──────┘  └──────┬──────┘
       │                │
       └───────┬────────┘
               │
               ▼
        ┌─────────────┐
        │   Entries   │
        └──────┬──────┘
               │
    ┌──────────┼──────────┐──────────┐──────────┐
    │          │          │          │          │
    ▼          ▼          ▼          ▼          ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌─────────┐ ┌───────┐
│  TB   │ │ Graph │ │Labels │ │Fin.Stmt │ │BankRec│
└───────┘ └───────┘ └───────┘ └─────────┘ └───────┘

Streaming vs Batch

Batch Mode

All data in memory:

#![allow(unused)]
fn main() {
let entries = generator.generate_batch(100000)?;
sink.write_batch(&entries)?;
}

Pro: Fast parallel processing Con: Memory intensive

Streaming Mode

Process one at a time:

#![allow(unused)]
fn main() {
for entry in generator.generate_stream() {
    sink.write(&entry?)?;
}
}

Pro: Memory efficient Con: No parallelism

Hybrid Mode

Batch with periodic flush:

#![allow(unused)]
fn main() {
for batch in generator.generate_batches(1000) {
    let entries = batch?;
    sink.write_batch(&entries)?;

    if memory_guard.check().exceeds_soft_limit {
        sink.flush()?;
    }
}
}

Generation Pipeline

Step-by-step generation process orchestrated by datasynth-runtime.

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────┐
│                      GenerationOrchestrator                          │
│                                                                      │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐   │
│  │Init  │→│Master│→│Open  │→│Trans │→│Close │→│Inject│→│Export│   │
│  │      │ │Data  │ │Bal   │ │      │ │      │ │      │ │      │   │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Stage 1: Initialization

Purpose: Prepare generation environment

#![allow(unused)]
fn main() {
pub fn initialize(&mut self) -> Result<()> {
    // 1. Validate configuration
    ConfigValidator::new().validate(&self.config)?;

    // 2. Initialize RNG with seed
    self.rng = ChaCha8Rng::seed_from_u64(self.config.global.seed);

    // 3. Create UUID factory
    self.uuid_factory = DeterministicUuidFactory::new(self.config.global.seed);

    // 4. Set up memory guard
    self.memory_guard = MemoryGuard::new(self.config.memory_config());

    // 5. Create output directories
    fs::create_dir_all(&self.output_path)?;

    Ok(())
}
}

Outputs:

Validated configuration
Initialized RNG
UUID factory
Memory guard
Output directories

Stage 2: Master Data Generation

Purpose: Generate all entity master records

#![allow(unused)]
fn main() {
pub fn generate_master_data(&mut self) -> Result<MasterDataState> {
    let mut state = MasterDataState::new();

    // 1. Chart of Accounts
    let coa_gen = CoaGenerator::new(&self.config, &mut self.rng);
    state.chart_of_accounts = coa_gen.generate()?;

    // 2. Employees (needed for approvals)
    let emp_gen = EmployeeGenerator::new(&self.config, &mut self.rng);
    state.employees = emp_gen.generate()?;

    // 3. Vendors (reference employees)
    let vendor_gen = VendorGenerator::new(&self.config, &mut self.rng);
    state.vendors = vendor_gen.generate()?;

    // 4. Customers
    let customer_gen = CustomerGenerator::new(&self.config, &mut self.rng);
    state.customers = customer_gen.generate()?;

    // 5. Materials
    let material_gen = MaterialGenerator::new(&self.config, &mut self.rng);
    state.materials = material_gen.generate()?;

    // 6. Fixed Assets
    let asset_gen = AssetGenerator::new(&self.config, &mut self.rng);
    state.fixed_assets = asset_gen.generate()?;

    // 7. Register in entity registry
    self.registry.register_all(&state);

    Ok(state)
}
}

Outputs:

Chart of Accounts
Vendors, Customers
Materials, Fixed Assets
Employees
Entity Registry

Stage 3: Opening Balance Generation

Purpose: Create coherent opening balance sheet

#![allow(unused)]
fn main() {
pub fn generate_opening_balances(&mut self) -> Result<Vec<JournalEntry>> {
    let generator = OpeningBalanceGenerator::new(
        &self.config,
        &self.state.chart_of_accounts,
        &mut self.rng,
    );

    let entries = generator.generate()?;

    // Initialize balance tracker
    self.balance_tracker = BalanceTracker::new(&self.state.chart_of_accounts);
    for entry in &entries {
        self.balance_tracker.post(entry)?;
    }

    // Verify A = L + E
    assert!(self.balance_tracker.is_balanced());

    Ok(entries)
}
}

Outputs:

Opening balance entries
Initialized balance tracker

Stage 4: Transaction Generation

Purpose: Generate main transaction volume

#![allow(unused)]
fn main() {
pub fn generate_transactions(&mut self) -> Result<Vec<JournalEntry>> {
    let target = self.config.transactions.target_count;
    let mut entries = Vec::with_capacity(target as usize);

    // Calculate counts by source
    let p2p_count = (target as f64 * self.config.document_flows.p2p.flow_rate) as u64;
    let o2c_count = (target as f64 * self.config.document_flows.o2c.flow_rate) as u64;
    let other_count = target - p2p_count - o2c_count;

    // 1. Generate P2P flows
    let p2p_entries = self.generate_p2p_flows(p2p_count)?;
    entries.extend(p2p_entries);

    // 2. Generate O2C flows
    let o2c_entries = self.generate_o2c_flows(o2c_count)?;
    entries.extend(o2c_entries);

    // 3. Generate other entries (manual, recurring, etc.)
    let other_entries = self.generate_other_entries(other_count)?;
    entries.extend(other_entries);

    // 4. Sort by posting date
    entries.sort_by_key(|e| e.header.posting_date);

    // 5. Update balance tracker
    for entry in &entries {
        self.balance_tracker.post(entry)?;
    }

    Ok(entries)
}
}

P2P Flow Generation

#![allow(unused)]
fn main() {
fn generate_p2p_flows(&mut self, count: u64) -> Result<Vec<JournalEntry>> {
    let mut p2p_gen = P2pGenerator::new(&self.config, &self.registry, &mut self.rng);
    let mut doc_gen = DocumentFlowJeGenerator::new(&self.config);

    let mut entries = Vec::new();

    for _ in 0..count {
        // 1. Generate document flow
        let flow = p2p_gen.generate_flow()?;
        self.state.documents.add_p2p_flow(&flow);

        // 2. Generate journal entries from flow
        let flow_entries = doc_gen.generate_from_p2p(&flow)?;
        entries.extend(flow_entries);
    }

    Ok(entries)
}
}

Outputs:

Journal entries
Document records
Updated balances

Stage 5: Period Close

Purpose: Run period-end processes

#![allow(unused)]
fn main() {
pub fn run_period_close(&mut self) -> Result<()> {
    let close_engine = CloseEngine::new(&self.config.period_close);

    for period in self.config.periods() {
        // 1. Monthly close
        let monthly_entries = close_engine.run_monthly_close(
            period,
            &self.state,
            &mut self.balance_tracker,
        )?;
        self.state.entries.extend(monthly_entries);

        // 2. Quarterly close (if applicable)
        if period.is_quarter_end() {
            let quarterly_entries = close_engine.run_quarterly_close(
                period,
                &self.state,
            )?;
            self.state.entries.extend(quarterly_entries);
        }

        // 3. Generate trial balance
        let trial_balance = self.balance_tracker.to_trial_balance(period);
        self.state.trial_balances.push(trial_balance);
    }

    // 4. Annual close
    if self.config.has_year_end() {
        let annual_entries = close_engine.run_annual_close(&self.state)?;
        self.state.entries.extend(annual_entries);
    }

    Ok(())
}
}

Outputs:

Accrual entries
Depreciation entries
Closing entries
Trial balances

Stage 6: Anomaly Injection

Purpose: Add anomalies and generate labels

#![allow(unused)]
fn main() {
pub fn inject_anomalies(&mut self) -> Result<()> {
    if !self.config.anomaly_injection.enabled {
        return Ok(());
    }

    let mut injector = AnomalyInjector::new(
        &self.config.anomaly_injection,
        &mut self.rng,
    );

    // 1. Select entries for injection
    let target_count = (self.state.entries.len() as f64
        * self.config.anomaly_injection.total_rate) as usize;

    // 2. Inject anomalies
    let (modified, labels) = injector.inject(
        &mut self.state.entries,
        target_count,
    )?;

    // 3. Store labels
    self.state.anomaly_labels = labels;

    // 4. Apply data quality variations
    if self.config.data_quality.enabled {
        let dq_injector = DataQualityInjector::new(&self.config.data_quality);
        dq_injector.apply(&mut self.state)?;
    }

    Ok(())
}
}

Outputs:

Modified entries with anomalies
Anomaly labels for ML

Stage 7: Export

Purpose: Write all outputs

#![allow(unused)]
fn main() {
pub fn export(&self) -> Result<()> {
    // 1. Master data
    self.export_master_data()?;

    // 2. Transactions
    self.export_transactions()?;

    // 3. Documents
    self.export_documents()?;

    // 4. Subledgers
    self.export_subledgers()?;

    // 5. Trial balances
    self.export_trial_balances()?;

    // 6. Labels
    self.export_labels()?;

    // 7. Controls
    self.export_controls()?;

    // 8. Graphs (if enabled)
    if self.config.graph_export.enabled {
        self.export_graphs()?;
    }

    Ok(())
}
}

Outputs:

CSV/JSON files
Graph exports
Label files

Stage 8: Banking & Process Mining

Purpose: Generate banking/KYC/AML data and OCEL 2.0 event logs

If banking or OCEL generation is enabled in the config, this stage produces banking transactions with KYC profiles and/or OCEL 2.0 event logs for process mining.

Outputs:

Banking customers, accounts, transactions
KYC profiles and AML typology labels
OCEL 2.0 event logs, objects, process variants

Stage 9: Audit Generation

Purpose: Generate ISA-compliant audit data

If audit generation is enabled, generates engagement records, workpapers, evidence, risks, findings, and professional judgments.

Outputs:

Audit engagements, workpapers, evidence
Risk assessments and findings
Professional judgment documentation

Stage 10: Graph Export

Purpose: Build and export ML-ready graphs

If graph export is enabled, builds transaction, approval, and entity graphs and exports to configured formats.

Outputs:

PyTorch Geometric tensors (.pt)
Neo4j CSV + Cypher scripts
DGL graph structures

Stage 11: LLM Enrichment (v0.5.0)

Purpose: Enrich generated data with LLM-generated metadata

When LLM enrichment is enabled, uses the configured LlmProvider (Mock, OpenAI, Anthropic, or Custom) to generate realistic:

Vendor names appropriate for industry and spend category
Transaction descriptions and memo fields
Natural language explanations for injected anomalies

The Mock provider is deterministic and requires no network access, making it suitable for CI/CD pipelines.

Outputs:

Enriched vendor master data
Enriched journal entry descriptions
Anomaly explanation text

Stage 12: Diffusion Enhancement (v0.5.0)

Purpose: Optionally blend diffusion model outputs with rule-based data

When diffusion is enabled, uses a StatisticalDiffusionBackend to generate samples through a learned denoising process. The HybridGenerator blends diffusion outputs with rule-based data using one of three strategies:

Interpolate: Weighted average of rule-based and diffusion values
Select: Per-record random selection from either source
Ensemble: Column-level blending (diffusion for amounts, rule-based for categoricals)

Outputs:

Blended transaction amounts and attributes
Diffusion fit report (mean/std errors, correlation preservation)

Stage 13: Causal Overlay (v0.5.0)

Purpose: Apply causal structure to generated data

When causal generation is enabled, constructs a StructuralCausalModel from the configured causal graph (or a built-in template like fraud_detection or revenue_cycle) and generates data that respects causal relationships. Supports:

Observational generation: Data following the causal structure
Interventional generation: Data under do-calculus interventions (“what-if” scenarios)
Counterfactual generation: Counterfactual versions of existing records via abduction-action-prediction

The causal validator verifies that generated data preserves the specified causal structure.

Outputs:

Causally-structured records
Intervention results with effect estimates
Counterfactual pairs (factual + counterfactual)
Causal validation report

Stage 14: Source-to-Contract (v0.6.0)

Purpose: Generate the full S2C procurement pipeline

When source-to-pay is enabled, generates the complete sourcing lifecycle from spend analysis through supplier scorecards. The generation DAG follows:

Spend Analysis → Sourcing Project → Supplier Qualification → RFx Event → Bids →
Bid Evaluation → Procurement Contract → Catalog Items → [feeds into P2P] → Supplier Scorecard

Outputs:

Spend analysis records and category hierarchies
Sourcing projects with supplier qualification data
RFx events (RFI/RFP/RFQ), bids, and bid evaluations
Procurement contracts and catalog items
Supplier scorecards with performance metrics

Stage 15: Financial Reporting (v0.6.0)

Purpose: Generate bank reconciliations and financial statements

When financial reporting is enabled, produces bank reconciliations with auto-matching and full financial statement sets derived from the adjusted trial balance.

Bank reconciliations match payments to bank statement lines with configurable auto-match, manual match, and exception rates. Financial statements include:

Balance Sheet: Assets = Liabilities + Equity
Income Statement: Revenue - COGS - OpEx - Tax = Net Income
Cash Flow Statement: Indirect method with operating, investing, and financing categories
Statement of Changes in Equity: Retained earnings, dividends, comprehensive income

Also generates management KPIs (financial ratios) and budget variance analysis when configured.

Outputs:

Bank reconciliations with statement lines and reconciling items
Financial statements (balance sheet, income statement, cash flow, changes in equity)
Management KPIs and financial ratios
Budget vs. actual variance reports

Stage 16: HR Data (v0.6.0)

Purpose: Generate Hire-to-Retire (H2R) process data

When HR generation is enabled, produces payroll runs, time entries, and expense reports linked to the employee master data generated in Stage 2.

Outputs:

Payroll runs with employee pay line items (gross, deductions, net, employer cost)
Time entries with regular hours, overtime, PTO, and sick leave
Expense reports with categorized line items and approval workflows

Stage 17: Accounting Standards (v0.6.0)

Purpose: Generate ASC 606/IFRS 15 revenue recognition and impairment testing data

When accounting standards generation is enabled, produces customer contracts with performance obligations for revenue recognition and asset impairment test records.

Outputs:

Customer contracts with performance obligations (ASC 606/IFRS 15)
Revenue recognition schedules
Asset impairment tests with recoverable amount calculations

Stage 18: Manufacturing (v0.6.0)

Purpose: Generate manufacturing process data

When manufacturing is enabled, produces production orders, quality inspections, and cycle counts linked to materials from the master data.

Outputs:

Production orders with BOM components and routing steps
Quality inspections with pass/fail/conditional results
Inventory cycle counts with variance analysis

Stage 19: Sales Quotes, KPIs, and Budgets (v0.6.0)

Purpose: Generate sales pipeline and financial planning data

When enabled, produces the quote-to-order pipeline, management KPI computations, and budget variance analysis.

Outputs:

Sales quotes with line items, conversion tracking, and win/loss rates
Management KPIs (liquidity, profitability, efficiency, leverage ratios)
Budget records with actual vs. planned variance analysis

Parallel Execution

Stages that support parallelism:

#![allow(unused)]
fn main() {
// Parallel transaction generation
let entries: Vec<JournalEntry> = (0..num_threads)
    .into_par_iter()
    .flat_map(|thread_id| {
        let mut gen = JournalEntryGenerator::new(
            &config,
            seed + thread_id as u64,
        );
        gen.generate_batch(batch_size)
    })
    .collect();
}

Progress Tracking

#![allow(unused)]
fn main() {
pub fn run_with_progress<F>(&mut self, callback: F) -> Result<()>
where
    F: Fn(Progress),
{
    let tracker = ProgressTracker::new(self.config.total_items());

    for stage in self.stages() {
        tracker.set_phase(&stage.name);
        stage.run()?;
        tracker.advance(stage.items);
        callback(tracker.progress());
    }

    Ok(())
}
}

Resource Management

How SyntheticData manages system resources during generation.

Overview

Large-scale data generation can stress system resources. SyntheticData provides:

Memory Guard: Cross-platform memory tracking with soft/hard limits
Disk Space Guard: Disk capacity monitoring and pre-write checks
CPU Monitor: CPU load tracking with auto-throttling
Resource Guard: Unified orchestration of all resource guards
Graceful Degradation: Progressive feature reduction under resource pressure
Streaming Output: Reduce memory pressure

Memory Guard

The MemoryGuard component tracks process memory usage:

#![allow(unused)]
fn main() {
pub struct MemoryGuard {
    config: MemoryGuardConfig,
    last_check: Instant,
    last_usage: u64,
}

pub struct MemoryGuardConfig {
    pub soft_limit: u64,           // Pause/slow threshold
    pub hard_limit: u64,           // Stop threshold
    pub check_interval_ms: u64,    // How often to check
    pub growth_rate_threshold: f64, // Bytes/sec warning
}

pub struct MemoryStatus {
    pub current_usage: u64,
    pub exceeds_soft_limit: bool,
    pub exceeds_hard_limit: bool,
    pub growth_rate: f64,
}
}

Platform Support

Platform	Method
Linux	`/proc/self/statm`
macOS	`ps` command
Windows	Stubbed (returns 0)

Linux Implementation

#![allow(unused)]
fn main() {
#[cfg(target_os = "linux")]
fn get_memory_usage() -> u64 {
    let statm = fs::read_to_string("/proc/self/statm").ok()?;
    let rss_pages: u64 = statm.split_whitespace().nth(1)?.parse().ok()?;
    let page_size = unsafe { libc::sysconf(libc::_SC_PAGESIZE) } as u64;
    rss_pages * page_size
}
}

macOS Implementation

#![allow(unused)]
fn main() {
#[cfg(target_os = "macos")]
fn get_memory_usage() -> u64 {
    let output = Command::new("ps")
        .args(["-o", "rss=", "-p", &std::process::id().to_string()])
        .output()
        .ok()?;
    let rss_kb: u64 = String::from_utf8_lossy(&output.stdout)
        .trim()
        .parse()
        .ok()?;
    rss_kb * 1024
}
}

Configuration

global:
  memory_limit: 2147483648    # 2 GB hard limit

Or programmatically:

#![allow(unused)]
fn main() {
let config = MemoryGuardConfig {
    soft_limit: 1024 * 1024 * 1024,      // 1 GB
    hard_limit: 2 * 1024 * 1024 * 1024,  // 2 GB
    check_interval_ms: 1000,              // Check every second
    growth_rate_threshold: 100_000_000.0, // 100 MB/sec
};
}

Usage in Generation

#![allow(unused)]
fn main() {
pub fn generate_with_memory_guard(&mut self) -> Result<()> {
    let guard = MemoryGuard::new(self.memory_config);

    loop {
        // Check memory
        let status = guard.check();

        if status.exceeds_hard_limit {
            // Stop generation
            return Err(Error::MemoryExceeded);
        }

        if status.exceeds_soft_limit {
            // Flush output and trigger GC
            self.sink.flush()?;
            self.state.clear_caches();
            continue;
        }

        if status.growth_rate > guard.config.growth_rate_threshold {
            // Slow down
            thread::sleep(Duration::from_millis(100));
        }

        // Generate batch
        let batch = self.generator.generate_batch(BATCH_SIZE)?;
        self.process_batch(batch)?;

        if self.is_complete() {
            break;
        }
    }

    Ok(())
}
}

Memory Estimation

Estimate memory requirements before generation:

#![allow(unused)]
fn main() {
pub fn estimate_memory(config: &Config) -> MemoryEstimate {
    let entry_size = 512;  // Approximate bytes per entry
    let master_data_size = config.estimate_master_data_size();

    let peak = master_data_size
        + (config.transactions.target_count as u64 * entry_size);

    let streaming_peak = master_data_size
        + (BATCH_SIZE as u64 * entry_size);

    MemoryEstimate {
        batch_peak: peak,
        streaming_peak,
        recommended_limit: peak * 2,
    }
}
}

Memory-Efficient Patterns

Streaming Output

Write as you generate instead of accumulating:

#![allow(unused)]
fn main() {
// Memory-efficient
for entry in generator.generate_stream() {
    sink.write(&entry?)?;
}

// Memory-intensive (avoid for large volumes)
let all_entries = generator.generate_batch(1_000_000)?;
sink.write_batch(&all_entries)?;
}

Batch Processing with Flush

#![allow(unused)]
fn main() {
const BATCH_SIZE: usize = 10_000;

let mut buffer = Vec::with_capacity(BATCH_SIZE);

for entry in generator.generate_stream() {
    buffer.push(entry?);

    if buffer.len() >= BATCH_SIZE {
        sink.write_batch(&buffer)?;
        buffer.clear();
    }
}

// Final flush
if !buffer.is_empty() {
    sink.write_batch(&buffer)?;
}
}

Lazy Loading

Load master data on demand:

#![allow(unused)]
fn main() {
pub struct LazyRegistry {
    vendors: OnceCell<Vec<Vendor>>,
    vendor_loader: Box<dyn Fn() -> Vec<Vendor>>,
}

impl LazyRegistry {
    pub fn vendors(&self) -> &[Vendor] {
        self.vendors.get_or_init(|| (self.vendor_loader)())
    }
}
}

Memory Limits by Component

Estimated memory usage:

Component	Size (per item)	For 1M entries
JournalEntry	~512 bytes	~500 MB
Document	~1 KB	~1 GB
Graph Node	~128 bytes	~128 MB
Graph Edge	~64 bytes	~64 MB

Monitoring

Progress with Memory

#![allow(unused)]
fn main() {
orchestrator.run_with_progress(|progress| {
    let memory_mb = guard.check().current_usage / 1_000_000;
    println!(
        "[{:.1}%] {} entries | {} MB",
        progress.percent,
        progress.current,
        memory_mb
    );
});
}

Server Endpoint

curl http://localhost:3000/health

{
  "status": "healthy",
  "memory_usage_mb": 512,
  "memory_limit_mb": 2048,
  "memory_percent": 25.0
}

Troubleshooting

Out of Memory

Symptoms: Process killed, “out of memory” error

Solutions:

Reduce target_count
Enable streaming output
Increase system memory
Set appropriate memory_limit

Slow Generation

Symptoms: Generation slows over time

Cause: Memory pressure triggering slowdown

Solutions:

Increase soft limit
Reduce batch size
Enable more aggressive flushing

Memory Not Freed

Symptoms: Memory stays high after generation

Cause: Data retained in caches

Solution: Explicitly clear state:

#![allow(unused)]
fn main() {
orchestrator.clear_caches();
}

Disk Space Guard

Monitors disk space and prevents disk exhaustion:

#![allow(unused)]
fn main() {
pub struct DiskSpaceGuardConfig {
    pub hard_limit_mb: usize,       // Minimum free space required
    pub soft_limit_mb: usize,       // Warning threshold
    pub check_interval: usize,      // Check every N operations
    pub reserve_buffer_mb: usize,   // Buffer to maintain
}
}

Platform Support

Platform	Method
Linux/macOS	`statvfs` syscall
Windows	`GetDiskFreeSpaceExW`

Usage

#![allow(unused)]
fn main() {
let guard = DiskSpaceGuard::with_min_free(100);  // 100 MB minimum

// Periodic check
guard.check()?;

// Pre-write check with size estimation
guard.check_before_write(estimated_bytes)?;

// Size estimation for planning
let size = estimate_output_size_mb(100_000, &[OutputFormat::Csv], false);
}

CPU Monitor

Tracks CPU load with optional auto-throttling:

#![allow(unused)]
fn main() {
pub struct CpuMonitorConfig {
    pub enabled: bool,
    pub high_load_threshold: f64,      // 0.85 default
    pub critical_load_threshold: f64,  // 0.95 default
    pub sample_interval_ms: u64,
    pub auto_throttle: bool,
    pub throttle_delay_ms: u64,
}
}

Platform Support

Platform	Method
Linux	`/proc/stat` parsing
macOS	`top -l 1` command

Usage

#![allow(unused)]
fn main() {
let config = CpuMonitorConfig::with_thresholds(0.85, 0.95)
    .with_auto_throttle(50);

let monitor = CpuMonitor::new(config);

// In generation loop
if let Some(load) = monitor.sample() {
    if load > 0.85 {
        // Consider slowing down
    }
    monitor.maybe_throttle();  // Applies delay if critical
}
}

Unified Resource Guard

Combines all guards into single interface:

#![allow(unused)]
fn main() {
let guard = ResourceGuard::new(ResourceGuardConfig::default())
    .with_memory_limit(2 * 1024 * 1024 * 1024)
    .with_output_path("./output")
    .with_cpu_monitoring();

// Check all resources at once
guard.check_all()?;

let stats = guard.stats();
println!("Memory: {}%", stats.memory_usage_percent);
println!("Disk: {} MB free", stats.disk_available_mb);
println!("CPU: {}%", stats.cpu_load * 100.0);
}

Graceful Degradation

Progressive feature reduction under resource pressure:

#![allow(unused)]
fn main() {
pub enum DegradationLevel {
    Normal,    // All features enabled
    Reduced,   // 50% batch, skip data quality, 50% anomaly rate
    Minimal,   // 25% batch, essential only, no injections
    Emergency, // Flush and terminate
}
}

Thresholds

Level	Memory	Disk	Batch Size	Actions
Normal	<70%	>1GB	100%	Full operation
Reduced	70-85%	500MB-1GB	50%	Skip data quality
Minimal	85-95%	100-500MB	25%	Essential data only
Emergency	>95%	<100MB	0%	Graceful shutdown

Usage

#![allow(unused)]
fn main() {
let controller = DegradationController::new(DegradationConfig::default());

// Update based on current resource status
let status = ResourceStatus::new(
    Some(memory_usage),
    Some(disk_available_mb),
    Some(cpu_load),
);

let (level, changed) = controller.update(&status);

if changed {
    let actions = DegradationActions::for_level(level);

    if actions.skip_data_quality {
        // Disable data quality injection
    }
    if actions.terminate {
        // Flush and exit
    }
}
}

Configuration

global:
  resource_budget:
    memory:
      hard_limit_mb: 2048
    disk:
      min_free_mb: 500
      reserve_buffer_mb: 100
    cpu:
      enabled: true
      high_load_threshold: 0.85
      auto_throttle: true
    degradation:
      enabled: true
      reduced_threshold: 0.70
      minimal_threshold: 0.85

Enterprise Process Chains

SyntheticData models enterprise operations as interconnected process chains — end-to-end business flows that share master data, generate journal entries, and link through common documents. This page maps the current implementation status and shows how the chains integrate.

Coverage Matrix

Chain	Full Name	Coverage	Status	Key Modules
S2P	Source-to-Pay	95%	Implemented (P2P + S2C + OCPM)	`document_flow/p2p_generator`, `sourcing/`, `ocpm/s2c_generator`
O2C	Order-to-Cash	95%	Implemented (+ OCPM)	`document_flow/o2c_generator`, `master_data/customer`, `subledger/ar`
R2R	Record-to-Report	85%	Implemented (+ Bank Recon OCPM)	`je_generator`, `balance/`, `period_close/`, `ocpm/bank_recon_generator`
A2R	Acquire-to-Retire	70%	Partially implemented	`master_data/asset`, `subledger/fa`, `period_close/depreciation`
INV	Inventory Management	55%	Partially implemented	`subledger/inventory`, `document_flow/` (GR/delivery links)
BANK	Banking & Treasury	85%	Implemented (+ OCPM)	`datasynth-banking`, `ocpm/bank_generator`
H2R	Hire-to-Retire	85%	Implemented (+ OCPM)	`hr/`, `master_data/employee`, `ocpm/h2r_generator`
MFG	Plan-to-Produce	85%	Implemented (+ OCPM)	`manufacturing/`, `ocpm/mfg_generator`
AUDIT	Audit Lifecycle	90%	Implemented (+ OCPM)	`audit/`, `ocpm/audit_generator`

Implemented Chains

Source-to-Pay (S2P)

The S2P chain covers procurement from purchase requisition through payment:

                    Source-to-Contract (S2C) — Planned
                    ┌──────────────────────────────────────────────┐
                    │ Spend Analysis → RFx → Bid Eval → Contract  │
                    └──────────────────────────┬───────────────────┘
                                               │
    ┌──────────────────────────────────────────┼──────────────────────────┐
    │              Procure-to-Pay (P2P) — Implemented                    │
    │                                          │                         │
    │  Purchase    Purchase    Goods     Vendor    Three-Way              │
    │  Requisition → Order  → Receipt → Invoice → Match    → Payment    │
    │                  │                   │         │           │        │
    │                  │              ┌────┘         │           │        │
    │                  ▼              ▼              ▼           ▼        │
    │              AP Open Item ← Match Result   AP Aging    Bank        │
    └────────────────────────────────────────────────────────────────────┘
                                               │
                    ┌──────────────────────────┘
                    ▼
    Vendor Network (quality scores, clusters, supply chain tiers)

P2P implementation details:

Component	Types/Variants	Key Config
Purchase Orders	6 types: Standard, Service, Framework, Consignment, StockTransfer, Subcontracting	`flow_rate`, `completion_rate`
Goods Receipts	7 movement types: GrForPo, ReturnToVendor, GrForProduction, TransferPosting, InitialEntry, Scrapping, Consumption	`gr_rate`
Vendor Invoices	Three-way match with tolerance	`price_tolerance`, `quantity_tolerance`
Payments	Configurable terms and scheduling	`payment_rate`, timing ranges
Three-Way Match	PO ↔ GR ↔ Invoice validation with 6 variance types	`allow_over_delivery`, `max_over_delivery_pct`

Order-to-Cash (O2C)

The O2C chain covers the revenue cycle from sales order through cash collection:

    ┌─────────────────────────────────────────────────────────────────────┐
    │                    Order-to-Cash (O2C)                              │
    │                                                                     │
    │  Sales    Credit   Delivery   Customer   Customer                   │
    │  Order  → Check  → (Pick/  → Invoice  → Receipt                    │
    │    │               Pack/        │          │                        │
    │    │               Ship)        │          │                        │
    │    │                │           ▼          ▼                        │
    │    │                │      AR Open Item  AR Aging                   │
    │    │                │           │                                   │
    │    │                │           └→ Dunning Notices                  │
    │    │                ▼                                               │
    │    │          Inventory Issue                                       │
    │    │          (COGS posting)                                        │
    └────┼────────────────────────────────────────────────────────────────┘
         │
    Revenue Recognition (ASC 606 / IFRS 15)
    Customer Contracts → Performance Obligations

O2C implementation details:

Component	Types/Variants	Key Config
Sales Orders	9 types: Standard, Rush, CashSale, Return, FreeOfCharge, Consignment, Service, CreditMemoRequest, DebitMemoRequest	`flow_rate`, `credit_check`
Deliveries	6 types: Outbound, Return, StockTransfer, Replenishment, ConsignmentIssue, ConsignmentReturn	`delivery_rate`
Customer Invoices	7 types: Standard, CreditMemo, DebitMemo, ProForma, DownPaymentRequest, FinalInvoice, Intercompany	`invoice_rate`
Customer Receipts	Full, partial, on-account, corrections, NSF	`collection_rate`

Record-to-Report (R2R)

The R2R chain covers financial close and reporting:

    Journal Entries (from all chains)
         │
         ▼
    Balance Tracker → Trial Balance → Adjustments → Close
         │                                │            │
         ├→ Intercompany Matching         ├→ Accruals   ├→ Year-End Close
         │     └→ IC Elimination          ├→ Reclasses  └→ Retained Earnings
         │                                └→ FX Reval
         ▼
    Consolidation
         ├→ Currency Translation
         ├→ CTA Adjustments
         └→ Consolidated Trial Balance

R2R coverage:

Journal entry generation from all process chains
Opening balance, running balance tracking, trial balance per period
Intercompany matching and elimination entries
Period close engine: accruals, depreciation, year-end closing
Audit simulation (ISA-compliant workpapers, findings, opinions)

Gaps: Financial statement generation (balance sheet, income statement, cash flow), budget vs actual reporting.

Banking & Treasury (BANK) — 85%

Implemented: Bank customer profiles, KYC/AML, bank accounts, transactions with fraud typologies (structuring, funnel, layering, mule, round-tripping). OCPM events for customer onboarding, KYC review, account management, and transaction lifecycle.

Gaps: Cash forecasting, liquidity management.

Hire-to-Retire (H2R) — 85%

Implemented: Employee master data, payroll runs with tax/deduction calculations, time entries with overtime, expense reports with policy violations. OCPM events for payroll lifecycle, time entry approval, and expense approval chains.

Gaps: Benefits administration, workforce planning.

Plan-to-Produce (MFG) — 85%

Implemented: Production orders with BOM explosion, routing operations, WIP costing, quality inspections, cycle counting. OCPM events for production order lifecycle, quality inspection, and cycle count reconciliation.

Gaps: Material requirements planning (MRP), advanced shop floor control.

Audit Lifecycle (AUDIT) — 90%

Implemented: Engagement planning, risk assessment (ISA 315/330), workpaper creation and review (ISA 230), evidence collection (ISA 500), findings (ISA 265), professional judgment documentation (ISA 200). OCPM events for the full engagement lifecycle.

Gaps: Multi-engagement portfolio management.

Partially Implemented Chains

Acquire-to-Retire (A2R) — 70%

Implemented: Fixed asset master data, depreciation (6 methods), acquisition from PO, disposal with gain/loss accounting, impairment testing (ASC 360/IAS 36).

Gaps: Capital project/WBS integration, asset transfers between companies, construction-in-progress (CIP) tracking.

Inventory Management (INV) — 55%

Implemented: Inventory positions, 22 movement types, 4 valuation methods, stock status tracking, P2P goods receipts, O2C goods issues.

Gaps: Quality inspection integration, obsolescence management, ABC analysis.

Cross-Process Integration

Process chains share data through several integration points, now with full OCPM event coverage:

    S2C ──→ S2P                    O2C                    R2R
    │        │                      │                      │
    Contract GR ──── Inventory ─────┼── Delivery           │
             │         │            │                      │
       Payment ────────┼────────────┼── Receipt ──── Bank Recon
             │         │            │                  │   │
       AP Open Item    │       AR Open Item         BANK  │
             │     MFG─┘            │                 │   │
             └──H2R──┴──────────────┴──── Journal Entries ┘
                  │                                   │
              AUDIT ─────────────────────────── Trial Balance
                                                      │
                                               Consolidation

    ──── All chains feed OCEL 2.0 Event Log (88 activities) ────

Integration Map

Integration Point	From Chain	To Chain	Mechanism
Inventory bridge	S2P (Goods Receipt)	O2C (Delivery)	GR increases stock, delivery decreases
Payment clearing	S2P / O2C	BANK	Payment status → bank reconciliation
Journal entries	All chains	R2R	Every document posts GL entries
Asset acquisition	S2P (Capital PO)	A2R	PO → GR → Fixed Asset Record
Revenue recognition	O2C (Invoice)	R2R	Contract → Revenue JE
Depreciation	A2R	R2R	Monthly depreciation → Trial Balance
Intercompany	S2P / O2C	R2R	IC invoices → IC matching → elimination

Document Reference Types

Documents maintain referential integrity across chains through 9 reference types:

Reference Type	Description	Example
`FollowOn`	Normal flow succession	PO → GR
`Payment`	Payment for invoice	PAY → VI
`Reversal`	Correction/reversal	Credit Memo → Invoice
`Partial`	Partial fulfillment	Partial GR → PO
`CreditMemo`	Credit against invoice	CM → Invoice
`DebitMemo`	Debit against invoice	DM → Invoice
`Return`	Return against delivery	Return → Delivery
`IntercompanyMatch`	IC matching pair	IC-INV → IC-INV
`Manual`	User-defined reference	Any → Any

Roadmap

The process chain expansion follows a wave-based plan:

Wave	Focus	Chains Affected
Wave 1	S2C completion, bank reconciliation, financial statements	S2P, BANK, R2R
Wave 2	Payroll/time, revenue recognition generator, impairment generator	H2R, O2C, A2R
Wave 3	Production orders/WIP, cycle counting/QA, expense management	MFG, INV, H2R
Wave 4	Sales quotes, cash forecasting, KPIs/budgets, obsolescence	O2C, BANK, R2R, INV

For detailed coverage targets and implementation plans, see:

S2P Process Chain Spec — Source-to-Contract extension
Enterprise Process Chain Gaps — Full gap analysis across all chains

Design Decisions

Key architectural choices and their rationale.

1. Deterministic RNG

Decision: Use seeded ChaCha8 RNG for all randomness.

Rationale:

Reproducible output for testing and debugging
Consistent results across runs
Parallel generation with per-thread seeds

Implementation:

#![allow(unused)]
fn main() {
use rand_chacha::ChaCha8Rng;
use rand::SeedableRng;

let mut rng = ChaCha8Rng::seed_from_u64(config.global.seed);
}

Trade-off: Slightly slower than system RNG, but reproducibility is essential for financial data testing.

2. Precise Decimal Arithmetic

Decision: Use rust_decimal::Decimal for all monetary values.

Rationale:

IEEE 754 floating-point causes rounding errors
Financial systems require exact decimal representation
Debits must exactly equal credits

Implementation:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

let amount = dec!(1234.56);
let tax = amount * dec!(0.077);  // Exact
}

Serialization: Decimals serialized as strings to preserve precision:

{"amount": "1234.56"}

3. Balanced Entry Enforcement

Decision: JournalEntry enforces debits = credits at construction.

Rationale:

Invalid accounting entries should be impossible
Catches bugs early in generation
Guarantees trial balance coherence

Implementation:

#![allow(unused)]
fn main() {
impl JournalEntry {
    pub fn new(header: JournalEntryHeader, lines: Vec<JournalEntryLine>) -> Result<Self> {
        let entry = Self { header, lines };
        if !entry.is_balanced() {
            return Err(Error::UnbalancedEntry);
        }
        Ok(entry)
    }
}
}

4. Collision-Free UUIDs

Decision: Use FNV-1a hash-based UUID generation with generator-type discriminators.

Rationale:

Document IDs must be unique across all generators
Deterministic generation requires deterministic IDs
Different generator types might generate same sequence

Implementation:

#![allow(unused)]
fn main() {
pub struct DeterministicUuidFactory {
    counter: AtomicU64,
    seed: u64,
}

pub enum GeneratorType {
    JournalEntry = 0x01,
    DocumentFlow = 0x02,
    Vendor = 0x03,
    // ...
}

impl DeterministicUuidFactory {
    pub fn generate(&self, gen_type: GeneratorType) -> Uuid {
        let counter = self.counter.fetch_add(1, Ordering::SeqCst);
        let hash_input = (self.seed, gen_type as u8, counter);
        Uuid::from_bytes(fnv1a_hash(&hash_input))
    }
}
}

5. Empirical Distributions

Decision: Base statistical distributions on academic research.

Rationale:

Synthetic data should match real-world patterns
Benford’s Law is expected in authentic financial data
Line item distributions affect detection algorithms

Sources:

Line item counts: GL research showing 60.68% two-line, 88% even counts
Amounts: Log-normal with round-number bias
Temporal: Month/quarter/year-end spikes

Implementation:

#![allow(unused)]
fn main() {
pub struct LineItemSampler {
    distribution: EmpiricalDistribution,
}

impl LineItemSampler {
    pub fn new() -> Self {
        Self {
            distribution: EmpiricalDistribution::from_data(&[
                (2, 0.6068),
                (3, 0.0524),
                (4, 0.1732),
                // ...
            ]),
        }
    }
}
}

6. Document Chain Integrity

Decision: Maintain proper reference chains with explicit links.

Rationale:

Real document flows have traceable references
Process mining requires complete chains
Audit trails need document relationships

Implementation:

#![allow(unused)]
fn main() {
pub struct DocumentReference {
    pub from_type: DocumentType,
    pub from_id: String,
    pub to_type: DocumentType,
    pub to_id: String,
    pub reference_type: ReferenceType,
}

// Payment explicitly references invoices
let payment_ref = DocumentReference {
    from_type: DocumentType::Payment,
    from_id: payment.id.clone(),
    to_type: DocumentType::Invoice,
    to_id: invoice.id.clone(),
    reference_type: ReferenceType::PaymentFor,
};
}

7. Three-Way Match Validation

Decision: Implement actual PO/GR/Invoice matching with tolerances.

Rationale:

Real P2P processes include match validation
Variances are common and should be generated
Match status affects downstream processing

Implementation:

#![allow(unused)]
fn main() {
pub fn validate_match(po: &PurchaseOrder, gr: &GoodsReceipt, inv: &Invoice,
                      config: &MatchConfig) -> MatchResult {
    let qty_variance = (gr.quantity - po.quantity).abs() / po.quantity;
    let price_variance = (inv.unit_price - po.unit_price).abs() / po.unit_price;

    if qty_variance > config.quantity_tolerance {
        return MatchResult::QuantityVariance(qty_variance);
    }
    if price_variance > config.price_tolerance {
        return MatchResult::PriceVariance(price_variance);
    }
    MatchResult::Matched
}
}

8. Memory Guard Architecture

Decision: Cross-platform memory tracking with soft/hard limits.

Rationale:

Large generations can exhaust memory
OOM kills are unrecoverable
Graceful degradation preferred

Implementation:

#![allow(unused)]
fn main() {
pub fn check(&self) -> MemoryStatus {
    let current = self.get_memory_usage();
    let growth_rate = (current - self.last_usage) as f64 / elapsed_ms;

    MemoryStatus {
        current_usage: current,
        exceeds_soft_limit: current > self.config.soft_limit,
        exceeds_hard_limit: current > self.config.hard_limit,
        growth_rate,
    }
}
}

9. Layered Crate Architecture

Decision: Strict layering with no circular dependencies.

Rationale:

Clear separation of concerns
Independent crate compilation
Easier testing and maintenance

Layers:

Foundation: datasynth-core (no internal dependencies)
Services: datasynth-config, datasynth-output
Processing: datasynth-generators, datasynth-graph
Orchestration: datasynth-runtime
Application: datasynth-cli, datasynth-server, datasynth-ui

10. Configuration-Driven Behavior

Decision: All behavior controlled by external configuration.

Rationale:

Flexibility without code changes
Reproducible scenarios
User-customizable presets

Scope: Configuration controls:

Industry and complexity
Transaction volumes and patterns
Anomaly types and rates
Output formats
All feature toggles

11. Trait-Based Extensibility

Decision: Define traits in core, implement in higher layers.

Rationale:

Dependency inversion
Pluggable implementations
Easy testing with mocks

Example:

#![allow(unused)]
fn main() {
// Defined in datasynth-core
pub trait Generator<T> {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<T>>;
}

// Implemented in datasynth-generators
impl Generator<JournalEntry> for JournalEntryGenerator {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<JournalEntry>> {
        // Implementation
    }
}
}

12. Parallel-Safe Design

Decision: Design all generators to be thread-safe.

Rationale:

Generation can be parallelized
Modern systems have many cores
Linear scaling improves throughput

Implementation:

Per-thread RNG seeds: seed + thread_id
Atomic counters for UUID factory
No shared mutable state during generation
Rayon for parallel iteration

Crate Reference

SyntheticData is organized as a Rust workspace with 15 modular crates. This section provides detailed documentation for each crate.

Workspace Structure

datasynth-cli          → Binary entry point (commands: generate, validate, init, info, fingerprint)
datasynth-server       → REST/gRPC/WebSocket server with auth, rate limiting, timeouts
datasynth-ui           → Tauri/SvelteKit desktop UI
    ↓
datasynth-runtime      → Orchestration layer (GenerationOrchestrator coordinates workflow)
    ↓
datasynth-generators   → Data generators (JE, Document Flows, Subledgers, Anomalies, Audit)
datasynth-banking      → KYC/AML banking transaction generator with fraud typologies
datasynth-ocpm         → Object-Centric Process Mining (OCEL 2.0 event logs)
datasynth-fingerprint  → Privacy-preserving fingerprint extraction and synthesis
datasynth-standards    → Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
    ↓
datasynth-graph        → Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval         → Evaluation framework with auto-tuning and recommendations
    ↓
datasynth-config       → Configuration schema, validation, industry presets
    ↓
datasynth-core         → Domain models, traits, distributions, templates, resource guards
    ↓
datasynth-output       → Output sinks (CSV, JSON, Parquet, ControlExport)

datasynth-test-utils   → Testing utilities and fixtures

Crate Categories

Application Layer

Crate	Description
datasynth-cli	Command-line interface binary with generate, validate, init, info, fingerprint commands
datasynth-server	REST/gRPC/WebSocket server with authentication, rate limiting, and timeouts
datasynth-ui	Cross-platform desktop GUI application (Tauri + SvelteKit)

Core Processing

Crate	Description
datasynth-runtime	Generation orchestration with resource guards and graceful degradation
datasynth-generators	All data generators (JE, master data, documents, subledgers, anomalies, audit)
datasynth-graph	ML graph export (PyTorch Geometric, Neo4j, DGL)

Domain-Specific Modules

Crate	Description
datasynth-banking	KYC/AML banking transactions with fraud typologies
datasynth-ocpm	Object-Centric Process Mining (OCEL 2.0)
datasynth-fingerprint	Privacy-preserving fingerprint extraction and synthesis
datasynth-standards	Accounting/audit standards (US GAAP, IFRS, ISA, SOX, PCAOB)

Foundation

Crate	Description
datasynth-core	Domain models, distributions, traits, resource guards
datasynth-config	Configuration schema and validation
datasynth-output	Output sinks (CSV, JSON, Parquet)

Supporting

Crate	Description
datasynth-eval	Quality evaluation with auto-tuning recommendations
datasynth-test-utils	Test utilities and fixtures

Dependencies

The crates follow a strict dependency hierarchy:

datasynth-core: No internal dependencies (foundation)
datasynth-config: Depends on datasynth-core
datasynth-output: Depends on datasynth-core
datasynth-generators: Depends on datasynth-core, datasynth-config
datasynth-graph: Depends on datasynth-core, datasynth-generators
datasynth-eval: Depends on datasynth-core
datasynth-banking: Depends on datasynth-core, datasynth-config
datasynth-ocpm: Depends on datasynth-core
datasynth-fingerprint: Depends on datasynth-core, datasynth-config
datasynth-runtime: Depends on datasynth-core, datasynth-config, datasynth-generators, datasynth-output, datasynth-graph, datasynth-banking, datasynth-ocpm, datasynth-fingerprint, datasynth-eval
datasynth-cli: Depends on datasynth-runtime, datasynth-fingerprint
datasynth-server: Depends on datasynth-runtime
datasynth-ui: Depends on datasynth-runtime (via Tauri)
datasynth-standards: Depends on datasynth-core, datasynth-config
datasynth-test-utils: Depends on datasynth-core

Building Individual Crates

# Build specific crate
cargo build -p datasynth-core
cargo build -p datasynth-generators
cargo build -p datasynth-fingerprint

# Run tests for specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators
cargo test -p datasynth-fingerprint

# Generate docs for specific crate
cargo doc -p datasynth-core --open
cargo doc -p datasynth-fingerprint --open

API Documentation

For detailed Rust API documentation, generate and view rustdoc:

cargo doc --workspace --no-deps --open

After deployment, API documentation is available at /api/ in the documentation site.

datasynth-core

Core domain models, traits, and distributions for synthetic accounting data generation.

Overview

datasynth-core provides the foundational building blocks for the SyntheticData workspace:

Domain Models: Journal entries, chart of accounts, master data, documents, anomalies
Statistical Distributions: Line item sampling, amount generation, temporal patterns
Core Traits: Generator and Sink interfaces for extensibility
Template System: File-based templates for regional/sector customization
Infrastructure: UUID factory, memory guard, GL account constants

Module Structure

Domain Models (`models/`)

Module	Description
`journal_entry.rs`	Journal entry header and balanced line items
`chart_of_accounts.rs`	Hierarchical GL accounts with account types
`master_data.rs`	Enhanced vendors, customers with payment behavior
`documents.rs`	Purchase orders, invoices, goods receipts, payments
`temporal.rs`	Bi-temporal data model for audit trails
`anomaly.rs`	Anomaly types and labels for ML training
`internal_control.rs`	SOX 404 control definitions
`sourcing/`	S2C models: SourcingProject, SupplierQualification, RfxEvent, Bid, BidEvaluation, ProcurementContract, CatalogItem, SupplierScorecard, SpendAnalysis
`payroll.rs`	PayrollRun, PayrollLineItem with gross/deductions/net/employer cost
`time_entry.rs`	TimeEntry with regular, overtime, PTO, and sick hours
`expense_report.rs`	ExpenseReport, ExpenseLineItem with category and approval workflow
`financial_statements.rs`	FinancialStatement, FinancialStatementLineItem, CashFlowItem, StatementType
`bank_reconciliation.rs`	BankReconciliation, BankStatementLine, ReconcilingItem with auto-matching

Statistical Distributions (`distributions/`)

Distribution	Description
`LineItemSampler`	Empirical distribution (60.68% two-line, 88% even counts)
`AmountSampler`	Log-normal with round-number bias, Benford compliance
`TemporalSampler`	Seasonality patterns with industry integration
`BenfordSampler`	First-digit distribution following P(d) = log10(1 + 1/d)
`FraudAmountGenerator`	Suspicious amount patterns
`IndustrySeasonality`	Industry-specific volume patterns
`HolidayCalendar`	Regional holidays for US, DE, GB, CN, JP, IN

Infrastructure

Component	Description
`uuid_factory.rs`	Deterministic FNV-1a hash-based UUID generation
`accounts.rs`	Centralized GL control account numbers
`templates/`	YAML/JSON template loading and merging

Resource Guards

Component	Description
`memory_guard.rs`	Cross-platform memory tracking with soft/hard limits
`disk_guard.rs`	Disk space monitoring and pre-write capacity checks
`cpu_monitor.rs`	CPU load tracking with auto-throttling
`resource_guard.rs`	Unified orchestration of all resource guards
`degradation.rs`	Graceful degradation system (Normal→Reduced→Minimal→Emergency)

AI & ML Modules (v0.5.0)

Module	Description
`llm/provider.rs`	`LlmProvider` trait with `complete()` and `complete_batch()` methods
`llm/mock_provider.rs`	Deterministic `MockLlmProvider` for testing (no network required)
`llm/http_provider.rs`	`HttpLlmProvider` for OpenAI, Anthropic, and custom API endpoints
`llm/nl_config.rs`	`NlConfigGenerator` — natural language to YAML configuration
`llm/cache.rs`	`LlmCache` with FNV-1a hashing for prompt deduplication
`diffusion/backend.rs`	`DiffusionBackend` trait with `forward()`, `reverse()`, `generate()` methods
`diffusion/schedule.rs`	`NoiseSchedule` with linear, cosine, and sigmoid schedules
`diffusion/statistical.rs`	`StatisticalDiffusionBackend` — fingerprint-guided denoising
`diffusion/hybrid.rs`	`HybridGenerator` with Interpolate, Select, Ensemble blend strategies
`diffusion/training.rs`	`DiffusionTrainer` and `TrainedDiffusionModel` with save/load
`causal/graph.rs`	`CausalGraph` with variables, edges, and built-in templates
`causal/scm.rs`	`StructuralCausalModel` with topological-order generation
`causal/intervention.rs`	`InterventionEngine` with do-calculus and effect estimation
`causal/counterfactual.rs`	`CounterfactualGenerator` with abduction-action-prediction
`causal/validation.rs`	`CausalValidator` for causal structure validation

Key Types

JournalEntry

#![allow(unused)]
fn main() {
pub struct JournalEntry {
    pub header: JournalEntryHeader,
    pub lines: Vec<JournalEntryLine>,
}

pub struct JournalEntryHeader {
    pub document_id: Uuid,
    pub company_code: String,
    pub fiscal_year: u16,
    pub fiscal_period: u8,
    pub posting_date: NaiveDate,
    pub document_date: NaiveDate,
    pub source: TransactionSource,
    pub business_process: Option<BusinessProcess>,
    pub is_fraud: bool,
    pub fraud_type: Option<FraudType>,
    pub is_anomaly: bool,
    pub anomaly_type: Option<AnomalyType>,
    // ... additional fields
}
}

AccountType Hierarchy

#![allow(unused)]
fn main() {
pub enum AccountType {
    Asset,
    Liability,
    Equity,
    Revenue,
    Expense,
}

pub enum AccountSubType {
    // Assets
    Cash,
    AccountsReceivable,
    Inventory,
    FixedAsset,
    // Liabilities
    AccountsPayable,
    AccruedLiabilities,
    LongTermDebt,
    // Equity
    CommonStock,
    RetainedEarnings,
    // Revenue
    SalesRevenue,
    ServiceRevenue,
    // Expense
    CostOfGoodsSold,
    OperatingExpense,
    // ...
}
}

Anomaly Types

#![allow(unused)]
fn main() {
pub enum AnomalyType {
    Fraud,
    Error,
    ProcessIssue,
    Statistical,
    Relational,
}

pub struct LabeledAnomaly {
    pub document_id: Uuid,
    pub anomaly_id: String,
    pub anomaly_type: AnomalyType,
    pub category: AnomalyCategory,
    pub severity: Severity,
    pub description: String,
}
}

Usage Examples

Creating a Balanced Journal Entry

#![allow(unused)]
fn main() {
use synth_core::models::{JournalEntry, JournalEntryLine, JournalEntryHeader};
use rust_decimal_macros::dec;

let header = JournalEntryHeader::new(/* ... */);
let mut entry = JournalEntry::new(header);

// Add balanced lines
entry.add_line(JournalEntryLine::debit("1100", dec!(1000.00), "AR Invoice"));
entry.add_line(JournalEntryLine::credit("4000", dec!(1000.00), "Revenue"));

// Entry enforces debits = credits
assert!(entry.is_balanced());
}

Sampling Amounts

#![allow(unused)]
fn main() {
use synth_core::distributions::AmountSampler;

let sampler = AmountSampler::new(42); // seed

// Benford-compliant amount
let amount = sampler.sample_benford_compliant(1000.0, 100000.0);

// Round-number biased
let round_amount = sampler.sample_with_round_bias(1000.0, 10000.0);
}

Using the UUID Factory

#![allow(unused)]
fn main() {
use synth_core::uuid_factory::{DeterministicUuidFactory, GeneratorType};

let factory = DeterministicUuidFactory::new(42);

// Generate collision-free UUIDs across generators
let je_id = factory.generate(GeneratorType::JournalEntry);
let doc_id = factory.generate(GeneratorType::DocumentFlow);
}

Memory Guard

#![allow(unused)]
fn main() {
use synth_core::memory_guard::{MemoryGuard, MemoryGuardConfig};

let config = MemoryGuardConfig {
    soft_limit: 1024 * 1024 * 1024,  // 1GB soft
    hard_limit: 2 * 1024 * 1024 * 1024, // 2GB hard
    check_interval_ms: 1000,
    ..Default::default()
};

let guard = MemoryGuard::new(config);
if guard.check().exceeds_soft_limit {
    // Slow down or pause generation
}
}

Disk Space Guard

#![allow(unused)]
fn main() {
use synth_core::disk_guard::{DiskSpaceGuard, DiskSpaceGuardConfig};

let config = DiskSpaceGuardConfig {
    hard_limit_mb: 100,        // Require at least 100 MB free
    soft_limit_mb: 500,        // Warn when below 500 MB
    check_interval: 500,       // Check every 500 operations
    reserve_buffer_mb: 50,     // Keep 50 MB buffer
    monitor_path: Some("./output".into()),
};

let guard = DiskSpaceGuard::new(config);
guard.check()?;  // Returns error if disk full
guard.check_before_write(1024 * 1024)?;  // Pre-write check
}

CPU Monitor

#![allow(unused)]
fn main() {
use synth_core::cpu_monitor::{CpuMonitor, CpuMonitorConfig};

let config = CpuMonitorConfig::with_thresholds(0.85, 0.95)
    .with_auto_throttle(50);  // 50ms delay when critical

let monitor = CpuMonitor::new(config);

// Sample and check in generation loop
if let Some(load) = monitor.sample() {
    if monitor.is_throttling() {
        monitor.maybe_throttle();  // Apply delay
    }
}
}

Graceful Degradation

#![allow(unused)]
fn main() {
use synth_core::degradation::{
    DegradationController, DegradationConfig, ResourceStatus, DegradationActions
};

let controller = DegradationController::new(DegradationConfig::default());

let status = ResourceStatus::new(
    Some(0.80),   // 80% memory usage
    Some(800),    // 800 MB disk free
    Some(0.70),   // 70% CPU load
);

let (level, changed) = controller.update(&status);
let actions = DegradationActions::for_level(level);

if actions.skip_data_quality {
    // Skip data quality injection
}
if actions.terminate {
    // Flush and exit gracefully
}
}

LLM Provider

#![allow(unused)]
fn main() {
use synth_core::llm::{LlmProvider, LlmRequest, MockLlmProvider};

let provider = MockLlmProvider::new(42);
let request = LlmRequest::new("Generate a realistic vendor name for a manufacturing company")
    .with_seed(42)
    .with_max_tokens(50);
let response = provider.complete(&request)?;
println!("Generated: {}", response.content);
}

Causal Generation

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, StructuralCausalModel};

// Use built-in fraud detection template
let graph = CausalGraph::fraud_detection_template();
let scm = StructuralCausalModel::new(graph)?;

// Generate observational samples
let samples = scm.generate(1000, 42)?;

// Run intervention: what if transaction_amount is set to 50000?
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let intervention_samples = intervened.generate(1000, 42)?;
}

Diffusion Model

#![allow(unused)]
fn main() {
use synth_core::diffusion::{
    StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType,
    HybridGenerator, BlendStrategy,
};

let config = DiffusionConfig {
    n_steps: 1000,
    schedule: NoiseScheduleType::Cosine,
    seed: 42,
};

let backend = StatisticalDiffusionBackend::new(
    vec![100.0, 5.0],  // means
    vec![50.0, 2.0],   // stds
    config,
);

let samples = backend.generate(1000, 2, 42);

// Hybrid: blend rule-based + diffusion
let hybrid = HybridGenerator::new(0.3); // 30% diffusion weight
let blended = hybrid.blend(&rule_based, &samples, BlendStrategy::Ensemble, 42);
}

Traits

Generator Trait

#![allow(unused)]
fn main() {
pub trait Generator {
    type Output;
    type Error;

    fn generate_batch(&mut self, count: usize) -> Result<Vec<Self::Output>, Self::Error>;

    fn generate_stream(&mut self) -> impl Iterator<Item = Result<Self::Output, Self::Error>>;
}
}

Sink Trait

#![allow(unused)]
fn main() {
pub trait Sink<T> {
    type Error;

    fn write(&mut self, item: &T) -> Result<(), Self::Error>;
    fn write_batch(&mut self, items: &[T]) -> Result<(), Self::Error>;
    fn flush(&mut self) -> Result<(), Self::Error>;
}
}

PostProcessor Trait

Interface for post-generation data transformations (e.g., data quality variations):

#![allow(unused)]
fn main() {
pub struct ProcessContext {
    pub record_index: usize,
    pub batch_size: usize,
    pub output_format: String,
    pub metadata: HashMap<String, String>,
}

pub struct ProcessorStats {
    pub records_processed: usize,
    pub records_modified: usize,
    pub labels_generated: usize,
}
}

Template System

Load external templates for customization:

#![allow(unused)]
fn main() {
use synth_core::templates::{TemplateLoader, MergeStrategy};

let loader = TemplateLoader::new("templates/");
let names = loader.load_category("vendor_names", MergeStrategy::Extend)?;
}

Template categories:

person_names
vendor_names
customer_names
material_descriptions
line_item_descriptions

Decimal Handling

All financial amounts use rust_decimal::Decimal:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

let amount = dec!(1234.56);
let tax = amount * dec!(0.077);
}

Decimals are serialized as strings to avoid IEEE 754 floating-point issues.

datasynth-config

Configuration schema, validation, and industry presets for synthetic data generation.

Overview

datasynth-config provides the configuration layer for SyntheticData:

Schema Definition: Complete YAML configuration schema
Validation: Bounds checking, constraint validation, distribution sum verification
Industry Presets: Pre-configured settings for common industries
Complexity Levels: Small, medium, and large organization profiles

Configuration Sections

Section	Description
`global`	Industry, dates, seed, performance settings
`companies`	Company codes, currencies, volume weights
`chart_of_accounts`	COA complexity and structure
`transactions`	Line items, amounts, sources, temporal patterns
`master_data`	Vendors, customers, materials, assets, employees
`document_flows`	P2P, O2C configuration
`intercompany`	IC transaction types and transfer pricing
`balance`	Opening balances, trial balance generation
`subledger`	AR, AP, FA, inventory settings
`fx`	Currency and exchange rate settings
`period_close`	Close tasks and schedules
`fraud`	Fraud injection rates and types
`internal_controls`	SOX controls and SoD rules
`anomaly_injection`	Anomaly rates and labeling
`data_quality`	Missing values, typos, duplicates
`graph_export`	ML graph export formats
`output`	Output format and compression

Industry Presets

Industry	Description
`manufacturing`	Heavy P2P, inventory, fixed assets
`retail`	High O2C volume, seasonal patterns
`financial_services`	Complex intercompany, high controls
`healthcare`	Regulatory focus, seasonal insurance
`technology`	SaaS revenue patterns, R&D capitalization

Key Types

Config

#![allow(unused)]
fn main() {
pub struct Config {
    pub global: GlobalConfig,
    pub companies: Vec<CompanyConfig>,
    pub chart_of_accounts: CoaConfig,
    pub transactions: TransactionConfig,
    pub master_data: MasterDataConfig,
    pub document_flows: DocumentFlowConfig,
    pub intercompany: IntercompanyConfig,
    pub balance: BalanceConfig,
    pub subledger: SubledgerConfig,
    pub fx: FxConfig,
    pub period_close: PeriodCloseConfig,
    pub fraud: FraudConfig,
    pub internal_controls: ControlConfig,
    pub anomaly_injection: AnomalyConfig,
    pub data_quality: DataQualityConfig,
    pub graph_export: GraphExportConfig,
    pub output: OutputConfig,
}
}

GlobalConfig

#![allow(unused)]
fn main() {
pub struct GlobalConfig {
    pub seed: Option<u64>,
    pub industry: Industry,
    pub start_date: NaiveDate,
    pub period_months: u32,      // 1-120
    pub group_currency: String,
    pub worker_threads: Option<usize>,
    pub memory_limit: Option<u64>,
}
}

CompanyConfig

#![allow(unused)]
fn main() {
pub struct CompanyConfig {
    pub code: String,
    pub name: String,
    pub currency: String,
    pub country: String,
    pub volume_weight: f64,     // Must sum to 1.0 across companies
    pub is_parent: bool,
    pub parent_code: Option<String>,
}
}

Validation Rules

The ConfigValidator enforces:

Rule	Constraint
`period_months`	1-120 (max 10 years)
`compression_level`	1-9 when compression enabled
Rate fields	0.0-1.0
Approval thresholds	Strictly ascending order
Distribution weights	Sum to 1.0 (±0.01 tolerance)
Company codes	Unique within configuration
Dates	`start_date` + `period_months` is valid

Usage Examples

Loading Configuration

#![allow(unused)]
fn main() {
use synth_config::{Config, ConfigValidator};

// From YAML file
let config = Config::from_yaml_file("config.yaml")?;

// Validate
let validator = ConfigValidator::new();
validator.validate(&config)?;
}

Using Presets

#![allow(unused)]
fn main() {
use synth_config::{Config, Industry, Complexity};

// Create from preset
let config = Config::from_preset(Industry::Manufacturing, Complexity::Medium);

// Modify as needed
config.transactions.target_count = 50000;
}

Creating Configuration Programmatically

#![allow(unused)]
fn main() {
use synth_config::{Config, GlobalConfig, TransactionConfig};

let config = Config {
    global: GlobalConfig {
        seed: Some(42),
        industry: Industry::Manufacturing,
        start_date: NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
        period_months: 12,
        group_currency: "USD".to_string(),
        ..Default::default()
    },
    transactions: TransactionConfig {
        target_count: 100000,
        ..Default::default()
    },
    ..Default::default()
};
}

Saving Configuration

#![allow(unused)]
fn main() {
// To YAML
config.to_yaml_file("config.yaml")?;

// To JSON
config.to_json_file("config.json")?;

// To string
let yaml = config.to_yaml_string()?;
}

Configuration Examples

Minimal Configuration

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 10000

output:
  format: csv

Full Configuration

See the YAML Schema Reference for complete documentation.

Complexity Levels

Level	Accounts	Vendors	Customers	Materials
`small`	~100	50	100	200
`medium`	~400	200	500	1000
`large`	~2500	1000	5000	10000

Validation Error Types

#![allow(unused)]
fn main() {
pub enum ConfigError {
    MissingRequiredField(String),
    InvalidValue { field: String, value: String, constraint: String },
    DistributionSumError { field: String, sum: f64 },
    DuplicateCode { field: String, code: String },
    DateRangeError { start: NaiveDate, end: NaiveDate },
    ParseError(String),
}
}

datasynth-generators

Data generators for journal entries, master data, document flows, and anomalies.

Overview

datasynth-generators contains all data generation logic for SyntheticData:

Core Generators: Journal entries, chart of accounts, users
Master Data: Vendors, customers, materials, assets, employees
Document Flows: P2P (Procure-to-Pay), O2C (Order-to-Cash)
Financial: Intercompany, balance tracking, subledgers, FX, period close
Quality: Anomaly injection, data quality variations
Sourcing (S2C): Spend analysis, RFx, bids, contracts, catalogs, scorecards (v0.6.0)
HR / Payroll: Payroll runs, time entries, expense reports (v0.6.0)
Financial Reporting: Financial statements, bank reconciliation (v0.6.0)
Standards: Revenue recognition, impairment testing (v0.6.0)
Manufacturing: Production orders, quality inspections, cycle counts (v0.6.0)

Module Structure

Core Generators

Generator	Description
`je_generator`	Journal entry generation with statistical distributions
`coa_generator`	Chart of accounts with industry-specific structures
`company_selector`	Weighted company selection for transactions
`user_generator`	User/persona generation with roles
`control_generator`	Internal controls and SoD rules

Master Data (`master_data/`)

Generator	Description
`vendor_generator`	Vendors with payment terms, bank accounts, behaviors
`customer_generator`	Customers with credit ratings, payment patterns
`material_generator`	Materials/products with BOM, valuations
`asset_generator`	Fixed assets with depreciation schedules
`employee_generator`	Employees with manager hierarchy
`entity_registry_manager`	Central entity registry with temporal validity

Document Flow (`document_flow/`)

Generator	Description
`p2p_generator`	PO → GR → Invoice → Payment flow
`o2c_generator`	SO → Delivery → Invoice → Receipt flow
`document_chain_manager`	Reference chain management
`document_flow_je_generator`	Generate JEs from document flows
`three_way_match`	PO/GR/Invoice matching validation

Intercompany (`intercompany/`)

Generator	Description
`ic_generator`	Matched intercompany entry pairs
`matching_engine`	IC matching and reconciliation
`elimination_generator`	Consolidation elimination entries

Balance (`balance/`)

Generator	Description
`opening_balance_generator`	Coherent opening balance sheet
`balance_tracker`	Running balance validation
`trial_balance_generator`	Period-end trial balance

Subledger (`subledger/`)

Generator	Description
`ar_generator`	AR invoices, receipts, credit memos, aging
`ap_generator`	AP invoices, payments, debit memos
`fa_generator`	Fixed assets, depreciation, disposals
`inventory_generator`	Inventory positions, movements, valuation
`reconciliation`	GL-to-subledger reconciliation

FX (`fx/`)

Generator	Description
`fx_rate_service`	FX rate generation (Ornstein-Uhlenbeck process)
`currency_translator`	Trial balance translation
`cta_generator`	Currency Translation Adjustment entries

Period Close (`period_close/`)

Generator	Description
`close_engine`	Main orchestration
`accruals`	Accrual entry generation
`depreciation`	Monthly depreciation runs
`year_end`	Year-end closing entries

Anomaly (`anomaly/`)

Generator	Description
`injector`	Main anomaly injection engine
`types`	Weighted anomaly type configurations
`strategies`	Injection strategies (amount, date, duplication)
`patterns`	Temporal patterns, clustering, entity targeting

Data Quality (`data_quality/`)

Generator	Description
`injector`	Main data quality injector
`missing_values`	MCAR, MAR, MNAR, Systematic patterns
`format_variations`	Date, amount, identifier formats
`duplicates`	Exact, near, fuzzy duplicates
`typos`	Keyboard-aware typos, OCR errors
`labels`	ML training labels for data quality issues

Audit (`audit/`)

ISA-compliant audit data generation.

Generator	Description
`engagement_generator`	Audit engagement with phases (Planning, Fieldwork, Completion)
`workpaper_generator`	Audit workpapers per ISA 230
`evidence_generator`	Audit evidence per ISA 500
`risk_generator`	Risk assessment per ISA 315/330
`finding_generator`	Audit findings per ISA 265
`judgment_generator`	Professional judgment documentation per ISA 200

LLM Enrichment (`llm_enrichment/`) — v0.5.0

Generator	Description
`VendorLlmEnricher`	Generate realistic vendor names by industry, spend category, and country
`TransactionLlmEnricher`	Generate transaction descriptions and memo fields
`AnomalyLlmExplainer`	Generate natural language explanations for injected anomalies

Sourcing (`sourcing/`) – v0.6.0

Source-to-Contract (S2C) procurement pipeline generators.

Generator	Description
`spend_analysis_generator`	Spend analysis records and category hierarchies
`sourcing_project_generator`	Sourcing project lifecycle management
`qualification_generator`	Supplier qualification assessments
`rfx_generator`	RFx events (RFI/RFP/RFQ) with invited suppliers
`bid_generator`	Supplier bids with pricing and compliance data
`bid_evaluation_generator`	Bid scoring, ranking, and award recommendations
`contract_generator`	Procurement contracts with terms and renewal rules
`catalog_generator`	Catalog items linked to contracts
`scorecard_generator`	Supplier scorecards with performance metrics

Generation DAG: spend_analysis -> sourcing_project -> qualification -> rfx -> bid -> bid_evaluation -> contract -> catalog -> [P2P] -> scorecard

HR (`hr/`) – v0.6.0

Hire-to-Retire (H2R) generators for the HR process chain.

Generator	Description
`payroll_generator`	Payroll runs with employee pay line items (gross, deductions, net, employer cost)
`time_entry_generator`	Employee time entries with regular, overtime, PTO, and sick hours
`expense_report_generator`	Expense reports with categorized line items and approval workflows

Standards (`standards/`) – v0.6.0

Accounting and audit standards generators.

Generator	Description
`revenue_recognition_generator`	ASC 606/IFRS 15 customer contracts with performance obligations
`impairment_generator`	Asset impairment tests with recoverable amount calculations

Period Close Additions – v0.6.0

Generator	Description
`financial_statement_generator`	Balance sheet, income statement, cash flow, and changes in equity from trial balance data

Bank Reconciliation – v0.6.0

Generator	Description
`bank_reconciliation_generator`	Bank reconciliations with statement lines, auto-matching, and reconciling items

Relationships (`relationships/`)

Generator	Description
`entity_graph_generator`	Cross-process entity relationship graphs
`relationship_strength`	Weighted relationship strength calculation

Audit Engagement Structure:

#![allow(unused)]
fn main() {
pub struct AuditEngagement {
    pub engagement_id: String,
    pub client_name: String,
    pub fiscal_year: u16,
    pub phase: AuditPhase,  // Planning, Fieldwork, Completion
    pub materiality: MaterialityLevels,
    pub team_size: usize,
    pub has_fraud_risk: bool,
    pub has_significant_risk: bool,
}

pub struct MaterialityLevels {
    pub primary_materiality: Decimal,        // 0.3-1% of base
    pub performance_materiality: Decimal,    // 50-75% of primary
    pub clearly_trivial: Decimal,            // 3-5% of primary
}
}

Usage Examples

Journal Entry Generation

#![allow(unused)]
fn main() {
use synth_generators::je_generator::JournalEntryGenerator;

let mut generator = JournalEntryGenerator::new(config, seed);

// Generate batch
let entries = generator.generate_batch(1000)?;

// Stream generation
for entry in generator.generate_stream().take(1000) {
    process(entry?);
}
}

Master Data Generation

#![allow(unused)]
fn main() {
use synth_generators::master_data::{VendorGenerator, CustomerGenerator};

let mut vendor_gen = VendorGenerator::new(seed);
let vendors = vendor_gen.generate(100);

let mut customer_gen = CustomerGenerator::new(seed);
let customers = customer_gen.generate(200);
}

Document Flow Generation

#![allow(unused)]
fn main() {
use synth_generators::document_flow::{P2pGenerator, O2cGenerator};

let mut p2p = P2pGenerator::new(config, seed);
let p2p_flows = p2p.generate_batch(500)?;

let mut o2c = O2cGenerator::new(config, seed);
let o2c_flows = o2c.generate_batch(500)?;
}

Anomaly Injection

#![allow(unused)]
fn main() {
use synth_generators::anomaly::AnomalyInjector;

let mut injector = AnomalyInjector::new(config.anomaly_injection, seed);

// Inject into existing entries
let (modified_entries, labels) = injector.inject(&entries)?;
}

LLM Enrichment

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::{VendorLlmEnricher, TransactionLlmEnricher};
use synth_core::llm::MockLlmProvider;
use std::sync::Arc;

let provider = Arc::new(MockLlmProvider::new(42));

// Enrich vendor names
let vendor_enricher = VendorLlmEnricher::new(provider.clone());
let name = vendor_enricher.enrich_vendor_name("manufacturing", "raw_materials", "US")?;

// Enrich transaction descriptions
let tx_enricher = TransactionLlmEnricher::new(provider);
let desc = tx_enricher.enrich_description("Office Supplies", "1000-5000", "retail", 3)?;
let memo = tx_enricher.enrich_memo("VendorInvoice", "Acme Corp", "2500.00")?;
}

Three-Way Match

The P2P generator validates document matching:

#![allow(unused)]
fn main() {
use synth_generators::document_flow::ThreeWayMatch;

let match_result = ThreeWayMatch::validate(
    &purchase_order,
    &goods_receipt,
    &vendor_invoice,
    tolerance_config,
);

match match_result {
    MatchResult::Passed => { /* Process normally */ }
    MatchResult::QuantityVariance(var) => { /* Handle variance */ }
    MatchResult::PriceVariance(var) => { /* Handle variance */ }
}
}

Balance Coherence

The balance tracker maintains accounting equation:

#![allow(unused)]
fn main() {
use synth_generators::balance::BalanceTracker;

let mut tracker = BalanceTracker::new();

for entry in &entries {
    tracker.post(&entry)?;
}

// Verify Assets = Liabilities + Equity
assert!(tracker.is_balanced());
}

FX Rate Generation

Uses Ornstein-Uhlenbeck process for realistic rate movements:

#![allow(unused)]
fn main() {
use synth_generators::fx::FxRateService;

let mut fx_service = FxRateService::new(config.fx, seed);

// Get rate for date
let rate = fx_service.get_rate("EUR", "USD", date)?;

// Generate daily rates
let rates = fx_service.generate_daily_rates(start, end)?;
}

Anomaly Types

Fraud Types

FictitiousTransaction, RevenueManipulation, ExpenseCapitalization
SplitTransaction, RoundTripping, KickbackScheme
GhostEmployee, DuplicatePayment, UnauthorizedDiscount

Error Types

DuplicateEntry, ReversedAmount, WrongPeriod
WrongAccount, MissingReference, IncorrectTaxCode

Process Issues

LatePosting, SkippedApproval, ThresholdManipulation
MissingDocumentation, OutOfSequence

Statistical Anomalies

UnusualAmount, TrendBreak, BenfordViolation, OutlierValue

Relational Anomalies

CircularTransaction, DormantAccountActivity, UnusualCounterparty

datasynth-output

Output sinks for CSV, JSON, and streaming formats.

Overview

datasynth-output provides the output layer for SyntheticData:

CSV Sink: High-performance CSV writing with optional compression
JSON Sink: JSON and JSONL (newline-delimited) output
Streaming: Async streaming output for real-time generation
Control Export: Internal control and SoD rule export

Supported Formats

Standard Formats

Format	Description	Extension
CSV	Standard comma-separated values	`.csv`
JSON	Pretty-printed JSON arrays	`.json`
JSONL	Newline-delimited JSON	`.jsonl`
Parquet	Apache Parquet columnar format	`.parquet`

ERP Formats

Format	Target ERP	Tables
SAP S/4HANA	`SapExporter`	BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
Oracle EBS	`OracleExporter`	GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
NetSuite	`NetSuiteExporter`	Journal entries with subsidiary/multi-book support

Streaming Sinks

Sink	Description
`CsvStreamingSink`	Streaming CSV with automatic headers
`JsonStreamingSink`	Streaming JSON arrays
`NdjsonStreamingSink`	Streaming newline-delimited JSON
`ParquetStreamingSink`	Streaming Apache Parquet

Features

Configurable compression (gzip, zstd, snappy for Parquet)
Streaming writes for memory efficiency with backpressure support
ERP-native table schemas (SAP, Oracle, NetSuite)
Decimal values serialized as strings (IEEE 754 safe)
Configurable field ordering and headers
Automatic directory creation

Key Types

OutputConfig

#![allow(unused)]
fn main() {
pub struct OutputConfig {
    pub format: OutputFormat,
    pub compression: CompressionType,
    pub compression_level: u32,
    pub include_headers: bool,
    pub decimal_precision: u32,
}

pub enum OutputFormat {
    Csv,
    Json,
    Jsonl,
}

pub enum CompressionType {
    None,
    Gzip,
    Zstd,
}
}

CsvSink

#![allow(unused)]
fn main() {
pub struct CsvSink<T> {
    writer: BufWriter<Box<dyn Write>>,
    config: OutputConfig,
    headers_written: bool,
    _phantom: PhantomData<T>,
}
}

JsonSink

#![allow(unused)]
fn main() {
pub struct JsonSink<T> {
    writer: BufWriter<Box<dyn Write>>,
    format: JsonFormat,
    first_written: bool,
    _phantom: PhantomData<T>,
}
}

Usage Examples

CSV Output

#![allow(unused)]
fn main() {
use synth_output::{CsvSink, OutputConfig, OutputFormat};

// Create sink
let config = OutputConfig {
    format: OutputFormat::Csv,
    compression: CompressionType::None,
    include_headers: true,
    ..Default::default()
};

let mut sink = CsvSink::new("output/journal_entries.csv", config)?;

// Write data
sink.write_batch(&entries)?;
sink.flush()?;
}

Compressed Output

#![allow(unused)]
fn main() {
use synth_output::{CsvSink, OutputConfig, CompressionType};

let config = OutputConfig {
    compression: CompressionType::Gzip,
    compression_level: 6,
    ..Default::default()
};

let mut sink = CsvSink::new("output/entries.csv.gz", config)?;
sink.write_batch(&entries)?;
}

JSON Streaming

#![allow(unused)]
fn main() {
use synth_output::{JsonSink, OutputConfig, OutputFormat};

let config = OutputConfig {
    format: OutputFormat::Jsonl,
    ..Default::default()
};

let mut sink = JsonSink::new("output/entries.jsonl", config)?;

// Stream writes (memory efficient)
for entry in entries {
    sink.write(&entry)?;
}
sink.flush()?;
}

Control Export

#![allow(unused)]
fn main() {
use synth_output::ControlExporter;

let exporter = ControlExporter::new("output/controls/");

// Export all control-related data
exporter.export_controls(&internal_controls)?;
exporter.export_sod_rules(&sod_rules)?;
exporter.export_control_mappings(&mappings)?;
}

Sink Trait Implementation

All sinks implement the Sink trait:

#![allow(unused)]
fn main() {
impl<T: Serialize> Sink<T> for CsvSink<T> {
    type Error = OutputError;

    fn write(&mut self, item: &T) -> Result<(), Self::Error> {
        // Single item write
    }

    fn write_batch(&mut self, items: &[T]) -> Result<(), Self::Error> {
        // Batch write for efficiency
    }

    fn flush(&mut self) -> Result<(), Self::Error> {
        // Ensure all data written to disk
    }
}
}

Decimal Serialization

Financial amounts are serialized as strings to prevent IEEE 754 floating-point issues:

#![allow(unused)]
fn main() {
// Internal: Decimal
let amount = dec!(1234.56);

// CSV output: "1234.56" (string)
// JSON output: "1234.56" (string, not number)
}

This ensures exact decimal representation across all systems.

Performance Tips

Batch Writes

Prefer batch writes over individual writes:

#![allow(unused)]
fn main() {
// Good: Single batch write
sink.write_batch(&entries)?;

// Less efficient: Multiple writes
for entry in &entries {
    sink.write(entry)?;
}
}

Buffer Size

The default buffer size is 8KB. For very large outputs, consider adjusting:

#![allow(unused)]
fn main() {
let sink = CsvSink::with_buffer_size(
    "output/large.csv",
    config,
    64 * 1024, // 64KB buffer
)?;
}

Compression Trade-offs

Compression	Speed	Size	Use Case
None	Fastest	Largest	Development, streaming
Gzip	Medium	Small	General purpose
Zstd	Fast	Smallest	Production, archival

Output Structure

The output module creates organized directory structure:

output/
├── master_data/
│   ├── vendors.csv
│   └── customers.csv
├── transactions/
│   ├── journal_entries.csv
│   └── acdoca.csv
├── controls/
│   ├── internal_controls.csv
│   └── sod_rules.csv
└── labels/
    └── anomaly_labels.csv

Error Handling

#![allow(unused)]
fn main() {
pub enum OutputError {
    IoError(std::io::Error),
    SerializationError(String),
    CompressionError(String),
    DirectoryCreationError(PathBuf),
}
}

datasynth-runtime

Runtime orchestration, parallel execution, and memory management.

Overview

datasynth-runtime provides the execution layer for SyntheticData:

GenerationOrchestrator: Coordinates the complete generation workflow
Parallel Execution: Multi-threaded generation with Rayon
Memory Management: Integration with memory guard for OOM prevention
Progress Tracking: Real-time progress reporting with pause/resume

Key Components

Component	Description
`GenerationOrchestrator`	Main workflow coordinator
`EnhancedOrchestrator`	Extended orchestrator with all enterprise features
`ParallelExecutor`	Thread pool management
`ProgressTracker`	Progress bars and status reporting

Generation Workflow

The orchestrator executes phases in order:

Initialize: Load configuration, validate settings
Master Data: Generate vendors, customers, materials, assets
Opening Balances: Create coherent opening balance sheet
Transactions: Generate journal entries with document flows
Period Close: Run monthly/quarterly/annual close processes
Anomalies: Inject configured anomalies and data quality issues
Export: Write outputs and generate ML labels
Banking: Generate KYC/AML data (if enabled)
Audit: Generate ISA-compliant audit data (if enabled)
Graphs: Build and export ML graphs (if enabled)
LLM Enrichment: Enrich data with LLM-generated metadata (v0.5.0, if enabled)
Diffusion Enhancement: Blend diffusion model outputs (v0.5.0, if enabled)
Causal Overlay: Apply causal structure (v0.5.0, if enabled)
S2C Sourcing: Generate Source-to-Contract procurement pipeline (v0.6.0, if enabled)
Financial Reporting: Generate bank reconciliations and financial statements (v0.6.0, if enabled)
HR Data: Generate payroll runs, time entries, and expense reports (v0.6.0, if enabled)
Accounting Standards: Generate revenue recognition and impairment data (v0.6.0, if enabled)
Manufacturing: Generate production orders, quality inspections, and cycle counts (v0.6.0, if enabled)
Sales/KPIs/Budgets: Generate sales quotes, management KPIs, and budget variance data (v0.6.0, if enabled)

Key Types

GenerationOrchestrator

#![allow(unused)]
fn main() {
pub struct GenerationOrchestrator {
    config: Config,
    state: GenerationState,
    progress: Arc<ProgressTracker>,
    memory_guard: MemoryGuard,
}

pub struct GenerationState {
    pub master_data: MasterDataState,
    pub entries: Vec<JournalEntry>,
    pub documents: DocumentState,
    pub balances: BalanceState,
    pub anomaly_labels: Vec<LabeledAnomaly>,
}
}

ProgressTracker

#![allow(unused)]
fn main() {
pub struct ProgressTracker {
    pub current: AtomicU64,
    pub total: u64,
    pub phase: String,
    pub paused: AtomicBool,
    pub start_time: Instant,
}

pub struct Progress {
    pub current: u64,
    pub total: u64,
    pub percent: f64,
    pub phase: String,
    pub entries_per_second: f64,
    pub elapsed: Duration,
    pub estimated_remaining: Duration,
}
}

Usage Examples

Basic Generation

#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;

let config = Config::from_yaml_file("config.yaml")?;
let orchestrator = GenerationOrchestrator::new(config)?;

// Run full generation
orchestrator.run()?;
}

With Progress Callback

#![allow(unused)]
fn main() {
orchestrator.run_with_progress(|progress| {
    println!(
        "[{:.1}%] {} - {}/{} ({:.0} entries/sec)",
        progress.percent,
        progress.phase,
        progress.current,
        progress.total,
        progress.entries_per_second,
    );
})?;
}

Parallel Execution

#![allow(unused)]
fn main() {
use synth_runtime::ParallelExecutor;

let executor = ParallelExecutor::new(4); // 4 threads

let results: Vec<JournalEntry> = executor.run(|thread_id| {
    let mut generator = JournalEntryGenerator::new(config.clone(), seed + thread_id);
    generator.generate_batch(batch_size)
})?;
}

Memory-Aware Generation

#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
use synth_core::memory_guard::MemoryGuardConfig;

let memory_config = MemoryGuardConfig {
    soft_limit: 1024 * 1024 * 1024,  // 1GB
    hard_limit: 2 * 1024 * 1024 * 1024,  // 2GB
    check_interval_ms: 1000,
    ..Default::default()
};

let orchestrator = GenerationOrchestrator::with_memory_config(config, memory_config)?;
orchestrator.run()?;
}

Pause/Resume

On Unix systems, generation can be paused and resumed:

# Start generation in background
datasynth-data generate --config config.yaml --output ./output &

# Send SIGUSR1 to toggle pause
kill -USR1 $(pgrep datasynth-data)

# Progress bar shows pause state
# [████████░░░░░░░░░░░░] 40% (PAUSED)

Programmatic Pause/Resume

#![allow(unused)]
fn main() {
// Pause
orchestrator.pause();

// Check state
if orchestrator.is_paused() {
    println!("Generation paused");
}

// Resume
orchestrator.resume();
}

Enhanced Orchestrator

The EnhancedOrchestrator includes additional enterprise features:

#![allow(unused)]
fn main() {
use synth_runtime::EnhancedOrchestrator;

let orchestrator = EnhancedOrchestrator::new(config)?;

// All features enabled
orchestrator
    .with_document_flows()
    .with_intercompany()
    .with_subledgers()
    .with_fx()
    .with_period_close()
    .with_anomaly_injection()
    .with_graph_export()
    .run()?;
}

Enterprise Process Chain Phases (v0.6.0)

The EnhancedOrchestrator supports six new phases for enterprise process chains, controlled by PhaseConfig:

Phase	Config Flag	Description
14	`generate_sourcing`	S2C procurement pipeline: spend analysis through supplier scorecards
15	`generate_financial_statements` / `generate_bank_reconciliation`	Financial statements and bank reconciliations
16	`generate_hr`	Payroll runs, time entries, expense reports
17	`generate_accounting_standards`	Revenue recognition (ASC 606/IFRS 15), impairment testing
18	`generate_manufacturing`	Production orders, quality inspections, cycle counts
19	`generate_sales_kpi_budgets`	Sales quotes, management KPIs, budget variance analysis

Each phase is independently enabled and gracefully skips when its dependencies (e.g., master data) are unavailable.

Output Coordination

The orchestrator coordinates output to multiple sinks:

#![allow(unused)]
fn main() {
// Orchestrator automatically:
// 1. Creates output directories
// 2. Writes master data files
// 3. Writes transaction files
// 4. Writes subledger files
// 5. Writes labels for ML
// 6. Generates graphs if enabled
}

Error Handling

#![allow(unused)]
fn main() {
pub enum RuntimeError {
    ConfigurationError(ConfigError),
    GenerationError(String),
    MemoryExceeded { limit: u64, current: u64 },
    OutputError(OutputError),
    Interrupted,
}
}

Performance Considerations

Thread Count

#![allow(unused)]
fn main() {
// Auto-detect (uses all cores)
let orchestrator = GenerationOrchestrator::new(config)?;

// Manual thread count
let orchestrator = GenerationOrchestrator::with_threads(config, 4)?;
}

Memory Management

The orchestrator monitors memory and can:

Slow down generation when soft limit approached
Pause generation at hard limit
Stream output to reduce memory pressure

Batch Sizes

Batch sizes are automatically tuned based on:

Available memory
Number of threads
Target throughput

datasynth-graph

Graph/network export for synthetic accounting data with ML-ready formats.

Overview

datasynth-graph provides graph construction and export capabilities:

Graph Builders: Transaction, approval, entity relationship, and multi-layer hypergraph builders
Hypergraph: 3-layer hypergraph (Governance, Process Events, Accounting Network) spanning 8 process families with 24 entity type codes and OCPM event hyperedges
ML Export: PyTorch Geometric, Neo4j, DGL, RustGraph, and RustGraph Hypergraph formats
Feature Engineering: Temporal, amount, structural, and categorical features
Data Splits: Train/validation/test split generation

Graph Types

Graph	Nodes	Edges	Use Case
Transaction Network	Accounts, Entities	Transactions	Anomaly detection
Approval Network	Users	Approvals	SoD analysis
Entity Relationship	Legal Entities	Ownership	Consolidation analysis

Export Formats

PyTorch Geometric

graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, num_features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, num_edge_features]
├── labels.pt           # [num_nodes] or [num_edges]
├── train_mask.pt       # Boolean mask
├── val_mask.pt
└── test_mask.pt

Neo4j

graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── edges_transaction.csv
├── edges_ownership.csv
└── import.cypher

DGL (Deep Graph Library)

graphs/approval_network/dgl/
├── graph.bin           # DGL graph object
├── node_feats.npy      # Node features
├── edge_feats.npy      # Edge features
└── labels.npy          # Labels

Feature Categories

Category	Features
Temporal	weekday, period, is_month_end, is_quarter_end, is_year_end
Amount	log(amount), benford_probability, is_round_number
Structural	line_count, unique_accounts, has_intercompany
Categorical	business_process (one-hot), source_type (one-hot)

Key Types

Graph Models

#![allow(unused)]
fn main() {
pub struct Graph {
    pub nodes: Vec<Node>,
    pub edges: Vec<Edge>,
    pub node_features: Option<Array2<f32>>,
    pub edge_features: Option<Array2<f32>>,
}

pub enum Node {
    Account(AccountNode),
    Entity(EntityNode),
    User(UserNode),
    Transaction(TransactionNode),
}

pub enum Edge {
    Transaction(TransactionEdge),
    Approval(ApprovalEdge),
    Ownership(OwnershipEdge),
}
}

Split Configuration

#![allow(unused)]
fn main() {
pub struct SplitConfig {
    pub train_ratio: f64,     // e.g., 0.7
    pub val_ratio: f64,       // e.g., 0.15
    pub test_ratio: f64,      // e.g., 0.15
    pub stratify_by: Option<String>,
    pub random_seed: u64,
}
}

Usage Examples

Building Transaction Graph

#![allow(unused)]
fn main() {
use synth_graph::{TransactionGraphBuilder, GraphConfig};

let builder = TransactionGraphBuilder::new(GraphConfig::default());
let graph = builder.build(&journal_entries)?;

println!("Nodes: {}", graph.nodes.len());
println!("Edges: {}", graph.edges.len());
}

PyTorch Geometric Export

#![allow(unused)]
fn main() {
use synth_graph::{PyTorchGeometricExporter, SplitConfig};

let exporter = PyTorchGeometricExporter::new("output/graphs");

let split = SplitConfig {
    train_ratio: 0.7,
    val_ratio: 0.15,
    test_ratio: 0.15,
    stratify_by: Some("is_anomaly".to_string()),
    random_seed: 42,
};

exporter.export(&graph, split)?;
}

Neo4j Export

#![allow(unused)]
fn main() {
use synth_graph::Neo4jExporter;

let exporter = Neo4jExporter::new("output/graphs/neo4j");
exporter.export(&graph)?;

// Generates import script:
// LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
// CREATE (:Account {id: row.id, name: row.name, ...})
}

Feature Engineering

#![allow(unused)]
fn main() {
use synth_graph::features::{FeatureExtractor, FeatureConfig};

let extractor = FeatureExtractor::new(FeatureConfig {
    temporal: true,
    amount: true,
    structural: true,
    categorical: true,
});

let node_features = extractor.extract_node_features(&entries)?;
let edge_features = extractor.extract_edge_features(&entries)?;
}

Graph Construction

Transaction Network

Accounts and entities become nodes; transactions become edges.

#![allow(unused)]
fn main() {
// Nodes:
// - Each GL account is a node
// - Each vendor/customer is a node

// Edges:
// - Each journal entry line creates an edge
// - Edge connects account to entity
// - Edge features: amount, date, fraud flag
}

Approval Network

Users become nodes; approval relationships become edges.

#![allow(unused)]
fn main() {
// Nodes:
// - Each user/employee is a node
// - Node features: approval_limit, department, role

// Edges:
// - Approval actions create edges
// - Edge features: amount, threshold, escalation
}

Entity Relationship Network

Legal entities become nodes; ownership and IC relationships become edges.

#![allow(unused)]
fn main() {
// Nodes:
// - Each company/legal entity is a node
// - Node features: currency, country, parent_flag

// Edges:
// - Ownership relationships (parent → subsidiary)
// - IC transaction relationships
// - Edge features: ownership_percent, transaction_volume
}

ML Integration

Loading in PyTorch

import torch
from torch_geometric.data import Data

# Load exported data
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')

data = Data(
    x=node_features,
    edge_index=edge_index,
    edge_attr=edge_attr,
    y=labels,
    train_mask=train_mask,
)

Loading in Neo4j

# Import using generated script
neo4j-admin import \
    --nodes=nodes_account.csv \
    --nodes=nodes_entity.csv \
    --relationships=edges_transaction.csv

Configuration

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j
  graphs:
    - transaction_network
    - approval_network
    - entity_relationship
  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly
  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

Multi-Layer Hypergraph (v0.6.2)

The hypergraph builder supports all 8 enterprise process families:

Method	Family	Node Types
`add_p2p_documents()`	P2P	PurchaseOrder, GoodsReceipt, VendorInvoice, Payment
`add_o2c_documents()`	O2C	SalesOrder, Delivery, CustomerInvoice
`add_s2c_documents()`	S2C	SourcingProject, RfxEvent, SupplierBid, ProcurementContract
`add_h2r_documents()`	H2R	PayrollRun, TimeEntry, ExpenseReport
`add_mfg_documents()`	MFG	ProductionOrder, QualityInspection, CycleCount
`add_bank_documents()`	BANK	BankingCustomer, BankAccount, BankTransaction
`add_audit_documents()`	AUDIT	AuditEngagement, Workpaper, AuditFinding, AuditEvidence
`add_bank_recon_documents()`	Bank Recon	BankReconciliation, BankStatementLine, ReconcilingItem
`add_ocpm_events()`	OCPM	Events as hyperedges (entity type 400)

datasynth-cli

Command-line interface for synthetic accounting data generation.

Overview

datasynth-cli provides the datasynth-data binary for command-line usage:

generate: Generate synthetic data from configuration
init: Create configuration files with industry presets
validate: Validate configuration files
info: Display available presets and options

Installation

cargo build --release
# Binary at: target/release/datasynth-data

Commands

generate

Generate synthetic financial data.

# From configuration file
datasynth-data generate --config config.yaml --output ./output

# Demo mode with defaults
datasynth-data generate --demo --output ./demo-output

# Override seed
datasynth-data generate --config config.yaml --output ./output --seed 12345

# Verbose output
datasynth-data generate --config config.yaml --output ./output -v

init

Create a configuration file from presets.

# Industry preset with complexity
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available industries:

manufacturing
retail
financial_services
healthcare
technology
energy
telecom
transportation
hospitality

validate

Validate configuration files.

datasynth-data validate --config config.yaml

info

Display available options.

datasynth-data info

fingerprint

Privacy-preserving fingerprint operations.

# Extract fingerprint
datasynth-data fingerprint extract --input ./data.csv --output ./fp.dsf --privacy-level standard

# Validate fingerprint
datasynth-data fingerprint validate ./fp.dsf

# View fingerprint details
datasynth-data fingerprint info ./fp.dsf --detailed

# Evaluate fidelity
datasynth-data fingerprint evaluate --fingerprint ./fp.dsf --synthetic ./output/ --threshold 0.8

# Federated aggregation (v0.5.0)
datasynth-data fingerprint federated --sources ./a.dsf ./b.dsf --output ./combined.dsf --method weighted_average

diffusion (v0.5.0)

Diffusion model training and evaluation.

# Train diffusion model from fingerprint
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json

# Evaluate model fit
datasynth-data diffusion evaluate --model ./model.json --samples 5000

causal (v0.5.0)

Causal and counterfactual data generation.

# Generate from causal template
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/

# Run intervention
datasynth-data causal intervene --template fraud_detection --variable transaction_amount --value 50000 --samples 5000 --output ./intervention/

# Validate causal structure
datasynth-data causal validate --data ./causal/ --template fraud_detection

Key Types

CLI Arguments

#![allow(unused)]
fn main() {
#[derive(Parser)]
pub struct Cli {
    #[command(subcommand)]
    pub command: Command,

    /// Enable verbose logging
    #[arg(short, long)]
    pub verbose: bool,

    /// Suppress non-error output
    #[arg(short, long)]
    pub quiet: bool,
}

#[derive(Subcommand)]
pub enum Command {
    Generate(GenerateArgs),
    Init(InitArgs),
    Validate(ValidateArgs),
    Info,
    Fingerprint(FingerprintArgs),   // fingerprint subcommands
    Diffusion(DiffusionArgs),       // v0.5.0: diffusion model commands
    Causal(CausalArgs),             // v0.5.0: causal generation commands
}
}

Generate Arguments

#![allow(unused)]
fn main() {
pub struct GenerateArgs {
    /// Configuration file path
    #[arg(short, long)]
    pub config: Option<PathBuf>,

    /// Use demo preset
    #[arg(long)]
    pub demo: bool,

    /// Output directory (required)
    #[arg(short, long)]
    pub output: PathBuf,

    /// Override random seed
    #[arg(long)]
    pub seed: Option<u64>,

    /// Output format
    #[arg(long, default_value = "csv")]
    pub format: String,

    /// Attach a synthetic data certificate (v0.5.0)
    #[arg(long)]
    pub certificate: bool,
}

pub struct InitArgs {
    // ... existing fields ...

    /// Generate config from natural language description (v0.5.0)
    #[arg(long)]
    pub from_description: Option<String>,
}
}

Signal Handling

On Unix systems, pause/resume generation with SIGUSR1:

# Start in background
datasynth-data generate --config config.yaml --output ./output &

# Toggle pause
kill -USR1 $(pgrep datasynth-data)

Progress bar shows pause state:

[████████░░░░░░░░░░░░] 40% - 40000/100000 entries (PAUSED)

Exit Codes

Code	Description
0	Success
1	Configuration error
2	Generation error
3	I/O error

Environment Variables

Variable	Description
`SYNTH_DATA_LOG`	Log level (error, warn, info, debug, trace)
`SYNTH_DATA_THREADS`	Worker thread count
`SYNTH_DATA_MEMORY_LIMIT`	Memory limit in bytes

SYNTH_DATA_LOG=debug datasynth-data generate --demo --output ./output

Progress Display

During generation, a progress bar shows:

Generating synthetic data...
[████████████████████] 100% - 100000/100000 entries
Phase: Transactions | 85,432 entries/sec | ETA: 0:00

Generation complete!
- Journal entries: 100,000
- Document flows: 15,000
- Output: ./output/
- Duration: 1.2s

Usage Examples

Basic Generation

datasynth-data init --industry manufacturing -o config.yaml
datasynth-data generate --config config.yaml --output ./output

Scripting

#!/bin/bash
for industry in manufacturing retail healthcare; do
    datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
    datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done

CI/CD

# GitHub Actions
- name: Generate Test Data
  run: |
    cargo build --release
    ./target/release/datasynth-data generate --demo --output ./test-data

Reproducible Generation

# Same seed = same output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2  # No differences

datasynth-server

REST, gRPC, and WebSocket server for synthetic data generation.

Overview

datasynth-server provides server-based access to SyntheticData:

REST API: Configuration management and stream control
gRPC API: High-performance streaming generation
WebSocket: Real-time event streaming
Production Features: Authentication, rate limiting, timeouts

Starting the Server

cargo run -p datasynth-server -- --port 3000 --worker-threads 4

Command-Line Options

Option	Default	Description
`--port`	3000	HTTP/WebSocket port
`--grpc-port`	50051	gRPC port
`--worker-threads`	CPU cores	Worker thread count
`--api-key`	None	Required API key
`--rate-limit`	100	Max requests per minute
`--memory-limit`	None	Memory limit in bytes

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      datasynth-server                       │
├─────────────────────────────────────────────────────────────┤
│  REST API (Axum)  │  gRPC (Tonic)  │  WebSocket (Axum)      │
├─────────────────────────────────────────────────────────────┤
│                   Middleware Layer                          │
│  Auth │ Rate Limit │ Timeout │ CORS │ Logging               │
├─────────────────────────────────────────────────────────────┤
│                 Generation Service                          │
│        (wraps datasynth-runtime orchestrator)               │
└─────────────────────────────────────────────────────────────┘

REST API Endpoints

Configuration

# Get current configuration
curl http://localhost:3000/api/config

# Update configuration
curl -X POST http://localhost:3000/api/config \
  -H "Content-Type: application/json" \
  -d '{"transactions": {"target_count": 50000}}'

# Validate configuration
curl -X POST http://localhost:3000/api/config/validate \
  -H "Content-Type: application/json" \
  -d @config.json

Stream Control

# Start generation
curl -X POST http://localhost:3000/api/stream/start

# Pause
curl -X POST http://localhost:3000/api/stream/pause

# Resume
curl -X POST http://localhost:3000/api/stream/resume

# Stop
curl -X POST http://localhost:3000/api/stream/stop

# Trigger pattern (month_end, quarter_end, year_end)
curl -X POST http://localhost:3000/api/stream/trigger/month_end

Health Check

curl http://localhost:3000/health

WebSocket API

Connect to ws://localhost:3000/ws/events for real-time events.

Event Types

// Progress
{"type": "progress", "current": 50000, "total": 100000, "percent": 50.0}

// Entry (streamed data)
{"type": "entry", "data": {"document_id": "abc-123", ...}}

// Error
{"type": "error", "message": "Memory limit exceeded"}

// Complete
{"type": "complete", "total_entries": 100000, "duration_ms": 1200}

gRPC API

Proto Definition

syntax = "proto3";
package synth;

service SynthService {
  rpc GetConfig(Empty) returns (Config);
  rpc SetConfig(Config) returns (Status);
  rpc StartGeneration(GenerationRequest) returns (stream Entry);
  rpc StopGeneration(Empty) returns (Status);
}

Client Example

#![allow(unused)]
fn main() {
use synth::synth_client::SynthClient;

let mut client = SynthClient::connect("http://localhost:50051").await?;

let request = tonic::Request::new(GenerationRequest { count: Some(1000) });
let mut stream = client.start_generation(request).await?.into_inner();

while let Some(entry) = stream.message().await? {
    println!("Entry: {:?}", entry.document_id);
}
}

Middleware

Authentication

# With API key
curl -H "X-API-Key: your-key" http://localhost:3000/api/config

Rate Limiting

Sliding window rate limiter with per-client tracking.

// 429 response when exceeded
{
  "error": "rate_limit_exceeded",
  "retry_after": 30
}

Request Timeout

Default timeout is 30 seconds. Long-running operations use streaming.

Key Types

Server Configuration

#![allow(unused)]
fn main() {
pub struct ServerConfig {
    pub port: u16,
    pub grpc_port: u16,
    pub worker_threads: usize,
    pub api_key: Option<String>,
    pub rate_limit: RateLimitConfig,
    pub memory_limit: Option<u64>,
    pub cors_origins: Vec<String>,
}
}

Rate Limit Configuration

#![allow(unused)]
fn main() {
pub struct RateLimitConfig {
    pub max_requests: u32,
    pub window_seconds: u64,
    pub exempt_paths: Vec<String>,
}
}

Production Deployment

Docker

FROM rust:1.88 as builder
WORKDIR /app
COPY . .
RUN cargo build --release -p datasynth-server

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/datasynth-server /usr/local/bin/
EXPOSE 3000 50051
CMD ["datasynth-server", "--port", "3000"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datasynth-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: datasynth-server
        image: datasynth-server:latest
        ports:
        - containerPort: 3000
        - containerPort: 50051
        env:
        - name: SYNTH_API_KEY
          valueFrom:
            secretKeyRef:
              name: synth-secrets
              key: api-key
        resources:
          limits:
            memory: "2Gi"

Monitoring

Health Endpoint

curl http://localhost:3000/health

{
  "status": "healthy",
  "uptime_seconds": 3600,
  "memory_usage_mb": 512,
  "active_streams": 2
}

Logging

Enable structured logging:

RUST_LOG=synth_server=info cargo run -p datasynth-server

datasynth-ui

Cross-platform desktop application for synthetic data generation.

Overview

datasynth-ui provides a graphical interface for SyntheticData:

Visual Configuration: Comprehensive UI for all configuration sections
Real-time Streaming: Live generation viewer with WebSocket
Preset Management: One-click industry preset application
Validation Feedback: Real-time configuration validation

Technology Stack

Component	Technology
Backend	Tauri 2.0 (Rust)
Frontend	SvelteKit + Svelte 5
Styling	TailwindCSS
State	Svelte stores with runes

Prerequisites

Linux (Ubuntu/Debian)

sudo apt install libgtk-3-dev libwebkit2gtk-4.1-dev \
    libappindicator3-dev librsvg2-dev

Linux (Fedora)

sudo dnf install gtk3-devel webkit2gtk4.1-devel \
    libappindicator-gtk3-devel librsvg2-devel

Linux (Arch)

sudo pacman -S webkit2gtk-4.1 base-devel curl wget file \
    openssl appmenu-gtk-module gtk3 librsvg libvips

macOS

No additional dependencies required (uses built-in WebKit).

Windows

WebView2 runtime (usually pre-installed on Windows 10/11).

Development

cd crates/datasynth-ui

# Install dependencies
npm install

# Frontend development (no desktop features)
npm run dev

# Desktop app development
npm run tauri dev

# Production build
npm run build
npm run tauri build

Project Structure

datasynth-ui/
├── src/                    # Svelte frontend
│   ├── routes/             # SvelteKit pages
│   │   ├── +page.svelte    # Dashboard
│   │   ├── config/         # Configuration pages (15+ sections)
│   │   │   ├── global/
│   │   │   ├── transactions/
│   │   │   ├── master-data/
│   │   │   └── ...
│   │   └── generate/
│   │       └── stream/     # Generation streaming viewer
│   └── lib/
│       ├── components/     # Reusable UI components
│       │   ├── forms/      # Form components
│       │   └── config/     # Config-specific components
│       ├── stores/         # Svelte stores
│       └── utils/          # Utilities
├── src-tauri/              # Rust backend
│   ├── src/
│   │   ├── lib.rs          # Tauri commands
│   │   └── main.rs         # App entry point
│   └── Cargo.toml
├── e2e/                    # Playwright E2E tests
├── package.json
└── tauri.conf.json

Configuration Sections

Section	Description
Global	Industry, dates, seed, performance
Transactions	Line items, amounts, sources
Master Data	Vendors, customers, materials
Document Flows	P2P, O2C configuration
Financial	Balance, subledger, FX, period close
Compliance	Fraud, controls, approval
Analytics	Graph export, anomaly, data quality
Output	Formats, compression

Key Components

Config Store

// src/lib/stores/config.ts
import { writable } from 'svelte/store';

export const config = writable<Config>(defaultConfig);
export const isDirty = writable(false);

export function updateConfig(section: string, value: any) {
    config.update(c => ({...c, [section]: value}));
    isDirty.set(true);
}

Form Components

<!-- src/lib/components/forms/InputNumber.svelte -->
<script lang="ts">
  export let value: number;
  export let min: number = 0;
  export let max: number = Infinity;
  export let label: string;
</script>

<label>
  {label}
  <input type="number" bind:value {min} {max} />
</label>

Tauri Commands

#![allow(unused)]
fn main() {
// src-tauri/src/lib.rs
#[tauri::command]
async fn save_config(config: Config) -> Result<(), String> {
    // Save configuration
}

#[tauri::command]
async fn start_generation(config: Config) -> Result<(), String> {
    // Start generation via datasynth-runtime
}
}

Server Connection

The UI connects to datasynth-server for streaming:

# Start server first
cargo run -p datasynth-server

# Then run UI
npm run tauri dev

Default server URL: http://localhost:3000

Testing

# Unit tests
npm test

# E2E tests with Playwright
npx playwright test

# E2E with UI
npx playwright test --ui

Build Output

Production builds create platform-specific packages:

Platform	Output
Windows	`.msi`, `.exe`
macOS	`.dmg`, `.app`
Linux	`.deb`, `.AppImage`, `.rpm`

Located in: src-tauri/target/release/bundle/

UI Features

Dashboard

System overview
Quick stats
Recent generations

Configuration Editor

Visual form editors for all sections
Real-time validation
Dirty state tracking
Export to YAML/JSON

Streaming Viewer

Real-time progress
Entry preview table
Memory usage graph
Pause/resume controls

Preset Selector

Industry presets
Complexity levels
One-click application

datasynth-eval

Evaluation framework for synthetic financial data quality and coherence.

Overview

datasynth-eval provides automated quality assessment for generated data:

Statistical Evaluation: Benford’s Law compliance, distribution analysis
Coherence Checking: Balance verification, document chain integrity
Intercompany Validation: IC matching and elimination verification
Data Quality Analysis: Completeness, consistency, format validation
ML Readiness: Feature distributions, label quality, graph structure
Enhancement Derivation: Auto-tuning with configuration recommendations

Evaluation Categories

Category	Description
Statistical	Benford’s Law, amount distributions, temporal patterns, line items
Coherence	Trial balance, subledger reconciliation, FX consistency, document chains
Intercompany	IC matching rates, elimination completeness
Quality	Completeness, consistency, duplicates, format validation, uniqueness
ML Readiness	Feature distributions, label quality, graph structure, train/val/test splits
Enhancement	Auto-tuning, configuration recommendations, root cause analysis

Module Structure

Module	Description
`statistical/`	Benford’s Law, amount distributions, temporal patterns
`coherence/`	Balance sheet, IC matching, document chains, subledger reconciliation
`quality/`	Completeness, consistency, duplicates, formats, uniqueness
`ml/`	Feature analysis, label quality, graph structure, splits
`report/`	HTML and JSON report generation with baseline comparisons
`tuning/`	Configuration optimization recommendations
`enhancement/`	Auto-tuning engine with config patch generation

Key Types

Evaluator

#![allow(unused)]
fn main() {
pub struct Evaluator {
    config: EvaluationConfig,
    checkers: Vec<Box<dyn Checker>>,
}

pub struct EvaluationConfig {
    pub benford_threshold: f64,      // Chi-square threshold
    pub balance_tolerance: Decimal,   // Allowed imbalance
    pub ic_match_threshold: f64,      // Required match rate
    pub duplicate_check: bool,
}
}

Evaluation Report

#![allow(unused)]
fn main() {
pub struct EvaluationReport {
    pub overall_status: Status,
    pub categories: Vec<CategoryResult>,
    pub warnings: Vec<Warning>,
    pub details: Vec<Finding>,
    pub scores: Scores,
}

pub struct Scores {
    pub benford_score: f64,           // 0.0-1.0
    pub balance_coherence: f64,       // 0.0-1.0
    pub ic_matching_rate: f64,        // 0.0-1.0
    pub uniqueness_score: f64,        // 0.0-1.0
}

pub enum Status {
    Passed,
    PassedWithWarnings,
    Failed,
}
}

Usage Examples

Basic Evaluation

#![allow(unused)]
fn main() {
use synth_eval::{Evaluator, EvaluationConfig};

let evaluator = Evaluator::new(EvaluationConfig::default());
let report = evaluator.evaluate(&generated_data)?;

println!("Status: {:?}", report.overall_status);
println!("Benford compliance: {:.2}%", report.scores.benford_score * 100.0);
}

Custom Configuration

#![allow(unused)]
fn main() {
let config = EvaluationConfig {
    benford_threshold: 0.05,          // 5% significance level
    balance_tolerance: dec!(0.01),    // 1 cent tolerance
    ic_match_threshold: 0.99,         // 99% required match
    duplicate_check: true,
};

let evaluator = Evaluator::new(config);
}

Category-Specific Evaluation

#![allow(unused)]
fn main() {
use synth_eval::checkers::{BenfordChecker, BalanceChecker};

let benford = BenfordChecker::new(0.05);
let result = benford.check(&amounts)?;

let balance = BalanceChecker::new(dec!(0.01));
let result = balance.check(&trial_balance)?;
}

Evaluation Checks

Benford’s Law

Verifies first-digit distribution follows Benford’s Law:

#![allow(unused)]
fn main() {
// Expected: P(d) = log10(1 + 1/d)
// d=1: 30.1%, d=2: 17.6%, d=3: 12.5%, etc.

let benford_result = evaluator.check_benford(&amounts)?;

if benford_result.chi_square > critical_value {
    println!("Warning: Amounts don't follow Benford's Law");
}
}

Balance Coherence

Verifies accounting equation:

#![allow(unused)]
fn main() {
// Assets = Liabilities + Equity
let balance_result = evaluator.check_balance(&trial_balance)?;

if !balance_result.passed {
    println!("Imbalance: {:?}", balance_result.difference);
}
}

Document Chain Integrity

Verifies document references:

#![allow(unused)]
fn main() {
// PO → GR → Invoice → Payment chain
let chain_result = evaluator.check_document_chains(&documents)?;

for broken_chain in &chain_result.broken_chains {
    println!("Broken chain: {} → {}", broken_chain.from, broken_chain.to);
}
}

IC Matching

Verifies intercompany transactions match:

#![allow(unused)]
fn main() {
let ic_result = evaluator.check_ic_matching(&ic_entries)?;

println!("Match rate: {:.2}%", ic_result.match_rate * 100.0);
println!("Unmatched: {}", ic_result.unmatched.len());
}

Uniqueness

Detects duplicate document IDs:

#![allow(unused)]
fn main() {
let unique_result = evaluator.check_uniqueness(&entries)?;

if !unique_result.duplicates.is_empty() {
    for dup in &unique_result.duplicates {
        println!("Duplicate ID: {}", dup.document_id);
    }
}
}

Report Output

Console Report

#![allow(unused)]
fn main() {
evaluator.print_report(&report);
}

=== Evaluation Report ===
Status: PASSED

Scores:
  Benford Compliance:    98.5%
  Balance Coherence:    100.0%
  IC Matching Rate:      99.8%
  Uniqueness:           100.0%

Warnings:
  - 3 entries with unusual amounts detected

Categories:
  [✓] Statistical:   PASSED
  [✓] Coherence:     PASSED
  [✓] Intercompany:  PASSED
  [✓] Uniqueness:    PASSED

JSON Report

#![allow(unused)]
fn main() {
let json = evaluator.to_json(&report)?;
std::fs::write("evaluation_report.json", json)?;
}

Integration with Generation

#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
use synth_eval::Evaluator;

let orchestrator = GenerationOrchestrator::new(config)?;
let data = orchestrator.run()?;

// Evaluate generated data
let evaluator = Evaluator::new(EvaluationConfig::default());
let report = evaluator.evaluate(&data)?;

if report.overall_status == Status::Failed {
    return Err("Generated data failed quality checks");
}
}

Enhancement Module

The enhancement module provides automatic configuration tuning based on evaluation results.

Pipeline Flow

Evaluation Results → Threshold Check → Gap Analysis → Root Cause → Config Suggestion

Auto-Tuning

#![allow(unused)]
fn main() {
use synth_eval::enhancement::{AutoTuner, AutoTuneResult};

let tuner = AutoTuner::new();
let result: AutoTuneResult = tuner.analyze(&evaluation);

for patch in result.patches_by_confidence() {
    println!("{}: {} → {} (confidence: {:.0}%)",
        patch.path,
        patch.current_value.as_deref().unwrap_or("?"),
        patch.suggested_value,
        patch.confidence * 100.0
    );
}
}

Key Types

#![allow(unused)]
fn main() {
pub struct ConfigPatch {
    pub path: String,              // e.g., "transactions.amount.benford_compliance"
    pub current_value: Option<String>,
    pub suggested_value: String,
    pub confidence: f64,           // 0.0-1.0
    pub expected_impact: String,
}

pub struct AutoTuneResult {
    pub patches: Vec<ConfigPatch>,
    pub expected_improvement: f64,
    pub addressed_metrics: Vec<String>,
    pub unaddressable_metrics: Vec<String>,
    pub summary: String,
}
}

Recommendation Engine

#![allow(unused)]
fn main() {
use synth_eval::enhancement::{RecommendationEngine, RecommendationPriority};

let engine = RecommendationEngine::new();
let recommendations = engine.generate(&evaluation);

for rec in recommendations.iter().filter(|r| r.priority == RecommendationPriority::Critical) {
    println!("CRITICAL: {} - {}", rec.title, rec.root_cause.description);
}
}

Metric-to-Config Mappings

Metric	Config Path	Strategy
`benford_p_value`	`transactions.amount.benford_compliance`	Enable boolean
`round_number_ratio`	`transactions.amount.round_number_bias`	Set to target
`temporal_correlation`	`transactions.temporal.seasonality_strength`	Increase by gap
`anomaly_rate`	`anomaly_injection.base_rate`	Set to target
`ic_match_rate`	`intercompany.match_precision`	Increase by gap
`completeness_rate`	`data_quality.missing_values.overall_rate`	Decrease by gap

datasynth-banking

KYC/AML banking transaction generator for synthetic data.

Overview

datasynth-banking provides comprehensive banking transaction simulation for:

Compliance testing and model training
AML/fraud detection system evaluation
KYC process simulation
Regulatory reporting testing

Features

Feature	Description
Customer Generation	Retail, business, and trust customers with realistic KYC profiles
Account Generation	Multiple account types with proper feature sets
Transaction Engine	Persona-based transaction generation with causal drivers
AML Typologies	Structuring, funnel accounts, layering, mule networks, and more
Ground Truth Labels	Multi-level labels for ML training
Spoofing Mode	Adversarial transaction generation for robustness testing

Architecture

BankingOrchestrator (orchestration)
        |
Generators (customer, account, transaction, counterparty)
        |
Typologies (AML pattern injection)
        |
Labels (ground truth generation)
        |
Models (customer, account, transaction, KYC)

Module Structure

Models

Model	Description
`BankingCustomer`	Retail, Business, Trust customer types
`BankAccount`	Account with type, features, status
`BankTransaction`	Transaction with direction, channel, category
`KycProfile`	Expected activity envelope for compliance
`CounterpartyPool`	Transaction counterparty management
`CaseNarrative`	Investigation and compliance narratives

KYC Profile

#![allow(unused)]
fn main() {
pub struct KycProfile {
    pub declared_purpose: String,
    pub turnover_band: TurnoverBand,
    pub transaction_frequency: TransactionFrequency,
    pub expected_categories: Vec<TransactionCategory>,
    pub source_of_funds: SourceOfFunds,
    pub source_of_wealth: SourceOfWealth,
    pub geographic_exposure: Vec<String>,
    pub cash_intensity: CashIntensity,
    pub beneficial_owner_complexity: OwnerComplexity,
    // Ground truth fields
    pub is_deceiving: bool,
    pub actual_turnover_band: Option<TurnoverBand>,
}
}

AML Typologies

Typology	Description
Structuring	Transactions below reporting thresholds ($10K)
Funnel Accounts	Multiple small deposits, few large withdrawals
Layering	Complex transaction chains to obscure origin
Mule Networks	Money mule payment chains
Round-Tripping	Circular transaction patterns
Credit Card Fraud	Fraudulent card transactions
Synthetic Identity	Fake identity transactions
Spoofing	Adversarial patterns for model testing

Labels

Label Type	Description
Entity Labels	Customer-level risk classifications
Relationship Labels	Relationship risk indicators
Transaction Labels	Transaction-level classifications
Narrative Labels	Investigation case narratives

Usage Examples

Basic Generation

#![allow(unused)]
fn main() {
use synth_banking::{BankingOrchestrator, BankingConfig};

let config = BankingConfig::default();
let mut orchestrator = BankingOrchestrator::new(config, 12345);

// Generate customers and accounts
let customers = orchestrator.generate_customers();
let accounts = orchestrator.generate_accounts(&customers);

// Generate transaction stream
let transactions = orchestrator.generate_transactions(&accounts);
}

With AML Typologies

#![allow(unused)]
fn main() {
use synth_banking::{BankingConfig, TypologyConfig};

let config = BankingConfig {
    customer_count: 1000,
    typologies: TypologyConfig {
        structuring_rate: 0.02,   // 2% structuring patterns
        funnel_rate: 0.01,        // 1% funnel accounts
        mule_rate: 0.005,         // 0.5% mule networks
        ..Default::default()
    },
    ..Default::default()
};
}

Accessing Labels

#![allow(unused)]
fn main() {
let labels = orchestrator.generate_labels();

// Entity-level labels
for entity_label in &labels.entity_labels {
    println!("Customer {} risk: {:?}",
        entity_label.customer_id,
        entity_label.risk_tier
    );
}

// Transaction-level labels
for tx_label in &labels.transaction_labels {
    if tx_label.is_suspicious {
        println!("Suspicious: {} - {:?}",
            tx_label.transaction_id,
            tx_label.typology
        );
    }
}
}

Key Types

Customer Types

#![allow(unused)]
fn main() {
pub enum BankingCustomerType {
    Retail,     // Individual customers
    Business,   // Business accounts
    Trust,      // Trust/corporate entities
}
}

Risk Tiers

#![allow(unused)]
fn main() {
pub enum RiskTier {
    Low,
    Medium,
    High,
    Prohibited,
}
}

Transaction Categories

#![allow(unused)]
fn main() {
pub enum TransactionCategory {
    SalaryWages,
    BusinessPayment,
    Investment,
    RealEstate,
    Gambling,
    Cryptocurrency,
    CashDeposit,
    CashWithdrawal,
    WireTransfer,
    AtmWithdrawal,
    PosPayment,
    OnlinePayment,
    // ... more categories
}
}

AML Typologies

#![allow(unused)]
fn main() {
pub enum AmlTypology {
    Structuring,
    Funnel,
    Layering,
    Mule,
    RoundTripping,
    CreditCardFraud,
    SyntheticIdentity,
    None,
}
}

Export Files

File	Description
`banking_customers.csv`	Customer profiles with KYC data
`bank_accounts.csv`	Account records with features
`bank_transactions.csv`	Transaction records
`kyc_profiles.csv`	Expected activity envelopes
`counterparties.csv`	Counterparty pool
`entity_risk_labels.csv`	Entity-level risk classifications
`transaction_risk_labels.csv`	Transaction-level labels
`aml_typology_labels.csv`	AML typology ground truth

datasynth-ocpm

Object-Centric Process Mining (OCPM) models and generators.

Overview

datasynth-ocpm provides OCEL 2.0 compliant event log generation across 8 enterprise process families:

OCEL 2.0 Models: Events, objects, relationships per IEEE standard
8 Process Generators: P2P, O2C, S2C, H2R, MFG, BANK, AUDIT, Bank Recon
88 Activity Types: Covering the full enterprise lifecycle
52 Object Types: With lifecycle states and inter-object relationships
Export Formats: OCEL 2.0 JSON, XML, and SQLite

OCEL 2.0 Standard

Implements the Object-Centric Event Log standard:

Element	Description
Events	Activities with timestamps and attributes
Objects	Business objects (PO, Invoice, Payment, etc.)
Object Types	Type definitions with attribute schemas
Relationships	Object-to-object relationships
Event-Object Links	Many-to-many event-object associations

Key Types

OCEL Models

#![allow(unused)]
fn main() {
pub struct OcelEventLog {
    pub object_types: Vec<ObjectType>,
    pub event_types: Vec<EventType>,
    pub objects: Vec<Object>,
    pub events: Vec<Event>,
    pub relationships: Vec<ObjectRelationship>,
}

pub struct Event {
    pub id: String,
    pub event_type: String,
    pub timestamp: DateTime<Utc>,
    pub attributes: HashMap<String, Value>,
    pub objects: Vec<ObjectRef>,
}

pub struct Object {
    pub id: String,
    pub object_type: String,
    pub attributes: HashMap<String, Value>,
}
}

Process Flow Documents

#![allow(unused)]
fn main() {
pub struct P2pDocuments {
    pub po_number: String,
    pub vendor_id: String,
    pub company_code: String,
    pub amount: Decimal,
    pub currency: String,
}

pub struct O2cDocuments {
    pub so_number: String,
    pub customer_id: String,
    pub company_code: String,
    pub amount: Decimal,
    pub currency: String,
}
}

Process Flows

Procure-to-Pay (P2P)

Create PO → Approve PO → Release PO → Create GR → Post GR →
Receive Invoice → Verify Invoice → Post Invoice → Execute Payment

Events generated:

Create Purchase Order
Approve Purchase Order
Release Purchase Order
Create Goods Receipt
Post Goods Receipt
Receive Vendor Invoice
Verify Three-Way Match
Post Vendor Invoice
Execute Payment

Order-to-Cash (O2C)

Create SO → Check Credit → Release SO → Create Delivery →
Pick → Pack → Ship → Create Invoice → Post Invoice → Receive Payment

Events generated:

Create Sales Order
Check Credit
Release Sales Order
Create Delivery
Pick Materials
Pack Shipment
Ship Goods
Create Customer Invoice
Post Customer Invoice
Receive Customer Payment

Usage Examples

Generate P2P Case

#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, P2pDocuments};

let mut generator = OcpmGenerator::new(seed);

let documents = P2pDocuments::new(
    "PO-001",
    "V-001",
    "1000",
    dec!(5000.00),
    "USD",
);

let users = vec!["user1", "user2", "user3"];
let start_time = Utc::now();

let result = generator.generate_p2p_case(&documents, start_time, &users);
}

Generate O2C Case

#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, O2cDocuments};

let documents = O2cDocuments::new(
    "SO-001",
    "C-001",
    "1000",
    dec!(10000.00),
    "USD",
);

let result = generator.generate_o2c_case(&documents, start_time, &users);
}

Generate Complete Event Log

#![allow(unused)]
fn main() {
use synth_ocpm::OcpmGenerator;

let mut generator = OcpmGenerator::new(seed);
let event_log = generator.generate_event_log(
    p2p_count: 1000,
    o2c_count: 500,
    start_date,
    end_date,
)?;
}

Export Formats

OCEL 2.0 JSON

#![allow(unused)]
fn main() {
use synth_ocpm::export::{Ocel2Exporter, ExportFormat};

let exporter = Ocel2Exporter::new(ExportFormat::Json);
exporter.export(&event_log, "output/ocel2.json")?;
}

Output structure:

{
  "objectTypes": [...],
  "eventTypes": [...],
  "objects": [...],
  "events": [...],
  "relations": [...]
}

OCEL 2.0 XML

#![allow(unused)]
fn main() {
let exporter = Ocel2Exporter::new(ExportFormat::Xml);
exporter.export(&event_log, "output/ocel2.xml")?;
}

SQLite Database

#![allow(unused)]
fn main() {
let exporter = Ocel2Exporter::new(ExportFormat::Sqlite);
exporter.export(&event_log, "output/ocel2.sqlite")?;
}

Tables created:

object_types
event_types
objects
events
event_objects
object_relationships

Process Families (v0.6.2)

Family	Generator	Activities	Object Types	Variants
P2P	`generate_p2p_case()`	9	PurchaseOrder, GoodsReceipt, VendorInvoice, Payment, Material, Vendor	Happy, Exception, Error
O2C	`generate_o2c_case()`	10	SalesOrder, Delivery, CustomerInvoice, CustomerPayment, Material, Customer	Happy, Exception, Error
S2C	`generate_s2c_case()`	8	SourcingProject, SupplierQualification, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract	Happy, Exception, Error
H2R	`generate_h2r_case()`	8	PayrollRun, PayrollLineItem, TimeEntry, ExpenseReport	Happy, Exception, Error
MFG	`generate_mfg_case()`	10	ProductionOrder, RoutingOperation, QualityInspection, CycleCount	Happy, Exception, Error
BANK	`generate_bank_case()`	8	BankingCustomer, BankAccount, BankTransaction	Happy, Exception, Error
AUDIT	`generate_audit_case()`	10	AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment	Happy, Exception, Error
Bank Recon	`generate_bank_recon_case()`	8	BankReconciliation, BankStatementLine, ReconcilingItem	Happy, Exception, Error

Variant distribution: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%).

Object Types (P2P/O2C)

Type	Description
PurchaseOrder	P2P ordering document
GoodsReceipt	Inventory receipt
VendorInvoice	AP invoice
Payment	Payment document
SalesOrder	O2C ordering document
Delivery	Shipment document
CustomerInvoice	AR invoice
CustomerPayment	Customer receipt
Material	Product/item
Vendor	Supplier
Customer	Customer/buyer

Integration with Process Mining Tools

OCEL 2.0 exports are compatible with:

PM4Py: Python process mining library
Celonis: Enterprise process mining platform
PROM: Academic process mining toolkit
OCPA: Object-centric process analysis tool

Loading in PM4Py

import pm4py
from pm4py.objects.ocel.importer import jsonocel

ocel = jsonocel.apply("ocel2.json")
print(f"Events: {len(ocel.events)}")
print(f"Objects: {len(ocel.objects)}")

datasynth-fingerprint

Privacy-preserving fingerprint extraction from real data and synthesis of matching synthetic data.

Overview

The datasynth-fingerprint crate provides tools for extracting statistical fingerprints from real datasets while preserving privacy through differential privacy mechanisms and k-anonymity. These fingerprints can then be used to generate synthetic data that matches the statistical properties of the original data without exposing sensitive information.

Architecture

Real Data → Extract → .dsf File → Generate → Synthetic Data → Evaluate

The fingerprinting workflow consists of three main stages:

Extraction: Analyze real data and extract statistical properties
Synthesis: Generate configuration and synthetic data from fingerprints
Evaluation: Validate synthetic data fidelity against fingerprints

Key Components

Models (`models/`)

Model	Description
Fingerprint	Root container with manifest, schema, statistics, correlations, integrity, rules, anomalies, privacy_audit
Manifest	Version, format, created_at, source metadata, privacy metadata, checksums, optional signature
SchemaFingerprint	Tables with columns, data types, cardinalities, relationships
StatisticsFingerprint	Numeric stats (distribution, percentiles, Benford), categorical stats (frequencies, entropy)
CorrelationFingerprint	Correlation matrices with copula parameters
IntegrityFingerprint	Foreign key definitions, cardinality rules
RulesFingerprint	Balance rules, approval thresholds
AnomalyFingerprint	Anomaly rates, type distributions, temporal patterns
PrivacyAudit	Actions log, epsilon spent, k-anonymity, warnings

Privacy Engine (`privacy/`)

Component	Description
LaplaceMechanism	Differential privacy with configurable epsilon
GaussianMechanism	Alternative DP mechanism for (ε,δ)-privacy
KAnonymity	Suppression of rare categorical values below k threshold
PrivacyEngine	Unified interface combining DP, k-anonymity, winsorization
PrivacyAuditBuilder	Build privacy audit with actions and warnings

Privacy Levels

Level	Epsilon	k	Outlier %	Use Case
Minimal	5.0	3	99%	Low privacy, high utility
Standard	1.0	5	95%	Balanced (default)
High	0.5	10	90%	Higher privacy
Maximum	0.1	20	85%	Maximum privacy

Extraction Engine (`extraction/`)

Extractor	Description
FingerprintExtractor	Main coordinator for all extraction
SchemaExtractor	Infer data types, cardinalities, relationships
StatsExtractor	Compute distributions, percentiles, Benford analysis
CorrelationExtractor	Pearson correlations, copula fitting
IntegrityExtractor	Detect foreign key relationships
RulesExtractor	Detect balance rules, approval patterns
AnomalyExtractor	Analyze anomaly rates and patterns

I/O (`io/`)

Component	Description
FingerprintWriter	Write .dsf files (ZIP with YAML/JSON components)
FingerprintReader	Read .dsf files with checksum verification
FingerprintValidator	Validate DSF structure and integrity
validate_dsf()	Convenience function for CLI validation

Synthesis (`synthesis/`)

Component	Description
ConfigSynthesizer	Convert fingerprint to GeneratorConfig
DistributionFitter	Fit AmountSampler parameters from statistics
GaussianCopula	Generate correlated values preserving multivariate structure

Evaluation (`evaluation/`)

Component	Description
FidelityEvaluator	Compare synthetic data against fingerprint
FidelityReport	Overall score, component scores, pass/fail status
FidelityConfig	Thresholds and weights for evaluation

Federated Fingerprinting (`federated/`) — v0.5.0

Component	Description
FederatedFingerprintProtocol	Orchestrates multi-source fingerprint aggregation
PartialFingerprint	Per-source fingerprint with local DP (epsilon, means, stds, correlations)
AggregatedFingerprint	Combined fingerprint with total epsilon and source count
AggregationMethod	WeightedAverage, Median, or TrimmedMean strategies
FederatedConfig	min_sources, max_epsilon_per_source, aggregation_method

Certificates (`certificates/`) — v0.5.0

Component	Description
SyntheticDataCertificate	Certificate with DP guarantees, quality metrics, config hash, signature
CertificateBuilder	Builder pattern for constructing certificates
DpGuarantee	DP mechanism, epsilon, delta, composition method, total queries
QualityMetrics	Benford MAD, correlation preservation, statistical fidelity, MIA AUC
sign_certificate()	HMAC-SHA256 signing
verify_certificate()	Signature verification

Privacy-Utility Frontier (`privacy/pareto.rs`) — v0.5.0

Component	Description
ParetoFrontier	Explore privacy-utility tradeoff space
ParetoPoint	Epsilon, utility_score, benford_mad, correlation_score
recommend()	Recommend optimal epsilon for target utility

DSF File Format

The DataSynth Fingerprint (.dsf) file is a ZIP archive containing:

fingerprint.dsf (ZIP)
├── manifest.json       # Version, checksums, privacy config
├── schema.yaml         # Tables, columns, relationships
├── statistics.yaml     # Distributions, percentiles, Benford
├── correlations.yaml   # Correlation matrices, copulas
├── integrity.yaml      # FK relationships, cardinality
├── rules.yaml          # Balance constraints, approval thresholds
├── anomalies.yaml      # Anomaly rates, type distribution
└── privacy_audit.json  # Privacy decisions, epsilon spent

Usage

Extracting a Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::{
    extraction::FingerprintExtractor,
    privacy::{PrivacyEngine, PrivacyLevel},
    io::FingerprintWriter,
};

// Create privacy engine with standard level
let privacy = PrivacyEngine::new(PrivacyLevel::Standard);

// Extract fingerprint from CSV data
let extractor = FingerprintExtractor::new(privacy);
let fingerprint = extractor.extract_from_csv("data.csv")?;

// Write to DSF file
let writer = FingerprintWriter::new();
writer.write(&fingerprint, "fingerprint.dsf")?;
}

Reading a Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::io::FingerprintReader;

let reader = FingerprintReader::new();
let fingerprint = reader.read("fingerprint.dsf")?;

println!("Tables: {:?}", fingerprint.schema.tables.len());
println!("Privacy epsilon spent: {}", fingerprint.privacy_audit.epsilon_spent);
}

Validating a Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::io::validate_dsf;

match validate_dsf("fingerprint.dsf") {
    Ok(report) => println!("Valid: {:?}", report),
    Err(e) => eprintln!("Invalid: {}", e),
}
}

Synthesizing Configuration

#![allow(unused)]
fn main() {
use datasynth_fingerprint::synthesis::ConfigSynthesizer;

let synthesizer = ConfigSynthesizer::new();
let config = synthesizer.synthesize(&fingerprint)?;

// Use config with datasynth-generators
}

Evaluating Fidelity

#![allow(unused)]
fn main() {
use datasynth_fingerprint::evaluation::{FidelityEvaluator, FidelityConfig};

let config = FidelityConfig::default();
let evaluator = FidelityEvaluator::new(config);

let report = evaluator.evaluate(&fingerprint, "./synthetic_data/")?;

println!("Overall score: {:.2}", report.overall_score);
println!("Pass: {}", report.passed);

for (metric, score) in &report.component_scores {
    println!("  {}: {:.2}", metric, score);
}
}

Federated Fingerprinting

#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::{
    FederatedFingerprintProtocol, FederatedConfig, AggregationMethod,
};

let config = FederatedConfig {
    min_sources: 2,
    max_epsilon_per_source: 5.0,
    aggregation_method: AggregationMethod::WeightedAverage,
};

let protocol = FederatedFingerprintProtocol::new(config);

// Create partial fingerprints from each data source
let partial1 = FederatedFingerprintProtocol::create_partial(
    "source_a", vec!["amount".into(), "count".into()], 10000,
    vec![5000.0, 3.0], vec![2000.0, 1.5], 1.0,
);
let partial2 = FederatedFingerprintProtocol::create_partial(
    "source_b", vec!["amount".into(), "count".into()], 8000,
    vec![4500.0, 2.8], vec![1800.0, 1.2], 1.0,
);

// Aggregate without centralizing raw data
let aggregated = protocol.aggregate(&[partial1, partial2])?;
println!("Total epsilon: {}", aggregated.total_epsilon);
}

Synthetic Data Certificates

#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{
    CertificateBuilder, DpGuarantee, QualityMetrics,
    sign_certificate, verify_certificate,
};

let mut cert = CertificateBuilder::new("DataSynth v0.5.0")
    .with_dp_guarantee(DpGuarantee {
        mechanism: "Laplace".into(),
        epsilon: 1.0,
        delta: None,
        composition_method: "sequential".into(),
        total_queries: 50,
    })
    .with_quality_metrics(QualityMetrics {
        benford_mad: Some(0.008),
        correlation_preservation: Some(0.95),
        statistical_fidelity: Some(0.92),
        mia_auc: Some(0.52),
    })
    .with_seed(42)
    .build();

// Sign and verify
sign_certificate(&mut cert, "my-secret-key");
assert!(verify_certificate(&cert, "my-secret-key"));
}

Fidelity Metrics

Category	Metrics
Statistical	KS statistic, Wasserstein distance, Benford MAD
Correlation	Correlation matrix RMSE
Schema	Column type match, row count ratio
Rules	Balance equation compliance rate

Privacy Guarantees

The fingerprint extraction process provides the following privacy guarantees:

Differential Privacy: Numeric statistics are perturbed using Laplace or Gaussian mechanisms with configurable epsilon budget
K-Anonymity: Categorical values appearing fewer than k times are suppressed
Winsorization: Outliers are clipped to prevent identification of extreme values
Audit Trail: All privacy decisions are logged for compliance verification

CLI Commands

# Extract fingerprint
datasynth-data fingerprint extract \
    --input ./data.csv \
    --output ./fp.dsf \
    --privacy-level standard

# Validate
datasynth-data fingerprint validate ./fp.dsf

# Show info
datasynth-data fingerprint info ./fp.dsf --detailed

# Compare
datasynth-data fingerprint diff ./fp1.dsf ./fp2.dsf

# Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fp.dsf \
    --synthetic ./synthetic/ \
    --threshold 0.8

# Federated fingerprinting
datasynth-data fingerprint federated \
    --sources ./source_a.dsf ./source_b.dsf \
    --output ./aggregated.dsf \
    --method weighted_average

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

Dependencies

[dependencies]
datasynth-core = { path = "../datasynth-core" }
datasynth-config = { path = "../datasynth-config" }
serde = { version = "1.0", features = ["derive"] }
serde_yaml = "0.9"
serde_json = "1.0"
zip = "0.6"
sha2 = "0.10"
rand = "0.8"
statrs = "0.16"

datasynth-standards

The datasynth-standards crate provides comprehensive support for major accounting and auditing standards frameworks including IFRS, US GAAP, ISA, SOX, and PCAOB.

Overview

This crate contains domain models and business logic for:

Accounting Standards: Revenue recognition, lease accounting, fair value measurement, impairment testing
Audit Standards: ISA requirements, analytical procedures, confirmations, audit opinions
Regulatory Frameworks: SOX 302/404 compliance, PCAOB standards

Modules

`framework`

Core accounting framework selection and settings.

#![allow(unused)]
fn main() {
use datasynth_standards::framework::{AccountingFramework, FrameworkSettings};

// Select framework
let framework = AccountingFramework::UsGaap;
assert!(framework.allows_lifo());
assert!(!framework.allows_impairment_reversal());

// Framework-specific settings
let settings = FrameworkSettings::us_gaap();
assert!(settings.validate().is_ok());
}

`accounting`

Accounting standards models:

Module	Standards	Key Types
`revenue`	ASC 606 / IFRS 15	`CustomerContract`, `PerformanceObligation`, `VariableConsideration`
`leases`	ASC 842 / IFRS 16	`Lease`, `ROUAsset`, `LeaseLiability`, `LeaseAmortizationEntry`
`fair_value`	ASC 820 / IFRS 13	`FairValueMeasurement`, `FairValueHierarchyLevel`
`impairment`	ASC 360 / IAS 36	`ImpairmentTest`, `RecoverableAmountMethod`
`differences`	Dual Reporting	`FrameworkDifferenceRecord`

`audit`

Audit standards models:

Module	Standards	Key Types
`isa_reference`	ISA 200-720	`IsaStandard`, `IsaRequirement`, `IsaProcedureMapping`
`analytical`	ISA 520	`AnalyticalProcedure`, `VarianceInvestigation`
`confirmation`	ISA 505	`ExternalConfirmation`, `ConfirmationResponse`
`opinion`	ISA 700/705/706/701	`AuditOpinion`, `KeyAuditMatter`, `OpinionModification`
`audit_trail`	Traceability	`AuditTrail`, `TrailGap`
`pcaob`	PCAOB AS	`PcaobStandard`, `PcaobIsaMapping`

`regulatory`

Regulatory compliance models:

Module	Standards	Key Types
`sox`	SOX 302/404	`Sox302Certification`, `Sox404Assessment`, `DeficiencyMatrix`, `MaterialWeakness`

Usage Examples

Revenue Recognition

#![allow(unused)]
fn main() {
use datasynth_standards::accounting::revenue::{
    CustomerContract, PerformanceObligation, ObligationType, SatisfactionPattern,
};
use datasynth_standards::framework::AccountingFramework;
use rust_decimal_macros::dec;

// Create a customer contract under US GAAP
let mut contract = CustomerContract::new(
    "C001".to_string(),
    "CUST001".to_string(),
    dec!(100000),
    AccountingFramework::UsGaap,
);

// Add performance obligations
let po = PerformanceObligation::new(
    "PO001".to_string(),
    ObligationType::Good,
    SatisfactionPattern::PointInTime,
    dec!(60000),
);
contract.add_performance_obligation(po);
}

Lease Accounting

#![allow(unused)]
fn main() {
use datasynth_standards::accounting::leases::{Lease, LeaseAssetClass, LeaseClassification};
use datasynth_standards::framework::AccountingFramework;
use chrono::NaiveDate;
use rust_decimal_macros::dec;

// Create a lease
let lease = Lease::new(
    "L001".to_string(),
    LeaseAssetClass::RealEstate,
    NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
    60,                    // 5-year term
    dec!(10000),          // Monthly payment
    0.05,                  // Discount rate
    AccountingFramework::UsGaap,
);

// Classify under US GAAP bright-line tests
let classification = lease.classify_us_gaap(
    72,                    // Asset useful life (months)
    dec!(600000),         // Fair value
    dec!(550000),         // Present value of payments
);
}

ISA Standards

#![allow(unused)]
fn main() {
use datasynth_standards::audit::isa_reference::{
    IsaStandard, IsaRequirement, IsaRequirementType,
};

// Reference an ISA standard
let standard = IsaStandard::Isa315;
assert_eq!(standard.number(), "315");
assert!(standard.title().contains("Risk"));

// Create a requirement
let requirement = IsaRequirement::new(
    IsaStandard::Isa500,
    "12".to_string(),
    IsaRequirementType::Requirement,
    "Design and perform audit procedures".to_string(),
);
}

SOX Compliance

#![allow(unused)]
fn main() {
use datasynth_standards::regulatory::sox::{
    Sox404Assessment, DeficiencyMatrix, DeficiencyLikelihood, DeficiencyMagnitude,
};
use uuid::Uuid;

// Create a SOX 404 assessment
let assessment = Sox404Assessment::new(
    Uuid::new_v4(),
    2024,
    true, // ICFR effective
);

// Classify a deficiency
let deficiency = DeficiencyMatrix::new(
    DeficiencyLikelihood::Probable,
    DeficiencyMagnitude::Material,
);
assert!(deficiency.is_material_weakness());
}

Framework Validation

The crate validates framework-specific rules:

#![allow(unused)]
fn main() {
use datasynth_standards::framework::{AccountingFramework, FrameworkSettings};

// LIFO is not permitted under IFRS
let mut settings = FrameworkSettings::ifrs();
settings.use_lifo_inventory = true;
assert!(settings.validate().is_err());

// PPE revaluation is not permitted under US GAAP
let mut settings = FrameworkSettings::us_gaap();
settings.use_ppe_revaluation = true;
assert!(settings.validate().is_err());
}

Dependencies

[dependencies]
datasynth-standards = "0.2.3"

Feature Flags

Currently, no optional features are defined. All functionality is included by default.

datasynth-test-utils

Test utilities and helpers for the SyntheticData workspace.

Overview

datasynth-test-utils provides shared testing infrastructure:

Test Fixtures: Pre-configured test data and scenarios
Assertion Helpers: Domain-specific assertions for financial data
Mock Generators: Simplified generators for unit testing
Snapshot Testing: Helpers for snapshot-based testing

Fixtures

Journal Entry Fixtures

#![allow(unused)]
fn main() {
use synth_test_utils::fixtures;

// Balanced two-line entry
let entry = fixtures::balanced_journal_entry();
assert!(entry.is_balanced());

// Entry with specific amounts
let entry = fixtures::journal_entry_with_amount(dec!(1000.00));

// Fraudulent entry for testing detection
let entry = fixtures::fraudulent_entry(FraudType::SplitTransaction);
}

Master Data Fixtures

#![allow(unused)]
fn main() {
// Sample vendors
let vendors = fixtures::sample_vendors(10);

// Sample customers
let customers = fixtures::sample_customers(20);

// Chart of accounts
let coa = fixtures::test_chart_of_accounts();
}

Amount Fixtures

#![allow(unused)]
fn main() {
// Benford-compliant amounts
let amounts = fixtures::sample_amounts(1000);

// Round-number biased amounts
let amounts = fixtures::round_amounts(100);

// Fraud-pattern amounts
let amounts = fixtures::suspicious_amounts(50);
}

Configuration Fixtures

#![allow(unused)]
fn main() {
// Minimal valid config
let config = fixtures::test_config();

// Manufacturing preset
let config = fixtures::manufacturing_config();

// Config with specific transaction count
let config = fixtures::config_with_transactions(10000);
}

Assertions

Balance Assertions

#![allow(unused)]
fn main() {
use synth_test_utils::assertions;

#[test]
fn test_entry_is_balanced() {
    let entry = create_entry();
    assertions::assert_balanced(&entry);
}

#[test]
fn test_trial_balance() {
    let tb = generate_trial_balance();
    assertions::assert_trial_balance_balanced(&tb);
}
}

Benford’s Law Assertions

#![allow(unused)]
fn main() {
#[test]
fn test_benford_compliance() {
    let amounts = generate_amounts(10000);
    assertions::assert_benford_compliant(&amounts, 0.05);
}
}

Document Chain Assertions

#![allow(unused)]
fn main() {
#[test]
fn test_p2p_chain() {
    let documents = generate_p2p_flow();
    assertions::assert_valid_document_chain(&documents);
}
}

Uniqueness Assertions

#![allow(unused)]
fn main() {
#[test]
fn test_no_duplicate_ids() {
    let entries = generate_entries(1000);
    assertions::assert_unique_document_ids(&entries);
}
}

Mock Generators

Simple Journal Entry Generator

#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockJeGenerator;

let mut generator = MockJeGenerator::new(42);

// Generate entries without full config
let entries = generator.generate(100);
}

Predictable Amount Generator

#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockAmountGenerator;

let mut generator = MockAmountGenerator::new();

// Returns predictable sequence
let amount1 = generator.next(); // 100.00
let amount2 = generator.next(); // 200.00
}

Fixed Date Generator

#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockDateGenerator;

let generator = MockDateGenerator::fixed(
    NaiveDate::from_ymd_opt(2024, 1, 15).unwrap()
);
}

Snapshot Testing

#![allow(unused)]
fn main() {
use synth_test_utils::snapshots;

#[test]
fn test_je_serialization() {
    let entry = fixtures::balanced_journal_entry();
    snapshots::assert_json_snapshot("je_balanced", &entry);
}

#[test]
fn test_csv_output() {
    let entries = fixtures::sample_entries(10);
    snapshots::assert_csv_snapshot("entries_sample", &entries);
}
}

Test Helpers

Temporary Directories

#![allow(unused)]
fn main() {
use synth_test_utils::temp_dir;

#[test]
fn test_output_writing() {
    let dir = temp_dir::create();

    // Test writes to temp directory
    let path = dir.path().join("test.csv");
    write_output(&path)?;

    assert!(path.exists());
    // Directory cleaned up on drop
}
}

Seed Management

#![allow(unused)]
fn main() {
use synth_test_utils::seeds;

#[test]
fn test_deterministic_generation() {
    let seed = seeds::fixed();

    let result1 = generate_with_seed(seed);
    let result2 = generate_with_seed(seed);

    assert_eq!(result1, result2);
}
}

Time Helpers

#![allow(unused)]
fn main() {
use synth_test_utils::time;

#[test]
fn test_with_frozen_time() {
    let frozen = time::freeze_at(2024, 1, 15);

    let entry = generate_entry_with_current_date();

    assert_eq!(entry.posting_date, frozen.date());
}
}

Usage in Other Crates

Add to Cargo.toml:

[dev-dependencies]
datasynth-test-utils = { path = "../datasynth-test-utils" }

Use in tests:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use synth_test_utils::{fixtures, assertions};

    #[test]
    fn test_my_function() {
        let input = fixtures::test_config();
        let result = my_function(&input);
        assertions::assert_balanced(&result);
    }
}
}

Fixture Data Files

Test data files in fixtures/:

datasynth-test-utils/
└── fixtures/
    ├── chart_of_accounts.yaml
    ├── sample_entries.json
    ├── vendor_master.csv
    └── test_config.yaml

Advanced Topics

Advanced features for specialized use cases.

Overview

Topic	Description
Anomaly Injection	Fraud, errors, and process issues
Data Quality Variations	Missing values, typos, duplicates
Graph Export	ML-ready graph formats
Intercompany Processing	Multi-entity transactions
Period Close Engine	Month/quarter/year-end processes
Performance Tuning	Optimization strategies

Feature Matrix

Feature	Use Case	Output
Anomaly Injection	ML training	Labels (CSV)
Data Quality	Testing robustness	Varied data
Graph Export	GNN training	PyG, Neo4j
Intercompany	Consolidation testing	IC pairs
Period Close	Full cycle testing	Closing entries

Enabling Advanced Features

In Configuration

# Anomaly injection
anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

# Data quality variations
data_quality:
  enabled: true
  missing_values:
    rate: 0.01

# Graph export
graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

# Intercompany
intercompany:
  enabled: true

# Period close
period_close:
  enabled: true
  monthly:
    accruals: true
    depreciation: true

Via CLI

Most advanced features are controlled through configuration. Use init to create a base config, then customize:

datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Edit config.yaml to enable advanced features
datasynth-data generate --config config.yaml --output ./output

Prerequisites

Some advanced features have dependencies:

Feature	Requires
Intercompany	Multiple companies defined
Period Close	`period_months` ≥ 1
Graph Export	Transactions generated
FX	Multiple currencies

Output Files

Advanced features produce additional outputs:

output/
├── labels/                      # Anomaly injection
│   ├── anomaly_labels.csv
│   ├── fraud_labels.csv
│   └── quality_issues.csv
├── graphs/                      # Graph export
│   ├── pytorch_geometric/
│   └── neo4j/
├── consolidation/               # Intercompany
│   ├── eliminations.csv
│   └── ic_pairs.csv
└── period_close/                # Period close
    ├── trial_balances/
    ├── accruals.csv
    └── closing_entries.csv

Performance Impact

Feature	Impact	Mitigation
Anomaly Injection	Low	Post-processing
Data Quality	Low	Post-processing
Graph Export	Medium	Separate phase
Intercompany	Medium	Per-transaction
Period Close	Low	Per-period

Fraud Patterns & ACFE Taxonomy

SyntheticData includes comprehensive fraud pattern modeling aligned with the Association of Certified Fraud Examiners (ACFE) Report to the Nations. This enables generation of realistic fraud scenarios for training machine learning models and testing audit analytics.

ACFE Fraud Taxonomy

The ACFE occupational fraud classification divides fraud into three main categories, each with distinct characteristics:

Asset Misappropriation (86% of cases)

The most common type of fraud, involving theft of organizational assets:

fraud:
  enabled: true
  acfe_category: asset_misappropriation
  schemes:
    cash_fraud:
      - skimming           # Sales not recorded
      - larceny            # Cash stolen after recording
      - shell_company      # Fictitious vendors
      - ghost_employee     # Non-existent employees
      - expense_schemes    # Personal expenses as business
    non_cash_fraud:
      - inventory_theft
      - fixed_asset_misuse

Corruption (33% of cases)

Schemes involving conflicts of interest and bribery:

fraud:
  enabled: true
  acfe_category: corruption
  schemes:
    - purchasing_conflict  # Undisclosed vendor ownership
    - sales_conflict       # Kickbacks from customers
    - invoice_kickback     # Vendor payment schemes
    - bid_rigging          # Collusion with vendors
    - economic_extortion   # Demands for payment

Financial Statement Fraud (10% of cases)

The least common but most costly fraud type:

fraud:
  enabled: true
  acfe_category: financial_statement
  schemes:
    overstatement:
      - premature_revenue      # Revenue before earned
      - fictitious_revenues    # Fake sales
      - concealed_liabilities  # Hidden obligations
      - improper_asset_values  # Overstated assets
    understatement:
      - understated_revenues   # Hidden sales
      - overstated_expenses    # Inflated costs

ACFE Calibration

Generated fraud data is calibrated to match ACFE statistics:

Metric	ACFE Value	Configuration
Median Loss	$117,000	`acfe.median_loss`
Median Duration	12 months	`acfe.median_duration_months`
Tip Detection	42%	`detection_method.tip`
Internal Audit Detection	16%	`detection_method.internal_audit`
Management Review Detection	12%	`detection_method.management_review`

fraud:
  acfe_calibration:
    enabled: true
    median_loss: 117000
    median_duration_months: 12
    detection_methods:
      tip: 0.42
      internal_audit: 0.16
      management_review: 0.12
      external_audit: 0.04
      accident: 0.06

Collusion & Conspiracy Modeling

SyntheticData models multi-party fraud networks with coordinated schemes:

Collusion Ring Types

#![allow(unused)]
fn main() {
pub enum CollusionRingType {
    // Internal collusion
    EmployeePair,           // approver + processor
    DepartmentRing,         // 3-5 employees
    ManagementSubordinate,  // manager + subordinate

    // Internal-external
    EmployeeVendor,         // purchasing + vendor contact
    EmployeeCustomer,       // sales rep + customer
    EmployeeContractor,     // project manager + contractor

    // External rings
    VendorRing,             // bid rigging (2-4 vendors)
    CustomerRing,           // return fraud
}
}

Conspirator Roles

Each conspirator in a ring has a specific role:

Initiator: Conceives scheme, recruits others
Executor: Performs fraudulent transactions
Approver: Provides approvals/overrides
Concealer: Hides evidence, manipulates records
Lookout: Monitors for detection
Beneficiary: External recipient of proceeds

Configuration

fraud:
  collusion:
    enabled: true
    ring_types:
      - type: employee_vendor
        probability: 0.15
        min_members: 2
        max_members: 4
      - type: department_ring
        probability: 0.08
        min_members: 3
        max_members: 5
    defection_probability: 0.05
    escalation_rate: 0.10

Management Override

Senior-level fraud with override patterns:

fraud:
  management_override:
    enabled: true
    perpetrator_levels:
      - senior_manager
      - cfo
      - ceo
    override_types:
      revenue:
        - journal_entry_override
        - revenue_recognition_acceleration
        - reserve_manipulation
      expense:
        - capitalization_abuse
        - expense_deferral
    pressure_sources:
      - financial_targets
      - market_expectations
      - covenant_compliance

Fraud Triangle

The fraud triangle (Pressure, Opportunity, Rationalization) is modeled:

fraud:
  fraud_triangle:
    pressure:
      source: financial_targets
      intensity: high
    opportunity:
      factors:
        - weak_internal_controls
        - management_override_capability
        - lack_of_oversight
    rationalization:
      type: temporary_adjustment  # "We'll fix it next quarter"

Red Flag Generation

Probabilistic fraud indicators with calibrated Bayesian probabilities:

Red Flag Strengths

Strength	P(fraud\|flag)	Examples
Strong	> 0.5	Matched home address vendor/employee
Moderate	0.2 - 0.5	Vendor with no physical address
Weak	< 0.2	Round number invoices

Configuration

fraud:
  red_flags:
    enabled: true
    inject_rate: 0.15  # 15% of transactions get flags
    patterns:
      strong:
        - name: matched_address_vendor_employee
          p_flag_given_fraud: 0.90
          p_flag_given_no_fraud: 0.001
        - name: sequential_check_numbers
          p_flag_given_fraud: 0.80
          p_flag_given_no_fraud: 0.01
      moderate:
        - name: approval_just_under_threshold
          p_flag_given_fraud: 0.70
          p_flag_given_no_fraud: 0.10
      weak:
        - name: round_number_invoice
          p_flag_given_fraud: 0.40
          p_flag_given_no_fraud: 0.20

Evaluation Benchmarks

ACFE-Calibrated Benchmarks

#![allow(unused)]
fn main() {
// General fraud detection
let bench = acfe_calibrated_1k();

// Collusion-focused benchmark
let bench = acfe_collusion_5k();

// Management override detection
let bench = acfe_management_override_2k();
}

Benchmark Metrics

#![allow(unused)]
fn main() {
pub struct AcfeAlignment {
    /// Category distribution MAD vs ACFE
    pub category_distribution_mad: f64,
    /// Median loss ratio (actual / expected)
    pub median_loss_ratio: f64,
    /// Duration distribution KS statistic
    pub duration_distribution_ks: f64,
    /// Detection method chi-squared
    pub detection_method_chi_sq: f64,
}
}

Output Files

File	Description
`collusion_rings.json`	Collusion network details with members, roles
`red_flags.csv`	Red flag indicators with probabilities
`management_overrides.json`	Management override schemes
`fraud_labels.csv`	Enhanced fraud labels with ACFE category

Best Practices

Start with ACFE calibration: Use default ACFE statistics for realistic distribution
Enable collusion gradually: Start with simple rings before complex networks
Use red flags for training: Red flags provide weak supervision signals
Validate against benchmarks: Use ACFE benchmarks to verify model performance
Consider detection difficulty: Use detection_difficulty labels for curriculum learning

Industry-Specific Features

SyntheticData includes industry-specific transaction modeling with authentic terminology, master data structures, and anomaly patterns. Three industries have full generator implementations (Manufacturing, Retail, Healthcare), while three additional industries (Technology, Financial Services, Professional Services) are available as configuration presets with industry-appropriate GL structures and anomaly rates.

Overview

Each industry module provides:

Industry-specific transactions: Authentic transaction types using correct terminology
Master data structures: Industry-specific entities (BOM, routings, clinical codes, etc.)
Anomaly patterns: Industry-authentic fraud and error patterns
GL account structures: Industry-appropriate chart of accounts
Configuration options: Fine-grained control over industry characteristics

Implementation Status

Industry	Status	Transaction Types	Master Data	Anomaly Patterns	Benchmarks
Manufacturing	Full generator	13 types	BOM, routings, work centers	5 patterns	Yes
Retail	Full generator	11 types	Stores, POS, loyalty	6 patterns	Yes
Healthcare	Full generator	9 types	ICD-10, CPT, DRG, payers	6 patterns	Yes
Technology	Config preset	Config-only	—	3 anomaly rates	Yes
Financial Services	Config preset	Config-only	—	3 anomaly rates	Yes
Professional Services	Config preset	Config-only	—	3 anomaly rates	No

Full generator industries have dedicated Rust enum types with per-transaction generation logic, dedicated master data structures, and industry-specific anomaly injection. Config preset industries use the standard generator pipeline but apply industry-appropriate GL account structures, transaction distributions, and anomaly rates through configuration.

Manufacturing

Transaction Types

#![allow(unused)]
fn main() {
pub enum ManufacturingTransaction {
    // Production
    WorkOrderIssuance,      // Create production order
    MaterialRequisition,    // Issue materials to production
    LaborBooking,           // Record labor hours
    OverheadAbsorption,     // Apply manufacturing overhead
    ScrapReporting,         // Report production scrap
    ReworkOrder,            // Create rework order
    ProductionVariance,     // Record variances

    // Inventory
    RawMaterialReceipt,     // Receive raw materials
    WipTransfer,            // Transfer between work centers
    FinishedGoodsTransfer,  // Move to finished goods
    CycleCountAdjustment,   // Inventory adjustments

    // Costing
    StandardCostRevaluation,  // Update standard costs
    PurchasePriceVariance,    // Record PPV
}
}

Master Data

manufacturing:
  bom:
    depth: 4                    # BOM levels (3-7 typical)
    yield_rate: 0.97            # Expected yield
    scrap_factor: 0.02          # Scrap percentage
  routings:
    operations_per_product: 5   # Average operations
    setup_time_minutes: 30      # Default setup time
  work_centers:
    count: 20
    capacity_hours: 8
    efficiency: 0.85

Anomaly Patterns

Anomaly	Description	Detection Method
Yield Manipulation	Reported yield differs from actual	Variance analysis
Labor Misallocation	Labor charged to wrong order	Cross-reference
Phantom Production	Production orders with no output	Data analytics
Obsolete Inventory	Aging inventory not written down	Aging analysis
Standard Cost Manipulation	Inflated standard costs	Trend analysis

Configuration

industry_specific:
  enabled: true
  manufacturing:
    enabled: true
    bom_depth: 4
    just_in_time: false
    production_order_types:
      - standard
      - rework
      - prototype
    quality_framework: iso_9001
    supplier_tiers: 2
    standard_cost_frequency: quarterly
    target_yield_rate: 0.97
    scrap_alert_threshold: 0.03
    anomaly_rates:
      yield_manipulation: 0.005
      labor_misallocation: 0.008
      phantom_production: 0.002
      obsolete_inventory: 0.01

Retail

Transaction Types

#![allow(unused)]
fn main() {
pub enum RetailTransaction {
    // Point of Sale
    PosSale,                // Register sale
    ReturnRefund,           // Customer return
    VoidTransaction,        // Voided sale
    EmployeeDiscount,       // Staff discount
    LoyaltyRedemption,      // Points redemption

    // Inventory
    InventoryReceipt,       // Receive from DC
    StoreTransfer,          // Store-to-store
    MarkdownRecording,      // Price reductions
    ShrinkageAdjustment,    // Inventory loss

    // Cash Management
    CashDrop,               // Safe deposit
    RegisterReconciliation, // Drawer count
}
}

Store Types

retail:
  stores:
    types:
      - flagship      # High-volume, full assortment
      - standard      # Normal retail store
      - express       # Small format, convenience
      - outlet        # Discount/clearance
      - warehouse     # Bulk/club format
      - pop_up        # Temporary locations
      - digital       # E-commerce only

Anomaly Patterns

Anomaly	Description	Detection Method
Sweethearting	Not scanning items	Video analytics
Skimming	Cash theft from register	Cash variance
Refund Fraud	Fake returns	Return pattern
Receiving Fraud	Short shipment theft	3-way match
Coupon Fraud	Invalid coupon use	Coupon validation
Employee Discount Abuse	Unauthorized discounts	Policy review

Configuration

industry_specific:
  enabled: true
  retail:
    enabled: true
    store_types:
      - standard
      - express
      - outlet
    shrinkage_rate: 0.015
    return_rate: 0.08
    markdown_frequency: weekly
    loss_prevention:
      camera_coverage: 0.85
      eas_enabled: true
    pos_anomaly_rates:
      sweethearting: 0.002
      skimming: 0.001
      refund_fraud: 0.003

Healthcare

Transaction Types

#![allow(unused)]
fn main() {
pub enum HealthcareTransaction {
    // Revenue Cycle
    PatientRegistration,    // Register patient
    ChargeCapture,          // Record charges
    ClaimSubmission,        // Submit to payer
    PaymentPosting,         // Record payment
    DenialManagement,       // Handle denials

    // Clinical
    ProcedureCoding,        // CPT codes
    DiagnosisCoding,        // ICD-10 codes
    SupplyConsumption,      // Medical supplies
    PharmacyDispensing,     // Medications
}
}

Coding Systems

healthcare:
  coding:
    icd10: true         # Diagnosis codes
    cpt: true           # Procedure codes
    drg: true           # Diagnosis Related Groups
    hcpcs: true         # Supplies/equipment

Payer Mix

healthcare:
  payer_mix:
    medicare: 0.40
    medicaid: 0.20
    commercial: 0.30
    self_pay: 0.10

Compliance Frameworks

healthcare:
  compliance:
    hipaa: true           # Privacy rules
    stark_law: true       # Physician referrals
    anti_kickback: true   # AKS compliance
    false_claims_act: true

Anomaly Patterns

Anomaly	Description	Detection Method
Upcoding	Higher-level code than justified	Code validation
Unbundling	Splitting bundled services	Bundle analysis
Phantom Billing	Billing for unrendered services	Audit
Duplicate Billing	Same service billed twice	Duplicate check
Kickbacks	Physician referral payments	Relationship analysis
HIPAA Violations	Unauthorized data access	Access logs

Configuration

industry_specific:
  enabled: true
  healthcare:
    enabled: true
    facility_type: hospital  # hospital, physician_practice, etc.
    payer_mix:
      medicare: 0.40
      medicaid: 0.20
      commercial: 0.30
      self_pay: 0.10
    coding_system:
      icd10: true
      cpt: true
      drg: true
    compliance:
      hipaa: true
      stark_law: true
      anti_kickback: true
    avg_daily_encounters: 200
    avg_charges_per_encounter: 8
    anomaly_rates:
      upcoding: 0.02
      unbundling: 0.015
      phantom_billing: 0.005
      duplicate_billing: 0.008

Technology

Transaction Types

License revenue recognition
Subscription billing
Professional services
R&D capitalization
Deferred revenue

Configuration

industry_specific:
  enabled: true
  technology:
    enabled: true
    revenue_model: subscription  # license, subscription, usage
    subscription_revenue_percent: 0.70
    professional_services_percent: 0.20
    license_revenue_percent: 0.10
    r_and_d_capitalization_rate: 0.15
    deferred_revenue_months: 12
    anomaly_rates:
      premature_revenue: 0.008
      channel_stuffing: 0.003
      improper_capitalization: 0.005

Financial Services

Transaction Types

Loan origination
Interest accrual
Fee income
Trading transactions
Customer deposits
Wire transfers

Configuration

industry_specific:
  enabled: true
  financial_services:
    enabled: true
    institution_type: commercial_bank
    regulatory_framework: us  # us, eu, uk
    loan_portfolio_size: 1000
    avg_loan_amount: 250000
    loan_loss_provision_rate: 0.02
    fee_income_percent: 0.15
    trading_volume_daily: 50000000
    anomaly_rates:
      loan_fraud: 0.003
      trading_fraud: 0.001
      account_takeover: 0.002

Professional Services

Transaction Types

Time and billing
Engagement management
Trust account transactions
Expense reimbursement
Partner distributions

Configuration

industry_specific:
  enabled: true
  professional_services:
    enabled: true
    billing_model: hourly  # hourly, fixed_fee, contingency
    avg_hourly_rate: 350
    utilization_target: 0.75
    realization_rate: 0.92
    trust_accounting: true
    engagement_types:
      - audit
      - tax
      - advisory
      - litigation
    anomaly_rates:
      billing_fraud: 0.004
      trust_misappropriation: 0.001
      expense_fraud: 0.008

Industry Benchmarks

SyntheticData provides pre-configured ML benchmarks for each industry:

#![allow(unused)]
fn main() {
// Get industry-specific benchmark
let bench = get_industry_benchmark(IndustrySector::Healthcare);

// Available benchmarks
let manufacturing = manufacturing_fraud_5k();
let retail = retail_fraud_10k();
let healthcare = healthcare_fraud_5k();
let technology = technology_fraud_3k();
let financial = financial_services_fraud_5k();
}

Benchmark Features

Each industry benchmark includes:

Industry-specific transaction features
Relevant anomaly types
Appropriate cost matrices
Industry-specific evaluation metrics

Best Practices

Match industry to use case: Select the industry that matches your target domain
Use industry presets first: Start with default settings before customizing
Enable industry-specific anomalies: These provide realistic fraud patterns
Consider regulatory context: Enable compliance frameworks relevant to your industry
Use industry benchmarks: Evaluate models against industry-specific baselines

Output Files

File	Description
`industry_transactions.csv`	Industry-specific transaction log
`industry_master_data.json`	Industry-specific entities
`industry_anomalies.csv`	Industry-specific anomaly labels
`industry_gl_accounts.csv`	Industry-specific chart of accounts

Anomaly Injection

Generate labeled anomalies for machine learning training.

Overview

Anomaly injection adds realistic irregularities to generated data with full labeling for supervised learning:

20+ fraud types
Error patterns
Process issues
Statistical outliers
Relational anomalies

Configuration

anomaly_injection:
  enabled: true
  total_rate: 0.02                   # 2% anomaly rate
  generate_labels: true              # Output ML labels

  categories:                        # Category distribution
    fraud: 0.25
    error: 0.40
    process_issue: 0.20
    statistical: 0.10
    relational: 0.05

  temporal_pattern:
    year_end_spike: 1.5              # More anomalies at year-end

  clustering:
    enabled: true
    cluster_probability: 0.2         # 20% appear in clusters

Anomaly Categories

Fraud Types

Type	Description	Detection Difficulty
`fictitious_transaction`	Fabricated entries	Medium
`revenue_manipulation`	Premature recognition	Hard
`expense_capitalization`	Improper capitalization	Medium
`split_transaction`	Split to avoid threshold	Easy
`round_tripping`	Circular transactions	Hard
`kickback_scheme`	Vendor kickbacks	Hard
`ghost_employee`	Non-existent payee	Medium
`duplicate_payment`	Same invoice twice	Easy
`unauthorized_discount`	Unapproved discounts	Medium
`suspense_abuse`	Hide in suspense	Hard

fraud:
  types:
    fictitious_transaction: 0.15
    split_transaction: 0.20
    duplicate_payment: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

Error Types

Type	Description
`duplicate_entry`	Same entry posted twice
`reversed_amount`	Debit/credit swapped
`wrong_period`	Posted to wrong period
`wrong_account`	Incorrect GL account
`missing_reference`	Missing document reference
`incorrect_tax_code`	Wrong tax calculation
`misclassification`	Wrong account category

Process Issues

Type	Description
`late_posting`	Posted after cutoff
`skipped_approval`	Missing required approval
`threshold_manipulation`	Amount just below threshold
`missing_documentation`	No supporting document
`out_of_sequence`	Documents out of order

Statistical Anomalies

Type	Description
`unusual_amount`	Significant deviation from mean
`trend_break`	Sudden pattern change
`benford_violation`	Doesn’t follow Benford’s Law
`outlier_value`	Extreme value

Relational Anomalies

Type	Description
`circular_transaction`	A → B → A flow
`dormant_account_activity`	Inactive account used
`unusual_counterparty`	Unexpected entity pairing

Injection Strategies

Amount Manipulation

anomaly_injection:
  strategies:
    amount:
      enabled: true
      threshold_adjacent: 0.3        # Just below approval limit
      round_number_bias: 0.4         # Suspicious round amounts

Threshold-adjacent: Amounts like $9,999 when limit is $10,000.

Date Manipulation

anomaly_injection:
  strategies:
    date:
      enabled: true
      weekend_bias: 0.2              # Weekend postings
      after_hours_bias: 0.15         # After business hours

Duplication

anomaly_injection:
  strategies:
    duplication:
      enabled: true
      exact_duplicate: 0.5           # Exact copy
      near_duplicate: 0.3            # Slight variations
      delayed_duplicate: 0.2         # Same entry later

Temporal Patterns

Anomalies can follow realistic patterns:

anomaly_injection:
  temporal_pattern:
    month_end_spike: 1.2             # 20% more at month-end
    quarter_end_spike: 1.5           # 50% more at quarter-end
    year_end_spike: 2.0              # Double at year-end
    seasonality: true                # Follow industry patterns

Entity Targeting

Control which entities receive anomalies:

anomaly_injection:
  entity_targeting:
    strategy: weighted               # random, repeat_offender, weighted

    repeat_offender:
      enabled: true
      rate: 0.4                      # 40% from same users

    high_volume_bias: 0.3            # Target high-volume entities

Clustering

Real anomalies often cluster:

anomaly_injection:
  clustering:
    enabled: true
    cluster_probability: 0.2         # 20% in clusters
    cluster_size:
      min: 3
      max: 10
    cluster_timespan_days: 30        # Within 30-day window

Output Labels

anomaly_labels.csv

Field	Description
`document_id`	Entry reference
`anomaly_id`	Unique anomaly ID
`anomaly_type`	Specific type
`anomaly_category`	Fraud, Error, etc.
`severity`	Low, Medium, High
`detection_difficulty`	Easy, Medium, Hard
`description`	Human-readable description

fraud_labels.csv

Subset with fraud-specific fields:

Field	Description
`document_id`	Entry reference
`fraud_type`	Specific fraud pattern
`perpetrator_id`	Employee ID
`scheme_id`	Related anomaly group
`amount_manipulated`	Fraud amount

ML Integration

Loading Labels

import pandas as pd

labels = pd.read_csv('output/labels/anomaly_labels.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_anomaly'] = data['anomaly_id'].notna()

Feature Engineering

# Create features
features = [
    'amount', 'line_count', 'is_round_number',
    'is_weekend', 'is_month_end', 'hour_of_day'
]

X = data[features]
y = data['is_anomaly']

Train/Test Split

Labels include suggested splits:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,  # Maintain anomaly ratio
    random_state=42
)

Example Configuration

Fraud Detection Training

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

  categories:
    fraud: 1.0                       # Only fraud for focused training

  clustering:
    enabled: true
    cluster_probability: 0.3

fraud:
  enabled: true
  fraud_rate: 0.02
  types:
    split_transaction: 0.25
    duplicate_payment: 0.25
    kickback_scheme: 0.20
    ghost_employee: 0.15
    fictitious_transaction: 0.15

General Anomaly Detection

anomaly_injection:
  enabled: true
  total_rate: 0.05
  generate_labels: true

  categories:
    fraud: 0.15
    error: 0.45
    process_issue: 0.25
    statistical: 0.10
    relational: 0.05

Data Quality Variations

Generate realistic data quality issues for testing robustness.

Overview

Real-world data has imperfections. The data quality module introduces:

Missing values (various patterns)
Format variations
Duplicates
Typos and transcription errors
Encoding issues

Configuration

data_quality:
  enabled: true

  missing_values:
    rate: 0.01
    pattern: mcar

  format_variations:
    date_formats: true
    amount_formats: true

  duplicates:
    rate: 0.001
    types: [exact, near, fuzzy]

  typos:
    rate: 0.005
    keyboard_aware: true

Missing Values

Patterns

Pattern	Description
`mcar`	Missing Completely At Random
`mar`	Missing At Random (conditional)
`mnar`	Missing Not At Random (value-dependent)
`systematic`	Entire field groups missing

data_quality:
  missing_values:
    rate: 0.01                       # 1% missing overall
    pattern: mcar

    # Pattern-specific settings
    mcar:
      uniform: true                  # Equal probability all fields

    mar:
      conditions:
        - field: vendor_name
          dependent_on: is_intercompany
          probability: 0.1

    mnar:
      conditions:
        - field: amount
          when_above: 100000         # Large amounts more likely missing
          probability: 0.05

    systematic:
      groups:
        - [address, city, country]   # All or none

Field Targeting

data_quality:
  missing_values:
    fields:
      description: 0.02              # 2% missing
      cost_center: 0.05              # 5% missing
      tax_code: 0.03                 # 3% missing
    exclude:
      - document_id                  # Never make missing
      - posting_date
      - account_number

Format Variations

Date Formats

data_quality:
  format_variations:
    date_formats: true
    date_variations:
      iso: 0.6                       # 2024-01-15
      us: 0.2                        # 01/15/2024
      eu: 0.1                        # 15.01.2024
      long: 0.1                      # January 15, 2024

Examples:

ISO: 2024-01-15
US: 01/15/2024, 1/15/2024
EU: 15.01.2024, 15/01/2024
Long: January 15, 2024

Amount Formats

data_quality:
  format_variations:
    amount_formats: true
    amount_variations:
      plain: 0.5                     # 1234.56
      us_comma: 0.3                  # 1,234.56
      eu_format: 0.1                 # 1.234,56
      currency_prefix: 0.05          # $1,234.56
      currency_suffix: 0.05          # 1.234,56 EUR

Identifier Formats

data_quality:
  format_variations:
    identifier_variations:
      case: 0.1                      # INV-001 vs inv-001
      padding: 0.1                   # 001 vs 1
      separator: 0.1                 # INV-001 vs INV_001 vs INV001

Duplicates

Duplicate Types

Type	Description
`exact`	Identical records
`near`	Minor field differences
`fuzzy`	Multiple field variations

data_quality:
  duplicates:
    rate: 0.001                      # 0.1% duplicates
    types:
      exact: 0.4                     # 40% exact duplicates
      near: 0.4                      # 40% near duplicates
      fuzzy: 0.2                     # 20% fuzzy duplicates

Near Duplicate Variations

data_quality:
  duplicates:
    near:
      fields_to_vary: 1              # Change 1 field
      variations:
        - field: posting_date
          offset_days: [-1, 0, 1]
        - field: amount
          variance: 0.001            # 0.1% difference

Fuzzy Duplicate Variations

data_quality:
  duplicates:
    fuzzy:
      fields_to_vary: 3              # Change multiple fields
      include_typos: true

Typos

Typo Types

Type	Description
Substitution	Adjacent key pressed
Transposition	Characters swapped
Insertion	Extra character
Deletion	Missing character
OCR errors	Scan-related (0/O, 1/l)
Homophones	Sound-alike substitution

data_quality:
  typos:
    rate: 0.005                      # 0.5% of string fields
    keyboard_aware: true             # Use QWERTY layout

    types:
      substitution: 0.35             # Adjacnet → Adjacent
      transposition: 0.25            # Recieve → Receive
      insertion: 0.15                # Shippping → Shipping
      deletion: 0.15                 # Shiping → Shipping
      ocr_errors: 0.05               # O → 0, l → 1
      homophones: 0.05               # their → there

Field Targeting

data_quality:
  typos:
    fields:
      description: 0.02              # More likely in descriptions
      vendor_name: 0.01
      customer_name: 0.01
    exclude:
      - account_number               # Never introduce typos
      - document_id

Encoding Issues

data_quality:
  encoding:
    enabled: true
    rate: 0.001

    issues:
      mojibake: 0.4                  # UTF-8/Latin-1 confusion
      missing_chars: 0.3             # Characters dropped
      bom_issues: 0.2                # BOM artifacts
      html_entities: 0.1             # &amp; instead of &

Examples:

Mojibake: Müller → MÃ¼ller
Missing: Zürich → Zrich
HTML: R&D → R&D

ML Training Labels

The data quality module generates labels for ML model training:

QualityIssueLabel

#![allow(unused)]
fn main() {
pub struct QualityIssueLabel {
    pub issue_id: String,
    pub issue_type: LabeledIssueType,
    pub issue_subtype: Option<QualityIssueSubtype>,
    pub document_id: String,
    pub field_name: String,
    pub original_value: Option<String>,
    pub modified_value: Option<String>,
    pub severity: u8,  // 1-5
    pub processor: String,
    pub metadata: HashMap<String, String>,
}
}

Issue Types

Type	Severity	Description
`MissingValue`	3	Field is null/empty
`Typo`	2	Character-level errors
`FormatVariation`	1	Different formatting
`Duplicate`	4	Duplicate record
`EncodingIssue`	3	Character encoding problems
`Inconsistency`	3	Cross-field inconsistency
`OutOfRange`	4	Value outside expected range
`InvalidReference`	5	Reference to non-existent entity

Subtypes

Each issue type has detailed subtypes:

Typo: Substitution, Transposition, Insertion, Deletion, DoubleChar, CaseError, OcrError, Homophone
FormatVariation: DateFormat, AmountFormat, IdentifierFormat, TextFormat
Duplicate: ExactDuplicate, NearDuplicate, FuzzyDuplicate, CrossSystemDuplicate
EncodingIssue: Mojibake, MissingChars, Bom, ControlChars, HtmlEntities

Output

quality_issues.csv

Field	Description
`document_id`	Affected record
`field_name`	Field with issue
`issue_type`	missing, typo, duplicate, etc.
`original_value`	Value before modification
`modified_value`	Value after modification

quality_labels.csv (ML Training)

Field	Description
`issue_id`	Unique issue identifier
`issue_type`	LabeledIssueType enum
`issue_subtype`	Detailed subtype
`document_id`	Affected document
`field_name`	Affected field
`original_value`	Original value
`modified_value`	Modified value
`severity`	1-5 severity score
`processor`	Which processor injected

Example Configurations

Testing Data Pipelines

data_quality:
  enabled: true

  missing_values:
    rate: 0.02
    pattern: mcar

  format_variations:
    date_formats: true
    amount_formats: true

  typos:
    rate: 0.01
    keyboard_aware: true

Testing Deduplication

data_quality:
  enabled: true

  duplicates:
    rate: 0.05                       # High duplicate rate
    types:
      exact: 0.3
      near: 0.4
      fuzzy: 0.3

Testing OCR Processing

data_quality:
  enabled: true

  typos:
    rate: 0.03
    types:
      ocr_errors: 0.8                # Mostly OCR-style errors
      substitution: 0.2

Graph Export

Export transaction data as ML-ready graphs.

Overview

Graph export transforms financial data into network representations:

Accounting Network (GL accounts as nodes, transactions as edges) - New in v0.2.1
Transaction networks (accounts and entities)
Approval networks (users and approvals)
Entity relationship graphs (ownership)

Accounting Network Graph Export

The accounting network represents money flows between GL accounts, designed for network reconstruction and anomaly detection algorithms.

Quick Start

# Generate with graph export enabled
datasynth-data generate --config config.yaml --output ./output --graph-export

Graph Structure

Element	Description
Nodes	GL Accounts from Chart of Accounts
Edges	Money flows FROM credit accounts TO debit accounts
Direction	Directed graph (source→target)

     ┌──────────────┐
     │ Credit Acct  │
     │   (2000)     │
     └──────┬───────┘
            │ $1,000
            ▼
     ┌──────────────┐
     │ Debit Acct   │
     │   (1100)     │
     └──────────────┘

Edge Features (8 dimensions)

Feature	Index	Description
`log_amount`	F0	log10(transaction amount)
`benford_prob`	F1	Expected first-digit probability
`weekday`	F2	Day of week (normalized 0-1)
`period`	F3	Fiscal period (normalized 0-1)
`is_month_end`	F4	Last 3 days of month
`is_year_end`	F5	Last month of year
`is_anomaly`	F6	Anomaly flag (0 or 1)
`business_process`	F7	Encoded business process

Output Files

output/graphs/accounting_network/pytorch_geometric/
├── edge_index.npy      # [2, E] source→target node indices
├── node_features.npy   # [N, 4] node feature vectors
├── edge_features.npy   # [E, 8] edge feature vectors
├── edge_labels.npy     # [E] anomaly labels (0=normal, 1=anomaly)
├── node_labels.npy     # [N] node labels
├── train_mask.npy      # [N] boolean training mask
├── val_mask.npy        # [N] boolean validation mask
├── test_mask.npy       # [N] boolean test mask
├── metadata.json       # Graph statistics and configuration
└── load_graph.py       # Auto-generated Python loader script

Loading in Python

import numpy as np
import json

# Load metadata
with open('metadata.json') as f:
    meta = json.load(f)
print(f"Nodes: {meta['num_nodes']}, Edges: {meta['num_edges']}")

# Load arrays
edge_index = np.load('edge_index.npy')      # [2, E]
node_features = np.load('node_features.npy') # [N, F]
edge_features = np.load('edge_features.npy') # [E, 8]
edge_labels = np.load('edge_labels.npy')     # [E]

# For PyTorch Geometric
import torch
from torch_geometric.data import Data

data = Data(
    x=torch.from_numpy(node_features).float(),
    edge_index=torch.from_numpy(edge_index).long(),
    edge_attr=torch.from_numpy(edge_features).float(),
    y=torch.from_numpy(edge_labels).long(),
)

Configuration

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
  train_ratio: 0.7
  validation_ratio: 0.15
  # test_ratio is automatically 1 - train - val = 0.15

Use Cases

Anomaly Detection: Train GNNs to detect anomalous transaction patterns
Network Reconstruction: Validate accounting network recovery algorithms
Fraud Detection: Identify suspicious money flow patterns
Link Prediction: Predict likely transaction relationships

Configuration

graph_export:
  enabled: true

  formats:
    - pytorch_geometric
    - neo4j
    - dgl

  graphs:
    - transaction_network
    - approval_network
    - entity_relationship

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly

  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

Graph Types

Transaction Network

Accounts and entities as nodes, transactions as edges.

     ┌──────────┐
     │ Account  │
     │  1100    │
     └────┬─────┘
          │ $1000
          ▼
     ┌──────────┐
     │ Customer │
     │  C-001   │
     └──────────┘

Nodes:

GL accounts
Vendors
Customers
Cost centers

Edges:

Journal entry lines
Payments
Invoices

Approval Network

Users as nodes, approval relationships as edges.

     ┌──────────┐
     │  Clerk   │
     │  U-001   │
     └────┬─────┘
          │ approved
          ▼
     ┌──────────┐
     │ Manager  │
     │  U-002   │
     └──────────┘

Nodes: Employees/users Edges: Approval actions

Entity Relationship Network

Legal entities with ownership relationships.

     ┌──────────┐
     │  Parent  │
     │  1000    │
     └────┬─────┘
          │ 100%
          ▼
     ┌──────────┐
     │   Sub    │
     │  2000    │
     └──────────┘

Nodes: Companies Edges: Ownership, IC transactions

Export Formats

PyTorch Geometric

output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, num_features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, num_edge_features]
├── labels.pt           # Labels
├── train_mask.pt       # Boolean training mask
├── val_mask.pt         # Boolean validation mask
└── test_mask.pt        # Boolean test mask

Loading in Python:

import torch
from torch_geometric.data import Data

# Load tensors
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')

# Create PyG Data object
data = Data(
    x=node_features,
    edge_index=edge_index,
    edge_attr=edge_attr,
    y=labels,
    train_mask=train_mask,
)

print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")

Neo4j

output/graphs/transaction_network/neo4j/
├── nodes_account.csv
├── nodes_vendor.csv
├── nodes_customer.csv
├── edges_transaction.csv
├── edges_payment.csv
└── import.cypher

Import script (import.cypher):

// Load accounts
LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
CREATE (:Account {
    id: row.id,
    name: row.name,
    type: row.type
});

// Load transactions
LOAD CSV WITH HEADERS FROM 'file:///edges_transaction.csv' AS row
MATCH (from:Account {id: row.from_id})
MATCH (to:Account {id: row.to_id})
CREATE (from)-[:TRANSACTION {
    amount: toFloat(row.amount),
    date: date(row.posting_date),
    is_anomaly: toBoolean(row.is_anomaly)
}]->(to);

DGL (Deep Graph Library)

output/graphs/transaction_network/dgl/
├── graph.bin           # Serialized DGL graph
├── node_feats.npy      # Node features
├── edge_feats.npy      # Edge features
└── labels.npy          # Labels

Loading in Python:

import dgl
import numpy as np

# Load graph
graph = dgl.load_graphs('graph.bin')[0][0]

# Load features
graph.ndata['feat'] = torch.tensor(np.load('node_feats.npy'))
graph.edata['feat'] = torch.tensor(np.load('edge_feats.npy'))
graph.ndata['label'] = torch.tensor(np.load('labels.npy'))

Features

Temporal Features

features:
  temporal: true

Feature	Description
`weekday`	Day of week (0-6)
`period`	Fiscal period (1-12)
`is_month_end`	Last 3 days of month
`is_quarter_end`	Last week of quarter
`is_year_end`	Last month of year
`hour`	Hour of posting

Amount Features

features:
  amount: true

Feature	Description
`log_amount`	log10(amount)
`benford_prob`	Expected first-digit probability
`is_round_number`	Ends in 00, 000, etc.
`amount_zscore`	Standard deviations from mean

Structural Features

features:
  structural: true

Feature	Description
`line_count`	Number of JE lines
`unique_accounts`	Distinct accounts used
`has_intercompany`	IC transaction flag
`debit_credit_ratio`	Total debits / credits

Categorical Features

features:
  categorical: true

One-hot encoded:

business_process: Manual, P2P, O2C, etc.
source_type: System, User, Recurring
account_type: Asset, Liability, etc.

Train/Val/Test Splits

split:
  train: 0.7                         # 70% training
  val: 0.15                          # 15% validation
  test: 0.15                         # 15% test
  stratify: is_anomaly               # Maintain anomaly ratio
  random_seed: 42                    # Reproducible splits

Stratification options:

is_anomaly: Balanced anomaly detection
is_fraud: Balanced fraud detection
account_type: Balanced by account type
null: Random (no stratification)

GNN Training Example

import torch
from torch_geometric.nn import GCNConv

class AnomalyGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, 2)  # Binary classification

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

# Train
model = AnomalyGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

Multi-Layer Hypergraph (v0.6.2)

The RustGraph Hypergraph exporter now supports all 8 enterprise process families with 24 entity type codes:

Entity Type Codes

Range	Family	Types
100	Accounting	GL Accounts
300-303	P2P	PurchaseOrder, GoodsReceipt, VendorInvoice, Payment
310-312	O2C	SalesOrder, Delivery, CustomerInvoice
320-325	S2C	SourcingProject, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract, SupplierQualification
330-333	H2R	PayrollRun, TimeEntry, ExpenseReport, PayrollLineItem
340-343	MFG	ProductionOrder, RoutingOperation, QualityInspection, CycleCount
350-352	BANK	BankingCustomer, BankAccount, BankTransaction
360-365	AUDIT	AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment
370-372	Bank Recon	BankReconciliation, BankStatementLine, ReconcilingItem
400	OCPM	OcpmEvent (events as hyperedges)

OCPM Events as Hyperedges

When events_as_hyperedges: true, each OCPM event becomes a hyperedge connecting all its participating objects. This enables cross-process analysis via the hypergraph structure.

Per-Family Toggles

graph_export:
  hypergraph:
    enabled: true
    process_layer:
      include_p2p: true
      include_o2c: true
      include_s2c: true
      include_h2r: true
      include_mfg: true
      include_bank: true
      include_audit: true
      include_r2r: true
      events_as_hyperedges: true

Intercompany Processing

Generate matched intercompany transactions and elimination entries.

Overview

Intercompany features simulate multi-entity corporate structures:

IC transaction pairs (seller/buyer)
Transfer pricing
IC reconciliation
Consolidation eliminations

Prerequisites

Multiple companies must be defined:

companies:
  - code: "1000"
    name: "Parent Company"
    is_parent: true
    volume_weight: 0.5

  - code: "2000"
    name: "Subsidiary"
    parent_code: "1000"
    volume_weight: 0.5

Configuration

intercompany:
  enabled: true

  transaction_types:
    goods_sale: 0.4
    service_provided: 0.2
    loan: 0.15
    dividend: 0.1
    management_fee: 0.1
    royalty: 0.05

  transfer_pricing:
    method: cost_plus
    markup_range:
      min: 0.03
      max: 0.10

  elimination:
    enabled: true
    timing: quarterly

IC Transaction Types

Goods Sale

Internal sale of inventory between entities.

Seller (1000):
    Dr Intercompany Receivable   1,100
        Cr IC Revenue            1,100
    Dr IC COGS                     800
        Cr Inventory               800

Buyer (2000):
    Dr Inventory                 1,100
        Cr Intercompany Payable  1,100

Service Provided

Internal services (IT, HR, legal).

Provider (1000):
    Dr IC Receivable               500
        Cr IC Service Revenue      500

Receiver (2000):
    Dr Service Expense             500
        Cr IC Payable              500

Loan

Intercompany financing.

Lender (1000):
    Dr IC Loan Receivable       10,000
        Cr Cash                 10,000

Borrower (2000):
    Dr Cash                     10,000
        Cr IC Loan Payable      10,000

Dividend

Upstream dividend payment.

Subsidiary (2000):
    Dr Retained Earnings         5,000
        Cr Cash                  5,000

Parent (1000):
    Dr Cash                      5,000
        Cr Dividend Income       5,000

Management Fee

Corporate overhead allocation.

Parent (1000):
    Dr IC Receivable             1,000
        Cr Mgmt Fee Revenue      1,000

Subsidiary (2000):
    Dr Mgmt Fee Expense          1,000
        Cr IC Payable            1,000

Royalty

IP licensing fees.

Licensor (1000):
    Dr IC Receivable               750
        Cr Royalty Revenue         750

Licensee (2000):
    Dr Royalty Expense             750
        Cr IC Payable              750

Transfer Pricing

Methods

Method	Description
`cost_plus`	Cost + markup percentage
`resale_minus`	Resale price - margin
`comparable_uncontrolled`	Market price

transfer_pricing:
  method: cost_plus
  markup_range:
    min: 0.03                        # 3% minimum markup
    max: 0.10                        # 10% maximum markup

  # OR for resale minus
  method: resale_minus
  margin_range:
    min: 0.15
    max: 0.25

Arm’s Length Pricing

Prices generated to be defensible:

#![allow(unused)]
fn main() {
fn calculate_transfer_price(cost: Decimal, method: &TransferPricingMethod) -> Decimal {
    match method {
        TransferPricingMethod::CostPlus { markup } => {
            cost * (Decimal::ONE + markup)
        }
        TransferPricingMethod::ResaleMinus { margin, resale_price } => {
            resale_price * (Decimal::ONE - margin)
        }
        TransferPricingMethod::Comparable { market_price } => {
            market_price
        }
    }
}
}

IC Matching

Matched Pair Structure

#![allow(unused)]
fn main() {
pub struct ICMatchedPair {
    pub pair_id: String,
    pub seller_company: String,
    pub buyer_company: String,
    pub seller_entry_id: Uuid,
    pub buyer_entry_id: Uuid,
    pub transaction_type: ICTransactionType,
    pub amount: Decimal,
    pub currency: String,
    pub transaction_date: NaiveDate,
}
}

Match Validation

intercompany:
  matching:
    enabled: true
    tolerance: 0.01                  # 1% variance allowed
    mismatch_rate: 0.02              # 2% intentional mismatches

Match statuses:

matched: Amounts reconcile
timing_difference: Different posting dates
fx_difference: Currency conversion variance
unmatched: No matching entry

Eliminations

Timing

intercompany:
  elimination:
    timing: quarterly                # monthly, quarterly, annual

Elimination Types

Revenue/Expense Elimination:

Elimination entry:
    Dr IC Revenue (1000)           1,100
        Cr IC Expense (2000)       1,100

Unrealized Profit Elimination:

If buyer still holds inventory:
    Dr IC Revenue                    300
        Cr Inventory                 300

Receivable/Payable Elimination:

    Dr IC Payable (2000)          10,000
        Cr IC Receivable (1000)   10,000

Output Files

ic_pairs.csv

Field	Description
`pair_id`	Unique pair identifier
`seller_company`	Selling entity
`buyer_company`	Buying entity
`seller_entry_id`	Seller’s JE document ID
`buyer_entry_id`	Buyer’s JE document ID
`transaction_type`	Type of IC transaction
`amount`	Transaction amount
`match_status`	Match result

eliminations.csv

Field	Description
`elimination_id`	Unique ID
`ic_pair_id`	Reference to IC pair
`elimination_type`	Revenue, profit, balance
`debit_company`	Company debited
`credit_company`	Company credited
`amount`	Elimination amount
`period`	Fiscal period

Example Configuration

Multi-National Structure

companies:
  - code: "1000"
    name: "US Headquarters"
    currency: USD
    country: US
    is_parent: true
    volume_weight: 0.4

  - code: "2000"
    name: "European Hub"
    currency: EUR
    country: DE
    parent_code: "1000"
    volume_weight: 0.3

  - code: "3000"
    name: "Asia Pacific"
    currency: JPY
    country: JP
    parent_code: "1000"
    volume_weight: 0.3

intercompany:
  enabled: true

  transaction_types:
    goods_sale: 0.5
    service_provided: 0.2
    management_fee: 0.15
    royalty: 0.15

  transfer_pricing:
    method: cost_plus
    markup_range:
      min: 0.05
      max: 0.12

  elimination:
    enabled: true
    timing: quarterly

Interconnectivity and Relationship Modeling

SyntheticData provides comprehensive relationship modeling capabilities for generating realistic enterprise networks with multi-tier vendor relationships, customer segmentation, relationship strength calculations, and cross-process linkages.

Overview

Real enterprise data exhibits complex interconnections between entities:

Vendors form multi-tier supply chains (supplier-of-supplier)
Customers segment by value (Enterprise vs. SMB) with different behaviors
Relationships vary in strength based on transaction history
Business processes connect (P2P and O2C link through inventory)

SyntheticData models all of these patterns to produce realistic, interconnected data.

Multi-Tier Vendor Networks

Supply Chain Tiers

Vendors are organized into a supply chain hierarchy:

Tier	Description	Visibility	Typical Count
Tier 1	Direct suppliers	Full financial visibility	50-100 per company
Tier 2	Supplier’s suppliers	Partial visibility	4-10 per Tier 1
Tier 3	Deep supply chain	Minimal visibility	2-5 per Tier 2

Vendor Clusters

Vendors are classified into behavioral clusters:

Cluster	Share	Characteristics
ReliableStrategic	20%	High delivery scores, low invoice errors, consistent quality
StandardOperational	50%	Average performance, predictable patterns
Transactional	25%	One-off or occasional purchases
Problematic	5%	Quality issues, late deliveries, invoice discrepancies

Vendor Lifecycle Stages

Onboarding → RampUp → SteadyState → Decline → Terminated

Each stage has associated behaviors:

Onboarding: Initial qualification, small orders
RampUp: Increasing order volumes
SteadyState: Stable, predictable patterns
Decline: Reduced orders, performance issues
Terminated: Relationship ended

Vendor Quality Scores

Metric	Range	Description
`delivery_on_time`	0.0-1.0	Percentage of on-time deliveries
`quality_pass_rate`	0.0-1.0	Quality inspection pass rate
`invoice_accuracy`	0.0-1.0	Invoice matching accuracy
`responsiveness_score`	0.0-1.0	Communication responsiveness

Vendor Concentration Analysis

SyntheticData tracks vendor concentration risks:

dependencies:
  max_single_vendor_concentration: 0.15  # No vendor > 15% of spend
  top_5_concentration: 0.45              # Top 5 vendors < 45% of spend
  single_source_percent: 0.05            # 5% of materials single-sourced

Customer Value Segmentation

Value Segments

Customers follow a Pareto-like distribution:

Segment	Revenue Share	Customer Share	Typical Order Value
Enterprise	40%	5%	$50,000+
MidMarket	35%	20%	$5,000-$50,000
SMB	20%	50%	$500-$5,000
Consumer	5%	25%	$50-$500

Customer Lifecycle

Prospect → New → Growth → Mature → AtRisk → Churned
                                         ↓
                                      WonBack

Each stage has associated behaviors:

Prospect: Potential customer, conversion probability
New: First purchase within 90 days
Growth: Increasing order frequency/value
Mature: Stable, loyal customer
AtRisk: Declining activity, churn signals
Churned: No activity for extended period
WonBack: Previously churned, now returned

Customer Engagement Metrics

Metric	Description
`order_frequency`	Average orders per period
`recency_days`	Days since last order
`nps_score`	Net Promoter Score (-100 to +100)
`engagement_score`	Composite engagement metric (0.0-1.0)

Customer Networks

Referral Networks: Customers refer other customers (configurable rate)
Corporate Hierarchies: Parent/child company relationships
Industry Clusters: Customers grouped by industry vertical

Relationship Strength Modeling

Composite Strength Calculation

Relationship strength is computed from multiple factors:

Component	Weight	Scale	Description
Transaction Volume	30%	Logarithmic	Total monetary value
Transaction Count	25%	Square root	Number of transactions
Duration	20%	Linear	Relationship age in days
Recency	15%	Exponential decay	Days since last transaction
Mutual Connections	10%	Jaccard index	Shared network connections

Strength Categories

Strength	Threshold	Description
Strong	≥ 0.7	Core business relationship
Moderate	≥ 0.4	Regular, established relationship
Weak	≥ 0.1	Occasional relationship
Dormant	< 0.1	Inactive relationship

Recency Decay

The recency component uses exponential decay:

recency_score = exp(-days_since_last / half_life)

Default half-life is 90 days.

Cross-Process Linkages

Inventory Links (P2P ↔ O2C)

Inventory naturally connects Procure-to-Pay and Order-to-Cash:

P2P: Purchase Order → Goods Receipt → Vendor Invoice → Payment
                           ↓
                      [Inventory]
                           ↓
O2C: Sales Order → Delivery → Customer Invoice → Receipt

When enabled, SyntheticData generates explicit CrossProcessLink records connecting:

GoodsReceipt (P2P) to Delivery (O2C) via inventory item

Payment-Bank Reconciliation

Links payment transactions to bank statement entries for reconciliation.

Intercompany Bilateral Links

Ensures intercompany transactions are properly linked between sending and receiving entities.

Entity Graph

Graph Structure

The EntityGraph provides a unified view of all entity relationships:

Component	Description
Nodes	Entities with type, ID, and metadata
Edges	Relationships with type and strength
Indexes	Fast lookups by entity type and ID

Entity Types (16 types)

Company, Vendor, Customer, Employee, Department, CostCenter,
Project, Contract, Asset, BankAccount, Material, Product,
Location, Currency, Account, Entity

Relationship Types (26 types)

// Transactional
BuysFrom, SellsTo, PaysTo, ReceivesFrom, SuppliesTo, OrdersFrom

// Organizational
ReportsTo, Manages, BelongsTo, OwnedBy, PartOf, Contains

// Network
ReferredBy, PartnersWith, AffiliateOf, SubsidiaryOf

// Process
ApprovesFor, AuthorizesFor, ProcessesFor

// Financial
BillsTo, ShipsTo, CollectsFrom, RemitsTo

// Document
ReferencedBy, SupersededBy, AmendedBy, LinkedTo

Configuration

Complete Example

vendor_network:
  enabled: true
  depth: 3
  tiers:
    tier1:
      count_min: 50
      count_max: 100
    tier2:
      count_per_parent_min: 4
      count_per_parent_max: 10
    tier3:
      count_per_parent_min: 2
      count_per_parent_max: 5
  clusters:
    reliable_strategic: 0.20
    standard_operational: 0.50
    transactional: 0.25
    problematic: 0.05
  dependencies:
    max_single_vendor_concentration: 0.15
    top_5_concentration: 0.45
    single_source_percent: 0.05

customer_segmentation:
  enabled: true
  value_segments:
    enterprise:
      revenue_share: 0.40
      customer_share: 0.05
      avg_order_min: 50000.0
    mid_market:
      revenue_share: 0.35
      customer_share: 0.20
      avg_order_min: 5000.0
      avg_order_max: 50000.0
    smb:
      revenue_share: 0.20
      customer_share: 0.50
      avg_order_min: 500.0
      avg_order_max: 5000.0
    consumer:
      revenue_share: 0.05
      customer_share: 0.25
      avg_order_min: 50.0
      avg_order_max: 500.0
  lifecycle:
    prospect_rate: 0.10
    new_rate: 0.15
    growth_rate: 0.20
    mature_rate: 0.35
    at_risk_rate: 0.10
    churned_rate: 0.08
    won_back_rate: 0.02
  networks:
    referrals:
      enabled: true
      referral_rate: 0.15
    corporate_hierarchies:
      enabled: true
      hierarchy_probability: 0.30

relationship_strength:
  enabled: true
  calculation:
    transaction_volume_weight: 0.30
    transaction_count_weight: 0.25
    relationship_duration_weight: 0.20
    recency_weight: 0.15
    mutual_connections_weight: 0.10
    recency_half_life_days: 90
  thresholds:
    strong: 0.7
    moderate: 0.4
    weak: 0.1

cross_process_links:
  enabled: true
  inventory_p2p_o2c: true
  payment_bank_reconciliation: true
  intercompany_bilateral: true

Network Evaluation

SyntheticData includes network metrics evaluation:

Metric	Description	Typical Range
Connectivity	Largest connected component ratio	> 0.95
Power Law Alpha	Degree distribution exponent	2.0-3.0
Clustering Coefficient	Local clustering	0.10-0.50
Top-1 Concentration	Largest node share	< 0.15
Top-5 Concentration	Top 5 nodes share	< 0.45
HHI	Herfindahl-Hirschman Index	< 0.25

These metrics validate that generated networks exhibit realistic properties.

API Usage

Rust API

#![allow(unused)]
fn main() {
use datasynth_core::models::{
    VendorNetwork, VendorCluster, SupplyChainTier,
    SegmentedCustomerPool, CustomerValueSegment,
    EntityGraph, RelationshipStrengthCalculator,
};
use datasynth_generators::relationships::EntityGraphGenerator;

// Generate vendor network
let vendor_generator = VendorGenerator::new(config);
let vendor_network = vendor_generator.generate_vendor_network("C001");

// Generate segmented customers
let customer_generator = CustomerGenerator::new(config);
let customer_pool = customer_generator.generate_segmented_pool("C001");

// Build entity graph with cross-process links
let graph_generator = EntityGraphGenerator::with_defaults();
let entity_graph = graph_generator.generate_entity_graph(
    &vendor_network,
    &customer_pool,
    &transactions,
    &document_flows,
);
}

Python API

from datasynth_py import DataSynth
from datasynth_py.config import VendorNetworkConfig, CustomerSegmentationConfig

config = Config(
    vendor_network=VendorNetworkConfig(
        enabled=True,
        depth=3,
        clusters={"reliable_strategic": 0.20, "standard_operational": 0.50},
    ),
    customer_segmentation=CustomerSegmentationConfig(
        enabled=True,
        value_segments={
            "enterprise": {"revenue_share": 0.40, "customer_share": 0.05},
            "mid_market": {"revenue_share": 0.35, "customer_share": 0.20},
        },
    ),
)

result = DataSynth().generate(config=config, output={"format": "csv"})

Period Close Engine

Generate period-end accounting processes.

Overview

The period close engine simulates:

Monthly close (accruals, depreciation)
Quarterly close (IC elimination, translation)
Annual close (closing entries, retained earnings)

Configuration

period_close:
  enabled: true

  monthly:
    accruals: true
    depreciation: true
    reconciliation: true

  quarterly:
    intercompany_elimination: true
    currency_translation: true

  annual:
    closing_entries: true
    retained_earnings: true

Monthly Close

Accruals

Generate reversing accrual entries:

period_close:
  monthly:
    accruals:
      enabled: true
      auto_reverse: true             # Reverse next period

      categories:
        expense_accrual: 0.4
        revenue_accrual: 0.2
        payroll_accrual: 0.3
        other: 0.1

Expense Accrual:

Period 1 (accrue):
    Dr Expense                     10,000
        Cr Accrued Liabilities     10,000

Period 2 (reverse):
    Dr Accrued Liabilities         10,000
        Cr Expense                 10,000

Depreciation

Calculate and post monthly depreciation:

period_close:
  monthly:
    depreciation:
      enabled: true
      run_date: last_day            # When in period

      methods:
        straight_line: 0.7
        declining_balance: 0.2
        units_of_production: 0.1

Depreciation Entry:

    Dr Depreciation Expense          5,000
        Cr Accumulated Depreciation  5,000

Subledger Reconciliation

Verify subledger-to-GL control accounts:

period_close:
  monthly:
    reconciliation:
      enabled: true

      checks:
        - subledger: ar
          control_account: "1100"
        - subledger: ap
          control_account: "2000"
        - subledger: inventory
          control_account: "1200"

Reconciliation Report:

Subledger	Control Account	Subledger Balance	GL Balance	Difference
AR	1100	500,000	500,000	0
AP	2000	(300,000)	(300,000)	0

Quarterly Close

IC Elimination

Generate consolidation eliminations:

period_close:
  quarterly:
    intercompany_elimination:
      enabled: true

      types:
        - revenue_expense            # Eliminate IC sales
        - unrealized_profit          # Eliminate IC inventory profit
        - receivable_payable         # Eliminate IC balances
        - dividends                  # Eliminate IC dividends

See Intercompany Processing for details.

Currency Translation

Translate foreign subsidiary balances:

period_close:
  quarterly:
    currency_translation:
      enabled: true
      method: current_rate           # current_rate, temporal

      rate_mapping:
        assets: closing_rate
        liabilities: closing_rate
        equity: historical_rate
        revenue: average_rate
        expense: average_rate

      cta_account: "3500"            # CTA equity account

Translation Entry (CTA):

If foreign currency strengthened:
    Dr Foreign Subsidiary Investment  10,000
        Cr CTA (Other Comprehensive)  10,000

Annual Close

Closing Entries

Close temporary accounts to retained earnings:

period_close:
  annual:
    closing_entries:
      enabled: true
      close_revenue: true
      close_expense: true
      income_summary_account: "3900"

Closing Sequence:

1. Close Revenue:
    Dr Revenue accounts (all)      1,000,000
        Cr Income Summary          1,000,000

2. Close Expenses:
    Dr Income Summary                800,000
        Cr Expense accounts (all)    800,000

3. Close Income Summary:
    Dr Income Summary                200,000
        Cr Retained Earnings         200,000

Retained Earnings

Update retained earnings:

period_close:
  annual:
    retained_earnings:
      enabled: true
      account: "3100"
      dividend_account: "3150"

Year-End Adjustments

Additional adjusting entries:

period_close:
  annual:
    adjustments:
      - type: bad_debt_provision
        rate: 0.02                   # 2% of AR

      - type: inventory_reserve
        rate: 0.01                   # 1% of inventory

      - type: bonus_accrual
        rate: 0.10                   # 10% of salary expense

Financial Statements (v0.6.0)

The period close engine can now generate full financial statement sets from the adjusted trial balance. This is controlled by the financial_reporting configuration section.

Balance Sheet

Generates a statement of financial position with current/non-current asset and liability classifications:

Assets                              Liabilities & Equity
├── Current Assets                  ├── Current Liabilities
│   ├── Cash & Equivalents          │   ├── Accounts Payable
│   ├── Accounts Receivable         │   ├── Accrued Liabilities
│   └── Inventory                   │   └── Current Debt
├── Non-Current Assets              ├── Non-Current Liabilities
│   ├── Fixed Assets (net)          │   └── Long-Term Debt
│   └── Intangibles                 └── Equity
Total Assets = Total L + E              ├── Common Stock
                                        └── Retained Earnings

Income Statement

Generates a multi-step income statement:

Revenue
- Cost of Goods Sold
= Gross Profit
- Operating Expenses
= Operating Income
+/- Other Income/Expense
= Income Before Tax
- Income Tax
= Net Income

Cash Flow Statement

Generates an indirect-method cash flow statement with three categories:

financial_reporting:
  generate_cash_flow: true

Categories:

Operating: Net income + non-cash adjustments + working capital changes
Investing: Capital expenditures, asset disposals
Financing: Debt proceeds/repayments, equity transactions, dividends

Statement of Changes in Equity

Tracks equity movements across the period:

Opening retained earnings
Net income for the period
Dividends declared
Other comprehensive income (CTA, unrealized gains)
Closing retained earnings

Management KPIs

When financial_reporting.management_kpis is enabled, computes financial ratios:

Liquidity: Current ratio, quick ratio, cash ratio
Profitability: Gross margin, operating margin, net margin, ROA, ROE
Efficiency: Inventory turnover, AR turnover, AP turnover, days sales outstanding
Leverage: Debt-to-equity, debt-to-assets, interest coverage

Budgets

When financial_reporting.budgets is enabled, generates budget records with variance analysis:

financial_reporting:
  budgets:
    enabled: true
    variance_threshold: 0.10    # Flag variances > 10%

Produces budget vs. actual comparisons by account and period, with favorable/unfavorable variance flags.

Output Files

trial_balances/YYYY_MM.csv

Field	Description
`account_number`	GL account
`account_name`	Account description
`opening_balance`	Period start
`period_debits`	Total debits
`period_credits`	Total credits
`closing_balance`	Period end

accruals.csv

Field	Description
`accrual_id`	Unique ID
`accrual_type`	Category
`period`	Accrual period
`amount`	Accrual amount
`reversal_period`	When reversed
`entry_id`	Related JE ID

depreciation.csv

Field	Description
`asset_id`	Fixed asset
`period`	Depreciation period
`method`	Depreciation method
`depreciation_amount`	Period expense
`accumulated_total`	Running total
`net_book_value`	Remaining value

closing_entries.csv

Field	Description
`entry_id`	Closing entry ID
`entry_type`	Revenue, expense, summary
`account`	Account closed
`amount`	Closing amount
`fiscal_year`	Year closed

financial_statements.csv (v0.6.0)

Field	Description
`statement_id`	Unique statement identifier
`statement_type`	balance_sheet, income_statement, cash_flow, changes_in_equity
`company_code`	Company code
`period_end`	Statement date
`basis`	us_gaap, ifrs, statutory
`line_code`	Line item code
`label`	Display label
`section`	Statement section
`amount`	Current period amount
`amount_prior`	Prior period amount

bank_reconciliations.csv (v0.6.0)

Field	Description
`reconciliation_id`	Unique reconciliation ID
`company_code`	Company code
`bank_account`	Bank account identifier
`period_start`	Reconciliation period start
`period_end`	Reconciliation period end
`opening_balance`	Opening bank balance
`closing_balance`	Closing bank balance
`status`	in_progress, completed, completed_with_exceptions

management_kpis.csv (v0.6.0)

Field	Description
`company_code`	Company code
`period`	Reporting period
`kpi_name`	Ratio name (e.g., current_ratio, gross_margin)
`kpi_value`	Computed ratio value
`category`	liquidity, profitability, efficiency, leverage

Close Schedule

Month 1-11:
├── Accruals
├── Depreciation
└── Reconciliation

Month 3, 6, 9:
├── IC Elimination
└── Currency Translation

Month 12:
├── All monthly tasks
├── All quarterly tasks
├── Year-end adjustments
└── Closing entries

Example Configuration

Full Close Cycle

global:
  start_date: 2024-01-01
  period_months: 12

period_close:
  enabled: true

  monthly:
    accruals:
      enabled: true
      auto_reverse: true
    depreciation:
      enabled: true
    reconciliation:
      enabled: true

  quarterly:
    intercompany_elimination:
      enabled: true
    currency_translation:
      enabled: true

  annual:
    closing_entries:
      enabled: true
    retained_earnings:
      enabled: true
    adjustments:
      - type: bad_debt_provision
        rate: 0.02

Fingerprinting

Privacy-preserving fingerprint extraction enables generating synthetic data that matches the statistical properties of real data without exposing sensitive information.

Overview

Fingerprinting is a three-stage process:

Extract: Analyze real data and capture statistical properties into a .dsf fingerprint file
Synthesize: Generate synthetic data configuration from the fingerprint
Evaluate: Validate that synthetic data matches the fingerprint’s statistical properties

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Real Data  │────▶│   Extract   │────▶│ .dsf File   │────▶│  Evaluate   │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                          │                    │                    │
                          ▼                    ▼                    ▼
                    Privacy Engine      Config Synthesizer    Fidelity Report

Privacy Mechanisms

Differential Privacy

The extraction process applies differential privacy to protect individual records:

Laplace Mechanism: Adds calibrated noise to numeric statistics
Gaussian Mechanism: Alternative for (ε,δ)-differential privacy
Epsilon Budget: Tracks privacy budget across all operations

Privacy Guarantee: For any two datasets D and D' differing in one record,
the probability ratio of any output is bounded by e^ε

K-Anonymity

Categorical values are protected through suppression:

Values appearing fewer than k times are replaced with <suppressed>
Prevents identification of rare categories
Configurable threshold per privacy level

Winsorization

Numeric outliers are clipped to prevent identification:

Values beyond the configured percentile are capped
Prevents extreme values from leaking individual information
Outlier percentile varies by privacy level (85%-99%)

Privacy Levels

Level	Epsilon	k	Outlier %	Description
Minimal	5.0	3	99%	Highest utility, lower privacy
Standard	1.0	5	95%	Balanced (recommended default)
High	0.5	10	90%	Higher privacy for sensitive data
Maximum	0.1	20	85%	Maximum privacy, some utility loss

Choosing a Privacy Level

Minimal: Internal testing, non-sensitive data
Standard: General use, moderate sensitivity
High: Personal financial data, healthcare
Maximum: Highly sensitive data, regulatory compliance

DSF File Format

The DataSynth Fingerprint (.dsf) file is a ZIP archive containing:

fingerprint.dsf
├── manifest.json       # Metadata, checksums, privacy config
├── schema.yaml         # Table/column structure, relationships
├── statistics.yaml     # Distributions, percentiles, Benford
├── correlations.yaml   # Correlation matrices, copula params
├── integrity.yaml      # Foreign keys, cardinality rules
├── rules.yaml          # Balance constraints, thresholds
├── anomalies.yaml      # Anomaly rates, patterns
└── privacy_audit.json  # All privacy decisions logged

Manifest Structure

{
  "version": "1.0",
  "format": "dsf",
  "created_at": "2026-01-23T10:30:00Z",
  "source": {
    "row_count": 100000,
    "column_count": 25,
    "tables": ["journal_entries", "vendors"]
  },
  "privacy": {
    "level": "standard",
    "epsilon": 1.0,
    "k": 5
  },
  "checksums": {
    "schema": "sha256:...",
    "statistics": "sha256:...",
    "correlations": "sha256:..."
  }
}

Extraction Process

Step 1: Schema Extraction

Analyzes data structure:

Infers column data types (numeric, categorical, date, text)
Computes cardinalities
Detects foreign key relationships
Identifies primary keys

Step 2: Statistical Extraction

Computes distributions with privacy:

Numeric columns: Mean, std, min, max, percentiles (with DP noise)
Categorical columns: Frequencies (with k-anonymity)
Temporal columns: Date ranges, seasonality patterns
Benford analysis: First-digit distribution compliance

Step 3: Correlation Extraction

Captures multivariate relationships:

Pearson correlation matrices (with DP)
Copula parameters for joint distributions
Cross-table relationship strengths

Step 4: Rules Extraction

Detects business rules:

Balance equations (debits = credits)
Approval thresholds
Validation constraints

Step 5: Anomaly Pattern Extraction

Captures anomaly characteristics:

Overall anomaly rate
Type distribution
Temporal patterns

Synthesis Process

Configuration Generation

The ConfigSynthesizer converts fingerprints to generation configuration:

#![allow(unused)]
fn main() {
// From fingerprint statistics, generate:
AmountSampler {
    distribution: LogNormal,
    mean: fp.statistics.amount.mean,
    std: fp.statistics.amount.std,
    round_number_bias: 0.15,
}
}

Copula-Based Generation

For correlated columns, the GaussianCopula preserves relationships:

Generate independent uniform samples
Apply correlation structure
Transform to target marginal distributions

Fidelity Evaluation

Metrics

Metric	Description	Target
KS Statistic	Max CDF difference	< 0.1
Wasserstein Distance	Earth mover’s distance	< 0.1
Benford MAD	Mean absolute deviation from Benford	< 0.015
Correlation RMSE	Correlation matrix difference	< 0.1
Schema Match	Column type agreement	> 0.95

Fidelity Report

Fidelity Evaluation Report
==========================
Overall Score: 0.87
Status: PASSED (threshold: 0.80)

Component Scores:
  Statistical:   0.89
  Correlation:   0.85
  Schema:        0.98
  Rules:         0.76

Details:
  - KS statistic (amount): 0.05
  - Benford MAD: 0.008
  - Correlation RMSE: 0.07

CLI Usage

Basic Workflow

# 1. Extract fingerprint from real data
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# 2. Validate fingerprint integrity
datasynth-data fingerprint validate ./fingerprint.dsf

# 3. View fingerprint details
datasynth-data fingerprint info ./fingerprint.dsf --detailed

# 4. Generate synthetic data (using derived config)
datasynth-data generate \
    --config ./derived_config.yaml \
    --output ./synthetic_data

# 5. Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data \
    --threshold 0.85 \
    --report ./report.html

Comparing Fingerprints

# Compare two versions
datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf

Custom Privacy Parameters

# Override privacy level with custom values
datasynth-data fingerprint extract \
    --input ./sensitive_data.csv \
    --output ./fingerprint.dsf \
    --epsilon 0.3 \
    --k 15

Best Practices

Data Preparation

Clean data first: Remove obvious errors before extraction
Consistent formats: Standardize date and number formats
Document exclusions: Note any columns excluded from extraction

Privacy Selection

Start with standard: Adjust based on fidelity evaluation
Consider sensitivity: Use higher privacy for personal data
Review audit log: Check privacy decisions in privacy_audit.json

Fidelity Optimization

Check component scores: Identify weak areas
Adjust generation config: Tune parameters for low-scoring metrics
Iterate: Re-evaluate after adjustments

Compliance

Preserve audit trail: Keep .dsf files for compliance review
Document privacy choices: Record rationale for privacy level
Version fingerprints: Track changes over time

Troubleshooting

Low Fidelity Score

Cause: Statistical differences between synthetic and fingerprint

Solutions:

Review component scores to identify specific issues
Adjust generation configuration parameters
Consider using auto-tuning recommendations

Fingerprint Validation Errors

Cause: Corrupted or modified DSF file

Solutions:

Re-extract from source data
Check file transfer integrity
Verify checksums match manifest

Privacy Budget Exceeded

Cause: Too many queries on sensitive data

Solutions:

Reduce number of extracted statistics
Use higher epsilon (lower privacy)
Aggregate fine-grained statistics

Accounting & Audit Standards

SyntheticData includes comprehensive support for major accounting and auditing standards frameworks, enabling the generation of standards-compliant synthetic financial data suitable for audit analytics, compliance testing, and ML model training.

Overview

The datasynth-standards crate provides domain models and generation logic for:

Category	Standards
Accounting	US GAAP (ASC), IFRS
Auditing	ISA (International Standards on Auditing), PCAOB
Regulatory	SOX (Sarbanes-Oxley Act)

Accounting Framework Selection

Framework Options

accounting_standards:
  enabled: true
  framework: us_gaap  # Options: us_gaap, ifrs, dual_reporting

Framework	Description
`us_gaap`	United States Generally Accepted Accounting Principles
`ifrs`	International Financial Reporting Standards
`dual_reporting`	Generate data for both frameworks with reconciliation

Key Framework Differences

The generator automatically handles framework-specific rules:

Area	US GAAP	IFRS
Inventory costing	LIFO permitted	LIFO prohibited
Development costs	Generally expensed	Capitalized when criteria met
PPE revaluation	Cost model only	Revaluation model permitted
Impairment reversal	Not permitted	Permitted (except goodwill)
Lease classification	Bright-line tests (75%/90%)	Principles-based

Revenue Recognition (ASC 606 / IFRS 15)

Generate realistic customer contracts with performance obligations:

accounting_standards:
  revenue_recognition:
    enabled: true
    generate_contracts: true
    avg_obligations_per_contract: 2.0
    variable_consideration_rate: 0.15
    over_time_recognition_rate: 0.30
    contract_count: 100

Generated Entities

Customer Contracts: Transaction price, status, framework
Performance Obligations: Goods, services, licenses with satisfaction patterns
Variable Consideration: Discounts, rebates, incentives with constraint application
Revenue Recognition Schedule: Period-by-period recognition

5-Step Model Compliance

The generator follows the 5-step revenue recognition model:

Identify the contract
Identify performance obligations
Determine transaction price
Allocate transaction price to obligations
Recognize revenue when/as obligations are satisfied

Lease Accounting (ASC 842 / IFRS 16)

Generate lease portfolios with ROU assets and lease liabilities:

accounting_standards:
  leases:
    enabled: true
    lease_count: 50
    finance_lease_percent: 0.30
    avg_lease_term_months: 60
    generate_amortization: true
    real_estate_percent: 0.40

Generated Entities

Leases: Classification, commencement date, term, payments, discount rate
ROU Assets: Initial measurement, accumulated depreciation, carrying amount
Lease Liabilities: Current/non-current portions
Amortization Schedules: Period-by-period interest and principal

Classification Logic

US GAAP: Bright-line tests (75% term, 90% PV)
IFRS: All leases (except short-term/low-value) recognized on balance sheet

Fair Value Measurement (ASC 820 / IFRS 13)

Generate fair value measurements across hierarchy levels:

accounting_standards:
  fair_value:
    enabled: true
    measurement_count: 30
    level1_percent: 0.60    # Quoted prices
    level2_percent: 0.30    # Observable inputs
    level3_percent: 0.10    # Unobservable inputs
    include_sensitivity_analysis: true

Fair Value Hierarchy

Level	Description	Examples
Level 1	Quoted prices in active markets	Listed stocks, exchange-traded funds
Level 2	Observable inputs	Corporate bonds, interest rate swaps
Level 3	Unobservable inputs	Private equity, complex derivatives

Impairment Testing (ASC 360 / IAS 36)

Generate impairment tests with framework-specific methodology:

accounting_standards:
  impairment:
    enabled: true
    test_count: 15
    impairment_rate: 0.20
    generate_projections: true
    include_goodwill: true

Framework Differences

US GAAP: Two-step test (recoverability then measurement)
IFRS: One-step test comparing to recoverable amount

ISA Compliance (Audit Standards)

Generate audit procedures mapped to ISA requirements:

audit_standards:
  isa_compliance:
    enabled: true
    compliance_level: comprehensive  # basic, standard, comprehensive
    generate_isa_mappings: true
    generate_coverage_summary: true
    include_pcaob: true
    framework: dual  # isa, pcaob, dual

Supported ISA Standards

The crate includes 34 ISA standards from ISA 200 through ISA 720:

Series	Focus Area
ISA 200-265	General principles and responsibilities
ISA 300-450	Risk assessment and response
ISA 500-580	Audit evidence
ISA 600-620	Using work of others
ISA 700-720	Conclusions and reporting

Analytical Procedures (ISA 520)

Generate analytical procedures with variance investigation:

audit_standards:
  analytical_procedures:
    enabled: true
    procedures_per_account: 3
    variance_probability: 0.20
    generate_investigations: true
    include_ratio_analysis: true

Procedure Types

Trend analysis: Year-over-year comparisons
Ratio analysis: Key financial ratios
Reasonableness tests: Expected vs. actual comparisons

External Confirmations (ISA 505)

Generate confirmation procedures with response tracking:

audit_standards:
  confirmations:
    enabled: true
    confirmation_count: 50
    positive_response_rate: 0.85
    exception_rate: 0.10

Confirmation Types

Bank confirmations
Accounts receivable confirmations
Accounts payable confirmations
Legal confirmations

Audit Opinion (ISA 700/705/706/701)

Generate audit opinions with key audit matters:

audit_standards:
  opinion:
    enabled: true
    generate_kam: true
    average_kam_count: 3

Opinion Types

Unmodified
Qualified
Adverse
Disclaimer

SOX Compliance

Generate SOX 302/404 compliance documentation:

audit_standards:
  sox:
    enabled: true
    generate_302_certifications: true
    generate_404_assessments: true
    materiality_threshold: 10000.0

Section 302 Certifications

CEO and CFO certifications
Disclosure controls effectiveness
Material weakness identification

Section 404 Assessments

ICFR effectiveness assessment
Key control testing
Deficiency classification matrix

Deficiency Classification

The DeficiencyMatrix classifies deficiencies based on:

Likelihood	Magnitude	Classification
Probable	Material	Material Weakness
Reasonably Possible	More Than Inconsequential	Significant Deficiency
Remote	Inconsequential	Control Deficiency

PCAOB Standards

Generate PCAOB-specific audit elements:

audit_standards:
  pcaob:
    enabled: true
    generate_cam: true
    integrated_audit: true

PCAOB-Specific Requirements

Critical Audit Matters (CAMs) vs. Key Audit Matters (KAMs)
Integrated audit (ICFR + financial statements)
AS 2201 ICFR testing requirements

Evaluation and Validation

The datasynth-eval crate includes standards compliance evaluators:

#![allow(unused)]
fn main() {
use datasynth_eval::coherence::{
    StandardsComplianceEvaluation,
    RevenueRecognitionEvaluator,
    LeaseAccountingEvaluator,
    StandardsThresholds,
};

// Evaluate revenue recognition compliance
let eval = RevenueRecognitionEvaluator::evaluate(&contracts);
assert!(eval.po_allocation_compliance >= 0.95);

// Evaluate lease classification accuracy
let eval = LeaseAccountingEvaluator::evaluate(&leases, "us_gaap");
assert!(eval.classification_accuracy >= 0.90);
}

Compliance Thresholds

Metric	Default Threshold
PO allocation compliance	95%
Revenue timing compliance	95%
Lease classification accuracy	90%
ROU asset accuracy	95%
Fair value hierarchy compliance	95%
ISA coverage	90%
SOX control coverage	95%
Audit trail completeness	90%

Output Files

When standards generation is enabled, additional files are exported:

output/
└── standards/
    ├── accounting/
    │   ├── customer_contracts.csv
    │   ├── performance_obligations.csv
    │   ├── variable_consideration.csv
    │   ├── revenue_recognition_schedule.csv
    │   ├── leases.csv
    │   ├── rou_assets.csv
    │   ├── lease_liabilities.csv
    │   ├── lease_amortization.csv
    │   ├── fair_value_measurements.csv
    │   ├── impairment_tests.csv
    │   └── framework_differences.csv
    ├── audit/
    │   ├── isa_requirement_mappings.csv
    │   ├── isa_coverage_summary.csv
    │   ├── analytical_procedures.csv
    │   ├── variance_investigations.csv
    │   ├── confirmations.csv
    │   ├── confirmation_responses.csv
    │   ├── audit_opinions.csv
    │   ├── key_audit_matters.csv
    │   ├── audit_trails.json
    │   └── pcaob_mappings.csv
    └── regulatory/
        ├── sox_302_certifications.csv
        ├── sox_404_assessments.csv
        ├── deficiency_classifications.csv
        └── material_weaknesses.csv

Use Cases

Audit Analytics Training

Generate labeled data for training audit analytics models with known standards compliance levels.

Compliance Testing

Test compliance monitoring systems with synthetic data covering all major accounting and auditing standards.

IFRS to US GAAP Reconciliation

Use dual reporting mode to generate reconciliation data for multi-framework analysis.

SOX Testing

Generate internal control data with known deficiencies for testing SOX monitoring systems.

Performance Tuning

Optimize SyntheticData for your hardware and requirements.

Performance Characteristics

Metric	Typical Performance
Single-threaded	~100,000 entries/second
Parallel (8 cores)	~600,000 entries/second
Memory per 1M entries	~500 MB

Configuration Tuning

Worker Threads

global:
  worker_threads: 8                  # Match CPU cores

Guidelines:

Default: Uses all available cores
I/O bound: May benefit from > cores
Memory constrained: Reduce threads

Memory Limits

global:
  memory_limit: 2147483648           # 2 GB

Guidelines:

Set to ~75% of available RAM
Leave room for OS and other processes
Lower limit = more streaming, less memory

Batch Sizes

The orchestrator automatically tunes batch sizes, but you can influence behavior:

transactions:
  target_count: 100000

# Implicit batch sizing based on:
# - Available memory
# - Number of threads
# - Target count

Hardware Recommendations

Minimum

Resource	Specification
CPU	2 cores
RAM	4 GB
Storage	10 GB

Suitable for: <100K entries, development

High Performance

Resource	Specification
CPU	32+ cores
RAM	64+ GB
Storage	NVMe SSD

Suitable for: 10M+ entries, benchmarking

Optimizing Generation

Reduce Memory Pressure

Enable streaming output:

output:
  format: csv
  # Writing as generated reduces memory

Disable unnecessary features:

graph_export:
  enabled: false                     # Skip if not needed

anomaly_injection:
  enabled: false                     # Add in post-processing

Optimize for Speed

Maximize parallelism:

global:
  worker_threads: 16                 # More threads

Simplify output:

output:
  format: csv                        # Faster than JSON
  compression: none                  # Skip compression time

Reduce complexity:

chart_of_accounts:
  complexity: small                  # Fewer accounts

document_flows:
  p2p:
    enabled: false                   # Skip if not needed

Optimize for Size

Enable compression:

output:
  compression: zstd
  compression_level: 9               # Maximum compression

Minimize output files:

output:
  files:
    journal_entries: true
    acdoca: false
    master_data: false               # Only what you need

Benchmarking

Built-in Benchmarks

# Run all benchmarks
cargo bench

# Specific benchmark
cargo bench --bench generation_throughput

# With baseline comparison
cargo bench -- --baseline main

Benchmark Categories

Benchmark	Measures
`generation_throughput`	Entries/second
`distribution_sampling`	Distribution speed
`output_sink`	Write performance
`scalability`	Parallel scaling
`correctness`	Validation overhead

Manual Benchmarking

# Time generation
time datasynth-data generate --config config.yaml --output ./output

# Profile memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

Profiling

CPU Profiling

# With perf (Linux)
perf record datasynth-data generate --config config.yaml --output ./output
perf report

# With Instruments (macOS)
xcrun xctrace record --template "Time Profiler" \
    --launch datasynth-data generate --config config.yaml --output ./output

Memory Profiling

# With heaptrack (Linux)
heaptrack datasynth-data generate --config config.yaml --output ./output
heaptrack_print heaptrack.*.gz

# With Instruments (macOS)
xcrun xctrace record --template "Allocations" \
    --launch datasynth-data generate --config config.yaml --output ./output

Common Bottlenecks

I/O Bound

Symptoms:

CPU utilization < 100%
Disk utilization high

Solutions:

Use faster storage (SSD/NVMe)
Enable compression (reduces write volume)
Increase buffer sizes

Memory Bound

Symptoms:

OOM errors
Excessive swapping

Solutions:

Reduce target_count
Lower memory_limit
Enable streaming
Reduce parallel threads

CPU Bound

Symptoms:

CPU at 100%
Generation time scales linearly

Solutions:

Add more cores
Simplify configuration
Disable unnecessary features

Scaling Guidelines

Entries vs Time

Entries	~Time (8 cores)
10,000	<1 second
100,000	~2 seconds
1,000,000	~20 seconds
10,000,000	~3 minutes

Entries vs Memory

Entries	Peak Memory
10,000	~50 MB
100,000	~200 MB
1,000,000	~1.5 GB
10,000,000	~12 GB

Memory estimates include full in-memory processing. Streaming reduces by ~80%.

Server Performance

Rate Limiting

cargo run -p datasynth-server -- \
    --port 3000 \
    --rate-limit 1000              # Requests per minute

Connection Pooling

For high-concurrency scenarios, configure worker threads:

cargo run -p datasynth-server -- \
    --worker-threads 16            # Handle more connections

WebSocket Optimization

# Client-side: batch messages
const BATCH_SIZE = 100;  // Request 100 entries at a time

LLM-Augmented Generation

LLM-augmented generation uses Large Language Models to enrich synthetic data with realistic metadata — vendor names, transaction descriptions, memo fields, and anomaly explanations — that would be difficult to generate with rule-based approaches alone.

Overview

Traditional synthetic data generators produce structurally correct but often generic-sounding text fields. LLM augmentation addresses this by using language models to generate contextually appropriate text based on the financial domain, industry, and transaction context.

DataSynth provides a pluggable provider abstraction that supports:

Provider	Description	Use Case
Mock	Deterministic, no network required	CI/CD, testing, reproducible builds
OpenAI	OpenAI-compatible APIs (GPT-4o-mini, etc.)	Production enrichment
Anthropic	Anthropic API (Claude models)	Production enrichment
Custom	Any OpenAI-compatible endpoint	Self-hosted models, Azure OpenAI

Provider Abstraction

All LLM functionality is built around the LlmProvider trait:

#![allow(unused)]
fn main() {
pub trait LlmProvider: Send + Sync {
    fn name(&self) -> &str;
    fn complete(&self, request: &LlmRequest) -> Result<LlmResponse, SynthError>;
    fn complete_batch(&self, requests: &[LlmRequest]) -> Result<Vec<LlmResponse>, SynthError>;
}
}

LlmRequest

#![allow(unused)]
fn main() {
let request = LlmRequest::new("Generate a vendor name for a German auto parts manufacturer")
    .with_system("You are a business data generator. Return only the company name.")
    .with_seed(42)
    .with_max_tokens(50)
    .with_temperature(0.7);
}

Field	Type	Default	Description
`prompt`	String	(required)	The generation prompt
`system`	Option<String>	None	System message for context
`max_tokens`	u32	100	Maximum response tokens
`temperature`	f64	0.7	Sampling temperature
`seed`	Option<u64>	None	Seed for deterministic output

LlmResponse

#![allow(unused)]
fn main() {
pub struct LlmResponse {
    pub content: String,       // Generated text
    pub usage: TokenUsage,     // Input/output token counts
    pub cached: bool,          // Whether result came from cache
}
}

Mock Provider

The MockLlmProvider generates deterministic, contextually-aware text without any network calls. It is the default provider and is ideal for:

CI/CD pipelines where network access is restricted
Reproducible builds with deterministic output
Development and testing
Environments where API costs are a concern

#![allow(unused)]
fn main() {
use synth_core::llm::MockLlmProvider;

let provider = MockLlmProvider::new(42); // seeded for reproducibility
}

The mock provider uses the seed and prompt content to generate plausible-sounding business names and descriptions deterministically.

HTTP Provider

The HttpLlmProvider connects to real LLM APIs:

#![allow(unused)]
fn main() {
use synth_core::llm::{HttpLlmProvider, LlmConfig, LlmProviderType};

let config = LlmConfig {
    provider: LlmProviderType::OpenAi,
    model: "gpt-4o-mini".to_string(),
    api_key_env: "OPENAI_API_KEY".to_string(),
    base_url: None,
    max_retries: 3,
    timeout_secs: 30,
    cache_enabled: true,
};

let provider = HttpLlmProvider::new(config)?;
}

Configuration

# In your generation config
llm:
  provider: openai          # mock, openai, anthropic, custom
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"
  base_url: null            # Override for custom endpoints
  max_retries: 3
  timeout_secs: 30
  cache_enabled: true

Field	Type	Default	Description
`provider`	string	`mock`	Provider type: `mock`, `openai`, `anthropic`, `custom`
`model`	string	`gpt-4o-mini`	Model identifier
`api_key_env`	string	—	Environment variable containing the API key
`base_url`	string	null	Custom API base URL (required for `custom` provider)
`max_retries`	integer	`3`	Maximum retry attempts on failure
`timeout_secs`	integer	`30`	Request timeout in seconds
`cache_enabled`	bool	`true`	Enable prompt-level caching

Enrichment Types

Vendor Name Enrichment

Generates realistic vendor names based on industry, spend category, and country:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::VendorLlmEnricher;

let enricher = VendorLlmEnricher::new(provider.clone());
let name = enricher.enrich_vendor_name("manufacturing", "raw_materials", "DE")?;
// e.g., "Rheinische Stahlwerke GmbH"

// Batch enrichment for efficiency
let names = enricher.enrich_batch(&[
    ("manufacturing".into(), "raw_materials".into(), "DE".into()),
    ("retail".into(), "logistics".into(), "US".into()),
], 42)?;
}

Transaction Description Enrichment

Generates contextually appropriate journal entry descriptions:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::TransactionLlmEnricher;

let enricher = TransactionLlmEnricher::new(provider.clone());

let desc = enricher.enrich_description(
    "Office Supplies",    // account name
    "1000-5000",          // amount range
    "retail",             // industry
    3,                    // fiscal period
)?;

let memo = enricher.enrich_memo(
    "VendorInvoice",      // document type
    "Acme Corp",          // vendor name
    "2500.00",            // amount
)?;
}

Anomaly Explanation

Generates natural language explanations for injected anomalies:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::AnomalyLlmExplainer;

let explainer = AnomalyLlmExplainer::new(provider.clone());
let explanation = explainer.explain(
    "DuplicatePayment",           // anomaly type
    3,                             // affected records
    "Same amount, same vendor, 2 days apart",  // statistical details
)?;
}

Natural Language Configuration

The NlConfigGenerator converts natural language descriptions into YAML configuration:

#![allow(unused)]
fn main() {
use synth_core::llm::NlConfigGenerator;

let yaml = NlConfigGenerator::generate(
    "Generate 1 year of retail data for a mid-size US company with fraud patterns",
    &provider,
)?;
}

CLI Usage

datasynth-data init \
    --from-description "Generate 1 year of manufacturing data for a German mid-cap with intercompany transactions" \
    -o config.yaml

The generator parses intent into structured fields:

#![allow(unused)]
fn main() {
pub struct ConfigIntent {
    pub industry: Option<String>,     // e.g., "manufacturing"
    pub country: Option<String>,      // e.g., "DE"
    pub company_size: Option<String>, // e.g., "mid-cap"
    pub period_months: Option<u32>,   // e.g., 12
    pub features: Vec<String>,        // e.g., ["intercompany"]
}
}

Caching

The LlmCache deduplicates identical prompts using FNV-1a hashing:

#![allow(unused)]
fn main() {
use synth_core::llm::LlmCache;

let cache = LlmCache::new(10000); // max 10,000 entries
let key = LlmCache::cache_key("prompt text", Some("system"), Some(42));

cache.insert(key, "cached response".into());
if let Some(response) = cache.get(key) {
    // Use cached response
}
}

Caching is enabled by default and significantly reduces API costs when generating similar entities.

Cost and Privacy Considerations

Cost Management

Use the Mock provider for development and CI/CD (zero cost)
Enable caching to avoid duplicate API calls
Use batch enrichment (complete_batch) to reduce per-request overhead
Set appropriate max_tokens limits to control response sizes
Consider gpt-4o-mini or similar efficient models for bulk enrichment

Privacy

LLM prompts contain only synthetic context (industry, category, amount ranges) — never real data
No PII or sensitive information is sent to LLM providers
The Mock provider keeps everything local with no network traffic
For maximum privacy, use self-hosted models via the custom provider type

Diffusion Models

DataSynth integrates a statistical diffusion model backend for learned distribution capture, offering an alternative and complement to rule-based generation.

Overview

Diffusion models generate data through a learned denoising process: starting from pure noise and iteratively removing it to produce realistic samples. DataSynth’s implementation uses a statistical backend that captures column-level distributions and inter-column correlations from fingerprint data, then generates new samples through a configurable noise schedule.

Forward Process (Training):     x₀ → x₁ → x₂ → ... → xₜ (pure noise)
Reverse Process (Generation):   xₜ → xₜ₋₁ → ... → x₁ → x₀ (data)

Architecture

DiffusionBackend Trait

All diffusion backends implement a common interface:

#![allow(unused)]
fn main() {
pub trait DiffusionBackend: Send + Sync {
    fn name(&self) -> &str;
    fn forward(&self, x: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
    fn reverse(&self, x_t: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
    fn generate(&self, n_samples: usize, n_features: usize, seed: u64) -> Vec<Vec<f64>>;
}
}

Statistical Diffusion Backend

The StatisticalDiffusionBackend uses per-column means and standard deviations (extracted from fingerprint data) to guide the denoising process:

#![allow(unused)]
fn main() {
use synth_core::diffusion::{StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType};

let config = DiffusionConfig {
    n_steps: 1000,
    schedule: NoiseScheduleType::Cosine,
    seed: 42,
};

let backend = StatisticalDiffusionBackend::new(
    vec![5000.0, 3.5, 2.0],    // column means
    vec![2000.0, 1.5, 0.8],    // column standard deviations
    config,
);

// Optionally add correlation structure
let backend = backend.with_correlations(vec![
    vec![1.0, 0.65, 0.72],
    vec![0.65, 1.0, 0.55],
    vec![0.72, 0.55, 1.0],
]);

let samples = backend.generate(1000, 3, 42);
}

Noise Schedules

The noise schedule controls how noise is added during the forward process and removed during the reverse process.

Schedule	Formula	Characteristics
Linear	β_t = β_min + t/T × (β_max - β_min)	Uniform noise addition; simple and robust
Cosine	β_t = 1 - ᾱ_t/ᾱ_{t-1}, ᾱ_t = cos²(π/2 × t/T)	Slower noise addition; better for preserving fine details
Sigmoid	β_t = sigmoid(a + (b-a) × t/T)	Smooth transition; balanced between linear and cosine

#![allow(unused)]
fn main() {
use synth_core::diffusion::{NoiseSchedule, NoiseScheduleType};

let schedule = NoiseSchedule::new(&NoiseScheduleType::Cosine, 1000);

// Access schedule components
println!("Steps: {}", schedule.n_steps());
println!("First beta: {}", schedule.betas[0]);
println!("Last alpha_bar: {}", schedule.alpha_bars[999]);
}

Schedule Properties

The NoiseSchedule precomputes all values needed for efficient forward/reverse steps:

Property	Description
`betas`	Noise variance at each step
`alphas`	1 - beta at each step
`alpha_bars`	Cumulative product of alphas
`sqrt_alpha_bars`	√(ᾱ_t) for forward process
`sqrt_one_minus_alpha_bars`	√(1 - ᾱ_t) for noise scaling

Hybrid Generation

The HybridGenerator blends rule-based and diffusion-generated data to combine the structural guarantees of rule-based generation with the distributional fidelity of diffusion models.

Blend Strategies

Strategy	Description	Best For
Interpolate	Weighted average: `w × diffusion + (1-w) × rule_based`	Smooth blending of continuous values
Select	Per-record random selection from either source	Maintaining distinct record characteristics
Ensemble	Column-level: diffusion for amounts, rule-based for categoricals	Mixed-type data with different generation needs

#![allow(unused)]
fn main() {
use synth_core::diffusion::{HybridGenerator, BlendStrategy};

let hybrid = HybridGenerator::new(0.3);  // 30% diffusion weight
println!("Weight: {}", hybrid.weight());

// Interpolation blend
let blended = hybrid.blend(
    &rule_based_data,
    &diffusion_data,
    BlendStrategy::Interpolate,
    42,
);

// Ensemble blend (specify which columns use diffusion)
let ensemble = hybrid.blend_ensemble(
    &rule_based_data,
    &diffusion_data,
    &[0, 2],  // columns 0 and 2 from diffusion
);
}

Training Pipeline

The DiffusionTrainer fits a model from column-level parameters and correlation matrices (typically extracted from a fingerprint):

Training

#![allow(unused)]
fn main() {
use synth_core::diffusion::{DiffusionTrainer, ColumnDiffusionParams, ColumnType, DiffusionConfig};

let params = vec![
    ColumnDiffusionParams {
        name: "amount".into(),
        mean: 5000.0,
        std: 2000.0,
        min: 0.0,
        max: 100000.0,
        col_type: ColumnType::Continuous,
    },
    ColumnDiffusionParams {
        name: "line_items".into(),
        mean: 3.5,
        std: 1.5,
        min: 1.0,
        max: 20.0,
        col_type: ColumnType::Integer,
    },
];

let corr_matrix = vec![
    vec![1.0, 0.65],
    vec![0.65, 1.0],
];

let config = DiffusionConfig { n_steps: 1000, schedule: NoiseScheduleType::Cosine, seed: 42 };
let model = DiffusionTrainer::fit(params, corr_matrix, config);
}

Generation from Trained Model

#![allow(unused)]
fn main() {
let samples = model.generate(5000, 42);

// Save/load model
model.save(Path::new("./model.json"))?;
let loaded = TrainedDiffusionModel::load(Path::new("./model.json"))?;
}

Evaluation

#![allow(unused)]
fn main() {
let report = DiffusionTrainer::evaluate(&model, 5000, 42);

println!("Overall score: {:.3}", report.overall_score);
println!("Correlation error: {:.4}", report.correlation_error);
for (i, (mean_err, std_err)) in report.mean_errors.iter().zip(&report.std_errors).enumerate() {
    println!("Column {}: mean_err={:.4}, std_err={:.4}", i, mean_err, std_err);
}
}

The FitReport contains:

Metric	Description
`mean_errors`	Per-column mean absolute error
`std_errors`	Per-column standard deviation error
`correlation_error`	RMSE of correlation matrix
`overall_score`	Weighted composite score (0-1, higher is better)

CLI Usage

Train a Model

datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json \
    --n-steps 1000 \
    --schedule cosine

Evaluate a Model

datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000

Configuration

diffusion:
  enabled: true
  n_steps: 1000           # Number of diffusion steps
  schedule: "cosine"       # Noise schedule: linear, cosine, sigmoid
  sample_size: 1000        # Samples to generate

Field	Type	Default	Description
`enabled`	bool	`false`	Enable diffusion generation
`n_steps`	integer	`1000`	Forward/reverse diffusion steps
`schedule`	string	`"linear"`	Noise schedule type
`sample_size`	integer	`1000`	Number of samples

Utility Functions

DataSynth provides helper functions for working with diffusion data:

#![allow(unused)]
fn main() {
use synth_core::diffusion::{
    add_gaussian_noise, normalize_features, denormalize_features,
    clip_values, generate_noise,
};

// Normalize data to zero mean, unit variance
let (normalized, means, stds) = normalize_features(&data);

// Add calibrated noise
let noisy = add_gaussian_noise(&normalized[0], 0.1, &mut rng);

// Denormalize back to original scale
let original_scale = denormalize_features(&generated, &means, &stds);

// Clip to valid ranges
clip_values(&mut samples, 0.0, 100000.0);
}

Causal & Counterfactual Generation

DataSynth supports Structural Causal Models (SCMs) for generating data with explicit causal structure, running interventional “what-if” scenarios, and producing counterfactual records.

Overview

Traditional synthetic data generators capture correlations but not causation. Causal generation lets you:

Define causal relationships between variables (e.g., “transaction amount causes approval level”)
Generate observational data that follows the causal structure
Run interventions to answer “what if?” questions (do-calculus)
Produce counterfactuals — “what would have happened if X were different?”

This is particularly valuable for fraud detection, audit analytics, and regulatory “what-if” scenario testing.

Causal Graph

A causal graph defines variables and the directed edges (causal mechanisms) between them.

Variables

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalVariable, CausalVarType};

let var = CausalVariable::new("transaction_amount", CausalVarType::Continuous)
    .with_distribution("lognormal")
    .with_param("mu", 8.0)
    .with_param("sigma", 1.5);
}

Variable Type	Description	Example
`Continuous`	Real-valued	Transaction amount, revenue
`Categorical`	Discrete categories	Industry, department
`Count`	Non-negative integers	Line items, approvals
`Binary`	Boolean (0/1)	Fraud flag, approval status

Causal Mechanisms

Edges between variables define how a parent causally affects a child:

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalEdge, CausalMechanism};

let edge = CausalEdge {
    from: "transaction_amount".into(),
    to: "approval_level".into(),
    mechanism: CausalMechanism::Logistic { scale: 0.001, midpoint: 50000.0 },
    strength: 1.0,
};
}

Mechanism	Formula	Use Case
`Linear { coefficient }`	y += coefficient × parent	Proportional effects
`Threshold { cutoff }`	y = 1 if parent > cutoff, else 0	Binary triggers
`Polynomial { coefficients }`	y += Σ coefficients[i] × parent^i	Non-linear effects
`Logistic { scale, midpoint }`	y += 1 / (1 + e^(-scale × (parent - midpoint)))	S-curve effects

Building a Graph

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, CausalVariable, CausalVarType, CausalEdge, CausalMechanism};

let mut graph = CausalGraph::new();

// Add variables
graph.add_variable(
    CausalVariable::new("transaction_amount", CausalVarType::Continuous)
        .with_distribution("lognormal")
        .with_param("mu", 8.0)
        .with_param("sigma", 1.5)
);
graph.add_variable(
    CausalVariable::new("approval_level", CausalVarType::Count)
        .with_distribution("normal")
        .with_param("mean", 1.0)
        .with_param("std", 0.5)
);
graph.add_variable(
    CausalVariable::new("fraud_flag", CausalVarType::Binary)
);

// Add causal edges
graph.add_edge(CausalEdge {
    from: "transaction_amount".into(),
    to: "approval_level".into(),
    mechanism: CausalMechanism::Linear { coefficient: 0.00005 },
    strength: 1.0,
});
graph.add_edge(CausalEdge {
    from: "transaction_amount".into(),
    to: "fraud_flag".into(),
    mechanism: CausalMechanism::Logistic { scale: 0.0001, midpoint: 50000.0 },
    strength: 0.8,
});

// Validate (checks for cycles, missing variables)
graph.validate()?;
}

Built-in Templates

DataSynth includes pre-configured causal graphs for common financial scenarios:

Fraud Detection Template

#![allow(unused)]
fn main() {
let graph = CausalGraph::fraud_detection_template();
}

Variables: transaction_amount, approval_level, vendor_risk, fraud_flag

Causal structure:

transaction_amount → approval_level (linear)
transaction_amount → fraud_flag (logistic)
vendor_risk → fraud_flag (linear)

Revenue Cycle Template

#![allow(unused)]
fn main() {
let graph = CausalGraph::revenue_cycle_template();
}

Variables: order_size, credit_score, payment_delay, revenue

Causal structure:

order_size → revenue (linear)
credit_score → payment_delay (linear, negative)
order_size → payment_delay (linear)

Structural Causal Model (SCM)

The SCM wraps a causal graph and provides generation capabilities:

#![allow(unused)]
fn main() {
use synth_core::causal::StructuralCausalModel;

let scm = StructuralCausalModel::new(graph)?;

// Generate observational data
let samples = scm.generate(10000, 42)?;
// samples: Vec<HashMap<String, f64>>

for sample in &samples[..3] {
    println!("Amount: {:.2}, Approval: {:.0}, Fraud: {:.0}",
        sample["transaction_amount"],
        sample["approval_level"],
        sample["fraud_flag"],
    );
}
}

Data is generated in topological order — root variables are sampled from their distributions first, then child variables are computed based on their parents’ values and the causal mechanisms.

Interventions (Do-Calculus)

Interventions answer “what would happen if we force variable X to value V?”, cutting all incoming causal edges to X.

Single Intervention

#![allow(unused)]
fn main() {
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let samples = intervened.generate(5000, 42)?;
}

Multiple Interventions

#![allow(unused)]
fn main() {
let intervened = scm
    .intervene("transaction_amount", 50000.0)?
    .and_intervene("vendor_risk", 0.9);
let samples = intervened.generate(5000, 42)?;
}

Intervention Engine with Effect Estimation

#![allow(unused)]
fn main() {
use synth_core::causal::InterventionEngine;

let engine = InterventionEngine::new(scm);

let result = engine.do_intervention(
    &[("transaction_amount".into(), 50000.0)],
    5000,  // samples
    42,    // seed
)?;

// Compare baseline vs intervention
println!("Baseline fraud rate: {:.4}",
    result.baseline_samples.iter()
        .map(|s| s["fraud_flag"])
        .sum::<f64>() / result.baseline_samples.len() as f64
);

// Effect estimates with confidence intervals
for (var, effect) in &result.effect_estimates {
    println!("{}: ATE={:.4}, 95% CI=({:.4}, {:.4})",
        var,
        effect.average_treatment_effect,
        effect.confidence_interval.0,
        effect.confidence_interval.1,
    );
}
}

The InterventionResult contains:

Field	Description
`baseline_samples`	Data generated without intervention
`intervened_samples`	Data generated with the intervention applied
`effect_estimates`	Per-variable average treatment effects with confidence intervals

Counterfactual Generation

Counterfactuals answer “what would have happened to this specific record if X were different?” using the abduction-action-prediction framework:

Abduction: Infer the latent noise variables from the factual observation
Action: Apply the intervention (change X to new value)
Prediction: Propagate through the SCM with inferred noise

#![allow(unused)]
fn main() {
use synth_core::causal::CounterfactualGenerator;
use std::collections::HashMap;

let cf_gen = CounterfactualGenerator::new(scm);

// Factual record
let factual: HashMap<String, f64> = [
    ("transaction_amount".to_string(), 5000.0),
    ("approval_level".to_string(), 1.0),
    ("fraud_flag".to_string(), 0.0),
].into_iter().collect();

// What if the amount had been 100,000?
let counterfactual = cf_gen.generate_counterfactual(
    &factual,
    "transaction_amount",
    100000.0,
    42,
)?;

println!("Factual fraud_flag: {}", factual["fraud_flag"]);
println!("Counterfactual fraud_flag: {}", counterfactual["fraud_flag"]);
}

Batch Counterfactuals

#![allow(unused)]
fn main() {
let pairs = cf_gen.generate_batch_counterfactuals(
    &factual_records,
    "transaction_amount",
    100000.0,
    42,
)?;

for pair in &pairs {
    println!("Changed variables: {:?}", pair.changed_variables);
}
}

Each CounterfactualPair contains:

Field	Description
`factual`	The original observation
`counterfactual`	The counterfactual version
`changed_variables`	List of variables that changed

Causal Validation

Validate that generated data preserves the specified causal structure:

#![allow(unused)]
fn main() {
use synth_core::causal::CausalValidator;

let report = CausalValidator::validate_causal_structure(&samples, &graph);

println!("Valid: {}", report.valid);
for check in &report.checks {
    println!("{}: {} — {}", check.name, if check.passed { "PASS" } else { "FAIL" }, check.details);
}
if !report.violations.is_empty() {
    println!("Violations: {:?}", report.violations);
}
}

The validator checks:

Causal edge directions are respected (parent-child correlations)
Independence constraints hold (non-adjacent variables)
Intervention effects are consistent with the graph structure

CLI Usage

Generate Observational Data

datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --seed 42 \
    --output ./causal_output

Run Interventions

datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output

Validate Causal Structure

datasynth-data causal validate \
    --data ./causal_output \
    --template fraud_detection

Configuration

causal:
  enabled: true
  template: "fraud_detection"   # or "revenue_cycle" or path to custom YAML
  sample_size: 10000
  validate: true                # validate causal structure in output

Custom Causal Graph YAML

# custom_graph.yaml
variables:
  - name: order_size
    type: continuous
    distribution: lognormal
    params:
      mu: 7.0
      sigma: 1.2
  - name: discount_rate
    type: continuous
    distribution: beta
    params:
      alpha: 2.0
      beta: 8.0
  - name: revenue
    type: continuous

edges:
  - from: order_size
    to: revenue
    mechanism:
      type: linear
      coefficient: 0.95
  - from: discount_rate
    to: revenue
    mechanism:
      type: linear
      coefficient: -5000.0

Federated Fingerprinting

Federated fingerprinting enables extracting statistical fingerprints from multiple distributed data sources and combining them without centralizing the raw data.

Overview

In many enterprise environments, data is distributed across multiple systems, departments, or legal entities that cannot share raw data due to privacy regulations or data governance policies. Federated fingerprinting addresses this by:

Local extraction: Each data source extracts a partial fingerprint with its own differential privacy budget
Secure aggregation: Partial fingerprints are combined using a configurable aggregation strategy
Privacy composition: The total privacy budget is tracked across all sources

Source A → [Extract + Local DP] → Partial FP A ─┐
Source B → [Extract + Local DP] → Partial FP B ─┼→ [Aggregate] → Combined FP → [Generate]
Source C → [Extract + Local DP] → Partial FP C ─┘

Partial Fingerprints

Each data source produces a PartialFingerprint containing noised statistics:

#![allow(unused)]
fn main() {
pub struct PartialFingerprint {
    pub source_id: String,         // Identifier for this data source
    pub local_epsilon: f64,        // DP epsilon budget spent locally
    pub record_count: u64,         // Number of records in source
    pub column_names: Vec<String>, // Column identifiers
    pub means: Vec<f64>,           // Per-column means (noised)
    pub stds: Vec<f64>,            // Per-column standard deviations (noised)
    pub mins: Vec<f64>,            // Per-column minimums (noised)
    pub maxs: Vec<f64>,            // Per-column maximums (noised)
    pub correlations: Vec<f64>,    // Flat row-major correlation matrix (noised)
}
}

Creating a Partial Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::FederatedFingerprintProtocol;

let partial = FederatedFingerprintProtocol::create_partial(
    "department_finance",                        // source ID
    vec!["amount".into(), "line_items".into()],  // columns
    50000,                                       // record count
    vec![8500.0, 3.2],                           // means
    vec![4200.0, 1.8],                           // standard deviations
    1.0,                                         // local epsilon budget
);
}

Aggregation Methods

Method	Description	Properties
WeightedAverage	Weighted by record count	Best for balanced sources
Median	Median across sources	Robust to outlier sources
TrimmedMean	Mean after removing extremes	Balances robustness and efficiency

Protocol Usage

#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::{
    FederatedFingerprintProtocol, FederatedConfig, AggregationMethod,
};

// Configure the protocol
let config = FederatedConfig {
    min_sources: 2,                                // Minimum sources required
    max_epsilon_per_source: 5.0,                   // Max DP budget per source
    aggregation_method: AggregationMethod::WeightedAverage,
};

let protocol = FederatedFingerprintProtocol::new(config);

// Collect partial fingerprints from each source
let partial_a = FederatedFingerprintProtocol::create_partial(
    "source_a", vec!["amount".into(), "count".into()],
    10000, vec![5000.0, 3.0], vec![2000.0, 1.5], 1.0,
);
let partial_b = FederatedFingerprintProtocol::create_partial(
    "source_b", vec!["amount".into(), "count".into()],
    8000, vec![4500.0, 2.8], vec![1800.0, 1.2], 1.0,
);
let partial_c = FederatedFingerprintProtocol::create_partial(
    "source_c", vec!["amount".into(), "count".into()],
    12000, vec![5500.0, 3.3], vec![2200.0, 1.7], 1.0,
);

// Aggregate
let aggregated = protocol.aggregate(&[partial_a, partial_b, partial_c])?;

println!("Total records: {}", aggregated.total_record_count);  // 30000
println!("Total epsilon: {}", aggregated.total_epsilon);        // 3.0 (sum)
println!("Sources: {}", aggregated.source_count);               // 3
}

Aggregated Fingerprint

The AggregatedFingerprint contains the combined statistics:

#![allow(unused)]
fn main() {
pub struct AggregatedFingerprint {
    pub column_names: Vec<String>,
    pub means: Vec<f64>,            // Aggregated means
    pub stds: Vec<f64>,             // Aggregated standard deviations
    pub mins: Vec<f64>,             // Aggregated minimums
    pub maxs: Vec<f64>,             // Aggregated maximums
    pub correlations: Vec<f64>,     // Aggregated correlation matrix
    pub total_record_count: u64,    // Sum across all sources
    pub total_epsilon: f64,         // Sum of local epsilons
    pub source_count: usize,        // Number of contributing sources
}
}

Privacy Budget Composition

The total privacy budget is the sum of local epsilons across all sources. This follows sequential composition — each source’s local DP guarantee composes with the others.

For example, if three sources each spend ε=1.0 locally, the total privacy cost of the aggregated fingerprint is ε=3.0 under sequential composition.

To minimize total budget:

Use the lowest local_epsilon that provides sufficient utility
Prefer fewer sources with more records over many sources with few records
Use max_epsilon_per_source to enforce per-source budget caps

CLI Usage

# Aggregate fingerprints from multiple sources
datasynth-data fingerprint federated \
    --sources ./finance.dsf ./operations.dsf ./sales.dsf \
    --output ./aggregated.dsf \
    --method weighted_average \
    --max-epsilon 5.0

# Then generate from the aggregated fingerprint
datasynth-data generate \
    --fingerprint ./aggregated.dsf \
    --output ./synthetic_output

Configuration

# Federated config is specified per-invocation via CLI flags
# The aggregation method and privacy budget are controlled at execution time

CLI Flag	Default	Description
`--sources`	(required)	Two or more .dsf fingerprint files
`--output`	(required)	Output path for aggregated fingerprint
`--method`	`weighted_average`	Aggregation strategy
`--max-epsilon`	`5.0`	Maximum epsilon per source

Synthetic Data Certificates

Synthetic data certificates provide cryptographic proof of the privacy guarantees and quality metrics associated with generated data.

Overview

As synthetic data becomes increasingly used in regulated industries, organizations need verifiable assurance that:

The data was generated with specific differential privacy guarantees
Quality metrics meet documented thresholds
The generation configuration hasn’t been tampered with
The certificate itself is authentic (HMAC-SHA256 signed)

Certificates are produced during generation and can be embedded in output files or distributed alongside them.

Certificate Structure

#![allow(unused)]
fn main() {
pub struct SyntheticDataCertificate {
    pub certificate_id: String,        // Unique certificate identifier
    pub generation_timestamp: String,  // ISO 8601 timestamp
    pub generator_version: String,     // DataSynth version
    pub config_hash: String,           // SHA-256 hash of generation config
    pub seed: Option<u64>,             // RNG seed for reproducibility
    pub dp_guarantee: Option<DpGuarantee>,
    pub quality_metrics: Option<QualityMetrics>,
    pub fingerprint_hash: Option<String>,  // Source fingerprint hash
    pub issuer: String,                // Certificate issuer
    pub signature: Option<String>,     // HMAC-SHA256 signature
}
}

DP Guarantee

#![allow(unused)]
fn main() {
pub struct DpGuarantee {
    pub mechanism: String,            // "Laplace" or "Gaussian"
    pub epsilon: f64,                 // Privacy budget spent
    pub delta: Option<f64>,           // For (ε,δ)-DP
    pub composition_method: String,   // "sequential", "advanced", "rdp"
    pub total_queries: u32,           // Number of DP queries made
}
}

Quality Metrics

#![allow(unused)]
fn main() {
pub struct QualityMetrics {
    pub benford_mad: Option<f64>,             // Mean Absolute Deviation from Benford's Law
    pub correlation_preservation: Option<f64>, // Correlation matrix similarity (0-1)
    pub statistical_fidelity: Option<f64>,    // Overall statistical fidelity score (0-1)
    pub mia_auc: Option<f64>,                 // Membership Inference Attack AUC (closer to 0.5 = better privacy)
}
}

Building Certificates

Use the CertificateBuilder for fluent construction:

#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{
    CertificateBuilder, DpGuarantee, QualityMetrics,
};

let cert = CertificateBuilder::new("DataSynth v0.5.0")
    .with_dp_guarantee(DpGuarantee {
        mechanism: "Laplace".into(),
        epsilon: 1.0,
        delta: None,
        composition_method: "sequential".into(),
        total_queries: 50,
    })
    .with_quality_metrics(QualityMetrics {
        benford_mad: Some(0.008),
        correlation_preservation: Some(0.95),
        statistical_fidelity: Some(0.92),
        mia_auc: Some(0.52),
    })
    .with_config_hash("sha256:abc123...")
    .with_seed(42)
    .with_fingerprint_hash("sha256:def456...")
    .with_generator_version("0.5.0")
    .build();
}

Signing and Verification

Certificates are signed using HMAC-SHA256:

#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{sign_certificate, verify_certificate};

// Sign
sign_certificate(&mut cert, "my-secret-key-material");

// Verify
let valid = verify_certificate(&cert, "my-secret-key-material");
assert!(valid);

// Tampered certificate fails verification
cert.dp_guarantee.as_mut().unwrap().epsilon = 0.001; // tamper
assert!(!verify_certificate(&cert, "my-secret-key-material"));
}

Output Embedding

Certificates can be:

Standalone JSON: Written as certificate.json in the output directory
Parquet metadata: Embedded in Parquet file metadata under the datasynth_certificate key
JSON metadata: Included in the generation manifest

CLI Usage

# Generate data with certificate
datasynth-data generate \
    --config config.yaml \
    --output ./output \
    --certificate

# Certificate is written to ./output/certificate.json

Configuration

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true

Field	Type	Default	Description
`enabled`	bool	`false`	Enable certificate generation
`issuer`	string	`"DataSynth"`	Issuer identity
`include_quality_metrics`	bool	`true`	Include quality metrics in certificate

Privacy-Utility Pareto Frontier

The ParetoFrontier helps find optimal privacy-utility tradeoffs:

#![allow(unused)]
fn main() {
use datasynth_fingerprint::privacy::pareto::{ParetoFrontier, ParetoPoint};

let epsilons = vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0];

let points = ParetoFrontier::explore(&epsilons, |epsilon| {
    // Evaluate utility at this epsilon level
    ParetoPoint {
        epsilon,
        delta: None,
        utility_score: compute_utility(epsilon),
        benford_mad: compute_benford(epsilon),
        correlation_score: compute_correlation(epsilon),
    }
});

// Recommend epsilon for target utility
if let Some(recommended_epsilon) = ParetoFrontier::recommend(&points, 0.90) {
    println!("For 90% utility, use epsilon = {:.2}", recommended_epsilon);
}
}

The frontier identifies non-dominated points where no other configuration achieves both better privacy and better utility.

Deployment & Operations

This section covers everything you need to deploy, operate, and maintain DataSynth in production environments.

Deployment Options

DataSynth supports three deployment models, each suited to different operational requirements:

Method	Best For	Scaling	Complexity
Docker / Compose	Small teams, dev/staging, single-node	Vertical	Low
Kubernetes / Helm	Production, multi-tenant, auto-scaling	Horizontal	Medium
Bare Metal / SystemD	Regulated environments, air-gapped networks	Vertical	Low

Architecture at a Glance

DataSynth server exposes two network interfaces:

REST API on port 3000 – configuration, bulk generation, streaming control, health probes, Prometheus metrics
gRPC API on port 50051 – high-throughput generation for programmatic clients

Both share an in-process ServerState with atomic counters, so a single process can serve REST, gRPC, and WebSocket clients concurrently.

Operations Guides

Guide	Description
Operational Runbook	Grafana dashboards, alert response, troubleshooting, log analysis
Capacity Planning	Sizing model, reference benchmarks, disk and memory estimates
Disaster Recovery	Backup procedures, deterministic replay, stateless restart

Security & API

Guide	Description
API Reference	Endpoints, authentication, rate limiting, WebSocket protocol, error formats
Security Hardening	Pre-deployment checklist, TLS/mTLS, secrets, container security, audit logging
TLS & Reverse Proxy	Nginx, Envoy, and native TLS configuration

Quick Decision Tree

Need auto-scaling or HA? – Use Kubernetes.
Single server, want observability? – Use Docker Compose with the full stack (Prometheus + Grafana).
Air-gapped or compliance-restricted? – Use Bare Metal with SystemD.

Docker Deployment

This guide walks through building, configuring, and running DataSynth as Docker containers.

Prerequisites

Docker Engine 24+ (or Docker Desktop 4.25+)
Docker Compose v2
2 GB RAM minimum (4 GB recommended)
10 GB disk for images and generated data

Images

DataSynth provides two container images:

Image	Dockerfile	Purpose
`datasynth/datasynth-server`	`Dockerfile`	Server (REST + gRPC + WebSocket)
`datasynth/datasynth-cli`	`Dockerfile.cli`	CLI for batch generation jobs

Multi-Stage Build Walkthrough

The server Dockerfile uses a four-stage build with cargo-chef for dependency caching:

Stage 1: chef       -- installs cargo-chef on rust:1.88-bookworm
Stage 2: planner    -- computes recipe.json from Cargo.lock
Stage 3: builder    -- cooks dependencies (cached), then builds datasynth-server + datasynth-data
Stage 4: runtime    -- copies binaries into gcr.io/distroless/cc-debian12

Build the server image:

docker build -t datasynth/datasynth-server:0.5.0 .

Build the CLI-only image:

docker build -t datasynth/datasynth-cli:0.5.0 -f Dockerfile.cli .

Build Arguments and Features

To enable optional features (TLS, Redis rate limiting, OpenTelemetry), modify the build command in the builder stage. For example, to enable Redis:

# In the builder stage, replace the cargo build line:
RUN cargo build --release -p datasynth-server -p datasynth-cli --features redis

Image Size

The distroless runtime image is approximately 40-60 MB. The build cache layer with cooked dependencies significantly speeds up rebuilds when only application code changes.

Docker Compose Stack

The repository includes a production-ready docker-compose.yml with the full observability stack:

services:
  datasynth-server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "50051:50051"  # gRPC
      - "3000:3000"    # REST
    environment:
      - RUST_LOG=info
      - DATASYNTH_API_KEYS=${DATASYNTH_API_KEYS:-}
    healthcheck:
      test: ["CMD", "/usr/local/bin/datasynth-data", "--help"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "2.0"
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    profiles:
      - redis
    ports:
      - "6379:6379"
    command: >
      redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
    deploy:
      resources:
        limits:
          memory: 256M
          cpus: "0.5"
    volumes:
      - redis-data:/data
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./deploy/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./deploy/prometheus-alerts.yml:/etc/prometheus/alerts.yml:ro
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3001:3000"
    volumes:
      - ./deploy/grafana/provisioning:/etc/grafana/provisioning:ro
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:
  redis-data:

Starting the Stack

Basic server only:

docker compose up -d datasynth-server

Full observability stack (server + Prometheus + Grafana):

docker compose up -d

With Redis for distributed rate limiting:

docker compose --profile redis up -d

Verifying the Deployment

# Health check
curl http://localhost:3000/health

# Readiness probe
curl http://localhost:3000/ready

# Prometheus metrics
curl http://localhost:3000/metrics

# Grafana UI
open http://localhost:3001  # admin / admin (or GRAFANA_PASSWORD)

Environment Variables

Variable	Default	Description
`RUST_LOG`	`info`	Log level: `trace`, `debug`, `info`, `warn`, `error`
`DATASYNTH_API_KEYS`	(none)	Comma-separated API keys for authentication
`DATASYNTH_WORKER_THREADS`	`0` (auto)	Tokio worker threads; 0 = CPU count
`DATASYNTH_REDIS_URL`	(none)	Redis URL for distributed rate limiting
`DATASYNTH_TLS_CERT`	(none)	Path to TLS certificate (PEM)
`DATASYNTH_TLS_KEY`	(none)	Path to TLS private key (PEM)
`OTEL_EXPORTER_OTLP_ENDPOINT`	(none)	OpenTelemetry collector endpoint
`OTEL_SERVICE_NAME`	(none)	OpenTelemetry service name

Resource Limits

Recommended container resource limits by workload:

Workload	CPU	Memory	Notes
Light (dev/test)	1 core	1 GB	Small configs, < 10K entries
Medium (staging)	2 cores	2 GB	Medium configs, up to 100K entries
Heavy (production)	4 cores	4 GB	Large configs, streaming, multiple clients
Batch CLI job	2-8 cores	2-8 GB	Scales linearly with core count

Running CLI Jobs in Docker

Generate data with the CLI image:

docker run --rm \
  -v $(pwd)/output:/output \
  datasynth/datasynth-cli:0.5.0 \
  generate --demo --output /output

Generate from a custom config:

docker run --rm \
  -v $(pwd)/config.yaml:/config.yaml:ro \
  -v $(pwd)/output:/output \
  datasynth/datasynth-cli:0.5.0 \
  generate --config /config.yaml --output /output

Networking

The server binds to 0.0.0.0 by default inside the container. Port mapping:

Container Port	Protocol	Service
3000	TCP	REST API + WebSocket + Prometheus metrics
50051	TCP	gRPC API

For WebSocket connections through a reverse proxy, ensure the proxy supports HTTP Upgrade headers. See TLS & Reverse Proxy for Nginx and Envoy configurations.

Logging

DataSynth server outputs structured JSON logs to stdout, which integrates with Docker’s logging drivers:

# View logs
docker compose logs -f datasynth-server

# Filter by level
docker compose logs datasynth-server | jq 'select(.level == "ERROR")'

To change the log format or level, set the RUST_LOG environment variable:

# Debug logging for the server crate only
RUST_LOG=datasynth_server=debug docker compose up -d datasynth-server

Kubernetes Deployment

This guide covers deploying DataSynth on Kubernetes using the included Helm chart or raw manifests.

Prerequisites

Kubernetes 1.27+
Helm 3.12+ (for Helm-based deployment)
kubectl configured for your cluster
A container registry accessible from the cluster
Metrics Server installed (for HPA)

Helm Chart

The Helm chart is located at deploy/helm/datasynth/ and manages all Kubernetes resources.

Quick Install

# From the repository root
helm install datasynth ./deploy/helm/datasynth \
  --namespace datasynth \
  --create-namespace

Install with Custom Values

helm install datasynth ./deploy/helm/datasynth \
  --namespace datasynth \
  --create-namespace \
  --set image.repository=your-registry.example.com/datasynth-server \
  --set image.tag=0.5.0 \
  --set autoscaling.minReplicas=3 \
  --set autoscaling.maxReplicas=15

Upgrade

helm upgrade datasynth ./deploy/helm/datasynth \
  --namespace datasynth \
  --reuse-values \
  --set image.tag=0.6.0

Uninstall

helm uninstall datasynth --namespace datasynth

Chart Reference

values.yaml Key Parameters

Parameter	Default	Description
`replicaCount`	`2`	Initial replicas (ignored when HPA is enabled)
`image.repository`	`datasynth/datasynth-server`	Container image repository
`image.tag`	`0.5.0`	Image tag
`service.type`	`ClusterIP`	Service type
`service.restPort`	`3000`	REST API port
`service.grpcPort`	`50051`	gRPC port
`resources.requests.cpu`	`500m`	CPU request
`resources.requests.memory`	`512Mi`	Memory request
`resources.limits.cpu`	`2`	CPU limit
`resources.limits.memory`	`2Gi`	Memory limit
`autoscaling.enabled`	`true`	Enable HPA
`autoscaling.minReplicas`	`2`	Minimum replicas
`autoscaling.maxReplicas`	`10`	Maximum replicas
`autoscaling.targetCPUUtilizationPercentage`	`70`	CPU scaling target
`podDisruptionBudget.enabled`	`true`	Enable PDB
`podDisruptionBudget.minAvailable`	`1`	Minimum available pods
`apiKeys`	`[]`	API keys (stored in a Secret)
`config.enabled`	`false`	Mount DataSynth YAML config via ConfigMap
`redis.enabled`	`false`	Deploy Redis sidecar for distributed rate limiting
`serviceMonitor.enabled`	`false`	Create Prometheus ServiceMonitor
`ingress.enabled`	`false`	Enable Ingress resource

Authentication

API keys are stored in a Kubernetes Secret and injected via the DATASYNTH_API_KEYS environment variable:

# values-production.yaml
apiKeys:
  - "your-secure-api-key-1"
  - "your-secure-api-key-2"

For external secret management, use the External Secrets Operator or mount from a Vault sidecar. See Security Hardening for details.

DataSynth Configuration via ConfigMap

To inject a DataSynth generation config into the pods:

config:
  enabled: true
  content: |
    global:
      industry: manufacturing
      start_date: "2024-01-01"
      period_months: 12
      seed: 42
    companies:
      - code: "1000"
        name: "Manufacturing Corp"
        currency: USD
        country: US
        annual_transaction_volume: 100000

The config is mounted at /etc/datasynth/datasynth.yaml as a read-only volume.

Health Probes

The Helm chart configures three probes:

Probe	Endpoint	Initial Delay	Period	Failure Threshold
Startup	`GET /live`	5s	5s	30 (= 2.5 min max startup)
Liveness	`GET /live`	15s	20s	3
Readiness	`GET /ready`	5s	10s	3

The readiness probe checks configuration validity, memory usage, and disk availability. A pod reporting not-ready is removed from Service endpoints until it recovers.

Horizontal Pod Autoscaler (HPA)

The chart creates an HPA by default targeting 70% CPU utilization:

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  # Uncomment to also scale on memory:
  # targetMemoryUtilizationPercentage: 80

Custom metrics scaling (e.g., on synth_active_streams) requires the Prometheus Adapter:

# Custom metrics HPA example (requires prometheus-adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: datasynth-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: datasynth
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: synth_active_streams
        target:
          type: AverageValue
          averageValue: "5"

Pod Disruption Budget (PDB)

The PDB ensures at least one pod remains available during voluntary disruptions (node drains, cluster upgrades):

podDisruptionBudget:
  enabled: true
  minAvailable: 1

For larger deployments, switch to maxUnavailable:

podDisruptionBudget:
  enabled: true
  maxUnavailable: 1

Ingress and TLS

Nginx Ingress with cert-manager

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
  hosts:
    - host: datasynth.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: datasynth-tls
      hosts:
        - datasynth.example.com

WebSocket Support

For Nginx Ingress, WebSocket upgrade is handled automatically for paths starting with /ws/. If you use a path-based routing rule, ensure the annotation is set:

nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
nginx.ingress.kubernetes.io/configuration-snippet: |
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";

gRPC Ingress

gRPC requires a separate Ingress resource or an Ingress controller that supports gRPC (e.g., Nginx Ingress with nginx.ingress.kubernetes.io/backend-protocol: "GRPC"):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: datasynth-grpc
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - secretName: datasynth-grpc-tls
      hosts:
        - grpc.datasynth.example.com
  rules:
    - host: grpc.datasynth.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: datasynth
                port:
                  name: grpc

Manual Manifests (Without Helm)

If you prefer raw manifests, here is a minimal deployment:

---
apiVersion: v1
kind: Namespace
metadata:
  name: datasynth
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: datasynth
  namespace: datasynth
spec:
  replicas: 2
  selector:
    matchLabels:
      app: datasynth
  template:
    metadata:
      labels:
        app: datasynth
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: datasynth
          image: datasynth/datasynth-server:0.5.0
          ports:
            - containerPort: 3000
              name: http-rest
            - containerPort: 50051
              name: grpc
          env:
            - name: RUST_LOG
              value: "info"
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: "2"
              memory: 2Gi
          livenessProbe:
            httpGet:
              path: /live
              port: http-rest
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /ready
              port: http-rest
            initialDelaySeconds: 5
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: datasynth
  namespace: datasynth
spec:
  type: ClusterIP
  ports:
    - port: 3000
      targetPort: http-rest
      name: http-rest
    - port: 50051
      targetPort: grpc
      name: grpc
  selector:
    app: datasynth

Prometheus ServiceMonitor

If you use the Prometheus Operator, enable the ServiceMonitor:

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s
  path: /metrics
  labels:
    release: prometheus  # Must match your Prometheus Operator selector

Rolling Update Strategy

The chart uses a zero-downtime rolling update strategy:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

Combined with the PDB and readiness probes, this ensures that:

A new pod starts and becomes ready before an old pod is terminated.
At least minAvailable pods are always serving traffic.
Config and secret changes trigger a rolling restart via checksum annotations.

Topology Spread

For multi-zone clusters, use topology spread constraints to distribute pods evenly:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: datasynth

Bare Metal Deployment

This guide covers installing and running DataSynth directly on a Linux server using SystemD.

Prerequisites

Linux x86_64 (Ubuntu 22.04+, Debian 12+, RHEL 9+, or equivalent)
2 GB RAM minimum (4 GB recommended)
Root or sudo access for initial setup

Binary Installation

Option 1: Download Pre-Built Binary

# Download the latest release
curl -L https://github.com/ey-asu-rnd/SyntheticData/releases/latest/download/datasynth-server-linux-x86_64.tar.gz \
  -o datasynth-server.tar.gz

# Extract
tar xzf datasynth-server.tar.gz

# Install binaries
sudo install -m 0755 datasynth-server /usr/local/bin/
sudo install -m 0755 datasynth-data /usr/local/bin/

# Verify
datasynth-server --help
datasynth-data --version

Option 2: Build from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Install protobuf compiler (required for gRPC)
sudo apt-get install -y protobuf-compiler   # Debian/Ubuntu
sudo dnf install -y protobuf-compiler       # RHEL/Fedora

# Clone and build
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release -p datasynth-server -p datasynth-cli

# Install
sudo install -m 0755 target/release/datasynth-server /usr/local/bin/
sudo install -m 0755 target/release/datasynth-data /usr/local/bin/

To enable optional features during the build:

# With TLS support
cargo build --release -p datasynth-server --features tls

# With Redis distributed rate limiting
cargo build --release -p datasynth-server --features redis

# With OpenTelemetry
cargo build --release -p datasynth-server --features otel

# All features
cargo build --release -p datasynth-server --features "tls,redis,otel"

User and Permissions

Create a dedicated service account:

# Create system user (no home dir, no login shell)
sudo useradd --system --no-create-home --shell /usr/sbin/nologin datasynth

# Create data and config directories
sudo mkdir -p /var/lib/datasynth
sudo mkdir -p /etc/datasynth
sudo mkdir -p /etc/datasynth/tls

# Set ownership
sudo chown -R datasynth:datasynth /var/lib/datasynth
sudo chmod 750 /var/lib/datasynth

sudo chown -R root:datasynth /etc/datasynth
sudo chmod 750 /etc/datasynth
sudo chmod 700 /etc/datasynth/tls

Environment Configuration

Copy the example environment file:

sudo cp deploy/datasynth-server.env.example /etc/datasynth/server.env
sudo chown root:datasynth /etc/datasynth/server.env
sudo chmod 640 /etc/datasynth/server.env

Edit /etc/datasynth/server.env:

# Logging level
RUST_LOG=info

# API authentication (comma-separated keys)
DATASYNTH_API_KEYS=your-secure-key-1,your-secure-key-2

# Worker threads (0 = auto-detect from CPU count)
DATASYNTH_WORKER_THREADS=0

# TLS (requires --features tls build)
# DATASYNTH_TLS_CERT=/etc/datasynth/tls/cert.pem
# DATASYNTH_TLS_KEY=/etc/datasynth/tls/key.pem

SystemD Service

The repository includes a production-ready SystemD unit at deploy/datasynth-server.service. Install it:

sudo cp deploy/datasynth-server.service /etc/systemd/system/
sudo systemctl daemon-reload

Unit File Walkthrough

[Unit]
Description=DataSynth Synthetic Data Server
Documentation=https://github.com/ey-asu-rnd/SyntheticData
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=datasynth
Group=datasynth
EnvironmentFile=-/etc/datasynth/server.env
ExecStart=/usr/local/bin/datasynth-server \
    --host 0.0.0.0 \
    --port 50051 \
    --rest-port 3000
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStartSec=30
TimeoutStopSec=30

# Resource limits
MemoryMax=4G
CPUQuota=200%
TasksMax=512
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
ReadWritePaths=/var/lib/datasynth

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=datasynth-server

[Install]
WantedBy=multi-user.target

Key security directives:

Directive	Effect
`NoNewPrivileges=true`	Prevents privilege escalation
`ProtectSystem=strict`	Mounts filesystem read-only except `ReadWritePaths`
`ProtectHome=true`	Hides `/home`, `/root`, `/run/user`
`PrivateTmp=true`	Isolates `/tmp`
`PrivateDevices=true`	Restricts device access
`ReadWritePaths=/var/lib/datasynth`	Only writable directory

Enable and Start

sudo systemctl enable datasynth-server
sudo systemctl start datasynth-server
sudo systemctl status datasynth-server

Common Operations

# View logs
journalctl -u datasynth-server -f

# Restart
sudo systemctl restart datasynth-server

# Reload (sends HUP signal)
sudo systemctl reload datasynth-server

# Stop
sudo systemctl stop datasynth-server

Log Rotation

SystemD journal handles log rotation automatically. To configure retention:

# /etc/systemd/journald.conf.d/datasynth.conf
[Journal]
SystemMaxUse=2G
MaxRetentionSec=30d

Reload journald:

sudo systemctl restart systemd-journald

To export logs to a file for external log aggregation:

# Export today's logs as JSON
journalctl -u datasynth-server --since today -o json > /var/log/datasynth-$(date +%F).json

Firewall Configuration

Open the required ports:

# UFW (Ubuntu)
sudo ufw allow 3000/tcp comment "DataSynth REST"
sudo ufw allow 50051/tcp comment "DataSynth gRPC"

# firewalld (RHEL/CentOS)
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=50051/tcp
sudo firewall-cmd --reload

Verifying the Installation

# Health check
curl -s http://localhost:3000/health | python3 -m json.tool

# Readiness check
curl -s http://localhost:3000/ready | python3 -m json.tool

# Prometheus metrics
curl -s http://localhost:3000/metrics

# Generate test data via CLI
datasynth-data generate --demo --output /tmp/datasynth-test
ls -la /tmp/datasynth-test/

Updating

# Stop the service
sudo systemctl stop datasynth-server

# Replace the binary
sudo install -m 0755 /path/to/new/datasynth-server /usr/local/bin/

# Start the service
sudo systemctl start datasynth-server

# Verify
curl -s http://localhost:3000/health | python3 -m json.tool

Operational Runbook

This runbook provides step-by-step procedures for monitoring, alerting, troubleshooting, and maintaining DataSynth in production.

Monitoring Stack Overview

The recommended monitoring stack uses Prometheus for metrics collection and Grafana for dashboards and alerting. The docker-compose.yml in the repository root sets this up automatically.

Component	Default URL	Purpose
Prometheus	`http://localhost:9090`	Metrics storage and alerting rules
Grafana	`http://localhost:3001`	Dashboards and visualization
DataSynth `/metrics`	`http://localhost:3000/metrics`	Prometheus exposition endpoint
DataSynth `/api/metrics`	`http://localhost:3000/api/metrics`	JSON metrics endpoint

Prometheus Configuration

The repository includes a pre-configured Prometheus scrape config at deploy/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "datasynth"
    static_configs:
      - targets: ["datasynth-server:3000"]
    metrics_path: "/metrics"
    scrape_interval: 10s

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

For Kubernetes, use the ServiceMonitor CRD instead (see Kubernetes deployment).

Available Metrics

DataSynth exposes the following Prometheus metrics at GET /metrics:

Metric	Type	Description
`synth_entries_generated_total`	Counter	Total journal entries generated since startup
`synth_anomalies_injected_total`	Counter	Total anomalies injected
`synth_uptime_seconds`	Gauge	Server uptime in seconds
`synth_entries_per_second`	Gauge	Current generation throughput
`synth_active_streams`	Gauge	Number of active WebSocket streaming connections
`synth_stream_events_total`	Counter	Total events sent through WebSocket streams
`synth_info`	Gauge	Server version info label (always 1)

Grafana Dashboard Setup

Step 1: Add Prometheus Data Source

Open Grafana at http://localhost:3001.
Navigate to Configuration > Data Sources > Add data source.
Select Prometheus.
Set URL to http://prometheus:9090 (Docker) or your Prometheus endpoint.
Click Save & Test.

If using Docker Compose, the Prometheus data source is auto-provisioned via deploy/grafana/provisioning/datasources/prometheus.yml.

Step 2: Create the DataSynth Dashboard

Create a new dashboard with the following panels:

Panel 1: Generation Throughput

Type: Time series
Query: rate(synth_entries_generated_total[5m])
Title: Entries Generated per Second (5m rate)
Unit: ops/sec

Panel 2: Active WebSocket Streams

Type: Stat
Query: synth_active_streams
Title: Active Streams
Thresholds: 0 (green), 5 (yellow), 10 (red)

Panel 3: Total Entries (Counter)

Type: Stat
Query: synth_entries_generated_total
Title: Total Entries Generated
Format: short

Panel 4: Anomaly Injection Rate

Type: Time series
Query A: rate(synth_anomalies_injected_total[5m])
Query B: rate(synth_entries_generated_total[5m])
Title: Anomaly Rate
Transform: A / B (using math expression)
Unit: percentunit

Panel 5: Server Uptime

Type: Stat
Query: synth_uptime_seconds
Title: Server Uptime
Unit: seconds (s)

Panel 6: Stream Events Rate

Type: Time series
Query: rate(synth_stream_events_total[1m])
Title: Stream Events per Second
Unit: events/sec

Step 3: Save and Export

Save the dashboard and export as JSON for version control. Place the file in deploy/grafana/provisioning/dashboards/ for automatic provisioning.

Alert Rules

The repository includes pre-configured alert rules at deploy/prometheus-alerts.yml:

Alert: ServerDown

- alert: ServerDown
  expr: up{job="datasynth"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "DataSynth server is down"
    description: "DataSynth server has been unreachable for more than 1 minute."

Response procedure:

Check the server process: systemctl status datasynth-server or docker compose ps.
Check logs: journalctl -u datasynth-server -n 100 or docker compose logs --tail 100 datasynth-server.
Check resource exhaustion: free -h, df -h, top.
If OOM killed, increase memory limits and restart.
If disk full, clear output directory and restart.

Alert: HighErrorRate

- alert: HighErrorRate
  expr: rate(synth_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate on DataSynth server"

Response procedure:

Check application logs for error patterns: journalctl -u datasynth-server -p err.
Look for invalid configuration: curl localhost:3000/ready.
Check if clients are sending malformed requests (rate limit headers in responses).
If errors are generation failures, check available memory and disk.

Alert: HighMemoryUsage

- alert: HighMemoryUsage
  expr: synth_memory_usage_bytes / 1024 / 1024 > 3072
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High memory usage on DataSynth server"
    description: "Memory usage is {{ $value }}MB, exceeding 3GB threshold."

Response procedure:

Check DataSynth’s internal degradation level: curl localhost:3000/ready – the memory check status will show ok, degraded, or fail.
If degraded, DataSynth automatically reduces batch sizes. Wait for current work to complete.
If in Emergency mode, stop active streams: curl -X POST localhost:3000/api/stream/stop.
Consider increasing memory limits or reducing concurrent streams.

Alert: HighLatency

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(datasynth_api_request_duration_seconds_bucket[5m])) > 30
  for: 5m
  labels:
    severity: warning

Response procedure:

Check if bulk generation requests are creating large datasets. The default timeout is 300 seconds.
Verify CPU is not throttled: kubectl top pod or docker stats.
Consider splitting large generation requests into smaller batches.

Alert: NoEntitiesGenerated

- alert: NoEntitiesGenerated
  expr: increase(synth_entries_generated_total[1h]) == 0 and synth_active_streams > 0
  for: 15m
  labels:
    severity: warning

Response procedure:

Streams are connected but not producing data. Check if streams are paused.
Resume streams: curl -X POST localhost:3000/api/stream/resume.
Check logs for generation failures.
Verify the configuration is valid: curl localhost:3000/api/config.

Common Troubleshooting

Server Fails to Start

Symptom	Cause	Resolution
`Invalid gRPC address`	Bad `--host` or `--port` value	Check arguments and env vars
`Failed to bind REST listener`	Port already in use	`lsof -i :3000` to find conflict
`protoc not found`	Missing protobuf compiler	Install `protobuf-compiler` package
Immediate exit, no logs	Panic before logger init	Run with `RUST_LOG=debug`

Generation Errors

Symptom	Cause	Resolution
`Failed to create orchestrator`	Invalid config	Validate with `datasynth-data validate --config config.yaml`
`Rate limit exceeded`	Too many API requests	Wait for `Retry-After` header, increase rate limits
Empty journal entries	No companies configured	Check `curl localhost:3000/api/config`
Slow generation	Large period or high volume	Add worker threads, increase CPU

Connection Issues

Symptom	Cause	Resolution
`Connection refused` on 3000	Server not running or wrong port	Check process and port bindings
`401 Unauthorized`	Missing or invalid API key	Add `X-API-Key` header or `Authorization: Bearer <key>`
`429 Too Many Requests`	Rate limit exceeded	Respect `Retry-After` header
WebSocket drops immediately	Proxy not forwarding Upgrade	Configure proxy for WebSocket (see TLS doc)

Memory Issues

DataSynth monitors its own memory usage via /proc/self/statm (Linux) and triggers automatic degradation:

Degradation Level	Trigger	Behavior
Normal	< 70% of limit	Full throughput
Reduced	70-85%	Smaller batch sizes
Minimal	85-95%	Single-record generation
Emergency	> 95%	Rejects new work

Check the current level:

curl -s localhost:3000/ready | jq '.checks[] | select(.name == "memory")'

Log Analysis

Structured JSON Logs

DataSynth emits structured JSON logs with the following fields:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "target": "datasynth_server::rest::routes",
  "message": "Configuration update requested",
  "thread_id": 42
}

Common Log Queries

Filter by severity:

# SystemD
journalctl -u datasynth-server -p err --since "1 hour ago"

# Docker
docker compose logs datasynth-server | jq 'select(.level == "ERROR" or .level == "WARN")'

Find configuration changes:

journalctl -u datasynth-server | grep "Configuration update"

Track generation throughput:

journalctl -u datasynth-server | grep "entries_generated"

Find API authentication failures:

journalctl -u datasynth-server | grep -i "unauthorized\|invalid api key"

Log Level Configuration

Set per-module log levels with RUST_LOG:

# Everything at info, server REST module at debug
RUST_LOG=info,datasynth_server::rest=debug

# Generation engine at trace (very verbose)
RUST_LOG=info,datasynth_runtime=trace

# Suppress noisy modules
RUST_LOG=info,hyper=warn,tower=warn

Routine Maintenance

Health Check Script

Create a monitoring script for external health checks:

#!/bin/bash
# /usr/local/bin/datasynth-healthcheck.sh

ENDPOINT="${1:-http://localhost:3000}"

# Check health
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT/health")
if [ "$HTTP_CODE" != "200" ]; then
  echo "CRITICAL: Health check failed (HTTP $HTTP_CODE)"
  exit 2
fi

# Check readiness
READY=$(curl -s "$ENDPOINT/ready" | jq -r '.ready')
if [ "$READY" != "true" ]; then
  echo "WARNING: Server not ready"
  exit 1
fi

echo "OK: DataSynth healthy and ready"
exit 0

Prometheus Rule Testing

Validate alert rules before deploying:

# Install promtool
go install github.com/prometheus/prometheus/cmd/promtool@latest

# Test rules
promtool check rules deploy/prometheus-alerts.yml

Backup Checklist

Item	Location	Frequency
DataSynth config	`/etc/datasynth/server.env`	On change
Generation configs	YAML files	On change
Grafana dashboards	Export as JSON	Weekly
Prometheus data	`prometheus-data` volume	Per retention policy
API keys	Kubernetes Secret or env file	On rotation

Incident Response Template

When a production incident occurs:

Detect: Alert fires or user reports an issue.
Triage: Check /health, /ready, and /metrics endpoints.
Contain: If generating bad data, stop streams: POST /api/stream/stop.
Diagnose: Collect logs (journalctl -u datasynth-server --since "1 hour ago").
Resolve: Apply fix (restart, config change, scale up).
Verify: Confirm /ready returns ready: true and metrics are flowing.
Document: Record root cause and remediation steps.

Capacity Planning

This guide provides sizing models, reference benchmarks, and recommendations for provisioning DataSynth deployments.

Performance Characteristics

DataSynth is CPU-bound during generation and I/O-bound during output. Key characteristics:

Throughput: 100K+ journal entries per second on a single core
Scaling: Near-linear scaling with CPU cores for batch generation
Memory: Proportional to active dataset size (companies, accounts, master data)
Disk: Output size depends on format, compression, and enabled modules
Network: REST/gRPC overhead is minimal; bulk generation is the bottleneck

Sizing Model

CPU

DataSynth uses Rayon for parallel generation and Tokio for async I/O. The relationship between CPU cores and throughput:

Cores	Approx. Entries/sec	Use Case
1	100K	Development, small datasets
2	180K	Staging, medium datasets
4	350K	Production, large datasets
8	650K	High-throughput batch jobs
16	1.1M	Maximum single-node throughput

These numbers are for journal entry generation with balanced debit/credit lines. Enabling additional modules (document flows, subledgers, master data, anomaly injection) reduces throughput by 30-60% due to cross-referencing overhead.

Memory

Memory usage depends on the active generation context:

Component	Approximate Memory
Base server process	50-100 MB
Chart of accounts (small)	5-10 MB
Chart of accounts (large)	30-50 MB
Master data per company (small)	20-40 MB
Master data per company (medium)	80-150 MB
Master data per company (large)	200-400 MB
Active journal entries buffer	2-5 MB per 10K entries
Document flow chains	50-100 MB per company
Anomaly injection engine	20-50 MB

Sizing formula (approximate):

Memory (MB) = 100 + (companies * master_data_per_company) + (buffer_entries * 0.5)

Recommended Memory by Config Complexity

Complexity	Companies	Memory Minimum	Memory Recommended
Small	1-2	512 MB	1 GB
Medium	3-5	1 GB	2 GB
Large	5-10	2 GB	4 GB
Enterprise	10-20	4 GB	8 GB

DataSynth includes built-in memory guards that trigger graceful degradation before OOM. See Runbook - Memory Issues for degradation levels.

Disk Sizing

Output Size by Format

The output size depends on the number of entries, enabled modules, and output format:

Entries	CSV (uncompressed)	JSON (uncompressed)	Parquet (compressed)
10K	15-25 MB	30-50 MB	3-5 MB
100K	150-250 MB	300-500 MB	30-50 MB
1M	1.5-2.5 GB	3-5 GB	300-500 MB
10M	15-25 GB	30-50 GB	3-5 GB

These estimates cover journal entries only. Enabling all modules (master data, document flows, subledgers, audit trails, etc.) can multiply total output by 5-10x.

Output Files by Module

When all modules are enabled, a typical generation produces 60+ output files:

Category	Typical File Count	Size Relative to JE
Journal entries + ACDOCA	2-3	1.0x (baseline)
Master data	6-8	0.3-0.5x
Document flows	8-10	1.5-2.0x
Subledgers	8-12	1.0-1.5x
Period close + consolidation	5-8	0.5-1.0x
Labels + controls	6-10	0.1-0.3x
Audit trails	6-8	0.3-0.5x

Disk Provisioning Formula

Disk (GB) = entries_millions * format_multiplier * module_multiplier * safety_margin

Where:
  format_multiplier:  CSV=0.25, JSON=0.50, Parquet=0.05  (per million entries)
  module_multiplier:  JE only=1.0, all modules=5.0
  safety_margin:      1.5 (for temp files, logs, etc.)

Example: 1M entries, all modules, CSV format:

Disk = 1 * 0.25 * 5.0 * 1.5 = 1.875 GB (round up to 2 GB)

Reference Benchmarks

Benchmarks run on c5.2xlarge (8 vCPU, 16 GB RAM):

Scenario	Config	Entries	Time	Throughput	Peak Memory
Batch (small)	1 company, small CoA, JE only	100K	0.8s	125K/s	280 MB
Batch (medium)	3 companies, medium CoA, all modules	100K	3.2s	31K/s	850 MB
Batch (large)	5 companies, large CoA, all modules	1M	45s	22K/s	2.1 GB
Streaming	1 company, JE only	continuous	–	10 events/s	350 MB
Concurrent API	10 parallel bulk requests	10K each	4.5s	22K/s total	1.2 GB

Container Resource Recommendations

Docker / Single Host

Profile	CPU	Memory	Disk	Use Case
Dev	1 core	1 GB	10 GB	Local testing
Staging	2 cores	2 GB	50 GB	Integration testing
Production	4 cores	4 GB	100 GB	Regular generation
Batch worker	8 cores	8 GB	200 GB	Large dataset generation

Kubernetes

Profile	requests.cpu	requests.memory	limits.cpu	limits.memory	Replicas
Light	250m	256Mi	1	1Gi	2
Standard	500m	512Mi	2	2Gi	2-5
Heavy	1000m	1Gi	4	4Gi	3-10
Burst	2000m	2Gi	8	8Gi	5-20

Scaling Guidelines

Vertical Scaling (Single Node)

Vertical scaling is effective up to 16 cores. Beyond that, returns diminish due to lock contention in the shared ServerState. Recommendations:

Start with the “Standard” Kubernetes profile.
Monitor synth_entries_per_second in Grafana.
If throughput plateaus at high CPU, add replicas instead.

Horizontal Scaling (Multi-Replica)

DataSynth is stateless – each pod generates data independently. Horizontal scaling considerations:

Enable Redis for shared rate limiting across replicas.
Use deterministic seeds per replica to avoid duplicate data (seed = base_seed + replica_index).
Route bulk generation requests to specific replicas if output deduplication matters.
WebSocket streams are per-connection and do not share state across replicas.

Scaling Decision Tree

Is throughput below target?
  |
  +-- Yes: Is CPU utilization > 70%?
  |    |
  |    +-- Yes: Add more replicas (horizontal)
  |    +-- No:  Is memory > 80%?
  |         |
  |         +-- Yes: Increase memory limit
  |         +-- No:  Check I/O (disk throughput, network)
  |
  +-- No: Current sizing is adequate

Network Bandwidth

DataSynth’s network requirements are modest:

Operation	Bandwidth	Notes
Health checks	< 1 KB/s	Negligible
Prometheus scrape	5-10 KB per scrape	Every 10-30s
Bulk API response (10K entries)	5-15 MB burst	Short-lived
WebSocket stream	1-5 KB/s per connection	10 events/s default
gRPC streaming	2-10 KB/s per stream	Depends on message size

Network is rarely the bottleneck. A 1 Gbps link supports hundreds of concurrent clients.

Disaster Recovery

DataSynth is a stateless data generation engine. It does not maintain a persistent database or durable state that requires traditional backup and recovery. Instead, recovery relies on two key properties:

Deterministic generation – Given the same configuration and seed, DataSynth produces identical output.
Stateless server – The server process can be restarted from scratch at any time.

What Needs to Be Backed Up

Asset	Location	Recovery Priority
Generation config (YAML)	`/etc/datasynth/`, ConfigMap, or source control	Critical
Environment / secrets	`/etc/datasynth/server.env`, K8s Secrets	Critical
API keys	Environment variable or Secret	Critical
Generated output files	Output directory, object storage	Depends on use case
Grafana dashboards	`deploy/grafana/provisioning/` or exported JSON	Low – can re-provision
Prometheus data	`prometheus-data` volume	Low – regenerate from metrics

The generation config and seed are the most important assets. With them, you can reproduce any dataset exactly.

Backup Procedures

Configuration Backup

Store all DataSynth configuration in version control. This is the primary backup mechanism:

# Recommended repository structure
configs/
  production/
    manufacturing.yaml      # Generation config
    server.env.encrypted    # Encrypted environment file
  staging/
    retail.yaml
    server.env.encrypted

For Kubernetes, export the ConfigMap and Secret:

# Export current config
kubectl -n datasynth get configmap datasynth-config -o yaml > backup/configmap.yaml

# Export secrets (base64-encoded)
kubectl -n datasynth get secret datasynth-api-keys -o yaml > backup/secret.yaml

Output Data Backup

If generated data must be preserved (not just re-generated), back up the output directory:

# Local backup
tar czf datasynth-output-$(date +%F).tar.gz /var/lib/datasynth/output/

# S3 backup
aws s3 sync /var/lib/datasynth/output/ s3://your-bucket/datasynth/$(date +%F)/

Scheduled Backup Script

#!/bin/bash
# /usr/local/bin/datasynth-backup.sh
# Run via cron: 0 2 * * * /usr/local/bin/datasynth-backup.sh

BACKUP_DIR="/var/backups/datasynth"
DATE=$(date +%F)

mkdir -p "$BACKUP_DIR"

# Back up configuration
cp /etc/datasynth/server.env "$BACKUP_DIR/server.env.$DATE"

# Back up output if it exists and is non-empty
if [ -d /var/lib/datasynth/output ] && [ "$(ls -A /var/lib/datasynth/output)" ]; then
  tar czf "$BACKUP_DIR/output-$DATE.tar.gz" /var/lib/datasynth/output/
fi

# Retain 30 days of backups
find "$BACKUP_DIR" -type f -mtime +30 -delete

echo "Backup completed: $DATE"

Deterministic Recovery

DataSynth uses ChaCha8 RNG with a configurable seed. When the seed is set in the configuration, every run produces byte-identical output.

Reproducing a Dataset

To reproduce a previous generation run:

Retrieve the configuration file used for that run.
Confirm the seed value is set (not random).
Run the generation with the same configuration.

# Example config with deterministic seed
global:
  industry: manufacturing
  start_date: "2024-01-01"
  period_months: 12
  seed: 42              # <-- deterministic seed

# Regenerate identical data
datasynth-data generate --config config.yaml --output ./recovered-output

# Verify output is identical
diff <(sha256sum original-output/*.csv | sort) <(sha256sum recovered-output/*.csv | sort)

Important Caveats for Determinism

Deterministic output requires exact version matching:

Factor	Must Match?	Notes
DataSynth version	Yes	Different versions may change generation logic
Configuration YAML	Yes	Any parameter change alters output
Seed value	Yes	Different seed = different data
Operating system	No	Cross-platform determinism is guaranteed
CPU architecture	No	ChaCha8 output is platform-independent
Number of threads	No	Parallelism does not affect determinism

If you need to reproduce data from a past release, pin the DataSynth version:

# Docker: use the exact version tag
docker run --rm \
  -v $(pwd)/config.yaml:/config.yaml:ro \
  -v $(pwd)/output:/output \
  datasynth/datasynth-server:0.5.0 \
  datasynth-data generate --config /config.yaml --output /output

# Source: checkout the exact tag
git checkout v0.5.0
cargo build --release -p datasynth-cli

Stateless Restart

The DataSynth server maintains no persistent state. All in-memory state (counters, active streams, generation context) is ephemeral. A restart produces a fresh server.

Restart Procedure

Docker:

docker compose restart datasynth-server

Kubernetes:

# Rolling restart (zero downtime with PDB)
kubectl -n datasynth rollout restart deployment/datasynth

# Verify rollout
kubectl -n datasynth rollout status deployment/datasynth

SystemD:

sudo systemctl restart datasynth-server

What Is Lost on Restart

State	Lost?	Impact
Prometheus metrics counters	Yes	Counters reset to 0; Prometheus handles counter resets via `rate()`
Active WebSocket streams	Yes	Clients must reconnect
Uptime counter	Yes	Resets to 0
In-progress bulk generation	Yes	Client receives connection error; must retry
Configuration (if set via API)	Yes	Reverts to default; use ConfigMap or env for persistence
Rate limit buckets	Yes	All clients get fresh rate limit windows

Mitigating Restart Impact

Use config files, not the API, for persistent configuration. The POST /api/config endpoint only updates in-memory state.
Set up client retry logic for bulk generation requests.
Use Kubernetes PDB to ensure at least one pod is always running during rolling restarts.
Monitor with Prometheus – counter resets are handled automatically by rate() and increase() functions.

Recovery Scenarios

Scenario 1: Server Process Crash

SystemD or Kubernetes automatically restarts the process.
Verify with curl localhost:3000/health.
Check logs for crash cause: journalctl -u datasynth-server -n 200.
No data loss – server is stateless.

Scenario 2: Node Failure (Kubernetes)

Kubernetes reschedules pods to healthy nodes.
PDB ensures minimum availability during rescheduling.
Clients reconnect automatically (Service endpoint updates).
No manual intervention required.

Scenario 3: Configuration Lost

Retrieve config from version control.
Redeploy: kubectl apply -f configmap.yaml or copy to /etc/datasynth/.
Restart server to pick up new config.

Scenario 4: Need to Reproduce Historical Data

Identify the DataSynth version and config used.
Pin the version (Docker tag or Git tag).
Run generation with the same config and seed.
Verify with checksums.

Recovery Time Objectives

Component	RTO	RPO	Notes
Server process	< 30s	N/A (stateless)	Auto-restart via SystemD/K8s
Full service (K8s)	< 2 min	N/A (stateless)	Pod scheduling + startup probes
Data regeneration	Depends on size	0 (deterministic)	Re-run with same config+seed
Config recovery	< 5 min	Last commit	From version control

API Reference

DataSynth exposes REST, gRPC, and WebSocket interfaces. This page documents all endpoints, authentication, rate limiting, error formats, and the WebSocket protocol.

Base URLs

Protocol	Default URL	Port
REST	`http://localhost:3000`	3000
gRPC	`grpc://localhost:50051`	50051
WebSocket	`ws://localhost:3000/ws/`	3000

Authentication

Authentication is optional and disabled by default. When enabled, all endpoints except health probes require a valid API key.

Enabling Authentication

Pass API keys at startup:

# CLI argument
datasynth-server --api-keys "key-1,key-2"

# Environment variable
DATASYNTH_API_KEYS="key-1,key-2" datasynth-server

Sending API Keys

The server accepts API keys via two headers (checked in order):

Method	Header	Example
Bearer token	`Authorization`	`Authorization: Bearer your-api-key`
Custom header	`X-API-Key`	`X-API-Key: your-api-key`

Exempt Paths

These paths never require authentication, even when auth is enabled:

GET /health
GET /ready
GET /live
GET /metrics

Authentication Internals

API keys are hashed with Argon2id at server startup.
Verification iterates all stored hashes (no short-circuit) to prevent timing side-channel attacks.
A 5-second LRU cache avoids repeated Argon2 verification for rapid successive requests.

Error Responses

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer

API key required. Provide via 'Authorization: Bearer <key>' or 'X-API-Key' header

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer

Invalid API key

Rate Limiting

Rate limiting is configurable and disabled by default. When enabled, it tracks requests per client IP using a sliding window.

Default Configuration

Parameter	Default	Description
`max_requests`	100	Maximum requests per window
`window`	60 seconds	Time window duration
Exempt paths	`/health`, `/ready`, `/live`	Not rate-limited

Rate Limit Headers

All non-exempt responses include rate limit headers:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed in the window
`X-RateLimit-Remaining`	Requests remaining in the current window
`Retry-After`	Seconds until the window resets (only on 429)

Rate Limit Exceeded Response

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 60

Rate limit exceeded. Max 100 requests per 60 seconds.

Client Identification

The rate limiter identifies clients by IP address, checked in order:

X-Forwarded-For header (first IP)
X-Real-IP header
Fallback: unknown (all unidentified clients share a bucket)

Distributed Rate Limiting

For multi-replica deployments, enable Redis-backed rate limiting:

datasynth-server --redis-url redis://127.0.0.1:6379

This requires the redis feature to be enabled at build time.

Security Headers

All responses include the following security headers:

Header	Value	Purpose
`X-Content-Type-Options`	`nosniff`	Prevent MIME type sniffing
`X-Frame-Options`	`DENY`	Prevent clickjacking
`X-XSS-Protection`	`0`	Disable legacy XSS filter (rely on CSP)
`Referrer-Policy`	`strict-origin-when-cross-origin`	Control referrer leakage
`Content-Security-Policy`	`default-src 'none'; frame-ancestors 'none'`	Restrict resource loading
`Cache-Control`	`no-store`	Prevent caching of API responses

Request ID

Every response includes an X-Request-Id header. If the client sends an X-Request-Id header, its value is preserved. Otherwise, a UUID v4 is generated.

# Client-provided request ID
curl -H "X-Request-Id: my-trace-123" http://localhost:3000/health
# Response header: X-Request-Id: my-trace-123

# Auto-generated request ID
curl -v http://localhost:3000/health
# Response header: X-Request-Id: 550e8400-e29b-41d4-a716-446655440000

CORS Configuration

Default allowed origins:

Origin	Purpose
`http://localhost:5173`	Vite dev server
`http://localhost:3000`	Local development
`http://127.0.0.1:5173`	Localhost variant
`http://127.0.0.1:3000`	Localhost variant
`tauri://localhost`	Tauri desktop app

Allowed methods: GET, POST, PUT, DELETE, OPTIONS

Allowed headers: Content-Type, Authorization, Accept

REST API Endpoints

Health & Metrics

GET /health

Returns overall server health status.

Response 200 OK:

{
  "healthy": true,
  "version": "0.5.0",
  "uptime_seconds": 3600
}

GET /ready

Kubernetes-compatible readiness probe. Performs deep checks (config, memory, disk).

Response 200 OK (when ready):

{
  "ready": true,
  "message": "Service is ready",
  "checks": [
    { "name": "config", "status": "ok" },
    { "name": "memory", "status": "ok" },
    { "name": "disk", "status": "ok" }
  ]
}

Response 503 Service Unavailable (when not ready):

{
  "ready": false,
  "message": "Service is not ready",
  "checks": [
    { "name": "config", "status": "ok" },
    { "name": "memory", "status": "fail" },
    { "name": "disk", "status": "ok" }
  ]
}

GET /live

Kubernetes-compatible liveness probe. Lightweight heartbeat.

Response 200 OK:

{
  "alive": true,
  "timestamp": "2024-01-15T10:30:00.123456789Z"
}

GET /api/metrics

Returns server metrics as JSON.

Response 200 OK:

{
  "total_entries_generated": 150000,
  "total_anomalies_injected": 750,
  "uptime_seconds": 3600,
  "session_entries": 150000,
  "session_entries_per_second": 41.67,
  "active_streams": 2,
  "total_stream_events": 50000
}

GET /metrics

Prometheus-compatible metrics in text exposition format.

Response 200 OK (text/plain; version=0.0.4):

# HELP synth_entries_generated_total Total number of journal entries generated
# TYPE synth_entries_generated_total counter
synth_entries_generated_total 150000

# HELP synth_anomalies_injected_total Total number of anomalies injected
# TYPE synth_anomalies_injected_total counter
synth_anomalies_injected_total 750

# HELP synth_uptime_seconds Server uptime in seconds
# TYPE synth_uptime_seconds gauge
synth_uptime_seconds 3600

# HELP synth_entries_per_second Rate of entry generation
# TYPE synth_entries_per_second gauge
synth_entries_per_second 41.67

# HELP synth_active_streams Number of active streaming connections
# TYPE synth_active_streams gauge
synth_active_streams 2

# HELP synth_stream_events_total Total events sent through streams
# TYPE synth_stream_events_total counter
synth_stream_events_total 50000

# HELP synth_info Server version information
# TYPE synth_info gauge
synth_info{version="0.5.0"} 1

Configuration

GET /api/config

Returns the current generation configuration.

Response 200 OK:

{
  "success": true,
  "message": "Current configuration",
  "config": {
    "industry": "Manufacturing",
    "start_date": "2024-01-01",
    "period_months": 12,
    "seed": 42,
    "coa_complexity": "Medium",
    "companies": [
      {
        "code": "1000",
        "name": "Manufacturing Corp",
        "currency": "USD",
        "country": "US",
        "annual_transaction_volume": 100000,
        "volume_weight": 1.0
      }
    ],
    "fraud_enabled": true,
    "fraud_rate": 0.02
  }
}

POST /api/config

Updates the generation configuration.

Request body:

{
  "industry": "retail",
  "start_date": "2024-06-01",
  "period_months": 6,
  "seed": 12345,
  "coa_complexity": "large",
  "companies": [
    {
      "code": "1000",
      "name": "Retail Corp",
      "currency": "USD",
      "country": "US",
      "annual_transaction_volume": 200000,
      "volume_weight": 1.0
    }
  ],
  "fraud_enabled": true,
  "fraud_rate": 0.05
}

Valid industries: manufacturing, retail, financial_services, healthcare, technology, professional_services, energy, transportation, real_estate, telecommunications

Valid CoA complexities: small, medium, large

Response 200 OK:

{
  "success": true,
  "message": "Configuration updated and applied",
  "config": { ... }
}

Error 400 Bad Request:

{
  "success": false,
  "message": "Unknown industry: 'invalid'. Valid values: manufacturing, retail, ...",
  "config": null
}

Generation

POST /api/generate/bulk

Generates journal entries in a single batch. Maximum 1,000,000 entries per request.

Request body:

{
  "entry_count": 10000,
  "include_master_data": true,
  "inject_anomalies": true
}

All fields are optional. Without entry_count, the server uses the configured volume.

Response 200 OK:

{
  "success": true,
  "entries_generated": 10000,
  "duration_ms": 450,
  "anomaly_count": 50
}

Error 400 Bad Request (entry count too large):

entry_count (2000000) exceeds maximum allowed value (1000000)

Streaming Control

POST /api/stream/start

Starts the event stream. WebSocket clients begin receiving events.

Request body:

{
  "events_per_second": 10,
  "max_events": 10000,
  "inject_anomalies": false
}

POST /api/stream/stop

Stops all active streams.

POST /api/stream/pause

Pauses active streams. Events stop flowing but connections remain open.

POST /api/stream/resume

Resumes paused streams.

POST /api/stream/trigger/:pattern

Triggers a named generation pattern for upcoming streamed entries.

Valid patterns: year_end_spike, period_end_spike, holiday_cluster, fraud_cluster, error_cluster, uniform, custom:*

Response:

{
  "success": true,
  "message": "Pattern 'year_end_spike' will be applied to upcoming entries"
}

WebSocket Protocol

ws://localhost:3000/ws/metrics

Sends metrics updates every 1 second as JSON text frames:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "total_entries": 150000,
  "total_anomalies": 750,
  "entries_per_second": 41.67,
  "active_streams": 2,
  "uptime_seconds": 3600
}

ws://localhost:3000/ws/events

Streams generated journal entry events as JSON text frames:

{
  "sequence": 1234,
  "timestamp": "2024-01-15T10:30:00.456Z",
  "event_type": "JournalEntry",
  "document_id": "JE-2024-001234",
  "company_code": "1000",
  "amount": "15000.00",
  "is_anomaly": false
}

Connection Management

The server responds to WebSocket Ping frames with Pong.
Clients should send periodic pings to keep the connection alive through proxies.
Close the connection by sending a WebSocket Close frame.
The server decrements active_streams when a client disconnects.

Example: Connecting with wscat

# Install wscat
npm install -g wscat

# Connect to metrics stream
wscat -c ws://localhost:3000/ws/metrics

# Connect to event stream
wscat -c ws://localhost:3000/ws/events

Example: Connecting with curl (WebSocket)

curl --include \
  --no-buffer \
  --header "Connection: Upgrade" \
  --header "Upgrade: websocket" \
  --header "Sec-WebSocket-Version: 13" \
  --header "Sec-WebSocket-Key: $(openssl rand -base64 16)" \
  http://localhost:3000/ws/events

Request Timeout

The default request timeout is 300 seconds (5 minutes), which accommodates large bulk generation requests. Requests exceeding this timeout receive a 408 Request Timeout response.

Error Format

REST API errors follow a consistent format:

Validation errors return JSON:

{
  "success": false,
  "message": "Descriptive error message",
  "config": null
}

Server errors return plain text:

HTTP/1.1 500 Internal Server Error

Generation failed: <error description>

HTTP Status Codes

Code	Meaning	When
200	Success	Request completed
400	Bad Request	Invalid parameters
401	Unauthorized	Missing or invalid API key
408	Request Timeout	Request exceeded 300s timeout
429	Too Many Requests	Rate limit exceeded
500	Internal Server Error	Generation or server failure
503	Service Unavailable	Readiness check failed

Security Hardening

This guide provides a pre-deployment security checklist and detailed guidance on TLS, secrets management, container security, and audit logging for DataSynth.

Pre-Deployment Checklist

Complete this checklist before exposing DataSynth to any network beyond localhost:

#	Item	Priority
1	Enable API key authentication	Critical
2	Use strong, unique API keys (32+ chars)	Critical
3	Enable TLS (direct or via reverse proxy)	Critical
4	Set explicit CORS allowed origins	High
5	Enable rate limiting	High
6	Run as non-root user	High
7	Use read-only root filesystem (container)	High
8	Drop all Linux capabilities	High
9	Set resource limits (memory, CPU, file descriptors)	High
10	Restrict network exposure (firewall, security groups)	High
11	Enable structured logging to a central log aggregator	Medium
12	Set up Prometheus monitoring and alerts	Medium
13	Rotate API keys periodically	Medium
14	Review and restrict CORS origins	Medium
15	Enable mTLS for gRPC if used in service mesh	Low

Authentication Hardening

Strong API Keys

Generate cryptographically strong API keys:

# Generate a 48-character random key
openssl rand -base64 36

# Example output: kZ9mR3xY7pQ2wV5nL8jH4cF6gT0aD1bE3sU9iO7

Recommendations:

Minimum 32 characters, ideally 48+
Use different keys per environment (dev, staging, production)
Use different keys per client/team when possible
Rotate keys quarterly or after any suspected compromise

Argon2id Hashing

DataSynth hashes API keys with Argon2id (the recommended password hashing algorithm). Keys are hashed at startup; the plaintext is never stored in memory after hashing.

For pre-hashed keys (avoiding plaintext in environment variables), hash the key externally and pass the PHC-format hash:

# Python example: pre-hash an API key
from argon2 import PasswordHasher

ph = PasswordHasher()
hash = ph.hash("your-api-key")
print(hash)
# $argon2id$v=19$m=65536,t=3,p=4$...

Pass the pre-hashed value to the server via the AuthConfig::with_prehashed_keys() API (for embedded use) or store in a secrets manager.

API Key Rotation

To rotate keys without downtime:

Add the new key to DATASYNTH_API_KEYS alongside the old key.
Restart the server (rolling restart in K8s).
Update all clients to use the new key.
Remove the old key from DATASYNTH_API_KEYS.
Restart again.

TLS Configuration

Option 1: Reverse Proxy TLS (Recommended)

Terminate TLS at a reverse proxy (Nginx, Envoy, cloud load balancer) and forward plain HTTP to DataSynth. See TLS & Reverse Proxy for full configurations.

Advantages:

Centralized certificate management
Standard renewal workflows (cert-manager, Let’s Encrypt)
Offloads TLS from the application
Easier to audit and rotate certificates

Option 2: Native TLS

Build DataSynth with TLS support:

cargo build --release -p datasynth-server --features tls

Run with certificate and key:

datasynth-server \
  --tls-cert /etc/datasynth/tls/cert.pem \
  --tls-key /etc/datasynth/tls/key.pem

Certificate Requirements

Requirement	Detail
Format	PEM-encoded X.509
Key type	RSA 2048+ or ECDSA P-256/P-384
Protocol	TLS 1.2 or 1.3 (1.0/1.1 disabled)
Cipher suites	HIGH:!aNULL:!MD5 (Nginx default)
Subject Alternative Name	Must match the hostname clients use

mTLS for gRPC

For service-to-service communication, configure mutual TLS:

# Nginx mTLS configuration
server {
    listen 50051 ssl http2;

    ssl_certificate /etc/ssl/certs/server.pem;
    ssl_certificate_key /etc/ssl/private/server-key.pem;

    # Client certificate verification
    ssl_client_certificate /etc/ssl/certs/ca.pem;
    ssl_verify_client on;

    location / {
        grpc_pass grpc://127.0.0.1:50051;
    }
}

Secret Management

Environment Variables

For simple deployments, store secrets in environment files with restricted permissions:

# Create the environment file
sudo install -m 640 -o root -g datasynth /dev/null /etc/datasynth/server.env

# Edit the file
sudo vi /etc/datasynth/server.env

Never commit plaintext secrets to version control. Use .gitignore to exclude env files.

Kubernetes Secrets

For Kubernetes, store API keys in a Secret resource:

apiVersion: v1
kind: Secret
metadata:
  name: datasynth-api-keys
  namespace: datasynth
type: Opaque
stringData:
  api-keys: "key-1,key-2"

External Secrets Operator

For production, integrate with a secrets manager via the External Secrets Operator:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: datasynth-api-keys
  namespace: datasynth
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: ClusterSecretStore
  target:
    name: datasynth-api-keys
  data:
    - secretKey: api-keys
      remoteRef:
        key: datasynth/api-keys

HashiCorp Vault

Inject secrets via the Vault Agent sidecar:

# Pod annotations for Vault Agent Injector
podAnnotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "datasynth"
  vault.hashicorp.com/agent-inject-secret-api-keys: "secret/data/datasynth/api-keys"
  vault.hashicorp.com/agent-inject-template-api-keys: |
    {{- with secret "secret/data/datasynth/api-keys" -}}
    {{ .Data.data.keys }}
    {{- end -}}

Container Security

Distroless Base Image

The production Dockerfile uses gcr.io/distroless/cc-debian12, which contains:

No shell (/bin/sh, /bin/bash)
No package manager
No unnecessary system utilities
Only the C runtime library and certificates

This minimizes the attack surface and prevents shell-based exploits.

Security Context (Kubernetes)

The Helm chart enforces the following security context:

podSecurityContext:
  runAsNonRoot: true        # Pod must run as non-root
  runAsUser: 1000           # UID 1000
  runAsGroup: 1000          # GID 1000
  fsGroup: 1000             # Filesystem group

securityContext:
  allowPrivilegeEscalation: false    # No setuid/setgid
  readOnlyRootFilesystem: true       # Read-only root FS
  capabilities:
    drop:
      - ALL                          # Drop all Linux capabilities

SystemD Sandboxing

The SystemD unit file includes comprehensive sandboxing:

NoNewPrivileges=true          # Prevent privilege escalation
ProtectSystem=strict          # Read-only filesystem
ProtectHome=true              # Hide home directories
PrivateTmp=true               # Isolated /tmp
PrivateDevices=true           # No device access
ProtectKernelTunables=true    # No sysctl modification
ProtectKernelModules=true     # No module loading
ProtectControlGroups=true     # No cgroup modification
RestrictNamespaces=true       # No namespace creation
RestrictRealtime=true         # No realtime scheduling
RestrictSUIDSGID=true         # No SUID/SGID

Image Scanning

Scan the container image for vulnerabilities before deployment:

# Trivy
trivy image datasynth/datasynth-server:0.5.0

# Grype
grype datasynth/datasynth-server:0.5.0

# Docker Scout
docker scout cves datasynth/datasynth-server:0.5.0

The distroless base image has a minimal CVE surface. Address any findings in the Rust dependencies via cargo audit:

cargo install cargo-audit
cargo audit

Network Security

Principle of Least Exposure

Only expose the ports and endpoints that clients need:

Deployment	Expose REST (3000)	Expose gRPC (50051)	Expose Metrics
Internal API only	Via Ingress/LB	Via Ingress/LB	Prometheus only
Public API	Via Ingress + WAF	No	No
Dev/staging	Localhost only	Localhost only	Localhost only

Network Policies (Kubernetes)

Restrict pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: datasynth-allow-ingress
  namespace: datasynth
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: datasynth
  policyTypes:
    - Ingress
  ingress:
    # Allow from Ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - port: 3000
          protocol: TCP
    # Allow Prometheus scraping
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - port: 3000
          protocol: TCP

CORS Lockdown

In production, override the default CORS configuration to allow only your application’s domain:

#![allow(unused)]
fn main() {
// Programmatic configuration
let cors = CorsConfig {
    allowed_origins: vec![
        "https://app.example.com".to_string(),
    ],
    allow_any_origin: false,
};
}

Never enable allow_any_origin: true in production.

Audit Logging

Request Tracing

Every request receives an X-Request-Id header (auto-generated UUID v4 or client-provided). Use this to correlate logs across services.

Structured Log Fields

DataSynth emits JSON-structured logs with the following fields useful for security auditing:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "target": "datasynth_server::rest::routes",
  "message": "Configuration update requested: industry=retail, period_months=6",
  "thread_id": 42
}

Log Events to Monitor

Event	Log Pattern	Severity
Authentication failure	`Unauthorized` / `Invalid API key`	High
Rate limit exceeded	`Rate limit exceeded`	Medium
Configuration change	`Configuration update requested`	Medium
Stream start/stop	`Stream started` / `Stream stopped`	Low
WebSocket connection	`WebSocket connected` / `disconnected`	Low
Server panic	`Server panic:`	Critical

Centralized Logging

Forward structured logs to a central aggregator:

Docker:

services:
  datasynth-server:
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "datasynth.server"

SystemD to Loki:

# Install Promtail for journal forwarding
# /etc/promtail/config.yaml
scrape_configs:
  - job_name: datasynth
    journal:
      matches:
        - _SYSTEMD_UNIT=datasynth-server.service
      labels:
        job: datasynth

RBAC (Kubernetes)

The Helm chart creates a ServiceAccount by default. Bind minimal permissions:

serviceAccount:
  create: true
  automount: true   # Only if needed by the application
  annotations: {}

DataSynth does not require any Kubernetes API access. If automount is not needed, set it to false to prevent the ServiceAccount token from being mounted into the pod.

Supply Chain Security

Reproducible Builds

The Dockerfile uses pinned versions:

rust:1.88-bookworm – pinned Rust compiler version
gcr.io/distroless/cc-debian12 – pinned distroless image
cargo-chef --locked – locked dependency resolution

Dependency Auditing

# Check for known vulnerabilities
cargo audit

# Check for unmaintained or yanked crates
cargo audit --deny warnings

Run cargo audit in CI on every pull request.

SBOM Generation

Generate a Software Bill of Materials for compliance:

# Using cargo-cyclonedx
cargo install cargo-cyclonedx
cargo cyclonedx --all

# Using syft for container images
syft datasynth/datasynth-server:0.5.0 -o cyclonedx-json > sbom.json

TLS & Reverse Proxy Configuration

DataSynth server supports TLS in two ways:

Native TLS (with tls feature flag) - direct rustls termination
Reverse Proxy - recommended for production deployments

Native TLS

Build with TLS support:

cargo build --release -p datasynth-server --features tls

Run with certificate and key:

datasynth-server --tls-cert /path/to/cert.pem --tls-key /path/to/key.pem

Nginx Reverse Proxy

upstream datasynth_rest {
    server 127.0.0.1:3000;
}

upstream datasynth_grpc {
    server 127.0.0.1:50051;
}

server {
    listen 443 ssl http2;
    server_name datasynth.example.com;

    ssl_certificate /etc/ssl/certs/datasynth.pem;
    ssl_certificate_key /etc/ssl/private/datasynth-key.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;

    # REST API
    location / {
        proxy_pass http://datasynth_rest;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
    }

    # WebSocket
    location /ws/ {
        proxy_pass http://datasynth_rest;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 3600s;
    }

    # gRPC
    location /synth_server. {
        grpc_pass grpc://datasynth_grpc;
        grpc_read_timeout 300s;
    }
}

Envoy Proxy

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 443
      filter_chains:
        - transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              common_tls_context:
                tls_certificates:
                  - certificate_chain:
                      filename: /etc/ssl/certs/datasynth.pem
                    private_key:
                      filename: /etc/ssl/private/datasynth-key.pem
          filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: datasynth
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: datasynth_rest
                            timeout: 300s
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: datasynth_rest
      connect_timeout: 5s
      type: STRICT_DNS
      load_assignment:
        cluster_name: datasynth_rest
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 3000

Health Check Configuration

For load balancers, use these health check endpoints:

Endpoint	Purpose	Expected Response
`GET /health`	Basic health	200 OK
`GET /ready`	Readiness probe	200 OK / 503 Unavailable
`GET /live`	Liveness probe	200 OK

Use Cases

Real-world applications for SyntheticData.

Overview

Use Case	Description
Fraud Detection ML	Train supervised fraud models
Audit Analytics	Test audit procedures
SOX Compliance	Test control monitoring
Process Mining	Generate OCEL 2.0 event logs
ERP Load Testing	Load and stress testing

Use Case Summary

Use Case	Key Features	Output Focus
Fraud Detection	Anomaly injection, graph export	Labels, graphs
Audit Analytics	Full document flows, controls	Transactions, controls
SOX Compliance	SoD rules, approval workflows	Controls, violations
Process Mining	OCEL 2.0 export	Event logs
ERP Testing	High volume, realistic patterns	Raw transactions

Quick Configuration

Fraud Detection

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

graph_export:
  enabled: true
  formats:
    - pytorch_geometric

Audit Analytics

document_flows:
  p2p:
    enabled: true
  o2c:
    enabled: true

internal_controls:
  enabled: true

SOX Compliance

internal_controls:
  enabled: true
  sod_rules: [...]

approval:
  enabled: true

Process Mining

document_flows:
  p2p:
    enabled: true
  o2c:
    enabled: true

# Use datasynth-ocpm for OCEL 2.0 export

ERP Testing

transactions:
  target_count: 1000000

output:
  format: csv

Selecting a Use Case

Choose Fraud Detection if:

Training ML/AI models
Building anomaly detection systems
Need labeled datasets

Choose Audit Analytics if:

Testing audit software
Validating analytical procedures
Need complete document trails

Choose SOX Compliance if:

Testing control monitoring systems
Validating SoD enforcement
Need control test data

Choose Process Mining if:

Using PM4Py, Celonis, or similar tools
Need OCEL 2.0 compliant logs
Analyzing business processes

Choose ERP Testing if:

Load testing financial systems
Performance benchmarking
Need high-volume realistic data

Combining Use Cases

Use cases can be combined:

# Fraud detection + audit analytics
anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

document_flows:
  p2p:
    enabled: true
  o2c:
    enabled: true

internal_controls:
  enabled: true

graph_export:
  enabled: true

Fraud Detection ML

Train machine learning models for financial fraud detection.

Overview

SyntheticData generates labeled datasets for supervised fraud detection:

20+ fraud patterns with full labels
Graph representations for GNN models
Realistic data distributions
Configurable fraud rates and types

Configuration

global:
  seed: 42
  industry: financial_services
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 100000

fraud:
  enabled: true
  fraud_rate: 0.02                   # 2% fraud rate

  types:
    split_transaction: 0.20
    duplicate_payment: 0.15
    fictitious_transaction: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

  categories:
    fraud: 1.0                       # Focus on fraud only

graph_export:
  enabled: true
  formats:
    - pytorch_geometric

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_fraud

output:
  format: csv

Output Files

Tabular Data

output/
├── transactions/
│   └── journal_entries.csv
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── master_data/
    └── ...

Graph Data

output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt
├── edge_index.pt
├── edge_attr.pt
├── labels.pt
├── train_mask.pt
├── val_mask.pt
└── test_mask.pt

ML Pipeline

1. Load Data

import pandas as pd
import torch

# Load tabular data
entries = pd.read_csv('output/transactions/journal_entries.csv')
labels = pd.read_csv('output/labels/fraud_labels.csv')

# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_fraud'] = data['fraud_type'].notna()

print(f"Total entries: {len(data)}")
print(f"Fraud entries: {data['is_fraud'].sum()}")
print(f"Fraud rate: {data['is_fraud'].mean():.2%}")

2. Feature Engineering

from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Numerical features
numerical_features = [
    'debit_amount', 'credit_amount', 'line_count'
]

# Derived features
data['log_amount'] = np.log1p(data['debit_amount'] + data['credit_amount'])
data['is_round'] = (data['debit_amount'] % 100 == 0).astype(int)
data['is_weekend'] = pd.to_datetime(data['posting_date']).dt.dayofweek >= 5
data['is_month_end'] = pd.to_datetime(data['posting_date']).dt.day >= 28

# Categorical features
categorical_features = ['source', 'business_process', 'company_code']

3. Train Model (Tabular)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features
X = data[numerical_features + derived_features]
y = data['is_fraud']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

4. Train GNN Model

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

# Load graph data
node_features = torch.load('output/graphs/.../node_features.pt')
edge_index = torch.load('output/graphs/.../edge_index.pt')
labels = torch.load('output/graphs/.../labels.pt')
train_mask = torch.load('output/graphs/.../train_mask.pt')
val_mask = torch.load('output/graphs/.../val_mask.pt')
test_mask = torch.load('output/graphs/.../test_mask.pt')

data = Data(
    x=node_features,
    edge_index=edge_index,
    y=labels,
    train_mask=train_mask,
    val_mask=val_mask,
    test_mask=test_mask,
)

# Define GNN
class FraudGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_channels):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.linear = torch.nn.Linear(hidden_channels, 2)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index).relu()
        x = self.linear(x)
        return x

# Train
model = FraudGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    # Validation
    if epoch % 10 == 0:
        model.eval()
        pred = out.argmax(dim=1)
        val_acc = (pred[data.val_mask] == data.y[data.val_mask]).float().mean()
        print(f'Epoch {epoch}: Val Acc: {val_acc:.4f}')

Fraud Types for Training

Type	Detection Approach	Difficulty
Split Transaction	Amount patterns	Easy
Duplicate Payment	Similarity matching	Easy
Fictitious Transaction	Anomaly detection	Medium
Ghost Employee	Entity verification	Medium
Kickback Scheme	Relationship analysis	Hard
Revenue Manipulation	Trend analysis	Hard

Best Practices

Class Imbalance

from imblearn.over_sampling import SMOTE

# Handle imbalanced classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Threshold Tuning

from sklearn.metrics import precision_recall_curve

# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * precision * recall / (precision + recall)
optimal_idx = f1_scores.argmax()
optimal_threshold = thresholds[optimal_idx]

Cross-Validation

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV ROC-AUC: {scores.mean():.4f} (+/- {scores.std():.4f})")

Audit Analytics

Test audit procedures and analytical tools with realistic data.

Overview

SyntheticData generates comprehensive datasets for audit analytics:

Complete document trails
Known control exceptions
Benford’s Law compliant amounts
Realistic temporal patterns

Configuration

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 100000

  benford:
    enabled: true                    # Realistic first-digit distribution

  temporal:
    month_end_spike: 2.5
    quarter_end_spike: 3.0
    year_end_spike: 4.0

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.35
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01
  o2c:
    enabled: true
    flow_rate: 0.35

master_data:
  vendors:
    count: 200
  customers:
    count: 500

internal_controls:
  enabled: true

anomaly_injection:
  enabled: true
  total_rate: 0.03
  generate_labels: true

  categories:
    fraud: 0.20
    error: 0.50
    process_issue: 0.30

output:
  format: csv

Audit Procedures

1. Benford’s Law Analysis

Test first-digit distribution of amounts:

import pandas as pd
import numpy as np
from scipy import stats

# Load data
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Extract first digits
amounts = entries['debit_amount'] + entries['credit_amount']
amounts = amounts[amounts > 0]
first_digits = amounts.apply(lambda x: int(str(x)[0]))

# Calculate observed distribution
observed = first_digits.value_counts().sort_index()
observed_freq = observed / observed.sum()

# Expected Benford distribution
benford = {d: np.log10(1 + 1/d) for d in range(1, 10)}

# Chi-square test
chi_stat, p_value = stats.chisquare(
    observed.values,
    [benford[d] * observed.sum() for d in range(1, 10)]
)

print(f"Chi-square: {chi_stat:.2f}, p-value: {p_value:.4f}")

2. Three-Way Match Testing

Verify PO, GR, and Invoice alignment:

# Load documents
po = pd.read_csv('output/documents/purchase_orders.csv')
gr = pd.read_csv('output/documents/goods_receipts.csv')
inv = pd.read_csv('output/documents/vendor_invoices.csv')

# Join on references
matched = po.merge(gr, left_on='po_number', right_on='po_reference')
matched = matched.merge(inv, left_on='po_number', right_on='po_reference')

# Calculate variances
matched['qty_variance'] = abs(matched['gr_quantity'] - matched['po_quantity']) / matched['po_quantity']
matched['price_variance'] = abs(matched['inv_unit_price'] - matched['po_unit_price']) / matched['po_unit_price']

# Identify exceptions
qty_exceptions = matched[matched['qty_variance'] > 0.02]
price_exceptions = matched[matched['price_variance'] > 0.01]

print(f"Quantity exceptions: {len(qty_exceptions)}")
print(f"Price exceptions: {len(price_exceptions)}")

3. Duplicate Payment Detection

Find potential duplicate payments:

# Load payments and invoices
payments = pd.read_csv('output/documents/payments.csv')
invoices = pd.read_csv('output/documents/vendor_invoices.csv')

# Group by vendor and amount
potential_dups = invoices.groupby(['vendor_id', 'total_amount']).filter(
    lambda x: len(x) > 1
)

# Check payment dates
duplicates = []
for (vendor, amount), group in potential_dups.groupby(['vendor_id', 'total_amount']):
    if len(group) > 1:
        duplicates.append({
            'vendor': vendor,
            'amount': amount,
            'count': len(group),
            'invoices': group['invoice_number'].tolist()
        })

print(f"Potential duplicate payments: {len(duplicates)}")

4. Journal Entry Testing

Analyze manual journal entries:

# Load entries
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Filter manual entries
manual = entries[entries['source'] == 'Manual']

# Analyze characteristics
print(f"Manual entries: {len(manual)}")
print(f"Weekend entries: {manual['is_weekend'].sum()}")
print(f"Month-end entries: {manual['is_month_end'].sum()}")

# Top accounts with manual entries
top_accounts = manual.groupby('account_number').size().sort_values(ascending=False).head(10)

5. Cutoff Testing

Verify transactions recorded in correct period:

# Identify late postings
entries['posting_date'] = pd.to_datetime(entries['posting_date'])
entries['document_date'] = pd.to_datetime(entries['document_date'])
entries['posting_lag'] = (entries['posting_date'] - entries['document_date']).dt.days

# Find entries posted after period end
late_postings = entries[entries['posting_lag'] > 5]
print(f"Late postings: {len(late_postings)}")

# Check year-end cutoff
year_end = entries['posting_date'].dt.year.max()
cutoff_issues = entries[
    (entries['document_date'].dt.year < year_end) &
    (entries['posting_date'].dt.year == year_end + 1)
]

6. Segregation of Duties

Check for SoD violations:

# Load controls data
sod_rules = pd.read_csv('output/controls/sod_rules.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Find entries with SoD violations
violations = entries[entries['sod_violation'] == True]
print(f"SoD violations: {len(violations)}")

# Analyze by conflict type
violation_types = violations.groupby('sod_conflict_type').size()

Audit Analytics Dashboard

Key Metrics

Metric	Query	Expected
Benford Chi-square	First-digit test	< 15.51 (p > 0.05)
Match exceptions	Three-way match	< 2%
Duplicate indicators	Amount/vendor matching	< 0.5%
Late postings	Document vs posting date	< 1%
SoD violations	Control violations	Known from labels

Population Statistics

# Summary statistics
print("=== Audit Population Summary ===")
print(f"Total transactions: {len(entries):,}")
print(f"Total amount: ${entries['debit_amount'].sum():,.2f}")
print(f"Unique vendors: {entries['vendor_id'].nunique()}")
print(f"Unique customers: {entries['customer_id'].nunique()}")
print(f"Date range: {entries['posting_date'].min()} to {entries['posting_date'].max()}")

7. Financial Statement Analytics (v0.6.0)

Analyze generated financial statements for consistency, trend analysis, and ratio testing:

import pandas as pd

# Load financial statements
balance_sheet = pd.read_csv('output/financial_reporting/balance_sheet.csv')
income_stmt = pd.read_csv('output/financial_reporting/income_statement.csv')
cash_flow = pd.read_csv('output/financial_reporting/cash_flow.csv')

# Verify accounting equation holds
for _, row in balance_sheet.iterrows():
    assets = row['total_assets']
    liabilities = row['total_liabilities']
    equity = row['total_equity']
    imbalance = abs(assets - (liabilities + equity))
    assert imbalance < 0.01, f"A=L+E violation: {imbalance}"

# Analytical procedures: ratio analysis
ratios = pd.DataFrame({
    'period': balance_sheet['period'],
    'current_ratio': balance_sheet['current_assets'] / balance_sheet['current_liabilities'],
    'gross_margin': income_stmt['gross_profit'] / income_stmt['revenue'],
    'debt_to_equity': balance_sheet['total_liabilities'] / balance_sheet['total_equity'],
})

# Flag unusual ratio movements (> 2 std devs from mean)
for col in ['current_ratio', 'gross_margin', 'debt_to_equity']:
    mean = ratios[col].mean()
    std = ratios[col].std()
    outliers = ratios[abs(ratios[col] - mean) > 2 * std]
    if len(outliers) > 0:
        print(f"Unusual {col} in periods: {outliers['period'].tolist()}")

Budget Variance Analysis

When budgets are enabled, compare budget to actual for each account:

# Load budget vs actual data
budget = pd.read_csv('output/financial_reporting/budget_vs_actual.csv')

# Calculate variance percentage
budget['variance_pct'] = (budget['actual'] - budget['budget']) / budget['budget']

# Identify material variances (> 10%)
material = budget[abs(budget['variance_pct']) > 0.10]
print(f"Material variances: {len(material)} accounts")
print(material[['account', 'budget', 'actual', 'variance_pct']].to_string())

# Favorable vs unfavorable analysis
favorable = budget[
    ((budget['account_type'] == 'revenue') & (budget['variance_pct'] > 0)) |
    ((budget['account_type'] == 'expense') & (budget['variance_pct'] < 0))
]
print(f"Favorable variances: {len(favorable)}")

Management KPI Trend Analysis

# Load KPI data
kpis = pd.read_csv('output/financial_reporting/management_kpis.csv')

# Check for declining trends
for kpi_name in kpis['kpi_name'].unique():
    series = kpis[kpis['kpi_name'] == kpi_name].sort_values('period')
    values = series['value'].values
    # Simple trend check: are last 3 periods declining?
    if len(values) >= 3 and all(values[-3+i] > values[-3+i+1] for i in range(2)):
        print(f"Declining trend: {kpi_name}")

Payroll Audit Testing (v0.6.0)

When the HR module is enabled, test payroll data for anomalies:

# Load payroll data
payroll = pd.read_csv('output/hr/payroll_entries.csv')

# Ghost employee check: employees with pay but no time entries
time_entries = pd.read_csv('output/hr/time_entries.csv')
paid_employees = set(payroll['employee_id'].unique())
active_employees = set(time_entries['employee_id'].unique())
no_time = paid_employees - active_employees
print(f"Employees paid without time entries: {len(no_time)}")

# Payroll amount reasonableness
payroll_summary = payroll.groupby('employee_id')['gross_pay'].sum()
mean_pay = payroll_summary.mean()
std_pay = payroll_summary.std()
outliers = payroll_summary[payroll_summary > mean_pay + 3 * std_pay]
print(f"Unusually high total pay: {len(outliers)} employees")

# Expense policy violation detection
expenses = pd.read_csv('output/hr/expense_reports.csv')
violations = expenses[expenses['policy_violation'] == True]
print(f"Expense policy violations: {len(violations)}")

Sampling

Statistical Sampling

from scipy import stats

# Calculate sample size for attribute testing
population_size = len(entries)
confidence_level = 0.95
tolerable_error_rate = 0.05
expected_error_rate = 0.01

# Sample size formula
z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)
sample_size = int(
    (z_score ** 2 * expected_error_rate * (1 - expected_error_rate)) /
    (tolerable_error_rate ** 2)
)

print(f"Recommended sample size: {sample_size}")

# Random sample
sample = entries.sample(n=sample_size, random_state=42)

Stratified Sampling

# Stratify by amount
entries['amount_stratum'] = pd.qcut(
    entries['debit_amount'] + entries['credit_amount'],
    q=5,
    labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)

# Sample from each stratum
stratified_sample = entries.groupby('amount_stratum').apply(
    lambda x: x.sample(n=min(100, len(x)), random_state=42)
)

SOX Compliance Testing

Test internal control monitoring systems.

Overview

SyntheticData generates data for SOX 404 compliance testing:

Internal control definitions
Control test evidence
Segregation of Duties violations
Approval workflow data

Configuration

global:
  seed: 42
  industry: financial_services
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 50000

internal_controls:
  enabled: true

  controls:
    - id: "CTL-001"
      name: "Payment Authorization"
      type: preventive
      frequency: continuous
      threshold: 10000
      assertions: [authorization, validity]

    - id: "CTL-002"
      name: "Journal Entry Review"
      type: detective
      frequency: daily
      assertions: [accuracy, completeness]

    - id: "CTL-003"
      name: "Bank Reconciliation"
      type: detective
      frequency: monthly
      assertions: [existence, completeness]

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]
      description: "Cannot create and approve payments"

    - conflict_type: create_approve
      processes: [ar_invoice, ar_receipt]
      description: "Cannot create and approve receipts"

    - conflict_type: custody_recording
      processes: [cash_handling, cash_recording]
      description: "Cannot handle and record cash"

approval:
  enabled: true
  thresholds:
    - level: 1
      max_amount: 5000
    - level: 2
      max_amount: 25000
    - level: 3
      max_amount: 100000
    - level: 4
      max_amount: null

fraud:
  enabled: true
  fraud_rate: 0.005

  types:
    skipped_approval: 0.30
    threshold_manipulation: 0.30
    unauthorized_discount: 0.20
    duplicate_payment: 0.20

output:
  format: csv

Control Testing

1. Control Evidence

import pandas as pd

# Load control data
controls = pd.read_csv('output/controls/internal_controls.csv')
mappings = pd.read_csv('output/controls/control_account_mappings.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Identify entries subject to each control
for _, control in controls.iterrows():
    control_id = control['control_id']
    threshold = control['threshold']

    # Filter entries in scope
    if pd.notna(threshold):
        in_scope = entries[
            (entries['control_ids'].str.contains(control_id)) &
            (entries['debit_amount'] >= threshold)
        ]
    else:
        in_scope = entries[entries['control_ids'].str.contains(control_id)]

    print(f"{control['name']}: {len(in_scope)} entries in scope")

2. Approval Testing

# Load entries with approval data
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Test approval compliance
approval_required = entries[entries['debit_amount'] >= 5000]
approved = approval_required[approval_required['approved_by'].notna()]
not_approved = approval_required[approval_required['approved_by'].isna()]

print(f"Requiring approval: {len(approval_required)}")
print(f"Properly approved: {len(approved)}")
print(f"Missing approval: {len(not_approved)}")

# Test approval levels
def check_approval_level(row):
    amount = row['debit_amount']
    if amount >= 100000:
        return row['approval_level'] >= 4
    elif amount >= 25000:
        return row['approval_level'] >= 3
    elif amount >= 5000:
        return row['approval_level'] >= 2
    return True

entries['approval_adequate'] = entries.apply(check_approval_level, axis=1)
inadequate = entries[~entries['approval_adequate']]
print(f"Inadequate approval level: {len(inadequate)}")

3. Segregation of Duties

# Load SoD data
sod_rules = pd.read_csv('output/controls/sod_rules.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Identify violations
violations = entries[entries['sod_violation'] == True]
print(f"Total SoD violations: {len(violations)}")

# Analyze by type
violation_summary = violations.groupby('sod_conflict_type').agg({
    'document_id': 'count',
    'debit_amount': 'sum'
}).rename(columns={'document_id': 'count', 'debit_amount': 'total_amount'})

print("\nViolations by type:")
print(violation_summary)

# Analyze by user
user_violations = violations.groupby('created_by').size().sort_values(ascending=False)
print("\nTop violators:")
print(user_violations.head(10))

4. Threshold Manipulation

# Detect threshold-adjacent transactions
approval_threshold = 10000

entries['near_threshold'] = (
    (entries['debit_amount'] >= approval_threshold * 0.9) &
    (entries['debit_amount'] < approval_threshold)
)

near_threshold = entries[entries['near_threshold']]
print(f"Near-threshold entries: {len(near_threshold)}")

# Statistical analysis
expected_near = len(entries) * 0.10  # 10% would be in this range randomly
chi_stat = ((len(near_threshold) - expected_near) ** 2) / expected_near
print(f"Chi-square statistic: {chi_stat:.2f}")

Control Matrix

Generate RACM

# Risk and Control Matrix
controls = pd.read_csv('output/controls/internal_controls.csv')
mappings = pd.read_csv('output/controls/control_account_mappings.csv')

racm = controls.merge(mappings, on='control_id')
racm = racm[[
    'control_id', 'name', 'control_type', 'frequency',
    'account_number', 'assertions'
]]

# Add testing results
racm['population'] = racm['account_number'].apply(
    lambda x: len(entries[entries['account_number'] == x])
)
racm['exceptions'] = racm['control_id'].apply(
    lambda x: len(entries[
        (entries['control_ids'].str.contains(x)) &
        (entries['is_anomaly'] == True)
    ])
)
racm['exception_rate'] = racm['exceptions'] / racm['population']

print(racm)

Test Documentation

Control Test Template

def document_control_test(control_id, entries, sample_size=25):
    """Generate control test documentation."""
    control = controls[controls['control_id'] == control_id].iloc[0]

    # Get population
    population = entries[entries['control_ids'].str.contains(control_id)]

    # Sample
    sample = population.sample(n=min(sample_size, len(population)), random_state=42)

    # Test results
    exceptions = sample[sample['is_anomaly'] == True]

    return {
        'control_id': control_id,
        'control_name': control['name'],
        'control_type': control['control_type'],
        'frequency': control['frequency'],
        'population_size': len(population),
        'sample_size': len(sample),
        'exceptions_found': len(exceptions),
        'exception_rate': len(exceptions) / len(sample),
        'conclusion': 'Effective' if len(exceptions) == 0 else 'Exception Noted'
    }

# Test all controls
results = []
for control_id in controls['control_id']:
    result = document_control_test(control_id, entries)
    results.append(result)

test_results = pd.DataFrame(results)
test_results.to_csv('control_test_results.csv', index=False)

Deficiency Assessment

# Classify deficiencies
def assess_deficiency(exception_rate, amount_impact):
    if exception_rate > 0.10 or amount_impact > 1000000:
        return 'Material Weakness'
    elif exception_rate > 0.05 or amount_impact > 100000:
        return 'Significant Deficiency'
    elif exception_rate > 0:
        return 'Control Deficiency'
    return 'No Deficiency'

test_results['amount_impact'] = test_results['control_id'].apply(
    lambda x: entries[
        (entries['control_ids'].str.contains(x)) &
        (entries['is_anomaly'] == True)
    ]['debit_amount'].sum()
)

test_results['deficiency_classification'] = test_results.apply(
    lambda x: assess_deficiency(x['exception_rate'], x['amount_impact']),
    axis=1
)

print(test_results[['control_name', 'exception_rate', 'deficiency_classification']])

Process Mining

Generate OCEL 2.0 event logs for process mining analysis across 8 enterprise process families.

Overview

SyntheticData generates comprehensive process mining data:

OCEL 2.0 compliant event logs with 88 activity types and 52 object types
8 process families: P2P, O2C, S2C, H2R, MFG, BANK, AUDIT, Bank Recon
Object-centric relationships with lifecycle states
Three variant types per generator: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%)
Cross-process object linking via shared document IDs

Configuration

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 6

transactions:
  target_count: 50000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
    completion_rate: 0.95

    stages:
      po_approval_rate: 0.9
      gr_rate: 0.98
      invoice_rate: 0.95
      payment_rate: 0.92

  o2c:
    enabled: true
    flow_rate: 0.4
    completion_rate: 0.90

    stages:
      so_approval_rate: 0.95
      credit_check_pass_rate: 0.9
      delivery_rate: 0.98
      invoice_rate: 0.95
      collection_rate: 0.85

master_data:
  vendors:
    count: 100
  customers:
    count: 200
  materials:
    count: 500
  employees:
    count: 30

output:
  format: csv

OCEL 2.0 Export

Use the datasynth-ocpm crate for OCEL 2.0 export:

#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, Ocel2Exporter, ExportFormat};

let mut generator = OcpmGenerator::new(seed);
let event_log = generator.generate_event_log(
    p2p_count: 5000,
    o2c_count: 5000,
    start_date,
    end_date,
)?;

let exporter = Ocel2Exporter::new(ExportFormat::Json);
exporter.export(&event_log, "output/ocel2.json")?;
}

P2P Process

Event Sequence

Create PO → Approve PO → Release PO → Create GR → Post GR →
Receive Invoice → Verify Invoice → Post Invoice → Execute Payment

Objects

Object Type	Attributes
PurchaseOrder	po_number, vendor_id, total_amount
GoodsReceipt	gr_number, po_reference, quantity
VendorInvoice	invoice_number, amount, due_date
Payment	payment_number, amount, bank_ref
Material	material_id, description
Vendor	vendor_id, name

Object Relationships

PurchaseOrder ─┬── contains ──→ Material
               └── from ──────→ Vendor

GoodsReceipt ──── for ──────→ PurchaseOrder

VendorInvoice ─── for ──────→ PurchaseOrder
               └── matches ──→ GoodsReceipt

Payment ───────── pays ──────→ VendorInvoice

O2C Process

Event Sequence

Create SO → Check Credit → Release SO → Create Delivery →
Pick → Pack → Ship → Create Invoice → Post Invoice → Receive Payment

Objects

Object Type	Attributes
SalesOrder	so_number, customer_id, total_amount
Delivery	delivery_number, so_reference
CustomerInvoice	invoice_number, amount, due_date
CustomerPayment	receipt_number, amount
Material	material_id, description
Customer	customer_id, name

Analysis with PM4Py

Load Event Log

from pm4py.objects.ocel.importer import jsonocel

# Load OCEL 2.0
ocel = jsonocel.apply("output/ocel2.json")

print(f"Events: {len(ocel.events)}")
print(f"Objects: {len(ocel.objects)}")
print(f"Object types: {ocel.object_types}")

Process Discovery

from pm4py.algo.discovery.ocel import algorithm as ocel_discovery

# Discover object-centric Petri net
ocpn = ocel_discovery.apply(ocel)

# Visualize
from pm4py.visualization.ocel.ocpn import visualizer
gviz = visualizer.apply(ocpn)
visualizer.save(gviz, "ocpn.png")

Object Lifecycle Analysis

from pm4py.statistics.ocel import object_lifecycle

# Analyze PurchaseOrder lifecycle
po_lifecycle = object_lifecycle.get_lifecycle_summary(
    ocel,
    object_type="PurchaseOrder"
)

print("Purchase Order Lifecycle:")
print(f"  Average duration: {po_lifecycle['avg_duration']}")
print(f"  Completion rate: {po_lifecycle['completion_rate']:.2%}")

Conformance Checking

from pm4py.algo.conformance.ocel import algorithm as ocel_conformance

# Check conformance against expected model
results = ocel_conformance.apply(ocel, ocpn)

print(f"Conformant cases: {results['conformant']}")
print(f"Non-conformant: {results['non_conformant']}")

Process Metrics

Throughput Time

import pandas as pd
from datetime import timedelta

# Load events
events = pd.DataFrame(ocel.events)

# Calculate case durations
case_durations = events.groupby('case_id').agg({
    'timestamp': ['min', 'max']
})
case_durations['duration'] = (
    case_durations[('timestamp', 'max')] -
    case_durations[('timestamp', 'min')]
)

print(f"Mean throughput time: {case_durations['duration'].mean()}")
print(f"Median throughput time: {case_durations['duration'].median()}")

Activity Frequency

# Count activity occurrences
activity_counts = events['activity'].value_counts()
print("Activity frequency:")
print(activity_counts)

Bottleneck Analysis

# Calculate waiting times between activities
events = events.sort_values(['case_id', 'timestamp'])
events['wait_time'] = events.groupby('case_id')['timestamp'].diff()

# Find bottlenecks
bottlenecks = events.groupby('activity')['wait_time'].mean().sort_values(ascending=False)
print("Bottleneck activities:")
print(bottlenecks.head(5))

Variant Analysis

from pm4py.algo.discovery.ocel import variants

# Get process variants
variant_stats = variants.get_variants_statistics(ocel)

print(f"Unique variants: {len(variant_stats)}")
print("\nTop variants:")
for variant, stats in sorted(variant_stats.items(), key=lambda x: -x[1]['count'])[:5]:
    print(f"  {variant}: {stats['count']} cases")

Integration with Tools

Celonis

# Export to Celonis format
from pm4py.objects.ocel.exporter import csv as ocel_csv_exporter

ocel_csv_exporter.apply(ocel, "output/celonis/")
# Upload CSV files to Celonis

OCPA

# Export to OCPA format
from pm4py.objects.ocel.exporter import sqlite

sqlite.apply(ocel, "output/ocel.sqlite")
# Open in OCPA tool

New Process Families (v0.6.2)

S2C — Source-to-Contract

Create Sourcing Project → Qualify Supplier → Publish RFx →
Submit Bid → Evaluate Bids → Award Contract →
Activate Contract → Complete Sourcing

H2R — Hire-to-Retire

Submit Time Entry → Approve Time Entry →
Create Payroll Run → Calculate Payroll → Approve Payroll → Post Payroll
Submit Expense → Approve Expense

MFG — Manufacturing

Create Production Order → Release → Start Operation →
Complete Operation → Quality Inspection → Confirm Production →
Close Production Order

BANK — Banking Operations

Onboard Customer → KYC Review → Open Account →
Execute Transaction → Authorize → Complete Transaction

AUDIT — Audit Engagement Lifecycle

Create Engagement → Plan → Assess Risk → Create Workpaper →
Collect Evidence → Review Workpaper → Raise Finding →
Remediate Finding → Record Judgment → Complete Engagement

Bank Recon — Bank Reconciliation

Import Bank Statement → Auto Match Items → Manual Match Item →
Create Reconciling Item → Resolve Exception →
Approve Reconciliation → Post Entries → Complete Reconciliation

S2P Process Mining

The full Source-to-Pay chain provides rich process mining opportunities beyond basic P2P:

Extended Event Sequence

Spend Analysis → Supplier Qualification → RFx Published →
Bid Received → Bid Evaluation → Contract Award →
Create PO → Approve PO → Release PO →
Create GR → Post GR →
Receive Invoice → Verify Invoice (Three-Way Match) → Post Invoice →
Schedule Payment → Execute Payment

Extended Object Types

Object Type	Attributes
SpendCategory	category_code, total_spend, vendor_count
SourcingProject	project_type, target_savings, status
SupplierBid	vendor_id, bid_amount, technical_score
ProcurementContract	contract_value, validity_period, terms
PurchaseRequisition	requester, catalog_item, urgency
PurchaseOrder	po_type, vendor_id, total_amount
GoodsReceipt	gr_number, received_qty, movement_type
VendorInvoice	invoice_amount, match_status, due_date
Payment	payment_method, cleared_amount, bank_ref

Cycle Time Analysis

# Analyze end-to-end procurement cycle times
po_events = events[events['object_type'] == 'PurchaseOrder']

# PO creation to payment completion
cycle_times = po_events.groupby('case_id').agg({
    'timestamp': ['min', 'max']
})
cycle_times['cycle_time'] = (
    cycle_times[('timestamp', 'max')] -
    cycle_times[('timestamp', 'min')]
)

# Segment by PO type
cycle_by_type = po_events.merge(
    objects[['po_type']], on='object_id'
).groupby('po_type')['cycle_time'].describe()

Three-Way Match Conformance

# Identify invoices that failed three-way match
match_events = events[events['activity'] == 'Verify Invoice']
blocked = match_events[match_events['match_status'] == 'blocked']

print(f"Three-way match block rate: {len(blocked)/len(match_events):.1%}")
print(f"Most common variance: {blocked['variance_type'].mode()[0]}")

AML/KYC Testing

Generate realistic banking transaction data with KYC profiles and AML typologies for compliance testing and fraud detection model development.

Overview

The datasynth-banking module generates synthetic banking data designed for:

AML System Testing: Validate transaction monitoring rules against known patterns
KYC Process Testing: Test customer onboarding and risk assessment workflows
ML Model Training: Train supervised models with labeled fraud typologies
Scenario Analysis: Test detection capabilities against specific attack patterns

KYC Profile Generation

Customer Types

Type	Description	Typical Characteristics
Retail	Individual customers	Salary deposits, consumer spending
Business	Small to medium businesses	Payroll, supplier payments
Trust	Trust accounts, complex structures	Investment flows, distributions

KYC Profile Components

Each customer has a KYC profile defining expected behavior:

kyc_profile:
  declared_turnover: 50000        # Expected monthly volume
  transaction_frequency: 25       # Expected transactions/month
  source_of_funds: "employment"   # Declared income source
  geographic_exposure: ["US", "EU"]
  cash_intensity: 0.05            # Expected cash ratio
  beneficial_owner_complexity: 1  # Ownership layers

Risk Scoring

Customers are assigned risk scores based on:

Geographic exposure (high-risk jurisdictions)
Industry sector
Transaction patterns vs. declared profile
Beneficial ownership complexity

AML Typology Generation

Structuring

Breaking large transactions into smaller amounts to avoid reporting thresholds.

Detection Signatures:
- Multiple transactions just below $10,000 threshold
- Same-day deposits across multiple branches
- Round-number amounts (e.g., $9,900, $9,800)

Configuration:

typologies:
  structuring:
    enabled: true
    rate: 0.001
    threshold: 10000
    margin: 500

Funnel Accounts

Concentrating funds from multiple sources before moving to destination.

Pattern:
Source A ─┐
Source B ─┼─▶ Funnel Account ─▶ Destination
Source C ─┘

Detection Signatures:
- Many small inbound, few large outbound
- High throughput relative to account balance
- Short holding periods

Layering

Complex chains of transactions to obscure fund origins.

Pattern:
Origin ─▶ Shell A ─▶ Shell B ─▶ Shell C ─▶ Destination
                          └─▶ Mixing ─┘

Detection Signatures:
- Rapid consecutive transfers
- Circular transaction patterns
- Cross-border routing through multiple jurisdictions

Money Mule Networks

Using recruited individuals to move illicit funds.

Pattern:
Fraudster ─▶ Mule 1 ─▶ Cash Withdrawal
           ─▶ Mule 2 ─▶ Wire Transfer
           ─▶ Mule 3 ─▶ Crypto Exchange

Detection Signatures:
- New accounts with sudden high volume
- Immediate outbound after inbound
- Multiple accounts with similar patterns

Round-Tripping

Moving funds in circular patterns to create apparent legitimacy.

Pattern:
Company A ─▶ Offshore ─▶ Company A (as "investment")

Detection Signatures:
- Funds return to origin within short period
- Offshore intermediaries
- Inflated invoicing

Fraud Patterns

Credit card fraud and synthetic identity patterns.

Patterns:
- Card testing (small amounts across merchants)
- Account takeover (changed behavior profile)
- Synthetic identity (blended PII attributes)

Generated Data

Output Files

banking/
├── banking_customers.csv        # Customer profiles with KYC data
├── bank_accounts.csv            # Account records with features
├── bank_transactions.csv        # Transaction records
├── kyc_profiles.csv             # Expected activity envelopes
├── counterparties.csv           # Counterparty pool
├── aml_typology_labels.csv      # Ground truth typology labels
├── entity_risk_labels.csv       # Entity-level risk classifications
└── transaction_risk_labels.csv  # Transaction-level classifications

Customer Record

customer_id,customer_type,name,created_at,risk_score,kyc_status,pep_flag,sanctions_flag
CUST001,retail,John Smith,2024-01-15,25,verified,false,false
CUST002,business,Acme Corp,2024-02-01,65,enhanced_due_diligence,false,false

Transaction Record

transaction_id,account_id,timestamp,amount,currency,direction,channel,category,counterparty_id
TXN001,ACC001,2024-03-15T10:30:00Z,9800.00,USD,credit,branch,cash_deposit,
TXN002,ACC001,2024-03-15T11:45:00Z,9750.00,USD,credit,branch,cash_deposit,

Typology Label

transaction_id,typology,confidence,pattern_id,related_transactions
TXN001,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"
TXN002,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"

Configuration

Basic Banking Setup

banking:
  enabled: true
  customers:
    retail: 5000
    business: 500
    trust: 50

  transactions:
    target_count: 500000
    date_range:
      start: 2024-01-01
      end: 2024-12-31

  typologies:
    structuring:
      enabled: true
      rate: 0.002
    funnel:
      enabled: true
      rate: 0.001
    layering:
      enabled: true
      rate: 0.0005
    mule:
      enabled: true
      rate: 0.001
    fraud:
      enabled: true
      rate: 0.005

  labels:
    generate: true
    include_confidence: true
    include_related: true

Adversarial Testing

Generate transactions designed to evade detection:

banking:
  typologies:
    spoofing:
      enabled: true
      strategies:
        - threshold_aware        # Varies amounts around thresholds
        - temporal_distribution  # Spreads over time windows
        - channel_mixing         # Uses multiple channels

Use Cases

Transaction Monitoring Rule Testing

# Generate data with known structuring patterns
datasynth-data generate --config banking_structuring.yaml --output ./test_data

# Expected results:
# - 0.2% of transactions should trigger structuring alerts
# - Labels in aml_typology_labels.csv for validation

ML Model Training

import pandas as pd
from sklearn.model_selection import train_test_split

# Load transactions and labels
transactions = pd.read_csv("banking/bank_transactions.csv")
labels = pd.read_csv("banking/aml_typology_labels.csv")

# Merge and prepare features
data = transactions.merge(labels, on="transaction_id", how="left")
data["is_suspicious"] = data["typology"].notna()

# Split for training
X_train, X_test, y_train, y_test = train_test_split(
    data[features],
    data["is_suspicious"],
    test_size=0.2,
    stratify=data["is_suspicious"]
)

Network Analysis

The banking data supports graph-based analysis:

import networkx as nx

# Build transaction network
G = nx.DiGraph()
for _, txn in transactions.iterrows():
    if txn["counterparty_id"]:
        G.add_edge(txn["account_id"], txn["counterparty_id"],
                   weight=txn["amount"])

# Detect funnel accounts (high in-degree, low out-degree)
in_degree = dict(G.in_degree())
out_degree = dict(G.out_degree())
funnels = [n for n in G.nodes()
           if in_degree.get(n, 0) > 10 and out_degree.get(n, 0) < 3]

KYC Deviation Analysis

# Compare actual behavior to KYC profile
customers = pd.read_csv("banking/banking_customers.csv")
kyc = pd.read_csv("banking/kyc_profiles.csv")
transactions = pd.read_csv("banking/bank_transactions.csv")

# Calculate actual monthly volumes
actual = transactions.groupby(["customer_id", "month"])["amount"].sum()

# Compare to declared turnover
merged = actual.merge(kyc, on="customer_id")
merged["deviation"] = (merged["actual"] - merged["declared_turnover"]) / merged["declared_turnover"]

# Flag significant deviations
alerts = merged[merged["deviation"].abs() > 0.5]

Best Practices

Realistic Testing

Match production volumes: Configure similar customer counts and transaction rates
Use realistic ratios: Keep typology rates at realistic levels (0.1-1%)
Include noise: Add legitimate edge cases that shouldn’t trigger alerts

Label Quality

Verify ground truth: Labels reflect injected patterns, not detected ones
Include confidence: Use confidence scores for uncertain classifications
Track related transactions: Pattern IDs link related suspicious activity

Model Validation

Test detection rates: Measure recall against known patterns
Check false positives: Ensure legitimate transactions aren’t flagged
Validate across typologies: Test each pattern type separately

ERP Load Testing

Generate high-volume data for ERP system testing.

Overview

SyntheticData generates realistic transaction volumes for:

Load testing
Stress testing
Performance benchmarking
System integration testing

Configuration

High Volume Generation

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  worker_threads: 8                  # Maximize parallelism

transactions:
  target_count: 1000000              # 1 million entries

  line_items:
    distribution: empirical

  amounts:
    min: 100
    max: 10000000
    distribution: log_normal

  sources:
    manual: 0.15
    automated: 0.65
    recurring: 0.15
    adjustment: 0.05

  temporal:
    month_end_spike: 2.5
    quarter_end_spike: 3.0
    year_end_spike: 4.0

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.35
  o2c:
    enabled: true
    flow_rate: 0.35

master_data:
  vendors:
    count: 2000
  customers:
    count: 5000
  materials:
    count: 10000

output:
  format: csv
  compression: none                  # Fastest for import

SAP ACDOCA Format

output:
  files:
    journal_entries: false
    acdoca: true                     # SAP Universal Journal format

Volume Sizing

Transaction Volume Guidelines

Company Size	Annual Entries	Per Day	Configuration
Small	10,000	~30	`target_count: 10000`
Medium	100,000	~300	`target_count: 100000`
Large	1,000,000	~3,000	`target_count: 1000000`
Enterprise	10,000,000	~30,000	`target_count: 10000000`

Master Data Guidelines

Size	Vendors	Customers	Materials
Small	100	200	500
Medium	500	1,000	5,000
Large	2,000	10,000	50,000
Enterprise	10,000	100,000	500,000

Load Testing Scenarios

1. Steady State Load

Normal daily operation:

transactions:
  target_count: 100000

  temporal:
    month_end_spike: 1.0             # No spikes
    quarter_end_spike: 1.0
    year_end_spike: 1.0
    working_hours_only: true

2. Peak Period Load

Month-end closing:

global:
  start_date: 2024-01-25
  period_months: 1                   # Focus on month-end

transactions:
  target_count: 50000

  temporal:
    month_end_spike: 5.0             # 5x normal volume

3. Year-End Stress

Year-end closing simulation:

global:
  start_date: 2024-12-01
  period_months: 1

transactions:
  target_count: 200000

  temporal:
    month_end_spike: 3.0
    quarter_end_spike: 4.0
    year_end_spike: 10.0             # Extreme spike

4. Batch Import

Large batch import testing:

transactions:
  target_count: 500000

  sources:
    automated: 1.0                   # All system-generated

output:
  compression: none                  # For fastest import

Manufacturing ERP Testing (v0.6.0)

Production Order Load

Generate production orders with WIP tracking, routings, and standard costing:

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  worker_threads: 8

transactions:
  target_count: 500000

manufacturing:
  enabled: true
  production_orders:
    orders_per_month: 200              # High volume
    avg_batch_size: 150
    yield_rate: 0.96
    rework_rate: 0.04
  costing:
    labor_rate_per_hour: 42.0
    overhead_rate: 1.75
  routing:
    avg_operations: 6
    setup_time_hours: 2.0

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.40                    # Heavy procurement

subledger:
  inventory:
    enabled: true
    valuation_methods:
      - standard_cost
      - weighted_average

This configuration exercises production order creation, goods issue to production, goods receipt from production, WIP valuation, and standard cost variance posting.

Three-Way Match with Source-to-Pay

Test the full procurement lifecycle from sourcing through payment:

source_to_pay:
  enabled: true
  sourcing:
    projects_per_year: 20
  rfx:
    min_invited_vendors: 5
    max_invited_vendors: 12
  contracts:
    min_duration_months: 12
    max_duration_months: 24
  p2p_integration:
    off_contract_rate: 0.10            # 10% maverick spending
    catalog_enforcement: true

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.40
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01

HR and Payroll Testing (v0.6.0)

Payroll Processing Load

Generate payroll runs, time entries, and expense reports:

master_data:
  employees:
    count: 500
    hierarchy_depth: 6

hr:
  enabled: true
  payroll:
    enabled: true
    pay_frequency: "biweekly"          # 26 pay periods per year
    benefits_enrollment_rate: 0.75
    retirement_participation_rate: 0.55
  time_attendance:
    enabled: true
    overtime_rate: 0.15
  expenses:
    enabled: true
    submission_rate: 0.40
    policy_violation_rate: 0.05

This exercises payroll journal entry generation (salary, tax withholdings, benefits deductions), time and attendance record creation, and expense report approval workflows.

Expense Report Compliance

Test expense policy enforcement with elevated violation rates:

hr:
  enabled: true
  expenses:
    enabled: true
    submission_rate: 0.60              # 60% of employees submit
    policy_violation_rate: 0.15        # Elevated violation rate for testing

anomaly_injection:
  enabled: true
  generate_labels: true

Procurement Testing (v0.6.0)

Vendor Scorecard and Qualification

Generate the full source-to-pay cycle for procurement system testing:

source_to_pay:
  enabled: true
  qualification:
    pass_rate: 0.80
    validity_days: 365
  scorecards:
    frequency: "quarterly"
    grade_a_threshold: 85.0
    grade_c_threshold: 55.0
  catalog:
    preferred_vendor_flag_rate: 0.65
    multi_source_rate: 0.30

vendor_network:
  enabled: true
  depth: 3

Sales Quote Pipeline

Test quote-to-order conversion with the O2C flow:

sales_quotes:
  enabled: true
  quotes_per_month: 100
  win_rate: 0.30
  validity_days: 45

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.40

Won quotes automatically feed into the O2C document flow as sales orders.

Performance Monitoring

Generation Metrics

# Time generation
time datasynth-data generate --config config.yaml --output ./output

# Monitor memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

# Watch progress
datasynth-data generate --config config.yaml --output ./output -v

Import Metrics

Track these during ERP import:

Metric	Description
Import rate	Records per second
Memory usage	Peak RAM during import
CPU utilization	Processor load
I/O throughput	Disk read/write speed
Lock contention	Database lock waits

Data Import Strategies

SAP S/4HANA

# Generate ACDOCA format
datasynth-data generate --config config.yaml --output ./output

# Use SAP Data Services or LSMW for import
# Output: output/transactions/acdoca.csv

Oracle EBS

-- Create staging table
CREATE TABLE XX_JE_STAGING (
    document_id VARCHAR2(36),
    posting_date DATE,
    account VARCHAR2(20),
    debit NUMBER,
    credit NUMBER
);

-- Load via SQL*Loader
LOAD DATA
INFILE 'journal_entries.csv'
INTO TABLE XX_JE_STAGING
FIELDS TERMINATED BY ','

Microsoft Dynamics

# Use Data Management Framework
# Import journal_entries.csv via Data Entity

Validation

Post-Import Checks

-- Verify record count
SELECT COUNT(*) FROM journal_entries;

-- Verify balance
SELECT SUM(debit) - SUM(credit) AS imbalance
FROM journal_entries;

-- Check date range
SELECT MIN(posting_date), MAX(posting_date)
FROM journal_entries;

Reconciliation

import pandas as pd

# Compare source and target
source = pd.read_csv('output/transactions/journal_entries.csv')
target = pd.read_sql('SELECT * FROM journal_entries', connection)

# Verify counts
assert len(source) == len(target), "Record count mismatch"

# Verify totals
assert abs(source['debit_amount'].sum() - target['debit'].sum()) < 0.01

Batch Processing

Chunked Generation

For very large volumes, generate in chunks:

# Generate 10 batches of 1M each
for i in {1..10}; do
    datasynth-data generate \
        --config config.yaml \
        --output ./output/batch_$i \
        --seed $((42 + i))
done

Parallel Import

# Import chunks in parallel
for batch in ./output/batch_*; do
    import_job $batch &
done
wait

Performance Tips

Generation Speed

Increase threads: worker_threads: 16
Disable unnecessary features: Turn off graph export, anomalies
Use fast storage: NVMe SSD
Reduce complexity: Smaller COA, fewer master records

Import Speed

Disable triggers: During bulk import
Drop indexes: Recreate after import
Increase batch size: Larger commits
Parallel loading: Multiple import streams

Causal Analysis

Use DataSynth’s causal generation capabilities for “what-if” scenario testing and counterfactual analysis in audit and risk management.

When to Use Causal Generation

Causal generation is ideal when you need to:

Test audit scenarios: “What would happen to fraud rates if we increased the approval threshold?”
Risk assessment: “How would revenue change if we lost our top vendor?”
Policy evaluation: “What is the causal effect of implementing a new control?”
Training causal ML models: Generate data with known causal structure for model validation

Setting Up a Fraud Detection SCM

# Generate causally-structured fraud detection data
datasynth-data causal generate \
    --template fraud_detection \
    --samples 50000 \
    --seed 42 \
    --output ./fraud_causal

The fraud_detection template models:

transaction_amount → approval_level (larger amounts require higher approval)
transaction_amount → fraud_flag (larger amounts have higher fraud probability)
vendor_risk → fraud_flag (risky vendors associated with more fraud)

Running Interventions

Answer “what if?” questions by forcing variables to specific values:

# What if all transactions were $50,000?
datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 10000 \
    --output ./intervention_50k

# What if vendor risk were always high (0.9)?
datasynth-data causal intervene \
    --template fraud_detection \
    --variable vendor_risk \
    --value 0.9 \
    --samples 10000 \
    --output ./intervention_high_risk

Compare the intervention output against the baseline to estimate causal effects.

Counterfactual Analysis for Audit

For individual transaction review:

from datasynth_py import DataSynth

synth = DataSynth()

# Load a specific flagged transaction
factual = {
    "transaction_amount": 5000.0,
    "approval_level": 1.0,
    "vendor_risk": 0.3,
    "fraud_flag": 0.0,
}

# What would have happened if the amount were 10x larger?
# The counterfactual preserves the same "noise" (latent factors)
# but propagates the new amount through the causal structure

This helps auditors understand which factors most influence risk assessments.

Configuration Example

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 50000
  validate: true

# Combine with regular generation
transactions:
  target_count: 100000

fraud:
  enabled: true
  fraud_rate: 0.005

Validating Causal Structure

Verify that generated data preserves the intended causal relationships:

datasynth-data causal validate \
    --data ./fraud_causal \
    --template fraud_detection

The validator checks:

Parent-child correlations match expected directions
Independence constraints hold for non-adjacent variables
Intervention effects are consistent with the graph

LLM Training Data

Generate LLM-enriched synthetic financial data for training and fine-tuning language models on domain-specific tasks.

When to Use LLM-Enriched Data

Fine-tuning: Train financial document understanding models on realistic data
RAG evaluation: Test retrieval-augmented generation with known-truth synthetic documents
Classification training: Generate labeled financial text for transaction categorization
Anomaly explanation: Train models to explain financial anomalies in natural language

Quality vs Cost Tradeoffs

Provider	Quality	Cost	Latency	Reproducibility
Mock	Good (template-based)	Free	Instant	Fully deterministic
gpt-4o-mini	High	~$0.15/1M tokens	~200ms/req	Seed-based
gpt-4o	Very High	~$2.50/1M tokens	~500ms/req	Seed-based
Claude (Anthropic)	Very High	Varies	~300ms/req	Seed-based
Self-hosted	Varies	Infrastructure cost	Varies	Full control

Using the Mock Provider for CI/CD

The mock provider generates deterministic, contextually-aware text without any API calls:

# Default: uses mock provider (no API key needed)
datasynth-data generate --config config.yaml --output ./output

# Explicit mock configuration
llm:
  provider: mock

The mock provider is suitable for:

CI/CD pipelines
Automated testing
Reproducible research
Development environments

Using Real LLM Providers

For production-quality enrichment:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"
  cache_enabled: true       # Avoid duplicate API calls
  max_retries: 3
  timeout_secs: 30

export OPENAI_API_KEY="sk-..."
datasynth-data generate --config config.yaml --output ./output

Batch Generation for Large Datasets

For large-scale enrichment, use batch mode to minimize API overhead:

from datasynth_py import DataSynth, Config
from datasynth_py.config import blueprints

# Generate base data first (fast, rule-based)
config = blueprints.manufacturing_large(transactions=100000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Then enrich with LLM in a separate pass if needed

Example: Financial Document Understanding

Generate training data for a document understanding model:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 50000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
  o2c:
    enabled: true
    flow_rate: 0.3

anomaly_injection:
  enabled: true
  total_rate: 0.03
  generate_labels: true

# LLM enrichment adds realistic descriptions
llm:
  provider: mock     # or openai for higher quality

The generated data includes:

Vendor names appropriate for the industry and spend category
Transaction descriptions that read like real GL entries
Memo fields on invoices and payments
Natural language explanations for flagged anomalies

Pipeline Orchestration

Integrate DataSynth into data engineering pipelines using Apache Airflow, dbt, MLflow, and Apache Spark.

Overview

DataSynth’s Python wrapper includes optional integrations for popular data engineering platforms, enabling synthetic data generation as part of automated workflows.

pip install datasynth-py[integrations]

Apache Airflow

Generate Data in a DAG

from airflow import DAG
from airflow.utils.dates import days_ago
from datasynth_py.integrations import (
    DataSynthOperator,
    DataSynthSensor,
    DataSynthValidateOperator,
)

config = {
    "global": {"industry": "retail", "start_date": "2024-01-01", "period_months": 12},
    "transactions": {"target_count": 50000},
}

with DAG("synthetic_data_pipeline", start_date=days_ago(1), schedule_interval="@weekly") as dag:

    validate = DataSynthValidateOperator(
        task_id="validate_config",
        config_path="/configs/retail.yaml",
    )

    generate = DataSynthOperator(
        task_id="generate_data",
        config=config,
        output_path="/data/synthetic/{{ ds }}",
    )

    wait = DataSynthSensor(
        task_id="wait_for_output",
        output_path="/data/synthetic/{{ ds }}",
    )

    validate >> generate >> wait

dbt Integration

Generate dbt Sources from Synthetic Data

from datasynth_py.integrations import DbtSourceGenerator, create_dbt_project

# Generate sources.yml pointing to synthetic CSV files
gen = DbtSourceGenerator()
gen.generate_sources_yaml("./synthetic_output", "./my_dbt_project")

# Generate seed CSVs for dbt
gen.generate_seeds("./synthetic_output", "./my_dbt_project")

# Or create a complete dbt project structure
project = create_dbt_project("./synthetic_output", "my_dbt_project")

This creates:

models/sources.yml with table definitions
seeds/ directory with CSV files
Standard dbt project structure

Testing dbt Models with Synthetic Data

# 1. Generate synthetic data
datasynth-data generate --config retail.yaml --output ./synthetic

# 2. Create dbt project from output
python -c "from datasynth_py.integrations import create_dbt_project; create_dbt_project('./synthetic', 'test_project')"

# 3. Run dbt
cd test_project && dbt seed && dbt run && dbt test

MLflow Tracking

Track Generation Experiments

from datasynth_py.integrations import DataSynthMlflowTracker

tracker = DataSynthMlflowTracker(experiment_name="data_generation")

# Track a generation run (logs config, metrics, artifacts)
run_info = tracker.track_generation("./output", config=config)

# Log additional quality metrics
tracker.log_quality_metrics({
    "benford_mad": 0.008,
    "correlation_preservation": 0.95,
    "completeness": 0.99,
})

# Compare recent runs
comparison = tracker.compare_runs(n=10)
for run in comparison:
    print(f"Run {run['run_id']}: quality={run['metrics'].get('statistical_fidelity', 'N/A')}")

A/B Testing Generation Configs

configs = [
    ("baseline", baseline_config),
    ("with_diffusion", diffusion_config),
    ("with_llm", llm_config),
]

for name, cfg in configs:
    with mlflow.start_run(run_name=name):
        result = synth.generate(config=cfg, output={"format": "csv", "sink": "temp_dir"})
        tracker.track_generation(result.output_dir, config=cfg)

Apache Spark

Read Synthetic Data as DataFrames

from datasynth_py.integrations import DataSynthSparkReader

reader = DataSynthSparkReader()

# Read a single table
je_df = reader.read_table(spark, "./output", "journal_entries")
je_df.show(5)

# Read all tables at once
tables = reader.read_all_tables(spark, "./output")
for name, df in tables.items():
    print(f"{name}: {df.count()} rows")

# Create temporary SQL views
reader.create_temp_views(spark, "./output")
spark.sql("""
    SELECT posting_date, SUM(amount) as total
    FROM journal_entries
    WHERE fiscal_period = 12
    GROUP BY posting_date
    ORDER BY posting_date
""").show()

End-to-End Pipeline Example

"""
Complete pipeline: Generate → Track → Load → Transform → Test
"""
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
from datasynth_py.integrations import (
    DataSynthMlflowTracker,
    DataSynthSparkReader,
    DbtSourceGenerator,
)

# 1. Generate
synth = DataSynth()
config = blueprints.retail_small(transactions=50000)
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# 2. Track with MLflow
tracker = DataSynthMlflowTracker(experiment_name="pipeline_test")
tracker.track_generation(result.output_dir, config=config)

# 3. Load into Spark
reader = DataSynthSparkReader()
reader.create_temp_views(spark, result.output_dir)

# 4. Create dbt project for transformation testing
gen = DbtSourceGenerator()
gen.generate_sources_yaml(result.output_dir, "./dbt_project")

Contributing

Welcome to the SyntheticData contributor guide.

Overview

SyntheticData is an open-source project and we welcome contributions from the community. This section covers everything you need to know to contribute effectively.

Ways to Contribute

Code Contributions

Bug fixes: Fix issues reported in the GitHub issue tracker
New features: Implement new generators, output formats, or analysis tools
Performance improvements: Optimize generation speed or memory usage
Documentation: Improve or expand the documentation

Non-Code Contributions

Bug reports: Report issues with detailed reproduction steps
Feature requests: Suggest new features or improvements
Documentation feedback: Point out unclear or missing documentation
Testing: Test pre-release versions and report issues

Quick Start

# Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/SyntheticData.git
cd SyntheticData

# Create a feature branch
git checkout -b feature/my-feature

# Make your changes and run tests
cargo test

# Submit a pull request

Contribution Guidelines

Before You Start

Check existing issues: Look for related issues or discussions
Open an issue first: For significant changes, discuss before implementing
Follow code style: Run cargo fmt and cargo clippy
Write tests: All new features need test coverage
Update documentation: Keep docs in sync with code changes

Code of Conduct

We are committed to providing a welcoming and inclusive environment. Please:

Be respectful and constructive in discussions
Focus on the technical merits of contributions
Help newcomers learn and contribute
Report unacceptable behavior to the maintainers

Getting Help

GitHub Issues: For bugs and feature requests
GitHub Discussions: For questions and general discussion
Pull Request Reviews: For feedback on your contributions

In This Section

Page	Description
Development Setup	Set up your development environment
Code Style	Coding standards and conventions
Testing	Testing guidelines and practices
Pull Requests	PR submission and review process

License

By contributing to SyntheticData, you agree that your contributions will be licensed under the project’s MIT License.

Development Setup

Set up your local development environment for SyntheticData.

Prerequisites

Required

Rust: 1.88 or later (install via rustup)
Git: For version control
Cargo: Included with Rust

Optional

Node.js 18+: For desktop UI development (datasynth-ui)
Protocol Buffers: For gRPC development
mdBook: For documentation development

Installation

1. Clone the Repository

git clone https://github.com/EY-ASU-RnD/SyntheticData.git
cd SyntheticData

2. Install Rust Toolchain

# Install rustup if not present
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install stable toolchain
rustup install stable
rustup default stable

# Add useful components
rustup component add clippy rustfmt

3. Build the Project

# Debug build (faster compilation)
cargo build

# Release build (optimized)
cargo build --release

# Check without building
cargo check

4. Run Tests

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific crate tests
cargo test -p datasynth-core
cargo test -p datasynth-generators

IDE Setup

VS Code

Recommended extensions:

{
  "recommendations": [
    "rust-lang.rust-analyzer",
    "tamasfe.even-better-toml",
    "serayuzgur.crates",
    "vadimcn.vscode-lldb"
  ]
}

Settings for the project:

{
  "rust-analyzer.cargo.features": "all",
  "rust-analyzer.checkOnSave.command": "clippy",
  "editor.formatOnSave": true
}

JetBrains (RustRover/IntelliJ)

Install the Rust plugin
Open the project directory
Configure Cargo settings under Preferences > Languages & Frameworks > Rust

Desktop UI Setup

For developing the Tauri/SvelteKit desktop UI:

# Navigate to UI crate
cd crates/datasynth-ui

# Install Node.js dependencies
npm install

# Run development server
npm run dev

# Run Tauri desktop app
npm run tauri dev

# Build production
npm run build

Documentation Setup

For working on documentation:

# Install mdBook
cargo install mdbook

# Build documentation
cd docs
mdbook build

# Serve with live reload
mdbook serve --open

# Generate Rust API docs
cargo doc --workspace --no-deps --open

Project Structure

SyntheticData/
├── crates/
│   ├── datasynth-cli/          # CLI binary
│   ├── datasynth-core/         # Core models and traits
│   ├── datasynth-config/       # Configuration schema
│   ├── datasynth-generators/   # Data generators
│   ├── datasynth-output/       # Output sinks
│   ├── datasynth-graph/        # Graph export
│   ├── datasynth-runtime/      # Orchestration
│   ├── datasynth-server/       # REST/gRPC server
│   ├── datasynth-ui/           # Desktop UI
│   └── datasynth-ocpm/         # OCEL 2.0 export
├── benches/                # Benchmarks
├── docs/                   # Documentation
├── configs/                # Example configs
└── templates/              # Data templates

Environment Variables

Variable	Description	Default
`RUST_LOG`	Log level (trace, debug, info, warn, error)	`info`
`SYNTH_CONFIG_PATH`	Default config search path	Current directory
`SYNTH_TEMPLATE_PATH`	Template files location	`./templates`

Debugging

VS Code Launch Configuration

{
  "version": "0.2.0",
  "configurations": [
    {
      "type": "lldb",
      "request": "launch",
      "name": "Debug CLI",
      "cargo": {
        "args": ["build", "--bin=datasynth-data", "--package=datasynth-cli"]
      },
      "args": ["generate", "--demo", "--output", "./output"],
      "cwd": "${workspaceFolder}"
    }
  ]
}

Logging

Enable debug logging:

RUST_LOG=debug cargo run --release -- generate --demo --output ./output

Module-specific logging:

RUST_LOG=synth_generators=debug,synth_core=info cargo run ...

Common Issues

Build Failures

# Clean and rebuild
cargo clean
cargo build

# Update dependencies
cargo update

Test Failures

# Run tests with backtrace
RUST_BACKTRACE=1 cargo test

# Run single test with output
cargo test test_name -- --nocapture

Memory Issues

For large generation volumes, increase system limits:

# Linux: Increase open file limit
ulimit -n 65536

# Check memory usage during generation
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

Next Steps

Review Code Style guidelines
Read Testing practices
Learn the Pull Request process

Code Style

Coding standards and conventions for SyntheticData.

Rust Style

Formatting

All code must be formatted with rustfmt:

# Format all code
cargo fmt

# Check formatting without changes
cargo fmt --check

Linting

Code must pass Clippy without warnings:

# Run clippy
cargo clippy

# Run clippy with all features
cargo clippy --all-features

# Run clippy on all targets
cargo clippy --all-targets

Configuration

The project uses these Clippy settings in Cargo.toml:

[workspace.lints.clippy]
all = "warn"
pedantic = "warn"
nursery = "warn"

Naming Conventions

General Rules

Item	Convention	Example
Types	PascalCase	`JournalEntry`, `VendorGenerator`
Functions	snake_case	`generate_batch`, `parse_config`
Variables	snake_case	`entry_count`, `total_amount`
Constants	SCREAMING_SNAKE_CASE	`MAX_LINE_ITEMS`, `DEFAULT_SEED`
Modules	snake_case	`je_generator`, `document_flow`

Domain-Specific Names

Use accounting domain terminology consistently:

#![allow(unused)]
fn main() {
// Good - uses domain terms
struct JournalEntry { ... }
struct ChartOfAccounts { ... }
fn post_to_gl() { ... }

// Avoid - generic terms
struct Entry { ... }
struct AccountList { ... }
fn save_data() { ... }
}

Code Organization

Module Structure

#![allow(unused)]
fn main() {
// 1. Module documentation
//! Brief description of the module.
//!
//! Extended description with examples.

// 2. Imports (grouped and sorted)
use std::collections::HashMap;

use chrono::{NaiveDate, Utc};
use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};

use crate::models::JournalEntry;

// 3. Constants
const DEFAULT_BATCH_SIZE: usize = 1000;

// 4. Type definitions
pub struct Generator { ... }

// 5. Trait implementations
impl Generator { ... }

// 6. Unit tests
#[cfg(test)]
mod tests { ... }
}

Import Organization

Group imports in this order:

Standard library (std::)
External crates (alphabetically)
Workspace crates (synth_*)
Current crate (crate::)

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::Arc;

use chrono::NaiveDate;
use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};
use uuid::Uuid;

use synth_core::models::JournalEntry;
use synth_core::traits::Generator;

use crate::config::GeneratorConfig;
}

Documentation

Public API Documentation

All public items must have documentation:

#![allow(unused)]
fn main() {
/// Generates journal entries with realistic financial patterns.
///
/// This generator produces balanced journal entries following
/// configurable statistical distributions for amounts, line counts,
/// and temporal patterns.
///
/// # Examples
///
/// ```
/// use synth_generators::JournalEntryGenerator;
///
/// let generator = JournalEntryGenerator::new(config, seed);
/// let entries = generator.generate_batch(1000)?;
/// ```
///
/// # Errors
///
/// Returns `GeneratorError` if:
/// - Configuration is invalid
/// - Memory limits are exceeded
pub struct JournalEntryGenerator { ... }
}

Module Documentation

Each module should have a module-level doc comment:

#![allow(unused)]
fn main() {
//! Journal Entry generation module.
//!
//! This module provides generators for creating realistic
//! journal entries with proper accounting rules enforcement.
//!
//! # Overview
//!
//! The main entry point is [`JournalEntryGenerator`], which
//! coordinates line item generation and balance verification.
}

Error Handling

Error Types

Use thiserror for error definitions:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Debug, Error)]
pub enum GeneratorError {
    #[error("Invalid configuration: {0}")]
    InvalidConfig(String),

    #[error("Memory limit exceeded: used {used} bytes, limit {limit} bytes")]
    MemoryExceeded { used: usize, limit: usize },

    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
}
}

Result Types

Define type aliases for common result types:

#![allow(unused)]
fn main() {
pub type Result<T> = std::result::Result<T, GeneratorError>;
}

Error Propagation

Use ? for error propagation:

#![allow(unused)]
fn main() {
// Good
fn process() -> Result<Data> {
    let config = load_config()?;
    let data = generate_data(&config)?;
    Ok(data)
}

// Avoid
fn process() -> Result<Data> {
    let config = match load_config() {
        Ok(c) => c,
        Err(e) => return Err(e),
    };
    // ...
}
}

Financial Data

Decimal Precision

Always use rust_decimal::Decimal for financial amounts:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

// Good
let amount: Decimal = dec!(1234.56);

// Avoid - floating point
let amount: f64 = 1234.56;
}

Serialization

Serialize decimals as strings to avoid precision loss:

#![allow(unused)]
fn main() {
#[derive(Serialize, Deserialize)]
pub struct LineItem {
    #[serde(serialize_with = "serialize_decimal_as_string")]
    pub amount: Decimal,
}
}

Testing

Test Organization

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    // Group related tests
    mod generation {
        use super::*;

        #[test]
        fn generates_balanced_entries() {
            // Arrange
            let config = test_config();
            let generator = Generator::new(config, 42);

            // Act
            let entries = generator.generate_batch(100).unwrap();

            // Assert
            for entry in entries {
                assert!(entry.is_balanced());
            }
        }
    }

    mod validation {
        // ...
    }
}
}

Test Naming

Use descriptive test names:

#![allow(unused)]
fn main() {
// Good - describes behavior
#[test]
fn rejects_unbalanced_entry() { ... }

#[test]
fn generates_benford_compliant_amounts() { ... }

// Avoid - vague names
#[test]
fn test_1() { ... }

#[test]
fn it_works() { ... }
}

Performance

Allocation

Minimize allocations in hot paths:

#![allow(unused)]
fn main() {
// Good - reuse buffer
let mut buffer = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
    buffer.push(generate_entry()?);
}

// Avoid - reallocations
let mut buffer = Vec::new();
for _ in 0..batch_size {
    buffer.push(generate_entry()?);
}
}

Iterator Usage

Prefer iterators over explicit loops:

#![allow(unused)]
fn main() {
// Good
let total: Decimal = entries
    .iter()
    .map(|e| e.amount)
    .sum();

// Avoid
let mut total = Decimal::ZERO;
for entry in &entries {
    total += entry.amount;
}
}

Testing

Testing guidelines and practices for SyntheticData.

Running Tests

All Tests

# Run all tests
cargo test

# Run with output displayed
cargo test -- --nocapture

# Run tests in parallel (default)
cargo test

# Run tests sequentially
cargo test -- --test-threads=1

Specific Tests

# Run tests for a specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators

# Run a single test by name
cargo test test_balanced_entry

# Run tests matching a pattern
cargo test benford
cargo test journal_entry

Test Output

# Show stdout/stderr from tests
cargo test -- --nocapture

# Show test timing
cargo test -- --show-output

# Run ignored tests
cargo test -- --ignored

# Run all tests including ignored
cargo test -- --include-ignored

Test Organization

Unit Tests

Place unit tests in the same file as the code:

#![allow(unused)]
fn main() {
// src/generators/je_generator.rs

pub struct JournalEntryGenerator { ... }

impl JournalEntryGenerator {
    pub fn generate(&self) -> Result<JournalEntry> { ... }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn generates_balanced_entry() {
        let generator = JournalEntryGenerator::new(test_config(), 42);
        let entry = generator.generate().unwrap();
        assert!(entry.is_balanced());
    }
}
}

Integration Tests

Place integration tests in the tests/ directory:

crates/datasynth-generators/
├── src/
│   └── ...
└── tests/
    ├── generation_flow.rs
    └── document_chains.rs

Test Modules

Group related tests in submodules:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    mod generation {
        use super::super::*;

        #[test]
        fn batch_generation() { ... }

        #[test]
        fn streaming_generation() { ... }
    }

    mod validation {
        use super::super::*;

        #[test]
        fn rejects_invalid_config() { ... }
    }
}
}

Test Patterns

Arrange-Act-Assert

Use the AAA pattern for test structure:

#![allow(unused)]
fn main() {
#[test]
fn calculates_correct_total() {
    // Arrange
    let entries = vec![
        create_entry(dec!(100.00)),
        create_entry(dec!(200.00)),
        create_entry(dec!(300.00)),
    ];

    // Act
    let total = calculate_total(&entries);

    // Assert
    assert_eq!(total, dec!(600.00));
}
}

Test Fixtures

Create helper functions for common test data:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    fn test_config() -> GeneratorConfig {
        GeneratorConfig {
            seed: 42,
            batch_size: 100,
            ..Default::default()
        }
    }

    fn create_test_entry() -> JournalEntry {
        JournalEntryBuilder::new()
            .with_company("1000")
            .with_date(NaiveDate::from_ymd_opt(2024, 1, 15).unwrap())
            .add_line(Account::CASH, dec!(1000.00), Decimal::ZERO)
            .add_line(Account::REVENUE, Decimal::ZERO, dec!(1000.00))
            .build()
            .unwrap()
    }
}
}

Deterministic Testing

Use fixed seeds for reproducibility:

#![allow(unused)]
fn main() {
#[test]
fn deterministic_generation() {
    let seed = 42;

    let gen1 = Generator::new(config.clone(), seed);
    let gen2 = Generator::new(config.clone(), seed);

    let result1 = gen1.generate_batch(100).unwrap();
    let result2 = gen2.generate_batch(100).unwrap();

    assert_eq!(result1, result2);
}
}

Property-Based Testing

Use proptest for property-based tests:

#![allow(unused)]
fn main() {
use proptest::prelude::*;

proptest! {
    #[test]
    fn entries_are_always_balanced(
        debit in 1u64..1_000_000,
        line_count in 2usize..10,
    ) {
        let entry = generate_entry(debit, line_count);
        prop_assert!(entry.is_balanced());
    }
}
}

Domain-Specific Tests

Balance Verification

Test that journal entries are balanced:

#![allow(unused)]
fn main() {
#[test]
fn entry_debits_equal_credits() {
    let entry = generate_test_entry();

    let total_debits: Decimal = entry.lines
        .iter()
        .map(|l| l.debit_amount)
        .sum();

    let total_credits: Decimal = entry.lines
        .iter()
        .map(|l| l.credit_amount)
        .sum();

    assert_eq!(total_debits, total_credits);
}
}

Benford’s Law

Test amount distribution compliance:

#![allow(unused)]
fn main() {
#[test]
fn amounts_follow_benford() {
    let entries = generate_entries(10_000);
    let first_digits = extract_first_digits(&entries);

    let observed = calculate_distribution(&first_digits);
    let expected = benford_distribution();

    let chi_square = calculate_chi_square(&observed, &expected);
    assert!(chi_square < 15.51, "Distribution deviates from Benford's Law");
}
}

Document Chain Integrity

Test document reference chains:

#![allow(unused)]
fn main() {
#[test]
fn p2p_chain_is_complete() {
    let documents = generate_p2p_flow();

    // Verify chain: PO -> GR -> Invoice -> Payment
    let po = &documents.purchase_order;
    let gr = &documents.goods_receipt;
    let invoice = &documents.vendor_invoice;
    let payment = &documents.payment;

    assert_eq!(gr.po_reference, Some(po.po_number.clone()));
    assert_eq!(invoice.po_reference, Some(po.po_number.clone()));
    assert_eq!(payment.invoice_reference, Some(invoice.invoice_number.clone()));
}
}

Decimal Precision

Test that decimal values maintain precision:

#![allow(unused)]
fn main() {
#[test]
fn decimal_precision_preserved() {
    let original = dec!(1234.5678);

    // Serialize and deserialize
    let json = serde_json::to_string(&original).unwrap();
    let restored: Decimal = serde_json::from_str(&json).unwrap();

    assert_eq!(original, restored);
}
}

Benchmarks

Running Benchmarks

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench --bench generation_throughput

# Run benchmark with specific filter
cargo bench -- batch_generation

Writing Benchmarks

#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};

fn generation_benchmark(c: &mut Criterion) {
    let config = test_config();

    c.bench_function("generate_1000_entries", |b| {
        b.iter(|| {
            let generator = Generator::new(config.clone(), 42);
            generator.generate_batch(1000).unwrap()
        })
    });
}

fn scaling_benchmark(c: &mut Criterion) {
    let config = test_config();
    let mut group = c.benchmark_group("scaling");

    for size in [100, 1000, 10000].iter() {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            size,
            |b, &size| {
                b.iter(|| {
                    let generator = Generator::new(config.clone(), 42);
                    generator.generate_batch(size).unwrap()
                })
            },
        );
    }
    group.finish();
}

criterion_group!(benches, generation_benchmark, scaling_benchmark);
criterion_main!(benches);
}

Test Coverage

Measuring Coverage

# Install coverage tool
cargo install cargo-tarpaulin

# Run with coverage
cargo tarpaulin --out Html

# View report
open tarpaulin-report.html

Coverage Guidelines

Aim for 80%+ coverage on core logic
100% coverage on public API
Focus on behavior, not lines
Don’t test trivial getters/setters

Continuous Integration

Tests run automatically on:

Pull request creation
Push to main branch
Nightly scheduled runs

CI Test Matrix

Test Type	Trigger	Platform
Unit tests	All PRs	Linux, macOS, Windows
Integration tests	All PRs	Linux
Benchmarks	Main branch	Linux
Coverage	Weekly	Linux

Pull Requests

Guide to submitting and reviewing pull requests.

Before You Start

1. Check for Existing Work

Search open issues for related discussions
Check open PRs for similar changes
Review the roadmap for planned features

2. Open an Issue First

For significant changes, open an issue to discuss:

New features or major changes
Breaking changes to public API
Architectural changes
Performance improvements

3. Create a Branch

# Sync with upstream
git checkout main
git pull origin main

# Create feature branch
git checkout -b feature/my-feature

# Or for bug fixes
git checkout -b fix/issue-123

Making Changes

1. Write Code

Follow the Code Style guidelines:

# Format code
cargo fmt

# Run clippy
cargo clippy

# Run tests
cargo test

2. Write Tests

Add tests for new functionality:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn new_feature_works() {
        // Test implementation
    }
}
}

3. Update Documentation

Update relevant docs in docs/src/
Add/update rustdoc comments
Update CHANGELOG.md if applicable

4. Commit Changes

Write clear commit messages:

# Good commit messages
git commit -m "Add Benford's Law validation to amount generator"
git commit -m "Fix off-by-one error in batch generation"
git commit -m "Improve memory efficiency in large volume generation"

# Avoid vague messages
git commit -m "Fix bug"
git commit -m "Update code"
git commit -m "WIP"

Commit Message Format

<type>: <short summary>

<optional detailed description>

<optional footer>

Types:

feat: New feature
fix: Bug fix
docs: Documentation only
refactor: Code change without feature/fix
test: Adding/updating tests
perf: Performance improvement
chore: Maintenance tasks

Submitting a PR

1. Push Your Branch

git push -u origin feature/my-feature

2. Create Pull Request

Use the PR template:

## Summary

Brief description of changes.

## Changes

- Added X feature
- Fixed Y bug
- Updated Z documentation

## Testing

- [ ] Added unit tests
- [ ] Added integration tests
- [ ] Ran full test suite
- [ ] Tested manually

## Checklist

- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] Tests pass locally
- [ ] No new warnings from clippy

3. PR Title Format

<type>: <short description>

Examples:

feat: Add OCEL 2.0 export format
fix: Correct decimal serialization in JSON output
docs: Add process mining use case guide

Review Process

Automated Checks

All PRs must pass:

Check	Requirement
Build	Compiles on all platforms
Tests	All tests pass
Formatting	`cargo fmt --check` passes
Linting	`cargo clippy` has no warnings
Documentation	Builds without errors

Code Review

Reviewers will check:

Correctness: Does the code do what it claims?
Tests: Are changes adequately tested?
Style: Does code follow conventions?
Documentation: Are changes documented?
Performance: Any performance implications?

Responding to Feedback

Address all comments
Push fixes as new commits (don’t force-push during review)
Mark resolved conversations
Ask for clarification if needed

Merging

Requirements

Before merging:

All CI checks pass
At least one approving review
No unresolved conversations
Branch is up to date with main

Merge Strategy

We use squash and merge for most PRs:

Combines all commits into one
Keeps main history clean
Preserves full history in PR

After Merge

Delete your feature branch
Update local main:

git checkout main
git pull origin main
git branch -d feature/my-feature

Special Cases

Breaking Changes

For breaking changes:

Open an issue for discussion first
Document migration path
Update CHANGELOG with breaking change notice
Use BREAKING CHANGE: in commit footer

Large PRs

For large changes:

Consider splitting into smaller PRs
Create a tracking issue
Use feature flags if needed
Provide detailed documentation

Security Issues

For security vulnerabilities:

Do not open a public issue
Contact maintainers directly
Follow responsible disclosure

PR Templates

Feature PR

## Summary

Adds [feature] to support [use case].

## Motivation

[Why is this needed?]

## Changes

- Added `NewType` struct in `datasynth-core`
- Implemented `NewGenerator` in `datasynth-generators`
- Added configuration options in `datasynth-config`
- Updated CLI to support new feature

## Testing

- Added unit tests for `NewType`
- Added integration tests for generation flow
- Manual testing with sample configs

## Documentation

- Added user guide section
- Updated configuration reference
- Added example configuration

Bug Fix PR

## Summary

Fixes #123 - [brief description]

## Root Cause

[What caused the bug?]

## Solution

[How does this fix it?]

## Testing

- Added regression test
- Verified fix with reproduction steps from issue
- Ran full test suite

## Checklist

- [ ] Regression test added
- [ ] Root cause documented
- [ ] Related issues linked

Compliance & Regulatory Overview

DataSynth generates synthetic financial data for testing, training, and analytics. This section documents how DataSynth aligns with key regulatory frameworks and provides self-assessment artifacts for compliance teams.

Regulatory Landscape

Synthetic data generation sits at the intersection of several regulatory domains. While pure synthetic data (generated without real-world data as input) generally faces fewer regulatory constraints than real data processing, organizations deploying DataSynth should understand the applicable frameworks.

EU AI Act

The EU AI Act (Regulation 2024/1689) introduces obligations for AI systems and their training data. DataSynth addresses two key articles:

Article 50 – Transparency for Synthetic Content: All DataSynth output includes machine-readable content credentials indicating that the data is synthetically generated. This is implemented through the ContentCredential system in datasynth-core, which embeds markers in CSV headers, JSON metadata, and Parquet file metadata. Content marking is enabled by default and can be configured via the marking section in the configuration YAML.

Article 10 – Data Governance: DataSynth generates automated DataGovernanceReport documents that describe data sources (synthetic generation, no real data used), processing steps (COA generation through quality validation), quality measures applied (Benford’s Law compliance, balance coherence, referential integrity), and bias assessments. These reports provide the documentation trail required under Article 10.

For full details, see EU AI Act Compliance.

NIST AI Risk Management Framework

The NIST AI RMF (AI 100-1) provides a voluntary framework for managing risks in AI systems. DataSynth has completed a self-assessment across all four core functions:

Function	Focus Area	DataSynth Alignment
MAP	Context and use cases	Documented intended uses, users, and known limitations
MEASURE	Metrics and evaluation	Quality gates, privacy metrics (MIA, linkage), statistical validation
MANAGE	Risk mitigation	Deterministic reproducibility, audit logging, content marking
GOVERN	Policies and oversight	Access control (API key + JWT/RBAC), configuration management, quality gate governance

For the complete self-assessment, see NIST AI RMF Self-Assessment.

The General Data Protection Regulation applies differently depending on the DataSynth workflow:

Pure Synthetic Generation (no real data input): GDPR obligations are minimal because no personal data is processed. The generated output contains no data subjects. Article 30 records should still document the processing activity for audit completeness.

Fingerprint Extraction (real data as input): When DataSynth’s fingerprint module extracts statistical profiles from real datasets, GDPR applies in full. The fingerprint module includes differential privacy (Laplace mechanism with configurable epsilon/delta budgets), k-anonymity suppression of rare values, and a complete privacy audit trail. A Data Protection Impact Assessment (DPIA) template is provided for this scenario.

For templates and detailed guidance, see GDPR Compliance.

SOC 2 Readiness

DataSynth’s architecture supports SOC 2 Type II controls across the Trust Services Criteria:

Criteria	DataSynth Controls
Security	API key authentication with Argon2id hashing, JWT/OIDC support, TLS termination, CORS lockdown
Availability	Graceful degradation under resource pressure, health/readiness endpoints
Processing Integrity	Deterministic RNG (ChaCha8), balanced journal entries enforced at construction, quality gates
Confidentiality	Content marking prevents synthetic data from being mistaken for real data
Privacy	Differential privacy in fingerprint extraction, no real PII in standard generation

For deployment security controls, see Security Hardening.

ISO 27001 Alignment

DataSynth supports ISO 27001:2022 Annex A controls relevant to data processing tools:

Control	Implementation
A.5.12 Classification of information	Content credentials classify all output as synthetic
A.8.10 Information deletion	Deterministic generation eliminates data retention concerns for pure synthetic workflows
A.8.11 Data masking	Fingerprint extraction applies differential privacy and k-anonymity
A.8.12 Data leakage prevention	Quality gates include privacy metrics (MIA AUC-ROC, linkage attack assessment)
A.8.25 Secure development lifecycle	Deterministic builds, dependency auditing (`cargo audit`), SBOM generation

For access control configuration, see Security Hardening.

Quick Reference

Framework	Status	Documentation
EU AI Act Article 50	Implemented (content marking)	EU AI Act
EU AI Act Article 10	Implemented (governance reports)	EU AI Act
NIST AI RMF	Self-assessment complete	NIST AI RMF
GDPR	Templates provided	GDPR
SOC 2	Readiness documented	SOC 2 Readiness
ISO 27001	Annex A alignment documented	ISO 27001 Alignment

EU AI Act Compliance

DataSynth implements technical controls aligned with the EU Artificial Intelligence Act (Regulation 2024/1689), focusing on Article 50 (transparency for synthetic content) and Article 10 (data governance for high-risk AI systems).

Article 50 — Synthetic Content Marking

Article 50(2) requires that providers of AI systems generating synthetic content shall ensure outputs are marked in a machine-readable format and detectable as artificially generated.

How DataSynth Complies

DataSynth embeds machine-readable synthetic content credentials in all output files:

CSV: Comment header lines with C2PA-inspired metadata
JSON: _synthetic_metadata top-level object with credential fields
Parquet: Key-value metadata pairs in the file footer

Configuration

compliance:
  content_marking:
    enabled: true          # Default: true
    format: embedded       # embedded, sidecar, or both
  article10_report: true   # Generate Article 10 governance report

Marking Formats

Format	Description
`embedded`	Credentials embedded directly in output files (default)
`sidecar`	Separate `.synthetic-credential.json` file alongside each output
`both`	Both embedded and sidecar credentials

Credential Fields

Each synthetic content credential contains:

Field	Description	Example
`generator`	Tool identifier	`"DataSynth"`
`version`	Generator version	`"0.5.0"`
`timestamp`	ISO 8601 generation time	`"2024-06-15T10:30:00Z"`
`content_type`	Output category	`"synthetic_financial_data"`
`method`	Generation technique	`"rule_based_statistical"`
`config_hash`	SHA-256 of config used	`"a1b2c3..."`
`declaration`	Human-readable statement	`"This content is synthetic..."`

Programmatic Detection

Third-party systems can detect synthetic DataSynth output by:

CSV: Checking for # X-Synthetic-Generator: DataSynth header lines
JSON: Checking for _synthetic_metadata.generator == "DataSynth"
Parquet: Reading synthetic_generator from file metadata

Article 10 — Data Governance

Article 10 requires appropriate data governance practices for training datasets used by high-risk AI systems. When synthetic data from DataSynth is used to train such systems, the Article 10 data governance report provides documentation.

Governance Report Contents

The automated report includes:

Data Sources: Documentation of all inputs (configuration parameters, seed values, statistical distributions)
Processing Steps: Complete pipeline documentation (CoA generation, master data, document flows, anomaly injection, quality validation)
Quality Measures: Statistical validation results (Benford’s Law, balance coherence, distribution fitting)
Bias Assessment: Known limitations, demographic representation gaps, and mitigation measures

Generating the Report

Enable in configuration:

compliance:
  article10_report: true

The report is written as article10_governance_report.json in the output directory.

Report Structure

{
  "report_version": "1.0",
  "generator": "DataSynth",
  "generated_at": "2024-06-15T10:30:00Z",
  "data_sources": ["configuration_parameters", "statistical_distributions", "deterministic_rng"],
  "processing_steps": [
    "chart_of_accounts_generation",
    "master_data_generation",
    "document_flow_generation",
    "journal_entry_generation",
    "anomaly_injection",
    "quality_validation"
  ],
  "quality_measures": [
    "benfords_law_compliance",
    "balance_sheet_coherence",
    "document_chain_integrity",
    "referential_integrity"
  ],
  "bias_assessment": {
    "known_limitations": [
      "Statistical distributions are parameterized, not learned from real data",
      "Temporal patterns use simplified seasonal models"
    ],
    "mitigation_measures": [
      "Configurable distribution parameters per industry profile",
      "Quality gate validation ensures statistical plausibility"
    ]
  }
}

NIST AI Risk Management Framework Self-Assessment

This document provides a self-assessment of DataSynth against the NIST AI Risk Management Framework (AI 100-1, January 2023). The framework defines four core functions – MAP, MEASURE, MANAGE, and GOVERN – each with categories and subcategories. This assessment covers all four functions as they apply to a synthetic data generation tool.

Assessment Scope

System: DataSynth synthetic financial data generator
Version: 0.5.x
Assessment Date: 2025
Assessor: Development team (self-assessment)
AI System Type: Data generation tool (not a decision-making AI system)
Risk Classification: The generated synthetic data may be used as training data for AI/ML systems. DataSynth itself does not make autonomous decisions, but the quality of its output can affect downstream AI system performance.

MAP: Context and Framing

The MAP function establishes the context for AI risk management by identifying intended use cases, users, and known limitations.

MAP 1: Intended Use Cases

DataSynth is designed for the following use cases:

Use Case	Description	Risk Level
ML Training Data	Generate labeled datasets for fraud detection, anomaly detection, and audit analytics models	Medium
Software Testing	Provide realistic test data for ERP systems, accounting platforms, and audit tools	Low
Privacy-Preserving Analytics	Replace real financial data with synthetic equivalents that preserve statistical properties	Medium
Compliance Testing	Generate SOX control test evidence, COSO framework data, and SoD violation scenarios	Low
Process Mining	Create OCEL 2.0 event logs for process analysis without exposing real business processes	Low
Education and Research	Provide realistic financial datasets for academic research and training	Low

Not intended for: Replacement of real financial records in regulatory filings, direct use as evidence in audit engagements, or any scenario where the synthetic nature of the data is concealed.

MAP 2: Intended Users

User Group	Typical Use	Access Level
Data Scientists	Training ML models for fraud/anomaly detection	API or CLI
QA Engineers	ERP and accounting system load/integration testing	CLI or Python wrapper
Auditors	Testing audit analytics tools against known-labeled data	CLI output files
Compliance Teams	SOX control testing, COSO framework validation	CLI or server API
Researchers	Academic study of financial data patterns	Python wrapper

MAP 3: Known Limitations

DataSynth users should understand the following limitations:

No Real PII: Generated names, identifiers, and addresses are synthetic. They do not correspond to real individuals or organizations. This is a design feature, not a limitation, but downstream systems should not treat synthetic identities as real.
Statistical Approximation: Generated data follows configurable statistical distributions (log-normal, Benford’s Law, Gaussian mixtures) that approximate real-world patterns. They are not derived from actual transaction populations unless fingerprint extraction is used.
Industry Profile Approximations: Pre-configured industry profiles (retail, manufacturing, financial services, healthcare, technology) are based on published research and general knowledge. They may not match specific organizations within an industry.
Temporal Pattern Simplification: Business day calendars, holiday schedules, and intraday patterns are modeled but may not capture all regional or organizational nuances.
Anomaly Injection Boundaries: Injected fraud patterns follow configurable typologies (ACFE taxonomy) but do not represent the full diversity of real-world fraud schemes.
Fingerprint Extraction Privacy: When extracting fingerprints from real data, differential privacy noise and k-anonymity are applied. The privacy guarantees depend on correct epsilon/delta parameter selection.

MAP 4: Deployment Context

DataSynth can be deployed as:

A CLI tool on developer workstations
A server (REST/gRPC/WebSocket) in cloud or on-premises environments
A Python library embedded in data pipelines
A desktop application (Tauri/SvelteKit)

Each deployment context has different risk profiles. Server deployments require authentication, TLS, and rate limiting. CLI usage on trusted workstations has fewer access control requirements.

MEASURE: Metrics and Evaluation

The MEASURE function establishes metrics, methods, and benchmarks for evaluating AI system trustworthiness.

MEASURE 1: Quality Gate Metrics

DataSynth includes a comprehensive evaluation framework (datasynth-eval) with configurable quality gates. Each metric has defined thresholds and automated pass/fail checking.

Statistical Quality

Metric	Gate Name	Threshold	Comparison	Purpose
Benford’s Law MAD	`benford_compliance`	0.015	LTE	First-digit distribution follows Benford’s Law
Balance Coherence	`balance_sheet_valid`	1.0	GTE	Assets = Liabilities + Equity
Document Chain Integrity	`doc_chain_complete`	0.95	GTE	P2P/O2C chains are complete
Temporal Consistency	`temporal_valid`	0.90	GTE	Temporal patterns match configuration
Correlation Preservation	`correlation_check`	0.80	GTE	Cross-field correlations preserved

Data Quality

Metric	Gate Name	Threshold	Comparison	Purpose
Completion Rate	`completeness`	0.95	GTE	Required fields are populated
Duplicate Rate	`uniqueness`	0.05	LTE	Acceptable duplicate rate
Referential Integrity	`ref_integrity`	0.99	GTE	Foreign key references valid
IC Match Rate	`ic_matching`	0.95	GTE	Intercompany transactions match

Gate Profiles

Quality gates are organized into profiles with configurable strictness:

evaluation:
  quality_gates:
    profile: strict    # strict, default, lenient
    fail_strategy: collect_all
    gates:
      - name: benford_compliance
        metric: benford_mad
        threshold: 0.015
        comparison: lte
      - name: balance_valid
        metric: balance_coherence
        threshold: 1.0
        comparison: gte
      - name: completeness
        metric: completion_rate
        threshold: 0.95
        comparison: gte

MEASURE 2: Privacy Evaluation

DataSynth evaluates privacy risk through empirical attacks on generated data.

Membership Inference Attack (MIA)

The MIA module (datasynth-eval/src/privacy/membership_inference.rs) implements a distance-based classifier that attempts to determine whether a specific record was part of the generation configuration. Key metrics:

Metric	Threshold	Interpretation
AUC-ROC	<= 0.60	Near-random classification indicates strong privacy
Accuracy	<= 0.55	Low accuracy means synthetic data does not memorize patterns
Precision/Recall	Balanced	No systematic bias toward members or non-members

Linkage Attack Assessment

The linkage module (datasynth-eval/src/privacy/linkage.rs) evaluates re-identification risk using quasi-identifier combinations:

Metric	Threshold	Interpretation
Re-identification Rate	<= 0.05	Less than 5% of synthetic records can be linked to originals
K-Anonymity Achieved	>= 5	Each quasi-identifier combination appears at least 5 times
Unique QI Overlap	Reported	Number of overlapping quasi-identifier combinations

NIST SP 800-226 Alignment

The evaluation framework includes self-assessment against NIST SP 800-226 criteria for de-identification. The NistAlignmentReport evaluates:

Data transformation adequacy
Re-identification risk assessment
Documentation completeness
Privacy control effectiveness

Overall alignment score must meet >= 71% for a passing grade.

Fingerprint Module Privacy

When fingerprint extraction is used with real data input, the datasynth-fingerprint privacy engine provides:

Mechanism	Parameter	Default (Standard Level)
Differential Privacy (Laplace)	Epsilon	1.0
K-Anonymity	K threshold	5
Outlier Protection	Winsorization percentile	95th
Composition	Method	Naive (RDP/zCDP available)

Privacy levels provide pre-configured parameter sets:

Level	Epsilon	K	Use Case
Minimal	5.0	3	Low sensitivity
Standard	1.0	5	Balanced (default)
High	0.5	10	Sensitive data
Maximum	0.1	20	Highly sensitive data

MEASURE 3: Completeness and Uniqueness

The evaluation module tracks data completeness and uniqueness metrics:

Completeness: Measures the percentage of non-null values across all required fields. Reported as overall_completeness in the evaluation output.
Uniqueness: Measures the duplicate rate across primary key fields. Collision-free UUIDs (FNV-1a hash-based with generator-type discriminators) ensure deterministic uniqueness.

MEASURE 4: Distribution Validation

Statistical validation tests verify that generated data matches configured distributions:

Test	Implementation	Purpose
Benford First Digit	Chi-squared against Benford distribution	Transaction amounts follow expected first-digit distribution
Distribution Fit	Anderson-Darling test	Amount distributions match configured log-normal parameters
Correlation Check	Pearson/Spearman correlation	Cross-field correlations preserved via copula models
Temporal Patterns	Autocorrelation analysis	Seasonality and period-end patterns present

MANAGE: Risk Mitigation

The MANAGE function addresses risk response and mitigation strategies.

MANAGE 1: Deterministic Reproducibility

DataSynth uses ChaCha8 CSPRNG with configurable seeds. Given the same configuration and seed, the output is identical across runs and platforms. This provides:

Auditability: Any generated dataset can be exactly reproduced by preserving the configuration YAML and seed value.
Debugging: Anomalous output can be reproduced for investigation.
Regression Testing: Changes to generation logic can be detected by comparing output hashes.

global:
  seed: 42                    # Deterministic seed
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

MANAGE 2: Audit Logging

DataSynth provides audit trails at multiple levels:

Generation Audit: The runtime emits structured JSON logs for every generation phase, including timing, record counts, and resource utilization.

Privacy Audit: The fingerprint module maintains a PrivacyAudit record of every privacy-related action (noise additions with epsilon spent, value suppressions, generalizations, winsorizations). This audit is embedded in the .dsf fingerprint file.

Server Audit: The REST/gRPC server logs authentication attempts, configuration changes, stream operations, and rate limit events with request correlation IDs (X-Request-Id).

Run Manifest: Each generation run produces a manifest documenting the configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.

MANAGE 3: Data Lineage Tracking

DataSynth tracks data lineage through:

Configuration Hashing: SHA-256 hash of the input configuration is embedded in all output metadata.
Content Credentials: Every output file includes a ContentCredential linking back to the generator version, configuration hash, and seed.
Document Reference Chains: Generated document flows maintain explicit reference chains (PO -> GR -> Invoice -> Payment) with DocumentReference records.
Data Governance Reports: Automated Article 10 governance reports document all processing steps from COA generation through quality validation.

MANAGE 4: Content Marking

All synthetic output is marked to prevent confusion with real data:

CSV: Comment headers with # SYNTHETIC DATA - Generated by DataSynth v{version}
JSON: _metadata.content_credential object with generator, timestamp, config hash, and EU AI Act article reference
Parquet: Custom metadata key-value pairs with full credential JSON
Sidecar Files: Optional .credential.json files alongside output files

Content marking is enabled by default and can be configured:

marking:
  enabled: true
  format: embedded    # embedded, sidecar, both

MANAGE 5: Graceful Degradation

The resource guard system (datasynth-core) monitors memory, disk, and CPU usage, applying progressive degradation:

Level	Memory Threshold	Response
Normal	< 70%	Full feature generation
Reduced	70-85%	Disable optional features
Minimal	85-95%	Core generation only
Emergency	> 95%	Graceful shutdown

This prevents resource exhaustion from affecting other systems in shared environments.

GOVERN: Policies and Oversight

The GOVERN function establishes organizational policies and structures for AI risk management.

GOVERN 1: Access Control

DataSynth implements layered access control for the server deployment:

API Key Authentication: Keys are hashed with Argon2id at startup. Verification uses timing-safe comparison with a short-lived cache to prevent side-channel attacks. Keys are provided via the X-API-Key header or Authorization: Bearer header.

JWT/OIDC Integration (optional jwt feature): Supports external identity providers (Keycloak, Auth0, Entra ID) with RS256 token validation. JWT claims include subject, roles, and tenant ID for multi-tenancy.

RBAC: Role-based access control via JWT claims enables differentiated access:

Role	Permissions
`operator`	Start/stop/pause generation streams
`admin`	Configuration changes, API key management
`viewer`	Read-only access to status and metrics

Exempt Paths: Health (/health), readiness (/ready), liveness (/live), and metrics (/metrics) endpoints are exempt from authentication for infrastructure integration.

GOVERN 2: Configuration Management

DataSynth configuration is managed through:

YAML Schema Validation: All configuration is validated against a typed schema before generation begins. Invalid configurations produce descriptive error messages.
Industry Presets: Pre-validated configuration presets for common industries (retail, manufacturing, financial services, healthcare, technology) reduce misconfiguration risk.
Complexity Levels: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide validated scaling parameters.
Template System: YAML/JSON templates with merge strategies enable configuration reuse while allowing overrides.

GOVERN 3: Quality Gates as Governance Controls

Quality gates serve as automated governance controls:

evaluation:
  quality_gates:
    profile: strict
    fail_strategy: fail_fast    # Stop on first failure
    gates:
      - name: benford_compliance
        metric: benford_mad
        threshold: 0.015
        comparison: lte
      - name: privacy_mia
        metric: privacy_mia_auc
        threshold: 0.60
        comparison: lte
      - name: balance_coherence
        metric: balance_coherence
        threshold: 1.0
        comparison: gte

Gate profiles can enforce:

Fail-fast: Stop generation on first quality failure
Collect-all: Run all checks and report all failures
Custom thresholds: Organization-specific quality requirements

The GateEngine evaluates all configured gates against the ComprehensiveEvaluation and produces a GateResult with per-gate pass/fail status, actual values, and summary messages.

GOVERN 4: Audit Trail Completeness

The following audit artifacts are produced for each generation run:

Artifact	Location	Contents
Run Manifest	`output/_manifest.json`	Config hash, seed, timestamps, record counts, gate results
Content Credentials	Embedded in each output file	Generator version, config hash, seed, EU AI Act reference
Data Governance Report	`output/_governance_report.json`	Article 10 data sources, processing steps, quality measures, bias assessment
Privacy Audit	Embedded in `.dsf` files	Epsilon spent, actions taken, composition method, remaining budget
Server Logs	Structured JSON to stdout/log aggregator	Request traces, auth events, config changes, stream operations
Quality Gate Results	`output/_evaluation.json`	Per-gate pass/fail, actual vs threshold, summary

GOVERN 5: Incident Response

For scenarios where generated data is mistakenly used as real data:

Detection: Content credentials in output files identify synthetic origin
Containment: Deterministic generation means the exact dataset can be reproduced and identified
Remediation: All output files carry machine-readable markers that downstream systems can check programmatically
Prevention: Content marking is enabled by default and requires explicit configuration to disable

Assessment Summary

Function	Category Count	Addressed	Notes
MAP	4	4	Use cases, users, limitations, and deployment documented
MEASURE	4	4	Quality gates, privacy metrics, completeness, distribution validation
MANAGE	5	5	Reproducibility, audit logging, lineage, content marking, degradation
GOVERN	5	5	Access control, config management, quality gates, audit trails, incident response

Overall Assessment: DataSynth provides comprehensive risk management controls appropriate for a synthetic data generation tool. The primary residual risks relate to (1) parameter misconfiguration leading to unrealistic output, mitigated by quality gates and industry presets, and (2) privacy leakage during fingerprint extraction from real data, mitigated by differential privacy with configurable epsilon/delta budgets and empirical privacy evaluation.

Article 30 — Records of Processing Activities

Template for Pure Synthetic Generation

Field	Value
Purpose	Generation of synthetic financial data for testing, training, and validation
Categories of data subjects	None (no real data subjects)
Categories of personal data	None (all data is synthetic)
Recipients	Internal development, QA, and data science teams
Transfers to third countries	Not applicable (no personal data)
Retention period	Per project requirements
Technical measures	Seed-based deterministic generation, content marking

Template for Fingerprint Extraction

Field	Value
Purpose	Statistical fingerprint extraction for privacy-preserving data synthesis
Legal basis	Legitimate interest (Article 6(1)(f)) or consent
Categories of data subjects	As per source dataset (e.g., customers, vendors, employees)
Categories of personal data	As per source dataset (aggregate statistics only retained)
Recipients	Data engineering team operating DataSynth
Transfers to third countries	Assess per deployment topology
Retention period	Fingerprint files: per project; source data: minimize retention
Technical measures	Differential privacy (configurable epsilon/delta), k-anonymity

Data Protection Impact Assessment (DPIA)

A DPIA under Article 35 is recommended when fingerprint extraction processes:

Large-scale datasets (>100,000 records)
Special categories of data (Article 9)
Data relating to vulnerable persons

DPIA Template for Fingerprint Extraction

1. Description of Processing

DataSynth extracts statistical fingerprints from source data. The fingerprint captures distribution parameters (means, variances, correlations) without retaining individual records. Differential privacy noise is added with configurable epsilon/delta parameters.

2. Necessity and Proportionality

Purpose: Enable realistic synthetic data generation without accessing source data repeatedly
Minimization: Only aggregate statistics are retained
Privacy controls: Differential privacy with user-specified budget

3. Risks to Data Subjects

Risk	Likelihood	Severity	Mitigation
Re-identification from fingerprint	Low	High	Differential privacy, k-anonymity enforcement
Membership inference	Low	Medium	MIA AUC-ROC testing in evaluation framework
Fingerprint file compromise	Medium	Low	Aggregate statistics only, no individual records

4. Measures to Address Risks

Configure fingerprint_privacy.level: high or maximum for sensitive data
Set fingerprint_privacy.epsilon to 0.1-1.0 range (lower = stronger privacy)
Enable k-anonymity with fingerprint_privacy.k_anonymity >= 5
Use evaluation framework MIA testing to verify privacy guarantees

Privacy Configuration

fingerprint_privacy:
  level: high             # minimal, standard, high, maximum, custom
  epsilon: 0.5            # Privacy budget (lower = stronger)
  delta: 1.0e-5           # Failure probability
  k_anonymity: 10         # Minimum group size
  composition_method: renyi_dp  # naive, advanced, renyi_dp, zcdp

Privacy Level Presets

Level	Epsilon	Delta	k-Anonymity	Use Case
`minimal`	10.0	1e-3	2	Non-sensitive aggregates
`standard`	1.0	1e-5	5	General business data
`high`	0.5	1e-6	10	Sensitive financial data
`maximum`	0.1	1e-8	20	Regulated personal data

Data Subject Rights

Pure Synthetic Mode

Articles 15-22 (access, rectification, erasure, etc.) do not apply as no real data subjects exist in synthetic output.

Fingerprint Extraction Mode

Right of access (Art. 15): Fingerprints contain only aggregate statistics; individual records cannot be extracted
Right to erasure (Art. 17): Delete source data and fingerprint files; regenerate synthetic data with new parameters
Right to restriction (Art. 18): Suspend fingerprint extraction pipeline
Right to object (Art. 21): Remove individual from source dataset before extraction

International Transfers

Synthetic output: Generally not subject to Chapter V transfer restrictions (no personal data)
Fingerprint files: Assess whether aggregate statistics constitute personal data in your jurisdiction
Source data: Standard GDPR transfer rules apply during fingerprint extraction

NIST SP 800-226 Alignment

DataSynth’s evaluation framework includes NIST SP 800-226 alignment reporting for synthetic data privacy assessment. Enable via:

privacy:
  nist_alignment_enabled: true

SOC 2 Type II Readiness

This document describes how DataSynth’s architecture and controls align with the AICPA Trust Services Criteria (TSC) used in SOC 2 Type II engagements. DataSynth is a synthetic data generation tool, not a cloud-hosted SaaS product, so this assessment focuses on the controls embedded in the software itself rather than organizational policies. Organizations deploying DataSynth should layer their own operational controls (change management, personnel security, vendor management) on top of the technical controls described here.

Assessment Scope

System: DataSynth synthetic financial data generator
Version: 0.5.x
Deployment Models: CLI binary, REST/gRPC/WebSocket server, Python library, desktop application
Assessment Type: Architecture readiness (pre-audit self-assessment)

CC1: Security

The Security criterion (Common Criteria) requires that the system is protected against unauthorized access, both logical and physical.

Authentication

DataSynth’s server component (datasynth-server) implements two authentication mechanisms:

API Key Authentication: API keys are hashed with Argon2id (memory-hard, side-channel resistant) at server startup. Verification iterates all stored hashes without short-circuiting to prevent timing-based enumeration. A short-lived (5-second TTL) FNV-1a hash cache avoids repeated Argon2id computation for successive requests from the same client. Keys are accepted via Authorization: Bearer <key> or X-API-Key headers.

JWT/OIDC (optional jwt feature): External identity providers (Keycloak, Auth0, Entra ID) issue RS256-signed tokens. The JwtValidator verifies issuer, audience, expiration, and signature. Claims include subject, email, roles, and tenant ID for multi-tenancy.

Authorization

Role-Based Access Control (RBAC) enforces least-privilege access:

Role	GenerateData	ManageJobs	ViewJobs	ManageConfig	ViewConfig	ViewMetrics	ManageApiKeys
Admin	Y	Y	Y	Y	Y	Y	Y
Operator	Y	Y	Y	N	Y	Y	N
Viewer	N	N	Y	N	Y	Y	N

RBAC can be disabled for development environments; when disabled, all authenticated requests are treated as Admin.

Network Security

The security headers middleware injects the following headers on all server responses:

Header	Value	Purpose
`X-Content-Type-Options`	`nosniff`	Prevent MIME-type sniffing
`X-Frame-Options`	`DENY`	Prevent clickjacking
`Content-Security-Policy`	`default-src 'none'; frame-ancestors 'none'`	Restrict resource loading
`Referrer-Policy`	`strict-origin-when-cross-origin`	Limit referrer leakage
`Cache-Control`	`no-store`	Prevent caching of API responses
`X-XSS-Protection`	`0`	Defer to CSP (modern best practice)

TLS termination is supported via reverse proxy (nginx, Caddy, Envoy) or Kubernetes ingress. CORS is configurable with allowlisted origins.

Rate Limiting

Per-client rate limiting uses a sliding-window counter with configurable thresholds (requests per second, burst size). A Redis-backed rate limiter is available for multi-instance deployments (redis feature flag).

CC2: Availability

The Availability criterion requires that the system is available for operation and use as committed.

Graceful Degradation

The DegradationController in datasynth-core monitors memory, disk, and CPU utilization and applies progressive feature reduction:

Level	Memory	Disk	CPU	Response
Normal	< 70%	> 1000 MB	< 80%	All features enabled, full batch sizes
Reduced	70–85%	500–1000 MB	80–90%	Half batch sizes, skip data quality injection
Minimal	85–95%	100–500 MB	> 90%	Essential data only, no anomaly injection
Emergency	> 95%	< 100 MB	–	Flush pending writes, terminate gracefully

Auto-recovery with hysteresis (5% improvement required) allows the system to step back up one level at a time when resource pressure subsides.

Resource Monitoring

Memory guard: Reads /proc/self/statm (Linux) or ps (macOS) to track resident set size against configurable limits.
Disk guard: Uses statvfs (Unix) or GetDiskFreeSpaceExW (Windows) to monitor available disk space in the output directory.
CPU monitor: Tracks CPU utilization with auto-throttle at 0.95 threshold.
Resource guard: Unified orchestration that combines all three monitors and drives the DegradationController.

Graceful Shutdown

The server handles SIGTERM by stopping acceptance of new requests, waiting for in-flight requests to complete (with configurable timeout), and flushing pending output. The CLI supports SIGUSR1 for pause/resume of generation runs.

Health Endpoints

The following endpoints are exempt from authentication for infrastructure integration:

Endpoint	Purpose
`/health`	General health check
`/ready`	Readiness probe (Kubernetes)
`/live`	Liveness probe (Kubernetes)
`/metrics`	Prometheus-compatible metrics

CC3: Processing Integrity

The Processing Integrity criterion requires that system processing is complete, valid, accurate, timely, and authorized.

Deterministic Generation

DataSynth uses the ChaCha8 cryptographically secure pseudo-random number generator with a configurable seed. Given the same configuration YAML and seed value, output is byte-identical across runs and platforms. This provides auditability (reproduce any dataset from its configuration) and regression detection (compare output hashes after code changes).

Quality Gates

The evaluation framework (datasynth-eval) applies configurable pass/fail criteria to every generation run. Built-in quality gate profiles provide three levels of strictness:

Metric	Strict	Default	Lenient
Benford MAD	<= 0.01	<= 0.015	<= 0.03
Balance Coherence	>= 0.999	>= 0.99	>= 0.95
Document Chain Integrity	>= 0.95	>= 0.90	>= 0.80
Completion Rate	>= 0.99	>= 0.95	>= 0.90
Duplicate Rate	<= 0.001	<= 0.01	<= 0.05
Referential Integrity	>= 0.999	>= 0.99	>= 0.95
IC Match Rate	>= 0.99	>= 0.95	>= 0.85
Privacy MIA AUC	<= 0.55	<= 0.60	<= 0.70

Gate evaluation supports fail-fast (stop on first failure) and collect-all (report all failures) strategies.

Balance Validation

The JournalEntry model enforces debits = credits at construction time. An entry that does not balance cannot be created, eliminating an entire class of data integrity errors.

Content Marking

EU AI Act Article 50 synthetic content credentials are embedded in all output files (CSV headers, JSON metadata, Parquet file metadata). This prevents synthetic data from being mistaken for real financial records. Content marking is enabled by default.

CC4: Confidentiality

The Confidentiality criterion requires that information designated as confidential is protected as committed.

No Real Data Storage

In the default operating mode (pure synthetic generation), DataSynth does not process, store, or transmit real data. All names, identifiers, transactions, and addresses are algorithmically generated from configuration parameters and RNG output.

Fingerprint Privacy

When the fingerprint extraction workflow processes real data, the following privacy controls apply:

Mechanism	Default (Standard Level)
Differential privacy (Laplace)	Epsilon = 1.0, Delta = 1e-5
K-anonymity suppression	K >= 5
Composition accounting	Naive (Renyi DP, zCDP available)

The output .dsf fingerprint file contains only aggregate statistics (means, variances, correlations), not individual records.

API Key Security

API keys are never stored in plaintext. At server startup, raw keys are hashed with Argon2id (random salt, PHC format) and discarded. Verification uses Argon2id comparison that iterates all stored hashes to prevent timing-based key enumeration.

Audit Logging

The JsonAuditLogger emits structured JSON audit events via the tracing crate. Each event records timestamp, request ID, actor identity (user ID or API key hash prefix), action, resource, outcome (success/denied/error), tenant ID, source IP, and user agent. Events are suitable for SIEM ingestion.

CC5: Privacy

The Privacy criterion requires that personal information is collected, used, retained, disclosed, and disposed of in conformity with commitments.

Synthetic Data by Design

DataSynth’s default mode generates purely synthetic data. No personal information is collected or processed. Generated entities (vendors, customers, employees) have no real-world counterparts. This eliminates most privacy obligations for pure synthetic workflows.

Privacy Evaluation

The evaluation framework includes empirical privacy testing:

Membership Inference Attack (MIA): Distance-based classifier measures AUC-ROC. A score near 0.50 indicates the synthetic data does not memorize real data patterns.
Linkage Attack Assessment: Evaluates re-identification risk using quasi-identifier combinations. Measures achieved k-anonymity and unique QI overlap.

NIST SP 800-226 Alignment

The evaluation framework generates NIST SP 800-226 alignment reports assessing data transformation adequacy, re-identification risk, documentation completeness, and privacy control effectiveness. An overall alignment score of >= 71% is required for a passing grade.

Fingerprint Extraction Privacy Levels

Level	Epsilon	Delta	K-Anonymity	Use Case
`minimal`	10.0	1e-3	2	Non-sensitive aggregates
`standard`	1.0	1e-5	5	General business data
`high`	0.5	1e-6	10	Sensitive financial data
`maximum`	0.1	1e-8	20	Regulated personal data

Controls Mapping

The following table maps DataSynth features to SOC 2 Trust Services Criteria identifiers.

TSC ID	Criterion	DataSynth Control	Implementation
CC6.1	Logical access security	API key authentication	`auth.rs`: Argon2id hashing, timing-safe comparison
CC6.1	Logical access security	JWT/OIDC support	`auth.rs`: RS256 token validation (optional `jwt` feature)
CC6.3	Role-based access	RBAC enforcement	`rbac.rs`: Admin/Operator/Viewer roles with permission matrix
CC6.6	System boundaries	Security headers	`security_headers.rs`: CSP, X-Frame-Options, HSTS support
CC6.6	System boundaries	Rate limiting	`rate_limit.rs`: Per-client sliding window, Redis backend
CC6.8	Transmission security	TLS support	Reverse proxy TLS termination, Kubernetes ingress
CC7.2	Monitoring	Resource guards	`resource_guard.rs`: CPU, memory, disk monitoring
CC7.2	Monitoring	Audit logging	`audit.rs`: Structured JSON events for SIEM
CC7.3	Change detection	Config hashing	SHA-256 hash of configuration embedded in output
CC7.4	Incident response	Content marking	Content credentials identify synthetic origin
CC8.1	Processing integrity	Deterministic RNG	ChaCha8 with configurable seed
CC8.1	Processing integrity	Quality gates	`gates/engine.rs`: Configurable pass/fail thresholds
CC8.1	Processing integrity	Balance validation	`JournalEntry` enforces debits = credits at construction
CC9.1	Availability management	Graceful degradation	`degradation.rs`: Normal/Reduced/Minimal/Emergency levels
CC9.1	Availability management	Health endpoints	`/health`, `/ready`, `/live` (auth-exempt)
P3.1	Privacy notice	Synthetic content marking	EU AI Act Article 50 credentials in all output
P4.1	Collection limitation	No real data by default	Pure synthetic generation requires no data collection
P6.1	Data quality	Quality gates	Statistical, coherence, and privacy quality metrics
P8.1	Disposal	Deterministic generation	No persistent state; regenerate from config + seed

Gap Analysis

The following areas require organizational controls that are outside DataSynth’s software scope:

Area	Recommendation
Physical security	Deploy on infrastructure with appropriate physical access controls
Change management	Implement CI/CD pipelines with code review and approval workflows
Vendor management	Assess third-party dependencies via `cargo audit` and SBOM generation
Personnel security	Apply organizational onboarding/offboarding procedures for API key management
Backup and recovery	Configure backup for generation configurations and output data per retention policies
Incident response plan	Document procedures for scenarios where synthetic data is mistakenly treated as real

ISO 27001:2022 Alignment

This document maps DataSynth’s technical controls to the ISO/IEC 27001:2022 Annex A controls. DataSynth is a synthetic data generation tool, not a managed service, so this alignment focuses on controls that are directly addressable by the software. Organizational controls (A.5.1 through A.5.37), people controls (A.6), and physical controls (A.7) are primarily the responsibility of the deploying organization and are noted where DataSynth provides supporting capabilities.

Assessment Scope

System: DataSynth synthetic financial data generator
Version: 0.5.x
Standard: ISO/IEC 27001:2022 (Annex A controls from ISO/IEC 27002:2022)
Assessment Type: Self-assessment of technical control alignment

A.5 Organizational Controls

A.5.1 Policies for Information Security

DataSynth supports policy-as-code through its configuration management approach:

Configuration-as-code: All generation parameters are defined in version-controllable YAML files with typed schema validation. Invalid configurations are rejected before generation begins.
Industry presets: Pre-validated configurations for retail, manufacturing, financial services, healthcare, and technology industries reduce misconfiguration risk.
CLAUDE.md: The project’s development guidelines are codified and version-controlled alongside the source code, establishing security-relevant coding standards (#[deny(clippy::unwrap_used)], input validation requirements).

Organizations should supplement these technical controls with written information security policies governing DataSynth deployment, access, and data handling.

A.5.12 Classification of Information

DataSynth classifies all generated output as synthetic through the content marking system:

Embedded credentials: CSV headers, JSON metadata objects, and Parquet file metadata contain machine-readable ContentCredential records identifying the content as synthetic.
Human-readable declarations: Each credential includes a declaration field: “This content is synthetically generated and does not represent real transactions or entities.”
Configuration hash: SHA-256 hash of the generation configuration is embedded in output, enabling traceability from any output file back to its generation parameters.
Sidecar files: Optional .synthetic-credential.json sidecar files provide classification metadata alongside each output file.

A.5.23 Information Security for Use of Cloud Services

DataSynth supports cloud deployment through:

Kubernetes support: Helm charts and deployment manifests for containerized deployment with health (/health), readiness (/ready), and liveness (/live) probe endpoints.
Stateless server: The server component maintains no persistent state beyond in-memory generation jobs. Configuration and output are externalized, supporting cloud-native architectures.
TLS termination: Integration with Kubernetes ingress controllers, nginx, Caddy, and Envoy for TLS termination.
Secret management: API keys can be injected via environment variables or mounted secrets rather than hardcoded in configuration files.

A.8 Technological Controls

A.8.1 User Endpoint Devices

The CLI binary (datasynth-data) is a stateless executable:

No persistent credentials: The CLI does not store API keys, tokens, or session data on disk.
No network access required: The CLI operates entirely offline for generation workflows. Network access is only needed when connecting to a remote DataSynth server.
Deterministic output: Given the same configuration and seed, the CLI produces identical output, eliminating concerns about endpoint-specific state affecting results.

A.8.5 Secure Authentication

DataSynth implements multiple authentication mechanisms:

API Key Authentication:

Keys are hashed with Argon2id (memory-hard, timing-attack resistant) at server startup.
Raw keys are discarded after hashing; only PHC-format hashes are retained in memory.
Verification iterates all stored hashes without short-circuiting to prevent timing-based key enumeration.
A 5-second TTL cache using FNV-1a fast hashing reduces repeated Argon2id computation overhead.

JWT/OIDC Integration (optional jwt feature):

RS256 token validation with issuer, audience, and expiration checks.
Compatible with Keycloak, Auth0, and Microsoft Entra ID.
Claims extraction provides subject, email, roles, and tenant ID for downstream RBAC and audit.

Authentication Bypass:

Infrastructure endpoints (/health, /ready, /live, /metrics) are exempt from authentication to support load balancer and orchestrator probes.

A.8.9 Configuration Management

DataSynth enforces configuration integrity through:

Typed schema validation: YAML configuration is deserialized into strongly-typed Rust structs. Type mismatches, missing required fields, and constraint violations (e.g., rates outside 0.0–1.0, non-ascending approval thresholds) produce descriptive error messages before generation begins.
Complexity presets: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide pre-validated scaling parameters.
Template system: YAML/JSON templates with merge strategies enable configuration reuse while maintaining a single source of truth for shared settings.
Configuration hashing: SHA-256 hash of the resolved configuration is computed before generation and embedded in all output metadata, enabling drift detection.

A.8.12 Data Leakage Prevention

DataSynth’s architecture inherently prevents data leakage:

Synthetic-only generation: The default workflow generates data from statistical distributions and configuration parameters. No real data enters the pipeline.
Content marking: All output files carry machine-readable synthetic content credentials (EU AI Act Article 50). Third-party systems can detect and flag synthetic content programmatically.
Fingerprint privacy: When real data is used as input for fingerprint extraction, differential privacy (Laplace mechanism, configurable epsilon/delta) and k-anonymity suppress individual-level information. The resulting .dsf file contains only aggregate statistics.
Quality gate enforcement: The PrivacyMiaAuc quality gate validates that generated data does not memorize real data patterns (MIA AUC-ROC threshold).

A.8.16 Monitoring Activities

DataSynth provides monitoring at multiple layers:

Structured Audit Logging: The JsonAuditLogger emits structured JSON events via the tracing crate, recording:

Timestamp (UTC), request ID, actor identity
Action attempted, resource accessed, outcome (success/denied/error)
Tenant ID, source IP, user agent

Events are emitted at INFO level with a dedicated audit_event structured field for log aggregation filtering.

Resource Monitoring:

Memory guard reads /proc/self/statm (Linux) or ps (macOS) for resident set size tracking.
Disk guard uses statvfs (Unix) / GetDiskFreeSpaceExW (Windows) for available space monitoring.
CPU monitor tracks utilization with auto-throttle at 0.95 threshold.
The DegradationController combines all monitors and emits level-change events when resource pressure triggers degradation.

Generation Monitoring:

Run manifests capture configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.
Prometheus-compatible /metrics endpoint exposes runtime statistics.

A.8.24 Use of Cryptography

DataSynth uses cryptographic primitives for the following purposes:

Purpose	Algorithm	Implementation
Deterministic RNG	ChaCha8 (CSPRNG)	`rand_chacha` crate, configurable seed
API key hashing	Argon2id	`argon2` crate, random salt, PHC format
Configuration integrity	SHA-256	Config hash embedded in output metadata
JWT verification	RS256 (RSA + SHA-256)	`jsonwebtoken` crate (optional `jwt` feature)
UUID generation	FNV-1a hash	Deterministic collision-free UUIDs with generator-type discriminators

Cryptographic operations use well-maintained Rust crate implementations. No custom cryptographic algorithms are implemented.

A.8.25 Secure Development Lifecycle

DataSynth’s development process includes:

Static analysis: cargo clippy with #[deny(clippy::unwrap_used)] enforces safe error handling across the codebase.
Test coverage: 2,500+ tests across 15 crates covering unit, integration, and property-based scenarios.
Dependency auditing: cargo audit checks for known vulnerabilities in dependencies.
Type safety: Rust’s ownership model and type system eliminate entire classes of memory safety and concurrency bugs at compile time.
MSRV policy: Minimum Supported Rust Version (1.88) ensures builds use a recent, well-supported compiler.
CI/CD: Automated build, test, lint, and audit checks on every commit.

A.8.28 Secure Coding

DataSynth applies secure coding practices:

No unwrap() in library code: #[deny(clippy::unwrap_used)] prevents panics from unchecked error handling.
Input validation: All user-provided configuration values are validated against typed schemas with range constraints before use.
Precise decimal arithmetic: Financial amounts use rust_decimal (serialized as strings) instead of IEEE 754 floating point, preventing rounding errors in financial calculations.
No unsafe code: The codebase does not use unsafe blocks in application logic.
Timing-safe comparisons: API key verification uses constant-time Argon2id comparison (iterating all hashes) to prevent side-channel attacks.
Memory-safe concurrency: Rust’s ownership model prevents data races at compile time. Shared state uses Arc<Mutex<>> or atomic operations.

Statement of Applicability

The following table summarizes the applicability of ISO 27001:2022 Annex A controls to DataSynth.

Implemented Controls

Control	Title	Implementation
A.5.1	Information security policies	Configuration-as-code with schema validation
A.5.12	Classification of information	Synthetic content marking (EU AI Act Article 50)
A.5.23	Cloud service security	Kubernetes deployment, health probes, TLS support
A.8.1	User endpoint devices	Stateless CLI with no persistent credentials
A.8.5	Secure authentication	Argon2id API keys, JWT/OIDC, RBAC
A.8.9	Configuration management	Typed schema validation, presets, hashing
A.8.12	Data leakage prevention	Synthetic-only generation, content marking, fingerprint privacy
A.8.16	Monitoring activities	Structured audit logs, resource monitors, run manifests
A.8.24	Use of cryptography	ChaCha8 RNG, Argon2id, SHA-256, RS256 JWT
A.8.25	Secure development lifecycle	Clippy, 2,500+ tests, cargo audit, CI/CD
A.8.28	Secure coding	No unwrap, input validation, precise decimals, no unsafe

Partially Implemented Controls

Control	Title	Status	Gap
A.5.8	Information security in project management	Partial	Security considerations are embedded in code (schema validation, quality gates) but formal project management security procedures are organizational
A.5.14	Information transfer	Partial	TLS support for server API; file-based output transfer policies are organizational
A.5.29	Information security during disruption	Partial	Graceful degradation handles resource pressure; broader business continuity is organizational
A.8.8	Management of technical vulnerabilities	Partial	`cargo audit` scans dependencies; patch management cadence is organizational
A.8.15	Logging	Partial	Structured JSON audit events with correlation IDs; log retention and SIEM integration are organizational
A.8.26	Application security requirements	Partial	Input validation and schema enforcement are built-in; threat modeling documentation is organizational

Not Applicable Controls

Control	Title	Rationale
A.5.19	Information security in supplier relationships	DataSynth is open-source software; supplier controls apply to the deploying organization
A.5.30	ICT readiness for business continuity	Business continuity planning is an organizational responsibility
A.6.1–A.6.8	People controls	Personnel security controls are organizational
A.7.1–A.7.14	Physical controls	Physical security controls depend on deployment environment
A.8.2	Privileged access rights	OS-level privilege management is outside DataSynth’s scope
A.8.7	Protection against malware	Endpoint protection is an infrastructure concern
A.8.20	Networks security	Network segmentation and firewall rules are infrastructure concerns
A.8.23	Web filtering	Web filtering is an organizational network control

Continuous Improvement

DataSynth supports ISO 27001’s Plan-Do-Check-Act cycle through:

Plan: Configuration-as-code with schema validation enforces security requirements at design time.
Do: Automated quality gates and resource guards enforce controls during operation.
Check: Evaluation framework produces quantitative metrics (Benford MAD, balance coherence, MIA AUC-ROC) that can be trended over time.
Act: The AutoTuner in datasynth-eval generates configuration patches from evaluation gaps, creating a feedback loop for continuous improvement.

Roadmap: Enterprise Simulation & ML Ground Truth

This roadmap outlines completed features, planned enhancements, and the wave-based expansion strategy for enterprise process chain coverage.

Completed Features

v0.1.0 — Core Generation

Statistical distributions: Benford’s Law compliance, log-normal mixtures, copulas
Industry presets: Manufacturing, Retail, Financial Services, Healthcare, Technology
Chart of Accounts: Small (~100), Medium (~400), Large (~2500) complexity levels
Temporal patterns: Month-end/quarter-end volume spikes, business day calendars
Master data: Vendors, customers, materials, fixed assets, employees
Document flows: P2P (6 PO types, three-way match) and O2C (9 SO types, 6 delivery types, 7 invoice types)
Intercompany: IC matching, transfer pricing, consolidation elimination entries
Subledgers: AR (aging, dunning), AP (scheduling, discounts), FA (6 depreciation methods), Inventory (22 movement types, 4 valuation methods)
Currency & FX: Ornstein-Uhlenbeck exchange rates, ASC 830 translation, CTA
Period close: Monthly close engine, accruals, depreciation runs, year-end closing
Balance coherence: Opening balances, running balance tracking, trial balance per period
Anomaly injection: 60+ fraud types, error patterns, process issues with full labeling
Data quality: Missing values (MCAR/MAR/MNAR), format variations, typos, duplicates
Graph export: PyTorch Geometric, Neo4j, DGL with train/val/test splits
Internal controls: COSO 2013 framework, SoD rules, 12 transaction + 6 entity controls
Resource guards: Memory, disk, CPU monitoring with graceful degradation
REST/gRPC/WebSocket server with authentication and rate limiting
Desktop UI: Tauri/SvelteKit with 15+ configuration pages
Python wrapper: Programmatic access with blueprints and config validation

v0.2.0 — Privacy & Standards

Fingerprint extraction: Statistical properties from real data into .dsf files
Differential privacy: Laplace and Gaussian mechanisms with configurable epsilon
K-anonymity: Suppression of rare categorical values
Fidelity evaluation: KS, Wasserstein, Benford MAD metric comparison
Gaussian copula synthesis: Preserve multivariate correlations
Accounting standards: Revenue recognition (ASC 606/IFRS 15), Leases (ASC 842/IFRS 16), Fair Value (ASC 820/IFRS 13), Impairment (ASC 360/IAS 36)
Audit standards: ISA compliance (34 standards), analytical procedures, confirmations, opinions, PCAOB mappings
SOX compliance: Section 302/404 assessments, deficiency matrix, material weakness classification
Streaming output: CSV, JSON, NDJSON, Parquet streaming sinks with backpressure
ERP output formats: SAP S/4HANA (BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA), Oracle EBS (GL_JE_HEADERS/LINES), NetSuite

v0.3.0 — Fraud & Industry

ACFE-aligned fraud taxonomy: Asset misappropriation, corruption, financial statement fraud calibrated to ACFE statistics
Collusion modeling: 8 ring types, 6 conspirator roles, defection/escalation dynamics
Management override: Fraud triangle modeling (pressure, opportunity, rationalization)
Red flag generation: 40+ probabilistic indicators with Bayesian probabilities
Industry-specific generators: Manufacturing (BOM, WIP, production orders), Retail (POS, shrinkage, loyalty), Healthcare (ICD-10, CPT, DRG, payer mix)
Industry benchmarks: Pre-configured ML benchmarks per industry
Banking/KYC/AML: Customer personas, KYC profiles, fraud typologies (structuring, funnel, layering, mule, round-tripping)
Process mining: OCEL 2.0 event logs with P2P and O2C processes
Evaluation framework: Auto-tuning with configuration recommendations from metric gaps
Vendor networks: Tiered supply chains, quality scores, clusters
Customer segmentation: Value segments, lifecycle stages, network positions
Cross-process links: Entity graph, relationship strength, cross-process integration

v0.5.0 — AI & Advanced Features

LLM-augmented generation: Pluggable provider abstraction (Mock, OpenAI, Anthropic) for realistic vendor names, descriptions, memo fields, and anomaly explanations
Natural language configuration: Generate YAML configs from descriptions
Diffusion model backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
Hybrid generation: Blend rule-based and diffusion outputs
Causal generation: Structural Causal Models (SCMs), do-calculus interventions, counterfactual generation
Built-in causal templates: fraud_detection and revenue_cycle causal graphs
Federated fingerprinting: Secure aggregation (weighted average, median, trimmed mean) for distributed data sources
Synthetic data certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing
Privacy-utility Pareto frontier: Automated exploration of optimal epsilon values
Ecosystem integrations: Airflow, dbt, MLflow, Spark pipeline integration

Planned Enhancements

Wave 1 — Foundation (enables everything else)

These items close the most critical gaps and unblock downstream work.

Item	Chain	Description	Dependencies
S2C completion	S2P	Source-to-Contract: spend analysis, RFx, bid evaluation, contract management, catalog items, supplier scorecards	Extends existing P2P
Bank reconciliation	BANK	Bank statement lines, auto-matching, reconciliation breaks, clearing	Validates all payment chains
Financial statement generator	R2R	Balance sheet, income statement, cash flow statement from trial balance	Consumes all JE data

Impact: S2C creates a closed-loop procurement model. Bank reconciliation validates payment integrity across S2P and O2C. Financial statements provide the final reporting layer for R2R.

Wave 2 — Core Process Chains

Item	Chain	Description	Dependencies
Payroll & time management	H2R	Payroll runs, time entries, overtime, benefits, tax withholding	Employee master data
Revenue recognition generator	O2C→R2R	Wire `CustomerContract` + `PerformanceObligation` models to SO/Invoice data	Existing ASC 606 models
Impairment generator	A2R→R2R	Wire existing `ImpairmentTest` model to FA generator with JE output	Existing ASC 360 models

Impact: Payroll is the largest H2R gap and enables SoD analysis for personnel. Revenue recognition and impairment generators wire existing standards models into the generation pipeline.

Wave 3 — Operational Depth

Item	Chain	Description	Dependencies
Production orders & WIP	MFG	Production order lifecycle, material consumption, WIP costing, variance analysis	Manufacturing industry config
Cycle counting & QA	INV	Cycle count programs, quality inspection, inspection lots, vendor quality feedback	Inventory subledger
Expense management	H2R	Expense reports, policy enforcement, receipt matching, reimbursement	Employee master data

Impact: Manufacturing becomes a fully simulated chain. Inventory completeness enables ABC analysis and obsolescence. Expenses extend H2R with AP integration.

Wave 4 — Polish

Item	Chain	Description	Dependencies
Sales quotes	O2C	Quote-to-order conversion tracking (fills orphan `quote_id` FK)	O2C generator
Cash forecasting	BANK	Projected cash flows from AP/AR schedules	AP/AR subledgers
KPIs & budget variance	R2R	Management reporting, budget vs actual analysis	Financial statements
Obsolescence management	INV	Slow-moving/excess stock identification and write-downs	Inventory aging

Impact: These items round out each chain with planning and reporting capabilities.

Cross-Process Integration Vision

The wave plan steadily increases cross-process coverage:

Integration	Current	After Wave 1	After Wave 2	After Wave 4
S2P → Inventory	GR updates stock	Same	Same	Same
Inventory → O2C	Delivery reduces stock	Same	Same	Obsolescence feeds write-downs
S2P/O2C → BANK	Payments created	Payments reconciled	Same	Cash forecasting
All → R2R	JEs → Trial Balance	JEs → Financial Statements	+ Revenue recog, impairment	+ Budget variance
H2R → S2P	Employee authorizations	Same	Expense → AP	Same
S2P → A2R	Capital PO → FA	Same	Same	Same
MFG → S2P	Config only	Same	Production → PR demand	Same
MFG → INV	Config only	Same	WIP → FG transfers	+ QA feedback

Coverage Targets

Chain	Current	Wave 1	Wave 2	Wave 3	Wave 4
S2P	85%	95%	95%	95%	95%
O2C	93%	93%	97%	97%	99%
R2R	78%	88%	92%	92%	97%
A2R	70%	70%	80%	80%	80%
INV	55%	55%	55%	75%	85%
BANK	65%	85%	85%	85%	90%
H2R	30%	30%	60%	75%	75%
MFG	20%	20%	20%	60%	60%

Guiding Principles

Enterprise realism: Simulate multi-entity, multi-region, multi-currency operations with coherent process flows
ML ground truth: Capture true labels and causal factors for supervised learning, explainability, and evaluation
Scalability: Handle large volumes with stable performance and reproducible results
Backward compatibility: New features are additive; existing configs continue to work

Dependencies & Risks

Schema stability: New models must not break existing serialization formats
Performance: Each wave adds generators; resource guards ensure stable memory/CPU
Validation complexity: Cross-chain coherence checks multiply as integration points increase

Contributing

We welcome contributions to any roadmap area. See Contributing Guidelines for details.

To propose new features:

Open a GitHub issue with the enhancement label
Describe the use case and expected behavior
Reference relevant roadmap items if applicable

Feedback

Roadmap priorities are influenced by user feedback. Please share your use cases and requirements:

GitHub Issues: Feature requests and bug reports
Email: michael.ivertowski@ch.ey.com

Production Readiness Roadmap

This roadmap addresses the infrastructure, operations, security, compliance, and ecosystem maturity required to transition DataSynth from a feature-complete beta to a production-grade enterprise platform. It complements the existing feature roadmap which covers domain-specific enhancements.

Current State Assessment
Phase 1: Foundation (0-3 months)
Phase 2: Hardening (3-6 months)
Phase 3: Enterprise Grade (6-12 months)
Phase 4: Market Leadership (12-18 months)
Industry & Research Context
Competitive Positioning
Regulatory Landscape
Risk Register

Current State Assessment

Production Readiness Scorecard (v0.5.0 — Phase 2 Complete)

Category	Score	Status	Key Findings
Workspace Structure	9/10	Excellent	15 well-organized crates, clear separation of concerns
Testing	10/10	Excellent	2,500+ tests, property testing via proptest, fuzzing harnesses (cargo-fuzz), k6 load tests, coverage via cargo-llvm-cov + Codecov
CI/CD	9/10	Excellent	7-job pipeline: fmt, clippy, cross-platform test (Linux/macOS/Windows), MSRV 1.88, security scanning (cargo-deny + cargo-audit), coverage, benchmark regression
Error Handling	10/10	Excellent	Idiomatic `thiserror`/`anyhow`; `#![deny(clippy::unwrap_used)]` enforced across all library crates; zero unwrap calls in non-test code
Observability	9/10	Excellent	Structured JSON logging, feature-gated OpenTelemetry (OTLP traces + Prometheus metrics), request ID propagation, request logging middleware, data lineage graph
Deployment	10/10	Excellent	Multi-stage Dockerfile (distroless), Docker Compose, Kubernetes Helm chart (HPA, PDB, Redis subchart), SystemD service, comprehensive deployment guides (Docker, K8s, bare-metal)
Security	9/10	Excellent	Argon2id key hashing with timing-safe comparison, security headers, request validation, TLS support (rustls), env var interpolation for secrets, cargo-deny + cargo-audit in CI, security hardening guide
Performance	9/10	Excellent	5 Criterion benchmark suites, 100K+ entries/sec; CI benchmark regression tracking on PRs; k6 load testing framework
Python Bindings	8/10	Strong	Strict mypy, PEP 561 compliant, blueprints; classified as “Beta”, no async support
Server	10/10	Excellent	REST/gRPC/WebSocket complete; async job queue; distributed rate limiting (Redis); stateless config loading; enhanced probes; full middleware stack
Documentation	10/10	Excellent	mdBook + rustdoc + CHANGELOG + CONTRIBUTING; deployment guides (Docker, K8s, bare-metal), operational runbook, capacity planning, DR procedures, API reference, security hardening
Code Quality	10/10	Excellent	Zero TODO/FIXME comments, warnings-as-errors enforced, panic-free library crates, 6 unsafe blocks (all justified)
Privacy	9/10	Excellent	Formal DP composition (RDP, zCDP), privacy budget management, MIA/linkage evaluation, NIST SP 800-226 alignment, SynQP matrix, custom privacy levels
Data Lineage	9/10	Excellent	Per-file checksums, lineage graph, W3C PROV-JSON export, CLI verify command for manifest integrity

Overall: 9.4/10 — Enterprise-grade with Kubernetes deployment, formal privacy guarantees, panic-free library code, comprehensive operations documentation, and data lineage tracking. Remaining gaps: RBAC/OAuth2, plugin SDK, Python async support.

Phase 1: Foundation (0-3 months)

Goal: Establish the minimum viable production infrastructure.

1.1 Containerization & Packaging

Priority: Critical | Effort: Medium

Deliverable	Description
Multi-stage Dockerfile	Rust builder stage + distroless/alpine runtime (~20MB image)
Docker Compose	Local dev stack: server + Prometheus + Grafana + Redis
OCI image publishing	GitHub Actions workflow to push to GHCR/ECR on tagged releases
Binary distribution	Pre-built binaries for Linux (x86_64, aarch64), macOS (Apple Silicon), Windows
SystemD service file	Production daemon configuration with resource limits

Implementation Notes:

# Target image structure
FROM rust:1.88-bookworm AS builder
# ... build with --release
FROM gcr.io/distroless/cc-debian12
COPY --from=builder /app/target/release/datasynth-server /
EXPOSE 3000
ENTRYPOINT ["/datasynth-server"]

1.2 Security Hardening

Priority: Critical | Effort: Medium

Deliverable	Description
API key hashing	Argon2id for stored keys; timing-safe comparison via `subtle` crate
Request validation middleware	Content-Type enforcement, configurable max body size (default 10MB)
TLS support	Native `rustls` integration or documented reverse proxy (nginx/Caddy) setup
Secrets management	Environment variable interpolation in config (`${ENV_VAR}` syntax)
Security headers	`X-Content-Type-Options`, `X-Frame-Options`, `Strict-Transport-Security`
Input sanitization	Validate all user-supplied config values before processing
Dependency auditing	`cargo-audit` and `cargo-deny` in CI pipeline

1.3 Observability Stack

Priority: Critical | Effort: Medium

Deliverable	Description
OpenTelemetry integration	Replace custom metrics with `opentelemetry` + `opentelemetry-otlp` crates
Structured logging	JSON-formatted logs with request IDs, span context, correlation traces
Prometheus metrics	Generation throughput, latency histograms, error rates, resource utilization
Distributed tracing	Trace generation pipeline phases end-to-end with span hierarchy
Health check enhancement	Add dependency checks (disk space, memory) to `/ready` endpoint
Alert rules	Example Prometheus alerting rules for SLO violations

Key Metrics to Instrument:

datasynth_generation_entries_total (Counter) — Total entries generated
datasynth_generation_duration_seconds (Histogram) — Per-phase latency
datasynth_generation_errors_total (Counter) — Errors by type
datasynth_memory_usage_bytes (Gauge) — Current memory consumption
datasynth_active_sessions (Gauge) — Concurrent generation sessions
datasynth_api_request_duration_seconds (Histogram) — API latency by endpoint

1.4 CI/CD Hardening

Priority: High | Effort: Low

Deliverable	Description
Code coverage	`cargo-tarpaulin` or `cargo-llvm-cov` with Codecov integration
Security scanning	`cargo-audit` for CVEs, `cargo-deny` for license compliance
MSRV validation	CI job testing against minimum supported Rust version (1.88)
Cross-platform matrix	Test on Linux, macOS, Windows in CI
Benchmark tracking	Criterion results uploaded to GitHub Pages; regression alerts on PRs
Release automation	Semantic versioning with auto-changelog via `git-cliff`
Container scanning	Trivy or Grype scanning of published Docker images

Phase 2: Hardening (3-6 months)

Goal: Enterprise-grade reliability, scalability, and compliance foundations.

2.1 Scalability & High Availability

Priority: High | Effort: High

Deliverable	Description
Redis-backed rate limiting	Distributed rate limiting via `redis-rs` for multi-instance deployments
Horizontal scaling	Stateless server design; shared config via Redis/S3
Kubernetes Helm chart	Production-ready chart with HPA, PDB, resource limits, readiness probes
Load testing framework	k6 or Locust scripts for API stress testing
Graceful rolling updates	Zero-downtime deployments with connection draining
Job queue	Async generation jobs with status tracking (Redis Streams or similar)

2.2 Data Lineage & Provenance

Priority: High | Effort: Medium

Deliverable	Description
Generation manifest	JSON/YAML file recording: config hash, seed, version, timestamp, checksums for all outputs
Data lineage graph	Track which config section produced which output file and row ranges
Reproducibility verification	CLI command: `datasynth-data verify --manifest manifest.json --output ./output/`
W3C PROV compatibility	Export lineage in W3C PROV-JSON format for interoperability
Audit trail	Append-only log of all generation runs with user, config, and output metadata

Rationale: Data lineage is becoming a regulatory requirement under the EU AI Act (Article 10 — data governance for training data) and is a key differentiator in the enterprise synthetic data market. NIST AI RMF 1.0 also emphasizes provenance tracking under its MAP and MEASURE functions.

2.3 Enhanced Privacy Guarantees

Priority: High | Effort: High

Deliverable	Description
Formal DP accounting	Implement Renyi DP and zero-concentrated DP (zCDP) composition tracking
Privacy budget management	Global budget tracking across multiple generation runs
Membership inference testing	Automated MIA evaluation as post-generation quality gate
NIST SP 800-226 alignment	Validate DP implementation against NIST Guidelines for Evaluating DP Guarantees
SynQP framework integration	Implement the IEEE SynQP evaluation matrix for joint quality-privacy assessment
Configurable privacy levels	Presets: `relaxed` (ε=10), `standard` (ε=1), `strict` (ε=0.1) with utility tradeoff documentation

Research Context: The NIST SP 800-226 (Guidelines for Evaluating Differential Privacy Guarantees) provides the authoritative framework for DP evaluation. The SynQP framework (IEEE, 2025) introduces standardized privacy-quality evaluation matrices. Benchmarking DP tabular synthesis algorithms was a key topic at TPDP 2025, and federated DP approaches (FedDPSyn) are emerging for distributed generation.

2.4 Unwrap Audit & Robustness

Priority: Medium | Effort: Medium

Deliverable	Description
Unwrap elimination	Audit and replace ~2,300 `unwrap()` calls in non-test code with proper error handling
Panic-free guarantee	Add `#![deny(clippy::unwrap_used)]` lint for library crates (not test/bench)
Fuzzing harnesses	`cargo-fuzz` targets for config parsing, fingerprint loading, and API endpoints
Property test expansion	Increase `proptest` coverage for statistical invariants and balance coherence

2.5 Documentation: Operations

Priority: Medium | Effort: Low

Deliverable	Description
Deployment guide	Docker, K8s, bare-metal deployment with step-by-step instructions
Operational runbook	Monitoring dashboards, common alerts, troubleshooting procedures
Capacity planning guide	Memory/CPU/disk sizing for different generation scales
Disaster recovery	Backup/restore procedures for server state and configurations
API rate limits documentation	Document auth, rate limiting, and CORS behavior for integrators
Security hardening guide	Checklist for production security configuration

Phase 3: Enterprise Grade (6-12 months)

Goal: Enterprise features, compliance certifications, and ecosystem maturity.

3.1 Multi-Tenancy & Access Control

Priority: High | Effort: High

Deliverable	Description
RBAC	Role-based access control (admin, operator, viewer) with JWT/OAuth2
Tenant isolation	Namespace-based isolation for multi-tenant SaaS deployment
Audit logging	Structured audit events for all API actions (who/what/when)
SSO integration	SAML 2.0 and OIDC support for enterprise identity providers
API versioning	URL-based API versioning (v1, v2) with deprecation lifecycle

3.2 Advanced Evaluation & Quality Gates

Priority: High | Effort: Medium

Deliverable	Description
Automated quality gates	Pre-configured pass/fail criteria for generation runs
Benchmark suite expansion	Domain-specific benchmarks: financial realism, fraud detection efficacy, audit trail coherence
Regression testing	Golden dataset comparison with tolerance thresholds
Quality dashboard	Web-based visualization of quality metrics over time
Third-party validation	Integration with SDMetrics and SDV evaluation utilities

Quality Metrics to Implement:

Statistical fidelity: Column distribution similarity (KL divergence, Wasserstein distance)
Structural fidelity: Correlation matrix preservation, inter-table referential integrity
Privacy: Nearest-neighbor distance ratio, attribute disclosure risk, identity disclosure risk (SynQP)
Utility: Train-on-synthetic-test-on-real (TSTR) ML performance parity
Temporal fidelity: Autocorrelation preservation, seasonal pattern retention
Domain-specific: Benford compliance MAD, balance equation coherence, document chain integrity

3.3 Plugin & Extension SDK

Priority: Medium | Effort: High

Deliverable	Description
Generator trait API	Stable, documented trait interface for custom generators
Plugin loading	Dynamic plugin loading via `libloading` or WASM runtime
Template marketplace	Repository of community-contributed industry templates
Custom output sinks	Plugin API for custom export formats (database write, S3, GCS)
Webhook system	Event-driven notifications (generation start/complete/error)

3.4 Python Ecosystem Maturity

Priority: Medium | Effort: Medium

Deliverable	Description
Async support	`asyncio`-compatible API using `websockets` for streaming
Conda package	Publish to conda-forge for data science workflows
Jupyter integration	Example notebooks for common use cases (fraud ML, audit analytics)
pandas/polars integration	Direct DataFrame output without intermediate CSV
PyPI 1.0.0 release	Promote from Beta to Production/Stable classifier
Type stubs	Complete `.pyi` stubs for IDE support

3.5 Regulatory Compliance Framework

Priority: Medium | Effort: Medium

Deliverable	Description
EU AI Act readiness	Synthetic content marking (Article 50), training data documentation (Article 10)
NIST AI RMF alignment	Self-assessment against MAP, MEASURE, MANAGE, GOVERN functions
SOC 2 Type II preparation	Document controls for security, availability, processing integrity
GDPR compliance documentation	Data processing documentation, privacy impact assessment template
ISO 27001 alignment	Information security management system controls mapping

Regulatory Context: The EU AI Act’s Article 50 transparency obligations (enforceable August 2026) require AI systems generating synthetic content to mark outputs as artificially generated in a machine-readable format. Article 10 mandates training data governance including documentation of data sources. Organizations face penalties up to €35M or 7% of global turnover for non-compliance. The NIST AI RMF 1.0 (expanded significantly through 2024-2025) provides the voluntary framework becoming the “operational layer” beneath regulatory compliance globally.

Phase 4: Market Leadership (12-18 months)

Goal: Cutting-edge capabilities informed by latest research, establishing DataSynth as the reference platform for financial synthetic data.

4.1 LLM-Augmented Generation

Priority: Medium | Effort: High

Deliverable	Description
LLM-guided metadata enrichment	Use LLMs to generate realistic vendor names, descriptions, memo fields
Natural language config	Generate YAML configs from natural language descriptions (“Generate 1 year of manufacturing data for a mid-size German company”)
Semantic constraint validation	LLM-based validation of inter-column logical relationships
Explanation generation	Natural language explanations for anomaly labels and findings

Research Context: Multiple 2025 papers demonstrate LLM-augmented tabular data generation. LLM-TabFlow (March 2025) addresses preserving inter-column logical relationships. StructSynth (August 2025) focuses on structure-aware synthesis in low-data regimes. LLM-TabLogic (August 2025) uses prompt-guided latent diffusion to maintain logical constraints. The CFA Institute’s July 2025 report on “Synthetic Data in Investment Management” validates the growing importance of synthetic data in financial applications.

4.2 Diffusion Model Integration

Priority: Medium | Effort: Very High

Deliverable	Description
TabDDPM backend	Optional diffusion-model-based generation for learned distribution capture
FinDiff integration	Financial-domain diffusion model for learned financial patterns
Hybrid generation	Combine rule-based generators with learned models for maximum fidelity
Model fine-tuning pipeline	Train custom diffusion models on fingerprint data
Imb-FinDiff for rare events	Diffusion-based class imbalance handling for fraud patterns

Research Context: The diffusion model landscape for tabular data has matured rapidly. TabDiff (ICLR 2025) introduced joint continuous-time diffusion with feature-wise learnable schedules, achieving 22.5% improvement over prior SOTA. FinDiff and its extensions (Imb-FinDiff for class imbalance, DP-Fed-FinDiff for federated privacy-preserving generation) are specifically designed for financial tabular data. A comprehensive survey (February 2025) catalogs 15+ diffusion models for tabular data. TabGraphSyn (December 2025) combines GNNs with diffusion for graph-guided tabular synthesis.

4.3 Advanced Privacy Techniques

Priority: Medium | Effort: High

Deliverable	Description
Federated fingerprinting	Extract fingerprints from distributed data sources without centralization
Synthetic data certificates	Cryptographic proof that output satisfies DP guarantees
Privacy-utility Pareto frontier	Automated exploration of optimal ε values for given utility targets
Surrogate public data	Support for surrogate public data approaches to improve DP utility

Research Context: TPDP 2025 featured FedDPSyn for federated DP tabular synthesis and research on surrogate public data for DP (Hod et al.). The AI-generated synthetic tabular data market reached $1.36B in 2024 and is projected to reach $6.73B by 2029 (37.9% CAGR), driven by privacy regulation and AI training demand.

4.4 Ecosystem & Integration

Priority: Medium | Effort: Medium

Deliverable	Description
Terraform provider	Infrastructure-as-code for DataSynth server deployment
Airflow/Dagster operators	Pipeline integration for automated generation in data workflows
dbt integration	Generate synthetic data as dbt sources for analytics testing
Spark connector	Read DataSynth output directly as Spark DataFrames
MLflow integration	Track generation runs as MLflow experiments with metrics

4.5 Causal & Counterfactual Generation

Priority: Low | Effort: Very High

Deliverable	Description
Causal graph specification	Define causal relationships between entities in config
Interventional generation	“What-if” scenarios: generate data under hypothetical interventions
Counterfactual samples	Generate counterfactual versions of existing records
Causal discovery validation	Validate that generated data preserves specified causal structure

Industry & Research Context

Synthetic Data Market (2025-2026)

The synthetic data market is experiencing explosive growth:

Gartner predicts 75% of businesses will use GenAI to create synthetic customer data by 2026, up from <5% in 2023.
The AI-generated synthetic tabular data market reached $1.36B in 2024, projected to $6.73B by 2029 (37.9% CAGR).
Synthetic data is predicted to account for >60% of all training data for GenAI models by 2030 (CFA Institute, July 2025).

Key Research Papers & Developments

Tabular Data Generation

TabDiff (ICLR 2025) — Mixed-type diffusion with learnable feature-wise schedules; 22.5% improvement on correlation preservation
LLM-TabFlow (March 2025) — Preserving inter-column logical relationships via LLM guidance
StructSynth (August 2025) — Structure-aware LLM synthesis for low-data regimes
LLM-TabLogic (August 2025) — Prompt-guided latent diffusion maintaining logical constraints
TabGraphSyn (December 2025) — Graph-guided latent diffusion combining VAE+GNN with diffusion

Financial Domain

FinDiff (ICAIF 2023) — Diffusion models for financial tabular data
Imb-FinDiff (ICAIF 2024) — Conditional diffusion for class-imbalanced financial data
DP-Fed-FinDiff — Federated DP diffusion for privacy-preserving financial synthesis
CFA Institute Report (July 2025) — “Synthetic Data in Investment Management” validating FinDiff as SOTA

Privacy & Evaluation

SynQP (IEEE, 2025) — Standardized quality-privacy evaluation framework for synthetic data
NIST SP 800-226 — Guidelines for Evaluating Differential Privacy Guarantees
TPDP 2025 — Benchmarking DP tabular synthesis; federated approaches; membership inference attacks
Consensus Privacy Metrics (Pilgram et al., 2025) — Framework for standardized privacy evaluation

Surveys

“Diffusion Models for Tabular Data” (February 2025) — Comprehensive survey cataloging 15+ models
“Comprehensive Survey of Synthetic Tabular Data Generation” (Shi et al., 2025) — Broad overview of methods

Technology Trends Impacting DataSynth

Trend	Impact	Timeframe
LLM-augmented generation	Realistic metadata, natural language config	2026
Diffusion models for tabular data	Learned distribution capture as alternative/complement to rule-based	2026-2027
Federated DP synthesis	Generate from distributed sources without centralization	2027
Causal modeling	“What-if” scenarios and interventional generation	2027-2028
OTEL standardization	Unified observability across Rust ecosystem	2026
WASM plugins	Safe, sandboxed extensibility for custom generators	2026-2027
EU AI Act enforcement	Mandatory synthetic content marking and data governance	August 2026

Competitive Positioning

Market Landscape (2025-2026)

Platform	Focus	Key Differentiator	Pricing	Status
Gretel.ai	Developer APIs	Navigator (NL-to-data); acquired by NVIDIA (March 2025)	Usage-based	Integrated into NVIDIA NeMo
MOSTLY AI	Enterprise compliance	TabularARGN with built-in DP; fairness controls	Enterprise license	Independent
Tonic.ai	Test data management	Database-aware synthesis; acquired Fabricate (April 2025)	Per-database	Growing
Hazy	Financial services	Regulated-sector focus; sequential data	Enterprise license	Independent
SDV/DataCebo	Open source ecosystem	CTGAN, TVAEs, Gaussian copulas; Python-native	Freemium	Open source core
K2view	Entity-based testing	All-in-one enterprise data management	Enterprise license	Established

DataSynth Competitive Advantages

Advantage	Detail
Domain depth	Deepest financial/accounting domain model (IFRS, US GAAP, ISA, SOX, COSO, KYC/AML)
Rule-based coherence	Guaranteed balance equations, document chain integrity, three-way matching
Deterministic reproducibility	ChaCha8 RNG with seed control; bit-exact reproducibility across runs
Performance	100K+ entries/sec (Rust native); 10-100x faster than Python-based competitors
Privacy-preserving fingerprinting	Unique extract-synthesize workflow with DP guarantees
Process mining	Native OCEL 2.0 event log generation (unique in market)
Graph-native	Direct PyTorch Geometric, Neo4j, DGL export for GNN workflows
Full-stack	CLI + REST/gRPC/WebSocket server + Desktop UI + Python bindings

Competitive Gaps to Address

Gap	Competitors with Feature	Priority
Cloud-hosted SaaS offering	Gretel, MOSTLY AI, Tonic	Phase 3
No-code UI for non-technical users	MOSTLY AI, K2view	Phase 3
Database-aware synthesis from production data	Tonic.ai	Phase 4
LLM-powered natural language interface	Gretel Navigator	Phase 4
Pre-built ML model training pipelines	Gretel	Phase 3
Marketplace for community templates	SDV ecosystem	Phase 3

Regulatory Landscape

EU AI Act Timeline

Date	Milestone	DataSynth Impact
Feb 2025	Prohibited AI systems discontinued; AI literacy obligations	Low — DataSynth is a tool, not a prohibited system
Aug 2025	GPAI transparency requirements; training data documentation	Medium — Users training AI with DataSynth output need provenance
Aug 2026	Full high-risk AI compliance; Article 50 transparency	High — Synthetic content marking required; data governance mandated
Aug 2027	High-risk AI in harmonized products	Low — Indirect impact

Required Compliance Features

Synthetic content marking (Article 50): All generated data must include machine-readable markers indicating artificial generation
Training data documentation (Article 10): Generation manifests must document configs, sources, and processing steps
Quality management (Annex IV): Documented quality assurance processes for generation and evaluation
Risk assessment: Template for users to assess risks of using synthetic data in AI systems

Other Regulatory Frameworks

Framework	Relevance	Status
NIST AI RMF 1.0	Voluntary; becoming the operational governance layer globally	Self-assessment planned (Phase 3)
NIST SP 800-226	DP evaluation guidelines	Alignment planned (Phase 2)
GDPR	Synthetic data reduces but doesn’t eliminate privacy obligations	Documentation in Phase 3
SOX	DataSynth already generates SOX-compliant test data	Feature complete
ISO 27001	Information security controls for server deployment	Alignment in Phase 3
SOC 2 Type II	Trust service criteria for SaaS offering	Phase 3 preparation

Risk Register

Technical Risks

Risk	Likelihood	Impact	Mitigation
Performance regression with OTEL instrumentation	Medium	Medium	Benchmark-gated CI; sampling in production
Breaking API changes during versioning	Low	High	Semantic versioning; deprecation policy; compatibility tests
Memory safety issues in unsafe blocks	Low	Critical	Miri testing; minimize unsafe; regular audits
Dependency CVEs	Medium	High	`cargo-audit` in CI; Dependabot alerts
Plugin system security (WASM/dynamic loading)	Medium	High	WASM sandboxing; capability-based permissions

Business Risks

Risk	Likelihood	Impact	Mitigation
EU AI Act scope broader than anticipated	Medium	High	Proactive Article 50 compliance; legal review
Competitor acqui-hires (Gretel→NVIDIA pattern)	Medium	Medium	Build unique domain depth as defensible moat
Open-source competitors (SDV) closing feature gap	Medium	Medium	Focus on financial domain depth and performance
Enterprise customers requiring SOC 2 certification	High	Medium	Begin SOC 2 preparation in Phase 3
Python ecosystem expects native (PyO3) bindings	Medium	Medium	Evaluate PyO3 migration for v2.0

Operational Risks

Risk	Likelihood	Impact	Mitigation
Production incidents without runbooks	High	Medium	Prioritize ops documentation in Phase 2
Scaling issues under concurrent load	Medium	High	Load testing in Phase 2; HPA configuration
Secret exposure in logs or configs	Low	Critical	Structured logging with PII filtering; secret scanning

Success Criteria

Phase 1 Exit Criteria

Docker image published and scannable (multi-stage distroless build)
cargo-audit and cargo-deny passing in CI
OTEL traces available via feature-gated otel flag with OTLP export
Prometheus metrics scraped and graphed (Docker Compose stack)
Code coverage measured and reported via cargo-llvm-cov + Codecov
Cross-platform CI (Linux + macOS + Windows)

Phase 2 Exit Criteria

Helm chart deployed to staging K8s cluster
Generation manifest produced for every run (with per-file checksums, lineage graph, W3C PROV-JSON)
Load test: k6 scripts for health, bulk generation, WebSocket, job queue, and soak testing
Zero unwrap() calls in library crate non-test code (#![deny(clippy::unwrap_used)] enforced)
Formal DP composition tracking with budget management (RDP, zCDP, privacy budget manager)
Operations runbook reviewed and validated (deployment guides, runbook, capacity planning, DR, API reference, security hardening)

Phase 3 Exit Criteria

JWT/OAuth2 authentication with RBAC
Automated quality gates blocking below-threshold runs
Plugin SDK documented with 2+ community plugins
Python 1.0.0 on PyPI with async support
EU AI Act Article 50 compliance verified
SOC 2 Type II readiness assessment completed

Phase 4 Exit Criteria

LLM-augmented generation available as opt-in feature
Diffusion model backend demonstrated on financial dataset
3+ ecosystem integrations (Airflow, dbt, MLflow)
Causal generation prototype validated

Appendix A: OpenTelemetry Integration Architecture

┌─────────────────────────────────────────────────────┐
│                   DataSynth Server                  │
│  ┌───────────┐  ┌──────────┐  ┌─────────────────┐  │
│  │  REST API  │  │   gRPC   │  │   WebSocket     │  │
│  └─────┬─────┘  └────┬─────┘  └───────┬─────────┘  │
│        │              │                │             │
│  ┌─────┴──────────────┴────────────────┴──────────┐ │
│  │          Tower Middleware Stack                 │ │
│  │  [Auth] [RateLimit] [Tracing] [Metrics]        │ │
│  └────────────────────┬───────────────────────────┘ │
│                       │                              │
│  ┌────────────────────┴───────────────────────────┐ │
│  │           OpenTelemetry SDK                    │ │
│  │  ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │ │
│  │  │ Traces  │ │ Metrics  │ │     Logs        │ │ │
│  │  └────┬────┘ └────┬─────┘ └───────┬─────────┘ │ │
│  └───────┼───────────┼───────────────┼────────────┘ │
│          │           │               │               │
│  ┌───────┴───────────┴───────────────┴────────────┐ │
│  │           OTLP Exporter (gRPC/HTTP)            │ │
│  └────────────────────┬───────────────────────────┘ │
└───────────────────────┼─────────────────────────────┘
                        │
              ┌─────────┴──────────┐
              │   OTel Collector   │
              │  (Agent sidecar)   │
              └──┬──────┬──────┬───┘
                 │      │      │
           ┌─────┘  ┌───┘  ┌──┘
           ▼        ▼      ▼
       ┌──────┐ ┌──────┐ ┌─────┐
       │Jaeger│ │Prom. │ │Loki │
       │/Tempo│ │      │ │     │
       └──────┘ └──────┘ └─────┘

Appendix B: Recommended Rust Crate Additions

Category	Crate	Purpose	Phase
Observability	`opentelemetry` (0.27+)	Unified telemetry API	1
Observability	`opentelemetry-otlp`	OTLP exporter	1
Observability	`tracing-opentelemetry`	Bridge tracing → OTEL	1
Security	`argon2`	Password/key hashing	1
Security	`subtle`	Constant-time comparison	1
Security	`rustls`	Native TLS	1
Scalability	`redis`	Distributed state/rate-limiting	2
Scalability	`deadpool-redis`	Redis connection pooling	2
Testing	`cargo-tarpaulin`	Code coverage	1
Testing	`cargo-fuzz`	Fuzz testing	2
Auth	`jsonwebtoken`	JWT tokens	3
Auth	`oauth2`	OAuth2 client	3
Plugins	`wasmtime`	WASM plugin runtime	3
Build	`git-cliff`	Changelog generation	1

Appendix C: Key References

Standards & Guidelines

NIST AI RMF 1.0 — AI Risk Management Framework
NIST SP 800-226 — Guidelines for Evaluating Differential Privacy Guarantees
EU AI Act (Regulation 2024/1689) — Articles 10, 50
ISO/IEC 25020:2019 — Systems and software Quality Requirements and Evaluation (SQuaRE)

Research Papers

Chen et al. (2025) — “Benchmarking Differentially Private Tabular Data Synthesis Algorithms” (TPDP 2025)
SynQP (IEEE, 2025) — “A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data”
Xu et al. (2025) — “TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation” (ICLR 2025)
Sattarov & Schreyer (2023) — “FinDiff: Diffusion Models for Financial Tabular Data Generation” (ICAIF 2023)
Shi et al. (2025) — “Comprehensive Survey of Synthetic Tabular Data Generation”
CFA Institute (July 2025) — “Synthetic Data in Investment Management”
Pilgram et al. (2025) — “A Consensus Privacy Metrics Framework for Synthetic Data”

Industry Reports

Gartner (2024) — “By 2026, 75% of businesses will use GenAI for synthetic customer data”
GlobeNewsWire (January 2026) — AI-Generated Synthetic Tabular Dataset Market: $6.73B by 2029

Research: System Improvements for Enhanced Realism

This research document series analyzes the current SyntheticData system and proposes comprehensive improvements across multiple dimensions to achieve greater realism, statistical validity, and domain authenticity.

Document Index

Document	Focus Area	Priority
01-realism-names-metadata.md	Names, descriptions, metadata realism	High
02-statistical-distributions.md	Numerical and statistical distributions	High
03-temporal-patterns.md	Temporal correctness and distributions	High
04-interconnectivity.md	Entity relationships and referential integrity	Critical
05-pattern-drift.md	Process and pattern evolution over time	Medium
06-anomaly-patterns.md	Anomaly detection and injection patterns	High
07-fraud-patterns.md	Fraud typologies and detection scenarios	High
08-domain-specific.md	Industry-specific enhancements	Medium

Executive Summary

Current State Assessment

The SyntheticData system is a mature, well-architected synthetic data generation platform with strong foundations in:

Deterministic generation via ChaCha8 RNG with configurable seeds
Domain modeling with 50+ entity types across accounting, banking, and audit domains
Statistical foundations including Benford’s Law, log-normal distributions, and temporal seasonality
Referential integrity through document chains, three-way matching, and intercompany reconciliation
Standards compliance with COSO 2013, ISA, SOX, IFRS, and US GAAP frameworks

Key Improvement Themes

After comprehensive analysis, we identify eight major improvement themes:

1. Realism in Names & Metadata

Current Gap: Generic placeholder names, limited cultural diversity, simplistic descriptions Impact: Immediate visual detection of synthetic nature Effort: Medium | Value: High

2. Statistical Distribution Enhancements

Current Gap: Single-mode distributions, limited correlation modeling, no regime changes Impact: ML models trained on synthetic data may not generalize Effort: High | Value: Critical

3. Temporal Pattern Sophistication

Current Gap: Static multipliers, no business day calculations, limited regional calendars Impact: Unrealistic transaction timing patterns Effort: Medium | Value: High

4. Interconnectivity & Relationship Modeling

Current Gap: Shallow relationship graphs, limited network effects, no behavioral clustering Impact: Graph-based analytics yield unrealistic structures Effort: High | Value: Critical

5. Pattern & Process Drift

Current Gap: Limited drift types, no organizational change modeling, static processes Impact: Temporal ML models overfit to stable patterns Effort: Medium | Value: High

6. Anomaly Pattern Enrichment

Current Gap: Limited anomaly correlation, no multi-stage anomalies, binary labeling Impact: Anomaly detection models lack nuanced training data Effort: Medium | Value: High

7. Fraud Pattern Sophistication

Current Gap: Isolated fraud events, limited collusion modeling, no adaptive patterns Impact: Fraud detection systems miss complex schemes Effort: High | Value: Critical

8. Domain-Specific Enhancements

Current Gap: Generic industry modeling, limited regulatory specificity Impact: Industry-specific use cases require extensive customization Effort: Medium | Value: Medium

Implementation Roadmap

Phase 1: Foundation (Q1)

Culturally-aware name generation with regional distributions
Enhanced amount distributions with mixture models
Business day calculation utilities
Relationship graph depth improvements

Phase 2: Statistical Sophistication (Q2)

Multi-modal distribution support
Cross-field correlation modeling
Regime change simulation
Network effect modeling

Phase 3: Temporal Evolution (Q3)

Organizational change events
Process evolution modeling
Adaptive fraud patterns
Multi-stage anomaly injection

Phase 4: Domain Specialization (Q4)

Industry-specific regulatory frameworks
Enhanced audit trail generation
Advanced graph analytics support
Privacy-preserving fingerprint improvements

Metrics for Success

Realism Metrics

Human Detection Rate: % of samples correctly identified as synthetic by domain experts
Statistical Divergence: KL divergence between synthetic and real-world distributions
Temporal Correlation: Autocorrelation alignment with empirical baselines

ML Utility Metrics

Transfer Learning Gap: Performance delta when models trained on synthetic data are applied to real data
Feature Distribution Overlap: Overlap coefficient for key feature distributions
Anomaly Detection AUC: Baseline AUC on synthetic vs. improvement after enhancements

Technical Metrics

Generation Throughput: Records/second with enhanced features
Memory Efficiency: Peak memory usage for equivalent dataset sizes
Configuration Complexity: Lines of YAML required for common scenarios

Next Steps

Review individual research documents for detailed analysis
Prioritize improvements based on use case requirements
Create implementation tickets for Phase 1 items
Establish baseline metrics for tracking progress

Research conducted: January 2026 System version analyzed: 0.2.3

Research: Realism in Names, Descriptions, and Metadata

Current State Analysis

Entity Name Generation

The current system uses basic name generation across multiple entity types:

Entity Type	Current Approach	Realism Level
Vendors	“Vendor_{id}” or template-based	Low
Customers	“Customer_{id}” or template-based	Low
Employees	First/Last name pools	Medium
Materials	“Material_{id}” with category prefix	Low
Cost Centers	“{dept}_{code}” pattern	Medium
GL Accounts	Numeric codes with descriptions	High
Companies	Configurable but often generic	Medium

Description Generation

Current descriptions follow predictable patterns:

Journal entries: “{type} for {entity}”
Invoices: “Invoice for {goods/services}”
Payments: “Payment for Invoice {ref}”

Metadata Patterns

Timestamps: Well-distributed but lack system-specific quirks
User IDs: Sequential or simple patterns
References: Deterministic but predictable formats

Improvement Recommendations

1. Culturally-Aware Name Generation

1.1 Regional Name Pools

Implementation: Create region-specific name databases with appropriate cultural distributions.

# Proposed configuration structure
name_generation:
  strategy: regional_weighted
  regions:
    - region: north_america
      weight: 0.45
      subregions:
        - country: US
          weight: 0.85
          cultural_mix:
            - origin: anglo
              weight: 0.55
            - origin: hispanic
              weight: 0.25
            - origin: asian
              weight: 0.12
            - origin: other
              weight: 0.08
        - country: CA
          weight: 0.10
        - country: MX
          weight: 0.05
    - region: europe
      weight: 0.30
    - region: asia_pacific
      weight: 0.25

1.2 Company Name Patterns by Industry

Retail:

Pattern: {Founder} {Product} → “Johnson’s Hardware”
Pattern: {Adjective} {Category} → “Premier Electronics”
Pattern: {Location} {Type} → “Westside Grocers”

Manufacturing:

Pattern: {Name} {Industry} {Suffix} → “Anderson Steel Corporation”
Pattern: {Acronym} {Type} → “ACM Industries”
Pattern: {Technical} {Systems} → “Precision Machining Systems”

Professional Services:

Pattern: {Partner1}, {Partner2} & {Partner3} → “Smith, Chen & Associates”
Pattern: {Name} {Specialty} {Type} → “Hartwell Tax Advisors”
Pattern: {Adjective} {Service} {Suffix} → “Strategic Consulting Group”

Financial Services:

Pattern: {Location} {Type} {Entity} → “Pacific Coast Credit Union”
Pattern: {Founder} {Service} → “Morgan Wealth Management”
Pattern: {Region} {Specialty} → “Midwest Commercial Lending”

1.3 Vendor Name Realism

Current: Vendor_00042 or simple templates

Proposed: Industry-appropriate vendor names based on spend category:

#![allow(unused)]
fn main() {
// Conceptual structure
pub struct VendorNameGenerator {
    category_templates: HashMap<SpendCategory, Vec<NameTemplate>>,
    regional_styles: HashMap<Region, NamingConvention>,
    legal_suffixes: HashMap<Country, Vec<String>>,
}

impl VendorNameGenerator {
    pub fn generate(&self, category: SpendCategory, region: Region) -> VendorName {
        // Select template based on category
        // Apply regional naming conventions
        // Add appropriate legal suffix (Inc., LLC, GmbH, Ltd., S.A., etc.)
    }
}
}

Examples by Category:

Category	Example Names
Office Supplies	Staples, Office Depot, ULINE, Quill Corporation
IT Services	Accenture Technology, Cognizant Solutions, InfoSys Systems
Raw Materials	Alcoa Aluminum, US Steel Supply, Nucor Materials
Utilities	Pacific Gas & Electric, ConEdison, Duke Energy
Professional Services	Deloitte & Touche, KPMG Advisory, BDO Consulting
Logistics	FedEx Freight, UPS Supply Chain, XPO Logistics
Facilities	ABM Industries, CBRE Services, JLL Facilities

2. Realistic Description Generation

2.1 Journal Entry Descriptions

Current Pattern: Generic, formulaic

Proposed: Context-aware, varied descriptions with realistic abbreviations and typos

journal_entry_descriptions:
  revenue:
    templates:
      - "Revenue recognition - {customer} - {contract_ref}"
      - "Rev rec {period} - {product_line}"
      - "Sales revenue {region} Q{quarter}"
      - "Earned revenue - PO# {po_number}"
    abbreviations:
      enabled: true
      probability: 0.3
      mappings:
        Revenue: ["Rev", "REV"]
        recognition: ["rec", "recog"]
        Quarter: ["Q", "Qtr"]
    variations:
      case_variation: 0.1
      typo_rate: 0.02

  expense:
    templates:
      - "AP invoice - {vendor} - {invoice_ref}"
      - "{expense_category} - {cost_center}"
      - "Accrued {expense_type} {period}"
      - "{vendor_short} inv {invoice_num}"
    context_aware:
      include_approver: 0.2
      include_po_reference: 0.7
      include_department: 0.4

2.2 Invoice Line Item Descriptions

Goods:

- "Qty {quantity} {product_name} @ ${unit_price}/ea"
- "{product_sku} - {product_description}"
- "{quantity}x {product_short_name}"
- "Lot# {lot_number} {product_name}"

Services:

- "Professional services - {date_range}"
- "Consulting fees - {project_name}"
- "Retainer - {month} {year}"
- "{hours} hrs @ ${rate}/hr - {service_type}"

2.3 Payment Descriptions

Current: “Payment for Invoice INV-00123”

Proposed variations:

- "Pmt INV-00123"
- "ACH payment - {vendor} - {invoice_ref}"
- "Wire transfer ref {bank_ref}"
- "Check #{check_number} - {vendor}"
- "EFT {date} {vendor_short}"
- "Batch payment - {batch_id}"

3. Enhanced Metadata Generation

3.1 User ID Patterns

Current: Sequential or simple random

Proposed: Realistic corporate patterns

user_id_patterns:
  format: "{first_initial}{last_name}{disambiguator}"
  examples:
    - "jsmith"
    - "jsmith2"
    - "john.smith"
    - "smithj"

  system_accounts:
    - prefix: "SVC_"
      examples: ["SVC_BATCH", "SVC_INTERFACE", "SVC_RECON"]
    - prefix: "SYS_"
      examples: ["SYS_AUTO", "SYS_SCHEDULER"]
    - prefix: "INT_"
      examples: ["INT_SAP", "INT_ORACLE", "INT_SALESFORCE"]

  admin_accounts:
    - pattern: "admin_{system}"
    - examples: ["admin_gl", "admin_ap", "admin_ar"]

3.2 Reference Number Formats

Realistic patterns by document type:

reference_formats:
  purchase_order:
    patterns:
      - "PO-{year}{seq:06}"        # PO-2024000142
      - "4500{seq:06}"              # SAP-style: 4500000142
      - "{plant}-{year}-{seq:05}"   # CHI-2024-00142

  invoice:
    vendor_patterns:
      - "INV-{seq:08}"
      - "{vendor_prefix}-{date}-{seq:04}"
      - "{random_alpha:3}{seq:06}"
    internal_patterns:
      - "VINV-{year}{seq:06}"
      - "{company_code}-AP-{seq:07}"

  journal_entry:
    patterns:
      - "{year}{period:02}{seq:06}"   # 202401000142
      - "JE-{date}-{seq:04}"          # JE-20240115-0142
      - "{company}-{year}-{seq:07}"   # C001-2024-0000142

  bank_reference:
    patterns:
      - "{date}{random:10}"           # Bank statement ref
      - "TRN{seq:12}"                 # Transaction ID
      - "{swift_code}{date}{seq:06}"  # SWIFT format

3.3 Timestamp Realism

System-specific posting behaviors:

timestamp_patterns:
  batch_processing:
    typical_times: ["02:00", "06:00", "22:00"]
    duration_minutes: 30-180
    day_pattern: "business_days"

  manual_posting:
    peak_hours: [9, 10, 11, 14, 15, 16]
    off_peak_probability: 0.15
    lunch_dip: [12, 13]
    lunch_probability: 0.3

  interface_posting:
    patterns:
      - hourly: ":00", ":15", ":30", ":45"
      - real_time: random within seconds
    source_systems:
      - name: "SAP_INTERFACE"
        posting_lag_hours: 0-4
      - name: "LEGACY_BATCH"
        posting_time: "23:30"
        posting_day: "next_business_day"

  period_end_crunch:
    enabled: true
    days_before_close: 3
    extended_hours: true
    weekend_activity: 0.3

4. Address and Contact Information

4.1 Realistic Address Generation

Current Gap: Generic or missing addresses

Proposed: Region-appropriate address formats

address_generation:
  us:
    format: "{street_number} {street_name} {street_type}\n{city}, {state} {zip}"
    components:
      street_numbers:
        residential: 100-9999
        commercial: 1-500
        distribution: "log_normal"
      street_names:
        sources: ["census_data", "common_names"]
        include_directional: 0.3  # "N", "S", "E", "W"
      street_types:
        distribution:
          Street: 0.25
          Avenue: 0.15
          Road: 0.12
          Drive: 0.12
          Boulevard: 0.08
          Lane: 0.08
          Way: 0.08
          Court: 0.05
          Place: 0.04
          Circle: 0.03
      cities:
        source: "population_weighted"
        major_metro_weight: 0.6
    commercial_patterns:
      suite_probability: 0.4
      floor_probability: 0.2
      building_name_probability: 0.15

  de:
    format: "{street_name} {street_number}\n{postal_code} {city}"
    # German addresses put number after street name

  jp:
    format: "〒{postal_code}\n{prefecture}{city}{ward}\n{block}-{building}-{unit}"
    # Japanese addressing system

4.2 Phone Number Formats

phone_generation:
  formats:
    us: "+1 ({area_code}) {exchange}-{subscriber}"
    uk: "+44 {area_code} {local_number}"
    de: "+49 {area_code} {subscriber}"

  area_codes:
    us:
      source: "valid_area_codes"
      weight_by_population: true
      exclude_toll_free: true
      business_toll_free_rate: 0.3

4.3 Email Patterns

email_generation:
  corporate:
    patterns:
      - "{first}.{last}@{company_domain}"
      - "{first_initial}{last}@{company_domain}"
      - "{first}_{last}@{company_domain}"
    domain_generation:
      from_company_name: true
      tld_distribution:
        ".com": 0.75
        ".net": 0.10
        ".io": 0.05
        ".co": 0.05
        country_tld: 0.05

  vendor_contacts:
    patterns:
      - "accounts.payable@{domain}"
      - "ar@{domain}"
      - "billing@{domain}"
      - "{first}.{last}@{domain}"
    generic_rate: 0.4

5. Material and Product Naming

5.1 SKU Patterns

sku_generation:
  patterns:
    category_prefix:
      format: "{category:3}-{subcategory:3}-{sequence:06}"
      example: "ELE-CPT-000142"  # Electronics-Components

    alphanumeric:
      format: "{alpha:2}{numeric:6}{check_digit}"
      example: "AB123456C"

    hierarchical:
      format: "{division}-{family}-{class}-{item}"
      example: "01-234-567-8901"

5.2 Product Descriptions

By Category:

product_descriptions:
  raw_materials:
    templates:
      - "{material_type}, {grade}, {specification}"
      - "{chemical_formula} {purity}% pure"
      - "{material} {form} - {dimensions}"
    examples:
      - "Steel Coil, Grade 304, 1.2mm thickness"
      - "Aluminum Sheet 6061-T6, 4' x 8' x 0.125\""
      - "Polyethylene Pellets, HDPE, 50lb bag"

  finished_goods:
    templates:
      - "{brand} {product_line} {model}"
      - "{product_type} - {feature1}, {feature2}"
      - "{category} {variant} ({color}/{size})"
    examples:
      - "Acme Pro Series 5000X Widget"
      - "Heavy-Duty Industrial Pump - 2HP, 120V"
      - "Office Chair Ergonomic Mesh (Black/Large)"

  services:
    templates:
      - "{service_type} - {duration} {frequency}"
      - "Professional {service} Services"
      - "{specialty} Consultation - {scope}"
    examples:
      - "HVAC Maintenance - Annual Contract"
      - "Professional IT Support Services"
      - "Legal Consultation - Contract Review"

6. Implementation Priority

Enhancement	Effort	Impact	Priority
Regional name pools	Medium	High	P1
Industry-specific vendor names	Medium	High	P1
Varied JE descriptions	Low	Medium	P1
Reference number formats	Low	High	P1
User ID patterns	Low	Medium	P2
Address generation	High	Medium	P2
Product descriptions	Medium	Medium	P2
Email patterns	Low	Low	P3
Phone formatting	Low	Low	P3

7. Data Sources

Recommended External Data Sources:

Name Data:
- US Census Bureau name frequency data
- International name databases (regional)
- Industry-specific company name patterns
Address Data:
- OpenAddresses project
- Census TIGER/Line files
- Postal code databases by country
Reference Patterns:
- ERP documentation (SAP, Oracle, NetSuite)
- Industry EDI standards
- Banking reference formats (SWIFT, ACH)
Product Data:
- UNSPSC category codes
- Industry classification systems
- Standard material specifications

8. Configuration Example

# Enhanced name and metadata configuration
realism:
  names:
    strategy: culturally_aware
    primary_region: north_america
    diversity_index: 0.4

  vendors:
    naming_style: industry_appropriate
    include_legal_suffix: true
    regional_distribution:
      domestic: 0.7
      international: 0.3

  descriptions:
    variation_enabled: true
    abbreviation_rate: 0.25
    typo_injection_rate: 0.01

  references:
    format_style: erp_realistic
    include_check_digits: true

  timestamps:
    system_behavior_modeling: true
    batch_window_realism: true

  addresses:
    format: regional_appropriate
    commercial_indicators: true

Next Steps

Create name pool data files for major regions
Implement NameGenerator trait with regional strategies
Build description template engine with variation support
Add reference format configurations to schema
Integrate address generation with Faker-like libraries

See also: 02-statistical-distributions.md for numerical realism

Research: Statistical and Numerical Distributions

Current State Analysis

Existing Distribution Implementations

The system currently supports several distribution types:

Distribution	Implementation	Usage
Log-Normal	`AmountSampler`	Transaction amounts
Benford’s Law	`BenfordSampler`	First-digit distribution
Uniform	Standard	ID generation, selection
Weighted	`LineItemSampler`	Line item counts
Poisson	`TemporalSampler`	Event counts
Normal/Gaussian	Standard	Some variations

Current Strengths

Benford’s Law compliance: First-digit distribution follows expected 30.1%, 17.6%, 12.5%… pattern
Log-normal amounts: Realistic transaction size distributions
Temporal weighting: Period-end spikes, day-of-week patterns
Industry seasonality: 10 industry profiles with event-based multipliers

Current Gaps

Single-mode distributions: No mixture models for multi-modal data
Limited correlation: Cross-field dependencies not modeled
Static parameters: No regime changes or parameter drift
Missing distributions: Pareto, Weibull, Beta not available
No copulas: Joint distributions not correlated realistically

Improvement Recommendations

1.1 Gaussian Mixture Models

Real-world transaction amounts often exhibit multiple modes:

#![allow(unused)]
fn main() {
/// Gaussian Mixture Model for multi-modal distributions
pub struct GaussianMixture {
    components: Vec<GaussianComponent>,
}

pub struct GaussianComponent {
    weight: f64,      // Component weight (sum to 1.0)
    mean: f64,        // Component mean
    std_dev: f64,     // Component standard deviation
}

impl GaussianMixture {
    /// Sample from the mixture distribution
    pub fn sample(&self, rng: &mut impl Rng) -> f64 {
        // Select component based on weights
        let component = self.select_component(rng);
        // Sample from selected Gaussian
        component.sample(rng)
    }
}
}

Configuration:

amount_distribution:
  type: gaussian_mixture
  components:
    - weight: 0.60
      mean: 500
      std_dev: 200
      label: "small_transactions"
    - weight: 0.30
      mean: 5000
      std_dev: 1500
      label: "medium_transactions"
    - weight: 0.10
      mean: 50000
      std_dev: 15000
      label: "large_transactions"

1.2 Log-Normal Mixture

For strictly positive amounts with multiple modes:

amount_distribution:
  type: lognormal_mixture
  components:
    - weight: 0.70
      mu: 5.5       # log-scale mean
      sigma: 1.2    # log-scale std dev
      label: "routine_expenses"
    - weight: 0.25
      mu: 8.5
      sigma: 0.8
      label: "capital_expenses"
    - weight: 0.05
      mu: 11.0
      sigma: 0.5
      label: "major_projects"

1.3 Realistic Transaction Amount Profiles

By Transaction Type:

Type	Distribution	Parameters	Notes
Petty Cash	Log-normal	μ=3.5, σ=0.8	$10-$500 range
AP Invoices	Mixture(3)	See below	Multi-modal
Payroll	Normal	μ=4500, σ=1200	Per employee
Utilities	Log-normal	μ=7.0, σ=0.4	Monthly, stable
Capital	Pareto	α=1.5, xₘ=10000	Heavy tail

AP Invoice Mixture:

ap_invoices:
  type: lognormal_mixture
  components:
    # Operating expenses
    - weight: 0.50
      mu: 6.0        # ~$400 median
      sigma: 1.5
    # Inventory/materials
    - weight: 0.35
      mu: 8.0        # ~$3000 median
      sigma: 1.0
    # Capital/projects
    - weight: 0.15
      mu: 10.5       # ~$36000 median
      sigma: 0.8

2. Cross-Field Correlation Modeling

2.1 Correlation Matrix Support

Define correlations between numeric fields:

correlations:
  enabled: true
  fields:
    - name: transaction_amount
    - name: line_item_count
    - name: approval_level
    - name: processing_time_hours
    - name: discount_percentage

  matrix:
    # Correlation coefficients (Pearson's r)
    # Higher amounts → more line items
    - [1.00, 0.65, 0.72, 0.45, -0.20]
    # More items → higher amount
    - [0.65, 1.00, 0.55, 0.60, -0.15]
    # Higher amount → higher approval
    - [0.72, 0.55, 1.00, 0.50, -0.30]
    # More complex → longer processing
    - [0.45, 0.60, 0.50, 1.00, -0.10]
    # Higher amount → lower discount %
    - [-0.20, -0.15, -0.30, -0.10, 1.00]

2.2 Copula-Based Generation

For more sophisticated dependency modeling:

#![allow(unused)]
fn main() {
/// Copula types for dependency modeling
pub enum CopulaType {
    /// Gaussian copula - symmetric dependencies
    Gaussian { correlation: f64 },
    /// Clayton copula - lower tail dependence
    Clayton { theta: f64 },
    /// Gumbel copula - upper tail dependence
    Gumbel { theta: f64 },
    /// Frank copula - symmetric, no tail dependence
    Frank { theta: f64 },
    /// Student-t copula - both tail dependencies
    StudentT { correlation: f64, df: f64 },
}

pub struct CopulaGenerator {
    copula: CopulaType,
    marginals: Vec<Box<dyn Distribution>>,
}
}

Use Cases:

Amount & Days-to-Pay: Larger invoices may have longer payment terms (Clayton copula)
Revenue & COGS: Strong positive correlation (Gaussian copula)
Fraud Amount & Detection Delay: Upper tail dependence (Gumbel copula)

2.3 Conditional Distributions

Generate values conditional on other fields:

conditional_distributions:
  # Discount percentage depends on order amount
  discount:
    type: conditional
    given: order_amount
    breakpoints:
      - threshold: 1000
        distribution: { type: constant, value: 0 }
      - threshold: 5000
        distribution: { type: uniform, min: 0, max: 0.05 }
      - threshold: 25000
        distribution: { type: uniform, min: 0.05, max: 0.10 }
      - threshold: 100000
        distribution: { type: uniform, min: 0.10, max: 0.15 }
      - threshold: infinity
        distribution: { type: normal, mean: 0.15, std: 0.03 }

  # Payment terms depend on vendor relationship
  payment_terms:
    type: conditional
    given: vendor_relationship_months
    breakpoints:
      - threshold: 6
        distribution: { type: choice, values: [0, 15], weights: [0.8, 0.2] }
      - threshold: 24
        distribution: { type: choice, values: [15, 30], weights: [0.6, 0.4] }
      - threshold: infinity
        distribution: { type: choice, values: [30, 45, 60], weights: [0.5, 0.35, 0.15] }

3. Industry-Specific Amount Distributions

3.1 Retail

retail:
  transaction_amounts:
    pos_sales:
      type: lognormal_mixture
      components:
        - weight: 0.65
          mu: 3.0      # ~$20 median
          sigma: 1.0
          label: "small_basket"
        - weight: 0.30
          mu: 4.5      # ~$90 median
          sigma: 0.8
          label: "medium_basket"
        - weight: 0.05
          mu: 6.0      # ~$400 median
          sigma: 0.6
          label: "large_basket"

    inventory_orders:
      type: lognormal
      mu: 9.0          # ~$8000 median
      sigma: 1.5

    seasonal_multipliers:
      black_friday: 3.5
      christmas_week: 2.8
      back_to_school: 1.6

3.2 Manufacturing

manufacturing:
  transaction_amounts:
    raw_materials:
      type: lognormal_mixture
      components:
        - weight: 0.40
          mu: 8.0      # ~$3000 median
          sigma: 1.0
          label: "consumables"
        - weight: 0.45
          mu: 10.0     # ~$22000 median
          sigma: 0.8
          label: "production_materials"
        - weight: 0.15
          mu: 12.0     # ~$163000 median
          sigma: 0.6
          label: "bulk_orders"

    maintenance:
      type: pareto
      alpha: 2.0
      x_min: 500
      label: "repair_costs"

    capital_equipment:
      type: lognormal
      mu: 12.5         # ~$268000 median
      sigma: 1.0

3.3 Financial Services

financial_services:
  transaction_amounts:
    wire_transfers:
      type: lognormal_mixture
      components:
        - weight: 0.30
          mu: 8.0      # ~$3000
          sigma: 1.2
          label: "retail_wire"
        - weight: 0.40
          mu: 11.0     # ~$60000
          sigma: 1.0
          label: "commercial_wire"
        - weight: 0.20
          mu: 14.0     # ~$1.2M
          sigma: 0.8
          label: "institutional_wire"
        - weight: 0.10
          mu: 17.0     # ~$24M
          sigma: 1.0
          label: "large_value"

    ach_transactions:
      type: lognormal
      mu: 7.5          # ~$1800
      sigma: 2.0

    fee_income:
      type: weibull
      scale: 500
      shape: 1.5

4. Regime Change Modeling

4.1 Structural Breaks

Model sudden changes in distribution parameters:

regime_changes:
  enabled: true
  changes:
    - date: "2024-03-15"
      type: acquisition
      effects:
        - field: transaction_volume
          multiplier: 1.35
        - field: average_amount
          shift: 5000
        - field: vendor_count
          multiplier: 1.25

    - date: "2024-07-01"
      type: price_increase
      effects:
        - field: cogs_ratio
          shift: 0.03
        - field: avg_invoice_amount
          multiplier: 1.08

    - date: "2024-10-01"
      type: new_product_line
      effects:
        - field: revenue
          multiplier: 1.20
        - field: inventory_turns
          multiplier: 0.85

4.2 Gradual Parameter Drift

Model slow changes over time:

parameter_drift:
  enabled: true
  parameters:
    - field: transaction_amount
      type: linear
      annual_drift: 0.03    # 3% annual increase (inflation)

    - field: digital_payment_ratio
      type: logistic
      start_value: 0.40
      end_value: 0.85
      midpoint_months: 18
      steepness: 0.15

    - field: approval_threshold
      type: step
      steps:
        - month: 6
          value: 5000
        - month: 18
          value: 7500
        - month: 30
          value: 10000

4.3 Economic Cycle Modeling

economic_cycles:
  enabled: true
  base_cycle:
    type: sinusoidal
    period_months: 48      # 4-year cycle
    amplitude: 0.15        # ±15% variation

  recession_events:
    - start: "2024-09-01"
      duration_months: 8
      severity: moderate    # 10-20% decline
      effects:
        - revenue: -0.15
        - discretionary_spend: -0.35
        - capital_investment: -0.50
        - headcount: -0.10
      recovery:
        type: gradual
        months: 12

5. Enhanced Benford’s Law Compliance

5.1 Second and Third Digit Distributions

Extend beyond first-digit to full Benford compliance:

#![allow(unused)]
fn main() {
pub struct BenfordDistribution {
    digits: BenfordDigitConfig,
}

pub struct BenfordDigitConfig {
    first_digit: bool,     // Standard Benford
    second_digit: bool,    // Second digit distribution
    first_two: bool,       // Joint first-two digits
    summation: bool,       // Summation test
}

impl BenfordDistribution {
    /// Generate amount following full Benford's Law
    pub fn sample_benford_compliant(&self, rng: &mut impl Rng) -> Decimal {
        // Use log-uniform distribution to ensure Benford compliance
        // across multiple digit positions
    }
}
}

5.2 Benford Deviation Injection

For anomaly scenarios, intentionally violate Benford:

benford_deviations:
  enabled: false  # Enable for fraud scenarios

  deviation_types:
    # Round number preference (fraud indicator)
    round_number_bias:
      probability: 0.15
      targets: [1000, 5000, 10000, 25000]
      tolerance: 0.01

    # Threshold avoidance (approval bypass)
    threshold_clustering:
      thresholds: [5000, 10000, 25000]
      cluster_below: true
      distance: 50-200

    # Uniform distribution (fabricated data)
    uniform_injection:
      probability: 0.05
      range: [1000, 9999]

6. Statistical Validation Framework

6.1 Distribution Fitness Tests

#![allow(unused)]
fn main() {
pub struct DistributionValidator {
    tests: Vec<StatisticalTest>,
}

pub enum StatisticalTest {
    /// Kolmogorov-Smirnov test
    KolmogorovSmirnov { significance: f64 },
    /// Chi-squared goodness of fit
    ChiSquared { bins: usize, significance: f64 },
    /// Anderson-Darling test
    AndersonDarling { significance: f64 },
    /// Benford's Law chi-squared
    BenfordChiSquared { digits: u8, significance: f64 },
    /// Mean Absolute Deviation from Benford
    BenfordMAD { threshold: f64 },
}
}

6.2 Validation Configuration

validation:
  statistical_tests:
    enabled: true
    tests:
      - type: benford_first_digit
        threshold_mad: 0.015
        warning_mad: 0.010

      - type: distribution_fit
        target: lognormal
        ks_significance: 0.05

      - type: correlation_check
        expected_correlations:
          - fields: [amount, line_items]
            expected_r: 0.65
            tolerance: 0.10

  reporting:
    generate_plots: true
    output_format: html
    include_raw_data: false

7. New Distribution Types

7.1 Pareto Distribution

For heavy-tailed phenomena (80/20 rule):

# Top 20% of customers generate 80% of revenue
customer_revenue:
  type: pareto
  alpha: 1.16      # Shape parameter for 80/20
  x_min: 1000      # Minimum value
  truncate_max: 10000000  # Optional cap

7.2 Weibull Distribution

For time-to-event data:

# Days until payment
days_to_payment:
  type: weibull
  shape: 2.0       # k > 1: increasing hazard (more likely to pay over time)
  scale: 30.0      # λ: characteristic life
  shift: 0         # Minimum days

7.3 Beta Distribution

For proportions and percentages:

# Discount percentage
discount_rate:
  type: beta
  alpha: 2.0       # Shape parameter 1
  beta: 8.0        # Shape parameter 2
  # This gives mode around 11%, right-skewed
  scale:
    min: 0.0
    max: 0.25      # Max 25% discount

7.4 Zero-Inflated Distributions

For data with excess zeros:

# Credits/returns (many transactions have zero)
credit_amount:
  type: zero_inflated
  zero_probability: 0.85
  positive_distribution:
    type: lognormal
    mu: 5.0
    sigma: 1.5

8. Implementation Priority

Enhancement	Complexity	Impact	Priority
Mixture models	Medium	High	P1
Correlation matrices	High	Critical	P1
Industry-specific profiles	Medium	High	P1
Regime changes	Medium	High	P2
Copula support	High	Medium	P2
Additional distributions	Low	Medium	P2
Validation framework	Medium	High	P1
Conditional distributions	Medium	Medium	P3

9. Configuration Example

# Complete statistical distribution configuration
distributions:
  # Global amount settings
  amounts:
    default:
      type: lognormal_mixture
      components:
        - { weight: 0.6, mu: 6.0, sigma: 1.5 }
        - { weight: 0.3, mu: 8.5, sigma: 1.0 }
        - { weight: 0.1, mu: 11.0, sigma: 0.8 }

    by_transaction_type:
      payroll:
        type: normal
        mean: 4500
        std_dev: 1500
        truncate_min: 1000

      utilities:
        type: lognormal
        mu: 7.0
        sigma: 0.5

  # Correlation settings
  correlations:
    enabled: true
    model: gaussian_copula
    pairs:
      - fields: [amount, processing_days]
        correlation: 0.45
      - fields: [amount, approval_level]
        correlation: 0.72

  # Drift settings
  drift:
    enabled: true
    inflation_rate: 0.03
    regime_changes:
      - date: "2024-06-01"
        field: avg_transaction
        multiplier: 1.15

  # Validation
  validation:
    benford_compliance: true
    distribution_tests: true
    correlation_verification: true

Technical Implementation Notes

Performance Considerations

Pre-computation: Calculate CDF tables for frequently-used distributions
Vectorization: Use SIMD for batch sampling where possible
Caching: Cache correlation matrix decompositions (Cholesky)
Lazy evaluation: Defer complex distribution calculations until needed

Memory Efficiency

Streaming: Generate correlated samples in batches
Reference tables: Use compact lookup tables for standard distributions
On-demand: Compute regime-adjusted parameters at sample time

See also: 03-temporal-patterns.md for time-based distributions

Research: Temporal Patterns and Distributions

Implementation Summary (v0.3.0)

Feature	Status	Location
Business day calculator	✅ Implemented	`datasynth-core/src/distributions/business_day.rs`
Holiday calendars (11 regions)	✅ Implemented	`datasynth-core/src/distributions/holidays.rs`
Period-end dynamics (decay curves)	✅ Implemented	`datasynth-core/src/distributions/period_end.rs`
Processing lag modeling	✅ Implemented	`datasynth-core/src/distributions/processing_lag.rs`
Timezone handling	✅ Implemented	`datasynth-core/src/distributions/timezone.rs`
Fiscal calendar (custom, 4-4-5)	✅ Implemented	Config: `fiscal_calendar`
Intraday segments	✅ Implemented	Config: `intraday`
Settlement rules (T+N)	✅ Implemented	`business_day.rs`
Half-day policies	✅ Implemented	`business_day.rs`
Lunar calendars	🔄 Planned	Approximate via fixed dates

Current State Analysis

Existing Temporal Infrastructure

Component	Lines	Functionality
`TemporalSampler`	632	Date/time sampling with seasonality
`IndustrySeasonality`	538	10 industry profiles
`HolidayCalendar`	852	6 regional calendars
`DriftController`	373	Gradual/sudden drift
`FiscalPeriod`	849	Period close mechanics
`BiTemporal`	449	Audit trail versioning

Current Capabilities

Period-end spikes: Month-end (2.5x), Quarter-end (4.0x), Year-end (6.0x)
Day-of-week patterns: Monday catch-up (1.3x), Friday wind-down (0.85x)
Holiday handling: 6 regions with ~15 holidays each
Working hours: 8-18 business hours with peak weighting
Industry seasonality: Black Friday, tax season, etc.

Current Gaps

No business day calculation - T+1, T+2 settlement not supported
No fiscal calendar alternatives - Only calendar year supported
Limited regional coverage - Missing LATAM, more APAC
No half-day handling - Early closes before holidays
Static spike multipliers - No decay curves toward period-end
No timezone awareness - All times in single timezone
Missing lunar calendars - Approximate Chinese New Year, Diwali

Improvement Recommendations

1. Business Day Calculations

1.1 Core Business Day Engine

#![allow(unused)]
fn main() {
pub struct BusinessDayCalculator {
    calendar: HolidayCalendar,
    weekend_days: HashSet<Weekday>,
    half_day_handling: HalfDayPolicy,
}

pub enum HalfDayPolicy {
    FullDay,           // Count as full business day
    HalfDay,           // Count as 0.5 business days
    NonBusinessDay,    // Treat as holiday
}

impl BusinessDayCalculator {
    /// Add N business days to a date
    pub fn add_business_days(&self, date: NaiveDate, days: i32) -> NaiveDate;

    /// Subtract N business days from a date
    pub fn sub_business_days(&self, date: NaiveDate, days: i32) -> NaiveDate;

    /// Count business days between two dates
    pub fn business_days_between(&self, start: NaiveDate, end: NaiveDate) -> i32;

    /// Get the next business day (inclusive or exclusive)
    pub fn next_business_day(&self, date: NaiveDate, inclusive: bool) -> NaiveDate;

    /// Get the previous business day
    pub fn prev_business_day(&self, date: NaiveDate, inclusive: bool) -> NaiveDate;

    /// Is this date a business day?
    pub fn is_business_day(&self, date: NaiveDate) -> bool;
}
}

1.2 Settlement Date Logic

settlement_rules:
  enabled: true
  conventions:
    # Standard equity settlement
    equity:
      type: T_plus_N
      days: 2
      calendar: exchange

    # Government bonds
    government_bonds:
      type: T_plus_N
      days: 1
      calendar: federal

    # Corporate bonds
    corporate_bonds:
      type: T_plus_N
      days: 2
      calendar: combined

    # FX spot
    fx_spot:
      type: T_plus_N
      days: 2
      calendar: both_currencies

    # Wire transfers
    wire_domestic:
      type: same_day_or_next
      cutoff_time: "14:00"
      calendar: federal

    # ACH
    ach:
      type: T_plus_N
      days: 1-3
      distribution: { 1: 0.6, 2: 0.3, 3: 0.1 }

1.3 Month-End Conventions

month_end_conventions:
  # Modified Following
  modified_following:
    if_holiday: next_business_day
    if_crosses_month: previous_business_day

  # Preceding
  preceding:
    if_holiday: previous_business_day

  # Following
  following:
    if_holiday: next_business_day

  # End of Month
  end_of_month:
    if_start_is_eom: end_stays_eom

2. Expanded Regional Calendars

2.1 Additional Regions

Latin America:

calendars:
  brazil:
    holidays:
      - name: "Carnival"
        type: floating
        rule: "easter - 47 days"
        duration_days: 2
        activity_multiplier: 0.05

      - name: "Tiradentes Day"
        type: fixed
        month: 4
        day: 21

      - name: "Independence Day"
        type: fixed
        month: 9
        day: 7

      - name: "Republic Day"
        type: fixed
        month: 11
        day: 15

  mexico:
    holidays:
      - name: "Constitution Day"
        type: floating
        rule: "first monday of february"

      - name: "Benito Juárez Birthday"
        type: floating
        rule: "third monday of march"

      - name: "Labor Day"
        type: fixed
        month: 5
        day: 1

      - name: "Independence Day"
        type: fixed
        month: 9
        day: 16

      - name: "Revolution Day"
        type: floating
        rule: "third monday of november"

      - name: "Day of the Dead"
        type: fixed
        month: 11
        day: 2
        activity_multiplier: 0.3

Asia-Pacific Expansion:

  australia:
    holidays:
      - name: "Australia Day"
        type: fixed
        month: 1
        day: 26
        observance: "next_monday_if_weekend"

      - name: "ANZAC Day"
        type: fixed
        month: 4
        day: 25

      - name: "Queen's Birthday"
        type: floating
        rule: "second monday of june"
        regional_variation: true  # Different dates by state

  singapore:
    holidays:
      - name: "Chinese New Year"
        type: lunar
        duration_days: 2

      - name: "Vesak Day"
        type: lunar

      - name: "Hari Raya Puasa"
        type: islamic
        rule: "end of ramadan"

      - name: "Deepavali"
        type: lunar
        calendar: hindu

  south_korea:
    holidays:
      - name: "Seollal"
        type: lunar
        calendar: korean
        duration_days: 3

      - name: "Chuseok"
        type: lunar
        calendar: korean
        duration_days: 3

2.2 Lunar Calendar Implementation

#![allow(unused)]
fn main() {
/// Accurate lunar calendar calculations
pub struct LunarCalendar {
    calendar_type: LunarCalendarType,
    cache: HashMap<i32, Vec<LunarDate>>,
}

pub enum LunarCalendarType {
    Chinese,    // Chinese lunisolar
    Islamic,    // Hijri calendar
    Hebrew,     // Jewish calendar
    Hindu,      // Vikram Samvat
    Korean,     // Dangun calendar
}

impl LunarCalendar {
    /// Convert Gregorian date to lunar date
    pub fn to_lunar(&self, date: NaiveDate) -> LunarDate;

    /// Convert lunar date to Gregorian
    pub fn to_gregorian(&self, lunar: LunarDate) -> NaiveDate;

    /// Get Chinese New Year date for a given Gregorian year
    pub fn chinese_new_year(&self, year: i32) -> NaiveDate;

    /// Get Ramadan start date for a given Gregorian year
    pub fn ramadan_start(&self, year: i32) -> NaiveDate;

    /// Get Diwali date (new moon in Kartik)
    pub fn diwali(&self, year: i32) -> NaiveDate;
}
}

3. Period-End Dynamics

3.1 Decay Curves Instead of Static Multipliers

Replace flat multipliers with realistic acceleration curves:

period_end_dynamics:
  enabled: true

  month_end:
    model: exponential_acceleration
    parameters:
      start_day: -10          # 10 days before month-end
      base_multiplier: 1.0
      peak_multiplier: 3.5
      decay_rate: 0.3         # Exponential decay parameter

    # Activity profile by days-to-close
    daily_profile:
      -10: 1.0
      -7: 1.2
      -5: 1.5
      -3: 2.0
      -2: 2.5
      -1: 3.0                 # Day before close
      0: 3.5                  # Close day

  quarter_end:
    model: stepped_exponential
    inherit_from: month_end
    additional_multiplier: 1.5

  year_end:
    model: extended_crunch
    parameters:
      start_day: -15
      sustained_high_days: 5
      peak_multiplier: 6.0

    # Year-end specific activities
    activities:
      - type: "audit_adjustments"
        days: [-3, -2, -1, 0]
        multiplier: 2.0
      - type: "tax_provisions"
        days: [-5, -4, -3]
        multiplier: 1.5
      - type: "impairment_reviews"
        days: [-10, -9, -8]
        multiplier: 1.3

3.2 Intra-Day Patterns

intraday_patterns:
  # Morning rush
  morning_spike:
    start: "08:30"
    end: "10:00"
    multiplier: 1.8

  # Pre-lunch activity
  late_morning:
    start: "10:00"
    end: "12:00"
    multiplier: 1.2

  # Lunch lull
  lunch_dip:
    start: "12:00"
    end: "13:30"
    multiplier: 0.4

  # Afternoon steady
  afternoon:
    start: "13:30"
    end: "16:00"
    multiplier: 1.0

  # End-of-day push
  eod_rush:
    start: "16:00"
    end: "17:30"
    multiplier: 1.5

  # After hours (manual only)
  after_hours:
    start: "17:30"
    end: "20:00"
    multiplier: 0.15
    type: manual_only

3.3 Time Zone Handling

timezones:
  enabled: true

  company_timezones:
    default: "America/New_York"
    by_entity:
      - entity_pattern: "EU_*"
        timezone: "Europe/London"
      - entity_pattern: "DE_*"
        timezone: "Europe/Berlin"
      - entity_pattern: "APAC_*"
        timezone: "Asia/Singapore"
      - entity_pattern: "JP_*"
        timezone: "Asia/Tokyo"

  posting_behavior:
    # Consolidation timing
    consolidation:
      coordinator_timezone: "America/New_York"
      cutoff_time: "18:00"

    # Intercompany coordination
    intercompany:
      settlement_timezone: "UTC"
      matching_window_hours: 24

4. Fiscal Calendar Alternatives

4.1 Non-Calendar Year Support

fiscal_calendar:
  type: custom
  year_start:
    month: 7
    day: 1
  # Fiscal year 2024 = July 1, 2024 - June 30, 2025

  period_naming:
    format: "FY{year}P{period:02}"
    # FY2024P01 = July 2024

4.2 4-4-5 Calendar

fiscal_calendar:
  type: 4-4-5
  year_start:
    anchor: first_sunday_of_february
    # Or: last_saturday_of_january

  periods:
    Q1:
      - weeks: 4
      - weeks: 4
      - weeks: 5
    Q2:
      - weeks: 4
      - weeks: 4
      - weeks: 5
    Q3:
      - weeks: 4
      - weeks: 4
      - weeks: 5
    Q4:
      - weeks: 4
      - weeks: 4
      - weeks: 5

  # 53rd week handling (every 5-6 years)
  leap_week:
    occurrence: calculated
    placement: Q4_P3  # Added to last period

4.3 13-Period Calendar

fiscal_calendar:
  type: 13_period
  weeks_per_period: 4
  year_start:
    anchor: first_monday_of_january

  # 53rd week handling
  extra_week_period: 13

5. Advanced Seasonality

5.1 Multi-Factor Seasonality

seasonality:
  factors:
    # Annual cycle
    annual:
      type: fourier
      harmonics: 3
      coefficients:
        cos1: 0.15
        sin1: 0.08
        cos2: 0.05
        sin2: 0.03
        cos3: 0.02
        sin3: 0.01

    # Weekly cycle
    weekly:
      type: categorical
      values:
        monday: 1.25
        tuesday: 1.10
        wednesday: 1.00
        thursday: 1.00
        friday: 0.90
        saturday: 0.15
        sunday: 0.05

    # Monthly cycle (within month)
    monthly:
      type: piecewise
      segments:
        - days: [1, 5]
          multiplier: 1.3
          label: "month_start"
        - days: [6, 20]
          multiplier: 0.9
          label: "mid_month"
        - days: [21, 31]
          multiplier: 1.4
          label: "month_end"

  # Interaction effects
  interactions:
    - factors: [annual, weekly]
      type: multiplicative
    - factors: [monthly, weekly]
      type: additive

5.2 Weather-Driven Seasonality

For relevant industries:

weather_seasonality:
  enabled: true
  industries: [retail, utilities, agriculture, construction]

  patterns:
    temperature:
      cold_threshold: 32  # Fahrenheit
      hot_threshold: 85
      effects:
        cold:
          utilities: 1.8
          construction: 0.5
          retail_outdoor: 0.3
        hot:
          utilities: 1.5
          construction: 0.8
          retail_outdoor: 1.3

    precipitation:
      effects:
        rain:
          construction: 0.6
          retail_brick_mortar: 0.8
          retail_online: 1.2

  # Regional weather profiles
  regional_profiles:
    northeast_us:
      winter_severity: high
      summer_humidity: medium
    southwest_us:
      winter_severity: low
      summer_heat: extreme
    pacific_northwest:
      precipitation_days: high
      temperature_variance: low

6. Transaction Timing Realism

6.1 Processing Lag Modeling

processing_lags:
  # Time between event and posting
  event_to_posting:
    distribution: lognormal
    parameters:
      sales_order:
        mu: 0.5    # ~1.6 hours median
        sigma: 0.8
      goods_receipt:
        mu: 1.5    # ~4.5 hours median
        sigma: 0.5
      invoice_receipt:
        mu: 2.0    # ~7.4 hours median
        sigma: 0.6
      payment:
        mu: 0.2    # ~1.2 hours median
        sigma: 0.3

  # Day-boundary crossing
  cross_day_posting:
    enabled: true
    probability_by_hour:
      "17:00": 0.7   # 70% post next day if after 5pm
      "19:00": 0.9
      "21:00": 0.99

  # Batch processing delays
  batch_delays:
    enabled: true
    schedules:
      nightly_batch:
        run_time: "02:00"
        affects: [bank_transactions, interfaces]
      hourly_sync:
        interval_minutes: 60
        affects: [inventory_movements]

6.2 Human vs. System Posting Patterns

posting_patterns:
  human:
    # Working hours focus
    primary_hours: [9, 10, 11, 14, 15, 16]
    probability: 0.8

    # Occasional overtime
    extended_hours: [8, 17, 18, 19]
    probability: 0.15

    # Rare late night
    late_hours: [20, 21, 22]
    probability: 0.05

    # Keystroke timing (for detailed simulation)
    entry_duration:
      simple_je:
        mean_seconds: 45
        std_seconds: 15
      complex_je:
        mean_seconds: 180
        std_seconds: 60

  system:
    # Interface postings
    interface:
      typical_times: ["01:00", "05:00", "13:00"]
      duration_minutes: 15-45
      burst_rate: 100-500  # Records per minute

    # Automated recurring
    recurring:
      time: "00:30"
      day: first_business_day

    # Real-time integrations
    realtime:
      latency_ms: 100-500
      batch_size: 1

7. Period Close Orchestration

7.1 Close Calendar Generation

close_calendar:
  enabled: true

  # Standard close schedule
  monthly:
    soft_close:
      day: 2        # 2nd business day
      activities: [preliminary_review, initial_accruals]
    hard_close:
      day: 5        # 5th business day
      activities: [final_adjustments, lock_period]
    reporting:
      day: 7        # 7th business day
      activities: [management_reports, variance_analysis]

  quarterly:
    extended_close:
      additional_days: 3
    activities:
      - quarter_end_reserves
      - intercompany_reconciliation
      - consolidation

  annual:
    extended_close:
      additional_days: 10
    activities:
      - audit_adjustments
      - tax_provisions
      - impairment_testing
      - goodwill_analysis
      - segment_reporting

7.2 Late Posting Behavior

late_postings:
  enabled: true

  # Probability of late posting by days after close
  probability_curve:
    day_1: 0.08    # 8% of transactions post 1 day late
    day_2: 0.03
    day_3: 0.01
    day_4: 0.005
    day_5+: 0.002

  # Characteristics of late postings
  characteristics:
    # More likely to be corrections
    correction_probability: 0.4
    # Higher average amount
    amount_multiplier: 1.5
    # Require special approval
    approval_required: true
    # Must reference original period
    period_reference: required

8. Implementation Priority

Enhancement	Complexity	Impact	Priority	Status
Business day calculator	Medium	Critical	P1	✅ v0.3.0
Additional regional calendars	Medium	High	P1	✅ v0.3.0 (11 regions)
Decay curves for period-end	Low	High	P1	✅ v0.3.0
Non-calendar fiscal years	Medium	Medium	P2	✅ v0.3.0
4-4-5 calendar support	High	Medium	P2	✅ v0.3.0
Timezone handling	Medium	Medium	P2	✅ v0.3.0
Lunar calendar accuracy	High	Medium	P3	🔄 Planned
Weather seasonality	Medium	Low	P3	🔄 Planned
Intra-day patterns	Low	Medium	P2	✅ v0.3.0
Processing lag modeling	Medium	High	P1	✅ v0.3.0

9. Validation Metrics

temporal_validation:
  metrics:
    # Period-end spike ratios
    period_end_spikes:
      month_end_ratio:
        expected: 2.0-3.0
        tolerance: 0.5
      quarter_end_ratio:
        expected: 3.5-4.5
        tolerance: 0.5
      year_end_ratio:
        expected: 5.0-7.0
        tolerance: 1.0

    # Day-of-week distribution
    dow_distribution:
      test: chi_squared
      expected_weights: [1.3, 1.1, 1.0, 1.0, 0.85, 0.1, 0.05]
      significance: 0.05

    # Holiday compliance
    holiday_activity:
      max_activity_on_holiday: 0.1
      allow_exceptions: ["bank_settlement"]

    # Business hours
    business_hours:
      human_transactions:
        in_hours_rate: 0.85-0.95
      system_transactions:
        off_hours_allowed: true

    # Late posting rate
    late_postings:
      max_rate: 0.15
      concentration_test: true  # Should not cluster

See also: 04-interconnectivity.md for relationship modeling

Research: Interconnectivity and Relationship Modeling

Implementation Summary (v0.3.0)

Feature	Status	Location
Multi-tier vendor networks	✅ Implemented	`datasynth-core/src/models/vendor_network.rs`
Vendor clusters & lifecycle	✅ Implemented	`datasynth-core/src/models/vendor_network.rs`
Customer value segmentation	✅ Implemented	`datasynth-core/src/models/customer_segment.rs`
Customer lifecycle stages	✅ Implemented	`datasynth-core/src/models/customer_segment.rs`
Relationship strength modeling	✅ Implemented	`datasynth-core/src/models/relationship.rs`
Entity graph (16 types, 26 relations)	✅ Implemented	`datasynth-core/src/models/relationship.rs`
Cross-process links (P2P↔O2C)	✅ Implemented	`datasynth-generators/src/relationships/`
Network evaluation metrics	✅ Implemented	`datasynth-eval/src/coherence/network.rs`
Configuration & validation	✅ Implemented	`datasynth-config/src/schema.rs`, `validation.rs`
Organizational hierarchy depth	🔄 P2 - Planned	-
Network effect modeling	🔄 P2 - Planned	-
Community detection	🔄 P3 - Planned	-

Current State Analysis

Existing Relationship Infrastructure

Relationship Type	Implementation	Depth
Document Chains	`DocumentChainManager`	Strong
Three-Way Match	`ThreeWayMatcher`	Strong
Intercompany	`ICMatchingEngine`	Strong
GL Balance Links	Account hierarchies	Medium
Vendor-Customer	Basic master data	Weak
Employee-Approval	Approval chains	Medium
Entity Registry	`EntityRegistry`	Medium

Current Strengths

Document flow integrity: PO → GR → Invoice → Payment chains maintained
Intercompany matching: Automatic generation of offsetting entries
Balance coherence: Trial balance validation, A=L+E enforcement
Graph export: PyTorch Geometric, Neo4j, DGL formats supported
COSO control mapping: Controls linked to processes and risks

Current Gaps

Shallow vendor networks: No supplier-of-supplier modeling
Limited customer relationships: No customer segmentation
No organizational hierarchy depth: Flat cost center structures
Missing behavioral clustering: Entities don’t cluster by behavior
No network effects: Relationships don’t influence behavior
Static relationships: No relationship lifecycle modeling

Improvement Recommendations

1. Deep Vendor Network Modeling

1.1 Multi-Tier Supply Chain

vendor_network:
  enabled: true
  depth: 3  # Tier-1, Tier-2, Tier-3 suppliers

  tiers:
    tier_1:
      count: 50-100
      relationship: direct_supplier
      visibility: full
      transaction_volume: high

    tier_2:
      count: 200-500
      relationship: supplier_of_supplier
      visibility: partial
      transaction_volume: medium
      # Only visible through Tier-1 transactions

    tier_3:
      count: 500-2000
      relationship: indirect
      visibility: minimal
      transaction_volume: low

  # Dependency modeling
  dependencies:
    concentration:
      max_single_vendor: 0.15  # No vendor > 15% of spend
      top_5_vendors: 0.45      # Top 5 < 45% of spend

    critical_materials:
      single_source: 0.05      # 5% of materials are single-source
      dual_source: 0.15
      multi_source: 0.80

    substitutability:
      easy: 0.60
      moderate: 0.30
      difficult: 0.10

1.2 Vendor Relationship Attributes

#![allow(unused)]
fn main() {
pub struct VendorRelationship {
    vendor_id: VendorId,
    relationship_type: VendorRelationshipType,
    start_date: NaiveDate,
    end_date: Option<NaiveDate>,

    // Relationship strength
    strategic_importance: StrategicLevel,  // Critical, Important, Standard, Transactional
    spend_tier: SpendTier,                 // Platinum, Gold, Silver, Bronze

    // Behavioral attributes
    payment_history: PaymentBehavior,
    dispute_frequency: DisputeLevel,
    quality_score: f64,

    // Contract terms
    contracted_rates: Vec<ContractedRate>,
    rebate_agreements: Vec<RebateAgreement>,
    payment_terms: PaymentTerms,

    // Network position
    tier: SupplyChainTier,
    parent_vendor: Option<VendorId>,
    child_vendors: Vec<VendorId>,
}

pub enum VendorRelationshipType {
    DirectSupplier,
    ServiceProvider,
    Contractor,
    Distributor,
    Manufacturer,
    RawMaterialSupplier,
    OEMPartner,
    Affiliate,
}
}

1.3 Vendor Behavior Clustering

vendor_clusters:
  enabled: true

  clusters:
    reliable_strategic:
      size: 0.20
      characteristics:
        payment_terms: [30, 45, 60]
        on_time_delivery: 0.95-1.0
        quality_issues: rare
        price_stability: high
        transaction_frequency: weekly+

    standard_operational:
      size: 0.50
      characteristics:
        payment_terms: [30]
        on_time_delivery: 0.85-0.95
        quality_issues: occasional
        price_stability: medium
        transaction_frequency: monthly

    transactional:
      size: 0.25
      characteristics:
        payment_terms: [0, 15]
        on_time_delivery: 0.75-0.90
        quality_issues: moderate
        price_stability: low
        transaction_frequency: quarterly

    problematic:
      size: 0.05
      characteristics:
        payment_terms: [0]  # COD only
        on_time_delivery: 0.50-0.80
        quality_issues: frequent
        price_stability: volatile
        transaction_frequency: declining

2. Customer Relationship Depth

2.1 Customer Segmentation

customer_segmentation:
  enabled: true

  dimensions:
    value:
      - segment: enterprise
        revenue_share: 0.40
        customer_share: 0.05
        characteristics:
          avg_order_value: 50000+
          order_frequency: weekly
          payment_behavior: terms
          churn_risk: low

      - segment: mid_market
        revenue_share: 0.35
        customer_share: 0.20
        characteristics:
          avg_order_value: 5000-50000
          order_frequency: monthly
          payment_behavior: mixed
          churn_risk: medium

      - segment: smb
        revenue_share: 0.20
        customer_share: 0.50
        characteristics:
          avg_order_value: 500-5000
          order_frequency: quarterly
          payment_behavior: prepay
          churn_risk: high

      - segment: consumer
        revenue_share: 0.05
        customer_share: 0.25
        characteristics:
          avg_order_value: 50-500
          order_frequency: occasional
          payment_behavior: immediate
          churn_risk: very_high

    lifecycle:
      - stage: prospect
        conversion_rate: 0.15
        avg_duration_days: 30

      - stage: new
        definition: "<90 days"
        behavior: exploring
        support_intensity: high

      - stage: growth
        definition: "90-365 days"
        behavior: expanding
        upsell_opportunity: high

      - stage: mature
        definition: ">365 days"
        behavior: stable
        retention_focus: true

      - stage: at_risk
        triggers: [declining_orders, late_payments, complaints]
        intervention: required

      - stage: churned
        definition: "no activity >180 days"
        win_back_probability: 0.10

2.2 Customer Network Effects

customer_networks:
  enabled: true

  # Referral relationships
  referrals:
    enabled: true
    referral_rate: 0.15
    referred_customer_value_multiplier: 1.2
    max_referral_chain: 3

  # Parent-child relationships (corporate structures)
  corporate_hierarchies:
    enabled: true
    probability: 0.30
    hierarchy_depth: 3
    billing_consolidation: true

  # Industry clustering
  industry_affinity:
    enabled: true
    same_industry_cluster_probability: 0.40
    industry_trend_correlation: 0.70

3. Organizational Hierarchy Modeling

3.1 Deep Cost Center Structure

organizational_structure:
  depth: 5

  levels:
    - level: 1
      name: division
      count: 3-5
      examples: ["North America", "EMEA", "APAC"]

    - level: 2
      name: business_unit
      count_per_parent: 2-4
      examples: ["Commercial", "Consumer", "Industrial"]

    - level: 3
      name: department
      count_per_parent: 3-6
      examples: ["Sales", "Marketing", "Operations", "Finance"]

    - level: 4
      name: function
      count_per_parent: 2-5
      examples: ["Inside Sales", "Field Sales", "Sales Ops"]

    - level: 5
      name: team
      count_per_parent: 2-4
      examples: ["Team Alpha", "Team Beta"]

  # Cross-cutting structures
  matrix_relationships:
    enabled: true
    types:
      - primary: division
        secondary: function
        # e.g., "EMEA Sales" reports to both EMEA Head and Global Sales VP

  # Shared services
  shared_services:
    enabled: true
    centers:
      - name: "Corporate Finance"
        serves: all_divisions
        allocation_method: headcount

      - name: "IT Infrastructure"
        serves: all_divisions
        allocation_method: usage

      - name: "HR Services"
        serves: all_divisions
        allocation_method: headcount

3.2 Approval Hierarchy

approval_hierarchy:
  enabled: true

  # Spending authority matrix
  authority_matrix:
    manager:
      limit: 5000
      exception_rate: 0.02

    senior_manager:
      limit: 25000
      exception_rate: 0.01

    director:
      limit: 100000
      exception_rate: 0.005

    vp:
      limit: 500000
      exception_rate: 0.002

    c_level:
      limit: unlimited
      exception_rate: 0.001

  # Approval chains
  chain_rules:
    sequential:
      enabled: true
      for: [capital_expenditure, contracts]

    parallel:
      enabled: true
      for: [operational_expenses]
      minimum_approvals: 2

    skip_level:
      enabled: true
      probability: 0.05
      audit_flag: true

4. Entity Relationship Graph

4.1 Comprehensive Relationship Model

#![allow(unused)]
fn main() {
/// Unified entity relationship graph
pub struct EntityGraph {
    nodes: HashMap<EntityId, EntityNode>,
    edges: Vec<RelationshipEdge>,
    indexes: GraphIndexes,
}

pub struct EntityNode {
    id: EntityId,
    entity_type: EntityType,
    attributes: EntityAttributes,
    created_at: DateTime<Utc>,
    last_activity: DateTime<Utc>,
}

pub enum EntityType {
    Company,
    Vendor,
    Customer,
    Employee,
    Department,
    CostCenter,
    Project,
    Contract,
    Asset,
    BankAccount,
}

pub struct RelationshipEdge {
    from_id: EntityId,
    to_id: EntityId,
    relationship_type: RelationshipType,
    strength: f64,           // 0.0 - 1.0
    start_date: NaiveDate,
    end_date: Option<NaiveDate>,
    attributes: RelationshipAttributes,
}

pub enum RelationshipType {
    // Transactional
    BuysFrom,
    SellsTo,
    PaysTo,
    ReceivesFrom,

    // Organizational
    ReportsTo,
    Manages,
    BelongsTo,
    OwnedBy,

    // Contractual
    ContractedWith,
    GuaranteedBy,
    InsuredBy,

    // Financial
    LendsTo,
    BorrowsFrom,
    InvestsIn,

    // Network
    ReferredBy,
    PartnersWith,
    CompetesWith,
}
}

4.2 Relationship Strength Modeling

relationship_strength:
  calculation:
    type: composite
    factors:
      transaction_volume:
        weight: 0.30
        normalization: log_scale

      transaction_count:
        weight: 0.25
        normalization: sqrt_scale

      relationship_duration:
        weight: 0.20
        decay: none

      recency:
        weight: 0.15
        decay: exponential
        half_life_days: 90

      mutual_connections:
        weight: 0.10
        normalization: jaccard_similarity

  thresholds:
    strong: 0.7
    moderate: 0.4
    weak: 0.1
    dormant: 0.0

5. Transaction Chain Integrity

5.1 Extended Document Chains

document_chains:
  # P2P extended chain
  procure_to_pay:
    stages:
      - type: purchase_requisition
        optional: true
        approval_required: conditional  # >$1000

      - type: purchase_order
        required: true
        generates: commitment

      - type: goods_receipt
        required: conditional  # For goods, not services
        updates: inventory
        tolerance: 0.05  # 5% over-receipt allowed

      - type: vendor_invoice
        required: true
        matching: three_way  # PO, GR, Invoice
        tolerance: 0.02

      - type: payment
        required: true
        methods: [ach, wire, check, virtual_card]
        generates: bank_transaction

    # Chain integrity rules
    integrity:
      sequence_enforcement: strict
      backdating_allowed: false
      amount_cascade: true  # Amounts must flow through

  # O2C extended chain
  order_to_cash:
    stages:
      - type: quote
        optional: true
        validity_days: 30

      - type: sales_order
        required: true
        credit_check: conditional

      - type: pick_list
        required: conditional
        triggers: inventory_reservation

      - type: delivery
        required: conditional
        updates: inventory
        generates: shipping_document

      - type: customer_invoice
        required: true
        triggers: revenue_recognition

      - type: customer_receipt
        required: true
        applies_to: invoices
        generates: bank_transaction

    integrity:
      partial_shipment: allowed
      partial_payment: allowed
      credit_memo: allowed

5.2 Cross-Process Linkages

cross_process_links:
  enabled: true

  links:
    # Inventory connects P2P and O2C
    - source_process: procure_to_pay
      source_stage: goods_receipt
      target_process: order_to_cash
      target_stage: pick_list
      through: inventory

    # Returns create reverse flows
    - source_process: order_to_cash
      source_stage: delivery
      target_process: returns
      target_stage: return_receipt
      condition: quality_issue

    # Payments connect to bank reconciliation
    - source_process: procure_to_pay
      source_stage: payment
      target_process: bank_reconciliation
      target_stage: bank_statement_line
      matching: automatic

    # Intercompany bilateral links
    - source_process: intercompany_sale
      source_stage: ic_invoice
      target_process: intercompany_purchase
      target_stage: ic_invoice
      matching: elimination_required

6. Network Effect Modeling

6.1 Behavioral Influence

network_effects:
  enabled: true

  influence_types:
    # Transaction patterns spread through network
    transaction_contagion:
      enabled: true
      effect: "similar vendors show similar payment patterns"
      correlation: 0.40
      lag_days: 30

    # Risk propagation
    risk_propagation:
      enabled: true
      effect: "vendor issues affect connected vendors"
      propagation_depth: 2
      decay_per_hop: 0.50

    # Seasonal correlation
    seasonal_sync:
      enabled: true
      effect: "connected entities show correlated seasonality"
      correlation: 0.60

    # Price correlation
    price_linkage:
      enabled: true
      effect: "commodity price changes propagate"
      propagation_speed: immediate
      pass_through_rate: 0.80

6.2 Community Detection

community_detection:
  enabled: true
  algorithms:
    - type: louvain
      resolution: 1.0
      output: vendor_communities

    - type: label_propagation
      output: customer_segments

    - type: girvan_newman
      output: department_clusters

  use_cases:
    # Fraud detection
    fraud_rings:
      algorithm: connected_components
      edge_filter: suspicious_transactions
      min_size: 3

    # Vendor consolidation
    vendor_overlap:
      algorithm: jaccard_similarity
      threshold: 0.70
      output: consolidation_candidates

    # Customer segmentation
    behavioral_clusters:
      algorithm: spectral
      features: [purchase_pattern, payment_behavior, product_mix]

7. Relationship Lifecycle

7.1 Lifecycle Stages

relationship_lifecycle:
  enabled: true

  vendor_lifecycle:
    stages:
      onboarding:
        duration_days: 30-90
        activities: [due_diligence, contract_negotiation, system_setup]
        transaction_volume: limited

      ramp_up:
        duration_days: 90-180
        activities: [volume_increase, performance_monitoring]
        transaction_volume: growing

      steady_state:
        duration_days: ongoing
        activities: [regular_transactions, periodic_review]
        transaction_volume: stable

      decline:
        triggers: [quality_issues, price_competitiveness, strategic_shift]
        activities: [reduced_orders, alternative_sourcing]
        transaction_volume: decreasing

      termination:
        triggers: [contract_end, performance_failure, strategic_decision]
        activities: [final_settlement, transition]
        transaction_volume: zero

    transitions:
      probability_matrix:
        onboarding:
          ramp_up: 0.80
          termination: 0.20
        ramp_up:
          steady_state: 0.85
          decline: 0.10
          termination: 0.05
        steady_state:
          steady_state: 0.90
          decline: 0.08
          termination: 0.02
        decline:
          steady_state: 0.20
          decline: 0.50
          termination: 0.30

  customer_lifecycle:
    # Similar structure for customer relationships
    stages:
      prospect: { conversion_rate: 0.15 }
      new: { retention_rate: 0.70 }
      active: { retention_rate: 0.90 }
      at_risk: { save_rate: 0.50 }
      churned: { win_back_rate: 0.10 }

8. Graph Export Enhancements

8.1 Enhanced PyTorch Geometric Export

graph_export:
  pytorch_geometric:
    enabled: true

    node_features:
      # Node type encoding
      type_encoding: one_hot

      # Numeric features
      numeric:
        - field: transaction_volume
          normalization: log_scale
        - field: relationship_duration_days
          normalization: min_max
        - field: average_amount
          normalization: z_score

      # Categorical features
      categorical:
        - field: industry
          encoding: label
        - field: region
          encoding: one_hot
        - field: segment
          encoding: embedding

    edge_features:
      - field: relationship_strength
        normalization: none
      - field: transaction_count
        normalization: log_scale
      - field: last_transaction_days_ago
        normalization: min_max

    # Temporal graphs
    temporal:
      enabled: true
      snapshot_frequency: monthly
      edge_weight_decay: exponential
      half_life_days: 90

    # Heterogeneous graph support
    heterogeneous:
      enabled: true
      node_types: [company, vendor, customer, employee, account]
      edge_types: [buys_from, sells_to, reports_to, pays_to]

8.2 Enhanced Neo4j Export

neo4j_export:
  enabled: true

  # Node labels
  node_labels:
    - label: Company
      properties: [code, name, currency, country]
    - label: Vendor
      properties: [id, name, category, rating]
    - label: Customer
      properties: [id, name, segment, region]
    - label: Transaction
      properties: [id, amount, date, type]

  # Relationship types
  relationships:
    - type: TRANSACTS_WITH
      properties: [volume, count, first_date, last_date]
    - type: BELONGS_TO
      properties: [start_date, role]
    - type: SUPPLIES
      properties: [material_type, contract_id]

  # Indexes for query optimization
  indexes:
    - label: Transaction
      property: date
      type: range
    - label: Vendor
      property: id
      type: unique
    - label: Customer
      property: segment
      type: lookup

  # Full-text search
  fulltext:
    - name: entity_search
      labels: [Vendor, Customer]
      properties: [name, description]

9. Implementation Priority

Enhancement	Complexity	Impact	Priority	Status
Vendor network depth	High	High	P1	✅ v0.3.0
Customer segmentation	Medium	High	P1	✅ v0.3.0
Organizational hierarchy	Medium	Medium	P2	🔄 Planned
Relationship strength modeling	Medium	High	P1	✅ v0.3.0
Cross-process linkages	Medium	High	P1	✅ v0.3.0
Network effect modeling	High	Medium	P2	🔄 Planned
Relationship lifecycle	Medium	Medium	P2	✅ v0.3.0
Community detection	High	Medium	P3	🔄 Planned
Enhanced graph export	Low	High	P1	🔄 Partial

10. Validation Framework

relationship_validation:
  integrity_checks:
    # All transactions have valid entity references
    referential_integrity:
      enabled: true
      strict: true

    # Document chains are complete
    chain_completeness:
      enabled: true
      allow_partial: false
      exception_rate: 0.02

    # Intercompany entries balance
    intercompany_balance:
      enabled: true
      tolerance: 0.01

  network_metrics:
    # Graph connectivity
    connectivity:
      check_strongly_connected: false
      check_weakly_connected: true
      max_isolated_nodes: 0.05

    # Degree distribution
    degree_distribution:
      check_power_law: true
      min_alpha: 1.5
      max_alpha: 3.0

    # Clustering coefficient
    clustering:
      min_coefficient: 0.1
      max_coefficient: 0.5

See also: 05-pattern-drift.md for temporal evolution of patterns

Research: Pattern and Process Drift Over Time

Current State Analysis

Existing Drift Implementation

The current DriftController (373 lines) supports:

Drift Type	Implementation	Realism
Gradual	Linear parameter drift	Medium
Sudden	Point-in-time shifts	Medium
Recurring	Seasonal patterns	Good
Mixed	Combination modes	Medium

Current Capabilities

Amount drift: Mean and variance adjustments over time
Anomaly rate drift: Changing fraud/error rates
Concept drift factor: Generic drift indicator
Seasonal adjustment: Periodic recurring patterns
Sudden drift probability: Random regime changes

Current Gaps

No organizational events: Mergers, restructuring not modeled
No process evolution: Static business processes
No regulatory changes: Compliance requirements don’t evolve
No technology transitions: System changes not simulated
No behavioral drift: Entity behaviors remain static
No market-driven drift: External factors not modeled
Limited drift detection signals: Hard to validate drift presence

Improvement Recommendations

1. Organizational Event Modeling

1.1 Corporate Event Timeline

organizational_events:
  enabled: true

  events:
    # Mergers and Acquisitions
    - type: acquisition
      date: "2024-06-15"
      acquired_entity: "TargetCorp"
      effects:
        - entity_count_increase: 1.35
        - vendor_count_increase: 1.25
        - customer_overlap: 0.15
        - integration_period_months: 12
        - synergy_realization:
            start_month: 6
            full_realization_month: 18
            cost_reduction: 0.08

    # Divestiture
    - type: divestiture
      date: "2024-09-01"
      divested_entity: "NonCoreBusiness"
      effects:
        - revenue_reduction: 0.12
        - entity_count_reduction: 0.10
        - vendor_transition_period: 6

    # Reorganization
    - type: reorganization
      date: "2024-04-01"
      type: functional_to_regional
      effects:
        - cost_center_restructure: true
        - approval_chain_changes: true
        - reporting_line_changes: true
        - transition_period_months: 3
        - temporary_confusion_factor: 1.15

    # Leadership Change
    - type: leadership_change
      date: "2024-07-01"
      position: CFO
      effects:
        - policy_changes_probability: 0.40
        - approval_threshold_review: true
        - vendor_review_trigger: true
        - audit_focus_shift: possible

    # Layoffs
    - type: workforce_reduction
      date: "2024-11-01"
      reduction_percent: 0.10
      effects:
        - employee_count_reduction: 0.10
        - workload_redistribution: true
        - approval_delays: 1.20
        - error_rate_increase: 1.15
        - duration_months: 6

1.2 Integration Pattern Modeling

#![allow(unused)]
fn main() {
pub struct IntegrationSimulator {
    phases: Vec<IntegrationPhase>,
    current_phase: usize,
}

pub struct IntegrationPhase {
    name: String,
    start_month: u32,
    end_month: u32,
    effects: IntegrationEffects,
}

pub struct IntegrationEffects {
    // Duplicate transactions during transition
    duplicate_probability: f64,
    // Coding errors during chart migration
    miscoding_rate: f64,
    // Legacy system parallel run
    parallel_posting: bool,
    // Vendor/customer migration errors
    master_data_errors: f64,
    // Timing differences
    posting_delay_multiplier: f64,
}
}

1.3 Merger Accounting Patterns

merger_accounting:
  enabled: true

  day_1_entries:
    - type: fair_value_adjustment
      accounts: [inventory, fixed_assets, intangibles]
      adjustment_range: [-0.20, 0.30]

    - type: goodwill_recognition
      calculation: "purchase_price - fair_value_net_assets"

    - type: liability_assumption
      includes: [accounts_payable, debt, contingencies]

  post_merger:
    # Integration costs
    integration_expenses:
      monthly_range: [100000, 500000]
      duration_months: 12-18
      categories: [consulting, severance, systems, legal]

    # Synergy realization
    synergies:
      start_month: 6
      ramp_up_months: 12
      categories:
        - type: headcount_reduction
          target: 0.05
        - type: vendor_consolidation
          target: 0.10
        - type: facility_optimization
          target: 0.03

    # Restructuring reserves
    restructuring:
      initial_reserve: 5000000
      utilization_pattern: front_loaded
      true_up_probability: 0.30

2. Process Evolution Modeling

2.1 Business Process Changes

process_evolution:
  enabled: true

  changes:
    # New approval workflow
    - type: approval_workflow_change
      date: "2024-03-01"
      from: sequential
      to: parallel
      effects:
        - approval_time_reduction: 0.40
        - same_day_approval_increase: 0.25
        - skip_approval_detection: improved

    # Automation introduction
    - type: process_automation
      date: "2024-05-01"
      process: invoice_matching
      effects:
        - manual_matching_reduction: 0.70
        - matching_accuracy_improvement: 0.15
        - exception_visibility_increase: true
        - posting_timing: more_consistent

    # Policy change
    - type: policy_change
      date: "2024-08-01"
      policy: expense_approval_limits
      changes:
        - manager_limit: 5000 -> 7500
        - director_limit: 25000 -> 35000
      effects:
        - approval_escalation_reduction: 0.20
        - processing_time_reduction: 0.15

    # Control enhancement
    - type: control_enhancement
      date: "2024-10-01"
      control: three_way_match
      changes:
        - tolerance: 0.05 -> 0.02
        - mandatory_for: all_po_invoices
      effects:
        - exception_rate_increase: 0.15
        - fraud_detection_improvement: 0.25

2.2 Technology Transition Patterns

technology_transitions:
  enabled: true

  transitions:
    # ERP migration
    - type: erp_migration
      phases:
        - name: parallel_run
          start: "2024-06-01"
          duration_months: 3
          effects:
            - duplicate_entries: true
            - reconciliation_required: true
            - posting_delays: 1.30

        - name: cutover
          date: "2024-09-01"
          effects:
            - legacy_system: read_only
            - new_system: live
            - catch_up_period: 5_days

        - name: stabilization
          start: "2024-09-01"
          duration_months: 3
          effects:
            - error_rate_multiplier: 1.25
            - support_ticket_increase: 3.0
            - workaround_transactions: 0.10

    # Module implementation
    - type: module_implementation
      module: advanced_analytics
      go_live: "2024-04-15"
      effects:
        - new_transaction_types: [analytical_adjustment]
        - automated_entries_increase: 0.20

    # Integration change
    - type: integration_upgrade
      system: bank_interface
      date: "2024-07-01"
      effects:
        - real_time_enabled: true
        - batch_processing: deprecated
        - posting_frequency: continuous

3. Regulatory and Compliance Drift

3.1 Regulatory Changes

regulatory_changes:
  enabled: true

  changes:
    # New accounting standard
    - type: accounting_standard_adoption
      standard: ASC_842  # Leases
      effective_date: "2024-01-01"
      effects:
        - new_account_codes: [rou_asset, lease_liability]
        - reclassification_entries: true
        - disclosure_changes: true
        - audit_focus: high

    # Tax law change
    - type: tax_law_change
      date: "2024-07-01"
      jurisdiction: federal
      change: corporate_tax_rate
      from: 0.21
      to: 0.25
      effects:
        - deferred_tax_revaluation: true
        - provision_adjustment: true
        - quarterly_estimate_revision: true

    # Compliance requirement
    - type: new_compliance_requirement
      regulation: SOX_AI_controls
      effective_date: "2024-10-01"
      requirements:
        - ai_model_documentation: required
        - automated_control_testing: required
        - data_lineage_tracking: required
      effects:
        - new_control_activities: 15
        - testing_frequency: quarterly
        - documentation_overhead: 0.10

    # Industry regulation
    - type: industry_regulation
      industry: financial_services
      regulation: enhanced_kyc
      date: "2024-06-01"
      effects:
        - customer_onboarding_time: 1.50
        - documentation_requirements: increased
        - rejection_rate_increase: 0.08

3.2 Audit Focus Shifts

audit_focus_evolution:
  enabled: true

  shifts:
    # Risk-based changes
    - trigger: fraud_detection
      date: "2024-03-15"
      new_focus_areas:
        - vendor_payments: high
        - manual_journal_entries: high
        - related_party_transactions: medium
      effects:
        - sampling_rate_increase: 0.30
        - documentation_requests: increased

    # Industry trend response
    - trigger: industry_trend
      date: "2024-06-01"
      trend: cybersecurity_risks
      new_focus_areas:
        - it_general_controls: high
        - access_management: high
        - change_management: medium
      effects:
        - itgc_testing_expansion: true
        - soc2_requirements: enhanced

    # Prior year findings
    - trigger: prior_year_finding
      finding: revenue_recognition_timing
      date: "2024-01-01"
      effects:
        - cutoff_testing: enhanced
        - sample_sizes: increased
        - management_inquiry: extensive

4. Behavioral Drift

4.1 Entity Behavior Evolution

behavioral_drift:
  enabled: true

  vendor_behavior:
    # Payment term negotiation
    payment_terms_drift:
      direction: extending
      rate_per_year: 2.5  # Days per year
      variance_increase: true
      trigger: economic_conditions

    # Quality drift
    quality_drift:
      new_vendors:
        initial_period_months: 6
        quality_improvement_rate: 0.02
      established_vendors:
        complacency_risk: 0.05
        quality_decline_rate: 0.01

    # Price drift
    pricing_behavior:
      inflation_pass_through: 0.80
      contract_renegotiation_frequency: annual
      opportunistic_increase_probability: 0.10

  customer_behavior:
    # Payment behavior evolution
    payment_drift:
      economic_downturn:
        days_extension: 5-15
        bad_debt_rate_increase: 0.02
      economic_upturn:
        days_reduction: 2-5
        early_payment_discount_uptake: 0.15

    # Order pattern drift
    order_drift:
      digital_shift:
        online_order_increase_per_year: 0.05
        average_order_value_decrease: 0.03
        order_frequency_increase: 0.10

  employee_behavior:
    # Approval pattern drift
    approval_drift:
      end_of_month_rush:
        intensity_increase_per_year: 0.05
      rubber_stamping_risk:
        increase_with_volume: true
        threshold: 50  # Approvals per day

    # Error pattern drift
    error_drift:
      new_employee:
        error_rate: 0.08
        learning_curve_months: 6
        target_error_rate: 0.02
      experienced_employee:
        fatigue_increase: 0.01_per_year

4.2 Collective Behavior Patterns

collective_drift:
  enabled: true

  patterns:
    # Year-end behavior
    year_end_intensity:
      drift: increasing
      rate_per_year: 0.05
      explanation: "tighter close deadlines, more scrutiny"

    # Automation adoption
    automation_adoption:
      s_curve_adoption: true
      early_adopters: 0.15
      mainstream: 0.60
      laggards: 0.25
      effects_by_phase:
        early:
          manual_reduction: 0.10
          error_types_shift: true
        mainstream:
          manual_reduction: 0.50
          new_error_types: automation_failures
        late:
          manual_reduction: 0.80
          exception_handling_focus: true

    # Remote work impact
    remote_work_patterns:
      transition_date: "2024-01-01"
      remote_percentage: 0.60
      effects:
        - posting_time_distribution: flattened
        - batch_processing_increase: true
        - approval_response_time: longer
        - documentation_quality: variable

5. Market-Driven Drift

5.1 Economic Cycle Effects

economic_cycles:
  enabled: true

  cycles:
    # Business cycle
    business_cycle:
      type: sinusoidal
      period_months: 48
      amplitude: 0.15
      effects:
        expansion:
          revenue_growth: positive
          hiring: active
          capital_investment: high
          credit_terms: generous
        contraction:
          revenue_growth: negative
          layoffs: possible
          capital_investment: low
          credit_terms: tight

    # Industry cycle
    industry_specific:
      technology:
        period_months: 36
        amplitude: 0.25
      manufacturing:
        period_months: 60
        amplitude: 0.20
      retail:
        period_months: 12  # Annual
        amplitude: 0.35

  # Recession simulation
  recession:
    enabled: false  # Trigger explicitly
    onset: gradual  # or sudden
    duration_months: 12-24
    severity: moderate  # mild, moderate, severe
    effects:
      revenue_decline: 0.15-0.30
      ar_aging_increase: 15_days
      bad_debt_increase: 0.03
      vendor_consolidation: 0.10
      workforce_reduction: 0.08
      capex_freeze: true

5.2 Commodity and Input Cost Drift

input_cost_drift:
  enabled: true

  commodities:
    - name: steel
      base_price: 800  # per ton
      volatility: 0.20
      correlation_with_economy: 0.60
      pass_through_to_cogs: 0.15

    - name: energy
      base_price: 75   # per barrel equivalent
      volatility: 0.35
      seasonal_pattern: true
      pass_through_to_overhead: 0.08

    - name: labor
      base_cost: 35    # per hour
      annual_increase: 0.03
      regional_variation: true
      pass_through_to_all: true

  price_shock_events:
    - type: supply_disruption
      probability_per_year: 0.10
      duration_months: 3-9
      price_increase: 0.30-1.00
      affected_commodities: [specific]

    - type: demand_surge
      probability_per_year: 0.15
      duration_months: 2-6
      price_increase: 0.15-0.40
      affected_commodities: [broad]

6. Concept Drift Detection Signals

6.1 Drift Indicators in Generated Data

drift_signals:
  enabled: true

  embedded_signals:
    # Statistical shift markers
    statistical:
      - type: mean_shift
        field: transaction_amount
        visibility: detectable_by_cusum
        magnitude: configurable

      - type: variance_change
        field: processing_time
        visibility: detectable_by_levene
        direction: both

      - type: distribution_change
        field: payment_terms
        visibility: detectable_by_ks_test
        gradual: true

    # Categorical drift markers
    categorical:
      - type: category_proportion_shift
        field: transaction_type
        new_category_emergence: true
        old_category_decline: true

      - type: label_drift
        field: account_code
        new_codes: added_over_time
        deprecated_codes: declining_usage

    # Temporal drift markers
    temporal:
      - type: seasonality_change
        field: transaction_count
        pattern_evolution: true
        detectability: acf_analysis

      - type: trend_change
        field: revenue
        change_points: marked
        detectability: pelt_algorithm

  # Ground truth labels for drift
  drift_labels:
    enabled: true
    output_file: drift_events.csv
    columns:
      - event_type
      - start_date
      - end_date
      - affected_fields
      - magnitude
      - detection_difficulty

6.2 Drift Validation Metrics

drift_validation:
  metrics:
    # Drift presence verification
    drift_detection:
      methods:
        - adwin   # Adaptive Windowing
        - ddm     # Drift Detection Method
        - eddm    # Early Drift Detection Method
        - ph      # Page-Hinkley Test
      threshold_calibration: true

    # Drift magnitude
    magnitude_metrics:
      - hellinger_distance
      - kl_divergence
      - wasserstein_distance
      - psi  # Population Stability Index

    # Drift timing accuracy
    timing_metrics:
      - detection_delay_days
      - false_positive_rate
      - detection_precision

7. Implementation Framework

7.1 Drift Controller Enhancement

#![allow(unused)]
fn main() {
pub struct EnhancedDriftController {
    // Existing drift
    parameter_drift: ParameterDrift,

    // New: Organizational events
    event_timeline: EventTimeline,

    // New: Process changes
    process_evolution: ProcessEvolution,

    // New: Regulatory changes
    regulatory_calendar: RegulatoryCalendar,

    // New: Behavioral models
    behavioral_drift: BehavioralDriftModel,

    // New: Market factors
    market_model: MarketModel,

    // Drift detection ground truth
    drift_labels: DriftLabelRecorder,
}

impl EnhancedDriftController {
    /// Get all active effects for a given date
    pub fn get_effects(&self, date: NaiveDate) -> DriftEffects {
        let mut effects = DriftEffects::default();

        // Apply organizational events
        effects.merge(self.event_timeline.effects_at(date));

        // Apply process evolution
        effects.merge(self.process_evolution.effects_at(date));

        // Apply regulatory changes
        effects.merge(self.regulatory_calendar.effects_at(date));

        // Apply behavioral drift
        effects.merge(self.behavioral_drift.effects_at(date));

        // Apply market conditions
        effects.merge(self.market_model.effects_at(date));

        // Record for ground truth
        self.drift_labels.record(date, &effects);

        effects
    }
}
}

7.2 Configuration Integration

# Master drift configuration
drift:
  enabled: true

  # Parameter drift (existing)
  parameters:
    amount_mean_drift: 0.02
    amount_variance_drift: 0.01

  # Organizational events (new)
  organizational:
    events_file: "organizational_events.yaml"
    random_events:
      reorganization_probability: 0.10
      leadership_change_probability: 0.15

  # Process evolution (new)
  process:
    automation_curve: s_curve
    policy_review_frequency: quarterly

  # Regulatory changes (new)
  regulatory:
    calendar_file: "regulatory_calendar.yaml"
    jurisdictions: [us, eu]

  # Behavioral drift (new)
  behavioral:
    vendor_learning: true
    customer_churn: true
    employee_turnover: 0.15

  # Market factors (new)
  market:
    economic_cycle: true
    commodity_volatility: true
    inflation_rate: 0.03

  # Drift labeling (new)
  labels:
    enabled: true
    output_format: csv
    include_magnitude: true

8. Implementation Priority

Enhancement	Complexity	Impact	Priority
Organizational events	Medium	High	P1
Process evolution	Medium	High	P1
Regulatory changes	Low	Medium	P2
Behavioral drift	High	High	P1
Market-driven drift	Medium	Medium	P2
Drift detection signals	Low	High	P1
Technology transitions	High	Medium	P3
Collective behavior	Medium	Medium	P2

9. Use Cases

ML Model Robustness Testing: Train models on stable data, test on drifted data
Drift Detection Benchmarking: Evaluate drift detection algorithms on known drift
Change Management Simulation: Test system responses to organizational changes
Regulatory Impact Analysis: Model effects of compliance requirement changes
Economic Scenario Planning: Generate data under different economic conditions

See also: 06-anomaly-patterns.md for anomaly injection patterns

Research: Anomaly Pattern Enhancements

Current State Analysis

Existing Anomaly Categories

Category	Types	Implementation
Fraud	Fictitious, Revenue Manipulation, Split, Round-trip, Ghost Employee, Duplicate Payment	Good
Error	Duplicate Entry, Reversed Amount, Wrong Period, Wrong Account, Missing Reference	Good
Process	Late Posting, Skipped Approval, Threshold Manipulation	Medium
Statistical	Unusual Amount, Trend Break, Benford Violation	Medium
Relational	Circular Transaction, Dormant Account	Basic

Current Strengths

Labeled output: anomaly_labels.csv with ground truth
Configurable injection rate: Per-anomaly-type rates
Quality issue labeling: Separate from fraud labels
Multiple anomaly types: 20+ distinct patterns
COSO control mapping: Anomalies linked to control failures

Current Gaps

Binary labeling only: No severity or confidence scores
Independent injection: Anomalies don’t correlate with each other
No multi-stage anomalies: Complex schemes not modeled
Static patterns: Same anomaly signature throughout
No near-miss generation: Only clear anomalies or clean data
Limited context awareness: Anomalies don’t adapt to entity behavior
No detection difficulty labeling: All anomalies treated equally

Improvement Recommendations

1. Multi-Dimensional Anomaly Labeling

1.1 Enhanced Label Schema

anomaly_labeling:
  schema:
    # Primary classification
    anomaly_id: uuid
    transaction_ids: [uuid]
    anomaly_type: string
    anomaly_category: [fraud, error, process, statistical, relational]

    # Severity scoring
    severity:
      level: [low, medium, high, critical]
      score: 0.0-1.0
      financial_impact: decimal
      materiality_threshold: exceeded | below

    # Detection characteristics
    detection:
      difficulty: [trivial, easy, moderate, hard, expert]
      recommended_methods: [rule_based, statistical, ml, graph, hybrid]
      expected_false_positive_rate: 0.0-1.0
      key_indicators: [string]

    # Confidence and certainty
    confidence:
      ground_truth_certainty: [definite, probable, possible]
      label_source: [injected, inferred, manual]

    # Temporal characteristics
    temporal:
      first_occurrence: date
      last_occurrence: date
      frequency: [one_time, recurring, continuous]
      detection_window: days

    # Relationship context
    context:
      related_anomalies: [uuid]
      affected_entities: [entity_id]
      control_failures: [control_id]
      root_cause: string

1.2 Materiality-Based Severity

severity_calculation:
  materiality_thresholds:
    trivial: 0.001        # 0.1% of relevant base
    immaterial: 0.01      # 1%
    material: 0.05        # 5%
    highly_material: 0.10 # 10%

  bases_by_type:
    revenue: total_revenue
    expense: total_expenses
    asset: total_assets
    liability: total_liabilities

  severity_factors:
    financial_impact:
      weight: 0.40
      calculation: amount / materiality_threshold

    detection_difficulty:
      weight: 0.25
      mapping:
        trivial: 0.1
        easy: 0.3
        moderate: 0.5
        hard: 0.7
        expert: 0.9

    persistence:
      weight: 0.20
      calculation: duration_days / 365

    entity_involvement:
      weight: 0.15
      calculation: log(affected_entity_count)

2. Correlated Anomaly Injection

2.1 Anomaly Co-occurrence Patterns

anomaly_correlations:
  enabled: true

  patterns:
    # Fraud often accompanied by concealment
    fraud_concealment:
      primary: fictitious_vendor
      correlated:
        - type: document_manipulation
          probability: 0.80
          lag_days: 0-30
        - type: approval_bypass
          probability: 0.60
          lag_days: 0
        - type: audit_trail_gaps
          probability: 0.40
          lag_days: 0-90

    # Error cascades
    error_cascade:
      primary: wrong_account_coding
      correlated:
        - type: reconciliation_difference
          probability: 0.90
          lag_days: 30-60
        - type: balance_discrepancy
          probability: 0.70
          lag_days: 30
        - type: correcting_entry
          probability: 0.85
          lag_days: 1-45

    # Process failures cluster
    process_breakdown:
      primary: skipped_approval
      correlated:
        - type: threshold_splitting
          probability: 0.50
          lag_days: -30 to 30
        - type: late_posting
          probability: 0.40
          lag_days: 0-15
        - type: documentation_missing
          probability: 0.60
          lag_days: 0

2.2 Temporal Clustering

temporal_clustering:
  enabled: true

  clusters:
    # Period-end error spikes
    period_end_errors:
      window: last_5_business_days
      error_rate_multiplier: 2.5
      types: [wrong_period, duplicate_entry, late_posting]

    # Post-holiday cleanup
    post_holiday:
      window: first_3_business_days_after_holiday
      types: [duplicate_entry, missing_reference]
      multiplier: 1.8

    # Quarter-end pressure
    quarter_end:
      window: last_week_of_quarter
      fraud_types: [revenue_manipulation, expense_deferral]
      multiplier: 1.5

    # Year-end audit prep
    year_end_audit:
      window: december
      correction_types: [reclassification, prior_period_adjustment]
      multiplier: 3.0

3. Multi-Stage Anomaly Patterns

3.1 Complex Scheme Modeling

multi_stage_anomalies:
  enabled: true

  schemes:
    # Gradual embezzlement
    gradual_embezzlement:
      stages:
        - stage: 1
          name: testing
          duration_months: 2
          transactions: 3-5
          amount_range: [100, 500]
          detection_difficulty: hard

        - stage: 2
          name: escalation
          duration_months: 6
          transactions: 10-20
          amount_range: [500, 2000]
          detection_difficulty: moderate

        - stage: 3
          name: acceleration
          duration_months: 3
          transactions: 20-50
          amount_range: [2000, 10000]
          detection_difficulty: easy

        - stage: 4
          name: desperation
          duration_months: 1
          transactions: 5-10
          amount_range: [10000, 50000]
          detection_difficulty: trivial

      total_scheme_probability: 0.02

    # Revenue manipulation over time
    revenue_scheme:
      stages:
        - stage: 1
          name: acceleration
          quarter: Q4
          action: early_revenue_recognition
          amount_percent: 0.02

        - stage: 2
          name: deferral
          quarter: Q1_next
          action: expense_deferral
          amount_percent: 0.03

        - stage: 3
          name: reserve_manipulation
          quarter: Q2
          action: reserve_release
          amount_percent: 0.02

        - stage: 4
          name: channel_stuffing
          quarter: Q4
          action: forced_sales
          amount_percent: 0.05

      cycle_probability: 0.01

    # Vendor kickback scheme
    kickback_scheme:
      stages:
        - stage: 1
          name: vendor_setup
          actions: [create_vendor, build_relationship]
          duration_months: 3

        - stage: 2
          name: price_inflation
          actions: [inflated_invoices]
          inflation_percent: 0.10-0.25
          duration_months: 12

        - stage: 3
          name: kickback_payments
          actions: [off_book_payments]
          kickback_percent: 0.50
          frequency: quarterly

        - stage: 4
          name: concealment
          actions: [document_destruction, false_approvals]
          ongoing: true

3.2 Scheme Evolution

#![allow(unused)]
fn main() {
pub struct MultiStageAnomaly {
    scheme_id: Uuid,
    scheme_type: SchemeType,
    current_stage: u32,
    start_date: NaiveDate,
    perpetrators: Vec<EntityId>,
    transactions: Vec<TransactionId>,
    total_impact: Decimal,
    detection_status: DetectionStatus,
}

impl MultiStageAnomaly {
    /// Advance scheme to next stage
    pub fn advance(&mut self, date: NaiveDate) -> Vec<AnomalyAction> {
        // Check if conditions met for stage advancement
        // Return actions for current stage
    }

    /// Check if scheme should be detected based on accumulated evidence
    pub fn detection_probability(&self) -> f64 {
        // Increases with:
        // - Number of transactions
        // - Total amount
        // - Duration
        // - Carelessness factor
    }
}
}

4. Near-Miss and Edge Case Generation

4.1 Near-Anomaly Patterns

near_miss_generation:
  enabled: true
  proportion_of_anomalies: 0.30  # 30% of "anomalies" are near-misses

  patterns:
    # Almost duplicate (timing difference)
    near_duplicate:
      description: "Similar transaction, different timing"
      difference:
        amount: exact_match
        date: 1-3_days_apart
        vendor: same
      label: not_anomaly
      detection_challenge: high

    # Threshold proximity
    threshold_proximity:
      description: "Transaction just below approval threshold"
      distance_from_threshold: [0.90, 0.99]
      label: not_anomaly
      suspicion_score: high

    # Unusual but explainable
    unusual_legitimate:
      description: "Unusual pattern with valid business reason"
      types:
        - year_end_bonus
        - contract_prepayment
        - settlement_payment
        - insurance_claim
      label: not_anomaly
      false_positive_trigger: high

    # Corrected error
    corrected_error:
      description: "Error that was caught and fixed"
      original_error: any
      correction_lag_days: 1-5
      net_impact: zero
      label: error_corrected
      visibility: both_entries_visible

4.2 Boundary Condition Testing

boundary_conditions:
  enabled: true

  conditions:
    # Exact threshold matches
    exact_thresholds:
      types: [approval_limit, materiality, tolerance]
      probability: 0.01
      label: boundary_case

    # Round number preference (non-fraudulent)
    legitimate_round_numbers:
      amounts: [1000, 5000, 10000, 25000]
      probability: 0.05
      label: not_anomaly
      context: budget_allocations

    # Last-minute but legitimate
    period_boundary:
      timing: last_hour_before_close
      legitimate_probability: 0.80
      label: timing_anomaly_only

    # Zero and negative amounts
    edge_amounts:
      zero_amount_probability: 0.001
      negative_amount_probability: 0.002
      labels: data_quality_issue

5. Context-Aware Anomaly Injection

5.1 Entity-Specific Patterns

entity_aware_anomalies:
  enabled: true

  vendor_specific:
    # New vendors have higher error rates
    new_vendor_errors:
      definition: vendor_age < 90_days
      error_rate_multiplier: 2.5
      common_errors: [wrong_account, missing_po]

    # Large vendors have more complex issues
    strategic_vendor_issues:
      definition: vendor_spend > percentile_90
      anomaly_types: [contract_deviation, price_variance]
      rate_multiplier: 1.5

    # International vendors
    international_vendor_issues:
      definition: vendor_country != company_country
      anomaly_types: [fx_errors, withholding_tax_errors]
      rate_multiplier: 2.0

  employee_specific:
    # New employee learning curve
    new_employee_errors:
      definition: employee_tenure < 180_days
      error_rate: 0.05
      error_types: [coding_error, approval_violation]
      decay: exponential

    # High-volume processors
    volume_fatigue:
      definition: daily_transactions > 50
      error_rate_increase: 0.02
      peak_time: end_of_day

    # Vacation coverage
    coverage_errors:
      trigger: primary_approver_absent
      error_rate_multiplier: 1.8
      types: [delayed_approval, wrong_approver]

  account_specific:
    # High-risk accounts
    high_risk_accounts:
      accounts: [cash, revenue, inventory]
      monitoring_level: enhanced
      anomaly_injection_rate: 1.5x

    # Infrequently used accounts
    dormant_account_activity:
      definition: no_activity_90_days
      any_activity_suspicious: true
      label: statistical_anomaly

5.2 Behavioral Baseline Deviation

behavioral_deviation:
  enabled: true

  baselines:
    # Establish per-entity behavioral baseline
    baseline_period: 90_days
    metrics:
      - average_transaction_amount
      - transaction_frequency
      - typical_posting_time
      - common_counterparties
      - usual_account_codes

  deviations:
    # Amount deviation
    amount_anomaly:
      threshold: 3_standard_deviations
      label: statistical_anomaly
      severity: based_on_deviation

    # Frequency deviation
    frequency_anomaly:
      threshold: 2_standard_deviations
      types: [sudden_increase, sudden_decrease, irregular_pattern]

    # Counterparty deviation
    new_counterparty:
      first_time_transaction: true
      risk_score: elevated
      label: relationship_anomaly

    # Timing deviation
    timing_anomaly:
      threshold: outside_usual_hours
      consideration: legitimate_reasons_exist
      label: timing_anomaly

6. Detection Difficulty Classification

6.1 Difficulty Taxonomy

detection_difficulty:
  levels:
    trivial:
      description: "Obvious on cursory review"
      examples:
        - duplicate_same_day
        - obviously_wrong_amount
        - missing_required_field
      expected_detection_rate: 0.99
      detection_methods: [basic_rules]

    easy:
      description: "Detectable with standard controls"
      examples:
        - threshold_violations
        - approval_gaps
        - segregation_of_duties
      expected_detection_rate: 0.90
      detection_methods: [automated_rules, basic_analytics]

    moderate:
      description: "Requires analytical procedures"
      examples:
        - trend_deviations
        - ratio_anomalies
        - benford_violations
      expected_detection_rate: 0.70
      detection_methods: [statistical_analysis, ratio_analysis]

    hard:
      description: "Requires advanced techniques or domain expertise"
      examples:
        - complex_fraud_schemes
        - collusion_patterns
        - sophisticated_manipulation
      expected_detection_rate: 0.40
      detection_methods: [ml_models, graph_analysis, forensic_audit]

    expert:
      description: "Only detectable by specialized investigation"
      examples:
        - long_running_schemes
        - management_override
        - deep_concealment
      expected_detection_rate: 0.15
      detection_methods: [tip_or_complaint, forensic_investigation, external_audit]

6.2 Difficulty Factors

#![allow(unused)]
fn main() {
pub struct DifficultyCalculator {
    factors: Vec<DifficultyFactor>,
}

pub enum DifficultyFactor {
    // Concealment techniques
    Concealment {
        document_manipulation: bool,
        approval_circumvention: bool,
        timing_exploitation: bool,
        splitting: bool,
    },

    // Blending with normal activity
    Blending {
        amount_within_normal_range: bool,
        timing_within_normal_hours: bool,
        counterparty_is_established: bool,
        account_coding_correct: bool,
    },

    // Collusion
    Collusion {
        number_of_participants: u32,
        includes_management: bool,
        external_parties: bool,
    },

    // Duration and frequency
    Temporal {
        duration_months: u32,
        transaction_frequency: Frequency,
        gradual_escalation: bool,
    },

    // Amount characteristics
    Amount {
        total_amount: Decimal,
        individual_amounts_small: bool,
        round_numbers_avoided: bool,
    },
}
}

7. Anomaly Generation Strategies

7.1 Strategy Configuration

anomaly_strategies:
  # Random injection (current approach)
  random:
    enabled: true
    weight: 0.40
    parameters:
      base_rate: 0.02
      per_type_rates: {...}

  # Scenario-based injection
  scenario_based:
    enabled: true
    weight: 0.30
    scenarios:
      - name: "new_employee_fraud"
        trigger: employee_tenure < 365
        probability: 0.005
        scheme: embezzlement

      - name: "vendor_collusion"
        trigger: vendor_concentration > 0.15
        probability: 0.01
        scheme: kickback

      - name: "year_end_pressure"
        trigger: month == 12
        probability: 0.03
        types: [revenue_manipulation, reserve_adjustment]

  # Adversarial injection
  adversarial:
    enabled: true
    weight: 0.20
    strategy: evade_known_detectors
    detectors_to_evade:
      - benford_analysis
      - duplicate_detection
      - threshold_monitoring
    techniques:
      - amount_variation
      - timing_spreading
      - entity_rotation

  # Benchmark-based injection
  benchmark:
    enabled: true
    weight: 0.10
    source: acfe_report_to_the_nations
    calibration:
      median_loss: 117000
      duration_months: 12
      detection_method_distribution: {...}

7.2 Adaptive Anomaly Injection

#![allow(unused)]
fn main() {
pub struct AdaptiveAnomalyInjector {
    // Tracks what's been injected
    injection_history: Vec<InjectedAnomaly>,

    // Ensures variety
    type_distribution: TypeDistribution,

    // Ensures difficulty spread
    difficulty_distribution: DifficultyDistribution,

    // Ensures temporal spread
    temporal_distribution: TemporalDistribution,
}

impl AdaptiveAnomalyInjector {
    /// Inject anomaly with awareness of what's already been injected
    pub fn inject(&mut self, context: &GenerationContext) -> Option<Anomaly> {
        // Check if injection appropriate at this point
        if !self.should_inject(context) {
            return None;
        }

        // Select type based on current distribution gaps
        let anomaly_type = self.select_type_for_balance();

        // Select difficulty based on current distribution gaps
        let difficulty = self.select_difficulty_for_balance();

        // Generate anomaly
        let anomaly = self.generate_anomaly(anomaly_type, difficulty, context);

        // Record injection
        self.record_injection(&anomaly);

        Some(anomaly)
    }
}
}

8. Output Enhancements

8.1 Enhanced Label File

output:
  anomaly_labels:
    format: parquet  # or csv
    columns:
      # Identifiers
      - anomaly_id
      - transaction_ids  # Array
      - scheme_id        # For multi-stage

      # Classification
      - anomaly_type
      - category
      - subcategory

      # Severity
      - severity_level
      - severity_score
      - financial_impact
      - is_material

      # Detection
      - difficulty_level
      - difficulty_score
      - recommended_detection_methods  # Array
      - key_indicators  # Array

      # Temporal
      - first_date
      - last_date
      - duration_days
      - stage  # For multi-stage

      # Context
      - affected_entities  # Array
      - control_failures  # Array
      - related_anomalies  # Array

      # Metadata
      - injection_strategy
      - generation_seed
      - ground_truth_certainty

  # Separate scheme file for multi-stage
  schemes:
    format: json
    structure:
      scheme_id: uuid
      scheme_type: string
      stages: [...]
      transactions_by_stage: {...}
      total_impact: decimal
      perpetrators: [entity_ids]

8.2 Detection Benchmark Output

detection_benchmarks:
  enabled: true

  outputs:
    # Performance expectations by method
    expected_performance:
      format: json
      content:
        by_method:
          rule_based:
            expected_recall: 0.40
            expected_precision: 0.85
          statistical:
            expected_recall: 0.55
            expected_precision: 0.70
          ml_supervised:
            expected_recall: 0.75
            expected_precision: 0.80
          graph_based:
            expected_recall: 0.65
            expected_precision: 0.75

    # Difficulty distribution
    difficulty_summary:
      format: csv
      columns: [difficulty_level, count, percentage, avg_amount]

    # Detection challenge set
    challenge_cases:
      format: json
      description: "Curated set of hardest-to-detect anomalies"
      count: 100
      selection_criteria: difficulty_score > 0.7

9. Implementation Priority

Enhancement	Complexity	Impact	Priority
Multi-dimensional labeling	Low	High	P1
Correlated anomaly injection	Medium	High	P1
Multi-stage schemes	High	High	P1
Near-miss generation	Medium	High	P1
Context-aware injection	Medium	High	P2
Difficulty classification	Low	High	P1
Adaptive injection	Medium	Medium	P2
Detection benchmarks	Low	Medium	P2

See also: 07-fraud-patterns.md for fraud-specific patterns

Research: Fraud Pattern Improvements

Current State Analysis

Existing Fraud Typologies

Category	Types Implemented	Realism
Asset Misappropriation	Ghost Employee, Duplicate Payment, Fictitious Vendor	Medium
Financial Statement Fraud	Revenue Manipulation, Round-tripping	Basic
Corruption	(Limited)	Weak
Banking/AML	Structuring, Layering, Mule, Funnel, Spoofing	Good

Current Strengths

Banking module: Sophisticated AML typologies with transaction networks
Fraud labeling: Ground truth labels for ML training
Control mapping: Fraud linked to control failures
Amount patterns: Benford violations for fraudulent amounts

Current Gaps

No collusion modeling: Fraud actors operate independently
Limited concealment: Fraud isn’t actively hidden
No behavioral adaptation: Fraudsters don’t learn
Static schemes: Same patterns throughout
Missing corruption types: Bribery, kickbacks underdeveloped
No management override: All fraud at operational level
Limited financial statement fraud: Complex schemes not modeled

Improvement Recommendations

1. Comprehensive Fraud Taxonomy

1.1 ACFE-Aligned Framework

Based on the Association of Certified Fraud Examiners Occupational Fraud and Abuse Classification:

fraud_taxonomy:
  # Asset Misappropriation (86% of cases, $100k median loss)
  asset_misappropriation:
    cash:
      theft_of_cash_on_hand:
        - larceny
        - skimming

      theft_of_cash_receipts:
        - sales_skimming
        - receivables_skimming
        - refund_schemes

      fraudulent_disbursements:
        - billing_schemes:
            - shell_company
            - non_accomplice_vendor
            - personal_purchases
        - payroll_schemes:
            - ghost_employee
            - falsified_wages
            - commission_schemes
        - expense_reimbursement:
            - mischaracterized_expenses
            - overstated_expenses
            - fictitious_expenses
        - check_tampering:
            - forged_maker
            - forged_endorsement
            - altered_payee
            - authorized_maker
        - register_disbursements:
            - false_voids
            - false_refunds

    inventory_and_assets:
      - misuse
      - larceny

  # Corruption (33% of cases, $150k median loss)
  corruption:
    conflicts_of_interest:
      - purchasing_schemes
      - sales_schemes

    bribery:
      - invoice_kickbacks
      - bid_rigging

    illegal_gratuities: true

    economic_extortion: true

  # Financial Statement Fraud (10% of cases, $954k median loss)
  financial_statement_fraud:
    overstatement:
      - timing_differences:
          - premature_revenue
          - delayed_expenses
      - fictitious_revenues
      - concealed_liabilities
      - improper_asset_valuations
      - improper_disclosures

    understatement:
      - understated_revenues
      - overstated_expenses
      - overstated_liabilities

1.2 Industry-Specific Fraud Patterns

industry_fraud_patterns:
  manufacturing:
    common_schemes:
      - type: inventory_theft
        frequency: high
        methods: [larceny, false_shipments, scrap_manipulation]
      - type: vendor_kickbacks
        frequency: medium
        methods: [inflated_pricing, phantom_materials]
      - type: quality_fraud
        frequency: low
        methods: [false_certifications, spec_violations]

  retail:
    common_schemes:
      - type: register_fraud
        frequency: high
        methods: [skimming, false_voids, sweethearting]
      - type: return_fraud
        frequency: high
        methods: [fictitious_returns, receipt_fraud]
      - type: inventory_shrinkage
        frequency: very_high
        methods: [employee_theft, vendor_collusion]

  financial_services:
    common_schemes:
      - type: loan_fraud
        frequency: medium
        methods: [false_documentation, appraisal_fraud]
      - type: insider_trading
        frequency: low
        methods: [front_running, tip_schemes]
      - type: account_takeover
        frequency: medium
        methods: [identity_theft, credential_theft]

  healthcare:
    common_schemes:
      - type: billing_fraud
        frequency: high
        methods: [upcoding, unbundling, phantom_billing]
      - type: kickbacks
        frequency: medium
        methods: [referral_fees, drug_company_payments]
      - type: identity_fraud
        frequency: medium
        methods: [patient_identity_theft, provider_impersonation]

  professional_services:
    common_schemes:
      - type: billing_fraud
        frequency: high
        methods: [inflated_hours, phantom_work]
      - type: expense_fraud
        frequency: medium
        methods: [personal_expenses, inflated_claims]
      - type: client_fund_misappropriation
        frequency: low
        methods: [trust_account_theft, advance_fee_theft]

2. Collusion and Conspiracy Modeling

2.1 Collusion Network Generation

collusion_networks:
  enabled: true

  network_types:
    # Internal collusion
    internal:
      - type: employee_pair
        roles: [approver, processor]
        scheme: approval_bypass
        probability: 0.005

      - type: department_ring
        size: 3-5
        roles: [initiator, approver, concealer]
        scheme: expense_fraud
        probability: 0.002

      - type: management_subordinate
        roles: [manager, subordinate]
        scheme: ghost_employee
        probability: 0.003

    # Internal-external collusion
    internal_external:
      - type: employee_vendor
        roles: [purchasing_agent, vendor_contact]
        scheme: kickback
        probability: 0.008

      - type: employee_customer
        roles: [sales_rep, customer]
        scheme: false_credits
        probability: 0.004

      - type: employee_contractor
        roles: [project_manager, contractor]
        scheme: overbilling
        probability: 0.006

    # External rings
    external:
      - type: vendor_ring
        size: 2-4
        scheme: bid_rigging
        probability: 0.002

      - type: customer_ring
        size: 2-3
        scheme: return_fraud
        probability: 0.003

  network_characteristics:
    trust_building:
      initial_period_months: 3
      test_transactions: 2-5
      test_amounts: small

    communication_patterns:
      frequency: coded
      channels: [personal_email, phone, in_person]
      visibility: low

    profit_sharing:
      methods: [equal_split, role_based, initiator_premium]
      payment_channels: [cash, personal_accounts, crypto]

2.2 Collusion Behavior Modeling

#![allow(unused)]
fn main() {
pub struct CollusionRing {
    ring_id: Uuid,
    members: Vec<Conspirator>,
    scheme_type: SchemeType,
    formation_date: NaiveDate,
    status: RingStatus,
    total_stolen: Decimal,
    detection_risk: f64,
}

pub struct Conspirator {
    entity_id: EntityId,
    role: ConspiratorRole,
    join_date: NaiveDate,
    loyalty: f64,           // Probability of not defecting
    risk_tolerance: f64,    // Willingness to escalate
    share: f64,             // Percentage of proceeds
}

pub enum ConspiratorRole {
    Initiator,      // Conceives scheme
    Executor,       // Performs transactions
    Approver,       // Provides approvals
    Concealer,      // Hides evidence
    Lookout,        // Monitors for detection
    Beneficiary,    // External recipient
}

impl CollusionRing {
    /// Simulate ring behavior for a period
    pub fn simulate_period(&mut self, period: &Period) -> Vec<FraudAction> {
        // Check for defection
        if self.check_defection() {
            return self.dissolve();
        }

        // Check for escalation
        let escalation = self.check_escalation();

        // Generate fraudulent transactions
        let actions = self.generate_actions(period, escalation);

        // Update detection risk
        self.update_detection_risk(&actions);

        actions
    }

    /// Check if any member might defect
    fn check_defection(&self) -> bool {
        // Factors: loyalty, detection_risk, personal_circumstances
    }
}
}

3. Concealment Techniques

3.1 Document Manipulation

concealment_techniques:
  document_manipulation:
    # Forged documents
    forgery:
      types:
        - invoices
        - receipts
        - approvals
        - contracts
      quality_levels:
        crude: 0.20      # Easy to detect
        moderate: 0.50   # Requires scrutiny
        sophisticated: 0.25  # Difficult to detect
        professional: 0.05   # Expert required

    # Altered documents
    alteration:
      techniques:
        - amount_change
        - date_change
        - payee_change
        - description_change
      detection_indicators:
        - different_handwriting
        - correction_fluid
        - digital_artifacts

    # Destroyed documents
    destruction:
      methods:
        - physical_destruction
        - digital_deletion
        - "lost_in_transition"
      recovery_probability: 0.30

  audit_trail_manipulation:
    techniques:
      - backdating_entries
      - manipulating_timestamps
      - deleting_log_entries
      - creating_false_trails

    sophistication_levels:
      basic: "obvious_gaps"
      intermediate: "plausible_explanations"
      advanced: "complete_alternative_narrative"

  segregation_circumvention:
    methods:
      - shared_credentials
      - delegated_authority_abuse
      - emergency_access_exploitation
      - system_override_use

3.2 Transaction Structuring

transaction_structuring:
  # Below threshold structuring
  threshold_avoidance:
    thresholds:
      - type: approval_limit
        values: [1000, 5000, 10000, 25000]
        technique: split_below
        margin: 0.05-0.15

      - type: reporting_threshold
        values: [10000]  # CTR threshold
        technique: structure_below
        margin: 0.10-0.20

      - type: audit_sample_threshold
        values: [materiality * 0.5]
        technique: avoid_population
        margin: variable

  # Timing manipulation
  timing_techniques:
    - type: spread_over_periods
      purpose: avoid_trending
      pattern: randomized

    - type: burst_before_vacation
      purpose: delayed_discovery
      window: 1_week

    - type: holiday_timing
      purpose: reduced_oversight
      targets: [year_end, summer]

  # Entity rotation
  entity_rotation:
    - type: vendor_rotation
      purpose: avoid_concentration_alerts
      rotation_frequency: quarterly

    - type: account_rotation
      purpose: avoid_pattern_detection
      accounts: [expense_categories]

    - type: department_rotation
      purpose: spread_impact
      pattern: round_robin

4. Management Override

4.1 Override Patterns

management_override:
  enabled: true

  scenarios:
    # Revenue manipulation
    revenue_override:
      perpetrator_level: senior_management
      techniques:
        - journal_entry_override
        - revenue_recognition_acceleration
        - reserve_manipulation
        - side_agreement_concealment
      concealment:
        - false_documentation
        - intimidation_of_subordinates
        - auditor_deception

    # Expense manipulation
    expense_override:
      perpetrator_level: department_head+
      techniques:
        - capitalization_abuse
        - expense_deferral
        - cost_allocation_manipulation
      pressure_sources:
        - budget_targets
        - bonus_thresholds
        - analyst_expectations

    # Asset manipulation
    asset_override:
      perpetrator_level: senior_management
      techniques:
        - impairment_avoidance
        - valuation_manipulation
        - classification_abuse
      motivations:
        - covenant_compliance
        - credit_rating_maintenance
        - acquisition_valuation

  override_characteristics:
    # Authority abuse
    authority_patterns:
      - override_segregation_of_duties
      - suppress_exception_reports
      - modify_control_parameters
      - grant_inappropriate_access

    # Pressure and rationalization
    fraud_triangle:
      pressure:
        - financial_targets
        - personal_financial_issues
        - market_expectations
      opportunity:
        - weak_board_oversight
        - auditor_reliance_on_management
        - complex_transactions
      rationalization:
        - "temporary_adjustment"
        - "everyone_does_it"
        - "for_the_good_of_company"

4.2 Tone at the Top Effects

tone_effects:
  enabled: true

  # Positive tone (ethical leadership)
  ethical_leadership:
    effects:
      - fraud_rate_reduction: 0.50
      - whistleblower_increase: 2.0
      - control_compliance_improvement: 0.20

  # Negative tone (pressure culture)
  pressure_culture:
    effects:
      - fraud_rate_increase: 2.5
      - concealment_sophistication: increased
      - collusion_probability: 1.5x
      - management_override_probability: 3.0x

  # Mixed signals
  inconsistent_messaging:
    effects:
      - employee_confusion: true
      - selective_compliance: true
      - rationalization_easier: true

5. Adaptive Fraud Behavior

5.1 Learning and Adaptation

adaptive_fraud:
  enabled: true

  learning_behaviors:
    # Response to near-detection
    near_detection_response:
      behaviors:
        - temporary_pause: 0.40
        - technique_change: 0.30
        - amount_reduction: 0.20
        - scheme_abandonment: 0.10
      pause_duration_days: 30-90

    # Response to control changes
    control_adaptation:
      when: new_control_implemented
      behaviors:
        - find_workaround: 0.60
        - wait_for_relaxation: 0.25
        - abandon_scheme: 0.15
      adaptation_time_days: 30-60

    # Success reinforcement
    success_reinforcement:
      when: fraud_not_detected
      behaviors:
        - increase_frequency: 0.30
        - increase_amount: 0.40
        - recruit_accomplices: 0.15
        - maintain_current: 0.15

  sophistication_evolution:
    stages:
      novice:
        characteristics: [obvious_patterns, small_amounts, nervous_behavior]
        detection_difficulty: easy

      intermediate:
        characteristics: [some_concealment, medium_amounts, confidence]
        detection_difficulty: moderate

      experienced:
        characteristics: [sophisticated_concealment, varied_amounts, systematic]
        detection_difficulty: hard

      expert:
        characteristics: [professional_techniques, large_amounts, network]
        detection_difficulty: expert

    progression:
      trigger: months_undetected > 6
      probability: 0.30_per_trigger

5.2 Detection Evasion

#![allow(unused)]
fn main() {
pub struct AdaptiveFraudster {
    experience_level: ExperienceLevel,
    known_controls: Vec<ControlId>,
    detection_events: Vec<DetectionEvent>,
    technique_repertoire: Vec<FraudTechnique>,
}

impl AdaptiveFraudster {
    /// Adapt technique based on environment
    pub fn adapt_technique(&mut self, context: &Context) -> FraudTechnique {
        // Avoid known controls
        let available = self.filter_by_controls(context.active_controls);

        // Avoid previously detected patterns
        let safe = self.filter_by_history(&available);

        // Select based on risk/reward
        self.select_optimal(&safe, context.current_risk_tolerance)
    }

    /// Learn from near-detection
    pub fn learn_from_event(&mut self, event: &DetectionEvent) {
        match event.outcome {
            DetectionOutcome::Detected => {
                self.avoid_technique(event.technique);
                self.reduce_risk_tolerance();
            }
            DetectionOutcome::NearMiss => {
                self.modify_technique(event.technique);
                self.record_warning_sign(event.indicator);
            }
            DetectionOutcome::Undetected => {
                self.reinforce_technique(event.technique);
                self.consider_escalation();
            }
        }
    }
}
}

6. Financial Statement Fraud Schemes

6.1 Revenue Manipulation Schemes

revenue_schemes:
  # Premature revenue recognition
  premature_recognition:
    techniques:
      - bill_and_hold:
          description: "Ship to warehouse, recognize revenue"
          indicators: [unusual_shipping, customer_complaints]
          journal_entries:
            - dr: accounts_receivable
              cr: revenue

      - channel_stuffing:
          description: "Force product on distributors"
          indicators: [quarter_end_spike, high_returns_next_period]
          side_agreements: [return_rights, extended_payment]

      - percentage_of_completion_abuse:
          description: "Overstate project completion"
          indicators: [optimistic_estimates, margin_improvements]
          documentation: [false_progress_reports]

      - round_tripping:
          description: "Simultaneous buy/sell with related party"
          indicators: [offsetting_transactions, unusual_counterparties]
          complexity: high

  # Fictitious revenue
  fictitious_revenue:
    techniques:
      - fake_invoices:
          description: "Bill nonexistent customers"
          concealment: [fake_customer_setup, false_confirmations]

      - side_agreements:
          description: "Hidden terms negate sale"
          concealment: [verbal_agreements, separate_documentation]

      - related_party_transactions:
          description: "Transactions with undisclosed affiliates"
          concealment: [complex_ownership, offshore_entities]

6.2 Expense and Liability Manipulation

expense_liability_schemes:
  # Expense deferral
  expense_deferral:
    techniques:
      - improper_capitalization:
          description: "Capitalize operating expenses"
          accounts: [fixed_assets, intangibles]
          indicators: [unusual_asset_growth, low_maintenance]

      - reserve_manipulation:
          description: "Cookie jar reserves"
          pattern: [build_in_good_years, release_in_bad]
          indicators: [volatile_provisions, earnings_smoothing]

      - period_cutoff_manipulation:
          description: "Push expenses to next period"
          timing: [quarter_end, year_end]
          techniques: [hold_invoices, delay_receipt]

  # Liability concealment
  liability_concealment:
    techniques:
      - off_balance_sheet:
          description: "Structure to avoid consolidation"
          vehicles: [SPEs, unconsolidated_subsidiaries]
          concealment: [complex_structures, offshore]

      - contingency_understatement:
          description: "Understate legal/warranty liabilities"
          rationalization: ["uncertain", "immaterial"]
          indicators: [subsequent_large_settlements]

7. Fraud Red Flags and Indicators

7.1 Behavioral Red Flags

behavioral_red_flags:
  # Employee behavior
  employee_indicators:
    - indicator: living_beyond_means
      fraud_correlation: 0.45
      detection_method: lifestyle_analysis

    - indicator: financial_difficulties
      fraud_correlation: 0.40
      detection_method: background_check

    - indicator: unusually_close_vendor_relationships
      fraud_correlation: 0.35
      detection_method: relationship_analysis

    - indicator: control_issues_attitude
      fraud_correlation: 0.30
      detection_method: 360_feedback

    - indicator: never_takes_vacation
      fraud_correlation: 0.50
      detection_method: hr_records

    - indicator: excessive_overtime
      fraud_correlation: 0.25
      detection_method: time_records

  # Transaction behavior
  transaction_indicators:
    - indicator: round_number_preference
      fraud_correlation: 0.20
      detection_method: benford_analysis

    - indicator: just_below_threshold
      fraud_correlation: 0.60
      detection_method: threshold_analysis

    - indicator: end_of_period_concentration
      fraud_correlation: 0.35
      detection_method: temporal_analysis

    - indicator: unusual_journal_entries
      fraud_correlation: 0.55
      detection_method: journal_entry_testing

7.2 Red Flag Generation

red_flag_injection:
  enabled: true

  # Inject red flags that correlate with but don't prove fraud
  correlations:
    # Strong correlation - usually indicates fraud
    strong:
      - flag: matched_home_address_vendor_employee
        fraud_probability: 0.85
        inject_with_fraud: 0.90
        inject_without_fraud: 0.001

      - flag: sequential_check_numbers_to_same_vendor
        fraud_probability: 0.70
        inject_with_fraud: 0.80
        inject_without_fraud: 0.01

    # Moderate correlation - worth investigating
    moderate:
      - flag: vendor_no_physical_address
        fraud_probability: 0.40
        inject_with_fraud: 0.60
        inject_without_fraud: 0.05

      - flag: approval_just_under_threshold
        fraud_probability: 0.35
        inject_with_fraud: 0.70
        inject_without_fraud: 0.10

    # Weak correlation - often legitimate
    weak:
      - flag: round_number_invoice
        fraud_probability: 0.15
        inject_with_fraud: 0.40
        inject_without_fraud: 0.20

      - flag: end_of_month_timing
        fraud_probability: 0.10
        inject_with_fraud: 0.50
        inject_without_fraud: 0.30

8. Fraud Investigation Scenarios

8.1 Investigation-Ready Data

investigation_scenarios:
  enabled: true

  scenarios:
    # Whistleblower scenario
    whistleblower_tip:
      allegation: "Vendor XYZ may be fictitious"
      evidence_trail:
        - vendor_setup_documents
        - approval_chain
        - payment_history
        - address_verification
        - phone_verification
      hidden_clues:
        - approver_is_also_requester
        - address_is_ups_store
        - phone_goes_to_employee

    # Audit finding follow-up
    audit_finding:
      initial_finding: "Unusual vendor payment pattern"
      investigation_path:
        - transaction_sample
        - vendor_analysis
        - employee_relationship_map
        - comparative_analysis
      discovery_stages:
        - stage_1: "Vendor has only one customer - us"
        - stage_2: "All invoices approved by same person"
        - stage_3: "Vendor address matches employee relative"

    # Hotline report
    anonymous_tip:
      report: "Manager taking kickbacks from contractor"
      evidence_available:
        - contract_documents
        - bid_history
        - payment_records
        - email_metadata
      additional_clues:
        - bids_always_awarded_to_same_contractor
        - contract_amendments_increase_cost_30%
        - manager_new_car_timing_correlates

8.2 Evidence Chain Generation

#![allow(unused)]
fn main() {
pub struct FraudEvidenceChain {
    fraud_id: Uuid,
    evidence_items: Vec<EvidenceItem>,
    discovery_order: Vec<EvidenceId>,
    linking_relationships: Vec<EvidenceLink>,
}

pub struct EvidenceItem {
    id: EvidenceId,
    item_type: EvidenceType,
    content: EvidenceContent,
    source_system: String,
    timestamp: DateTime<Utc>,
    accessibility: Accessibility,  // How hard to find
    probative_value: f64,         // Strength as evidence
}

pub enum EvidenceType {
    Transaction,
    Document,
    Communication,
    SystemLog,
    ExternalRecord,
    WitnessStatement,
    PhysicalEvidence,
}

impl FraudEvidenceChain {
    /// Generate investigation-ready evidence trail
    pub fn generate_trail(&self) -> InvestigationTrail {
        // Order evidence by discoverability
        // Create logical links between items
        // Add red herrings (false leads that are eliminated)
        // Include corroborating evidence
    }
}
}

9. Implementation Priority

Enhancement	Complexity	Impact	Priority
ACFE-aligned taxonomy	Low	High	P1
Collusion modeling	High	High	P1
Concealment techniques	Medium	High	P1
Management override	Medium	High	P1
Adaptive behavior	High	Medium	P2
Financial statement fraud	High	High	P1
Red flag generation	Medium	High	P1
Investigation scenarios	Medium	Medium	P2
Industry-specific patterns	Medium	Medium	P2

10. Validation and Calibration

fraud_validation:
  # Calibration against real-world statistics
  calibration:
    source: acfe_report_to_the_nations_2024
    metrics:
      median_loss: 117000
      median_duration_months: 12
      detection_methods:
        tip: 0.42
        internal_audit: 0.16
        management_review: 0.12
        external_audit: 0.04
        accident: 0.06
      perpetrator_departments:
        accounting: 0.21
        operations: 0.17
        executive: 0.12
        sales: 0.11
        customer_service: 0.08

  # Distribution validation
  distribution_checks:
    - metric: loss_distribution
      expected: lognormal
      parameters_from: acfe_data

    - metric: duration_distribution
      expected: exponential
      mean_months: 12

    - metric: detection_method_distribution
      expected: categorical
      match_acfe: true

See also: 08-domain-specific.md for industry-specific enhancements

Research: Domain-Specific Enhancements

Current State Analysis

Existing Industry Support

Industry	Configuration	Generator Support	Realism
Manufacturing	Preset available	Good	Medium
Retail	Preset available	Good	Medium
Financial Services	Preset + Banking module	Strong	Good
Healthcare	Preset available	Basic	Low
Technology	Preset available	Basic	Low
Professional Services	Limited	Basic	Low

Current Strengths

Banking module: Comprehensive KYC/AML with fraud typologies
Industry presets: 5 industry configurations available
Seasonality profiles: 10 industry-specific patterns
Standards support: IFRS, US GAAP, ISA, SOX frameworks

Current Gaps

Shallow industry modeling: Generic patterns across industries
Limited regulatory specificity: One-size-fits-all compliance
Missing vertical-specific transactions: Generic document flows
No industry-specific anomalies: Same fraud patterns everywhere
Limited terminology: Generic naming regardless of industry

Industry-Specific Enhancement Recommendations

1. Manufacturing Industry

1.1 Enhanced Transaction Types

manufacturing:
  transaction_types:
    # Production-specific
    production:
      - work_order_issuance
      - material_requisition
      - labor_booking
      - overhead_absorption
      - scrap_reporting
      - rework_order
      - production_variance

    # Inventory movements
    inventory:
      - raw_material_receipt
      - wip_transfer
      - finished_goods_transfer
      - consignment_movement
      - subcontractor_shipment
      - cycle_count_adjustment
      - physical_inventory_adjustment

    # Cost accounting
    costing:
      - standard_cost_revaluation
      - purchase_price_variance
      - production_variance_allocation
      - overhead_rate_adjustment
      - interplant_transfer_pricing

  # Manufacturing-specific master data
  master_data:
    bill_of_materials:
      levels: 3-7
      components_per_level: 2-15
      yield_rates: 0.95-0.99
      scrap_factors: 0.01-0.05

    routings:
      operations: 3-12
      work_centers: 5-50
      labor_rates: by_skill_level
      machine_rates: by_equipment_type

    production_orders:
      types: [discrete, repetitive, process]
      statuses: [planned, released, confirmed, completed]

1.2 Manufacturing Anomalies

manufacturing_anomalies:
  production:
    - type: yield_manipulation
      description: "Inflating yield to hide scrap"
      indicators: [abnormal_yield, missing_scrap_entries]

    - type: labor_misallocation
      description: "Charging labor to wrong orders"
      indicators: [unusual_labor_distribution, overtime_patterns]

    - type: phantom_production
      description: "Recording production that didn't occur"
      indicators: [no_material_consumption, missing_quality_records]

  inventory:
    - type: obsolete_inventory_concealment
      description: "Failing to write down obsolete stock"
      indicators: [no_movement_items, aging_without_provision]

    - type: consignment_manipulation
      description: "Recording consigned goods as owned"
      indicators: [unusual_consignment_patterns, ownership_disputes]

  costing:
    - type: standard_cost_manipulation
      description: "Setting unrealistic standards"
      indicators: [persistent_favorable_variances, standard_changes]

    - type: overhead_misallocation
      description: "Allocating overhead to wrong products"
      indicators: [margin_anomalies, allocation_base_changes]

2. Retail Industry

2.1 Enhanced Transaction Types

retail:
  transaction_types:
    # Point of Sale
    pos:
      - cash_sale
      - credit_card_sale
      - debit_sale
      - gift_card_sale
      - layaway_transaction
      - special_order
      - rain_check

    # Returns and adjustments
    returns:
      - customer_return
      - exchange
      - price_adjustment
      - markdown
      - damage_writeoff
      - vendor_return

    # Inventory
    inventory:
      - receiving
      - transfer_in
      - transfer_out
      - cycle_count
      - shrinkage_adjustment
      - donation
      - disposal

    # Promotions
    promotions:
      - coupon_redemption
      - loyalty_redemption
      - bundle_discount
      - flash_sale
      - clearance_markdown

  # Retail-specific metrics
  metrics:
    same_store_sales: by_period
    basket_size: average_and_distribution
    conversion_rate: by_store_type
    shrinkage_rate: by_category
    markdown_percentage: by_season
    inventory_turn: by_category

2.2 Retail Anomalies

retail_anomalies:
  pos_fraud:
    - type: sweethearting
      description: "Employee gives free/discounted items to friends"
      indicators: [high_void_rate, specific_cashier_patterns]

    - type: skimming
      description: "Not recording cash sales"
      indicators: [cash_short, transaction_gaps]

    - type: refund_fraud
      description: "Fraudulent refunds to personal cards"
      indicators: [refund_patterns, card_number_reuse]

  inventory_fraud:
    - type: receiving_fraud
      description: "Collusion with vendors on short shipments"
      indicators: [variance_patterns, vendor_concentration]

    - type: transfer_fraud
      description: "Fake transfers to cover theft"
      indicators: [transfer_without_receipt, location_patterns]

  promotional_abuse:
    - type: coupon_fraud
      description: "Applying coupons without customer purchase"
      indicators: [high_coupon_rate, timing_patterns]

    - type: employee_discount_abuse
      description: "Using employee discount for non-employees"
      indicators: [discount_volume, transaction_timing]

3. Healthcare Industry

3.1 Enhanced Transaction Types

healthcare:
  transaction_types:
    # Revenue cycle
    revenue:
      - patient_registration
      - charge_capture
      - claim_submission
      - payment_posting
      - denial_management
      - patient_billing
      - collection_activity

    # Clinical operations
    clinical:
      - supply_consumption
      - pharmacy_dispensing
      - procedure_coding
      - diagnosis_coding
      - medical_record_documentation

    # Payer transactions
    payer:
      - contract_payment
      - capitation_payment
      - risk_adjustment
      - quality_bonus
      - value_based_payment

  # Healthcare-specific elements
  elements:
    coding:
      icd10: diagnostic_codes
      cpt: procedure_codes
      drg: diagnosis_related_groups
      hcpcs: healthcare_common_procedure

    payers:
      types: [medicare, medicaid, commercial, self_pay]
      mix_distribution: configurable
      contract_terms: by_payer

    compliance:
      hipaa: true
      stark_law: true
      anti_kickback: true
      false_claims_act: true

3.2 Healthcare Anomalies

healthcare_anomalies:
  billing_fraud:
    - type: upcoding
      description: "Billing for more expensive service than provided"
      indicators: [code_distribution_shift, complexity_increase]

    - type: unbundling
      description: "Billing separately for bundled services"
      indicators: [modifier_patterns, procedure_combinations]

    - type: phantom_billing
      description: "Billing for services not rendered"
      indicators: [impossible_combinations, deceased_patient_billing]

    - type: duplicate_billing
      description: "Billing multiple times for same service"
      indicators: [same_day_duplicates, claim_resubmission_patterns]

  kickback_schemes:
    - type: physician_referral_kickback
      description: "Payments for patient referrals"
      indicators: [referral_concentration, payment_timing]

    - type: medical_director_fraud
      description: "Sham medical director agreements"
      indicators: [no_services_rendered, excessive_compensation]

  compliance_violations:
    - type: hipaa_violation
      description: "Unauthorized access to patient records"
      indicators: [access_patterns, audit_log_anomalies]

    - type: credential_fraud
      description: "Using credentials of another provider"
      indicators: [impossible_geography, timing_conflicts]

4. Technology Industry

4.1 Enhanced Transaction Types

technology:
  transaction_types:
    # Revenue recognition (ASC 606)
    revenue:
      - license_revenue
      - subscription_revenue
      - professional_services
      - maintenance_revenue
      - usage_based_revenue
      - milestone_based_revenue

    # Software development
    development:
      - r_and_d_expense
      - capitalized_software
      - amortization
      - impairment_testing

    # Cloud operations
    cloud:
      - hosting_costs
      - bandwidth_costs
      - storage_costs
      - compute_costs
      - third_party_services

    # Sales and marketing
    sales:
      - commission_expense
      - deferred_commission
      - customer_acquisition_cost
      - marketing_program_expense

  # Tech-specific accounting
  accounting:
    revenue_recognition:
      multiple_element_arrangements: true
      variable_consideration: true
      contract_modifications: true

    software_development:
      capitalization_criteria: true
      useful_life_determination: true
      impairment_testing: annual

    stock_compensation:
      option_valuation: black_scholes
      rsu_accounting: true
      performance_units: true

4.2 Technology Anomalies

technology_anomalies:
  revenue_fraud:
    - type: premature_license_recognition
      description: "Recognizing license revenue before delivery criteria met"
      indicators: [quarter_end_concentration, delivery_delays]

    - type: side_letter_abuse
      description: "Hidden terms that negate revenue recognition"
      indicators: [unusual_contract_terms, customer_complaints]

    - type: channel_stuffing
      description: "Forcing product on resellers at period end"
      indicators: [reseller_inventory_buildup, returns_next_quarter]

  capitalization_fraud:
    - type: improper_capitalization
      description: "Capitalizing expenses that should be expensed"
      indicators: [r_and_d_ratio_changes, asset_growth]

    - type: useful_life_manipulation
      description: "Extending useful life to reduce amortization"
      indicators: [useful_life_changes, peer_comparison]

  stock_compensation:
    - type: options_backdating
      description: "Selecting favorable grant dates retroactively"
      indicators: [grant_date_patterns, exercise_price_analysis]

    - type: vesting_manipulation
      description: "Accelerating vesting to manage earnings"
      indicators: [vesting_schedule_changes, departure_timing]

5. Financial Services Industry

5.1 Enhanced Transaction Types

financial_services:
  transaction_types:
    # Banking operations
    banking:
      - loan_origination
      - loan_disbursement
      - loan_payment
      - interest_accrual
      - fee_income
      - deposit_transaction
      - wire_transfer
      - ach_transaction

    # Investment operations
    investments:
      - trade_execution
      - trade_settlement
      - dividend_receipt
      - interest_receipt
      - mark_to_market
      - realized_gain_loss
      - unrealized_gain_loss

    # Insurance operations
    insurance:
      - premium_collection
      - claim_payment
      - reserve_adjustment
      - reinsurance_transaction
      - commission_payment
      - policy_acquisition_cost

    # Asset management
    asset_management:
      - management_fee
      - performance_fee
      - distribution
      - capital_call
      - redemption

  # Regulatory requirements
  regulatory:
    capital_requirements:
      basel_iii: true
      leverage_ratio: true
      liquidity_coverage: true

    reporting:
      call_reports: true
      form_10k_10q: true
      form_13f: true
      sar_filing: true

5.2 Financial Services Anomalies

financial_services_anomalies:
  lending_fraud:
    - type: loan_fraud
      description: "Falsified loan applications"
      indicators: [documentation_inconsistencies, verification_failures]

    - type: appraisal_fraud
      description: "Inflated property valuations"
      indicators: [appraisal_variances, appraiser_concentration]

    - type: straw_borrower
      description: "Using nominee to obtain loans"
      indicators: [relationship_patterns, fund_flow_analysis]

  trading_fraud:
    - type: wash_trading
      description: "Buying and selling same security to inflate volume"
      indicators: [self_trades, volume_patterns]

    - type: front_running
      description: "Trading ahead of customer orders"
      indicators: [timing_analysis, profitability_patterns]

    - type: churning
      description: "Excessive trading to generate commissions"
      indicators: [turnover_ratio, commission_patterns]

  insurance_fraud:
    - type: premium_theft
      description: "Agent pocketing premiums"
      indicators: [lapsed_policies, customer_complaints]

    - type: claims_fraud
      description: "Fraudulent or inflated claims"
      indicators: [claim_patterns, adjuster_analysis]

    - type: reserve_manipulation
      description: "Understating claim reserves"
      indicators: [reserve_development, adequacy_analysis]

6. Professional Services

6.1 Enhanced Transaction Types

professional_services:
  transaction_types:
    # Time and billing
    billing:
      - time_entry
      - expense_entry
      - invoice_generation
      - write_off_adjustment
      - realization_adjustment
      - wip_adjustment

    # Engagement management
    engagement:
      - engagement_setup
      - budget_allocation
      - milestone_billing
      - retainer_application
      - contingency_fee

    # Resource management
    resource:
      - staff_allocation
      - contractor_engagement
      - subcontractor_payment
      - expert_fee

    # Client accounting
    client:
      - trust_deposit
      - trust_withdrawal
      - cost_advance
      - client_reimbursement

  # Professional-specific metrics
  metrics:
    utilization_rate: by_level
    realization_rate: by_practice
    collection_rate: by_client
    leverage_ratio: staff_to_partner
    revenue_per_professional: by_level

6.2 Professional Services Anomalies

professional_services_anomalies:
  billing_fraud:
    - type: inflated_hours
      description: "Billing for time not worked"
      indicators: [impossible_hours, pattern_analysis]

    - type: phantom_work
      description: "Billing for work never performed"
      indicators: [no_work_product, client_complaints]

    - type: duplicate_billing
      description: "Billing multiple clients for same time"
      indicators: [time_overlap, total_hours_analysis]

  expense_fraud:
    - type: personal_expense_billing
      description: "Charging personal expenses to clients"
      indicators: [expense_patterns, vendor_analysis]

    - type: markup_abuse
      description: "Excessive markups on pass-through costs"
      indicators: [markup_comparison, cost_analysis]

  trust_account_fraud:
    - type: commingling
      description: "Mixing trust and operating funds"
      indicators: [transfer_patterns, reconciliation_issues]

    - type: misappropriation
      description: "Using client funds for personal use"
      indicators: [unauthorized_withdrawals, shortages]

7. Real Estate Industry

7.1 Enhanced Transaction Types

real_estate:
  transaction_types:
    # Property management
    property:
      - rent_collection
      - cam_charges
      - security_deposit
      - lease_payment
      - tenant_improvement
      - property_tax
      - insurance_expense

    # Development
    development:
      - land_acquisition
      - construction_draw
      - development_fee
      - capitalized_interest
      - soft_cost
      - hard_cost

    # Investment
    investment:
      - property_acquisition
      - property_disposition
      - depreciation
      - impairment
      - fair_value_adjustment
      - debt_service

    # REIT-specific
    reit:
      - ffo_calculation
      - dividend_distribution
      - taxable_income
      - section_1031_exchange

7.2 Real Estate Anomalies

real_estate_anomalies:
  property_management:
    - type: rent_skimming
      description: "Not recording cash rent payments"
      indicators: [occupancy_vs_revenue, cash_deposits]

    - type: kickback_maintenance
      description: "Receiving kickbacks from contractors"
      indicators: [contractor_concentration, pricing_analysis]

  development:
    - type: cost_inflation
      description: "Inflating development costs"
      indicators: [cost_per_unit_comparison, change_order_patterns]

    - type: capitalization_abuse
      description: "Capitalizing operating expenses"
      indicators: [capitalization_ratio, expense_classification]

  valuation:
    - type: appraisal_manipulation
      description: "Influencing property appraisals"
      indicators: [appraisal_vs_sale_price, appraiser_relationships]

    - type: impairment_avoidance
      description: "Failing to record impairments"
      indicators: [occupancy_decline, market_comparisons]

8. Industry-Specific Configuration

8.1 Unified Industry Configuration

# Master industry configuration schema
industry_configuration:
  industry: manufacturing  # or retail, healthcare, etc.

  # Industry-specific settings
  settings:
    transaction_types:
      enabled: [production, inventory, costing]
      weights:
        production_orders: 0.30
        inventory_movements: 0.40
        cost_adjustments: 0.30

    master_data:
      bill_of_materials: true
      routings: true
      work_centers: true
      production_resources: true

    anomaly_injection:
      industry_specific: true
      generic: true
      industry_weight: 0.60

    terminology:
      use_industry_terms: true
      document_naming: industry_standard
      account_descriptions: industry_specific

    seasonality:
      profile: manufacturing
      custom_events:
        - name: plant_shutdown
          month: 7
          duration_weeks: 2
          activity_multiplier: 0.10

    regulatory:
      frameworks:
        - environmental: epa
        - safety: osha
        - quality: iso_9001

  # Cross-industry settings (inherit from base)
  inherit:
    - accounting_standards
    - audit_standards
    - control_framework

8.2 Industry Presets Enhancement

presets:
  manufacturing_automotive:
    base: manufacturing
    customizations:
      bom_depth: 7
      just_in_time: true
      quality_framework: iatf_16949
      supplier_tiers: 3
      defect_rates: very_low

  retail_grocery:
    base: retail
    customizations:
      perishable_inventory: true
      high_volume_low_margin: true
      shrinkage_focus: true
      vendor_managed_inventory: true

  healthcare_hospital:
    base: healthcare
    customizations:
      inpatient: true
      outpatient: true
      emergency_services: true
      ancillary_services: true
      case_mix_complexity: high

  technology_saas:
    base: technology
    customizations:
      subscription_revenue: primary
      professional_services: secondary
      monthly_recurring_revenue: true
      churn_modeling: true

  financial_services_bank:
    base: financial_services
    customizations:
      banking_charter: commercial
      deposit_taking: true
      lending: true
      capital_markets: limited

9. Implementation Priority

Industry	Enhancement Scope	Complexity	Priority
Manufacturing	Full enhancement	High	P1
Retail	Full enhancement	Medium	P1
Healthcare	Full enhancement	High	P1
Technology	Revenue recognition	Medium	P2
Financial Services	Extend banking module	Medium	P1
Professional Services	New module	Medium	P2
Real Estate	New module	Medium	P3

10. Terminology and Naming

industry_terminology:
  manufacturing:
    document_types:
      purchase_order: "Production Purchase Order"
      invoice: "Vendor Invoice"
      receipt: "Goods Receipt / Material Document"

    accounts:
      wip: "Work in Process"
      fg: "Finished Goods Inventory"
      rm: "Raw Materials Inventory"

    transactions:
      production: "Manufacturing Order Settlement"
      variance: "Production Variance Posting"

  healthcare:
    document_types:
      invoice: "Claim"
      payment: "Remittance Advice"
      receipt: "Patient Payment"

    accounts:
      ar: "Patient Accounts Receivable"
      revenue: "Net Patient Service Revenue"
      contractual: "Contractual Allowance"

    transactions:
      billing: "Charge Capture"
      collection: "Payment Posting"

  # Similar for other industries...

Summary

This research document series provides a comprehensive analysis of improvement opportunities for the SyntheticData system. Key themes across all documents:

Depth over breadth: Enhance existing features rather than adding new surface-level capabilities
Correlation modeling: Move from independent generation to correlated, interconnected data
Temporal realism: Add dynamic behavior that evolves over time
Domain authenticity: Use real industry terminology, patterns, and regulations
Detection-aware design: Generate data that enables meaningful ML training and evaluation

The recommended implementation approach is phased, starting with high-impact, lower-complexity enhancements and building toward more sophisticated modeling over time.

End of Research Document Series

Total documents: 8 Research conducted: January 2026 System version analyzed: 0.2.3

Keyboard shortcuts

SyntheticData Documentation