Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SyntheticData

High-Performance Synthetic Enterprise Financial Data Generator

Version License Rust

Developed by Ernst & Young Ltd., Zurich, Switzerland

What is SyntheticData?

SyntheticData is a configurable synthetic data generator that produces realistic, interconnected enterprise financial data. It generates General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, and ML-ready graph exports at scale.

The generator produces statistically accurate data based on empirical research from real-world general ledger patterns, ensuring that synthetic datasets exhibit the same characteristics as production data—including Benford’s Law compliance, temporal patterns, and document flow integrity.

New in v0.5.0: LLM-augmented generation (vendor names, descriptions, anomaly explanations), diffusion model backend (statistical denoising, hybrid generation), causal & counterfactual generation (SCMs, do-calculus interventions), federated fingerprinting, synthetic data certificates, and ecosystem integrations (Airflow, dbt, MLflow, Spark).

v0.3.0: ACFE-aligned fraud taxonomy, collusion modeling, industry-specific transactions (Manufacturing, Retail, Healthcare), and ML benchmarks.

v0.2.x: Privacy-preserving fingerprinting, accounting/audit standards (US GAAP, IFRS, ISA, SOX), streaming output API.

SectionDescription
Getting StartedInstallation, quick start guide, and demo mode
User GuideCLI reference, server API, desktop UI, Python wrapper
ConfigurationComplete YAML schema and presets
ArchitectureSystem design, data flow, resource management
Crate ReferenceDetailed crate documentation (15 crates)
Advanced TopicsAnomaly injection, graph export, fingerprinting, performance
Use CasesFraud detection, audit, AML/KYC, compliance

Key Features

Core Data Generation

FeatureDescription
Statistical DistributionsLine item counts, amounts, and patterns based on empirical GL research
Benford’s Law ComplianceFirst-digit distribution following Benford’s Law with configurable fraud patterns
Industry PresetsManufacturing, Retail, Financial Services, Healthcare, Technology
Chart of AccountsSmall (~100), Medium (~400), Large (~2500) account structures
Temporal PatternsMonth-end, quarter-end, year-end volume spikes with working hour modeling
Regional CalendarsHoliday calendars for US, DE, GB, CN, JP, IN with lunar calendar support

Enterprise Simulation

  • Master Data Management: Vendors, customers, materials, fixed assets, employees with temporal validity
  • Document Flow Engine: Complete P2P (Procure-to-Pay) and O2C (Order-to-Cash) processes with three-way matching
  • Intercompany Transactions: IC matching, transfer pricing, consolidation eliminations
  • Balance Coherence: Opening balances, running balance tracking, trial balance generation
  • Subledger Simulation: AR, AP, Fixed Assets, Inventory with GL reconciliation
  • Currency & FX: Realistic exchange rates (Ornstein-Uhlenbeck process), currency translation, CTA generation
  • Period Close Engine: Monthly close, depreciation runs, accruals, year-end closing
  • Banking/KYC/AML: Customer personas, KYC profiles, AML typologies (structuring, funnel, mule, layering, round-tripping)
  • Process Mining: OCEL 2.0 event logs with object-centric relationships
  • Audit Simulation: ISA-compliant engagements, workpapers, findings, risk assessments, professional judgments

Fraud Patterns & Industry-Specific Features

  • ACFE-Aligned Fraud Taxonomy: Asset Misappropriation, Corruption, Financial Statement Fraud calibrated to ACFE statistics
  • Collusion & Conspiracy Modeling: Multi-party fraud networks with 9 ring types and role-based conspirators
  • Management Override: Senior-level fraud with fraud triangle modeling (Pressure, Opportunity, Rationalization)
  • Red Flag Generation: 40+ probabilistic fraud indicators with Bayesian probabilities
  • Industry-Specific Transactions: Manufacturing (BOM, WIP), Retail (POS, shrinkage), Healthcare (ICD-10, claims)
  • Industry-Specific Anomalies: Authentic fraud patterns per industry (upcoding, sweethearting, yield manipulation)

Machine Learning & Analytics

  • Graph Export: PyTorch Geometric, Neo4j, DGL, and RustGraph formats with train/val/test splits
  • Anomaly Injection: 60+ fraud types, errors, process issues with full labeling
  • Data Quality Variations: Missing values (MCAR, MAR, MNAR), format variations, duplicates, typos
  • Evaluation Framework: Auto-tuning with configuration recommendations based on metric gaps
  • ACFE Benchmarks: ML benchmarks calibrated to ACFE fraud statistics
  • Industry Benchmarks: Pre-configured benchmarks for fraud detection by industry

AI & ML-Powered Generation

  • LLM-Augmented Generation: Use LLMs to generate realistic vendor names, transaction descriptions, memo fields, and anomaly explanations via pluggable provider abstraction (Mock, OpenAI, Anthropic, Custom)
  • Natural Language Configuration: Generate YAML configs from natural language descriptions (init --from-description "Generate 1 year of retail data for a mid-size US company")
  • Diffusion Model Backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
  • Hybrid Generation: Blend rule-based and diffusion outputs using interpolation, selection, or ensemble strategies
  • Causal Generation: Define Structural Causal Models (SCMs) with interventional (“what-if”) and counterfactual generation
  • Built-in Causal Templates: Pre-configured fraud_detection and revenue_cycle causal graphs

Privacy-Preserving Fingerprinting

  • Fingerprint Extraction: Extract statistical properties from real data into .dsf files
  • Differential Privacy: Laplace and Gaussian mechanisms with configurable epsilon budget
  • K-Anonymity: Suppression of rare categorical values below configurable threshold
  • Privacy Audit Trail: Complete logging of all privacy decisions and epsilon spent
  • Fidelity Evaluation: Validate synthetic data matches original fingerprint (KS, Wasserstein, Benford MAD)
  • Gaussian Copula: Preserve multivariate correlations during synthesis
  • Federated Fingerprinting: Extract fingerprints from distributed data sources without centralization using secure aggregation (weighted average, median, trimmed mean)
  • Synthetic Data Certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing, embeddable in Parquet metadata and JSON output
  • Privacy-Utility Pareto Frontier: Automated exploration of optimal epsilon values for given utility targets

Production Features

  • REST & gRPC APIs: Streaming generation with authentication and rate limiting
  • Desktop UI: Cross-platform Tauri/SvelteKit application with 15+ configuration pages
  • Resource Guards: Memory, disk, and CPU monitoring with graceful degradation
  • Graceful Degradation: Progressive feature reduction under resource pressure (Normal→Reduced→Minimal→Emergency)
  • Deterministic Generation: Seeded RNG (ChaCha8) for reproducible output
  • Python Wrapper: Programmatic access with blueprints and config validation

Performance

MetricPerformance
Single-threaded throughput~100,000+ entries/second
Parallel scalingLinear with available cores
Memory efficiencyStreaming generation for large volumes

Use Cases

Use CaseDescription
Fraud Detection MLTrain supervised models with labeled fraud patterns
Graph Neural NetworksEntity relationship graphs for anomaly detection
AML/KYC TestingBanking transaction data with structuring, layering, mule patterns
Audit AnalyticsTest audit procedures with known control exceptions
Process MiningOCEL 2.0 event logs for process discovery
ERP TestingLoad testing with realistic transaction volumes
SOX ComplianceTest internal control monitoring systems
Data Quality MLTrain models to detect missing values, typos, duplicates
Causal Analysis“What-if” scenarios and counterfactual generation for audit
LLM Training DataGenerate LLM-enriched training datasets with realistic metadata
Pipeline OrchestrationIntegrate with Airflow, dbt, MLflow, and Spark pipelines

Quick Start

# Install from source
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release

# Run demo mode
./target/release/datasynth-data generate --demo --output ./output

# Or create a custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Fingerprinting (New in v0.2.0)

# Extract fingerprint from real data with privacy protection
./target/release/datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Validate fingerprint integrity
./target/release/datasynth-data fingerprint validate ./fingerprint.dsf

# View fingerprint details
./target/release/datasynth-data fingerprint info ./fingerprint.dsf --detailed

# Evaluate synthetic data fidelity
./target/release/datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

LLM-Augmented Generation (New in v0.5.0)

# Generate config from natural language description
./target/release/datasynth-data init \
    --from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
    -o config.yaml

# Generate with LLM enrichment (uses mock provider by default)
./target/release/datasynth-data generate --config config.yaml --output ./output

Causal Generation (New in v0.5.0)

# Generate data with causal structure (fraud_detection template)
./target/release/datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --output ./causal_output

# Run intervention ("what-if" scenario)
./target/release/datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output

Diffusion Model Training (New in v0.5.0)

# Train a diffusion model on fingerprint data
./target/release/datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json

# Evaluate diffusion model fit
./target/release/datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000

Python Wrapper

from datasynth_py import DataSynth
from datasynth_py.config import blueprints

config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)

Architecture

SyntheticData is organized as a Rust workspace with 15 modular crates:

datasynth-cli          Command-line interface (binary: datasynth-data)
datasynth-server       REST/gRPC/WebSocket server with auth and rate limiting
datasynth-ui           Tauri/SvelteKit desktop application
    │
datasynth-runtime      Orchestration layer (parallel execution, resource guards)
    │
datasynth-generators   Data generators (JE, documents, subledgers, anomalies, audit)
datasynth-banking      KYC/AML banking transaction generator
datasynth-ocpm         Object-Centric Process Mining (OCEL 2.0)
datasynth-fingerprint  Privacy-preserving fingerprint extraction and synthesis
datasynth-standards    Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
    │
datasynth-graph        Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval         Evaluation framework with auto-tuning
    │
datasynth-config       Configuration schema, validation, industry presets
    │
datasynth-core         Domain models, traits, distributions, resource guards
    │
datasynth-output       Output sinks (CSV, JSON, Parquet, streaming)
datasynth-test-utils   Test utilities, fixtures, mocks

License

Copyright 2024-2026 Michael Ivertowski, Ernst & Young Ltd., Zurich, Switzerland

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available upon request. Please contact the author at michael.ivertowski@ch.ey.com for inquiries.


SyntheticData is provided “as is” without warranty of any kind. It is intended for testing, development, and educational purposes. Generated data should not be used as a substitute for real financial records.

Getting Started

Welcome to SyntheticData! This section will help you get up and running quickly.

What You’ll Learn

  • Installation: Set up SyntheticData on your system
  • Quick Start: Generate your first synthetic dataset
  • Demo Mode: Explore SyntheticData with built-in demo presets

Prerequisites

Before you begin, ensure you have:

  • Rust 1.88+: SyntheticData is written in Rust and requires the Rust toolchain
  • Git: For cloning the repository
  • (Optional) Node.js 18+: Required only for the desktop UI

Installation Overview

# Clone and build
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release

# The binary is at target/release/datasynth-data

First Steps

The fastest way to explore SyntheticData is through demo mode:

datasynth-data generate --demo --output ./demo-output

This generates a complete set of synthetic financial data using sensible defaults.

Architecture at a Glance

SyntheticData generates interconnected financial data:

┌─────────────────────────────────────────────────────────────┐
│                    Configuration (YAML)                      │
├─────────────────────────────────────────────────────────────┤
│                    Generation Pipeline                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │  Master  │→│ Document │→│  Journal │→│  Output  │     │
│  │   Data   │  │  Flows   │  │ Entries  │  │  Files   │     │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
├─────────────────────────────────────────────────────────────┤
│  Output: CSV, JSON, Neo4j, PyTorch Geometric, ACDOCA        │
└─────────────────────────────────────────────────────────────┘

Next Steps

  1. Follow the Installation Guide to set up your environment
  2. Work through the Quick Start Tutorial
  3. Explore Demo Mode for a hands-on introduction
  4. Review the CLI Reference for all available commands

Getting Help

Installation

This guide covers installing SyntheticData from source.

Prerequisites

Required

RequirementVersionPurpose
Rust1.88+Compilation
GitAny recentClone repository
C compilergcc/clangNative dependencies

Optional

RequirementVersionPurpose
Node.js18+Desktop UI
npm9+Desktop UI dependencies

Installing Rust

If you don’t have Rust installed, use rustup:

# Linux/macOS
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Windows
# Download and run rustup-init.exe from https://rustup.rs

# Verify installation
rustc --version
cargo --version

Building from Source

Clone the Repository

git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData

Build Release Binary

# Build optimized release binary
cargo build --release

# The binary is at target/release/datasynth-data

Verify Installation

# Check version
./target/release/datasynth-data --version

# View help
./target/release/datasynth-data --help

# Run quick validation
./target/release/datasynth-data info

Adding to PATH

To run datasynth-data from anywhere:

Linux/macOS

# Option 1: Symlink to local bin
ln -s $(pwd)/target/release/datasynth-data ~/.local/bin/datasynth-data

# Option 2: Copy to system bin (requires sudo)
sudo cp target/release/datasynth-data /usr/local/bin/

# Option 3: Add target/release to PATH in ~/.bashrc or ~/.zshrc
export PATH="$PATH:/path/to/SyntheticData/target/release"

Windows

Add the target/release directory to your system PATH environment variable.

Building the Desktop UI

The desktop UI requires additional setup:

# Navigate to UI crate
cd crates/datasynth-ui

# Install Node.js dependencies
npm install

# Run in development mode
npm run tauri dev

# Build production release
npm run tauri build

Platform-Specific Dependencies

Linux (Ubuntu/Debian):

sudo apt-get install libwebkit2gtk-4.1-dev \
    libgtk-3-dev \
    libayatana-appindicator3-dev \
    librsvg2-dev

macOS: No additional dependencies required.

Windows: Install WebView2 runtime (usually pre-installed on Windows 10/11).

Running Tests

Verify your installation by running the test suite:

# Run all tests
cargo test

# Run tests for a specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators

# Run with output
cargo test -- --nocapture

Development Setup

For development work:

# Check code without building
cargo check

# Format code
cargo fmt

# Run lints
cargo clippy

# Build documentation
cargo doc --workspace --no-deps --open

Running Benchmarks

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench --bench generation_throughput

Troubleshooting

Build Failures

Missing C compiler:

# Ubuntu/Debian
sudo apt-get install build-essential

# macOS
xcode-select --install

# Fedora/RHEL
sudo dnf install gcc

Out of memory during build:

# Limit parallel jobs
cargo build --release -j 2

Runtime Issues

Permission denied:

chmod +x target/release/datasynth-data

Library not found (Linux):

# Check for missing dependencies
ldd target/release/datasynth-data

Next Steps

Quick Start

This guide walks you through generating your first synthetic financial dataset.

Overview

The typical workflow is:

  1. Initialize a configuration file
  2. Validate the configuration
  3. Generate synthetic data
  4. Review the output

Step 1: Initialize Configuration

Create a configuration file for your industry and complexity needs:

# Manufacturing company with medium complexity (~400 accounts)
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available Industry Presets

IndustryDescription
manufacturingProduction, inventory, cost accounting
retailSales, inventory, customer transactions
financial_servicesBanking, investments, regulatory reporting
healthcarePatient revenue, medical supplies, compliance
technologyR&D, SaaS revenue, deferred revenue

Complexity Levels

LevelAccountsDescription
small~100Simple chart of accounts
medium~400Typical mid-size company
large~2500Enterprise-scale structure

Step 2: Review Configuration

Open config.yaml to review and customize:

global:
  seed: 42                        # For reproducible generation
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 1.0

transactions:
  target_count: 100000            # Number of journal entries

fraud:
  enabled: true
  fraud_rate: 0.005               # 0.5% fraud transactions

output:
  format: csv
  compression: none

See the Configuration Guide for all options.

Step 3: Validate Configuration

Check your configuration for errors:

datasynth-data validate --config config.yaml

The validator checks:

  • Required fields are present
  • Values are within valid ranges
  • Distribution weights sum to 1.0
  • Dates are consistent

Step 4: Generate Data

Run the generation:

datasynth-data generate --config config.yaml --output ./output

You’ll see a progress bar:

Generating synthetic data...
[████████████████████████████████] 100000/100000 entries
Generation complete in 1.2s

Step 5: Explore Output

The output directory contains organized subdirectories:

output/
├── master_data/
│   ├── vendors.csv
│   ├── customers.csv
│   ├── materials.csv
│   └── employees.csv
├── transactions/
│   ├── journal_entries.csv
│   ├── acdoca.csv
│   ├── purchase_orders.csv
│   └── vendor_invoices.csv
├── subledgers/
│   ├── ar_open_items.csv
│   └── ap_open_items.csv
├── period_close/
│   └── trial_balances/
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── controls/
    └── internal_controls.csv

Common Customizations

Generate More Data

transactions:
  target_count: 1000000           # 1 million entries

Enable Graph Export

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

Add Anomaly Injection

anomaly_injection:
  enabled: true
  total_rate: 0.02                # 2% anomaly rate
  generate_labels: true           # For ML training

Multiple Companies

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    volume_weight: 0.6

  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    volume_weight: 0.4

Next Steps

Quick Reference

# Common commands
datasynth-data init --industry <industry> --complexity <level> -o config.yaml
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output
datasynth-data generate --demo --output ./demo-output
datasynth-data info                   # Show available presets

Demo Mode

Demo mode provides a quick way to explore SyntheticData without creating a configuration file. It uses sensible defaults to generate a complete synthetic dataset.

Running Demo Mode

datasynth-data generate --demo --output ./demo-output

What Demo Mode Generates

Demo mode creates a comprehensive dataset with:

CategoryContents
Master DataVendors, customers, materials, employees
Transactions~10,000 journal entries
Document FlowsP2P and O2C process documents
SubledgersAR and AP records
Period CloseTrial balances
ControlsInternal control mappings

Demo Configuration

Demo mode uses these defaults:

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 3
  group_currency: USD

companies:
  - code: "1000"
    name: "Demo Company"
    currency: USD
    country: US

chart_of_accounts:
  complexity: medium              # ~400 accounts

transactions:
  target_count: 10000

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.01
  generate_labels: true

output:
  format: csv

Output Structure

After running demo mode, explore the output:

tree demo-output/
demo-output/
├── master_data/
│   ├── chart_of_accounts.csv     # GL accounts
│   ├── vendors.csv               # Vendor master
│   ├── customers.csv             # Customer master
│   ├── materials.csv             # Material/product master
│   └── employees.csv             # Employee/user master
├── transactions/
│   ├── journal_entries.csv       # Main JE file
│   ├── acdoca.csv                # SAP HANA format
│   ├── purchase_orders.csv       # P2P documents
│   ├── goods_receipts.csv
│   ├── vendor_invoices.csv
│   ├── payments.csv
│   ├── sales_orders.csv          # O2C documents
│   ├── deliveries.csv
│   ├── customer_invoices.csv
│   └── customer_receipts.csv
├── subledgers/
│   ├── ar_open_items.csv
│   ├── ap_open_items.csv
│   └── inventory_positions.csv
├── period_close/
│   └── trial_balances/
│       ├── 2024_01.csv
│       ├── 2024_02.csv
│       └── 2024_03.csv
├── labels/
│   ├── anomaly_labels.csv        # For ML training
│   └── fraud_labels.csv
└── controls/
    ├── internal_controls.csv
    └── sod_rules.csv

Exploring the Data

Journal Entries

head -5 demo-output/transactions/journal_entries.csv

Key fields:

  • document_id: Unique transaction identifier
  • posting_date: When the entry was posted
  • company_code: Company identifier
  • account_number: GL account
  • debit_amount / credit_amount: Entry amounts
  • is_fraud: Fraud label (true/false)
  • is_anomaly: Anomaly label

Fraud Labels

# View fraud transactions
grep "true" demo-output/labels/fraud_labels.csv | head

Trial Balance

# Check balance coherence
head demo-output/period_close/trial_balances/2024_01.csv

Customizing Demo Output

You can combine demo mode with some options:

# Change output directory
datasynth-data generate --demo --output ./my-demo

# Use demo as starting point, then create config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Edit config.yaml as needed
datasynth-data generate --config config.yaml --output ./output

Use Cases for Demo Mode

Quick Exploration

Test SyntheticData’s capabilities before creating a custom configuration.

Development Testing

Generate test data quickly for development purposes.

Training & Workshops

Provide sample data for training sessions without complex setup.

Benchmarking

Establish baseline performance metrics.

Moving Beyond Demo Mode

When you’re ready for more control:

  1. Create a configuration file:

    datasynth-data init --industry <your-industry> -o config.yaml
    
  2. Customize settings:

    • Adjust transaction volume
    • Configure multiple companies
    • Enable graph export
    • Fine-tune fraud/anomaly rates
  3. Generate with your config:

    datasynth-data generate --config config.yaml --output ./output
    

Next Steps

User Guide

This section covers the different ways to use SyntheticData.

Interfaces

SyntheticData offers three interfaces:

InterfaceUse Case
CLICommand-line generation, scripting, automation
Server APIREST/gRPC/WebSocket for applications
Desktop UIVisual configuration and monitoring

Quick Comparison

FeatureCLIServerDesktop UI
Configuration editingYAML filesAPI endpointsVisual forms
Batch generationYesYesYes
Streaming generationNoYesYes (view)
Progress monitoringProgress barWebSocketReal-time
Scripting/automationYesYesNo
Visual feedbackMinimalNoneFull

CLI Overview

The command-line interface (datasynth-data) is ideal for:

  • Batch generation
  • CI/CD pipelines
  • Scripting and automation
  • Server environments
datasynth-data generate --config config.yaml --output ./output

Server Overview

The server (datasynth-server) provides:

  • REST API for configuration and control
  • gRPC for high-performance integration
  • WebSocket for real-time streaming
cargo run -p datasynth-server -- --port 3000

Desktop UI Overview

The desktop application offers:

  • Visual configuration editor
  • Industry preset selector
  • Real-time generation monitoring
  • Cross-platform support (Windows, macOS, Linux)
cd crates/datasynth-ui && npm run tauri dev

Output Formats

SyntheticData produces various output formats:

  • CSV: Standard tabular data
  • JSON: Structured data with nested objects
  • ACDOCA: SAP HANA Universal Journal format
  • PyTorch Geometric: ML-ready graph tensors
  • Neo4j: Graph database import format

See Output Formats for details.

Choosing an Interface

Use the CLI if you:

  • Need to automate generation
  • Work in headless/server environments
  • Prefer command-line tools
  • Want to integrate with shell scripts

Use the Server if you:

  • Build applications that consume synthetic data
  • Need streaming generation
  • Want API-based control
  • Integrate with microservices

Use the Desktop UI if you:

  • Prefer visual configuration
  • Want to explore options interactively
  • Need real-time monitoring
  • Are new to SyntheticData

Next Steps

CLI Reference

The datasynth-data command-line tool provides commands for generating synthetic financial data and extracting fingerprints from real data.

Installation

After building the project, the binary is at target/release/datasynth-data.

cargo build --release
./target/release/datasynth-data --help

Global Options

OptionDescription
-h, --helpShow help information
-V, --versionShow version number
-v, --verboseEnable verbose output
-q, --quietSuppress non-error output

Commands

generate

Generate synthetic financial data.

datasynth-data generate [OPTIONS]

Options:

OptionTypeDescription
--config <PATH>PathConfiguration YAML file
--demoFlagUse demo preset instead of config
--output <DIR>PathOutput directory (required)
--format <FMT>StringOutput format: csv, json
--seed <NUM>u64Override random seed

Examples:

# Generate with configuration file
datasynth-data generate --config config.yaml --output ./output

# Use demo mode
datasynth-data generate --demo --output ./demo-output

# Override seed for reproducibility
datasynth-data generate --config config.yaml --output ./output --seed 12345

# JSON output format
datasynth-data generate --config config.yaml --output ./output --format json

init

Create a new configuration file from industry presets.

datasynth-data init [OPTIONS]

Options:

OptionTypeDescription
--industry <NAME>StringIndustry preset
--complexity <LEVEL>Stringsmall, medium, large
-o, --output <PATH>PathOutput file path

Available Industries:

  • manufacturing
  • retail
  • financial_services
  • healthcare
  • technology

Examples:

# Create manufacturing config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Create large retail config
datasynth-data init --industry retail --complexity large -o retail.yaml

validate

Validate a configuration file.

datasynth-data validate --config <PATH>

Options:

OptionTypeDescription
--config <PATH>PathConfiguration file to validate

Example:

datasynth-data validate --config config.yaml

Validation Checks:

  • Required fields present
  • Value ranges (period_months: 1-120)
  • Distribution weights sum to 1.0 (±0.01 tolerance)
  • Date consistency
  • Company code uniqueness
  • Compression level: 1-9 when enabled
  • All rate/percentage fields: 0.0-1.0
  • Approval thresholds: strictly ascending order

info

Display available presets and configuration options.

datasynth-data info

Output includes:

  • Available industry presets
  • Complexity levels
  • Supported output formats
  • Feature capabilities

fingerprint

Privacy-preserving fingerprint extraction and evaluation. This command has several subcommands.

datasynth-data fingerprint <SUBCOMMAND>

fingerprint extract

Extract a fingerprint from real data with privacy controls.

datasynth-data fingerprint extract [OPTIONS]

Options:

OptionTypeDescription
--input <PATH>PathInput CSV data file (required)
--output <PATH>PathOutput .dsf fingerprint file (required)
--privacy-level <LEVEL>StringPrivacy level: minimal, standard, high, maximum
--epsilon <FLOAT>f64Custom differential privacy epsilon (overrides level)
--k <INT>usizeCustom k-anonymity threshold (overrides level)

Privacy Levels:

LevelEpsilonkOutlier %Use Case
minimal5.0399%Low privacy, high utility
standard1.0595%Balanced (default)
high0.51090%Higher privacy
maximum0.12085%Maximum privacy

Examples:

# Extract with standard privacy
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Extract with custom privacy parameters
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --epsilon 0.75 \
    --k 8

fingerprint validate

Validate a fingerprint file’s integrity and structure.

datasynth-data fingerprint validate <PATH>

Arguments:

ArgumentTypeDescription
<PATH>PathPath to .dsf fingerprint file

Validation Checks:

  • DSF file structure (ZIP archive with required components)
  • SHA-256 checksums for all components
  • Required fields in manifest, schema, statistics
  • Privacy audit completeness

Example:

datasynth-data fingerprint validate ./fingerprint.dsf

fingerprint info

Display fingerprint metadata and statistics.

datasynth-data fingerprint info <PATH> [OPTIONS]

Arguments:

ArgumentTypeDescription
<PATH>PathPath to .dsf fingerprint file

Options:

OptionTypeDescription
--detailedFlagShow detailed statistics
--jsonFlagOutput as JSON

Examples:

# Basic info
datasynth-data fingerprint info ./fingerprint.dsf

# Detailed statistics
datasynth-data fingerprint info ./fingerprint.dsf --detailed

# JSON output for scripting
datasynth-data fingerprint info ./fingerprint.dsf --json

fingerprint diff

Compare two fingerprints.

datasynth-data fingerprint diff <PATH1> <PATH2>

Arguments:

ArgumentTypeDescription
<PATH1>PathFirst .dsf fingerprint file
<PATH2>PathSecond .dsf fingerprint file

Output includes:

  • Schema differences (columns added/removed/changed)
  • Statistical distribution changes
  • Correlation matrix differences

Example:

datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf

fingerprint evaluate

Evaluate synthetic data fidelity against a fingerprint.

datasynth-data fingerprint evaluate [OPTIONS]

Options:

OptionTypeDescription
--fingerprint <PATH>PathReference .dsf fingerprint file (required)
--synthetic <PATH>PathDirectory containing synthetic data (required)
--threshold <FLOAT>f64Minimum fidelity score (0.0-1.0, default 0.8)
--report <PATH>PathOutput report file (HTML or JSON based on extension)

Fidelity Metrics:

  • Statistical: KS statistic, Wasserstein distance, Benford MAD
  • Correlation: Correlation matrix RMSE
  • Schema: Column type match, row count ratio
  • Rules: Balance equation compliance rate

Examples:

# Basic evaluation
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

# Generate HTML report
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.85 \
    --report ./fidelity_report.html

diffusion (v0.5.0)

Train and evaluate diffusion models for statistical data generation.

diffusion train

Train a diffusion model from a fingerprint file.

datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json \
    --n-steps 1000 \
    --schedule cosine
OptionTypeDefaultDescription
--fingerprintpath(required)Path to .dsf fingerprint file
--outputpath(required)Output path for trained model
--n-stepsinteger1000Number of diffusion steps
--schedulestringlinearNoise schedule: linear, cosine, sigmoid

diffusion evaluate

Evaluate a trained diffusion model’s fit quality.

datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000
OptionTypeDefaultDescription
--modelpath(required)Path to trained model
--samplesinteger1000Number of evaluation samples

causal (v0.5.0)

Generate data with causal structure, run interventions, and produce counterfactuals.

causal generate

Generate data following a causal graph structure.

datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --seed 42 \
    --output ./causal_output
OptionTypeDefaultDescription
--templatestring(required)Built-in template (fraud_detection, revenue_cycle) or path to custom YAML
--samplesinteger1000Number of samples to generate
--seedinteger(random)Random seed for reproducibility
--outputpath(required)Output directory

causal intervene

Run do-calculus interventions (“what-if” scenarios).

datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output
OptionTypeDefaultDescription
--templatestring(required)Causal template or YAML path
--variablestring(required)Variable to intervene on
--valuefloat(required)Value to set the variable to
--samplesinteger1000Number of intervention samples
--outputpath(required)Output directory

causal validate

Validate that generated data preserves causal structure.

datasynth-data causal validate \
    --data ./causal_output \
    --template fraud_detection
OptionTypeDefaultDescription
--datapath(required)Path to generated data
--templatestring(required)Causal template to validate against

fingerprint federated (v0.5.0)

Aggregate fingerprints from multiple distributed sources without centralizing raw data.

datasynth-data fingerprint federated \
    --sources ./source_a.dsf ./source_b.dsf ./source_c.dsf \
    --output ./aggregated.dsf \
    --method weighted_average \
    --max-epsilon 5.0
OptionTypeDefaultDescription
--sourcespaths(required)Two or more .dsf fingerprint files
--outputpath(required)Output path for aggregated fingerprint
--methodstringweighted_averageAggregation method: weighted_average, median, trimmed_mean
--max-epsilonfloat5.0Maximum epsilon budget per source

init –from-description (v0.5.0)

Generate configuration from a natural language description using LLM.

datasynth-data init \
    --from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
    -o config.yaml

Uses the configured LLM provider (defaults to Mock) to parse the description and generate an appropriate YAML configuration.

generate –certificate (v0.5.0)

Attach a synthetic data certificate to the generated output.

datasynth-data generate \
    --config config.yaml \
    --output ./output \
    --certificate

Produces a certificate.json in the output directory containing DP guarantees, quality metrics, and an HMAC-SHA256 signature.

Signal Handling (Unix)

On Unix systems, you can pause and resume generation:

# Start generation in background
datasynth-data generate --config config.yaml --output ./output &

# Pause generation
kill -USR1 $(pgrep datasynth-data)

# Resume generation (send SIGUSR1 again)
kill -USR1 $(pgrep datasynth-data)

Exit Codes

CodeMeaning
0Success
1General error
2Configuration error
3I/O error
4Validation error
5Fingerprint error

Environment Variables

VariableDescription
SYNTH_DATA_LOGLog level (error, warn, info, debug, trace)
SYNTH_DATA_THREADSNumber of worker threads

Example:

SYNTH_DATA_LOG=debug datasynth-data generate --config config.yaml --output ./output

Configuration File Location

The tool looks for configuration files in this order:

  1. Path specified with --config
  2. ./datasynth-data.yaml in current directory
  3. ~/.config/datasynth-data/config.yaml

Output Directory Structure

Generation creates this structure:

output/
├── master_data/          Vendors, customers, materials, assets, employees
├── transactions/         Journal entries, purchase orders, invoices, payments
├── subledgers/           AR, AP, FA, inventory detail records
├── period_close/         Trial balances, accruals, closing entries
├── consolidation/        Eliminations, currency translation
├── fx/                   Exchange rates, CTA adjustments
├── banking/              KYC profiles, bank transactions, AML typology labels
├── process_mining/       OCEL 2.0 event logs, process variants
├── audit/                Engagements, workpapers, findings, risk assessments
├── graphs/               PyTorch Geometric, Neo4j, DGL exports (if enabled)
├── labels/               Anomaly, fraud, and data quality labels for ML
└── controls/             Internal control mappings, SoD rules

Scripting Examples

Batch Generation

#!/bin/bash
for industry in manufacturing retail healthcare; do
    datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
    datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done

CI/CD Pipeline

# GitHub Actions example
- name: Generate Test Data
  run: |
    cargo build --release
    ./target/release/datasynth-data generate --demo --output ./test-data

- name: Validate Generation
  run: |
    # Check output files exist
    test -f ./test-data/transactions/journal_entries.csv
    test -f ./test-data/master_data/vendors.csv

Reproducible Generation

# Same seed produces identical output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2  # No differences

Fingerprint Pipeline

#!/bin/bash
# Extract fingerprint from real data
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level high

# Validate the fingerprint
datasynth-data fingerprint validate ./fingerprint.dsf

# Generate synthetic data matching the fingerprint
# (fingerprint informs config generation)
datasynth-data generate --config generated_config.yaml --output ./synthetic

# Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic \
    --threshold 0.85 \
    --report ./fidelity_report.html

Troubleshooting

Common Issues

“Configuration file not found”

# Check file path
ls -la config.yaml
# Use absolute path
datasynth-data generate --config /full/path/to/config.yaml --output ./output

“Invalid configuration”

# Validate first
datasynth-data validate --config config.yaml

“Permission denied”

# Check output directory permissions
mkdir -p ./output
chmod 755 ./output

“Out of memory”

The generator includes memory guards that prevent OOM conditions. If you still encounter issues:

  • Reduce transaction count in configuration
  • The system will automatically reduce batch sizes under memory pressure
  • Check memory_guard settings in configuration

“Fingerprint validation failed”

# Check DSF file integrity
datasynth-data fingerprint validate ./fingerprint.dsf

# View detailed info
datasynth-data fingerprint info ./fingerprint.dsf --detailed

“Low fidelity score”

If synthetic data fidelity is below threshold:

  • Review the fidelity report for specific metrics
  • Adjust configuration to better match fingerprint statistics
  • Consider using the evaluation framework’s auto-tuning recommendations

See Also

Server API

SyntheticData provides a server component with REST, gRPC, and WebSocket APIs for application integration.

Starting the Server

cargo run -p datasynth-server -- --port 3000 --worker-threads 4

Options:

OptionDefaultDescription
--port3000HTTP/WebSocket port
--grpc-port50051gRPC port
--worker-threadsCPU coresWorker thread count
--api-keyNoneRequired API key
--rate-limit100Max requests per minute

Authentication

When --api-key is set, include it in requests:

curl -H "X-API-Key: your-api-key" http://localhost:3000/api/config

REST API

Configuration Endpoints

GET /api/config

Retrieve current configuration.

curl http://localhost:3000/api/config

Response:

{
  "global": {
    "seed": 42,
    "industry": "manufacturing",
    "start_date": "2024-01-01",
    "period_months": 12
  },
  "transactions": {
    "target_count": 100000
  }
}

POST /api/config

Update configuration.

curl -X POST http://localhost:3000/api/config \
  -H "Content-Type: application/json" \
  -d '{"transactions": {"target_count": 50000}}'

POST /api/config/validate

Validate configuration without applying.

curl -X POST http://localhost:3000/api/config/validate \
  -H "Content-Type: application/json" \
  -d @config.json

Stream Control Endpoints

POST /api/stream/start

Start data generation.

curl -X POST http://localhost:3000/api/stream/start

Response:

{
  "status": "started",
  "stream_id": "abc123"
}

POST /api/stream/stop

Stop current generation.

curl -X POST http://localhost:3000/api/stream/stop

POST /api/stream/pause

Pause generation.

curl -X POST http://localhost:3000/api/stream/pause

POST /api/stream/resume

Resume paused generation.

curl -X POST http://localhost:3000/api/stream/resume

Pattern Trigger Endpoints

POST /api/stream/trigger/

Trigger special event patterns.

Available patterns:

  • month_end - Month-end close activities
  • quarter_end - Quarter-end activities
  • year_end - Year-end close activities
curl -X POST http://localhost:3000/api/stream/trigger/month_end

Health Check

GET /health

curl http://localhost:3000/health

Response:

{
  "status": "healthy",
  "uptime_seconds": 3600
}

WebSocket API

Connect to receive real-time events during generation.

Connection

const ws = new WebSocket('ws://localhost:3000/ws/events');

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data);
};

Event Types

Progress Event:

{
  "type": "progress",
  "current": 50000,
  "total": 100000,
  "percent": 50.0,
  "rate": 85000.5
}

Entry Event:

{
  "type": "entry",
  "data": {
    "document_id": "abc-123",
    "posting_date": "2024-03-15",
    "account": "1100",
    "debit": "1000.00",
    "credit": "0.00"
  }
}

Error Event:

{
  "type": "error",
  "message": "Memory limit exceeded"
}

Complete Event:

{
  "type": "complete",
  "total_entries": 100000,
  "duration_ms": 1200
}

gRPC API

Proto Definition

syntax = "proto3";

package synth;

service SynthService {
  rpc GetConfig(Empty) returns (Config);
  rpc SetConfig(Config) returns (Status);
  rpc StartGeneration(GenerationRequest) returns (stream Entry);
  rpc StopGeneration(Empty) returns (Status);
}

message Config {
  string yaml = 1;
}

message GenerationRequest {
  optional int64 count = 1;
}

message Entry {
  string document_id = 1;
  string posting_date = 2;
  string company_code = 3;
  repeated LineItem lines = 4;
}

message LineItem {
  string account = 1;
  string debit = 2;
  string credit = 3;
}

Client Example (Rust)

use synth::synth_client::SynthClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut client = SynthClient::connect("http://localhost:50051").await?;

    let request = tonic::Request::new(GenerationRequest { count: Some(1000) });
    let mut stream = client.start_generation(request).await?.into_inner();

    while let Some(entry) = stream.message().await? {
        println!("Entry: {:?}", entry.document_id);
    }

    Ok(())
}

Rate Limiting

The server implements sliding window rate limiting:

MetricDefault
Window1 minute
Max requests100

Exceeding the limit returns 429 Too Many Requests:

{
  "error": "rate_limit_exceeded",
  "retry_after": 30
}

Memory Management

The server enforces memory limits:

# Set memory limit (bytes)
cargo run -p datasynth-server -- --memory-limit 1073741824  # 1GB

When memory is low:

  • Generation pauses automatically
  • WebSocket sends warning event
  • New requests may be rejected

Error Responses

HTTP StatusMeaning
400Invalid request/configuration
401Missing or invalid API key
429Rate limit exceeded
500Internal server error
503Server overloaded

Error Response Format:

{
  "error": "error_code",
  "message": "Human readable description",
  "details": {}
}

Integration Examples

Python Client

import requests
import websocket
import json

BASE_URL = "http://localhost:3000"

# Set configuration
config = {
    "transactions": {"target_count": 10000}
}
requests.post(f"{BASE_URL}/api/config", json=config)

# Start generation
requests.post(f"{BASE_URL}/api/stream/start")

# Monitor via WebSocket
ws = websocket.create_connection(f"ws://localhost:3000/ws/events")
while True:
    event = json.loads(ws.recv())
    if event["type"] == "complete":
        break
    print(f"Progress: {event.get('percent', 0)}%")

Node.js Client

const axios = require('axios');
const WebSocket = require('ws');

const BASE_URL = 'http://localhost:3000';

async function generate() {
    // Configure
    await axios.post(`${BASE_URL}/api/config`, {
        transactions: { target_count: 10000 }
    });

    // Start
    await axios.post(`${BASE_URL}/api/stream/start`);

    // Monitor
    const ws = new WebSocket('ws://localhost:3000/ws/events');
    ws.on('message', (data) => {
        const event = JSON.parse(data);
        console.log(event);
    });
}

See Also

Desktop UI

SyntheticData includes a cross-platform desktop application built with Tauri and SvelteKit.

Overview

The desktop UI provides:

  • Visual configuration editing
  • Industry preset selection
  • Real-time generation monitoring
  • Configuration validation feedback

Installation

Prerequisites

RequirementVersion
Node.js18+
npm9+
Rust1.88+
Platform dependenciesSee below

Platform Dependencies

Linux (Ubuntu/Debian):

sudo apt-get install libwebkit2gtk-4.1-dev \
    libgtk-3-dev \
    libayatana-appindicator3-dev \
    librsvg2-dev

macOS: No additional dependencies required.

Windows: WebView2 runtime (usually pre-installed on Windows 10/11).

Running in Development

cd crates/datasynth-ui
npm install
npm run tauri dev

Building for Production

cd crates/datasynth-ui
npm run tauri build

Build outputs are in crates/datasynth-ui/src-tauri/target/release/bundle/.

Application Layout

Dashboard

The main dashboard provides:

  • Quick stats overview
  • Recent generation history
  • System status

Configuration Editor

Access via the sidebar. Configuration is organized into sections:

SectionContents
GlobalIndustry, dates, seed, performance
CompaniesCompany definitions and weights
TransactionsTarget count, line items, amounts
Master DataVendors, customers, materials
Document FlowsP2P, O2C configuration
FinancialBalance, subledger, FX, period close
ComplianceFraud, controls, approval
AnalyticsGraph export, anomaly, data quality
OutputFormats, compression

Configuration Sections

Global Settings

  • Industry: Select from presets (manufacturing, retail, etc.)
  • Start Date: Beginning of simulation period
  • Period Months: Duration (1-120 months)
  • Group Currency: Base currency for consolidation
  • Random Seed: For reproducible generation

Chart of Accounts

  • Complexity: Small (~100), Medium (~400), Large (~2500) accounts
  • Structure: Industry-specific account hierarchies

Transactions

  • Target Count: Number of journal entries to generate
  • Line Item Distribution: Configure line count probabilities
  • Amount Distribution: Log-normal parameters, round number bias

Master Data

Configure generation parameters for:

  • Vendors (count, payment terms, intercompany flags)
  • Customers (count, credit terms, payment behavior)
  • Materials (count, valuation methods)
  • Fixed Assets (count, depreciation methods)
  • Employees (count, hierarchy depth)

Document Flows

  • P2P (Procure-to-Pay): PO → GR → Invoice → Payment rates
  • O2C (Order-to-Cash): SO → Delivery → Invoice → Receipt rates
  • Three-Way Match: Tolerance settings

Financial Settings

  • Balance: Opening balance configuration
  • Subledger: AR, AP, FA, Inventory settings
  • FX: Currency pairs, rate volatility
  • Period Close: Accrual, depreciation, closing settings

Compliance

  • Fraud: Enable/disable, fraud rate, fraud types
  • Controls: Internal control definitions
  • Approval: Threshold configuration, SoD rules

Analytics

  • Graph Export: Format selection (PyTorch Geometric, Neo4j, DGL)
  • Anomaly Injection: Rate, types, labeling
  • Data Quality: Missing values, format variations, duplicates

Output Settings

  • Format: CSV or JSON
  • Compression: None, gzip, or zstd
  • File Organization: Directory structure options

Preset Selector

Quickly load industry presets:

  1. Click “Load Preset” in the header
  2. Select industry
  3. Choose complexity level
  4. Click “Apply”

Real-time Streaming

During generation, view:

  • Progress bar with percentage
  • Entries per second
  • Memory usage
  • Recent entries table

Access streaming view via “Generate” → “Stream”.

Validation

The UI validates configuration in real-time:

  • Required fields are highlighted
  • Invalid values show error messages
  • Distribution weights are checked
  • Constraints are enforced

Keyboard Shortcuts

ShortcutAction
Ctrl/Cmd + SSave configuration
Ctrl/Cmd + GStart generation
Ctrl/Cmd + ,Open settings
EscapeClose modal

Configuration Files

The UI stores configurations in:

PlatformLocation
Linux~/.config/datasynth-data/
macOS~/Library/Application Support/datasynth-data/
Windows%APPDATA%\datasynth-data\

Exporting Configuration

To use your configuration with the CLI:

  1. Configure in the UI
  2. Click “Export” → “Export YAML”
  3. Save the .yaml file
  4. Use with CLI: datasynth-data generate --config exported.yaml --output ./output

Development

Project Structure

crates/datasynth-ui/
├── src/                      # SvelteKit frontend
│   ├── routes/               # Page routes
│   │   ├── +page.svelte      # Dashboard
│   │   ├── generate/         # Generation views
│   │   └── config/           # Configuration pages
│   └── lib/
│       ├── stores/           # State management
│       └── components/       # Reusable components
├── src-tauri/                # Rust backend
│   └── src/
│       └── main.rs           # Tauri commands
├── package.json
└── tauri.conf.json

Adding a Configuration Page

  1. Create route in src/routes/config/<section>/+page.svelte
  2. Add form components
  3. Connect to config store
  4. Add navigation link

Debugging

# Enable Tauri dev tools
npm run tauri dev

# View browser console (Ctrl/Cmd + Shift + I in dev mode)

Troubleshooting

UI Doesn’t Start

# Check Node dependencies
npm install

# Rebuild native modules
npm run tauri clean
npm run tauri build

Configuration Not Saving

Check file permissions in the config directory.

WebSocket Connection Failed

Ensure the server is running if using streaming features:

cargo run -p datasynth-server -- --port 3000

See Also

Output Formats

SyntheticData generates multiple file types organized into categories.

Output Directory Structure

output/
├── master_data/          # Entity master records
├── transactions/         # Journal entries and documents
├── subledgers/           # Subsidiary ledger records
├── period_close/         # Trial balances and closing
├── consolidation/        # Elimination entries
├── fx/                   # Exchange rates
├── banking/              # KYC profiles and bank transactions
├── process_mining/       # OCEL 2.0 event logs
├── audit/                # Audit engagements and workpapers
├── graphs/               # ML-ready graph exports
├── labels/               # Anomaly, fraud, and quality labels
└── controls/             # Internal control mappings

File Formats

CSV

Default format with standard conventions:

  • UTF-8 encoding
  • Comma-separated values
  • Header row included
  • Quoted strings when needed
  • Decimal values serialized as strings (prevents floating-point artifacts)

Example (journal_entries.csv):

document_id,posting_date,company_code,account,description,debit,credit,is_fraud
abc-123,2024-01-15,1000,1100,Customer payment,"1000.00","0.00",false
abc-123,2024-01-15,1000,1200,Cash receipt,"0.00","1000.00",false

JSON

Structured format with nested objects:

Example (journal_entries.json):

[
  {
    "header": {
      "document_id": "abc-123",
      "posting_date": "2024-01-15",
      "company_code": "1000",
      "source": "Manual",
      "is_fraud": false
    },
    "lines": [
      {
        "account": "1100",
        "description": "Customer payment",
        "debit": "1000.00",
        "credit": "0.00"
      },
      {
        "account": "1200",
        "description": "Cash receipt",
        "debit": "0.00",
        "credit": "1000.00"
      }
    ]
  }
]

ACDOCA (SAP HANA)

SAP Universal Journal format with simulation fields:

FieldDescription
RCLNTClient
RLDNRLedger
RBUKRSCompany code
GJAHRFiscal year
BELNRDocument number
DOCLNLine item
RYEARYear
POPERPosting period
RACCTAccount
DRCRKDebit/Credit indicator
HSLAmount in local currency
ZSIM_*Simulation metadata fields

Master Data Files

chart_of_accounts.csv

FieldDescription
account_numberGL account code
account_nameDisplay name
account_typeAsset, Liability, Equity, Revenue, Expense
account_subtypeDetailed classification
is_control_accountLinks to subledger
normal_balanceDebit or Credit

vendors.csv

FieldDescription
vendor_idUnique identifier
vendor_nameCompany name
tax_idTax identification
payment_termsStandard terms
currencyTransaction currency
is_intercompanyIC flag

customers.csv

FieldDescription
customer_idUnique identifier
customer_nameCompany/person name
credit_limitMaximum credit
credit_ratingRating code
payment_behaviorTypical payment pattern

materials.csv

FieldDescription
material_idUnique identifier
descriptionMaterial name
material_typeClassification
valuation_methodFIFO, LIFO, Avg
standard_costUnit cost

employees.csv

FieldDescription
employee_idUnique identifier
nameFull name
departmentDepartment code
manager_idHierarchy link
approval_limitMaximum approval amount
transaction_codesAuthorized T-codes

Transaction Files

journal_entries.csv

FieldDescription
document_idEntry identifier
company_codeCompany
fiscal_yearYear
fiscal_periodPeriod
posting_dateDate posted
document_dateOriginal date
sourceTransaction source
business_processProcess category
is_fraudFraud indicator
is_anomalyAnomaly indicator

Line Items (embedded or separate)

FieldDescription
line_numberSequence
account_numberGL account
cost_centerCost center
profit_centerProfit center
debit_amountDebit
credit_amountCredit
descriptionLine description

Document Flow Files

purchase_orders.csv:

  • Order header with vendor, dates, status
  • Line items with materials, quantities, prices

goods_receipts.csv:

  • Receipt linked to PO
  • Quantities received, variances

vendor_invoices.csv:

  • Invoice with three-way match status
  • Payment terms, due date

payments.csv:

  • Payment documents
  • Bank references, cleared invoices

document_references.csv:

  • Links between documents (FollowOn, Payment, Reversal)
  • Ensures complete document chains

Subledger Files

ar_open_items.csv

FieldDescription
customer_idCustomer reference
invoice_numberDocument number
invoice_dateDate issued
due_datePayment due
original_amountInvoice total
open_amountRemaining balance
aging_bucket0-30, 31-60, 61-90, 90+

ap_open_items.csv

Similar structure for payables.

fa_register.csv

FieldDescription
asset_idAsset identifier
descriptionAsset name
acquisition_datePurchase date
acquisition_costOriginal cost
useful_life_yearsDepreciation period
depreciation_methodStraight-line, etc.
accumulated_depreciationTotal depreciation
net_book_valueCurrent value

inventory_positions.csv

FieldDescription
material_idMaterial reference
warehouseLocation
quantityUnits on hand
unit_costCurrent cost
total_valueExtended value

Period Close Files

trial_balances/YYYY_MM.csv

FieldDescription
account_numberGL account
account_nameDescription
opening_balancePeriod start
period_debitsTotal debits
period_creditsTotal credits
closing_balancePeriod end

accruals.csv

Accrual entries with reversal dates.

depreciation.csv

Monthly depreciation entries per asset.

Banking Files

banking_customers.csv

FieldDescription
customer_idUnique identifier
customer_typeretail, business, trust
nameCustomer name
created_atAccount creation date
risk_scoreCalculated risk score (0-100)
kyc_statusverified, pending, enhanced_due_diligence
pep_flagPolitically exposed person
sanctions_flagSanctions list match

bank_accounts.csv

FieldDescription
account_idUnique identifier
customer_idOwner reference
account_typechecking, savings, money_market
currencyAccount currency
opened_dateOpening date
balanceCurrent balance
statusactive, dormant, closed

bank_transactions.csv

FieldDescription
transaction_idUnique identifier
account_idAccount reference
timestampTransaction time
amountTransaction amount
currencyTransaction currency
directioncredit, debit
channelbranch, atm, online, wire, ach
categoryTransaction category
counterparty_idCounterparty reference

kyc_profiles.csv

FieldDescription
customer_idCustomer reference
declared_turnoverExpected monthly volume
transaction_frequencyExpected transactions/month
source_of_fundsDeclared income source
geographic_exposureList of countries
cash_intensityExpected cash ratio
beneficial_owner_complexityOwnership layers

aml_typology_labels.csv

FieldDescription
transaction_idTransaction reference
typologystructuring, funnel, layering, mule, fraud
confidenceConfidence score (0-1)
pattern_idRelated pattern identifier
related_transactionsComma-separated related IDs

entity_risk_labels.csv

FieldDescription
entity_idCustomer or account ID
entity_typecustomer, account
risk_categoryhigh, medium, low
risk_factorsContributing factors
label_dateLabel timestamp

Process Mining Files (OCEL 2.0)

event_log.json

OCEL 2.0 format event log:

{
  "ocel:global-log": {
    "ocel:version": "2.0",
    "ocel:ordering": "timestamp"
  },
  "ocel:events": {
    "e1": {
      "ocel:activity": "Create Purchase Order",
      "ocel:timestamp": "2024-01-15T10:30:00Z",
      "ocel:typedOmap": [
        {"ocel:oid": "PO-001", "ocel:qualifier": "order"}
      ]
    }
  },
  "ocel:objects": {
    "PO-001": {
      "ocel:type": "PurchaseOrder",
      "ocel:attributes": {
        "vendor": "VEND-001",
        "amount": "10000.00"
      }
    }
  }
}

objects.json

Object instances with types and attributes.

events.json

Event records with object relationships.

process_variants.csv

FieldDescription
variant_idUnique identifier
activity_sequenceOrdered activity list
frequencyOccurrence count
avg_durationAverage case duration

Audit Files

audit_engagements.csv

FieldDescription
engagement_idUnique identifier
client_nameClient entity
engagement_typeFinancial, Compliance, Operational
fiscal_yearAudit period
materialityMateriality threshold
statusPlanning, Fieldwork, Completion

audit_workpapers.csv

FieldDescription
workpaper_idUnique identifier
engagement_idEngagement reference
workpaper_typeLead schedule, Substantive, etc.
prepared_byPreparer ID
reviewed_byReviewer ID
statusDraft, Reviewed, Final

audit_evidence.csv

FieldDescription
evidence_idUnique identifier
workpaper_idWorkpaper reference
evidence_typeDocument, Inquiry, Observation, etc.
sourceEvidence source
reliabilityHigh, Medium, Low
sufficiencySufficient, Insufficient

audit_risks.csv

FieldDescription
risk_idUnique identifier
engagement_idEngagement reference
risk_descriptionRisk narrative
risk_levelHigh, Significant, Low
likelihoodProbable, Possible, Remote
responseResponse strategy

audit_findings.csv

FieldDescription
finding_idUnique identifier
engagement_idEngagement reference
finding_typeDeficiency, Significant, Material Weakness
descriptionFinding narrative
recommendationRecommended action
management_responseResponse text

audit_judgments.csv

FieldDescription
judgment_idUnique identifier
workpaper_idWorkpaper reference
judgment_areaRevenue recognition, Estimates, etc.
alternatives_consideredOptions evaluated
conclusionSelected approach
rationaleReasoning documentation

Graph Export Files

PyTorch Geometric

graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, edge_features]
├── labels.pt           # Node/edge labels
├── train_mask.pt       # Training split
├── val_mask.pt         # Validation split
└── test_mask.pt        # Test split

Neo4j

graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── nodes_user.csv
├── edges_transaction.csv
├── edges_approval.csv
└── import.cypher        # Import script

DGL (Deep Graph Library)

graphs/transaction_network/dgl/
├── graph.bin           # DGL binary format
├── node_features.npy   # NumPy arrays
└── edge_features.npy

Label Files

anomaly_labels.csv

FieldDescription
document_idEntry reference
anomaly_idUnique anomaly ID
anomaly_typeClassification
anomaly_categoryFraud, Error, Process, Statistical, Relational
severityLow, Medium, High
descriptionHuman-readable explanation

fraud_labels.csv

FieldDescription
document_idEntry reference
fraud_typeSpecific fraud pattern (20+ types)
detection_difficultyEasy, Medium, Hard
descriptionFraud scenario description

quality_labels.csv

FieldDescription
record_idRecord reference
field_nameAffected field
issue_typeMissingValue, Typo, FormatVariation, Duplicate
issue_subtypeDetailed classification
original_valueValue before modification
modified_valueValue after modification
severitySeverity level (1-5)

Control Files

internal_controls.csv

FieldDescription
control_idUnique identifier
control_nameDescription
control_typePreventive, Detective
frequencyContinuous, Daily, etc.
assertionsCompleteness, Accuracy, etc.

control_account_mappings.csv

FieldDescription
control_idControl reference
account_numberGL account
thresholdMonetary threshold

sod_rules.csv

Segregation of duties conflict definitions.

sod_conflict_pairs.csv

Actual SoD violations detected in generated data.

Parquet Format

Apache Parquet columnar format for large analytical datasets:

output:
  format: parquet
  compression: snappy      # snappy, gzip, zstd

Benefits:

  • Columnar storage — efficient for queries touching few columns
  • Built-in compression — typically 5-10x smaller than CSV
  • Schema embedding — self-describing files with full type information
  • Predicate pushdown — query engines skip irrelevant row groups

Use with: Apache Spark, DuckDB, Polars, pandas, BigQuery, Snowflake, Databricks.

ERP-Specific Formats

SyntheticData can export in native ERP table schemas:

FormatTarget ERPTables
sapSAP S/4HANABKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
oracleOracle EBSGL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
netsuiteNetSuiteJournal entries with subsidiary, multi-book, custom fields

See ERP Output Formats for field mappings and configuration.

Compression Options

OptionExtensionUse Case
none.csv/.jsonDevelopment, small datasets
gzip.csv.gzGeneral compression
zstd.csv.zstHigh performance
snappy.parquetParquet default (fast)

Configuration

output:
  format: csv              # csv, json, jsonl, parquet, sap, oracle, netsuite
  compression: none        # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)
  compression_level: 6     # 1-9 (if compression enabled)
  streaming: false         # Enable streaming mode for large outputs

See Also

ERP Output Formats

SyntheticData can export data in native ERP table formats, enabling direct load testing and integration validation against SAP S/4HANA, Oracle EBS, and NetSuite environments.

Overview

The datasynth-output crate provides three ERP-specific exporters alongside the standard CSV/JSON/Parquet sinks. Each exporter transforms the internal data model into the target ERP’s table schema with correct field names, data types, and referential integrity.

ERP SystemExporterTables
SAP S/4HANASapExporterBKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
Oracle EBSOracleExporterGL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
NetSuiteNetSuiteExporterJournal entries with subsidiary, multi-book, custom fields

SAP S/4HANA

Supported Tables

TableDescriptionSource Data
BKPFDocument HeaderJournal entry headers
BSEGDocument Line ItemsJournal entry line items
ACDOCAUniversal JournalFull ACDOCA event records
LFA1Vendor MasterVendor records
KNA1Customer MasterCustomer records
MARAMaterial MasterMaterial records
CSKSCost Center MasterCost center assignments
CEPCProfit Center MasterProfit center assignments

BKPF Fields (Document Header)

SAP FieldDescriptionExample
MANDTClient100
BUKRSCompany Code1000
BELNRDocument Number0100000001
GJAHRFiscal Year2024
BLARTDocument TypeSA (G/L posting)
BLDATDocument Date2024-01-15
BUDATPosting Date2024-01-15
MONATFiscal Period1
CPUDTEntry Date2024-01-15
CPUTMEntry Time10:30:00
USNAMUser NameJSMITH

BSEG Fields (Line Items)

SAP FieldDescriptionExample
MANDTClient100
BUKRSCompany Code1000
BELNRDocument Number0100000001
GJAHRFiscal Year2024
BUZEILine Item001
BSCHLPosting Key40 (debit) / 50 (credit)
HKONTGL Account1100
DMBTRAmount in Local Currency1000.00
WRBTRAmount in Doc Currency1000.00
KOSTLCost CenterCC100
PRCTRProfit CenterPC100

ACDOCA Fields (Universal Journal)

The ACDOCA format includes all standard SAP Universal Journal fields plus simulation metadata:

FieldDescription
RCLNTClient
RLDNRLedger
RBUKRSCompany Code
GJAHRFiscal Year
BELNRDocument Number
DOCLNLine Item
POPERPosting Period
RACCTAccount
DRCRKDebit/Credit Indicator
HSLAmount in Local Currency
ZSIM_*Simulation metadata fields

Configuration

output:
  format: sap
  sap:
    tables:
      - bkpf
      - bseg
      - acdoca
      - lfa1
      - kna1
      - mara
    client: "100"
    ledger: "0L"

Oracle EBS

Supported Tables

TableDescriptionSource Data
GL_JE_HEADERSJournal Entry HeadersJournal entry headers
GL_JE_LINESJournal Entry LinesJournal entry line items
GL_JE_BATCHESJournal Entry BatchesBatch groupings

GL_JE_HEADERS Fields

Oracle FieldDescriptionExample
JE_HEADER_IDUnique Header ID10001
LEDGER_IDLedger ID1
JE_BATCH_IDBatch ID5001
PERIOD_NAMEPeriod NameJAN-24
NAMEJournal NameManual Entry 001
JE_CATEGORYCategoryMANUAL, ADJUSTMENT, PAYABLES
JE_SOURCESourceMANUAL, PAYABLES, RECEIVABLES
CURRENCY_CODECurrencyUSD
ACTUAL_FLAGTypeA (Actual), B (Budget), E (Encumbrance)
STATUSStatusP (Posted), U (Unposted)
DEFAULT_EFFECTIVE_DATEEffective Date2024-01-15
RUNNING_TOTAL_DRTotal Debits10000.00
RUNNING_TOTAL_CRTotal Credits10000.00
PARENT_JE_HEADER_IDParent (for reversals)null
ACCRUAL_REV_FLAGReversal FlagY / N

GL_JE_LINES Fields

Oracle FieldDescriptionExample
JE_HEADER_IDHeader Reference10001
JE_LINE_NUMLine Number1
CODE_COMBINATION_IDAccount Combo ID10110
ENTERED_DREntered Debit1000.00
ENTERED_CREntered Credit0.00
ACCOUNTED_DRAccounted Debit1000.00
ACCOUNTED_CRAccounted Credit0.00
DESCRIPTIONLine DescriptionCustomer payment
EFFECTIVE_DATEEffective Date2024-01-15

Configuration

output:
  format: oracle
  oracle:
    ledger_id: 1
    set_of_books_id: 1

NetSuite

Journal Entry Fields

NetSuite export includes support for subsidiaries, multi-book accounting, and custom fields:

NetSuite FieldDescriptionExample
INTERNAL_IDInternal ID50001
EXTERNAL_IDExternal ID (for import)DS-JE-001
TRAN_IDTransaction NumberJE00001
TRAN_DATETransaction Date2024-01-15
POSTING_PERIODPeriod IDJan 2024
SUBSIDIARYSubsidiary ID1
CURRENCYCurrency CodeUSD
EXCHANGE_RATEExchange Rate1.000000
MEMOMemoMonthly accrual
APPROVEDApproval Statustrue
REVERSAL_DATEReversal Date2024-02-01
DEPARTMENTDepartment ID100
CLASSClass ID1
LOCATIONLocation ID1
TOTAL_DEBITTotal Debits5000.00
TOTAL_CREDITTotal Credits5000.00

NetSuite Line Fields

FieldDescription
ACCOUNTAccount internal ID
DEBITDebit amount
CREDITCredit amount
MEMOLine memo
DEPARTMENTDepartment
CLASSClass segment
LOCATIONLocation segment
ENTITYCustomer/Vendor reference
CUSTOM_FIELDSAdditional custom field map

Configuration

output:
  format: netsuite
  netsuite:
    subsidiary_id: 1
    include_custom_fields: true

Usage Examples

SAP Load Testing

Generate data for SAP S/4HANA load testing with full table coverage:

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 100000

output:
  format: sap
  sap:
    tables: [bkpf, bseg, acdoca, lfa1, kna1, mara, csks, cepc]
    client: "100"

Oracle EBS Migration Validation

Generate journal entries in Oracle EBS format for migration testing:

output:
  format: oracle
  oracle:
    ledger_id: 1

NetSuite Integration Testing

Generate multi-subsidiary data with custom fields:

output:
  format: netsuite
  netsuite:
    subsidiary_id: 1
    include_custom_fields: true

Output Files

FormatOutput Files
SAPsap_bkpf.csv, sap_bseg.csv, sap_acdoca.csv, sap_lfa1.csv, sap_kna1.csv, sap_mara.csv, sap_csks.csv, sap_cepc.csv
Oracleoracle_gl_je_headers.csv, oracle_gl_je_lines.csv, oracle_gl_je_batches.csv
NetSuitenetsuite_journal_entries.csv, netsuite_journal_lines.csv

See Also

Streaming Output

SyntheticData provides streaming output sinks for real-time data generation, enabling memory-efficient export of large datasets without loading everything into memory at once.

Overview

The streaming module in datasynth-output implements the StreamingSink trait for four output formats:

SinkDescriptionFile Extension
CsvStreamingSinkCSV with automatic headers.csv
JsonStreamingSinkPretty-printed JSON arrays.json
NdjsonStreamingSinkNewline-delimited JSON.jsonl / .ndjson
ParquetStreamingSinkApache Parquet columnar.parquet

All streaming sinks accept StreamEvent values through the process() method:

#![allow(unused)]
fn main() {
pub enum StreamEvent<T> {
    Data(T),       // A data record to write
    Flush,         // Force flush to disk
    Close,         // Close the stream
}
}

StreamingSink Trait

All streaming sinks implement:

#![allow(unused)]
fn main() {
pub trait StreamingSink<T: Serialize + Send> {
    /// Process a single stream event (data, flush, or close).
    fn process(&mut self, event: StreamEvent<T>) -> SynthResult<()>;

    /// Close the stream and flush remaining data.
    fn close(&mut self) -> SynthResult<()>;

    /// Return the number of items written so far.
    fn items_written(&self) -> u64;

    /// Return the number of bytes written so far.
    fn bytes_written(&self) -> u64;
}
}

When to Use Streaming vs Batch

ScenarioRecommendation
< 100K recordsBatch (CsvSink / JsonSink) — simpler API
100K–10M recordsStreaming — lower memory footprint
> 10M recordsStreaming with Parquet — columnar compression
Real-time consumersStreaming NDJSON — line-by-line parsing
REST/WebSocket APIStreaming — integrated with server endpoints

CSV Streaming

#![allow(unused)]
fn main() {
use datasynth_output::streaming::CsvStreamingSink;
use datasynth_core::traits::StreamEvent;

let mut sink = CsvStreamingSink::<JournalEntry>::new("output.csv".into())?;

// Write records one at a time (memory efficient)
for entry in generate_entries() {
    sink.process(StreamEvent::Data(entry))?;
}

// Periodic flush (optional — ensures data is on disk)
sink.process(StreamEvent::Flush)?;

// Close when done
sink.close()?;

println!("Wrote {} items ({} bytes)", sink.items_written(), sink.bytes_written());
}

Headers are written automatically on the first Data event.

JSON Streaming

Pretty-printed JSON Array

#![allow(unused)]
fn main() {
use datasynth_output::streaming::JsonStreamingSink;

let mut sink = JsonStreamingSink::<JournalEntry>::new("output.json".into())?;
for entry in entries {
    sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;  // Writes closing bracket
}

Output:

[
  { "document_id": "abc-001", ... },
  { "document_id": "abc-002", ... }
]

Newline-Delimited JSON (NDJSON)

#![allow(unused)]
fn main() {
use datasynth_output::streaming::NdjsonStreamingSink;

let mut sink = NdjsonStreamingSink::<JournalEntry>::new("output.jsonl".into())?;
for entry in entries {
    sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;
}

Output:

{"document_id":"abc-001",...}
{"document_id":"abc-002",...}

NDJSON is ideal for streaming consumers that process records line by line (e.g., jq, Kafka, log aggregators).

Parquet Streaming

Apache Parquet provides columnar compression, making it ideal for large analytical datasets:

#![allow(unused)]
fn main() {
use datasynth_output::streaming::ParquetStreamingSink;

let mut sink = ParquetStreamingSink::<JournalEntry>::new("output.parquet".into())?;
for entry in entries {
    sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;
}

Parquet benefits:

  • Columnar storage: Efficient for analytical queries that touch few columns
  • Built-in compression: Snappy, Gzip, or Zstd per column group
  • Schema embedding: Self-describing files with full type information
  • Predicate pushdown: Query engines can skip irrelevant row groups

Configuration

Streaming output is enabled when using the server API or when the runtime detects memory pressure:

output:
  format: csv           # csv, json, jsonl, parquet
  streaming: true       # Enable streaming mode
  compression: none     # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)

Server Streaming

The server API uses streaming sinks for the /api/stream/ endpoints:

# Start streaming generation
curl -X POST http://localhost:3000/api/stream/start \
  -H "Content-Type: application/json" \
  -d '{"config": {...}, "format": "ndjson"}'

# WebSocket streaming
wscat -c ws://localhost:3000/ws/events

Backpressure

Streaming sinks monitor write throughput and provide backpressure signals:

  • items_written() / bytes_written(): Track progress for rate limiting
  • Flush events: Force periodic disk writes to bound memory usage
  • Disk space monitoring: The runtime’s DiskGuard can pause generation when disk space runs low

Performance

FormatThroughputFile SizeUse Case
CSV~150K rows/secLargestUniversal compatibility
NDJSON~120K rows/secLargeStreaming consumers
JSON~100K rows/secLargeHuman-readable
Parquet~80K rows/secSmallestAnalytics, data lakes

Throughput varies with record complexity and disk speed.

See Also

Python Wrapper Specification (In-Memory Configs)

This document specifies a Python wrapper that makes DataSynth usable out of the box without requiring persisted configuration files. The wrapper focuses on rich, structured configuration objects and reusable configuration blueprints so developers can generate data entirely in memory while still benefiting from the full DataSynth configuration model.

Goals

  • Zero-file setup: Instantiate and run DataSynth without writing YAML/JSON to disk.
  • Rich configuration: Offer a Pythonic API that maps cleanly to the full DataSynth configuration schema.
  • Blueprints: Provide reusable, parameterized configuration templates for common scenarios.
  • Interoperable: Allow optional export to YAML/JSON for debugging or CLI parity.
  • Composable: Enable programmatic composition, overrides, and validation.

Non-goals

  • Replacing the DataSynth CLI or server API.
  • Hiding the underlying schema; the wrapper should expose all configuration knobs.
  • Managing persistence beyond optional explicit export helpers.

Package layout

packages/
  datasynth_py/
    __init__.py
    client.py             # entrypoint wrapper
    config/
      __init__.py
      models.py           # typed config objects
      blueprints.py       # blueprint registry + builders
      validation.py       # schema validation helpers
    runtime/
      __init__.py
      session.py          # in-memory execution

Core API surface

DataSynth entrypoint

from datasynth_py import DataSynth

synth = DataSynth()

Responsibilities

  • Provide a generate() method that accepts rich configuration objects.
  • Provide blueprints registry access for common starting points.
  • Manage execution in memory, including optional output sinks.

generate() signature

result = synth.generate(
    config=Config(...),
    output=OutputSpec(...),
    seed=42,
)

Behavior

  • Validates configuration objects.
  • Converts configuration to DataSynth schema (internal model or JSON/YAML in-memory string).
  • Executes the generator and returns result handles (paths, in-memory tables, or streams).

Optional output handling

OutputSpec can include:

  • format (e.g., parquet, csv, jsonl)
  • sink (memory, temp_dir, path)
  • compression settings

When sink="memory", the wrapper returns in-memory table objects (pandas DataFrames by default).

Configuration model

Typed configuration objects

Provide typed dataclasses/Pydantic models mirroring the DataSynth YAML schema:

  • GlobalSettings
  • CompanySettings
  • TransactionSettings
  • MasterDataSettings
  • ComplianceSettings
  • OutputSettings

Example:

from datasynth_py.config import Config, GlobalSettings, CompanySettings

config = Config(
    global_settings=GlobalSettings(
        locale="en_US",
        fiscal_year_start="2024-01-01",
        periods=12,
    ),
    companies=CompanySettings(count=5, industry="retail"),
)

Overrides and layering

Allow configuration layering to support incremental overrides:

config = base_config.override(
    companies={"count": 10},
    output={"format": "parquet"},
)

The wrapper merges overrides deeply, preserving nested settings.

Blueprints

Blueprints provide preconfigured setups with parameters. The wrapper ships with a registry:

from datasynth_py import blueprints

config = blueprints.retail_small(companies=3, transactions=5000)

Blueprint characteristics

  • Parameterized: Each blueprint accepts keyword overrides for key metrics.
  • Composable: A blueprint can extend or wrap another blueprint.
  • Discoverable: Registry lists available blueprints and metadata.
blueprints.list()
# ["retail_small", "banking_medium", "saas_subscription", ...]

Execution model

The wrapper runs the Rust engine in-process via FFI or uses the DataSynth runtime API:

  • In-memory config: converted to serialized config strings without writing to disk.
  • Transient workspace: uses a temporary directory only if required by runtime internals.
  • Deterministic runs: seed controls RNG.

Streaming generation

The wrapper exposes a streaming session that connects to datasynth-server over WebSockets while using REST endpoints to start, pause, resume, and stop streams.

Examples

Example 1: Minimal generation in memory

from datasynth_py import DataSynth
from datasynth_py.config import Config, GlobalSettings, CompanySettings

config = Config(
    global_settings=GlobalSettings(locale="en_US", fiscal_year_start="2024-01-01"),
    companies=CompanySettings(count=2),
)

synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "memory"})

# result.tables -> dict[str, pandas.DataFrame]
print(result.tables["transactions"].head())

Example 2: Use a blueprint with overrides

from datasynth_py import DataSynth, blueprints

synth = DataSynth()
config = blueprints.retail_small(companies=4, transactions=15000)

result = synth.generate(
    config=config,
    output={"format": "parquet", "sink": "temp_dir"},
    seed=7,
)

print(result.output_dir)

Example 3: Layering overrides for a custom scenario

from datasynth_py import DataSynth
from datasynth_py.config import Config, GlobalSettings, TransactionSettings

base = Config(global_settings=GlobalSettings(locale="en_GB"))
custom = base.override(
    transactions=TransactionSettings(
        count=25000,
        currency="GBP",
        anomaly_rate=0.02,
    )
)

synth = DataSynth()
result = synth.generate(config=custom, output={"format": "jsonl", "sink": "memory"})

Example 4: Export configuration for debugging

from datasynth_py import DataSynth
from datasynth_py.config import Config

synth = DataSynth()
config = Config(...)

print(config.to_yaml())
print(config.to_json())

Example 5: Streaming events

import asyncio

from datasynth_py import DataSynth
from datasynth_py.config import blueprints


async def main() -> None:
    synth = DataSynth(server_url="http://localhost:3000")
    config = blueprints.retail_small(companies=2, transactions=5000)
    session = synth.stream(config=config, events_per_second=50)

    async for event in session.events():
        print(event)
        break


asyncio.run(main())

Decisions

  • In-memory table format: pandas DataFrames are the default return type for memory sinks.
  • Validation errors: configuration validation raises ConfigValidationError with structured error details.

Python Wrapper Guide

This guide explains how to use the DataSynth Python wrapper for in-memory configuration, local CLI generation, and streaming generation through the server API.

Installation

The wrapper lives in the repository under python/. Install it in development mode:

cd python
pip install -e ".[all]"

Or install just the core with specific extras:

pip install -e ".[cli]"      # For CLI generation (requires PyYAML)
pip install -e ".[memory]"   # For in-memory tables (requires pandas)
pip install -e ".[streaming]" # For streaming (requires websockets)

Quick start (CLI generation)

from datasynth_py import DataSynth, CompanyConfig, Config, GlobalSettings, ChartOfAccountsSettings

config = Config(
    global_settings=GlobalSettings(
        industry="retail",
        start_date="2024-01-01",
        period_months=12,
    ),
    companies=[
        CompanyConfig(code="C001", name="Retail Corp", currency="USD", country="US"),
    ],
    chart_of_accounts=ChartOfAccountsSettings(complexity="small"),
)

synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

print(result.output_dir)  # Path to generated files

Using blueprints

Blueprints provide preconfigured templates for common scenarios:

from datasynth_py import DataSynth
from datasynth_py.config import blueprints

# List available blueprints
print(blueprints.list())
# ['retail_small', 'banking_medium', 'manufacturing_large',
#  'banking_aml', 'ml_training', 'with_graph_export']

# Create a retail configuration with 4 companies
config = blueprints.retail_small(companies=4, transactions=10000)

# Banking/AML focused configuration
config = blueprints.banking_aml(customers=1000, typologies=True)

# ML training optimized configuration
config = blueprints.ml_training(
    industry="manufacturing",
    anomaly_ratio=0.05,
)

# Add graph export to any configuration
config = blueprints.with_graph_export(
    base_config=blueprints.retail_small(),
    formats=["pytorch_geometric", "neo4j"],
)

synth = DataSynth()
result = synth.generate(config=config, output={"format": "parquet", "sink": "path", "path": "./output"})

Configuration model

The configuration model matches the CLI schema:

from datasynth_py import (
    ChartOfAccountsSettings,
    CompanyConfig,
    Config,
    FraudSettings,
    GlobalSettings,
)

config = Config(
    global_settings=GlobalSettings(
        industry="manufacturing",      # Industry sector
        start_date="2024-01-01",       # Simulation start date
        period_months=12,              # Number of months to simulate
        seed=42,                       # Random seed for reproducibility
        group_currency="USD",          # Base currency
    ),
    companies=[
        CompanyConfig(
            code="M001",
            name="Manufacturing Co",
            currency="USD",
            country="US",
            annual_transaction_volume="ten_k",  # Volume preset
        ),
        CompanyConfig(
            code="M002",
            name="Manufacturing EU",
            currency="EUR",
            country="DE",
            annual_transaction_volume="hundred_k",
        ),
    ],
    chart_of_accounts=ChartOfAccountsSettings(
        complexity="medium",           # small, medium, or large
    ),
    fraud=FraudSettings(
        enabled=True,
        rate=0.01,                     # 1% fraud rate
    ),
)

Valid industry values

  • manufacturing
  • retail
  • financial_services
  • healthcare
  • technology
  • professional_services
  • energy
  • transportation
  • real_estate
  • telecommunications

Transaction volume presets

  • ten_k - 10,000 transactions/year
  • hundred_k - 100,000 transactions/year
  • one_m - 1,000,000 transactions/year
  • ten_m - 10,000,000 transactions/year
  • hundred_m - 100,000,000 transactions/year

Extended configuration

Additional configuration sections for specialized scenarios:

from datasynth_py.config.models import (
    Config,
    GlobalSettings,
    BankingSettings,
    ScenarioSettings,
    TemporalDriftSettings,
    DataQualitySettings,
    GraphExportSettings,
)

config = Config(
    global_settings=GlobalSettings(industry="financial_services"),

    # Banking/KYC/AML configuration
    banking=BankingSettings(
        enabled=True,
        retail_customers=1000,
        business_customers=200,
        typologies_enabled=True,  # Structuring, layering, mule patterns
    ),

    # ML training scenario
    scenario=ScenarioSettings(
        tags=["ml_training", "fraud_detection"],
        ml_training=True,
        target_anomaly_ratio=0.05,
    ),

    # Temporal drift for concept drift testing
    temporal=TemporalDriftSettings(
        enabled=True,
        amount_mean_drift=0.02,
        drift_type="gradual",  # gradual, sudden, recurring
    ),

    # Data quality issues for DQ model training
    data_quality=DataQualitySettings(
        enabled=True,
        missing_rate=0.05,
        typo_rate=0.02,
    ),

    # Graph export for GNN training
    graph_export=GraphExportSettings(
        enabled=True,
        formats=["pytorch_geometric", "neo4j"],
    ),
)

Configuration layering

Override configuration values:

from datasynth_py import Config, GlobalSettings

base = Config(global_settings=GlobalSettings(industry="retail", start_date="2024-01-01"))
custom = base.override(
    fraud={"enabled": True, "rate": 0.02},
)

Validation

Validation raises ConfigValidationError with structured error details:

from datasynth_py import Config, GlobalSettings
from datasynth_py.config.validation import ConfigValidationError

try:
    Config(global_settings=GlobalSettings(period_months=0)).validate()
except ConfigValidationError as exc:
    for error in exc.errors:
        print(error.path, error.message, error.value)

Output options

Control where and how data is generated:

from datasynth_py import DataSynth, OutputSpec

synth = DataSynth()

# Write to a specific path
result = synth.generate(
    config=config,
    output=OutputSpec(format="csv", sink="path", path="./output"),
)

# Write to a temporary directory
result = synth.generate(
    config=config,
    output=OutputSpec(format="parquet", sink="temp_dir"),
)
print(result.output_dir)  # Temp directory path

# Load into memory (requires pandas)
result = synth.generate(
    config=config,
    output=OutputSpec(format="csv", sink="memory"),
)
print(result.tables["journal_entries"].head())

Fingerprint Operations

The Python wrapper provides access to fingerprint extraction, validation, and evaluation:

from datasynth_py import DataSynth

synth = DataSynth()

# Extract fingerprint from real data
synth.fingerprint.extract(
    input_path="./real_data/",
    output_path="./fingerprint.dsf",
    privacy_level="standard"  # minimal, standard, high, maximum
)

# Validate fingerprint file
is_valid, errors = synth.fingerprint.validate("./fingerprint.dsf")
if not is_valid:
    print(f"Validation errors: {errors}")

# Get fingerprint info
info = synth.fingerprint.info("./fingerprint.dsf", detailed=True)
print(f"Privacy level: {info.privacy_level}")
print(f"Epsilon spent: {info.epsilon_spent}")
print(f"Tables: {info.tables}")

# Evaluate synthetic data fidelity
report = synth.fingerprint.evaluate(
    fingerprint_path="./fingerprint.dsf",
    synthetic_path="./synthetic_data/",
    threshold=0.8
)
print(f"Overall score: {report.overall_score}")
print(f"Statistical fidelity: {report.statistical_fidelity}")
print(f"Correlation fidelity: {report.correlation_fidelity}")
print(f"Passes threshold: {report.passes}")

FidelityReport Fields

FieldDescription
overall_scoreWeighted average of all fidelity metrics (0-1)
statistical_fidelityKS statistic, Wasserstein distance, Benford MAD
correlation_fidelityCorrelation matrix RMSE
schema_fidelityColumn type match, row count ratio
passesWhether the score meets the threshold

Streaming generation

Streaming uses the DataSynth server for real-time event generation. Start the server first:

cargo run -p datasynth-server -- --port 3000

Then stream events:

import asyncio

from datasynth_py import DataSynth
from datasynth_py.config import blueprints


async def main() -> None:
    synth = DataSynth(server_url="http://localhost:3000")
    config = blueprints.retail_small(companies=2)
    session = synth.stream(config=config, events_per_second=100)

    async for event in session.events():
        print(event)
        break


asyncio.run(main())

Stream controls

session.pause()
session.resume()
session.stop()

Pattern triggers

Trigger specific patterns during streaming to simulate real-world scenarios:

# Trigger temporal patterns
session.trigger_month_end()    # Month-end volume spike
session.trigger_year_end()     # Year-end closing entries
session.trigger_pattern("quarter_end_spike")

# Trigger anomaly patterns
session.trigger_fraud_cluster()  # Clustered fraud transactions
session.trigger_pattern("dormant_account_activity")

# Available patterns:
# - period_end_spike
# - quarter_end_spike
# - year_end_spike
# - fraud_cluster
# - error_burst
# - dormant_account_activity

Synchronous event consumption

For simpler use cases without async/await:

def process_event(event):
    print(f"Received: {event['document_id']}")

session.sync_events(callback=process_event, max_events=1000)

Runtime requirements

The wrapper shells out to the datasynth-data CLI for batch generation. Ensure the binary is available:

cargo build --release
export DATASYNTH_BINARY=target/release/datasynth-data

Alternatively, pass binary_path when creating the client:

synth = DataSynth(binary_path="/path/to/datasynth-data")

Troubleshooting

  • MissingDependencyError: Install the required optional dependency (PyYAML, pandas, or websockets).
  • CLI not found: Build the datasynth-data binary and set DATASYNTH_BINARY or pass binary_path.
  • ConfigValidationError: Check the error details for invalid configuration values.
  • Streaming errors: Verify the server is running and reachable at the configured URL.

Ecosystem Integrations (v0.5.0)

DataSynth includes optional integrations with popular data engineering and ML platforms. Install with:

pip install datasynth-py[integrations]
# Or install specific integrations
pip install datasynth-py[airflow,dbt,mlflow,spark]

Apache Airflow

Use the DataSynthOperator to generate data as part of Airflow DAGs:

from datasynth_py.integrations import DataSynthOperator, DataSynthSensor, DataSynthValidateOperator

# Generate data
generate = DataSynthOperator(
    task_id="generate_data",
    config=config,
    output_path="/data/synthetic/output",
)

# Wait for completion
sensor = DataSynthSensor(
    task_id="wait_for_data",
    output_path="/data/synthetic/output",
)

# Validate config
validate = DataSynthValidateOperator(
    task_id="validate_config",
    config_path="/data/configs/config.yaml",
)

dbt Integration

Generate dbt sources and seeds from synthetic data:

from datasynth_py.integrations import DbtSourceGenerator, create_dbt_project

gen = DbtSourceGenerator()

# Generate sources.yml for dbt
sources_path = gen.generate_sources_yaml("./output", "./my_dbt_project")

# Generate seed CSVs
seeds_dir = gen.generate_seeds("./output", "./my_dbt_project")

# Create complete dbt project from synthetic output
project = create_dbt_project("./output", "my_dbt_project")

MLflow Tracking

Track generation runs as MLflow experiments:

from datasynth_py.integrations import DataSynthMlflowTracker

tracker = DataSynthMlflowTracker(experiment_name="synthetic_data_runs")

# Track a generation run
run_info = tracker.track_generation("./output", config=cfg)

# Log quality metrics
tracker.log_quality_metrics({
    "completeness": 0.98,
    "benford_mad": 0.008,
    "correlation_preservation": 0.95,
})

# Compare recent runs
comparison = tracker.compare_runs(n=5)

Apache Spark

Read synthetic data as Spark DataFrames:

from datasynth_py.integrations import DataSynthSparkReader

reader = DataSynthSparkReader()

# Read a single table
df = reader.read_table(spark, "./output", "journal_entries")

# Read all tables
tables = reader.read_all_tables(spark, "./output")

# Create temporary views for SQL queries
views = reader.create_temp_views(spark, "./output")
spark.sql("SELECT * FROM journal_entries WHERE amount > 10000").show()

For comprehensive integration documentation, see the Ecosystem Integrations guide.

Ecosystem Integrations

New in v0.5.0

DataSynth’s Python wrapper includes optional integrations with popular data engineering and ML platforms for seamless pipeline orchestration.

Installation

# Install all integrations
pip install datasynth-py[integrations]

# Install specific integrations
pip install datasynth-py[airflow]
pip install datasynth-py[dbt]
pip install datasynth-py[mlflow]
pip install datasynth-py[spark]

Apache Airflow

The Airflow integration provides custom operators and sensors for orchestrating synthetic data generation in Airflow DAGs.

DataSynthOperator

Generates synthetic data as an Airflow task:

from datasynth_py.integrations import DataSynthOperator

generate = DataSynthOperator(
    task_id="generate_synthetic_data",
    config={
        "global": {"industry": "retail", "start_date": "2024-01-01", "period_months": 12},
        "transactions": {"target_count": 50000},
        "output": {"format": "csv"},
    },
    output_path="/data/synthetic/{{ ds }}",
)
ParameterTypeDescription
task_idstrAirflow task identifier
configdictGeneration configuration (inline)
config_pathstrPath to YAML config file (alternative to config)
output_pathstrOutput directory (supports Jinja templates)

DataSynthSensor

Waits for synthetic data generation to complete:

from datasynth_py.integrations import DataSynthSensor

wait = DataSynthSensor(
    task_id="wait_for_data",
    output_path="/data/synthetic/{{ ds }}",
    poke_interval=30,
    timeout=600,
)

DataSynthValidateOperator

Validates a configuration file before generation:

from datasynth_py.integrations import DataSynthValidateOperator

validate = DataSynthValidateOperator(
    task_id="validate_config",
    config_path="/configs/retail.yaml",
)

Complete DAG Example

from airflow import DAG
from airflow.utils.dates import days_ago
from datasynth_py.integrations import (
    DataSynthOperator,
    DataSynthSensor,
    DataSynthValidateOperator,
)

with DAG(
    "weekly_synthetic_data",
    start_date=days_ago(1),
    schedule_interval="@weekly",
    catchup=False,
) as dag:

    validate = DataSynthValidateOperator(
        task_id="validate",
        config_path="/configs/retail.yaml",
    )

    generate = DataSynthOperator(
        task_id="generate",
        config_path="/configs/retail.yaml",
        output_path="/data/synthetic/{{ ds }}",
    )

    wait = DataSynthSensor(
        task_id="wait",
        output_path="/data/synthetic/{{ ds }}",
    )

    validate >> generate >> wait

dbt Integration

Generate dbt-compatible project structures from synthetic data output.

DbtSourceGenerator

from datasynth_py.integrations import DbtSourceGenerator

gen = DbtSourceGenerator()

Generate sources.yml

Creates a dbt sources.yml file pointing to synthetic data tables:

sources_path = gen.generate_sources_yaml(
    output_dir="./synthetic_output",
    dbt_project_dir="./my_dbt_project",
)
# Creates ./my_dbt_project/models/sources.yml

Generate Seeds

Copies synthetic CSV files as dbt seeds:

seeds_dir = gen.generate_seeds(
    output_dir="./synthetic_output",
    dbt_project_dir="./my_dbt_project",
)
# Copies CSVs to ./my_dbt_project/seeds/

create_dbt_project

Creates a complete dbt project structure from synthetic output:

from datasynth_py.integrations import create_dbt_project

project = create_dbt_project(
    output_dir="./synthetic_output",
    project_name="synthetic_test",
)

This creates:

synthetic_test/
├── dbt_project.yml
├── models/
│   └── sources.yml
├── seeds/
│   ├── journal_entries.csv
│   ├── vendors.csv
│   ├── customers.csv
│   └── ...
└── tests/

Usage with dbt CLI

cd synthetic_test
dbt seed      # Load synthetic CSVs
dbt run       # Run transformations
dbt test      # Run data tests

MLflow Integration

Track synthetic data generation runs as MLflow experiments for comparison and reproducibility.

DataSynthMlflowTracker

from datasynth_py.integrations import DataSynthMlflowTracker

tracker = DataSynthMlflowTracker(experiment_name="synthetic_data_experiments")

Track a Generation Run

run_info = tracker.track_generation(
    output_dir="./output",
    config=config,
)
# Logs: config parameters, output file counts, generation metadata

Log Quality Metrics

tracker.log_quality_metrics({
    "completeness": 0.98,
    "benford_mad": 0.008,
    "correlation_preservation": 0.95,
    "statistical_fidelity": 0.92,
})

Compare Runs

comparison = tracker.compare_runs(n=5)
for run in comparison:
    print(f"Run {run['run_id']}: {run['metrics']}")

Experiment Comparison

Use MLflow to compare different generation configurations:

import mlflow

configs = {
    "baseline": baseline_config,
    "with_diffusion": diffusion_config,
    "high_fraud": high_fraud_config,
}

for name, cfg in configs.items():
    with mlflow.start_run(run_name=name):
        result = synth.generate(config=cfg, output={"format": "csv", "sink": "temp_dir"})
        tracker.track_generation(result.output_dir, config=cfg)
        tracker.log_quality_metrics(evaluate_quality(result.output_dir))

View results in the MLflow UI:

mlflow ui --port 5000
# Open http://localhost:5000

Apache Spark

Read synthetic data output directly as Spark DataFrames for large-scale analysis.

DataSynthSparkReader

from datasynth_py.integrations import DataSynthSparkReader

reader = DataSynthSparkReader()

Read a Single Table

df = reader.read_table(spark, "./output", "journal_entries")
df.printSchema()
df.show(5)

Read All Tables

tables = reader.read_all_tables(spark, "./output")
for name, df in tables.items():
    print(f"{name}: {df.count()} rows, {len(df.columns)} columns")

Create Temporary Views

views = reader.create_temp_views(spark, "./output")

# Now use SQL
spark.sql("""
    SELECT
        v.vendor_id,
        v.name,
        COUNT(p.document_id) as payment_count,
        SUM(p.amount) as total_paid
    FROM vendors v
    JOIN payments p ON v.vendor_id = p.vendor_id
    GROUP BY v.vendor_id, v.name
    ORDER BY total_paid DESC
    LIMIT 10
""").show()

Spark + DataSynth Pipeline

from pyspark.sql import SparkSession
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
from datasynth_py.integrations import DataSynthSparkReader

# Generate
synth = DataSynth()
config = blueprints.retail_small(transactions=100000)
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Load into Spark
spark = SparkSession.builder.appName("SyntheticAnalysis").getOrCreate()
reader = DataSynthSparkReader()
reader.create_temp_views(spark, result.output_dir)

# Analyze
spark.sql("""
    SELECT fiscal_period, COUNT(*) as entry_count, SUM(amount) as total_amount
    FROM journal_entries
    GROUP BY fiscal_period
    ORDER BY fiscal_period
""").show()

Integration Dependencies

IntegrationRequired PackageVersion
Airflowapache-airflow>= 2.5
dbtdbt-core>= 1.5
MLflowmlflow>= 2.0
Sparkpyspark>= 3.3

All integrations are optional — install only what you need.

See Also

Configuration

SyntheticData uses YAML configuration files to control all aspects of data generation.

Quick Start

# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Validate configuration
datasynth-data validate --config config.yaml

# Generate with configuration
datasynth-data generate --config config.yaml --output ./output

Configuration Sections

SectionDescription
Global SettingsIndustry, dates, seed, performance
CompaniesCompany codes, currencies, volume weights
TransactionsLine items, amounts, sources
Master DataVendors, customers, materials, assets
Document FlowsP2P, O2C configuration
Financial SettingsBalance, subledger, FX, period close
ComplianceFraud, controls, approval
AI & ML FeaturesLLM, diffusion, causal, certificates
Output SettingsFormat, compression
Source-to-PayS2C sourcing pipeline (projects, RFx, bids, contracts, catalogs, scorecards)
Financial ReportingFinancial statements, bank reconciliation, management KPIs, budgets
HRPayroll runs, time entries, expense reports
ManufacturingProduction orders, quality inspections, cycle counts
Sales QuotesQuote-to-order pipeline
Accounting StandardsRevenue recognition (ASC 606/IFRS 15), impairment testing

Reference

Minimal Configuration

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 10000

output:
  format: csv

Full Configuration Example

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.6
  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    country: DE
    volume_weight: 0.4

chart_of_accounts:
  complexity: medium

transactions:
  target_count: 100000
  line_items:
    distribution: empirical
  amounts:
    min: 100
    max: 1000000

master_data:
  vendors:
    count: 200
  customers:
    count: 500
  materials:
    count: 1000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3
  o2c:
    enabled: true
    flow_rate: 0.3

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

# AI & ML Features (v0.5.0)
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 1000

causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 1000
  validate: true

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true

# Enterprise Process Chains (v0.6.0)
source_to_pay:
  enabled: true
  projects_per_period: 5
  avg_bids_per_rfx: 4
  contract_award_rate: 0.75
  catalog_items_per_contract: 10

financial_reporting:
  enabled: true
  generate_balance_sheet: true
  generate_income_statement: true
  generate_cash_flow: true
  generate_changes_in_equity: true
  management_kpis:
    enabled: true
  budgets:
    enabled: true
    variance_threshold: 0.10

hr:
  enabled: true
  payroll_frequency: monthly
  time_tracking: true
  expense_reports: true

manufacturing:
  enabled: true
  production_orders_per_period: 20
  quality_inspection_rate: 0.30
  cycle_count_frequency: quarterly

sales_quotes:
  enabled: true
  quotes_per_period: 15
  conversion_rate: 0.35

output:
  format: csv
  compression: none

Configuration Loading

Configuration can be loaded from:

  1. YAML file (recommended):

    datasynth-data generate --config config.yaml --output ./output
    
  2. JSON file:

    datasynth-data generate --config config.json --output ./output
    
  3. Demo preset:

    datasynth-data generate --demo --output ./output
    

Validation

The configuration is validated for:

RuleDescription
Required fieldsAll mandatory fields must be present
Value rangesNumbers within valid bounds
DistributionsWeights sum to 1.0 (±0.01 tolerance)
DatesValid date ranges
UniquenessCompany codes must be unique
ConsistencyCross-field validations

Run validation:

datasynth-data validate --config config.yaml

Overriding Values

Command-line options override configuration file values:

# Override seed
datasynth-data generate --config config.yaml --seed 12345 --output ./output

# Override format
datasynth-data generate --config config.yaml --format json --output ./output

Environment Variables

Some settings can be controlled via environment variables:

VariableConfiguration Equivalent
SYNTH_DATA_SEEDglobal.seed
SYNTH_DATA_THREADSglobal.worker_threads
SYNTH_DATA_MEMORY_LIMITglobal.memory_limit

See Also

YAML Schema Reference

Complete reference for all configuration options.

Schema Overview

global:                    # Global settings
companies:                 # Company definitions
chart_of_accounts:         # COA structure
transactions:              # Transaction settings
master_data:               # Master data settings
document_flows:            # P2P, O2C flows
intercompany:              # IC settings
balance:                   # Balance settings
subledger:                 # Subledger settings
fx:                        # FX settings
period_close:              # Period close settings
fraud:                     # Fraud injection
internal_controls:         # SOX controls
anomaly_injection:         # Anomaly injection
data_quality:              # Data quality variations
graph_export:              # Graph export settings
output:                    # Output settings
business_processes:        # Process distribution
templates:                 # External templates
approval:                  # Approval thresholds
departments:               # Department distribution
source_to_pay:             # Source-to-Pay (v0.6.0)
financial_reporting:       # Financial statements & KPIs (v0.6.0)
hr:                        # HR / payroll / expenses (v0.6.0)
manufacturing:             # Production orders & costing (v0.6.0)
sales_quotes:              # Quote-to-order pipeline (v0.6.0)

global

global:
  seed: 42                           # u64, optional - RNG seed
  industry: manufacturing            # string - industry preset
  start_date: 2024-01-01             # date - generation start
  period_months: 12                  # u32, 1-120 - duration
  group_currency: USD                # string - base currency
  worker_threads: 4                  # usize, optional - parallelism
  memory_limit: 2147483648           # u64, optional - bytes

Industries: manufacturing, retail, financial_services, healthcare, technology, energy, telecom, transportation, hospitality

companies

companies:
  - code: "1000"                     # string - unique code
    name: "Headquarters"             # string - display name
    currency: USD                    # string - local currency
    country: US                      # string - ISO country code
    volume_weight: 0.6               # f64, 0-1 - transaction weight
    is_parent: true                  # bool - consolidation parent
    parent_code: null                # string, optional - parent ref

Constraints:

  • volume_weight across all companies must sum to 1.0
  • code must be unique

chart_of_accounts

chart_of_accounts:
  complexity: medium                 # small, medium, large
  industry_specific: true            # bool - use industry COA
  custom_accounts: []                # list - additional accounts

Complexity levels:

  • small: ~100 accounts
  • medium: ~400 accounts
  • large: ~2500 accounts

transactions

transactions:
  target_count: 100000               # u64 - total JEs to generate

  line_items:
    distribution: empirical          # empirical, uniform, custom
    min_lines: 2                     # u32 - minimum line items
    max_lines: 20                    # u32 - maximum line items
    custom_distribution:             # only if distribution: custom
      2: 0.6068
      3: 0.0524
      4: 0.1732

  amounts:
    min: 100                         # f64 - minimum amount
    max: 1000000                     # f64 - maximum amount
    distribution: log_normal         # log_normal, uniform, custom
    round_number_bias: 0.15          # f64, 0-1 - round number preference

  sources:                           # transaction source weights
    manual: 0.3
    automated: 0.5
    recurring: 0.15
    adjustment: 0.05

  benford:
    enabled: true                    # bool - Benford's Law compliance

  temporal:
    month_end_spike: 2.5             # f64 - month-end volume multiplier
    quarter_end_spike: 3.0           # f64 - quarter-end multiplier
    year_end_spike: 4.0              # f64 - year-end multiplier
    working_hours_only: true         # bool - restrict to business hours

master_data

master_data:
  vendors:
    count: 200                       # u32 - number of vendors
    intercompany_ratio: 0.05         # f64, 0-1 - IC vendor ratio

  customers:
    count: 500                       # u32 - number of customers
    intercompany_ratio: 0.05         # f64, 0-1 - IC customer ratio

  materials:
    count: 1000                      # u32 - number of materials

  fixed_assets:
    count: 100                       # u32 - number of assets

  employees:
    count: 50                        # u32 - number of employees
    hierarchy_depth: 4               # u32 - org chart depth

document_flows

document_flows:
  p2p:                               # Procure-to-Pay
    enabled: true
    flow_rate: 0.3                   # f64, 0-1 - JE percentage
    completion_rate: 0.95            # f64, 0-1 - full flow rate
    three_way_match:
      quantity_tolerance: 0.02       # f64, 0-1 - qty variance allowed
      price_tolerance: 0.01          # f64, 0-1 - price variance allowed

  o2c:                               # Order-to-Cash
    enabled: true
    flow_rate: 0.3                   # f64, 0-1 - JE percentage
    completion_rate: 0.95            # f64, 0-1 - full flow rate

intercompany

intercompany:
  enabled: true
  transaction_types:                 # weights must sum to 1.0
    goods_sale: 0.4
    service_provided: 0.2
    loan: 0.15
    dividend: 0.1
    management_fee: 0.1
    royalty: 0.05

  transfer_pricing:
    method: cost_plus                # cost_plus, resale_minus, comparable
    markup_range:
      min: 0.03
      max: 0.10

balance

balance:
  opening_balance:
    enabled: true
    total_assets: 10000000           # f64 - opening balance sheet size

  coherence_check:
    enabled: true                    # bool - verify A = L + E
    tolerance: 0.01                  # f64 - allowed imbalance

subledger

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90]      # list of days

  ap:
    enabled: true
    aging_buckets: [30, 60, 90]

  fixed_assets:
    enabled: true
    depreciation_methods:
      - straight_line
      - declining_balance

  inventory:
    enabled: true
    valuation_methods:
      - fifo
      - weighted_average

fx

fx:
  enabled: true
  base_currency: USD

  currency_pairs:                    # currencies to generate
    - EUR
    - GBP
    - CHF
    - JPY

  volatility: 0.01                   # f64 - daily volatility

  translation:
    method: current_rate             # current_rate, temporal

period_close

period_close:
  enabled: true

  monthly:
    accruals: true
    depreciation: true

  quarterly:
    intercompany_elimination: true

  annual:
    closing_entries: true
    retained_earnings: true

fraud

fraud:
  enabled: true
  fraud_rate: 0.005                  # f64, 0-1 - fraud percentage

  types:                             # weights must sum to 1.0
    fictitious_transaction: 0.15
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    split_transaction: 0.15
    round_tripping: 0.05
    kickback_scheme: 0.10
    ghost_employee: 0.05
    duplicate_payment: 0.15
    unauthorized_discount: 0.10
    suspense_abuse: 0.05

internal_controls

internal_controls:
  enabled: true

  controls:
    - id: "CTL-001"
      name: "Payment Approval"
      type: preventive
      frequency: continuous

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]

anomaly_injection

anomaly_injection:
  enabled: true
  total_rate: 0.02                   # f64, 0-1 - total anomaly rate
  generate_labels: true              # bool - output ML labels

  categories:                        # weights must sum to 1.0
    fraud: 0.25
    error: 0.40
    process_issue: 0.20
    statistical: 0.10
    relational: 0.05

  temporal_pattern:
    year_end_spike: 1.5              # f64 - year-end multiplier

  clustering:
    enabled: true
    cluster_probability: 0.2

data_quality

data_quality:
  enabled: true

  missing_values:
    rate: 0.01                       # f64, 0-1
    pattern: mcar                    # mcar, mar, mnar, systematic

  format_variations:
    date_formats: true
    amount_formats: true

  duplicates:
    rate: 0.001                      # f64, 0-1
    types: [exact, near, fuzzy]

  typos:
    rate: 0.005                      # f64, 0-1
    keyboard_aware: true

graph_export

graph_export:
  enabled: true

  formats:
    - pytorch_geometric
    - neo4j
    - dgl

  graphs:
    - transaction_network
    - approval_network
    - entity_relationship

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly

  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

output

output:
  format: csv                        # csv, json
  compression: none                  # none, gzip, zstd
  compression_level: 6               # u32, 1-9 (if compression enabled)

  files:
    journal_entries: true
    acdoca: true
    master_data: true
    documents: true
    subledgers: true
    trial_balances: true
    labels: true
    controls: true

Validation Summary

FieldConstraint
period_months1-120
compression_level1-9
All rates/percentages0.0-1.0
DistributionsSum to 1.0 (±0.01)
Company codesUnique
DatesValid and consistent

Diffusion Configuration (v0.5.0)

diffusion:
  enabled: false                    # Enable diffusion model backend
  n_steps: 1000                     # Number of diffusion steps (default: 1000)
  schedule: "linear"                # Noise schedule: "linear", "cosine", "sigmoid"
  sample_size: 1000                 # Number of samples to generate (default: 1000)
FieldTypeDefaultDescription
enabledboolfalseEnable diffusion model generation
n_stepsinteger1000Number of forward/reverse diffusion steps
schedulestring"linear"Noise schedule type: linear, cosine, sigmoid
sample_sizeinteger1000Number of samples to generate

Causal Configuration (v0.5.0)

causal:
  enabled: false                    # Enable causal generation
  template: "fraud_detection"       # Built-in template or custom graph path
  sample_size: 1000                 # Number of samples to generate
  validate: true                    # Validate causal structure in output
FieldTypeDefaultDescription
enabledboolfalseEnable causal/counterfactual generation
templatestring"fraud_detection"Template name (fraud_detection, revenue_cycle) or path to custom YAML
sample_sizeinteger1000Number of causal samples to generate
validatebooltrueRun causal structure validation on output

Built-in Causal Templates

TemplateVariablesDescription
fraud_detectiontransaction_amount, approval_level, vendor_risk, fraud_flagFraud detection causal graph
revenue_cycleorder_size, credit_score, payment_delay, revenueRevenue cycle causal graph

Certificate Configuration (v0.5.0)

certificates:
  enabled: false                    # Enable synthetic data certificates
  issuer: "DataSynth"              # Certificate issuer name
  include_quality_metrics: true     # Include quality metrics in certificate
FieldTypeDefaultDescription
enabledboolfalseAttach certificate to generated output
issuerstring"DataSynth"Issuer identity for the certificate
include_quality_metricsbooltrueInclude Benford MAD, correlation, fidelity metrics

Source-to-Pay Configuration (v0.6.0)

source_to_pay:
  enabled: false                       # Enable source-to-pay generation

  spend_analysis:
    hhi_threshold: 2500.0              # f64 - HHI threshold for sourcing trigger
    contract_coverage_target: 0.80     # f64, 0-1 - target spend under contracts

  sourcing:
    projects_per_year: 10              # u32 - sourcing projects per year
    renewal_horizon_months: 3          # u32 - months before expiry to trigger renewal
    project_duration_months: 4         # u32 - average project duration

  qualification:
    pass_rate: 0.75                    # f64, 0-1 - qualification pass rate
    validity_days: 365                 # u32 - qualification validity in days
    financial_weight: 0.25             # f64 - financial stability weight
    quality_weight: 0.30               # f64 - quality management weight
    delivery_weight: 0.25              # f64 - delivery performance weight
    compliance_weight: 0.20            # f64 - compliance weight

  rfx:
    rfi_threshold: 100000.0            # f64 - spend above which RFI required
    min_invited_vendors: 3             # u32 - minimum vendors per RFx
    max_invited_vendors: 8             # u32 - maximum vendors per RFx
    response_rate: 0.70                # f64, 0-1 - vendor response rate
    default_price_weight: 0.40         # f64 - price weight in evaluation
    default_quality_weight: 0.35       # f64 - quality weight in evaluation
    default_delivery_weight: 0.25      # f64 - delivery weight in evaluation

  contracts:
    min_duration_months: 12            # u32 - minimum contract duration
    max_duration_months: 36            # u32 - maximum contract duration
    auto_renewal_rate: 0.40            # f64, 0-1 - auto-renewal rate
    amendment_rate: 0.20               # f64, 0-1 - contracts with amendments
    type_distribution:
      fixed_price: 0.40               # f64 - fixed price contracts
      blanket: 0.30                    # f64 - blanket/framework agreements
      time_and_materials: 0.15         # f64 - T&M contracts
      service_agreement: 0.15          # f64 - service agreements

  catalog:
    preferred_vendor_flag_rate: 0.70   # f64, 0-1 - items marked as preferred
    multi_source_rate: 0.25            # f64, 0-1 - items with multiple sources

  scorecards:
    frequency: "quarterly"             # string - review frequency
    on_time_delivery_weight: 0.30      # f64 - OTD weight in score
    quality_weight: 0.30               # f64 - quality weight in score
    price_weight: 0.25                 # f64 - price competitiveness weight
    responsiveness_weight: 0.15        # f64 - responsiveness weight
    grade_a_threshold: 90.0            # f64 - grade A threshold
    grade_b_threshold: 75.0            # f64 - grade B threshold
    grade_c_threshold: 60.0            # f64 - grade C threshold

  p2p_integration:
    off_contract_rate: 0.15            # f64, 0-1 - maverick purchase rate
    price_tolerance: 0.02              # f64 - contract price variance allowed
    catalog_enforcement: false          # bool - enforce catalog ordering
FieldTypeDefaultDescription
enabledboolfalseEnable source-to-pay generation
sourcing.projects_per_yearu3210Sourcing projects per year
qualification.pass_ratef640.75Supplier qualification pass rate
rfx.response_ratef640.70Fraction of invited vendors that respond
contracts.auto_renewal_ratef640.40Auto-renewal rate
scorecards.frequencystring"quarterly"Scorecard review frequency
p2p_integration.off_contract_ratef640.15Rate of off-contract (maverick) purchases

Financial Reporting Configuration (v0.6.0)

financial_reporting:
  enabled: false                       # Enable financial reporting generation
  generate_balance_sheet: true         # bool - generate balance sheet
  generate_income_statement: true      # bool - generate income statement
  generate_cash_flow: true             # bool - generate cash flow statement
  generate_changes_in_equity: true     # bool - generate changes in equity
  comparative_periods: 1               # u32 - number of comparative periods

  management_kpis:
    enabled: false                     # bool - enable KPI generation
    frequency: "monthly"               # string - monthly, quarterly

  budgets:
    enabled: false                     # bool - enable budget generation
    revenue_growth_rate: 0.05          # f64 - expected revenue growth rate
    expense_inflation_rate: 0.03       # f64 - expected expense inflation rate
    variance_noise: 0.10               # f64 - noise for budget vs actual
FieldTypeDefaultDescription
enabledboolfalseEnable financial reporting generation
generate_balance_sheetbooltrueGenerate balance sheet output
generate_income_statementbooltrueGenerate income statement output
generate_cash_flowbooltrueGenerate cash flow statement output
generate_changes_in_equitybooltrueGenerate changes in equity statement
comparative_periodsu321Number of comparative periods to include
management_kpis.enabledboolfalseEnable management KPI calculation
management_kpis.frequencystring"monthly"KPI calculation frequency
budgets.enabledboolfalseEnable budget generation
budgets.revenue_growth_ratef640.05Expected revenue growth rate for budgeting
budgets.expense_inflation_ratef640.03Expected expense inflation rate
budgets.variance_noisef640.10Random noise added to budget vs actual

HR Configuration (v0.6.0)

hr:
  enabled: false                       # Enable HR generation

  payroll:
    enabled: true                      # bool - enable payroll generation
    pay_frequency: "monthly"           # string - monthly, biweekly, weekly
    salary_ranges:
      staff_min: 50000.0               # f64 - staff level minimum salary
      staff_max: 70000.0               # f64 - staff level maximum salary
      manager_min: 80000.0             # f64 - manager level minimum salary
      manager_max: 120000.0            # f64 - manager level maximum salary
      director_min: 120000.0           # f64 - director level minimum salary
      director_max: 180000.0           # f64 - director level maximum salary
      executive_min: 180000.0          # f64 - executive level minimum salary
      executive_max: 350000.0          # f64 - executive level maximum salary
    tax_rates:
      federal_effective: 0.22          # f64 - federal effective tax rate
      state_effective: 0.05            # f64 - state effective tax rate
      fica: 0.0765                     # f64 - FICA/social security rate
    benefits_enrollment_rate: 0.60     # f64, 0-1 - benefits enrollment rate
    retirement_participation_rate: 0.45 # f64, 0-1 - retirement plan participation

  time_attendance:
    enabled: true                      # bool - enable time tracking
    overtime_rate: 0.10                # f64, 0-1 - employees with overtime

  expenses:
    enabled: true                      # bool - enable expense report generation
    submission_rate: 0.30              # f64, 0-1 - employees submitting per month
    policy_violation_rate: 0.08        # f64, 0-1 - rate of policy violations
FieldTypeDefaultDescription
enabledboolfalseEnable HR generation
payroll.enabledbooltrueEnable payroll generation
payroll.pay_frequencystring"monthly"Pay frequency: monthly, biweekly, weekly
payroll.benefits_enrollment_ratef640.60Benefits enrollment rate
payroll.retirement_participation_ratef640.45Retirement plan participation rate
time_attendance.enabledbooltrueEnable time tracking
time_attendance.overtime_ratef640.10Rate of employees with overtime
expenses.enabledbooltrueEnable expense report generation
expenses.submission_ratef640.30Rate of employees submitting expenses per month
expenses.policy_violation_ratef640.08Rate of policy violations

Manufacturing Configuration (v0.6.0)

manufacturing:
  enabled: false                       # Enable manufacturing generation

  production_orders:
    orders_per_month: 50               # u32 - production orders per month
    avg_batch_size: 100                # u32 - average batch size
    yield_rate: 0.97                   # f64, 0-1 - production yield rate
    make_to_order_rate: 0.20           # f64, 0-1 - MTO vs MTS ratio
    rework_rate: 0.03                  # f64, 0-1 - rework rate

  costing:
    labor_rate_per_hour: 35.0          # f64 - labor rate per hour
    overhead_rate: 1.50                # f64 - overhead multiplier on direct labor
    standard_cost_update_frequency: "quarterly"  # string - cost update cycle

  routing:
    avg_operations: 4                  # u32 - average operations per routing
    setup_time_hours: 1.5              # f64 - average setup time in hours
    run_time_variation: 0.15           # f64 - run time variation coefficient
FieldTypeDefaultDescription
enabledboolfalseEnable manufacturing generation
production_orders.orders_per_monthu3250Number of production orders per month
production_orders.avg_batch_sizeu32100Average batch size
production_orders.yield_ratef640.97Production yield rate
production_orders.rework_ratef640.03Rework rate
costing.labor_rate_per_hourf6435.0Direct labor cost per hour
costing.overhead_ratef641.50Overhead application multiplier
routing.avg_operationsu324Average operations per routing step
routing.setup_time_hoursf641.5Average machine setup time in hours

Sales Quotes Configuration (v0.6.0)

sales_quotes:
  enabled: false                       # Enable sales quote generation
  quotes_per_month: 30                 # u32 - quotes generated per month
  win_rate: 0.35                       # f64, 0-1 - quote-to-order conversion
  validity_days: 30                    # u32 - default quote validity period
FieldTypeDefaultDescription
enabledboolfalseEnable sales quote generation
quotes_per_monthu3230Number of quotes generated per month
win_ratef640.35Fraction of quotes that convert to sales orders
validity_daysu3230Default quote validity period in days

See Also

Industry Presets

SyntheticData includes pre-configured settings for common industries.

Using Presets

# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available Industries

IndustryKey Characteristics
ManufacturingHeavy P2P, inventory, fixed assets
RetailHigh O2C volume, seasonal patterns
Financial ServicesComplex intercompany, high controls
HealthcareRegulatory focus, insurance seasonality
TechnologySaaS revenue, R&D capitalization

Complexity Levels

LevelAccountsVendorsCustomersMaterials
Small~10050100200
Medium~4002005001000
Large~25001000500010000

Manufacturing

Characteristics:

  • High P2P activity (procurement, production)
  • Significant inventory and WIP
  • Fixed asset intensive
  • Cost accounting emphasis

Key Settings:

global:
  industry: manufacturing

transactions:
  sources:
    manual: 0.2
    automated: 0.6
    recurring: 0.15
    adjustment: 0.05

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4          # 40% of JEs from P2P
  o2c:
    enabled: true
    flow_rate: 0.25         # 25% of JEs from O2C

master_data:
  materials:
    count: 1000
  fixed_assets:
    count: 200

subledger:
  inventory:
    enabled: true
    valuation_methods:
      - weighted_average
      - fifo

Typical Account Distribution:

  • 45% expense accounts (production costs)
  • 25% asset accounts (inventory, equipment)
  • 15% liability accounts
  • 10% revenue accounts
  • 5% equity accounts

Retail

Characteristics:

  • High transaction volume
  • Strong seasonal patterns
  • High O2C activity
  • Inventory turnover focus

Key Settings:

global:
  industry: retail

transactions:
  target_count: 500000      # High volume
  temporal:
    month_end_spike: 1.5
    quarter_end_spike: 2.0
    year_end_spike: 5.0     # Holiday season

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.25
  o2c:
    enabled: true
    flow_rate: 0.45         # High sales activity

master_data:
  customers:
    count: 2000
  materials:
    count: 5000

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]

Seasonal Pattern:

  • Q4 volume: 200-300% of Q1-Q3 average
  • Black Friday/holiday spikes
  • Post-holiday returns

Financial Services

Characteristics:

  • Complex intercompany structures
  • High regulatory requirements
  • Sophisticated controls
  • Mark-to-market adjustments

Key Settings:

global:
  industry: financial_services

transactions:
  sources:
    automated: 0.7          # High automation
    adjustment: 0.15        # MTM adjustments

intercompany:
  enabled: true
  transaction_types:
    loan: 0.3
    service_provided: 0.25
    dividend: 0.2
    management_fee: 0.15
    royalty: 0.1

internal_controls:
  enabled: true
  controls:
    - id: "SOX-001"
      type: preventive
      frequency: continuous

fx:
  enabled: true
  currency_pairs:
    - EUR
    - GBP
    - CHF
    - JPY
    - CNY
  volatility: 0.015

Control Requirements:

  • SOX 404 compliance mandatory
  • High SoD enforcement
  • Continuous monitoring

Healthcare

Characteristics:

  • Complex revenue recognition (insurance)
  • Regulatory compliance (HIPAA)
  • Seasonal patterns (flu season, open enrollment)
  • High accounts receivable

Key Settings:

global:
  industry: healthcare

transactions:
  amounts:
    min: 50
    max: 500000
    distribution: log_normal

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.5          # Revenue cycle focus

master_data:
  customers:
    count: 1000             # Patient/payer mix

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120, 180]  # Extended aging

fraud:
  types:
    fictitious_transaction: 0.2
    revenue_manipulation: 0.3   # Upcoding focus
    duplicate_payment: 0.2

Seasonal Pattern:

  • Q1 spike (insurance deductible reset)
  • Flu season (Oct-Feb)
  • Open enrollment (Nov-Dec)

Technology

Characteristics:

  • SaaS/subscription revenue
  • R&D capitalization
  • Stock compensation
  • Deferred revenue management

Key Settings:

global:
  industry: technology

transactions:
  sources:
    automated: 0.65
    recurring: 0.25         # Subscription billing
    manual: 0.08
    adjustment: 0.02

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.35

subledger:
  ar:
    enabled: true

# Additional technology-specific
deferred_revenue:
  enabled: true
  recognition_period: monthly

capitalization:
  r_and_d:
    enabled: true
    threshold: 50000

Revenue Pattern:

  • Monthly recurring revenue (MRR)
  • Annual contract billing (ACV)
  • Usage-based components

Process Chain Defaults (v0.6.0)

Starting in v0.6.0, all five industry presets include default settings for the new enterprise process chains. When you generate a configuration with datasynth-data init, the preset populates sensible defaults for each new section, though they remain disabled until explicitly turned on.

Process ChainManufacturingRetailFinancial ServicesHealthcareTechnology
source_to_payHighMediumLowMediumLow
financial_reportingFullFullFullFullFull
hrFullFullFullFullFull
manufacturingHigh
sales_quotesMediumHighLowMediumHigh

Manufacturing presets emphasize production orders, routing, and costing. Retail presets increase sales quote volume and quote-to-order win rates. Financial Services presets focus on financial reporting with comprehensive KPIs and budgets. Healthcare and Technology presets provide balanced defaults.

Each preset configures the following when you set enabled: true:

  • source_to_pay: Sourcing projects, RFx events, contract management, catalogs, and vendor scorecards that feed into the existing P2P document flow.
  • financial_reporting: Balance sheets, income statements, cash flow statements, management KPIs, and budget vs. actual variance analysis.
  • hr: Payroll runs based on employee master data, time and attendance tracking, and expense report generation with policy violation injection.
  • manufacturing: Production orders, WIP tracking, standard costing with labor and overhead, and routing operations.
  • sales_quotes: Quote-to-order pipeline that feeds into the existing O2C document flow.

Customizing Presets

Start with a preset and customize:

# Generate preset
datasynth-data init --industry manufacturing -o config.yaml

# Edit config.yaml
# - Adjust transaction counts
# - Add companies
# - Enable additional features

# Validate and generate
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output

Combining Industries

For conglomerates, use multiple companies with different characteristics:

companies:
  - code: "1000"
    name: "Manufacturing Division"
    volume_weight: 0.5

  - code: "2000"
    name: "Retail Division"
    volume_weight: 0.3

  - code: "3000"
    name: "Services Division"
    volume_weight: 0.2

See Also

Global Settings

Global settings control overall generation behavior.

Configuration

global:
  seed: 42                           # Random seed for reproducibility
  industry: manufacturing            # Industry preset
  start_date: 2024-01-01             # Generation start date
  period_months: 12                  # Duration in months
  group_currency: USD                # Base/reporting currency
  worker_threads: 4                  # Parallel workers (optional)
  memory_limit: 2147483648           # Memory limit in bytes (optional)

Fields

seed

Random number generator seed for reproducible output.

PropertyValue
Typeu64
RequiredNo
DefaultRandom
global:
  seed: 42  # Same seed = same output

Use cases:

  • Reproducible test datasets
  • Debugging
  • Consistent benchmarks

industry

Industry preset for domain-specific settings.

PropertyValue
Typestring
RequiredYes
ValuesSee below

Available industries:

IndustryDescription
manufacturingProduction, inventory, cost accounting
retailHigh volume sales, seasonal patterns
financial_servicesComplex IC, regulatory compliance
healthcareInsurance billing, compliance
technologySaaS revenue, R&D
energyLong-term assets, commodity trading
telecomSubscription revenue, network assets
transportationFleet assets, fuel costs
hospitalitySeasonal, revenue management

start_date

Beginning date for generated data.

PropertyValue
Typedate (YYYY-MM-DD)
RequiredYes
global:
  start_date: 2024-01-01

Notes:

  • First transaction will be on or after this date
  • Combined with period_months to determine date range

period_months

Duration of generation period.

PropertyValue
Typeu32
RequiredYes
Range1-120
global:
  period_months: 12    # One year
  period_months: 36    # Three years
  period_months: 1     # One month

Considerations:

  • Longer periods = more data
  • Period close features require at least 1 month
  • Year-end close requires at least 12 months

group_currency

Base currency for consolidation and reporting.

PropertyValue
Typestring (ISO 4217)
RequiredYes
global:
  group_currency: USD
  group_currency: EUR
  group_currency: CHF

Used for:

  • Currency translation
  • Consolidation
  • Intercompany eliminations

worker_threads

Number of parallel worker threads.

PropertyValue
Typeusize
RequiredNo
DefaultNumber of CPU cores
global:
  worker_threads: 4    # Use 4 threads
  worker_threads: 1    # Single-threaded

Guidance:

  • Default (CPU cores) is usually optimal
  • Reduce for memory-constrained systems
  • Increase may not improve performance beyond CPU cores

memory_limit

Maximum memory usage in bytes.

PropertyValue
Typeu64
RequiredNo
DefaultNone (system limit)
global:
  memory_limit: 1073741824    # 1 GB
  memory_limit: 2147483648    # 2 GB
  memory_limit: 4294967296    # 4 GB

Behavior:

  • Soft limit: Generation slows down
  • Hard limit: Generation pauses until memory freed
  • Streaming output to reduce memory pressure

Environment Variable Overrides

VariableSetting
SYNTH_DATA_SEEDglobal.seed
SYNTH_DATA_THREADSglobal.worker_threads
SYNTH_DATA_MEMORY_LIMITglobal.memory_limit
SYNTH_DATA_SEED=12345 datasynth-data generate --config config.yaml --output ./output

Examples

Minimal

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

Full Control

global:
  seed: 42
  industry: financial_services
  start_date: 2023-01-01
  period_months: 36
  group_currency: USD
  worker_threads: 8
  memory_limit: 8589934592  # 8 GB

Development/Testing

global:
  seed: 42                # Reproducible
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 1        # Short period
  group_currency: USD
  worker_threads: 1       # Single thread for debugging

Validation

CheckRule
period_months1 ≤ value ≤ 120
start_dateValid date
industryKnown industry preset
group_currencyValid ISO 4217 code

See Also

Companies

Company configuration defines the legal entities for data generation.

Configuration

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.6
    is_parent: true
    parent_code: null

  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    country: DE
    volume_weight: 0.4
    is_parent: false
    parent_code: "1000"

Fields

code

Unique identifier for the company.

PropertyValue
Typestring
RequiredYes
ConstraintsUnique across all companies
companies:
  - code: "1000"      # Four-digit SAP-style
  - code: "US01"      # Region-based
  - code: "HQ"        # Abbreviated

name

Display name for the company.

PropertyValue
Typestring
RequiredYes
companies:
  - name: "Headquarters"
  - name: "European Operations GmbH"
  - name: "Asia Pacific Holdings"

currency

Local currency for the company.

PropertyValue
Typestring (ISO 4217)
RequiredYes
companies:
  - currency: USD
  - currency: EUR
  - currency: CHF
  - currency: JPY

Used for:

  • Transaction amounts
  • Local reporting
  • FX translation

country

Country code for the company.

PropertyValue
Typestring (ISO 3166-1 alpha-2)
RequiredYes
companies:
  - country: US
  - country: DE
  - country: CH
  - country: JP

Affects:

  • Holiday calendars
  • Tax calculations
  • Regional templates

volume_weight

Relative transaction volume for this company.

PropertyValue
Typef64
RequiredYes
Range0.0 - 1.0
ConstraintSum across all companies = 1.0
companies:
  - code: "1000"
    volume_weight: 0.5    # 50% of transactions

  - code: "2000"
    volume_weight: 0.3    # 30% of transactions

  - code: "3000"
    volume_weight: 0.2    # 20% of transactions

is_parent

Whether this company is the consolidation parent.

PropertyValue
Typebool
RequiredNo
Defaultfalse
companies:
  - code: "1000"
    is_parent: true       # Consolidation parent

  - code: "2000"
    is_parent: false      # Subsidiary

Notes:

  • Exactly one company should be is_parent: true for consolidation
  • Parent receives elimination entries

parent_code

Reference to parent company for subsidiaries.

PropertyValue
Typestring or null
RequiredNo
Defaultnull
companies:
  - code: "1000"
    is_parent: true
    parent_code: null     # No parent (is the parent)

  - code: "2000"
    is_parent: false
    parent_code: "1000"   # Owned by 1000

  - code: "3000"
    is_parent: false
    parent_code: "2000"   # Owned by 2000 (nested)

Examples

Single Company

companies:
  - code: "1000"
    name: "Demo Company"
    currency: USD
    country: US
    volume_weight: 1.0

Multi-National

companies:
  - code: "1000"
    name: "Global Holdings Inc"
    currency: USD
    country: US
    volume_weight: 0.4
    is_parent: true

  - code: "2000"
    name: "European Operations GmbH"
    currency: EUR
    country: DE
    volume_weight: 0.25
    parent_code: "1000"

  - code: "3000"
    name: "UK Limited"
    currency: GBP
    country: GB
    volume_weight: 0.15
    parent_code: "2000"

  - code: "4000"
    name: "Asia Pacific Pte Ltd"
    currency: SGD
    country: SG
    volume_weight: 0.2
    parent_code: "1000"

Regional Structure

companies:
  - code: "HQ"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.3
    is_parent: true

  - code: "NA01"
    name: "North America Operations"
    currency: USD
    country: US
    volume_weight: 0.3
    parent_code: "HQ"

  - code: "EU01"
    name: "EMEA Operations"
    currency: EUR
    country: DE
    volume_weight: 0.25
    parent_code: "HQ"

  - code: "AP01"
    name: "APAC Operations"
    currency: JPY
    country: JP
    volume_weight: 0.15
    parent_code: "HQ"

Validation

CheckRule
codeMust be unique
volume_weightSum must equal 1.0 (±0.01)
parent_codeMust reference existing company or be null
is_parentAt most one true (if intercompany enabled)

Intercompany Implications

When multiple companies exist:

  • Intercompany transactions generated between companies
  • FX rates generated for currency pairs
  • Elimination entries created for parent
  • Transfer pricing applied

See Intercompany Processing for details.

See Also

Transactions

Transaction settings control journal entry generation.

Configuration

transactions:
  target_count: 100000

  line_items:
    distribution: empirical
    min_lines: 2
    max_lines: 20

  amounts:
    min: 100
    max: 1000000
    distribution: log_normal
    round_number_bias: 0.15

  sources:
    manual: 0.3
    automated: 0.5
    recurring: 0.15
    adjustment: 0.05

  benford:
    enabled: true

  temporal:
    month_end_spike: 2.5
    quarter_end_spike: 3.0
    year_end_spike: 4.0
    working_hours_only: true

Fields

target_count

Total number of journal entries to generate.

PropertyValue
Typeu64
RequiredYes
transactions:
  target_count: 10000      # Small dataset
  target_count: 100000     # Medium dataset
  target_count: 1000000    # Large dataset

line_items

Controls the number of line items per journal entry.

distribution

ValueDescription
empiricalBased on real-world GL research
uniformEqual probability for all counts
customUser-defined probabilities

Empirical distribution (default):

  • 2 lines: 60.68%
  • 3 lines: 5.24%
  • 4 lines: 17.32%
  • Even counts: 88% preference
line_items:
  distribution: empirical

Custom distribution:

line_items:
  distribution: custom
  custom_distribution:
    2: 0.50
    3: 0.10
    4: 0.20
    5: 0.10
    6: 0.10

min_lines / max_lines

PropertyValue
Typeu32
Default2 / 20
line_items:
  min_lines: 2
  max_lines: 10

amounts

Controls transaction amounts.

min / max

PropertyValue
Typef64
RequiredYes
amounts:
  min: 100           # Minimum amount
  max: 1000000       # Maximum amount

distribution

ValueDescription
log_normalLog-normal distribution (realistic)
uniformEqual probability across range
customUser-defined
amounts:
  distribution: log_normal

round_number_bias

Preference for round numbers (100, 500, 1000, etc.).

PropertyValue
Typef64
Range0.0 - 1.0
Default0.15
amounts:
  round_number_bias: 0.15    # 15% round numbers
  round_number_bias: 0.0     # No round number bias

sources

Transaction source distribution (weights must sum to 1.0).

SourceDescription
manualManual journal entries
automatedSystem-generated
recurringScheduled recurring entries
adjustmentPeriod-end adjustments
sources:
  manual: 0.3
  automated: 0.5
  recurring: 0.15
  adjustment: 0.05

benford

Benford’s Law compliance for first-digit distribution.

benford:
  enabled: true       # Follow P(d) = log10(1 + 1/d)
  enabled: false      # Disable Benford compliance

Expected distribution (enabled):

DigitProbability
130.1%
217.6%
312.5%
49.7%
57.9%
66.7%
75.8%
85.1%
94.6%

temporal

Temporal patterns for transaction timing.

Spikes

Volume multipliers for period ends:

temporal:
  month_end_spike: 2.5    # 2.5x volume at month end
  quarter_end_spike: 3.0  # 3.0x at quarter end
  year_end_spike: 4.0     # 4.0x at year end

Working Hours

Restrict transactions to business hours:

temporal:
  working_hours_only: true   # Mon-Fri, 8am-6pm
  working_hours_only: false  # Any time

Examples

High Volume Retail

transactions:
  target_count: 500000

  line_items:
    distribution: empirical
    min_lines: 2
    max_lines: 6

  amounts:
    min: 10
    max: 50000
    distribution: log_normal
    round_number_bias: 0.3

  sources:
    manual: 0.1
    automated: 0.8
    recurring: 0.08
    adjustment: 0.02

  temporal:
    month_end_spike: 1.5
    quarter_end_spike: 2.0
    year_end_spike: 5.0

Low Volume Manual

transactions:
  target_count: 5000

  line_items:
    distribution: empirical

  amounts:
    min: 1000
    max: 10000000

  sources:
    manual: 0.6
    automated: 0.2
    recurring: 0.1
    adjustment: 0.1

  temporal:
    month_end_spike: 3.0
    quarter_end_spike: 4.0
    year_end_spike: 5.0
    working_hours_only: true

Testing/Development

transactions:
  target_count: 1000

  line_items:
    distribution: uniform
    min_lines: 2
    max_lines: 4

  amounts:
    min: 100
    max: 10000
    distribution: uniform
    round_number_bias: 0.0

  sources:
    manual: 1.0

  benford:
    enabled: false

Validation

CheckRule
target_count> 0
min_lines≥ 2
max_lines≥ min_lines
amounts.min> 0
amounts.max> min
round_number_bias0.0 - 1.0
sourcesSum = 1.0 (±0.01)
*_spike≥ 1.0

See Also

Master Data

Master data settings control generation of business entities.

Configuration

master_data:
  vendors:
    count: 200
    intercompany_ratio: 0.05

  customers:
    count: 500
    intercompany_ratio: 0.05

  materials:
    count: 1000

  fixed_assets:
    count: 100

  employees:
    count: 50
    hierarchy_depth: 4

Vendors

Supplier master data configuration.

master_data:
  vendors:
    count: 200                    # Number of vendors
    intercompany_ratio: 0.05      # IC vendor percentage

    payment_terms:
      - code: "NET30"
        days: 30
        weight: 0.5
      - code: "NET60"
        days: 60
        weight: 0.3
      - code: "NET10"
        days: 10
        weight: 0.2

    behavior:
      late_payment_rate: 0.1      # % with late payment tendency
      discount_usage_rate: 0.3    # % that take early pay discounts

Generated Fields

FieldDescription
vendor_idUnique identifier
vendor_nameCompany name
tax_idTax identification number
payment_termsDefault payment terms
currencyTransaction currency
bank_accountBank details
is_intercompanyIC vendor flag
valid_fromTemporal validity start

Customers

Customer master data configuration.

master_data:
  customers:
    count: 500                    # Number of customers
    intercompany_ratio: 0.05      # IC customer percentage

    credit_rating:
      - code: "AAA"
        limit_multiplier: 10.0
        weight: 0.1
      - code: "AA"
        limit_multiplier: 5.0
        weight: 0.2
      - code: "A"
        limit_multiplier: 2.0
        weight: 0.4
      - code: "B"
        limit_multiplier: 1.0
        weight: 0.3

    payment_behavior:
      on_time_rate: 0.7           # % that pay on time
      early_payment_rate: 0.1     # % that pay early
      late_payment_rate: 0.2      # % that pay late

Generated Fields

FieldDescription
customer_idUnique identifier
customer_nameCompany/person name
credit_limitMaximum credit
credit_ratingRating code
payment_behaviorPayment tendency
currencyTransaction currency
is_intercompanyIC customer flag

Materials

Product/material master data.

master_data:
  materials:
    count: 1000                   # Number of materials

    types:
      raw_material: 0.3
      work_in_progress: 0.1
      finished_goods: 0.4
      services: 0.2

    valuation:
      - method: fifo
        weight: 0.3
      - method: weighted_average
        weight: 0.5
      - method: standard_cost
        weight: 0.2

Generated Fields

FieldDescription
material_idUnique identifier
descriptionMaterial name
material_typeClassification
unit_of_measureUOM
valuation_methodCosting method
standard_costUnit cost
gl_accountInventory account

Fixed Assets

Capital asset master data.

master_data:
  fixed_assets:
    count: 100                    # Number of assets

    categories:
      buildings: 0.1
      machinery: 0.3
      vehicles: 0.2
      furniture: 0.2
      it_equipment: 0.2

    depreciation:
      - method: straight_line
        weight: 0.7
      - method: declining_balance
        weight: 0.2
      - method: units_of_production
        weight: 0.1

Generated Fields

FieldDescription
asset_idUnique identifier
descriptionAsset name
asset_classCategory
acquisition_datePurchase date
acquisition_costOriginal cost
useful_lifeYears
depreciation_methodMethod
salvage_valueResidual value

Employees

User/employee master data.

master_data:
  employees:
    count: 50                     # Number of employees
    hierarchy_depth: 4            # Org chart depth

    roles:
      - name: "AP Clerk"
        approval_limit: 5000
        weight: 0.3
      - name: "AP Manager"
        approval_limit: 50000
        weight: 0.1
      - name: "AR Clerk"
        approval_limit: 5000
        weight: 0.3
      - name: "Controller"
        approval_limit: 500000
        weight: 0.1
      - name: "CFO"
        approval_limit: 999999999
        weight: 0.05

    transaction_codes:
      - "FB01"     # Post document
      - "FB50"     # Enter GL
      - "F-28"     # Post incoming payment
      - "F-53"     # Post outgoing payment

Generated Fields

FieldDescription
employee_idUnique identifier
nameFull name
departmentDepartment code
roleJob role
manager_idSupervisor reference
approval_limitMax approval amount
transaction_codesAuthorized T-codes

HR and Payroll Integration (v0.6.0)

Employee master data serves as the foundation for the hr configuration section introduced in v0.6.0. When the HR module is enabled, each employee record drives downstream generation:

  • Payroll: Salary, tax withholdings, benefits deductions, and retirement contributions are computed per employee based on their role and the salary ranges defined in hr.payroll.salary_ranges. The pay_frequency setting (monthly, biweekly, or weekly) determines how many payroll runs are generated per period.
  • Time and Attendance: Time entries are generated for each employee according to working days in the period. The overtime_rate controls how many employees have overtime hours in a given period.
  • Expense Reports: A subset of employees (controlled by hr.expenses.submission_rate) generate expense reports each month. Policy violations are injected at the configured policy_violation_rate.

The employees.count and employees.hierarchy_depth settings in master_data directly determine the population size for all HR outputs. Increasing the employee count will proportionally increase payroll journal entries, time records, and expense reports.

master_data:
  employees:
    count: 200                     # Drives payroll and HR volume
    hierarchy_depth: 5

hr:
  enabled: true                    # Activates payroll, time, and expenses
  payroll:
    pay_frequency: "biweekly"      # 26 pay periods per year
  expenses:
    submission_rate: 0.40          # 40% of employees submit per month

Examples

Small Company

master_data:
  vendors:
    count: 50
  customers:
    count: 100
  materials:
    count: 200
  fixed_assets:
    count: 20
  employees:
    count: 10
    hierarchy_depth: 2

Large Enterprise

master_data:
  vendors:
    count: 2000
    intercompany_ratio: 0.1
  customers:
    count: 10000
    intercompany_ratio: 0.1
  materials:
    count: 50000
  fixed_assets:
    count: 5000
  employees:
    count: 500
    hierarchy_depth: 8

Validation

CheckRule
count> 0
intercompany_ratio0.0 - 1.0
hierarchy_depth≥ 1
Distribution weightsSum = 1.0

See Also

Document Flows

Document flow settings control P2P (Procure-to-Pay) and O2C (Order-to-Cash) process generation, including document types, three-way matching, credit checks, and document chain management.

Configuration

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3
    completion_rate: 0.95
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01

  o2c:
    enabled: true
    flow_rate: 0.3
    completion_rate: 0.95

Procure-to-Pay (P2P)

Flow

Purchase     Purchase    Goods      Vendor     Three-Way
Requisition → Order   → Receipt  → Invoice  → Match    → Payment
                │                     │          │
                │                ┌────┘          │
                ▼                ▼               ▼
           AP Open Item ← Match Result      AP Aging

Purchase Order Types

SyntheticData models 6 PO types, each with different downstream behavior:

TypeDescriptionRequires GR?Use Case
StandardStandard goods purchaseYesMost common PO type
ServiceService procurementNoConsulting, maintenance, etc.
FrameworkBlanket/framework agreementYesLong-term supply agreements
ConsignmentVendor-managed inventoryYesConsignment stock
StockTransferInter-plant transferYesInternal stock movement
SubcontractingExternal processingYesOutsourced manufacturing

Goods Receipt Movement Types

Goods receipts use SAP-style movement type codes:

Movement TypeCodeDescription
GrForPo101Standard GR against purchase order
ReturnToVendor122Return materials to vendor
GrForProduction131GR from production order
TransferPosting301Transfer between plants/locations
InitialEntry561Initial stock entry
Scrapping551Scrap disposal
Consumption201Direct consumption posting

Three-Way Match

The three-way match validator compares Purchase Order, Goods Receipt, and Vendor Invoice to detect variances before payment.

Algorithm

For each invoice line item:
  1. Find matching PO line (by PO reference + line number)
  2. Sum GR quantities for that PO line (supports multiple partial GRs)
  3. Compare:
     a. PO quantity vs GR quantity → QuantityPoGr variance
     b. GR quantity vs Invoice quantity → QuantityGrInvoice variance
     c. PO unit price vs Invoice unit price → PricePoInvoice variance
     d. PO total vs Invoice total → TotalAmount variance
  4. Apply tolerances:
     - Quantity: ±quantity_tolerance (default 2%)
     - Price: ±price_tolerance (default 5%)
     - Absolute: ±absolute_amount_tolerance (default $0.01)
  5. Check over-delivery:
     - If GR qty > PO qty and allow_over_delivery=true:
       allow up to max_over_delivery_pct (default 10%)

Variance Types

Variance TypeDescriptionDetection
QuantityPoGrGR quantity differs from PO quantityPO vs GR comparison
QuantityGrInvoiceInvoice quantity differs from GR quantityGR vs Invoice comparison
PricePoInvoiceInvoice unit price differs from PO pricePO vs Invoice comparison
TotalAmountTotal invoice amount mismatchOverall amount check
MissingLinePO line not found in invoice or GRLine matching
ExtraLineInvoice has lines not on POLine matching

Match Outcomes

OutcomeMeaningAction
passedAll within toleranceProceed to payment
quantity_varianceQuantity outside toleranceReview required
price_variancePrice outside toleranceReview required
blockedMultiple variances or critical mismatchManual resolution

Configuration

document_flows:
  p2p:
    three_way_match:
      enabled: true
      price_tolerance: 0.05              # 5% price variance allowed
      quantity_tolerance: 0.02            # 2% quantity variance allowed
      absolute_amount_tolerance: 0.01     # $0.01 rounding tolerance
      allow_over_delivery: true
      max_over_delivery_pct: 0.10         # 10% over-delivery allowed

P2P Stage Configuration

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3                       # 30% of JEs from P2P
    completion_rate: 0.95                # 95% complete full flow

    stages:
      po_approval_rate: 0.9             # 90% of POs approved
      gr_rate: 0.98                     # 98% of POs get goods receipts
      invoice_rate: 0.95                # 95% of GRs get invoices
      payment_rate: 0.92                # 92% of invoices get paid

    timing:
      po_to_gr_days:
        min: 1
        max: 30
      gr_to_invoice_days:
        min: 1
        max: 14
      invoice_to_payment_days:
        min: 10
        max: 60

P2P Journal Entries

StageDebitCreditTrigger
Goods ReceiptInventory (1300)GR/IR Clearing (2100)GR posted
Invoice ReceiptGR/IR Clearing (2100)Accounts Payable (2000)Invoice verified
PaymentAccounts Payable (2000)Cash (1000)Payment executed
Price VariancePPV Expense (5xxx)GR/IR Clearing (2100)Price mismatch

Order-to-Cash (O2C)

Flow

Sales     Credit   Delivery    Customer    Customer
Order   → Check  → (Pick/   → Invoice   → Receipt
  │                Pack/         │           │
  │                Ship)         │           │
  │                  │           ▼           ▼
  │                  │      AR Open Item   AR Aging
  │                  │           │
  │                  │           └→ Dunning (if overdue)
  │                  ▼
  │            Inventory Issue
  │            (COGS posting)
  ▼
Revenue Recognition
(ASC 606 / IFRS 15)

Sales Order Types

SyntheticData models 9 SO types:

TypeDescriptionRequires Delivery?
StandardStandard sales orderYes
RushPriority/expedited orderYes
CashSaleImmediate payment at saleYes
ReturnCustomer return orderNo (creates return delivery)
FreeOfChargeNo-charge delivery (samples, warranty)Yes
ConsignmentConsignment fill-up/issueYes
ServiceService order (no physical delivery)No
CreditMemoRequestRequest for credit memoNo
DebitMemoRequestRequest for debit memoNo

Delivery Types

6 delivery types model different fulfillment scenarios:

TypeDescriptionDirection
OutboundStandard outbound deliveryShip to customer
ReturnCustomer return deliveryReceive from customer
StockTransferInter-plant stock transferInternal movement
ReplenishmentReplenishment deliveryWarehouse → store
ConsignmentIssueIssue from consignment stockConsignment → customer
ConsignmentReturnReturn to consignment stockCustomer → consignment

Customer Invoice Types

7 invoice types with different accounting treatment:

TypeDescriptionAR Impact
StandardNormal sales invoiceCreates receivable
CreditMemoCredit for returns/adjustmentsReduces receivable
DebitMemoAdditional chargeIncreases receivable
ProFormaPre-delivery invoice (no posting)None
DownPaymentRequestAdvance payment requestCreates special receivable
FinalInvoiceSettles down paymentClears down payment
IntercompanyIC billingCreates IC receivable

Credit Check

Sales orders pass through credit verification before delivery:

document_flows:
  o2c:
    credit_check:
      enabled: true
      check_credit_limit: true          # Verify customer limit
      check_overdue: true               # Check for past-due AR
      block_threshold: 0.9              # Block if >90% of limit used

O2C Stage Configuration

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.3                       # 30% of JEs from O2C
    completion_rate: 0.95                # 95% complete full flow

    stages:
      so_approval_rate: 0.95            # 95% of SOs approved
      credit_check_pass_rate: 0.9       # 90% pass credit check
      delivery_rate: 0.98               # 98% of SOs get deliveries
      invoice_rate: 0.95                # 95% of deliveries get invoices
      collection_rate: 0.85             # 85% of invoices collected

    timing:
      so_to_delivery_days:
        min: 1
        max: 14
      delivery_to_invoice_days:
        min: 0
        max: 3
      invoice_to_payment_days:
        min: 15
        max: 90

O2C Journal Entries

StageDebitCreditTrigger
DeliveryCost of Goods Sold (5000)Inventory (1300)Goods issued
InvoiceAccounts Receivable (1100)Revenue (4000)Invoice posted
ReceiptCash (1000)Accounts Receivable (1100)Payment received
Credit MemoRevenue (4000)Accounts Receivable (1100)Credit issued

Document Chain Manager

The document chain manager maintains referential integrity across the complete document flow by tracking references between documents.

Reference Types

TypeDescriptionExample
FollowOnNext document in normal flowPO → GR → Invoice → Payment
PaymentPayment for invoicePAY-001 → INV-001
ReversalCorrection or reversal documentCRED-001 → INV-001
PartialPartial fulfillmentGR-001 (partial) → PO-001
CreditMemoCredit against invoiceCM-001 → INV-001
DebitMemoDebit against invoiceDM-001 → INV-001
ReturnReturn against deliveryRET-001 → DEL-001
IntercompanyMatchIC matched pairIC-INV-001 → IC-INV-002
ManualUser-defined referenceAny → Any

Document Chain Output

PO-001 ─→ GR-001 ─→ INV-001 ─→ PAY-001
   │          │          │          │
   └──────────┴──────────┴──────────┘
              Document Chain

The document_references.csv output file records all links:

FieldDescription
source_document_idReferencing document
target_document_idReferenced document
reference_typeType of reference
created_dateDate reference was created

Complex Scenario Examples

Partial Deliveries with Split Invoice

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
    completion_rate: 0.90           # 10% incomplete (partial deliveries)
    three_way_match:
      quantity_tolerance: 0.05      # 5% tolerance for partials
      allow_over_delivery: true
      max_over_delivery_pct: 0.10
    timing:
      po_to_gr_days: { min: 3, max: 45 }    # Longer lead times
      gr_to_invoice_days: { min: 1, max: 21 }
      invoice_to_payment_days: { min: 30, max: 90 }

High-Volume Retail O2C

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.5                  # 50% of JEs from O2C
    completion_rate: 0.98           # High completion rate
    stages:
      so_approval_rate: 0.99       # Auto-approved
      credit_check_pass_rate: 0.95
      delivery_rate: 0.99
      invoice_rate: 0.99
      collection_rate: 0.92
    timing:
      so_to_delivery_days: { min: 0, max: 3 }     # Fast fulfillment
      delivery_to_invoice_days: { min: 0, max: 0 } # Immediate invoice
      invoice_to_payment_days: { min: 10, max: 45 }

Combined Manufacturing P2P + O2C

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.35
    completion_rate: 0.95
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01
    timing:
      po_to_gr_days: { min: 5, max: 30 }
      gr_to_invoice_days: { min: 1, max: 10 }
      invoice_to_payment_days: { min: 20, max: 45 }

  o2c:
    enabled: true
    flow_rate: 0.35
    completion_rate: 0.90
    credit_check:
      enabled: true
      block_threshold: 0.85
    timing:
      so_to_delivery_days: { min: 3, max: 21 }
      delivery_to_invoice_days: { min: 0, max: 2 }
      invoice_to_payment_days: { min: 30, max: 60 }

Validation

CheckRule
flow_rate0.0 - 1.0
completion_rate0.0 - 1.0
tolerance values0.0 - 1.0
timing.min≥ 0
timing.max≥ min
Stage rates0.0 - 1.0

See Also

Subledgers

SyntheticData generates subsidiary ledger records for Accounts Receivable (AR), Accounts Payable (AP), Fixed Assets (FA), and Inventory, with automatic GL reconciliation and document flow linking.

Overview

Subledger generators produce detailed records that reconcile back to GL control accounts:

SubledgerControl AccountRecord TypesOutput Files
AR1100 (AR Control)Open items, aging, receipts, credit memos, dunningar_open_items.csv, ar_aging.csv
AP2000 (AP Control)Open items, aging, payment scheduling, debit memosap_open_items.csv, ap_aging.csv
FA1600+ (Asset accounts)Register, depreciation, acquisitions, disposalsfa_register.csv, fa_depreciation.csv
Inventory1300 (Inventory)Positions, movements (22 types), valuationinventory_positions.csv, inventory_movements.csv

Configuration

subledger:
  enabled: true
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]    # Days
    dunning_levels: 3
    credit_memo_rate: 0.05               # 5% of invoices get credit memos
  ap:
    enabled: true
    aging_buckets: [30, 60, 90, 120]
    early_payment_discount_rate: 0.02
    payment_scheduling: true
  fa:
    enabled: true
    depreciation_methods:
      - straight_line
      - declining_balance
      - sum_of_years_digits
    disposal_rate: 0.03                  # 3% of assets disposed per year
  inventory:
    enabled: true
    valuation_method: standard_cost      # standard_cost, moving_average, fifo, lifo
    cycle_count_frequency: monthly

Accounts Receivable (AR)

Record Types

The AR subledger generates:

  • Open Items: Outstanding customer invoices with aging classification
  • Receipts: Customer payments applied to invoices (full, partial, on-account)
  • Credit Memos: Credits issued for returns, disputes, or pricing adjustments
  • Aging Reports: Aged balances by customer and aging bucket
  • Dunning Notices: Automated collection notices at configurable levels

Open Item Fields

FieldDescription
customer_idCustomer reference
invoice_numberDocument number
invoice_dateIssue date
due_datePayment due date
original_amountInvoice total
open_amountRemaining balance
currencyInvoice currency
payment_termsNet 30, Net 60, etc.
aging_bucket0-30, 31-60, 61-90, 91-120, 120+
dunning_levelCurrent dunning level (0-3)
last_dunning_dateDate of last dunning notice
dispute_flagWhether item is disputed

Aging Buckets

Default aging buckets classify receivables by days past due:

BucketRangeTypical %
Current0-30 days65-75%
31-6031-60 days12-18%
61-9061-90 days5-8%
91-12091-120 days2-4%
120+Over 120 days1-3%

Dunning Process

Dunning generates progressively urgent collection notices:

LevelDays OverdueAction
00-30No action (within terms)
131-60Friendly reminder
261-90Formal notice
390+Final demand / collections

Document Flow Integration

AR open items are created from O2C customer invoices:

Sales Order → Delivery → Customer Invoice → AR Open Item → Customer Receipt
                                                 │
                                                 └→ Dunning Notice (if overdue)

Accounts Payable (AP)

Record Types

The AP subledger generates:

  • Open Items: Outstanding vendor invoices with aging and payment scheduling
  • Payments: Vendor payment runs (check, wire, ACH)
  • Debit Memos: Deductions for quality issues, returns, pricing errors
  • Aging Reports: Aged payables by vendor
  • Payment Scheduling: Planned payments considering cash flow and discounts

Open Item Fields

FieldDescription
vendor_idVendor reference
invoice_numberVendor invoice number
invoice_dateInvoice receipt date
due_datePayment due date
baseline_dateDate for terms calculation
original_amountInvoice total
open_amountRemaining balance
currencyInvoice currency
payment_terms2/10 Net 30, etc.
discount_dateDiscount deadline
discount_amountAvailable discount
payment_blockBlock code (if blocked)
three_way_match_statusMatched / Variance / Blocked

Early Payment Discounts

The AP generator models cash discount optimization:

Payment Terms: 2/10 Net 30
  → Pay within 10 days: 2% discount
  → Pay by day 30: full amount
  → Past day 30: overdue

early_payment_discount_rate: 0.02   # Take 2% discount when offered

Payment Scheduling

When enabled, the AP generator creates a payment schedule that optimizes:

  • Discount capture: Prioritize invoices with expiring discounts
  • Cash flow: Spread payments across the period
  • Vendor priority: Pay critical vendors first

Document Flow Integration

AP open items are created from P2P vendor invoices:

Purchase Order → Goods Receipt → Vendor Invoice → Three-Way Match → AP Open Item → Payment
                                                                          │
                                                                          └→ Debit Memo (if variance)

Fixed Assets (FA)

Record Types

The FA subledger generates:

  • Asset Register: Master record for each fixed asset
  • Depreciation Schedule: Monthly depreciation entries per asset
  • Acquisitions: New asset additions (from PO or direct capitalization)
  • Disposals: Asset retirements, sales, scrapping
  • Transfers: Inter-company or inter-department transfers
  • Impairment: Write-downs when fair value drops below book value

Asset Register Fields

FieldDescription
asset_idUnique identifier
descriptionAsset name/description
asset_classBuildings, Equipment, Vehicles, IT, Furniture
acquisition_datePurchase/capitalization date
acquisition_costOriginal cost
useful_life_yearsDepreciable life
salvage_valueResidual value
depreciation_methodMethod used
accumulated_depreciationTotal depreciation to date
net_book_valueCurrent carrying value
disposal_dateDate retired (if applicable)
disposal_proceedsSale price (if sold)
disposal_gain_lossGain or loss on disposal

Depreciation Methods

MethodDescriptionUse Case
StraightLineEqual amounts each periodDefault, most common
DecliningBalance { rate }Fixed percentage of remaining balanceAccelerated (tax)
SumOfYearsDigitsDecreasing fractions of depreciable baseAccelerated
UnitsOfProduction { total_units }Based on usage/outputManufacturing equipment
NoneNo depreciationLand, construction in progress

Depreciation Journal Entries

Each period, the FA generator creates depreciation entries:

DebitCreditAmount
Depreciation Expense (6xxx)Accumulated Depreciation (1650)Period depreciation

Disposal Accounting

When an asset is disposed:

ScenarioDebitCredit
Sale at gainCash, Accum DeprAsset Cost, Gain on Disposal
Sale at lossCash, Accum Depr, Loss on DisposalAsset Cost
ScrappingAccum Depr, Loss on DisposalAsset Cost

Inventory

Record Types

The Inventory subledger generates:

  • Positions: Current stock levels by material, plant, and storage location
  • Movements: 22 movement types covering receipts, issues, transfers, and adjustments
  • Valuation: Inventory value calculated using configurable valuation methods

Position Fields

FieldDescription
material_idMaterial reference
plantPlant/warehouse code
storage_locationStorage location within plant
quantityUnits on hand
unit_of_measureUOM
unit_costPer-unit cost
total_valueExtended value
valuation_methodStandardCost, MovingAverage, FIFO, LIFO
stock_statusUnrestricted, QualityInspection, Blocked
last_movement_dateDate of last stock change

Movement Types (22 types)

CategoryMovement TypeDescription
Goods ReceiptGoodsReceiptPOReceipt against purchase order
GoodsReceiptProductionReceipt from production order
GoodsReceiptOtherReceipt without reference
GoodsReceiptGeneric goods receipt
ReturnsReturnToVendorReturn materials to vendor
Goods IssueGoodsIssueSalesIssue for sales order / delivery
GoodsIssueProductionIssue to production order
GoodsIssueCostCenterIssue to cost center (consumption)
GoodsIssueScrappingScrap disposal
GoodsIssueGeneric goods issue
ScrapAlias for scrapping
TransfersTransferPlantBetween plants
TransferStorageLocationBetween storage locations
TransferInInbound transfer
TransferOutOutbound transfer
TransferToInspectionMove to quality inspection
TransferFromInspectionRelease from quality inspection
AdjustmentsPhysicalInventoryPhysical count difference
InventoryAdjustmentInPositive adjustment
InventoryAdjustmentOutNegative adjustment
InitialStockInitial stock entry
ReversalsReversalGoodsReceiptReverse a goods receipt
ReversalGoodsIssueReverse a goods issue

Valuation Methods

MethodDescriptionUse Case
StandardCostFixed cost per unit, variances posted separatelyManufacturing
MovingAverageWeighted average of all receiptsGeneral purpose
FIFOFirst-in, first-out costingPerishable goods
LIFOLast-in, first-out costingTax optimization (where permitted)

Cycle Counting (v0.6.0)

The cycle_count_frequency setting controls how often physical inventory counts are performed. Cycle counting generates PhysicalInventory movement records that reconcile book quantities against counted quantities:

subledger:
  inventory:
    enabled: true
    cycle_count_frequency: monthly     # monthly, quarterly, annual
FrequencyBehavior
monthlyEach storage location counted once per month on a rolling basis
quarterlyFull count once per quarter, with high-value items counted monthly
annualSingle year-end wall-to-wall count

Cycle count differences generate adjustment entries (InventoryAdjustmentIn or InventoryAdjustmentOut) and are flagged in the quality labels output for audit trail analysis.

Quality Inspection (v0.6.0)

Inventory positions can be placed in quality inspection status via TransferToInspection movements. This models the inspection hold process common in manufacturing and pharmaceutical industries:

Goods Receipt → Transfer to Inspection → QC Hold → Transfer from Inspection → Unrestricted Use
                                                 └→ Scrap (if rejected)

The rate of items routed through inspection depends on the material type and vendor scorecard grades (when source_to_pay is enabled). Materials from vendors with grade C or lower are routed through inspection at a higher rate.

Inventory Journal Entries

MovementDebitCredit
Goods Receipt (PO)InventoryGR/IR Clearing
Goods Issue (Sales)COGSInventory
Goods Issue (Production)WIPInventory
ScrapScrap ExpenseInventory
Physical Count (surplus)InventoryInventory Adjustment
Physical Count (shortage)Inventory AdjustmentInventory

GL Reconciliation

The subledger generators ensure that subledger balances reconcile to GL control accounts:

GL Control Account Balance = Σ Subledger Open Items

AR Control (1100) = Σ AR Open Items
AP Control (2000) = Σ AP Open Items
Inventory  (1300) = Σ Inventory Position Values
FA Gross   (1600) = Σ FA Acquisition Costs
Accum Depr (1650) = Σ FA Accumulated Depreciation

Reconciliation is validated by the datasynth-eval coherence module and any differences are flagged as potential data quality issues.

Output Files

FileContent
subledgers/ar_open_items.csvAR outstanding invoices
subledgers/ar_aging.csvAR aging analysis
subledgers/ap_open_items.csvAP outstanding invoices
subledgers/ap_aging.csvAP aging analysis
subledgers/fa_register.csvFixed asset master records
subledgers/fa_depreciation.csvDepreciation schedule entries
subledgers/inventory_positions.csvCurrent stock positions
subledgers/inventory_movements.csvStock movement history

See Also

FX & Currency

SyntheticData generates realistic foreign exchange rates, currency translation entries, and cumulative translation adjustments (CTA) for multi-currency enterprise simulation.

Overview

The FX module in datasynth-generators provides three generators:

GeneratorPurposeOutput
FX Rate ServiceDaily exchange rates via Ornstein-Uhlenbeck processfx/daily_rates.csv, fx/period_rates.csv
Currency TranslatorTranslate foreign-currency financials to reporting currencyconsolidation/currency_translation.csv
CTA GeneratorCumulative Translation Adjustment for consolidationconsolidation/cta_entries.csv

Configuration

fx:
  enabled: true
  base_currency: USD                    # Reporting/functional currency
  currencies:
    - code: EUR
      initial_rate: 1.10
      volatility: 0.08
      mean_reversion: 0.05
    - code: GBP
      initial_rate: 1.27
      volatility: 0.07
      mean_reversion: 0.04
    - code: JPY
      initial_rate: 0.0067
      volatility: 0.10
      mean_reversion: 0.06
    - code: CHF
      initial_rate: 1.12
      volatility: 0.06
      mean_reversion: 0.03

  translation:
    method: current_rate                # current_rate, temporal, monetary_non_monetary
    equity_at_historical: true
    income_at_average: true

  cta:
    enabled: true
    equity_account: "3900"              # CTA equity account

FX Rate Service

Ornstein-Uhlenbeck Process

Exchange rates are generated using a mean-reverting stochastic process (Ornstein-Uhlenbeck), which models the tendency of exchange rates to revert toward a long-term equilibrium:

dX(t) = θ(μ - X(t))dt + σdW(t)

where:
  X(t)  = log exchange rate at time t
  θ     = mean reversion speed (mean_reversion config)
  μ     = long-term mean (derived from initial_rate)
  σ     = volatility
  dW(t) = Wiener process (random walk)

This produces rates that:

  • Mean-revert: Rates drift back toward the initial level over time
  • Have realistic volatility: Day-to-day movements match configurable volatility targets
  • Are serially correlated: Today’s rate depends on yesterday’s rate (not i.i.d.)
  • Are deterministic: Given the same seed, rates are exactly reproducible

Rate Types

Rate TypeUsageCalculation
Daily spotTransaction-date ratesO-U process output for each business day
Period averageIncome statement translationArithmetic mean of daily rates within the period
Period closingBalance sheet translationLast business day rate in the period
HistoricalEquity itemsRate at the date equity was contributed

Output: daily_rates.csv

FieldDescription
dateBusiness day
from_currencySource currency (e.g., EUR)
to_currencyTarget currency (e.g., USD)
spot_rateDaily spot rate
inverse_rate1 / spot_rate

Output: period_rates.csv

FieldDescription
periodFiscal period (YYYY-MM)
from_currencySource currency
to_currencyTarget currency
average_ratePeriod average
closing_ratePeriod-end closing rate

Currency Translation

Translation Methods

SyntheticData supports three standard currency translation methods:

Current Rate Method (ASC 830 / IAS 21 — default)

The most common method for foreign subsidiaries with functional currency different from reporting currency:

ItemRate Used
AssetsClosing rate
LiabilitiesClosing rate
Equity (contributed capital)Historical rate
Equity (retained earnings)Rolled-forward
RevenueAverage rate
ExpensesAverage rate
DividendsRate on declaration date
CTABalancing item → Equity

Temporal Method (ASC 830)

Used when the foreign operation’s functional currency is the parent’s currency (e.g., highly inflationary economies):

ItemRate Used
Monetary assets/liabilitiesClosing rate
Non-monetary assets (at cost)Historical rate
Non-monetary assets (at fair value)Rate at fair value date
RevenueAverage rate
ExpensesAverage rate
DepreciationHistorical rate of related asset
Remeasurement gain/lossIncome statement

Monetary/Non-Monetary Method

ItemRate Used
Monetary itemsClosing rate
Non-monetary itemsHistorical rate

Translation Configuration

fx:
  translation:
    method: current_rate      # current_rate | temporal | monetary_non_monetary
    equity_at_historical: true
    income_at_average: true

CTA Generator

The Cumulative Translation Adjustment arises because assets/liabilities are translated at closing rates while equity is at historical rates. The CTA is posted to Other Comprehensive Income (OCI) in equity:

CTA = Translated Net Assets (at closing rate)
    - Translated Equity (at historical rates)
    - Translated Net Income (at average rate)

CTA Journal Entry

DebitCreditDescription
CTA (Equity 3900)Various BS accountsTranslation adjustment for period

The CTA accumulates over time and is only recycled to the income statement when a foreign subsidiary is disposed of.

Configuration

fx:
  cta:
    enabled: true
    equity_account: "3900"    # OCI - CTA account

Multi-Currency Company Configuration

Multi-currency scenarios require companies with different functional currencies:

companies:
  - code: C001
    name: "US Parent Corp"
    currency: USD
    country: US

  - code: C002
    name: "European Subsidiary"
    currency: EUR
    country: DE

  - code: C003
    name: "UK Subsidiary"
    currency: GBP
    country: GB

  - code: C004
    name: "Japan Subsidiary"
    currency: JPY
    country: JP

fx:
  enabled: true
  base_currency: USD
  currencies:
    - { code: EUR, initial_rate: 1.10, volatility: 0.08, mean_reversion: 0.05 }
    - { code: GBP, initial_rate: 1.27, volatility: 0.07, mean_reversion: 0.04 }
    - { code: JPY, initial_rate: 0.0067, volatility: 0.10, mean_reversion: 0.06 }

intercompany:
  enabled: true
  # IC transactions generate FX exposure

Output Files

FileContent
fx/daily_rates.csvDaily spot rates for all currency pairs
fx/period_rates.csvPeriod average and closing rates
consolidation/currency_translation.csvTranslation entries per entity/period
consolidation/cta_entries.csvCTA adjustments (if CTA enabled)
consolidation/consolidated_trial_balance.csvTranslated and consolidated TB

See Also

Financial Settings

Financial settings control balance, subledger, FX, and period close.

Balance Configuration

balance:
  opening_balance:
    enabled: true
    total_assets: 10000000

  coherence_check:
    enabled: true
    tolerance: 0.01

Opening Balance

Generate coherent opening balance sheet:

balance:
  opening_balance:
    enabled: true
    total_assets: 10000000           # Total asset value

    structure:                        # Balance sheet structure
      current_assets: 0.3
      fixed_assets: 0.5
      other_assets: 0.2

      current_liabilities: 0.2
      long_term_debt: 0.3
      equity: 0.5

Balance Coherence

Verify accounting equation:

balance:
  coherence_check:
    enabled: true                    # Verify Assets = L + E
    tolerance: 0.01                  # Allowed rounding variance
    frequency: monthly               # When to check

Subledger Configuration

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]

  ap:
    enabled: true
    aging_buckets: [30, 60, 90]

  fixed_assets:
    enabled: true
    depreciation_methods:
      - straight_line
      - declining_balance

  inventory:
    enabled: true
    valuation_methods:
      - fifo
      - weighted_average

Accounts Receivable

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120]  # Aging period boundaries

    collection:
      on_time_rate: 0.7               # % paid within terms
      write_off_rate: 0.02            # % written off

    reconciliation:
      enabled: true                   # Reconcile to GL
      control_account: "1100"         # AR control account

Accounts Payable

subledger:
  ap:
    enabled: true
    aging_buckets: [30, 60, 90]

    payment:
      discount_usage_rate: 0.3        # % taking early pay discount
      late_payment_rate: 0.1          # % paid late

    reconciliation:
      enabled: true
      control_account: "2000"         # AP control account

Fixed Assets

subledger:
  fixed_assets:
    enabled: true

    depreciation_methods:
      - method: straight_line
        weight: 0.7
      - method: declining_balance
        rate: 0.2
        weight: 0.2
      - method: units_of_production
        weight: 0.1

    disposal:
      rate: 0.05                      # Annual disposal rate
      gain_loss_account: "8000"       # Gain/loss account

    reconciliation:
      enabled: true
      control_accounts:
        asset: "1500"
        depreciation: "1510"

Inventory

subledger:
  inventory:
    enabled: true

    valuation_methods:
      - method: fifo
        weight: 0.3
      - method: weighted_average
        weight: 0.5
      - method: standard_cost
        weight: 0.2

    movements:
      receipt_weight: 0.4
      issue_weight: 0.4
      adjustment_weight: 0.1
      transfer_weight: 0.1

    reconciliation:
      enabled: true
      control_account: "1200"

FX Configuration

fx:
  enabled: true
  base_currency: USD

  currency_pairs:
    - EUR
    - GBP
    - CHF
    - JPY

  volatility: 0.01

  translation:
    method: current_rate

Exchange Rates

fx:
  enabled: true
  base_currency: USD                  # Reporting currency

  currency_pairs:                     # Currencies to generate
    - EUR
    - GBP
    - CHF

  rate_types:
    - spot                            # Daily spot rates
    - closing                         # Period closing rates
    - average                         # Period average rates

  volatility: 0.01                    # Daily volatility
  mean_reversion: 0.1                 # Ornstein-Uhlenbeck parameter

Currency Translation

fx:
  translation:
    method: current_rate              # current_rate, temporal

    rate_mapping:
      assets: closing_rate
      liabilities: closing_rate
      equity: historical_rate
      revenue: average_rate
      expense: average_rate

    cta_account: "3500"               # CTA equity account

Period Close Configuration

period_close:
  enabled: true

  monthly:
    accruals: true
    depreciation: true

  quarterly:
    intercompany_elimination: true

  annual:
    closing_entries: true
    retained_earnings: true

Monthly Close

period_close:
  monthly:
    accruals:
      enabled: true
      auto_reverse: true              # Reverse in next period
      categories:
        - expense_accrual
        - revenue_accrual
        - payroll_accrual

    depreciation:
      enabled: true
      run_date: last_day              # When to run

    reconciliation:
      enabled: true
      subledger_to_gl: true

Quarterly Close

period_close:
  quarterly:
    intercompany_elimination:
      enabled: true
      types:
        - intercompany_sales
        - intercompany_profit
        - intercompany_dividends

    currency_translation:
      enabled: true

Annual Close

period_close:
  annual:
    closing_entries:
      enabled: true
      close_revenue: true
      close_expense: true

    retained_earnings:
      enabled: true
      account: "3100"

    year_end_adjustments:
      - bad_debt_provision
      - inventory_reserve
      - bonus_accrual

Combined Example

balance:
  opening_balance:
    enabled: true
    total_assets: 50000000
  coherence_check:
    enabled: true

subledger:
  ar:
    enabled: true
    aging_buckets: [30, 60, 90, 120, 180]
  ap:
    enabled: true
    aging_buckets: [30, 60, 90]
  fixed_assets:
    enabled: true
  inventory:
    enabled: true

fx:
  enabled: true
  base_currency: USD
  currency_pairs: [EUR, GBP, CHF, JPY, CNY]
  volatility: 0.012

period_close:
  enabled: true
  monthly:
    accruals: true
    depreciation: true
  quarterly:
    intercompany_elimination: true
  annual:
    closing_entries: true
    retained_earnings: true

Financial Reporting (v0.6.0)

The financial_reporting section generates structured financial statements, management KPIs, and budgets derived from the underlying journal entries, trial balances, and period close data.

Financial Statements

financial_reporting:
  enabled: true
  generate_balance_sheet: true         # Balance sheet
  generate_income_statement: true      # Income statement / P&L
  generate_cash_flow: true             # Cash flow statement
  generate_changes_in_equity: true     # Statement of changes in equity
  comparative_periods: 1               # Number of prior-period comparatives

When enabled, the generator produces financial statements at each period close. The comparative_periods setting controls how many prior periods are included for comparative analysis. Statements are aggregated from the trial balance and subledger data, ensuring consistency with the underlying journal entries.

Management KPIs

financial_reporting:
  management_kpis:
    enabled: true
    frequency: "monthly"               # monthly or quarterly

Management KPIs include ratios and metrics computed from the generated financial data:

KPI CategoryExamples
LiquidityCurrent ratio, quick ratio, cash conversion cycle
ProfitabilityGross margin, operating margin, ROE, ROA
EfficiencyInventory turnover, receivables turnover, asset turnover
LeverageDebt-to-equity, interest coverage

Budgets

financial_reporting:
  budgets:
    enabled: true
    revenue_growth_rate: 0.05          # 5% expected growth
    expense_inflation_rate: 0.03       # 3% cost inflation
    variance_noise: 0.10               # 10% random noise on actuals vs budget

Budget generation creates a budget line for each GL account based on prior-period actuals, adjusted by the configured growth and inflation rates. The variance_noise parameter controls the spread between budget and actual figures, producing realistic budget-to-actual variance reports.


See Also

Compliance

Compliance settings control fraud injection, internal controls, and approval workflows.

Fraud Configuration

fraud:
  enabled: true
  fraud_rate: 0.005

  types:
    fictitious_transaction: 0.15
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    split_transaction: 0.15
    round_tripping: 0.05
    kickback_scheme: 0.10
    ghost_employee: 0.05
    duplicate_payment: 0.15
    unauthorized_discount: 0.10
    suspense_abuse: 0.05

Fraud Rate

Overall percentage of fraudulent transactions:

fraud:
  enabled: true
  fraud_rate: 0.005    # 0.5% fraud rate
  fraud_rate: 0.01     # 1% fraud rate
  fraud_rate: 0.001    # 0.1% fraud rate

Fraud Types

TypeDescription
fictitious_transactionCompletely fabricated entries
revenue_manipulationPremature/delayed revenue recognition
expense_capitalizationImproper capitalization of expenses
split_transactionSplit to avoid approval thresholds
round_trippingCircular transactions to inflate revenue
kickback_schemeVendor kickback arrangements
ghost_employeePayments to non-existent employees
duplicate_paymentSame invoice paid multiple times
unauthorized_discountUnapproved customer discounts
suspense_abuseHiding items in suspense accounts

Fraud Patterns

fraud:
  patterns:
    threshold_adjacent:
      enabled: true
      threshold: 10000             # Approval threshold
      range: 0.1                   # % below threshold

    time_based:
      weekend_preference: 0.3      # Weekend entry rate
      after_hours_preference: 0.2  # After hours rate

    entity_targeting:
      repeat_offender_rate: 0.4    # Same user commits multiple

Internal Controls Configuration

internal_controls:
  enabled: true

  controls:
    - id: "CTL-001"
      name: "Payment Approval"
      type: preventive
      frequency: continuous
      assertions:
        - authorization
        - validity

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]

Control Definition

internal_controls:
  controls:
    - id: "CTL-001"
      name: "Payment Approval"
      description: "Payments require manager approval"
      type: preventive              # preventive, detective
      frequency: continuous         # continuous, daily, weekly, monthly
      assertions:
        - authorization
        - validity
        - completeness
      accounts: ["2000"]            # Applicable accounts
      threshold: 5000               # Trigger threshold

    - id: "CTL-002"
      name: "Journal Entry Review"
      type: detective
      frequency: daily
      assertions:
        - accuracy
        - completeness

Control Types

TypeDescription
preventivePrevents errors/fraud before occurrence
detectiveDetects errors/fraud after occurrence

Control Assertions

AssertionDescription
authorizationProper approval obtained
validityTransaction is legitimate
completenessAll transactions recorded
accuracyAmounts are correct
cutoffRecorded in correct period
classificationProperly categorized

Segregation of Duties

internal_controls:
  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]
      description: "Cannot create and approve payments"

    - conflict_type: create_approve
      processes: [ar_invoice, ar_receipt]

    - conflict_type: custody_recording
      processes: [cash_handling, cash_recording]

    - conflict_type: authorization_custody
      processes: [vendor_master, ap_payment]

SoD Conflict Types

TypeDescription
create_approveCreate and approve same transaction
custody_recordingPhysical custody and recording
authorization_custodyAuthorization and physical access
create_modifyCreate and modify master data

Approval Configuration

approval:
  enabled: true

  thresholds:
    - level: 1
      name: "Clerk"
      max_amount: 5000
    - level: 2
      name: "Supervisor"
      max_amount: 25000
    - level: 3
      name: "Manager"
      max_amount: 100000
    - level: 4
      name: "Director"
      max_amount: 500000
    - level: 5
      name: "Executive"
      max_amount: null          # Unlimited

Approval Thresholds

approval:
  thresholds:
    - level: 1
      name: "Level 1 - Clerk"
      max_amount: 5000
      auto_approve: false

    - level: 2
      name: "Level 2 - Supervisor"
      max_amount: 25000
      auto_approve: false

    - level: 3
      name: "Level 3 - Manager"
      max_amount: 100000
      requires_dual: false        # Single approver

    - level: 4
      name: "Level 4 - Director"
      max_amount: 500000
      requires_dual: true         # Dual approval required

Approval Process

approval:
  process:
    workflow: hierarchical        # hierarchical, matrix
    escalation_days: 3            # Auto-escalate after N days
    reminder_days: 1              # Send reminder after N days

  exceptions:
    recurring_exempt: true        # Skip for recurring entries
    system_exempt: true           # Skip for system entries

Combined Example

fraud:
  enabled: true
  fraud_rate: 0.005
  types:
    fictitious_transaction: 0.15
    split_transaction: 0.20
    duplicate_payment: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

internal_controls:
  enabled: true
  controls:
    - id: "SOX-001"
      name: "Payment Authorization"
      type: preventive
      frequency: continuous
      threshold: 10000

    - id: "SOX-002"
      name: "JE Review"
      type: detective
      frequency: daily

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]
    - conflict_type: create_approve
      processes: [ar_invoice, ar_receipt]
    - conflict_type: create_modify
      processes: [vendor_master, ap_invoice]

approval:
  enabled: true
  thresholds:
    - level: 1
      max_amount: 5000
    - level: 2
      max_amount: 25000
    - level: 3
      max_amount: 100000
    - level: 4
      max_amount: null

Validation

CheckRule
fraud_rate0.0 - 1.0
fraud.typesSum = 1.0
control.idUnique
thresholdsStrictly ascending

Synthetic Data Certificates (v0.5.0)

Certificates provide cryptographic proof of the privacy guarantees and quality metrics of generated data.

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true

When enabled, a certificate.json file is produced alongside the output containing:

  • DP Guarantee: Mechanism (Laplace/Gaussian), epsilon, delta, composition method
  • Quality Metrics: Benford MAD, correlation preservation, statistical fidelity, MIA AUC
  • Config Hash: SHA-256 hash of the generation configuration
  • Signature: HMAC-SHA256 signature for tamper detection
  • Fingerprint Hash: Hash of source fingerprint (if fingerprint-based generation)

The certificate can be embedded in Parquet file metadata or included as a separate JSON file.

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

# Certificate is written to ./output/certificate.json

See Also

Output Settings

Output settings control file formats and organization.

Configuration

output:
  format: csv
  compression: none
  compression_level: 6

  files:
    journal_entries: true
    acdoca: true
    master_data: true
    documents: true
    subledgers: true
    trial_balances: true
    labels: true
    controls: true

Format

Output file format selection.

output:
  format: csv        # CSV format (default)
  format: json       # JSON format
  format: jsonl      # Newline-delimited JSON
  format: parquet    # Apache Parquet columnar
  format: sap        # SAP S/4HANA table format
  format: oracle     # Oracle EBS GL tables
  format: netsuite   # NetSuite journal entries

CSV Format

Standard comma-separated values:

document_id,posting_date,company_code,account,debit,credit
abc-123,2024-01-15,1000,1100,"1000.00","0.00"
abc-123,2024-01-15,1000,4000,"0.00","1000.00"

Characteristics:

  • UTF-8 encoding
  • Header row included
  • Quoted strings when needed
  • Decimals as strings

JSON Format

Structured JSON with nested objects:

[
  {
    "header": {
      "document_id": "abc-123",
      "posting_date": "2024-01-15",
      "company_code": "1000"
    },
    "lines": [
      {"account": "1100", "debit": "1000.00", "credit": "0.00"},
      {"account": "4000", "debit": "0.00", "credit": "1000.00"}
    ]
  }
]

Parquet Format

Apache Parquet columnar format for analytics:

output:
  format: parquet
  compression: snappy     # snappy (default), gzip, zstd

Parquet files are self-describing with embedded schema and support columnar compression. Ideal for Spark, DuckDB, Polars, pandas, and cloud data warehouses.

ERP Formats

Export in native ERP table schemas for load testing and integration validation:

# SAP S/4HANA
output:
  format: sap
  sap:
    tables: [bkpf, bseg, acdoca, lfa1, kna1, mara, csks, cepc]
    client: "100"

# Oracle EBS
output:
  format: oracle
  oracle:
    ledger_id: 1

# NetSuite
output:
  format: netsuite
  netsuite:
    subsidiary_id: 1
    include_custom_fields: true

See ERP Output Formats for full field mappings.

Streaming Mode

Enable streaming output for memory-efficient generation of large datasets:

output:
  format: csv           # Any format
  streaming: true       # Enable streaming mode

See Streaming Output for details.

Compression

File compression options.

output:
  compression: none     # No compression
  compression: gzip     # Gzip compression (.gz)
  compression: zstd     # Zstandard compression (.zst)

Compression Level

When compression is enabled:

output:
  compression: gzip
  compression_level: 6    # 1-9, higher = smaller + slower
LevelSpeedSizeUse Case
1FastestLargestQuick compression
6BalancedMediumGeneral use (default)
9SlowestSmallestMaximum compression

Compression Comparison

CompressionExtensionSpeedRatio
none.csvN/A1.0
gzip.csv.gzMedium~0.15
zstd.csv.zstFast~0.12

File Selection

Control which files are generated:

output:
  files:
    # Core transaction data
    journal_entries: true    # journal_entries.csv
    acdoca: true             # acdoca.csv (SAP format)

    # Master data
    master_data: true        # vendors.csv, customers.csv, etc.

    # Document flow
    documents: true          # purchase_orders.csv, invoices.csv, etc.

    # Subsidiary ledgers
    subledgers: true         # ar_open_items.csv, ap_open_items.csv, etc.

    # Period close
    trial_balances: true     # trial_balances/*.csv

    # ML labels
    labels: true             # anomaly_labels.csv, fraud_labels.csv

    # Controls
    controls: true           # internal_controls.csv, sod_rules.csv

Output Directory Structure

With all files enabled:

output/
├── master_data/
│   ├── chart_of_accounts.csv
│   ├── vendors.csv
│   ├── customers.csv
│   ├── materials.csv
│   ├── fixed_assets.csv
│   └── employees.csv
├── transactions/
│   ├── journal_entries.csv
│   └── acdoca.csv
├── documents/
│   ├── purchase_orders.csv
│   ├── goods_receipts.csv
│   ├── vendor_invoices.csv
│   ├── payments.csv
│   ├── sales_orders.csv
│   ├── deliveries.csv
│   ├── customer_invoices.csv
│   └── customer_receipts.csv
├── subledgers/
│   ├── ar_open_items.csv
│   ├── ar_aging.csv
│   ├── ap_open_items.csv
│   ├── ap_aging.csv
│   ├── fa_register.csv
│   ├── fa_depreciation.csv
│   ├── inventory_positions.csv
│   └── inventory_movements.csv
├── period_close/
│   └── trial_balances/
│       ├── 2024_01.csv
│       ├── 2024_02.csv
│       └── ...
├── consolidation/
│   ├── eliminations.csv
│   └── currency_translation.csv
├── fx/
│   ├── daily_rates.csv
│   └── period_rates.csv
├── graphs/                      # If graph_export enabled
│   ├── pytorch_geometric/
│   └── neo4j/
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── controls/
    ├── internal_controls.csv
    ├── control_mappings.csv
    └── sod_rules.csv

Examples

Development (Fast)

output:
  format: csv
  compression: none
  files:
    journal_entries: true
    master_data: true
    labels: true

Production (Compact)

output:
  format: csv
  compression: zstd
  compression_level: 6
  files:
    journal_entries: true
    acdoca: true
    master_data: true
    documents: true
    subledgers: true
    trial_balances: true
    labels: true
    controls: true

ML Training Focus

output:
  format: csv
  compression: gzip
  files:
    journal_entries: true
    labels: true                 # Important for supervised learning
    master_data: true            # For feature engineering

SAP Integration

output:
  format: csv
  compression: none
  files:
    journal_entries: false
    acdoca: true                 # SAP ACDOCA format
    master_data: true
    documents: true

Validation

CheckRule
formatcsv or json
compressionnone, gzip, or zstd
compression_level1-9 (only when compression enabled)

See Also

AI & ML Features Configuration

New in v0.5.0

This page documents the configuration for DataSynth’s AI and ML-powered generation features: LLM-augmented generation, diffusion models, causal generation, and synthetic data certificates.

LLM Configuration

Configure the LLM provider for metadata enrichment and natural language configuration.

llm:
  provider: mock              # Provider type
  model: "gpt-4o-mini"       # Model identifier
  api_key_env: "OPENAI_API_KEY"  # Environment variable for API key
  base_url: null              # Custom API endpoint (for 'custom' provider)
  max_retries: 3              # Retry attempts on failure
  timeout_secs: 30            # Request timeout
  cache_enabled: true         # Enable prompt-level caching

Provider Types

ProviderValueRequirementsDescription
MockmockNoneDeterministic, no network. Default for CI/CD
OpenAIopenaiOPENAI_API_KEY env varOpenAI API (GPT-4o, GPT-4o-mini, etc.)
AnthropicanthropicANTHROPIC_API_KEY env varAnthropic API (Claude models)
Customcustombase_url + API key env varAny OpenAI-compatible endpoint

Field Reference

FieldTypeDefaultDescription
providerstring"mock"LLM provider type
modelstring"gpt-4o-mini"Model identifier passed to the API
api_key_envstring""Environment variable name containing the API key
base_urlstringnullCustom API base URL (required for custom provider)
max_retriesinteger3Maximum retry attempts on transient failures
timeout_secsinteger30Per-request timeout in seconds
cache_enabledbooltrueCache responses to avoid duplicate API calls

Examples

Mock provider (default, no config needed):

# LLM enrichment uses mock provider by default
# No configuration required

OpenAI:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"

Anthropic:

llm:
  provider: anthropic
  model: "claude-sonnet-4-5-20250929"
  api_key_env: "ANTHROPIC_API_KEY"

Self-hosted (e.g., vLLM, Ollama):

llm:
  provider: custom
  model: "llama-3-8b"
  api_key_env: "LOCAL_API_KEY"
  base_url: "http://localhost:8000/v1"

Azure OpenAI:

llm:
  provider: custom
  model: "gpt-4o-mini"
  api_key_env: "AZURE_OPENAI_KEY"
  base_url: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o-mini"

Diffusion Configuration

Configure the statistical diffusion model backend for learned distribution capture.

diffusion:
  enabled: false              # Enable diffusion generation
  n_steps: 1000               # Number of diffusion steps
  schedule: "linear"          # Noise schedule type
  sample_size: 1000           # Number of samples to generate

Field Reference

FieldTypeDefaultDescription
enabledboolfalseEnable diffusion model generation
n_stepsinteger1000Number of forward/reverse diffusion steps. Higher values improve quality but increase compute time
schedulestring"linear"Noise schedule: "linear", "cosine", "sigmoid"
sample_sizeinteger1000Number of diffusion-generated samples

Noise Schedules

ScheduleCharacteristicsBest For
linearUniform noise addition, simple and robustGeneral purpose
cosineSlower noise addition, preserves fine detailsFinancial amounts with precise distributions
sigmoidSmooth transition between linear and cosineBalanced quality and compute

Examples

Basic diffusion:

diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

Fast diffusion (fewer steps):

diffusion:
  enabled: true
  n_steps: 200
  schedule: "linear"
  sample_size: 1000

Causal Configuration

Configure causal graph-based data generation with Structural Causal Models.

causal:
  enabled: false              # Enable causal generation
  template: "fraud_detection" # Built-in template or custom YAML path
  sample_size: 1000           # Number of samples
  validate: true              # Validate causal structure in output

Field Reference

FieldTypeDefaultDescription
enabledboolfalseEnable causal/counterfactual generation
templatestring"fraud_detection"Template name or path to custom YAML graph
sample_sizeinteger1000Number of causal samples to generate
validatebooltrueRun causal structure validation on output

Built-in Templates

TemplateVariablesUse Case
fraud_detectiontransaction_amount, approval_level, vendor_risk, fraud_flagFraud risk modeling
revenue_cycleorder_size, credit_score, payment_delay, revenueRevenue and credit analysis

Custom Causal Graph

Point template to a YAML file defining a custom causal graph:

causal:
  enabled: true
  template: "./graphs/custom_fraud.yaml"
  sample_size: 10000
  validate: true

Custom graph format:

# custom_fraud.yaml
variables:
  - name: transaction_amount
    type: continuous
    distribution: lognormal
    params:
      mu: 8.0
      sigma: 1.5
  - name: approval_level
    type: count
    distribution: normal
    params:
      mean: 1.0
      std: 0.5
  - name: fraud_flag
    type: binary

edges:
  - from: transaction_amount
    to: approval_level
    mechanism:
      type: linear
      coefficient: 0.00005
  - from: transaction_amount
    to: fraud_flag
    mechanism:
      type: logistic
      scale: 0.0001
      midpoint: 50000.0
    strength: 0.8

Causal Mechanism Types

TypeParametersDescription
linearcoefficienty += coefficient × parent
thresholdcutoffy = 1 if parent > cutoff, else 0
polynomialcoefficients (list)y += Σ c[i] × parent^i
logisticscale, midpointy += 1 / (1 + e^(-scale × (parent - midpoint)))

Certificate Configuration

Configure synthetic data certificates for provenance and privacy attestation.

certificates:
  enabled: false              # Enable certificate generation
  issuer: "DataSynth"        # Certificate issuer identity
  include_quality_metrics: true  # Include quality metrics

Field Reference

FieldTypeDefaultDescription
enabledboolfalseGenerate a certificate with each output
issuerstring"DataSynth"Issuer identity embedded in the certificate
include_quality_metricsbooltrueInclude Benford MAD, correlation, fidelity, MIA AUC metrics

Certificate Contents

When enabled, a certificate.json is produced containing:

SectionContents
Identitycertificate_id, generation_timestamp, generator_version
Reproducibilityconfig_hash (SHA-256), seed, fingerprint_hash
PrivacyDP mechanism, epsilon, delta, composition method, total queries
QualityBenford MAD, correlation preservation, statistical fidelity, MIA AUC
IntegrityHMAC-SHA256 signature

Combined Example

A complete configuration using all AI/ML features:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

companies:
  - code: "1000"
    name: "Manufacturing Corp"
    currency: USD
    country: US

transactions:
  target_count: 50000

# LLM enrichment for realistic metadata
llm:
  provider: mock

# Diffusion for learned distributions
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

# Causal structure for fraud scenarios
causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 10000
  validate: true

# Certificate for provenance
certificates:
  enabled: true
  issuer: "DataSynth v0.5.0"
  include_quality_metrics: true

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02

output:
  format: csv

CLI Flags

Several AI/ML features can also be controlled via CLI flags:

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

# Initialize from natural language
datasynth-data init --from-description "1 year of retail data with fraud" -o config.yaml

# Train diffusion model
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json

# Generate causal data
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/

See Also

Architecture

SyntheticData is designed as a modular, high-performance data generation system.

Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         Application Layer                            │
│   datasynth-cli │ datasynth-server │ datasynth-ui                               │
├─────────────────────────────────────────────────────────────────────┤
│                        Orchestration Layer                           │
│                         datasynth-runtime                                │
├─────────────────────────────────────────────────────────────────────┤
│                        Generation Layer                              │
│   datasynth-generators │ datasynth-graph                                    │
├─────────────────────────────────────────────────────────────────────┤
│                        Foundation Layer                              │
│   datasynth-core │ datasynth-config │ datasynth-output                          │
└─────────────────────────────────────────────────────────────────────┘

Key Characteristics

CharacteristicDescription
Modular12 independent crates with clear boundaries
LayeredStrict dependency hierarchy prevents cycles
High-PerformanceParallel execution, memory-efficient streaming
DeterministicSeeded RNG for reproducible output
Type-SafeRust’s type system ensures correctness

Architecture Sections

SectionDescription
Workspace LayoutCrate organization and dependencies
Domain ModelsCore data structures
Data FlowHow data moves through the system
Generation PipelineStep-by-step generation process
Memory ManagementMemory tracking and limits
Design DecisionsKey architectural choices

Design Principles

Separation of Concerns

Each crate has a single responsibility:

  • datasynth-core: Domain models and distributions
  • datasynth-config: Configuration and validation
  • datasynth-generators: Data generation logic
  • datasynth-output: File writing
  • datasynth-runtime: Orchestration

Dependency Inversion

Core components define traits, implementations provided by higher layers:

#![allow(unused)]
fn main() {
// datasynth-core defines the trait
pub trait Generator<T> {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<T>>;
}

// datasynth-generators implements it
impl Generator<JournalEntry> for JournalEntryGenerator {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<JournalEntry>> {
        // Implementation
    }
}
}

Configuration-Driven

All behavior controlled by configuration:

transactions:
  target_count: 100000
  benford:
    enabled: true

Memory Safety

Rust’s ownership system prevents:

  • Data races in parallel generation
  • Memory leaks
  • Buffer overflows

Component Interactions

                    ┌─────────────┐
                    │   Config    │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  JE Generator│  │ Doc Generator│  │ Master Data  │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┼─────────────────┘
                         │
                         ▼
                ┌──────────────┐
                │ Orchestrator │
                └──────┬───────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
        ▼              ▼              ▼
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │   CSV   │   │  Graph  │   │  JSON   │
   └─────────┘   └─────────┘   └─────────┘

Performance Architecture

Parallel Execution

#![allow(unused)]
fn main() {
// Thread pool distributes work
let entries: Vec<JournalEntry> = (0..num_threads)
    .into_par_iter()
    .flat_map(|thread_id| {
        let mut gen = generator_for_thread(thread_id);
        gen.generate_batch(batch_size)
    })
    .collect();
}

Streaming Output

#![allow(unused)]
fn main() {
// Memory-efficient streaming
for entry in generator.generate_stream() {
    sink.write(&entry)?;
}
}

Memory Guards

#![allow(unused)]
fn main() {
// Memory limits enforced
let guard = MemoryGuard::new(config);
while !guard.check().exceeds_hard_limit {
    generate_batch();
}
}

Extension Points

Custom Generators

Implement the Generator trait:

#![allow(unused)]
fn main() {
impl Generator<CustomType> for CustomGenerator {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<CustomType>> {
        // Custom logic
    }
}
}

Custom Output Sinks

Implement the Sink trait:

#![allow(unused)]
fn main() {
impl Sink<JournalEntry> for CustomSink {
    fn write(&mut self, entry: &JournalEntry) -> Result<()> {
        // Custom output logic
    }
}
}

Custom Distributions

Create specialized samplers:

#![allow(unused)]
fn main() {
impl AmountSampler for CustomAmountSampler {
    fn sample(&mut self) -> Decimal {
        // Custom distribution
    }
}
}

See Also

Workspace Layout

SyntheticData is organized as a Rust workspace with 15 crates following a layered architecture.

Crate Hierarchy

datasynth-cli          → Binary entry point (commands: generate, validate, init, info, fingerprint)
datasynth-server       → REST/gRPC/WebSocket server with auth, rate limiting, timeouts
datasynth-ui           → Tauri/SvelteKit desktop UI
    │
    ▼
datasynth-runtime      → Orchestration layer (GenerationOrchestrator coordinates workflow)
    │
    ├─────────────────────────────────────┐
    ▼                                     ▼
datasynth-generators   datasynth-banking  datasynth-ocpm  datasynth-fingerprint  datasynth-standards
    │                        │                  │                    │
    └────────────────────────┴──────────────────┴────────────────────┘
                                     │
                    ┌────────────────┴────────────────┐
                    ▼                                 ▼
           datasynth-graph                    datasynth-eval
                    │                                 │
                    └────────────────┬────────────────┘
                                     ▼
                            datasynth-config
                                     │
                                     ▼
                            datasynth-core         → Foundation layer
                                     │
                                     ▼
                            datasynth-output

                            datasynth-test-utils   → Testing utilities

Dependency Matrix

CrateDepends On
datasynth-core(none)
datasynth-configdatasynth-core
datasynth-outputdatasynth-core
datasynth-generatorsdatasynth-core, datasynth-config
datasynth-graphdatasynth-core, datasynth-generators
datasynth-evaldatasynth-core
datasynth-bankingdatasynth-core, datasynth-config
datasynth-ocpmdatasynth-core
datasynth-fingerprintdatasynth-core, datasynth-config
datasynth-standardsdatasynth-core, datasynth-config
datasynth-runtimedatasynth-core, datasynth-config, datasynth-generators, datasynth-output, datasynth-graph, datasynth-banking, datasynth-ocpm, datasynth-fingerprint, datasynth-eval
datasynth-clidatasynth-runtime, datasynth-fingerprint
datasynth-serverdatasynth-runtime
datasynth-uidatasynth-runtime (via Tauri)
datasynth-test-utilsdatasynth-core

Directory Structure

SyntheticData/
├── Cargo.toml              # Workspace manifest
├── crates/
│   ├── datasynth-core/
│   │   ├── Cargo.toml
│   │   ├── README.md
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Domain models (JournalEntry, Master data, etc.)
│   │       ├── distributions/  # Statistical samplers
│   │       ├── traits/         # Generator, Sink, PostProcessor traits
│   │       ├── templates/      # Template loading system
│   │       ├── accounts.rs     # GL account constants
│   │       ├── uuid_factory.rs # Deterministic UUID generation
│   │       ├── memory_guard.rs # Memory limit enforcement
│   │       ├── disk_guard.rs   # Disk space monitoring
│   │       ├── cpu_monitor.rs  # CPU load tracking
│   │       ├── resource_guard.rs # Unified resource orchestration
│   │       ├── degradation.rs  # Graceful degradation controller
│   │       ├── llm/            # LLM provider abstraction (Mock, HTTP, OpenAI, Anthropic)
│   │       ├── diffusion/      # Diffusion model backend (statistical, hybrid, training)
│   │       └── causal/         # Causal graphs, SCMs, interventions, counterfactuals
│   ├── datasynth-config/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── schema.rs       # Configuration schema
│   │       ├── validation.rs   # Config validation rules
│   │       └── presets/        # Industry preset definitions
│   ├── datasynth-generators/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── je_generator.rs
│   │       ├── coa_generator.rs
│   │       ├── control_generator.rs
│   │       ├── master_data/    # Vendor, Customer, Material, Asset, Employee
│   │       ├── document_flow/  # P2P, O2C, three-way match
│   │       ├── intercompany/   # IC generation, matching, elimination
│   │       ├── balance/        # Opening balance, balance tracker
│   │       ├── subledger/      # AR, AP, FA, Inventory
│   │       ├── fx/             # FX rates, translation, CTA
│   │       ├── period_close/   # Close engine, accruals, depreciation
│   │       ├── anomaly/        # Anomaly injection engine
│   │       ├── data_quality/   # Missing values, typos, duplicates
│   │       ├── audit/          # Engagement, workpaper, evidence, findings
│   │       ├── llm_enrichment/ # LLM-powered vendor names, descriptions, anomaly explanations
│   │       └── relationships/  # Entity graph, cross-process links, relationship strength
│   ├── datasynth-output/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── csv_sink.rs
│   │       ├── json_sink.rs
│   │       └── control_export.rs
│   ├── datasynth-graph/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Node, edge types
│   │       ├── builders/       # Transaction, approval, entity graphs
│   │       ├── exporters/      # PyTorch Geometric, Neo4j, DGL
│   │       └── ml/             # Feature computation, train/val/test splits
│   ├── datasynth-runtime/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── orchestrator.rs # GenerationOrchestrator
│   │       └── progress.rs     # Progress tracking
│   ├── datasynth-cli/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       └── main.rs         # generate, validate, init, info, fingerprint commands
│   ├── datasynth-server/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── main.rs
│   │       ├── rest/           # Axum REST API
│   │       ├── grpc/           # Tonic gRPC service
│   │       └── websocket/      # WebSocket streaming
│   ├── datasynth-ui/
│   │   ├── package.json
│   │   ├── src/                # SvelteKit frontend
│   │   │   ├── routes/         # 15+ config pages
│   │   │   └── lib/            # Components, stores
│   │   └── src-tauri/          # Rust backend
│   ├── datasynth-eval/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── statistical/    # Benford, distributions, temporal
│   │       ├── coherence/      # Balance, IC, document chains
│   │       ├── quality/        # Completeness, consistency, duplicates
│   │       ├── ml/             # Feature distributions, label quality
│   │       ├── report/         # HTML/JSON report generation
│   │       └── enhancement/    # AutoTuner, RecommendationEngine
│   ├── datasynth-banking/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Customer, Account, Transaction, KYC
│   │       ├── generators/     # Customer, account, transaction generation
│   │       ├── typologies/     # Structuring, funnel, layering, mule, fraud
│   │       ├── personas/       # Retail, business, trust behaviors
│   │       └── labels/         # Entity, relationship, transaction labels
│   ├── datasynth-ocpm/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # EventLog, Event, ObjectInstance, ObjectType
│   │       ├── generator/      # P2P, O2C event generation
│   │       └── export/         # OCEL 2.0 JSON export
│   ├── datasynth-fingerprint/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── models/         # Fingerprint, Manifest, Schema, Statistics
│   │       ├── privacy/        # Laplace, Gaussian, k-anonymity, PrivacyEngine
│   │       ├── extraction/     # Schema, stats, correlation, integrity extractors
│   │       ├── io/             # DSF file reader, writer, validator
│   │       ├── synthesis/      # ConfigSynthesizer, DistributionFitter, GaussianCopula
│   │       ├── evaluation/     # FidelityEvaluator, FidelityReport
│   │       ├── federated/      # Federated fingerprint protocol, secure aggregation
│   │       └── certificates/   # Synthetic data certificates, HMAC-SHA256 signing
│   ├── datasynth-standards/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── framework.rs     # AccountingFramework, FrameworkSettings
│   │       ├── accounting/      # Revenue (ASC 606/IFRS 15), Leases, Fair Value, Impairment
│   │       ├── audit/           # ISA standards, Analytical procedures, Opinions
│   │       └── regulatory/      # SOX 302/404, DeficiencyMatrix
│   └── datasynth-test-utils/
│       ├── Cargo.toml
│       └── src/
│           └── lib.rs          # Test fixtures, assertions, mocks
├── benches/                    # Criterion benchmark suite
├── docs/                       # This documentation (mdBook)
├── python/                     # Python wrapper (datasynth_py)
├── examples/                   # Example configurations and templates
└── tests/                      # Integration tests

Crate Purposes

Application Layer

CratePurpose
datasynth-cliCommand-line interface with generate, validate, init, info, fingerprint commands
datasynth-serverREST/gRPC/WebSocket API with auth, rate limiting, timeouts
datasynth-uiCross-platform desktop application (Tauri + SvelteKit)

Processing Layer

CratePurpose
datasynth-runtimeOrchestrates generation workflow with resource guards
datasynth-generatorsCore data generation (JE, master data, documents, anomalies, audit)
datasynth-graphGraph construction and export for ML

Domain-Specific Modules

CratePurpose
datasynth-bankingKYC/AML banking transactions with fraud typologies
datasynth-ocpmOCEL 2.0 process mining event logs
datasynth-fingerprintPrivacy-preserving fingerprint extraction and synthesis
datasynth-standardsAccounting/audit standards (US GAAP, IFRS, ISA, SOX, PCAOB)

Foundation Layer

CratePurpose
datasynth-coreDomain models, traits, distributions, resource guards
datasynth-configConfiguration schema and validation
datasynth-outputOutput sinks (CSV, JSON, Parquet)

Supporting Crates

CratePurpose
datasynth-evalQuality evaluation with auto-tuning recommendations
datasynth-test-utilsTest fixtures and assertions

Build Commands

# Build entire workspace
cargo build --release

# Build specific crate
cargo build -p datasynth-core
cargo build -p datasynth-generators
cargo build -p datasynth-fingerprint

# Run tests
cargo test
cargo test -p datasynth-core
cargo test -p datasynth-fingerprint

# Generate documentation
cargo doc --workspace --no-deps

# Run benchmarks
cargo bench

Feature Flags

Workspace-level features:

[workspace.features]
default = ["full"]
full = ["server", "ui", "graph"]
server = []
ui = []
graph = []

Crate-level features:

# datasynth-core
[features]
templates = ["serde_yaml"]

# datasynth-output
[features]
compression = ["flate2", "zstd"]

Adding a New Crate

  1. Create directory: crates/datasynth-newcrate/
  2. Add Cargo.toml:
    [package]
    name = "datasynth-newcrate"
    version = "0.2.0"
    edition = "2021"
    
    [dependencies]
    datasynth-core = { path = "../datasynth-core" }
    
  3. Add to workspace Cargo.toml:
    [workspace]
    members = [
        # ...
        "crates/datasynth-newcrate",
    ]
    
  4. Create src/lib.rs
  5. Add documentation to docs/src/crates/

See Also

Domain Models

Core data structures representing enterprise financial concepts.

Model Categories

CategoryModels
AccountingJournalEntry, ChartOfAccounts, ACDOCA
Master DataVendor, Customer, Material, FixedAsset, Employee
DocumentsPurchaseOrder, Invoice, Payment, etc.
FinancialTrialBalance, FxRate, AccountBalance
Financial ReportingFinancialStatement, CashFlowItem, BankReconciliation, BankStatementLine
Sourcing (S2C)SourcingProject, SupplierQualification, RfxEvent, Bid, BidEvaluation, ProcurementContract, CatalogItem, SupplierScorecard, SpendAnalysis
HR / PayrollPayrollRun, PayrollLineItem, TimeEntry, ExpenseReport, ExpenseLineItem
ManufacturingProductionOrder, QualityInspection, CycleCount
Sales QuotesSalesQuote, QuoteLineItem
ComplianceInternalControl, SoDRule, LabeledAnomaly

Accounting

JournalEntry

The core accounting record.

#![allow(unused)]
fn main() {
pub struct JournalEntry {
    pub header: JournalEntryHeader,
    pub lines: Vec<JournalEntryLine>,
}

pub struct JournalEntryHeader {
    pub document_id: Uuid,
    pub company_code: String,
    pub fiscal_year: u16,
    pub fiscal_period: u8,
    pub posting_date: NaiveDate,
    pub document_date: NaiveDate,
    pub created_at: DateTime<Utc>,
    pub source: TransactionSource,
    pub business_process: Option<BusinessProcess>,

    // Document references
    pub source_document_type: Option<DocumentType>,
    pub source_document_id: Option<String>,

    // Labels
    pub is_fraud: bool,
    pub fraud_type: Option<FraudType>,
    pub is_anomaly: bool,
    pub anomaly_type: Option<AnomalyType>,

    // Control markers
    pub control_ids: Vec<String>,
    pub sox_relevant: bool,
    pub sod_violation: bool,
}

pub struct JournalEntryLine {
    pub line_number: u32,
    pub account_number: String,
    pub cost_center: Option<String>,
    pub profit_center: Option<String>,
    pub debit_amount: Decimal,
    pub credit_amount: Decimal,
    pub description: String,
    pub tax_code: Option<String>,
}
}

Invariant: Sum of debits must equal sum of credits.

ChartOfAccounts

GL account structure.

#![allow(unused)]
fn main() {
pub struct ChartOfAccounts {
    pub accounts: Vec<Account>,
}

pub struct Account {
    pub account_number: String,
    pub name: String,
    pub account_type: AccountType,
    pub account_subtype: AccountSubType,
    pub is_control_account: bool,
    pub normal_balance: NormalBalance,
    pub is_active: bool,
}

pub enum AccountType {
    Asset,
    Liability,
    Equity,
    Revenue,
    Expense,
}

pub enum AccountSubType {
    // Assets
    Cash, AccountsReceivable, Inventory, FixedAsset,
    // Liabilities
    AccountsPayable, AccruedLiabilities, LongTermDebt,
    // Equity
    CommonStock, RetainedEarnings,
    // Revenue
    SalesRevenue, ServiceRevenue,
    // Expense
    CostOfGoodsSold, OperatingExpense,
    // ...
}
}

ACDOCA

SAP HANA Universal Journal format.

#![allow(unused)]
fn main() {
pub struct AcdocaEntry {
    pub rclnt: String,           // Client
    pub rldnr: String,           // Ledger
    pub rbukrs: String,          // Company code
    pub gjahr: u16,              // Fiscal year
    pub belnr: String,           // Document number
    pub docln: u32,              // Line item
    pub ryear: u16,              // Year
    pub poper: u8,               // Posting period
    pub racct: String,           // Account
    pub drcrk: DebitCreditIndicator,
    pub hsl: Decimal,            // Amount in local currency
    pub ksl: Decimal,            // Amount in group currency

    // Simulation fields
    pub zsim_fraud: bool,
    pub zsim_anomaly: bool,
    pub zsim_source: String,
}
}

Master Data

Vendor

Supplier master record.

#![allow(unused)]
fn main() {
pub struct Vendor {
    pub vendor_id: String,
    pub vendor_name: String,
    pub tax_id: Option<String>,
    pub currency: String,
    pub country: String,
    pub payment_terms: PaymentTerms,
    pub bank_account: Option<BankAccount>,
    pub is_intercompany: bool,
    pub behavior: VendorBehavior,
    pub valid_from: NaiveDate,
    pub valid_to: Option<NaiveDate>,
}

pub struct VendorBehavior {
    pub late_payment_tendency: f64,
    pub discount_usage_rate: f64,
}
}

Customer

Customer master record.

#![allow(unused)]
fn main() {
pub struct Customer {
    pub customer_id: String,
    pub customer_name: String,
    pub currency: String,
    pub country: String,
    pub credit_limit: Decimal,
    pub credit_rating: CreditRating,
    pub payment_behavior: PaymentBehavior,
    pub is_intercompany: bool,
    pub valid_from: NaiveDate,
}

pub struct PaymentBehavior {
    pub on_time_rate: f64,
    pub early_payment_rate: f64,
    pub late_payment_rate: f64,
    pub average_days_late: u32,
}
}

Material

Product/material master.

#![allow(unused)]
fn main() {
pub struct Material {
    pub material_id: String,
    pub description: String,
    pub material_type: MaterialType,
    pub unit_of_measure: String,
    pub valuation_method: ValuationMethod,
    pub standard_cost: Decimal,
    pub gl_account: String,
}

pub enum MaterialType {
    RawMaterial,
    WorkInProgress,
    FinishedGoods,
    Service,
}

pub enum ValuationMethod {
    Fifo,
    Lifo,
    WeightedAverage,
    StandardCost,
}
}

FixedAsset

Capital asset record.

#![allow(unused)]
fn main() {
pub struct FixedAsset {
    pub asset_id: String,
    pub description: String,
    pub asset_class: AssetClass,
    pub acquisition_date: NaiveDate,
    pub acquisition_cost: Decimal,
    pub useful_life_years: u32,
    pub depreciation_method: DepreciationMethod,
    pub salvage_value: Decimal,
    pub accumulated_depreciation: Decimal,
    pub disposal_date: Option<NaiveDate>,
}
}

Employee

User/employee record.

#![allow(unused)]
fn main() {
pub struct Employee {
    pub employee_id: String,
    pub name: String,
    pub department: String,
    pub role: String,
    pub manager_id: Option<String>,
    pub approval_limit: Decimal,
    pub transaction_codes: Vec<String>,
    pub hire_date: NaiveDate,
}
}

Documents

PurchaseOrder

P2P initiating document.

#![allow(unused)]
fn main() {
pub struct PurchaseOrder {
    pub po_number: String,
    pub vendor_id: String,
    pub company_code: String,
    pub order_date: NaiveDate,
    pub items: Vec<PoLineItem>,
    pub total_amount: Decimal,
    pub currency: String,
    pub status: PoStatus,
}

pub struct PoLineItem {
    pub line_number: u32,
    pub material_id: String,
    pub quantity: Decimal,
    pub unit_price: Decimal,
    pub gl_account: String,
}
}

VendorInvoice

AP invoice with three-way match.

#![allow(unused)]
fn main() {
pub struct VendorInvoice {
    pub invoice_number: String,
    pub vendor_id: String,
    pub po_number: Option<String>,
    pub gr_number: Option<String>,
    pub invoice_date: NaiveDate,
    pub due_date: NaiveDate,
    pub total_amount: Decimal,
    pub match_status: MatchStatus,
}

pub enum MatchStatus {
    Matched,
    QuantityVariance,
    PriceVariance,
    Blocked,
}
}

DocumentReference

Links documents in flows.

#![allow(unused)]
fn main() {
pub struct DocumentReference {
    pub from_document_type: DocumentType,
    pub from_document_id: String,
    pub to_document_type: DocumentType,
    pub to_document_id: String,
    pub reference_type: ReferenceType,
}

pub enum ReferenceType {
    FollowsFrom,     // Normal flow
    PaymentFor,      // Payment → Invoice
    ReversalOf,      // Reversal/credit memo
}
}

Financial

TrialBalance

Period-end balances.

#![allow(unused)]
fn main() {
pub struct TrialBalance {
    pub company_code: String,
    pub fiscal_year: u16,
    pub fiscal_period: u8,
    pub accounts: Vec<TrialBalanceRow>,
}

pub struct TrialBalanceRow {
    pub account_number: String,
    pub account_name: String,
    pub opening_balance: Decimal,
    pub period_debits: Decimal,
    pub period_credits: Decimal,
    pub closing_balance: Decimal,
}
}

FxRate

Exchange rate record.

#![allow(unused)]
fn main() {
pub struct FxRate {
    pub from_currency: String,
    pub to_currency: String,
    pub rate_date: NaiveDate,
    pub rate_type: RateType,
    pub rate: Decimal,
}

pub enum RateType {
    Spot,
    Closing,
    Average,
}
}

Compliance

LabeledAnomaly

ML training label.

#![allow(unused)]
fn main() {
pub struct LabeledAnomaly {
    pub document_id: Uuid,
    pub anomaly_id: String,
    pub anomaly_type: AnomalyType,
    pub category: AnomalyCategory,
    pub severity: Severity,
    pub description: String,
    pub detection_difficulty: DetectionDifficulty,
}

pub enum AnomalyType {
    Fraud,
    Error,
    ProcessIssue,
    Statistical,
    Relational,
}
}

InternalControl

SOX control definition.

#![allow(unused)]
fn main() {
pub struct InternalControl {
    pub control_id: String,
    pub name: String,
    pub description: String,
    pub control_type: ControlType,
    pub frequency: ControlFrequency,
    pub assertions: Vec<Assertion>,
}
}

Financial Reporting

FinancialStatement

Period-end financial statement with line items.

#![allow(unused)]
fn main() {
pub enum StatementType {
    BalanceSheet,
    IncomeStatement,
    CashFlowStatement,
    ChangesInEquity,
}

pub struct FinancialStatementLineItem {
    pub line_code: String,
    pub label: String,
    pub section: String,
    pub sort_order: u32,
    pub amount: Decimal,
    pub amount_prior: Option<Decimal>,
    pub indent_level: u8,
    pub is_total: bool,
    pub gl_accounts: Vec<String>,
}

pub struct CashFlowItem {
    pub item_code: String,
    pub label: String,
    pub category: CashFlowCategory,  // Operating, Investing, Financing
    pub amount: Decimal,
}
}

BankReconciliation

Bank statement reconciliation with auto-matching.

#![allow(unused)]
fn main() {
pub struct BankStatementLine {
    pub line_id: String,
    pub statement_date: NaiveDate,
    pub direction: Direction,         // Inflow, Outflow
    pub amount: Decimal,
    pub description: String,
    pub match_status: MatchStatus,    // Unmatched, AutoMatched, ManuallyMatched, BankCharge, Interest
    pub matched_payment_id: Option<String>,
}

pub struct BankReconciliation {
    pub reconciliation_id: String,
    pub company_code: String,
    pub bank_account: String,
    pub period_start: NaiveDate,
    pub period_end: NaiveDate,
    pub opening_balance: Decimal,
    pub closing_balance: Decimal,
    pub status: ReconciliationStatus, // InProgress, Completed, CompletedWithExceptions
}
}

Sourcing (S2C)

Source-to-Contract models for the procurement pipeline.

SourcingProject

Top-level sourcing initiative.

#![allow(unused)]
fn main() {
pub struct SourcingProject {
    pub project_id: String,
    pub title: String,
    pub category: String,
    pub status: SourcingProjectStatus,
    pub estimated_spend: Decimal,
    pub start_date: NaiveDate,
    pub target_award_date: NaiveDate,
}
}

RfxEvent

Request for Information/Proposal/Quote.

#![allow(unused)]
fn main() {
pub struct RfxEvent {
    pub rfx_id: String,
    pub project_id: String,
    pub rfx_type: RfxType,       // Rfi, Rfp, Rfq
    pub title: String,
    pub issue_date: NaiveDate,
    pub close_date: NaiveDate,
    pub invited_suppliers: Vec<String>,
}
}

ProcurementContract

Awarded contract resulting from bid evaluation.

#![allow(unused)]
fn main() {
pub struct ProcurementContract {
    pub contract_id: String,
    pub vendor_id: String,
    pub rfx_id: Option<String>,
    pub contract_value: Decimal,
    pub start_date: NaiveDate,
    pub end_date: NaiveDate,
    pub auto_renew: bool,
}
}

Additional S2C models include SpendAnalysis, SupplierQualification, Bid, BidEvaluation, CatalogItem, and SupplierScorecard.


HR / Payroll

Hire-to-Retire (H2R) process models.

PayrollRun

A complete pay cycle for a company.

#![allow(unused)]
fn main() {
pub struct PayrollRun {
    pub payroll_id: String,
    pub company_code: String,
    pub pay_period_start: NaiveDate,
    pub pay_period_end: NaiveDate,
    pub run_date: NaiveDate,
    pub status: PayrollRunStatus,     // Draft, Calculated, Approved, Posted, Reversed
    pub total_gross: Decimal,
    pub total_deductions: Decimal,
    pub total_net: Decimal,
    pub total_employer_cost: Decimal,
    pub employee_count: u32,
}
}

TimeEntry

Employee time tracking record.

#![allow(unused)]
fn main() {
pub struct TimeEntry {
    pub entry_id: String,
    pub employee_id: String,
    pub date: NaiveDate,
    pub hours_regular: f64,
    pub hours_overtime: f64,
    pub hours_pto: f64,
    pub hours_sick: f64,
    pub project_id: Option<String>,
    pub cost_center: Option<String>,
    pub approval_status: TimeApprovalStatus,  // Pending, Approved, Rejected
}
}

ExpenseReport

Employee expense reimbursement.

#![allow(unused)]
fn main() {
pub struct ExpenseReport {
    pub report_id: String,
    pub employee_id: String,
    pub submission_date: NaiveDate,
    pub status: ExpenseStatus,        // Draft, Submitted, Approved, Rejected, Paid
    pub total_amount: Decimal,
    pub line_items: Vec<ExpenseLineItem>,
}

pub enum ExpenseCategory {
    Travel, Meals, Lodging, Transportation,
    Office, Entertainment, Training, Other,
}
}

Manufacturing

Production and quality process models.

ProductionOrder

Manufacturing production order linked to materials.

#![allow(unused)]
fn main() {
pub struct ProductionOrder {
    pub order_id: String,
    pub material_id: String,
    pub planned_quantity: Decimal,
    pub actual_quantity: Decimal,
    pub start_date: NaiveDate,
    pub end_date: Option<NaiveDate>,
    pub status: ProductionOrderStatus,
}
}

QualityInspection

Quality control inspection record.

#![allow(unused)]
fn main() {
pub struct QualityInspection {
    pub inspection_id: String,
    pub production_order_id: String,
    pub inspection_date: NaiveDate,
    pub result: InspectionResult,     // Pass, Fail, Conditional
    pub defect_count: u32,
}
}

CycleCount

Inventory cycle count with variance tracking.

#![allow(unused)]
fn main() {
pub struct CycleCount {
    pub count_id: String,
    pub material_id: String,
    pub warehouse: String,
    pub count_date: NaiveDate,
    pub system_quantity: Decimal,
    pub counted_quantity: Decimal,
    pub variance: Decimal,
}
}

Sales Quotes

Quote-to-order pipeline models.

SalesQuote

Sales quotation record.

#![allow(unused)]
fn main() {
pub struct SalesQuote {
    pub quote_id: String,
    pub customer_id: String,
    pub quote_date: NaiveDate,
    pub valid_until: NaiveDate,
    pub total_amount: Decimal,
    pub status: QuoteStatus,          // Draft, Sent, Won, Lost, Expired
    pub converted_order_id: Option<String>,
}
}

Decimal Handling

All monetary amounts use rust_decimal::Decimal:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

let amount = dec!(1234.56);
let tax = amount * dec!(0.077);
}

Serialized as strings to prevent IEEE 754 issues:

{"amount": "1234.56"}

See Also

Data Flow

How data flows through the SyntheticData system.

High-Level Flow

┌─────────────┐
│   Config    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│                     Orchestrator                             │
│                                                              │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐ │
│  │  Master  │ → │  Opening │ → │ Transact │ → │  Period  │ │
│  │   Data   │   │ Balances │   │   ions   │   │  Close   │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘ │
│                                                              │
└───────────────────────────┬─────────────────────────────────┘
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  CSV Sink   │      │ Graph Export│      │  Labels     │
└─────────────┘      └─────────────┘      └─────────────┘

Phase 1: Configuration Loading

YAML File → Parser → Validator → Config Object
  1. Load: Read YAML/JSON file
  2. Parse: Convert to strongly-typed structures
  3. Validate: Check constraints and ranges
  4. Resolve: Apply defaults and presets
#![allow(unused)]
fn main() {
let config = Config::from_yaml_file("config.yaml")?;
ConfigValidator::new().validate(&config)?;
}

Phase 2: Master Data Generation

Config → Master Data Generators → Entity Registry

Order of generation (to satisfy dependencies):

  1. Chart of Accounts: GL account structure
  2. Employees: Users with approval limits
  3. Vendors: Suppliers (reference employees as approvers)
  4. Customers: Buyers (reference employees)
  5. Materials: Products (reference accounts)
  6. Fixed Assets: Capital assets (reference accounts)
#![allow(unused)]
fn main() {
// Entity registry maintains references
let registry = EntityRegistry::new();
registry.register_vendors(&vendors);
registry.register_customers(&customers);
}

Phase 3: Opening Balance Generation

Config + CoA → Balance Generator → Opening JEs

Generates coherent opening balance sheet:

  1. Calculate target balances per account type
  2. Distribute across accounts
  3. Generate opening entries
  4. Verify A = L + E
#![allow(unused)]
fn main() {
let opening = OpeningBalanceGenerator::new(&config);
let entries = opening.generate()?;

// Verify balance coherence
assert!(entries.iter().all(|e| e.is_balanced()));
}

Phase 4: Transaction Generation

Document Flow Path

Config → P2P/O2C Generators → Documents → JE Generator → Entries

P2P Flow:

PO Generator → Purchase Order
                    │
                    ▼
GR Generator → Goods Receipt → JE (Inventory/GR-IR)
                    │
                    ▼
Invoice Gen. → Vendor Invoice → JE (GR-IR/AP)
                    │
                    ▼
Payment Gen. → Payment → JE (AP/Cash)

Direct JE Path

Config → JE Generator → Entries

For transactions not from document flows:

  • Manual entries
  • Recurring entries
  • Adjustments

Phase 5: Balance Tracking

Entries → Balance Tracker → Running Balances → Trial Balance

Continuous tracking during generation:

#![allow(unused)]
fn main() {
let mut tracker = BalanceTracker::new(&coa);

for entry in &entries {
    tracker.post(&entry)?;

    // Verify coherence after each entry
    assert!(tracker.is_balanced());
}

let trial_balance = tracker.to_trial_balance(period);
}

Phase 6: Anomaly Injection

Entries → Anomaly Injector → Modified Entries + Labels

Anomalies injected post-generation:

  1. Select entries based on targeting strategy
  2. Apply anomaly transformation
  3. Generate label record
#![allow(unused)]
fn main() {
let injector = AnomalyInjector::new(&config.anomaly_injection);
let (modified, labels) = injector.inject(&entries)?;
}

Phase 7: Period Close

Entries + Balances → Close Engine → Closing Entries

Monthly:

  • Accruals
  • Depreciation
  • Subledger reconciliation

Quarterly:

  • IC eliminations
  • Currency translation

Annual:

  • Closing entries
  • Retained earnings

Phase 8: Output Generation

CSV/JSON Output

Entries + Master Data → Sinks → Files
#![allow(unused)]
fn main() {
let mut sink = CsvSink::new("output/journal_entries.csv")?;
sink.write_batch(&entries)?;
sink.flush()?;
}

Graph Output

Entries → Graph Builder → Graph → Exporter → PyG/Neo4j
#![allow(unused)]
fn main() {
let builder = TransactionGraphBuilder::new();
let graph = builder.build(&entries)?;

let exporter = PyTorchGeometricExporter::new("output/graphs");
exporter.export(&graph, split_config)?;
}

Phase 9: Enterprise Process Chains (v0.6.0)

Source-to-Contract (S2C) Flow

Spend Analysis → Sourcing Project → Supplier Qualification → RFx Event → Bids →
Bid Evaluation → Contract Award → Catalog Items → [feeds into P2P] → Supplier Scorecard

S2C data feeds into the existing P2P procurement flow. Procurement contracts and catalog items provide the upstream sourcing context for purchase orders.

HR / Payroll Flow

Employees (Master Data) → Time Entries → Payroll Run → JE (Salary Expense/Cash)
                        → Expense Reports → JE (Expense/AP)

HR data depends on the employee master data from Phase 2. Payroll runs generate journal entries that post to salary expense and cash accounts.

Financial Reporting Flow

Trial Balance → Balance Sheet + Income Statement
             → Cash Flow Statement (indirect method)
             → Changes in Equity
             → Management KPIs
             → Budget Variance Analysis

Payments (P2P/O2C) → Bank Reconciliation → Matched/Unmatched Items

Financial statements are derived from the adjusted trial balance. Bank reconciliations match payments from document flows against bank statement lines.

Manufacturing Flow

Materials (Master Data) → Production Orders → Quality Inspections
                                            → Cycle Counts

Manufacturing data depends on materials from the master data. Production orders consume raw materials and produce finished goods.

Sales Quote Flow

Customers (Master Data) → Sales Quotes → [feeds into O2C when won]

The quote-to-order pipeline generates sales quotes that, when won, link to sales orders in the O2C flow.

Accounting Standards Flow

Customers → Customer Contracts → Performance Obligations (ASC 606/IFRS 15)
Fixed Assets → Impairment Tests → Recoverable Amount Calculations

Revenue recognition generates contracts with performance obligations. Impairment testing evaluates fixed asset carrying amounts against recoverable values.

Data Dependencies

         ┌─────────────┐
         │    Config   │
         └──────┬──────┘
                │
    ┌───────────┼───────────┐
    │           │           │
    ▼           ▼           ▼
┌───────┐  ┌───────┐  ┌───────┐
│  CoA  │  │Vendors│  │Customs│
└───┬───┘  └───┬───┘  └───┬───┘
    │          │          │
    │    ┌─────┴─────┐    │
    │    │           │    │
    ▼    ▼           ▼    ▼
┌─────────────┐  ┌─────────────┐
│   P2P Docs  │  │   O2C Docs  │
└──────┬──────┘  └──────┬──────┘
       │                │
       └───────┬────────┘
               │
               ▼
        ┌─────────────┐
        │   Entries   │
        └──────┬──────┘
               │
    ┌──────────┼──────────┐──────────┐──────────┐
    │          │          │          │          │
    ▼          ▼          ▼          ▼          ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌─────────┐ ┌───────┐
│  TB   │ │ Graph │ │Labels │ │Fin.Stmt │ │BankRec│
└───────┘ └───────┘ └───────┘ └─────────┘ └───────┘

Streaming vs Batch

Batch Mode

All data in memory:

#![allow(unused)]
fn main() {
let entries = generator.generate_batch(100000)?;
sink.write_batch(&entries)?;
}

Pro: Fast parallel processing Con: Memory intensive

Streaming Mode

Process one at a time:

#![allow(unused)]
fn main() {
for entry in generator.generate_stream() {
    sink.write(&entry?)?;
}
}

Pro: Memory efficient Con: No parallelism

Hybrid Mode

Batch with periodic flush:

#![allow(unused)]
fn main() {
for batch in generator.generate_batches(1000) {
    let entries = batch?;
    sink.write_batch(&entries)?;

    if memory_guard.check().exceeds_soft_limit {
        sink.flush()?;
    }
}
}

See Also

Generation Pipeline

Step-by-step generation process orchestrated by datasynth-runtime.

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────┐
│                      GenerationOrchestrator                          │
│                                                                      │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐   │
│  │Init  │→│Master│→│Open  │→│Trans │→│Close │→│Inject│→│Export│   │
│  │      │ │Data  │ │Bal   │ │      │ │      │ │      │ │      │   │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Stage 1: Initialization

Purpose: Prepare generation environment

#![allow(unused)]
fn main() {
pub fn initialize(&mut self) -> Result<()> {
    // 1. Validate configuration
    ConfigValidator::new().validate(&self.config)?;

    // 2. Initialize RNG with seed
    self.rng = ChaCha8Rng::seed_from_u64(self.config.global.seed);

    // 3. Create UUID factory
    self.uuid_factory = DeterministicUuidFactory::new(self.config.global.seed);

    // 4. Set up memory guard
    self.memory_guard = MemoryGuard::new(self.config.memory_config());

    // 5. Create output directories
    fs::create_dir_all(&self.output_path)?;

    Ok(())
}
}

Outputs:

  • Validated configuration
  • Initialized RNG
  • UUID factory
  • Memory guard
  • Output directories

Stage 2: Master Data Generation

Purpose: Generate all entity master records

#![allow(unused)]
fn main() {
pub fn generate_master_data(&mut self) -> Result<MasterDataState> {
    let mut state = MasterDataState::new();

    // 1. Chart of Accounts
    let coa_gen = CoaGenerator::new(&self.config, &mut self.rng);
    state.chart_of_accounts = coa_gen.generate()?;

    // 2. Employees (needed for approvals)
    let emp_gen = EmployeeGenerator::new(&self.config, &mut self.rng);
    state.employees = emp_gen.generate()?;

    // 3. Vendors (reference employees)
    let vendor_gen = VendorGenerator::new(&self.config, &mut self.rng);
    state.vendors = vendor_gen.generate()?;

    // 4. Customers
    let customer_gen = CustomerGenerator::new(&self.config, &mut self.rng);
    state.customers = customer_gen.generate()?;

    // 5. Materials
    let material_gen = MaterialGenerator::new(&self.config, &mut self.rng);
    state.materials = material_gen.generate()?;

    // 6. Fixed Assets
    let asset_gen = AssetGenerator::new(&self.config, &mut self.rng);
    state.fixed_assets = asset_gen.generate()?;

    // 7. Register in entity registry
    self.registry.register_all(&state);

    Ok(state)
}
}

Outputs:

  • Chart of Accounts
  • Vendors, Customers
  • Materials, Fixed Assets
  • Employees
  • Entity Registry

Stage 3: Opening Balance Generation

Purpose: Create coherent opening balance sheet

#![allow(unused)]
fn main() {
pub fn generate_opening_balances(&mut self) -> Result<Vec<JournalEntry>> {
    let generator = OpeningBalanceGenerator::new(
        &self.config,
        &self.state.chart_of_accounts,
        &mut self.rng,
    );

    let entries = generator.generate()?;

    // Initialize balance tracker
    self.balance_tracker = BalanceTracker::new(&self.state.chart_of_accounts);
    for entry in &entries {
        self.balance_tracker.post(entry)?;
    }

    // Verify A = L + E
    assert!(self.balance_tracker.is_balanced());

    Ok(entries)
}
}

Outputs:

  • Opening balance entries
  • Initialized balance tracker

Stage 4: Transaction Generation

Purpose: Generate main transaction volume

#![allow(unused)]
fn main() {
pub fn generate_transactions(&mut self) -> Result<Vec<JournalEntry>> {
    let target = self.config.transactions.target_count;
    let mut entries = Vec::with_capacity(target as usize);

    // Calculate counts by source
    let p2p_count = (target as f64 * self.config.document_flows.p2p.flow_rate) as u64;
    let o2c_count = (target as f64 * self.config.document_flows.o2c.flow_rate) as u64;
    let other_count = target - p2p_count - o2c_count;

    // 1. Generate P2P flows
    let p2p_entries = self.generate_p2p_flows(p2p_count)?;
    entries.extend(p2p_entries);

    // 2. Generate O2C flows
    let o2c_entries = self.generate_o2c_flows(o2c_count)?;
    entries.extend(o2c_entries);

    // 3. Generate other entries (manual, recurring, etc.)
    let other_entries = self.generate_other_entries(other_count)?;
    entries.extend(other_entries);

    // 4. Sort by posting date
    entries.sort_by_key(|e| e.header.posting_date);

    // 5. Update balance tracker
    for entry in &entries {
        self.balance_tracker.post(entry)?;
    }

    Ok(entries)
}
}

P2P Flow Generation

#![allow(unused)]
fn main() {
fn generate_p2p_flows(&mut self, count: u64) -> Result<Vec<JournalEntry>> {
    let mut p2p_gen = P2pGenerator::new(&self.config, &self.registry, &mut self.rng);
    let mut doc_gen = DocumentFlowJeGenerator::new(&self.config);

    let mut entries = Vec::new();

    for _ in 0..count {
        // 1. Generate document flow
        let flow = p2p_gen.generate_flow()?;
        self.state.documents.add_p2p_flow(&flow);

        // 2. Generate journal entries from flow
        let flow_entries = doc_gen.generate_from_p2p(&flow)?;
        entries.extend(flow_entries);
    }

    Ok(entries)
}
}

Outputs:

  • Journal entries
  • Document records
  • Updated balances

Stage 5: Period Close

Purpose: Run period-end processes

#![allow(unused)]
fn main() {
pub fn run_period_close(&mut self) -> Result<()> {
    let close_engine = CloseEngine::new(&self.config.period_close);

    for period in self.config.periods() {
        // 1. Monthly close
        let monthly_entries = close_engine.run_monthly_close(
            period,
            &self.state,
            &mut self.balance_tracker,
        )?;
        self.state.entries.extend(monthly_entries);

        // 2. Quarterly close (if applicable)
        if period.is_quarter_end() {
            let quarterly_entries = close_engine.run_quarterly_close(
                period,
                &self.state,
            )?;
            self.state.entries.extend(quarterly_entries);
        }

        // 3. Generate trial balance
        let trial_balance = self.balance_tracker.to_trial_balance(period);
        self.state.trial_balances.push(trial_balance);
    }

    // 4. Annual close
    if self.config.has_year_end() {
        let annual_entries = close_engine.run_annual_close(&self.state)?;
        self.state.entries.extend(annual_entries);
    }

    Ok(())
}
}

Outputs:

  • Accrual entries
  • Depreciation entries
  • Closing entries
  • Trial balances

Stage 6: Anomaly Injection

Purpose: Add anomalies and generate labels

#![allow(unused)]
fn main() {
pub fn inject_anomalies(&mut self) -> Result<()> {
    if !self.config.anomaly_injection.enabled {
        return Ok(());
    }

    let mut injector = AnomalyInjector::new(
        &self.config.anomaly_injection,
        &mut self.rng,
    );

    // 1. Select entries for injection
    let target_count = (self.state.entries.len() as f64
        * self.config.anomaly_injection.total_rate) as usize;

    // 2. Inject anomalies
    let (modified, labels) = injector.inject(
        &mut self.state.entries,
        target_count,
    )?;

    // 3. Store labels
    self.state.anomaly_labels = labels;

    // 4. Apply data quality variations
    if self.config.data_quality.enabled {
        let dq_injector = DataQualityInjector::new(&self.config.data_quality);
        dq_injector.apply(&mut self.state)?;
    }

    Ok(())
}
}

Outputs:

  • Modified entries with anomalies
  • Anomaly labels for ML

Stage 7: Export

Purpose: Write all outputs

#![allow(unused)]
fn main() {
pub fn export(&self) -> Result<()> {
    // 1. Master data
    self.export_master_data()?;

    // 2. Transactions
    self.export_transactions()?;

    // 3. Documents
    self.export_documents()?;

    // 4. Subledgers
    self.export_subledgers()?;

    // 5. Trial balances
    self.export_trial_balances()?;

    // 6. Labels
    self.export_labels()?;

    // 7. Controls
    self.export_controls()?;

    // 8. Graphs (if enabled)
    if self.config.graph_export.enabled {
        self.export_graphs()?;
    }

    Ok(())
}
}

Outputs:

  • CSV/JSON files
  • Graph exports
  • Label files

Stage 8: Banking & Process Mining

Purpose: Generate banking/KYC/AML data and OCEL 2.0 event logs

If banking or OCEL generation is enabled in the config, this stage produces banking transactions with KYC profiles and/or OCEL 2.0 event logs for process mining.

Outputs:

  • Banking customers, accounts, transactions
  • KYC profiles and AML typology labels
  • OCEL 2.0 event logs, objects, process variants

Stage 9: Audit Generation

Purpose: Generate ISA-compliant audit data

If audit generation is enabled, generates engagement records, workpapers, evidence, risks, findings, and professional judgments.

Outputs:

  • Audit engagements, workpapers, evidence
  • Risk assessments and findings
  • Professional judgment documentation

Stage 10: Graph Export

Purpose: Build and export ML-ready graphs

If graph export is enabled, builds transaction, approval, and entity graphs and exports to configured formats.

Outputs:

  • PyTorch Geometric tensors (.pt)
  • Neo4j CSV + Cypher scripts
  • DGL graph structures

Stage 11: LLM Enrichment (v0.5.0)

Purpose: Enrich generated data with LLM-generated metadata

When LLM enrichment is enabled, uses the configured LlmProvider (Mock, OpenAI, Anthropic, or Custom) to generate realistic:

  • Vendor names appropriate for industry and spend category
  • Transaction descriptions and memo fields
  • Natural language explanations for injected anomalies

The Mock provider is deterministic and requires no network access, making it suitable for CI/CD pipelines.

Outputs:

  • Enriched vendor master data
  • Enriched journal entry descriptions
  • Anomaly explanation text

Stage 12: Diffusion Enhancement (v0.5.0)

Purpose: Optionally blend diffusion model outputs with rule-based data

When diffusion is enabled, uses a StatisticalDiffusionBackend to generate samples through a learned denoising process. The HybridGenerator blends diffusion outputs with rule-based data using one of three strategies:

  • Interpolate: Weighted average of rule-based and diffusion values
  • Select: Per-record random selection from either source
  • Ensemble: Column-level blending (diffusion for amounts, rule-based for categoricals)

Outputs:

  • Blended transaction amounts and attributes
  • Diffusion fit report (mean/std errors, correlation preservation)

Stage 13: Causal Overlay (v0.5.0)

Purpose: Apply causal structure to generated data

When causal generation is enabled, constructs a StructuralCausalModel from the configured causal graph (or a built-in template like fraud_detection or revenue_cycle) and generates data that respects causal relationships. Supports:

  • Observational generation: Data following the causal structure
  • Interventional generation: Data under do-calculus interventions (“what-if” scenarios)
  • Counterfactual generation: Counterfactual versions of existing records via abduction-action-prediction

The causal validator verifies that generated data preserves the specified causal structure.

Outputs:

  • Causally-structured records
  • Intervention results with effect estimates
  • Counterfactual pairs (factual + counterfactual)
  • Causal validation report

Stage 14: Source-to-Contract (v0.6.0)

Purpose: Generate the full S2C procurement pipeline

When source-to-pay is enabled, generates the complete sourcing lifecycle from spend analysis through supplier scorecards. The generation DAG follows:

Spend Analysis → Sourcing Project → Supplier Qualification → RFx Event → Bids →
Bid Evaluation → Procurement Contract → Catalog Items → [feeds into P2P] → Supplier Scorecard

Outputs:

  • Spend analysis records and category hierarchies
  • Sourcing projects with supplier qualification data
  • RFx events (RFI/RFP/RFQ), bids, and bid evaluations
  • Procurement contracts and catalog items
  • Supplier scorecards with performance metrics

Stage 15: Financial Reporting (v0.6.0)

Purpose: Generate bank reconciliations and financial statements

When financial reporting is enabled, produces bank reconciliations with auto-matching and full financial statement sets derived from the adjusted trial balance.

Bank reconciliations match payments to bank statement lines with configurable auto-match, manual match, and exception rates. Financial statements include:

  • Balance Sheet: Assets = Liabilities + Equity
  • Income Statement: Revenue - COGS - OpEx - Tax = Net Income
  • Cash Flow Statement: Indirect method with operating, investing, and financing categories
  • Statement of Changes in Equity: Retained earnings, dividends, comprehensive income

Also generates management KPIs (financial ratios) and budget variance analysis when configured.

Outputs:

  • Bank reconciliations with statement lines and reconciling items
  • Financial statements (balance sheet, income statement, cash flow, changes in equity)
  • Management KPIs and financial ratios
  • Budget vs. actual variance reports

Stage 16: HR Data (v0.6.0)

Purpose: Generate Hire-to-Retire (H2R) process data

When HR generation is enabled, produces payroll runs, time entries, and expense reports linked to the employee master data generated in Stage 2.

Outputs:

  • Payroll runs with employee pay line items (gross, deductions, net, employer cost)
  • Time entries with regular hours, overtime, PTO, and sick leave
  • Expense reports with categorized line items and approval workflows

Stage 17: Accounting Standards (v0.6.0)

Purpose: Generate ASC 606/IFRS 15 revenue recognition and impairment testing data

When accounting standards generation is enabled, produces customer contracts with performance obligations for revenue recognition and asset impairment test records.

Outputs:

  • Customer contracts with performance obligations (ASC 606/IFRS 15)
  • Revenue recognition schedules
  • Asset impairment tests with recoverable amount calculations

Stage 18: Manufacturing (v0.6.0)

Purpose: Generate manufacturing process data

When manufacturing is enabled, produces production orders, quality inspections, and cycle counts linked to materials from the master data.

Outputs:

  • Production orders with BOM components and routing steps
  • Quality inspections with pass/fail/conditional results
  • Inventory cycle counts with variance analysis

Stage 19: Sales Quotes, KPIs, and Budgets (v0.6.0)

Purpose: Generate sales pipeline and financial planning data

When enabled, produces the quote-to-order pipeline, management KPI computations, and budget variance analysis.

Outputs:

  • Sales quotes with line items, conversion tracking, and win/loss rates
  • Management KPIs (liquidity, profitability, efficiency, leverage ratios)
  • Budget records with actual vs. planned variance analysis

Parallel Execution

Stages that support parallelism:

#![allow(unused)]
fn main() {
// Parallel transaction generation
let entries: Vec<JournalEntry> = (0..num_threads)
    .into_par_iter()
    .flat_map(|thread_id| {
        let mut gen = JournalEntryGenerator::new(
            &config,
            seed + thread_id as u64,
        );
        gen.generate_batch(batch_size)
    })
    .collect();
}

Progress Tracking

#![allow(unused)]
fn main() {
pub fn run_with_progress<F>(&mut self, callback: F) -> Result<()>
where
    F: Fn(Progress),
{
    let tracker = ProgressTracker::new(self.config.total_items());

    for stage in self.stages() {
        tracker.set_phase(&stage.name);
        stage.run()?;
        tracker.advance(stage.items);
        callback(tracker.progress());
    }

    Ok(())
}
}

See Also

Resource Management

How SyntheticData manages system resources during generation.

Overview

Large-scale data generation can stress system resources. SyntheticData provides:

  • Memory Guard: Cross-platform memory tracking with soft/hard limits
  • Disk Space Guard: Disk capacity monitoring and pre-write checks
  • CPU Monitor: CPU load tracking with auto-throttling
  • Resource Guard: Unified orchestration of all resource guards
  • Graceful Degradation: Progressive feature reduction under resource pressure
  • Streaming Output: Reduce memory pressure

Memory Guard

The MemoryGuard component tracks process memory usage:

#![allow(unused)]
fn main() {
pub struct MemoryGuard {
    config: MemoryGuardConfig,
    last_check: Instant,
    last_usage: u64,
}

pub struct MemoryGuardConfig {
    pub soft_limit: u64,           // Pause/slow threshold
    pub hard_limit: u64,           // Stop threshold
    pub check_interval_ms: u64,    // How often to check
    pub growth_rate_threshold: f64, // Bytes/sec warning
}

pub struct MemoryStatus {
    pub current_usage: u64,
    pub exceeds_soft_limit: bool,
    pub exceeds_hard_limit: bool,
    pub growth_rate: f64,
}
}

Platform Support

PlatformMethod
Linux/proc/self/statm
macOSps command
WindowsStubbed (returns 0)

Linux Implementation

#![allow(unused)]
fn main() {
#[cfg(target_os = "linux")]
fn get_memory_usage() -> u64 {
    let statm = fs::read_to_string("/proc/self/statm").ok()?;
    let rss_pages: u64 = statm.split_whitespace().nth(1)?.parse().ok()?;
    let page_size = unsafe { libc::sysconf(libc::_SC_PAGESIZE) } as u64;
    rss_pages * page_size
}
}

macOS Implementation

#![allow(unused)]
fn main() {
#[cfg(target_os = "macos")]
fn get_memory_usage() -> u64 {
    let output = Command::new("ps")
        .args(["-o", "rss=", "-p", &std::process::id().to_string()])
        .output()
        .ok()?;
    let rss_kb: u64 = String::from_utf8_lossy(&output.stdout)
        .trim()
        .parse()
        .ok()?;
    rss_kb * 1024
}
}

Configuration

global:
  memory_limit: 2147483648    # 2 GB hard limit

Or programmatically:

#![allow(unused)]
fn main() {
let config = MemoryGuardConfig {
    soft_limit: 1024 * 1024 * 1024,      // 1 GB
    hard_limit: 2 * 1024 * 1024 * 1024,  // 2 GB
    check_interval_ms: 1000,              // Check every second
    growth_rate_threshold: 100_000_000.0, // 100 MB/sec
};
}

Usage in Generation

#![allow(unused)]
fn main() {
pub fn generate_with_memory_guard(&mut self) -> Result<()> {
    let guard = MemoryGuard::new(self.memory_config);

    loop {
        // Check memory
        let status = guard.check();

        if status.exceeds_hard_limit {
            // Stop generation
            return Err(Error::MemoryExceeded);
        }

        if status.exceeds_soft_limit {
            // Flush output and trigger GC
            self.sink.flush()?;
            self.state.clear_caches();
            continue;
        }

        if status.growth_rate > guard.config.growth_rate_threshold {
            // Slow down
            thread::sleep(Duration::from_millis(100));
        }

        // Generate batch
        let batch = self.generator.generate_batch(BATCH_SIZE)?;
        self.process_batch(batch)?;

        if self.is_complete() {
            break;
        }
    }

    Ok(())
}
}

Memory Estimation

Estimate memory requirements before generation:

#![allow(unused)]
fn main() {
pub fn estimate_memory(config: &Config) -> MemoryEstimate {
    let entry_size = 512;  // Approximate bytes per entry
    let master_data_size = config.estimate_master_data_size();

    let peak = master_data_size
        + (config.transactions.target_count as u64 * entry_size);

    let streaming_peak = master_data_size
        + (BATCH_SIZE as u64 * entry_size);

    MemoryEstimate {
        batch_peak: peak,
        streaming_peak,
        recommended_limit: peak * 2,
    }
}
}

Memory-Efficient Patterns

Streaming Output

Write as you generate instead of accumulating:

#![allow(unused)]
fn main() {
// Memory-efficient
for entry in generator.generate_stream() {
    sink.write(&entry?)?;
}

// Memory-intensive (avoid for large volumes)
let all_entries = generator.generate_batch(1_000_000)?;
sink.write_batch(&all_entries)?;
}

Batch Processing with Flush

#![allow(unused)]
fn main() {
const BATCH_SIZE: usize = 10_000;

let mut buffer = Vec::with_capacity(BATCH_SIZE);

for entry in generator.generate_stream() {
    buffer.push(entry?);

    if buffer.len() >= BATCH_SIZE {
        sink.write_batch(&buffer)?;
        buffer.clear();
    }
}

// Final flush
if !buffer.is_empty() {
    sink.write_batch(&buffer)?;
}
}

Lazy Loading

Load master data on demand:

#![allow(unused)]
fn main() {
pub struct LazyRegistry {
    vendors: OnceCell<Vec<Vendor>>,
    vendor_loader: Box<dyn Fn() -> Vec<Vendor>>,
}

impl LazyRegistry {
    pub fn vendors(&self) -> &[Vendor] {
        self.vendors.get_or_init(|| (self.vendor_loader)())
    }
}
}

Memory Limits by Component

Estimated memory usage:

ComponentSize (per item)For 1M entries
JournalEntry~512 bytes~500 MB
Document~1 KB~1 GB
Graph Node~128 bytes~128 MB
Graph Edge~64 bytes~64 MB

Monitoring

Progress with Memory

#![allow(unused)]
fn main() {
orchestrator.run_with_progress(|progress| {
    let memory_mb = guard.check().current_usage / 1_000_000;
    println!(
        "[{:.1}%] {} entries | {} MB",
        progress.percent,
        progress.current,
        memory_mb
    );
});
}

Server Endpoint

curl http://localhost:3000/health
{
  "status": "healthy",
  "memory_usage_mb": 512,
  "memory_limit_mb": 2048,
  "memory_percent": 25.0
}

Troubleshooting

Out of Memory

Symptoms: Process killed, “out of memory” error

Solutions:

  1. Reduce target_count
  2. Enable streaming output
  3. Increase system memory
  4. Set appropriate memory_limit

Slow Generation

Symptoms: Generation slows over time

Cause: Memory pressure triggering slowdown

Solutions:

  1. Increase soft limit
  2. Reduce batch size
  3. Enable more aggressive flushing

Memory Not Freed

Symptoms: Memory stays high after generation

Cause: Data retained in caches

Solution: Explicitly clear state:

#![allow(unused)]
fn main() {
orchestrator.clear_caches();
}

Disk Space Guard

Monitors disk space and prevents disk exhaustion:

#![allow(unused)]
fn main() {
pub struct DiskSpaceGuardConfig {
    pub hard_limit_mb: usize,       // Minimum free space required
    pub soft_limit_mb: usize,       // Warning threshold
    pub check_interval: usize,      // Check every N operations
    pub reserve_buffer_mb: usize,   // Buffer to maintain
}
}

Platform Support

PlatformMethod
Linux/macOSstatvfs syscall
WindowsGetDiskFreeSpaceExW

Usage

#![allow(unused)]
fn main() {
let guard = DiskSpaceGuard::with_min_free(100);  // 100 MB minimum

// Periodic check
guard.check()?;

// Pre-write check with size estimation
guard.check_before_write(estimated_bytes)?;

// Size estimation for planning
let size = estimate_output_size_mb(100_000, &[OutputFormat::Csv], false);
}

CPU Monitor

Tracks CPU load with optional auto-throttling:

#![allow(unused)]
fn main() {
pub struct CpuMonitorConfig {
    pub enabled: bool,
    pub high_load_threshold: f64,      // 0.85 default
    pub critical_load_threshold: f64,  // 0.95 default
    pub sample_interval_ms: u64,
    pub auto_throttle: bool,
    pub throttle_delay_ms: u64,
}
}

Platform Support

PlatformMethod
Linux/proc/stat parsing
macOStop -l 1 command

Usage

#![allow(unused)]
fn main() {
let config = CpuMonitorConfig::with_thresholds(0.85, 0.95)
    .with_auto_throttle(50);

let monitor = CpuMonitor::new(config);

// In generation loop
if let Some(load) = monitor.sample() {
    if load > 0.85 {
        // Consider slowing down
    }
    monitor.maybe_throttle();  // Applies delay if critical
}
}

Unified Resource Guard

Combines all guards into single interface:

#![allow(unused)]
fn main() {
let guard = ResourceGuard::new(ResourceGuardConfig::default())
    .with_memory_limit(2 * 1024 * 1024 * 1024)
    .with_output_path("./output")
    .with_cpu_monitoring();

// Check all resources at once
guard.check_all()?;

let stats = guard.stats();
println!("Memory: {}%", stats.memory_usage_percent);
println!("Disk: {} MB free", stats.disk_available_mb);
println!("CPU: {}%", stats.cpu_load * 100.0);
}

Graceful Degradation

Progressive feature reduction under resource pressure:

#![allow(unused)]
fn main() {
pub enum DegradationLevel {
    Normal,    // All features enabled
    Reduced,   // 50% batch, skip data quality, 50% anomaly rate
    Minimal,   // 25% batch, essential only, no injections
    Emergency, // Flush and terminate
}
}

Thresholds

LevelMemoryDiskBatch SizeActions
Normal<70%>1GB100%Full operation
Reduced70-85%500MB-1GB50%Skip data quality
Minimal85-95%100-500MB25%Essential data only
Emergency>95%<100MB0%Graceful shutdown

Usage

#![allow(unused)]
fn main() {
let controller = DegradationController::new(DegradationConfig::default());

// Update based on current resource status
let status = ResourceStatus::new(
    Some(memory_usage),
    Some(disk_available_mb),
    Some(cpu_load),
);

let (level, changed) = controller.update(&status);

if changed {
    let actions = DegradationActions::for_level(level);

    if actions.skip_data_quality {
        // Disable data quality injection
    }
    if actions.terminate {
        // Flush and exit
    }
}
}

Configuration

global:
  resource_budget:
    memory:
      hard_limit_mb: 2048
    disk:
      min_free_mb: 500
      reserve_buffer_mb: 100
    cpu:
      enabled: true
      high_load_threshold: 0.85
      auto_throttle: true
    degradation:
      enabled: true
      reduced_threshold: 0.70
      minimal_threshold: 0.85

See Also

Enterprise Process Chains

SyntheticData models enterprise operations as interconnected process chains — end-to-end business flows that share master data, generate journal entries, and link through common documents. This page maps the current implementation status and shows how the chains integrate.

Coverage Matrix

ChainFull NameCoverageStatusKey Modules
S2PSource-to-Pay95%Implemented (P2P + S2C + OCPM)document_flow/p2p_generator, sourcing/, ocpm/s2c_generator
O2COrder-to-Cash95%Implemented (+ OCPM)document_flow/o2c_generator, master_data/customer, subledger/ar
R2RRecord-to-Report85%Implemented (+ Bank Recon OCPM)je_generator, balance/, period_close/, ocpm/bank_recon_generator
A2RAcquire-to-Retire70%Partially implementedmaster_data/asset, subledger/fa, period_close/depreciation
INVInventory Management55%Partially implementedsubledger/inventory, document_flow/ (GR/delivery links)
BANKBanking & Treasury85%Implemented (+ OCPM)datasynth-banking, ocpm/bank_generator
H2RHire-to-Retire85%Implemented (+ OCPM)hr/, master_data/employee, ocpm/h2r_generator
MFGPlan-to-Produce85%Implemented (+ OCPM)manufacturing/, ocpm/mfg_generator
AUDITAudit Lifecycle90%Implemented (+ OCPM)audit/, ocpm/audit_generator

Implemented Chains

Source-to-Pay (S2P)

The S2P chain covers procurement from purchase requisition through payment:

                    Source-to-Contract (S2C) — Planned
                    ┌──────────────────────────────────────────────┐
                    │ Spend Analysis → RFx → Bid Eval → Contract  │
                    └──────────────────────────┬───────────────────┘
                                               │
    ┌──────────────────────────────────────────┼──────────────────────────┐
    │              Procure-to-Pay (P2P) — Implemented                    │
    │                                          │                         │
    │  Purchase    Purchase    Goods     Vendor    Three-Way              │
    │  Requisition → Order  → Receipt → Invoice → Match    → Payment    │
    │                  │                   │         │           │        │
    │                  │              ┌────┘         │           │        │
    │                  ▼              ▼              ▼           ▼        │
    │              AP Open Item ← Match Result   AP Aging    Bank        │
    └────────────────────────────────────────────────────────────────────┘
                                               │
                    ┌──────────────────────────┘
                    ▼
    Vendor Network (quality scores, clusters, supply chain tiers)

P2P implementation details:

ComponentTypes/VariantsKey Config
Purchase Orders6 types: Standard, Service, Framework, Consignment, StockTransfer, Subcontractingflow_rate, completion_rate
Goods Receipts7 movement types: GrForPo, ReturnToVendor, GrForProduction, TransferPosting, InitialEntry, Scrapping, Consumptiongr_rate
Vendor InvoicesThree-way match with toleranceprice_tolerance, quantity_tolerance
PaymentsConfigurable terms and schedulingpayment_rate, timing ranges
Three-Way MatchPO ↔ GR ↔ Invoice validation with 6 variance typesallow_over_delivery, max_over_delivery_pct

Order-to-Cash (O2C)

The O2C chain covers the revenue cycle from sales order through cash collection:

    ┌─────────────────────────────────────────────────────────────────────┐
    │                    Order-to-Cash (O2C)                              │
    │                                                                     │
    │  Sales    Credit   Delivery   Customer   Customer                   │
    │  Order  → Check  → (Pick/  → Invoice  → Receipt                    │
    │    │               Pack/        │          │                        │
    │    │               Ship)        │          │                        │
    │    │                │           ▼          ▼                        │
    │    │                │      AR Open Item  AR Aging                   │
    │    │                │           │                                   │
    │    │                │           └→ Dunning Notices                  │
    │    │                ▼                                               │
    │    │          Inventory Issue                                       │
    │    │          (COGS posting)                                        │
    └────┼────────────────────────────────────────────────────────────────┘
         │
    Revenue Recognition (ASC 606 / IFRS 15)
    Customer Contracts → Performance Obligations

O2C implementation details:

ComponentTypes/VariantsKey Config
Sales Orders9 types: Standard, Rush, CashSale, Return, FreeOfCharge, Consignment, Service, CreditMemoRequest, DebitMemoRequestflow_rate, credit_check
Deliveries6 types: Outbound, Return, StockTransfer, Replenishment, ConsignmentIssue, ConsignmentReturndelivery_rate
Customer Invoices7 types: Standard, CreditMemo, DebitMemo, ProForma, DownPaymentRequest, FinalInvoice, Intercompanyinvoice_rate
Customer ReceiptsFull, partial, on-account, corrections, NSFcollection_rate

Record-to-Report (R2R)

The R2R chain covers financial close and reporting:

    Journal Entries (from all chains)
         │
         ▼
    Balance Tracker → Trial Balance → Adjustments → Close
         │                                │            │
         ├→ Intercompany Matching         ├→ Accruals   ├→ Year-End Close
         │     └→ IC Elimination          ├→ Reclasses  └→ Retained Earnings
         │                                └→ FX Reval
         ▼
    Consolidation
         ├→ Currency Translation
         ├→ CTA Adjustments
         └→ Consolidated Trial Balance

R2R coverage:

  • Journal entry generation from all process chains
  • Opening balance, running balance tracking, trial balance per period
  • Intercompany matching and elimination entries
  • Period close engine: accruals, depreciation, year-end closing
  • Audit simulation (ISA-compliant workpapers, findings, opinions)

Gaps: Financial statement generation (balance sheet, income statement, cash flow), budget vs actual reporting.


Banking & Treasury (BANK) — 85%

Implemented: Bank customer profiles, KYC/AML, bank accounts, transactions with fraud typologies (structuring, funnel, layering, mule, round-tripping). OCPM events for customer onboarding, KYC review, account management, and transaction lifecycle.

Gaps: Cash forecasting, liquidity management.

Hire-to-Retire (H2R) — 85%

Implemented: Employee master data, payroll runs with tax/deduction calculations, time entries with overtime, expense reports with policy violations. OCPM events for payroll lifecycle, time entry approval, and expense approval chains.

Gaps: Benefits administration, workforce planning.

Plan-to-Produce (MFG) — 85%

Implemented: Production orders with BOM explosion, routing operations, WIP costing, quality inspections, cycle counting. OCPM events for production order lifecycle, quality inspection, and cycle count reconciliation.

Gaps: Material requirements planning (MRP), advanced shop floor control.

Audit Lifecycle (AUDIT) — 90%

Implemented: Engagement planning, risk assessment (ISA 315/330), workpaper creation and review (ISA 230), evidence collection (ISA 500), findings (ISA 265), professional judgment documentation (ISA 200). OCPM events for the full engagement lifecycle.

Gaps: Multi-engagement portfolio management.

Partially Implemented Chains

Acquire-to-Retire (A2R) — 70%

Implemented: Fixed asset master data, depreciation (6 methods), acquisition from PO, disposal with gain/loss accounting, impairment testing (ASC 360/IAS 36).

Gaps: Capital project/WBS integration, asset transfers between companies, construction-in-progress (CIP) tracking.

Inventory Management (INV) — 55%

Implemented: Inventory positions, 22 movement types, 4 valuation methods, stock status tracking, P2P goods receipts, O2C goods issues.

Gaps: Quality inspection integration, obsolescence management, ABC analysis.


Cross-Process Integration

Process chains share data through several integration points, now with full OCPM event coverage:

    S2C ──→ S2P                    O2C                    R2R
    │        │                      │                      │
    Contract GR ──── Inventory ─────┼── Delivery           │
             │         │            │                      │
       Payment ────────┼────────────┼── Receipt ──── Bank Recon
             │         │            │                  │   │
       AP Open Item    │       AR Open Item         BANK  │
             │     MFG─┘            │                 │   │
             └──H2R──┴──────────────┴──── Journal Entries ┘
                  │                                   │
              AUDIT ─────────────────────────── Trial Balance
                                                      │
                                               Consolidation

    ──── All chains feed OCEL 2.0 Event Log (88 activities) ────

Integration Map

Integration PointFrom ChainTo ChainMechanism
Inventory bridgeS2P (Goods Receipt)O2C (Delivery)GR increases stock, delivery decreases
Payment clearingS2P / O2CBANKPayment status → bank reconciliation
Journal entriesAll chainsR2REvery document posts GL entries
Asset acquisitionS2P (Capital PO)A2RPO → GR → Fixed Asset Record
Revenue recognitionO2C (Invoice)R2RContract → Revenue JE
DepreciationA2RR2RMonthly depreciation → Trial Balance
IntercompanyS2P / O2CR2RIC invoices → IC matching → elimination

Document Reference Types

Documents maintain referential integrity across chains through 9 reference types:

Reference TypeDescriptionExample
FollowOnNormal flow successionPO → GR
PaymentPayment for invoicePAY → VI
ReversalCorrection/reversalCredit Memo → Invoice
PartialPartial fulfillmentPartial GR → PO
CreditMemoCredit against invoiceCM → Invoice
DebitMemoDebit against invoiceDM → Invoice
ReturnReturn against deliveryReturn → Delivery
IntercompanyMatchIC matching pairIC-INV → IC-INV
ManualUser-defined referenceAny → Any

Roadmap

The process chain expansion follows a wave-based plan:

WaveFocusChains Affected
Wave 1S2C completion, bank reconciliation, financial statementsS2P, BANK, R2R
Wave 2Payroll/time, revenue recognition generator, impairment generatorH2R, O2C, A2R
Wave 3Production orders/WIP, cycle counting/QA, expense managementMFG, INV, H2R
Wave 4Sales quotes, cash forecasting, KPIs/budgets, obsolescenceO2C, BANK, R2R, INV

For detailed coverage targets and implementation plans, see:

See Also

Design Decisions

Key architectural choices and their rationale.

1. Deterministic RNG

Decision: Use seeded ChaCha8 RNG for all randomness.

Rationale:

  • Reproducible output for testing and debugging
  • Consistent results across runs
  • Parallel generation with per-thread seeds

Implementation:

#![allow(unused)]
fn main() {
use rand_chacha::ChaCha8Rng;
use rand::SeedableRng;

let mut rng = ChaCha8Rng::seed_from_u64(config.global.seed);
}

Trade-off: Slightly slower than system RNG, but reproducibility is essential for financial data testing.


2. Precise Decimal Arithmetic

Decision: Use rust_decimal::Decimal for all monetary values.

Rationale:

  • IEEE 754 floating-point causes rounding errors
  • Financial systems require exact decimal representation
  • Debits must exactly equal credits

Implementation:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

let amount = dec!(1234.56);
let tax = amount * dec!(0.077);  // Exact
}

Serialization: Decimals serialized as strings to preserve precision:

{"amount": "1234.56"}

3. Balanced Entry Enforcement

Decision: JournalEntry enforces debits = credits at construction.

Rationale:

  • Invalid accounting entries should be impossible
  • Catches bugs early in generation
  • Guarantees trial balance coherence

Implementation:

#![allow(unused)]
fn main() {
impl JournalEntry {
    pub fn new(header: JournalEntryHeader, lines: Vec<JournalEntryLine>) -> Result<Self> {
        let entry = Self { header, lines };
        if !entry.is_balanced() {
            return Err(Error::UnbalancedEntry);
        }
        Ok(entry)
    }
}
}

4. Collision-Free UUIDs

Decision: Use FNV-1a hash-based UUID generation with generator-type discriminators.

Rationale:

  • Document IDs must be unique across all generators
  • Deterministic generation requires deterministic IDs
  • Different generator types might generate same sequence

Implementation:

#![allow(unused)]
fn main() {
pub struct DeterministicUuidFactory {
    counter: AtomicU64,
    seed: u64,
}

pub enum GeneratorType {
    JournalEntry = 0x01,
    DocumentFlow = 0x02,
    Vendor = 0x03,
    // ...
}

impl DeterministicUuidFactory {
    pub fn generate(&self, gen_type: GeneratorType) -> Uuid {
        let counter = self.counter.fetch_add(1, Ordering::SeqCst);
        let hash_input = (self.seed, gen_type as u8, counter);
        Uuid::from_bytes(fnv1a_hash(&hash_input))
    }
}
}

5. Empirical Distributions

Decision: Base statistical distributions on academic research.

Rationale:

  • Synthetic data should match real-world patterns
  • Benford’s Law is expected in authentic financial data
  • Line item distributions affect detection algorithms

Sources:

  • Line item counts: GL research showing 60.68% two-line, 88% even counts
  • Amounts: Log-normal with round-number bias
  • Temporal: Month/quarter/year-end spikes

Implementation:

#![allow(unused)]
fn main() {
pub struct LineItemSampler {
    distribution: EmpiricalDistribution,
}

impl LineItemSampler {
    pub fn new() -> Self {
        Self {
            distribution: EmpiricalDistribution::from_data(&[
                (2, 0.6068),
                (3, 0.0524),
                (4, 0.1732),
                // ...
            ]),
        }
    }
}
}

6. Document Chain Integrity

Decision: Maintain proper reference chains with explicit links.

Rationale:

  • Real document flows have traceable references
  • Process mining requires complete chains
  • Audit trails need document relationships

Implementation:

#![allow(unused)]
fn main() {
pub struct DocumentReference {
    pub from_type: DocumentType,
    pub from_id: String,
    pub to_type: DocumentType,
    pub to_id: String,
    pub reference_type: ReferenceType,
}

// Payment explicitly references invoices
let payment_ref = DocumentReference {
    from_type: DocumentType::Payment,
    from_id: payment.id.clone(),
    to_type: DocumentType::Invoice,
    to_id: invoice.id.clone(),
    reference_type: ReferenceType::PaymentFor,
};
}

7. Three-Way Match Validation

Decision: Implement actual PO/GR/Invoice matching with tolerances.

Rationale:

  • Real P2P processes include match validation
  • Variances are common and should be generated
  • Match status affects downstream processing

Implementation:

#![allow(unused)]
fn main() {
pub fn validate_match(po: &PurchaseOrder, gr: &GoodsReceipt, inv: &Invoice,
                      config: &MatchConfig) -> MatchResult {
    let qty_variance = (gr.quantity - po.quantity).abs() / po.quantity;
    let price_variance = (inv.unit_price - po.unit_price).abs() / po.unit_price;

    if qty_variance > config.quantity_tolerance {
        return MatchResult::QuantityVariance(qty_variance);
    }
    if price_variance > config.price_tolerance {
        return MatchResult::PriceVariance(price_variance);
    }
    MatchResult::Matched
}
}

8. Memory Guard Architecture

Decision: Cross-platform memory tracking with soft/hard limits.

Rationale:

  • Large generations can exhaust memory
  • OOM kills are unrecoverable
  • Graceful degradation preferred

Implementation:

#![allow(unused)]
fn main() {
pub fn check(&self) -> MemoryStatus {
    let current = self.get_memory_usage();
    let growth_rate = (current - self.last_usage) as f64 / elapsed_ms;

    MemoryStatus {
        current_usage: current,
        exceeds_soft_limit: current > self.config.soft_limit,
        exceeds_hard_limit: current > self.config.hard_limit,
        growth_rate,
    }
}
}

9. Layered Crate Architecture

Decision: Strict layering with no circular dependencies.

Rationale:

  • Clear separation of concerns
  • Independent crate compilation
  • Easier testing and maintenance

Layers:

  1. Foundation: datasynth-core (no internal dependencies)
  2. Services: datasynth-config, datasynth-output
  3. Processing: datasynth-generators, datasynth-graph
  4. Orchestration: datasynth-runtime
  5. Application: datasynth-cli, datasynth-server, datasynth-ui

10. Configuration-Driven Behavior

Decision: All behavior controlled by external configuration.

Rationale:

  • Flexibility without code changes
  • Reproducible scenarios
  • User-customizable presets

Scope: Configuration controls:

  • Industry and complexity
  • Transaction volumes and patterns
  • Anomaly types and rates
  • Output formats
  • All feature toggles

11. Trait-Based Extensibility

Decision: Define traits in core, implement in higher layers.

Rationale:

  • Dependency inversion
  • Pluggable implementations
  • Easy testing with mocks

Example:

#![allow(unused)]
fn main() {
// Defined in datasynth-core
pub trait Generator<T> {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<T>>;
}

// Implemented in datasynth-generators
impl Generator<JournalEntry> for JournalEntryGenerator {
    fn generate_batch(&mut self, count: usize) -> Result<Vec<JournalEntry>> {
        // Implementation
    }
}
}

12. Parallel-Safe Design

Decision: Design all generators to be thread-safe.

Rationale:

  • Generation can be parallelized
  • Modern systems have many cores
  • Linear scaling improves throughput

Implementation:

  • Per-thread RNG seeds: seed + thread_id
  • Atomic counters for UUID factory
  • No shared mutable state during generation
  • Rayon for parallel iteration

See Also

Crate Reference

SyntheticData is organized as a Rust workspace with 15 modular crates. This section provides detailed documentation for each crate.

Workspace Structure

datasynth-cli          → Binary entry point (commands: generate, validate, init, info, fingerprint)
datasynth-server       → REST/gRPC/WebSocket server with auth, rate limiting, timeouts
datasynth-ui           → Tauri/SvelteKit desktop UI
    ↓
datasynth-runtime      → Orchestration layer (GenerationOrchestrator coordinates workflow)
    ↓
datasynth-generators   → Data generators (JE, Document Flows, Subledgers, Anomalies, Audit)
datasynth-banking      → KYC/AML banking transaction generator with fraud typologies
datasynth-ocpm         → Object-Centric Process Mining (OCEL 2.0 event logs)
datasynth-fingerprint  → Privacy-preserving fingerprint extraction and synthesis
datasynth-standards    → Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
    ↓
datasynth-graph        → Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval         → Evaluation framework with auto-tuning and recommendations
    ↓
datasynth-config       → Configuration schema, validation, industry presets
    ↓
datasynth-core         → Domain models, traits, distributions, templates, resource guards
    ↓
datasynth-output       → Output sinks (CSV, JSON, Parquet, ControlExport)

datasynth-test-utils   → Testing utilities and fixtures

Crate Categories

Application Layer

CrateDescription
datasynth-cliCommand-line interface binary with generate, validate, init, info, fingerprint commands
datasynth-serverREST/gRPC/WebSocket server with authentication, rate limiting, and timeouts
datasynth-uiCross-platform desktop GUI application (Tauri + SvelteKit)

Core Processing

CrateDescription
datasynth-runtimeGeneration orchestration with resource guards and graceful degradation
datasynth-generatorsAll data generators (JE, master data, documents, subledgers, anomalies, audit)
datasynth-graphML graph export (PyTorch Geometric, Neo4j, DGL)

Domain-Specific Modules

CrateDescription
datasynth-bankingKYC/AML banking transactions with fraud typologies
datasynth-ocpmObject-Centric Process Mining (OCEL 2.0)
datasynth-fingerprintPrivacy-preserving fingerprint extraction and synthesis
datasynth-standardsAccounting/audit standards (US GAAP, IFRS, ISA, SOX, PCAOB)

Foundation

CrateDescription
datasynth-coreDomain models, distributions, traits, resource guards
datasynth-configConfiguration schema and validation
datasynth-outputOutput sinks (CSV, JSON, Parquet)

Supporting

CrateDescription
datasynth-evalQuality evaluation with auto-tuning recommendations
datasynth-test-utilsTest utilities and fixtures

Dependencies

The crates follow a strict dependency hierarchy:

  1. datasynth-core: No internal dependencies (foundation)
  2. datasynth-config: Depends on datasynth-core
  3. datasynth-output: Depends on datasynth-core
  4. datasynth-generators: Depends on datasynth-core, datasynth-config
  5. datasynth-graph: Depends on datasynth-core, datasynth-generators
  6. datasynth-eval: Depends on datasynth-core
  7. datasynth-banking: Depends on datasynth-core, datasynth-config
  8. datasynth-ocpm: Depends on datasynth-core
  9. datasynth-fingerprint: Depends on datasynth-core, datasynth-config
  10. datasynth-runtime: Depends on datasynth-core, datasynth-config, datasynth-generators, datasynth-output, datasynth-graph, datasynth-banking, datasynth-ocpm, datasynth-fingerprint, datasynth-eval
  11. datasynth-cli: Depends on datasynth-runtime, datasynth-fingerprint
  12. datasynth-server: Depends on datasynth-runtime
  13. datasynth-ui: Depends on datasynth-runtime (via Tauri)
  14. datasynth-standards: Depends on datasynth-core, datasynth-config
  15. datasynth-test-utils: Depends on datasynth-core

Building Individual Crates

# Build specific crate
cargo build -p datasynth-core
cargo build -p datasynth-generators
cargo build -p datasynth-fingerprint

# Run tests for specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators
cargo test -p datasynth-fingerprint

# Generate docs for specific crate
cargo doc -p datasynth-core --open
cargo doc -p datasynth-fingerprint --open

API Documentation

For detailed Rust API documentation, generate and view rustdoc:

cargo doc --workspace --no-deps --open

After deployment, API documentation is available at /api/ in the documentation site.

See Also

datasynth-core

Core domain models, traits, and distributions for synthetic accounting data generation.

Overview

datasynth-core provides the foundational building blocks for the SyntheticData workspace:

  • Domain Models: Journal entries, chart of accounts, master data, documents, anomalies
  • Statistical Distributions: Line item sampling, amount generation, temporal patterns
  • Core Traits: Generator and Sink interfaces for extensibility
  • Template System: File-based templates for regional/sector customization
  • Infrastructure: UUID factory, memory guard, GL account constants

Module Structure

Domain Models (models/)

ModuleDescription
journal_entry.rsJournal entry header and balanced line items
chart_of_accounts.rsHierarchical GL accounts with account types
master_data.rsEnhanced vendors, customers with payment behavior
documents.rsPurchase orders, invoices, goods receipts, payments
temporal.rsBi-temporal data model for audit trails
anomaly.rsAnomaly types and labels for ML training
internal_control.rsSOX 404 control definitions
sourcing/S2C models: SourcingProject, SupplierQualification, RfxEvent, Bid, BidEvaluation, ProcurementContract, CatalogItem, SupplierScorecard, SpendAnalysis
payroll.rsPayrollRun, PayrollLineItem with gross/deductions/net/employer cost
time_entry.rsTimeEntry with regular, overtime, PTO, and sick hours
expense_report.rsExpenseReport, ExpenseLineItem with category and approval workflow
financial_statements.rsFinancialStatement, FinancialStatementLineItem, CashFlowItem, StatementType
bank_reconciliation.rsBankReconciliation, BankStatementLine, ReconcilingItem with auto-matching

Statistical Distributions (distributions/)

DistributionDescription
LineItemSamplerEmpirical distribution (60.68% two-line, 88% even counts)
AmountSamplerLog-normal with round-number bias, Benford compliance
TemporalSamplerSeasonality patterns with industry integration
BenfordSamplerFirst-digit distribution following P(d) = log10(1 + 1/d)
FraudAmountGeneratorSuspicious amount patterns
IndustrySeasonalityIndustry-specific volume patterns
HolidayCalendarRegional holidays for US, DE, GB, CN, JP, IN

Infrastructure

ComponentDescription
uuid_factory.rsDeterministic FNV-1a hash-based UUID generation
accounts.rsCentralized GL control account numbers
templates/YAML/JSON template loading and merging

Resource Guards

ComponentDescription
memory_guard.rsCross-platform memory tracking with soft/hard limits
disk_guard.rsDisk space monitoring and pre-write capacity checks
cpu_monitor.rsCPU load tracking with auto-throttling
resource_guard.rsUnified orchestration of all resource guards
degradation.rsGraceful degradation system (Normal→Reduced→Minimal→Emergency)

AI & ML Modules (v0.5.0)

ModuleDescription
llm/provider.rsLlmProvider trait with complete() and complete_batch() methods
llm/mock_provider.rsDeterministic MockLlmProvider for testing (no network required)
llm/http_provider.rsHttpLlmProvider for OpenAI, Anthropic, and custom API endpoints
llm/nl_config.rsNlConfigGenerator — natural language to YAML configuration
llm/cache.rsLlmCache with FNV-1a hashing for prompt deduplication
diffusion/backend.rsDiffusionBackend trait with forward(), reverse(), generate() methods
diffusion/schedule.rsNoiseSchedule with linear, cosine, and sigmoid schedules
diffusion/statistical.rsStatisticalDiffusionBackend — fingerprint-guided denoising
diffusion/hybrid.rsHybridGenerator with Interpolate, Select, Ensemble blend strategies
diffusion/training.rsDiffusionTrainer and TrainedDiffusionModel with save/load
causal/graph.rsCausalGraph with variables, edges, and built-in templates
causal/scm.rsStructuralCausalModel with topological-order generation
causal/intervention.rsInterventionEngine with do-calculus and effect estimation
causal/counterfactual.rsCounterfactualGenerator with abduction-action-prediction
causal/validation.rsCausalValidator for causal structure validation

Key Types

JournalEntry

#![allow(unused)]
fn main() {
pub struct JournalEntry {
    pub header: JournalEntryHeader,
    pub lines: Vec<JournalEntryLine>,
}

pub struct JournalEntryHeader {
    pub document_id: Uuid,
    pub company_code: String,
    pub fiscal_year: u16,
    pub fiscal_period: u8,
    pub posting_date: NaiveDate,
    pub document_date: NaiveDate,
    pub source: TransactionSource,
    pub business_process: Option<BusinessProcess>,
    pub is_fraud: bool,
    pub fraud_type: Option<FraudType>,
    pub is_anomaly: bool,
    pub anomaly_type: Option<AnomalyType>,
    // ... additional fields
}
}

AccountType Hierarchy

#![allow(unused)]
fn main() {
pub enum AccountType {
    Asset,
    Liability,
    Equity,
    Revenue,
    Expense,
}

pub enum AccountSubType {
    // Assets
    Cash,
    AccountsReceivable,
    Inventory,
    FixedAsset,
    // Liabilities
    AccountsPayable,
    AccruedLiabilities,
    LongTermDebt,
    // Equity
    CommonStock,
    RetainedEarnings,
    // Revenue
    SalesRevenue,
    ServiceRevenue,
    // Expense
    CostOfGoodsSold,
    OperatingExpense,
    // ...
}
}

Anomaly Types

#![allow(unused)]
fn main() {
pub enum AnomalyType {
    Fraud,
    Error,
    ProcessIssue,
    Statistical,
    Relational,
}

pub struct LabeledAnomaly {
    pub document_id: Uuid,
    pub anomaly_id: String,
    pub anomaly_type: AnomalyType,
    pub category: AnomalyCategory,
    pub severity: Severity,
    pub description: String,
}
}

Usage Examples

Creating a Balanced Journal Entry

#![allow(unused)]
fn main() {
use synth_core::models::{JournalEntry, JournalEntryLine, JournalEntryHeader};
use rust_decimal_macros::dec;

let header = JournalEntryHeader::new(/* ... */);
let mut entry = JournalEntry::new(header);

// Add balanced lines
entry.add_line(JournalEntryLine::debit("1100", dec!(1000.00), "AR Invoice"));
entry.add_line(JournalEntryLine::credit("4000", dec!(1000.00), "Revenue"));

// Entry enforces debits = credits
assert!(entry.is_balanced());
}

Sampling Amounts

#![allow(unused)]
fn main() {
use synth_core::distributions::AmountSampler;

let sampler = AmountSampler::new(42); // seed

// Benford-compliant amount
let amount = sampler.sample_benford_compliant(1000.0, 100000.0);

// Round-number biased
let round_amount = sampler.sample_with_round_bias(1000.0, 10000.0);
}

Using the UUID Factory

#![allow(unused)]
fn main() {
use synth_core::uuid_factory::{DeterministicUuidFactory, GeneratorType};

let factory = DeterministicUuidFactory::new(42);

// Generate collision-free UUIDs across generators
let je_id = factory.generate(GeneratorType::JournalEntry);
let doc_id = factory.generate(GeneratorType::DocumentFlow);
}

Memory Guard

#![allow(unused)]
fn main() {
use synth_core::memory_guard::{MemoryGuard, MemoryGuardConfig};

let config = MemoryGuardConfig {
    soft_limit: 1024 * 1024 * 1024,  // 1GB soft
    hard_limit: 2 * 1024 * 1024 * 1024, // 2GB hard
    check_interval_ms: 1000,
    ..Default::default()
};

let guard = MemoryGuard::new(config);
if guard.check().exceeds_soft_limit {
    // Slow down or pause generation
}
}

Disk Space Guard

#![allow(unused)]
fn main() {
use synth_core::disk_guard::{DiskSpaceGuard, DiskSpaceGuardConfig};

let config = DiskSpaceGuardConfig {
    hard_limit_mb: 100,        // Require at least 100 MB free
    soft_limit_mb: 500,        // Warn when below 500 MB
    check_interval: 500,       // Check every 500 operations
    reserve_buffer_mb: 50,     // Keep 50 MB buffer
    monitor_path: Some("./output".into()),
};

let guard = DiskSpaceGuard::new(config);
guard.check()?;  // Returns error if disk full
guard.check_before_write(1024 * 1024)?;  // Pre-write check
}

CPU Monitor

#![allow(unused)]
fn main() {
use synth_core::cpu_monitor::{CpuMonitor, CpuMonitorConfig};

let config = CpuMonitorConfig::with_thresholds(0.85, 0.95)
    .with_auto_throttle(50);  // 50ms delay when critical

let monitor = CpuMonitor::new(config);

// Sample and check in generation loop
if let Some(load) = monitor.sample() {
    if monitor.is_throttling() {
        monitor.maybe_throttle();  // Apply delay
    }
}
}

Graceful Degradation

#![allow(unused)]
fn main() {
use synth_core::degradation::{
    DegradationController, DegradationConfig, ResourceStatus, DegradationActions
};

let controller = DegradationController::new(DegradationConfig::default());

let status = ResourceStatus::new(
    Some(0.80),   // 80% memory usage
    Some(800),    // 800 MB disk free
    Some(0.70),   // 70% CPU load
);

let (level, changed) = controller.update(&status);
let actions = DegradationActions::for_level(level);

if actions.skip_data_quality {
    // Skip data quality injection
}
if actions.terminate {
    // Flush and exit gracefully
}
}

LLM Provider

#![allow(unused)]
fn main() {
use synth_core::llm::{LlmProvider, LlmRequest, MockLlmProvider};

let provider = MockLlmProvider::new(42);
let request = LlmRequest::new("Generate a realistic vendor name for a manufacturing company")
    .with_seed(42)
    .with_max_tokens(50);
let response = provider.complete(&request)?;
println!("Generated: {}", response.content);
}

Causal Generation

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, StructuralCausalModel};

// Use built-in fraud detection template
let graph = CausalGraph::fraud_detection_template();
let scm = StructuralCausalModel::new(graph)?;

// Generate observational samples
let samples = scm.generate(1000, 42)?;

// Run intervention: what if transaction_amount is set to 50000?
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let intervention_samples = intervened.generate(1000, 42)?;
}

Diffusion Model

#![allow(unused)]
fn main() {
use synth_core::diffusion::{
    StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType,
    HybridGenerator, BlendStrategy,
};

let config = DiffusionConfig {
    n_steps: 1000,
    schedule: NoiseScheduleType::Cosine,
    seed: 42,
};

let backend = StatisticalDiffusionBackend::new(
    vec![100.0, 5.0],  // means
    vec![50.0, 2.0],   // stds
    config,
);

let samples = backend.generate(1000, 2, 42);

// Hybrid: blend rule-based + diffusion
let hybrid = HybridGenerator::new(0.3); // 30% diffusion weight
let blended = hybrid.blend(&rule_based, &samples, BlendStrategy::Ensemble, 42);
}

Traits

Generator Trait

#![allow(unused)]
fn main() {
pub trait Generator {
    type Output;
    type Error;

    fn generate_batch(&mut self, count: usize) -> Result<Vec<Self::Output>, Self::Error>;

    fn generate_stream(&mut self) -> impl Iterator<Item = Result<Self::Output, Self::Error>>;
}
}

Sink Trait

#![allow(unused)]
fn main() {
pub trait Sink<T> {
    type Error;

    fn write(&mut self, item: &T) -> Result<(), Self::Error>;
    fn write_batch(&mut self, items: &[T]) -> Result<(), Self::Error>;
    fn flush(&mut self) -> Result<(), Self::Error>;
}
}

PostProcessor Trait

Interface for post-generation data transformations (e.g., data quality variations):

#![allow(unused)]
fn main() {
pub struct ProcessContext {
    pub record_index: usize,
    pub batch_size: usize,
    pub output_format: String,
    pub metadata: HashMap<String, String>,
}

pub struct ProcessorStats {
    pub records_processed: usize,
    pub records_modified: usize,
    pub labels_generated: usize,
}
}

Template System

Load external templates for customization:

#![allow(unused)]
fn main() {
use synth_core::templates::{TemplateLoader, MergeStrategy};

let loader = TemplateLoader::new("templates/");
let names = loader.load_category("vendor_names", MergeStrategy::Extend)?;
}

Template categories:

  • person_names
  • vendor_names
  • customer_names
  • material_descriptions
  • line_item_descriptions

Decimal Handling

All financial amounts use rust_decimal::Decimal:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

let amount = dec!(1234.56);
let tax = amount * dec!(0.077);
}

Decimals are serialized as strings to avoid IEEE 754 floating-point issues.

See Also

datasynth-config

Configuration schema, validation, and industry presets for synthetic data generation.

Overview

datasynth-config provides the configuration layer for SyntheticData:

  • Schema Definition: Complete YAML configuration schema
  • Validation: Bounds checking, constraint validation, distribution sum verification
  • Industry Presets: Pre-configured settings for common industries
  • Complexity Levels: Small, medium, and large organization profiles

Configuration Sections

SectionDescription
globalIndustry, dates, seed, performance settings
companiesCompany codes, currencies, volume weights
chart_of_accountsCOA complexity and structure
transactionsLine items, amounts, sources, temporal patterns
master_dataVendors, customers, materials, assets, employees
document_flowsP2P, O2C configuration
intercompanyIC transaction types and transfer pricing
balanceOpening balances, trial balance generation
subledgerAR, AP, FA, inventory settings
fxCurrency and exchange rate settings
period_closeClose tasks and schedules
fraudFraud injection rates and types
internal_controlsSOX controls and SoD rules
anomaly_injectionAnomaly rates and labeling
data_qualityMissing values, typos, duplicates
graph_exportML graph export formats
outputOutput format and compression

Industry Presets

IndustryDescription
manufacturingHeavy P2P, inventory, fixed assets
retailHigh O2C volume, seasonal patterns
financial_servicesComplex intercompany, high controls
healthcareRegulatory focus, seasonal insurance
technologySaaS revenue patterns, R&D capitalization

Key Types

Config

#![allow(unused)]
fn main() {
pub struct Config {
    pub global: GlobalConfig,
    pub companies: Vec<CompanyConfig>,
    pub chart_of_accounts: CoaConfig,
    pub transactions: TransactionConfig,
    pub master_data: MasterDataConfig,
    pub document_flows: DocumentFlowConfig,
    pub intercompany: IntercompanyConfig,
    pub balance: BalanceConfig,
    pub subledger: SubledgerConfig,
    pub fx: FxConfig,
    pub period_close: PeriodCloseConfig,
    pub fraud: FraudConfig,
    pub internal_controls: ControlConfig,
    pub anomaly_injection: AnomalyConfig,
    pub data_quality: DataQualityConfig,
    pub graph_export: GraphExportConfig,
    pub output: OutputConfig,
}
}

GlobalConfig

#![allow(unused)]
fn main() {
pub struct GlobalConfig {
    pub seed: Option<u64>,
    pub industry: Industry,
    pub start_date: NaiveDate,
    pub period_months: u32,      // 1-120
    pub group_currency: String,
    pub worker_threads: Option<usize>,
    pub memory_limit: Option<u64>,
}
}

CompanyConfig

#![allow(unused)]
fn main() {
pub struct CompanyConfig {
    pub code: String,
    pub name: String,
    pub currency: String,
    pub country: String,
    pub volume_weight: f64,     // Must sum to 1.0 across companies
    pub is_parent: bool,
    pub parent_code: Option<String>,
}
}

Validation Rules

The ConfigValidator enforces:

RuleConstraint
period_months1-120 (max 10 years)
compression_level1-9 when compression enabled
Rate fields0.0-1.0
Approval thresholdsStrictly ascending order
Distribution weightsSum to 1.0 (±0.01 tolerance)
Company codesUnique within configuration
Datesstart_date + period_months is valid

Usage Examples

Loading Configuration

#![allow(unused)]
fn main() {
use synth_config::{Config, ConfigValidator};

// From YAML file
let config = Config::from_yaml_file("config.yaml")?;

// Validate
let validator = ConfigValidator::new();
validator.validate(&config)?;
}

Using Presets

#![allow(unused)]
fn main() {
use synth_config::{Config, Industry, Complexity};

// Create from preset
let config = Config::from_preset(Industry::Manufacturing, Complexity::Medium);

// Modify as needed
config.transactions.target_count = 50000;
}

Creating Configuration Programmatically

#![allow(unused)]
fn main() {
use synth_config::{Config, GlobalConfig, TransactionConfig};

let config = Config {
    global: GlobalConfig {
        seed: Some(42),
        industry: Industry::Manufacturing,
        start_date: NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
        period_months: 12,
        group_currency: "USD".to_string(),
        ..Default::default()
    },
    transactions: TransactionConfig {
        target_count: 100000,
        ..Default::default()
    },
    ..Default::default()
};
}

Saving Configuration

#![allow(unused)]
fn main() {
// To YAML
config.to_yaml_file("config.yaml")?;

// To JSON
config.to_json_file("config.json")?;

// To string
let yaml = config.to_yaml_string()?;
}

Configuration Examples

Minimal Configuration

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 10000

output:
  format: csv

Full Configuration

See the YAML Schema Reference for complete documentation.

Complexity Levels

LevelAccountsVendorsCustomersMaterials
small~10050100200
medium~4002005001000
large~25001000500010000

Validation Error Types

#![allow(unused)]
fn main() {
pub enum ConfigError {
    MissingRequiredField(String),
    InvalidValue { field: String, value: String, constraint: String },
    DistributionSumError { field: String, sum: f64 },
    DuplicateCode { field: String, code: String },
    DateRangeError { start: NaiveDate, end: NaiveDate },
    ParseError(String),
}
}

See Also

datasynth-generators

Data generators for journal entries, master data, document flows, and anomalies.

Overview

datasynth-generators contains all data generation logic for SyntheticData:

  • Core Generators: Journal entries, chart of accounts, users
  • Master Data: Vendors, customers, materials, assets, employees
  • Document Flows: P2P (Procure-to-Pay), O2C (Order-to-Cash)
  • Financial: Intercompany, balance tracking, subledgers, FX, period close
  • Quality: Anomaly injection, data quality variations
  • Sourcing (S2C): Spend analysis, RFx, bids, contracts, catalogs, scorecards (v0.6.0)
  • HR / Payroll: Payroll runs, time entries, expense reports (v0.6.0)
  • Financial Reporting: Financial statements, bank reconciliation (v0.6.0)
  • Standards: Revenue recognition, impairment testing (v0.6.0)
  • Manufacturing: Production orders, quality inspections, cycle counts (v0.6.0)

Module Structure

Core Generators

GeneratorDescription
je_generatorJournal entry generation with statistical distributions
coa_generatorChart of accounts with industry-specific structures
company_selectorWeighted company selection for transactions
user_generatorUser/persona generation with roles
control_generatorInternal controls and SoD rules

Master Data (master_data/)

GeneratorDescription
vendor_generatorVendors with payment terms, bank accounts, behaviors
customer_generatorCustomers with credit ratings, payment patterns
material_generatorMaterials/products with BOM, valuations
asset_generatorFixed assets with depreciation schedules
employee_generatorEmployees with manager hierarchy
entity_registry_managerCentral entity registry with temporal validity

Document Flow (document_flow/)

GeneratorDescription
p2p_generatorPO → GR → Invoice → Payment flow
o2c_generatorSO → Delivery → Invoice → Receipt flow
document_chain_managerReference chain management
document_flow_je_generatorGenerate JEs from document flows
three_way_matchPO/GR/Invoice matching validation

Intercompany (intercompany/)

GeneratorDescription
ic_generatorMatched intercompany entry pairs
matching_engineIC matching and reconciliation
elimination_generatorConsolidation elimination entries

Balance (balance/)

GeneratorDescription
opening_balance_generatorCoherent opening balance sheet
balance_trackerRunning balance validation
trial_balance_generatorPeriod-end trial balance

Subledger (subledger/)

GeneratorDescription
ar_generatorAR invoices, receipts, credit memos, aging
ap_generatorAP invoices, payments, debit memos
fa_generatorFixed assets, depreciation, disposals
inventory_generatorInventory positions, movements, valuation
reconciliationGL-to-subledger reconciliation

FX (fx/)

GeneratorDescription
fx_rate_serviceFX rate generation (Ornstein-Uhlenbeck process)
currency_translatorTrial balance translation
cta_generatorCurrency Translation Adjustment entries

Period Close (period_close/)

GeneratorDescription
close_engineMain orchestration
accrualsAccrual entry generation
depreciationMonthly depreciation runs
year_endYear-end closing entries

Anomaly (anomaly/)

GeneratorDescription
injectorMain anomaly injection engine
typesWeighted anomaly type configurations
strategiesInjection strategies (amount, date, duplication)
patternsTemporal patterns, clustering, entity targeting

Data Quality (data_quality/)

GeneratorDescription
injectorMain data quality injector
missing_valuesMCAR, MAR, MNAR, Systematic patterns
format_variationsDate, amount, identifier formats
duplicatesExact, near, fuzzy duplicates
typosKeyboard-aware typos, OCR errors
labelsML training labels for data quality issues

Audit (audit/)

ISA-compliant audit data generation.

GeneratorDescription
engagement_generatorAudit engagement with phases (Planning, Fieldwork, Completion)
workpaper_generatorAudit workpapers per ISA 230
evidence_generatorAudit evidence per ISA 500
risk_generatorRisk assessment per ISA 315/330
finding_generatorAudit findings per ISA 265
judgment_generatorProfessional judgment documentation per ISA 200

LLM Enrichment (llm_enrichment/) — v0.5.0

GeneratorDescription
VendorLlmEnricherGenerate realistic vendor names by industry, spend category, and country
TransactionLlmEnricherGenerate transaction descriptions and memo fields
AnomalyLlmExplainerGenerate natural language explanations for injected anomalies

Sourcing (sourcing/) – v0.6.0

Source-to-Contract (S2C) procurement pipeline generators.

GeneratorDescription
spend_analysis_generatorSpend analysis records and category hierarchies
sourcing_project_generatorSourcing project lifecycle management
qualification_generatorSupplier qualification assessments
rfx_generatorRFx events (RFI/RFP/RFQ) with invited suppliers
bid_generatorSupplier bids with pricing and compliance data
bid_evaluation_generatorBid scoring, ranking, and award recommendations
contract_generatorProcurement contracts with terms and renewal rules
catalog_generatorCatalog items linked to contracts
scorecard_generatorSupplier scorecards with performance metrics

Generation DAG: spend_analysis -> sourcing_project -> qualification -> rfx -> bid -> bid_evaluation -> contract -> catalog -> [P2P] -> scorecard

HR (hr/) – v0.6.0

Hire-to-Retire (H2R) generators for the HR process chain.

GeneratorDescription
payroll_generatorPayroll runs with employee pay line items (gross, deductions, net, employer cost)
time_entry_generatorEmployee time entries with regular, overtime, PTO, and sick hours
expense_report_generatorExpense reports with categorized line items and approval workflows

Standards (standards/) – v0.6.0

Accounting and audit standards generators.

GeneratorDescription
revenue_recognition_generatorASC 606/IFRS 15 customer contracts with performance obligations
impairment_generatorAsset impairment tests with recoverable amount calculations

Period Close Additions – v0.6.0

GeneratorDescription
financial_statement_generatorBalance sheet, income statement, cash flow, and changes in equity from trial balance data

Bank Reconciliation – v0.6.0

GeneratorDescription
bank_reconciliation_generatorBank reconciliations with statement lines, auto-matching, and reconciling items

Relationships (relationships/)

GeneratorDescription
entity_graph_generatorCross-process entity relationship graphs
relationship_strengthWeighted relationship strength calculation

Audit Engagement Structure:

#![allow(unused)]
fn main() {
pub struct AuditEngagement {
    pub engagement_id: String,
    pub client_name: String,
    pub fiscal_year: u16,
    pub phase: AuditPhase,  // Planning, Fieldwork, Completion
    pub materiality: MaterialityLevels,
    pub team_size: usize,
    pub has_fraud_risk: bool,
    pub has_significant_risk: bool,
}

pub struct MaterialityLevels {
    pub primary_materiality: Decimal,        // 0.3-1% of base
    pub performance_materiality: Decimal,    // 50-75% of primary
    pub clearly_trivial: Decimal,            // 3-5% of primary
}
}

Usage Examples

Journal Entry Generation

#![allow(unused)]
fn main() {
use synth_generators::je_generator::JournalEntryGenerator;

let mut generator = JournalEntryGenerator::new(config, seed);

// Generate batch
let entries = generator.generate_batch(1000)?;

// Stream generation
for entry in generator.generate_stream().take(1000) {
    process(entry?);
}
}

Master Data Generation

#![allow(unused)]
fn main() {
use synth_generators::master_data::{VendorGenerator, CustomerGenerator};

let mut vendor_gen = VendorGenerator::new(seed);
let vendors = vendor_gen.generate(100);

let mut customer_gen = CustomerGenerator::new(seed);
let customers = customer_gen.generate(200);
}

Document Flow Generation

#![allow(unused)]
fn main() {
use synth_generators::document_flow::{P2pGenerator, O2cGenerator};

let mut p2p = P2pGenerator::new(config, seed);
let p2p_flows = p2p.generate_batch(500)?;

let mut o2c = O2cGenerator::new(config, seed);
let o2c_flows = o2c.generate_batch(500)?;
}

Anomaly Injection

#![allow(unused)]
fn main() {
use synth_generators::anomaly::AnomalyInjector;

let mut injector = AnomalyInjector::new(config.anomaly_injection, seed);

// Inject into existing entries
let (modified_entries, labels) = injector.inject(&entries)?;
}

LLM Enrichment

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::{VendorLlmEnricher, TransactionLlmEnricher};
use synth_core::llm::MockLlmProvider;
use std::sync::Arc;

let provider = Arc::new(MockLlmProvider::new(42));

// Enrich vendor names
let vendor_enricher = VendorLlmEnricher::new(provider.clone());
let name = vendor_enricher.enrich_vendor_name("manufacturing", "raw_materials", "US")?;

// Enrich transaction descriptions
let tx_enricher = TransactionLlmEnricher::new(provider);
let desc = tx_enricher.enrich_description("Office Supplies", "1000-5000", "retail", 3)?;
let memo = tx_enricher.enrich_memo("VendorInvoice", "Acme Corp", "2500.00")?;
}

Three-Way Match

The P2P generator validates document matching:

#![allow(unused)]
fn main() {
use synth_generators::document_flow::ThreeWayMatch;

let match_result = ThreeWayMatch::validate(
    &purchase_order,
    &goods_receipt,
    &vendor_invoice,
    tolerance_config,
);

match match_result {
    MatchResult::Passed => { /* Process normally */ }
    MatchResult::QuantityVariance(var) => { /* Handle variance */ }
    MatchResult::PriceVariance(var) => { /* Handle variance */ }
}
}

Balance Coherence

The balance tracker maintains accounting equation:

#![allow(unused)]
fn main() {
use synth_generators::balance::BalanceTracker;

let mut tracker = BalanceTracker::new();

for entry in &entries {
    tracker.post(&entry)?;
}

// Verify Assets = Liabilities + Equity
assert!(tracker.is_balanced());
}

FX Rate Generation

Uses Ornstein-Uhlenbeck process for realistic rate movements:

#![allow(unused)]
fn main() {
use synth_generators::fx::FxRateService;

let mut fx_service = FxRateService::new(config.fx, seed);

// Get rate for date
let rate = fx_service.get_rate("EUR", "USD", date)?;

// Generate daily rates
let rates = fx_service.generate_daily_rates(start, end)?;
}

Anomaly Types

Fraud Types

  • FictitiousTransaction, RevenueManipulation, ExpenseCapitalization
  • SplitTransaction, RoundTripping, KickbackScheme
  • GhostEmployee, DuplicatePayment, UnauthorizedDiscount

Error Types

  • DuplicateEntry, ReversedAmount, WrongPeriod
  • WrongAccount, MissingReference, IncorrectTaxCode

Process Issues

  • LatePosting, SkippedApproval, ThresholdManipulation
  • MissingDocumentation, OutOfSequence

Statistical Anomalies

  • UnusualAmount, TrendBreak, BenfordViolation, OutlierValue

Relational Anomalies

  • CircularTransaction, DormantAccountActivity, UnusualCounterparty

See Also

datasynth-output

Output sinks for CSV, JSON, and streaming formats.

Overview

datasynth-output provides the output layer for SyntheticData:

  • CSV Sink: High-performance CSV writing with optional compression
  • JSON Sink: JSON and JSONL (newline-delimited) output
  • Streaming: Async streaming output for real-time generation
  • Control Export: Internal control and SoD rule export

Supported Formats

Standard Formats

FormatDescriptionExtension
CSVStandard comma-separated values.csv
JSONPretty-printed JSON arrays.json
JSONLNewline-delimited JSON.jsonl
ParquetApache Parquet columnar format.parquet

ERP Formats

FormatTarget ERPTables
SAP S/4HANASapExporterBKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
Oracle EBSOracleExporterGL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
NetSuiteNetSuiteExporterJournal entries with subsidiary/multi-book support

Streaming Sinks

SinkDescription
CsvStreamingSinkStreaming CSV with automatic headers
JsonStreamingSinkStreaming JSON arrays
NdjsonStreamingSinkStreaming newline-delimited JSON
ParquetStreamingSinkStreaming Apache Parquet

Features

  • Configurable compression (gzip, zstd, snappy for Parquet)
  • Streaming writes for memory efficiency with backpressure support
  • ERP-native table schemas (SAP, Oracle, NetSuite)
  • Decimal values serialized as strings (IEEE 754 safe)
  • Configurable field ordering and headers
  • Automatic directory creation

Key Types

OutputConfig

#![allow(unused)]
fn main() {
pub struct OutputConfig {
    pub format: OutputFormat,
    pub compression: CompressionType,
    pub compression_level: u32,
    pub include_headers: bool,
    pub decimal_precision: u32,
}

pub enum OutputFormat {
    Csv,
    Json,
    Jsonl,
}

pub enum CompressionType {
    None,
    Gzip,
    Zstd,
}
}

CsvSink

#![allow(unused)]
fn main() {
pub struct CsvSink<T> {
    writer: BufWriter<Box<dyn Write>>,
    config: OutputConfig,
    headers_written: bool,
    _phantom: PhantomData<T>,
}
}

JsonSink

#![allow(unused)]
fn main() {
pub struct JsonSink<T> {
    writer: BufWriter<Box<dyn Write>>,
    format: JsonFormat,
    first_written: bool,
    _phantom: PhantomData<T>,
}
}

Usage Examples

CSV Output

#![allow(unused)]
fn main() {
use synth_output::{CsvSink, OutputConfig, OutputFormat};

// Create sink
let config = OutputConfig {
    format: OutputFormat::Csv,
    compression: CompressionType::None,
    include_headers: true,
    ..Default::default()
};

let mut sink = CsvSink::new("output/journal_entries.csv", config)?;

// Write data
sink.write_batch(&entries)?;
sink.flush()?;
}

Compressed Output

#![allow(unused)]
fn main() {
use synth_output::{CsvSink, OutputConfig, CompressionType};

let config = OutputConfig {
    compression: CompressionType::Gzip,
    compression_level: 6,
    ..Default::default()
};

let mut sink = CsvSink::new("output/entries.csv.gz", config)?;
sink.write_batch(&entries)?;
}

JSON Streaming

#![allow(unused)]
fn main() {
use synth_output::{JsonSink, OutputConfig, OutputFormat};

let config = OutputConfig {
    format: OutputFormat::Jsonl,
    ..Default::default()
};

let mut sink = JsonSink::new("output/entries.jsonl", config)?;

// Stream writes (memory efficient)
for entry in entries {
    sink.write(&entry)?;
}
sink.flush()?;
}

Control Export

#![allow(unused)]
fn main() {
use synth_output::ControlExporter;

let exporter = ControlExporter::new("output/controls/");

// Export all control-related data
exporter.export_controls(&internal_controls)?;
exporter.export_sod_rules(&sod_rules)?;
exporter.export_control_mappings(&mappings)?;
}

Sink Trait Implementation

All sinks implement the Sink trait:

#![allow(unused)]
fn main() {
impl<T: Serialize> Sink<T> for CsvSink<T> {
    type Error = OutputError;

    fn write(&mut self, item: &T) -> Result<(), Self::Error> {
        // Single item write
    }

    fn write_batch(&mut self, items: &[T]) -> Result<(), Self::Error> {
        // Batch write for efficiency
    }

    fn flush(&mut self) -> Result<(), Self::Error> {
        // Ensure all data written to disk
    }
}
}

Decimal Serialization

Financial amounts are serialized as strings to prevent IEEE 754 floating-point issues:

#![allow(unused)]
fn main() {
// Internal: Decimal
let amount = dec!(1234.56);

// CSV output: "1234.56" (string)
// JSON output: "1234.56" (string, not number)
}

This ensures exact decimal representation across all systems.

Performance Tips

Batch Writes

Prefer batch writes over individual writes:

#![allow(unused)]
fn main() {
// Good: Single batch write
sink.write_batch(&entries)?;

// Less efficient: Multiple writes
for entry in &entries {
    sink.write(entry)?;
}
}

Buffer Size

The default buffer size is 8KB. For very large outputs, consider adjusting:

#![allow(unused)]
fn main() {
let sink = CsvSink::with_buffer_size(
    "output/large.csv",
    config,
    64 * 1024, // 64KB buffer
)?;
}

Compression Trade-offs

CompressionSpeedSizeUse Case
NoneFastestLargestDevelopment, streaming
GzipMediumSmallGeneral purpose
ZstdFastSmallestProduction, archival

Output Structure

The output module creates organized directory structure:

output/
├── master_data/
│   ├── vendors.csv
│   └── customers.csv
├── transactions/
│   ├── journal_entries.csv
│   └── acdoca.csv
├── controls/
│   ├── internal_controls.csv
│   └── sod_rules.csv
└── labels/
    └── anomaly_labels.csv

Error Handling

#![allow(unused)]
fn main() {
pub enum OutputError {
    IoError(std::io::Error),
    SerializationError(String),
    CompressionError(String),
    DirectoryCreationError(PathBuf),
}
}

See Also

datasynth-runtime

Runtime orchestration, parallel execution, and memory management.

Overview

datasynth-runtime provides the execution layer for SyntheticData:

  • GenerationOrchestrator: Coordinates the complete generation workflow
  • Parallel Execution: Multi-threaded generation with Rayon
  • Memory Management: Integration with memory guard for OOM prevention
  • Progress Tracking: Real-time progress reporting with pause/resume

Key Components

ComponentDescription
GenerationOrchestratorMain workflow coordinator
EnhancedOrchestratorExtended orchestrator with all enterprise features
ParallelExecutorThread pool management
ProgressTrackerProgress bars and status reporting

Generation Workflow

The orchestrator executes phases in order:

  1. Initialize: Load configuration, validate settings
  2. Master Data: Generate vendors, customers, materials, assets
  3. Opening Balances: Create coherent opening balance sheet
  4. Transactions: Generate journal entries with document flows
  5. Period Close: Run monthly/quarterly/annual close processes
  6. Anomalies: Inject configured anomalies and data quality issues
  7. Export: Write outputs and generate ML labels
  8. Banking: Generate KYC/AML data (if enabled)
  9. Audit: Generate ISA-compliant audit data (if enabled)
  10. Graphs: Build and export ML graphs (if enabled)
  11. LLM Enrichment: Enrich data with LLM-generated metadata (v0.5.0, if enabled)
  12. Diffusion Enhancement: Blend diffusion model outputs (v0.5.0, if enabled)
  13. Causal Overlay: Apply causal structure (v0.5.0, if enabled)
  14. S2C Sourcing: Generate Source-to-Contract procurement pipeline (v0.6.0, if enabled)
  15. Financial Reporting: Generate bank reconciliations and financial statements (v0.6.0, if enabled)
  16. HR Data: Generate payroll runs, time entries, and expense reports (v0.6.0, if enabled)
  17. Accounting Standards: Generate revenue recognition and impairment data (v0.6.0, if enabled)
  18. Manufacturing: Generate production orders, quality inspections, and cycle counts (v0.6.0, if enabled)
  19. Sales/KPIs/Budgets: Generate sales quotes, management KPIs, and budget variance data (v0.6.0, if enabled)

Key Types

GenerationOrchestrator

#![allow(unused)]
fn main() {
pub struct GenerationOrchestrator {
    config: Config,
    state: GenerationState,
    progress: Arc<ProgressTracker>,
    memory_guard: MemoryGuard,
}

pub struct GenerationState {
    pub master_data: MasterDataState,
    pub entries: Vec<JournalEntry>,
    pub documents: DocumentState,
    pub balances: BalanceState,
    pub anomaly_labels: Vec<LabeledAnomaly>,
}
}

ProgressTracker

#![allow(unused)]
fn main() {
pub struct ProgressTracker {
    pub current: AtomicU64,
    pub total: u64,
    pub phase: String,
    pub paused: AtomicBool,
    pub start_time: Instant,
}

pub struct Progress {
    pub current: u64,
    pub total: u64,
    pub percent: f64,
    pub phase: String,
    pub entries_per_second: f64,
    pub elapsed: Duration,
    pub estimated_remaining: Duration,
}
}

Usage Examples

Basic Generation

#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;

let config = Config::from_yaml_file("config.yaml")?;
let orchestrator = GenerationOrchestrator::new(config)?;

// Run full generation
orchestrator.run()?;
}

With Progress Callback

#![allow(unused)]
fn main() {
orchestrator.run_with_progress(|progress| {
    println!(
        "[{:.1}%] {} - {}/{} ({:.0} entries/sec)",
        progress.percent,
        progress.phase,
        progress.current,
        progress.total,
        progress.entries_per_second,
    );
})?;
}

Parallel Execution

#![allow(unused)]
fn main() {
use synth_runtime::ParallelExecutor;

let executor = ParallelExecutor::new(4); // 4 threads

let results: Vec<JournalEntry> = executor.run(|thread_id| {
    let mut generator = JournalEntryGenerator::new(config.clone(), seed + thread_id);
    generator.generate_batch(batch_size)
})?;
}

Memory-Aware Generation

#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
use synth_core::memory_guard::MemoryGuardConfig;

let memory_config = MemoryGuardConfig {
    soft_limit: 1024 * 1024 * 1024,  // 1GB
    hard_limit: 2 * 1024 * 1024 * 1024,  // 2GB
    check_interval_ms: 1000,
    ..Default::default()
};

let orchestrator = GenerationOrchestrator::with_memory_config(config, memory_config)?;
orchestrator.run()?;
}

Pause/Resume

On Unix systems, generation can be paused and resumed:

# Start generation in background
datasynth-data generate --config config.yaml --output ./output &

# Send SIGUSR1 to toggle pause
kill -USR1 $(pgrep datasynth-data)

# Progress bar shows pause state
# [████████░░░░░░░░░░░░] 40% (PAUSED)

Programmatic Pause/Resume

#![allow(unused)]
fn main() {
// Pause
orchestrator.pause();

// Check state
if orchestrator.is_paused() {
    println!("Generation paused");
}

// Resume
orchestrator.resume();
}

Enhanced Orchestrator

The EnhancedOrchestrator includes additional enterprise features:

#![allow(unused)]
fn main() {
use synth_runtime::EnhancedOrchestrator;

let orchestrator = EnhancedOrchestrator::new(config)?;

// All features enabled
orchestrator
    .with_document_flows()
    .with_intercompany()
    .with_subledgers()
    .with_fx()
    .with_period_close()
    .with_anomaly_injection()
    .with_graph_export()
    .run()?;
}

Enterprise Process Chain Phases (v0.6.0)

The EnhancedOrchestrator supports six new phases for enterprise process chains, controlled by PhaseConfig:

PhaseConfig FlagDescription
14generate_sourcingS2C procurement pipeline: spend analysis through supplier scorecards
15generate_financial_statements / generate_bank_reconciliationFinancial statements and bank reconciliations
16generate_hrPayroll runs, time entries, expense reports
17generate_accounting_standardsRevenue recognition (ASC 606/IFRS 15), impairment testing
18generate_manufacturingProduction orders, quality inspections, cycle counts
19generate_sales_kpi_budgetsSales quotes, management KPIs, budget variance analysis

Each phase is independently enabled and gracefully skips when its dependencies (e.g., master data) are unavailable.

Output Coordination

The orchestrator coordinates output to multiple sinks:

#![allow(unused)]
fn main() {
// Orchestrator automatically:
// 1. Creates output directories
// 2. Writes master data files
// 3. Writes transaction files
// 4. Writes subledger files
// 5. Writes labels for ML
// 6. Generates graphs if enabled
}

Error Handling

#![allow(unused)]
fn main() {
pub enum RuntimeError {
    ConfigurationError(ConfigError),
    GenerationError(String),
    MemoryExceeded { limit: u64, current: u64 },
    OutputError(OutputError),
    Interrupted,
}
}

Performance Considerations

Thread Count

#![allow(unused)]
fn main() {
// Auto-detect (uses all cores)
let orchestrator = GenerationOrchestrator::new(config)?;

// Manual thread count
let orchestrator = GenerationOrchestrator::with_threads(config, 4)?;
}

Memory Management

The orchestrator monitors memory and can:

  • Slow down generation when soft limit approached
  • Pause generation at hard limit
  • Stream output to reduce memory pressure

Batch Sizes

Batch sizes are automatically tuned based on:

  • Available memory
  • Number of threads
  • Target throughput

See Also

datasynth-graph

Graph/network export for synthetic accounting data with ML-ready formats.

Overview

datasynth-graph provides graph construction and export capabilities:

  • Graph Builders: Transaction, approval, entity relationship, and multi-layer hypergraph builders
  • Hypergraph: 3-layer hypergraph (Governance, Process Events, Accounting Network) spanning 8 process families with 24 entity type codes and OCPM event hyperedges
  • ML Export: PyTorch Geometric, Neo4j, DGL, RustGraph, and RustGraph Hypergraph formats
  • Feature Engineering: Temporal, amount, structural, and categorical features
  • Data Splits: Train/validation/test split generation

Graph Types

GraphNodesEdgesUse Case
Transaction NetworkAccounts, EntitiesTransactionsAnomaly detection
Approval NetworkUsersApprovalsSoD analysis
Entity RelationshipLegal EntitiesOwnershipConsolidation analysis

Export Formats

PyTorch Geometric

graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, num_features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, num_edge_features]
├── labels.pt           # [num_nodes] or [num_edges]
├── train_mask.pt       # Boolean mask
├── val_mask.pt
└── test_mask.pt

Neo4j

graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── edges_transaction.csv
├── edges_ownership.csv
└── import.cypher

DGL (Deep Graph Library)

graphs/approval_network/dgl/
├── graph.bin           # DGL graph object
├── node_feats.npy      # Node features
├── edge_feats.npy      # Edge features
└── labels.npy          # Labels

Feature Categories

CategoryFeatures
Temporalweekday, period, is_month_end, is_quarter_end, is_year_end
Amountlog(amount), benford_probability, is_round_number
Structuralline_count, unique_accounts, has_intercompany
Categoricalbusiness_process (one-hot), source_type (one-hot)

Key Types

Graph Models

#![allow(unused)]
fn main() {
pub struct Graph {
    pub nodes: Vec<Node>,
    pub edges: Vec<Edge>,
    pub node_features: Option<Array2<f32>>,
    pub edge_features: Option<Array2<f32>>,
}

pub enum Node {
    Account(AccountNode),
    Entity(EntityNode),
    User(UserNode),
    Transaction(TransactionNode),
}

pub enum Edge {
    Transaction(TransactionEdge),
    Approval(ApprovalEdge),
    Ownership(OwnershipEdge),
}
}

Split Configuration

#![allow(unused)]
fn main() {
pub struct SplitConfig {
    pub train_ratio: f64,     // e.g., 0.7
    pub val_ratio: f64,       // e.g., 0.15
    pub test_ratio: f64,      // e.g., 0.15
    pub stratify_by: Option<String>,
    pub random_seed: u64,
}
}

Usage Examples

Building Transaction Graph

#![allow(unused)]
fn main() {
use synth_graph::{TransactionGraphBuilder, GraphConfig};

let builder = TransactionGraphBuilder::new(GraphConfig::default());
let graph = builder.build(&journal_entries)?;

println!("Nodes: {}", graph.nodes.len());
println!("Edges: {}", graph.edges.len());
}

PyTorch Geometric Export

#![allow(unused)]
fn main() {
use synth_graph::{PyTorchGeometricExporter, SplitConfig};

let exporter = PyTorchGeometricExporter::new("output/graphs");

let split = SplitConfig {
    train_ratio: 0.7,
    val_ratio: 0.15,
    test_ratio: 0.15,
    stratify_by: Some("is_anomaly".to_string()),
    random_seed: 42,
};

exporter.export(&graph, split)?;
}

Neo4j Export

#![allow(unused)]
fn main() {
use synth_graph::Neo4jExporter;

let exporter = Neo4jExporter::new("output/graphs/neo4j");
exporter.export(&graph)?;

// Generates import script:
// LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
// CREATE (:Account {id: row.id, name: row.name, ...})
}

Feature Engineering

#![allow(unused)]
fn main() {
use synth_graph::features::{FeatureExtractor, FeatureConfig};

let extractor = FeatureExtractor::new(FeatureConfig {
    temporal: true,
    amount: true,
    structural: true,
    categorical: true,
});

let node_features = extractor.extract_node_features(&entries)?;
let edge_features = extractor.extract_edge_features(&entries)?;
}

Graph Construction

Transaction Network

Accounts and entities become nodes; transactions become edges.

#![allow(unused)]
fn main() {
// Nodes:
// - Each GL account is a node
// - Each vendor/customer is a node

// Edges:
// - Each journal entry line creates an edge
// - Edge connects account to entity
// - Edge features: amount, date, fraud flag
}

Approval Network

Users become nodes; approval relationships become edges.

#![allow(unused)]
fn main() {
// Nodes:
// - Each user/employee is a node
// - Node features: approval_limit, department, role

// Edges:
// - Approval actions create edges
// - Edge features: amount, threshold, escalation
}

Entity Relationship Network

Legal entities become nodes; ownership and IC relationships become edges.

#![allow(unused)]
fn main() {
// Nodes:
// - Each company/legal entity is a node
// - Node features: currency, country, parent_flag

// Edges:
// - Ownership relationships (parent → subsidiary)
// - IC transaction relationships
// - Edge features: ownership_percent, transaction_volume
}

ML Integration

Loading in PyTorch

import torch
from torch_geometric.data import Data

# Load exported data
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')

data = Data(
    x=node_features,
    edge_index=edge_index,
    edge_attr=edge_attr,
    y=labels,
    train_mask=train_mask,
)

Loading in Neo4j

# Import using generated script
neo4j-admin import \
    --nodes=nodes_account.csv \
    --nodes=nodes_entity.csv \
    --relationships=edges_transaction.csv

Configuration

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j
  graphs:
    - transaction_network
    - approval_network
    - entity_relationship
  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly
  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

Multi-Layer Hypergraph (v0.6.2)

The hypergraph builder supports all 8 enterprise process families:

MethodFamilyNode Types
add_p2p_documents()P2PPurchaseOrder, GoodsReceipt, VendorInvoice, Payment
add_o2c_documents()O2CSalesOrder, Delivery, CustomerInvoice
add_s2c_documents()S2CSourcingProject, RfxEvent, SupplierBid, ProcurementContract
add_h2r_documents()H2RPayrollRun, TimeEntry, ExpenseReport
add_mfg_documents()MFGProductionOrder, QualityInspection, CycleCount
add_bank_documents()BANKBankingCustomer, BankAccount, BankTransaction
add_audit_documents()AUDITAuditEngagement, Workpaper, AuditFinding, AuditEvidence
add_bank_recon_documents()Bank ReconBankReconciliation, BankStatementLine, ReconcilingItem
add_ocpm_events()OCPMEvents as hyperedges (entity type 400)

See Also

datasynth-cli

Command-line interface for synthetic accounting data generation.

Overview

datasynth-cli provides the datasynth-data binary for command-line usage:

  • generate: Generate synthetic data from configuration
  • init: Create configuration files with industry presets
  • validate: Validate configuration files
  • info: Display available presets and options

Installation

cargo build --release
# Binary at: target/release/datasynth-data

Commands

generate

Generate synthetic financial data.

# From configuration file
datasynth-data generate --config config.yaml --output ./output

# Demo mode with defaults
datasynth-data generate --demo --output ./demo-output

# Override seed
datasynth-data generate --config config.yaml --output ./output --seed 12345

# Verbose output
datasynth-data generate --config config.yaml --output ./output -v

init

Create a configuration file from presets.

# Industry preset with complexity
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available industries:

  • manufacturing
  • retail
  • financial_services
  • healthcare
  • technology
  • energy
  • telecom
  • transportation
  • hospitality

validate

Validate configuration files.

datasynth-data validate --config config.yaml

info

Display available options.

datasynth-data info

fingerprint

Privacy-preserving fingerprint operations.

# Extract fingerprint
datasynth-data fingerprint extract --input ./data.csv --output ./fp.dsf --privacy-level standard

# Validate fingerprint
datasynth-data fingerprint validate ./fp.dsf

# View fingerprint details
datasynth-data fingerprint info ./fp.dsf --detailed

# Evaluate fidelity
datasynth-data fingerprint evaluate --fingerprint ./fp.dsf --synthetic ./output/ --threshold 0.8

# Federated aggregation (v0.5.0)
datasynth-data fingerprint federated --sources ./a.dsf ./b.dsf --output ./combined.dsf --method weighted_average

diffusion (v0.5.0)

Diffusion model training and evaluation.

# Train diffusion model from fingerprint
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json

# Evaluate model fit
datasynth-data diffusion evaluate --model ./model.json --samples 5000

causal (v0.5.0)

Causal and counterfactual data generation.

# Generate from causal template
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/

# Run intervention
datasynth-data causal intervene --template fraud_detection --variable transaction_amount --value 50000 --samples 5000 --output ./intervention/

# Validate causal structure
datasynth-data causal validate --data ./causal/ --template fraud_detection

Key Types

CLI Arguments

#![allow(unused)]
fn main() {
#[derive(Parser)]
pub struct Cli {
    #[command(subcommand)]
    pub command: Command,

    /// Enable verbose logging
    #[arg(short, long)]
    pub verbose: bool,

    /// Suppress non-error output
    #[arg(short, long)]
    pub quiet: bool,
}

#[derive(Subcommand)]
pub enum Command {
    Generate(GenerateArgs),
    Init(InitArgs),
    Validate(ValidateArgs),
    Info,
    Fingerprint(FingerprintArgs),   // fingerprint subcommands
    Diffusion(DiffusionArgs),       // v0.5.0: diffusion model commands
    Causal(CausalArgs),             // v0.5.0: causal generation commands
}
}

Generate Arguments

#![allow(unused)]
fn main() {
pub struct GenerateArgs {
    /// Configuration file path
    #[arg(short, long)]
    pub config: Option<PathBuf>,

    /// Use demo preset
    #[arg(long)]
    pub demo: bool,

    /// Output directory (required)
    #[arg(short, long)]
    pub output: PathBuf,

    /// Override random seed
    #[arg(long)]
    pub seed: Option<u64>,

    /// Output format
    #[arg(long, default_value = "csv")]
    pub format: String,

    /// Attach a synthetic data certificate (v0.5.0)
    #[arg(long)]
    pub certificate: bool,
}

pub struct InitArgs {
    // ... existing fields ...

    /// Generate config from natural language description (v0.5.0)
    #[arg(long)]
    pub from_description: Option<String>,
}
}

Signal Handling

On Unix systems, pause/resume generation with SIGUSR1:

# Start in background
datasynth-data generate --config config.yaml --output ./output &

# Toggle pause
kill -USR1 $(pgrep datasynth-data)

Progress bar shows pause state:

[████████░░░░░░░░░░░░] 40% - 40000/100000 entries (PAUSED)

Exit Codes

CodeDescription
0Success
1Configuration error
2Generation error
3I/O error

Environment Variables

VariableDescription
SYNTH_DATA_LOGLog level (error, warn, info, debug, trace)
SYNTH_DATA_THREADSWorker thread count
SYNTH_DATA_MEMORY_LIMITMemory limit in bytes
SYNTH_DATA_LOG=debug datasynth-data generate --demo --output ./output

Progress Display

During generation, a progress bar shows:

Generating synthetic data...
[████████████████████] 100% - 100000/100000 entries
Phase: Transactions | 85,432 entries/sec | ETA: 0:00

Generation complete!
- Journal entries: 100,000
- Document flows: 15,000
- Output: ./output/
- Duration: 1.2s

Usage Examples

Basic Generation

datasynth-data init --industry manufacturing -o config.yaml
datasynth-data generate --config config.yaml --output ./output

Scripting

#!/bin/bash
for industry in manufacturing retail healthcare; do
    datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
    datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done

CI/CD

# GitHub Actions
- name: Generate Test Data
  run: |
    cargo build --release
    ./target/release/datasynth-data generate --demo --output ./test-data

Reproducible Generation

# Same seed = same output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2  # No differences

See Also

datasynth-server

REST, gRPC, and WebSocket server for synthetic data generation.

Overview

datasynth-server provides server-based access to SyntheticData:

  • REST API: Configuration management and stream control
  • gRPC API: High-performance streaming generation
  • WebSocket: Real-time event streaming
  • Production Features: Authentication, rate limiting, timeouts

Starting the Server

cargo run -p datasynth-server -- --port 3000 --worker-threads 4

Command-Line Options

OptionDefaultDescription
--port3000HTTP/WebSocket port
--grpc-port50051gRPC port
--worker-threadsCPU coresWorker thread count
--api-keyNoneRequired API key
--rate-limit100Max requests per minute
--memory-limitNoneMemory limit in bytes

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      datasynth-server                       │
├─────────────────────────────────────────────────────────────┤
│  REST API (Axum)  │  gRPC (Tonic)  │  WebSocket (Axum)      │
├─────────────────────────────────────────────────────────────┤
│                   Middleware Layer                          │
│  Auth │ Rate Limit │ Timeout │ CORS │ Logging               │
├─────────────────────────────────────────────────────────────┤
│                 Generation Service                          │
│        (wraps datasynth-runtime orchestrator)               │
└─────────────────────────────────────────────────────────────┘

REST API Endpoints

Configuration

# Get current configuration
curl http://localhost:3000/api/config

# Update configuration
curl -X POST http://localhost:3000/api/config \
  -H "Content-Type: application/json" \
  -d '{"transactions": {"target_count": 50000}}'

# Validate configuration
curl -X POST http://localhost:3000/api/config/validate \
  -H "Content-Type: application/json" \
  -d @config.json

Stream Control

# Start generation
curl -X POST http://localhost:3000/api/stream/start

# Pause
curl -X POST http://localhost:3000/api/stream/pause

# Resume
curl -X POST http://localhost:3000/api/stream/resume

# Stop
curl -X POST http://localhost:3000/api/stream/stop

# Trigger pattern (month_end, quarter_end, year_end)
curl -X POST http://localhost:3000/api/stream/trigger/month_end

Health Check

curl http://localhost:3000/health

WebSocket API

Connect to ws://localhost:3000/ws/events for real-time events.

Event Types

// Progress
{"type": "progress", "current": 50000, "total": 100000, "percent": 50.0}

// Entry (streamed data)
{"type": "entry", "data": {"document_id": "abc-123", ...}}

// Error
{"type": "error", "message": "Memory limit exceeded"}

// Complete
{"type": "complete", "total_entries": 100000, "duration_ms": 1200}

gRPC API

Proto Definition

syntax = "proto3";
package synth;

service SynthService {
  rpc GetConfig(Empty) returns (Config);
  rpc SetConfig(Config) returns (Status);
  rpc StartGeneration(GenerationRequest) returns (stream Entry);
  rpc StopGeneration(Empty) returns (Status);
}

Client Example

#![allow(unused)]
fn main() {
use synth::synth_client::SynthClient;

let mut client = SynthClient::connect("http://localhost:50051").await?;

let request = tonic::Request::new(GenerationRequest { count: Some(1000) });
let mut stream = client.start_generation(request).await?.into_inner();

while let Some(entry) = stream.message().await? {
    println!("Entry: {:?}", entry.document_id);
}
}

Middleware

Authentication

# With API key
curl -H "X-API-Key: your-key" http://localhost:3000/api/config

Rate Limiting

Sliding window rate limiter with per-client tracking.

// 429 response when exceeded
{
  "error": "rate_limit_exceeded",
  "retry_after": 30
}

Request Timeout

Default timeout is 30 seconds. Long-running operations use streaming.

Key Types

Server Configuration

#![allow(unused)]
fn main() {
pub struct ServerConfig {
    pub port: u16,
    pub grpc_port: u16,
    pub worker_threads: usize,
    pub api_key: Option<String>,
    pub rate_limit: RateLimitConfig,
    pub memory_limit: Option<u64>,
    pub cors_origins: Vec<String>,
}
}

Rate Limit Configuration

#![allow(unused)]
fn main() {
pub struct RateLimitConfig {
    pub max_requests: u32,
    pub window_seconds: u64,
    pub exempt_paths: Vec<String>,
}
}

Production Deployment

Docker

FROM rust:1.88 as builder
WORKDIR /app
COPY . .
RUN cargo build --release -p datasynth-server

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/datasynth-server /usr/local/bin/
EXPOSE 3000 50051
CMD ["datasynth-server", "--port", "3000"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datasynth-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: datasynth-server
        image: datasynth-server:latest
        ports:
        - containerPort: 3000
        - containerPort: 50051
        env:
        - name: SYNTH_API_KEY
          valueFrom:
            secretKeyRef:
              name: synth-secrets
              key: api-key
        resources:
          limits:
            memory: "2Gi"

Monitoring

Health Endpoint

curl http://localhost:3000/health
{
  "status": "healthy",
  "uptime_seconds": 3600,
  "memory_usage_mb": 512,
  "active_streams": 2
}

Logging

Enable structured logging:

RUST_LOG=synth_server=info cargo run -p datasynth-server

See Also

datasynth-ui

Cross-platform desktop application for synthetic data generation.

Overview

datasynth-ui provides a graphical interface for SyntheticData:

  • Visual Configuration: Comprehensive UI for all configuration sections
  • Real-time Streaming: Live generation viewer with WebSocket
  • Preset Management: One-click industry preset application
  • Validation Feedback: Real-time configuration validation

Technology Stack

ComponentTechnology
BackendTauri 2.0 (Rust)
FrontendSvelteKit + Svelte 5
StylingTailwindCSS
StateSvelte stores with runes

Prerequisites

Linux (Ubuntu/Debian)

sudo apt install libgtk-3-dev libwebkit2gtk-4.1-dev \
    libappindicator3-dev librsvg2-dev

Linux (Fedora)

sudo dnf install gtk3-devel webkit2gtk4.1-devel \
    libappindicator-gtk3-devel librsvg2-devel

Linux (Arch)

sudo pacman -S webkit2gtk-4.1 base-devel curl wget file \
    openssl appmenu-gtk-module gtk3 librsvg libvips

macOS

No additional dependencies required (uses built-in WebKit).

Windows

WebView2 runtime (usually pre-installed on Windows 10/11).

Development

cd crates/datasynth-ui

# Install dependencies
npm install

# Frontend development (no desktop features)
npm run dev

# Desktop app development
npm run tauri dev

# Production build
npm run build
npm run tauri build

Project Structure

datasynth-ui/
├── src/                    # Svelte frontend
│   ├── routes/             # SvelteKit pages
│   │   ├── +page.svelte    # Dashboard
│   │   ├── config/         # Configuration pages (15+ sections)
│   │   │   ├── global/
│   │   │   ├── transactions/
│   │   │   ├── master-data/
│   │   │   └── ...
│   │   └── generate/
│   │       └── stream/     # Generation streaming viewer
│   └── lib/
│       ├── components/     # Reusable UI components
│       │   ├── forms/      # Form components
│       │   └── config/     # Config-specific components
│       ├── stores/         # Svelte stores
│       └── utils/          # Utilities
├── src-tauri/              # Rust backend
│   ├── src/
│   │   ├── lib.rs          # Tauri commands
│   │   └── main.rs         # App entry point
│   └── Cargo.toml
├── e2e/                    # Playwright E2E tests
├── package.json
└── tauri.conf.json

Configuration Sections

SectionDescription
GlobalIndustry, dates, seed, performance
TransactionsLine items, amounts, sources
Master DataVendors, customers, materials
Document FlowsP2P, O2C configuration
FinancialBalance, subledger, FX, period close
ComplianceFraud, controls, approval
AnalyticsGraph export, anomaly, data quality
OutputFormats, compression

Key Components

Config Store

// src/lib/stores/config.ts
import { writable } from 'svelte/store';

export const config = writable<Config>(defaultConfig);
export const isDirty = writable(false);

export function updateConfig(section: string, value: any) {
    config.update(c => ({...c, [section]: value}));
    isDirty.set(true);
}

Form Components

<!-- src/lib/components/forms/InputNumber.svelte -->
<script lang="ts">
  export let value: number;
  export let min: number = 0;
  export let max: number = Infinity;
  export let label: string;
</script>

<label>
  {label}
  <input type="number" bind:value {min} {max} />
</label>

Tauri Commands

#![allow(unused)]
fn main() {
// src-tauri/src/lib.rs
#[tauri::command]
async fn save_config(config: Config) -> Result<(), String> {
    // Save configuration
}

#[tauri::command]
async fn start_generation(config: Config) -> Result<(), String> {
    // Start generation via datasynth-runtime
}
}

Server Connection

The UI connects to datasynth-server for streaming:

# Start server first
cargo run -p datasynth-server

# Then run UI
npm run tauri dev

Default server URL: http://localhost:3000

Testing

# Unit tests
npm test

# E2E tests with Playwright
npx playwright test

# E2E with UI
npx playwright test --ui

Build Output

Production builds create platform-specific packages:

PlatformOutput
Windows.msi, .exe
macOS.dmg, .app
Linux.deb, .AppImage, .rpm

Located in: src-tauri/target/release/bundle/

UI Features

Dashboard

  • System overview
  • Quick stats
  • Recent generations

Configuration Editor

  • Visual form editors for all sections
  • Real-time validation
  • Dirty state tracking
  • Export to YAML/JSON

Streaming Viewer

  • Real-time progress
  • Entry preview table
  • Memory usage graph
  • Pause/resume controls

Preset Selector

  • Industry presets
  • Complexity levels
  • One-click application

See Also

datasynth-eval

Evaluation framework for synthetic financial data quality and coherence.

Overview

datasynth-eval provides automated quality assessment for generated data:

  • Statistical Evaluation: Benford’s Law compliance, distribution analysis
  • Coherence Checking: Balance verification, document chain integrity
  • Intercompany Validation: IC matching and elimination verification
  • Data Quality Analysis: Completeness, consistency, format validation
  • ML Readiness: Feature distributions, label quality, graph structure
  • Enhancement Derivation: Auto-tuning with configuration recommendations

Evaluation Categories

CategoryDescription
StatisticalBenford’s Law, amount distributions, temporal patterns, line items
CoherenceTrial balance, subledger reconciliation, FX consistency, document chains
IntercompanyIC matching rates, elimination completeness
QualityCompleteness, consistency, duplicates, format validation, uniqueness
ML ReadinessFeature distributions, label quality, graph structure, train/val/test splits
EnhancementAuto-tuning, configuration recommendations, root cause analysis

Module Structure

ModuleDescription
statistical/Benford’s Law, amount distributions, temporal patterns
coherence/Balance sheet, IC matching, document chains, subledger reconciliation
quality/Completeness, consistency, duplicates, formats, uniqueness
ml/Feature analysis, label quality, graph structure, splits
report/HTML and JSON report generation with baseline comparisons
tuning/Configuration optimization recommendations
enhancement/Auto-tuning engine with config patch generation

Key Types

Evaluator

#![allow(unused)]
fn main() {
pub struct Evaluator {
    config: EvaluationConfig,
    checkers: Vec<Box<dyn Checker>>,
}

pub struct EvaluationConfig {
    pub benford_threshold: f64,      // Chi-square threshold
    pub balance_tolerance: Decimal,   // Allowed imbalance
    pub ic_match_threshold: f64,      // Required match rate
    pub duplicate_check: bool,
}
}

Evaluation Report

#![allow(unused)]
fn main() {
pub struct EvaluationReport {
    pub overall_status: Status,
    pub categories: Vec<CategoryResult>,
    pub warnings: Vec<Warning>,
    pub details: Vec<Finding>,
    pub scores: Scores,
}

pub struct Scores {
    pub benford_score: f64,           // 0.0-1.0
    pub balance_coherence: f64,       // 0.0-1.0
    pub ic_matching_rate: f64,        // 0.0-1.0
    pub uniqueness_score: f64,        // 0.0-1.0
}

pub enum Status {
    Passed,
    PassedWithWarnings,
    Failed,
}
}

Usage Examples

Basic Evaluation

#![allow(unused)]
fn main() {
use synth_eval::{Evaluator, EvaluationConfig};

let evaluator = Evaluator::new(EvaluationConfig::default());
let report = evaluator.evaluate(&generated_data)?;

println!("Status: {:?}", report.overall_status);
println!("Benford compliance: {:.2}%", report.scores.benford_score * 100.0);
}

Custom Configuration

#![allow(unused)]
fn main() {
let config = EvaluationConfig {
    benford_threshold: 0.05,          // 5% significance level
    balance_tolerance: dec!(0.01),    // 1 cent tolerance
    ic_match_threshold: 0.99,         // 99% required match
    duplicate_check: true,
};

let evaluator = Evaluator::new(config);
}

Category-Specific Evaluation

#![allow(unused)]
fn main() {
use synth_eval::checkers::{BenfordChecker, BalanceChecker};

let benford = BenfordChecker::new(0.05);
let result = benford.check(&amounts)?;

let balance = BalanceChecker::new(dec!(0.01));
let result = balance.check(&trial_balance)?;
}

Evaluation Checks

Benford’s Law

Verifies first-digit distribution follows Benford’s Law:

#![allow(unused)]
fn main() {
// Expected: P(d) = log10(1 + 1/d)
// d=1: 30.1%, d=2: 17.6%, d=3: 12.5%, etc.

let benford_result = evaluator.check_benford(&amounts)?;

if benford_result.chi_square > critical_value {
    println!("Warning: Amounts don't follow Benford's Law");
}
}

Balance Coherence

Verifies accounting equation:

#![allow(unused)]
fn main() {
// Assets = Liabilities + Equity
let balance_result = evaluator.check_balance(&trial_balance)?;

if !balance_result.passed {
    println!("Imbalance: {:?}", balance_result.difference);
}
}

Document Chain Integrity

Verifies document references:

#![allow(unused)]
fn main() {
// PO → GR → Invoice → Payment chain
let chain_result = evaluator.check_document_chains(&documents)?;

for broken_chain in &chain_result.broken_chains {
    println!("Broken chain: {} → {}", broken_chain.from, broken_chain.to);
}
}

IC Matching

Verifies intercompany transactions match:

#![allow(unused)]
fn main() {
let ic_result = evaluator.check_ic_matching(&ic_entries)?;

println!("Match rate: {:.2}%", ic_result.match_rate * 100.0);
println!("Unmatched: {}", ic_result.unmatched.len());
}

Uniqueness

Detects duplicate document IDs:

#![allow(unused)]
fn main() {
let unique_result = evaluator.check_uniqueness(&entries)?;

if !unique_result.duplicates.is_empty() {
    for dup in &unique_result.duplicates {
        println!("Duplicate ID: {}", dup.document_id);
    }
}
}

Report Output

Console Report

#![allow(unused)]
fn main() {
evaluator.print_report(&report);
}
=== Evaluation Report ===
Status: PASSED

Scores:
  Benford Compliance:    98.5%
  Balance Coherence:    100.0%
  IC Matching Rate:      99.8%
  Uniqueness:           100.0%

Warnings:
  - 3 entries with unusual amounts detected

Categories:
  [✓] Statistical:   PASSED
  [✓] Coherence:     PASSED
  [✓] Intercompany:  PASSED
  [✓] Uniqueness:    PASSED

JSON Report

#![allow(unused)]
fn main() {
let json = evaluator.to_json(&report)?;
std::fs::write("evaluation_report.json", json)?;
}

Integration with Generation

#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
use synth_eval::Evaluator;

let orchestrator = GenerationOrchestrator::new(config)?;
let data = orchestrator.run()?;

// Evaluate generated data
let evaluator = Evaluator::new(EvaluationConfig::default());
let report = evaluator.evaluate(&data)?;

if report.overall_status == Status::Failed {
    return Err("Generated data failed quality checks");
}
}

Enhancement Module

The enhancement module provides automatic configuration tuning based on evaluation results.

Pipeline Flow

Evaluation Results → Threshold Check → Gap Analysis → Root Cause → Config Suggestion

Auto-Tuning

#![allow(unused)]
fn main() {
use synth_eval::enhancement::{AutoTuner, AutoTuneResult};

let tuner = AutoTuner::new();
let result: AutoTuneResult = tuner.analyze(&evaluation);

for patch in result.patches_by_confidence() {
    println!("{}: {} → {} (confidence: {:.0}%)",
        patch.path,
        patch.current_value.as_deref().unwrap_or("?"),
        patch.suggested_value,
        patch.confidence * 100.0
    );
}
}

Key Types

#![allow(unused)]
fn main() {
pub struct ConfigPatch {
    pub path: String,              // e.g., "transactions.amount.benford_compliance"
    pub current_value: Option<String>,
    pub suggested_value: String,
    pub confidence: f64,           // 0.0-1.0
    pub expected_impact: String,
}

pub struct AutoTuneResult {
    pub patches: Vec<ConfigPatch>,
    pub expected_improvement: f64,
    pub addressed_metrics: Vec<String>,
    pub unaddressable_metrics: Vec<String>,
    pub summary: String,
}
}

Recommendation Engine

#![allow(unused)]
fn main() {
use synth_eval::enhancement::{RecommendationEngine, RecommendationPriority};

let engine = RecommendationEngine::new();
let recommendations = engine.generate(&evaluation);

for rec in recommendations.iter().filter(|r| r.priority == RecommendationPriority::Critical) {
    println!("CRITICAL: {} - {}", rec.title, rec.root_cause.description);
}
}

Metric-to-Config Mappings

MetricConfig PathStrategy
benford_p_valuetransactions.amount.benford_complianceEnable boolean
round_number_ratiotransactions.amount.round_number_biasSet to target
temporal_correlationtransactions.temporal.seasonality_strengthIncrease by gap
anomaly_rateanomaly_injection.base_rateSet to target
ic_match_rateintercompany.match_precisionIncrease by gap
completeness_ratedata_quality.missing_values.overall_rateDecrease by gap

See Also

datasynth-banking

KYC/AML banking transaction generator for synthetic data.

Overview

datasynth-banking provides comprehensive banking transaction simulation for:

  • Compliance testing and model training
  • AML/fraud detection system evaluation
  • KYC process simulation
  • Regulatory reporting testing

Features

FeatureDescription
Customer GenerationRetail, business, and trust customers with realistic KYC profiles
Account GenerationMultiple account types with proper feature sets
Transaction EnginePersona-based transaction generation with causal drivers
AML TypologiesStructuring, funnel accounts, layering, mule networks, and more
Ground Truth LabelsMulti-level labels for ML training
Spoofing ModeAdversarial transaction generation for robustness testing

Architecture

BankingOrchestrator (orchestration)
        |
Generators (customer, account, transaction, counterparty)
        |
Typologies (AML pattern injection)
        |
Labels (ground truth generation)
        |
Models (customer, account, transaction, KYC)

Module Structure

Models

ModelDescription
BankingCustomerRetail, Business, Trust customer types
BankAccountAccount with type, features, status
BankTransactionTransaction with direction, channel, category
KycProfileExpected activity envelope for compliance
CounterpartyPoolTransaction counterparty management
CaseNarrativeInvestigation and compliance narratives

KYC Profile

#![allow(unused)]
fn main() {
pub struct KycProfile {
    pub declared_purpose: String,
    pub turnover_band: TurnoverBand,
    pub transaction_frequency: TransactionFrequency,
    pub expected_categories: Vec<TransactionCategory>,
    pub source_of_funds: SourceOfFunds,
    pub source_of_wealth: SourceOfWealth,
    pub geographic_exposure: Vec<String>,
    pub cash_intensity: CashIntensity,
    pub beneficial_owner_complexity: OwnerComplexity,
    // Ground truth fields
    pub is_deceiving: bool,
    pub actual_turnover_band: Option<TurnoverBand>,
}
}

AML Typologies

TypologyDescription
StructuringTransactions below reporting thresholds ($10K)
Funnel AccountsMultiple small deposits, few large withdrawals
LayeringComplex transaction chains to obscure origin
Mule NetworksMoney mule payment chains
Round-TrippingCircular transaction patterns
Credit Card FraudFraudulent card transactions
Synthetic IdentityFake identity transactions
SpoofingAdversarial patterns for model testing

Labels

Label TypeDescription
Entity LabelsCustomer-level risk classifications
Relationship LabelsRelationship risk indicators
Transaction LabelsTransaction-level classifications
Narrative LabelsInvestigation case narratives

Usage Examples

Basic Generation

#![allow(unused)]
fn main() {
use synth_banking::{BankingOrchestrator, BankingConfig};

let config = BankingConfig::default();
let mut orchestrator = BankingOrchestrator::new(config, 12345);

// Generate customers and accounts
let customers = orchestrator.generate_customers();
let accounts = orchestrator.generate_accounts(&customers);

// Generate transaction stream
let transactions = orchestrator.generate_transactions(&accounts);
}

With AML Typologies

#![allow(unused)]
fn main() {
use synth_banking::{BankingConfig, TypologyConfig};

let config = BankingConfig {
    customer_count: 1000,
    typologies: TypologyConfig {
        structuring_rate: 0.02,   // 2% structuring patterns
        funnel_rate: 0.01,        // 1% funnel accounts
        mule_rate: 0.005,         // 0.5% mule networks
        ..Default::default()
    },
    ..Default::default()
};
}

Accessing Labels

#![allow(unused)]
fn main() {
let labels = orchestrator.generate_labels();

// Entity-level labels
for entity_label in &labels.entity_labels {
    println!("Customer {} risk: {:?}",
        entity_label.customer_id,
        entity_label.risk_tier
    );
}

// Transaction-level labels
for tx_label in &labels.transaction_labels {
    if tx_label.is_suspicious {
        println!("Suspicious: {} - {:?}",
            tx_label.transaction_id,
            tx_label.typology
        );
    }
}
}

Key Types

Customer Types

#![allow(unused)]
fn main() {
pub enum BankingCustomerType {
    Retail,     // Individual customers
    Business,   // Business accounts
    Trust,      // Trust/corporate entities
}
}

Risk Tiers

#![allow(unused)]
fn main() {
pub enum RiskTier {
    Low,
    Medium,
    High,
    Prohibited,
}
}

Transaction Categories

#![allow(unused)]
fn main() {
pub enum TransactionCategory {
    SalaryWages,
    BusinessPayment,
    Investment,
    RealEstate,
    Gambling,
    Cryptocurrency,
    CashDeposit,
    CashWithdrawal,
    WireTransfer,
    AtmWithdrawal,
    PosPayment,
    OnlinePayment,
    // ... more categories
}
}

AML Typologies

#![allow(unused)]
fn main() {
pub enum AmlTypology {
    Structuring,
    Funnel,
    Layering,
    Mule,
    RoundTripping,
    CreditCardFraud,
    SyntheticIdentity,
    None,
}
}

Export Files

FileDescription
banking_customers.csvCustomer profiles with KYC data
bank_accounts.csvAccount records with features
bank_transactions.csvTransaction records
kyc_profiles.csvExpected activity envelopes
counterparties.csvCounterparty pool
entity_risk_labels.csvEntity-level risk classifications
transaction_risk_labels.csvTransaction-level labels
aml_typology_labels.csvAML typology ground truth

See Also

datasynth-ocpm

Object-Centric Process Mining (OCPM) models and generators.

Overview

datasynth-ocpm provides OCEL 2.0 compliant event log generation across 8 enterprise process families:

  • OCEL 2.0 Models: Events, objects, relationships per IEEE standard
  • 8 Process Generators: P2P, O2C, S2C, H2R, MFG, BANK, AUDIT, Bank Recon
  • 88 Activity Types: Covering the full enterprise lifecycle
  • 52 Object Types: With lifecycle states and inter-object relationships
  • Export Formats: OCEL 2.0 JSON, XML, and SQLite

OCEL 2.0 Standard

Implements the Object-Centric Event Log standard:

ElementDescription
EventsActivities with timestamps and attributes
ObjectsBusiness objects (PO, Invoice, Payment, etc.)
Object TypesType definitions with attribute schemas
RelationshipsObject-to-object relationships
Event-Object LinksMany-to-many event-object associations

Key Types

OCEL Models

#![allow(unused)]
fn main() {
pub struct OcelEventLog {
    pub object_types: Vec<ObjectType>,
    pub event_types: Vec<EventType>,
    pub objects: Vec<Object>,
    pub events: Vec<Event>,
    pub relationships: Vec<ObjectRelationship>,
}

pub struct Event {
    pub id: String,
    pub event_type: String,
    pub timestamp: DateTime<Utc>,
    pub attributes: HashMap<String, Value>,
    pub objects: Vec<ObjectRef>,
}

pub struct Object {
    pub id: String,
    pub object_type: String,
    pub attributes: HashMap<String, Value>,
}
}

Process Flow Documents

#![allow(unused)]
fn main() {
pub struct P2pDocuments {
    pub po_number: String,
    pub vendor_id: String,
    pub company_code: String,
    pub amount: Decimal,
    pub currency: String,
}

pub struct O2cDocuments {
    pub so_number: String,
    pub customer_id: String,
    pub company_code: String,
    pub amount: Decimal,
    pub currency: String,
}
}

Process Flows

Procure-to-Pay (P2P)

Create PO → Approve PO → Release PO → Create GR → Post GR →
Receive Invoice → Verify Invoice → Post Invoice → Execute Payment

Events generated:

  • Create Purchase Order
  • Approve Purchase Order
  • Release Purchase Order
  • Create Goods Receipt
  • Post Goods Receipt
  • Receive Vendor Invoice
  • Verify Three-Way Match
  • Post Vendor Invoice
  • Execute Payment

Order-to-Cash (O2C)

Create SO → Check Credit → Release SO → Create Delivery →
Pick → Pack → Ship → Create Invoice → Post Invoice → Receive Payment

Events generated:

  • Create Sales Order
  • Check Credit
  • Release Sales Order
  • Create Delivery
  • Pick Materials
  • Pack Shipment
  • Ship Goods
  • Create Customer Invoice
  • Post Customer Invoice
  • Receive Customer Payment

Usage Examples

Generate P2P Case

#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, P2pDocuments};

let mut generator = OcpmGenerator::new(seed);

let documents = P2pDocuments::new(
    "PO-001",
    "V-001",
    "1000",
    dec!(5000.00),
    "USD",
);

let users = vec!["user1", "user2", "user3"];
let start_time = Utc::now();

let result = generator.generate_p2p_case(&documents, start_time, &users);
}

Generate O2C Case

#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, O2cDocuments};

let documents = O2cDocuments::new(
    "SO-001",
    "C-001",
    "1000",
    dec!(10000.00),
    "USD",
);

let result = generator.generate_o2c_case(&documents, start_time, &users);
}

Generate Complete Event Log

#![allow(unused)]
fn main() {
use synth_ocpm::OcpmGenerator;

let mut generator = OcpmGenerator::new(seed);
let event_log = generator.generate_event_log(
    p2p_count: 1000,
    o2c_count: 500,
    start_date,
    end_date,
)?;
}

Export Formats

OCEL 2.0 JSON

#![allow(unused)]
fn main() {
use synth_ocpm::export::{Ocel2Exporter, ExportFormat};

let exporter = Ocel2Exporter::new(ExportFormat::Json);
exporter.export(&event_log, "output/ocel2.json")?;
}

Output structure:

{
  "objectTypes": [...],
  "eventTypes": [...],
  "objects": [...],
  "events": [...],
  "relations": [...]
}

OCEL 2.0 XML

#![allow(unused)]
fn main() {
let exporter = Ocel2Exporter::new(ExportFormat::Xml);
exporter.export(&event_log, "output/ocel2.xml")?;
}

SQLite Database

#![allow(unused)]
fn main() {
let exporter = Ocel2Exporter::new(ExportFormat::Sqlite);
exporter.export(&event_log, "output/ocel2.sqlite")?;
}

Tables created:

  • object_types
  • event_types
  • objects
  • events
  • event_objects
  • object_relationships

Process Families (v0.6.2)

FamilyGeneratorActivitiesObject TypesVariants
P2Pgenerate_p2p_case()9PurchaseOrder, GoodsReceipt, VendorInvoice, Payment, Material, VendorHappy, Exception, Error
O2Cgenerate_o2c_case()10SalesOrder, Delivery, CustomerInvoice, CustomerPayment, Material, CustomerHappy, Exception, Error
S2Cgenerate_s2c_case()8SourcingProject, SupplierQualification, RfxEvent, SupplierBid, BidEvaluation, ProcurementContractHappy, Exception, Error
H2Rgenerate_h2r_case()8PayrollRun, PayrollLineItem, TimeEntry, ExpenseReportHappy, Exception, Error
MFGgenerate_mfg_case()10ProductionOrder, RoutingOperation, QualityInspection, CycleCountHappy, Exception, Error
BANKgenerate_bank_case()8BankingCustomer, BankAccount, BankTransactionHappy, Exception, Error
AUDITgenerate_audit_case()10AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgmentHappy, Exception, Error
Bank Recongenerate_bank_recon_case()8BankReconciliation, BankStatementLine, ReconcilingItemHappy, Exception, Error

Variant distribution: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%).

Object Types (P2P/O2C)

TypeDescription
PurchaseOrderP2P ordering document
GoodsReceiptInventory receipt
VendorInvoiceAP invoice
PaymentPayment document
SalesOrderO2C ordering document
DeliveryShipment document
CustomerInvoiceAR invoice
CustomerPaymentCustomer receipt
MaterialProduct/item
VendorSupplier
CustomerCustomer/buyer

Integration with Process Mining Tools

OCEL 2.0 exports are compatible with:

  • PM4Py: Python process mining library
  • Celonis: Enterprise process mining platform
  • PROM: Academic process mining toolkit
  • OCPA: Object-centric process analysis tool

Loading in PM4Py

import pm4py
from pm4py.objects.ocel.importer import jsonocel

ocel = jsonocel.apply("ocel2.json")
print(f"Events: {len(ocel.events)}")
print(f"Objects: {len(ocel.objects)}")

See Also

datasynth-fingerprint

Privacy-preserving fingerprint extraction from real data and synthesis of matching synthetic data.

Overview

The datasynth-fingerprint crate provides tools for extracting statistical fingerprints from real datasets while preserving privacy through differential privacy mechanisms and k-anonymity. These fingerprints can then be used to generate synthetic data that matches the statistical properties of the original data without exposing sensitive information.

Architecture

Real Data → Extract → .dsf File → Generate → Synthetic Data → Evaluate

The fingerprinting workflow consists of three main stages:

  1. Extraction: Analyze real data and extract statistical properties
  2. Synthesis: Generate configuration and synthetic data from fingerprints
  3. Evaluation: Validate synthetic data fidelity against fingerprints

Key Components

Models (models/)

ModelDescription
FingerprintRoot container with manifest, schema, statistics, correlations, integrity, rules, anomalies, privacy_audit
ManifestVersion, format, created_at, source metadata, privacy metadata, checksums, optional signature
SchemaFingerprintTables with columns, data types, cardinalities, relationships
StatisticsFingerprintNumeric stats (distribution, percentiles, Benford), categorical stats (frequencies, entropy)
CorrelationFingerprintCorrelation matrices with copula parameters
IntegrityFingerprintForeign key definitions, cardinality rules
RulesFingerprintBalance rules, approval thresholds
AnomalyFingerprintAnomaly rates, type distributions, temporal patterns
PrivacyAuditActions log, epsilon spent, k-anonymity, warnings

Privacy Engine (privacy/)

ComponentDescription
LaplaceMechanismDifferential privacy with configurable epsilon
GaussianMechanismAlternative DP mechanism for (ε,δ)-privacy
KAnonymitySuppression of rare categorical values below k threshold
PrivacyEngineUnified interface combining DP, k-anonymity, winsorization
PrivacyAuditBuilderBuild privacy audit with actions and warnings

Privacy Levels

LevelEpsilonkOutlier %Use Case
Minimal5.0399%Low privacy, high utility
Standard1.0595%Balanced (default)
High0.51090%Higher privacy
Maximum0.12085%Maximum privacy

Extraction Engine (extraction/)

ExtractorDescription
FingerprintExtractorMain coordinator for all extraction
SchemaExtractorInfer data types, cardinalities, relationships
StatsExtractorCompute distributions, percentiles, Benford analysis
CorrelationExtractorPearson correlations, copula fitting
IntegrityExtractorDetect foreign key relationships
RulesExtractorDetect balance rules, approval patterns
AnomalyExtractorAnalyze anomaly rates and patterns

I/O (io/)

ComponentDescription
FingerprintWriterWrite .dsf files (ZIP with YAML/JSON components)
FingerprintReaderRead .dsf files with checksum verification
FingerprintValidatorValidate DSF structure and integrity
validate_dsf()Convenience function for CLI validation

Synthesis (synthesis/)

ComponentDescription
ConfigSynthesizerConvert fingerprint to GeneratorConfig
DistributionFitterFit AmountSampler parameters from statistics
GaussianCopulaGenerate correlated values preserving multivariate structure

Evaluation (evaluation/)

ComponentDescription
FidelityEvaluatorCompare synthetic data against fingerprint
FidelityReportOverall score, component scores, pass/fail status
FidelityConfigThresholds and weights for evaluation

Federated Fingerprinting (federated/) — v0.5.0

ComponentDescription
FederatedFingerprintProtocolOrchestrates multi-source fingerprint aggregation
PartialFingerprintPer-source fingerprint with local DP (epsilon, means, stds, correlations)
AggregatedFingerprintCombined fingerprint with total epsilon and source count
AggregationMethodWeightedAverage, Median, or TrimmedMean strategies
FederatedConfigmin_sources, max_epsilon_per_source, aggregation_method

Certificates (certificates/) — v0.5.0

ComponentDescription
SyntheticDataCertificateCertificate with DP guarantees, quality metrics, config hash, signature
CertificateBuilderBuilder pattern for constructing certificates
DpGuaranteeDP mechanism, epsilon, delta, composition method, total queries
QualityMetricsBenford MAD, correlation preservation, statistical fidelity, MIA AUC
sign_certificate()HMAC-SHA256 signing
verify_certificate()Signature verification

Privacy-Utility Frontier (privacy/pareto.rs) — v0.5.0

ComponentDescription
ParetoFrontierExplore privacy-utility tradeoff space
ParetoPointEpsilon, utility_score, benford_mad, correlation_score
recommend()Recommend optimal epsilon for target utility

DSF File Format

The DataSynth Fingerprint (.dsf) file is a ZIP archive containing:

fingerprint.dsf (ZIP)
├── manifest.json       # Version, checksums, privacy config
├── schema.yaml         # Tables, columns, relationships
├── statistics.yaml     # Distributions, percentiles, Benford
├── correlations.yaml   # Correlation matrices, copulas
├── integrity.yaml      # FK relationships, cardinality
├── rules.yaml          # Balance constraints, approval thresholds
├── anomalies.yaml      # Anomaly rates, type distribution
└── privacy_audit.json  # Privacy decisions, epsilon spent

Usage

Extracting a Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::{
    extraction::FingerprintExtractor,
    privacy::{PrivacyEngine, PrivacyLevel},
    io::FingerprintWriter,
};

// Create privacy engine with standard level
let privacy = PrivacyEngine::new(PrivacyLevel::Standard);

// Extract fingerprint from CSV data
let extractor = FingerprintExtractor::new(privacy);
let fingerprint = extractor.extract_from_csv("data.csv")?;

// Write to DSF file
let writer = FingerprintWriter::new();
writer.write(&fingerprint, "fingerprint.dsf")?;
}

Reading a Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::io::FingerprintReader;

let reader = FingerprintReader::new();
let fingerprint = reader.read("fingerprint.dsf")?;

println!("Tables: {:?}", fingerprint.schema.tables.len());
println!("Privacy epsilon spent: {}", fingerprint.privacy_audit.epsilon_spent);
}

Validating a Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::io::validate_dsf;

match validate_dsf("fingerprint.dsf") {
    Ok(report) => println!("Valid: {:?}", report),
    Err(e) => eprintln!("Invalid: {}", e),
}
}

Synthesizing Configuration

#![allow(unused)]
fn main() {
use datasynth_fingerprint::synthesis::ConfigSynthesizer;

let synthesizer = ConfigSynthesizer::new();
let config = synthesizer.synthesize(&fingerprint)?;

// Use config with datasynth-generators
}

Evaluating Fidelity

#![allow(unused)]
fn main() {
use datasynth_fingerprint::evaluation::{FidelityEvaluator, FidelityConfig};

let config = FidelityConfig::default();
let evaluator = FidelityEvaluator::new(config);

let report = evaluator.evaluate(&fingerprint, "./synthetic_data/")?;

println!("Overall score: {:.2}", report.overall_score);
println!("Pass: {}", report.passed);

for (metric, score) in &report.component_scores {
    println!("  {}: {:.2}", metric, score);
}
}

Federated Fingerprinting

#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::{
    FederatedFingerprintProtocol, FederatedConfig, AggregationMethod,
};

let config = FederatedConfig {
    min_sources: 2,
    max_epsilon_per_source: 5.0,
    aggregation_method: AggregationMethod::WeightedAverage,
};

let protocol = FederatedFingerprintProtocol::new(config);

// Create partial fingerprints from each data source
let partial1 = FederatedFingerprintProtocol::create_partial(
    "source_a", vec!["amount".into(), "count".into()], 10000,
    vec![5000.0, 3.0], vec![2000.0, 1.5], 1.0,
);
let partial2 = FederatedFingerprintProtocol::create_partial(
    "source_b", vec!["amount".into(), "count".into()], 8000,
    vec![4500.0, 2.8], vec![1800.0, 1.2], 1.0,
);

// Aggregate without centralizing raw data
let aggregated = protocol.aggregate(&[partial1, partial2])?;
println!("Total epsilon: {}", aggregated.total_epsilon);
}

Synthetic Data Certificates

#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{
    CertificateBuilder, DpGuarantee, QualityMetrics,
    sign_certificate, verify_certificate,
};

let mut cert = CertificateBuilder::new("DataSynth v0.5.0")
    .with_dp_guarantee(DpGuarantee {
        mechanism: "Laplace".into(),
        epsilon: 1.0,
        delta: None,
        composition_method: "sequential".into(),
        total_queries: 50,
    })
    .with_quality_metrics(QualityMetrics {
        benford_mad: Some(0.008),
        correlation_preservation: Some(0.95),
        statistical_fidelity: Some(0.92),
        mia_auc: Some(0.52),
    })
    .with_seed(42)
    .build();

// Sign and verify
sign_certificate(&mut cert, "my-secret-key");
assert!(verify_certificate(&cert, "my-secret-key"));
}

Fidelity Metrics

CategoryMetrics
StatisticalKS statistic, Wasserstein distance, Benford MAD
CorrelationCorrelation matrix RMSE
SchemaColumn type match, row count ratio
RulesBalance equation compliance rate

Privacy Guarantees

The fingerprint extraction process provides the following privacy guarantees:

  1. Differential Privacy: Numeric statistics are perturbed using Laplace or Gaussian mechanisms with configurable epsilon budget
  2. K-Anonymity: Categorical values appearing fewer than k times are suppressed
  3. Winsorization: Outliers are clipped to prevent identification of extreme values
  4. Audit Trail: All privacy decisions are logged for compliance verification

CLI Commands

# Extract fingerprint
datasynth-data fingerprint extract \
    --input ./data.csv \
    --output ./fp.dsf \
    --privacy-level standard

# Validate
datasynth-data fingerprint validate ./fp.dsf

# Show info
datasynth-data fingerprint info ./fp.dsf --detailed

# Compare
datasynth-data fingerprint diff ./fp1.dsf ./fp2.dsf

# Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fp.dsf \
    --synthetic ./synthetic/ \
    --threshold 0.8

# Federated fingerprinting
datasynth-data fingerprint federated \
    --sources ./source_a.dsf ./source_b.dsf \
    --output ./aggregated.dsf \
    --method weighted_average

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

Dependencies

[dependencies]
datasynth-core = { path = "../datasynth-core" }
datasynth-config = { path = "../datasynth-config" }
serde = { version = "1.0", features = ["derive"] }
serde_yaml = "0.9"
serde_json = "1.0"
zip = "0.6"
sha2 = "0.10"
rand = "0.8"
statrs = "0.16"

See Also

datasynth-standards

The datasynth-standards crate provides comprehensive support for major accounting and auditing standards frameworks including IFRS, US GAAP, ISA, SOX, and PCAOB.

Overview

This crate contains domain models and business logic for:

  • Accounting Standards: Revenue recognition, lease accounting, fair value measurement, impairment testing
  • Audit Standards: ISA requirements, analytical procedures, confirmations, audit opinions
  • Regulatory Frameworks: SOX 302/404 compliance, PCAOB standards

Modules

framework

Core accounting framework selection and settings.

#![allow(unused)]
fn main() {
use datasynth_standards::framework::{AccountingFramework, FrameworkSettings};

// Select framework
let framework = AccountingFramework::UsGaap;
assert!(framework.allows_lifo());
assert!(!framework.allows_impairment_reversal());

// Framework-specific settings
let settings = FrameworkSettings::us_gaap();
assert!(settings.validate().is_ok());
}

accounting

Accounting standards models:

ModuleStandardsKey Types
revenueASC 606 / IFRS 15CustomerContract, PerformanceObligation, VariableConsideration
leasesASC 842 / IFRS 16Lease, ROUAsset, LeaseLiability, LeaseAmortizationEntry
fair_valueASC 820 / IFRS 13FairValueMeasurement, FairValueHierarchyLevel
impairmentASC 360 / IAS 36ImpairmentTest, RecoverableAmountMethod
differencesDual ReportingFrameworkDifferenceRecord

audit

Audit standards models:

ModuleStandardsKey Types
isa_referenceISA 200-720IsaStandard, IsaRequirement, IsaProcedureMapping
analyticalISA 520AnalyticalProcedure, VarianceInvestigation
confirmationISA 505ExternalConfirmation, ConfirmationResponse
opinionISA 700/705/706/701AuditOpinion, KeyAuditMatter, OpinionModification
audit_trailTraceabilityAuditTrail, TrailGap
pcaobPCAOB ASPcaobStandard, PcaobIsaMapping

regulatory

Regulatory compliance models:

ModuleStandardsKey Types
soxSOX 302/404Sox302Certification, Sox404Assessment, DeficiencyMatrix, MaterialWeakness

Usage Examples

Revenue Recognition

#![allow(unused)]
fn main() {
use datasynth_standards::accounting::revenue::{
    CustomerContract, PerformanceObligation, ObligationType, SatisfactionPattern,
};
use datasynth_standards::framework::AccountingFramework;
use rust_decimal_macros::dec;

// Create a customer contract under US GAAP
let mut contract = CustomerContract::new(
    "C001".to_string(),
    "CUST001".to_string(),
    dec!(100000),
    AccountingFramework::UsGaap,
);

// Add performance obligations
let po = PerformanceObligation::new(
    "PO001".to_string(),
    ObligationType::Good,
    SatisfactionPattern::PointInTime,
    dec!(60000),
);
contract.add_performance_obligation(po);
}

Lease Accounting

#![allow(unused)]
fn main() {
use datasynth_standards::accounting::leases::{Lease, LeaseAssetClass, LeaseClassification};
use datasynth_standards::framework::AccountingFramework;
use chrono::NaiveDate;
use rust_decimal_macros::dec;

// Create a lease
let lease = Lease::new(
    "L001".to_string(),
    LeaseAssetClass::RealEstate,
    NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
    60,                    // 5-year term
    dec!(10000),          // Monthly payment
    0.05,                  // Discount rate
    AccountingFramework::UsGaap,
);

// Classify under US GAAP bright-line tests
let classification = lease.classify_us_gaap(
    72,                    // Asset useful life (months)
    dec!(600000),         // Fair value
    dec!(550000),         // Present value of payments
);
}

ISA Standards

#![allow(unused)]
fn main() {
use datasynth_standards::audit::isa_reference::{
    IsaStandard, IsaRequirement, IsaRequirementType,
};

// Reference an ISA standard
let standard = IsaStandard::Isa315;
assert_eq!(standard.number(), "315");
assert!(standard.title().contains("Risk"));

// Create a requirement
let requirement = IsaRequirement::new(
    IsaStandard::Isa500,
    "12".to_string(),
    IsaRequirementType::Requirement,
    "Design and perform audit procedures".to_string(),
);
}

SOX Compliance

#![allow(unused)]
fn main() {
use datasynth_standards::regulatory::sox::{
    Sox404Assessment, DeficiencyMatrix, DeficiencyLikelihood, DeficiencyMagnitude,
};
use uuid::Uuid;

// Create a SOX 404 assessment
let assessment = Sox404Assessment::new(
    Uuid::new_v4(),
    2024,
    true, // ICFR effective
);

// Classify a deficiency
let deficiency = DeficiencyMatrix::new(
    DeficiencyLikelihood::Probable,
    DeficiencyMagnitude::Material,
);
assert!(deficiency.is_material_weakness());
}

Framework Validation

The crate validates framework-specific rules:

#![allow(unused)]
fn main() {
use datasynth_standards::framework::{AccountingFramework, FrameworkSettings};

// LIFO is not permitted under IFRS
let mut settings = FrameworkSettings::ifrs();
settings.use_lifo_inventory = true;
assert!(settings.validate().is_err());

// PPE revaluation is not permitted under US GAAP
let mut settings = FrameworkSettings::us_gaap();
settings.use_ppe_revaluation = true;
assert!(settings.validate().is_err());
}

Dependencies

[dependencies]
datasynth-standards = "0.2.3"

Feature Flags

Currently, no optional features are defined. All functionality is included by default.

See Also

datasynth-test-utils

Test utilities and helpers for the SyntheticData workspace.

Overview

datasynth-test-utils provides shared testing infrastructure:

  • Test Fixtures: Pre-configured test data and scenarios
  • Assertion Helpers: Domain-specific assertions for financial data
  • Mock Generators: Simplified generators for unit testing
  • Snapshot Testing: Helpers for snapshot-based testing

Fixtures

Journal Entry Fixtures

#![allow(unused)]
fn main() {
use synth_test_utils::fixtures;

// Balanced two-line entry
let entry = fixtures::balanced_journal_entry();
assert!(entry.is_balanced());

// Entry with specific amounts
let entry = fixtures::journal_entry_with_amount(dec!(1000.00));

// Fraudulent entry for testing detection
let entry = fixtures::fraudulent_entry(FraudType::SplitTransaction);
}

Master Data Fixtures

#![allow(unused)]
fn main() {
// Sample vendors
let vendors = fixtures::sample_vendors(10);

// Sample customers
let customers = fixtures::sample_customers(20);

// Chart of accounts
let coa = fixtures::test_chart_of_accounts();
}

Amount Fixtures

#![allow(unused)]
fn main() {
// Benford-compliant amounts
let amounts = fixtures::sample_amounts(1000);

// Round-number biased amounts
let amounts = fixtures::round_amounts(100);

// Fraud-pattern amounts
let amounts = fixtures::suspicious_amounts(50);
}

Configuration Fixtures

#![allow(unused)]
fn main() {
// Minimal valid config
let config = fixtures::test_config();

// Manufacturing preset
let config = fixtures::manufacturing_config();

// Config with specific transaction count
let config = fixtures::config_with_transactions(10000);
}

Assertions

Balance Assertions

#![allow(unused)]
fn main() {
use synth_test_utils::assertions;

#[test]
fn test_entry_is_balanced() {
    let entry = create_entry();
    assertions::assert_balanced(&entry);
}

#[test]
fn test_trial_balance() {
    let tb = generate_trial_balance();
    assertions::assert_trial_balance_balanced(&tb);
}
}

Benford’s Law Assertions

#![allow(unused)]
fn main() {
#[test]
fn test_benford_compliance() {
    let amounts = generate_amounts(10000);
    assertions::assert_benford_compliant(&amounts, 0.05);
}
}

Document Chain Assertions

#![allow(unused)]
fn main() {
#[test]
fn test_p2p_chain() {
    let documents = generate_p2p_flow();
    assertions::assert_valid_document_chain(&documents);
}
}

Uniqueness Assertions

#![allow(unused)]
fn main() {
#[test]
fn test_no_duplicate_ids() {
    let entries = generate_entries(1000);
    assertions::assert_unique_document_ids(&entries);
}
}

Mock Generators

Simple Journal Entry Generator

#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockJeGenerator;

let mut generator = MockJeGenerator::new(42);

// Generate entries without full config
let entries = generator.generate(100);
}

Predictable Amount Generator

#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockAmountGenerator;

let mut generator = MockAmountGenerator::new();

// Returns predictable sequence
let amount1 = generator.next(); // 100.00
let amount2 = generator.next(); // 200.00
}

Fixed Date Generator

#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockDateGenerator;

let generator = MockDateGenerator::fixed(
    NaiveDate::from_ymd_opt(2024, 1, 15).unwrap()
);
}

Snapshot Testing

#![allow(unused)]
fn main() {
use synth_test_utils::snapshots;

#[test]
fn test_je_serialization() {
    let entry = fixtures::balanced_journal_entry();
    snapshots::assert_json_snapshot("je_balanced", &entry);
}

#[test]
fn test_csv_output() {
    let entries = fixtures::sample_entries(10);
    snapshots::assert_csv_snapshot("entries_sample", &entries);
}
}

Test Helpers

Temporary Directories

#![allow(unused)]
fn main() {
use synth_test_utils::temp_dir;

#[test]
fn test_output_writing() {
    let dir = temp_dir::create();

    // Test writes to temp directory
    let path = dir.path().join("test.csv");
    write_output(&path)?;

    assert!(path.exists());
    // Directory cleaned up on drop
}
}

Seed Management

#![allow(unused)]
fn main() {
use synth_test_utils::seeds;

#[test]
fn test_deterministic_generation() {
    let seed = seeds::fixed();

    let result1 = generate_with_seed(seed);
    let result2 = generate_with_seed(seed);

    assert_eq!(result1, result2);
}
}

Time Helpers

#![allow(unused)]
fn main() {
use synth_test_utils::time;

#[test]
fn test_with_frozen_time() {
    let frozen = time::freeze_at(2024, 1, 15);

    let entry = generate_entry_with_current_date();

    assert_eq!(entry.posting_date, frozen.date());
}
}

Usage in Other Crates

Add to Cargo.toml:

[dev-dependencies]
datasynth-test-utils = { path = "../datasynth-test-utils" }

Use in tests:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use synth_test_utils::{fixtures, assertions};

    #[test]
    fn test_my_function() {
        let input = fixtures::test_config();
        let result = my_function(&input);
        assertions::assert_balanced(&result);
    }
}
}

Fixture Data Files

Test data files in fixtures/:

datasynth-test-utils/
└── fixtures/
    ├── chart_of_accounts.yaml
    ├── sample_entries.json
    ├── vendor_master.csv
    └── test_config.yaml

See Also

Advanced Topics

Advanced features for specialized use cases.

Overview

TopicDescription
Anomaly InjectionFraud, errors, and process issues
Data Quality VariationsMissing values, typos, duplicates
Graph ExportML-ready graph formats
Intercompany ProcessingMulti-entity transactions
Period Close EngineMonth/quarter/year-end processes
Performance TuningOptimization strategies

Feature Matrix

FeatureUse CaseOutput
Anomaly InjectionML trainingLabels (CSV)
Data QualityTesting robustnessVaried data
Graph ExportGNN trainingPyG, Neo4j
IntercompanyConsolidation testingIC pairs
Period CloseFull cycle testingClosing entries

Enabling Advanced Features

In Configuration

# Anomaly injection
anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

# Data quality variations
data_quality:
  enabled: true
  missing_values:
    rate: 0.01

# Graph export
graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

# Intercompany
intercompany:
  enabled: true

# Period close
period_close:
  enabled: true
  monthly:
    accruals: true
    depreciation: true

Via CLI

Most advanced features are controlled through configuration. Use init to create a base config, then customize:

datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Edit config.yaml to enable advanced features
datasynth-data generate --config config.yaml --output ./output

Prerequisites

Some advanced features have dependencies:

FeatureRequires
IntercompanyMultiple companies defined
Period Closeperiod_months ≥ 1
Graph ExportTransactions generated
FXMultiple currencies

Output Files

Advanced features produce additional outputs:

output/
├── labels/                      # Anomaly injection
│   ├── anomaly_labels.csv
│   ├── fraud_labels.csv
│   └── quality_issues.csv
├── graphs/                      # Graph export
│   ├── pytorch_geometric/
│   └── neo4j/
├── consolidation/               # Intercompany
│   ├── eliminations.csv
│   └── ic_pairs.csv
└── period_close/                # Period close
    ├── trial_balances/
    ├── accruals.csv
    └── closing_entries.csv

Performance Impact

FeatureImpactMitigation
Anomaly InjectionLowPost-processing
Data QualityLowPost-processing
Graph ExportMediumSeparate phase
IntercompanyMediumPer-transaction
Period CloseLowPer-period

See Also

Fraud Patterns & ACFE Taxonomy

SyntheticData includes comprehensive fraud pattern modeling aligned with the Association of Certified Fraud Examiners (ACFE) Report to the Nations. This enables generation of realistic fraud scenarios for training machine learning models and testing audit analytics.

ACFE Fraud Taxonomy

The ACFE occupational fraud classification divides fraud into three main categories, each with distinct characteristics:

Asset Misappropriation (86% of cases)

The most common type of fraud, involving theft of organizational assets:

fraud:
  enabled: true
  acfe_category: asset_misappropriation
  schemes:
    cash_fraud:
      - skimming           # Sales not recorded
      - larceny            # Cash stolen after recording
      - shell_company      # Fictitious vendors
      - ghost_employee     # Non-existent employees
      - expense_schemes    # Personal expenses as business
    non_cash_fraud:
      - inventory_theft
      - fixed_asset_misuse

Corruption (33% of cases)

Schemes involving conflicts of interest and bribery:

fraud:
  enabled: true
  acfe_category: corruption
  schemes:
    - purchasing_conflict  # Undisclosed vendor ownership
    - sales_conflict       # Kickbacks from customers
    - invoice_kickback     # Vendor payment schemes
    - bid_rigging          # Collusion with vendors
    - economic_extortion   # Demands for payment

Financial Statement Fraud (10% of cases)

The least common but most costly fraud type:

fraud:
  enabled: true
  acfe_category: financial_statement
  schemes:
    overstatement:
      - premature_revenue      # Revenue before earned
      - fictitious_revenues    # Fake sales
      - concealed_liabilities  # Hidden obligations
      - improper_asset_values  # Overstated assets
    understatement:
      - understated_revenues   # Hidden sales
      - overstated_expenses    # Inflated costs

ACFE Calibration

Generated fraud data is calibrated to match ACFE statistics:

MetricACFE ValueConfiguration
Median Loss$117,000acfe.median_loss
Median Duration12 monthsacfe.median_duration_months
Tip Detection42%detection_method.tip
Internal Audit Detection16%detection_method.internal_audit
Management Review Detection12%detection_method.management_review
fraud:
  acfe_calibration:
    enabled: true
    median_loss: 117000
    median_duration_months: 12
    detection_methods:
      tip: 0.42
      internal_audit: 0.16
      management_review: 0.12
      external_audit: 0.04
      accident: 0.06

Collusion & Conspiracy Modeling

SyntheticData models multi-party fraud networks with coordinated schemes:

Collusion Ring Types

#![allow(unused)]
fn main() {
pub enum CollusionRingType {
    // Internal collusion
    EmployeePair,           // approver + processor
    DepartmentRing,         // 3-5 employees
    ManagementSubordinate,  // manager + subordinate

    // Internal-external
    EmployeeVendor,         // purchasing + vendor contact
    EmployeeCustomer,       // sales rep + customer
    EmployeeContractor,     // project manager + contractor

    // External rings
    VendorRing,             // bid rigging (2-4 vendors)
    CustomerRing,           // return fraud
}
}

Conspirator Roles

Each conspirator in a ring has a specific role:

  • Initiator: Conceives scheme, recruits others
  • Executor: Performs fraudulent transactions
  • Approver: Provides approvals/overrides
  • Concealer: Hides evidence, manipulates records
  • Lookout: Monitors for detection
  • Beneficiary: External recipient of proceeds

Configuration

fraud:
  collusion:
    enabled: true
    ring_types:
      - type: employee_vendor
        probability: 0.15
        min_members: 2
        max_members: 4
      - type: department_ring
        probability: 0.08
        min_members: 3
        max_members: 5
    defection_probability: 0.05
    escalation_rate: 0.10

Management Override

Senior-level fraud with override patterns:

fraud:
  management_override:
    enabled: true
    perpetrator_levels:
      - senior_manager
      - cfo
      - ceo
    override_types:
      revenue:
        - journal_entry_override
        - revenue_recognition_acceleration
        - reserve_manipulation
      expense:
        - capitalization_abuse
        - expense_deferral
    pressure_sources:
      - financial_targets
      - market_expectations
      - covenant_compliance

Fraud Triangle

The fraud triangle (Pressure, Opportunity, Rationalization) is modeled:

fraud:
  fraud_triangle:
    pressure:
      source: financial_targets
      intensity: high
    opportunity:
      factors:
        - weak_internal_controls
        - management_override_capability
        - lack_of_oversight
    rationalization:
      type: temporary_adjustment  # "We'll fix it next quarter"

Red Flag Generation

Probabilistic fraud indicators with calibrated Bayesian probabilities:

Red Flag Strengths

StrengthP(fraud|flag)Examples
Strong> 0.5Matched home address vendor/employee
Moderate0.2 - 0.5Vendor with no physical address
Weak< 0.2Round number invoices

Configuration

fraud:
  red_flags:
    enabled: true
    inject_rate: 0.15  # 15% of transactions get flags
    patterns:
      strong:
        - name: matched_address_vendor_employee
          p_flag_given_fraud: 0.90
          p_flag_given_no_fraud: 0.001
        - name: sequential_check_numbers
          p_flag_given_fraud: 0.80
          p_flag_given_no_fraud: 0.01
      moderate:
        - name: approval_just_under_threshold
          p_flag_given_fraud: 0.70
          p_flag_given_no_fraud: 0.10
      weak:
        - name: round_number_invoice
          p_flag_given_fraud: 0.40
          p_flag_given_no_fraud: 0.20

Evaluation Benchmarks

ACFE-Calibrated Benchmarks

#![allow(unused)]
fn main() {
// General fraud detection
let bench = acfe_calibrated_1k();

// Collusion-focused benchmark
let bench = acfe_collusion_5k();

// Management override detection
let bench = acfe_management_override_2k();
}

Benchmark Metrics

#![allow(unused)]
fn main() {
pub struct AcfeAlignment {
    /// Category distribution MAD vs ACFE
    pub category_distribution_mad: f64,
    /// Median loss ratio (actual / expected)
    pub median_loss_ratio: f64,
    /// Duration distribution KS statistic
    pub duration_distribution_ks: f64,
    /// Detection method chi-squared
    pub detection_method_chi_sq: f64,
}
}

Output Files

FileDescription
collusion_rings.jsonCollusion network details with members, roles
red_flags.csvRed flag indicators with probabilities
management_overrides.jsonManagement override schemes
fraud_labels.csvEnhanced fraud labels with ACFE category

Best Practices

  1. Start with ACFE calibration: Use default ACFE statistics for realistic distribution
  2. Enable collusion gradually: Start with simple rings before complex networks
  3. Use red flags for training: Red flags provide weak supervision signals
  4. Validate against benchmarks: Use ACFE benchmarks to verify model performance
  5. Consider detection difficulty: Use detection_difficulty labels for curriculum learning

Industry-Specific Features

SyntheticData includes industry-specific transaction modeling with authentic terminology, master data structures, and anomaly patterns. Three industries have full generator implementations (Manufacturing, Retail, Healthcare), while three additional industries (Technology, Financial Services, Professional Services) are available as configuration presets with industry-appropriate GL structures and anomaly rates.

Overview

Each industry module provides:

  • Industry-specific transactions: Authentic transaction types using correct terminology
  • Master data structures: Industry-specific entities (BOM, routings, clinical codes, etc.)
  • Anomaly patterns: Industry-authentic fraud and error patterns
  • GL account structures: Industry-appropriate chart of accounts
  • Configuration options: Fine-grained control over industry characteristics

Implementation Status

IndustryStatusTransaction TypesMaster DataAnomaly PatternsBenchmarks
ManufacturingFull generator13 typesBOM, routings, work centers5 patternsYes
RetailFull generator11 typesStores, POS, loyalty6 patternsYes
HealthcareFull generator9 typesICD-10, CPT, DRG, payers6 patternsYes
TechnologyConfig presetConfig-only3 anomaly ratesYes
Financial ServicesConfig presetConfig-only3 anomaly ratesYes
Professional ServicesConfig presetConfig-only3 anomaly ratesNo

Full generator industries have dedicated Rust enum types with per-transaction generation logic, dedicated master data structures, and industry-specific anomaly injection. Config preset industries use the standard generator pipeline but apply industry-appropriate GL account structures, transaction distributions, and anomaly rates through configuration.

Manufacturing

Transaction Types

#![allow(unused)]
fn main() {
pub enum ManufacturingTransaction {
    // Production
    WorkOrderIssuance,      // Create production order
    MaterialRequisition,    // Issue materials to production
    LaborBooking,           // Record labor hours
    OverheadAbsorption,     // Apply manufacturing overhead
    ScrapReporting,         // Report production scrap
    ReworkOrder,            // Create rework order
    ProductionVariance,     // Record variances

    // Inventory
    RawMaterialReceipt,     // Receive raw materials
    WipTransfer,            // Transfer between work centers
    FinishedGoodsTransfer,  // Move to finished goods
    CycleCountAdjustment,   // Inventory adjustments

    // Costing
    StandardCostRevaluation,  // Update standard costs
    PurchasePriceVariance,    // Record PPV
}
}

Master Data

manufacturing:
  bom:
    depth: 4                    # BOM levels (3-7 typical)
    yield_rate: 0.97            # Expected yield
    scrap_factor: 0.02          # Scrap percentage
  routings:
    operations_per_product: 5   # Average operations
    setup_time_minutes: 30      # Default setup time
  work_centers:
    count: 20
    capacity_hours: 8
    efficiency: 0.85

Anomaly Patterns

AnomalyDescriptionDetection Method
Yield ManipulationReported yield differs from actualVariance analysis
Labor MisallocationLabor charged to wrong orderCross-reference
Phantom ProductionProduction orders with no outputData analytics
Obsolete InventoryAging inventory not written downAging analysis
Standard Cost ManipulationInflated standard costsTrend analysis

Configuration

industry_specific:
  enabled: true
  manufacturing:
    enabled: true
    bom_depth: 4
    just_in_time: false
    production_order_types:
      - standard
      - rework
      - prototype
    quality_framework: iso_9001
    supplier_tiers: 2
    standard_cost_frequency: quarterly
    target_yield_rate: 0.97
    scrap_alert_threshold: 0.03
    anomaly_rates:
      yield_manipulation: 0.005
      labor_misallocation: 0.008
      phantom_production: 0.002
      obsolete_inventory: 0.01

Retail

Transaction Types

#![allow(unused)]
fn main() {
pub enum RetailTransaction {
    // Point of Sale
    PosSale,                // Register sale
    ReturnRefund,           // Customer return
    VoidTransaction,        // Voided sale
    EmployeeDiscount,       // Staff discount
    LoyaltyRedemption,      // Points redemption

    // Inventory
    InventoryReceipt,       // Receive from DC
    StoreTransfer,          // Store-to-store
    MarkdownRecording,      // Price reductions
    ShrinkageAdjustment,    // Inventory loss

    // Cash Management
    CashDrop,               // Safe deposit
    RegisterReconciliation, // Drawer count
}
}

Store Types

retail:
  stores:
    types:
      - flagship      # High-volume, full assortment
      - standard      # Normal retail store
      - express       # Small format, convenience
      - outlet        # Discount/clearance
      - warehouse     # Bulk/club format
      - pop_up        # Temporary locations
      - digital       # E-commerce only

Anomaly Patterns

AnomalyDescriptionDetection Method
SweetheartingNot scanning itemsVideo analytics
SkimmingCash theft from registerCash variance
Refund FraudFake returnsReturn pattern
Receiving FraudShort shipment theft3-way match
Coupon FraudInvalid coupon useCoupon validation
Employee Discount AbuseUnauthorized discountsPolicy review

Configuration

industry_specific:
  enabled: true
  retail:
    enabled: true
    store_types:
      - standard
      - express
      - outlet
    shrinkage_rate: 0.015
    return_rate: 0.08
    markdown_frequency: weekly
    loss_prevention:
      camera_coverage: 0.85
      eas_enabled: true
    pos_anomaly_rates:
      sweethearting: 0.002
      skimming: 0.001
      refund_fraud: 0.003

Healthcare

Transaction Types

#![allow(unused)]
fn main() {
pub enum HealthcareTransaction {
    // Revenue Cycle
    PatientRegistration,    // Register patient
    ChargeCapture,          // Record charges
    ClaimSubmission,        // Submit to payer
    PaymentPosting,         // Record payment
    DenialManagement,       // Handle denials

    // Clinical
    ProcedureCoding,        // CPT codes
    DiagnosisCoding,        // ICD-10 codes
    SupplyConsumption,      // Medical supplies
    PharmacyDispensing,     // Medications
}
}

Coding Systems

healthcare:
  coding:
    icd10: true         # Diagnosis codes
    cpt: true           # Procedure codes
    drg: true           # Diagnosis Related Groups
    hcpcs: true         # Supplies/equipment

Payer Mix

healthcare:
  payer_mix:
    medicare: 0.40
    medicaid: 0.20
    commercial: 0.30
    self_pay: 0.10

Compliance Frameworks

healthcare:
  compliance:
    hipaa: true           # Privacy rules
    stark_law: true       # Physician referrals
    anti_kickback: true   # AKS compliance
    false_claims_act: true

Anomaly Patterns

AnomalyDescriptionDetection Method
UpcodingHigher-level code than justifiedCode validation
UnbundlingSplitting bundled servicesBundle analysis
Phantom BillingBilling for unrendered servicesAudit
Duplicate BillingSame service billed twiceDuplicate check
KickbacksPhysician referral paymentsRelationship analysis
HIPAA ViolationsUnauthorized data accessAccess logs

Configuration

industry_specific:
  enabled: true
  healthcare:
    enabled: true
    facility_type: hospital  # hospital, physician_practice, etc.
    payer_mix:
      medicare: 0.40
      medicaid: 0.20
      commercial: 0.30
      self_pay: 0.10
    coding_system:
      icd10: true
      cpt: true
      drg: true
    compliance:
      hipaa: true
      stark_law: true
      anti_kickback: true
    avg_daily_encounters: 200
    avg_charges_per_encounter: 8
    anomaly_rates:
      upcoding: 0.02
      unbundling: 0.015
      phantom_billing: 0.005
      duplicate_billing: 0.008

Technology

Transaction Types

  • License revenue recognition
  • Subscription billing
  • Professional services
  • R&D capitalization
  • Deferred revenue

Configuration

industry_specific:
  enabled: true
  technology:
    enabled: true
    revenue_model: subscription  # license, subscription, usage
    subscription_revenue_percent: 0.70
    professional_services_percent: 0.20
    license_revenue_percent: 0.10
    r_and_d_capitalization_rate: 0.15
    deferred_revenue_months: 12
    anomaly_rates:
      premature_revenue: 0.008
      channel_stuffing: 0.003
      improper_capitalization: 0.005

Financial Services

Transaction Types

  • Loan origination
  • Interest accrual
  • Fee income
  • Trading transactions
  • Customer deposits
  • Wire transfers

Configuration

industry_specific:
  enabled: true
  financial_services:
    enabled: true
    institution_type: commercial_bank
    regulatory_framework: us  # us, eu, uk
    loan_portfolio_size: 1000
    avg_loan_amount: 250000
    loan_loss_provision_rate: 0.02
    fee_income_percent: 0.15
    trading_volume_daily: 50000000
    anomaly_rates:
      loan_fraud: 0.003
      trading_fraud: 0.001
      account_takeover: 0.002

Professional Services

Transaction Types

  • Time and billing
  • Engagement management
  • Trust account transactions
  • Expense reimbursement
  • Partner distributions

Configuration

industry_specific:
  enabled: true
  professional_services:
    enabled: true
    billing_model: hourly  # hourly, fixed_fee, contingency
    avg_hourly_rate: 350
    utilization_target: 0.75
    realization_rate: 0.92
    trust_accounting: true
    engagement_types:
      - audit
      - tax
      - advisory
      - litigation
    anomaly_rates:
      billing_fraud: 0.004
      trust_misappropriation: 0.001
      expense_fraud: 0.008

Industry Benchmarks

SyntheticData provides pre-configured ML benchmarks for each industry:

#![allow(unused)]
fn main() {
// Get industry-specific benchmark
let bench = get_industry_benchmark(IndustrySector::Healthcare);

// Available benchmarks
let manufacturing = manufacturing_fraud_5k();
let retail = retail_fraud_10k();
let healthcare = healthcare_fraud_5k();
let technology = technology_fraud_3k();
let financial = financial_services_fraud_5k();
}

Benchmark Features

Each industry benchmark includes:

  • Industry-specific transaction features
  • Relevant anomaly types
  • Appropriate cost matrices
  • Industry-specific evaluation metrics

Best Practices

  1. Match industry to use case: Select the industry that matches your target domain
  2. Use industry presets first: Start with default settings before customizing
  3. Enable industry-specific anomalies: These provide realistic fraud patterns
  4. Consider regulatory context: Enable compliance frameworks relevant to your industry
  5. Use industry benchmarks: Evaluate models against industry-specific baselines

Output Files

FileDescription
industry_transactions.csvIndustry-specific transaction log
industry_master_data.jsonIndustry-specific entities
industry_anomalies.csvIndustry-specific anomaly labels
industry_gl_accounts.csvIndustry-specific chart of accounts

Anomaly Injection

Generate labeled anomalies for machine learning training.

Overview

Anomaly injection adds realistic irregularities to generated data with full labeling for supervised learning:

  • 20+ fraud types
  • Error patterns
  • Process issues
  • Statistical outliers
  • Relational anomalies

Configuration

anomaly_injection:
  enabled: true
  total_rate: 0.02                   # 2% anomaly rate
  generate_labels: true              # Output ML labels

  categories:                        # Category distribution
    fraud: 0.25
    error: 0.40
    process_issue: 0.20
    statistical: 0.10
    relational: 0.05

  temporal_pattern:
    year_end_spike: 1.5              # More anomalies at year-end

  clustering:
    enabled: true
    cluster_probability: 0.2         # 20% appear in clusters

Anomaly Categories

Fraud Types

TypeDescriptionDetection Difficulty
fictitious_transactionFabricated entriesMedium
revenue_manipulationPremature recognitionHard
expense_capitalizationImproper capitalizationMedium
split_transactionSplit to avoid thresholdEasy
round_trippingCircular transactionsHard
kickback_schemeVendor kickbacksHard
ghost_employeeNon-existent payeeMedium
duplicate_paymentSame invoice twiceEasy
unauthorized_discountUnapproved discountsMedium
suspense_abuseHide in suspenseHard
fraud:
  types:
    fictitious_transaction: 0.15
    split_transaction: 0.20
    duplicate_payment: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

Error Types

TypeDescription
duplicate_entrySame entry posted twice
reversed_amountDebit/credit swapped
wrong_periodPosted to wrong period
wrong_accountIncorrect GL account
missing_referenceMissing document reference
incorrect_tax_codeWrong tax calculation
misclassificationWrong account category

Process Issues

TypeDescription
late_postingPosted after cutoff
skipped_approvalMissing required approval
threshold_manipulationAmount just below threshold
missing_documentationNo supporting document
out_of_sequenceDocuments out of order

Statistical Anomalies

TypeDescription
unusual_amountSignificant deviation from mean
trend_breakSudden pattern change
benford_violationDoesn’t follow Benford’s Law
outlier_valueExtreme value

Relational Anomalies

TypeDescription
circular_transactionA → B → A flow
dormant_account_activityInactive account used
unusual_counterpartyUnexpected entity pairing

Injection Strategies

Amount Manipulation

anomaly_injection:
  strategies:
    amount:
      enabled: true
      threshold_adjacent: 0.3        # Just below approval limit
      round_number_bias: 0.4         # Suspicious round amounts

Threshold-adjacent: Amounts like $9,999 when limit is $10,000.

Date Manipulation

anomaly_injection:
  strategies:
    date:
      enabled: true
      weekend_bias: 0.2              # Weekend postings
      after_hours_bias: 0.15         # After business hours

Duplication

anomaly_injection:
  strategies:
    duplication:
      enabled: true
      exact_duplicate: 0.5           # Exact copy
      near_duplicate: 0.3            # Slight variations
      delayed_duplicate: 0.2         # Same entry later

Temporal Patterns

Anomalies can follow realistic patterns:

anomaly_injection:
  temporal_pattern:
    month_end_spike: 1.2             # 20% more at month-end
    quarter_end_spike: 1.5           # 50% more at quarter-end
    year_end_spike: 2.0              # Double at year-end
    seasonality: true                # Follow industry patterns

Entity Targeting

Control which entities receive anomalies:

anomaly_injection:
  entity_targeting:
    strategy: weighted               # random, repeat_offender, weighted

    repeat_offender:
      enabled: true
      rate: 0.4                      # 40% from same users

    high_volume_bias: 0.3            # Target high-volume entities

Clustering

Real anomalies often cluster:

anomaly_injection:
  clustering:
    enabled: true
    cluster_probability: 0.2         # 20% in clusters
    cluster_size:
      min: 3
      max: 10
    cluster_timespan_days: 30        # Within 30-day window

Output Labels

anomaly_labels.csv

FieldDescription
document_idEntry reference
anomaly_idUnique anomaly ID
anomaly_typeSpecific type
anomaly_categoryFraud, Error, etc.
severityLow, Medium, High
detection_difficultyEasy, Medium, Hard
descriptionHuman-readable description

fraud_labels.csv

Subset with fraud-specific fields:

FieldDescription
document_idEntry reference
fraud_typeSpecific fraud pattern
perpetrator_idEmployee ID
scheme_idRelated anomaly group
amount_manipulatedFraud amount

ML Integration

Loading Labels

import pandas as pd

labels = pd.read_csv('output/labels/anomaly_labels.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_anomaly'] = data['anomaly_id'].notna()

Feature Engineering

# Create features
features = [
    'amount', 'line_count', 'is_round_number',
    'is_weekend', 'is_month_end', 'hour_of_day'
]

X = data[features]
y = data['is_anomaly']

Train/Test Split

Labels include suggested splits:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,  # Maintain anomaly ratio
    random_state=42
)

Example Configuration

Fraud Detection Training

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

  categories:
    fraud: 1.0                       # Only fraud for focused training

  clustering:
    enabled: true
    cluster_probability: 0.3

fraud:
  enabled: true
  fraud_rate: 0.02
  types:
    split_transaction: 0.25
    duplicate_payment: 0.25
    kickback_scheme: 0.20
    ghost_employee: 0.15
    fictitious_transaction: 0.15

General Anomaly Detection

anomaly_injection:
  enabled: true
  total_rate: 0.05
  generate_labels: true

  categories:
    fraud: 0.15
    error: 0.45
    process_issue: 0.25
    statistical: 0.10
    relational: 0.05

See Also

Data Quality Variations

Generate realistic data quality issues for testing robustness.

Overview

Real-world data has imperfections. The data quality module introduces:

  • Missing values (various patterns)
  • Format variations
  • Duplicates
  • Typos and transcription errors
  • Encoding issues

Configuration

data_quality:
  enabled: true

  missing_values:
    rate: 0.01
    pattern: mcar

  format_variations:
    date_formats: true
    amount_formats: true

  duplicates:
    rate: 0.001
    types: [exact, near, fuzzy]

  typos:
    rate: 0.005
    keyboard_aware: true

Missing Values

Patterns

PatternDescription
mcarMissing Completely At Random
marMissing At Random (conditional)
mnarMissing Not At Random (value-dependent)
systematicEntire field groups missing
data_quality:
  missing_values:
    rate: 0.01                       # 1% missing overall
    pattern: mcar

    # Pattern-specific settings
    mcar:
      uniform: true                  # Equal probability all fields

    mar:
      conditions:
        - field: vendor_name
          dependent_on: is_intercompany
          probability: 0.1

    mnar:
      conditions:
        - field: amount
          when_above: 100000         # Large amounts more likely missing
          probability: 0.05

    systematic:
      groups:
        - [address, city, country]   # All or none

Field Targeting

data_quality:
  missing_values:
    fields:
      description: 0.02              # 2% missing
      cost_center: 0.05              # 5% missing
      tax_code: 0.03                 # 3% missing
    exclude:
      - document_id                  # Never make missing
      - posting_date
      - account_number

Format Variations

Date Formats

data_quality:
  format_variations:
    date_formats: true
    date_variations:
      iso: 0.6                       # 2024-01-15
      us: 0.2                        # 01/15/2024
      eu: 0.1                        # 15.01.2024
      long: 0.1                      # January 15, 2024

Examples:

  • ISO: 2024-01-15
  • US: 01/15/2024, 1/15/2024
  • EU: 15.01.2024, 15/01/2024
  • Long: January 15, 2024

Amount Formats

data_quality:
  format_variations:
    amount_formats: true
    amount_variations:
      plain: 0.5                     # 1234.56
      us_comma: 0.3                  # 1,234.56
      eu_format: 0.1                 # 1.234,56
      currency_prefix: 0.05          # $1,234.56
      currency_suffix: 0.05          # 1.234,56 EUR

Identifier Formats

data_quality:
  format_variations:
    identifier_variations:
      case: 0.1                      # INV-001 vs inv-001
      padding: 0.1                   # 001 vs 1
      separator: 0.1                 # INV-001 vs INV_001 vs INV001

Duplicates

Duplicate Types

TypeDescription
exactIdentical records
nearMinor field differences
fuzzyMultiple field variations
data_quality:
  duplicates:
    rate: 0.001                      # 0.1% duplicates
    types:
      exact: 0.4                     # 40% exact duplicates
      near: 0.4                      # 40% near duplicates
      fuzzy: 0.2                     # 20% fuzzy duplicates

Near Duplicate Variations

data_quality:
  duplicates:
    near:
      fields_to_vary: 1              # Change 1 field
      variations:
        - field: posting_date
          offset_days: [-1, 0, 1]
        - field: amount
          variance: 0.001            # 0.1% difference

Fuzzy Duplicate Variations

data_quality:
  duplicates:
    fuzzy:
      fields_to_vary: 3              # Change multiple fields
      include_typos: true

Typos

Typo Types

TypeDescription
SubstitutionAdjacent key pressed
TranspositionCharacters swapped
InsertionExtra character
DeletionMissing character
OCR errorsScan-related (0/O, 1/l)
HomophonesSound-alike substitution
data_quality:
  typos:
    rate: 0.005                      # 0.5% of string fields
    keyboard_aware: true             # Use QWERTY layout

    types:
      substitution: 0.35             # Adjacnet → Adjacent
      transposition: 0.25            # Recieve → Receive
      insertion: 0.15                # Shippping → Shipping
      deletion: 0.15                 # Shiping → Shipping
      ocr_errors: 0.05               # O → 0, l → 1
      homophones: 0.05               # their → there

Field Targeting

data_quality:
  typos:
    fields:
      description: 0.02              # More likely in descriptions
      vendor_name: 0.01
      customer_name: 0.01
    exclude:
      - account_number               # Never introduce typos
      - document_id

Encoding Issues

data_quality:
  encoding:
    enabled: true
    rate: 0.001

    issues:
      mojibake: 0.4                  # UTF-8/Latin-1 confusion
      missing_chars: 0.3             # Characters dropped
      bom_issues: 0.2                # BOM artifacts
      html_entities: 0.1             # &amp; instead of &

Examples:

  • Mojibake: MüllerMüller
  • Missing: ZürichZrich
  • HTML: R&DR&amp;D

ML Training Labels

The data quality module generates labels for ML model training:

QualityIssueLabel

#![allow(unused)]
fn main() {
pub struct QualityIssueLabel {
    pub issue_id: String,
    pub issue_type: LabeledIssueType,
    pub issue_subtype: Option<QualityIssueSubtype>,
    pub document_id: String,
    pub field_name: String,
    pub original_value: Option<String>,
    pub modified_value: Option<String>,
    pub severity: u8,  // 1-5
    pub processor: String,
    pub metadata: HashMap<String, String>,
}
}

Issue Types

TypeSeverityDescription
MissingValue3Field is null/empty
Typo2Character-level errors
FormatVariation1Different formatting
Duplicate4Duplicate record
EncodingIssue3Character encoding problems
Inconsistency3Cross-field inconsistency
OutOfRange4Value outside expected range
InvalidReference5Reference to non-existent entity

Subtypes

Each issue type has detailed subtypes:

  • Typo: Substitution, Transposition, Insertion, Deletion, DoubleChar, CaseError, OcrError, Homophone
  • FormatVariation: DateFormat, AmountFormat, IdentifierFormat, TextFormat
  • Duplicate: ExactDuplicate, NearDuplicate, FuzzyDuplicate, CrossSystemDuplicate
  • EncodingIssue: Mojibake, MissingChars, Bom, ControlChars, HtmlEntities

Output

quality_issues.csv

FieldDescription
document_idAffected record
field_nameField with issue
issue_typemissing, typo, duplicate, etc.
original_valueValue before modification
modified_valueValue after modification

quality_labels.csv (ML Training)

FieldDescription
issue_idUnique issue identifier
issue_typeLabeledIssueType enum
issue_subtypeDetailed subtype
document_idAffected document
field_nameAffected field
original_valueOriginal value
modified_valueModified value
severity1-5 severity score
processorWhich processor injected

Example Configurations

Testing Data Pipelines

data_quality:
  enabled: true

  missing_values:
    rate: 0.02
    pattern: mcar

  format_variations:
    date_formats: true
    amount_formats: true

  typos:
    rate: 0.01
    keyboard_aware: true

Testing Deduplication

data_quality:
  enabled: true

  duplicates:
    rate: 0.05                       # High duplicate rate
    types:
      exact: 0.3
      near: 0.4
      fuzzy: 0.3

Testing OCR Processing

data_quality:
  enabled: true

  typos:
    rate: 0.03
    types:
      ocr_errors: 0.8                # Mostly OCR-style errors
      substitution: 0.2

See Also

Graph Export

Export transaction data as ML-ready graphs.

Overview

Graph export transforms financial data into network representations:

  • Accounting Network (GL accounts as nodes, transactions as edges) - New in v0.2.1
  • Transaction networks (accounts and entities)
  • Approval networks (users and approvals)
  • Entity relationship graphs (ownership)

Accounting Network Graph Export

The accounting network represents money flows between GL accounts, designed for network reconstruction and anomaly detection algorithms.

Quick Start

# Generate with graph export enabled
datasynth-data generate --config config.yaml --output ./output --graph-export

Graph Structure

ElementDescription
NodesGL Accounts from Chart of Accounts
EdgesMoney flows FROM credit accounts TO debit accounts
DirectionDirected graph (source→target)
     ┌──────────────┐
     │ Credit Acct  │
     │   (2000)     │
     └──────┬───────┘
            │ $1,000
            ▼
     ┌──────────────┐
     │ Debit Acct   │
     │   (1100)     │
     └──────────────┘

Edge Features (8 dimensions)

FeatureIndexDescription
log_amountF0log10(transaction amount)
benford_probF1Expected first-digit probability
weekdayF2Day of week (normalized 0-1)
periodF3Fiscal period (normalized 0-1)
is_month_endF4Last 3 days of month
is_year_endF5Last month of year
is_anomalyF6Anomaly flag (0 or 1)
business_processF7Encoded business process

Output Files

output/graphs/accounting_network/pytorch_geometric/
├── edge_index.npy      # [2, E] source→target node indices
├── node_features.npy   # [N, 4] node feature vectors
├── edge_features.npy   # [E, 8] edge feature vectors
├── edge_labels.npy     # [E] anomaly labels (0=normal, 1=anomaly)
├── node_labels.npy     # [N] node labels
├── train_mask.npy      # [N] boolean training mask
├── val_mask.npy        # [N] boolean validation mask
├── test_mask.npy       # [N] boolean test mask
├── metadata.json       # Graph statistics and configuration
└── load_graph.py       # Auto-generated Python loader script

Loading in Python

import numpy as np
import json

# Load metadata
with open('metadata.json') as f:
    meta = json.load(f)
print(f"Nodes: {meta['num_nodes']}, Edges: {meta['num_edges']}")

# Load arrays
edge_index = np.load('edge_index.npy')      # [2, E]
node_features = np.load('node_features.npy') # [N, F]
edge_features = np.load('edge_features.npy') # [E, 8]
edge_labels = np.load('edge_labels.npy')     # [E]

# For PyTorch Geometric
import torch
from torch_geometric.data import Data

data = Data(
    x=torch.from_numpy(node_features).float(),
    edge_index=torch.from_numpy(edge_index).long(),
    edge_attr=torch.from_numpy(edge_features).float(),
    y=torch.from_numpy(edge_labels).long(),
)

Configuration

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
  train_ratio: 0.7
  validation_ratio: 0.15
  # test_ratio is automatically 1 - train - val = 0.15

Use Cases

  1. Anomaly Detection: Train GNNs to detect anomalous transaction patterns
  2. Network Reconstruction: Validate accounting network recovery algorithms
  3. Fraud Detection: Identify suspicious money flow patterns
  4. Link Prediction: Predict likely transaction relationships

Configuration

graph_export:
  enabled: true

  formats:
    - pytorch_geometric
    - neo4j
    - dgl

  graphs:
    - transaction_network
    - approval_network
    - entity_relationship

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly

  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

Graph Types

Transaction Network

Accounts and entities as nodes, transactions as edges.

     ┌──────────┐
     │ Account  │
     │  1100    │
     └────┬─────┘
          │ $1000
          ▼
     ┌──────────┐
     │ Customer │
     │  C-001   │
     └──────────┘

Nodes:

  • GL accounts
  • Vendors
  • Customers
  • Cost centers

Edges:

  • Journal entry lines
  • Payments
  • Invoices

Approval Network

Users as nodes, approval relationships as edges.

     ┌──────────┐
     │  Clerk   │
     │  U-001   │
     └────┬─────┘
          │ approved
          ▼
     ┌──────────┐
     │ Manager  │
     │  U-002   │
     └──────────┘

Nodes: Employees/users Edges: Approval actions

Entity Relationship Network

Legal entities with ownership relationships.

     ┌──────────┐
     │  Parent  │
     │  1000    │
     └────┬─────┘
          │ 100%
          ▼
     ┌──────────┐
     │   Sub    │
     │  2000    │
     └──────────┘

Nodes: Companies Edges: Ownership, IC transactions

Export Formats

PyTorch Geometric

output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, num_features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, num_edge_features]
├── labels.pt           # Labels
├── train_mask.pt       # Boolean training mask
├── val_mask.pt         # Boolean validation mask
└── test_mask.pt        # Boolean test mask

Loading in Python:

import torch
from torch_geometric.data import Data

# Load tensors
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')

# Create PyG Data object
data = Data(
    x=node_features,
    edge_index=edge_index,
    edge_attr=edge_attr,
    y=labels,
    train_mask=train_mask,
)

print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")

Neo4j

output/graphs/transaction_network/neo4j/
├── nodes_account.csv
├── nodes_vendor.csv
├── nodes_customer.csv
├── edges_transaction.csv
├── edges_payment.csv
└── import.cypher

Import script (import.cypher):

// Load accounts
LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
CREATE (:Account {
    id: row.id,
    name: row.name,
    type: row.type
});

// Load transactions
LOAD CSV WITH HEADERS FROM 'file:///edges_transaction.csv' AS row
MATCH (from:Account {id: row.from_id})
MATCH (to:Account {id: row.to_id})
CREATE (from)-[:TRANSACTION {
    amount: toFloat(row.amount),
    date: date(row.posting_date),
    is_anomaly: toBoolean(row.is_anomaly)
}]->(to);

DGL (Deep Graph Library)

output/graphs/transaction_network/dgl/
├── graph.bin           # Serialized DGL graph
├── node_feats.npy      # Node features
├── edge_feats.npy      # Edge features
└── labels.npy          # Labels

Loading in Python:

import dgl
import numpy as np

# Load graph
graph = dgl.load_graphs('graph.bin')[0][0]

# Load features
graph.ndata['feat'] = torch.tensor(np.load('node_feats.npy'))
graph.edata['feat'] = torch.tensor(np.load('edge_feats.npy'))
graph.ndata['label'] = torch.tensor(np.load('labels.npy'))

Features

Temporal Features

features:
  temporal: true
FeatureDescription
weekdayDay of week (0-6)
periodFiscal period (1-12)
is_month_endLast 3 days of month
is_quarter_endLast week of quarter
is_year_endLast month of year
hourHour of posting

Amount Features

features:
  amount: true
FeatureDescription
log_amountlog10(amount)
benford_probExpected first-digit probability
is_round_numberEnds in 00, 000, etc.
amount_zscoreStandard deviations from mean

Structural Features

features:
  structural: true
FeatureDescription
line_countNumber of JE lines
unique_accountsDistinct accounts used
has_intercompanyIC transaction flag
debit_credit_ratioTotal debits / credits

Categorical Features

features:
  categorical: true

One-hot encoded:

  • business_process: Manual, P2P, O2C, etc.
  • source_type: System, User, Recurring
  • account_type: Asset, Liability, etc.

Train/Val/Test Splits

split:
  train: 0.7                         # 70% training
  val: 0.15                          # 15% validation
  test: 0.15                         # 15% test
  stratify: is_anomaly               # Maintain anomaly ratio
  random_seed: 42                    # Reproducible splits

Stratification options:

  • is_anomaly: Balanced anomaly detection
  • is_fraud: Balanced fraud detection
  • account_type: Balanced by account type
  • null: Random (no stratification)

GNN Training Example

import torch
from torch_geometric.nn import GCNConv

class AnomalyGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, 2)  # Binary classification

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

# Train
model = AnomalyGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

Multi-Layer Hypergraph (v0.6.2)

The RustGraph Hypergraph exporter now supports all 8 enterprise process families with 24 entity type codes:

Entity Type Codes

RangeFamilyTypes
100AccountingGL Accounts
300-303P2PPurchaseOrder, GoodsReceipt, VendorInvoice, Payment
310-312O2CSalesOrder, Delivery, CustomerInvoice
320-325S2CSourcingProject, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract, SupplierQualification
330-333H2RPayrollRun, TimeEntry, ExpenseReport, PayrollLineItem
340-343MFGProductionOrder, RoutingOperation, QualityInspection, CycleCount
350-352BANKBankingCustomer, BankAccount, BankTransaction
360-365AUDITAuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment
370-372Bank ReconBankReconciliation, BankStatementLine, ReconcilingItem
400OCPMOcpmEvent (events as hyperedges)

OCPM Events as Hyperedges

When events_as_hyperedges: true, each OCPM event becomes a hyperedge connecting all its participating objects. This enables cross-process analysis via the hypergraph structure.

Per-Family Toggles

graph_export:
  hypergraph:
    enabled: true
    process_layer:
      include_p2p: true
      include_o2c: true
      include_s2c: true
      include_h2r: true
      include_mfg: true
      include_bank: true
      include_audit: true
      include_r2r: true
      events_as_hyperedges: true

See Also

Intercompany Processing

Generate matched intercompany transactions and elimination entries.

Overview

Intercompany features simulate multi-entity corporate structures:

  • IC transaction pairs (seller/buyer)
  • Transfer pricing
  • IC reconciliation
  • Consolidation eliminations

Prerequisites

Multiple companies must be defined:

companies:
  - code: "1000"
    name: "Parent Company"
    is_parent: true
    volume_weight: 0.5

  - code: "2000"
    name: "Subsidiary"
    parent_code: "1000"
    volume_weight: 0.5

Configuration

intercompany:
  enabled: true

  transaction_types:
    goods_sale: 0.4
    service_provided: 0.2
    loan: 0.15
    dividend: 0.1
    management_fee: 0.1
    royalty: 0.05

  transfer_pricing:
    method: cost_plus
    markup_range:
      min: 0.03
      max: 0.10

  elimination:
    enabled: true
    timing: quarterly

IC Transaction Types

Goods Sale

Internal sale of inventory between entities.

Seller (1000):
    Dr Intercompany Receivable   1,100
        Cr IC Revenue            1,100
    Dr IC COGS                     800
        Cr Inventory               800

Buyer (2000):
    Dr Inventory                 1,100
        Cr Intercompany Payable  1,100

Service Provided

Internal services (IT, HR, legal).

Provider (1000):
    Dr IC Receivable               500
        Cr IC Service Revenue      500

Receiver (2000):
    Dr Service Expense             500
        Cr IC Payable              500

Loan

Intercompany financing.

Lender (1000):
    Dr IC Loan Receivable       10,000
        Cr Cash                 10,000

Borrower (2000):
    Dr Cash                     10,000
        Cr IC Loan Payable      10,000

Dividend

Upstream dividend payment.

Subsidiary (2000):
    Dr Retained Earnings         5,000
        Cr Cash                  5,000

Parent (1000):
    Dr Cash                      5,000
        Cr Dividend Income       5,000

Management Fee

Corporate overhead allocation.

Parent (1000):
    Dr IC Receivable             1,000
        Cr Mgmt Fee Revenue      1,000

Subsidiary (2000):
    Dr Mgmt Fee Expense          1,000
        Cr IC Payable            1,000

Royalty

IP licensing fees.

Licensor (1000):
    Dr IC Receivable               750
        Cr Royalty Revenue         750

Licensee (2000):
    Dr Royalty Expense             750
        Cr IC Payable              750

Transfer Pricing

Methods

MethodDescription
cost_plusCost + markup percentage
resale_minusResale price - margin
comparable_uncontrolledMarket price
transfer_pricing:
  method: cost_plus
  markup_range:
    min: 0.03                        # 3% minimum markup
    max: 0.10                        # 10% maximum markup

  # OR for resale minus
  method: resale_minus
  margin_range:
    min: 0.15
    max: 0.25

Arm’s Length Pricing

Prices generated to be defensible:

#![allow(unused)]
fn main() {
fn calculate_transfer_price(cost: Decimal, method: &TransferPricingMethod) -> Decimal {
    match method {
        TransferPricingMethod::CostPlus { markup } => {
            cost * (Decimal::ONE + markup)
        }
        TransferPricingMethod::ResaleMinus { margin, resale_price } => {
            resale_price * (Decimal::ONE - margin)
        }
        TransferPricingMethod::Comparable { market_price } => {
            market_price
        }
    }
}
}

IC Matching

Matched Pair Structure

#![allow(unused)]
fn main() {
pub struct ICMatchedPair {
    pub pair_id: String,
    pub seller_company: String,
    pub buyer_company: String,
    pub seller_entry_id: Uuid,
    pub buyer_entry_id: Uuid,
    pub transaction_type: ICTransactionType,
    pub amount: Decimal,
    pub currency: String,
    pub transaction_date: NaiveDate,
}
}

Match Validation

intercompany:
  matching:
    enabled: true
    tolerance: 0.01                  # 1% variance allowed
    mismatch_rate: 0.02              # 2% intentional mismatches

Match statuses:

  • matched: Amounts reconcile
  • timing_difference: Different posting dates
  • fx_difference: Currency conversion variance
  • unmatched: No matching entry

Eliminations

Timing

intercompany:
  elimination:
    timing: quarterly                # monthly, quarterly, annual

Elimination Types

Revenue/Expense Elimination:

Elimination entry:
    Dr IC Revenue (1000)           1,100
        Cr IC Expense (2000)       1,100

Unrealized Profit Elimination:

If buyer still holds inventory:
    Dr IC Revenue                    300
        Cr Inventory                 300

Receivable/Payable Elimination:

    Dr IC Payable (2000)          10,000
        Cr IC Receivable (1000)   10,000

Output Files

ic_pairs.csv

FieldDescription
pair_idUnique pair identifier
seller_companySelling entity
buyer_companyBuying entity
seller_entry_idSeller’s JE document ID
buyer_entry_idBuyer’s JE document ID
transaction_typeType of IC transaction
amountTransaction amount
match_statusMatch result

eliminations.csv

FieldDescription
elimination_idUnique ID
ic_pair_idReference to IC pair
elimination_typeRevenue, profit, balance
debit_companyCompany debited
credit_companyCompany credited
amountElimination amount
periodFiscal period

Example Configuration

Multi-National Structure

companies:
  - code: "1000"
    name: "US Headquarters"
    currency: USD
    country: US
    is_parent: true
    volume_weight: 0.4

  - code: "2000"
    name: "European Hub"
    currency: EUR
    country: DE
    parent_code: "1000"
    volume_weight: 0.3

  - code: "3000"
    name: "Asia Pacific"
    currency: JPY
    country: JP
    parent_code: "1000"
    volume_weight: 0.3

intercompany:
  enabled: true

  transaction_types:
    goods_sale: 0.5
    service_provided: 0.2
    management_fee: 0.15
    royalty: 0.15

  transfer_pricing:
    method: cost_plus
    markup_range:
      min: 0.05
      max: 0.12

  elimination:
    enabled: true
    timing: quarterly

See Also

Interconnectivity and Relationship Modeling

SyntheticData provides comprehensive relationship modeling capabilities for generating realistic enterprise networks with multi-tier vendor relationships, customer segmentation, relationship strength calculations, and cross-process linkages.

Overview

Real enterprise data exhibits complex interconnections between entities:

  • Vendors form multi-tier supply chains (supplier-of-supplier)
  • Customers segment by value (Enterprise vs. SMB) with different behaviors
  • Relationships vary in strength based on transaction history
  • Business processes connect (P2P and O2C link through inventory)

SyntheticData models all of these patterns to produce realistic, interconnected data.


Multi-Tier Vendor Networks

Supply Chain Tiers

Vendors are organized into a supply chain hierarchy:

TierDescriptionVisibilityTypical Count
Tier 1Direct suppliersFull financial visibility50-100 per company
Tier 2Supplier’s suppliersPartial visibility4-10 per Tier 1
Tier 3Deep supply chainMinimal visibility2-5 per Tier 2

Vendor Clusters

Vendors are classified into behavioral clusters:

ClusterShareCharacteristics
ReliableStrategic20%High delivery scores, low invoice errors, consistent quality
StandardOperational50%Average performance, predictable patterns
Transactional25%One-off or occasional purchases
Problematic5%Quality issues, late deliveries, invoice discrepancies

Vendor Lifecycle Stages

Onboarding → RampUp → SteadyState → Decline → Terminated

Each stage has associated behaviors:

  • Onboarding: Initial qualification, small orders
  • RampUp: Increasing order volumes
  • SteadyState: Stable, predictable patterns
  • Decline: Reduced orders, performance issues
  • Terminated: Relationship ended

Vendor Quality Scores

MetricRangeDescription
delivery_on_time0.0-1.0Percentage of on-time deliveries
quality_pass_rate0.0-1.0Quality inspection pass rate
invoice_accuracy0.0-1.0Invoice matching accuracy
responsiveness_score0.0-1.0Communication responsiveness

Vendor Concentration Analysis

SyntheticData tracks vendor concentration risks:

dependencies:
  max_single_vendor_concentration: 0.15  # No vendor > 15% of spend
  top_5_concentration: 0.45              # Top 5 vendors < 45% of spend
  single_source_percent: 0.05            # 5% of materials single-sourced

Customer Value Segmentation

Value Segments

Customers follow a Pareto-like distribution:

SegmentRevenue ShareCustomer ShareTypical Order Value
Enterprise40%5%$50,000+
MidMarket35%20%$5,000-$50,000
SMB20%50%$500-$5,000
Consumer5%25%$50-$500

Customer Lifecycle

Prospect → New → Growth → Mature → AtRisk → Churned
                                         ↓
                                      WonBack

Each stage has associated behaviors:

  • Prospect: Potential customer, conversion probability
  • New: First purchase within 90 days
  • Growth: Increasing order frequency/value
  • Mature: Stable, loyal customer
  • AtRisk: Declining activity, churn signals
  • Churned: No activity for extended period
  • WonBack: Previously churned, now returned

Customer Engagement Metrics

MetricDescription
order_frequencyAverage orders per period
recency_daysDays since last order
nps_scoreNet Promoter Score (-100 to +100)
engagement_scoreComposite engagement metric (0.0-1.0)

Customer Networks

  • Referral Networks: Customers refer other customers (configurable rate)
  • Corporate Hierarchies: Parent/child company relationships
  • Industry Clusters: Customers grouped by industry vertical

Relationship Strength Modeling

Composite Strength Calculation

Relationship strength is computed from multiple factors:

ComponentWeightScaleDescription
Transaction Volume30%LogarithmicTotal monetary value
Transaction Count25%Square rootNumber of transactions
Duration20%LinearRelationship age in days
Recency15%Exponential decayDays since last transaction
Mutual Connections10%Jaccard indexShared network connections

Strength Categories

StrengthThresholdDescription
Strong≥ 0.7Core business relationship
Moderate≥ 0.4Regular, established relationship
Weak≥ 0.1Occasional relationship
Dormant< 0.1Inactive relationship

Recency Decay

The recency component uses exponential decay:

recency_score = exp(-days_since_last / half_life)

Default half-life is 90 days.


Cross-Process Linkages

Inventory naturally connects Procure-to-Pay and Order-to-Cash:

P2P: Purchase Order → Goods Receipt → Vendor Invoice → Payment
                           ↓
                      [Inventory]
                           ↓
O2C: Sales Order → Delivery → Customer Invoice → Receipt

When enabled, SyntheticData generates explicit CrossProcessLink records connecting:

  • GoodsReceipt (P2P) to Delivery (O2C) via inventory item

Payment-Bank Reconciliation

Links payment transactions to bank statement entries for reconciliation.

Ensures intercompany transactions are properly linked between sending and receiving entities.


Entity Graph

Graph Structure

The EntityGraph provides a unified view of all entity relationships:

ComponentDescription
NodesEntities with type, ID, and metadata
EdgesRelationships with type and strength
IndexesFast lookups by entity type and ID

Entity Types (16 types)

Company, Vendor, Customer, Employee, Department, CostCenter,
Project, Contract, Asset, BankAccount, Material, Product,
Location, Currency, Account, Entity

Relationship Types (26 types)

// Transactional
BuysFrom, SellsTo, PaysTo, ReceivesFrom, SuppliesTo, OrdersFrom

// Organizational
ReportsTo, Manages, BelongsTo, OwnedBy, PartOf, Contains

// Network
ReferredBy, PartnersWith, AffiliateOf, SubsidiaryOf

// Process
ApprovesFor, AuthorizesFor, ProcessesFor

// Financial
BillsTo, ShipsTo, CollectsFrom, RemitsTo

// Document
ReferencedBy, SupersededBy, AmendedBy, LinkedTo

Configuration

Complete Example

vendor_network:
  enabled: true
  depth: 3
  tiers:
    tier1:
      count_min: 50
      count_max: 100
    tier2:
      count_per_parent_min: 4
      count_per_parent_max: 10
    tier3:
      count_per_parent_min: 2
      count_per_parent_max: 5
  clusters:
    reliable_strategic: 0.20
    standard_operational: 0.50
    transactional: 0.25
    problematic: 0.05
  dependencies:
    max_single_vendor_concentration: 0.15
    top_5_concentration: 0.45
    single_source_percent: 0.05

customer_segmentation:
  enabled: true
  value_segments:
    enterprise:
      revenue_share: 0.40
      customer_share: 0.05
      avg_order_min: 50000.0
    mid_market:
      revenue_share: 0.35
      customer_share: 0.20
      avg_order_min: 5000.0
      avg_order_max: 50000.0
    smb:
      revenue_share: 0.20
      customer_share: 0.50
      avg_order_min: 500.0
      avg_order_max: 5000.0
    consumer:
      revenue_share: 0.05
      customer_share: 0.25
      avg_order_min: 50.0
      avg_order_max: 500.0
  lifecycle:
    prospect_rate: 0.10
    new_rate: 0.15
    growth_rate: 0.20
    mature_rate: 0.35
    at_risk_rate: 0.10
    churned_rate: 0.08
    won_back_rate: 0.02
  networks:
    referrals:
      enabled: true
      referral_rate: 0.15
    corporate_hierarchies:
      enabled: true
      hierarchy_probability: 0.30

relationship_strength:
  enabled: true
  calculation:
    transaction_volume_weight: 0.30
    transaction_count_weight: 0.25
    relationship_duration_weight: 0.20
    recency_weight: 0.15
    mutual_connections_weight: 0.10
    recency_half_life_days: 90
  thresholds:
    strong: 0.7
    moderate: 0.4
    weak: 0.1

cross_process_links:
  enabled: true
  inventory_p2p_o2c: true
  payment_bank_reconciliation: true
  intercompany_bilateral: true

Network Evaluation

SyntheticData includes network metrics evaluation:

MetricDescriptionTypical Range
ConnectivityLargest connected component ratio> 0.95
Power Law AlphaDegree distribution exponent2.0-3.0
Clustering CoefficientLocal clustering0.10-0.50
Top-1 ConcentrationLargest node share< 0.15
Top-5 ConcentrationTop 5 nodes share< 0.45
HHIHerfindahl-Hirschman Index< 0.25

These metrics validate that generated networks exhibit realistic properties.


API Usage

Rust API

#![allow(unused)]
fn main() {
use datasynth_core::models::{
    VendorNetwork, VendorCluster, SupplyChainTier,
    SegmentedCustomerPool, CustomerValueSegment,
    EntityGraph, RelationshipStrengthCalculator,
};
use datasynth_generators::relationships::EntityGraphGenerator;

// Generate vendor network
let vendor_generator = VendorGenerator::new(config);
let vendor_network = vendor_generator.generate_vendor_network("C001");

// Generate segmented customers
let customer_generator = CustomerGenerator::new(config);
let customer_pool = customer_generator.generate_segmented_pool("C001");

// Build entity graph with cross-process links
let graph_generator = EntityGraphGenerator::with_defaults();
let entity_graph = graph_generator.generate_entity_graph(
    &vendor_network,
    &customer_pool,
    &transactions,
    &document_flows,
);
}

Python API

from datasynth_py import DataSynth
from datasynth_py.config import VendorNetworkConfig, CustomerSegmentationConfig

config = Config(
    vendor_network=VendorNetworkConfig(
        enabled=True,
        depth=3,
        clusters={"reliable_strategic": 0.20, "standard_operational": 0.50},
    ),
    customer_segmentation=CustomerSegmentationConfig(
        enabled=True,
        value_segments={
            "enterprise": {"revenue_share": 0.40, "customer_share": 0.05},
            "mid_market": {"revenue_share": 0.35, "customer_share": 0.20},
        },
    ),
)

result = DataSynth().generate(config=config, output={"format": "csv"})

See Also

Period Close Engine

Generate period-end accounting processes.

Overview

The period close engine simulates:

  • Monthly close (accruals, depreciation)
  • Quarterly close (IC elimination, translation)
  • Annual close (closing entries, retained earnings)

Configuration

period_close:
  enabled: true

  monthly:
    accruals: true
    depreciation: true
    reconciliation: true

  quarterly:
    intercompany_elimination: true
    currency_translation: true

  annual:
    closing_entries: true
    retained_earnings: true

Monthly Close

Accruals

Generate reversing accrual entries:

period_close:
  monthly:
    accruals:
      enabled: true
      auto_reverse: true             # Reverse next period

      categories:
        expense_accrual: 0.4
        revenue_accrual: 0.2
        payroll_accrual: 0.3
        other: 0.1

Expense Accrual:

Period 1 (accrue):
    Dr Expense                     10,000
        Cr Accrued Liabilities     10,000

Period 2 (reverse):
    Dr Accrued Liabilities         10,000
        Cr Expense                 10,000

Depreciation

Calculate and post monthly depreciation:

period_close:
  monthly:
    depreciation:
      enabled: true
      run_date: last_day            # When in period

      methods:
        straight_line: 0.7
        declining_balance: 0.2
        units_of_production: 0.1

Depreciation Entry:

    Dr Depreciation Expense          5,000
        Cr Accumulated Depreciation  5,000

Subledger Reconciliation

Verify subledger-to-GL control accounts:

period_close:
  monthly:
    reconciliation:
      enabled: true

      checks:
        - subledger: ar
          control_account: "1100"
        - subledger: ap
          control_account: "2000"
        - subledger: inventory
          control_account: "1200"

Reconciliation Report:

SubledgerControl AccountSubledger BalanceGL BalanceDifference
AR1100500,000500,0000
AP2000(300,000)(300,000)0

Quarterly Close

IC Elimination

Generate consolidation eliminations:

period_close:
  quarterly:
    intercompany_elimination:
      enabled: true

      types:
        - revenue_expense            # Eliminate IC sales
        - unrealized_profit          # Eliminate IC inventory profit
        - receivable_payable         # Eliminate IC balances
        - dividends                  # Eliminate IC dividends

See Intercompany Processing for details.

Currency Translation

Translate foreign subsidiary balances:

period_close:
  quarterly:
    currency_translation:
      enabled: true
      method: current_rate           # current_rate, temporal

      rate_mapping:
        assets: closing_rate
        liabilities: closing_rate
        equity: historical_rate
        revenue: average_rate
        expense: average_rate

      cta_account: "3500"            # CTA equity account

Translation Entry (CTA):

If foreign currency strengthened:
    Dr Foreign Subsidiary Investment  10,000
        Cr CTA (Other Comprehensive)  10,000

Annual Close

Closing Entries

Close temporary accounts to retained earnings:

period_close:
  annual:
    closing_entries:
      enabled: true
      close_revenue: true
      close_expense: true
      income_summary_account: "3900"

Closing Sequence:

1. Close Revenue:
    Dr Revenue accounts (all)      1,000,000
        Cr Income Summary          1,000,000

2. Close Expenses:
    Dr Income Summary                800,000
        Cr Expense accounts (all)    800,000

3. Close Income Summary:
    Dr Income Summary                200,000
        Cr Retained Earnings         200,000

Retained Earnings

Update retained earnings:

period_close:
  annual:
    retained_earnings:
      enabled: true
      account: "3100"
      dividend_account: "3150"

Year-End Adjustments

Additional adjusting entries:

period_close:
  annual:
    adjustments:
      - type: bad_debt_provision
        rate: 0.02                   # 2% of AR

      - type: inventory_reserve
        rate: 0.01                   # 1% of inventory

      - type: bonus_accrual
        rate: 0.10                   # 10% of salary expense

Financial Statements (v0.6.0)

The period close engine can now generate full financial statement sets from the adjusted trial balance. This is controlled by the financial_reporting configuration section.

Balance Sheet

Generates a statement of financial position with current/non-current asset and liability classifications:

Assets                              Liabilities & Equity
├── Current Assets                  ├── Current Liabilities
│   ├── Cash & Equivalents          │   ├── Accounts Payable
│   ├── Accounts Receivable         │   ├── Accrued Liabilities
│   └── Inventory                   │   └── Current Debt
├── Non-Current Assets              ├── Non-Current Liabilities
│   ├── Fixed Assets (net)          │   └── Long-Term Debt
│   └── Intangibles                 └── Equity
Total Assets = Total L + E              ├── Common Stock
                                        └── Retained Earnings

Income Statement

Generates a multi-step income statement:

Revenue
- Cost of Goods Sold
= Gross Profit
- Operating Expenses
= Operating Income
+/- Other Income/Expense
= Income Before Tax
- Income Tax
= Net Income

Cash Flow Statement

Generates an indirect-method cash flow statement with three categories:

financial_reporting:
  generate_cash_flow: true

Categories:

  • Operating: Net income + non-cash adjustments + working capital changes
  • Investing: Capital expenditures, asset disposals
  • Financing: Debt proceeds/repayments, equity transactions, dividends

Statement of Changes in Equity

Tracks equity movements across the period:

  • Opening retained earnings
  • Net income for the period
  • Dividends declared
  • Other comprehensive income (CTA, unrealized gains)
  • Closing retained earnings

Management KPIs

When financial_reporting.management_kpis is enabled, computes financial ratios:

  • Liquidity: Current ratio, quick ratio, cash ratio
  • Profitability: Gross margin, operating margin, net margin, ROA, ROE
  • Efficiency: Inventory turnover, AR turnover, AP turnover, days sales outstanding
  • Leverage: Debt-to-equity, debt-to-assets, interest coverage

Budgets

When financial_reporting.budgets is enabled, generates budget records with variance analysis:

financial_reporting:
  budgets:
    enabled: true
    variance_threshold: 0.10    # Flag variances > 10%

Produces budget vs. actual comparisons by account and period, with favorable/unfavorable variance flags.

Output Files

trial_balances/YYYY_MM.csv

FieldDescription
account_numberGL account
account_nameAccount description
opening_balancePeriod start
period_debitsTotal debits
period_creditsTotal credits
closing_balancePeriod end

accruals.csv

FieldDescription
accrual_idUnique ID
accrual_typeCategory
periodAccrual period
amountAccrual amount
reversal_periodWhen reversed
entry_idRelated JE ID

depreciation.csv

FieldDescription
asset_idFixed asset
periodDepreciation period
methodDepreciation method
depreciation_amountPeriod expense
accumulated_totalRunning total
net_book_valueRemaining value

closing_entries.csv

FieldDescription
entry_idClosing entry ID
entry_typeRevenue, expense, summary
accountAccount closed
amountClosing amount
fiscal_yearYear closed

financial_statements.csv (v0.6.0)

FieldDescription
statement_idUnique statement identifier
statement_typebalance_sheet, income_statement, cash_flow, changes_in_equity
company_codeCompany code
period_endStatement date
basisus_gaap, ifrs, statutory
line_codeLine item code
labelDisplay label
sectionStatement section
amountCurrent period amount
amount_priorPrior period amount

bank_reconciliations.csv (v0.6.0)

FieldDescription
reconciliation_idUnique reconciliation ID
company_codeCompany code
bank_accountBank account identifier
period_startReconciliation period start
period_endReconciliation period end
opening_balanceOpening bank balance
closing_balanceClosing bank balance
statusin_progress, completed, completed_with_exceptions

management_kpis.csv (v0.6.0)

FieldDescription
company_codeCompany code
periodReporting period
kpi_nameRatio name (e.g., current_ratio, gross_margin)
kpi_valueComputed ratio value
categoryliquidity, profitability, efficiency, leverage

Close Schedule

Month 1-11:
├── Accruals
├── Depreciation
└── Reconciliation

Month 3, 6, 9:
├── IC Elimination
└── Currency Translation

Month 12:
├── All monthly tasks
├── All quarterly tasks
├── Year-end adjustments
└── Closing entries

Example Configuration

Full Close Cycle

global:
  start_date: 2024-01-01
  period_months: 12

period_close:
  enabled: true

  monthly:
    accruals:
      enabled: true
      auto_reverse: true
    depreciation:
      enabled: true
    reconciliation:
      enabled: true

  quarterly:
    intercompany_elimination:
      enabled: true
    currency_translation:
      enabled: true

  annual:
    closing_entries:
      enabled: true
    retained_earnings:
      enabled: true
    adjustments:
      - type: bad_debt_provision
        rate: 0.02

See Also

Fingerprinting

Privacy-preserving fingerprint extraction enables generating synthetic data that matches the statistical properties of real data without exposing sensitive information.

Overview

Fingerprinting is a three-stage process:

  1. Extract: Analyze real data and capture statistical properties into a .dsf fingerprint file
  2. Synthesize: Generate synthetic data configuration from the fingerprint
  3. Evaluate: Validate that synthetic data matches the fingerprint’s statistical properties
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Real Data  │────▶│   Extract   │────▶│ .dsf File   │────▶│  Evaluate   │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                          │                    │                    │
                          ▼                    ▼                    ▼
                    Privacy Engine      Config Synthesizer    Fidelity Report

Privacy Mechanisms

Differential Privacy

The extraction process applies differential privacy to protect individual records:

  • Laplace Mechanism: Adds calibrated noise to numeric statistics
  • Gaussian Mechanism: Alternative for (ε,δ)-differential privacy
  • Epsilon Budget: Tracks privacy budget across all operations
Privacy Guarantee: For any two datasets D and D' differing in one record,
the probability ratio of any output is bounded by e^ε

K-Anonymity

Categorical values are protected through suppression:

  • Values appearing fewer than k times are replaced with <suppressed>
  • Prevents identification of rare categories
  • Configurable threshold per privacy level

Winsorization

Numeric outliers are clipped to prevent identification:

  • Values beyond the configured percentile are capped
  • Prevents extreme values from leaking individual information
  • Outlier percentile varies by privacy level (85%-99%)

Privacy Levels

LevelEpsilonkOutlier %Description
Minimal5.0399%Highest utility, lower privacy
Standard1.0595%Balanced (recommended default)
High0.51090%Higher privacy for sensitive data
Maximum0.12085%Maximum privacy, some utility loss

Choosing a Privacy Level

  • Minimal: Internal testing, non-sensitive data
  • Standard: General use, moderate sensitivity
  • High: Personal financial data, healthcare
  • Maximum: Highly sensitive data, regulatory compliance

DSF File Format

The DataSynth Fingerprint (.dsf) file is a ZIP archive containing:

fingerprint.dsf
├── manifest.json       # Metadata, checksums, privacy config
├── schema.yaml         # Table/column structure, relationships
├── statistics.yaml     # Distributions, percentiles, Benford
├── correlations.yaml   # Correlation matrices, copula params
├── integrity.yaml      # Foreign keys, cardinality rules
├── rules.yaml          # Balance constraints, thresholds
├── anomalies.yaml      # Anomaly rates, patterns
└── privacy_audit.json  # All privacy decisions logged

Manifest Structure

{
  "version": "1.0",
  "format": "dsf",
  "created_at": "2026-01-23T10:30:00Z",
  "source": {
    "row_count": 100000,
    "column_count": 25,
    "tables": ["journal_entries", "vendors"]
  },
  "privacy": {
    "level": "standard",
    "epsilon": 1.0,
    "k": 5
  },
  "checksums": {
    "schema": "sha256:...",
    "statistics": "sha256:...",
    "correlations": "sha256:..."
  }
}

Extraction Process

Step 1: Schema Extraction

Analyzes data structure:

  • Infers column data types (numeric, categorical, date, text)
  • Computes cardinalities
  • Detects foreign key relationships
  • Identifies primary keys

Step 2: Statistical Extraction

Computes distributions with privacy:

  • Numeric columns: Mean, std, min, max, percentiles (with DP noise)
  • Categorical columns: Frequencies (with k-anonymity)
  • Temporal columns: Date ranges, seasonality patterns
  • Benford analysis: First-digit distribution compliance

Step 3: Correlation Extraction

Captures multivariate relationships:

  • Pearson correlation matrices (with DP)
  • Copula parameters for joint distributions
  • Cross-table relationship strengths

Step 4: Rules Extraction

Detects business rules:

  • Balance equations (debits = credits)
  • Approval thresholds
  • Validation constraints

Step 5: Anomaly Pattern Extraction

Captures anomaly characteristics:

  • Overall anomaly rate
  • Type distribution
  • Temporal patterns

Synthesis Process

Configuration Generation

The ConfigSynthesizer converts fingerprints to generation configuration:

#![allow(unused)]
fn main() {
// From fingerprint statistics, generate:
AmountSampler {
    distribution: LogNormal,
    mean: fp.statistics.amount.mean,
    std: fp.statistics.amount.std,
    round_number_bias: 0.15,
}
}

Copula-Based Generation

For correlated columns, the GaussianCopula preserves relationships:

  1. Generate independent uniform samples
  2. Apply correlation structure
  3. Transform to target marginal distributions

Fidelity Evaluation

Metrics

MetricDescriptionTarget
KS StatisticMax CDF difference< 0.1
Wasserstein DistanceEarth mover’s distance< 0.1
Benford MADMean absolute deviation from Benford< 0.015
Correlation RMSECorrelation matrix difference< 0.1
Schema MatchColumn type agreement> 0.95

Fidelity Report

Fidelity Evaluation Report
==========================
Overall Score: 0.87
Status: PASSED (threshold: 0.80)

Component Scores:
  Statistical:   0.89
  Correlation:   0.85
  Schema:        0.98
  Rules:         0.76

Details:
  - KS statistic (amount): 0.05
  - Benford MAD: 0.008
  - Correlation RMSE: 0.07

CLI Usage

Basic Workflow

# 1. Extract fingerprint from real data
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# 2. Validate fingerprint integrity
datasynth-data fingerprint validate ./fingerprint.dsf

# 3. View fingerprint details
datasynth-data fingerprint info ./fingerprint.dsf --detailed

# 4. Generate synthetic data (using derived config)
datasynth-data generate \
    --config ./derived_config.yaml \
    --output ./synthetic_data

# 5. Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data \
    --threshold 0.85 \
    --report ./report.html

Comparing Fingerprints

# Compare two versions
datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf

Custom Privacy Parameters

# Override privacy level with custom values
datasynth-data fingerprint extract \
    --input ./sensitive_data.csv \
    --output ./fingerprint.dsf \
    --epsilon 0.3 \
    --k 15

Best Practices

Data Preparation

  1. Clean data first: Remove obvious errors before extraction
  2. Consistent formats: Standardize date and number formats
  3. Document exclusions: Note any columns excluded from extraction

Privacy Selection

  1. Start with standard: Adjust based on fidelity evaluation
  2. Consider sensitivity: Use higher privacy for personal data
  3. Review audit log: Check privacy decisions in privacy_audit.json

Fidelity Optimization

  1. Check component scores: Identify weak areas
  2. Adjust generation config: Tune parameters for low-scoring metrics
  3. Iterate: Re-evaluate after adjustments

Compliance

  1. Preserve audit trail: Keep .dsf files for compliance review
  2. Document privacy choices: Record rationale for privacy level
  3. Version fingerprints: Track changes over time

Troubleshooting

Low Fidelity Score

Cause: Statistical differences between synthetic and fingerprint

Solutions:

  • Review component scores to identify specific issues
  • Adjust generation configuration parameters
  • Consider using auto-tuning recommendations

Fingerprint Validation Errors

Cause: Corrupted or modified DSF file

Solutions:

  • Re-extract from source data
  • Check file transfer integrity
  • Verify checksums match manifest

Privacy Budget Exceeded

Cause: Too many queries on sensitive data

Solutions:

  • Reduce number of extracted statistics
  • Use higher epsilon (lower privacy)
  • Aggregate fine-grained statistics

See Also

Accounting & Audit Standards

SyntheticData includes comprehensive support for major accounting and auditing standards frameworks, enabling the generation of standards-compliant synthetic financial data suitable for audit analytics, compliance testing, and ML model training.

Overview

The datasynth-standards crate provides domain models and generation logic for:

CategoryStandards
AccountingUS GAAP (ASC), IFRS
AuditingISA (International Standards on Auditing), PCAOB
RegulatorySOX (Sarbanes-Oxley Act)

Accounting Framework Selection

Framework Options

accounting_standards:
  enabled: true
  framework: us_gaap  # Options: us_gaap, ifrs, dual_reporting
FrameworkDescription
us_gaapUnited States Generally Accepted Accounting Principles
ifrsInternational Financial Reporting Standards
dual_reportingGenerate data for both frameworks with reconciliation

Key Framework Differences

The generator automatically handles framework-specific rules:

AreaUS GAAPIFRS
Inventory costingLIFO permittedLIFO prohibited
Development costsGenerally expensedCapitalized when criteria met
PPE revaluationCost model onlyRevaluation model permitted
Impairment reversalNot permittedPermitted (except goodwill)
Lease classificationBright-line tests (75%/90%)Principles-based

Revenue Recognition (ASC 606 / IFRS 15)

Generate realistic customer contracts with performance obligations:

accounting_standards:
  revenue_recognition:
    enabled: true
    generate_contracts: true
    avg_obligations_per_contract: 2.0
    variable_consideration_rate: 0.15
    over_time_recognition_rate: 0.30
    contract_count: 100

Generated Entities

  • Customer Contracts: Transaction price, status, framework
  • Performance Obligations: Goods, services, licenses with satisfaction patterns
  • Variable Consideration: Discounts, rebates, incentives with constraint application
  • Revenue Recognition Schedule: Period-by-period recognition

5-Step Model Compliance

The generator follows the 5-step revenue recognition model:

  1. Identify the contract
  2. Identify performance obligations
  3. Determine transaction price
  4. Allocate transaction price to obligations
  5. Recognize revenue when/as obligations are satisfied

Lease Accounting (ASC 842 / IFRS 16)

Generate lease portfolios with ROU assets and lease liabilities:

accounting_standards:
  leases:
    enabled: true
    lease_count: 50
    finance_lease_percent: 0.30
    avg_lease_term_months: 60
    generate_amortization: true
    real_estate_percent: 0.40

Generated Entities

  • Leases: Classification, commencement date, term, payments, discount rate
  • ROU Assets: Initial measurement, accumulated depreciation, carrying amount
  • Lease Liabilities: Current/non-current portions
  • Amortization Schedules: Period-by-period interest and principal

Classification Logic

  • US GAAP: Bright-line tests (75% term, 90% PV)
  • IFRS: All leases (except short-term/low-value) recognized on balance sheet

Fair Value Measurement (ASC 820 / IFRS 13)

Generate fair value measurements across hierarchy levels:

accounting_standards:
  fair_value:
    enabled: true
    measurement_count: 30
    level1_percent: 0.60    # Quoted prices
    level2_percent: 0.30    # Observable inputs
    level3_percent: 0.10    # Unobservable inputs
    include_sensitivity_analysis: true

Fair Value Hierarchy

LevelDescriptionExamples
Level 1Quoted prices in active marketsListed stocks, exchange-traded funds
Level 2Observable inputsCorporate bonds, interest rate swaps
Level 3Unobservable inputsPrivate equity, complex derivatives

Impairment Testing (ASC 360 / IAS 36)

Generate impairment tests with framework-specific methodology:

accounting_standards:
  impairment:
    enabled: true
    test_count: 15
    impairment_rate: 0.20
    generate_projections: true
    include_goodwill: true

Framework Differences

  • US GAAP: Two-step test (recoverability then measurement)
  • IFRS: One-step test comparing to recoverable amount

ISA Compliance (Audit Standards)

Generate audit procedures mapped to ISA requirements:

audit_standards:
  isa_compliance:
    enabled: true
    compliance_level: comprehensive  # basic, standard, comprehensive
    generate_isa_mappings: true
    generate_coverage_summary: true
    include_pcaob: true
    framework: dual  # isa, pcaob, dual

Supported ISA Standards

The crate includes 34 ISA standards from ISA 200 through ISA 720:

SeriesFocus Area
ISA 200-265General principles and responsibilities
ISA 300-450Risk assessment and response
ISA 500-580Audit evidence
ISA 600-620Using work of others
ISA 700-720Conclusions and reporting

Analytical Procedures (ISA 520)

Generate analytical procedures with variance investigation:

audit_standards:
  analytical_procedures:
    enabled: true
    procedures_per_account: 3
    variance_probability: 0.20
    generate_investigations: true
    include_ratio_analysis: true

Procedure Types

  • Trend analysis: Year-over-year comparisons
  • Ratio analysis: Key financial ratios
  • Reasonableness tests: Expected vs. actual comparisons

External Confirmations (ISA 505)

Generate confirmation procedures with response tracking:

audit_standards:
  confirmations:
    enabled: true
    confirmation_count: 50
    positive_response_rate: 0.85
    exception_rate: 0.10

Confirmation Types

  • Bank confirmations
  • Accounts receivable confirmations
  • Accounts payable confirmations
  • Legal confirmations

Audit Opinion (ISA 700/705/706/701)

Generate audit opinions with key audit matters:

audit_standards:
  opinion:
    enabled: true
    generate_kam: true
    average_kam_count: 3

Opinion Types

  • Unmodified
  • Qualified
  • Adverse
  • Disclaimer

SOX Compliance

Generate SOX 302/404 compliance documentation:

audit_standards:
  sox:
    enabled: true
    generate_302_certifications: true
    generate_404_assessments: true
    materiality_threshold: 10000.0

Section 302 Certifications

  • CEO and CFO certifications
  • Disclosure controls effectiveness
  • Material weakness identification

Section 404 Assessments

  • ICFR effectiveness assessment
  • Key control testing
  • Deficiency classification matrix

Deficiency Classification

The DeficiencyMatrix classifies deficiencies based on:

LikelihoodMagnitudeClassification
ProbableMaterialMaterial Weakness
Reasonably PossibleMore Than InconsequentialSignificant Deficiency
RemoteInconsequentialControl Deficiency

PCAOB Standards

Generate PCAOB-specific audit elements:

audit_standards:
  pcaob:
    enabled: true
    generate_cam: true
    integrated_audit: true

PCAOB-Specific Requirements

  • Critical Audit Matters (CAMs) vs. Key Audit Matters (KAMs)
  • Integrated audit (ICFR + financial statements)
  • AS 2201 ICFR testing requirements

Evaluation and Validation

The datasynth-eval crate includes standards compliance evaluators:

#![allow(unused)]
fn main() {
use datasynth_eval::coherence::{
    StandardsComplianceEvaluation,
    RevenueRecognitionEvaluator,
    LeaseAccountingEvaluator,
    StandardsThresholds,
};

// Evaluate revenue recognition compliance
let eval = RevenueRecognitionEvaluator::evaluate(&contracts);
assert!(eval.po_allocation_compliance >= 0.95);

// Evaluate lease classification accuracy
let eval = LeaseAccountingEvaluator::evaluate(&leases, "us_gaap");
assert!(eval.classification_accuracy >= 0.90);
}

Compliance Thresholds

MetricDefault Threshold
PO allocation compliance95%
Revenue timing compliance95%
Lease classification accuracy90%
ROU asset accuracy95%
Fair value hierarchy compliance95%
ISA coverage90%
SOX control coverage95%
Audit trail completeness90%

Output Files

When standards generation is enabled, additional files are exported:

output/
└── standards/
    ├── accounting/
    │   ├── customer_contracts.csv
    │   ├── performance_obligations.csv
    │   ├── variable_consideration.csv
    │   ├── revenue_recognition_schedule.csv
    │   ├── leases.csv
    │   ├── rou_assets.csv
    │   ├── lease_liabilities.csv
    │   ├── lease_amortization.csv
    │   ├── fair_value_measurements.csv
    │   ├── impairment_tests.csv
    │   └── framework_differences.csv
    ├── audit/
    │   ├── isa_requirement_mappings.csv
    │   ├── isa_coverage_summary.csv
    │   ├── analytical_procedures.csv
    │   ├── variance_investigations.csv
    │   ├── confirmations.csv
    │   ├── confirmation_responses.csv
    │   ├── audit_opinions.csv
    │   ├── key_audit_matters.csv
    │   ├── audit_trails.json
    │   └── pcaob_mappings.csv
    └── regulatory/
        ├── sox_302_certifications.csv
        ├── sox_404_assessments.csv
        ├── deficiency_classifications.csv
        └── material_weaknesses.csv

Use Cases

Audit Analytics Training

Generate labeled data for training audit analytics models with known standards compliance levels.

Compliance Testing

Test compliance monitoring systems with synthetic data covering all major accounting and auditing standards.

IFRS to US GAAP Reconciliation

Use dual reporting mode to generate reconciliation data for multi-framework analysis.

SOX Testing

Generate internal control data with known deficiencies for testing SOX monitoring systems.

See Also

Performance Tuning

Optimize SyntheticData for your hardware and requirements.

Performance Characteristics

MetricTypical Performance
Single-threaded~100,000 entries/second
Parallel (8 cores)~600,000 entries/second
Memory per 1M entries~500 MB

Configuration Tuning

Worker Threads

global:
  worker_threads: 8                  # Match CPU cores

Guidelines:

  • Default: Uses all available cores
  • I/O bound: May benefit from > cores
  • Memory constrained: Reduce threads

Memory Limits

global:
  memory_limit: 2147483648           # 2 GB

Guidelines:

  • Set to ~75% of available RAM
  • Leave room for OS and other processes
  • Lower limit = more streaming, less memory

Batch Sizes

The orchestrator automatically tunes batch sizes, but you can influence behavior:

transactions:
  target_count: 100000

# Implicit batch sizing based on:
# - Available memory
# - Number of threads
# - Target count

Hardware Recommendations

Minimum

ResourceSpecification
CPU2 cores
RAM4 GB
Storage10 GB

Suitable for: <100K entries, development

ResourceSpecification
CPU8 cores
RAM16 GB
Storage50 GB SSD

Suitable for: 1M entries, production

High Performance

ResourceSpecification
CPU32+ cores
RAM64+ GB
StorageNVMe SSD

Suitable for: 10M+ entries, benchmarking

Optimizing Generation

Reduce Memory Pressure

Enable streaming output:

output:
  format: csv
  # Writing as generated reduces memory

Disable unnecessary features:

graph_export:
  enabled: false                     # Skip if not needed

anomaly_injection:
  enabled: false                     # Add in post-processing

Optimize for Speed

Maximize parallelism:

global:
  worker_threads: 16                 # More threads

Simplify output:

output:
  format: csv                        # Faster than JSON
  compression: none                  # Skip compression time

Reduce complexity:

chart_of_accounts:
  complexity: small                  # Fewer accounts

document_flows:
  p2p:
    enabled: false                   # Skip if not needed

Optimize for Size

Enable compression:

output:
  compression: zstd
  compression_level: 9               # Maximum compression

Minimize output files:

output:
  files:
    journal_entries: true
    acdoca: false
    master_data: false               # Only what you need

Benchmarking

Built-in Benchmarks

# Run all benchmarks
cargo bench

# Specific benchmark
cargo bench --bench generation_throughput

# With baseline comparison
cargo bench -- --baseline main

Benchmark Categories

BenchmarkMeasures
generation_throughputEntries/second
distribution_samplingDistribution speed
output_sinkWrite performance
scalabilityParallel scaling
correctnessValidation overhead

Manual Benchmarking

# Time generation
time datasynth-data generate --config config.yaml --output ./output

# Profile memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

Profiling

CPU Profiling

# With perf (Linux)
perf record datasynth-data generate --config config.yaml --output ./output
perf report

# With Instruments (macOS)
xcrun xctrace record --template "Time Profiler" \
    --launch datasynth-data generate --config config.yaml --output ./output

Memory Profiling

# With heaptrack (Linux)
heaptrack datasynth-data generate --config config.yaml --output ./output
heaptrack_print heaptrack.*.gz

# With Instruments (macOS)
xcrun xctrace record --template "Allocations" \
    --launch datasynth-data generate --config config.yaml --output ./output

Common Bottlenecks

I/O Bound

Symptoms:

  • CPU utilization < 100%
  • Disk utilization high

Solutions:

  • Use faster storage (SSD/NVMe)
  • Enable compression (reduces write volume)
  • Increase buffer sizes

Memory Bound

Symptoms:

  • OOM errors
  • Excessive swapping

Solutions:

  • Reduce target_count
  • Lower memory_limit
  • Enable streaming
  • Reduce parallel threads

CPU Bound

Symptoms:

  • CPU at 100%
  • Generation time scales linearly

Solutions:

  • Add more cores
  • Simplify configuration
  • Disable unnecessary features

Scaling Guidelines

Entries vs Time

Entries~Time (8 cores)
10,000<1 second
100,000~2 seconds
1,000,000~20 seconds
10,000,000~3 minutes

Entries vs Memory

EntriesPeak Memory
10,000~50 MB
100,000~200 MB
1,000,000~1.5 GB
10,000,000~12 GB

Memory estimates include full in-memory processing. Streaming reduces by ~80%.

Server Performance

Rate Limiting

cargo run -p datasynth-server -- \
    --port 3000 \
    --rate-limit 1000              # Requests per minute

Connection Pooling

For high-concurrency scenarios, configure worker threads:

cargo run -p datasynth-server -- \
    --worker-threads 16            # Handle more connections

WebSocket Optimization

# Client-side: batch messages
const BATCH_SIZE = 100;  // Request 100 entries at a time

See Also

LLM-Augmented Generation

New in v0.5.0

LLM-augmented generation uses Large Language Models to enrich synthetic data with realistic metadata — vendor names, transaction descriptions, memo fields, and anomaly explanations — that would be difficult to generate with rule-based approaches alone.

Overview

Traditional synthetic data generators produce structurally correct but often generic-sounding text fields. LLM augmentation addresses this by using language models to generate contextually appropriate text based on the financial domain, industry, and transaction context.

DataSynth provides a pluggable provider abstraction that supports:

ProviderDescriptionUse Case
MockDeterministic, no network requiredCI/CD, testing, reproducible builds
OpenAIOpenAI-compatible APIs (GPT-4o-mini, etc.)Production enrichment
AnthropicAnthropic API (Claude models)Production enrichment
CustomAny OpenAI-compatible endpointSelf-hosted models, Azure OpenAI

Provider Abstraction

All LLM functionality is built around the LlmProvider trait:

#![allow(unused)]
fn main() {
pub trait LlmProvider: Send + Sync {
    fn name(&self) -> &str;
    fn complete(&self, request: &LlmRequest) -> Result<LlmResponse, SynthError>;
    fn complete_batch(&self, requests: &[LlmRequest]) -> Result<Vec<LlmResponse>, SynthError>;
}
}

LlmRequest

#![allow(unused)]
fn main() {
let request = LlmRequest::new("Generate a vendor name for a German auto parts manufacturer")
    .with_system("You are a business data generator. Return only the company name.")
    .with_seed(42)
    .with_max_tokens(50)
    .with_temperature(0.7);
}
FieldTypeDefaultDescription
promptString(required)The generation prompt
systemOption<String>NoneSystem message for context
max_tokensu32100Maximum response tokens
temperaturef640.7Sampling temperature
seedOption<u64>NoneSeed for deterministic output

LlmResponse

#![allow(unused)]
fn main() {
pub struct LlmResponse {
    pub content: String,       // Generated text
    pub usage: TokenUsage,     // Input/output token counts
    pub cached: bool,          // Whether result came from cache
}
}

Mock Provider

The MockLlmProvider generates deterministic, contextually-aware text without any network calls. It is the default provider and is ideal for:

  • CI/CD pipelines where network access is restricted
  • Reproducible builds with deterministic output
  • Development and testing
  • Environments where API costs are a concern
#![allow(unused)]
fn main() {
use synth_core::llm::MockLlmProvider;

let provider = MockLlmProvider::new(42); // seeded for reproducibility
}

The mock provider uses the seed and prompt content to generate plausible-sounding business names and descriptions deterministically.

HTTP Provider

The HttpLlmProvider connects to real LLM APIs:

#![allow(unused)]
fn main() {
use synth_core::llm::{HttpLlmProvider, LlmConfig, LlmProviderType};

let config = LlmConfig {
    provider: LlmProviderType::OpenAi,
    model: "gpt-4o-mini".to_string(),
    api_key_env: "OPENAI_API_KEY".to_string(),
    base_url: None,
    max_retries: 3,
    timeout_secs: 30,
    cache_enabled: true,
};

let provider = HttpLlmProvider::new(config)?;
}

Configuration

# In your generation config
llm:
  provider: openai          # mock, openai, anthropic, custom
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"
  base_url: null            # Override for custom endpoints
  max_retries: 3
  timeout_secs: 30
  cache_enabled: true
FieldTypeDefaultDescription
providerstringmockProvider type: mock, openai, anthropic, custom
modelstringgpt-4o-miniModel identifier
api_key_envstringEnvironment variable containing the API key
base_urlstringnullCustom API base URL (required for custom provider)
max_retriesinteger3Maximum retry attempts on failure
timeout_secsinteger30Request timeout in seconds
cache_enabledbooltrueEnable prompt-level caching

Enrichment Types

Vendor Name Enrichment

Generates realistic vendor names based on industry, spend category, and country:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::VendorLlmEnricher;

let enricher = VendorLlmEnricher::new(provider.clone());
let name = enricher.enrich_vendor_name("manufacturing", "raw_materials", "DE")?;
// e.g., "Rheinische Stahlwerke GmbH"

// Batch enrichment for efficiency
let names = enricher.enrich_batch(&[
    ("manufacturing".into(), "raw_materials".into(), "DE".into()),
    ("retail".into(), "logistics".into(), "US".into()),
], 42)?;
}

Transaction Description Enrichment

Generates contextually appropriate journal entry descriptions:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::TransactionLlmEnricher;

let enricher = TransactionLlmEnricher::new(provider.clone());

let desc = enricher.enrich_description(
    "Office Supplies",    // account name
    "1000-5000",          // amount range
    "retail",             // industry
    3,                    // fiscal period
)?;

let memo = enricher.enrich_memo(
    "VendorInvoice",      // document type
    "Acme Corp",          // vendor name
    "2500.00",            // amount
)?;
}

Anomaly Explanation

Generates natural language explanations for injected anomalies:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::AnomalyLlmExplainer;

let explainer = AnomalyLlmExplainer::new(provider.clone());
let explanation = explainer.explain(
    "DuplicatePayment",           // anomaly type
    3,                             // affected records
    "Same amount, same vendor, 2 days apart",  // statistical details
)?;
}

Natural Language Configuration

The NlConfigGenerator converts natural language descriptions into YAML configuration:

#![allow(unused)]
fn main() {
use synth_core::llm::NlConfigGenerator;

let yaml = NlConfigGenerator::generate(
    "Generate 1 year of retail data for a mid-size US company with fraud patterns",
    &provider,
)?;
}

CLI Usage

datasynth-data init \
    --from-description "Generate 1 year of manufacturing data for a German mid-cap with intercompany transactions" \
    -o config.yaml

The generator parses intent into structured fields:

#![allow(unused)]
fn main() {
pub struct ConfigIntent {
    pub industry: Option<String>,     // e.g., "manufacturing"
    pub country: Option<String>,      // e.g., "DE"
    pub company_size: Option<String>, // e.g., "mid-cap"
    pub period_months: Option<u32>,   // e.g., 12
    pub features: Vec<String>,        // e.g., ["intercompany"]
}
}

Caching

The LlmCache deduplicates identical prompts using FNV-1a hashing:

#![allow(unused)]
fn main() {
use synth_core::llm::LlmCache;

let cache = LlmCache::new(10000); // max 10,000 entries
let key = LlmCache::cache_key("prompt text", Some("system"), Some(42));

cache.insert(key, "cached response".into());
if let Some(response) = cache.get(key) {
    // Use cached response
}
}

Caching is enabled by default and significantly reduces API costs when generating similar entities.

Cost and Privacy Considerations

Cost Management

  • Use the Mock provider for development and CI/CD (zero cost)
  • Enable caching to avoid duplicate API calls
  • Use batch enrichment (complete_batch) to reduce per-request overhead
  • Set appropriate max_tokens limits to control response sizes
  • Consider gpt-4o-mini or similar efficient models for bulk enrichment

Privacy

  • LLM prompts contain only synthetic context (industry, category, amount ranges) — never real data
  • No PII or sensitive information is sent to LLM providers
  • The Mock provider keeps everything local with no network traffic
  • For maximum privacy, use self-hosted models via the custom provider type

See Also

Diffusion Models

New in v0.5.0

DataSynth integrates a statistical diffusion model backend for learned distribution capture, offering an alternative and complement to rule-based generation.

Overview

Diffusion models generate data through a learned denoising process: starting from pure noise and iteratively removing it to produce realistic samples. DataSynth’s implementation uses a statistical backend that captures column-level distributions and inter-column correlations from fingerprint data, then generates new samples through a configurable noise schedule.

Forward Process (Training):     x₀ → x₁ → x₂ → ... → xₜ (pure noise)
Reverse Process (Generation):   xₜ → xₜ₋₁ → ... → x₁ → x₀ (data)

Architecture

DiffusionBackend Trait

All diffusion backends implement a common interface:

#![allow(unused)]
fn main() {
pub trait DiffusionBackend: Send + Sync {
    fn name(&self) -> &str;
    fn forward(&self, x: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
    fn reverse(&self, x_t: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
    fn generate(&self, n_samples: usize, n_features: usize, seed: u64) -> Vec<Vec<f64>>;
}
}

Statistical Diffusion Backend

The StatisticalDiffusionBackend uses per-column means and standard deviations (extracted from fingerprint data) to guide the denoising process:

#![allow(unused)]
fn main() {
use synth_core::diffusion::{StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType};

let config = DiffusionConfig {
    n_steps: 1000,
    schedule: NoiseScheduleType::Cosine,
    seed: 42,
};

let backend = StatisticalDiffusionBackend::new(
    vec![5000.0, 3.5, 2.0],    // column means
    vec![2000.0, 1.5, 0.8],    // column standard deviations
    config,
);

// Optionally add correlation structure
let backend = backend.with_correlations(vec![
    vec![1.0, 0.65, 0.72],
    vec![0.65, 1.0, 0.55],
    vec![0.72, 0.55, 1.0],
]);

let samples = backend.generate(1000, 3, 42);
}

Noise Schedules

The noise schedule controls how noise is added during the forward process and removed during the reverse process.

ScheduleFormulaCharacteristics
Linearβ_t = β_min + t/T × (β_max - β_min)Uniform noise addition; simple and robust
Cosineβ_t = 1 - ᾱ_t/ᾱ_{t-1}, ᾱ_t = cos²(π/2 × t/T)Slower noise addition; better for preserving fine details
Sigmoidβ_t = sigmoid(a + (b-a) × t/T)Smooth transition; balanced between linear and cosine
#![allow(unused)]
fn main() {
use synth_core::diffusion::{NoiseSchedule, NoiseScheduleType};

let schedule = NoiseSchedule::new(&NoiseScheduleType::Cosine, 1000);

// Access schedule components
println!("Steps: {}", schedule.n_steps());
println!("First beta: {}", schedule.betas[0]);
println!("Last alpha_bar: {}", schedule.alpha_bars[999]);
}

Schedule Properties

The NoiseSchedule precomputes all values needed for efficient forward/reverse steps:

PropertyDescription
betasNoise variance at each step
alphas1 - beta at each step
alpha_barsCumulative product of alphas
sqrt_alpha_bars√(ᾱ_t) for forward process
sqrt_one_minus_alpha_bars√(1 - ᾱ_t) for noise scaling

Hybrid Generation

The HybridGenerator blends rule-based and diffusion-generated data to combine the structural guarantees of rule-based generation with the distributional fidelity of diffusion models.

Blend Strategies

StrategyDescriptionBest For
InterpolateWeighted average: w × diffusion + (1-w) × rule_basedSmooth blending of continuous values
SelectPer-record random selection from either sourceMaintaining distinct record characteristics
EnsembleColumn-level: diffusion for amounts, rule-based for categoricalsMixed-type data with different generation needs
#![allow(unused)]
fn main() {
use synth_core::diffusion::{HybridGenerator, BlendStrategy};

let hybrid = HybridGenerator::new(0.3);  // 30% diffusion weight
println!("Weight: {}", hybrid.weight());

// Interpolation blend
let blended = hybrid.blend(
    &rule_based_data,
    &diffusion_data,
    BlendStrategy::Interpolate,
    42,
);

// Ensemble blend (specify which columns use diffusion)
let ensemble = hybrid.blend_ensemble(
    &rule_based_data,
    &diffusion_data,
    &[0, 2],  // columns 0 and 2 from diffusion
);
}

Training Pipeline

The DiffusionTrainer fits a model from column-level parameters and correlation matrices (typically extracted from a fingerprint):

Training

#![allow(unused)]
fn main() {
use synth_core::diffusion::{DiffusionTrainer, ColumnDiffusionParams, ColumnType, DiffusionConfig};

let params = vec![
    ColumnDiffusionParams {
        name: "amount".into(),
        mean: 5000.0,
        std: 2000.0,
        min: 0.0,
        max: 100000.0,
        col_type: ColumnType::Continuous,
    },
    ColumnDiffusionParams {
        name: "line_items".into(),
        mean: 3.5,
        std: 1.5,
        min: 1.0,
        max: 20.0,
        col_type: ColumnType::Integer,
    },
];

let corr_matrix = vec![
    vec![1.0, 0.65],
    vec![0.65, 1.0],
];

let config = DiffusionConfig { n_steps: 1000, schedule: NoiseScheduleType::Cosine, seed: 42 };
let model = DiffusionTrainer::fit(params, corr_matrix, config);
}

Generation from Trained Model

#![allow(unused)]
fn main() {
let samples = model.generate(5000, 42);

// Save/load model
model.save(Path::new("./model.json"))?;
let loaded = TrainedDiffusionModel::load(Path::new("./model.json"))?;
}

Evaluation

#![allow(unused)]
fn main() {
let report = DiffusionTrainer::evaluate(&model, 5000, 42);

println!("Overall score: {:.3}", report.overall_score);
println!("Correlation error: {:.4}", report.correlation_error);
for (i, (mean_err, std_err)) in report.mean_errors.iter().zip(&report.std_errors).enumerate() {
    println!("Column {}: mean_err={:.4}, std_err={:.4}", i, mean_err, std_err);
}
}

The FitReport contains:

MetricDescription
mean_errorsPer-column mean absolute error
std_errorsPer-column standard deviation error
correlation_errorRMSE of correlation matrix
overall_scoreWeighted composite score (0-1, higher is better)

CLI Usage

Train a Model

datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json \
    --n-steps 1000 \
    --schedule cosine

Evaluate a Model

datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000

Configuration

diffusion:
  enabled: true
  n_steps: 1000           # Number of diffusion steps
  schedule: "cosine"       # Noise schedule: linear, cosine, sigmoid
  sample_size: 1000        # Samples to generate
FieldTypeDefaultDescription
enabledboolfalseEnable diffusion generation
n_stepsinteger1000Forward/reverse diffusion steps
schedulestring"linear"Noise schedule type
sample_sizeinteger1000Number of samples

Utility Functions

DataSynth provides helper functions for working with diffusion data:

#![allow(unused)]
fn main() {
use synth_core::diffusion::{
    add_gaussian_noise, normalize_features, denormalize_features,
    clip_values, generate_noise,
};

// Normalize data to zero mean, unit variance
let (normalized, means, stds) = normalize_features(&data);

// Add calibrated noise
let noisy = add_gaussian_noise(&normalized[0], 0.1, &mut rng);

// Denormalize back to original scale
let original_scale = denormalize_features(&generated, &means, &stds);

// Clip to valid ranges
clip_values(&mut samples, 0.0, 100000.0);
}

See Also

Causal & Counterfactual Generation

New in v0.5.0

DataSynth supports Structural Causal Models (SCMs) for generating data with explicit causal structure, running interventional “what-if” scenarios, and producing counterfactual records.

Overview

Traditional synthetic data generators capture correlations but not causation. Causal generation lets you:

  1. Define causal relationships between variables (e.g., “transaction amount causes approval level”)
  2. Generate observational data that follows the causal structure
  3. Run interventions to answer “what if?” questions (do-calculus)
  4. Produce counterfactuals — “what would have happened if X were different?”

This is particularly valuable for fraud detection, audit analytics, and regulatory “what-if” scenario testing.

Causal Graph

A causal graph defines variables and the directed edges (causal mechanisms) between them.

Variables

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalVariable, CausalVarType};

let var = CausalVariable::new("transaction_amount", CausalVarType::Continuous)
    .with_distribution("lognormal")
    .with_param("mu", 8.0)
    .with_param("sigma", 1.5);
}
Variable TypeDescriptionExample
ContinuousReal-valuedTransaction amount, revenue
CategoricalDiscrete categoriesIndustry, department
CountNon-negative integersLine items, approvals
BinaryBoolean (0/1)Fraud flag, approval status

Causal Mechanisms

Edges between variables define how a parent causally affects a child:

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalEdge, CausalMechanism};

let edge = CausalEdge {
    from: "transaction_amount".into(),
    to: "approval_level".into(),
    mechanism: CausalMechanism::Logistic { scale: 0.001, midpoint: 50000.0 },
    strength: 1.0,
};
}
MechanismFormulaUse Case
Linear { coefficient }y += coefficient × parentProportional effects
Threshold { cutoff }y = 1 if parent > cutoff, else 0Binary triggers
Polynomial { coefficients }y += Σ coefficients[i] × parent^iNon-linear effects
Logistic { scale, midpoint }y += 1 / (1 + e^(-scale × (parent - midpoint)))S-curve effects

Building a Graph

#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, CausalVariable, CausalVarType, CausalEdge, CausalMechanism};

let mut graph = CausalGraph::new();

// Add variables
graph.add_variable(
    CausalVariable::new("transaction_amount", CausalVarType::Continuous)
        .with_distribution("lognormal")
        .with_param("mu", 8.0)
        .with_param("sigma", 1.5)
);
graph.add_variable(
    CausalVariable::new("approval_level", CausalVarType::Count)
        .with_distribution("normal")
        .with_param("mean", 1.0)
        .with_param("std", 0.5)
);
graph.add_variable(
    CausalVariable::new("fraud_flag", CausalVarType::Binary)
);

// Add causal edges
graph.add_edge(CausalEdge {
    from: "transaction_amount".into(),
    to: "approval_level".into(),
    mechanism: CausalMechanism::Linear { coefficient: 0.00005 },
    strength: 1.0,
});
graph.add_edge(CausalEdge {
    from: "transaction_amount".into(),
    to: "fraud_flag".into(),
    mechanism: CausalMechanism::Logistic { scale: 0.0001, midpoint: 50000.0 },
    strength: 0.8,
});

// Validate (checks for cycles, missing variables)
graph.validate()?;
}

Built-in Templates

DataSynth includes pre-configured causal graphs for common financial scenarios:

Fraud Detection Template

#![allow(unused)]
fn main() {
let graph = CausalGraph::fraud_detection_template();
}

Variables: transaction_amount, approval_level, vendor_risk, fraud_flag

Causal structure:

  • transaction_amountapproval_level (linear)
  • transaction_amountfraud_flag (logistic)
  • vendor_riskfraud_flag (linear)

Revenue Cycle Template

#![allow(unused)]
fn main() {
let graph = CausalGraph::revenue_cycle_template();
}

Variables: order_size, credit_score, payment_delay, revenue

Causal structure:

  • order_sizerevenue (linear)
  • credit_scorepayment_delay (linear, negative)
  • order_sizepayment_delay (linear)

Structural Causal Model (SCM)

The SCM wraps a causal graph and provides generation capabilities:

#![allow(unused)]
fn main() {
use synth_core::causal::StructuralCausalModel;

let scm = StructuralCausalModel::new(graph)?;

// Generate observational data
let samples = scm.generate(10000, 42)?;
// samples: Vec<HashMap<String, f64>>

for sample in &samples[..3] {
    println!("Amount: {:.2}, Approval: {:.0}, Fraud: {:.0}",
        sample["transaction_amount"],
        sample["approval_level"],
        sample["fraud_flag"],
    );
}
}

Data is generated in topological order — root variables are sampled from their distributions first, then child variables are computed based on their parents’ values and the causal mechanisms.

Interventions (Do-Calculus)

Interventions answer “what would happen if we force variable X to value V?”, cutting all incoming causal edges to X.

Single Intervention

#![allow(unused)]
fn main() {
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let samples = intervened.generate(5000, 42)?;
}

Multiple Interventions

#![allow(unused)]
fn main() {
let intervened = scm
    .intervene("transaction_amount", 50000.0)?
    .and_intervene("vendor_risk", 0.9);
let samples = intervened.generate(5000, 42)?;
}

Intervention Engine with Effect Estimation

#![allow(unused)]
fn main() {
use synth_core::causal::InterventionEngine;

let engine = InterventionEngine::new(scm);

let result = engine.do_intervention(
    &[("transaction_amount".into(), 50000.0)],
    5000,  // samples
    42,    // seed
)?;

// Compare baseline vs intervention
println!("Baseline fraud rate: {:.4}",
    result.baseline_samples.iter()
        .map(|s| s["fraud_flag"])
        .sum::<f64>() / result.baseline_samples.len() as f64
);

// Effect estimates with confidence intervals
for (var, effect) in &result.effect_estimates {
    println!("{}: ATE={:.4}, 95% CI=({:.4}, {:.4})",
        var,
        effect.average_treatment_effect,
        effect.confidence_interval.0,
        effect.confidence_interval.1,
    );
}
}

The InterventionResult contains:

FieldDescription
baseline_samplesData generated without intervention
intervened_samplesData generated with the intervention applied
effect_estimatesPer-variable average treatment effects with confidence intervals

Counterfactual Generation

Counterfactuals answer “what would have happened to this specific record if X were different?” using the abduction-action-prediction framework:

  1. Abduction: Infer the latent noise variables from the factual observation
  2. Action: Apply the intervention (change X to new value)
  3. Prediction: Propagate through the SCM with inferred noise
#![allow(unused)]
fn main() {
use synth_core::causal::CounterfactualGenerator;
use std::collections::HashMap;

let cf_gen = CounterfactualGenerator::new(scm);

// Factual record
let factual: HashMap<String, f64> = [
    ("transaction_amount".to_string(), 5000.0),
    ("approval_level".to_string(), 1.0),
    ("fraud_flag".to_string(), 0.0),
].into_iter().collect();

// What if the amount had been 100,000?
let counterfactual = cf_gen.generate_counterfactual(
    &factual,
    "transaction_amount",
    100000.0,
    42,
)?;

println!("Factual fraud_flag: {}", factual["fraud_flag"]);
println!("Counterfactual fraud_flag: {}", counterfactual["fraud_flag"]);
}

Batch Counterfactuals

#![allow(unused)]
fn main() {
let pairs = cf_gen.generate_batch_counterfactuals(
    &factual_records,
    "transaction_amount",
    100000.0,
    42,
)?;

for pair in &pairs {
    println!("Changed variables: {:?}", pair.changed_variables);
}
}

Each CounterfactualPair contains:

FieldDescription
factualThe original observation
counterfactualThe counterfactual version
changed_variablesList of variables that changed

Causal Validation

Validate that generated data preserves the specified causal structure:

#![allow(unused)]
fn main() {
use synth_core::causal::CausalValidator;

let report = CausalValidator::validate_causal_structure(&samples, &graph);

println!("Valid: {}", report.valid);
for check in &report.checks {
    println!("{}: {} — {}", check.name, if check.passed { "PASS" } else { "FAIL" }, check.details);
}
if !report.violations.is_empty() {
    println!("Violations: {:?}", report.violations);
}
}

The validator checks:

  • Causal edge directions are respected (parent-child correlations)
  • Independence constraints hold (non-adjacent variables)
  • Intervention effects are consistent with the graph structure

CLI Usage

Generate Observational Data

datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --seed 42 \
    --output ./causal_output

Run Interventions

datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output

Validate Causal Structure

datasynth-data causal validate \
    --data ./causal_output \
    --template fraud_detection

Configuration

causal:
  enabled: true
  template: "fraud_detection"   # or "revenue_cycle" or path to custom YAML
  sample_size: 10000
  validate: true                # validate causal structure in output

Custom Causal Graph YAML

# custom_graph.yaml
variables:
  - name: order_size
    type: continuous
    distribution: lognormal
    params:
      mu: 7.0
      sigma: 1.2
  - name: discount_rate
    type: continuous
    distribution: beta
    params:
      alpha: 2.0
      beta: 8.0
  - name: revenue
    type: continuous

edges:
  - from: order_size
    to: revenue
    mechanism:
      type: linear
      coefficient: 0.95
  - from: discount_rate
    to: revenue
    mechanism:
      type: linear
      coefficient: -5000.0

See Also

Federated Fingerprinting

New in v0.5.0

Federated fingerprinting enables extracting statistical fingerprints from multiple distributed data sources and combining them without centralizing the raw data.

Overview

In many enterprise environments, data is distributed across multiple systems, departments, or legal entities that cannot share raw data due to privacy regulations or data governance policies. Federated fingerprinting addresses this by:

  1. Local extraction: Each data source extracts a partial fingerprint with its own differential privacy budget
  2. Secure aggregation: Partial fingerprints are combined using a configurable aggregation strategy
  3. Privacy composition: The total privacy budget is tracked across all sources
Source A → [Extract + Local DP] → Partial FP A ─┐
Source B → [Extract + Local DP] → Partial FP B ─┼→ [Aggregate] → Combined FP → [Generate]
Source C → [Extract + Local DP] → Partial FP C ─┘

Partial Fingerprints

Each data source produces a PartialFingerprint containing noised statistics:

#![allow(unused)]
fn main() {
pub struct PartialFingerprint {
    pub source_id: String,         // Identifier for this data source
    pub local_epsilon: f64,        // DP epsilon budget spent locally
    pub record_count: u64,         // Number of records in source
    pub column_names: Vec<String>, // Column identifiers
    pub means: Vec<f64>,           // Per-column means (noised)
    pub stds: Vec<f64>,            // Per-column standard deviations (noised)
    pub mins: Vec<f64>,            // Per-column minimums (noised)
    pub maxs: Vec<f64>,            // Per-column maximums (noised)
    pub correlations: Vec<f64>,    // Flat row-major correlation matrix (noised)
}
}

Creating a Partial Fingerprint

#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::FederatedFingerprintProtocol;

let partial = FederatedFingerprintProtocol::create_partial(
    "department_finance",                        // source ID
    vec!["amount".into(), "line_items".into()],  // columns
    50000,                                       // record count
    vec![8500.0, 3.2],                           // means
    vec![4200.0, 1.8],                           // standard deviations
    1.0,                                         // local epsilon budget
);
}

Aggregation Methods

MethodDescriptionProperties
WeightedAverageWeighted by record countBest for balanced sources
MedianMedian across sourcesRobust to outlier sources
TrimmedMeanMean after removing extremesBalances robustness and efficiency

Protocol Usage

#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::{
    FederatedFingerprintProtocol, FederatedConfig, AggregationMethod,
};

// Configure the protocol
let config = FederatedConfig {
    min_sources: 2,                                // Minimum sources required
    max_epsilon_per_source: 5.0,                   // Max DP budget per source
    aggregation_method: AggregationMethod::WeightedAverage,
};

let protocol = FederatedFingerprintProtocol::new(config);

// Collect partial fingerprints from each source
let partial_a = FederatedFingerprintProtocol::create_partial(
    "source_a", vec!["amount".into(), "count".into()],
    10000, vec![5000.0, 3.0], vec![2000.0, 1.5], 1.0,
);
let partial_b = FederatedFingerprintProtocol::create_partial(
    "source_b", vec!["amount".into(), "count".into()],
    8000, vec![4500.0, 2.8], vec![1800.0, 1.2], 1.0,
);
let partial_c = FederatedFingerprintProtocol::create_partial(
    "source_c", vec!["amount".into(), "count".into()],
    12000, vec![5500.0, 3.3], vec![2200.0, 1.7], 1.0,
);

// Aggregate
let aggregated = protocol.aggregate(&[partial_a, partial_b, partial_c])?;

println!("Total records: {}", aggregated.total_record_count);  // 30000
println!("Total epsilon: {}", aggregated.total_epsilon);        // 3.0 (sum)
println!("Sources: {}", aggregated.source_count);               // 3
}

Aggregated Fingerprint

The AggregatedFingerprint contains the combined statistics:

#![allow(unused)]
fn main() {
pub struct AggregatedFingerprint {
    pub column_names: Vec<String>,
    pub means: Vec<f64>,            // Aggregated means
    pub stds: Vec<f64>,             // Aggregated standard deviations
    pub mins: Vec<f64>,             // Aggregated minimums
    pub maxs: Vec<f64>,             // Aggregated maximums
    pub correlations: Vec<f64>,     // Aggregated correlation matrix
    pub total_record_count: u64,    // Sum across all sources
    pub total_epsilon: f64,         // Sum of local epsilons
    pub source_count: usize,        // Number of contributing sources
}
}

Privacy Budget Composition

The total privacy budget is the sum of local epsilons across all sources. This follows sequential composition — each source’s local DP guarantee composes with the others.

For example, if three sources each spend ε=1.0 locally, the total privacy cost of the aggregated fingerprint is ε=3.0 under sequential composition.

To minimize total budget:

  • Use the lowest local_epsilon that provides sufficient utility
  • Prefer fewer sources with more records over many sources with few records
  • Use max_epsilon_per_source to enforce per-source budget caps

CLI Usage

# Aggregate fingerprints from multiple sources
datasynth-data fingerprint federated \
    --sources ./finance.dsf ./operations.dsf ./sales.dsf \
    --output ./aggregated.dsf \
    --method weighted_average \
    --max-epsilon 5.0

# Then generate from the aggregated fingerprint
datasynth-data generate \
    --fingerprint ./aggregated.dsf \
    --output ./synthetic_output

Configuration

# Federated config is specified per-invocation via CLI flags
# The aggregation method and privacy budget are controlled at execution time
CLI FlagDefaultDescription
--sources(required)Two or more .dsf fingerprint files
--output(required)Output path for aggregated fingerprint
--methodweighted_averageAggregation strategy
--max-epsilon5.0Maximum epsilon per source

See Also

Synthetic Data Certificates

New in v0.5.0

Synthetic data certificates provide cryptographic proof of the privacy guarantees and quality metrics associated with generated data.

Overview

As synthetic data becomes increasingly used in regulated industries, organizations need verifiable assurance that:

  1. The data was generated with specific differential privacy guarantees
  2. Quality metrics meet documented thresholds
  3. The generation configuration hasn’t been tampered with
  4. The certificate itself is authentic (HMAC-SHA256 signed)

Certificates are produced during generation and can be embedded in output files or distributed alongside them.

Certificate Structure

#![allow(unused)]
fn main() {
pub struct SyntheticDataCertificate {
    pub certificate_id: String,        // Unique certificate identifier
    pub generation_timestamp: String,  // ISO 8601 timestamp
    pub generator_version: String,     // DataSynth version
    pub config_hash: String,           // SHA-256 hash of generation config
    pub seed: Option<u64>,             // RNG seed for reproducibility
    pub dp_guarantee: Option<DpGuarantee>,
    pub quality_metrics: Option<QualityMetrics>,
    pub fingerprint_hash: Option<String>,  // Source fingerprint hash
    pub issuer: String,                // Certificate issuer
    pub signature: Option<String>,     // HMAC-SHA256 signature
}
}

DP Guarantee

#![allow(unused)]
fn main() {
pub struct DpGuarantee {
    pub mechanism: String,            // "Laplace" or "Gaussian"
    pub epsilon: f64,                 // Privacy budget spent
    pub delta: Option<f64>,           // For (ε,δ)-DP
    pub composition_method: String,   // "sequential", "advanced", "rdp"
    pub total_queries: u32,           // Number of DP queries made
}
}

Quality Metrics

#![allow(unused)]
fn main() {
pub struct QualityMetrics {
    pub benford_mad: Option<f64>,             // Mean Absolute Deviation from Benford's Law
    pub correlation_preservation: Option<f64>, // Correlation matrix similarity (0-1)
    pub statistical_fidelity: Option<f64>,    // Overall statistical fidelity score (0-1)
    pub mia_auc: Option<f64>,                 // Membership Inference Attack AUC (closer to 0.5 = better privacy)
}
}

Building Certificates

Use the CertificateBuilder for fluent construction:

#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{
    CertificateBuilder, DpGuarantee, QualityMetrics,
};

let cert = CertificateBuilder::new("DataSynth v0.5.0")
    .with_dp_guarantee(DpGuarantee {
        mechanism: "Laplace".into(),
        epsilon: 1.0,
        delta: None,
        composition_method: "sequential".into(),
        total_queries: 50,
    })
    .with_quality_metrics(QualityMetrics {
        benford_mad: Some(0.008),
        correlation_preservation: Some(0.95),
        statistical_fidelity: Some(0.92),
        mia_auc: Some(0.52),
    })
    .with_config_hash("sha256:abc123...")
    .with_seed(42)
    .with_fingerprint_hash("sha256:def456...")
    .with_generator_version("0.5.0")
    .build();
}

Signing and Verification

Certificates are signed using HMAC-SHA256:

#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{sign_certificate, verify_certificate};

// Sign
sign_certificate(&mut cert, "my-secret-key-material");

// Verify
let valid = verify_certificate(&cert, "my-secret-key-material");
assert!(valid);

// Tampered certificate fails verification
cert.dp_guarantee.as_mut().unwrap().epsilon = 0.001; // tamper
assert!(!verify_certificate(&cert, "my-secret-key-material"));
}

Output Embedding

Certificates can be:

  1. Standalone JSON: Written as certificate.json in the output directory
  2. Parquet metadata: Embedded in Parquet file metadata under the datasynth_certificate key
  3. JSON metadata: Included in the generation manifest

CLI Usage

# Generate data with certificate
datasynth-data generate \
    --config config.yaml \
    --output ./output \
    --certificate

# Certificate is written to ./output/certificate.json

Configuration

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true
FieldTypeDefaultDescription
enabledboolfalseEnable certificate generation
issuerstring"DataSynth"Issuer identity
include_quality_metricsbooltrueInclude quality metrics in certificate

Privacy-Utility Pareto Frontier

The ParetoFrontier helps find optimal privacy-utility tradeoffs:

#![allow(unused)]
fn main() {
use datasynth_fingerprint::privacy::pareto::{ParetoFrontier, ParetoPoint};

let epsilons = vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0];

let points = ParetoFrontier::explore(&epsilons, |epsilon| {
    // Evaluate utility at this epsilon level
    ParetoPoint {
        epsilon,
        delta: None,
        utility_score: compute_utility(epsilon),
        benford_mad: compute_benford(epsilon),
        correlation_score: compute_correlation(epsilon),
    }
});

// Recommend epsilon for target utility
if let Some(recommended_epsilon) = ParetoFrontier::recommend(&points, 0.90) {
    println!("For 90% utility, use epsilon = {:.2}", recommended_epsilon);
}
}

The frontier identifies non-dominated points where no other configuration achieves both better privacy and better utility.

See Also

Deployment & Operations

This section covers everything you need to deploy, operate, and maintain DataSynth in production environments.

Deployment Options

DataSynth supports three deployment models, each suited to different operational requirements:

MethodBest ForScalingComplexity
Docker / ComposeSmall teams, dev/staging, single-nodeVerticalLow
Kubernetes / HelmProduction, multi-tenant, auto-scalingHorizontalMedium
Bare Metal / SystemDRegulated environments, air-gapped networksVerticalLow

Architecture at a Glance

DataSynth server exposes two network interfaces:

  • REST API on port 3000 – configuration, bulk generation, streaming control, health probes, Prometheus metrics
  • gRPC API on port 50051 – high-throughput generation for programmatic clients

Both share an in-process ServerState with atomic counters, so a single process can serve REST, gRPC, and WebSocket clients concurrently.

Operations Guides

GuideDescription
Operational RunbookGrafana dashboards, alert response, troubleshooting, log analysis
Capacity PlanningSizing model, reference benchmarks, disk and memory estimates
Disaster RecoveryBackup procedures, deterministic replay, stateless restart

Security & API

GuideDescription
API ReferenceEndpoints, authentication, rate limiting, WebSocket protocol, error formats
Security HardeningPre-deployment checklist, TLS/mTLS, secrets, container security, audit logging
TLS & Reverse ProxyNginx, Envoy, and native TLS configuration

Quick Decision Tree

  1. Need auto-scaling or HA? – Use Kubernetes.
  2. Single server, want observability? – Use Docker Compose with the full stack (Prometheus + Grafana).
  3. Air-gapped or compliance-restricted? – Use Bare Metal with SystemD.

Docker Deployment

This guide walks through building, configuring, and running DataSynth as Docker containers.

Prerequisites

  • Docker Engine 24+ (or Docker Desktop 4.25+)
  • Docker Compose v2
  • 2 GB RAM minimum (4 GB recommended)
  • 10 GB disk for images and generated data

Images

DataSynth provides two container images:

ImageDockerfilePurpose
datasynth/datasynth-serverDockerfileServer (REST + gRPC + WebSocket)
datasynth/datasynth-cliDockerfile.cliCLI for batch generation jobs

Multi-Stage Build Walkthrough

The server Dockerfile uses a four-stage build with cargo-chef for dependency caching:

Stage 1: chef       -- installs cargo-chef on rust:1.88-bookworm
Stage 2: planner    -- computes recipe.json from Cargo.lock
Stage 3: builder    -- cooks dependencies (cached), then builds datasynth-server + datasynth-data
Stage 4: runtime    -- copies binaries into gcr.io/distroless/cc-debian12

Build the server image:

docker build -t datasynth/datasynth-server:0.5.0 .

Build the CLI-only image:

docker build -t datasynth/datasynth-cli:0.5.0 -f Dockerfile.cli .

Build Arguments and Features

To enable optional features (TLS, Redis rate limiting, OpenTelemetry), modify the build command in the builder stage. For example, to enable Redis:

# In the builder stage, replace the cargo build line:
RUN cargo build --release -p datasynth-server -p datasynth-cli --features redis

Image Size

The distroless runtime image is approximately 40-60 MB. The build cache layer with cooked dependencies significantly speeds up rebuilds when only application code changes.

Docker Compose Stack

The repository includes a production-ready docker-compose.yml with the full observability stack:

services:
  datasynth-server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "50051:50051"  # gRPC
      - "3000:3000"    # REST
    environment:
      - RUST_LOG=info
      - DATASYNTH_API_KEYS=${DATASYNTH_API_KEYS:-}
    healthcheck:
      test: ["CMD", "/usr/local/bin/datasynth-data", "--help"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "2.0"
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    profiles:
      - redis
    ports:
      - "6379:6379"
    command: >
      redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
    deploy:
      resources:
        limits:
          memory: 256M
          cpus: "0.5"
    volumes:
      - redis-data:/data
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./deploy/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./deploy/prometheus-alerts.yml:/etc/prometheus/alerts.yml:ro
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3001:3000"
    volumes:
      - ./deploy/grafana/provisioning:/etc/grafana/provisioning:ro
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:
  redis-data:

Starting the Stack

Basic server only:

docker compose up -d datasynth-server

Full observability stack (server + Prometheus + Grafana):

docker compose up -d

With Redis for distributed rate limiting:

docker compose --profile redis up -d

Verifying the Deployment

# Health check
curl http://localhost:3000/health

# Readiness probe
curl http://localhost:3000/ready

# Prometheus metrics
curl http://localhost:3000/metrics

# Grafana UI
open http://localhost:3001  # admin / admin (or GRAFANA_PASSWORD)

Environment Variables

VariableDefaultDescription
RUST_LOGinfoLog level: trace, debug, info, warn, error
DATASYNTH_API_KEYS(none)Comma-separated API keys for authentication
DATASYNTH_WORKER_THREADS0 (auto)Tokio worker threads; 0 = CPU count
DATASYNTH_REDIS_URL(none)Redis URL for distributed rate limiting
DATASYNTH_TLS_CERT(none)Path to TLS certificate (PEM)
DATASYNTH_TLS_KEY(none)Path to TLS private key (PEM)
OTEL_EXPORTER_OTLP_ENDPOINT(none)OpenTelemetry collector endpoint
OTEL_SERVICE_NAME(none)OpenTelemetry service name

Resource Limits

Recommended container resource limits by workload:

WorkloadCPUMemoryNotes
Light (dev/test)1 core1 GBSmall configs, < 10K entries
Medium (staging)2 cores2 GBMedium configs, up to 100K entries
Heavy (production)4 cores4 GBLarge configs, streaming, multiple clients
Batch CLI job2-8 cores2-8 GBScales linearly with core count

Running CLI Jobs in Docker

Generate data with the CLI image:

docker run --rm \
  -v $(pwd)/output:/output \
  datasynth/datasynth-cli:0.5.0 \
  generate --demo --output /output

Generate from a custom config:

docker run --rm \
  -v $(pwd)/config.yaml:/config.yaml:ro \
  -v $(pwd)/output:/output \
  datasynth/datasynth-cli:0.5.0 \
  generate --config /config.yaml --output /output

Networking

The server binds to 0.0.0.0 by default inside the container. Port mapping:

Container PortProtocolService
3000TCPREST API + WebSocket + Prometheus metrics
50051TCPgRPC API

For WebSocket connections through a reverse proxy, ensure the proxy supports HTTP Upgrade headers. See TLS & Reverse Proxy for Nginx and Envoy configurations.

Logging

DataSynth server outputs structured JSON logs to stdout, which integrates with Docker’s logging drivers:

# View logs
docker compose logs -f datasynth-server

# Filter by level
docker compose logs datasynth-server | jq 'select(.level == "ERROR")'

To change the log format or level, set the RUST_LOG environment variable:

# Debug logging for the server crate only
RUST_LOG=datasynth_server=debug docker compose up -d datasynth-server

Kubernetes Deployment

This guide covers deploying DataSynth on Kubernetes using the included Helm chart or raw manifests.

Prerequisites

  • Kubernetes 1.27+
  • Helm 3.12+ (for Helm-based deployment)
  • kubectl configured for your cluster
  • A container registry accessible from the cluster
  • Metrics Server installed (for HPA)

Helm Chart

The Helm chart is located at deploy/helm/datasynth/ and manages all Kubernetes resources.

Quick Install

# From the repository root
helm install datasynth ./deploy/helm/datasynth \
  --namespace datasynth \
  --create-namespace

Install with Custom Values

helm install datasynth ./deploy/helm/datasynth \
  --namespace datasynth \
  --create-namespace \
  --set image.repository=your-registry.example.com/datasynth-server \
  --set image.tag=0.5.0 \
  --set autoscaling.minReplicas=3 \
  --set autoscaling.maxReplicas=15

Upgrade

helm upgrade datasynth ./deploy/helm/datasynth \
  --namespace datasynth \
  --reuse-values \
  --set image.tag=0.6.0

Uninstall

helm uninstall datasynth --namespace datasynth

Chart Reference

values.yaml Key Parameters

ParameterDefaultDescription
replicaCount2Initial replicas (ignored when HPA is enabled)
image.repositorydatasynth/datasynth-serverContainer image repository
image.tag0.5.0Image tag
service.typeClusterIPService type
service.restPort3000REST API port
service.grpcPort50051gRPC port
resources.requests.cpu500mCPU request
resources.requests.memory512MiMemory request
resources.limits.cpu2CPU limit
resources.limits.memory2GiMemory limit
autoscaling.enabledtrueEnable HPA
autoscaling.minReplicas2Minimum replicas
autoscaling.maxReplicas10Maximum replicas
autoscaling.targetCPUUtilizationPercentage70CPU scaling target
podDisruptionBudget.enabledtrueEnable PDB
podDisruptionBudget.minAvailable1Minimum available pods
apiKeys[]API keys (stored in a Secret)
config.enabledfalseMount DataSynth YAML config via ConfigMap
redis.enabledfalseDeploy Redis sidecar for distributed rate limiting
serviceMonitor.enabledfalseCreate Prometheus ServiceMonitor
ingress.enabledfalseEnable Ingress resource

Authentication

API keys are stored in a Kubernetes Secret and injected via the DATASYNTH_API_KEYS environment variable:

# values-production.yaml
apiKeys:
  - "your-secure-api-key-1"
  - "your-secure-api-key-2"

For external secret management, use the External Secrets Operator or mount from a Vault sidecar. See Security Hardening for details.

DataSynth Configuration via ConfigMap

To inject a DataSynth generation config into the pods:

config:
  enabled: true
  content: |
    global:
      industry: manufacturing
      start_date: "2024-01-01"
      period_months: 12
      seed: 42
    companies:
      - code: "1000"
        name: "Manufacturing Corp"
        currency: USD
        country: US
        annual_transaction_volume: 100000

The config is mounted at /etc/datasynth/datasynth.yaml as a read-only volume.

Health Probes

The Helm chart configures three probes:

ProbeEndpointInitial DelayPeriodFailure Threshold
StartupGET /live5s5s30 (= 2.5 min max startup)
LivenessGET /live15s20s3
ReadinessGET /ready5s10s3

The readiness probe checks configuration validity, memory usage, and disk availability. A pod reporting not-ready is removed from Service endpoints until it recovers.

Horizontal Pod Autoscaler (HPA)

The chart creates an HPA by default targeting 70% CPU utilization:

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  # Uncomment to also scale on memory:
  # targetMemoryUtilizationPercentage: 80

Custom metrics scaling (e.g., on synth_active_streams) requires the Prometheus Adapter:

# Custom metrics HPA example (requires prometheus-adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: datasynth-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: datasynth
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: synth_active_streams
        target:
          type: AverageValue
          averageValue: "5"

Pod Disruption Budget (PDB)

The PDB ensures at least one pod remains available during voluntary disruptions (node drains, cluster upgrades):

podDisruptionBudget:
  enabled: true
  minAvailable: 1

For larger deployments, switch to maxUnavailable:

podDisruptionBudget:
  enabled: true
  maxUnavailable: 1

Ingress and TLS

Nginx Ingress with cert-manager

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
  hosts:
    - host: datasynth.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: datasynth-tls
      hosts:
        - datasynth.example.com

WebSocket Support

For Nginx Ingress, WebSocket upgrade is handled automatically for paths starting with /ws/. If you use a path-based routing rule, ensure the annotation is set:

nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
nginx.ingress.kubernetes.io/configuration-snippet: |
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";

gRPC Ingress

gRPC requires a separate Ingress resource or an Ingress controller that supports gRPC (e.g., Nginx Ingress with nginx.ingress.kubernetes.io/backend-protocol: "GRPC"):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: datasynth-grpc
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - secretName: datasynth-grpc-tls
      hosts:
        - grpc.datasynth.example.com
  rules:
    - host: grpc.datasynth.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: datasynth
                port:
                  name: grpc

Manual Manifests (Without Helm)

If you prefer raw manifests, here is a minimal deployment:

---
apiVersion: v1
kind: Namespace
metadata:
  name: datasynth
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: datasynth
  namespace: datasynth
spec:
  replicas: 2
  selector:
    matchLabels:
      app: datasynth
  template:
    metadata:
      labels:
        app: datasynth
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: datasynth
          image: datasynth/datasynth-server:0.5.0
          ports:
            - containerPort: 3000
              name: http-rest
            - containerPort: 50051
              name: grpc
          env:
            - name: RUST_LOG
              value: "info"
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: "2"
              memory: 2Gi
          livenessProbe:
            httpGet:
              path: /live
              port: http-rest
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /ready
              port: http-rest
            initialDelaySeconds: 5
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: datasynth
  namespace: datasynth
spec:
  type: ClusterIP
  ports:
    - port: 3000
      targetPort: http-rest
      name: http-rest
    - port: 50051
      targetPort: grpc
      name: grpc
  selector:
    app: datasynth

Prometheus ServiceMonitor

If you use the Prometheus Operator, enable the ServiceMonitor:

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s
  path: /metrics
  labels:
    release: prometheus  # Must match your Prometheus Operator selector

Rolling Update Strategy

The chart uses a zero-downtime rolling update strategy:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

Combined with the PDB and readiness probes, this ensures that:

  1. A new pod starts and becomes ready before an old pod is terminated.
  2. At least minAvailable pods are always serving traffic.
  3. Config and secret changes trigger a rolling restart via checksum annotations.

Topology Spread

For multi-zone clusters, use topology spread constraints to distribute pods evenly:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: datasynth

Bare Metal Deployment

This guide covers installing and running DataSynth directly on a Linux server using SystemD.

Prerequisites

  • Linux x86_64 (Ubuntu 22.04+, Debian 12+, RHEL 9+, or equivalent)
  • 2 GB RAM minimum (4 GB recommended)
  • Root or sudo access for initial setup

Binary Installation

Option 1: Download Pre-Built Binary

# Download the latest release
curl -L https://github.com/ey-asu-rnd/SyntheticData/releases/latest/download/datasynth-server-linux-x86_64.tar.gz \
  -o datasynth-server.tar.gz

# Extract
tar xzf datasynth-server.tar.gz

# Install binaries
sudo install -m 0755 datasynth-server /usr/local/bin/
sudo install -m 0755 datasynth-data /usr/local/bin/

# Verify
datasynth-server --help
datasynth-data --version

Option 2: Build from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Install protobuf compiler (required for gRPC)
sudo apt-get install -y protobuf-compiler   # Debian/Ubuntu
sudo dnf install -y protobuf-compiler       # RHEL/Fedora

# Clone and build
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release -p datasynth-server -p datasynth-cli

# Install
sudo install -m 0755 target/release/datasynth-server /usr/local/bin/
sudo install -m 0755 target/release/datasynth-data /usr/local/bin/

To enable optional features during the build:

# With TLS support
cargo build --release -p datasynth-server --features tls

# With Redis distributed rate limiting
cargo build --release -p datasynth-server --features redis

# With OpenTelemetry
cargo build --release -p datasynth-server --features otel

# All features
cargo build --release -p datasynth-server --features "tls,redis,otel"

User and Permissions

Create a dedicated service account:

# Create system user (no home dir, no login shell)
sudo useradd --system --no-create-home --shell /usr/sbin/nologin datasynth

# Create data and config directories
sudo mkdir -p /var/lib/datasynth
sudo mkdir -p /etc/datasynth
sudo mkdir -p /etc/datasynth/tls

# Set ownership
sudo chown -R datasynth:datasynth /var/lib/datasynth
sudo chmod 750 /var/lib/datasynth

sudo chown -R root:datasynth /etc/datasynth
sudo chmod 750 /etc/datasynth
sudo chmod 700 /etc/datasynth/tls

Environment Configuration

Copy the example environment file:

sudo cp deploy/datasynth-server.env.example /etc/datasynth/server.env
sudo chown root:datasynth /etc/datasynth/server.env
sudo chmod 640 /etc/datasynth/server.env

Edit /etc/datasynth/server.env:

# Logging level
RUST_LOG=info

# API authentication (comma-separated keys)
DATASYNTH_API_KEYS=your-secure-key-1,your-secure-key-2

# Worker threads (0 = auto-detect from CPU count)
DATASYNTH_WORKER_THREADS=0

# TLS (requires --features tls build)
# DATASYNTH_TLS_CERT=/etc/datasynth/tls/cert.pem
# DATASYNTH_TLS_KEY=/etc/datasynth/tls/key.pem

SystemD Service

The repository includes a production-ready SystemD unit at deploy/datasynth-server.service. Install it:

sudo cp deploy/datasynth-server.service /etc/systemd/system/
sudo systemctl daemon-reload

Unit File Walkthrough

[Unit]
Description=DataSynth Synthetic Data Server
Documentation=https://github.com/ey-asu-rnd/SyntheticData
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=datasynth
Group=datasynth
EnvironmentFile=-/etc/datasynth/server.env
ExecStart=/usr/local/bin/datasynth-server \
    --host 0.0.0.0 \
    --port 50051 \
    --rest-port 3000
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStartSec=30
TimeoutStopSec=30

# Resource limits
MemoryMax=4G
CPUQuota=200%
TasksMax=512
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
ReadWritePaths=/var/lib/datasynth

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=datasynth-server

[Install]
WantedBy=multi-user.target

Key security directives:

DirectiveEffect
NoNewPrivileges=truePrevents privilege escalation
ProtectSystem=strictMounts filesystem read-only except ReadWritePaths
ProtectHome=trueHides /home, /root, /run/user
PrivateTmp=trueIsolates /tmp
PrivateDevices=trueRestricts device access
ReadWritePaths=/var/lib/datasynthOnly writable directory

Enable and Start

sudo systemctl enable datasynth-server
sudo systemctl start datasynth-server
sudo systemctl status datasynth-server

Common Operations

# View logs
journalctl -u datasynth-server -f

# Restart
sudo systemctl restart datasynth-server

# Reload (sends HUP signal)
sudo systemctl reload datasynth-server

# Stop
sudo systemctl stop datasynth-server

Log Rotation

SystemD journal handles log rotation automatically. To configure retention:

# /etc/systemd/journald.conf.d/datasynth.conf
[Journal]
SystemMaxUse=2G
MaxRetentionSec=30d

Reload journald:

sudo systemctl restart systemd-journald

To export logs to a file for external log aggregation:

# Export today's logs as JSON
journalctl -u datasynth-server --since today -o json > /var/log/datasynth-$(date +%F).json

Firewall Configuration

Open the required ports:

# UFW (Ubuntu)
sudo ufw allow 3000/tcp comment "DataSynth REST"
sudo ufw allow 50051/tcp comment "DataSynth gRPC"

# firewalld (RHEL/CentOS)
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=50051/tcp
sudo firewall-cmd --reload

Verifying the Installation

# Health check
curl -s http://localhost:3000/health | python3 -m json.tool

# Readiness check
curl -s http://localhost:3000/ready | python3 -m json.tool

# Prometheus metrics
curl -s http://localhost:3000/metrics

# Generate test data via CLI
datasynth-data generate --demo --output /tmp/datasynth-test
ls -la /tmp/datasynth-test/

Updating

# Stop the service
sudo systemctl stop datasynth-server

# Replace the binary
sudo install -m 0755 /path/to/new/datasynth-server /usr/local/bin/

# Start the service
sudo systemctl start datasynth-server

# Verify
curl -s http://localhost:3000/health | python3 -m json.tool

Operational Runbook

This runbook provides step-by-step procedures for monitoring, alerting, troubleshooting, and maintaining DataSynth in production.

Monitoring Stack Overview

The recommended monitoring stack uses Prometheus for metrics collection and Grafana for dashboards and alerting. The docker-compose.yml in the repository root sets this up automatically.

ComponentDefault URLPurpose
Prometheushttp://localhost:9090Metrics storage and alerting rules
Grafanahttp://localhost:3001Dashboards and visualization
DataSynth /metricshttp://localhost:3000/metricsPrometheus exposition endpoint
DataSynth /api/metricshttp://localhost:3000/api/metricsJSON metrics endpoint

Prometheus Configuration

The repository includes a pre-configured Prometheus scrape config at deploy/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "datasynth"
    static_configs:
      - targets: ["datasynth-server:3000"]
    metrics_path: "/metrics"
    scrape_interval: 10s

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

For Kubernetes, use the ServiceMonitor CRD instead (see Kubernetes deployment).

Available Metrics

DataSynth exposes the following Prometheus metrics at GET /metrics:

MetricTypeDescription
synth_entries_generated_totalCounterTotal journal entries generated since startup
synth_anomalies_injected_totalCounterTotal anomalies injected
synth_uptime_secondsGaugeServer uptime in seconds
synth_entries_per_secondGaugeCurrent generation throughput
synth_active_streamsGaugeNumber of active WebSocket streaming connections
synth_stream_events_totalCounterTotal events sent through WebSocket streams
synth_infoGaugeServer version info label (always 1)

Grafana Dashboard Setup

Step 1: Add Prometheus Data Source

  1. Open Grafana at http://localhost:3001.
  2. Navigate to Configuration > Data Sources > Add data source.
  3. Select Prometheus.
  4. Set URL to http://prometheus:9090 (Docker) or your Prometheus endpoint.
  5. Click Save & Test.

If using Docker Compose, the Prometheus data source is auto-provisioned via deploy/grafana/provisioning/datasources/prometheus.yml.

Step 2: Create the DataSynth Dashboard

Create a new dashboard with the following panels:

Panel 1: Generation Throughput

Type: Time series
Query: rate(synth_entries_generated_total[5m])
Title: Entries Generated per Second (5m rate)
Unit: ops/sec

Panel 2: Active WebSocket Streams

Type: Stat
Query: synth_active_streams
Title: Active Streams
Thresholds: 0 (green), 5 (yellow), 10 (red)

Panel 3: Total Entries (Counter)

Type: Stat
Query: synth_entries_generated_total
Title: Total Entries Generated
Format: short

Panel 4: Anomaly Injection Rate

Type: Time series
Query A: rate(synth_anomalies_injected_total[5m])
Query B: rate(synth_entries_generated_total[5m])
Title: Anomaly Rate
Transform: A / B (using math expression)
Unit: percentunit

Panel 5: Server Uptime

Type: Stat
Query: synth_uptime_seconds
Title: Server Uptime
Unit: seconds (s)

Panel 6: Stream Events Rate

Type: Time series
Query: rate(synth_stream_events_total[1m])
Title: Stream Events per Second
Unit: events/sec

Step 3: Save and Export

Save the dashboard and export as JSON for version control. Place the file in deploy/grafana/provisioning/dashboards/ for automatic provisioning.

Alert Rules

The repository includes pre-configured alert rules at deploy/prometheus-alerts.yml:

Alert: ServerDown

- alert: ServerDown
  expr: up{job="datasynth"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "DataSynth server is down"
    description: "DataSynth server has been unreachable for more than 1 minute."

Response procedure:

  1. Check the server process: systemctl status datasynth-server or docker compose ps.
  2. Check logs: journalctl -u datasynth-server -n 100 or docker compose logs --tail 100 datasynth-server.
  3. Check resource exhaustion: free -h, df -h, top.
  4. If OOM killed, increase memory limits and restart.
  5. If disk full, clear output directory and restart.

Alert: HighErrorRate

- alert: HighErrorRate
  expr: rate(synth_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate on DataSynth server"

Response procedure:

  1. Check application logs for error patterns: journalctl -u datasynth-server -p err.
  2. Look for invalid configuration: curl localhost:3000/ready.
  3. Check if clients are sending malformed requests (rate limit headers in responses).
  4. If errors are generation failures, check available memory and disk.

Alert: HighMemoryUsage

- alert: HighMemoryUsage
  expr: synth_memory_usage_bytes / 1024 / 1024 > 3072
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High memory usage on DataSynth server"
    description: "Memory usage is {{ $value }}MB, exceeding 3GB threshold."

Response procedure:

  1. Check DataSynth’s internal degradation level: curl localhost:3000/ready – the memory check status will show ok, degraded, or fail.
  2. If degraded, DataSynth automatically reduces batch sizes. Wait for current work to complete.
  3. If in Emergency mode, stop active streams: curl -X POST localhost:3000/api/stream/stop.
  4. Consider increasing memory limits or reducing concurrent streams.

Alert: HighLatency

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(datasynth_api_request_duration_seconds_bucket[5m])) > 30
  for: 5m
  labels:
    severity: warning

Response procedure:

  1. Check if bulk generation requests are creating large datasets. The default timeout is 300 seconds.
  2. Verify CPU is not throttled: kubectl top pod or docker stats.
  3. Consider splitting large generation requests into smaller batches.

Alert: NoEntitiesGenerated

- alert: NoEntitiesGenerated
  expr: increase(synth_entries_generated_total[1h]) == 0 and synth_active_streams > 0
  for: 15m
  labels:
    severity: warning

Response procedure:

  1. Streams are connected but not producing data. Check if streams are paused.
  2. Resume streams: curl -X POST localhost:3000/api/stream/resume.
  3. Check logs for generation failures.
  4. Verify the configuration is valid: curl localhost:3000/api/config.

Common Troubleshooting

Server Fails to Start

SymptomCauseResolution
Invalid gRPC addressBad --host or --port valueCheck arguments and env vars
Failed to bind REST listenerPort already in uselsof -i :3000 to find conflict
protoc not foundMissing protobuf compilerInstall protobuf-compiler package
Immediate exit, no logsPanic before logger initRun with RUST_LOG=debug

Generation Errors

SymptomCauseResolution
Failed to create orchestratorInvalid configValidate with datasynth-data validate --config config.yaml
Rate limit exceededToo many API requestsWait for Retry-After header, increase rate limits
Empty journal entriesNo companies configuredCheck curl localhost:3000/api/config
Slow generationLarge period or high volumeAdd worker threads, increase CPU

Connection Issues

SymptomCauseResolution
Connection refused on 3000Server not running or wrong portCheck process and port bindings
401 UnauthorizedMissing or invalid API keyAdd X-API-Key header or Authorization: Bearer <key>
429 Too Many RequestsRate limit exceededRespect Retry-After header
WebSocket drops immediatelyProxy not forwarding UpgradeConfigure proxy for WebSocket (see TLS doc)

Memory Issues

DataSynth monitors its own memory usage via /proc/self/statm (Linux) and triggers automatic degradation:

Degradation LevelTriggerBehavior
Normal< 70% of limitFull throughput
Reduced70-85%Smaller batch sizes
Minimal85-95%Single-record generation
Emergency> 95%Rejects new work

Check the current level:

curl -s localhost:3000/ready | jq '.checks[] | select(.name == "memory")'

Log Analysis

Structured JSON Logs

DataSynth emits structured JSON logs with the following fields:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "target": "datasynth_server::rest::routes",
  "message": "Configuration update requested",
  "thread_id": 42
}

Common Log Queries

Filter by severity:

# SystemD
journalctl -u datasynth-server -p err --since "1 hour ago"

# Docker
docker compose logs datasynth-server | jq 'select(.level == "ERROR" or .level == "WARN")'

Find configuration changes:

journalctl -u datasynth-server | grep "Configuration update"

Track generation throughput:

journalctl -u datasynth-server | grep "entries_generated"

Find API authentication failures:

journalctl -u datasynth-server | grep -i "unauthorized\|invalid api key"

Log Level Configuration

Set per-module log levels with RUST_LOG:

# Everything at info, server REST module at debug
RUST_LOG=info,datasynth_server::rest=debug

# Generation engine at trace (very verbose)
RUST_LOG=info,datasynth_runtime=trace

# Suppress noisy modules
RUST_LOG=info,hyper=warn,tower=warn

Routine Maintenance

Health Check Script

Create a monitoring script for external health checks:

#!/bin/bash
# /usr/local/bin/datasynth-healthcheck.sh

ENDPOINT="${1:-http://localhost:3000}"

# Check health
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT/health")
if [ "$HTTP_CODE" != "200" ]; then
  echo "CRITICAL: Health check failed (HTTP $HTTP_CODE)"
  exit 2
fi

# Check readiness
READY=$(curl -s "$ENDPOINT/ready" | jq -r '.ready')
if [ "$READY" != "true" ]; then
  echo "WARNING: Server not ready"
  exit 1
fi

echo "OK: DataSynth healthy and ready"
exit 0

Prometheus Rule Testing

Validate alert rules before deploying:

# Install promtool
go install github.com/prometheus/prometheus/cmd/promtool@latest

# Test rules
promtool check rules deploy/prometheus-alerts.yml

Backup Checklist

ItemLocationFrequency
DataSynth config/etc/datasynth/server.envOn change
Generation configsYAML filesOn change
Grafana dashboardsExport as JSONWeekly
Prometheus dataprometheus-data volumePer retention policy
API keysKubernetes Secret or env fileOn rotation

Incident Response Template

When a production incident occurs:

  1. Detect: Alert fires or user reports an issue.
  2. Triage: Check /health, /ready, and /metrics endpoints.
  3. Contain: If generating bad data, stop streams: POST /api/stream/stop.
  4. Diagnose: Collect logs (journalctl -u datasynth-server --since "1 hour ago").
  5. Resolve: Apply fix (restart, config change, scale up).
  6. Verify: Confirm /ready returns ready: true and metrics are flowing.
  7. Document: Record root cause and remediation steps.

Capacity Planning

This guide provides sizing models, reference benchmarks, and recommendations for provisioning DataSynth deployments.

Performance Characteristics

DataSynth is CPU-bound during generation and I/O-bound during output. Key characteristics:

  • Throughput: 100K+ journal entries per second on a single core
  • Scaling: Near-linear scaling with CPU cores for batch generation
  • Memory: Proportional to active dataset size (companies, accounts, master data)
  • Disk: Output size depends on format, compression, and enabled modules
  • Network: REST/gRPC overhead is minimal; bulk generation is the bottleneck

Sizing Model

CPU

DataSynth uses Rayon for parallel generation and Tokio for async I/O. The relationship between CPU cores and throughput:

CoresApprox. Entries/secUse Case
1100KDevelopment, small datasets
2180KStaging, medium datasets
4350KProduction, large datasets
8650KHigh-throughput batch jobs
161.1MMaximum single-node throughput

These numbers are for journal entry generation with balanced debit/credit lines. Enabling additional modules (document flows, subledgers, master data, anomaly injection) reduces throughput by 30-60% due to cross-referencing overhead.

Memory

Memory usage depends on the active generation context:

ComponentApproximate Memory
Base server process50-100 MB
Chart of accounts (small)5-10 MB
Chart of accounts (large)30-50 MB
Master data per company (small)20-40 MB
Master data per company (medium)80-150 MB
Master data per company (large)200-400 MB
Active journal entries buffer2-5 MB per 10K entries
Document flow chains50-100 MB per company
Anomaly injection engine20-50 MB

Sizing formula (approximate):

Memory (MB) = 100 + (companies * master_data_per_company) + (buffer_entries * 0.5)
ComplexityCompaniesMemory MinimumMemory Recommended
Small1-2512 MB1 GB
Medium3-51 GB2 GB
Large5-102 GB4 GB
Enterprise10-204 GB8 GB

DataSynth includes built-in memory guards that trigger graceful degradation before OOM. See Runbook - Memory Issues for degradation levels.

Disk Sizing

Output Size by Format

The output size depends on the number of entries, enabled modules, and output format:

EntriesCSV (uncompressed)JSON (uncompressed)Parquet (compressed)
10K15-25 MB30-50 MB3-5 MB
100K150-250 MB300-500 MB30-50 MB
1M1.5-2.5 GB3-5 GB300-500 MB
10M15-25 GB30-50 GB3-5 GB

These estimates cover journal entries only. Enabling all modules (master data, document flows, subledgers, audit trails, etc.) can multiply total output by 5-10x.

Output Files by Module

When all modules are enabled, a typical generation produces 60+ output files:

CategoryTypical File CountSize Relative to JE
Journal entries + ACDOCA2-31.0x (baseline)
Master data6-80.3-0.5x
Document flows8-101.5-2.0x
Subledgers8-121.0-1.5x
Period close + consolidation5-80.5-1.0x
Labels + controls6-100.1-0.3x
Audit trails6-80.3-0.5x

Disk Provisioning Formula

Disk (GB) = entries_millions * format_multiplier * module_multiplier * safety_margin

Where:
  format_multiplier:  CSV=0.25, JSON=0.50, Parquet=0.05  (per million entries)
  module_multiplier:  JE only=1.0, all modules=5.0
  safety_margin:      1.5 (for temp files, logs, etc.)

Example: 1M entries, all modules, CSV format:

Disk = 1 * 0.25 * 5.0 * 1.5 = 1.875 GB (round up to 2 GB)

Reference Benchmarks

Benchmarks run on c5.2xlarge (8 vCPU, 16 GB RAM):

ScenarioConfigEntriesTimeThroughputPeak Memory
Batch (small)1 company, small CoA, JE only100K0.8s125K/s280 MB
Batch (medium)3 companies, medium CoA, all modules100K3.2s31K/s850 MB
Batch (large)5 companies, large CoA, all modules1M45s22K/s2.1 GB
Streaming1 company, JE onlycontinuous10 events/s350 MB
Concurrent API10 parallel bulk requests10K each4.5s22K/s total1.2 GB

Container Resource Recommendations

Docker / Single Host

ProfileCPUMemoryDiskUse Case
Dev1 core1 GB10 GBLocal testing
Staging2 cores2 GB50 GBIntegration testing
Production4 cores4 GB100 GBRegular generation
Batch worker8 cores8 GB200 GBLarge dataset generation

Kubernetes

Profilerequests.cpurequests.memorylimits.cpulimits.memoryReplicas
Light250m256Mi11Gi2
Standard500m512Mi22Gi2-5
Heavy1000m1Gi44Gi3-10
Burst2000m2Gi88Gi5-20

Scaling Guidelines

Vertical Scaling (Single Node)

Vertical scaling is effective up to 16 cores. Beyond that, returns diminish due to lock contention in the shared ServerState. Recommendations:

  1. Start with the “Standard” Kubernetes profile.
  2. Monitor synth_entries_per_second in Grafana.
  3. If throughput plateaus at high CPU, add replicas instead.

Horizontal Scaling (Multi-Replica)

DataSynth is stateless – each pod generates data independently. Horizontal scaling considerations:

  1. Enable Redis for shared rate limiting across replicas.
  2. Use deterministic seeds per replica to avoid duplicate data (seed = base_seed + replica_index).
  3. Route bulk generation requests to specific replicas if output deduplication matters.
  4. WebSocket streams are per-connection and do not share state across replicas.

Scaling Decision Tree

Is throughput below target?
  |
  +-- Yes: Is CPU utilization > 70%?
  |    |
  |    +-- Yes: Add more replicas (horizontal)
  |    +-- No:  Is memory > 80%?
  |         |
  |         +-- Yes: Increase memory limit
  |         +-- No:  Check I/O (disk throughput, network)
  |
  +-- No: Current sizing is adequate

Network Bandwidth

DataSynth’s network requirements are modest:

OperationBandwidthNotes
Health checks< 1 KB/sNegligible
Prometheus scrape5-10 KB per scrapeEvery 10-30s
Bulk API response (10K entries)5-15 MB burstShort-lived
WebSocket stream1-5 KB/s per connection10 events/s default
gRPC streaming2-10 KB/s per streamDepends on message size

Network is rarely the bottleneck. A 1 Gbps link supports hundreds of concurrent clients.

Disaster Recovery

DataSynth is a stateless data generation engine. It does not maintain a persistent database or durable state that requires traditional backup and recovery. Instead, recovery relies on two key properties:

  1. Deterministic generation – Given the same configuration and seed, DataSynth produces identical output.
  2. Stateless server – The server process can be restarted from scratch at any time.

What Needs to Be Backed Up

AssetLocationRecovery Priority
Generation config (YAML)/etc/datasynth/, ConfigMap, or source controlCritical
Environment / secrets/etc/datasynth/server.env, K8s SecretsCritical
API keysEnvironment variable or SecretCritical
Generated output filesOutput directory, object storageDepends on use case
Grafana dashboardsdeploy/grafana/provisioning/ or exported JSONLow – can re-provision
Prometheus dataprometheus-data volumeLow – regenerate from metrics

The generation config and seed are the most important assets. With them, you can reproduce any dataset exactly.

Backup Procedures

Configuration Backup

Store all DataSynth configuration in version control. This is the primary backup mechanism:

# Recommended repository structure
configs/
  production/
    manufacturing.yaml      # Generation config
    server.env.encrypted    # Encrypted environment file
  staging/
    retail.yaml
    server.env.encrypted

For Kubernetes, export the ConfigMap and Secret:

# Export current config
kubectl -n datasynth get configmap datasynth-config -o yaml > backup/configmap.yaml

# Export secrets (base64-encoded)
kubectl -n datasynth get secret datasynth-api-keys -o yaml > backup/secret.yaml

Output Data Backup

If generated data must be preserved (not just re-generated), back up the output directory:

# Local backup
tar czf datasynth-output-$(date +%F).tar.gz /var/lib/datasynth/output/

# S3 backup
aws s3 sync /var/lib/datasynth/output/ s3://your-bucket/datasynth/$(date +%F)/

Scheduled Backup Script

#!/bin/bash
# /usr/local/bin/datasynth-backup.sh
# Run via cron: 0 2 * * * /usr/local/bin/datasynth-backup.sh

BACKUP_DIR="/var/backups/datasynth"
DATE=$(date +%F)

mkdir -p "$BACKUP_DIR"

# Back up configuration
cp /etc/datasynth/server.env "$BACKUP_DIR/server.env.$DATE"

# Back up output if it exists and is non-empty
if [ -d /var/lib/datasynth/output ] && [ "$(ls -A /var/lib/datasynth/output)" ]; then
  tar czf "$BACKUP_DIR/output-$DATE.tar.gz" /var/lib/datasynth/output/
fi

# Retain 30 days of backups
find "$BACKUP_DIR" -type f -mtime +30 -delete

echo "Backup completed: $DATE"

Deterministic Recovery

DataSynth uses ChaCha8 RNG with a configurable seed. When the seed is set in the configuration, every run produces byte-identical output.

Reproducing a Dataset

To reproduce a previous generation run:

  1. Retrieve the configuration file used for that run.
  2. Confirm the seed value is set (not random).
  3. Run the generation with the same configuration.
# Example config with deterministic seed
global:
  industry: manufacturing
  start_date: "2024-01-01"
  period_months: 12
  seed: 42              # <-- deterministic seed
# Regenerate identical data
datasynth-data generate --config config.yaml --output ./recovered-output

# Verify output is identical
diff <(sha256sum original-output/*.csv | sort) <(sha256sum recovered-output/*.csv | sort)

Important Caveats for Determinism

Deterministic output requires exact version matching:

FactorMust Match?Notes
DataSynth versionYesDifferent versions may change generation logic
Configuration YAMLYesAny parameter change alters output
Seed valueYesDifferent seed = different data
Operating systemNoCross-platform determinism is guaranteed
CPU architectureNoChaCha8 output is platform-independent
Number of threadsNoParallelism does not affect determinism

If you need to reproduce data from a past release, pin the DataSynth version:

# Docker: use the exact version tag
docker run --rm \
  -v $(pwd)/config.yaml:/config.yaml:ro \
  -v $(pwd)/output:/output \
  datasynth/datasynth-server:0.5.0 \
  datasynth-data generate --config /config.yaml --output /output

# Source: checkout the exact tag
git checkout v0.5.0
cargo build --release -p datasynth-cli

Stateless Restart

The DataSynth server maintains no persistent state. All in-memory state (counters, active streams, generation context) is ephemeral. A restart produces a fresh server.

Restart Procedure

Docker:

docker compose restart datasynth-server

Kubernetes:

# Rolling restart (zero downtime with PDB)
kubectl -n datasynth rollout restart deployment/datasynth

# Verify rollout
kubectl -n datasynth rollout status deployment/datasynth

SystemD:

sudo systemctl restart datasynth-server

What Is Lost on Restart

StateLost?Impact
Prometheus metrics countersYesCounters reset to 0; Prometheus handles counter resets via rate()
Active WebSocket streamsYesClients must reconnect
Uptime counterYesResets to 0
In-progress bulk generationYesClient receives connection error; must retry
Configuration (if set via API)YesReverts to default; use ConfigMap or env for persistence
Rate limit bucketsYesAll clients get fresh rate limit windows

Mitigating Restart Impact

  1. Use config files, not the API, for persistent configuration. The POST /api/config endpoint only updates in-memory state.
  2. Set up client retry logic for bulk generation requests.
  3. Use Kubernetes PDB to ensure at least one pod is always running during rolling restarts.
  4. Monitor with Prometheus – counter resets are handled automatically by rate() and increase() functions.

Recovery Scenarios

Scenario 1: Server Process Crash

  1. SystemD or Kubernetes automatically restarts the process.
  2. Verify with curl localhost:3000/health.
  3. Check logs for crash cause: journalctl -u datasynth-server -n 200.
  4. No data loss – server is stateless.

Scenario 2: Node Failure (Kubernetes)

  1. Kubernetes reschedules pods to healthy nodes.
  2. PDB ensures minimum availability during rescheduling.
  3. Clients reconnect automatically (Service endpoint updates).
  4. No manual intervention required.

Scenario 3: Configuration Lost

  1. Retrieve config from version control.
  2. Redeploy: kubectl apply -f configmap.yaml or copy to /etc/datasynth/.
  3. Restart server to pick up new config.

Scenario 4: Need to Reproduce Historical Data

  1. Identify the DataSynth version and config used.
  2. Pin the version (Docker tag or Git tag).
  3. Run generation with the same config and seed.
  4. Verify with checksums.

Recovery Time Objectives

ComponentRTORPONotes
Server process< 30sN/A (stateless)Auto-restart via SystemD/K8s
Full service (K8s)< 2 minN/A (stateless)Pod scheduling + startup probes
Data regenerationDepends on size0 (deterministic)Re-run with same config+seed
Config recovery< 5 minLast commitFrom version control

API Reference

DataSynth exposes REST, gRPC, and WebSocket interfaces. This page documents all endpoints, authentication, rate limiting, error formats, and the WebSocket protocol.

Base URLs

ProtocolDefault URLPort
RESThttp://localhost:30003000
gRPCgrpc://localhost:5005150051
WebSocketws://localhost:3000/ws/3000

Authentication

Authentication is optional and disabled by default. When enabled, all endpoints except health probes require a valid API key.

Enabling Authentication

Pass API keys at startup:

# CLI argument
datasynth-server --api-keys "key-1,key-2"

# Environment variable
DATASYNTH_API_KEYS="key-1,key-2" datasynth-server

Sending API Keys

The server accepts API keys via two headers (checked in order):

MethodHeaderExample
Bearer tokenAuthorizationAuthorization: Bearer your-api-key
Custom headerX-API-KeyX-API-Key: your-api-key

Exempt Paths

These paths never require authentication, even when auth is enabled:

  • GET /health
  • GET /ready
  • GET /live
  • GET /metrics

Authentication Internals

  • API keys are hashed with Argon2id at server startup.
  • Verification iterates all stored hashes (no short-circuit) to prevent timing side-channel attacks.
  • A 5-second LRU cache avoids repeated Argon2 verification for rapid successive requests.

Error Responses

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer

API key required. Provide via 'Authorization: Bearer <key>' or 'X-API-Key' header
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer

Invalid API key

Rate Limiting

Rate limiting is configurable and disabled by default. When enabled, it tracks requests per client IP using a sliding window.

Default Configuration

ParameterDefaultDescription
max_requests100Maximum requests per window
window60 secondsTime window duration
Exempt paths/health, /ready, /liveNot rate-limited

Rate Limit Headers

All non-exempt responses include rate limit headers:

HeaderDescription
X-RateLimit-LimitMaximum requests allowed in the window
X-RateLimit-RemainingRequests remaining in the current window
Retry-AfterSeconds until the window resets (only on 429)

Rate Limit Exceeded Response

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 60

Rate limit exceeded. Max 100 requests per 60 seconds.

Client Identification

The rate limiter identifies clients by IP address, checked in order:

  1. X-Forwarded-For header (first IP)
  2. X-Real-IP header
  3. Fallback: unknown (all unidentified clients share a bucket)

Distributed Rate Limiting

For multi-replica deployments, enable Redis-backed rate limiting:

datasynth-server --redis-url redis://127.0.0.1:6379

This requires the redis feature to be enabled at build time.

Security Headers

All responses include the following security headers:

HeaderValuePurpose
X-Content-Type-OptionsnosniffPrevent MIME type sniffing
X-Frame-OptionsDENYPrevent clickjacking
X-XSS-Protection0Disable legacy XSS filter (rely on CSP)
Referrer-Policystrict-origin-when-cross-originControl referrer leakage
Content-Security-Policydefault-src 'none'; frame-ancestors 'none'Restrict resource loading
Cache-Controlno-storePrevent caching of API responses

Request ID

Every response includes an X-Request-Id header. If the client sends an X-Request-Id header, its value is preserved. Otherwise, a UUID v4 is generated.

# Client-provided request ID
curl -H "X-Request-Id: my-trace-123" http://localhost:3000/health
# Response header: X-Request-Id: my-trace-123

# Auto-generated request ID
curl -v http://localhost:3000/health
# Response header: X-Request-Id: 550e8400-e29b-41d4-a716-446655440000

CORS Configuration

Default allowed origins:

OriginPurpose
http://localhost:5173Vite dev server
http://localhost:3000Local development
http://127.0.0.1:5173Localhost variant
http://127.0.0.1:3000Localhost variant
tauri://localhostTauri desktop app

Allowed methods: GET, POST, PUT, DELETE, OPTIONS

Allowed headers: Content-Type, Authorization, Accept

REST API Endpoints

Health & Metrics

GET /health

Returns overall server health status.

Response 200 OK:

{
  "healthy": true,
  "version": "0.5.0",
  "uptime_seconds": 3600
}

GET /ready

Kubernetes-compatible readiness probe. Performs deep checks (config, memory, disk).

Response 200 OK (when ready):

{
  "ready": true,
  "message": "Service is ready",
  "checks": [
    { "name": "config", "status": "ok" },
    { "name": "memory", "status": "ok" },
    { "name": "disk", "status": "ok" }
  ]
}

Response 503 Service Unavailable (when not ready):

{
  "ready": false,
  "message": "Service is not ready",
  "checks": [
    { "name": "config", "status": "ok" },
    { "name": "memory", "status": "fail" },
    { "name": "disk", "status": "ok" }
  ]
}

GET /live

Kubernetes-compatible liveness probe. Lightweight heartbeat.

Response 200 OK:

{
  "alive": true,
  "timestamp": "2024-01-15T10:30:00.123456789Z"
}

GET /api/metrics

Returns server metrics as JSON.

Response 200 OK:

{
  "total_entries_generated": 150000,
  "total_anomalies_injected": 750,
  "uptime_seconds": 3600,
  "session_entries": 150000,
  "session_entries_per_second": 41.67,
  "active_streams": 2,
  "total_stream_events": 50000
}

GET /metrics

Prometheus-compatible metrics in text exposition format.

Response 200 OK (text/plain; version=0.0.4):

# HELP synth_entries_generated_total Total number of journal entries generated
# TYPE synth_entries_generated_total counter
synth_entries_generated_total 150000

# HELP synth_anomalies_injected_total Total number of anomalies injected
# TYPE synth_anomalies_injected_total counter
synth_anomalies_injected_total 750

# HELP synth_uptime_seconds Server uptime in seconds
# TYPE synth_uptime_seconds gauge
synth_uptime_seconds 3600

# HELP synth_entries_per_second Rate of entry generation
# TYPE synth_entries_per_second gauge
synth_entries_per_second 41.67

# HELP synth_active_streams Number of active streaming connections
# TYPE synth_active_streams gauge
synth_active_streams 2

# HELP synth_stream_events_total Total events sent through streams
# TYPE synth_stream_events_total counter
synth_stream_events_total 50000

# HELP synth_info Server version information
# TYPE synth_info gauge
synth_info{version="0.5.0"} 1

Configuration

GET /api/config

Returns the current generation configuration.

Response 200 OK:

{
  "success": true,
  "message": "Current configuration",
  "config": {
    "industry": "Manufacturing",
    "start_date": "2024-01-01",
    "period_months": 12,
    "seed": 42,
    "coa_complexity": "Medium",
    "companies": [
      {
        "code": "1000",
        "name": "Manufacturing Corp",
        "currency": "USD",
        "country": "US",
        "annual_transaction_volume": 100000,
        "volume_weight": 1.0
      }
    ],
    "fraud_enabled": true,
    "fraud_rate": 0.02
  }
}

POST /api/config

Updates the generation configuration.

Request body:

{
  "industry": "retail",
  "start_date": "2024-06-01",
  "period_months": 6,
  "seed": 12345,
  "coa_complexity": "large",
  "companies": [
    {
      "code": "1000",
      "name": "Retail Corp",
      "currency": "USD",
      "country": "US",
      "annual_transaction_volume": 200000,
      "volume_weight": 1.0
    }
  ],
  "fraud_enabled": true,
  "fraud_rate": 0.05
}

Valid industries: manufacturing, retail, financial_services, healthcare, technology, professional_services, energy, transportation, real_estate, telecommunications

Valid CoA complexities: small, medium, large

Response 200 OK:

{
  "success": true,
  "message": "Configuration updated and applied",
  "config": { ... }
}

Error 400 Bad Request:

{
  "success": false,
  "message": "Unknown industry: 'invalid'. Valid values: manufacturing, retail, ...",
  "config": null
}

Generation

POST /api/generate/bulk

Generates journal entries in a single batch. Maximum 1,000,000 entries per request.

Request body:

{
  "entry_count": 10000,
  "include_master_data": true,
  "inject_anomalies": true
}

All fields are optional. Without entry_count, the server uses the configured volume.

Response 200 OK:

{
  "success": true,
  "entries_generated": 10000,
  "duration_ms": 450,
  "anomaly_count": 50
}

Error 400 Bad Request (entry count too large):

entry_count (2000000) exceeds maximum allowed value (1000000)

Streaming Control

POST /api/stream/start

Starts the event stream. WebSocket clients begin receiving events.

Request body:

{
  "events_per_second": 10,
  "max_events": 10000,
  "inject_anomalies": false
}

POST /api/stream/stop

Stops all active streams.

POST /api/stream/pause

Pauses active streams. Events stop flowing but connections remain open.

POST /api/stream/resume

Resumes paused streams.

POST /api/stream/trigger/:pattern

Triggers a named generation pattern for upcoming streamed entries.

Valid patterns: year_end_spike, period_end_spike, holiday_cluster, fraud_cluster, error_cluster, uniform, custom:*

Response:

{
  "success": true,
  "message": "Pattern 'year_end_spike' will be applied to upcoming entries"
}

WebSocket Protocol

ws://localhost:3000/ws/metrics

Sends metrics updates every 1 second as JSON text frames:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "total_entries": 150000,
  "total_anomalies": 750,
  "entries_per_second": 41.67,
  "active_streams": 2,
  "uptime_seconds": 3600
}

ws://localhost:3000/ws/events

Streams generated journal entry events as JSON text frames:

{
  "sequence": 1234,
  "timestamp": "2024-01-15T10:30:00.456Z",
  "event_type": "JournalEntry",
  "document_id": "JE-2024-001234",
  "company_code": "1000",
  "amount": "15000.00",
  "is_anomaly": false
}

Connection Management

  • The server responds to WebSocket Ping frames with Pong.
  • Clients should send periodic pings to keep the connection alive through proxies.
  • Close the connection by sending a WebSocket Close frame.
  • The server decrements active_streams when a client disconnects.

Example: Connecting with wscat

# Install wscat
npm install -g wscat

# Connect to metrics stream
wscat -c ws://localhost:3000/ws/metrics

# Connect to event stream
wscat -c ws://localhost:3000/ws/events

Example: Connecting with curl (WebSocket)

curl --include \
  --no-buffer \
  --header "Connection: Upgrade" \
  --header "Upgrade: websocket" \
  --header "Sec-WebSocket-Version: 13" \
  --header "Sec-WebSocket-Key: $(openssl rand -base64 16)" \
  http://localhost:3000/ws/events

Request Timeout

The default request timeout is 300 seconds (5 minutes), which accommodates large bulk generation requests. Requests exceeding this timeout receive a 408 Request Timeout response.

Error Format

REST API errors follow a consistent format:

Validation errors return JSON:

{
  "success": false,
  "message": "Descriptive error message",
  "config": null
}

Server errors return plain text:

HTTP/1.1 500 Internal Server Error

Generation failed: <error description>

HTTP Status Codes

CodeMeaningWhen
200SuccessRequest completed
400Bad RequestInvalid parameters
401UnauthorizedMissing or invalid API key
408Request TimeoutRequest exceeded 300s timeout
429Too Many RequestsRate limit exceeded
500Internal Server ErrorGeneration or server failure
503Service UnavailableReadiness check failed

Security Hardening

This guide provides a pre-deployment security checklist and detailed guidance on TLS, secrets management, container security, and audit logging for DataSynth.

Pre-Deployment Checklist

Complete this checklist before exposing DataSynth to any network beyond localhost:

#ItemPriorityStatus
1Enable API key authenticationCritical
2Use strong, unique API keys (32+ chars)Critical
3Enable TLS (direct or via reverse proxy)Critical
4Set explicit CORS allowed originsHigh
5Enable rate limitingHigh
6Run as non-root userHigh
7Use read-only root filesystem (container)High
8Drop all Linux capabilitiesHigh
9Set resource limits (memory, CPU, file descriptors)High
10Restrict network exposure (firewall, security groups)High
11Enable structured logging to a central log aggregatorMedium
12Set up Prometheus monitoring and alertsMedium
13Rotate API keys periodicallyMedium
14Review and restrict CORS originsMedium
15Enable mTLS for gRPC if used in service meshLow

Authentication Hardening

Strong API Keys

Generate cryptographically strong API keys:

# Generate a 48-character random key
openssl rand -base64 36

# Example output: kZ9mR3xY7pQ2wV5nL8jH4cF6gT0aD1bE3sU9iO7

Recommendations:

  • Minimum 32 characters, ideally 48+
  • Use different keys per environment (dev, staging, production)
  • Use different keys per client/team when possible
  • Rotate keys quarterly or after any suspected compromise

Argon2id Hashing

DataSynth hashes API keys with Argon2id (the recommended password hashing algorithm). Keys are hashed at startup; the plaintext is never stored in memory after hashing.

For pre-hashed keys (avoiding plaintext in environment variables), hash the key externally and pass the PHC-format hash:

# Python example: pre-hash an API key
from argon2 import PasswordHasher

ph = PasswordHasher()
hash = ph.hash("your-api-key")
print(hash)
# $argon2id$v=19$m=65536,t=3,p=4$...

Pass the pre-hashed value to the server via the AuthConfig::with_prehashed_keys() API (for embedded use) or store in a secrets manager.

API Key Rotation

To rotate keys without downtime:

  1. Add the new key to DATASYNTH_API_KEYS alongside the old key.
  2. Restart the server (rolling restart in K8s).
  3. Update all clients to use the new key.
  4. Remove the old key from DATASYNTH_API_KEYS.
  5. Restart again.

TLS Configuration

Terminate TLS at a reverse proxy (Nginx, Envoy, cloud load balancer) and forward plain HTTP to DataSynth. See TLS & Reverse Proxy for full configurations.

Advantages:

  • Centralized certificate management
  • Standard renewal workflows (cert-manager, Let’s Encrypt)
  • Offloads TLS from the application
  • Easier to audit and rotate certificates

Option 2: Native TLS

Build DataSynth with TLS support:

cargo build --release -p datasynth-server --features tls

Run with certificate and key:

datasynth-server \
  --tls-cert /etc/datasynth/tls/cert.pem \
  --tls-key /etc/datasynth/tls/key.pem

Certificate Requirements

RequirementDetail
FormatPEM-encoded X.509
Key typeRSA 2048+ or ECDSA P-256/P-384
ProtocolTLS 1.2 or 1.3 (1.0/1.1 disabled)
Cipher suitesHIGH:!aNULL:!MD5 (Nginx default)
Subject Alternative NameMust match the hostname clients use

mTLS for gRPC

For service-to-service communication, configure mutual TLS:

# Nginx mTLS configuration
server {
    listen 50051 ssl http2;

    ssl_certificate /etc/ssl/certs/server.pem;
    ssl_certificate_key /etc/ssl/private/server-key.pem;

    # Client certificate verification
    ssl_client_certificate /etc/ssl/certs/ca.pem;
    ssl_verify_client on;

    location / {
        grpc_pass grpc://127.0.0.1:50051;
    }
}

Secret Management

Environment Variables

For simple deployments, store secrets in environment files with restricted permissions:

# Create the environment file
sudo install -m 640 -o root -g datasynth /dev/null /etc/datasynth/server.env

# Edit the file
sudo vi /etc/datasynth/server.env

Never commit plaintext secrets to version control. Use .gitignore to exclude env files.

Kubernetes Secrets

For Kubernetes, store API keys in a Secret resource:

apiVersion: v1
kind: Secret
metadata:
  name: datasynth-api-keys
  namespace: datasynth
type: Opaque
stringData:
  api-keys: "key-1,key-2"

External Secrets Operator

For production, integrate with a secrets manager via the External Secrets Operator:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: datasynth-api-keys
  namespace: datasynth
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: ClusterSecretStore
  target:
    name: datasynth-api-keys
  data:
    - secretKey: api-keys
      remoteRef:
        key: datasynth/api-keys

HashiCorp Vault

Inject secrets via the Vault Agent sidecar:

# Pod annotations for Vault Agent Injector
podAnnotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "datasynth"
  vault.hashicorp.com/agent-inject-secret-api-keys: "secret/data/datasynth/api-keys"
  vault.hashicorp.com/agent-inject-template-api-keys: |
    {{- with secret "secret/data/datasynth/api-keys" -}}
    {{ .Data.data.keys }}
    {{- end -}}

Container Security

Distroless Base Image

The production Dockerfile uses gcr.io/distroless/cc-debian12, which contains:

  • No shell (/bin/sh, /bin/bash)
  • No package manager
  • No unnecessary system utilities
  • Only the C runtime library and certificates

This minimizes the attack surface and prevents shell-based exploits.

Security Context (Kubernetes)

The Helm chart enforces the following security context:

podSecurityContext:
  runAsNonRoot: true        # Pod must run as non-root
  runAsUser: 1000           # UID 1000
  runAsGroup: 1000          # GID 1000
  fsGroup: 1000             # Filesystem group

securityContext:
  allowPrivilegeEscalation: false    # No setuid/setgid
  readOnlyRootFilesystem: true       # Read-only root FS
  capabilities:
    drop:
      - ALL                          # Drop all Linux capabilities

SystemD Sandboxing

The SystemD unit file includes comprehensive sandboxing:

NoNewPrivileges=true          # Prevent privilege escalation
ProtectSystem=strict          # Read-only filesystem
ProtectHome=true              # Hide home directories
PrivateTmp=true               # Isolated /tmp
PrivateDevices=true           # No device access
ProtectKernelTunables=true    # No sysctl modification
ProtectKernelModules=true     # No module loading
ProtectControlGroups=true     # No cgroup modification
RestrictNamespaces=true       # No namespace creation
RestrictRealtime=true         # No realtime scheduling
RestrictSUIDSGID=true         # No SUID/SGID

Image Scanning

Scan the container image for vulnerabilities before deployment:

# Trivy
trivy image datasynth/datasynth-server:0.5.0

# Grype
grype datasynth/datasynth-server:0.5.0

# Docker Scout
docker scout cves datasynth/datasynth-server:0.5.0

The distroless base image has a minimal CVE surface. Address any findings in the Rust dependencies via cargo audit:

cargo install cargo-audit
cargo audit

Network Security

Principle of Least Exposure

Only expose the ports and endpoints that clients need:

DeploymentExpose REST (3000)Expose gRPC (50051)Expose Metrics
Internal API onlyVia Ingress/LBVia Ingress/LBPrometheus only
Public APIVia Ingress + WAFNoNo
Dev/stagingLocalhost onlyLocalhost onlyLocalhost only

Network Policies (Kubernetes)

Restrict pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: datasynth-allow-ingress
  namespace: datasynth
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: datasynth
  policyTypes:
    - Ingress
  ingress:
    # Allow from Ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - port: 3000
          protocol: TCP
    # Allow Prometheus scraping
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - port: 3000
          protocol: TCP

CORS Lockdown

In production, override the default CORS configuration to allow only your application’s domain:

#![allow(unused)]
fn main() {
// Programmatic configuration
let cors = CorsConfig {
    allowed_origins: vec![
        "https://app.example.com".to_string(),
    ],
    allow_any_origin: false,
};
}

Never enable allow_any_origin: true in production.

Audit Logging

Request Tracing

Every request receives an X-Request-Id header (auto-generated UUID v4 or client-provided). Use this to correlate logs across services.

Structured Log Fields

DataSynth emits JSON-structured logs with the following fields useful for security auditing:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "target": "datasynth_server::rest::routes",
  "message": "Configuration update requested: industry=retail, period_months=6",
  "thread_id": 42
}

Log Events to Monitor

EventLog PatternSeverity
Authentication failureUnauthorized / Invalid API keyHigh
Rate limit exceededRate limit exceededMedium
Configuration changeConfiguration update requestedMedium
Stream start/stopStream started / Stream stoppedLow
WebSocket connectionWebSocket connected / disconnectedLow
Server panicServer panic:Critical

Centralized Logging

Forward structured logs to a central aggregator:

Docker:

services:
  datasynth-server:
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "datasynth.server"

SystemD to Loki:

# Install Promtail for journal forwarding
# /etc/promtail/config.yaml
scrape_configs:
  - job_name: datasynth
    journal:
      matches:
        - _SYSTEMD_UNIT=datasynth-server.service
      labels:
        job: datasynth

RBAC (Kubernetes)

The Helm chart creates a ServiceAccount by default. Bind minimal permissions:

serviceAccount:
  create: true
  automount: true   # Only if needed by the application
  annotations: {}

DataSynth does not require any Kubernetes API access. If automount is not needed, set it to false to prevent the ServiceAccount token from being mounted into the pod.

Supply Chain Security

Reproducible Builds

The Dockerfile uses pinned versions:

  • rust:1.88-bookworm – pinned Rust compiler version
  • gcr.io/distroless/cc-debian12 – pinned distroless image
  • cargo-chef --locked – locked dependency resolution

Dependency Auditing

# Check for known vulnerabilities
cargo audit

# Check for unmaintained or yanked crates
cargo audit --deny warnings

Run cargo audit in CI on every pull request.

SBOM Generation

Generate a Software Bill of Materials for compliance:

# Using cargo-cyclonedx
cargo install cargo-cyclonedx
cargo cyclonedx --all

# Using syft for container images
syft datasynth/datasynth-server:0.5.0 -o cyclonedx-json > sbom.json

TLS & Reverse Proxy Configuration

DataSynth server supports TLS in two ways:

  1. Native TLS (with tls feature flag) - direct rustls termination
  2. Reverse Proxy - recommended for production deployments

Native TLS

Build with TLS support:

cargo build --release -p datasynth-server --features tls

Run with certificate and key:

datasynth-server --tls-cert /path/to/cert.pem --tls-key /path/to/key.pem

Nginx Reverse Proxy

upstream datasynth_rest {
    server 127.0.0.1:3000;
}

upstream datasynth_grpc {
    server 127.0.0.1:50051;
}

server {
    listen 443 ssl http2;
    server_name datasynth.example.com;

    ssl_certificate /etc/ssl/certs/datasynth.pem;
    ssl_certificate_key /etc/ssl/private/datasynth-key.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;

    # REST API
    location / {
        proxy_pass http://datasynth_rest;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
    }

    # WebSocket
    location /ws/ {
        proxy_pass http://datasynth_rest;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 3600s;
    }

    # gRPC
    location /synth_server. {
        grpc_pass grpc://datasynth_grpc;
        grpc_read_timeout 300s;
    }
}

Envoy Proxy

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 443
      filter_chains:
        - transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              common_tls_context:
                tls_certificates:
                  - certificate_chain:
                      filename: /etc/ssl/certs/datasynth.pem
                    private_key:
                      filename: /etc/ssl/private/datasynth-key.pem
          filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: datasynth
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: datasynth_rest
                            timeout: 300s
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: datasynth_rest
      connect_timeout: 5s
      type: STRICT_DNS
      load_assignment:
        cluster_name: datasynth_rest
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 3000

Health Check Configuration

For load balancers, use these health check endpoints:

EndpointPurposeExpected Response
GET /healthBasic health200 OK
GET /readyReadiness probe200 OK / 503 Unavailable
GET /liveLiveness probe200 OK

Use Cases

Real-world applications for SyntheticData.

Overview

Use CaseDescription
Fraud Detection MLTrain supervised fraud models
Audit AnalyticsTest audit procedures
SOX ComplianceTest control monitoring
Process MiningGenerate OCEL 2.0 event logs
ERP Load TestingLoad and stress testing

Use Case Summary

Use CaseKey FeaturesOutput Focus
Fraud DetectionAnomaly injection, graph exportLabels, graphs
Audit AnalyticsFull document flows, controlsTransactions, controls
SOX ComplianceSoD rules, approval workflowsControls, violations
Process MiningOCEL 2.0 exportEvent logs
ERP TestingHigh volume, realistic patternsRaw transactions

Quick Configuration

Fraud Detection

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

graph_export:
  enabled: true
  formats:
    - pytorch_geometric

Audit Analytics

document_flows:
  p2p:
    enabled: true
  o2c:
    enabled: true

internal_controls:
  enabled: true

SOX Compliance

internal_controls:
  enabled: true
  sod_rules: [...]

approval:
  enabled: true

Process Mining

document_flows:
  p2p:
    enabled: true
  o2c:
    enabled: true

# Use datasynth-ocpm for OCEL 2.0 export

ERP Testing

transactions:
  target_count: 1000000

output:
  format: csv

Selecting a Use Case

Choose Fraud Detection if:

  • Training ML/AI models
  • Building anomaly detection systems
  • Need labeled datasets

Choose Audit Analytics if:

  • Testing audit software
  • Validating analytical procedures
  • Need complete document trails

Choose SOX Compliance if:

  • Testing control monitoring systems
  • Validating SoD enforcement
  • Need control test data

Choose Process Mining if:

  • Using PM4Py, Celonis, or similar tools
  • Need OCEL 2.0 compliant logs
  • Analyzing business processes

Choose ERP Testing if:

  • Load testing financial systems
  • Performance benchmarking
  • Need high-volume realistic data

Combining Use Cases

Use cases can be combined:

# Fraud detection + audit analytics
anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

document_flows:
  p2p:
    enabled: true
  o2c:
    enabled: true

internal_controls:
  enabled: true

graph_export:
  enabled: true

See Also

Fraud Detection ML

Train machine learning models for financial fraud detection.

Overview

SyntheticData generates labeled datasets for supervised fraud detection:

  • 20+ fraud patterns with full labels
  • Graph representations for GNN models
  • Realistic data distributions
  • Configurable fraud rates and types

Configuration

global:
  seed: 42
  industry: financial_services
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 100000

fraud:
  enabled: true
  fraud_rate: 0.02                   # 2% fraud rate

  types:
    split_transaction: 0.20
    duplicate_payment: 0.15
    fictitious_transaction: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

  categories:
    fraud: 1.0                       # Focus on fraud only

graph_export:
  enabled: true
  formats:
    - pytorch_geometric

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_fraud

output:
  format: csv

Output Files

Tabular Data

output/
├── transactions/
│   └── journal_entries.csv
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── master_data/
    └── ...

Graph Data

output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt
├── edge_index.pt
├── edge_attr.pt
├── labels.pt
├── train_mask.pt
├── val_mask.pt
└── test_mask.pt

ML Pipeline

1. Load Data

import pandas as pd
import torch

# Load tabular data
entries = pd.read_csv('output/transactions/journal_entries.csv')
labels = pd.read_csv('output/labels/fraud_labels.csv')

# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_fraud'] = data['fraud_type'].notna()

print(f"Total entries: {len(data)}")
print(f"Fraud entries: {data['is_fraud'].sum()}")
print(f"Fraud rate: {data['is_fraud'].mean():.2%}")

2. Feature Engineering

from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Numerical features
numerical_features = [
    'debit_amount', 'credit_amount', 'line_count'
]

# Derived features
data['log_amount'] = np.log1p(data['debit_amount'] + data['credit_amount'])
data['is_round'] = (data['debit_amount'] % 100 == 0).astype(int)
data['is_weekend'] = pd.to_datetime(data['posting_date']).dt.dayofweek >= 5
data['is_month_end'] = pd.to_datetime(data['posting_date']).dt.day >= 28

# Categorical features
categorical_features = ['source', 'business_process', 'company_code']

3. Train Model (Tabular)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features
X = data[numerical_features + derived_features]
y = data['is_fraud']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

4. Train GNN Model

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

# Load graph data
node_features = torch.load('output/graphs/.../node_features.pt')
edge_index = torch.load('output/graphs/.../edge_index.pt')
labels = torch.load('output/graphs/.../labels.pt')
train_mask = torch.load('output/graphs/.../train_mask.pt')
val_mask = torch.load('output/graphs/.../val_mask.pt')
test_mask = torch.load('output/graphs/.../test_mask.pt')

data = Data(
    x=node_features,
    edge_index=edge_index,
    y=labels,
    train_mask=train_mask,
    val_mask=val_mask,
    test_mask=test_mask,
)

# Define GNN
class FraudGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_channels):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.linear = torch.nn.Linear(hidden_channels, 2)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index).relu()
        x = self.linear(x)
        return x

# Train
model = FraudGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    # Validation
    if epoch % 10 == 0:
        model.eval()
        pred = out.argmax(dim=1)
        val_acc = (pred[data.val_mask] == data.y[data.val_mask]).float().mean()
        print(f'Epoch {epoch}: Val Acc: {val_acc:.4f}')

Fraud Types for Training

TypeDetection ApproachDifficulty
Split TransactionAmount patternsEasy
Duplicate PaymentSimilarity matchingEasy
Fictitious TransactionAnomaly detectionMedium
Ghost EmployeeEntity verificationMedium
Kickback SchemeRelationship analysisHard
Revenue ManipulationTrend analysisHard

Best Practices

Class Imbalance

from imblearn.over_sampling import SMOTE

# Handle imbalanced classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Threshold Tuning

from sklearn.metrics import precision_recall_curve

# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * precision * recall / (precision + recall)
optimal_idx = f1_scores.argmax()
optimal_threshold = thresholds[optimal_idx]

Cross-Validation

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV ROC-AUC: {scores.mean():.4f} (+/- {scores.std():.4f})")

See Also

Audit Analytics

Test audit procedures and analytical tools with realistic data.

Overview

SyntheticData generates comprehensive datasets for audit analytics:

  • Complete document trails
  • Known control exceptions
  • Benford’s Law compliant amounts
  • Realistic temporal patterns

Configuration

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 100000

  benford:
    enabled: true                    # Realistic first-digit distribution

  temporal:
    month_end_spike: 2.5
    quarter_end_spike: 3.0
    year_end_spike: 4.0

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.35
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01
  o2c:
    enabled: true
    flow_rate: 0.35

master_data:
  vendors:
    count: 200
  customers:
    count: 500

internal_controls:
  enabled: true

anomaly_injection:
  enabled: true
  total_rate: 0.03
  generate_labels: true

  categories:
    fraud: 0.20
    error: 0.50
    process_issue: 0.30

output:
  format: csv

Audit Procedures

1. Benford’s Law Analysis

Test first-digit distribution of amounts:

import pandas as pd
import numpy as np
from scipy import stats

# Load data
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Extract first digits
amounts = entries['debit_amount'] + entries['credit_amount']
amounts = amounts[amounts > 0]
first_digits = amounts.apply(lambda x: int(str(x)[0]))

# Calculate observed distribution
observed = first_digits.value_counts().sort_index()
observed_freq = observed / observed.sum()

# Expected Benford distribution
benford = {d: np.log10(1 + 1/d) for d in range(1, 10)}

# Chi-square test
chi_stat, p_value = stats.chisquare(
    observed.values,
    [benford[d] * observed.sum() for d in range(1, 10)]
)

print(f"Chi-square: {chi_stat:.2f}, p-value: {p_value:.4f}")

2. Three-Way Match Testing

Verify PO, GR, and Invoice alignment:

# Load documents
po = pd.read_csv('output/documents/purchase_orders.csv')
gr = pd.read_csv('output/documents/goods_receipts.csv')
inv = pd.read_csv('output/documents/vendor_invoices.csv')

# Join on references
matched = po.merge(gr, left_on='po_number', right_on='po_reference')
matched = matched.merge(inv, left_on='po_number', right_on='po_reference')

# Calculate variances
matched['qty_variance'] = abs(matched['gr_quantity'] - matched['po_quantity']) / matched['po_quantity']
matched['price_variance'] = abs(matched['inv_unit_price'] - matched['po_unit_price']) / matched['po_unit_price']

# Identify exceptions
qty_exceptions = matched[matched['qty_variance'] > 0.02]
price_exceptions = matched[matched['price_variance'] > 0.01]

print(f"Quantity exceptions: {len(qty_exceptions)}")
print(f"Price exceptions: {len(price_exceptions)}")

3. Duplicate Payment Detection

Find potential duplicate payments:

# Load payments and invoices
payments = pd.read_csv('output/documents/payments.csv')
invoices = pd.read_csv('output/documents/vendor_invoices.csv')

# Group by vendor and amount
potential_dups = invoices.groupby(['vendor_id', 'total_amount']).filter(
    lambda x: len(x) > 1
)

# Check payment dates
duplicates = []
for (vendor, amount), group in potential_dups.groupby(['vendor_id', 'total_amount']):
    if len(group) > 1:
        duplicates.append({
            'vendor': vendor,
            'amount': amount,
            'count': len(group),
            'invoices': group['invoice_number'].tolist()
        })

print(f"Potential duplicate payments: {len(duplicates)}")

4. Journal Entry Testing

Analyze manual journal entries:

# Load entries
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Filter manual entries
manual = entries[entries['source'] == 'Manual']

# Analyze characteristics
print(f"Manual entries: {len(manual)}")
print(f"Weekend entries: {manual['is_weekend'].sum()}")
print(f"Month-end entries: {manual['is_month_end'].sum()}")

# Top accounts with manual entries
top_accounts = manual.groupby('account_number').size().sort_values(ascending=False).head(10)

5. Cutoff Testing

Verify transactions recorded in correct period:

# Identify late postings
entries['posting_date'] = pd.to_datetime(entries['posting_date'])
entries['document_date'] = pd.to_datetime(entries['document_date'])
entries['posting_lag'] = (entries['posting_date'] - entries['document_date']).dt.days

# Find entries posted after period end
late_postings = entries[entries['posting_lag'] > 5]
print(f"Late postings: {len(late_postings)}")

# Check year-end cutoff
year_end = entries['posting_date'].dt.year.max()
cutoff_issues = entries[
    (entries['document_date'].dt.year < year_end) &
    (entries['posting_date'].dt.year == year_end + 1)
]

6. Segregation of Duties

Check for SoD violations:

# Load controls data
sod_rules = pd.read_csv('output/controls/sod_rules.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Find entries with SoD violations
violations = entries[entries['sod_violation'] == True]
print(f"SoD violations: {len(violations)}")

# Analyze by conflict type
violation_types = violations.groupby('sod_conflict_type').size()

Audit Analytics Dashboard

Key Metrics

MetricQueryExpected
Benford Chi-squareFirst-digit test< 15.51 (p > 0.05)
Match exceptionsThree-way match< 2%
Duplicate indicatorsAmount/vendor matching< 0.5%
Late postingsDocument vs posting date< 1%
SoD violationsControl violationsKnown from labels

Population Statistics

# Summary statistics
print("=== Audit Population Summary ===")
print(f"Total transactions: {len(entries):,}")
print(f"Total amount: ${entries['debit_amount'].sum():,.2f}")
print(f"Unique vendors: {entries['vendor_id'].nunique()}")
print(f"Unique customers: {entries['customer_id'].nunique()}")
print(f"Date range: {entries['posting_date'].min()} to {entries['posting_date'].max()}")

7. Financial Statement Analytics (v0.6.0)

Analyze generated financial statements for consistency, trend analysis, and ratio testing:

import pandas as pd

# Load financial statements
balance_sheet = pd.read_csv('output/financial_reporting/balance_sheet.csv')
income_stmt = pd.read_csv('output/financial_reporting/income_statement.csv')
cash_flow = pd.read_csv('output/financial_reporting/cash_flow.csv')

# Verify accounting equation holds
for _, row in balance_sheet.iterrows():
    assets = row['total_assets']
    liabilities = row['total_liabilities']
    equity = row['total_equity']
    imbalance = abs(assets - (liabilities + equity))
    assert imbalance < 0.01, f"A=L+E violation: {imbalance}"

# Analytical procedures: ratio analysis
ratios = pd.DataFrame({
    'period': balance_sheet['period'],
    'current_ratio': balance_sheet['current_assets'] / balance_sheet['current_liabilities'],
    'gross_margin': income_stmt['gross_profit'] / income_stmt['revenue'],
    'debt_to_equity': balance_sheet['total_liabilities'] / balance_sheet['total_equity'],
})

# Flag unusual ratio movements (> 2 std devs from mean)
for col in ['current_ratio', 'gross_margin', 'debt_to_equity']:
    mean = ratios[col].mean()
    std = ratios[col].std()
    outliers = ratios[abs(ratios[col] - mean) > 2 * std]
    if len(outliers) > 0:
        print(f"Unusual {col} in periods: {outliers['period'].tolist()}")

Budget Variance Analysis

When budgets are enabled, compare budget to actual for each account:

# Load budget vs actual data
budget = pd.read_csv('output/financial_reporting/budget_vs_actual.csv')

# Calculate variance percentage
budget['variance_pct'] = (budget['actual'] - budget['budget']) / budget['budget']

# Identify material variances (> 10%)
material = budget[abs(budget['variance_pct']) > 0.10]
print(f"Material variances: {len(material)} accounts")
print(material[['account', 'budget', 'actual', 'variance_pct']].to_string())

# Favorable vs unfavorable analysis
favorable = budget[
    ((budget['account_type'] == 'revenue') & (budget['variance_pct'] > 0)) |
    ((budget['account_type'] == 'expense') & (budget['variance_pct'] < 0))
]
print(f"Favorable variances: {len(favorable)}")

Management KPI Trend Analysis

# Load KPI data
kpis = pd.read_csv('output/financial_reporting/management_kpis.csv')

# Check for declining trends
for kpi_name in kpis['kpi_name'].unique():
    series = kpis[kpis['kpi_name'] == kpi_name].sort_values('period')
    values = series['value'].values
    # Simple trend check: are last 3 periods declining?
    if len(values) >= 3 and all(values[-3+i] > values[-3+i+1] for i in range(2)):
        print(f"Declining trend: {kpi_name}")

Payroll Audit Testing (v0.6.0)

When the HR module is enabled, test payroll data for anomalies:

# Load payroll data
payroll = pd.read_csv('output/hr/payroll_entries.csv')

# Ghost employee check: employees with pay but no time entries
time_entries = pd.read_csv('output/hr/time_entries.csv')
paid_employees = set(payroll['employee_id'].unique())
active_employees = set(time_entries['employee_id'].unique())
no_time = paid_employees - active_employees
print(f"Employees paid without time entries: {len(no_time)}")

# Payroll amount reasonableness
payroll_summary = payroll.groupby('employee_id')['gross_pay'].sum()
mean_pay = payroll_summary.mean()
std_pay = payroll_summary.std()
outliers = payroll_summary[payroll_summary > mean_pay + 3 * std_pay]
print(f"Unusually high total pay: {len(outliers)} employees")

# Expense policy violation detection
expenses = pd.read_csv('output/hr/expense_reports.csv')
violations = expenses[expenses['policy_violation'] == True]
print(f"Expense policy violations: {len(violations)}")

Sampling

Statistical Sampling

from scipy import stats

# Calculate sample size for attribute testing
population_size = len(entries)
confidence_level = 0.95
tolerable_error_rate = 0.05
expected_error_rate = 0.01

# Sample size formula
z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)
sample_size = int(
    (z_score ** 2 * expected_error_rate * (1 - expected_error_rate)) /
    (tolerable_error_rate ** 2)
)

print(f"Recommended sample size: {sample_size}")

# Random sample
sample = entries.sample(n=sample_size, random_state=42)

Stratified Sampling

# Stratify by amount
entries['amount_stratum'] = pd.qcut(
    entries['debit_amount'] + entries['credit_amount'],
    q=5,
    labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)

# Sample from each stratum
stratified_sample = entries.groupby('amount_stratum').apply(
    lambda x: x.sample(n=min(100, len(x)), random_state=42)
)

See Also

SOX Compliance Testing

Test internal control monitoring systems.

Overview

SyntheticData generates data for SOX 404 compliance testing:

  • Internal control definitions
  • Control test evidence
  • Segregation of Duties violations
  • Approval workflow data

Configuration

global:
  seed: 42
  industry: financial_services
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 50000

internal_controls:
  enabled: true

  controls:
    - id: "CTL-001"
      name: "Payment Authorization"
      type: preventive
      frequency: continuous
      threshold: 10000
      assertions: [authorization, validity]

    - id: "CTL-002"
      name: "Journal Entry Review"
      type: detective
      frequency: daily
      assertions: [accuracy, completeness]

    - id: "CTL-003"
      name: "Bank Reconciliation"
      type: detective
      frequency: monthly
      assertions: [existence, completeness]

  sod_rules:
    - conflict_type: create_approve
      processes: [ap_invoice, ap_payment]
      description: "Cannot create and approve payments"

    - conflict_type: create_approve
      processes: [ar_invoice, ar_receipt]
      description: "Cannot create and approve receipts"

    - conflict_type: custody_recording
      processes: [cash_handling, cash_recording]
      description: "Cannot handle and record cash"

approval:
  enabled: true
  thresholds:
    - level: 1
      max_amount: 5000
    - level: 2
      max_amount: 25000
    - level: 3
      max_amount: 100000
    - level: 4
      max_amount: null

fraud:
  enabled: true
  fraud_rate: 0.005

  types:
    skipped_approval: 0.30
    threshold_manipulation: 0.30
    unauthorized_discount: 0.20
    duplicate_payment: 0.20

output:
  format: csv

Control Testing

1. Control Evidence

import pandas as pd

# Load control data
controls = pd.read_csv('output/controls/internal_controls.csv')
mappings = pd.read_csv('output/controls/control_account_mappings.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Identify entries subject to each control
for _, control in controls.iterrows():
    control_id = control['control_id']
    threshold = control['threshold']

    # Filter entries in scope
    if pd.notna(threshold):
        in_scope = entries[
            (entries['control_ids'].str.contains(control_id)) &
            (entries['debit_amount'] >= threshold)
        ]
    else:
        in_scope = entries[entries['control_ids'].str.contains(control_id)]

    print(f"{control['name']}: {len(in_scope)} entries in scope")

2. Approval Testing

# Load entries with approval data
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Test approval compliance
approval_required = entries[entries['debit_amount'] >= 5000]
approved = approval_required[approval_required['approved_by'].notna()]
not_approved = approval_required[approval_required['approved_by'].isna()]

print(f"Requiring approval: {len(approval_required)}")
print(f"Properly approved: {len(approved)}")
print(f"Missing approval: {len(not_approved)}")

# Test approval levels
def check_approval_level(row):
    amount = row['debit_amount']
    if amount >= 100000:
        return row['approval_level'] >= 4
    elif amount >= 25000:
        return row['approval_level'] >= 3
    elif amount >= 5000:
        return row['approval_level'] >= 2
    return True

entries['approval_adequate'] = entries.apply(check_approval_level, axis=1)
inadequate = entries[~entries['approval_adequate']]
print(f"Inadequate approval level: {len(inadequate)}")

3. Segregation of Duties

# Load SoD data
sod_rules = pd.read_csv('output/controls/sod_rules.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Identify violations
violations = entries[entries['sod_violation'] == True]
print(f"Total SoD violations: {len(violations)}")

# Analyze by type
violation_summary = violations.groupby('sod_conflict_type').agg({
    'document_id': 'count',
    'debit_amount': 'sum'
}).rename(columns={'document_id': 'count', 'debit_amount': 'total_amount'})

print("\nViolations by type:")
print(violation_summary)

# Analyze by user
user_violations = violations.groupby('created_by').size().sort_values(ascending=False)
print("\nTop violators:")
print(user_violations.head(10))

4. Threshold Manipulation

# Detect threshold-adjacent transactions
approval_threshold = 10000

entries['near_threshold'] = (
    (entries['debit_amount'] >= approval_threshold * 0.9) &
    (entries['debit_amount'] < approval_threshold)
)

near_threshold = entries[entries['near_threshold']]
print(f"Near-threshold entries: {len(near_threshold)}")

# Statistical analysis
expected_near = len(entries) * 0.10  # 10% would be in this range randomly
chi_stat = ((len(near_threshold) - expected_near) ** 2) / expected_near
print(f"Chi-square statistic: {chi_stat:.2f}")

Control Matrix

Generate RACM

# Risk and Control Matrix
controls = pd.read_csv('output/controls/internal_controls.csv')
mappings = pd.read_csv('output/controls/control_account_mappings.csv')

racm = controls.merge(mappings, on='control_id')
racm = racm[[
    'control_id', 'name', 'control_type', 'frequency',
    'account_number', 'assertions'
]]

# Add testing results
racm['population'] = racm['account_number'].apply(
    lambda x: len(entries[entries['account_number'] == x])
)
racm['exceptions'] = racm['control_id'].apply(
    lambda x: len(entries[
        (entries['control_ids'].str.contains(x)) &
        (entries['is_anomaly'] == True)
    ])
)
racm['exception_rate'] = racm['exceptions'] / racm['population']

print(racm)

Test Documentation

Control Test Template

def document_control_test(control_id, entries, sample_size=25):
    """Generate control test documentation."""
    control = controls[controls['control_id'] == control_id].iloc[0]

    # Get population
    population = entries[entries['control_ids'].str.contains(control_id)]

    # Sample
    sample = population.sample(n=min(sample_size, len(population)), random_state=42)

    # Test results
    exceptions = sample[sample['is_anomaly'] == True]

    return {
        'control_id': control_id,
        'control_name': control['name'],
        'control_type': control['control_type'],
        'frequency': control['frequency'],
        'population_size': len(population),
        'sample_size': len(sample),
        'exceptions_found': len(exceptions),
        'exception_rate': len(exceptions) / len(sample),
        'conclusion': 'Effective' if len(exceptions) == 0 else 'Exception Noted'
    }

# Test all controls
results = []
for control_id in controls['control_id']:
    result = document_control_test(control_id, entries)
    results.append(result)

test_results = pd.DataFrame(results)
test_results.to_csv('control_test_results.csv', index=False)

Deficiency Assessment

# Classify deficiencies
def assess_deficiency(exception_rate, amount_impact):
    if exception_rate > 0.10 or amount_impact > 1000000:
        return 'Material Weakness'
    elif exception_rate > 0.05 or amount_impact > 100000:
        return 'Significant Deficiency'
    elif exception_rate > 0:
        return 'Control Deficiency'
    return 'No Deficiency'

test_results['amount_impact'] = test_results['control_id'].apply(
    lambda x: entries[
        (entries['control_ids'].str.contains(x)) &
        (entries['is_anomaly'] == True)
    ]['debit_amount'].sum()
)

test_results['deficiency_classification'] = test_results.apply(
    lambda x: assess_deficiency(x['exception_rate'], x['amount_impact']),
    axis=1
)

print(test_results[['control_name', 'exception_rate', 'deficiency_classification']])

See Also

Process Mining

Generate OCEL 2.0 event logs for process mining analysis across 8 enterprise process families.

Overview

SyntheticData generates comprehensive process mining data:

  • OCEL 2.0 compliant event logs with 88 activity types and 52 object types
  • 8 process families: P2P, O2C, S2C, H2R, MFG, BANK, AUDIT, Bank Recon
  • Object-centric relationships with lifecycle states
  • Three variant types per generator: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%)
  • Cross-process object linking via shared document IDs

Configuration

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 6

transactions:
  target_count: 50000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
    completion_rate: 0.95

    stages:
      po_approval_rate: 0.9
      gr_rate: 0.98
      invoice_rate: 0.95
      payment_rate: 0.92

  o2c:
    enabled: true
    flow_rate: 0.4
    completion_rate: 0.90

    stages:
      so_approval_rate: 0.95
      credit_check_pass_rate: 0.9
      delivery_rate: 0.98
      invoice_rate: 0.95
      collection_rate: 0.85

master_data:
  vendors:
    count: 100
  customers:
    count: 200
  materials:
    count: 500
  employees:
    count: 30

output:
  format: csv

OCEL 2.0 Export

Use the datasynth-ocpm crate for OCEL 2.0 export:

#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, Ocel2Exporter, ExportFormat};

let mut generator = OcpmGenerator::new(seed);
let event_log = generator.generate_event_log(
    p2p_count: 5000,
    o2c_count: 5000,
    start_date,
    end_date,
)?;

let exporter = Ocel2Exporter::new(ExportFormat::Json);
exporter.export(&event_log, "output/ocel2.json")?;
}

P2P Process

Event Sequence

Create PO → Approve PO → Release PO → Create GR → Post GR →
Receive Invoice → Verify Invoice → Post Invoice → Execute Payment

Objects

Object TypeAttributes
PurchaseOrderpo_number, vendor_id, total_amount
GoodsReceiptgr_number, po_reference, quantity
VendorInvoiceinvoice_number, amount, due_date
Paymentpayment_number, amount, bank_ref
Materialmaterial_id, description
Vendorvendor_id, name

Object Relationships

PurchaseOrder ─┬── contains ──→ Material
               └── from ──────→ Vendor

GoodsReceipt ──── for ──────→ PurchaseOrder

VendorInvoice ─── for ──────→ PurchaseOrder
               └── matches ──→ GoodsReceipt

Payment ───────── pays ──────→ VendorInvoice

O2C Process

Event Sequence

Create SO → Check Credit → Release SO → Create Delivery →
Pick → Pack → Ship → Create Invoice → Post Invoice → Receive Payment

Objects

Object TypeAttributes
SalesOrderso_number, customer_id, total_amount
Deliverydelivery_number, so_reference
CustomerInvoiceinvoice_number, amount, due_date
CustomerPaymentreceipt_number, amount
Materialmaterial_id, description
Customercustomer_id, name

Analysis with PM4Py

Load Event Log

from pm4py.objects.ocel.importer import jsonocel

# Load OCEL 2.0
ocel = jsonocel.apply("output/ocel2.json")

print(f"Events: {len(ocel.events)}")
print(f"Objects: {len(ocel.objects)}")
print(f"Object types: {ocel.object_types}")

Process Discovery

from pm4py.algo.discovery.ocel import algorithm as ocel_discovery

# Discover object-centric Petri net
ocpn = ocel_discovery.apply(ocel)

# Visualize
from pm4py.visualization.ocel.ocpn import visualizer
gviz = visualizer.apply(ocpn)
visualizer.save(gviz, "ocpn.png")

Object Lifecycle Analysis

from pm4py.statistics.ocel import object_lifecycle

# Analyze PurchaseOrder lifecycle
po_lifecycle = object_lifecycle.get_lifecycle_summary(
    ocel,
    object_type="PurchaseOrder"
)

print("Purchase Order Lifecycle:")
print(f"  Average duration: {po_lifecycle['avg_duration']}")
print(f"  Completion rate: {po_lifecycle['completion_rate']:.2%}")

Conformance Checking

from pm4py.algo.conformance.ocel import algorithm as ocel_conformance

# Check conformance against expected model
results = ocel_conformance.apply(ocel, ocpn)

print(f"Conformant cases: {results['conformant']}")
print(f"Non-conformant: {results['non_conformant']}")

Process Metrics

Throughput Time

import pandas as pd
from datetime import timedelta

# Load events
events = pd.DataFrame(ocel.events)

# Calculate case durations
case_durations = events.groupby('case_id').agg({
    'timestamp': ['min', 'max']
})
case_durations['duration'] = (
    case_durations[('timestamp', 'max')] -
    case_durations[('timestamp', 'min')]
)

print(f"Mean throughput time: {case_durations['duration'].mean()}")
print(f"Median throughput time: {case_durations['duration'].median()}")

Activity Frequency

# Count activity occurrences
activity_counts = events['activity'].value_counts()
print("Activity frequency:")
print(activity_counts)

Bottleneck Analysis

# Calculate waiting times between activities
events = events.sort_values(['case_id', 'timestamp'])
events['wait_time'] = events.groupby('case_id')['timestamp'].diff()

# Find bottlenecks
bottlenecks = events.groupby('activity')['wait_time'].mean().sort_values(ascending=False)
print("Bottleneck activities:")
print(bottlenecks.head(5))

Variant Analysis

from pm4py.algo.discovery.ocel import variants

# Get process variants
variant_stats = variants.get_variants_statistics(ocel)

print(f"Unique variants: {len(variant_stats)}")
print("\nTop variants:")
for variant, stats in sorted(variant_stats.items(), key=lambda x: -x[1]['count'])[:5]:
    print(f"  {variant}: {stats['count']} cases")

Integration with Tools

Celonis

# Export to Celonis format
from pm4py.objects.ocel.exporter import csv as ocel_csv_exporter

ocel_csv_exporter.apply(ocel, "output/celonis/")
# Upload CSV files to Celonis

OCPA

# Export to OCPA format
from pm4py.objects.ocel.exporter import sqlite

sqlite.apply(ocel, "output/ocel.sqlite")
# Open in OCPA tool

New Process Families (v0.6.2)

S2C — Source-to-Contract

Create Sourcing Project → Qualify Supplier → Publish RFx →
Submit Bid → Evaluate Bids → Award Contract →
Activate Contract → Complete Sourcing

H2R — Hire-to-Retire

Submit Time Entry → Approve Time Entry →
Create Payroll Run → Calculate Payroll → Approve Payroll → Post Payroll
Submit Expense → Approve Expense

MFG — Manufacturing

Create Production Order → Release → Start Operation →
Complete Operation → Quality Inspection → Confirm Production →
Close Production Order

BANK — Banking Operations

Onboard Customer → KYC Review → Open Account →
Execute Transaction → Authorize → Complete Transaction

AUDIT — Audit Engagement Lifecycle

Create Engagement → Plan → Assess Risk → Create Workpaper →
Collect Evidence → Review Workpaper → Raise Finding →
Remediate Finding → Record Judgment → Complete Engagement

Bank Recon — Bank Reconciliation

Import Bank Statement → Auto Match Items → Manual Match Item →
Create Reconciling Item → Resolve Exception →
Approve Reconciliation → Post Entries → Complete Reconciliation

S2P Process Mining

The full Source-to-Pay chain provides rich process mining opportunities beyond basic P2P:

Extended Event Sequence

Spend Analysis → Supplier Qualification → RFx Published →
Bid Received → Bid Evaluation → Contract Award →
Create PO → Approve PO → Release PO →
Create GR → Post GR →
Receive Invoice → Verify Invoice (Three-Way Match) → Post Invoice →
Schedule Payment → Execute Payment

Extended Object Types

Object TypeAttributes
SpendCategorycategory_code, total_spend, vendor_count
SourcingProjectproject_type, target_savings, status
SupplierBidvendor_id, bid_amount, technical_score
ProcurementContractcontract_value, validity_period, terms
PurchaseRequisitionrequester, catalog_item, urgency
PurchaseOrderpo_type, vendor_id, total_amount
GoodsReceiptgr_number, received_qty, movement_type
VendorInvoiceinvoice_amount, match_status, due_date
Paymentpayment_method, cleared_amount, bank_ref

Cycle Time Analysis

# Analyze end-to-end procurement cycle times
po_events = events[events['object_type'] == 'PurchaseOrder']

# PO creation to payment completion
cycle_times = po_events.groupby('case_id').agg({
    'timestamp': ['min', 'max']
})
cycle_times['cycle_time'] = (
    cycle_times[('timestamp', 'max')] -
    cycle_times[('timestamp', 'min')]
)

# Segment by PO type
cycle_by_type = po_events.merge(
    objects[['po_type']], on='object_id'
).groupby('po_type')['cycle_time'].describe()

Three-Way Match Conformance

# Identify invoices that failed three-way match
match_events = events[events['activity'] == 'Verify Invoice']
blocked = match_events[match_events['match_status'] == 'blocked']

print(f"Three-way match block rate: {len(blocked)/len(match_events):.1%}")
print(f"Most common variance: {blocked['variance_type'].mode()[0]}")

See Also

AML/KYC Testing

Generate realistic banking transaction data with KYC profiles and AML typologies for compliance testing and fraud detection model development.

Overview

The datasynth-banking module generates synthetic banking data designed for:

  • AML System Testing: Validate transaction monitoring rules against known patterns
  • KYC Process Testing: Test customer onboarding and risk assessment workflows
  • ML Model Training: Train supervised models with labeled fraud typologies
  • Scenario Analysis: Test detection capabilities against specific attack patterns

KYC Profile Generation

Customer Types

TypeDescriptionTypical Characteristics
RetailIndividual customersSalary deposits, consumer spending
BusinessSmall to medium businessesPayroll, supplier payments
TrustTrust accounts, complex structuresInvestment flows, distributions

KYC Profile Components

Each customer has a KYC profile defining expected behavior:

kyc_profile:
  declared_turnover: 50000        # Expected monthly volume
  transaction_frequency: 25       # Expected transactions/month
  source_of_funds: "employment"   # Declared income source
  geographic_exposure: ["US", "EU"]
  cash_intensity: 0.05            # Expected cash ratio
  beneficial_owner_complexity: 1  # Ownership layers

Risk Scoring

Customers are assigned risk scores based on:

  • Geographic exposure (high-risk jurisdictions)
  • Industry sector
  • Transaction patterns vs. declared profile
  • Beneficial ownership complexity

AML Typology Generation

Structuring

Breaking large transactions into smaller amounts to avoid reporting thresholds.

Detection Signatures:
- Multiple transactions just below $10,000 threshold
- Same-day deposits across multiple branches
- Round-number amounts (e.g., $9,900, $9,800)

Configuration:

typologies:
  structuring:
    enabled: true
    rate: 0.001
    threshold: 10000
    margin: 500

Funnel Accounts

Concentrating funds from multiple sources before moving to destination.

Pattern:
Source A ─┐
Source B ─┼─▶ Funnel Account ─▶ Destination
Source C ─┘

Detection Signatures:
- Many small inbound, few large outbound
- High throughput relative to account balance
- Short holding periods

Layering

Complex chains of transactions to obscure fund origins.

Pattern:
Origin ─▶ Shell A ─▶ Shell B ─▶ Shell C ─▶ Destination
                          └─▶ Mixing ─┘

Detection Signatures:
- Rapid consecutive transfers
- Circular transaction patterns
- Cross-border routing through multiple jurisdictions

Money Mule Networks

Using recruited individuals to move illicit funds.

Pattern:
Fraudster ─▶ Mule 1 ─▶ Cash Withdrawal
           ─▶ Mule 2 ─▶ Wire Transfer
           ─▶ Mule 3 ─▶ Crypto Exchange

Detection Signatures:
- New accounts with sudden high volume
- Immediate outbound after inbound
- Multiple accounts with similar patterns

Round-Tripping

Moving funds in circular patterns to create apparent legitimacy.

Pattern:
Company A ─▶ Offshore ─▶ Company A (as "investment")

Detection Signatures:
- Funds return to origin within short period
- Offshore intermediaries
- Inflated invoicing

Fraud Patterns

Credit card fraud and synthetic identity patterns.

Patterns:
- Card testing (small amounts across merchants)
- Account takeover (changed behavior profile)
- Synthetic identity (blended PII attributes)

Generated Data

Output Files

banking/
├── banking_customers.csv        # Customer profiles with KYC data
├── bank_accounts.csv            # Account records with features
├── bank_transactions.csv        # Transaction records
├── kyc_profiles.csv             # Expected activity envelopes
├── counterparties.csv           # Counterparty pool
├── aml_typology_labels.csv      # Ground truth typology labels
├── entity_risk_labels.csv       # Entity-level risk classifications
└── transaction_risk_labels.csv  # Transaction-level classifications

Customer Record

customer_id,customer_type,name,created_at,risk_score,kyc_status,pep_flag,sanctions_flag
CUST001,retail,John Smith,2024-01-15,25,verified,false,false
CUST002,business,Acme Corp,2024-02-01,65,enhanced_due_diligence,false,false

Transaction Record

transaction_id,account_id,timestamp,amount,currency,direction,channel,category,counterparty_id
TXN001,ACC001,2024-03-15T10:30:00Z,9800.00,USD,credit,branch,cash_deposit,
TXN002,ACC001,2024-03-15T11:45:00Z,9750.00,USD,credit,branch,cash_deposit,

Typology Label

transaction_id,typology,confidence,pattern_id,related_transactions
TXN001,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"
TXN002,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"

Configuration

Basic Banking Setup

banking:
  enabled: true
  customers:
    retail: 5000
    business: 500
    trust: 50

  transactions:
    target_count: 500000
    date_range:
      start: 2024-01-01
      end: 2024-12-31

  typologies:
    structuring:
      enabled: true
      rate: 0.002
    funnel:
      enabled: true
      rate: 0.001
    layering:
      enabled: true
      rate: 0.0005
    mule:
      enabled: true
      rate: 0.001
    fraud:
      enabled: true
      rate: 0.005

  labels:
    generate: true
    include_confidence: true
    include_related: true

Adversarial Testing

Generate transactions designed to evade detection:

banking:
  typologies:
    spoofing:
      enabled: true
      strategies:
        - threshold_aware        # Varies amounts around thresholds
        - temporal_distribution  # Spreads over time windows
        - channel_mixing         # Uses multiple channels

Use Cases

Transaction Monitoring Rule Testing

# Generate data with known structuring patterns
datasynth-data generate --config banking_structuring.yaml --output ./test_data

# Expected results:
# - 0.2% of transactions should trigger structuring alerts
# - Labels in aml_typology_labels.csv for validation

ML Model Training

import pandas as pd
from sklearn.model_selection import train_test_split

# Load transactions and labels
transactions = pd.read_csv("banking/bank_transactions.csv")
labels = pd.read_csv("banking/aml_typology_labels.csv")

# Merge and prepare features
data = transactions.merge(labels, on="transaction_id", how="left")
data["is_suspicious"] = data["typology"].notna()

# Split for training
X_train, X_test, y_train, y_test = train_test_split(
    data[features],
    data["is_suspicious"],
    test_size=0.2,
    stratify=data["is_suspicious"]
)

Network Analysis

The banking data supports graph-based analysis:

import networkx as nx

# Build transaction network
G = nx.DiGraph()
for _, txn in transactions.iterrows():
    if txn["counterparty_id"]:
        G.add_edge(txn["account_id"], txn["counterparty_id"],
                   weight=txn["amount"])

# Detect funnel accounts (high in-degree, low out-degree)
in_degree = dict(G.in_degree())
out_degree = dict(G.out_degree())
funnels = [n for n in G.nodes()
           if in_degree.get(n, 0) > 10 and out_degree.get(n, 0) < 3]

KYC Deviation Analysis

# Compare actual behavior to KYC profile
customers = pd.read_csv("banking/banking_customers.csv")
kyc = pd.read_csv("banking/kyc_profiles.csv")
transactions = pd.read_csv("banking/bank_transactions.csv")

# Calculate actual monthly volumes
actual = transactions.groupby(["customer_id", "month"])["amount"].sum()

# Compare to declared turnover
merged = actual.merge(kyc, on="customer_id")
merged["deviation"] = (merged["actual"] - merged["declared_turnover"]) / merged["declared_turnover"]

# Flag significant deviations
alerts = merged[merged["deviation"].abs() > 0.5]

Best Practices

Realistic Testing

  1. Match production volumes: Configure similar customer counts and transaction rates
  2. Use realistic ratios: Keep typology rates at realistic levels (0.1-1%)
  3. Include noise: Add legitimate edge cases that shouldn’t trigger alerts

Label Quality

  1. Verify ground truth: Labels reflect injected patterns, not detected ones
  2. Include confidence: Use confidence scores for uncertain classifications
  3. Track related transactions: Pattern IDs link related suspicious activity

Model Validation

  1. Test detection rates: Measure recall against known patterns
  2. Check false positives: Ensure legitimate transactions aren’t flagged
  3. Validate across typologies: Test each pattern type separately

See Also

ERP Load Testing

Generate high-volume data for ERP system testing.

Overview

SyntheticData generates realistic transaction volumes for:

  • Load testing
  • Stress testing
  • Performance benchmarking
  • System integration testing

Configuration

High Volume Generation

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  worker_threads: 8                  # Maximize parallelism

transactions:
  target_count: 1000000              # 1 million entries

  line_items:
    distribution: empirical

  amounts:
    min: 100
    max: 10000000
    distribution: log_normal

  sources:
    manual: 0.15
    automated: 0.65
    recurring: 0.15
    adjustment: 0.05

  temporal:
    month_end_spike: 2.5
    quarter_end_spike: 3.0
    year_end_spike: 4.0

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.35
  o2c:
    enabled: true
    flow_rate: 0.35

master_data:
  vendors:
    count: 2000
  customers:
    count: 5000
  materials:
    count: 10000

output:
  format: csv
  compression: none                  # Fastest for import

SAP ACDOCA Format

output:
  files:
    journal_entries: false
    acdoca: true                     # SAP Universal Journal format

Volume Sizing

Transaction Volume Guidelines

Company SizeAnnual EntriesPer DayConfiguration
Small10,000~30target_count: 10000
Medium100,000~300target_count: 100000
Large1,000,000~3,000target_count: 1000000
Enterprise10,000,000~30,000target_count: 10000000

Master Data Guidelines

SizeVendorsCustomersMaterials
Small100200500
Medium5001,0005,000
Large2,00010,00050,000
Enterprise10,000100,000500,000

Load Testing Scenarios

1. Steady State Load

Normal daily operation:

transactions:
  target_count: 100000

  temporal:
    month_end_spike: 1.0             # No spikes
    quarter_end_spike: 1.0
    year_end_spike: 1.0
    working_hours_only: true

2. Peak Period Load

Month-end closing:

global:
  start_date: 2024-01-25
  period_months: 1                   # Focus on month-end

transactions:
  target_count: 50000

  temporal:
    month_end_spike: 5.0             # 5x normal volume

3. Year-End Stress

Year-end closing simulation:

global:
  start_date: 2024-12-01
  period_months: 1

transactions:
  target_count: 200000

  temporal:
    month_end_spike: 3.0
    quarter_end_spike: 4.0
    year_end_spike: 10.0             # Extreme spike

4. Batch Import

Large batch import testing:

transactions:
  target_count: 500000

  sources:
    automated: 1.0                   # All system-generated

output:
  compression: none                  # For fastest import

Manufacturing ERP Testing (v0.6.0)

Production Order Load

Generate production orders with WIP tracking, routings, and standard costing:

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  worker_threads: 8

transactions:
  target_count: 500000

manufacturing:
  enabled: true
  production_orders:
    orders_per_month: 200              # High volume
    avg_batch_size: 150
    yield_rate: 0.96
    rework_rate: 0.04
  costing:
    labor_rate_per_hour: 42.0
    overhead_rate: 1.75
  routing:
    avg_operations: 6
    setup_time_hours: 2.0

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.40                    # Heavy procurement

subledger:
  inventory:
    enabled: true
    valuation_methods:
      - standard_cost
      - weighted_average

This configuration exercises production order creation, goods issue to production, goods receipt from production, WIP valuation, and standard cost variance posting.

Three-Way Match with Source-to-Pay

Test the full procurement lifecycle from sourcing through payment:

source_to_pay:
  enabled: true
  sourcing:
    projects_per_year: 20
  rfx:
    min_invited_vendors: 5
    max_invited_vendors: 12
  contracts:
    min_duration_months: 12
    max_duration_months: 24
  p2p_integration:
    off_contract_rate: 0.10            # 10% maverick spending
    catalog_enforcement: true

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.40
    three_way_match:
      quantity_tolerance: 0.02
      price_tolerance: 0.01

HR and Payroll Testing (v0.6.0)

Payroll Processing Load

Generate payroll runs, time entries, and expense reports:

master_data:
  employees:
    count: 500
    hierarchy_depth: 6

hr:
  enabled: true
  payroll:
    enabled: true
    pay_frequency: "biweekly"          # 26 pay periods per year
    benefits_enrollment_rate: 0.75
    retirement_participation_rate: 0.55
  time_attendance:
    enabled: true
    overtime_rate: 0.15
  expenses:
    enabled: true
    submission_rate: 0.40
    policy_violation_rate: 0.05

This exercises payroll journal entry generation (salary, tax withholdings, benefits deductions), time and attendance record creation, and expense report approval workflows.

Expense Report Compliance

Test expense policy enforcement with elevated violation rates:

hr:
  enabled: true
  expenses:
    enabled: true
    submission_rate: 0.60              # 60% of employees submit
    policy_violation_rate: 0.15        # Elevated violation rate for testing

anomaly_injection:
  enabled: true
  generate_labels: true

Procurement Testing (v0.6.0)

Vendor Scorecard and Qualification

Generate the full source-to-pay cycle for procurement system testing:

source_to_pay:
  enabled: true
  qualification:
    pass_rate: 0.80
    validity_days: 365
  scorecards:
    frequency: "quarterly"
    grade_a_threshold: 85.0
    grade_c_threshold: 55.0
  catalog:
    preferred_vendor_flag_rate: 0.65
    multi_source_rate: 0.30

vendor_network:
  enabled: true
  depth: 3

Sales Quote Pipeline

Test quote-to-order conversion with the O2C flow:

sales_quotes:
  enabled: true
  quotes_per_month: 100
  win_rate: 0.30
  validity_days: 45

document_flows:
  o2c:
    enabled: true
    flow_rate: 0.40

Won quotes automatically feed into the O2C document flow as sales orders.


Performance Monitoring

Generation Metrics

# Time generation
time datasynth-data generate --config config.yaml --output ./output

# Monitor memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

# Watch progress
datasynth-data generate --config config.yaml --output ./output -v

Import Metrics

Track these during ERP import:

MetricDescription
Import rateRecords per second
Memory usagePeak RAM during import
CPU utilizationProcessor load
I/O throughputDisk read/write speed
Lock contentionDatabase lock waits

Data Import Strategies

SAP S/4HANA

# Generate ACDOCA format
datasynth-data generate --config config.yaml --output ./output

# Use SAP Data Services or LSMW for import
# Output: output/transactions/acdoca.csv

Oracle EBS

-- Create staging table
CREATE TABLE XX_JE_STAGING (
    document_id VARCHAR2(36),
    posting_date DATE,
    account VARCHAR2(20),
    debit NUMBER,
    credit NUMBER
);

-- Load via SQL*Loader
LOAD DATA
INFILE 'journal_entries.csv'
INTO TABLE XX_JE_STAGING
FIELDS TERMINATED BY ','

Microsoft Dynamics

# Use Data Management Framework
# Import journal_entries.csv via Data Entity

Validation

Post-Import Checks

-- Verify record count
SELECT COUNT(*) FROM journal_entries;

-- Verify balance
SELECT SUM(debit) - SUM(credit) AS imbalance
FROM journal_entries;

-- Check date range
SELECT MIN(posting_date), MAX(posting_date)
FROM journal_entries;

Reconciliation

import pandas as pd

# Compare source and target
source = pd.read_csv('output/transactions/journal_entries.csv')
target = pd.read_sql('SELECT * FROM journal_entries', connection)

# Verify counts
assert len(source) == len(target), "Record count mismatch"

# Verify totals
assert abs(source['debit_amount'].sum() - target['debit'].sum()) < 0.01

Batch Processing

Chunked Generation

For very large volumes, generate in chunks:

# Generate 10 batches of 1M each
for i in {1..10}; do
    datasynth-data generate \
        --config config.yaml \
        --output ./output/batch_$i \
        --seed $((42 + i))
done

Parallel Import

# Import chunks in parallel
for batch in ./output/batch_*; do
    import_job $batch &
done
wait

Performance Tips

Generation Speed

  1. Increase threads: worker_threads: 16
  2. Disable unnecessary features: Turn off graph export, anomalies
  3. Use fast storage: NVMe SSD
  4. Reduce complexity: Smaller COA, fewer master records

Import Speed

  1. Disable triggers: During bulk import
  2. Drop indexes: Recreate after import
  3. Increase batch size: Larger commits
  4. Parallel loading: Multiple import streams

See Also

Causal Analysis

New in v0.5.0

Use DataSynth’s causal generation capabilities for “what-if” scenario testing and counterfactual analysis in audit and risk management.

When to Use Causal Generation

Causal generation is ideal when you need to:

  • Test audit scenarios: “What would happen to fraud rates if we increased the approval threshold?”
  • Risk assessment: “How would revenue change if we lost our top vendor?”
  • Policy evaluation: “What is the causal effect of implementing a new control?”
  • Training causal ML models: Generate data with known causal structure for model validation

Setting Up a Fraud Detection SCM

# Generate causally-structured fraud detection data
datasynth-data causal generate \
    --template fraud_detection \
    --samples 50000 \
    --seed 42 \
    --output ./fraud_causal

The fraud_detection template models:

  • transaction_amountapproval_level (larger amounts require higher approval)
  • transaction_amountfraud_flag (larger amounts have higher fraud probability)
  • vendor_riskfraud_flag (risky vendors associated with more fraud)

Running Interventions

Answer “what if?” questions by forcing variables to specific values:

# What if all transactions were $50,000?
datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 10000 \
    --output ./intervention_50k

# What if vendor risk were always high (0.9)?
datasynth-data causal intervene \
    --template fraud_detection \
    --variable vendor_risk \
    --value 0.9 \
    --samples 10000 \
    --output ./intervention_high_risk

Compare the intervention output against the baseline to estimate causal effects.

Counterfactual Analysis for Audit

For individual transaction review:

from datasynth_py import DataSynth

synth = DataSynth()

# Load a specific flagged transaction
factual = {
    "transaction_amount": 5000.0,
    "approval_level": 1.0,
    "vendor_risk": 0.3,
    "fraud_flag": 0.0,
}

# What would have happened if the amount were 10x larger?
# The counterfactual preserves the same "noise" (latent factors)
# but propagates the new amount through the causal structure

This helps auditors understand which factors most influence risk assessments.

Configuration Example

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 50000
  validate: true

# Combine with regular generation
transactions:
  target_count: 100000

fraud:
  enabled: true
  fraud_rate: 0.005

Validating Causal Structure

Verify that generated data preserves the intended causal relationships:

datasynth-data causal validate \
    --data ./fraud_causal \
    --template fraud_detection

The validator checks:

  • Parent-child correlations match expected directions
  • Independence constraints hold for non-adjacent variables
  • Intervention effects are consistent with the graph

See Also

LLM Training Data

New in v0.5.0

Generate LLM-enriched synthetic financial data for training and fine-tuning language models on domain-specific tasks.

When to Use LLM-Enriched Data

  • Fine-tuning: Train financial document understanding models on realistic data
  • RAG evaluation: Test retrieval-augmented generation with known-truth synthetic documents
  • Classification training: Generate labeled financial text for transaction categorization
  • Anomaly explanation: Train models to explain financial anomalies in natural language

Quality vs Cost Tradeoffs

ProviderQualityCostLatencyReproducibility
MockGood (template-based)FreeInstantFully deterministic
gpt-4o-miniHigh~$0.15/1M tokens~200ms/reqSeed-based
gpt-4oVery High~$2.50/1M tokens~500ms/reqSeed-based
Claude (Anthropic)Very HighVaries~300ms/reqSeed-based
Self-hostedVariesInfrastructure costVariesFull control

Using the Mock Provider for CI/CD

The mock provider generates deterministic, contextually-aware text without any API calls:

# Default: uses mock provider (no API key needed)
datasynth-data generate --config config.yaml --output ./output
# Explicit mock configuration
llm:
  provider: mock

The mock provider is suitable for:

  • CI/CD pipelines
  • Automated testing
  • Reproducible research
  • Development environments

Using Real LLM Providers

For production-quality enrichment:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"
  cache_enabled: true       # Avoid duplicate API calls
  max_retries: 3
  timeout_secs: 30
export OPENAI_API_KEY="sk-..."
datasynth-data generate --config config.yaml --output ./output

Batch Generation for Large Datasets

For large-scale enrichment, use batch mode to minimize API overhead:

from datasynth_py import DataSynth, Config
from datasynth_py.config import blueprints

# Generate base data first (fast, rule-based)
config = blueprints.manufacturing_large(transactions=100000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Then enrich with LLM in a separate pass if needed

Example: Financial Document Understanding

Generate training data for a document understanding model:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 50000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
  o2c:
    enabled: true
    flow_rate: 0.3

anomaly_injection:
  enabled: true
  total_rate: 0.03
  generate_labels: true

# LLM enrichment adds realistic descriptions
llm:
  provider: mock     # or openai for higher quality

The generated data includes:

  • Vendor names appropriate for the industry and spend category
  • Transaction descriptions that read like real GL entries
  • Memo fields on invoices and payments
  • Natural language explanations for flagged anomalies

See Also

Pipeline Orchestration

New in v0.5.0

Integrate DataSynth into data engineering pipelines using Apache Airflow, dbt, MLflow, and Apache Spark.

Overview

DataSynth’s Python wrapper includes optional integrations for popular data engineering platforms, enabling synthetic data generation as part of automated workflows.

pip install datasynth-py[integrations]

Apache Airflow

Generate Data in a DAG

from airflow import DAG
from airflow.utils.dates import days_ago
from datasynth_py.integrations import (
    DataSynthOperator,
    DataSynthSensor,
    DataSynthValidateOperator,
)

config = {
    "global": {"industry": "retail", "start_date": "2024-01-01", "period_months": 12},
    "transactions": {"target_count": 50000},
}

with DAG("synthetic_data_pipeline", start_date=days_ago(1), schedule_interval="@weekly") as dag:

    validate = DataSynthValidateOperator(
        task_id="validate_config",
        config_path="/configs/retail.yaml",
    )

    generate = DataSynthOperator(
        task_id="generate_data",
        config=config,
        output_path="/data/synthetic/{{ ds }}",
    )

    wait = DataSynthSensor(
        task_id="wait_for_output",
        output_path="/data/synthetic/{{ ds }}",
    )

    validate >> generate >> wait

dbt Integration

Generate dbt Sources from Synthetic Data

from datasynth_py.integrations import DbtSourceGenerator, create_dbt_project

# Generate sources.yml pointing to synthetic CSV files
gen = DbtSourceGenerator()
gen.generate_sources_yaml("./synthetic_output", "./my_dbt_project")

# Generate seed CSVs for dbt
gen.generate_seeds("./synthetic_output", "./my_dbt_project")

# Or create a complete dbt project structure
project = create_dbt_project("./synthetic_output", "my_dbt_project")

This creates:

  • models/sources.yml with table definitions
  • seeds/ directory with CSV files
  • Standard dbt project structure

Testing dbt Models with Synthetic Data

# 1. Generate synthetic data
datasynth-data generate --config retail.yaml --output ./synthetic

# 2. Create dbt project from output
python -c "from datasynth_py.integrations import create_dbt_project; create_dbt_project('./synthetic', 'test_project')"

# 3. Run dbt
cd test_project && dbt seed && dbt run && dbt test

MLflow Tracking

Track Generation Experiments

from datasynth_py.integrations import DataSynthMlflowTracker

tracker = DataSynthMlflowTracker(experiment_name="data_generation")

# Track a generation run (logs config, metrics, artifacts)
run_info = tracker.track_generation("./output", config=config)

# Log additional quality metrics
tracker.log_quality_metrics({
    "benford_mad": 0.008,
    "correlation_preservation": 0.95,
    "completeness": 0.99,
})

# Compare recent runs
comparison = tracker.compare_runs(n=10)
for run in comparison:
    print(f"Run {run['run_id']}: quality={run['metrics'].get('statistical_fidelity', 'N/A')}")

A/B Testing Generation Configs

configs = [
    ("baseline", baseline_config),
    ("with_diffusion", diffusion_config),
    ("with_llm", llm_config),
]

for name, cfg in configs:
    with mlflow.start_run(run_name=name):
        result = synth.generate(config=cfg, output={"format": "csv", "sink": "temp_dir"})
        tracker.track_generation(result.output_dir, config=cfg)

Apache Spark

Read Synthetic Data as DataFrames

from datasynth_py.integrations import DataSynthSparkReader

reader = DataSynthSparkReader()

# Read a single table
je_df = reader.read_table(spark, "./output", "journal_entries")
je_df.show(5)

# Read all tables at once
tables = reader.read_all_tables(spark, "./output")
for name, df in tables.items():
    print(f"{name}: {df.count()} rows")

# Create temporary SQL views
reader.create_temp_views(spark, "./output")
spark.sql("""
    SELECT posting_date, SUM(amount) as total
    FROM journal_entries
    WHERE fiscal_period = 12
    GROUP BY posting_date
    ORDER BY posting_date
""").show()

End-to-End Pipeline Example

"""
Complete pipeline: Generate → Track → Load → Transform → Test
"""
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
from datasynth_py.integrations import (
    DataSynthMlflowTracker,
    DataSynthSparkReader,
    DbtSourceGenerator,
)

# 1. Generate
synth = DataSynth()
config = blueprints.retail_small(transactions=50000)
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# 2. Track with MLflow
tracker = DataSynthMlflowTracker(experiment_name="pipeline_test")
tracker.track_generation(result.output_dir, config=config)

# 3. Load into Spark
reader = DataSynthSparkReader()
reader.create_temp_views(spark, result.output_dir)

# 4. Create dbt project for transformation testing
gen = DbtSourceGenerator()
gen.generate_sources_yaml(result.output_dir, "./dbt_project")

See Also

Contributing

Welcome to the SyntheticData contributor guide.

Overview

SyntheticData is an open-source project and we welcome contributions from the community. This section covers everything you need to know to contribute effectively.

Ways to Contribute

Code Contributions

  • Bug fixes: Fix issues reported in the GitHub issue tracker
  • New features: Implement new generators, output formats, or analysis tools
  • Performance improvements: Optimize generation speed or memory usage
  • Documentation: Improve or expand the documentation

Non-Code Contributions

  • Bug reports: Report issues with detailed reproduction steps
  • Feature requests: Suggest new features or improvements
  • Documentation feedback: Point out unclear or missing documentation
  • Testing: Test pre-release versions and report issues

Quick Start

# Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/SyntheticData.git
cd SyntheticData

# Create a feature branch
git checkout -b feature/my-feature

# Make your changes and run tests
cargo test

# Submit a pull request

Contribution Guidelines

Before You Start

  1. Check existing issues: Look for related issues or discussions
  2. Open an issue first: For significant changes, discuss before implementing
  3. Follow code style: Run cargo fmt and cargo clippy
  4. Write tests: All new features need test coverage
  5. Update documentation: Keep docs in sync with code changes

Code of Conduct

We are committed to providing a welcoming and inclusive environment. Please:

  • Be respectful and constructive in discussions
  • Focus on the technical merits of contributions
  • Help newcomers learn and contribute
  • Report unacceptable behavior to the maintainers

Getting Help

  • GitHub Issues: For bugs and feature requests
  • GitHub Discussions: For questions and general discussion
  • Pull Request Reviews: For feedback on your contributions

In This Section

PageDescription
Development SetupSet up your development environment
Code StyleCoding standards and conventions
TestingTesting guidelines and practices
Pull RequestsPR submission and review process

License

By contributing to SyntheticData, you agree that your contributions will be licensed under the project’s MIT License.

Development Setup

Set up your local development environment for SyntheticData.

Prerequisites

Required

  • Rust: 1.88 or later (install via rustup)
  • Git: For version control
  • Cargo: Included with Rust

Optional

  • Node.js 18+: For desktop UI development (datasynth-ui)
  • Protocol Buffers: For gRPC development
  • mdBook: For documentation development

Installation

1. Clone the Repository

git clone https://github.com/EY-ASU-RnD/SyntheticData.git
cd SyntheticData

2. Install Rust Toolchain

# Install rustup if not present
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install stable toolchain
rustup install stable
rustup default stable

# Add useful components
rustup component add clippy rustfmt

3. Build the Project

# Debug build (faster compilation)
cargo build

# Release build (optimized)
cargo build --release

# Check without building
cargo check

4. Run Tests

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific crate tests
cargo test -p datasynth-core
cargo test -p datasynth-generators

IDE Setup

VS Code

Recommended extensions:

{
  "recommendations": [
    "rust-lang.rust-analyzer",
    "tamasfe.even-better-toml",
    "serayuzgur.crates",
    "vadimcn.vscode-lldb"
  ]
}

Settings for the project:

{
  "rust-analyzer.cargo.features": "all",
  "rust-analyzer.checkOnSave.command": "clippy",
  "editor.formatOnSave": true
}

JetBrains (RustRover/IntelliJ)

  1. Install the Rust plugin
  2. Open the project directory
  3. Configure Cargo settings under Preferences > Languages & Frameworks > Rust

Desktop UI Setup

For developing the Tauri/SvelteKit desktop UI:

# Navigate to UI crate
cd crates/datasynth-ui

# Install Node.js dependencies
npm install

# Run development server
npm run dev

# Run Tauri desktop app
npm run tauri dev

# Build production
npm run build

Documentation Setup

For working on documentation:

# Install mdBook
cargo install mdbook

# Build documentation
cd docs
mdbook build

# Serve with live reload
mdbook serve --open

# Generate Rust API docs
cargo doc --workspace --no-deps --open

Project Structure

SyntheticData/
├── crates/
│   ├── datasynth-cli/          # CLI binary
│   ├── datasynth-core/         # Core models and traits
│   ├── datasynth-config/       # Configuration schema
│   ├── datasynth-generators/   # Data generators
│   ├── datasynth-output/       # Output sinks
│   ├── datasynth-graph/        # Graph export
│   ├── datasynth-runtime/      # Orchestration
│   ├── datasynth-server/       # REST/gRPC server
│   ├── datasynth-ui/           # Desktop UI
│   └── datasynth-ocpm/         # OCEL 2.0 export
├── benches/                # Benchmarks
├── docs/                   # Documentation
├── configs/                # Example configs
└── templates/              # Data templates

Environment Variables

VariableDescriptionDefault
RUST_LOGLog level (trace, debug, info, warn, error)info
SYNTH_CONFIG_PATHDefault config search pathCurrent directory
SYNTH_TEMPLATE_PATHTemplate files location./templates

Debugging

VS Code Launch Configuration

{
  "version": "0.2.0",
  "configurations": [
    {
      "type": "lldb",
      "request": "launch",
      "name": "Debug CLI",
      "cargo": {
        "args": ["build", "--bin=datasynth-data", "--package=datasynth-cli"]
      },
      "args": ["generate", "--demo", "--output", "./output"],
      "cwd": "${workspaceFolder}"
    }
  ]
}

Logging

Enable debug logging:

RUST_LOG=debug cargo run --release -- generate --demo --output ./output

Module-specific logging:

RUST_LOG=synth_generators=debug,synth_core=info cargo run ...

Common Issues

Build Failures

# Clean and rebuild
cargo clean
cargo build

# Update dependencies
cargo update

Test Failures

# Run tests with backtrace
RUST_BACKTRACE=1 cargo test

# Run single test with output
cargo test test_name -- --nocapture

Memory Issues

For large generation volumes, increase system limits:

# Linux: Increase open file limit
ulimit -n 65536

# Check memory usage during generation
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

Next Steps

Code Style

Coding standards and conventions for SyntheticData.

Rust Style

Formatting

All code must be formatted with rustfmt:

# Format all code
cargo fmt

# Check formatting without changes
cargo fmt --check

Linting

Code must pass Clippy without warnings:

# Run clippy
cargo clippy

# Run clippy with all features
cargo clippy --all-features

# Run clippy on all targets
cargo clippy --all-targets

Configuration

The project uses these Clippy settings in Cargo.toml:

[workspace.lints.clippy]
all = "warn"
pedantic = "warn"
nursery = "warn"

Naming Conventions

General Rules

ItemConventionExample
TypesPascalCaseJournalEntry, VendorGenerator
Functionssnake_casegenerate_batch, parse_config
Variablessnake_caseentry_count, total_amount
ConstantsSCREAMING_SNAKE_CASEMAX_LINE_ITEMS, DEFAULT_SEED
Modulessnake_caseje_generator, document_flow

Domain-Specific Names

Use accounting domain terminology consistently:

#![allow(unused)]
fn main() {
// Good - uses domain terms
struct JournalEntry { ... }
struct ChartOfAccounts { ... }
fn post_to_gl() { ... }

// Avoid - generic terms
struct Entry { ... }
struct AccountList { ... }
fn save_data() { ... }
}

Code Organization

Module Structure

#![allow(unused)]
fn main() {
// 1. Module documentation
//! Brief description of the module.
//!
//! Extended description with examples.

// 2. Imports (grouped and sorted)
use std::collections::HashMap;

use chrono::{NaiveDate, Utc};
use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};

use crate::models::JournalEntry;

// 3. Constants
const DEFAULT_BATCH_SIZE: usize = 1000;

// 4. Type definitions
pub struct Generator { ... }

// 5. Trait implementations
impl Generator { ... }

// 6. Unit tests
#[cfg(test)]
mod tests { ... }
}

Import Organization

Group imports in this order:

  1. Standard library (std::)
  2. External crates (alphabetically)
  3. Workspace crates (synth_*)
  4. Current crate (crate::)
#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::Arc;

use chrono::NaiveDate;
use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};
use uuid::Uuid;

use synth_core::models::JournalEntry;
use synth_core::traits::Generator;

use crate::config::GeneratorConfig;
}

Documentation

Public API Documentation

All public items must have documentation:

#![allow(unused)]
fn main() {
/// Generates journal entries with realistic financial patterns.
///
/// This generator produces balanced journal entries following
/// configurable statistical distributions for amounts, line counts,
/// and temporal patterns.
///
/// # Examples
///
/// ```
/// use synth_generators::JournalEntryGenerator;
///
/// let generator = JournalEntryGenerator::new(config, seed);
/// let entries = generator.generate_batch(1000)?;
/// ```
///
/// # Errors
///
/// Returns `GeneratorError` if:
/// - Configuration is invalid
/// - Memory limits are exceeded
pub struct JournalEntryGenerator { ... }
}

Module Documentation

Each module should have a module-level doc comment:

#![allow(unused)]
fn main() {
//! Journal Entry generation module.
//!
//! This module provides generators for creating realistic
//! journal entries with proper accounting rules enforcement.
//!
//! # Overview
//!
//! The main entry point is [`JournalEntryGenerator`], which
//! coordinates line item generation and balance verification.
}

Error Handling

Error Types

Use thiserror for error definitions:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Debug, Error)]
pub enum GeneratorError {
    #[error("Invalid configuration: {0}")]
    InvalidConfig(String),

    #[error("Memory limit exceeded: used {used} bytes, limit {limit} bytes")]
    MemoryExceeded { used: usize, limit: usize },

    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
}
}

Result Types

Define type aliases for common result types:

#![allow(unused)]
fn main() {
pub type Result<T> = std::result::Result<T, GeneratorError>;
}

Error Propagation

Use ? for error propagation:

#![allow(unused)]
fn main() {
// Good
fn process() -> Result<Data> {
    let config = load_config()?;
    let data = generate_data(&config)?;
    Ok(data)
}

// Avoid
fn process() -> Result<Data> {
    let config = match load_config() {
        Ok(c) => c,
        Err(e) => return Err(e),
    };
    // ...
}
}

Financial Data

Decimal Precision

Always use rust_decimal::Decimal for financial amounts:

#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;

// Good
let amount: Decimal = dec!(1234.56);

// Avoid - floating point
let amount: f64 = 1234.56;
}

Serialization

Serialize decimals as strings to avoid precision loss:

#![allow(unused)]
fn main() {
#[derive(Serialize, Deserialize)]
pub struct LineItem {
    #[serde(serialize_with = "serialize_decimal_as_string")]
    pub amount: Decimal,
}
}

Testing

Test Organization

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    // Group related tests
    mod generation {
        use super::*;

        #[test]
        fn generates_balanced_entries() {
            // Arrange
            let config = test_config();
            let generator = Generator::new(config, 42);

            // Act
            let entries = generator.generate_batch(100).unwrap();

            // Assert
            for entry in entries {
                assert!(entry.is_balanced());
            }
        }
    }

    mod validation {
        // ...
    }
}
}

Test Naming

Use descriptive test names:

#![allow(unused)]
fn main() {
// Good - describes behavior
#[test]
fn rejects_unbalanced_entry() { ... }

#[test]
fn generates_benford_compliant_amounts() { ... }

// Avoid - vague names
#[test]
fn test_1() { ... }

#[test]
fn it_works() { ... }
}

Performance

Allocation

Minimize allocations in hot paths:

#![allow(unused)]
fn main() {
// Good - reuse buffer
let mut buffer = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
    buffer.push(generate_entry()?);
}

// Avoid - reallocations
let mut buffer = Vec::new();
for _ in 0..batch_size {
    buffer.push(generate_entry()?);
}
}

Iterator Usage

Prefer iterators over explicit loops:

#![allow(unused)]
fn main() {
// Good
let total: Decimal = entries
    .iter()
    .map(|e| e.amount)
    .sum();

// Avoid
let mut total = Decimal::ZERO;
for entry in &entries {
    total += entry.amount;
}
}

See Also

Testing

Testing guidelines and practices for SyntheticData.

Running Tests

All Tests

# Run all tests
cargo test

# Run with output displayed
cargo test -- --nocapture

# Run tests in parallel (default)
cargo test

# Run tests sequentially
cargo test -- --test-threads=1

Specific Tests

# Run tests for a specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators

# Run a single test by name
cargo test test_balanced_entry

# Run tests matching a pattern
cargo test benford
cargo test journal_entry

Test Output

# Show stdout/stderr from tests
cargo test -- --nocapture

# Show test timing
cargo test -- --show-output

# Run ignored tests
cargo test -- --ignored

# Run all tests including ignored
cargo test -- --include-ignored

Test Organization

Unit Tests

Place unit tests in the same file as the code:

#![allow(unused)]
fn main() {
// src/generators/je_generator.rs

pub struct JournalEntryGenerator { ... }

impl JournalEntryGenerator {
    pub fn generate(&self) -> Result<JournalEntry> { ... }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn generates_balanced_entry() {
        let generator = JournalEntryGenerator::new(test_config(), 42);
        let entry = generator.generate().unwrap();
        assert!(entry.is_balanced());
    }
}
}

Integration Tests

Place integration tests in the tests/ directory:

crates/datasynth-generators/
├── src/
│   └── ...
└── tests/
    ├── generation_flow.rs
    └── document_chains.rs

Test Modules

Group related tests in submodules:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    mod generation {
        use super::super::*;

        #[test]
        fn batch_generation() { ... }

        #[test]
        fn streaming_generation() { ... }
    }

    mod validation {
        use super::super::*;

        #[test]
        fn rejects_invalid_config() { ... }
    }
}
}

Test Patterns

Arrange-Act-Assert

Use the AAA pattern for test structure:

#![allow(unused)]
fn main() {
#[test]
fn calculates_correct_total() {
    // Arrange
    let entries = vec![
        create_entry(dec!(100.00)),
        create_entry(dec!(200.00)),
        create_entry(dec!(300.00)),
    ];

    // Act
    let total = calculate_total(&entries);

    // Assert
    assert_eq!(total, dec!(600.00));
}
}

Test Fixtures

Create helper functions for common test data:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    fn test_config() -> GeneratorConfig {
        GeneratorConfig {
            seed: 42,
            batch_size: 100,
            ..Default::default()
        }
    }

    fn create_test_entry() -> JournalEntry {
        JournalEntryBuilder::new()
            .with_company("1000")
            .with_date(NaiveDate::from_ymd_opt(2024, 1, 15).unwrap())
            .add_line(Account::CASH, dec!(1000.00), Decimal::ZERO)
            .add_line(Account::REVENUE, Decimal::ZERO, dec!(1000.00))
            .build()
            .unwrap()
    }
}
}

Deterministic Testing

Use fixed seeds for reproducibility:

#![allow(unused)]
fn main() {
#[test]
fn deterministic_generation() {
    let seed = 42;

    let gen1 = Generator::new(config.clone(), seed);
    let gen2 = Generator::new(config.clone(), seed);

    let result1 = gen1.generate_batch(100).unwrap();
    let result2 = gen2.generate_batch(100).unwrap();

    assert_eq!(result1, result2);
}
}

Property-Based Testing

Use proptest for property-based tests:

#![allow(unused)]
fn main() {
use proptest::prelude::*;

proptest! {
    #[test]
    fn entries_are_always_balanced(
        debit in 1u64..1_000_000,
        line_count in 2usize..10,
    ) {
        let entry = generate_entry(debit, line_count);
        prop_assert!(entry.is_balanced());
    }
}
}

Domain-Specific Tests

Balance Verification

Test that journal entries are balanced:

#![allow(unused)]
fn main() {
#[test]
fn entry_debits_equal_credits() {
    let entry = generate_test_entry();

    let total_debits: Decimal = entry.lines
        .iter()
        .map(|l| l.debit_amount)
        .sum();

    let total_credits: Decimal = entry.lines
        .iter()
        .map(|l| l.credit_amount)
        .sum();

    assert_eq!(total_debits, total_credits);
}
}

Benford’s Law

Test amount distribution compliance:

#![allow(unused)]
fn main() {
#[test]
fn amounts_follow_benford() {
    let entries = generate_entries(10_000);
    let first_digits = extract_first_digits(&entries);

    let observed = calculate_distribution(&first_digits);
    let expected = benford_distribution();

    let chi_square = calculate_chi_square(&observed, &expected);
    assert!(chi_square < 15.51, "Distribution deviates from Benford's Law");
}
}

Document Chain Integrity

Test document reference chains:

#![allow(unused)]
fn main() {
#[test]
fn p2p_chain_is_complete() {
    let documents = generate_p2p_flow();

    // Verify chain: PO -> GR -> Invoice -> Payment
    let po = &documents.purchase_order;
    let gr = &documents.goods_receipt;
    let invoice = &documents.vendor_invoice;
    let payment = &documents.payment;

    assert_eq!(gr.po_reference, Some(po.po_number.clone()));
    assert_eq!(invoice.po_reference, Some(po.po_number.clone()));
    assert_eq!(payment.invoice_reference, Some(invoice.invoice_number.clone()));
}
}

Decimal Precision

Test that decimal values maintain precision:

#![allow(unused)]
fn main() {
#[test]
fn decimal_precision_preserved() {
    let original = dec!(1234.5678);

    // Serialize and deserialize
    let json = serde_json::to_string(&original).unwrap();
    let restored: Decimal = serde_json::from_str(&json).unwrap();

    assert_eq!(original, restored);
}
}

Benchmarks

Running Benchmarks

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench --bench generation_throughput

# Run benchmark with specific filter
cargo bench -- batch_generation

Writing Benchmarks

#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};

fn generation_benchmark(c: &mut Criterion) {
    let config = test_config();

    c.bench_function("generate_1000_entries", |b| {
        b.iter(|| {
            let generator = Generator::new(config.clone(), 42);
            generator.generate_batch(1000).unwrap()
        })
    });
}

fn scaling_benchmark(c: &mut Criterion) {
    let config = test_config();
    let mut group = c.benchmark_group("scaling");

    for size in [100, 1000, 10000].iter() {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            size,
            |b, &size| {
                b.iter(|| {
                    let generator = Generator::new(config.clone(), 42);
                    generator.generate_batch(size).unwrap()
                })
            },
        );
    }
    group.finish();
}

criterion_group!(benches, generation_benchmark, scaling_benchmark);
criterion_main!(benches);
}

Test Coverage

Measuring Coverage

# Install coverage tool
cargo install cargo-tarpaulin

# Run with coverage
cargo tarpaulin --out Html

# View report
open tarpaulin-report.html

Coverage Guidelines

  • Aim for 80%+ coverage on core logic
  • 100% coverage on public API
  • Focus on behavior, not lines
  • Don’t test trivial getters/setters

Continuous Integration

Tests run automatically on:

  • Pull request creation
  • Push to main branch
  • Nightly scheduled runs

CI Test Matrix

Test TypeTriggerPlatform
Unit testsAll PRsLinux, macOS, Windows
Integration testsAll PRsLinux
BenchmarksMain branchLinux
CoverageWeeklyLinux

See Also

Pull Requests

Guide to submitting and reviewing pull requests.

Before You Start

1. Check for Existing Work

  • Search open issues for related discussions
  • Check open PRs for similar changes
  • Review the roadmap for planned features

2. Open an Issue First

For significant changes, open an issue to discuss:

  • New features or major changes
  • Breaking changes to public API
  • Architectural changes
  • Performance improvements

3. Create a Branch

# Sync with upstream
git checkout main
git pull origin main

# Create feature branch
git checkout -b feature/my-feature

# Or for bug fixes
git checkout -b fix/issue-123

Making Changes

1. Write Code

Follow the Code Style guidelines:

# Format code
cargo fmt

# Run clippy
cargo clippy

# Run tests
cargo test

2. Write Tests

Add tests for new functionality:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn new_feature_works() {
        // Test implementation
    }
}
}

3. Update Documentation

  • Update relevant docs in docs/src/
  • Add/update rustdoc comments
  • Update CHANGELOG.md if applicable

4. Commit Changes

Write clear commit messages:

# Good commit messages
git commit -m "Add Benford's Law validation to amount generator"
git commit -m "Fix off-by-one error in batch generation"
git commit -m "Improve memory efficiency in large volume generation"

# Avoid vague messages
git commit -m "Fix bug"
git commit -m "Update code"
git commit -m "WIP"

Commit Message Format

<type>: <short summary>

<optional detailed description>

<optional footer>

Types:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation only
  • refactor: Code change without feature/fix
  • test: Adding/updating tests
  • perf: Performance improvement
  • chore: Maintenance tasks

Submitting a PR

1. Push Your Branch

git push -u origin feature/my-feature

2. Create Pull Request

Use the PR template:

## Summary

Brief description of changes.

## Changes

- Added X feature
- Fixed Y bug
- Updated Z documentation

## Testing

- [ ] Added unit tests
- [ ] Added integration tests
- [ ] Ran full test suite
- [ ] Tested manually

## Checklist

- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] Tests pass locally
- [ ] No new warnings from clippy

3. PR Title Format

<type>: <short description>

Examples:

  • feat: Add OCEL 2.0 export format
  • fix: Correct decimal serialization in JSON output
  • docs: Add process mining use case guide

Review Process

Automated Checks

All PRs must pass:

CheckRequirement
BuildCompiles on all platforms
TestsAll tests pass
Formattingcargo fmt --check passes
Lintingcargo clippy has no warnings
DocumentationBuilds without errors

Code Review

Reviewers will check:

  1. Correctness: Does the code do what it claims?
  2. Tests: Are changes adequately tested?
  3. Style: Does code follow conventions?
  4. Documentation: Are changes documented?
  5. Performance: Any performance implications?

Responding to Feedback

  • Address all comments
  • Push fixes as new commits (don’t force-push during review)
  • Mark resolved conversations
  • Ask for clarification if needed

Merging

Requirements

Before merging:

  • All CI checks pass
  • At least one approving review
  • No unresolved conversations
  • Branch is up to date with main

Merge Strategy

We use squash and merge for most PRs:

  • Combines all commits into one
  • Keeps main history clean
  • Preserves full history in PR

After Merge

  • Delete your feature branch
  • Update local main:
git checkout main
git pull origin main
git branch -d feature/my-feature

Special Cases

Breaking Changes

For breaking changes:

  1. Open an issue for discussion first
  2. Document migration path
  3. Update CHANGELOG with breaking change notice
  4. Use BREAKING CHANGE: in commit footer

Large PRs

For large changes:

  1. Consider splitting into smaller PRs
  2. Create a tracking issue
  3. Use feature flags if needed
  4. Provide detailed documentation

Security Issues

For security vulnerabilities:

  1. Do not open a public issue
  2. Contact maintainers directly
  3. Follow responsible disclosure

PR Templates

Feature PR

## Summary

Adds [feature] to support [use case].

## Motivation

[Why is this needed?]

## Changes

- Added `NewType` struct in `datasynth-core`
- Implemented `NewGenerator` in `datasynth-generators`
- Added configuration options in `datasynth-config`
- Updated CLI to support new feature

## Testing

- Added unit tests for `NewType`
- Added integration tests for generation flow
- Manual testing with sample configs

## Documentation

- Added user guide section
- Updated configuration reference
- Added example configuration

Bug Fix PR

## Summary

Fixes #123 - [brief description]

## Root Cause

[What caused the bug?]

## Solution

[How does this fix it?]

## Testing

- Added regression test
- Verified fix with reproduction steps from issue
- Ran full test suite

## Checklist

- [ ] Regression test added
- [ ] Root cause documented
- [ ] Related issues linked

See Also

Compliance & Regulatory Overview

DataSynth generates synthetic financial data for testing, training, and analytics. This section documents how DataSynth aligns with key regulatory frameworks and provides self-assessment artifacts for compliance teams.

Regulatory Landscape

Synthetic data generation sits at the intersection of several regulatory domains. While pure synthetic data (generated without real-world data as input) generally faces fewer regulatory constraints than real data processing, organizations deploying DataSynth should understand the applicable frameworks.

EU AI Act

The EU AI Act (Regulation 2024/1689) introduces obligations for AI systems and their training data. DataSynth addresses two key articles:

Article 50 – Transparency for Synthetic Content: All DataSynth output includes machine-readable content credentials indicating that the data is synthetically generated. This is implemented through the ContentCredential system in datasynth-core, which embeds markers in CSV headers, JSON metadata, and Parquet file metadata. Content marking is enabled by default and can be configured via the marking section in the configuration YAML.

Article 10 – Data Governance: DataSynth generates automated DataGovernanceReport documents that describe data sources (synthetic generation, no real data used), processing steps (COA generation through quality validation), quality measures applied (Benford’s Law compliance, balance coherence, referential integrity), and bias assessments. These reports provide the documentation trail required under Article 10.

For full details, see EU AI Act Compliance.

NIST AI Risk Management Framework

The NIST AI RMF (AI 100-1) provides a voluntary framework for managing risks in AI systems. DataSynth has completed a self-assessment across all four core functions:

FunctionFocus AreaDataSynth Alignment
MAPContext and use casesDocumented intended uses, users, and known limitations
MEASUREMetrics and evaluationQuality gates, privacy metrics (MIA, linkage), statistical validation
MANAGERisk mitigationDeterministic reproducibility, audit logging, content marking
GOVERNPolicies and oversightAccess control (API key + JWT/RBAC), configuration management, quality gate governance

For the complete self-assessment, see NIST AI RMF Self-Assessment.

GDPR

The General Data Protection Regulation applies differently depending on the DataSynth workflow:

Pure Synthetic Generation (no real data input): GDPR obligations are minimal because no personal data is processed. The generated output contains no data subjects. Article 30 records should still document the processing activity for audit completeness.

Fingerprint Extraction (real data as input): When DataSynth’s fingerprint module extracts statistical profiles from real datasets, GDPR applies in full. The fingerprint module includes differential privacy (Laplace mechanism with configurable epsilon/delta budgets), k-anonymity suppression of rare values, and a complete privacy audit trail. A Data Protection Impact Assessment (DPIA) template is provided for this scenario.

For templates and detailed guidance, see GDPR Compliance.

SOC 2 Readiness

DataSynth’s architecture supports SOC 2 Type II controls across the Trust Services Criteria:

CriteriaDataSynth Controls
SecurityAPI key authentication with Argon2id hashing, JWT/OIDC support, TLS termination, CORS lockdown
AvailabilityGraceful degradation under resource pressure, health/readiness endpoints
Processing IntegrityDeterministic RNG (ChaCha8), balanced journal entries enforced at construction, quality gates
ConfidentialityContent marking prevents synthetic data from being mistaken for real data
PrivacyDifferential privacy in fingerprint extraction, no real PII in standard generation

For deployment security controls, see Security Hardening.

ISO 27001 Alignment

DataSynth supports ISO 27001:2022 Annex A controls relevant to data processing tools:

ControlImplementation
A.5.12 Classification of informationContent credentials classify all output as synthetic
A.8.10 Information deletionDeterministic generation eliminates data retention concerns for pure synthetic workflows
A.8.11 Data maskingFingerprint extraction applies differential privacy and k-anonymity
A.8.12 Data leakage preventionQuality gates include privacy metrics (MIA AUC-ROC, linkage attack assessment)
A.8.25 Secure development lifecycleDeterministic builds, dependency auditing (cargo audit), SBOM generation

For access control configuration, see Security Hardening.

Quick Reference

FrameworkStatusDocumentation
EU AI Act Article 50Implemented (content marking)EU AI Act
EU AI Act Article 10Implemented (governance reports)EU AI Act
NIST AI RMFSelf-assessment completeNIST AI RMF
GDPRTemplates providedGDPR
SOC 2Readiness documentedSOC 2 Readiness
ISO 27001Annex A alignment documentedISO 27001 Alignment

See Also

EU AI Act Compliance

DataSynth implements technical controls aligned with the EU Artificial Intelligence Act (Regulation 2024/1689), focusing on Article 50 (transparency for synthetic content) and Article 10 (data governance for high-risk AI systems).

Article 50 — Synthetic Content Marking

Article 50(2) requires that providers of AI systems generating synthetic content shall ensure outputs are marked in a machine-readable format and detectable as artificially generated.

How DataSynth Complies

DataSynth embeds machine-readable synthetic content credentials in all output files:

  • CSV: Comment header lines with C2PA-inspired metadata
  • JSON: _synthetic_metadata top-level object with credential fields
  • Parquet: Key-value metadata pairs in the file footer

Configuration

compliance:
  content_marking:
    enabled: true          # Default: true
    format: embedded       # embedded, sidecar, or both
  article10_report: true   # Generate Article 10 governance report

Marking Formats

FormatDescription
embeddedCredentials embedded directly in output files (default)
sidecarSeparate .synthetic-credential.json file alongside each output
bothBoth embedded and sidecar credentials

Credential Fields

Each synthetic content credential contains:

FieldDescriptionExample
generatorTool identifier"DataSynth"
versionGenerator version"0.5.0"
timestampISO 8601 generation time"2024-06-15T10:30:00Z"
content_typeOutput category"synthetic_financial_data"
methodGeneration technique"rule_based_statistical"
config_hashSHA-256 of config used"a1b2c3..."
declarationHuman-readable statement"This content is synthetic..."

Programmatic Detection

Third-party systems can detect synthetic DataSynth output by:

  1. CSV: Checking for # X-Synthetic-Generator: DataSynth header lines
  2. JSON: Checking for _synthetic_metadata.generator == "DataSynth"
  3. Parquet: Reading synthetic_generator from file metadata

Article 10 — Data Governance

Article 10 requires appropriate data governance practices for training datasets used by high-risk AI systems. When synthetic data from DataSynth is used to train such systems, the Article 10 data governance report provides documentation.

Governance Report Contents

The automated report includes:

  • Data Sources: Documentation of all inputs (configuration parameters, seed values, statistical distributions)
  • Processing Steps: Complete pipeline documentation (CoA generation, master data, document flows, anomaly injection, quality validation)
  • Quality Measures: Statistical validation results (Benford’s Law, balance coherence, distribution fitting)
  • Bias Assessment: Known limitations, demographic representation gaps, and mitigation measures

Generating the Report

Enable in configuration:

compliance:
  article10_report: true

The report is written as article10_governance_report.json in the output directory.

Report Structure

{
  "report_version": "1.0",
  "generator": "DataSynth",
  "generated_at": "2024-06-15T10:30:00Z",
  "data_sources": ["configuration_parameters", "statistical_distributions", "deterministic_rng"],
  "processing_steps": [
    "chart_of_accounts_generation",
    "master_data_generation",
    "document_flow_generation",
    "journal_entry_generation",
    "anomaly_injection",
    "quality_validation"
  ],
  "quality_measures": [
    "benfords_law_compliance",
    "balance_sheet_coherence",
    "document_chain_integrity",
    "referential_integrity"
  ],
  "bias_assessment": {
    "known_limitations": [
      "Statistical distributions are parameterized, not learned from real data",
      "Temporal patterns use simplified seasonal models"
    ],
    "mitigation_measures": [
      "Configurable distribution parameters per industry profile",
      "Quality gate validation ensures statistical plausibility"
    ]
  }
}

See Also

NIST AI Risk Management Framework Self-Assessment

This document provides a self-assessment of DataSynth against the NIST AI Risk Management Framework (AI 100-1, January 2023). The framework defines four core functions – MAP, MEASURE, MANAGE, and GOVERN – each with categories and subcategories. This assessment covers all four functions as they apply to a synthetic data generation tool.

Assessment Scope

  • System: DataSynth synthetic financial data generator
  • Version: 0.5.x
  • Assessment Date: 2025
  • Assessor: Development team (self-assessment)
  • AI System Type: Data generation tool (not a decision-making AI system)
  • Risk Classification: The generated synthetic data may be used as training data for AI/ML systems. DataSynth itself does not make autonomous decisions, but the quality of its output can affect downstream AI system performance.

MAP: Context and Framing

The MAP function establishes the context for AI risk management by identifying intended use cases, users, and known limitations.

MAP 1: Intended Use Cases

DataSynth is designed for the following use cases:

Use CaseDescriptionRisk Level
ML Training DataGenerate labeled datasets for fraud detection, anomaly detection, and audit analytics modelsMedium
Software TestingProvide realistic test data for ERP systems, accounting platforms, and audit toolsLow
Privacy-Preserving AnalyticsReplace real financial data with synthetic equivalents that preserve statistical propertiesMedium
Compliance TestingGenerate SOX control test evidence, COSO framework data, and SoD violation scenariosLow
Process MiningCreate OCEL 2.0 event logs for process analysis without exposing real business processesLow
Education and ResearchProvide realistic financial datasets for academic research and trainingLow

Not intended for: Replacement of real financial records in regulatory filings, direct use as evidence in audit engagements, or any scenario where the synthetic nature of the data is concealed.

MAP 2: Intended Users

User GroupTypical UseAccess Level
Data ScientistsTraining ML models for fraud/anomaly detectionAPI or CLI
QA EngineersERP and accounting system load/integration testingCLI or Python wrapper
AuditorsTesting audit analytics tools against known-labeled dataCLI output files
Compliance TeamsSOX control testing, COSO framework validationCLI or server API
ResearchersAcademic study of financial data patternsPython wrapper

MAP 3: Known Limitations

DataSynth users should understand the following limitations:

  1. No Real PII: Generated names, identifiers, and addresses are synthetic. They do not correspond to real individuals or organizations. This is a design feature, not a limitation, but downstream systems should not treat synthetic identities as real.

  2. Statistical Approximation: Generated data follows configurable statistical distributions (log-normal, Benford’s Law, Gaussian mixtures) that approximate real-world patterns. They are not derived from actual transaction populations unless fingerprint extraction is used.

  3. Industry Profile Approximations: Pre-configured industry profiles (retail, manufacturing, financial services, healthcare, technology) are based on published research and general knowledge. They may not match specific organizations within an industry.

  4. Temporal Pattern Simplification: Business day calendars, holiday schedules, and intraday patterns are modeled but may not capture all regional or organizational nuances.

  5. Anomaly Injection Boundaries: Injected fraud patterns follow configurable typologies (ACFE taxonomy) but do not represent the full diversity of real-world fraud schemes.

  6. Fingerprint Extraction Privacy: When extracting fingerprints from real data, differential privacy noise and k-anonymity are applied. The privacy guarantees depend on correct epsilon/delta parameter selection.

MAP 4: Deployment Context

DataSynth can be deployed as:

  • A CLI tool on developer workstations
  • A server (REST/gRPC/WebSocket) in cloud or on-premises environments
  • A Python library embedded in data pipelines
  • A desktop application (Tauri/SvelteKit)

Each deployment context has different risk profiles. Server deployments require authentication, TLS, and rate limiting. CLI usage on trusted workstations has fewer access control requirements.


MEASURE: Metrics and Evaluation

The MEASURE function establishes metrics, methods, and benchmarks for evaluating AI system trustworthiness.

MEASURE 1: Quality Gate Metrics

DataSynth includes a comprehensive evaluation framework (datasynth-eval) with configurable quality gates. Each metric has defined thresholds and automated pass/fail checking.

Statistical Quality

MetricGate NameThresholdComparisonPurpose
Benford’s Law MADbenford_compliance0.015LTEFirst-digit distribution follows Benford’s Law
Balance Coherencebalance_sheet_valid1.0GTEAssets = Liabilities + Equity
Document Chain Integritydoc_chain_complete0.95GTEP2P/O2C chains are complete
Temporal Consistencytemporal_valid0.90GTETemporal patterns match configuration
Correlation Preservationcorrelation_check0.80GTECross-field correlations preserved

Data Quality

MetricGate NameThresholdComparisonPurpose
Completion Ratecompleteness0.95GTERequired fields are populated
Duplicate Rateuniqueness0.05LTEAcceptable duplicate rate
Referential Integrityref_integrity0.99GTEForeign key references valid
IC Match Rateic_matching0.95GTEIntercompany transactions match

Gate Profiles

Quality gates are organized into profiles with configurable strictness:

evaluation:
  quality_gates:
    profile: strict    # strict, default, lenient
    fail_strategy: collect_all
    gates:
      - name: benford_compliance
        metric: benford_mad
        threshold: 0.015
        comparison: lte
      - name: balance_valid
        metric: balance_coherence
        threshold: 1.0
        comparison: gte
      - name: completeness
        metric: completion_rate
        threshold: 0.95
        comparison: gte

MEASURE 2: Privacy Evaluation

DataSynth evaluates privacy risk through empirical attacks on generated data.

Membership Inference Attack (MIA)

The MIA module (datasynth-eval/src/privacy/membership_inference.rs) implements a distance-based classifier that attempts to determine whether a specific record was part of the generation configuration. Key metrics:

MetricThresholdInterpretation
AUC-ROC<= 0.60Near-random classification indicates strong privacy
Accuracy<= 0.55Low accuracy means synthetic data does not memorize patterns
Precision/RecallBalancedNo systematic bias toward members or non-members

Linkage Attack Assessment

The linkage module (datasynth-eval/src/privacy/linkage.rs) evaluates re-identification risk using quasi-identifier combinations:

MetricThresholdInterpretation
Re-identification Rate<= 0.05Less than 5% of synthetic records can be linked to originals
K-Anonymity Achieved>= 5Each quasi-identifier combination appears at least 5 times
Unique QI OverlapReportedNumber of overlapping quasi-identifier combinations

NIST SP 800-226 Alignment

The evaluation framework includes self-assessment against NIST SP 800-226 criteria for de-identification. The NistAlignmentReport evaluates:

  • Data transformation adequacy
  • Re-identification risk assessment
  • Documentation completeness
  • Privacy control effectiveness

Overall alignment score must meet >= 71% for a passing grade.

Fingerprint Module Privacy

When fingerprint extraction is used with real data input, the datasynth-fingerprint privacy engine provides:

MechanismParameterDefault (Standard Level)
Differential Privacy (Laplace)Epsilon1.0
K-AnonymityK threshold5
Outlier ProtectionWinsorization percentile95th
CompositionMethodNaive (RDP/zCDP available)

Privacy levels provide pre-configured parameter sets:

LevelEpsilonKUse Case
Minimal5.03Low sensitivity
Standard1.05Balanced (default)
High0.510Sensitive data
Maximum0.120Highly sensitive data

MEASURE 3: Completeness and Uniqueness

The evaluation module tracks data completeness and uniqueness metrics:

  • Completeness: Measures the percentage of non-null values across all required fields. Reported as overall_completeness in the evaluation output.
  • Uniqueness: Measures the duplicate rate across primary key fields. Collision-free UUIDs (FNV-1a hash-based with generator-type discriminators) ensure deterministic uniqueness.

MEASURE 4: Distribution Validation

Statistical validation tests verify that generated data matches configured distributions:

TestImplementationPurpose
Benford First DigitChi-squared against Benford distributionTransaction amounts follow expected first-digit distribution
Distribution FitAnderson-Darling testAmount distributions match configured log-normal parameters
Correlation CheckPearson/Spearman correlationCross-field correlations preserved via copula models
Temporal PatternsAutocorrelation analysisSeasonality and period-end patterns present

MANAGE: Risk Mitigation

The MANAGE function addresses risk response and mitigation strategies.

MANAGE 1: Deterministic Reproducibility

DataSynth uses ChaCha8 CSPRNG with configurable seeds. Given the same configuration and seed, the output is identical across runs and platforms. This provides:

  • Auditability: Any generated dataset can be exactly reproduced by preserving the configuration YAML and seed value.
  • Debugging: Anomalous output can be reproduced for investigation.
  • Regression Testing: Changes to generation logic can be detected by comparing output hashes.
global:
  seed: 42                    # Deterministic seed
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

MANAGE 2: Audit Logging

DataSynth provides audit trails at multiple levels:

Generation Audit: The runtime emits structured JSON logs for every generation phase, including timing, record counts, and resource utilization.

Privacy Audit: The fingerprint module maintains a PrivacyAudit record of every privacy-related action (noise additions with epsilon spent, value suppressions, generalizations, winsorizations). This audit is embedded in the .dsf fingerprint file.

Server Audit: The REST/gRPC server logs authentication attempts, configuration changes, stream operations, and rate limit events with request correlation IDs (X-Request-Id).

Run Manifest: Each generation run produces a manifest documenting the configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.

MANAGE 3: Data Lineage Tracking

DataSynth tracks data lineage through:

  • Configuration Hashing: SHA-256 hash of the input configuration is embedded in all output metadata.
  • Content Credentials: Every output file includes a ContentCredential linking back to the generator version, configuration hash, and seed.
  • Document Reference Chains: Generated document flows maintain explicit reference chains (PO -> GR -> Invoice -> Payment) with DocumentReference records.
  • Data Governance Reports: Automated Article 10 governance reports document all processing steps from COA generation through quality validation.

MANAGE 4: Content Marking

All synthetic output is marked to prevent confusion with real data:

  • CSV: Comment headers with # SYNTHETIC DATA - Generated by DataSynth v{version}
  • JSON: _metadata.content_credential object with generator, timestamp, config hash, and EU AI Act article reference
  • Parquet: Custom metadata key-value pairs with full credential JSON
  • Sidecar Files: Optional .credential.json files alongside output files

Content marking is enabled by default and can be configured:

marking:
  enabled: true
  format: embedded    # embedded, sidecar, both

MANAGE 5: Graceful Degradation

The resource guard system (datasynth-core) monitors memory, disk, and CPU usage, applying progressive degradation:

LevelMemory ThresholdResponse
Normal< 70%Full feature generation
Reduced70-85%Disable optional features
Minimal85-95%Core generation only
Emergency> 95%Graceful shutdown

This prevents resource exhaustion from affecting other systems in shared environments.


GOVERN: Policies and Oversight

The GOVERN function establishes organizational policies and structures for AI risk management.

GOVERN 1: Access Control

DataSynth implements layered access control for the server deployment:

API Key Authentication: Keys are hashed with Argon2id at startup. Verification uses timing-safe comparison with a short-lived cache to prevent side-channel attacks. Keys are provided via the X-API-Key header or Authorization: Bearer header.

JWT/OIDC Integration (optional jwt feature): Supports external identity providers (Keycloak, Auth0, Entra ID) with RS256 token validation. JWT claims include subject, roles, and tenant ID for multi-tenancy.

RBAC: Role-based access control via JWT claims enables differentiated access:

RolePermissions
operatorStart/stop/pause generation streams
adminConfiguration changes, API key management
viewerRead-only access to status and metrics

Exempt Paths: Health (/health), readiness (/ready), liveness (/live), and metrics (/metrics) endpoints are exempt from authentication for infrastructure integration.

GOVERN 2: Configuration Management

DataSynth configuration is managed through:

  • YAML Schema Validation: All configuration is validated against a typed schema before generation begins. Invalid configurations produce descriptive error messages.
  • Industry Presets: Pre-validated configuration presets for common industries (retail, manufacturing, financial services, healthcare, technology) reduce misconfiguration risk.
  • Complexity Levels: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide validated scaling parameters.
  • Template System: YAML/JSON templates with merge strategies enable configuration reuse while allowing overrides.

GOVERN 3: Quality Gates as Governance Controls

Quality gates serve as automated governance controls:

evaluation:
  quality_gates:
    profile: strict
    fail_strategy: fail_fast    # Stop on first failure
    gates:
      - name: benford_compliance
        metric: benford_mad
        threshold: 0.015
        comparison: lte
      - name: privacy_mia
        metric: privacy_mia_auc
        threshold: 0.60
        comparison: lte
      - name: balance_coherence
        metric: balance_coherence
        threshold: 1.0
        comparison: gte

Gate profiles can enforce:

  • Fail-fast: Stop generation on first quality failure
  • Collect-all: Run all checks and report all failures
  • Custom thresholds: Organization-specific quality requirements

The GateEngine evaluates all configured gates against the ComprehensiveEvaluation and produces a GateResult with per-gate pass/fail status, actual values, and summary messages.

GOVERN 4: Audit Trail Completeness

The following audit artifacts are produced for each generation run:

ArtifactLocationContents
Run Manifestoutput/_manifest.jsonConfig hash, seed, timestamps, record counts, gate results
Content CredentialsEmbedded in each output fileGenerator version, config hash, seed, EU AI Act reference
Data Governance Reportoutput/_governance_report.jsonArticle 10 data sources, processing steps, quality measures, bias assessment
Privacy AuditEmbedded in .dsf filesEpsilon spent, actions taken, composition method, remaining budget
Server LogsStructured JSON to stdout/log aggregatorRequest traces, auth events, config changes, stream operations
Quality Gate Resultsoutput/_evaluation.jsonPer-gate pass/fail, actual vs threshold, summary

GOVERN 5: Incident Response

For scenarios where generated data is mistakenly used as real data:

  1. Detection: Content credentials in output files identify synthetic origin
  2. Containment: Deterministic generation means the exact dataset can be reproduced and identified
  3. Remediation: All output files carry machine-readable markers that downstream systems can check programmatically
  4. Prevention: Content marking is enabled by default and requires explicit configuration to disable

Assessment Summary

FunctionCategory CountAddressedNotes
MAP44Use cases, users, limitations, and deployment documented
MEASURE44Quality gates, privacy metrics, completeness, distribution validation
MANAGE55Reproducibility, audit logging, lineage, content marking, degradation
GOVERN55Access control, config management, quality gates, audit trails, incident response

Overall Assessment: DataSynth provides comprehensive risk management controls appropriate for a synthetic data generation tool. The primary residual risks relate to (1) parameter misconfiguration leading to unrealistic output, mitigated by quality gates and industry presets, and (2) privacy leakage during fingerprint extraction from real data, mitigated by differential privacy with configurable epsilon/delta budgets and empirical privacy evaluation.

See Also

GDPR Compliance

This document provides GDPR (General Data Protection Regulation) compliance guidance for DataSynth deployments. DataSynth generates purely synthetic data by default, but certain workflows (fingerprint extraction) may process real personal data.

Synthetic Data and GDPR

Pure Synthetic Generation

When DataSynth generates data from configuration alone (no real data input):

  • No personal data is processed: All names, identifiers, and transactions are algorithmically generated
  • No data subjects exist: Synthetic entities have no real-world counterparts
  • GDPR does not apply to the generated output, as it contains no personal data per Article 4(1)

This is the default operating mode for all datasynth-data generate workflows.

Fingerprint Extraction Workflows

When using datasynth-data fingerprint extract with real data as input:

  • Real personal data may be processed during statistical fingerprint extraction
  • GDPR obligations apply to the extraction phase
  • Differential privacy controls limit information retained in the fingerprint
  • The output fingerprint (.dsf file) contains only aggregate statistics, not individual records

Article 30 — Records of Processing Activities

Template for Pure Synthetic Generation

FieldValue
PurposeGeneration of synthetic financial data for testing, training, and validation
Categories of data subjectsNone (no real data subjects)
Categories of personal dataNone (all data is synthetic)
RecipientsInternal development, QA, and data science teams
Transfers to third countriesNot applicable (no personal data)
Retention periodPer project requirements
Technical measuresSeed-based deterministic generation, content marking

Template for Fingerprint Extraction

FieldValue
PurposeStatistical fingerprint extraction for privacy-preserving data synthesis
Legal basisLegitimate interest (Article 6(1)(f)) or consent
Categories of data subjectsAs per source dataset (e.g., customers, vendors, employees)
Categories of personal dataAs per source dataset (aggregate statistics only retained)
RecipientsData engineering team operating DataSynth
Transfers to third countriesAssess per deployment topology
Retention periodFingerprint files: per project; source data: minimize retention
Technical measuresDifferential privacy (configurable epsilon/delta), k-anonymity

Data Protection Impact Assessment (DPIA)

A DPIA under Article 35 is recommended when fingerprint extraction processes:

  • Large-scale datasets (>100,000 records)
  • Special categories of data (Article 9)
  • Data relating to vulnerable persons

DPIA Template for Fingerprint Extraction

1. Description of Processing

DataSynth extracts statistical fingerprints from source data. The fingerprint captures distribution parameters (means, variances, correlations) without retaining individual records. Differential privacy noise is added with configurable epsilon/delta parameters.

2. Necessity and Proportionality

  • Purpose: Enable realistic synthetic data generation without accessing source data repeatedly
  • Minimization: Only aggregate statistics are retained
  • Privacy controls: Differential privacy with user-specified budget

3. Risks to Data Subjects

RiskLikelihoodSeverityMitigation
Re-identification from fingerprintLowHighDifferential privacy, k-anonymity enforcement
Membership inferenceLowMediumMIA AUC-ROC testing in evaluation framework
Fingerprint file compromiseMediumLowAggregate statistics only, no individual records

4. Measures to Address Risks

  • Configure fingerprint_privacy.level: high or maximum for sensitive data
  • Set fingerprint_privacy.epsilon to 0.1-1.0 range (lower = stronger privacy)
  • Enable k-anonymity with fingerprint_privacy.k_anonymity >= 5
  • Use evaluation framework MIA testing to verify privacy guarantees

Privacy Configuration

fingerprint_privacy:
  level: high             # minimal, standard, high, maximum, custom
  epsilon: 0.5            # Privacy budget (lower = stronger)
  delta: 1.0e-5           # Failure probability
  k_anonymity: 10         # Minimum group size
  composition_method: renyi_dp  # naive, advanced, renyi_dp, zcdp

Privacy Level Presets

LevelEpsilonDeltak-AnonymityUse Case
minimal10.01e-32Non-sensitive aggregates
standard1.01e-55General business data
high0.51e-610Sensitive financial data
maximum0.11e-820Regulated personal data

Data Subject Rights

Pure Synthetic Mode

Articles 15-22 (access, rectification, erasure, etc.) do not apply as no real data subjects exist in synthetic output.

Fingerprint Extraction Mode

  • Right of access (Art. 15): Fingerprints contain only aggregate statistics; individual records cannot be extracted
  • Right to erasure (Art. 17): Delete source data and fingerprint files; regenerate synthetic data with new parameters
  • Right to restriction (Art. 18): Suspend fingerprint extraction pipeline
  • Right to object (Art. 21): Remove individual from source dataset before extraction

International Transfers

  • Synthetic output: Generally not subject to Chapter V transfer restrictions (no personal data)
  • Fingerprint files: Assess whether aggregate statistics constitute personal data in your jurisdiction
  • Source data: Standard GDPR transfer rules apply during fingerprint extraction

NIST SP 800-226 Alignment

DataSynth’s evaluation framework includes NIST SP 800-226 alignment reporting for synthetic data privacy assessment. Enable via:

privacy:
  nist_alignment_enabled: true

See Also

SOC 2 Type II Readiness

This document describes how DataSynth’s architecture and controls align with the AICPA Trust Services Criteria (TSC) used in SOC 2 Type II engagements. DataSynth is a synthetic data generation tool, not a cloud-hosted SaaS product, so this assessment focuses on the controls embedded in the software itself rather than organizational policies. Organizations deploying DataSynth should layer their own operational controls (change management, personnel security, vendor management) on top of the technical controls described here.

Assessment Scope

  • System: DataSynth synthetic financial data generator
  • Version: 0.5.x
  • Deployment Models: CLI binary, REST/gRPC/WebSocket server, Python library, desktop application
  • Assessment Type: Architecture readiness (pre-audit self-assessment)

CC1: Security

The Security criterion (Common Criteria) requires that the system is protected against unauthorized access, both logical and physical.

Authentication

DataSynth’s server component (datasynth-server) implements two authentication mechanisms:

API Key Authentication: API keys are hashed with Argon2id (memory-hard, side-channel resistant) at server startup. Verification iterates all stored hashes without short-circuiting to prevent timing-based enumeration. A short-lived (5-second TTL) FNV-1a hash cache avoids repeated Argon2id computation for successive requests from the same client. Keys are accepted via Authorization: Bearer <key> or X-API-Key headers.

JWT/OIDC (optional jwt feature): External identity providers (Keycloak, Auth0, Entra ID) issue RS256-signed tokens. The JwtValidator verifies issuer, audience, expiration, and signature. Claims include subject, email, roles, and tenant ID for multi-tenancy.

Authorization

Role-Based Access Control (RBAC) enforces least-privilege access:

RoleGenerateDataManageJobsViewJobsManageConfigViewConfigViewMetricsManageApiKeys
AdminYYYYYYY
OperatorYYYNYYN
ViewerNNYNYYN

RBAC can be disabled for development environments; when disabled, all authenticated requests are treated as Admin.

Network Security

The security headers middleware injects the following headers on all server responses:

HeaderValuePurpose
X-Content-Type-OptionsnosniffPrevent MIME-type sniffing
X-Frame-OptionsDENYPrevent clickjacking
Content-Security-Policydefault-src 'none'; frame-ancestors 'none'Restrict resource loading
Referrer-Policystrict-origin-when-cross-originLimit referrer leakage
Cache-Controlno-storePrevent caching of API responses
X-XSS-Protection0Defer to CSP (modern best practice)

TLS termination is supported via reverse proxy (nginx, Caddy, Envoy) or Kubernetes ingress. CORS is configurable with allowlisted origins.

Rate Limiting

Per-client rate limiting uses a sliding-window counter with configurable thresholds (requests per second, burst size). A Redis-backed rate limiter is available for multi-instance deployments (redis feature flag).


CC2: Availability

The Availability criterion requires that the system is available for operation and use as committed.

Graceful Degradation

The DegradationController in datasynth-core monitors memory, disk, and CPU utilization and applies progressive feature reduction:

LevelMemoryDiskCPUResponse
Normal< 70%> 1000 MB< 80%All features enabled, full batch sizes
Reduced70–85%500–1000 MB80–90%Half batch sizes, skip data quality injection
Minimal85–95%100–500 MB> 90%Essential data only, no anomaly injection
Emergency> 95%< 100 MBFlush pending writes, terminate gracefully

Auto-recovery with hysteresis (5% improvement required) allows the system to step back up one level at a time when resource pressure subsides.

Resource Monitoring

  • Memory guard: Reads /proc/self/statm (Linux) or ps (macOS) to track resident set size against configurable limits.
  • Disk guard: Uses statvfs (Unix) or GetDiskFreeSpaceExW (Windows) to monitor available disk space in the output directory.
  • CPU monitor: Tracks CPU utilization with auto-throttle at 0.95 threshold.
  • Resource guard: Unified orchestration that combines all three monitors and drives the DegradationController.

Graceful Shutdown

The server handles SIGTERM by stopping acceptance of new requests, waiting for in-flight requests to complete (with configurable timeout), and flushing pending output. The CLI supports SIGUSR1 for pause/resume of generation runs.

Health Endpoints

The following endpoints are exempt from authentication for infrastructure integration:

EndpointPurpose
/healthGeneral health check
/readyReadiness probe (Kubernetes)
/liveLiveness probe (Kubernetes)
/metricsPrometheus-compatible metrics

CC3: Processing Integrity

The Processing Integrity criterion requires that system processing is complete, valid, accurate, timely, and authorized.

Deterministic Generation

DataSynth uses the ChaCha8 cryptographically secure pseudo-random number generator with a configurable seed. Given the same configuration YAML and seed value, output is byte-identical across runs and platforms. This provides auditability (reproduce any dataset from its configuration) and regression detection (compare output hashes after code changes).

Quality Gates

The evaluation framework (datasynth-eval) applies configurable pass/fail criteria to every generation run. Built-in quality gate profiles provide three levels of strictness:

MetricStrictDefaultLenient
Benford MAD<= 0.01<= 0.015<= 0.03
Balance Coherence>= 0.999>= 0.99>= 0.95
Document Chain Integrity>= 0.95>= 0.90>= 0.80
Completion Rate>= 0.99>= 0.95>= 0.90
Duplicate Rate<= 0.001<= 0.01<= 0.05
Referential Integrity>= 0.999>= 0.99>= 0.95
IC Match Rate>= 0.99>= 0.95>= 0.85
Privacy MIA AUC<= 0.55<= 0.60<= 0.70

Gate evaluation supports fail-fast (stop on first failure) and collect-all (report all failures) strategies.

Balance Validation

The JournalEntry model enforces debits = credits at construction time. An entry that does not balance cannot be created, eliminating an entire class of data integrity errors.

Content Marking

EU AI Act Article 50 synthetic content credentials are embedded in all output files (CSV headers, JSON metadata, Parquet file metadata). This prevents synthetic data from being mistaken for real financial records. Content marking is enabled by default.


CC4: Confidentiality

The Confidentiality criterion requires that information designated as confidential is protected as committed.

No Real Data Storage

In the default operating mode (pure synthetic generation), DataSynth does not process, store, or transmit real data. All names, identifiers, transactions, and addresses are algorithmically generated from configuration parameters and RNG output.

Fingerprint Privacy

When the fingerprint extraction workflow processes real data, the following privacy controls apply:

MechanismDefault (Standard Level)
Differential privacy (Laplace)Epsilon = 1.0, Delta = 1e-5
K-anonymity suppressionK >= 5
Composition accountingNaive (Renyi DP, zCDP available)

The output .dsf fingerprint file contains only aggregate statistics (means, variances, correlations), not individual records.

API Key Security

API keys are never stored in plaintext. At server startup, raw keys are hashed with Argon2id (random salt, PHC format) and discarded. Verification uses Argon2id comparison that iterates all stored hashes to prevent timing-based key enumeration.

Audit Logging

The JsonAuditLogger emits structured JSON audit events via the tracing crate. Each event records timestamp, request ID, actor identity (user ID or API key hash prefix), action, resource, outcome (success/denied/error), tenant ID, source IP, and user agent. Events are suitable for SIEM ingestion.


CC5: Privacy

The Privacy criterion requires that personal information is collected, used, retained, disclosed, and disposed of in conformity with commitments.

Synthetic Data by Design

DataSynth’s default mode generates purely synthetic data. No personal information is collected or processed. Generated entities (vendors, customers, employees) have no real-world counterparts. This eliminates most privacy obligations for pure synthetic workflows.

Privacy Evaluation

The evaluation framework includes empirical privacy testing:

  • Membership Inference Attack (MIA): Distance-based classifier measures AUC-ROC. A score near 0.50 indicates the synthetic data does not memorize real data patterns.
  • Linkage Attack Assessment: Evaluates re-identification risk using quasi-identifier combinations. Measures achieved k-anonymity and unique QI overlap.

NIST SP 800-226 Alignment

The evaluation framework generates NIST SP 800-226 alignment reports assessing data transformation adequacy, re-identification risk, documentation completeness, and privacy control effectiveness. An overall alignment score of >= 71% is required for a passing grade.

Fingerprint Extraction Privacy Levels

LevelEpsilonDeltaK-AnonymityUse Case
minimal10.01e-32Non-sensitive aggregates
standard1.01e-55General business data
high0.51e-610Sensitive financial data
maximum0.11e-820Regulated personal data

Controls Mapping

The following table maps DataSynth features to SOC 2 Trust Services Criteria identifiers.

TSC IDCriterionDataSynth ControlImplementation
CC6.1Logical access securityAPI key authenticationauth.rs: Argon2id hashing, timing-safe comparison
CC6.1Logical access securityJWT/OIDC supportauth.rs: RS256 token validation (optional jwt feature)
CC6.3Role-based accessRBAC enforcementrbac.rs: Admin/Operator/Viewer roles with permission matrix
CC6.6System boundariesSecurity headerssecurity_headers.rs: CSP, X-Frame-Options, HSTS support
CC6.6System boundariesRate limitingrate_limit.rs: Per-client sliding window, Redis backend
CC6.8Transmission securityTLS supportReverse proxy TLS termination, Kubernetes ingress
CC7.2MonitoringResource guardsresource_guard.rs: CPU, memory, disk monitoring
CC7.2MonitoringAudit loggingaudit.rs: Structured JSON events for SIEM
CC7.3Change detectionConfig hashingSHA-256 hash of configuration embedded in output
CC7.4Incident responseContent markingContent credentials identify synthetic origin
CC8.1Processing integrityDeterministic RNGChaCha8 with configurable seed
CC8.1Processing integrityQuality gatesgates/engine.rs: Configurable pass/fail thresholds
CC8.1Processing integrityBalance validationJournalEntry enforces debits = credits at construction
CC9.1Availability managementGraceful degradationdegradation.rs: Normal/Reduced/Minimal/Emergency levels
CC9.1Availability managementHealth endpoints/health, /ready, /live (auth-exempt)
P3.1Privacy noticeSynthetic content markingEU AI Act Article 50 credentials in all output
P4.1Collection limitationNo real data by defaultPure synthetic generation requires no data collection
P6.1Data qualityQuality gatesStatistical, coherence, and privacy quality metrics
P8.1DisposalDeterministic generationNo persistent state; regenerate from config + seed

Gap Analysis

The following areas require organizational controls that are outside DataSynth’s software scope:

AreaRecommendation
Physical securityDeploy on infrastructure with appropriate physical access controls
Change managementImplement CI/CD pipelines with code review and approval workflows
Vendor managementAssess third-party dependencies via cargo audit and SBOM generation
Personnel securityApply organizational onboarding/offboarding procedures for API key management
Backup and recoveryConfigure backup for generation configurations and output data per retention policies
Incident response planDocument procedures for scenarios where synthetic data is mistakenly treated as real

See Also

ISO 27001:2022 Alignment

This document maps DataSynth’s technical controls to the ISO/IEC 27001:2022 Annex A controls. DataSynth is a synthetic data generation tool, not a managed service, so this alignment focuses on controls that are directly addressable by the software. Organizational controls (A.5.1 through A.5.37), people controls (A.6), and physical controls (A.7) are primarily the responsibility of the deploying organization and are noted where DataSynth provides supporting capabilities.

Assessment Scope

  • System: DataSynth synthetic financial data generator
  • Version: 0.5.x
  • Standard: ISO/IEC 27001:2022 (Annex A controls from ISO/IEC 27002:2022)
  • Assessment Type: Self-assessment of technical control alignment

A.5 Organizational Controls

A.5.1 Policies for Information Security

DataSynth supports policy-as-code through its configuration management approach:

  • Configuration-as-code: All generation parameters are defined in version-controllable YAML files with typed schema validation. Invalid configurations are rejected before generation begins.
  • Industry presets: Pre-validated configurations for retail, manufacturing, financial services, healthcare, and technology industries reduce misconfiguration risk.
  • CLAUDE.md: The project’s development guidelines are codified and version-controlled alongside the source code, establishing security-relevant coding standards (#[deny(clippy::unwrap_used)], input validation requirements).

Organizations should supplement these technical controls with written information security policies governing DataSynth deployment, access, and data handling.

A.5.12 Classification of Information

DataSynth classifies all generated output as synthetic through the content marking system:

  • Embedded credentials: CSV headers, JSON metadata objects, and Parquet file metadata contain machine-readable ContentCredential records identifying the content as synthetic.
  • Human-readable declarations: Each credential includes a declaration field: “This content is synthetically generated and does not represent real transactions or entities.”
  • Configuration hash: SHA-256 hash of the generation configuration is embedded in output, enabling traceability from any output file back to its generation parameters.
  • Sidecar files: Optional .synthetic-credential.json sidecar files provide classification metadata alongside each output file.

A.5.23 Information Security for Use of Cloud Services

DataSynth supports cloud deployment through:

  • Kubernetes support: Helm charts and deployment manifests for containerized deployment with health (/health), readiness (/ready), and liveness (/live) probe endpoints.
  • Stateless server: The server component maintains no persistent state beyond in-memory generation jobs. Configuration and output are externalized, supporting cloud-native architectures.
  • TLS termination: Integration with Kubernetes ingress controllers, nginx, Caddy, and Envoy for TLS termination.
  • Secret management: API keys can be injected via environment variables or mounted secrets rather than hardcoded in configuration files.

A.8 Technological Controls

A.8.1 User Endpoint Devices

The CLI binary (datasynth-data) is a stateless executable:

  • No persistent credentials: The CLI does not store API keys, tokens, or session data on disk.
  • No network access required: The CLI operates entirely offline for generation workflows. Network access is only needed when connecting to a remote DataSynth server.
  • Deterministic output: Given the same configuration and seed, the CLI produces identical output, eliminating concerns about endpoint-specific state affecting results.

A.8.5 Secure Authentication

DataSynth implements multiple authentication mechanisms:

API Key Authentication:

  • Keys are hashed with Argon2id (memory-hard, timing-attack resistant) at server startup.
  • Raw keys are discarded after hashing; only PHC-format hashes are retained in memory.
  • Verification iterates all stored hashes without short-circuiting to prevent timing-based key enumeration.
  • A 5-second TTL cache using FNV-1a fast hashing reduces repeated Argon2id computation overhead.

JWT/OIDC Integration (optional jwt feature):

  • RS256 token validation with issuer, audience, and expiration checks.
  • Compatible with Keycloak, Auth0, and Microsoft Entra ID.
  • Claims extraction provides subject, email, roles, and tenant ID for downstream RBAC and audit.

Authentication Bypass:

  • Infrastructure endpoints (/health, /ready, /live, /metrics) are exempt from authentication to support load balancer and orchestrator probes.

A.8.9 Configuration Management

DataSynth enforces configuration integrity through:

  • Typed schema validation: YAML configuration is deserialized into strongly-typed Rust structs. Type mismatches, missing required fields, and constraint violations (e.g., rates outside 0.0–1.0, non-ascending approval thresholds) produce descriptive error messages before generation begins.
  • Complexity presets: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide pre-validated scaling parameters.
  • Template system: YAML/JSON templates with merge strategies enable configuration reuse while maintaining a single source of truth for shared settings.
  • Configuration hashing: SHA-256 hash of the resolved configuration is computed before generation and embedded in all output metadata, enabling drift detection.

A.8.12 Data Leakage Prevention

DataSynth’s architecture inherently prevents data leakage:

  • Synthetic-only generation: The default workflow generates data from statistical distributions and configuration parameters. No real data enters the pipeline.
  • Content marking: All output files carry machine-readable synthetic content credentials (EU AI Act Article 50). Third-party systems can detect and flag synthetic content programmatically.
  • Fingerprint privacy: When real data is used as input for fingerprint extraction, differential privacy (Laplace mechanism, configurable epsilon/delta) and k-anonymity suppress individual-level information. The resulting .dsf file contains only aggregate statistics.
  • Quality gate enforcement: The PrivacyMiaAuc quality gate validates that generated data does not memorize real data patterns (MIA AUC-ROC threshold).

A.8.16 Monitoring Activities

DataSynth provides monitoring at multiple layers:

Structured Audit Logging: The JsonAuditLogger emits structured JSON events via the tracing crate, recording:

  • Timestamp (UTC), request ID, actor identity
  • Action attempted, resource accessed, outcome (success/denied/error)
  • Tenant ID, source IP, user agent

Events are emitted at INFO level with a dedicated audit_event structured field for log aggregation filtering.

Resource Monitoring:

  • Memory guard reads /proc/self/statm (Linux) or ps (macOS) for resident set size tracking.
  • Disk guard uses statvfs (Unix) / GetDiskFreeSpaceExW (Windows) for available space monitoring.
  • CPU monitor tracks utilization with auto-throttle at 0.95 threshold.
  • The DegradationController combines all monitors and emits level-change events when resource pressure triggers degradation.

Generation Monitoring:

  • Run manifests capture configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.
  • Prometheus-compatible /metrics endpoint exposes runtime statistics.

A.8.24 Use of Cryptography

DataSynth uses cryptographic primitives for the following purposes:

PurposeAlgorithmImplementation
Deterministic RNGChaCha8 (CSPRNG)rand_chacha crate, configurable seed
API key hashingArgon2idargon2 crate, random salt, PHC format
Configuration integritySHA-256Config hash embedded in output metadata
JWT verificationRS256 (RSA + SHA-256)jsonwebtoken crate (optional jwt feature)
UUID generationFNV-1a hashDeterministic collision-free UUIDs with generator-type discriminators

Cryptographic operations use well-maintained Rust crate implementations. No custom cryptographic algorithms are implemented.

A.8.25 Secure Development Lifecycle

DataSynth’s development process includes:

  • Static analysis: cargo clippy with #[deny(clippy::unwrap_used)] enforces safe error handling across the codebase.
  • Test coverage: 2,500+ tests across 15 crates covering unit, integration, and property-based scenarios.
  • Dependency auditing: cargo audit checks for known vulnerabilities in dependencies.
  • Type safety: Rust’s ownership model and type system eliminate entire classes of memory safety and concurrency bugs at compile time.
  • MSRV policy: Minimum Supported Rust Version (1.88) ensures builds use a recent, well-supported compiler.
  • CI/CD: Automated build, test, lint, and audit checks on every commit.

A.8.28 Secure Coding

DataSynth applies secure coding practices:

  • No unwrap() in library code: #[deny(clippy::unwrap_used)] prevents panics from unchecked error handling.
  • Input validation: All user-provided configuration values are validated against typed schemas with range constraints before use.
  • Precise decimal arithmetic: Financial amounts use rust_decimal (serialized as strings) instead of IEEE 754 floating point, preventing rounding errors in financial calculations.
  • No unsafe code: The codebase does not use unsafe blocks in application logic.
  • Timing-safe comparisons: API key verification uses constant-time Argon2id comparison (iterating all hashes) to prevent side-channel attacks.
  • Memory-safe concurrency: Rust’s ownership model prevents data races at compile time. Shared state uses Arc<Mutex<>> or atomic operations.

Statement of Applicability

The following table summarizes the applicability of ISO 27001:2022 Annex A controls to DataSynth.

Implemented Controls

ControlTitleImplementation
A.5.1Information security policiesConfiguration-as-code with schema validation
A.5.12Classification of informationSynthetic content marking (EU AI Act Article 50)
A.5.23Cloud service securityKubernetes deployment, health probes, TLS support
A.8.1User endpoint devicesStateless CLI with no persistent credentials
A.8.5Secure authenticationArgon2id API keys, JWT/OIDC, RBAC
A.8.9Configuration managementTyped schema validation, presets, hashing
A.8.12Data leakage preventionSynthetic-only generation, content marking, fingerprint privacy
A.8.16Monitoring activitiesStructured audit logs, resource monitors, run manifests
A.8.24Use of cryptographyChaCha8 RNG, Argon2id, SHA-256, RS256 JWT
A.8.25Secure development lifecycleClippy, 2,500+ tests, cargo audit, CI/CD
A.8.28Secure codingNo unwrap, input validation, precise decimals, no unsafe

Partially Implemented Controls

ControlTitleStatusGap
A.5.8Information security in project managementPartialSecurity considerations are embedded in code (schema validation, quality gates) but formal project management security procedures are organizational
A.5.14Information transferPartialTLS support for server API; file-based output transfer policies are organizational
A.5.29Information security during disruptionPartialGraceful degradation handles resource pressure; broader business continuity is organizational
A.8.8Management of technical vulnerabilitiesPartialcargo audit scans dependencies; patch management cadence is organizational
A.8.15LoggingPartialStructured JSON audit events with correlation IDs; log retention and SIEM integration are organizational
A.8.26Application security requirementsPartialInput validation and schema enforcement are built-in; threat modeling documentation is organizational

Not Applicable Controls

ControlTitleRationale
A.5.19Information security in supplier relationshipsDataSynth is open-source software; supplier controls apply to the deploying organization
A.5.30ICT readiness for business continuityBusiness continuity planning is an organizational responsibility
A.6.1–A.6.8People controlsPersonnel security controls are organizational
A.7.1–A.7.14Physical controlsPhysical security controls depend on deployment environment
A.8.2Privileged access rightsOS-level privilege management is outside DataSynth’s scope
A.8.7Protection against malwareEndpoint protection is an infrastructure concern
A.8.20Networks securityNetwork segmentation and firewall rules are infrastructure concerns
A.8.23Web filteringWeb filtering is an organizational network control

Continuous Improvement

DataSynth supports ISO 27001’s Plan-Do-Check-Act cycle through:

  • Plan: Configuration-as-code with schema validation enforces security requirements at design time.
  • Do: Automated quality gates and resource guards enforce controls during operation.
  • Check: Evaluation framework produces quantitative metrics (Benford MAD, balance coherence, MIA AUC-ROC) that can be trended over time.
  • Act: The AutoTuner in datasynth-eval generates configuration patches from evaluation gaps, creating a feedback loop for continuous improvement.

See Also

Roadmap: Enterprise Simulation & ML Ground Truth

This roadmap outlines completed features, planned enhancements, and the wave-based expansion strategy for enterprise process chain coverage.


Completed Features

v0.1.0 — Core Generation

  • Statistical distributions: Benford’s Law compliance, log-normal mixtures, copulas
  • Industry presets: Manufacturing, Retail, Financial Services, Healthcare, Technology
  • Chart of Accounts: Small (~100), Medium (~400), Large (~2500) complexity levels
  • Temporal patterns: Month-end/quarter-end volume spikes, business day calendars
  • Master data: Vendors, customers, materials, fixed assets, employees
  • Document flows: P2P (6 PO types, three-way match) and O2C (9 SO types, 6 delivery types, 7 invoice types)
  • Intercompany: IC matching, transfer pricing, consolidation elimination entries
  • Subledgers: AR (aging, dunning), AP (scheduling, discounts), FA (6 depreciation methods), Inventory (22 movement types, 4 valuation methods)
  • Currency & FX: Ornstein-Uhlenbeck exchange rates, ASC 830 translation, CTA
  • Period close: Monthly close engine, accruals, depreciation runs, year-end closing
  • Balance coherence: Opening balances, running balance tracking, trial balance per period
  • Anomaly injection: 60+ fraud types, error patterns, process issues with full labeling
  • Data quality: Missing values (MCAR/MAR/MNAR), format variations, typos, duplicates
  • Graph export: PyTorch Geometric, Neo4j, DGL with train/val/test splits
  • Internal controls: COSO 2013 framework, SoD rules, 12 transaction + 6 entity controls
  • Resource guards: Memory, disk, CPU monitoring with graceful degradation
  • REST/gRPC/WebSocket server with authentication and rate limiting
  • Desktop UI: Tauri/SvelteKit with 15+ configuration pages
  • Python wrapper: Programmatic access with blueprints and config validation

v0.2.0 — Privacy & Standards

  • Fingerprint extraction: Statistical properties from real data into .dsf files
  • Differential privacy: Laplace and Gaussian mechanisms with configurable epsilon
  • K-anonymity: Suppression of rare categorical values
  • Fidelity evaluation: KS, Wasserstein, Benford MAD metric comparison
  • Gaussian copula synthesis: Preserve multivariate correlations
  • Accounting standards: Revenue recognition (ASC 606/IFRS 15), Leases (ASC 842/IFRS 16), Fair Value (ASC 820/IFRS 13), Impairment (ASC 360/IAS 36)
  • Audit standards: ISA compliance (34 standards), analytical procedures, confirmations, opinions, PCAOB mappings
  • SOX compliance: Section 302/404 assessments, deficiency matrix, material weakness classification
  • Streaming output: CSV, JSON, NDJSON, Parquet streaming sinks with backpressure
  • ERP output formats: SAP S/4HANA (BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA), Oracle EBS (GL_JE_HEADERS/LINES), NetSuite

v0.3.0 — Fraud & Industry

  • ACFE-aligned fraud taxonomy: Asset misappropriation, corruption, financial statement fraud calibrated to ACFE statistics
  • Collusion modeling: 8 ring types, 6 conspirator roles, defection/escalation dynamics
  • Management override: Fraud triangle modeling (pressure, opportunity, rationalization)
  • Red flag generation: 40+ probabilistic indicators with Bayesian probabilities
  • Industry-specific generators: Manufacturing (BOM, WIP, production orders), Retail (POS, shrinkage, loyalty), Healthcare (ICD-10, CPT, DRG, payer mix)
  • Industry benchmarks: Pre-configured ML benchmarks per industry
  • Banking/KYC/AML: Customer personas, KYC profiles, fraud typologies (structuring, funnel, layering, mule, round-tripping)
  • Process mining: OCEL 2.0 event logs with P2P and O2C processes
  • Evaluation framework: Auto-tuning with configuration recommendations from metric gaps
  • Vendor networks: Tiered supply chains, quality scores, clusters
  • Customer segmentation: Value segments, lifecycle stages, network positions
  • Cross-process links: Entity graph, relationship strength, cross-process integration

v0.5.0 — AI & Advanced Features

  • LLM-augmented generation: Pluggable provider abstraction (Mock, OpenAI, Anthropic) for realistic vendor names, descriptions, memo fields, and anomaly explanations
  • Natural language configuration: Generate YAML configs from descriptions
  • Diffusion model backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
  • Hybrid generation: Blend rule-based and diffusion outputs
  • Causal generation: Structural Causal Models (SCMs), do-calculus interventions, counterfactual generation
  • Built-in causal templates: fraud_detection and revenue_cycle causal graphs
  • Federated fingerprinting: Secure aggregation (weighted average, median, trimmed mean) for distributed data sources
  • Synthetic data certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing
  • Privacy-utility Pareto frontier: Automated exploration of optimal epsilon values
  • Ecosystem integrations: Airflow, dbt, MLflow, Spark pipeline integration

Planned Enhancements

Wave 1 — Foundation (enables everything else)

These items close the most critical gaps and unblock downstream work.

ItemChainDescriptionDependencies
S2C completionS2PSource-to-Contract: spend analysis, RFx, bid evaluation, contract management, catalog items, supplier scorecardsExtends existing P2P
Bank reconciliationBANKBank statement lines, auto-matching, reconciliation breaks, clearingValidates all payment chains
Financial statement generatorR2RBalance sheet, income statement, cash flow statement from trial balanceConsumes all JE data

Impact: S2C creates a closed-loop procurement model. Bank reconciliation validates payment integrity across S2P and O2C. Financial statements provide the final reporting layer for R2R.

Wave 2 — Core Process Chains

ItemChainDescriptionDependencies
Payroll & time managementH2RPayroll runs, time entries, overtime, benefits, tax withholdingEmployee master data
Revenue recognition generatorO2C→R2RWire CustomerContract + PerformanceObligation models to SO/Invoice dataExisting ASC 606 models
Impairment generatorA2R→R2RWire existing ImpairmentTest model to FA generator with JE outputExisting ASC 360 models

Impact: Payroll is the largest H2R gap and enables SoD analysis for personnel. Revenue recognition and impairment generators wire existing standards models into the generation pipeline.

Wave 3 — Operational Depth

ItemChainDescriptionDependencies
Production orders & WIPMFGProduction order lifecycle, material consumption, WIP costing, variance analysisManufacturing industry config
Cycle counting & QAINVCycle count programs, quality inspection, inspection lots, vendor quality feedbackInventory subledger
Expense managementH2RExpense reports, policy enforcement, receipt matching, reimbursementEmployee master data

Impact: Manufacturing becomes a fully simulated chain. Inventory completeness enables ABC analysis and obsolescence. Expenses extend H2R with AP integration.

Wave 4 — Polish

ItemChainDescriptionDependencies
Sales quotesO2CQuote-to-order conversion tracking (fills orphan quote_id FK)O2C generator
Cash forecastingBANKProjected cash flows from AP/AR schedulesAP/AR subledgers
KPIs & budget varianceR2RManagement reporting, budget vs actual analysisFinancial statements
Obsolescence managementINVSlow-moving/excess stock identification and write-downsInventory aging

Impact: These items round out each chain with planning and reporting capabilities.


Cross-Process Integration Vision

The wave plan steadily increases cross-process coverage:

IntegrationCurrentAfter Wave 1After Wave 2After Wave 4
S2P → InventoryGR updates stockSameSameSame
Inventory → O2CDelivery reduces stockSameSameObsolescence feeds write-downs
S2P/O2C → BANKPayments createdPayments reconciledSameCash forecasting
All → R2RJEs → Trial BalanceJEs → Financial Statements+ Revenue recog, impairment+ Budget variance
H2R → S2PEmployee authorizationsSameExpense → APSame
S2P → A2RCapital PO → FASameSameSame
MFG → S2PConfig onlySameProduction → PR demandSame
MFG → INVConfig onlySameWIP → FG transfers+ QA feedback

Coverage Targets

ChainCurrentWave 1Wave 2Wave 3Wave 4
S2P85%95%95%95%95%
O2C93%93%97%97%99%
R2R78%88%92%92%97%
A2R70%70%80%80%80%
INV55%55%55%75%85%
BANK65%85%85%85%90%
H2R30%30%60%75%75%
MFG20%20%20%60%60%

Guiding Principles

  • Enterprise realism: Simulate multi-entity, multi-region, multi-currency operations with coherent process flows
  • ML ground truth: Capture true labels and causal factors for supervised learning, explainability, and evaluation
  • Scalability: Handle large volumes with stable performance and reproducible results
  • Backward compatibility: New features are additive; existing configs continue to work

Dependencies & Risks

  • Schema stability: New models must not break existing serialization formats
  • Performance: Each wave adds generators; resource guards ensure stable memory/CPU
  • Validation complexity: Cross-chain coherence checks multiply as integration points increase

Contributing

We welcome contributions to any roadmap area. See Contributing Guidelines for details.

To propose new features:

  1. Open a GitHub issue with the enhancement label
  2. Describe the use case and expected behavior
  3. Reference relevant roadmap items if applicable

Feedback

Roadmap priorities are influenced by user feedback. Please share your use cases and requirements:

See Also

Production Readiness Roadmap

Version: 1.0 | Date: February 2026 | Status: Living Document

This roadmap addresses the infrastructure, operations, security, compliance, and ecosystem maturity required to transition DataSynth from a feature-complete beta to a production-grade enterprise platform. It complements the existing feature roadmap which covers domain-specific enhancements.


Table of Contents


Current State Assessment

Production Readiness Scorecard (v0.5.0 — Phase 2 Complete)

CategoryScoreStatusKey Findings
Workspace Structure9/10Excellent15 well-organized crates, clear separation of concerns
Testing10/10Excellent2,500+ tests, property testing via proptest, fuzzing harnesses (cargo-fuzz), k6 load tests, coverage via cargo-llvm-cov + Codecov
CI/CD9/10Excellent7-job pipeline: fmt, clippy, cross-platform test (Linux/macOS/Windows), MSRV 1.88, security scanning (cargo-deny + cargo-audit), coverage, benchmark regression
Error Handling10/10ExcellentIdiomatic thiserror/anyhow; #![deny(clippy::unwrap_used)] enforced across all library crates; zero unwrap calls in non-test code
Observability9/10ExcellentStructured JSON logging, feature-gated OpenTelemetry (OTLP traces + Prometheus metrics), request ID propagation, request logging middleware, data lineage graph
Deployment10/10ExcellentMulti-stage Dockerfile (distroless), Docker Compose, Kubernetes Helm chart (HPA, PDB, Redis subchart), SystemD service, comprehensive deployment guides (Docker, K8s, bare-metal)
Security9/10ExcellentArgon2id key hashing with timing-safe comparison, security headers, request validation, TLS support (rustls), env var interpolation for secrets, cargo-deny + cargo-audit in CI, security hardening guide
Performance9/10Excellent5 Criterion benchmark suites, 100K+ entries/sec; CI benchmark regression tracking on PRs; k6 load testing framework
Python Bindings8/10StrongStrict mypy, PEP 561 compliant, blueprints; classified as “Beta”, no async support
Server10/10ExcellentREST/gRPC/WebSocket complete; async job queue; distributed rate limiting (Redis); stateless config loading; enhanced probes; full middleware stack
Documentation10/10ExcellentmdBook + rustdoc + CHANGELOG + CONTRIBUTING; deployment guides (Docker, K8s, bare-metal), operational runbook, capacity planning, DR procedures, API reference, security hardening
Code Quality10/10ExcellentZero TODO/FIXME comments, warnings-as-errors enforced, panic-free library crates, 6 unsafe blocks (all justified)
Privacy9/10ExcellentFormal DP composition (RDP, zCDP), privacy budget management, MIA/linkage evaluation, NIST SP 800-226 alignment, SynQP matrix, custom privacy levels
Data Lineage9/10ExcellentPer-file checksums, lineage graph, W3C PROV-JSON export, CLI verify command for manifest integrity

Overall: 9.4/10 — Enterprise-grade with Kubernetes deployment, formal privacy guarantees, panic-free library code, comprehensive operations documentation, and data lineage tracking. Remaining gaps: RBAC/OAuth2, plugin SDK, Python async support.


Phase 1: Foundation (0-3 months)

Goal: Establish the minimum viable production infrastructure.

1.1 Containerization & Packaging

Priority: Critical | Effort: Medium

DeliverableDescription
Multi-stage DockerfileRust builder stage + distroless/alpine runtime (~20MB image)
Docker ComposeLocal dev stack: server + Prometheus + Grafana + Redis
OCI image publishingGitHub Actions workflow to push to GHCR/ECR on tagged releases
Binary distributionPre-built binaries for Linux (x86_64, aarch64), macOS (Apple Silicon), Windows
SystemD service fileProduction daemon configuration with resource limits

Implementation Notes:

# Target image structure
FROM rust:1.88-bookworm AS builder
# ... build with --release
FROM gcr.io/distroless/cc-debian12
COPY --from=builder /app/target/release/datasynth-server /
EXPOSE 3000
ENTRYPOINT ["/datasynth-server"]

1.2 Security Hardening

Priority: Critical | Effort: Medium

DeliverableDescription
API key hashingArgon2id for stored keys; timing-safe comparison via subtle crate
Request validation middlewareContent-Type enforcement, configurable max body size (default 10MB)
TLS supportNative rustls integration or documented reverse proxy (nginx/Caddy) setup
Secrets managementEnvironment variable interpolation in config (${ENV_VAR} syntax)
Security headersX-Content-Type-Options, X-Frame-Options, Strict-Transport-Security
Input sanitizationValidate all user-supplied config values before processing
Dependency auditingcargo-audit and cargo-deny in CI pipeline

1.3 Observability Stack

Priority: Critical | Effort: Medium

DeliverableDescription
OpenTelemetry integrationReplace custom metrics with opentelemetry + opentelemetry-otlp crates
Structured loggingJSON-formatted logs with request IDs, span context, correlation traces
Prometheus metricsGeneration throughput, latency histograms, error rates, resource utilization
Distributed tracingTrace generation pipeline phases end-to-end with span hierarchy
Health check enhancementAdd dependency checks (disk space, memory) to /ready endpoint
Alert rulesExample Prometheus alerting rules for SLO violations

Key Metrics to Instrument:

  • datasynth_generation_entries_total (Counter) — Total entries generated
  • datasynth_generation_duration_seconds (Histogram) — Per-phase latency
  • datasynth_generation_errors_total (Counter) — Errors by type
  • datasynth_memory_usage_bytes (Gauge) — Current memory consumption
  • datasynth_active_sessions (Gauge) — Concurrent generation sessions
  • datasynth_api_request_duration_seconds (Histogram) — API latency by endpoint

1.4 CI/CD Hardening

Priority: High | Effort: Low

DeliverableDescription
Code coveragecargo-tarpaulin or cargo-llvm-cov with Codecov integration
Security scanningcargo-audit for CVEs, cargo-deny for license compliance
MSRV validationCI job testing against minimum supported Rust version (1.88)
Cross-platform matrixTest on Linux, macOS, Windows in CI
Benchmark trackingCriterion results uploaded to GitHub Pages; regression alerts on PRs
Release automationSemantic versioning with auto-changelog via git-cliff
Container scanningTrivy or Grype scanning of published Docker images

Phase 2: Hardening (3-6 months)

Goal: Enterprise-grade reliability, scalability, and compliance foundations.

2.1 Scalability & High Availability

Priority: High | Effort: High

DeliverableDescription
Redis-backed rate limitingDistributed rate limiting via redis-rs for multi-instance deployments
Horizontal scalingStateless server design; shared config via Redis/S3
Kubernetes Helm chartProduction-ready chart with HPA, PDB, resource limits, readiness probes
Load testing frameworkk6 or Locust scripts for API stress testing
Graceful rolling updatesZero-downtime deployments with connection draining
Job queueAsync generation jobs with status tracking (Redis Streams or similar)

2.2 Data Lineage & Provenance

Priority: High | Effort: Medium

DeliverableDescription
Generation manifestJSON/YAML file recording: config hash, seed, version, timestamp, checksums for all outputs
Data lineage graphTrack which config section produced which output file and row ranges
Reproducibility verificationCLI command: datasynth-data verify --manifest manifest.json --output ./output/
W3C PROV compatibilityExport lineage in W3C PROV-JSON format for interoperability
Audit trailAppend-only log of all generation runs with user, config, and output metadata

Rationale: Data lineage is becoming a regulatory requirement under the EU AI Act (Article 10 — data governance for training data) and is a key differentiator in the enterprise synthetic data market. NIST AI RMF 1.0 also emphasizes provenance tracking under its MAP and MEASURE functions.

2.3 Enhanced Privacy Guarantees

Priority: High | Effort: High

DeliverableDescription
Formal DP accountingImplement Renyi DP and zero-concentrated DP (zCDP) composition tracking
Privacy budget managementGlobal budget tracking across multiple generation runs
Membership inference testingAutomated MIA evaluation as post-generation quality gate
NIST SP 800-226 alignmentValidate DP implementation against NIST Guidelines for Evaluating DP Guarantees
SynQP framework integrationImplement the IEEE SynQP evaluation matrix for joint quality-privacy assessment
Configurable privacy levelsPresets: relaxed (ε=10), standard (ε=1), strict (ε=0.1) with utility tradeoff documentation

Research Context: The NIST SP 800-226 (Guidelines for Evaluating Differential Privacy Guarantees) provides the authoritative framework for DP evaluation. The SynQP framework (IEEE, 2025) introduces standardized privacy-quality evaluation matrices. Benchmarking DP tabular synthesis algorithms was a key topic at TPDP 2025, and federated DP approaches (FedDPSyn) are emerging for distributed generation.

2.4 Unwrap Audit & Robustness

Priority: Medium | Effort: Medium

DeliverableDescription
Unwrap eliminationAudit and replace ~2,300 unwrap() calls in non-test code with proper error handling
Panic-free guaranteeAdd #![deny(clippy::unwrap_used)] lint for library crates (not test/bench)
Fuzzing harnessescargo-fuzz targets for config parsing, fingerprint loading, and API endpoints
Property test expansionIncrease proptest coverage for statistical invariants and balance coherence

2.5 Documentation: Operations

Priority: Medium | Effort: Low

DeliverableDescription
Deployment guideDocker, K8s, bare-metal deployment with step-by-step instructions
Operational runbookMonitoring dashboards, common alerts, troubleshooting procedures
Capacity planning guideMemory/CPU/disk sizing for different generation scales
Disaster recoveryBackup/restore procedures for server state and configurations
API rate limits documentationDocument auth, rate limiting, and CORS behavior for integrators
Security hardening guideChecklist for production security configuration

Phase 3: Enterprise Grade (6-12 months)

Goal: Enterprise features, compliance certifications, and ecosystem maturity.

3.1 Multi-Tenancy & Access Control

Priority: High | Effort: High

DeliverableDescription
RBACRole-based access control (admin, operator, viewer) with JWT/OAuth2
Tenant isolationNamespace-based isolation for multi-tenant SaaS deployment
Audit loggingStructured audit events for all API actions (who/what/when)
SSO integrationSAML 2.0 and OIDC support for enterprise identity providers
API versioningURL-based API versioning (v1, v2) with deprecation lifecycle

3.2 Advanced Evaluation & Quality Gates

Priority: High | Effort: Medium

DeliverableDescription
Automated quality gatesPre-configured pass/fail criteria for generation runs
Benchmark suite expansionDomain-specific benchmarks: financial realism, fraud detection efficacy, audit trail coherence
Regression testingGolden dataset comparison with tolerance thresholds
Quality dashboardWeb-based visualization of quality metrics over time
Third-party validationIntegration with SDMetrics and SDV evaluation utilities

Quality Metrics to Implement:

  • Statistical fidelity: Column distribution similarity (KL divergence, Wasserstein distance)
  • Structural fidelity: Correlation matrix preservation, inter-table referential integrity
  • Privacy: Nearest-neighbor distance ratio, attribute disclosure risk, identity disclosure risk (SynQP)
  • Utility: Train-on-synthetic-test-on-real (TSTR) ML performance parity
  • Temporal fidelity: Autocorrelation preservation, seasonal pattern retention
  • Domain-specific: Benford compliance MAD, balance equation coherence, document chain integrity

3.3 Plugin & Extension SDK

Priority: Medium | Effort: High

DeliverableDescription
Generator trait APIStable, documented trait interface for custom generators
Plugin loadingDynamic plugin loading via libloading or WASM runtime
Template marketplaceRepository of community-contributed industry templates
Custom output sinksPlugin API for custom export formats (database write, S3, GCS)
Webhook systemEvent-driven notifications (generation start/complete/error)

3.4 Python Ecosystem Maturity

Priority: Medium | Effort: Medium

DeliverableDescription
Async supportasyncio-compatible API using websockets for streaming
Conda packagePublish to conda-forge for data science workflows
Jupyter integrationExample notebooks for common use cases (fraud ML, audit analytics)
pandas/polars integrationDirect DataFrame output without intermediate CSV
PyPI 1.0.0 releasePromote from Beta to Production/Stable classifier
Type stubsComplete .pyi stubs for IDE support

3.5 Regulatory Compliance Framework

Priority: Medium | Effort: Medium

DeliverableDescription
EU AI Act readinessSynthetic content marking (Article 50), training data documentation (Article 10)
NIST AI RMF alignmentSelf-assessment against MAP, MEASURE, MANAGE, GOVERN functions
SOC 2 Type II preparationDocument controls for security, availability, processing integrity
GDPR compliance documentationData processing documentation, privacy impact assessment template
ISO 27001 alignmentInformation security management system controls mapping

Regulatory Context: The EU AI Act’s Article 50 transparency obligations (enforceable August 2026) require AI systems generating synthetic content to mark outputs as artificially generated in a machine-readable format. Article 10 mandates training data governance including documentation of data sources. Organizations face penalties up to €35M or 7% of global turnover for non-compliance. The NIST AI RMF 1.0 (expanded significantly through 2024-2025) provides the voluntary framework becoming the “operational layer” beneath regulatory compliance globally.


Phase 4: Market Leadership (12-18 months)

Goal: Cutting-edge capabilities informed by latest research, establishing DataSynth as the reference platform for financial synthetic data.

4.1 LLM-Augmented Generation

Priority: Medium | Effort: High

DeliverableDescription
LLM-guided metadata enrichmentUse LLMs to generate realistic vendor names, descriptions, memo fields
Natural language configGenerate YAML configs from natural language descriptions (“Generate 1 year of manufacturing data for a mid-size German company”)
Semantic constraint validationLLM-based validation of inter-column logical relationships
Explanation generationNatural language explanations for anomaly labels and findings

Research Context: Multiple 2025 papers demonstrate LLM-augmented tabular data generation. LLM-TabFlow (March 2025) addresses preserving inter-column logical relationships. StructSynth (August 2025) focuses on structure-aware synthesis in low-data regimes. LLM-TabLogic (August 2025) uses prompt-guided latent diffusion to maintain logical constraints. The CFA Institute’s July 2025 report on “Synthetic Data in Investment Management” validates the growing importance of synthetic data in financial applications.

4.2 Diffusion Model Integration

Priority: Medium | Effort: Very High

DeliverableDescription
TabDDPM backendOptional diffusion-model-based generation for learned distribution capture
FinDiff integrationFinancial-domain diffusion model for learned financial patterns
Hybrid generationCombine rule-based generators with learned models for maximum fidelity
Model fine-tuning pipelineTrain custom diffusion models on fingerprint data
Imb-FinDiff for rare eventsDiffusion-based class imbalance handling for fraud patterns

Research Context: The diffusion model landscape for tabular data has matured rapidly. TabDiff (ICLR 2025) introduced joint continuous-time diffusion with feature-wise learnable schedules, achieving 22.5% improvement over prior SOTA. FinDiff and its extensions (Imb-FinDiff for class imbalance, DP-Fed-FinDiff for federated privacy-preserving generation) are specifically designed for financial tabular data. A comprehensive survey (February 2025) catalogs 15+ diffusion models for tabular data. TabGraphSyn (December 2025) combines GNNs with diffusion for graph-guided tabular synthesis.

4.3 Advanced Privacy Techniques

Priority: Medium | Effort: High

DeliverableDescription
Federated fingerprintingExtract fingerprints from distributed data sources without centralization
Synthetic data certificatesCryptographic proof that output satisfies DP guarantees
Privacy-utility Pareto frontierAutomated exploration of optimal ε values for given utility targets
Surrogate public dataSupport for surrogate public data approaches to improve DP utility

Research Context: TPDP 2025 featured FedDPSyn for federated DP tabular synthesis and research on surrogate public data for DP (Hod et al.). The AI-generated synthetic tabular data market reached $1.36B in 2024 and is projected to reach $6.73B by 2029 (37.9% CAGR), driven by privacy regulation and AI training demand.

4.4 Ecosystem & Integration

Priority: Medium | Effort: Medium

DeliverableDescription
Terraform providerInfrastructure-as-code for DataSynth server deployment
Airflow/Dagster operatorsPipeline integration for automated generation in data workflows
dbt integrationGenerate synthetic data as dbt sources for analytics testing
Spark connectorRead DataSynth output directly as Spark DataFrames
MLflow integrationTrack generation runs as MLflow experiments with metrics

4.5 Causal & Counterfactual Generation

Priority: Low | Effort: Very High

DeliverableDescription
Causal graph specificationDefine causal relationships between entities in config
Interventional generation“What-if” scenarios: generate data under hypothetical interventions
Counterfactual samplesGenerate counterfactual versions of existing records
Causal discovery validationValidate that generated data preserves specified causal structure

Industry & Research Context

Synthetic Data Market (2025-2026)

The synthetic data market is experiencing explosive growth:

  • Gartner predicts 75% of businesses will use GenAI to create synthetic customer data by 2026, up from <5% in 2023.
  • The AI-generated synthetic tabular data market reached $1.36B in 2024, projected to $6.73B by 2029 (37.9% CAGR).
  • Synthetic data is predicted to account for >60% of all training data for GenAI models by 2030 (CFA Institute, July 2025).

Key Research Papers & Developments

Tabular Data Generation

  • TabDiff (ICLR 2025) — Mixed-type diffusion with learnable feature-wise schedules; 22.5% improvement on correlation preservation
  • LLM-TabFlow (March 2025) — Preserving inter-column logical relationships via LLM guidance
  • StructSynth (August 2025) — Structure-aware LLM synthesis for low-data regimes
  • LLM-TabLogic (August 2025) — Prompt-guided latent diffusion maintaining logical constraints
  • TabGraphSyn (December 2025) — Graph-guided latent diffusion combining VAE+GNN with diffusion

Financial Domain

  • FinDiff (ICAIF 2023) — Diffusion models for financial tabular data
  • Imb-FinDiff (ICAIF 2024) — Conditional diffusion for class-imbalanced financial data
  • DP-Fed-FinDiff — Federated DP diffusion for privacy-preserving financial synthesis
  • CFA Institute Report (July 2025) — “Synthetic Data in Investment Management” validating FinDiff as SOTA

Privacy & Evaluation

  • SynQP (IEEE, 2025) — Standardized quality-privacy evaluation framework for synthetic data
  • NIST SP 800-226 — Guidelines for Evaluating Differential Privacy Guarantees
  • TPDP 2025 — Benchmarking DP tabular synthesis; federated approaches; membership inference attacks
  • Consensus Privacy Metrics (Pilgram et al., 2025) — Framework for standardized privacy evaluation

Surveys

  • “Diffusion Models for Tabular Data” (February 2025) — Comprehensive survey cataloging 15+ models
  • “Comprehensive Survey of Synthetic Tabular Data Generation” (Shi et al., 2025) — Broad overview of methods
TrendImpactTimeframe
LLM-augmented generationRealistic metadata, natural language config2026
Diffusion models for tabular dataLearned distribution capture as alternative/complement to rule-based2026-2027
Federated DP synthesisGenerate from distributed sources without centralization2027
Causal modeling“What-if” scenarios and interventional generation2027-2028
OTEL standardizationUnified observability across Rust ecosystem2026
WASM pluginsSafe, sandboxed extensibility for custom generators2026-2027
EU AI Act enforcementMandatory synthetic content marking and data governanceAugust 2026

Competitive Positioning

Market Landscape (2025-2026)

PlatformFocusKey DifferentiatorPricingStatus
Gretel.aiDeveloper APIsNavigator (NL-to-data); acquired by NVIDIA (March 2025)Usage-basedIntegrated into NVIDIA NeMo
MOSTLY AIEnterprise complianceTabularARGN with built-in DP; fairness controlsEnterprise licenseIndependent
Tonic.aiTest data managementDatabase-aware synthesis; acquired Fabricate (April 2025)Per-databaseGrowing
HazyFinancial servicesRegulated-sector focus; sequential dataEnterprise licenseIndependent
SDV/DataCeboOpen source ecosystemCTGAN, TVAEs, Gaussian copulas; Python-nativeFreemiumOpen source core
K2viewEntity-based testingAll-in-one enterprise data managementEnterprise licenseEstablished

DataSynth Competitive Advantages

AdvantageDetail
Domain depthDeepest financial/accounting domain model (IFRS, US GAAP, ISA, SOX, COSO, KYC/AML)
Rule-based coherenceGuaranteed balance equations, document chain integrity, three-way matching
Deterministic reproducibilityChaCha8 RNG with seed control; bit-exact reproducibility across runs
Performance100K+ entries/sec (Rust native); 10-100x faster than Python-based competitors
Privacy-preserving fingerprintingUnique extract-synthesize workflow with DP guarantees
Process miningNative OCEL 2.0 event log generation (unique in market)
Graph-nativeDirect PyTorch Geometric, Neo4j, DGL export for GNN workflows
Full-stackCLI + REST/gRPC/WebSocket server + Desktop UI + Python bindings

Competitive Gaps to Address

GapCompetitors with FeaturePriority
Cloud-hosted SaaS offeringGretel, MOSTLY AI, TonicPhase 3
No-code UI for non-technical usersMOSTLY AI, K2viewPhase 3
Database-aware synthesis from production dataTonic.aiPhase 4
LLM-powered natural language interfaceGretel NavigatorPhase 4
Pre-built ML model training pipelinesGretelPhase 3
Marketplace for community templatesSDV ecosystemPhase 3

Regulatory Landscape

EU AI Act Timeline

DateMilestoneDataSynth Impact
Feb 2025Prohibited AI systems discontinued; AI literacy obligationsLow — DataSynth is a tool, not a prohibited system
Aug 2025GPAI transparency requirements; training data documentationMedium — Users training AI with DataSynth output need provenance
Aug 2026Full high-risk AI compliance; Article 50 transparencyHigh — Synthetic content marking required; data governance mandated
Aug 2027High-risk AI in harmonized productsLow — Indirect impact

Required Compliance Features

  1. Synthetic content marking (Article 50): All generated data must include machine-readable markers indicating artificial generation
  2. Training data documentation (Article 10): Generation manifests must document configs, sources, and processing steps
  3. Quality management (Annex IV): Documented quality assurance processes for generation and evaluation
  4. Risk assessment: Template for users to assess risks of using synthetic data in AI systems

Other Regulatory Frameworks

FrameworkRelevanceStatus
NIST AI RMF 1.0Voluntary; becoming the operational governance layer globallySelf-assessment planned (Phase 3)
NIST SP 800-226DP evaluation guidelinesAlignment planned (Phase 2)
GDPRSynthetic data reduces but doesn’t eliminate privacy obligationsDocumentation in Phase 3
SOXDataSynth already generates SOX-compliant test dataFeature complete
ISO 27001Information security controls for server deploymentAlignment in Phase 3
SOC 2 Type IITrust service criteria for SaaS offeringPhase 3 preparation

Risk Register

Technical Risks

RiskLikelihoodImpactMitigation
Performance regression with OTEL instrumentationMediumMediumBenchmark-gated CI; sampling in production
Breaking API changes during versioningLowHighSemantic versioning; deprecation policy; compatibility tests
Memory safety issues in unsafe blocksLowCriticalMiri testing; minimize unsafe; regular audits
Dependency CVEsMediumHighcargo-audit in CI; Dependabot alerts
Plugin system security (WASM/dynamic loading)MediumHighWASM sandboxing; capability-based permissions

Business Risks

RiskLikelihoodImpactMitigation
EU AI Act scope broader than anticipatedMediumHighProactive Article 50 compliance; legal review
Competitor acqui-hires (Gretel→NVIDIA pattern)MediumMediumBuild unique domain depth as defensible moat
Open-source competitors (SDV) closing feature gapMediumMediumFocus on financial domain depth and performance
Enterprise customers requiring SOC 2 certificationHighMediumBegin SOC 2 preparation in Phase 3
Python ecosystem expects native (PyO3) bindingsMediumMediumEvaluate PyO3 migration for v2.0

Operational Risks

RiskLikelihoodImpactMitigation
Production incidents without runbooksHighMediumPrioritize ops documentation in Phase 2
Scaling issues under concurrent loadMediumHighLoad testing in Phase 2; HPA configuration
Secret exposure in logs or configsLowCriticalStructured logging with PII filtering; secret scanning

Success Criteria

Phase 1 Exit Criteria

  • Docker image published and scannable (multi-stage distroless build)
  • cargo-audit and cargo-deny passing in CI
  • OTEL traces available via feature-gated otel flag with OTLP export
  • Prometheus metrics scraped and graphed (Docker Compose stack)
  • Code coverage measured and reported via cargo-llvm-cov + Codecov
  • Cross-platform CI (Linux + macOS + Windows)

Phase 2 Exit Criteria

  • Helm chart deployed to staging K8s cluster
  • Generation manifest produced for every run (with per-file checksums, lineage graph, W3C PROV-JSON)
  • Load test: k6 scripts for health, bulk generation, WebSocket, job queue, and soak testing
  • Zero unwrap() calls in library crate non-test code (#![deny(clippy::unwrap_used)] enforced)
  • Formal DP composition tracking with budget management (RDP, zCDP, privacy budget manager)
  • Operations runbook reviewed and validated (deployment guides, runbook, capacity planning, DR, API reference, security hardening)

Phase 3 Exit Criteria

  • JWT/OAuth2 authentication with RBAC
  • Automated quality gates blocking below-threshold runs
  • Plugin SDK documented with 2+ community plugins
  • Python 1.0.0 on PyPI with async support
  • EU AI Act Article 50 compliance verified
  • SOC 2 Type II readiness assessment completed

Phase 4 Exit Criteria

  • LLM-augmented generation available as opt-in feature
  • Diffusion model backend demonstrated on financial dataset
  • 3+ ecosystem integrations (Airflow, dbt, MLflow)
  • Causal generation prototype validated

Appendix A: OpenTelemetry Integration Architecture

┌─────────────────────────────────────────────────────┐
│                   DataSynth Server                  │
│  ┌───────────┐  ┌──────────┐  ┌─────────────────┐  │
│  │  REST API  │  │   gRPC   │  │   WebSocket     │  │
│  └─────┬─────┘  └────┬─────┘  └───────┬─────────┘  │
│        │              │                │             │
│  ┌─────┴──────────────┴────────────────┴──────────┐ │
│  │          Tower Middleware Stack                 │ │
│  │  [Auth] [RateLimit] [Tracing] [Metrics]        │ │
│  └────────────────────┬───────────────────────────┘ │
│                       │                              │
│  ┌────────────────────┴───────────────────────────┐ │
│  │           OpenTelemetry SDK                    │ │
│  │  ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │ │
│  │  │ Traces  │ │ Metrics  │ │     Logs        │ │ │
│  │  └────┬────┘ └────┬─────┘ └───────┬─────────┘ │ │
│  └───────┼───────────┼───────────────┼────────────┘ │
│          │           │               │               │
│  ┌───────┴───────────┴───────────────┴────────────┐ │
│  │           OTLP Exporter (gRPC/HTTP)            │ │
│  └────────────────────┬───────────────────────────┘ │
└───────────────────────┼─────────────────────────────┘
                        │
              ┌─────────┴──────────┐
              │   OTel Collector   │
              │  (Agent sidecar)   │
              └──┬──────┬──────┬───┘
                 │      │      │
           ┌─────┘  ┌───┘  ┌──┘
           ▼        ▼      ▼
       ┌──────┐ ┌──────┐ ┌─────┐
       │Jaeger│ │Prom. │ │Loki │
       │/Tempo│ │      │ │     │
       └──────┘ └──────┘ └─────┘
CategoryCratePurposePhase
Observabilityopentelemetry (0.27+)Unified telemetry API1
Observabilityopentelemetry-otlpOTLP exporter1
Observabilitytracing-opentelemetryBridge tracing → OTEL1
Securityargon2Password/key hashing1
SecuritysubtleConstant-time comparison1
SecurityrustlsNative TLS1
ScalabilityredisDistributed state/rate-limiting2
Scalabilitydeadpool-redisRedis connection pooling2
Testingcargo-tarpaulinCode coverage1
Testingcargo-fuzzFuzz testing2
AuthjsonwebtokenJWT tokens3
Authoauth2OAuth2 client3
PluginswasmtimeWASM plugin runtime3
Buildgit-cliffChangelog generation1

Appendix C: Key References

Standards & Guidelines

  • NIST AI RMF 1.0 — AI Risk Management Framework
  • NIST SP 800-226 — Guidelines for Evaluating Differential Privacy Guarantees
  • EU AI Act (Regulation 2024/1689) — Articles 10, 50
  • ISO/IEC 25020:2019 — Systems and software Quality Requirements and Evaluation (SQuaRE)

Research Papers

  • Chen et al. (2025) — “Benchmarking Differentially Private Tabular Data Synthesis Algorithms” (TPDP 2025)
  • SynQP (IEEE, 2025) — “A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data”
  • Xu et al. (2025) — “TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation” (ICLR 2025)
  • Sattarov & Schreyer (2023) — “FinDiff: Diffusion Models for Financial Tabular Data Generation” (ICAIF 2023)
  • Shi et al. (2025) — “Comprehensive Survey of Synthetic Tabular Data Generation”
  • CFA Institute (July 2025) — “Synthetic Data in Investment Management”
  • Pilgram et al. (2025) — “A Consensus Privacy Metrics Framework for Synthetic Data”

Industry Reports

  • Gartner (2024) — “By 2026, 75% of businesses will use GenAI for synthetic customer data”
  • GlobeNewsWire (January 2026) — AI-Generated Synthetic Tabular Dataset Market: $6.73B by 2029

Research: System Improvements for Enhanced Realism

This research document series analyzes the current SyntheticData system and proposes comprehensive improvements across multiple dimensions to achieve greater realism, statistical validity, and domain authenticity.

Document Index

DocumentFocus AreaPriority
01-realism-names-metadata.mdNames, descriptions, metadata realismHigh
02-statistical-distributions.mdNumerical and statistical distributionsHigh
03-temporal-patterns.mdTemporal correctness and distributionsHigh
04-interconnectivity.mdEntity relationships and referential integrityCritical
05-pattern-drift.mdProcess and pattern evolution over timeMedium
06-anomaly-patterns.mdAnomaly detection and injection patternsHigh
07-fraud-patterns.mdFraud typologies and detection scenariosHigh
08-domain-specific.mdIndustry-specific enhancementsMedium

Executive Summary

Current State Assessment

The SyntheticData system is a mature, well-architected synthetic data generation platform with strong foundations in:

  • Deterministic generation via ChaCha8 RNG with configurable seeds
  • Domain modeling with 50+ entity types across accounting, banking, and audit domains
  • Statistical foundations including Benford’s Law, log-normal distributions, and temporal seasonality
  • Referential integrity through document chains, three-way matching, and intercompany reconciliation
  • Standards compliance with COSO 2013, ISA, SOX, IFRS, and US GAAP frameworks

Key Improvement Themes

After comprehensive analysis, we identify eight major improvement themes:

1. Realism in Names & Metadata

Current Gap: Generic placeholder names, limited cultural diversity, simplistic descriptions Impact: Immediate visual detection of synthetic nature Effort: Medium | Value: High

2. Statistical Distribution Enhancements

Current Gap: Single-mode distributions, limited correlation modeling, no regime changes Impact: ML models trained on synthetic data may not generalize Effort: High | Value: Critical

3. Temporal Pattern Sophistication

Current Gap: Static multipliers, no business day calculations, limited regional calendars Impact: Unrealistic transaction timing patterns Effort: Medium | Value: High

4. Interconnectivity & Relationship Modeling

Current Gap: Shallow relationship graphs, limited network effects, no behavioral clustering Impact: Graph-based analytics yield unrealistic structures Effort: High | Value: Critical

5. Pattern & Process Drift

Current Gap: Limited drift types, no organizational change modeling, static processes Impact: Temporal ML models overfit to stable patterns Effort: Medium | Value: High

6. Anomaly Pattern Enrichment

Current Gap: Limited anomaly correlation, no multi-stage anomalies, binary labeling Impact: Anomaly detection models lack nuanced training data Effort: Medium | Value: High

7. Fraud Pattern Sophistication

Current Gap: Isolated fraud events, limited collusion modeling, no adaptive patterns Impact: Fraud detection systems miss complex schemes Effort: High | Value: Critical

8. Domain-Specific Enhancements

Current Gap: Generic industry modeling, limited regulatory specificity Impact: Industry-specific use cases require extensive customization Effort: Medium | Value: Medium


Implementation Roadmap

Phase 1: Foundation (Q1)

  • Culturally-aware name generation with regional distributions
  • Enhanced amount distributions with mixture models
  • Business day calculation utilities
  • Relationship graph depth improvements

Phase 2: Statistical Sophistication (Q2)

  • Multi-modal distribution support
  • Cross-field correlation modeling
  • Regime change simulation
  • Network effect modeling

Phase 3: Temporal Evolution (Q3)

  • Organizational change events
  • Process evolution modeling
  • Adaptive fraud patterns
  • Multi-stage anomaly injection

Phase 4: Domain Specialization (Q4)

  • Industry-specific regulatory frameworks
  • Enhanced audit trail generation
  • Advanced graph analytics support
  • Privacy-preserving fingerprint improvements

Metrics for Success

Realism Metrics

  • Human Detection Rate: % of samples correctly identified as synthetic by domain experts
  • Statistical Divergence: KL divergence between synthetic and real-world distributions
  • Temporal Correlation: Autocorrelation alignment with empirical baselines

ML Utility Metrics

  • Transfer Learning Gap: Performance delta when models trained on synthetic data are applied to real data
  • Feature Distribution Overlap: Overlap coefficient for key feature distributions
  • Anomaly Detection AUC: Baseline AUC on synthetic vs. improvement after enhancements

Technical Metrics

  • Generation Throughput: Records/second with enhanced features
  • Memory Efficiency: Peak memory usage for equivalent dataset sizes
  • Configuration Complexity: Lines of YAML required for common scenarios

Next Steps

  1. Review individual research documents for detailed analysis
  2. Prioritize improvements based on use case requirements
  3. Create implementation tickets for Phase 1 items
  4. Establish baseline metrics for tracking progress

Research conducted: January 2026 System version analyzed: 0.2.3

Research: Realism in Names, Descriptions, and Metadata

Current State Analysis

Entity Name Generation

The current system uses basic name generation across multiple entity types:

Entity TypeCurrent ApproachRealism Level
Vendors“Vendor_{id}” or template-basedLow
Customers“Customer_{id}” or template-basedLow
EmployeesFirst/Last name poolsMedium
Materials“Material_{id}” with category prefixLow
Cost Centers“{dept}_{code}” patternMedium
GL AccountsNumeric codes with descriptionsHigh
CompaniesConfigurable but often genericMedium

Description Generation

Current descriptions follow predictable patterns:

  • Journal entries: “{type} for {entity}”
  • Invoices: “Invoice for {goods/services}”
  • Payments: “Payment for Invoice {ref}”

Metadata Patterns

  • Timestamps: Well-distributed but lack system-specific quirks
  • User IDs: Sequential or simple patterns
  • References: Deterministic but predictable formats

Improvement Recommendations

1. Culturally-Aware Name Generation

1.1 Regional Name Pools

Implementation: Create region-specific name databases with appropriate cultural distributions.

# Proposed configuration structure
name_generation:
  strategy: regional_weighted
  regions:
    - region: north_america
      weight: 0.45
      subregions:
        - country: US
          weight: 0.85
          cultural_mix:
            - origin: anglo
              weight: 0.55
            - origin: hispanic
              weight: 0.25
            - origin: asian
              weight: 0.12
            - origin: other
              weight: 0.08
        - country: CA
          weight: 0.10
        - country: MX
          weight: 0.05
    - region: europe
      weight: 0.30
    - region: asia_pacific
      weight: 0.25

1.2 Company Name Patterns by Industry

Retail:

  • Pattern: {Founder} {Product} → “Johnson’s Hardware”
  • Pattern: {Adjective} {Category} → “Premier Electronics”
  • Pattern: {Location} {Type} → “Westside Grocers”

Manufacturing:

  • Pattern: {Name} {Industry} {Suffix} → “Anderson Steel Corporation”
  • Pattern: {Acronym} {Type} → “ACM Industries”
  • Pattern: {Technical} {Systems} → “Precision Machining Systems”

Professional Services:

  • Pattern: {Partner1}, {Partner2} & {Partner3} → “Smith, Chen & Associates”
  • Pattern: {Name} {Specialty} {Type} → “Hartwell Tax Advisors”
  • Pattern: {Adjective} {Service} {Suffix} → “Strategic Consulting Group”

Financial Services:

  • Pattern: {Location} {Type} {Entity} → “Pacific Coast Credit Union”
  • Pattern: {Founder} {Service} → “Morgan Wealth Management”
  • Pattern: {Region} {Specialty} → “Midwest Commercial Lending”

1.3 Vendor Name Realism

Current: Vendor_00042 or simple templates

Proposed: Industry-appropriate vendor names based on spend category:

#![allow(unused)]
fn main() {
// Conceptual structure
pub struct VendorNameGenerator {
    category_templates: HashMap<SpendCategory, Vec<NameTemplate>>,
    regional_styles: HashMap<Region, NamingConvention>,
    legal_suffixes: HashMap<Country, Vec<String>>,
}

impl VendorNameGenerator {
    pub fn generate(&self, category: SpendCategory, region: Region) -> VendorName {
        // Select template based on category
        // Apply regional naming conventions
        // Add appropriate legal suffix (Inc., LLC, GmbH, Ltd., S.A., etc.)
    }
}
}

Examples by Category:

CategoryExample Names
Office SuppliesStaples, Office Depot, ULINE, Quill Corporation
IT ServicesAccenture Technology, Cognizant Solutions, InfoSys Systems
Raw MaterialsAlcoa Aluminum, US Steel Supply, Nucor Materials
UtilitiesPacific Gas & Electric, ConEdison, Duke Energy
Professional ServicesDeloitte & Touche, KPMG Advisory, BDO Consulting
LogisticsFedEx Freight, UPS Supply Chain, XPO Logistics
FacilitiesABM Industries, CBRE Services, JLL Facilities

2. Realistic Description Generation

2.1 Journal Entry Descriptions

Current Pattern: Generic, formulaic

Proposed: Context-aware, varied descriptions with realistic abbreviations and typos

journal_entry_descriptions:
  revenue:
    templates:
      - "Revenue recognition - {customer} - {contract_ref}"
      - "Rev rec {period} - {product_line}"
      - "Sales revenue {region} Q{quarter}"
      - "Earned revenue - PO# {po_number}"
    abbreviations:
      enabled: true
      probability: 0.3
      mappings:
        Revenue: ["Rev", "REV"]
        recognition: ["rec", "recog"]
        Quarter: ["Q", "Qtr"]
    variations:
      case_variation: 0.1
      typo_rate: 0.02

  expense:
    templates:
      - "AP invoice - {vendor} - {invoice_ref}"
      - "{expense_category} - {cost_center}"
      - "Accrued {expense_type} {period}"
      - "{vendor_short} inv {invoice_num}"
    context_aware:
      include_approver: 0.2
      include_po_reference: 0.7
      include_department: 0.4

2.2 Invoice Line Item Descriptions

Goods:

- "Qty {quantity} {product_name} @ ${unit_price}/ea"
- "{product_sku} - {product_description}"
- "{quantity}x {product_short_name}"
- "Lot# {lot_number} {product_name}"

Services:

- "Professional services - {date_range}"
- "Consulting fees - {project_name}"
- "Retainer - {month} {year}"
- "{hours} hrs @ ${rate}/hr - {service_type}"

2.3 Payment Descriptions

Current: “Payment for Invoice INV-00123”

Proposed variations:

- "Pmt INV-00123"
- "ACH payment - {vendor} - {invoice_ref}"
- "Wire transfer ref {bank_ref}"
- "Check #{check_number} - {vendor}"
- "EFT {date} {vendor_short}"
- "Batch payment - {batch_id}"

3. Enhanced Metadata Generation

3.1 User ID Patterns

Current: Sequential or simple random

Proposed: Realistic corporate patterns

user_id_patterns:
  format: "{first_initial}{last_name}{disambiguator}"
  examples:
    - "jsmith"
    - "jsmith2"
    - "john.smith"
    - "smithj"

  system_accounts:
    - prefix: "SVC_"
      examples: ["SVC_BATCH", "SVC_INTERFACE", "SVC_RECON"]
    - prefix: "SYS_"
      examples: ["SYS_AUTO", "SYS_SCHEDULER"]
    - prefix: "INT_"
      examples: ["INT_SAP", "INT_ORACLE", "INT_SALESFORCE"]

  admin_accounts:
    - pattern: "admin_{system}"
    - examples: ["admin_gl", "admin_ap", "admin_ar"]

3.2 Reference Number Formats

Realistic patterns by document type:

reference_formats:
  purchase_order:
    patterns:
      - "PO-{year}{seq:06}"        # PO-2024000142
      - "4500{seq:06}"              # SAP-style: 4500000142
      - "{plant}-{year}-{seq:05}"   # CHI-2024-00142

  invoice:
    vendor_patterns:
      - "INV-{seq:08}"
      - "{vendor_prefix}-{date}-{seq:04}"
      - "{random_alpha:3}{seq:06}"
    internal_patterns:
      - "VINV-{year}{seq:06}"
      - "{company_code}-AP-{seq:07}"

  journal_entry:
    patterns:
      - "{year}{period:02}{seq:06}"   # 202401000142
      - "JE-{date}-{seq:04}"          # JE-20240115-0142
      - "{company}-{year}-{seq:07}"   # C001-2024-0000142

  bank_reference:
    patterns:
      - "{date}{random:10}"           # Bank statement ref
      - "TRN{seq:12}"                 # Transaction ID
      - "{swift_code}{date}{seq:06}"  # SWIFT format

3.3 Timestamp Realism

System-specific posting behaviors:

timestamp_patterns:
  batch_processing:
    typical_times: ["02:00", "06:00", "22:00"]
    duration_minutes: 30-180
    day_pattern: "business_days"

  manual_posting:
    peak_hours: [9, 10, 11, 14, 15, 16]
    off_peak_probability: 0.15
    lunch_dip: [12, 13]
    lunch_probability: 0.3

  interface_posting:
    patterns:
      - hourly: ":00", ":15", ":30", ":45"
      - real_time: random within seconds
    source_systems:
      - name: "SAP_INTERFACE"
        posting_lag_hours: 0-4
      - name: "LEGACY_BATCH"
        posting_time: "23:30"
        posting_day: "next_business_day"

  period_end_crunch:
    enabled: true
    days_before_close: 3
    extended_hours: true
    weekend_activity: 0.3

4. Address and Contact Information

4.1 Realistic Address Generation

Current Gap: Generic or missing addresses

Proposed: Region-appropriate address formats

address_generation:
  us:
    format: "{street_number} {street_name} {street_type}\n{city}, {state} {zip}"
    components:
      street_numbers:
        residential: 100-9999
        commercial: 1-500
        distribution: "log_normal"
      street_names:
        sources: ["census_data", "common_names"]
        include_directional: 0.3  # "N", "S", "E", "W"
      street_types:
        distribution:
          Street: 0.25
          Avenue: 0.15
          Road: 0.12
          Drive: 0.12
          Boulevard: 0.08
          Lane: 0.08
          Way: 0.08
          Court: 0.05
          Place: 0.04
          Circle: 0.03
      cities:
        source: "population_weighted"
        major_metro_weight: 0.6
    commercial_patterns:
      suite_probability: 0.4
      floor_probability: 0.2
      building_name_probability: 0.15

  de:
    format: "{street_name} {street_number}\n{postal_code} {city}"
    # German addresses put number after street name

  jp:
    format: "〒{postal_code}\n{prefecture}{city}{ward}\n{block}-{building}-{unit}"
    # Japanese addressing system

4.2 Phone Number Formats

phone_generation:
  formats:
    us: "+1 ({area_code}) {exchange}-{subscriber}"
    uk: "+44 {area_code} {local_number}"
    de: "+49 {area_code} {subscriber}"

  area_codes:
    us:
      source: "valid_area_codes"
      weight_by_population: true
      exclude_toll_free: true
      business_toll_free_rate: 0.3

4.3 Email Patterns

email_generation:
  corporate:
    patterns:
      - "{first}.{last}@{company_domain}"
      - "{first_initial}{last}@{company_domain}"
      - "{first}_{last}@{company_domain}"
    domain_generation:
      from_company_name: true
      tld_distribution:
        ".com": 0.75
        ".net": 0.10
        ".io": 0.05
        ".co": 0.05
        country_tld: 0.05

  vendor_contacts:
    patterns:
      - "accounts.payable@{domain}"
      - "ar@{domain}"
      - "billing@{domain}"
      - "{first}.{last}@{domain}"
    generic_rate: 0.4

5. Material and Product Naming

5.1 SKU Patterns

sku_generation:
  patterns:
    category_prefix:
      format: "{category:3}-{subcategory:3}-{sequence:06}"
      example: "ELE-CPT-000142"  # Electronics-Components

    alphanumeric:
      format: "{alpha:2}{numeric:6}{check_digit}"
      example: "AB123456C"

    hierarchical:
      format: "{division}-{family}-{class}-{item}"
      example: "01-234-567-8901"

5.2 Product Descriptions

By Category:

product_descriptions:
  raw_materials:
    templates:
      - "{material_type}, {grade}, {specification}"
      - "{chemical_formula} {purity}% pure"
      - "{material} {form} - {dimensions}"
    examples:
      - "Steel Coil, Grade 304, 1.2mm thickness"
      - "Aluminum Sheet 6061-T6, 4' x 8' x 0.125\""
      - "Polyethylene Pellets, HDPE, 50lb bag"

  finished_goods:
    templates:
      - "{brand} {product_line} {model}"
      - "{product_type} - {feature1}, {feature2}"
      - "{category} {variant} ({color}/{size})"
    examples:
      - "Acme Pro Series 5000X Widget"
      - "Heavy-Duty Industrial Pump - 2HP, 120V"
      - "Office Chair Ergonomic Mesh (Black/Large)"

  services:
    templates:
      - "{service_type} - {duration} {frequency}"
      - "Professional {service} Services"
      - "{specialty} Consultation - {scope}"
    examples:
      - "HVAC Maintenance - Annual Contract"
      - "Professional IT Support Services"
      - "Legal Consultation - Contract Review"

6. Implementation Priority

EnhancementEffortImpactPriority
Regional name poolsMediumHighP1
Industry-specific vendor namesMediumHighP1
Varied JE descriptionsLowMediumP1
Reference number formatsLowHighP1
User ID patternsLowMediumP2
Address generationHighMediumP2
Product descriptionsMediumMediumP2
Email patternsLowLowP3
Phone formattingLowLowP3

7. Data Sources

Recommended External Data Sources:

  1. Name Data:

    • US Census Bureau name frequency data
    • International name databases (regional)
    • Industry-specific company name patterns
  2. Address Data:

    • OpenAddresses project
    • Census TIGER/Line files
    • Postal code databases by country
  3. Reference Patterns:

    • ERP documentation (SAP, Oracle, NetSuite)
    • Industry EDI standards
    • Banking reference formats (SWIFT, ACH)
  4. Product Data:

    • UNSPSC category codes
    • Industry classification systems
    • Standard material specifications

8. Configuration Example

# Enhanced name and metadata configuration
realism:
  names:
    strategy: culturally_aware
    primary_region: north_america
    diversity_index: 0.4

  vendors:
    naming_style: industry_appropriate
    include_legal_suffix: true
    regional_distribution:
      domestic: 0.7
      international: 0.3

  descriptions:
    variation_enabled: true
    abbreviation_rate: 0.25
    typo_injection_rate: 0.01

  references:
    format_style: erp_realistic
    include_check_digits: true

  timestamps:
    system_behavior_modeling: true
    batch_window_realism: true

  addresses:
    format: regional_appropriate
    commercial_indicators: true

Next Steps

  1. Create name pool data files for major regions
  2. Implement NameGenerator trait with regional strategies
  3. Build description template engine with variation support
  4. Add reference format configurations to schema
  5. Integrate address generation with Faker-like libraries

See also: 02-statistical-distributions.md for numerical realism

Research: Statistical and Numerical Distributions

Current State Analysis

Existing Distribution Implementations

The system currently supports several distribution types:

DistributionImplementationUsage
Log-NormalAmountSamplerTransaction amounts
Benford’s LawBenfordSamplerFirst-digit distribution
UniformStandardID generation, selection
WeightedLineItemSamplerLine item counts
PoissonTemporalSamplerEvent counts
Normal/GaussianStandardSome variations

Current Strengths

  1. Benford’s Law compliance: First-digit distribution follows expected 30.1%, 17.6%, 12.5%… pattern
  2. Log-normal amounts: Realistic transaction size distributions
  3. Temporal weighting: Period-end spikes, day-of-week patterns
  4. Industry seasonality: 10 industry profiles with event-based multipliers

Current Gaps

  1. Single-mode distributions: No mixture models for multi-modal data
  2. Limited correlation: Cross-field dependencies not modeled
  3. Static parameters: No regime changes or parameter drift
  4. Missing distributions: Pareto, Weibull, Beta not available
  5. No copulas: Joint distributions not correlated realistically

Improvement Recommendations

1. Multi-Modal Distribution Support

1.1 Gaussian Mixture Models

Real-world transaction amounts often exhibit multiple modes:

#![allow(unused)]
fn main() {
/// Gaussian Mixture Model for multi-modal distributions
pub struct GaussianMixture {
    components: Vec<GaussianComponent>,
}

pub struct GaussianComponent {
    weight: f64,      // Component weight (sum to 1.0)
    mean: f64,        // Component mean
    std_dev: f64,     // Component standard deviation
}

impl GaussianMixture {
    /// Sample from the mixture distribution
    pub fn sample(&self, rng: &mut impl Rng) -> f64 {
        // Select component based on weights
        let component = self.select_component(rng);
        // Sample from selected Gaussian
        component.sample(rng)
    }
}
}

Configuration:

amount_distribution:
  type: gaussian_mixture
  components:
    - weight: 0.60
      mean: 500
      std_dev: 200
      label: "small_transactions"
    - weight: 0.30
      mean: 5000
      std_dev: 1500
      label: "medium_transactions"
    - weight: 0.10
      mean: 50000
      std_dev: 15000
      label: "large_transactions"

1.2 Log-Normal Mixture

For strictly positive amounts with multiple modes:

amount_distribution:
  type: lognormal_mixture
  components:
    - weight: 0.70
      mu: 5.5       # log-scale mean
      sigma: 1.2    # log-scale std dev
      label: "routine_expenses"
    - weight: 0.25
      mu: 8.5
      sigma: 0.8
      label: "capital_expenses"
    - weight: 0.05
      mu: 11.0
      sigma: 0.5
      label: "major_projects"

1.3 Realistic Transaction Amount Profiles

By Transaction Type:

TypeDistributionParametersNotes
Petty CashLog-normalμ=3.5, σ=0.8$10-$500 range
AP InvoicesMixture(3)See belowMulti-modal
PayrollNormalμ=4500, σ=1200Per employee
UtilitiesLog-normalμ=7.0, σ=0.4Monthly, stable
CapitalParetoα=1.5, xₘ=10000Heavy tail

AP Invoice Mixture:

ap_invoices:
  type: lognormal_mixture
  components:
    # Operating expenses
    - weight: 0.50
      mu: 6.0        # ~$400 median
      sigma: 1.5
    # Inventory/materials
    - weight: 0.35
      mu: 8.0        # ~$3000 median
      sigma: 1.0
    # Capital/projects
    - weight: 0.15
      mu: 10.5       # ~$36000 median
      sigma: 0.8

2. Cross-Field Correlation Modeling

2.1 Correlation Matrix Support

Define correlations between numeric fields:

correlations:
  enabled: true
  fields:
    - name: transaction_amount
    - name: line_item_count
    - name: approval_level
    - name: processing_time_hours
    - name: discount_percentage

  matrix:
    # Correlation coefficients (Pearson's r)
    # Higher amounts → more line items
    - [1.00, 0.65, 0.72, 0.45, -0.20]
    # More items → higher amount
    - [0.65, 1.00, 0.55, 0.60, -0.15]
    # Higher amount → higher approval
    - [0.72, 0.55, 1.00, 0.50, -0.30]
    # More complex → longer processing
    - [0.45, 0.60, 0.50, 1.00, -0.10]
    # Higher amount → lower discount %
    - [-0.20, -0.15, -0.30, -0.10, 1.00]

2.2 Copula-Based Generation

For more sophisticated dependency modeling:

#![allow(unused)]
fn main() {
/// Copula types for dependency modeling
pub enum CopulaType {
    /// Gaussian copula - symmetric dependencies
    Gaussian { correlation: f64 },
    /// Clayton copula - lower tail dependence
    Clayton { theta: f64 },
    /// Gumbel copula - upper tail dependence
    Gumbel { theta: f64 },
    /// Frank copula - symmetric, no tail dependence
    Frank { theta: f64 },
    /// Student-t copula - both tail dependencies
    StudentT { correlation: f64, df: f64 },
}

pub struct CopulaGenerator {
    copula: CopulaType,
    marginals: Vec<Box<dyn Distribution>>,
}
}

Use Cases:

  • Amount & Days-to-Pay: Larger invoices may have longer payment terms (Clayton copula)
  • Revenue & COGS: Strong positive correlation (Gaussian copula)
  • Fraud Amount & Detection Delay: Upper tail dependence (Gumbel copula)

2.3 Conditional Distributions

Generate values conditional on other fields:

conditional_distributions:
  # Discount percentage depends on order amount
  discount:
    type: conditional
    given: order_amount
    breakpoints:
      - threshold: 1000
        distribution: { type: constant, value: 0 }
      - threshold: 5000
        distribution: { type: uniform, min: 0, max: 0.05 }
      - threshold: 25000
        distribution: { type: uniform, min: 0.05, max: 0.10 }
      - threshold: 100000
        distribution: { type: uniform, min: 0.10, max: 0.15 }
      - threshold: infinity
        distribution: { type: normal, mean: 0.15, std: 0.03 }

  # Payment terms depend on vendor relationship
  payment_terms:
    type: conditional
    given: vendor_relationship_months
    breakpoints:
      - threshold: 6
        distribution: { type: choice, values: [0, 15], weights: [0.8, 0.2] }
      - threshold: 24
        distribution: { type: choice, values: [15, 30], weights: [0.6, 0.4] }
      - threshold: infinity
        distribution: { type: choice, values: [30, 45, 60], weights: [0.5, 0.35, 0.15] }

3. Industry-Specific Amount Distributions

3.1 Retail

retail:
  transaction_amounts:
    pos_sales:
      type: lognormal_mixture
      components:
        - weight: 0.65
          mu: 3.0      # ~$20 median
          sigma: 1.0
          label: "small_basket"
        - weight: 0.30
          mu: 4.5      # ~$90 median
          sigma: 0.8
          label: "medium_basket"
        - weight: 0.05
          mu: 6.0      # ~$400 median
          sigma: 0.6
          label: "large_basket"

    inventory_orders:
      type: lognormal
      mu: 9.0          # ~$8000 median
      sigma: 1.5

    seasonal_multipliers:
      black_friday: 3.5
      christmas_week: 2.8
      back_to_school: 1.6

3.2 Manufacturing

manufacturing:
  transaction_amounts:
    raw_materials:
      type: lognormal_mixture
      components:
        - weight: 0.40
          mu: 8.0      # ~$3000 median
          sigma: 1.0
          label: "consumables"
        - weight: 0.45
          mu: 10.0     # ~$22000 median
          sigma: 0.8
          label: "production_materials"
        - weight: 0.15
          mu: 12.0     # ~$163000 median
          sigma: 0.6
          label: "bulk_orders"

    maintenance:
      type: pareto
      alpha: 2.0
      x_min: 500
      label: "repair_costs"

    capital_equipment:
      type: lognormal
      mu: 12.5         # ~$268000 median
      sigma: 1.0

3.3 Financial Services

financial_services:
  transaction_amounts:
    wire_transfers:
      type: lognormal_mixture
      components:
        - weight: 0.30
          mu: 8.0      # ~$3000
          sigma: 1.2
          label: "retail_wire"
        - weight: 0.40
          mu: 11.0     # ~$60000
          sigma: 1.0
          label: "commercial_wire"
        - weight: 0.20
          mu: 14.0     # ~$1.2M
          sigma: 0.8
          label: "institutional_wire"
        - weight: 0.10
          mu: 17.0     # ~$24M
          sigma: 1.0
          label: "large_value"

    ach_transactions:
      type: lognormal
      mu: 7.5          # ~$1800
      sigma: 2.0

    fee_income:
      type: weibull
      scale: 500
      shape: 1.5

4. Regime Change Modeling

4.1 Structural Breaks

Model sudden changes in distribution parameters:

regime_changes:
  enabled: true
  changes:
    - date: "2024-03-15"
      type: acquisition
      effects:
        - field: transaction_volume
          multiplier: 1.35
        - field: average_amount
          shift: 5000
        - field: vendor_count
          multiplier: 1.25

    - date: "2024-07-01"
      type: price_increase
      effects:
        - field: cogs_ratio
          shift: 0.03
        - field: avg_invoice_amount
          multiplier: 1.08

    - date: "2024-10-01"
      type: new_product_line
      effects:
        - field: revenue
          multiplier: 1.20
        - field: inventory_turns
          multiplier: 0.85

4.2 Gradual Parameter Drift

Model slow changes over time:

parameter_drift:
  enabled: true
  parameters:
    - field: transaction_amount
      type: linear
      annual_drift: 0.03    # 3% annual increase (inflation)

    - field: digital_payment_ratio
      type: logistic
      start_value: 0.40
      end_value: 0.85
      midpoint_months: 18
      steepness: 0.15

    - field: approval_threshold
      type: step
      steps:
        - month: 6
          value: 5000
        - month: 18
          value: 7500
        - month: 30
          value: 10000

4.3 Economic Cycle Modeling

economic_cycles:
  enabled: true
  base_cycle:
    type: sinusoidal
    period_months: 48      # 4-year cycle
    amplitude: 0.15        # ±15% variation

  recession_events:
    - start: "2024-09-01"
      duration_months: 8
      severity: moderate    # 10-20% decline
      effects:
        - revenue: -0.15
        - discretionary_spend: -0.35
        - capital_investment: -0.50
        - headcount: -0.10
      recovery:
        type: gradual
        months: 12

5. Enhanced Benford’s Law Compliance

5.1 Second and Third Digit Distributions

Extend beyond first-digit to full Benford compliance:

#![allow(unused)]
fn main() {
pub struct BenfordDistribution {
    digits: BenfordDigitConfig,
}

pub struct BenfordDigitConfig {
    first_digit: bool,     // Standard Benford
    second_digit: bool,    // Second digit distribution
    first_two: bool,       // Joint first-two digits
    summation: bool,       // Summation test
}

impl BenfordDistribution {
    /// Generate amount following full Benford's Law
    pub fn sample_benford_compliant(&self, rng: &mut impl Rng) -> Decimal {
        // Use log-uniform distribution to ensure Benford compliance
        // across multiple digit positions
    }
}
}

5.2 Benford Deviation Injection

For anomaly scenarios, intentionally violate Benford:

benford_deviations:
  enabled: false  # Enable for fraud scenarios

  deviation_types:
    # Round number preference (fraud indicator)
    round_number_bias:
      probability: 0.15
      targets: [1000, 5000, 10000, 25000]
      tolerance: 0.01

    # Threshold avoidance (approval bypass)
    threshold_clustering:
      thresholds: [5000, 10000, 25000]
      cluster_below: true
      distance: 50-200

    # Uniform distribution (fabricated data)
    uniform_injection:
      probability: 0.05
      range: [1000, 9999]

6. Statistical Validation Framework

6.1 Distribution Fitness Tests

#![allow(unused)]
fn main() {
pub struct DistributionValidator {
    tests: Vec<StatisticalTest>,
}

pub enum StatisticalTest {
    /// Kolmogorov-Smirnov test
    KolmogorovSmirnov { significance: f64 },
    /// Chi-squared goodness of fit
    ChiSquared { bins: usize, significance: f64 },
    /// Anderson-Darling test
    AndersonDarling { significance: f64 },
    /// Benford's Law chi-squared
    BenfordChiSquared { digits: u8, significance: f64 },
    /// Mean Absolute Deviation from Benford
    BenfordMAD { threshold: f64 },
}
}

6.2 Validation Configuration

validation:
  statistical_tests:
    enabled: true
    tests:
      - type: benford_first_digit
        threshold_mad: 0.015
        warning_mad: 0.010

      - type: distribution_fit
        target: lognormal
        ks_significance: 0.05

      - type: correlation_check
        expected_correlations:
          - fields: [amount, line_items]
            expected_r: 0.65
            tolerance: 0.10

  reporting:
    generate_plots: true
    output_format: html
    include_raw_data: false

7. New Distribution Types

7.1 Pareto Distribution

For heavy-tailed phenomena (80/20 rule):

# Top 20% of customers generate 80% of revenue
customer_revenue:
  type: pareto
  alpha: 1.16      # Shape parameter for 80/20
  x_min: 1000      # Minimum value
  truncate_max: 10000000  # Optional cap

7.2 Weibull Distribution

For time-to-event data:

# Days until payment
days_to_payment:
  type: weibull
  shape: 2.0       # k > 1: increasing hazard (more likely to pay over time)
  scale: 30.0      # λ: characteristic life
  shift: 0         # Minimum days

7.3 Beta Distribution

For proportions and percentages:

# Discount percentage
discount_rate:
  type: beta
  alpha: 2.0       # Shape parameter 1
  beta: 8.0        # Shape parameter 2
  # This gives mode around 11%, right-skewed
  scale:
    min: 0.0
    max: 0.25      # Max 25% discount

7.4 Zero-Inflated Distributions

For data with excess zeros:

# Credits/returns (many transactions have zero)
credit_amount:
  type: zero_inflated
  zero_probability: 0.85
  positive_distribution:
    type: lognormal
    mu: 5.0
    sigma: 1.5

8. Implementation Priority

EnhancementComplexityImpactPriority
Mixture modelsMediumHighP1
Correlation matricesHighCriticalP1
Industry-specific profilesMediumHighP1
Regime changesMediumHighP2
Copula supportHighMediumP2
Additional distributionsLowMediumP2
Validation frameworkMediumHighP1
Conditional distributionsMediumMediumP3

9. Configuration Example

# Complete statistical distribution configuration
distributions:
  # Global amount settings
  amounts:
    default:
      type: lognormal_mixture
      components:
        - { weight: 0.6, mu: 6.0, sigma: 1.5 }
        - { weight: 0.3, mu: 8.5, sigma: 1.0 }
        - { weight: 0.1, mu: 11.0, sigma: 0.8 }

    by_transaction_type:
      payroll:
        type: normal
        mean: 4500
        std_dev: 1500
        truncate_min: 1000

      utilities:
        type: lognormal
        mu: 7.0
        sigma: 0.5

  # Correlation settings
  correlations:
    enabled: true
    model: gaussian_copula
    pairs:
      - fields: [amount, processing_days]
        correlation: 0.45
      - fields: [amount, approval_level]
        correlation: 0.72

  # Drift settings
  drift:
    enabled: true
    inflation_rate: 0.03
    regime_changes:
      - date: "2024-06-01"
        field: avg_transaction
        multiplier: 1.15

  # Validation
  validation:
    benford_compliance: true
    distribution_tests: true
    correlation_verification: true

Technical Implementation Notes

Performance Considerations

  1. Pre-computation: Calculate CDF tables for frequently-used distributions
  2. Vectorization: Use SIMD for batch sampling where possible
  3. Caching: Cache correlation matrix decompositions (Cholesky)
  4. Lazy evaluation: Defer complex distribution calculations until needed

Memory Efficiency

  1. Streaming: Generate correlated samples in batches
  2. Reference tables: Use compact lookup tables for standard distributions
  3. On-demand: Compute regime-adjusted parameters at sample time

See also: 03-temporal-patterns.md for time-based distributions

Research: Temporal Patterns and Distributions

Implementation Status: Core temporal patterns implemented in v0.3.0. See CLAUDE.md for configuration examples.

Implementation Summary (v0.3.0)

FeatureStatusLocation
Business day calculator✅ Implementeddatasynth-core/src/distributions/business_day.rs
Holiday calendars (11 regions)✅ Implementeddatasynth-core/src/distributions/holidays.rs
Period-end dynamics (decay curves)✅ Implementeddatasynth-core/src/distributions/period_end.rs
Processing lag modeling✅ Implementeddatasynth-core/src/distributions/processing_lag.rs
Timezone handling✅ Implementeddatasynth-core/src/distributions/timezone.rs
Fiscal calendar (custom, 4-4-5)✅ ImplementedConfig: fiscal_calendar
Intraday segments✅ ImplementedConfig: intraday
Settlement rules (T+N)✅ Implementedbusiness_day.rs
Half-day policies✅ Implementedbusiness_day.rs
Lunar calendars🔄 PlannedApproximate via fixed dates

Current State Analysis

Existing Temporal Infrastructure

ComponentLinesFunctionality
TemporalSampler632Date/time sampling with seasonality
IndustrySeasonality53810 industry profiles
HolidayCalendar8526 regional calendars
DriftController373Gradual/sudden drift
FiscalPeriod849Period close mechanics
BiTemporal449Audit trail versioning

Current Capabilities

  1. Period-end spikes: Month-end (2.5x), Quarter-end (4.0x), Year-end (6.0x)
  2. Day-of-week patterns: Monday catch-up (1.3x), Friday wind-down (0.85x)
  3. Holiday handling: 6 regions with ~15 holidays each
  4. Working hours: 8-18 business hours with peak weighting
  5. Industry seasonality: Black Friday, tax season, etc.

Current Gaps

  1. No business day calculation - T+1, T+2 settlement not supported
  2. No fiscal calendar alternatives - Only calendar year supported
  3. Limited regional coverage - Missing LATAM, more APAC
  4. No half-day handling - Early closes before holidays
  5. Static spike multipliers - No decay curves toward period-end
  6. No timezone awareness - All times in single timezone
  7. Missing lunar calendars - Approximate Chinese New Year, Diwali

Improvement Recommendations

1. Business Day Calculations

1.1 Core Business Day Engine

#![allow(unused)]
fn main() {
pub struct BusinessDayCalculator {
    calendar: HolidayCalendar,
    weekend_days: HashSet<Weekday>,
    half_day_handling: HalfDayPolicy,
}

pub enum HalfDayPolicy {
    FullDay,           // Count as full business day
    HalfDay,           // Count as 0.5 business days
    NonBusinessDay,    // Treat as holiday
}

impl BusinessDayCalculator {
    /// Add N business days to a date
    pub fn add_business_days(&self, date: NaiveDate, days: i32) -> NaiveDate;

    /// Subtract N business days from a date
    pub fn sub_business_days(&self, date: NaiveDate, days: i32) -> NaiveDate;

    /// Count business days between two dates
    pub fn business_days_between(&self, start: NaiveDate, end: NaiveDate) -> i32;

    /// Get the next business day (inclusive or exclusive)
    pub fn next_business_day(&self, date: NaiveDate, inclusive: bool) -> NaiveDate;

    /// Get the previous business day
    pub fn prev_business_day(&self, date: NaiveDate, inclusive: bool) -> NaiveDate;

    /// Is this date a business day?
    pub fn is_business_day(&self, date: NaiveDate) -> bool;
}
}

1.2 Settlement Date Logic

settlement_rules:
  enabled: true
  conventions:
    # Standard equity settlement
    equity:
      type: T_plus_N
      days: 2
      calendar: exchange

    # Government bonds
    government_bonds:
      type: T_plus_N
      days: 1
      calendar: federal

    # Corporate bonds
    corporate_bonds:
      type: T_plus_N
      days: 2
      calendar: combined

    # FX spot
    fx_spot:
      type: T_plus_N
      days: 2
      calendar: both_currencies

    # Wire transfers
    wire_domestic:
      type: same_day_or_next
      cutoff_time: "14:00"
      calendar: federal

    # ACH
    ach:
      type: T_plus_N
      days: 1-3
      distribution: { 1: 0.6, 2: 0.3, 3: 0.1 }

1.3 Month-End Conventions

month_end_conventions:
  # Modified Following
  modified_following:
    if_holiday: next_business_day
    if_crosses_month: previous_business_day

  # Preceding
  preceding:
    if_holiday: previous_business_day

  # Following
  following:
    if_holiday: next_business_day

  # End of Month
  end_of_month:
    if_start_is_eom: end_stays_eom

2. Expanded Regional Calendars

2.1 Additional Regions

Latin America:

calendars:
  brazil:
    holidays:
      - name: "Carnival"
        type: floating
        rule: "easter - 47 days"
        duration_days: 2
        activity_multiplier: 0.05

      - name: "Tiradentes Day"
        type: fixed
        month: 4
        day: 21

      - name: "Independence Day"
        type: fixed
        month: 9
        day: 7

      - name: "Republic Day"
        type: fixed
        month: 11
        day: 15

  mexico:
    holidays:
      - name: "Constitution Day"
        type: floating
        rule: "first monday of february"

      - name: "Benito Juárez Birthday"
        type: floating
        rule: "third monday of march"

      - name: "Labor Day"
        type: fixed
        month: 5
        day: 1

      - name: "Independence Day"
        type: fixed
        month: 9
        day: 16

      - name: "Revolution Day"
        type: floating
        rule: "third monday of november"

      - name: "Day of the Dead"
        type: fixed
        month: 11
        day: 2
        activity_multiplier: 0.3

Asia-Pacific Expansion:

  australia:
    holidays:
      - name: "Australia Day"
        type: fixed
        month: 1
        day: 26
        observance: "next_monday_if_weekend"

      - name: "ANZAC Day"
        type: fixed
        month: 4
        day: 25

      - name: "Queen's Birthday"
        type: floating
        rule: "second monday of june"
        regional_variation: true  # Different dates by state

  singapore:
    holidays:
      - name: "Chinese New Year"
        type: lunar
        duration_days: 2

      - name: "Vesak Day"
        type: lunar

      - name: "Hari Raya Puasa"
        type: islamic
        rule: "end of ramadan"

      - name: "Deepavali"
        type: lunar
        calendar: hindu

  south_korea:
    holidays:
      - name: "Seollal"
        type: lunar
        calendar: korean
        duration_days: 3

      - name: "Chuseok"
        type: lunar
        calendar: korean
        duration_days: 3

2.2 Lunar Calendar Implementation

#![allow(unused)]
fn main() {
/// Accurate lunar calendar calculations
pub struct LunarCalendar {
    calendar_type: LunarCalendarType,
    cache: HashMap<i32, Vec<LunarDate>>,
}

pub enum LunarCalendarType {
    Chinese,    // Chinese lunisolar
    Islamic,    // Hijri calendar
    Hebrew,     // Jewish calendar
    Hindu,      // Vikram Samvat
    Korean,     // Dangun calendar
}

impl LunarCalendar {
    /// Convert Gregorian date to lunar date
    pub fn to_lunar(&self, date: NaiveDate) -> LunarDate;

    /// Convert lunar date to Gregorian
    pub fn to_gregorian(&self, lunar: LunarDate) -> NaiveDate;

    /// Get Chinese New Year date for a given Gregorian year
    pub fn chinese_new_year(&self, year: i32) -> NaiveDate;

    /// Get Ramadan start date for a given Gregorian year
    pub fn ramadan_start(&self, year: i32) -> NaiveDate;

    /// Get Diwali date (new moon in Kartik)
    pub fn diwali(&self, year: i32) -> NaiveDate;
}
}

3. Period-End Dynamics

3.1 Decay Curves Instead of Static Multipliers

Replace flat multipliers with realistic acceleration curves:

period_end_dynamics:
  enabled: true

  month_end:
    model: exponential_acceleration
    parameters:
      start_day: -10          # 10 days before month-end
      base_multiplier: 1.0
      peak_multiplier: 3.5
      decay_rate: 0.3         # Exponential decay parameter

    # Activity profile by days-to-close
    daily_profile:
      -10: 1.0
      -7: 1.2
      -5: 1.5
      -3: 2.0
      -2: 2.5
      -1: 3.0                 # Day before close
      0: 3.5                  # Close day

  quarter_end:
    model: stepped_exponential
    inherit_from: month_end
    additional_multiplier: 1.5

  year_end:
    model: extended_crunch
    parameters:
      start_day: -15
      sustained_high_days: 5
      peak_multiplier: 6.0

    # Year-end specific activities
    activities:
      - type: "audit_adjustments"
        days: [-3, -2, -1, 0]
        multiplier: 2.0
      - type: "tax_provisions"
        days: [-5, -4, -3]
        multiplier: 1.5
      - type: "impairment_reviews"
        days: [-10, -9, -8]
        multiplier: 1.3

3.2 Intra-Day Patterns

intraday_patterns:
  # Morning rush
  morning_spike:
    start: "08:30"
    end: "10:00"
    multiplier: 1.8

  # Pre-lunch activity
  late_morning:
    start: "10:00"
    end: "12:00"
    multiplier: 1.2

  # Lunch lull
  lunch_dip:
    start: "12:00"
    end: "13:30"
    multiplier: 0.4

  # Afternoon steady
  afternoon:
    start: "13:30"
    end: "16:00"
    multiplier: 1.0

  # End-of-day push
  eod_rush:
    start: "16:00"
    end: "17:30"
    multiplier: 1.5

  # After hours (manual only)
  after_hours:
    start: "17:30"
    end: "20:00"
    multiplier: 0.15
    type: manual_only

3.3 Time Zone Handling

timezones:
  enabled: true

  company_timezones:
    default: "America/New_York"
    by_entity:
      - entity_pattern: "EU_*"
        timezone: "Europe/London"
      - entity_pattern: "DE_*"
        timezone: "Europe/Berlin"
      - entity_pattern: "APAC_*"
        timezone: "Asia/Singapore"
      - entity_pattern: "JP_*"
        timezone: "Asia/Tokyo"

  posting_behavior:
    # Consolidation timing
    consolidation:
      coordinator_timezone: "America/New_York"
      cutoff_time: "18:00"

    # Intercompany coordination
    intercompany:
      settlement_timezone: "UTC"
      matching_window_hours: 24

4. Fiscal Calendar Alternatives

4.1 Non-Calendar Year Support

fiscal_calendar:
  type: custom
  year_start:
    month: 7
    day: 1
  # Fiscal year 2024 = July 1, 2024 - June 30, 2025

  period_naming:
    format: "FY{year}P{period:02}"
    # FY2024P01 = July 2024

4.2 4-4-5 Calendar

fiscal_calendar:
  type: 4-4-5
  year_start:
    anchor: first_sunday_of_february
    # Or: last_saturday_of_january

  periods:
    Q1:
      - weeks: 4
      - weeks: 4
      - weeks: 5
    Q2:
      - weeks: 4
      - weeks: 4
      - weeks: 5
    Q3:
      - weeks: 4
      - weeks: 4
      - weeks: 5
    Q4:
      - weeks: 4
      - weeks: 4
      - weeks: 5

  # 53rd week handling (every 5-6 years)
  leap_week:
    occurrence: calculated
    placement: Q4_P3  # Added to last period

4.3 13-Period Calendar

fiscal_calendar:
  type: 13_period
  weeks_per_period: 4
  year_start:
    anchor: first_monday_of_january

  # 53rd week handling
  extra_week_period: 13

5. Advanced Seasonality

5.1 Multi-Factor Seasonality

seasonality:
  factors:
    # Annual cycle
    annual:
      type: fourier
      harmonics: 3
      coefficients:
        cos1: 0.15
        sin1: 0.08
        cos2: 0.05
        sin2: 0.03
        cos3: 0.02
        sin3: 0.01

    # Weekly cycle
    weekly:
      type: categorical
      values:
        monday: 1.25
        tuesday: 1.10
        wednesday: 1.00
        thursday: 1.00
        friday: 0.90
        saturday: 0.15
        sunday: 0.05

    # Monthly cycle (within month)
    monthly:
      type: piecewise
      segments:
        - days: [1, 5]
          multiplier: 1.3
          label: "month_start"
        - days: [6, 20]
          multiplier: 0.9
          label: "mid_month"
        - days: [21, 31]
          multiplier: 1.4
          label: "month_end"

  # Interaction effects
  interactions:
    - factors: [annual, weekly]
      type: multiplicative
    - factors: [monthly, weekly]
      type: additive

5.2 Weather-Driven Seasonality

For relevant industries:

weather_seasonality:
  enabled: true
  industries: [retail, utilities, agriculture, construction]

  patterns:
    temperature:
      cold_threshold: 32  # Fahrenheit
      hot_threshold: 85
      effects:
        cold:
          utilities: 1.8
          construction: 0.5
          retail_outdoor: 0.3
        hot:
          utilities: 1.5
          construction: 0.8
          retail_outdoor: 1.3

    precipitation:
      effects:
        rain:
          construction: 0.6
          retail_brick_mortar: 0.8
          retail_online: 1.2

  # Regional weather profiles
  regional_profiles:
    northeast_us:
      winter_severity: high
      summer_humidity: medium
    southwest_us:
      winter_severity: low
      summer_heat: extreme
    pacific_northwest:
      precipitation_days: high
      temperature_variance: low

6. Transaction Timing Realism

6.1 Processing Lag Modeling

processing_lags:
  # Time between event and posting
  event_to_posting:
    distribution: lognormal
    parameters:
      sales_order:
        mu: 0.5    # ~1.6 hours median
        sigma: 0.8
      goods_receipt:
        mu: 1.5    # ~4.5 hours median
        sigma: 0.5
      invoice_receipt:
        mu: 2.0    # ~7.4 hours median
        sigma: 0.6
      payment:
        mu: 0.2    # ~1.2 hours median
        sigma: 0.3

  # Day-boundary crossing
  cross_day_posting:
    enabled: true
    probability_by_hour:
      "17:00": 0.7   # 70% post next day if after 5pm
      "19:00": 0.9
      "21:00": 0.99

  # Batch processing delays
  batch_delays:
    enabled: true
    schedules:
      nightly_batch:
        run_time: "02:00"
        affects: [bank_transactions, interfaces]
      hourly_sync:
        interval_minutes: 60
        affects: [inventory_movements]

6.2 Human vs. System Posting Patterns

posting_patterns:
  human:
    # Working hours focus
    primary_hours: [9, 10, 11, 14, 15, 16]
    probability: 0.8

    # Occasional overtime
    extended_hours: [8, 17, 18, 19]
    probability: 0.15

    # Rare late night
    late_hours: [20, 21, 22]
    probability: 0.05

    # Keystroke timing (for detailed simulation)
    entry_duration:
      simple_je:
        mean_seconds: 45
        std_seconds: 15
      complex_je:
        mean_seconds: 180
        std_seconds: 60

  system:
    # Interface postings
    interface:
      typical_times: ["01:00", "05:00", "13:00"]
      duration_minutes: 15-45
      burst_rate: 100-500  # Records per minute

    # Automated recurring
    recurring:
      time: "00:30"
      day: first_business_day

    # Real-time integrations
    realtime:
      latency_ms: 100-500
      batch_size: 1

7. Period Close Orchestration

7.1 Close Calendar Generation

close_calendar:
  enabled: true

  # Standard close schedule
  monthly:
    soft_close:
      day: 2        # 2nd business day
      activities: [preliminary_review, initial_accruals]
    hard_close:
      day: 5        # 5th business day
      activities: [final_adjustments, lock_period]
    reporting:
      day: 7        # 7th business day
      activities: [management_reports, variance_analysis]

  quarterly:
    extended_close:
      additional_days: 3
    activities:
      - quarter_end_reserves
      - intercompany_reconciliation
      - consolidation

  annual:
    extended_close:
      additional_days: 10
    activities:
      - audit_adjustments
      - tax_provisions
      - impairment_testing
      - goodwill_analysis
      - segment_reporting

7.2 Late Posting Behavior

late_postings:
  enabled: true

  # Probability of late posting by days after close
  probability_curve:
    day_1: 0.08    # 8% of transactions post 1 day late
    day_2: 0.03
    day_3: 0.01
    day_4: 0.005
    day_5+: 0.002

  # Characteristics of late postings
  characteristics:
    # More likely to be corrections
    correction_probability: 0.4
    # Higher average amount
    amount_multiplier: 1.5
    # Require special approval
    approval_required: true
    # Must reference original period
    period_reference: required

8. Implementation Priority

EnhancementComplexityImpactPriorityStatus
Business day calculatorMediumCriticalP1✅ v0.3.0
Additional regional calendarsMediumHighP1✅ v0.3.0 (11 regions)
Decay curves for period-endLowHighP1✅ v0.3.0
Non-calendar fiscal yearsMediumMediumP2✅ v0.3.0
4-4-5 calendar supportHighMediumP2✅ v0.3.0
Timezone handlingMediumMediumP2✅ v0.3.0
Lunar calendar accuracyHighMediumP3🔄 Planned
Weather seasonalityMediumLowP3🔄 Planned
Intra-day patternsLowMediumP2✅ v0.3.0
Processing lag modelingMediumHighP1✅ v0.3.0

9. Validation Metrics

temporal_validation:
  metrics:
    # Period-end spike ratios
    period_end_spikes:
      month_end_ratio:
        expected: 2.0-3.0
        tolerance: 0.5
      quarter_end_ratio:
        expected: 3.5-4.5
        tolerance: 0.5
      year_end_ratio:
        expected: 5.0-7.0
        tolerance: 1.0

    # Day-of-week distribution
    dow_distribution:
      test: chi_squared
      expected_weights: [1.3, 1.1, 1.0, 1.0, 0.85, 0.1, 0.05]
      significance: 0.05

    # Holiday compliance
    holiday_activity:
      max_activity_on_holiday: 0.1
      allow_exceptions: ["bank_settlement"]

    # Business hours
    business_hours:
      human_transactions:
        in_hours_rate: 0.85-0.95
      system_transactions:
        off_hours_allowed: true

    # Late posting rate
    late_postings:
      max_rate: 0.15
      concentration_test: true  # Should not cluster

See also: 04-interconnectivity.md for relationship modeling

Research: Interconnectivity and Relationship Modeling

Implementation Status: P1 features implemented in v0.3.0. See Interconnectivity Documentation for usage.

Implementation Summary (v0.3.0)

FeatureStatusLocation
Multi-tier vendor networks✅ Implementeddatasynth-core/src/models/vendor_network.rs
Vendor clusters & lifecycle✅ Implementeddatasynth-core/src/models/vendor_network.rs
Customer value segmentation✅ Implementeddatasynth-core/src/models/customer_segment.rs
Customer lifecycle stages✅ Implementeddatasynth-core/src/models/customer_segment.rs
Relationship strength modeling✅ Implementeddatasynth-core/src/models/relationship.rs
Entity graph (16 types, 26 relations)✅ Implementeddatasynth-core/src/models/relationship.rs
Cross-process links (P2P↔O2C)✅ Implementeddatasynth-generators/src/relationships/
Network evaluation metrics✅ Implementeddatasynth-eval/src/coherence/network.rs
Configuration & validation✅ Implementeddatasynth-config/src/schema.rs, validation.rs
Organizational hierarchy depth🔄 P2 - Planned-
Network effect modeling🔄 P2 - Planned-
Community detection🔄 P3 - Planned-

Current State Analysis

Existing Relationship Infrastructure

Relationship TypeImplementationDepth
Document ChainsDocumentChainManagerStrong
Three-Way MatchThreeWayMatcherStrong
IntercompanyICMatchingEngineStrong
GL Balance LinksAccount hierarchiesMedium
Vendor-CustomerBasic master dataWeak
Employee-ApprovalApproval chainsMedium
Entity RegistryEntityRegistryMedium

Current Strengths

  1. Document flow integrity: PO → GR → Invoice → Payment chains maintained
  2. Intercompany matching: Automatic generation of offsetting entries
  3. Balance coherence: Trial balance validation, A=L+E enforcement
  4. Graph export: PyTorch Geometric, Neo4j, DGL formats supported
  5. COSO control mapping: Controls linked to processes and risks

Current Gaps

  1. Shallow vendor networks: No supplier-of-supplier modeling
  2. Limited customer relationships: No customer segmentation
  3. No organizational hierarchy depth: Flat cost center structures
  4. Missing behavioral clustering: Entities don’t cluster by behavior
  5. No network effects: Relationships don’t influence behavior
  6. Static relationships: No relationship lifecycle modeling

Improvement Recommendations

1. Deep Vendor Network Modeling

1.1 Multi-Tier Supply Chain

vendor_network:
  enabled: true
  depth: 3  # Tier-1, Tier-2, Tier-3 suppliers

  tiers:
    tier_1:
      count: 50-100
      relationship: direct_supplier
      visibility: full
      transaction_volume: high

    tier_2:
      count: 200-500
      relationship: supplier_of_supplier
      visibility: partial
      transaction_volume: medium
      # Only visible through Tier-1 transactions

    tier_3:
      count: 500-2000
      relationship: indirect
      visibility: minimal
      transaction_volume: low

  # Dependency modeling
  dependencies:
    concentration:
      max_single_vendor: 0.15  # No vendor > 15% of spend
      top_5_vendors: 0.45      # Top 5 < 45% of spend

    critical_materials:
      single_source: 0.05      # 5% of materials are single-source
      dual_source: 0.15
      multi_source: 0.80

    substitutability:
      easy: 0.60
      moderate: 0.30
      difficult: 0.10

1.2 Vendor Relationship Attributes

#![allow(unused)]
fn main() {
pub struct VendorRelationship {
    vendor_id: VendorId,
    relationship_type: VendorRelationshipType,
    start_date: NaiveDate,
    end_date: Option<NaiveDate>,

    // Relationship strength
    strategic_importance: StrategicLevel,  // Critical, Important, Standard, Transactional
    spend_tier: SpendTier,                 // Platinum, Gold, Silver, Bronze

    // Behavioral attributes
    payment_history: PaymentBehavior,
    dispute_frequency: DisputeLevel,
    quality_score: f64,

    // Contract terms
    contracted_rates: Vec<ContractedRate>,
    rebate_agreements: Vec<RebateAgreement>,
    payment_terms: PaymentTerms,

    // Network position
    tier: SupplyChainTier,
    parent_vendor: Option<VendorId>,
    child_vendors: Vec<VendorId>,
}

pub enum VendorRelationshipType {
    DirectSupplier,
    ServiceProvider,
    Contractor,
    Distributor,
    Manufacturer,
    RawMaterialSupplier,
    OEMPartner,
    Affiliate,
}
}

1.3 Vendor Behavior Clustering

vendor_clusters:
  enabled: true

  clusters:
    reliable_strategic:
      size: 0.20
      characteristics:
        payment_terms: [30, 45, 60]
        on_time_delivery: 0.95-1.0
        quality_issues: rare
        price_stability: high
        transaction_frequency: weekly+

    standard_operational:
      size: 0.50
      characteristics:
        payment_terms: [30]
        on_time_delivery: 0.85-0.95
        quality_issues: occasional
        price_stability: medium
        transaction_frequency: monthly

    transactional:
      size: 0.25
      characteristics:
        payment_terms: [0, 15]
        on_time_delivery: 0.75-0.90
        quality_issues: moderate
        price_stability: low
        transaction_frequency: quarterly

    problematic:
      size: 0.05
      characteristics:
        payment_terms: [0]  # COD only
        on_time_delivery: 0.50-0.80
        quality_issues: frequent
        price_stability: volatile
        transaction_frequency: declining

2. Customer Relationship Depth

2.1 Customer Segmentation

customer_segmentation:
  enabled: true

  dimensions:
    value:
      - segment: enterprise
        revenue_share: 0.40
        customer_share: 0.05
        characteristics:
          avg_order_value: 50000+
          order_frequency: weekly
          payment_behavior: terms
          churn_risk: low

      - segment: mid_market
        revenue_share: 0.35
        customer_share: 0.20
        characteristics:
          avg_order_value: 5000-50000
          order_frequency: monthly
          payment_behavior: mixed
          churn_risk: medium

      - segment: smb
        revenue_share: 0.20
        customer_share: 0.50
        characteristics:
          avg_order_value: 500-5000
          order_frequency: quarterly
          payment_behavior: prepay
          churn_risk: high

      - segment: consumer
        revenue_share: 0.05
        customer_share: 0.25
        characteristics:
          avg_order_value: 50-500
          order_frequency: occasional
          payment_behavior: immediate
          churn_risk: very_high

    lifecycle:
      - stage: prospect
        conversion_rate: 0.15
        avg_duration_days: 30

      - stage: new
        definition: "<90 days"
        behavior: exploring
        support_intensity: high

      - stage: growth
        definition: "90-365 days"
        behavior: expanding
        upsell_opportunity: high

      - stage: mature
        definition: ">365 days"
        behavior: stable
        retention_focus: true

      - stage: at_risk
        triggers: [declining_orders, late_payments, complaints]
        intervention: required

      - stage: churned
        definition: "no activity >180 days"
        win_back_probability: 0.10

2.2 Customer Network Effects

customer_networks:
  enabled: true

  # Referral relationships
  referrals:
    enabled: true
    referral_rate: 0.15
    referred_customer_value_multiplier: 1.2
    max_referral_chain: 3

  # Parent-child relationships (corporate structures)
  corporate_hierarchies:
    enabled: true
    probability: 0.30
    hierarchy_depth: 3
    billing_consolidation: true

  # Industry clustering
  industry_affinity:
    enabled: true
    same_industry_cluster_probability: 0.40
    industry_trend_correlation: 0.70

3. Organizational Hierarchy Modeling

3.1 Deep Cost Center Structure

organizational_structure:
  depth: 5

  levels:
    - level: 1
      name: division
      count: 3-5
      examples: ["North America", "EMEA", "APAC"]

    - level: 2
      name: business_unit
      count_per_parent: 2-4
      examples: ["Commercial", "Consumer", "Industrial"]

    - level: 3
      name: department
      count_per_parent: 3-6
      examples: ["Sales", "Marketing", "Operations", "Finance"]

    - level: 4
      name: function
      count_per_parent: 2-5
      examples: ["Inside Sales", "Field Sales", "Sales Ops"]

    - level: 5
      name: team
      count_per_parent: 2-4
      examples: ["Team Alpha", "Team Beta"]

  # Cross-cutting structures
  matrix_relationships:
    enabled: true
    types:
      - primary: division
        secondary: function
        # e.g., "EMEA Sales" reports to both EMEA Head and Global Sales VP

  # Shared services
  shared_services:
    enabled: true
    centers:
      - name: "Corporate Finance"
        serves: all_divisions
        allocation_method: headcount

      - name: "IT Infrastructure"
        serves: all_divisions
        allocation_method: usage

      - name: "HR Services"
        serves: all_divisions
        allocation_method: headcount

3.2 Approval Hierarchy

approval_hierarchy:
  enabled: true

  # Spending authority matrix
  authority_matrix:
    manager:
      limit: 5000
      exception_rate: 0.02

    senior_manager:
      limit: 25000
      exception_rate: 0.01

    director:
      limit: 100000
      exception_rate: 0.005

    vp:
      limit: 500000
      exception_rate: 0.002

    c_level:
      limit: unlimited
      exception_rate: 0.001

  # Approval chains
  chain_rules:
    sequential:
      enabled: true
      for: [capital_expenditure, contracts]

    parallel:
      enabled: true
      for: [operational_expenses]
      minimum_approvals: 2

    skip_level:
      enabled: true
      probability: 0.05
      audit_flag: true

4. Entity Relationship Graph

4.1 Comprehensive Relationship Model

#![allow(unused)]
fn main() {
/// Unified entity relationship graph
pub struct EntityGraph {
    nodes: HashMap<EntityId, EntityNode>,
    edges: Vec<RelationshipEdge>,
    indexes: GraphIndexes,
}

pub struct EntityNode {
    id: EntityId,
    entity_type: EntityType,
    attributes: EntityAttributes,
    created_at: DateTime<Utc>,
    last_activity: DateTime<Utc>,
}

pub enum EntityType {
    Company,
    Vendor,
    Customer,
    Employee,
    Department,
    CostCenter,
    Project,
    Contract,
    Asset,
    BankAccount,
}

pub struct RelationshipEdge {
    from_id: EntityId,
    to_id: EntityId,
    relationship_type: RelationshipType,
    strength: f64,           // 0.0 - 1.0
    start_date: NaiveDate,
    end_date: Option<NaiveDate>,
    attributes: RelationshipAttributes,
}

pub enum RelationshipType {
    // Transactional
    BuysFrom,
    SellsTo,
    PaysTo,
    ReceivesFrom,

    // Organizational
    ReportsTo,
    Manages,
    BelongsTo,
    OwnedBy,

    // Contractual
    ContractedWith,
    GuaranteedBy,
    InsuredBy,

    // Financial
    LendsTo,
    BorrowsFrom,
    InvestsIn,

    // Network
    ReferredBy,
    PartnersWith,
    CompetesWith,
}
}

4.2 Relationship Strength Modeling

relationship_strength:
  calculation:
    type: composite
    factors:
      transaction_volume:
        weight: 0.30
        normalization: log_scale

      transaction_count:
        weight: 0.25
        normalization: sqrt_scale

      relationship_duration:
        weight: 0.20
        decay: none

      recency:
        weight: 0.15
        decay: exponential
        half_life_days: 90

      mutual_connections:
        weight: 0.10
        normalization: jaccard_similarity

  thresholds:
    strong: 0.7
    moderate: 0.4
    weak: 0.1
    dormant: 0.0

5. Transaction Chain Integrity

5.1 Extended Document Chains

document_chains:
  # P2P extended chain
  procure_to_pay:
    stages:
      - type: purchase_requisition
        optional: true
        approval_required: conditional  # >$1000

      - type: purchase_order
        required: true
        generates: commitment

      - type: goods_receipt
        required: conditional  # For goods, not services
        updates: inventory
        tolerance: 0.05  # 5% over-receipt allowed

      - type: vendor_invoice
        required: true
        matching: three_way  # PO, GR, Invoice
        tolerance: 0.02

      - type: payment
        required: true
        methods: [ach, wire, check, virtual_card]
        generates: bank_transaction

    # Chain integrity rules
    integrity:
      sequence_enforcement: strict
      backdating_allowed: false
      amount_cascade: true  # Amounts must flow through

  # O2C extended chain
  order_to_cash:
    stages:
      - type: quote
        optional: true
        validity_days: 30

      - type: sales_order
        required: true
        credit_check: conditional

      - type: pick_list
        required: conditional
        triggers: inventory_reservation

      - type: delivery
        required: conditional
        updates: inventory
        generates: shipping_document

      - type: customer_invoice
        required: true
        triggers: revenue_recognition

      - type: customer_receipt
        required: true
        applies_to: invoices
        generates: bank_transaction

    integrity:
      partial_shipment: allowed
      partial_payment: allowed
      credit_memo: allowed

5.2 Cross-Process Linkages

cross_process_links:
  enabled: true

  links:
    # Inventory connects P2P and O2C
    - source_process: procure_to_pay
      source_stage: goods_receipt
      target_process: order_to_cash
      target_stage: pick_list
      through: inventory

    # Returns create reverse flows
    - source_process: order_to_cash
      source_stage: delivery
      target_process: returns
      target_stage: return_receipt
      condition: quality_issue

    # Payments connect to bank reconciliation
    - source_process: procure_to_pay
      source_stage: payment
      target_process: bank_reconciliation
      target_stage: bank_statement_line
      matching: automatic

    # Intercompany bilateral links
    - source_process: intercompany_sale
      source_stage: ic_invoice
      target_process: intercompany_purchase
      target_stage: ic_invoice
      matching: elimination_required

6. Network Effect Modeling

6.1 Behavioral Influence

network_effects:
  enabled: true

  influence_types:
    # Transaction patterns spread through network
    transaction_contagion:
      enabled: true
      effect: "similar vendors show similar payment patterns"
      correlation: 0.40
      lag_days: 30

    # Risk propagation
    risk_propagation:
      enabled: true
      effect: "vendor issues affect connected vendors"
      propagation_depth: 2
      decay_per_hop: 0.50

    # Seasonal correlation
    seasonal_sync:
      enabled: true
      effect: "connected entities show correlated seasonality"
      correlation: 0.60

    # Price correlation
    price_linkage:
      enabled: true
      effect: "commodity price changes propagate"
      propagation_speed: immediate
      pass_through_rate: 0.80

6.2 Community Detection

community_detection:
  enabled: true
  algorithms:
    - type: louvain
      resolution: 1.0
      output: vendor_communities

    - type: label_propagation
      output: customer_segments

    - type: girvan_newman
      output: department_clusters

  use_cases:
    # Fraud detection
    fraud_rings:
      algorithm: connected_components
      edge_filter: suspicious_transactions
      min_size: 3

    # Vendor consolidation
    vendor_overlap:
      algorithm: jaccard_similarity
      threshold: 0.70
      output: consolidation_candidates

    # Customer segmentation
    behavioral_clusters:
      algorithm: spectral
      features: [purchase_pattern, payment_behavior, product_mix]

7. Relationship Lifecycle

7.1 Lifecycle Stages

relationship_lifecycle:
  enabled: true

  vendor_lifecycle:
    stages:
      onboarding:
        duration_days: 30-90
        activities: [due_diligence, contract_negotiation, system_setup]
        transaction_volume: limited

      ramp_up:
        duration_days: 90-180
        activities: [volume_increase, performance_monitoring]
        transaction_volume: growing

      steady_state:
        duration_days: ongoing
        activities: [regular_transactions, periodic_review]
        transaction_volume: stable

      decline:
        triggers: [quality_issues, price_competitiveness, strategic_shift]
        activities: [reduced_orders, alternative_sourcing]
        transaction_volume: decreasing

      termination:
        triggers: [contract_end, performance_failure, strategic_decision]
        activities: [final_settlement, transition]
        transaction_volume: zero

    transitions:
      probability_matrix:
        onboarding:
          ramp_up: 0.80
          termination: 0.20
        ramp_up:
          steady_state: 0.85
          decline: 0.10
          termination: 0.05
        steady_state:
          steady_state: 0.90
          decline: 0.08
          termination: 0.02
        decline:
          steady_state: 0.20
          decline: 0.50
          termination: 0.30

  customer_lifecycle:
    # Similar structure for customer relationships
    stages:
      prospect: { conversion_rate: 0.15 }
      new: { retention_rate: 0.70 }
      active: { retention_rate: 0.90 }
      at_risk: { save_rate: 0.50 }
      churned: { win_back_rate: 0.10 }

8. Graph Export Enhancements

8.1 Enhanced PyTorch Geometric Export

graph_export:
  pytorch_geometric:
    enabled: true

    node_features:
      # Node type encoding
      type_encoding: one_hot

      # Numeric features
      numeric:
        - field: transaction_volume
          normalization: log_scale
        - field: relationship_duration_days
          normalization: min_max
        - field: average_amount
          normalization: z_score

      # Categorical features
      categorical:
        - field: industry
          encoding: label
        - field: region
          encoding: one_hot
        - field: segment
          encoding: embedding

    edge_features:
      - field: relationship_strength
        normalization: none
      - field: transaction_count
        normalization: log_scale
      - field: last_transaction_days_ago
        normalization: min_max

    # Temporal graphs
    temporal:
      enabled: true
      snapshot_frequency: monthly
      edge_weight_decay: exponential
      half_life_days: 90

    # Heterogeneous graph support
    heterogeneous:
      enabled: true
      node_types: [company, vendor, customer, employee, account]
      edge_types: [buys_from, sells_to, reports_to, pays_to]

8.2 Enhanced Neo4j Export

neo4j_export:
  enabled: true

  # Node labels
  node_labels:
    - label: Company
      properties: [code, name, currency, country]
    - label: Vendor
      properties: [id, name, category, rating]
    - label: Customer
      properties: [id, name, segment, region]
    - label: Transaction
      properties: [id, amount, date, type]

  # Relationship types
  relationships:
    - type: TRANSACTS_WITH
      properties: [volume, count, first_date, last_date]
    - type: BELONGS_TO
      properties: [start_date, role]
    - type: SUPPLIES
      properties: [material_type, contract_id]

  # Indexes for query optimization
  indexes:
    - label: Transaction
      property: date
      type: range
    - label: Vendor
      property: id
      type: unique
    - label: Customer
      property: segment
      type: lookup

  # Full-text search
  fulltext:
    - name: entity_search
      labels: [Vendor, Customer]
      properties: [name, description]

9. Implementation Priority

EnhancementComplexityImpactPriorityStatus
Vendor network depthHighHighP1✅ v0.3.0
Customer segmentationMediumHighP1✅ v0.3.0
Organizational hierarchyMediumMediumP2🔄 Planned
Relationship strength modelingMediumHighP1✅ v0.3.0
Cross-process linkagesMediumHighP1✅ v0.3.0
Network effect modelingHighMediumP2🔄 Planned
Relationship lifecycleMediumMediumP2✅ v0.3.0
Community detectionHighMediumP3🔄 Planned
Enhanced graph exportLowHighP1🔄 Partial

10. Validation Framework

relationship_validation:
  integrity_checks:
    # All transactions have valid entity references
    referential_integrity:
      enabled: true
      strict: true

    # Document chains are complete
    chain_completeness:
      enabled: true
      allow_partial: false
      exception_rate: 0.02

    # Intercompany entries balance
    intercompany_balance:
      enabled: true
      tolerance: 0.01

  network_metrics:
    # Graph connectivity
    connectivity:
      check_strongly_connected: false
      check_weakly_connected: true
      max_isolated_nodes: 0.05

    # Degree distribution
    degree_distribution:
      check_power_law: true
      min_alpha: 1.5
      max_alpha: 3.0

    # Clustering coefficient
    clustering:
      min_coefficient: 0.1
      max_coefficient: 0.5

See also: 05-pattern-drift.md for temporal evolution of patterns

Research: Pattern and Process Drift Over Time

Implementation Status: COMPLETE (v0.3.0)

This research document has been fully implemented. See the following modules:

  • datasynth-core/src/models/organizational_event.rs - Organizational events
  • datasynth-core/src/models/process_evolution.rs - Process evolution types
  • datasynth-core/src/models/technology_transition.rs - Technology transitions
  • datasynth-core/src/models/regulatory_events.rs - Regulatory changes
  • datasynth-core/src/models/drift_events.rs - Ground truth labels
  • datasynth-core/src/distributions/behavioral_drift.rs - Behavioral drift
  • datasynth-core/src/distributions/market_drift.rs - Market/economic drift
  • datasynth-core/src/distributions/event_timeline.rs - Event orchestration
  • datasynth-core/src/distributions/drift_recorder.rs - Ground truth recording
  • datasynth-eval/src/statistical/drift_detection.rs - Drift detection evaluation
  • datasynth-config/src/schema.rs - Configuration types

Current State Analysis

Existing Drift Implementation

The current DriftController (373 lines) supports:

Drift TypeImplementationRealism
GradualLinear parameter driftMedium
SuddenPoint-in-time shiftsMedium
RecurringSeasonal patternsGood
MixedCombination modesMedium

Current Capabilities

  1. Amount drift: Mean and variance adjustments over time
  2. Anomaly rate drift: Changing fraud/error rates
  3. Concept drift factor: Generic drift indicator
  4. Seasonal adjustment: Periodic recurring patterns
  5. Sudden drift probability: Random regime changes

Current Gaps

  1. No organizational events: Mergers, restructuring not modeled
  2. No process evolution: Static business processes
  3. No regulatory changes: Compliance requirements don’t evolve
  4. No technology transitions: System changes not simulated
  5. No behavioral drift: Entity behaviors remain static
  6. No market-driven drift: External factors not modeled
  7. Limited drift detection signals: Hard to validate drift presence

Improvement Recommendations

1. Organizational Event Modeling

1.1 Corporate Event Timeline

organizational_events:
  enabled: true

  events:
    # Mergers and Acquisitions
    - type: acquisition
      date: "2024-06-15"
      acquired_entity: "TargetCorp"
      effects:
        - entity_count_increase: 1.35
        - vendor_count_increase: 1.25
        - customer_overlap: 0.15
        - integration_period_months: 12
        - synergy_realization:
            start_month: 6
            full_realization_month: 18
            cost_reduction: 0.08

    # Divestiture
    - type: divestiture
      date: "2024-09-01"
      divested_entity: "NonCoreBusiness"
      effects:
        - revenue_reduction: 0.12
        - entity_count_reduction: 0.10
        - vendor_transition_period: 6

    # Reorganization
    - type: reorganization
      date: "2024-04-01"
      type: functional_to_regional
      effects:
        - cost_center_restructure: true
        - approval_chain_changes: true
        - reporting_line_changes: true
        - transition_period_months: 3
        - temporary_confusion_factor: 1.15

    # Leadership Change
    - type: leadership_change
      date: "2024-07-01"
      position: CFO
      effects:
        - policy_changes_probability: 0.40
        - approval_threshold_review: true
        - vendor_review_trigger: true
        - audit_focus_shift: possible

    # Layoffs
    - type: workforce_reduction
      date: "2024-11-01"
      reduction_percent: 0.10
      effects:
        - employee_count_reduction: 0.10
        - workload_redistribution: true
        - approval_delays: 1.20
        - error_rate_increase: 1.15
        - duration_months: 6

1.2 Integration Pattern Modeling

#![allow(unused)]
fn main() {
pub struct IntegrationSimulator {
    phases: Vec<IntegrationPhase>,
    current_phase: usize,
}

pub struct IntegrationPhase {
    name: String,
    start_month: u32,
    end_month: u32,
    effects: IntegrationEffects,
}

pub struct IntegrationEffects {
    // Duplicate transactions during transition
    duplicate_probability: f64,
    // Coding errors during chart migration
    miscoding_rate: f64,
    // Legacy system parallel run
    parallel_posting: bool,
    // Vendor/customer migration errors
    master_data_errors: f64,
    // Timing differences
    posting_delay_multiplier: f64,
}
}

1.3 Merger Accounting Patterns

merger_accounting:
  enabled: true

  day_1_entries:
    - type: fair_value_adjustment
      accounts: [inventory, fixed_assets, intangibles]
      adjustment_range: [-0.20, 0.30]

    - type: goodwill_recognition
      calculation: "purchase_price - fair_value_net_assets"

    - type: liability_assumption
      includes: [accounts_payable, debt, contingencies]

  post_merger:
    # Integration costs
    integration_expenses:
      monthly_range: [100000, 500000]
      duration_months: 12-18
      categories: [consulting, severance, systems, legal]

    # Synergy realization
    synergies:
      start_month: 6
      ramp_up_months: 12
      categories:
        - type: headcount_reduction
          target: 0.05
        - type: vendor_consolidation
          target: 0.10
        - type: facility_optimization
          target: 0.03

    # Restructuring reserves
    restructuring:
      initial_reserve: 5000000
      utilization_pattern: front_loaded
      true_up_probability: 0.30

2. Process Evolution Modeling

2.1 Business Process Changes

process_evolution:
  enabled: true

  changes:
    # New approval workflow
    - type: approval_workflow_change
      date: "2024-03-01"
      from: sequential
      to: parallel
      effects:
        - approval_time_reduction: 0.40
        - same_day_approval_increase: 0.25
        - skip_approval_detection: improved

    # Automation introduction
    - type: process_automation
      date: "2024-05-01"
      process: invoice_matching
      effects:
        - manual_matching_reduction: 0.70
        - matching_accuracy_improvement: 0.15
        - exception_visibility_increase: true
        - posting_timing: more_consistent

    # Policy change
    - type: policy_change
      date: "2024-08-01"
      policy: expense_approval_limits
      changes:
        - manager_limit: 5000 -> 7500
        - director_limit: 25000 -> 35000
      effects:
        - approval_escalation_reduction: 0.20
        - processing_time_reduction: 0.15

    # Control enhancement
    - type: control_enhancement
      date: "2024-10-01"
      control: three_way_match
      changes:
        - tolerance: 0.05 -> 0.02
        - mandatory_for: all_po_invoices
      effects:
        - exception_rate_increase: 0.15
        - fraud_detection_improvement: 0.25

2.2 Technology Transition Patterns

technology_transitions:
  enabled: true

  transitions:
    # ERP migration
    - type: erp_migration
      phases:
        - name: parallel_run
          start: "2024-06-01"
          duration_months: 3
          effects:
            - duplicate_entries: true
            - reconciliation_required: true
            - posting_delays: 1.30

        - name: cutover
          date: "2024-09-01"
          effects:
            - legacy_system: read_only
            - new_system: live
            - catch_up_period: 5_days

        - name: stabilization
          start: "2024-09-01"
          duration_months: 3
          effects:
            - error_rate_multiplier: 1.25
            - support_ticket_increase: 3.0
            - workaround_transactions: 0.10

    # Module implementation
    - type: module_implementation
      module: advanced_analytics
      go_live: "2024-04-15"
      effects:
        - new_transaction_types: [analytical_adjustment]
        - automated_entries_increase: 0.20

    # Integration change
    - type: integration_upgrade
      system: bank_interface
      date: "2024-07-01"
      effects:
        - real_time_enabled: true
        - batch_processing: deprecated
        - posting_frequency: continuous

3. Regulatory and Compliance Drift

3.1 Regulatory Changes

regulatory_changes:
  enabled: true

  changes:
    # New accounting standard
    - type: accounting_standard_adoption
      standard: ASC_842  # Leases
      effective_date: "2024-01-01"
      effects:
        - new_account_codes: [rou_asset, lease_liability]
        - reclassification_entries: true
        - disclosure_changes: true
        - audit_focus: high

    # Tax law change
    - type: tax_law_change
      date: "2024-07-01"
      jurisdiction: federal
      change: corporate_tax_rate
      from: 0.21
      to: 0.25
      effects:
        - deferred_tax_revaluation: true
        - provision_adjustment: true
        - quarterly_estimate_revision: true

    # Compliance requirement
    - type: new_compliance_requirement
      regulation: SOX_AI_controls
      effective_date: "2024-10-01"
      requirements:
        - ai_model_documentation: required
        - automated_control_testing: required
        - data_lineage_tracking: required
      effects:
        - new_control_activities: 15
        - testing_frequency: quarterly
        - documentation_overhead: 0.10

    # Industry regulation
    - type: industry_regulation
      industry: financial_services
      regulation: enhanced_kyc
      date: "2024-06-01"
      effects:
        - customer_onboarding_time: 1.50
        - documentation_requirements: increased
        - rejection_rate_increase: 0.08

3.2 Audit Focus Shifts

audit_focus_evolution:
  enabled: true

  shifts:
    # Risk-based changes
    - trigger: fraud_detection
      date: "2024-03-15"
      new_focus_areas:
        - vendor_payments: high
        - manual_journal_entries: high
        - related_party_transactions: medium
      effects:
        - sampling_rate_increase: 0.30
        - documentation_requests: increased

    # Industry trend response
    - trigger: industry_trend
      date: "2024-06-01"
      trend: cybersecurity_risks
      new_focus_areas:
        - it_general_controls: high
        - access_management: high
        - change_management: medium
      effects:
        - itgc_testing_expansion: true
        - soc2_requirements: enhanced

    # Prior year findings
    - trigger: prior_year_finding
      finding: revenue_recognition_timing
      date: "2024-01-01"
      effects:
        - cutoff_testing: enhanced
        - sample_sizes: increased
        - management_inquiry: extensive

4. Behavioral Drift

4.1 Entity Behavior Evolution

behavioral_drift:
  enabled: true

  vendor_behavior:
    # Payment term negotiation
    payment_terms_drift:
      direction: extending
      rate_per_year: 2.5  # Days per year
      variance_increase: true
      trigger: economic_conditions

    # Quality drift
    quality_drift:
      new_vendors:
        initial_period_months: 6
        quality_improvement_rate: 0.02
      established_vendors:
        complacency_risk: 0.05
        quality_decline_rate: 0.01

    # Price drift
    pricing_behavior:
      inflation_pass_through: 0.80
      contract_renegotiation_frequency: annual
      opportunistic_increase_probability: 0.10

  customer_behavior:
    # Payment behavior evolution
    payment_drift:
      economic_downturn:
        days_extension: 5-15
        bad_debt_rate_increase: 0.02
      economic_upturn:
        days_reduction: 2-5
        early_payment_discount_uptake: 0.15

    # Order pattern drift
    order_drift:
      digital_shift:
        online_order_increase_per_year: 0.05
        average_order_value_decrease: 0.03
        order_frequency_increase: 0.10

  employee_behavior:
    # Approval pattern drift
    approval_drift:
      end_of_month_rush:
        intensity_increase_per_year: 0.05
      rubber_stamping_risk:
        increase_with_volume: true
        threshold: 50  # Approvals per day

    # Error pattern drift
    error_drift:
      new_employee:
        error_rate: 0.08
        learning_curve_months: 6
        target_error_rate: 0.02
      experienced_employee:
        fatigue_increase: 0.01_per_year

4.2 Collective Behavior Patterns

collective_drift:
  enabled: true

  patterns:
    # Year-end behavior
    year_end_intensity:
      drift: increasing
      rate_per_year: 0.05
      explanation: "tighter close deadlines, more scrutiny"

    # Automation adoption
    automation_adoption:
      s_curve_adoption: true
      early_adopters: 0.15
      mainstream: 0.60
      laggards: 0.25
      effects_by_phase:
        early:
          manual_reduction: 0.10
          error_types_shift: true
        mainstream:
          manual_reduction: 0.50
          new_error_types: automation_failures
        late:
          manual_reduction: 0.80
          exception_handling_focus: true

    # Remote work impact
    remote_work_patterns:
      transition_date: "2024-01-01"
      remote_percentage: 0.60
      effects:
        - posting_time_distribution: flattened
        - batch_processing_increase: true
        - approval_response_time: longer
        - documentation_quality: variable

5. Market-Driven Drift

5.1 Economic Cycle Effects

economic_cycles:
  enabled: true

  cycles:
    # Business cycle
    business_cycle:
      type: sinusoidal
      period_months: 48
      amplitude: 0.15
      effects:
        expansion:
          revenue_growth: positive
          hiring: active
          capital_investment: high
          credit_terms: generous
        contraction:
          revenue_growth: negative
          layoffs: possible
          capital_investment: low
          credit_terms: tight

    # Industry cycle
    industry_specific:
      technology:
        period_months: 36
        amplitude: 0.25
      manufacturing:
        period_months: 60
        amplitude: 0.20
      retail:
        period_months: 12  # Annual
        amplitude: 0.35

  # Recession simulation
  recession:
    enabled: false  # Trigger explicitly
    onset: gradual  # or sudden
    duration_months: 12-24
    severity: moderate  # mild, moderate, severe
    effects:
      revenue_decline: 0.15-0.30
      ar_aging_increase: 15_days
      bad_debt_increase: 0.03
      vendor_consolidation: 0.10
      workforce_reduction: 0.08
      capex_freeze: true

5.2 Commodity and Input Cost Drift

input_cost_drift:
  enabled: true

  commodities:
    - name: steel
      base_price: 800  # per ton
      volatility: 0.20
      correlation_with_economy: 0.60
      pass_through_to_cogs: 0.15

    - name: energy
      base_price: 75   # per barrel equivalent
      volatility: 0.35
      seasonal_pattern: true
      pass_through_to_overhead: 0.08

    - name: labor
      base_cost: 35    # per hour
      annual_increase: 0.03
      regional_variation: true
      pass_through_to_all: true

  price_shock_events:
    - type: supply_disruption
      probability_per_year: 0.10
      duration_months: 3-9
      price_increase: 0.30-1.00
      affected_commodities: [specific]

    - type: demand_surge
      probability_per_year: 0.15
      duration_months: 2-6
      price_increase: 0.15-0.40
      affected_commodities: [broad]

6. Concept Drift Detection Signals

6.1 Drift Indicators in Generated Data

drift_signals:
  enabled: true

  embedded_signals:
    # Statistical shift markers
    statistical:
      - type: mean_shift
        field: transaction_amount
        visibility: detectable_by_cusum
        magnitude: configurable

      - type: variance_change
        field: processing_time
        visibility: detectable_by_levene
        direction: both

      - type: distribution_change
        field: payment_terms
        visibility: detectable_by_ks_test
        gradual: true

    # Categorical drift markers
    categorical:
      - type: category_proportion_shift
        field: transaction_type
        new_category_emergence: true
        old_category_decline: true

      - type: label_drift
        field: account_code
        new_codes: added_over_time
        deprecated_codes: declining_usage

    # Temporal drift markers
    temporal:
      - type: seasonality_change
        field: transaction_count
        pattern_evolution: true
        detectability: acf_analysis

      - type: trend_change
        field: revenue
        change_points: marked
        detectability: pelt_algorithm

  # Ground truth labels for drift
  drift_labels:
    enabled: true
    output_file: drift_events.csv
    columns:
      - event_type
      - start_date
      - end_date
      - affected_fields
      - magnitude
      - detection_difficulty

6.2 Drift Validation Metrics

drift_validation:
  metrics:
    # Drift presence verification
    drift_detection:
      methods:
        - adwin   # Adaptive Windowing
        - ddm     # Drift Detection Method
        - eddm    # Early Drift Detection Method
        - ph      # Page-Hinkley Test
      threshold_calibration: true

    # Drift magnitude
    magnitude_metrics:
      - hellinger_distance
      - kl_divergence
      - wasserstein_distance
      - psi  # Population Stability Index

    # Drift timing accuracy
    timing_metrics:
      - detection_delay_days
      - false_positive_rate
      - detection_precision

7. Implementation Framework

7.1 Drift Controller Enhancement

#![allow(unused)]
fn main() {
pub struct EnhancedDriftController {
    // Existing drift
    parameter_drift: ParameterDrift,

    // New: Organizational events
    event_timeline: EventTimeline,

    // New: Process changes
    process_evolution: ProcessEvolution,

    // New: Regulatory changes
    regulatory_calendar: RegulatoryCalendar,

    // New: Behavioral models
    behavioral_drift: BehavioralDriftModel,

    // New: Market factors
    market_model: MarketModel,

    // Drift detection ground truth
    drift_labels: DriftLabelRecorder,
}

impl EnhancedDriftController {
    /// Get all active effects for a given date
    pub fn get_effects(&self, date: NaiveDate) -> DriftEffects {
        let mut effects = DriftEffects::default();

        // Apply organizational events
        effects.merge(self.event_timeline.effects_at(date));

        // Apply process evolution
        effects.merge(self.process_evolution.effects_at(date));

        // Apply regulatory changes
        effects.merge(self.regulatory_calendar.effects_at(date));

        // Apply behavioral drift
        effects.merge(self.behavioral_drift.effects_at(date));

        // Apply market conditions
        effects.merge(self.market_model.effects_at(date));

        // Record for ground truth
        self.drift_labels.record(date, &effects);

        effects
    }
}
}

7.2 Configuration Integration

# Master drift configuration
drift:
  enabled: true

  # Parameter drift (existing)
  parameters:
    amount_mean_drift: 0.02
    amount_variance_drift: 0.01

  # Organizational events (new)
  organizational:
    events_file: "organizational_events.yaml"
    random_events:
      reorganization_probability: 0.10
      leadership_change_probability: 0.15

  # Process evolution (new)
  process:
    automation_curve: s_curve
    policy_review_frequency: quarterly

  # Regulatory changes (new)
  regulatory:
    calendar_file: "regulatory_calendar.yaml"
    jurisdictions: [us, eu]

  # Behavioral drift (new)
  behavioral:
    vendor_learning: true
    customer_churn: true
    employee_turnover: 0.15

  # Market factors (new)
  market:
    economic_cycle: true
    commodity_volatility: true
    inflation_rate: 0.03

  # Drift labeling (new)
  labels:
    enabled: true
    output_format: csv
    include_magnitude: true

8. Implementation Priority

EnhancementComplexityImpactPriority
Organizational eventsMediumHighP1
Process evolutionMediumHighP1
Regulatory changesLowMediumP2
Behavioral driftHighHighP1
Market-driven driftMediumMediumP2
Drift detection signalsLowHighP1
Technology transitionsHighMediumP3
Collective behaviorMediumMediumP2

9. Use Cases

  1. ML Model Robustness Testing: Train models on stable data, test on drifted data
  2. Drift Detection Benchmarking: Evaluate drift detection algorithms on known drift
  3. Change Management Simulation: Test system responses to organizational changes
  4. Regulatory Impact Analysis: Model effects of compliance requirement changes
  5. Economic Scenario Planning: Generate data under different economic conditions

See also: 06-anomaly-patterns.md for anomaly injection patterns

Research: Anomaly Pattern Enhancements

Current State Analysis

Existing Anomaly Categories

CategoryTypesImplementation
FraudFictitious, Revenue Manipulation, Split, Round-trip, Ghost Employee, Duplicate PaymentGood
ErrorDuplicate Entry, Reversed Amount, Wrong Period, Wrong Account, Missing ReferenceGood
ProcessLate Posting, Skipped Approval, Threshold ManipulationMedium
StatisticalUnusual Amount, Trend Break, Benford ViolationMedium
RelationalCircular Transaction, Dormant AccountBasic

Current Strengths

  1. Labeled output: anomaly_labels.csv with ground truth
  2. Configurable injection rate: Per-anomaly-type rates
  3. Quality issue labeling: Separate from fraud labels
  4. Multiple anomaly types: 20+ distinct patterns
  5. COSO control mapping: Anomalies linked to control failures

Current Gaps

  1. Binary labeling only: No severity or confidence scores
  2. Independent injection: Anomalies don’t correlate with each other
  3. No multi-stage anomalies: Complex schemes not modeled
  4. Static patterns: Same anomaly signature throughout
  5. No near-miss generation: Only clear anomalies or clean data
  6. Limited context awareness: Anomalies don’t adapt to entity behavior
  7. No detection difficulty labeling: All anomalies treated equally

Improvement Recommendations

1. Multi-Dimensional Anomaly Labeling

1.1 Enhanced Label Schema

anomaly_labeling:
  schema:
    # Primary classification
    anomaly_id: uuid
    transaction_ids: [uuid]
    anomaly_type: string
    anomaly_category: [fraud, error, process, statistical, relational]

    # Severity scoring
    severity:
      level: [low, medium, high, critical]
      score: 0.0-1.0
      financial_impact: decimal
      materiality_threshold: exceeded | below

    # Detection characteristics
    detection:
      difficulty: [trivial, easy, moderate, hard, expert]
      recommended_methods: [rule_based, statistical, ml, graph, hybrid]
      expected_false_positive_rate: 0.0-1.0
      key_indicators: [string]

    # Confidence and certainty
    confidence:
      ground_truth_certainty: [definite, probable, possible]
      label_source: [injected, inferred, manual]

    # Temporal characteristics
    temporal:
      first_occurrence: date
      last_occurrence: date
      frequency: [one_time, recurring, continuous]
      detection_window: days

    # Relationship context
    context:
      related_anomalies: [uuid]
      affected_entities: [entity_id]
      control_failures: [control_id]
      root_cause: string

1.2 Materiality-Based Severity

severity_calculation:
  materiality_thresholds:
    trivial: 0.001        # 0.1% of relevant base
    immaterial: 0.01      # 1%
    material: 0.05        # 5%
    highly_material: 0.10 # 10%

  bases_by_type:
    revenue: total_revenue
    expense: total_expenses
    asset: total_assets
    liability: total_liabilities

  severity_factors:
    financial_impact:
      weight: 0.40
      calculation: amount / materiality_threshold

    detection_difficulty:
      weight: 0.25
      mapping:
        trivial: 0.1
        easy: 0.3
        moderate: 0.5
        hard: 0.7
        expert: 0.9

    persistence:
      weight: 0.20
      calculation: duration_days / 365

    entity_involvement:
      weight: 0.15
      calculation: log(affected_entity_count)

2. Correlated Anomaly Injection

2.1 Anomaly Co-occurrence Patterns

anomaly_correlations:
  enabled: true

  patterns:
    # Fraud often accompanied by concealment
    fraud_concealment:
      primary: fictitious_vendor
      correlated:
        - type: document_manipulation
          probability: 0.80
          lag_days: 0-30
        - type: approval_bypass
          probability: 0.60
          lag_days: 0
        - type: audit_trail_gaps
          probability: 0.40
          lag_days: 0-90

    # Error cascades
    error_cascade:
      primary: wrong_account_coding
      correlated:
        - type: reconciliation_difference
          probability: 0.90
          lag_days: 30-60
        - type: balance_discrepancy
          probability: 0.70
          lag_days: 30
        - type: correcting_entry
          probability: 0.85
          lag_days: 1-45

    # Process failures cluster
    process_breakdown:
      primary: skipped_approval
      correlated:
        - type: threshold_splitting
          probability: 0.50
          lag_days: -30 to 30
        - type: late_posting
          probability: 0.40
          lag_days: 0-15
        - type: documentation_missing
          probability: 0.60
          lag_days: 0

2.2 Temporal Clustering

temporal_clustering:
  enabled: true

  clusters:
    # Period-end error spikes
    period_end_errors:
      window: last_5_business_days
      error_rate_multiplier: 2.5
      types: [wrong_period, duplicate_entry, late_posting]

    # Post-holiday cleanup
    post_holiday:
      window: first_3_business_days_after_holiday
      types: [duplicate_entry, missing_reference]
      multiplier: 1.8

    # Quarter-end pressure
    quarter_end:
      window: last_week_of_quarter
      fraud_types: [revenue_manipulation, expense_deferral]
      multiplier: 1.5

    # Year-end audit prep
    year_end_audit:
      window: december
      correction_types: [reclassification, prior_period_adjustment]
      multiplier: 3.0

3. Multi-Stage Anomaly Patterns

3.1 Complex Scheme Modeling

multi_stage_anomalies:
  enabled: true

  schemes:
    # Gradual embezzlement
    gradual_embezzlement:
      stages:
        - stage: 1
          name: testing
          duration_months: 2
          transactions: 3-5
          amount_range: [100, 500]
          detection_difficulty: hard

        - stage: 2
          name: escalation
          duration_months: 6
          transactions: 10-20
          amount_range: [500, 2000]
          detection_difficulty: moderate

        - stage: 3
          name: acceleration
          duration_months: 3
          transactions: 20-50
          amount_range: [2000, 10000]
          detection_difficulty: easy

        - stage: 4
          name: desperation
          duration_months: 1
          transactions: 5-10
          amount_range: [10000, 50000]
          detection_difficulty: trivial

      total_scheme_probability: 0.02

    # Revenue manipulation over time
    revenue_scheme:
      stages:
        - stage: 1
          name: acceleration
          quarter: Q4
          action: early_revenue_recognition
          amount_percent: 0.02

        - stage: 2
          name: deferral
          quarter: Q1_next
          action: expense_deferral
          amount_percent: 0.03

        - stage: 3
          name: reserve_manipulation
          quarter: Q2
          action: reserve_release
          amount_percent: 0.02

        - stage: 4
          name: channel_stuffing
          quarter: Q4
          action: forced_sales
          amount_percent: 0.05

      cycle_probability: 0.01

    # Vendor kickback scheme
    kickback_scheme:
      stages:
        - stage: 1
          name: vendor_setup
          actions: [create_vendor, build_relationship]
          duration_months: 3

        - stage: 2
          name: price_inflation
          actions: [inflated_invoices]
          inflation_percent: 0.10-0.25
          duration_months: 12

        - stage: 3
          name: kickback_payments
          actions: [off_book_payments]
          kickback_percent: 0.50
          frequency: quarterly

        - stage: 4
          name: concealment
          actions: [document_destruction, false_approvals]
          ongoing: true

3.2 Scheme Evolution

#![allow(unused)]
fn main() {
pub struct MultiStageAnomaly {
    scheme_id: Uuid,
    scheme_type: SchemeType,
    current_stage: u32,
    start_date: NaiveDate,
    perpetrators: Vec<EntityId>,
    transactions: Vec<TransactionId>,
    total_impact: Decimal,
    detection_status: DetectionStatus,
}

impl MultiStageAnomaly {
    /// Advance scheme to next stage
    pub fn advance(&mut self, date: NaiveDate) -> Vec<AnomalyAction> {
        // Check if conditions met for stage advancement
        // Return actions for current stage
    }

    /// Check if scheme should be detected based on accumulated evidence
    pub fn detection_probability(&self) -> f64 {
        // Increases with:
        // - Number of transactions
        // - Total amount
        // - Duration
        // - Carelessness factor
    }
}
}

4. Near-Miss and Edge Case Generation

4.1 Near-Anomaly Patterns

near_miss_generation:
  enabled: true
  proportion_of_anomalies: 0.30  # 30% of "anomalies" are near-misses

  patterns:
    # Almost duplicate (timing difference)
    near_duplicate:
      description: "Similar transaction, different timing"
      difference:
        amount: exact_match
        date: 1-3_days_apart
        vendor: same
      label: not_anomaly
      detection_challenge: high

    # Threshold proximity
    threshold_proximity:
      description: "Transaction just below approval threshold"
      distance_from_threshold: [0.90, 0.99]
      label: not_anomaly
      suspicion_score: high

    # Unusual but explainable
    unusual_legitimate:
      description: "Unusual pattern with valid business reason"
      types:
        - year_end_bonus
        - contract_prepayment
        - settlement_payment
        - insurance_claim
      label: not_anomaly
      false_positive_trigger: high

    # Corrected error
    corrected_error:
      description: "Error that was caught and fixed"
      original_error: any
      correction_lag_days: 1-5
      net_impact: zero
      label: error_corrected
      visibility: both_entries_visible

4.2 Boundary Condition Testing

boundary_conditions:
  enabled: true

  conditions:
    # Exact threshold matches
    exact_thresholds:
      types: [approval_limit, materiality, tolerance]
      probability: 0.01
      label: boundary_case

    # Round number preference (non-fraudulent)
    legitimate_round_numbers:
      amounts: [1000, 5000, 10000, 25000]
      probability: 0.05
      label: not_anomaly
      context: budget_allocations

    # Last-minute but legitimate
    period_boundary:
      timing: last_hour_before_close
      legitimate_probability: 0.80
      label: timing_anomaly_only

    # Zero and negative amounts
    edge_amounts:
      zero_amount_probability: 0.001
      negative_amount_probability: 0.002
      labels: data_quality_issue

5. Context-Aware Anomaly Injection

5.1 Entity-Specific Patterns

entity_aware_anomalies:
  enabled: true

  vendor_specific:
    # New vendors have higher error rates
    new_vendor_errors:
      definition: vendor_age < 90_days
      error_rate_multiplier: 2.5
      common_errors: [wrong_account, missing_po]

    # Large vendors have more complex issues
    strategic_vendor_issues:
      definition: vendor_spend > percentile_90
      anomaly_types: [contract_deviation, price_variance]
      rate_multiplier: 1.5

    # International vendors
    international_vendor_issues:
      definition: vendor_country != company_country
      anomaly_types: [fx_errors, withholding_tax_errors]
      rate_multiplier: 2.0

  employee_specific:
    # New employee learning curve
    new_employee_errors:
      definition: employee_tenure < 180_days
      error_rate: 0.05
      error_types: [coding_error, approval_violation]
      decay: exponential

    # High-volume processors
    volume_fatigue:
      definition: daily_transactions > 50
      error_rate_increase: 0.02
      peak_time: end_of_day

    # Vacation coverage
    coverage_errors:
      trigger: primary_approver_absent
      error_rate_multiplier: 1.8
      types: [delayed_approval, wrong_approver]

  account_specific:
    # High-risk accounts
    high_risk_accounts:
      accounts: [cash, revenue, inventory]
      monitoring_level: enhanced
      anomaly_injection_rate: 1.5x

    # Infrequently used accounts
    dormant_account_activity:
      definition: no_activity_90_days
      any_activity_suspicious: true
      label: statistical_anomaly

5.2 Behavioral Baseline Deviation

behavioral_deviation:
  enabled: true

  baselines:
    # Establish per-entity behavioral baseline
    baseline_period: 90_days
    metrics:
      - average_transaction_amount
      - transaction_frequency
      - typical_posting_time
      - common_counterparties
      - usual_account_codes

  deviations:
    # Amount deviation
    amount_anomaly:
      threshold: 3_standard_deviations
      label: statistical_anomaly
      severity: based_on_deviation

    # Frequency deviation
    frequency_anomaly:
      threshold: 2_standard_deviations
      types: [sudden_increase, sudden_decrease, irregular_pattern]

    # Counterparty deviation
    new_counterparty:
      first_time_transaction: true
      risk_score: elevated
      label: relationship_anomaly

    # Timing deviation
    timing_anomaly:
      threshold: outside_usual_hours
      consideration: legitimate_reasons_exist
      label: timing_anomaly

6. Detection Difficulty Classification

6.1 Difficulty Taxonomy

detection_difficulty:
  levels:
    trivial:
      description: "Obvious on cursory review"
      examples:
        - duplicate_same_day
        - obviously_wrong_amount
        - missing_required_field
      expected_detection_rate: 0.99
      detection_methods: [basic_rules]

    easy:
      description: "Detectable with standard controls"
      examples:
        - threshold_violations
        - approval_gaps
        - segregation_of_duties
      expected_detection_rate: 0.90
      detection_methods: [automated_rules, basic_analytics]

    moderate:
      description: "Requires analytical procedures"
      examples:
        - trend_deviations
        - ratio_anomalies
        - benford_violations
      expected_detection_rate: 0.70
      detection_methods: [statistical_analysis, ratio_analysis]

    hard:
      description: "Requires advanced techniques or domain expertise"
      examples:
        - complex_fraud_schemes
        - collusion_patterns
        - sophisticated_manipulation
      expected_detection_rate: 0.40
      detection_methods: [ml_models, graph_analysis, forensic_audit]

    expert:
      description: "Only detectable by specialized investigation"
      examples:
        - long_running_schemes
        - management_override
        - deep_concealment
      expected_detection_rate: 0.15
      detection_methods: [tip_or_complaint, forensic_investigation, external_audit]

6.2 Difficulty Factors

#![allow(unused)]
fn main() {
pub struct DifficultyCalculator {
    factors: Vec<DifficultyFactor>,
}

pub enum DifficultyFactor {
    // Concealment techniques
    Concealment {
        document_manipulation: bool,
        approval_circumvention: bool,
        timing_exploitation: bool,
        splitting: bool,
    },

    // Blending with normal activity
    Blending {
        amount_within_normal_range: bool,
        timing_within_normal_hours: bool,
        counterparty_is_established: bool,
        account_coding_correct: bool,
    },

    // Collusion
    Collusion {
        number_of_participants: u32,
        includes_management: bool,
        external_parties: bool,
    },

    // Duration and frequency
    Temporal {
        duration_months: u32,
        transaction_frequency: Frequency,
        gradual_escalation: bool,
    },

    // Amount characteristics
    Amount {
        total_amount: Decimal,
        individual_amounts_small: bool,
        round_numbers_avoided: bool,
    },
}
}

7. Anomaly Generation Strategies

7.1 Strategy Configuration

anomaly_strategies:
  # Random injection (current approach)
  random:
    enabled: true
    weight: 0.40
    parameters:
      base_rate: 0.02
      per_type_rates: {...}

  # Scenario-based injection
  scenario_based:
    enabled: true
    weight: 0.30
    scenarios:
      - name: "new_employee_fraud"
        trigger: employee_tenure < 365
        probability: 0.005
        scheme: embezzlement

      - name: "vendor_collusion"
        trigger: vendor_concentration > 0.15
        probability: 0.01
        scheme: kickback

      - name: "year_end_pressure"
        trigger: month == 12
        probability: 0.03
        types: [revenue_manipulation, reserve_adjustment]

  # Adversarial injection
  adversarial:
    enabled: true
    weight: 0.20
    strategy: evade_known_detectors
    detectors_to_evade:
      - benford_analysis
      - duplicate_detection
      - threshold_monitoring
    techniques:
      - amount_variation
      - timing_spreading
      - entity_rotation

  # Benchmark-based injection
  benchmark:
    enabled: true
    weight: 0.10
    source: acfe_report_to_the_nations
    calibration:
      median_loss: 117000
      duration_months: 12
      detection_method_distribution: {...}

7.2 Adaptive Anomaly Injection

#![allow(unused)]
fn main() {
pub struct AdaptiveAnomalyInjector {
    // Tracks what's been injected
    injection_history: Vec<InjectedAnomaly>,

    // Ensures variety
    type_distribution: TypeDistribution,

    // Ensures difficulty spread
    difficulty_distribution: DifficultyDistribution,

    // Ensures temporal spread
    temporal_distribution: TemporalDistribution,
}

impl AdaptiveAnomalyInjector {
    /// Inject anomaly with awareness of what's already been injected
    pub fn inject(&mut self, context: &GenerationContext) -> Option<Anomaly> {
        // Check if injection appropriate at this point
        if !self.should_inject(context) {
            return None;
        }

        // Select type based on current distribution gaps
        let anomaly_type = self.select_type_for_balance();

        // Select difficulty based on current distribution gaps
        let difficulty = self.select_difficulty_for_balance();

        // Generate anomaly
        let anomaly = self.generate_anomaly(anomaly_type, difficulty, context);

        // Record injection
        self.record_injection(&anomaly);

        Some(anomaly)
    }
}
}

8. Output Enhancements

8.1 Enhanced Label File

output:
  anomaly_labels:
    format: parquet  # or csv
    columns:
      # Identifiers
      - anomaly_id
      - transaction_ids  # Array
      - scheme_id        # For multi-stage

      # Classification
      - anomaly_type
      - category
      - subcategory

      # Severity
      - severity_level
      - severity_score
      - financial_impact
      - is_material

      # Detection
      - difficulty_level
      - difficulty_score
      - recommended_detection_methods  # Array
      - key_indicators  # Array

      # Temporal
      - first_date
      - last_date
      - duration_days
      - stage  # For multi-stage

      # Context
      - affected_entities  # Array
      - control_failures  # Array
      - related_anomalies  # Array

      # Metadata
      - injection_strategy
      - generation_seed
      - ground_truth_certainty

  # Separate scheme file for multi-stage
  schemes:
    format: json
    structure:
      scheme_id: uuid
      scheme_type: string
      stages: [...]
      transactions_by_stage: {...}
      total_impact: decimal
      perpetrators: [entity_ids]

8.2 Detection Benchmark Output

detection_benchmarks:
  enabled: true

  outputs:
    # Performance expectations by method
    expected_performance:
      format: json
      content:
        by_method:
          rule_based:
            expected_recall: 0.40
            expected_precision: 0.85
          statistical:
            expected_recall: 0.55
            expected_precision: 0.70
          ml_supervised:
            expected_recall: 0.75
            expected_precision: 0.80
          graph_based:
            expected_recall: 0.65
            expected_precision: 0.75

    # Difficulty distribution
    difficulty_summary:
      format: csv
      columns: [difficulty_level, count, percentage, avg_amount]

    # Detection challenge set
    challenge_cases:
      format: json
      description: "Curated set of hardest-to-detect anomalies"
      count: 100
      selection_criteria: difficulty_score > 0.7

9. Implementation Priority

EnhancementComplexityImpactPriority
Multi-dimensional labelingLowHighP1
Correlated anomaly injectionMediumHighP1
Multi-stage schemesHighHighP1
Near-miss generationMediumHighP1
Context-aware injectionMediumHighP2
Difficulty classificationLowHighP1
Adaptive injectionMediumMediumP2
Detection benchmarksLowMediumP2

See also: 07-fraud-patterns.md for fraud-specific patterns

Research: Fraud Pattern Improvements

Implementation Status: COMPLETE (v0.3.0)

This research document has been fully implemented in v0.3.0. See:

Key implementations:

  • ACFE-aligned fraud taxonomy with calibration statistics
  • Collusion and conspiracy modeling with 9 ring types
  • Management override patterns with fraud triangle
  • Red flag generation with Bayesian probabilities
  • ACFE-calibrated ML benchmarks

Current State Analysis

Existing Fraud Typologies

CategoryTypes ImplementedRealism
Asset MisappropriationGhost Employee, Duplicate Payment, Fictitious VendorMedium
Financial Statement FraudRevenue Manipulation, Round-trippingBasic
Corruption(Limited)Weak
Banking/AMLStructuring, Layering, Mule, Funnel, SpoofingGood

Current Strengths

  1. Banking module: Sophisticated AML typologies with transaction networks
  2. Fraud labeling: Ground truth labels for ML training
  3. Control mapping: Fraud linked to control failures
  4. Amount patterns: Benford violations for fraudulent amounts

Current Gaps

  1. No collusion modeling: Fraud actors operate independently
  2. Limited concealment: Fraud isn’t actively hidden
  3. No behavioral adaptation: Fraudsters don’t learn
  4. Static schemes: Same patterns throughout
  5. Missing corruption types: Bribery, kickbacks underdeveloped
  6. No management override: All fraud at operational level
  7. Limited financial statement fraud: Complex schemes not modeled

Improvement Recommendations

1. Comprehensive Fraud Taxonomy

1.1 ACFE-Aligned Framework

Based on the Association of Certified Fraud Examiners Occupational Fraud and Abuse Classification:

fraud_taxonomy:
  # Asset Misappropriation (86% of cases, $100k median loss)
  asset_misappropriation:
    cash:
      theft_of_cash_on_hand:
        - larceny
        - skimming

      theft_of_cash_receipts:
        - sales_skimming
        - receivables_skimming
        - refund_schemes

      fraudulent_disbursements:
        - billing_schemes:
            - shell_company
            - non_accomplice_vendor
            - personal_purchases
        - payroll_schemes:
            - ghost_employee
            - falsified_wages
            - commission_schemes
        - expense_reimbursement:
            - mischaracterized_expenses
            - overstated_expenses
            - fictitious_expenses
        - check_tampering:
            - forged_maker
            - forged_endorsement
            - altered_payee
            - authorized_maker
        - register_disbursements:
            - false_voids
            - false_refunds

    inventory_and_assets:
      - misuse
      - larceny

  # Corruption (33% of cases, $150k median loss)
  corruption:
    conflicts_of_interest:
      - purchasing_schemes
      - sales_schemes

    bribery:
      - invoice_kickbacks
      - bid_rigging

    illegal_gratuities: true

    economic_extortion: true

  # Financial Statement Fraud (10% of cases, $954k median loss)
  financial_statement_fraud:
    overstatement:
      - timing_differences:
          - premature_revenue
          - delayed_expenses
      - fictitious_revenues
      - concealed_liabilities
      - improper_asset_valuations
      - improper_disclosures

    understatement:
      - understated_revenues
      - overstated_expenses
      - overstated_liabilities

1.2 Industry-Specific Fraud Patterns

industry_fraud_patterns:
  manufacturing:
    common_schemes:
      - type: inventory_theft
        frequency: high
        methods: [larceny, false_shipments, scrap_manipulation]
      - type: vendor_kickbacks
        frequency: medium
        methods: [inflated_pricing, phantom_materials]
      - type: quality_fraud
        frequency: low
        methods: [false_certifications, spec_violations]

  retail:
    common_schemes:
      - type: register_fraud
        frequency: high
        methods: [skimming, false_voids, sweethearting]
      - type: return_fraud
        frequency: high
        methods: [fictitious_returns, receipt_fraud]
      - type: inventory_shrinkage
        frequency: very_high
        methods: [employee_theft, vendor_collusion]

  financial_services:
    common_schemes:
      - type: loan_fraud
        frequency: medium
        methods: [false_documentation, appraisal_fraud]
      - type: insider_trading
        frequency: low
        methods: [front_running, tip_schemes]
      - type: account_takeover
        frequency: medium
        methods: [identity_theft, credential_theft]

  healthcare:
    common_schemes:
      - type: billing_fraud
        frequency: high
        methods: [upcoding, unbundling, phantom_billing]
      - type: kickbacks
        frequency: medium
        methods: [referral_fees, drug_company_payments]
      - type: identity_fraud
        frequency: medium
        methods: [patient_identity_theft, provider_impersonation]

  professional_services:
    common_schemes:
      - type: billing_fraud
        frequency: high
        methods: [inflated_hours, phantom_work]
      - type: expense_fraud
        frequency: medium
        methods: [personal_expenses, inflated_claims]
      - type: client_fund_misappropriation
        frequency: low
        methods: [trust_account_theft, advance_fee_theft]

2. Collusion and Conspiracy Modeling

2.1 Collusion Network Generation

collusion_networks:
  enabled: true

  network_types:
    # Internal collusion
    internal:
      - type: employee_pair
        roles: [approver, processor]
        scheme: approval_bypass
        probability: 0.005

      - type: department_ring
        size: 3-5
        roles: [initiator, approver, concealer]
        scheme: expense_fraud
        probability: 0.002

      - type: management_subordinate
        roles: [manager, subordinate]
        scheme: ghost_employee
        probability: 0.003

    # Internal-external collusion
    internal_external:
      - type: employee_vendor
        roles: [purchasing_agent, vendor_contact]
        scheme: kickback
        probability: 0.008

      - type: employee_customer
        roles: [sales_rep, customer]
        scheme: false_credits
        probability: 0.004

      - type: employee_contractor
        roles: [project_manager, contractor]
        scheme: overbilling
        probability: 0.006

    # External rings
    external:
      - type: vendor_ring
        size: 2-4
        scheme: bid_rigging
        probability: 0.002

      - type: customer_ring
        size: 2-3
        scheme: return_fraud
        probability: 0.003

  network_characteristics:
    trust_building:
      initial_period_months: 3
      test_transactions: 2-5
      test_amounts: small

    communication_patterns:
      frequency: coded
      channels: [personal_email, phone, in_person]
      visibility: low

    profit_sharing:
      methods: [equal_split, role_based, initiator_premium]
      payment_channels: [cash, personal_accounts, crypto]

2.2 Collusion Behavior Modeling

#![allow(unused)]
fn main() {
pub struct CollusionRing {
    ring_id: Uuid,
    members: Vec<Conspirator>,
    scheme_type: SchemeType,
    formation_date: NaiveDate,
    status: RingStatus,
    total_stolen: Decimal,
    detection_risk: f64,
}

pub struct Conspirator {
    entity_id: EntityId,
    role: ConspiratorRole,
    join_date: NaiveDate,
    loyalty: f64,           // Probability of not defecting
    risk_tolerance: f64,    // Willingness to escalate
    share: f64,             // Percentage of proceeds
}

pub enum ConspiratorRole {
    Initiator,      // Conceives scheme
    Executor,       // Performs transactions
    Approver,       // Provides approvals
    Concealer,      // Hides evidence
    Lookout,        // Monitors for detection
    Beneficiary,    // External recipient
}

impl CollusionRing {
    /// Simulate ring behavior for a period
    pub fn simulate_period(&mut self, period: &Period) -> Vec<FraudAction> {
        // Check for defection
        if self.check_defection() {
            return self.dissolve();
        }

        // Check for escalation
        let escalation = self.check_escalation();

        // Generate fraudulent transactions
        let actions = self.generate_actions(period, escalation);

        // Update detection risk
        self.update_detection_risk(&actions);

        actions
    }

    /// Check if any member might defect
    fn check_defection(&self) -> bool {
        // Factors: loyalty, detection_risk, personal_circumstances
    }
}
}

3. Concealment Techniques

3.1 Document Manipulation

concealment_techniques:
  document_manipulation:
    # Forged documents
    forgery:
      types:
        - invoices
        - receipts
        - approvals
        - contracts
      quality_levels:
        crude: 0.20      # Easy to detect
        moderate: 0.50   # Requires scrutiny
        sophisticated: 0.25  # Difficult to detect
        professional: 0.05   # Expert required

    # Altered documents
    alteration:
      techniques:
        - amount_change
        - date_change
        - payee_change
        - description_change
      detection_indicators:
        - different_handwriting
        - correction_fluid
        - digital_artifacts

    # Destroyed documents
    destruction:
      methods:
        - physical_destruction
        - digital_deletion
        - "lost_in_transition"
      recovery_probability: 0.30

  audit_trail_manipulation:
    techniques:
      - backdating_entries
      - manipulating_timestamps
      - deleting_log_entries
      - creating_false_trails

    sophistication_levels:
      basic: "obvious_gaps"
      intermediate: "plausible_explanations"
      advanced: "complete_alternative_narrative"

  segregation_circumvention:
    methods:
      - shared_credentials
      - delegated_authority_abuse
      - emergency_access_exploitation
      - system_override_use

3.2 Transaction Structuring

transaction_structuring:
  # Below threshold structuring
  threshold_avoidance:
    thresholds:
      - type: approval_limit
        values: [1000, 5000, 10000, 25000]
        technique: split_below
        margin: 0.05-0.15

      - type: reporting_threshold
        values: [10000]  # CTR threshold
        technique: structure_below
        margin: 0.10-0.20

      - type: audit_sample_threshold
        values: [materiality * 0.5]
        technique: avoid_population
        margin: variable

  # Timing manipulation
  timing_techniques:
    - type: spread_over_periods
      purpose: avoid_trending
      pattern: randomized

    - type: burst_before_vacation
      purpose: delayed_discovery
      window: 1_week

    - type: holiday_timing
      purpose: reduced_oversight
      targets: [year_end, summer]

  # Entity rotation
  entity_rotation:
    - type: vendor_rotation
      purpose: avoid_concentration_alerts
      rotation_frequency: quarterly

    - type: account_rotation
      purpose: avoid_pattern_detection
      accounts: [expense_categories]

    - type: department_rotation
      purpose: spread_impact
      pattern: round_robin

4. Management Override

4.1 Override Patterns

management_override:
  enabled: true

  scenarios:
    # Revenue manipulation
    revenue_override:
      perpetrator_level: senior_management
      techniques:
        - journal_entry_override
        - revenue_recognition_acceleration
        - reserve_manipulation
        - side_agreement_concealment
      concealment:
        - false_documentation
        - intimidation_of_subordinates
        - auditor_deception

    # Expense manipulation
    expense_override:
      perpetrator_level: department_head+
      techniques:
        - capitalization_abuse
        - expense_deferral
        - cost_allocation_manipulation
      pressure_sources:
        - budget_targets
        - bonus_thresholds
        - analyst_expectations

    # Asset manipulation
    asset_override:
      perpetrator_level: senior_management
      techniques:
        - impairment_avoidance
        - valuation_manipulation
        - classification_abuse
      motivations:
        - covenant_compliance
        - credit_rating_maintenance
        - acquisition_valuation

  override_characteristics:
    # Authority abuse
    authority_patterns:
      - override_segregation_of_duties
      - suppress_exception_reports
      - modify_control_parameters
      - grant_inappropriate_access

    # Pressure and rationalization
    fraud_triangle:
      pressure:
        - financial_targets
        - personal_financial_issues
        - market_expectations
      opportunity:
        - weak_board_oversight
        - auditor_reliance_on_management
        - complex_transactions
      rationalization:
        - "temporary_adjustment"
        - "everyone_does_it"
        - "for_the_good_of_company"

4.2 Tone at the Top Effects

tone_effects:
  enabled: true

  # Positive tone (ethical leadership)
  ethical_leadership:
    effects:
      - fraud_rate_reduction: 0.50
      - whistleblower_increase: 2.0
      - control_compliance_improvement: 0.20

  # Negative tone (pressure culture)
  pressure_culture:
    effects:
      - fraud_rate_increase: 2.5
      - concealment_sophistication: increased
      - collusion_probability: 1.5x
      - management_override_probability: 3.0x

  # Mixed signals
  inconsistent_messaging:
    effects:
      - employee_confusion: true
      - selective_compliance: true
      - rationalization_easier: true

5. Adaptive Fraud Behavior

5.1 Learning and Adaptation

adaptive_fraud:
  enabled: true

  learning_behaviors:
    # Response to near-detection
    near_detection_response:
      behaviors:
        - temporary_pause: 0.40
        - technique_change: 0.30
        - amount_reduction: 0.20
        - scheme_abandonment: 0.10
      pause_duration_days: 30-90

    # Response to control changes
    control_adaptation:
      when: new_control_implemented
      behaviors:
        - find_workaround: 0.60
        - wait_for_relaxation: 0.25
        - abandon_scheme: 0.15
      adaptation_time_days: 30-60

    # Success reinforcement
    success_reinforcement:
      when: fraud_not_detected
      behaviors:
        - increase_frequency: 0.30
        - increase_amount: 0.40
        - recruit_accomplices: 0.15
        - maintain_current: 0.15

  sophistication_evolution:
    stages:
      novice:
        characteristics: [obvious_patterns, small_amounts, nervous_behavior]
        detection_difficulty: easy

      intermediate:
        characteristics: [some_concealment, medium_amounts, confidence]
        detection_difficulty: moderate

      experienced:
        characteristics: [sophisticated_concealment, varied_amounts, systematic]
        detection_difficulty: hard

      expert:
        characteristics: [professional_techniques, large_amounts, network]
        detection_difficulty: expert

    progression:
      trigger: months_undetected > 6
      probability: 0.30_per_trigger

5.2 Detection Evasion

#![allow(unused)]
fn main() {
pub struct AdaptiveFraudster {
    experience_level: ExperienceLevel,
    known_controls: Vec<ControlId>,
    detection_events: Vec<DetectionEvent>,
    technique_repertoire: Vec<FraudTechnique>,
}

impl AdaptiveFraudster {
    /// Adapt technique based on environment
    pub fn adapt_technique(&mut self, context: &Context) -> FraudTechnique {
        // Avoid known controls
        let available = self.filter_by_controls(context.active_controls);

        // Avoid previously detected patterns
        let safe = self.filter_by_history(&available);

        // Select based on risk/reward
        self.select_optimal(&safe, context.current_risk_tolerance)
    }

    /// Learn from near-detection
    pub fn learn_from_event(&mut self, event: &DetectionEvent) {
        match event.outcome {
            DetectionOutcome::Detected => {
                self.avoid_technique(event.technique);
                self.reduce_risk_tolerance();
            }
            DetectionOutcome::NearMiss => {
                self.modify_technique(event.technique);
                self.record_warning_sign(event.indicator);
            }
            DetectionOutcome::Undetected => {
                self.reinforce_technique(event.technique);
                self.consider_escalation();
            }
        }
    }
}
}

6. Financial Statement Fraud Schemes

6.1 Revenue Manipulation Schemes

revenue_schemes:
  # Premature revenue recognition
  premature_recognition:
    techniques:
      - bill_and_hold:
          description: "Ship to warehouse, recognize revenue"
          indicators: [unusual_shipping, customer_complaints]
          journal_entries:
            - dr: accounts_receivable
              cr: revenue

      - channel_stuffing:
          description: "Force product on distributors"
          indicators: [quarter_end_spike, high_returns_next_period]
          side_agreements: [return_rights, extended_payment]

      - percentage_of_completion_abuse:
          description: "Overstate project completion"
          indicators: [optimistic_estimates, margin_improvements]
          documentation: [false_progress_reports]

      - round_tripping:
          description: "Simultaneous buy/sell with related party"
          indicators: [offsetting_transactions, unusual_counterparties]
          complexity: high

  # Fictitious revenue
  fictitious_revenue:
    techniques:
      - fake_invoices:
          description: "Bill nonexistent customers"
          concealment: [fake_customer_setup, false_confirmations]

      - side_agreements:
          description: "Hidden terms negate sale"
          concealment: [verbal_agreements, separate_documentation]

      - related_party_transactions:
          description: "Transactions with undisclosed affiliates"
          concealment: [complex_ownership, offshore_entities]

6.2 Expense and Liability Manipulation

expense_liability_schemes:
  # Expense deferral
  expense_deferral:
    techniques:
      - improper_capitalization:
          description: "Capitalize operating expenses"
          accounts: [fixed_assets, intangibles]
          indicators: [unusual_asset_growth, low_maintenance]

      - reserve_manipulation:
          description: "Cookie jar reserves"
          pattern: [build_in_good_years, release_in_bad]
          indicators: [volatile_provisions, earnings_smoothing]

      - period_cutoff_manipulation:
          description: "Push expenses to next period"
          timing: [quarter_end, year_end]
          techniques: [hold_invoices, delay_receipt]

  # Liability concealment
  liability_concealment:
    techniques:
      - off_balance_sheet:
          description: "Structure to avoid consolidation"
          vehicles: [SPEs, unconsolidated_subsidiaries]
          concealment: [complex_structures, offshore]

      - contingency_understatement:
          description: "Understate legal/warranty liabilities"
          rationalization: ["uncertain", "immaterial"]
          indicators: [subsequent_large_settlements]

7. Fraud Red Flags and Indicators

7.1 Behavioral Red Flags

behavioral_red_flags:
  # Employee behavior
  employee_indicators:
    - indicator: living_beyond_means
      fraud_correlation: 0.45
      detection_method: lifestyle_analysis

    - indicator: financial_difficulties
      fraud_correlation: 0.40
      detection_method: background_check

    - indicator: unusually_close_vendor_relationships
      fraud_correlation: 0.35
      detection_method: relationship_analysis

    - indicator: control_issues_attitude
      fraud_correlation: 0.30
      detection_method: 360_feedback

    - indicator: never_takes_vacation
      fraud_correlation: 0.50
      detection_method: hr_records

    - indicator: excessive_overtime
      fraud_correlation: 0.25
      detection_method: time_records

  # Transaction behavior
  transaction_indicators:
    - indicator: round_number_preference
      fraud_correlation: 0.20
      detection_method: benford_analysis

    - indicator: just_below_threshold
      fraud_correlation: 0.60
      detection_method: threshold_analysis

    - indicator: end_of_period_concentration
      fraud_correlation: 0.35
      detection_method: temporal_analysis

    - indicator: unusual_journal_entries
      fraud_correlation: 0.55
      detection_method: journal_entry_testing

7.2 Red Flag Generation

red_flag_injection:
  enabled: true

  # Inject red flags that correlate with but don't prove fraud
  correlations:
    # Strong correlation - usually indicates fraud
    strong:
      - flag: matched_home_address_vendor_employee
        fraud_probability: 0.85
        inject_with_fraud: 0.90
        inject_without_fraud: 0.001

      - flag: sequential_check_numbers_to_same_vendor
        fraud_probability: 0.70
        inject_with_fraud: 0.80
        inject_without_fraud: 0.01

    # Moderate correlation - worth investigating
    moderate:
      - flag: vendor_no_physical_address
        fraud_probability: 0.40
        inject_with_fraud: 0.60
        inject_without_fraud: 0.05

      - flag: approval_just_under_threshold
        fraud_probability: 0.35
        inject_with_fraud: 0.70
        inject_without_fraud: 0.10

    # Weak correlation - often legitimate
    weak:
      - flag: round_number_invoice
        fraud_probability: 0.15
        inject_with_fraud: 0.40
        inject_without_fraud: 0.20

      - flag: end_of_month_timing
        fraud_probability: 0.10
        inject_with_fraud: 0.50
        inject_without_fraud: 0.30

8. Fraud Investigation Scenarios

8.1 Investigation-Ready Data

investigation_scenarios:
  enabled: true

  scenarios:
    # Whistleblower scenario
    whistleblower_tip:
      allegation: "Vendor XYZ may be fictitious"
      evidence_trail:
        - vendor_setup_documents
        - approval_chain
        - payment_history
        - address_verification
        - phone_verification
      hidden_clues:
        - approver_is_also_requester
        - address_is_ups_store
        - phone_goes_to_employee

    # Audit finding follow-up
    audit_finding:
      initial_finding: "Unusual vendor payment pattern"
      investigation_path:
        - transaction_sample
        - vendor_analysis
        - employee_relationship_map
        - comparative_analysis
      discovery_stages:
        - stage_1: "Vendor has only one customer - us"
        - stage_2: "All invoices approved by same person"
        - stage_3: "Vendor address matches employee relative"

    # Hotline report
    anonymous_tip:
      report: "Manager taking kickbacks from contractor"
      evidence_available:
        - contract_documents
        - bid_history
        - payment_records
        - email_metadata
      additional_clues:
        - bids_always_awarded_to_same_contractor
        - contract_amendments_increase_cost_30%
        - manager_new_car_timing_correlates

8.2 Evidence Chain Generation

#![allow(unused)]
fn main() {
pub struct FraudEvidenceChain {
    fraud_id: Uuid,
    evidence_items: Vec<EvidenceItem>,
    discovery_order: Vec<EvidenceId>,
    linking_relationships: Vec<EvidenceLink>,
}

pub struct EvidenceItem {
    id: EvidenceId,
    item_type: EvidenceType,
    content: EvidenceContent,
    source_system: String,
    timestamp: DateTime<Utc>,
    accessibility: Accessibility,  // How hard to find
    probative_value: f64,         // Strength as evidence
}

pub enum EvidenceType {
    Transaction,
    Document,
    Communication,
    SystemLog,
    ExternalRecord,
    WitnessStatement,
    PhysicalEvidence,
}

impl FraudEvidenceChain {
    /// Generate investigation-ready evidence trail
    pub fn generate_trail(&self) -> InvestigationTrail {
        // Order evidence by discoverability
        // Create logical links between items
        // Add red herrings (false leads that are eliminated)
        // Include corroborating evidence
    }
}
}

9. Implementation Priority

EnhancementComplexityImpactPriority
ACFE-aligned taxonomyLowHighP1
Collusion modelingHighHighP1
Concealment techniquesMediumHighP1
Management overrideMediumHighP1
Adaptive behaviorHighMediumP2
Financial statement fraudHighHighP1
Red flag generationMediumHighP1
Investigation scenariosMediumMediumP2
Industry-specific patternsMediumMediumP2

10. Validation and Calibration

fraud_validation:
  # Calibration against real-world statistics
  calibration:
    source: acfe_report_to_the_nations_2024
    metrics:
      median_loss: 117000
      median_duration_months: 12
      detection_methods:
        tip: 0.42
        internal_audit: 0.16
        management_review: 0.12
        external_audit: 0.04
        accident: 0.06
      perpetrator_departments:
        accounting: 0.21
        operations: 0.17
        executive: 0.12
        sales: 0.11
        customer_service: 0.08

  # Distribution validation
  distribution_checks:
    - metric: loss_distribution
      expected: lognormal
      parameters_from: acfe_data

    - metric: duration_distribution
      expected: exponential
      mean_months: 12

    - metric: detection_method_distribution
      expected: categorical
      match_acfe: true

See also: 08-domain-specific.md for industry-specific enhancements

Research: Domain-Specific Enhancements

Implementation Status: COMPLETE (v0.3.0)

This research document has been fully implemented in v0.3.0. See:

Key implementations:

  • Manufacturing: BOM, routings, work centers, production variances
  • Retail: POS transactions, shrinkage, loss prevention
  • Healthcare: Revenue cycle, ICD-10/CPT/DRG coding, payer mix
  • Technology: License/subscription revenue, R&D capitalization
  • Financial Services: Loan origination, trading, regulatory frameworks
  • Professional Services: Time & billing, trust accounting
  • Industry-specific anomaly patterns for each sector
  • Industry-specific ML benchmarks

Current State Analysis

Existing Industry Support

IndustryConfigurationGenerator SupportRealism
ManufacturingPreset availableGoodMedium
RetailPreset availableGoodMedium
Financial ServicesPreset + Banking moduleStrongGood
HealthcarePreset availableBasicLow
TechnologyPreset availableBasicLow
Professional ServicesLimitedBasicLow

Current Strengths

  1. Banking module: Comprehensive KYC/AML with fraud typologies
  2. Industry presets: 5 industry configurations available
  3. Seasonality profiles: 10 industry-specific patterns
  4. Standards support: IFRS, US GAAP, ISA, SOX frameworks

Current Gaps

  1. Shallow industry modeling: Generic patterns across industries
  2. Limited regulatory specificity: One-size-fits-all compliance
  3. Missing vertical-specific transactions: Generic document flows
  4. No industry-specific anomalies: Same fraud patterns everywhere
  5. Limited terminology: Generic naming regardless of industry

Industry-Specific Enhancement Recommendations

1. Manufacturing Industry

1.1 Enhanced Transaction Types

manufacturing:
  transaction_types:
    # Production-specific
    production:
      - work_order_issuance
      - material_requisition
      - labor_booking
      - overhead_absorption
      - scrap_reporting
      - rework_order
      - production_variance

    # Inventory movements
    inventory:
      - raw_material_receipt
      - wip_transfer
      - finished_goods_transfer
      - consignment_movement
      - subcontractor_shipment
      - cycle_count_adjustment
      - physical_inventory_adjustment

    # Cost accounting
    costing:
      - standard_cost_revaluation
      - purchase_price_variance
      - production_variance_allocation
      - overhead_rate_adjustment
      - interplant_transfer_pricing

  # Manufacturing-specific master data
  master_data:
    bill_of_materials:
      levels: 3-7
      components_per_level: 2-15
      yield_rates: 0.95-0.99
      scrap_factors: 0.01-0.05

    routings:
      operations: 3-12
      work_centers: 5-50
      labor_rates: by_skill_level
      machine_rates: by_equipment_type

    production_orders:
      types: [discrete, repetitive, process]
      statuses: [planned, released, confirmed, completed]

1.2 Manufacturing Anomalies

manufacturing_anomalies:
  production:
    - type: yield_manipulation
      description: "Inflating yield to hide scrap"
      indicators: [abnormal_yield, missing_scrap_entries]

    - type: labor_misallocation
      description: "Charging labor to wrong orders"
      indicators: [unusual_labor_distribution, overtime_patterns]

    - type: phantom_production
      description: "Recording production that didn't occur"
      indicators: [no_material_consumption, missing_quality_records]

  inventory:
    - type: obsolete_inventory_concealment
      description: "Failing to write down obsolete stock"
      indicators: [no_movement_items, aging_without_provision]

    - type: consignment_manipulation
      description: "Recording consigned goods as owned"
      indicators: [unusual_consignment_patterns, ownership_disputes]

  costing:
    - type: standard_cost_manipulation
      description: "Setting unrealistic standards"
      indicators: [persistent_favorable_variances, standard_changes]

    - type: overhead_misallocation
      description: "Allocating overhead to wrong products"
      indicators: [margin_anomalies, allocation_base_changes]

2. Retail Industry

2.1 Enhanced Transaction Types

retail:
  transaction_types:
    # Point of Sale
    pos:
      - cash_sale
      - credit_card_sale
      - debit_sale
      - gift_card_sale
      - layaway_transaction
      - special_order
      - rain_check

    # Returns and adjustments
    returns:
      - customer_return
      - exchange
      - price_adjustment
      - markdown
      - damage_writeoff
      - vendor_return

    # Inventory
    inventory:
      - receiving
      - transfer_in
      - transfer_out
      - cycle_count
      - shrinkage_adjustment
      - donation
      - disposal

    # Promotions
    promotions:
      - coupon_redemption
      - loyalty_redemption
      - bundle_discount
      - flash_sale
      - clearance_markdown

  # Retail-specific metrics
  metrics:
    same_store_sales: by_period
    basket_size: average_and_distribution
    conversion_rate: by_store_type
    shrinkage_rate: by_category
    markdown_percentage: by_season
    inventory_turn: by_category

2.2 Retail Anomalies

retail_anomalies:
  pos_fraud:
    - type: sweethearting
      description: "Employee gives free/discounted items to friends"
      indicators: [high_void_rate, specific_cashier_patterns]

    - type: skimming
      description: "Not recording cash sales"
      indicators: [cash_short, transaction_gaps]

    - type: refund_fraud
      description: "Fraudulent refunds to personal cards"
      indicators: [refund_patterns, card_number_reuse]

  inventory_fraud:
    - type: receiving_fraud
      description: "Collusion with vendors on short shipments"
      indicators: [variance_patterns, vendor_concentration]

    - type: transfer_fraud
      description: "Fake transfers to cover theft"
      indicators: [transfer_without_receipt, location_patterns]

  promotional_abuse:
    - type: coupon_fraud
      description: "Applying coupons without customer purchase"
      indicators: [high_coupon_rate, timing_patterns]

    - type: employee_discount_abuse
      description: "Using employee discount for non-employees"
      indicators: [discount_volume, transaction_timing]

3. Healthcare Industry

3.1 Enhanced Transaction Types

healthcare:
  transaction_types:
    # Revenue cycle
    revenue:
      - patient_registration
      - charge_capture
      - claim_submission
      - payment_posting
      - denial_management
      - patient_billing
      - collection_activity

    # Clinical operations
    clinical:
      - supply_consumption
      - pharmacy_dispensing
      - procedure_coding
      - diagnosis_coding
      - medical_record_documentation

    # Payer transactions
    payer:
      - contract_payment
      - capitation_payment
      - risk_adjustment
      - quality_bonus
      - value_based_payment

  # Healthcare-specific elements
  elements:
    coding:
      icd10: diagnostic_codes
      cpt: procedure_codes
      drg: diagnosis_related_groups
      hcpcs: healthcare_common_procedure

    payers:
      types: [medicare, medicaid, commercial, self_pay]
      mix_distribution: configurable
      contract_terms: by_payer

    compliance:
      hipaa: true
      stark_law: true
      anti_kickback: true
      false_claims_act: true

3.2 Healthcare Anomalies

healthcare_anomalies:
  billing_fraud:
    - type: upcoding
      description: "Billing for more expensive service than provided"
      indicators: [code_distribution_shift, complexity_increase]

    - type: unbundling
      description: "Billing separately for bundled services"
      indicators: [modifier_patterns, procedure_combinations]

    - type: phantom_billing
      description: "Billing for services not rendered"
      indicators: [impossible_combinations, deceased_patient_billing]

    - type: duplicate_billing
      description: "Billing multiple times for same service"
      indicators: [same_day_duplicates, claim_resubmission_patterns]

  kickback_schemes:
    - type: physician_referral_kickback
      description: "Payments for patient referrals"
      indicators: [referral_concentration, payment_timing]

    - type: medical_director_fraud
      description: "Sham medical director agreements"
      indicators: [no_services_rendered, excessive_compensation]

  compliance_violations:
    - type: hipaa_violation
      description: "Unauthorized access to patient records"
      indicators: [access_patterns, audit_log_anomalies]

    - type: credential_fraud
      description: "Using credentials of another provider"
      indicators: [impossible_geography, timing_conflicts]

4. Technology Industry

4.1 Enhanced Transaction Types

technology:
  transaction_types:
    # Revenue recognition (ASC 606)
    revenue:
      - license_revenue
      - subscription_revenue
      - professional_services
      - maintenance_revenue
      - usage_based_revenue
      - milestone_based_revenue

    # Software development
    development:
      - r_and_d_expense
      - capitalized_software
      - amortization
      - impairment_testing

    # Cloud operations
    cloud:
      - hosting_costs
      - bandwidth_costs
      - storage_costs
      - compute_costs
      - third_party_services

    # Sales and marketing
    sales:
      - commission_expense
      - deferred_commission
      - customer_acquisition_cost
      - marketing_program_expense

  # Tech-specific accounting
  accounting:
    revenue_recognition:
      multiple_element_arrangements: true
      variable_consideration: true
      contract_modifications: true

    software_development:
      capitalization_criteria: true
      useful_life_determination: true
      impairment_testing: annual

    stock_compensation:
      option_valuation: black_scholes
      rsu_accounting: true
      performance_units: true

4.2 Technology Anomalies

technology_anomalies:
  revenue_fraud:
    - type: premature_license_recognition
      description: "Recognizing license revenue before delivery criteria met"
      indicators: [quarter_end_concentration, delivery_delays]

    - type: side_letter_abuse
      description: "Hidden terms that negate revenue recognition"
      indicators: [unusual_contract_terms, customer_complaints]

    - type: channel_stuffing
      description: "Forcing product on resellers at period end"
      indicators: [reseller_inventory_buildup, returns_next_quarter]

  capitalization_fraud:
    - type: improper_capitalization
      description: "Capitalizing expenses that should be expensed"
      indicators: [r_and_d_ratio_changes, asset_growth]

    - type: useful_life_manipulation
      description: "Extending useful life to reduce amortization"
      indicators: [useful_life_changes, peer_comparison]

  stock_compensation:
    - type: options_backdating
      description: "Selecting favorable grant dates retroactively"
      indicators: [grant_date_patterns, exercise_price_analysis]

    - type: vesting_manipulation
      description: "Accelerating vesting to manage earnings"
      indicators: [vesting_schedule_changes, departure_timing]

5. Financial Services Industry

5.1 Enhanced Transaction Types

financial_services:
  transaction_types:
    # Banking operations
    banking:
      - loan_origination
      - loan_disbursement
      - loan_payment
      - interest_accrual
      - fee_income
      - deposit_transaction
      - wire_transfer
      - ach_transaction

    # Investment operations
    investments:
      - trade_execution
      - trade_settlement
      - dividend_receipt
      - interest_receipt
      - mark_to_market
      - realized_gain_loss
      - unrealized_gain_loss

    # Insurance operations
    insurance:
      - premium_collection
      - claim_payment
      - reserve_adjustment
      - reinsurance_transaction
      - commission_payment
      - policy_acquisition_cost

    # Asset management
    asset_management:
      - management_fee
      - performance_fee
      - distribution
      - capital_call
      - redemption

  # Regulatory requirements
  regulatory:
    capital_requirements:
      basel_iii: true
      leverage_ratio: true
      liquidity_coverage: true

    reporting:
      call_reports: true
      form_10k_10q: true
      form_13f: true
      sar_filing: true

5.2 Financial Services Anomalies

financial_services_anomalies:
  lending_fraud:
    - type: loan_fraud
      description: "Falsified loan applications"
      indicators: [documentation_inconsistencies, verification_failures]

    - type: appraisal_fraud
      description: "Inflated property valuations"
      indicators: [appraisal_variances, appraiser_concentration]

    - type: straw_borrower
      description: "Using nominee to obtain loans"
      indicators: [relationship_patterns, fund_flow_analysis]

  trading_fraud:
    - type: wash_trading
      description: "Buying and selling same security to inflate volume"
      indicators: [self_trades, volume_patterns]

    - type: front_running
      description: "Trading ahead of customer orders"
      indicators: [timing_analysis, profitability_patterns]

    - type: churning
      description: "Excessive trading to generate commissions"
      indicators: [turnover_ratio, commission_patterns]

  insurance_fraud:
    - type: premium_theft
      description: "Agent pocketing premiums"
      indicators: [lapsed_policies, customer_complaints]

    - type: claims_fraud
      description: "Fraudulent or inflated claims"
      indicators: [claim_patterns, adjuster_analysis]

    - type: reserve_manipulation
      description: "Understating claim reserves"
      indicators: [reserve_development, adequacy_analysis]

6. Professional Services

6.1 Enhanced Transaction Types

professional_services:
  transaction_types:
    # Time and billing
    billing:
      - time_entry
      - expense_entry
      - invoice_generation
      - write_off_adjustment
      - realization_adjustment
      - wip_adjustment

    # Engagement management
    engagement:
      - engagement_setup
      - budget_allocation
      - milestone_billing
      - retainer_application
      - contingency_fee

    # Resource management
    resource:
      - staff_allocation
      - contractor_engagement
      - subcontractor_payment
      - expert_fee

    # Client accounting
    client:
      - trust_deposit
      - trust_withdrawal
      - cost_advance
      - client_reimbursement

  # Professional-specific metrics
  metrics:
    utilization_rate: by_level
    realization_rate: by_practice
    collection_rate: by_client
    leverage_ratio: staff_to_partner
    revenue_per_professional: by_level

6.2 Professional Services Anomalies

professional_services_anomalies:
  billing_fraud:
    - type: inflated_hours
      description: "Billing for time not worked"
      indicators: [impossible_hours, pattern_analysis]

    - type: phantom_work
      description: "Billing for work never performed"
      indicators: [no_work_product, client_complaints]

    - type: duplicate_billing
      description: "Billing multiple clients for same time"
      indicators: [time_overlap, total_hours_analysis]

  expense_fraud:
    - type: personal_expense_billing
      description: "Charging personal expenses to clients"
      indicators: [expense_patterns, vendor_analysis]

    - type: markup_abuse
      description: "Excessive markups on pass-through costs"
      indicators: [markup_comparison, cost_analysis]

  trust_account_fraud:
    - type: commingling
      description: "Mixing trust and operating funds"
      indicators: [transfer_patterns, reconciliation_issues]

    - type: misappropriation
      description: "Using client funds for personal use"
      indicators: [unauthorized_withdrawals, shortages]

7. Real Estate Industry

7.1 Enhanced Transaction Types

real_estate:
  transaction_types:
    # Property management
    property:
      - rent_collection
      - cam_charges
      - security_deposit
      - lease_payment
      - tenant_improvement
      - property_tax
      - insurance_expense

    # Development
    development:
      - land_acquisition
      - construction_draw
      - development_fee
      - capitalized_interest
      - soft_cost
      - hard_cost

    # Investment
    investment:
      - property_acquisition
      - property_disposition
      - depreciation
      - impairment
      - fair_value_adjustment
      - debt_service

    # REIT-specific
    reit:
      - ffo_calculation
      - dividend_distribution
      - taxable_income
      - section_1031_exchange

7.2 Real Estate Anomalies

real_estate_anomalies:
  property_management:
    - type: rent_skimming
      description: "Not recording cash rent payments"
      indicators: [occupancy_vs_revenue, cash_deposits]

    - type: kickback_maintenance
      description: "Receiving kickbacks from contractors"
      indicators: [contractor_concentration, pricing_analysis]

  development:
    - type: cost_inflation
      description: "Inflating development costs"
      indicators: [cost_per_unit_comparison, change_order_patterns]

    - type: capitalization_abuse
      description: "Capitalizing operating expenses"
      indicators: [capitalization_ratio, expense_classification]

  valuation:
    - type: appraisal_manipulation
      description: "Influencing property appraisals"
      indicators: [appraisal_vs_sale_price, appraiser_relationships]

    - type: impairment_avoidance
      description: "Failing to record impairments"
      indicators: [occupancy_decline, market_comparisons]

8. Industry-Specific Configuration

8.1 Unified Industry Configuration

# Master industry configuration schema
industry_configuration:
  industry: manufacturing  # or retail, healthcare, etc.

  # Industry-specific settings
  settings:
    transaction_types:
      enabled: [production, inventory, costing]
      weights:
        production_orders: 0.30
        inventory_movements: 0.40
        cost_adjustments: 0.30

    master_data:
      bill_of_materials: true
      routings: true
      work_centers: true
      production_resources: true

    anomaly_injection:
      industry_specific: true
      generic: true
      industry_weight: 0.60

    terminology:
      use_industry_terms: true
      document_naming: industry_standard
      account_descriptions: industry_specific

    seasonality:
      profile: manufacturing
      custom_events:
        - name: plant_shutdown
          month: 7
          duration_weeks: 2
          activity_multiplier: 0.10

    regulatory:
      frameworks:
        - environmental: epa
        - safety: osha
        - quality: iso_9001

  # Cross-industry settings (inherit from base)
  inherit:
    - accounting_standards
    - audit_standards
    - control_framework

8.2 Industry Presets Enhancement

presets:
  manufacturing_automotive:
    base: manufacturing
    customizations:
      bom_depth: 7
      just_in_time: true
      quality_framework: iatf_16949
      supplier_tiers: 3
      defect_rates: very_low

  retail_grocery:
    base: retail
    customizations:
      perishable_inventory: true
      high_volume_low_margin: true
      shrinkage_focus: true
      vendor_managed_inventory: true

  healthcare_hospital:
    base: healthcare
    customizations:
      inpatient: true
      outpatient: true
      emergency_services: true
      ancillary_services: true
      case_mix_complexity: high

  technology_saas:
    base: technology
    customizations:
      subscription_revenue: primary
      professional_services: secondary
      monthly_recurring_revenue: true
      churn_modeling: true

  financial_services_bank:
    base: financial_services
    customizations:
      banking_charter: commercial
      deposit_taking: true
      lending: true
      capital_markets: limited

9. Implementation Priority

IndustryEnhancement ScopeComplexityPriority
ManufacturingFull enhancementHighP1
RetailFull enhancementMediumP1
HealthcareFull enhancementHighP1
TechnologyRevenue recognitionMediumP2
Financial ServicesExtend banking moduleMediumP1
Professional ServicesNew moduleMediumP2
Real EstateNew moduleMediumP3

10. Terminology and Naming

industry_terminology:
  manufacturing:
    document_types:
      purchase_order: "Production Purchase Order"
      invoice: "Vendor Invoice"
      receipt: "Goods Receipt / Material Document"

    accounts:
      wip: "Work in Process"
      fg: "Finished Goods Inventory"
      rm: "Raw Materials Inventory"

    transactions:
      production: "Manufacturing Order Settlement"
      variance: "Production Variance Posting"

  healthcare:
    document_types:
      invoice: "Claim"
      payment: "Remittance Advice"
      receipt: "Patient Payment"

    accounts:
      ar: "Patient Accounts Receivable"
      revenue: "Net Patient Service Revenue"
      contractual: "Contractual Allowance"

    transactions:
      billing: "Charge Capture"
      collection: "Payment Posting"

  # Similar for other industries...

Summary

This research document series provides a comprehensive analysis of improvement opportunities for the SyntheticData system. Key themes across all documents:

  1. Depth over breadth: Enhance existing features rather than adding new surface-level capabilities
  2. Correlation modeling: Move from independent generation to correlated, interconnected data
  3. Temporal realism: Add dynamic behavior that evolves over time
  4. Domain authenticity: Use real industry terminology, patterns, and regulations
  5. Detection-aware design: Generate data that enables meaningful ML training and evaluation

The recommended implementation approach is phased, starting with high-impact, lower-complexity enhancements and building toward more sophisticated modeling over time.


End of Research Document Series

Total documents: 8 Research conducted: January 2026 System version analyzed: 0.2.3