SyntheticData
High-Performance Synthetic Enterprise Financial Data Generator
Developed by Ernst & Young Ltd., Zurich, Switzerland
What is SyntheticData?
SyntheticData is a configurable synthetic data generator that produces realistic, interconnected enterprise financial data. It generates General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, and ML-ready graph exports at scale.
The generator produces statistically accurate data based on empirical research from real-world general ledger patterns, ensuring that synthetic datasets exhibit the same characteristics as production data—including Benford’s Law compliance, temporal patterns, and document flow integrity.
New in v0.5.0: LLM-augmented generation (vendor names, descriptions, anomaly explanations), diffusion model backend (statistical denoising, hybrid generation), causal & counterfactual generation (SCMs, do-calculus interventions), federated fingerprinting, synthetic data certificates, and ecosystem integrations (Airflow, dbt, MLflow, Spark).
v0.3.0: ACFE-aligned fraud taxonomy, collusion modeling, industry-specific transactions (Manufacturing, Retail, Healthcare), and ML benchmarks.
v0.2.x: Privacy-preserving fingerprinting, accounting/audit standards (US GAAP, IFRS, ISA, SOX), streaming output API.
Quick Links
| Section | Description |
|---|---|
| Getting Started | Installation, quick start guide, and demo mode |
| User Guide | CLI reference, server API, desktop UI, Python wrapper |
| Configuration | Complete YAML schema and presets |
| Architecture | System design, data flow, resource management |
| Crate Reference | Detailed crate documentation (15 crates) |
| Advanced Topics | Anomaly injection, graph export, fingerprinting, performance |
| Use Cases | Fraud detection, audit, AML/KYC, compliance |
Key Features
Core Data Generation
| Feature | Description |
|---|---|
| Statistical Distributions | Line item counts, amounts, and patterns based on empirical GL research |
| Benford’s Law Compliance | First-digit distribution following Benford’s Law with configurable fraud patterns |
| Industry Presets | Manufacturing, Retail, Financial Services, Healthcare, Technology |
| Chart of Accounts | Small (~100), Medium (~400), Large (~2500) account structures |
| Temporal Patterns | Month-end, quarter-end, year-end volume spikes with working hour modeling |
| Regional Calendars | Holiday calendars for US, DE, GB, CN, JP, IN with lunar calendar support |
Enterprise Simulation
- Master Data Management: Vendors, customers, materials, fixed assets, employees with temporal validity
- Document Flow Engine: Complete P2P (Procure-to-Pay) and O2C (Order-to-Cash) processes with three-way matching
- Intercompany Transactions: IC matching, transfer pricing, consolidation eliminations
- Balance Coherence: Opening balances, running balance tracking, trial balance generation
- Subledger Simulation: AR, AP, Fixed Assets, Inventory with GL reconciliation
- Currency & FX: Realistic exchange rates (Ornstein-Uhlenbeck process), currency translation, CTA generation
- Period Close Engine: Monthly close, depreciation runs, accruals, year-end closing
- Banking/KYC/AML: Customer personas, KYC profiles, AML typologies (structuring, funnel, mule, layering, round-tripping)
- Process Mining: OCEL 2.0 event logs with object-centric relationships
- Audit Simulation: ISA-compliant engagements, workpapers, findings, risk assessments, professional judgments
Fraud Patterns & Industry-Specific Features
- ACFE-Aligned Fraud Taxonomy: Asset Misappropriation, Corruption, Financial Statement Fraud calibrated to ACFE statistics
- Collusion & Conspiracy Modeling: Multi-party fraud networks with 9 ring types and role-based conspirators
- Management Override: Senior-level fraud with fraud triangle modeling (Pressure, Opportunity, Rationalization)
- Red Flag Generation: 40+ probabilistic fraud indicators with Bayesian probabilities
- Industry-Specific Transactions: Manufacturing (BOM, WIP), Retail (POS, shrinkage), Healthcare (ICD-10, claims)
- Industry-Specific Anomalies: Authentic fraud patterns per industry (upcoding, sweethearting, yield manipulation)
Machine Learning & Analytics
- Graph Export: PyTorch Geometric, Neo4j, DGL, and RustGraph formats with train/val/test splits
- Anomaly Injection: 60+ fraud types, errors, process issues with full labeling
- Data Quality Variations: Missing values (MCAR, MAR, MNAR), format variations, duplicates, typos
- Evaluation Framework: Auto-tuning with configuration recommendations based on metric gaps
- ACFE Benchmarks: ML benchmarks calibrated to ACFE fraud statistics
- Industry Benchmarks: Pre-configured benchmarks for fraud detection by industry
AI & ML-Powered Generation
- LLM-Augmented Generation: Use LLMs to generate realistic vendor names, transaction descriptions, memo fields, and anomaly explanations via pluggable provider abstraction (Mock, OpenAI, Anthropic, Custom)
- Natural Language Configuration: Generate YAML configs from natural language descriptions (
init --from-description "Generate 1 year of retail data for a mid-size US company") - Diffusion Model Backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
- Hybrid Generation: Blend rule-based and diffusion outputs using interpolation, selection, or ensemble strategies
- Causal Generation: Define Structural Causal Models (SCMs) with interventional (“what-if”) and counterfactual generation
- Built-in Causal Templates: Pre-configured
fraud_detectionandrevenue_cyclecausal graphs
Privacy-Preserving Fingerprinting
- Fingerprint Extraction: Extract statistical properties from real data into
.dsffiles - Differential Privacy: Laplace and Gaussian mechanisms with configurable epsilon budget
- K-Anonymity: Suppression of rare categorical values below configurable threshold
- Privacy Audit Trail: Complete logging of all privacy decisions and epsilon spent
- Fidelity Evaluation: Validate synthetic data matches original fingerprint (KS, Wasserstein, Benford MAD)
- Gaussian Copula: Preserve multivariate correlations during synthesis
- Federated Fingerprinting: Extract fingerprints from distributed data sources without centralization using secure aggregation (weighted average, median, trimmed mean)
- Synthetic Data Certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing, embeddable in Parquet metadata and JSON output
- Privacy-Utility Pareto Frontier: Automated exploration of optimal epsilon values for given utility targets
Production Features
- REST & gRPC APIs: Streaming generation with authentication and rate limiting
- Desktop UI: Cross-platform Tauri/SvelteKit application with 15+ configuration pages
- Resource Guards: Memory, disk, and CPU monitoring with graceful degradation
- Graceful Degradation: Progressive feature reduction under resource pressure (Normal→Reduced→Minimal→Emergency)
- Deterministic Generation: Seeded RNG (ChaCha8) for reproducible output
- Python Wrapper: Programmatic access with blueprints and config validation
Performance
| Metric | Performance |
|---|---|
| Single-threaded throughput | ~100,000+ entries/second |
| Parallel scaling | Linear with available cores |
| Memory efficiency | Streaming generation for large volumes |
Use Cases
| Use Case | Description |
|---|---|
| Fraud Detection ML | Train supervised models with labeled fraud patterns |
| Graph Neural Networks | Entity relationship graphs for anomaly detection |
| AML/KYC Testing | Banking transaction data with structuring, layering, mule patterns |
| Audit Analytics | Test audit procedures with known control exceptions |
| Process Mining | OCEL 2.0 event logs for process discovery |
| ERP Testing | Load testing with realistic transaction volumes |
| SOX Compliance | Test internal control monitoring systems |
| Data Quality ML | Train models to detect missing values, typos, duplicates |
| Causal Analysis | “What-if” scenarios and counterfactual generation for audit |
| LLM Training Data | Generate LLM-enriched training datasets with realistic metadata |
| Pipeline Orchestration | Integrate with Airflow, dbt, MLflow, and Spark pipelines |
Quick Start
# Install from source
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release
# Run demo mode
./target/release/datasynth-data generate --demo --output ./output
# Or create a custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output
Fingerprinting (New in v0.2.0)
# Extract fingerprint from real data with privacy protection
./target/release/datasynth-data fingerprint extract \
--input ./real_data.csv \
--output ./fingerprint.dsf \
--privacy-level standard
# Validate fingerprint integrity
./target/release/datasynth-data fingerprint validate ./fingerprint.dsf
# View fingerprint details
./target/release/datasynth-data fingerprint info ./fingerprint.dsf --detailed
# Evaluate synthetic data fidelity
./target/release/datasynth-data fingerprint evaluate \
--fingerprint ./fingerprint.dsf \
--synthetic ./synthetic_data/ \
--threshold 0.8
LLM-Augmented Generation (New in v0.5.0)
# Generate config from natural language description
./target/release/datasynth-data init \
--from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
-o config.yaml
# Generate with LLM enrichment (uses mock provider by default)
./target/release/datasynth-data generate --config config.yaml --output ./output
Causal Generation (New in v0.5.0)
# Generate data with causal structure (fraud_detection template)
./target/release/datasynth-data causal generate \
--template fraud_detection \
--samples 10000 \
--output ./causal_output
# Run intervention ("what-if" scenario)
./target/release/datasynth-data causal intervene \
--template fraud_detection \
--variable transaction_amount \
--value 50000 \
--samples 5000 \
--output ./intervention_output
Diffusion Model Training (New in v0.5.0)
# Train a diffusion model on fingerprint data
./target/release/datasynth-data diffusion train \
--fingerprint ./fingerprint.dsf \
--output ./model.json
# Evaluate diffusion model fit
./target/release/datasynth-data diffusion evaluate \
--model ./model.json \
--samples 5000
Python Wrapper
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
config = blueprints.retail_small(companies=4, transactions=10000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir)
Architecture
SyntheticData is organized as a Rust workspace with 15 modular crates:
datasynth-cli Command-line interface (binary: datasynth-data)
datasynth-server REST/gRPC/WebSocket server with auth and rate limiting
datasynth-ui Tauri/SvelteKit desktop application
│
datasynth-runtime Orchestration layer (parallel execution, resource guards)
│
datasynth-generators Data generators (JE, documents, subledgers, anomalies, audit)
datasynth-banking KYC/AML banking transaction generator
datasynth-ocpm Object-Centric Process Mining (OCEL 2.0)
datasynth-fingerprint Privacy-preserving fingerprint extraction and synthesis
datasynth-standards Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
│
datasynth-graph Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval Evaluation framework with auto-tuning
│
datasynth-config Configuration schema, validation, industry presets
│
datasynth-core Domain models, traits, distributions, resource guards
│
datasynth-output Output sinks (CSV, JSON, Parquet, streaming)
datasynth-test-utils Test utilities, fixtures, mocks
License
Copyright 2024-2026 Michael Ivertowski, Ernst & Young Ltd., Zurich, Switzerland
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Support
Commercial support, custom development, and enterprise licensing are available upon request. Please contact the author at michael.ivertowski@ch.ey.com for inquiries.
SyntheticData is provided “as is” without warranty of any kind. It is intended for testing, development, and educational purposes. Generated data should not be used as a substitute for real financial records.
Getting Started
Welcome to SyntheticData! This section will help you get up and running quickly.
What You’ll Learn
- Installation: Set up SyntheticData on your system
- Quick Start: Generate your first synthetic dataset
- Demo Mode: Explore SyntheticData with built-in demo presets
Prerequisites
Before you begin, ensure you have:
- Rust 1.88+: SyntheticData is written in Rust and requires the Rust toolchain
- Git: For cloning the repository
- (Optional) Node.js 18+: Required only for the desktop UI
Installation Overview
# Clone and build
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release
# The binary is at target/release/datasynth-data
First Steps
The fastest way to explore SyntheticData is through demo mode:
datasynth-data generate --demo --output ./demo-output
This generates a complete set of synthetic financial data using sensible defaults.
Architecture at a Glance
SyntheticData generates interconnected financial data:
┌─────────────────────────────────────────────────────────────┐
│ Configuration (YAML) │
├─────────────────────────────────────────────────────────────┤
│ Generation Pipeline │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Master │→│ Document │→│ Journal │→│ Output │ │
│ │ Data │ │ Flows │ │ Entries │ │ Files │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Output: CSV, JSON, Neo4j, PyTorch Geometric, ACDOCA │
└─────────────────────────────────────────────────────────────┘
Next Steps
- Follow the Installation Guide to set up your environment
- Work through the Quick Start Tutorial
- Explore Demo Mode for a hands-on introduction
- Review the CLI Reference for all available commands
Getting Help
- Check the User Guide for detailed usage instructions
- Review Configuration for all available options
- See Use Cases for real-world examples
Installation
This guide covers installing SyntheticData from source.
Prerequisites
Required
| Requirement | Version | Purpose |
|---|---|---|
| Rust | 1.88+ | Compilation |
| Git | Any recent | Clone repository |
| C compiler | gcc/clang | Native dependencies |
Optional
| Requirement | Version | Purpose |
|---|---|---|
| Node.js | 18+ | Desktop UI |
| npm | 9+ | Desktop UI dependencies |
Installing Rust
If you don’t have Rust installed, use rustup:
# Linux/macOS
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Windows
# Download and run rustup-init.exe from https://rustup.rs
# Verify installation
rustc --version
cargo --version
Building from Source
Clone the Repository
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
Build Release Binary
# Build optimized release binary
cargo build --release
# The binary is at target/release/datasynth-data
Verify Installation
# Check version
./target/release/datasynth-data --version
# View help
./target/release/datasynth-data --help
# Run quick validation
./target/release/datasynth-data info
Adding to PATH
To run datasynth-data from anywhere:
Linux/macOS
# Option 1: Symlink to local bin
ln -s $(pwd)/target/release/datasynth-data ~/.local/bin/datasynth-data
# Option 2: Copy to system bin (requires sudo)
sudo cp target/release/datasynth-data /usr/local/bin/
# Option 3: Add target/release to PATH in ~/.bashrc or ~/.zshrc
export PATH="$PATH:/path/to/SyntheticData/target/release"
Windows
Add the target/release directory to your system PATH environment variable.
Building the Desktop UI
The desktop UI requires additional setup:
# Navigate to UI crate
cd crates/datasynth-ui
# Install Node.js dependencies
npm install
# Run in development mode
npm run tauri dev
# Build production release
npm run tauri build
Platform-Specific Dependencies
Linux (Ubuntu/Debian):
sudo apt-get install libwebkit2gtk-4.1-dev \
libgtk-3-dev \
libayatana-appindicator3-dev \
librsvg2-dev
macOS: No additional dependencies required.
Windows: Install WebView2 runtime (usually pre-installed on Windows 10/11).
Running Tests
Verify your installation by running the test suite:
# Run all tests
cargo test
# Run tests for a specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators
# Run with output
cargo test -- --nocapture
Development Setup
For development work:
# Check code without building
cargo check
# Format code
cargo fmt
# Run lints
cargo clippy
# Build documentation
cargo doc --workspace --no-deps --open
Running Benchmarks
# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench --bench generation_throughput
Troubleshooting
Build Failures
Missing C compiler:
# Ubuntu/Debian
sudo apt-get install build-essential
# macOS
xcode-select --install
# Fedora/RHEL
sudo dnf install gcc
Out of memory during build:
# Limit parallel jobs
cargo build --release -j 2
Runtime Issues
Permission denied:
chmod +x target/release/datasynth-data
Library not found (Linux):
# Check for missing dependencies
ldd target/release/datasynth-data
Next Steps
- Follow the Quick Start Guide to generate your first dataset
- Explore Demo Mode for a hands-on introduction
- Review the CLI Reference for all commands
Quick Start
This guide walks you through generating your first synthetic financial dataset.
Overview
The typical workflow is:
- Initialize a configuration file
- Validate the configuration
- Generate synthetic data
- Review the output
Step 1: Initialize Configuration
Create a configuration file for your industry and complexity needs:
# Manufacturing company with medium complexity (~400 accounts)
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
Available Industry Presets
| Industry | Description |
|---|---|
manufacturing | Production, inventory, cost accounting |
retail | Sales, inventory, customer transactions |
financial_services | Banking, investments, regulatory reporting |
healthcare | Patient revenue, medical supplies, compliance |
technology | R&D, SaaS revenue, deferred revenue |
Complexity Levels
| Level | Accounts | Description |
|---|---|---|
small | ~100 | Simple chart of accounts |
medium | ~400 | Typical mid-size company |
large | ~2500 | Enterprise-scale structure |
Step 2: Review Configuration
Open config.yaml to review and customize:
global:
seed: 42 # For reproducible generation
industry: manufacturing
start_date: 2024-01-01
period_months: 12
group_currency: USD
companies:
- code: "1000"
name: "Headquarters"
currency: USD
country: US
volume_weight: 1.0
transactions:
target_count: 100000 # Number of journal entries
fraud:
enabled: true
fraud_rate: 0.005 # 0.5% fraud transactions
output:
format: csv
compression: none
See the Configuration Guide for all options.
Step 3: Validate Configuration
Check your configuration for errors:
datasynth-data validate --config config.yaml
The validator checks:
- Required fields are present
- Values are within valid ranges
- Distribution weights sum to 1.0
- Dates are consistent
Step 4: Generate Data
Run the generation:
datasynth-data generate --config config.yaml --output ./output
You’ll see a progress bar:
Generating synthetic data...
[████████████████████████████████] 100000/100000 entries
Generation complete in 1.2s
Step 5: Explore Output
The output directory contains organized subdirectories:
output/
├── master_data/
│ ├── vendors.csv
│ ├── customers.csv
│ ├── materials.csv
│ └── employees.csv
├── transactions/
│ ├── journal_entries.csv
│ ├── acdoca.csv
│ ├── purchase_orders.csv
│ └── vendor_invoices.csv
├── subledgers/
│ ├── ar_open_items.csv
│ └── ap_open_items.csv
├── period_close/
│ └── trial_balances/
├── labels/
│ ├── anomaly_labels.csv
│ └── fraud_labels.csv
└── controls/
└── internal_controls.csv
Common Customizations
Generate More Data
transactions:
target_count: 1000000 # 1 million entries
Enable Graph Export
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
Add Anomaly Injection
anomaly_injection:
enabled: true
total_rate: 0.02 # 2% anomaly rate
generate_labels: true # For ML training
Multiple Companies
companies:
- code: "1000"
name: "Headquarters"
currency: USD
volume_weight: 0.6
- code: "2000"
name: "European Subsidiary"
currency: EUR
volume_weight: 0.4
Next Steps
- Explore Demo Mode for built-in presets
- Learn the CLI Reference
- Review Output Formats
- See Configuration for all options
Quick Reference
# Common commands
datasynth-data init --industry <industry> --complexity <level> -o config.yaml
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output
datasynth-data generate --demo --output ./demo-output
datasynth-data info # Show available presets
Demo Mode
Demo mode provides a quick way to explore SyntheticData without creating a configuration file. It uses sensible defaults to generate a complete synthetic dataset.
Running Demo Mode
datasynth-data generate --demo --output ./demo-output
What Demo Mode Generates
Demo mode creates a comprehensive dataset with:
| Category | Contents |
|---|---|
| Master Data | Vendors, customers, materials, employees |
| Transactions | ~10,000 journal entries |
| Document Flows | P2P and O2C process documents |
| Subledgers | AR and AP records |
| Period Close | Trial balances |
| Controls | Internal control mappings |
Demo Configuration
Demo mode uses these defaults:
global:
industry: manufacturing
start_date: 2024-01-01
period_months: 3
group_currency: USD
companies:
- code: "1000"
name: "Demo Company"
currency: USD
country: US
chart_of_accounts:
complexity: medium # ~400 accounts
transactions:
target_count: 10000
fraud:
enabled: true
fraud_rate: 0.005
anomaly_injection:
enabled: true
total_rate: 0.01
generate_labels: true
output:
format: csv
Output Structure
After running demo mode, explore the output:
tree demo-output/
demo-output/
├── master_data/
│ ├── chart_of_accounts.csv # GL accounts
│ ├── vendors.csv # Vendor master
│ ├── customers.csv # Customer master
│ ├── materials.csv # Material/product master
│ └── employees.csv # Employee/user master
├── transactions/
│ ├── journal_entries.csv # Main JE file
│ ├── acdoca.csv # SAP HANA format
│ ├── purchase_orders.csv # P2P documents
│ ├── goods_receipts.csv
│ ├── vendor_invoices.csv
│ ├── payments.csv
│ ├── sales_orders.csv # O2C documents
│ ├── deliveries.csv
│ ├── customer_invoices.csv
│ └── customer_receipts.csv
├── subledgers/
│ ├── ar_open_items.csv
│ ├── ap_open_items.csv
│ └── inventory_positions.csv
├── period_close/
│ └── trial_balances/
│ ├── 2024_01.csv
│ ├── 2024_02.csv
│ └── 2024_03.csv
├── labels/
│ ├── anomaly_labels.csv # For ML training
│ └── fraud_labels.csv
└── controls/
├── internal_controls.csv
└── sod_rules.csv
Exploring the Data
Journal Entries
head -5 demo-output/transactions/journal_entries.csv
Key fields:
document_id: Unique transaction identifierposting_date: When the entry was postedcompany_code: Company identifieraccount_number: GL accountdebit_amount/credit_amount: Entry amountsis_fraud: Fraud label (true/false)is_anomaly: Anomaly label
Fraud Labels
# View fraud transactions
grep "true" demo-output/labels/fraud_labels.csv | head
Trial Balance
# Check balance coherence
head demo-output/period_close/trial_balances/2024_01.csv
Customizing Demo Output
You can combine demo mode with some options:
# Change output directory
datasynth-data generate --demo --output ./my-demo
# Use demo as starting point, then create config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Edit config.yaml as needed
datasynth-data generate --config config.yaml --output ./output
Use Cases for Demo Mode
Quick Exploration
Test SyntheticData’s capabilities before creating a custom configuration.
Development Testing
Generate test data quickly for development purposes.
Training & Workshops
Provide sample data for training sessions without complex setup.
Benchmarking
Establish baseline performance metrics.
Moving Beyond Demo Mode
When you’re ready for more control:
-
Create a configuration file:
datasynth-data init --industry <your-industry> -o config.yaml -
Customize settings:
- Adjust transaction volume
- Configure multiple companies
- Enable graph export
- Fine-tune fraud/anomaly rates
-
Generate with your config:
datasynth-data generate --config config.yaml --output ./output
Next Steps
- Review Quick Start for custom configurations
- Learn the CLI Reference
- Explore Configuration Options
- See Use Cases for real-world examples
User Guide
This section covers the different ways to use SyntheticData.
Interfaces
SyntheticData offers three interfaces:
| Interface | Use Case |
|---|---|
| CLI | Command-line generation, scripting, automation |
| Server API | REST/gRPC/WebSocket for applications |
| Desktop UI | Visual configuration and monitoring |
Quick Comparison
| Feature | CLI | Server | Desktop UI |
|---|---|---|---|
| Configuration editing | YAML files | API endpoints | Visual forms |
| Batch generation | Yes | Yes | Yes |
| Streaming generation | No | Yes | Yes (view) |
| Progress monitoring | Progress bar | WebSocket | Real-time |
| Scripting/automation | Yes | Yes | No |
| Visual feedback | Minimal | None | Full |
CLI Overview
The command-line interface (datasynth-data) is ideal for:
- Batch generation
- CI/CD pipelines
- Scripting and automation
- Server environments
datasynth-data generate --config config.yaml --output ./output
Server Overview
The server (datasynth-server) provides:
- REST API for configuration and control
- gRPC for high-performance integration
- WebSocket for real-time streaming
cargo run -p datasynth-server -- --port 3000
Desktop UI Overview
The desktop application offers:
- Visual configuration editor
- Industry preset selector
- Real-time generation monitoring
- Cross-platform support (Windows, macOS, Linux)
cd crates/datasynth-ui && npm run tauri dev
Output Formats
SyntheticData produces various output formats:
- CSV: Standard tabular data
- JSON: Structured data with nested objects
- ACDOCA: SAP HANA Universal Journal format
- PyTorch Geometric: ML-ready graph tensors
- Neo4j: Graph database import format
See Output Formats for details.
Choosing an Interface
Use the CLI if you:
- Need to automate generation
- Work in headless/server environments
- Prefer command-line tools
- Want to integrate with shell scripts
Use the Server if you:
- Build applications that consume synthetic data
- Need streaming generation
- Want API-based control
- Integrate with microservices
Use the Desktop UI if you:
- Prefer visual configuration
- Want to explore options interactively
- Need real-time monitoring
- Are new to SyntheticData
Next Steps
- CLI Reference - Complete command documentation
- Server API - REST, gRPC, and WebSocket endpoints
- Desktop UI - Desktop application guide
- Output Formats - Detailed output file documentation
CLI Reference
The datasynth-data command-line tool provides commands for generating synthetic financial data and extracting fingerprints from real data.
Installation
After building the project, the binary is at target/release/datasynth-data.
cargo build --release
./target/release/datasynth-data --help
Global Options
| Option | Description |
|---|---|
-h, --help | Show help information |
-V, --version | Show version number |
-v, --verbose | Enable verbose output |
-q, --quiet | Suppress non-error output |
Commands
generate
Generate synthetic financial data.
datasynth-data generate [OPTIONS]
Options:
| Option | Type | Description |
|---|---|---|
--config <PATH> | Path | Configuration YAML file |
--demo | Flag | Use demo preset instead of config |
--output <DIR> | Path | Output directory (required) |
--format <FMT> | String | Output format: csv, json |
--seed <NUM> | u64 | Override random seed |
Examples:
# Generate with configuration file
datasynth-data generate --config config.yaml --output ./output
# Use demo mode
datasynth-data generate --demo --output ./demo-output
# Override seed for reproducibility
datasynth-data generate --config config.yaml --output ./output --seed 12345
# JSON output format
datasynth-data generate --config config.yaml --output ./output --format json
init
Create a new configuration file from industry presets.
datasynth-data init [OPTIONS]
Options:
| Option | Type | Description |
|---|---|---|
--industry <NAME> | String | Industry preset |
--complexity <LEVEL> | String | small, medium, large |
-o, --output <PATH> | Path | Output file path |
Available Industries:
manufacturingretailfinancial_serviceshealthcaretechnology
Examples:
# Create manufacturing config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Create large retail config
datasynth-data init --industry retail --complexity large -o retail.yaml
validate
Validate a configuration file.
datasynth-data validate --config <PATH>
Options:
| Option | Type | Description |
|---|---|---|
--config <PATH> | Path | Configuration file to validate |
Example:
datasynth-data validate --config config.yaml
Validation Checks:
- Required fields present
- Value ranges (period_months: 1-120)
- Distribution weights sum to 1.0 (±0.01 tolerance)
- Date consistency
- Company code uniqueness
- Compression level: 1-9 when enabled
- All rate/percentage fields: 0.0-1.0
- Approval thresholds: strictly ascending order
info
Display available presets and configuration options.
datasynth-data info
Output includes:
- Available industry presets
- Complexity levels
- Supported output formats
- Feature capabilities
fingerprint
Privacy-preserving fingerprint extraction and evaluation. This command has several subcommands.
datasynth-data fingerprint <SUBCOMMAND>
fingerprint extract
Extract a fingerprint from real data with privacy controls.
datasynth-data fingerprint extract [OPTIONS]
Options:
| Option | Type | Description |
|---|---|---|
--input <PATH> | Path | Input CSV data file (required) |
--output <PATH> | Path | Output .dsf fingerprint file (required) |
--privacy-level <LEVEL> | String | Privacy level: minimal, standard, high, maximum |
--epsilon <FLOAT> | f64 | Custom differential privacy epsilon (overrides level) |
--k <INT> | usize | Custom k-anonymity threshold (overrides level) |
Privacy Levels:
| Level | Epsilon | k | Outlier % | Use Case |
|---|---|---|---|---|
| minimal | 5.0 | 3 | 99% | Low privacy, high utility |
| standard | 1.0 | 5 | 95% | Balanced (default) |
| high | 0.5 | 10 | 90% | Higher privacy |
| maximum | 0.1 | 20 | 85% | Maximum privacy |
Examples:
# Extract with standard privacy
datasynth-data fingerprint extract \
--input ./real_data.csv \
--output ./fingerprint.dsf \
--privacy-level standard
# Extract with custom privacy parameters
datasynth-data fingerprint extract \
--input ./real_data.csv \
--output ./fingerprint.dsf \
--epsilon 0.75 \
--k 8
fingerprint validate
Validate a fingerprint file’s integrity and structure.
datasynth-data fingerprint validate <PATH>
Arguments:
| Argument | Type | Description |
|---|---|---|
<PATH> | Path | Path to .dsf fingerprint file |
Validation Checks:
- DSF file structure (ZIP archive with required components)
- SHA-256 checksums for all components
- Required fields in manifest, schema, statistics
- Privacy audit completeness
Example:
datasynth-data fingerprint validate ./fingerprint.dsf
fingerprint info
Display fingerprint metadata and statistics.
datasynth-data fingerprint info <PATH> [OPTIONS]
Arguments:
| Argument | Type | Description |
|---|---|---|
<PATH> | Path | Path to .dsf fingerprint file |
Options:
| Option | Type | Description |
|---|---|---|
--detailed | Flag | Show detailed statistics |
--json | Flag | Output as JSON |
Examples:
# Basic info
datasynth-data fingerprint info ./fingerprint.dsf
# Detailed statistics
datasynth-data fingerprint info ./fingerprint.dsf --detailed
# JSON output for scripting
datasynth-data fingerprint info ./fingerprint.dsf --json
fingerprint diff
Compare two fingerprints.
datasynth-data fingerprint diff <PATH1> <PATH2>
Arguments:
| Argument | Type | Description |
|---|---|---|
<PATH1> | Path | First .dsf fingerprint file |
<PATH2> | Path | Second .dsf fingerprint file |
Output includes:
- Schema differences (columns added/removed/changed)
- Statistical distribution changes
- Correlation matrix differences
Example:
datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf
fingerprint evaluate
Evaluate synthetic data fidelity against a fingerprint.
datasynth-data fingerprint evaluate [OPTIONS]
Options:
| Option | Type | Description |
|---|---|---|
--fingerprint <PATH> | Path | Reference .dsf fingerprint file (required) |
--synthetic <PATH> | Path | Directory containing synthetic data (required) |
--threshold <FLOAT> | f64 | Minimum fidelity score (0.0-1.0, default 0.8) |
--report <PATH> | Path | Output report file (HTML or JSON based on extension) |
Fidelity Metrics:
- Statistical: KS statistic, Wasserstein distance, Benford MAD
- Correlation: Correlation matrix RMSE
- Schema: Column type match, row count ratio
- Rules: Balance equation compliance rate
Examples:
# Basic evaluation
datasynth-data fingerprint evaluate \
--fingerprint ./fingerprint.dsf \
--synthetic ./synthetic_data/ \
--threshold 0.8
# Generate HTML report
datasynth-data fingerprint evaluate \
--fingerprint ./fingerprint.dsf \
--synthetic ./synthetic_data/ \
--threshold 0.85 \
--report ./fidelity_report.html
diffusion (v0.5.0)
Train and evaluate diffusion models for statistical data generation.
diffusion train
Train a diffusion model from a fingerprint file.
datasynth-data diffusion train \
--fingerprint ./fingerprint.dsf \
--output ./model.json \
--n-steps 1000 \
--schedule cosine
| Option | Type | Default | Description |
|---|---|---|---|
--fingerprint | path | (required) | Path to .dsf fingerprint file |
--output | path | (required) | Output path for trained model |
--n-steps | integer | 1000 | Number of diffusion steps |
--schedule | string | linear | Noise schedule: linear, cosine, sigmoid |
diffusion evaluate
Evaluate a trained diffusion model’s fit quality.
datasynth-data diffusion evaluate \
--model ./model.json \
--samples 5000
| Option | Type | Default | Description |
|---|---|---|---|
--model | path | (required) | Path to trained model |
--samples | integer | 1000 | Number of evaluation samples |
causal (v0.5.0)
Generate data with causal structure, run interventions, and produce counterfactuals.
causal generate
Generate data following a causal graph structure.
datasynth-data causal generate \
--template fraud_detection \
--samples 10000 \
--seed 42 \
--output ./causal_output
| Option | Type | Default | Description |
|---|---|---|---|
--template | string | (required) | Built-in template (fraud_detection, revenue_cycle) or path to custom YAML |
--samples | integer | 1000 | Number of samples to generate |
--seed | integer | (random) | Random seed for reproducibility |
--output | path | (required) | Output directory |
causal intervene
Run do-calculus interventions (“what-if” scenarios).
datasynth-data causal intervene \
--template fraud_detection \
--variable transaction_amount \
--value 50000 \
--samples 5000 \
--output ./intervention_output
| Option | Type | Default | Description |
|---|---|---|---|
--template | string | (required) | Causal template or YAML path |
--variable | string | (required) | Variable to intervene on |
--value | float | (required) | Value to set the variable to |
--samples | integer | 1000 | Number of intervention samples |
--output | path | (required) | Output directory |
causal validate
Validate that generated data preserves causal structure.
datasynth-data causal validate \
--data ./causal_output \
--template fraud_detection
| Option | Type | Default | Description |
|---|---|---|---|
--data | path | (required) | Path to generated data |
--template | string | (required) | Causal template to validate against |
fingerprint federated (v0.5.0)
Aggregate fingerprints from multiple distributed sources without centralizing raw data.
datasynth-data fingerprint federated \
--sources ./source_a.dsf ./source_b.dsf ./source_c.dsf \
--output ./aggregated.dsf \
--method weighted_average \
--max-epsilon 5.0
| Option | Type | Default | Description |
|---|---|---|---|
--sources | paths | (required) | Two or more .dsf fingerprint files |
--output | path | (required) | Output path for aggregated fingerprint |
--method | string | weighted_average | Aggregation method: weighted_average, median, trimmed_mean |
--max-epsilon | float | 5.0 | Maximum epsilon budget per source |
init –from-description (v0.5.0)
Generate configuration from a natural language description using LLM.
datasynth-data init \
--from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
-o config.yaml
Uses the configured LLM provider (defaults to Mock) to parse the description and generate an appropriate YAML configuration.
generate –certificate (v0.5.0)
Attach a synthetic data certificate to the generated output.
datasynth-data generate \
--config config.yaml \
--output ./output \
--certificate
Produces a certificate.json in the output directory containing DP guarantees, quality metrics, and an HMAC-SHA256 signature.
Signal Handling (Unix)
On Unix systems, you can pause and resume generation:
# Start generation in background
datasynth-data generate --config config.yaml --output ./output &
# Pause generation
kill -USR1 $(pgrep datasynth-data)
# Resume generation (send SIGUSR1 again)
kill -USR1 $(pgrep datasynth-data)
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Configuration error |
| 3 | I/O error |
| 4 | Validation error |
| 5 | Fingerprint error |
Environment Variables
| Variable | Description |
|---|---|
SYNTH_DATA_LOG | Log level (error, warn, info, debug, trace) |
SYNTH_DATA_THREADS | Number of worker threads |
Example:
SYNTH_DATA_LOG=debug datasynth-data generate --config config.yaml --output ./output
Configuration File Location
The tool looks for configuration files in this order:
- Path specified with
--config ./datasynth-data.yamlin current directory~/.config/datasynth-data/config.yaml
Output Directory Structure
Generation creates this structure:
output/
├── master_data/ Vendors, customers, materials, assets, employees
├── transactions/ Journal entries, purchase orders, invoices, payments
├── subledgers/ AR, AP, FA, inventory detail records
├── period_close/ Trial balances, accruals, closing entries
├── consolidation/ Eliminations, currency translation
├── fx/ Exchange rates, CTA adjustments
├── banking/ KYC profiles, bank transactions, AML typology labels
├── process_mining/ OCEL 2.0 event logs, process variants
├── audit/ Engagements, workpapers, findings, risk assessments
├── graphs/ PyTorch Geometric, Neo4j, DGL exports (if enabled)
├── labels/ Anomaly, fraud, and data quality labels for ML
└── controls/ Internal control mappings, SoD rules
Scripting Examples
Batch Generation
#!/bin/bash
for industry in manufacturing retail healthcare; do
datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done
CI/CD Pipeline
# GitHub Actions example
- name: Generate Test Data
run: |
cargo build --release
./target/release/datasynth-data generate --demo --output ./test-data
- name: Validate Generation
run: |
# Check output files exist
test -f ./test-data/transactions/journal_entries.csv
test -f ./test-data/master_data/vendors.csv
Reproducible Generation
# Same seed produces identical output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2 # No differences
Fingerprint Pipeline
#!/bin/bash
# Extract fingerprint from real data
datasynth-data fingerprint extract \
--input ./real_data.csv \
--output ./fingerprint.dsf \
--privacy-level high
# Validate the fingerprint
datasynth-data fingerprint validate ./fingerprint.dsf
# Generate synthetic data matching the fingerprint
# (fingerprint informs config generation)
datasynth-data generate --config generated_config.yaml --output ./synthetic
# Evaluate fidelity
datasynth-data fingerprint evaluate \
--fingerprint ./fingerprint.dsf \
--synthetic ./synthetic \
--threshold 0.85 \
--report ./fidelity_report.html
Troubleshooting
Common Issues
“Configuration file not found”
# Check file path
ls -la config.yaml
# Use absolute path
datasynth-data generate --config /full/path/to/config.yaml --output ./output
“Invalid configuration”
# Validate first
datasynth-data validate --config config.yaml
“Permission denied”
# Check output directory permissions
mkdir -p ./output
chmod 755 ./output
“Out of memory”
The generator includes memory guards that prevent OOM conditions. If you still encounter issues:
- Reduce transaction count in configuration
- The system will automatically reduce batch sizes under memory pressure
- Check
memory_guardsettings in configuration
“Fingerprint validation failed”
# Check DSF file integrity
datasynth-data fingerprint validate ./fingerprint.dsf
# View detailed info
datasynth-data fingerprint info ./fingerprint.dsf --detailed
“Low fidelity score”
If synthetic data fidelity is below threshold:
- Review the fidelity report for specific metrics
- Adjust configuration to better match fingerprint statistics
- Consider using the evaluation framework’s auto-tuning recommendations
See Also
Server API
SyntheticData provides a server component with REST, gRPC, and WebSocket APIs for application integration.
Starting the Server
cargo run -p datasynth-server -- --port 3000 --worker-threads 4
Options:
| Option | Default | Description |
|---|---|---|
--port | 3000 | HTTP/WebSocket port |
--grpc-port | 50051 | gRPC port |
--worker-threads | CPU cores | Worker thread count |
--api-key | None | Required API key |
--rate-limit | 100 | Max requests per minute |
Authentication
When --api-key is set, include it in requests:
curl -H "X-API-Key: your-api-key" http://localhost:3000/api/config
REST API
Configuration Endpoints
GET /api/config
Retrieve current configuration.
curl http://localhost:3000/api/config
Response:
{
"global": {
"seed": 42,
"industry": "manufacturing",
"start_date": "2024-01-01",
"period_months": 12
},
"transactions": {
"target_count": 100000
}
}
POST /api/config
Update configuration.
curl -X POST http://localhost:3000/api/config \
-H "Content-Type: application/json" \
-d '{"transactions": {"target_count": 50000}}'
POST /api/config/validate
Validate configuration without applying.
curl -X POST http://localhost:3000/api/config/validate \
-H "Content-Type: application/json" \
-d @config.json
Stream Control Endpoints
POST /api/stream/start
Start data generation.
curl -X POST http://localhost:3000/api/stream/start
Response:
{
"status": "started",
"stream_id": "abc123"
}
POST /api/stream/stop
Stop current generation.
curl -X POST http://localhost:3000/api/stream/stop
POST /api/stream/pause
Pause generation.
curl -X POST http://localhost:3000/api/stream/pause
POST /api/stream/resume
Resume paused generation.
curl -X POST http://localhost:3000/api/stream/resume
Pattern Trigger Endpoints
POST /api/stream/trigger/
Trigger special event patterns.
Available patterns:
month_end- Month-end close activitiesquarter_end- Quarter-end activitiesyear_end- Year-end close activities
curl -X POST http://localhost:3000/api/stream/trigger/month_end
Health Check
GET /health
curl http://localhost:3000/health
Response:
{
"status": "healthy",
"uptime_seconds": 3600
}
WebSocket API
Connect to receive real-time events during generation.
Connection
const ws = new WebSocket('ws://localhost:3000/ws/events');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(data);
};
Event Types
Progress Event:
{
"type": "progress",
"current": 50000,
"total": 100000,
"percent": 50.0,
"rate": 85000.5
}
Entry Event:
{
"type": "entry",
"data": {
"document_id": "abc-123",
"posting_date": "2024-03-15",
"account": "1100",
"debit": "1000.00",
"credit": "0.00"
}
}
Error Event:
{
"type": "error",
"message": "Memory limit exceeded"
}
Complete Event:
{
"type": "complete",
"total_entries": 100000,
"duration_ms": 1200
}
gRPC API
Proto Definition
syntax = "proto3";
package synth;
service SynthService {
rpc GetConfig(Empty) returns (Config);
rpc SetConfig(Config) returns (Status);
rpc StartGeneration(GenerationRequest) returns (stream Entry);
rpc StopGeneration(Empty) returns (Status);
}
message Config {
string yaml = 1;
}
message GenerationRequest {
optional int64 count = 1;
}
message Entry {
string document_id = 1;
string posting_date = 2;
string company_code = 3;
repeated LineItem lines = 4;
}
message LineItem {
string account = 1;
string debit = 2;
string credit = 3;
}
Client Example (Rust)
use synth::synth_client::SynthClient;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut client = SynthClient::connect("http://localhost:50051").await?;
let request = tonic::Request::new(GenerationRequest { count: Some(1000) });
let mut stream = client.start_generation(request).await?.into_inner();
while let Some(entry) = stream.message().await? {
println!("Entry: {:?}", entry.document_id);
}
Ok(())
}
Rate Limiting
The server implements sliding window rate limiting:
| Metric | Default |
|---|---|
| Window | 1 minute |
| Max requests | 100 |
Exceeding the limit returns 429 Too Many Requests:
{
"error": "rate_limit_exceeded",
"retry_after": 30
}
Memory Management
The server enforces memory limits:
# Set memory limit (bytes)
cargo run -p datasynth-server -- --memory-limit 1073741824 # 1GB
When memory is low:
- Generation pauses automatically
- WebSocket sends warning event
- New requests may be rejected
Error Responses
| HTTP Status | Meaning |
|---|---|
| 400 | Invalid request/configuration |
| 401 | Missing or invalid API key |
| 429 | Rate limit exceeded |
| 500 | Internal server error |
| 503 | Server overloaded |
Error Response Format:
{
"error": "error_code",
"message": "Human readable description",
"details": {}
}
Integration Examples
Python Client
import requests
import websocket
import json
BASE_URL = "http://localhost:3000"
# Set configuration
config = {
"transactions": {"target_count": 10000}
}
requests.post(f"{BASE_URL}/api/config", json=config)
# Start generation
requests.post(f"{BASE_URL}/api/stream/start")
# Monitor via WebSocket
ws = websocket.create_connection(f"ws://localhost:3000/ws/events")
while True:
event = json.loads(ws.recv())
if event["type"] == "complete":
break
print(f"Progress: {event.get('percent', 0)}%")
Node.js Client
const axios = require('axios');
const WebSocket = require('ws');
const BASE_URL = 'http://localhost:3000';
async function generate() {
// Configure
await axios.post(`${BASE_URL}/api/config`, {
transactions: { target_count: 10000 }
});
// Start
await axios.post(`${BASE_URL}/api/stream/start`);
// Monitor
const ws = new WebSocket('ws://localhost:3000/ws/events');
ws.on('message', (data) => {
const event = JSON.parse(data);
console.log(event);
});
}
See Also
Desktop UI
SyntheticData includes a cross-platform desktop application built with Tauri and SvelteKit.
Overview
The desktop UI provides:
- Visual configuration editing
- Industry preset selection
- Real-time generation monitoring
- Configuration validation feedback
Installation
Prerequisites
| Requirement | Version |
|---|---|
| Node.js | 18+ |
| npm | 9+ |
| Rust | 1.88+ |
| Platform dependencies | See below |
Platform Dependencies
Linux (Ubuntu/Debian):
sudo apt-get install libwebkit2gtk-4.1-dev \
libgtk-3-dev \
libayatana-appindicator3-dev \
librsvg2-dev
macOS: No additional dependencies required.
Windows: WebView2 runtime (usually pre-installed on Windows 10/11).
Running in Development
cd crates/datasynth-ui
npm install
npm run tauri dev
Building for Production
cd crates/datasynth-ui
npm run tauri build
Build outputs are in crates/datasynth-ui/src-tauri/target/release/bundle/.
Application Layout
Dashboard
The main dashboard provides:
- Quick stats overview
- Recent generation history
- System status
Configuration Editor
Access via the sidebar. Configuration is organized into sections:
| Section | Contents |
|---|---|
| Global | Industry, dates, seed, performance |
| Companies | Company definitions and weights |
| Transactions | Target count, line items, amounts |
| Master Data | Vendors, customers, materials |
| Document Flows | P2P, O2C configuration |
| Financial | Balance, subledger, FX, period close |
| Compliance | Fraud, controls, approval |
| Analytics | Graph export, anomaly, data quality |
| Output | Formats, compression |
Configuration Sections
Global Settings
- Industry: Select from presets (manufacturing, retail, etc.)
- Start Date: Beginning of simulation period
- Period Months: Duration (1-120 months)
- Group Currency: Base currency for consolidation
- Random Seed: For reproducible generation
Chart of Accounts
- Complexity: Small (~100), Medium (~400), Large (~2500) accounts
- Structure: Industry-specific account hierarchies
Transactions
- Target Count: Number of journal entries to generate
- Line Item Distribution: Configure line count probabilities
- Amount Distribution: Log-normal parameters, round number bias
Master Data
Configure generation parameters for:
- Vendors (count, payment terms, intercompany flags)
- Customers (count, credit terms, payment behavior)
- Materials (count, valuation methods)
- Fixed Assets (count, depreciation methods)
- Employees (count, hierarchy depth)
Document Flows
- P2P (Procure-to-Pay): PO → GR → Invoice → Payment rates
- O2C (Order-to-Cash): SO → Delivery → Invoice → Receipt rates
- Three-Way Match: Tolerance settings
Financial Settings
- Balance: Opening balance configuration
- Subledger: AR, AP, FA, Inventory settings
- FX: Currency pairs, rate volatility
- Period Close: Accrual, depreciation, closing settings
Compliance
- Fraud: Enable/disable, fraud rate, fraud types
- Controls: Internal control definitions
- Approval: Threshold configuration, SoD rules
Analytics
- Graph Export: Format selection (PyTorch Geometric, Neo4j, DGL)
- Anomaly Injection: Rate, types, labeling
- Data Quality: Missing values, format variations, duplicates
Output Settings
- Format: CSV or JSON
- Compression: None, gzip, or zstd
- File Organization: Directory structure options
Preset Selector
Quickly load industry presets:
- Click “Load Preset” in the header
- Select industry
- Choose complexity level
- Click “Apply”
Real-time Streaming
During generation, view:
- Progress bar with percentage
- Entries per second
- Memory usage
- Recent entries table
Access streaming view via “Generate” → “Stream”.
Validation
The UI validates configuration in real-time:
- Required fields are highlighted
- Invalid values show error messages
- Distribution weights are checked
- Constraints are enforced
Keyboard Shortcuts
| Shortcut | Action |
|---|---|
Ctrl/Cmd + S | Save configuration |
Ctrl/Cmd + G | Start generation |
Ctrl/Cmd + , | Open settings |
Escape | Close modal |
Configuration Files
The UI stores configurations in:
| Platform | Location |
|---|---|
| Linux | ~/.config/datasynth-data/ |
| macOS | ~/Library/Application Support/datasynth-data/ |
| Windows | %APPDATA%\datasynth-data\ |
Exporting Configuration
To use your configuration with the CLI:
- Configure in the UI
- Click “Export” → “Export YAML”
- Save the
.yamlfile - Use with CLI:
datasynth-data generate --config exported.yaml --output ./output
Development
Project Structure
crates/datasynth-ui/
├── src/ # SvelteKit frontend
│ ├── routes/ # Page routes
│ │ ├── +page.svelte # Dashboard
│ │ ├── generate/ # Generation views
│ │ └── config/ # Configuration pages
│ └── lib/
│ ├── stores/ # State management
│ └── components/ # Reusable components
├── src-tauri/ # Rust backend
│ └── src/
│ └── main.rs # Tauri commands
├── package.json
└── tauri.conf.json
Adding a Configuration Page
- Create route in
src/routes/config/<section>/+page.svelte - Add form components
- Connect to config store
- Add navigation link
Debugging
# Enable Tauri dev tools
npm run tauri dev
# View browser console (Ctrl/Cmd + Shift + I in dev mode)
Troubleshooting
UI Doesn’t Start
# Check Node dependencies
npm install
# Rebuild native modules
npm run tauri clean
npm run tauri build
Configuration Not Saving
Check file permissions in the config directory.
WebSocket Connection Failed
Ensure the server is running if using streaming features:
cargo run -p datasynth-server -- --port 3000
See Also
Output Formats
SyntheticData generates multiple file types organized into categories.
Output Directory Structure
output/
├── master_data/ # Entity master records
├── transactions/ # Journal entries and documents
├── subledgers/ # Subsidiary ledger records
├── period_close/ # Trial balances and closing
├── consolidation/ # Elimination entries
├── fx/ # Exchange rates
├── banking/ # KYC profiles and bank transactions
├── process_mining/ # OCEL 2.0 event logs
├── audit/ # Audit engagements and workpapers
├── graphs/ # ML-ready graph exports
├── labels/ # Anomaly, fraud, and quality labels
└── controls/ # Internal control mappings
File Formats
CSV
Default format with standard conventions:
- UTF-8 encoding
- Comma-separated values
- Header row included
- Quoted strings when needed
- Decimal values serialized as strings (prevents floating-point artifacts)
Example (journal_entries.csv):
document_id,posting_date,company_code,account,description,debit,credit,is_fraud
abc-123,2024-01-15,1000,1100,Customer payment,"1000.00","0.00",false
abc-123,2024-01-15,1000,1200,Cash receipt,"0.00","1000.00",false
JSON
Structured format with nested objects:
Example (journal_entries.json):
[
{
"header": {
"document_id": "abc-123",
"posting_date": "2024-01-15",
"company_code": "1000",
"source": "Manual",
"is_fraud": false
},
"lines": [
{
"account": "1100",
"description": "Customer payment",
"debit": "1000.00",
"credit": "0.00"
},
{
"account": "1200",
"description": "Cash receipt",
"debit": "0.00",
"credit": "1000.00"
}
]
}
]
ACDOCA (SAP HANA)
SAP Universal Journal format with simulation fields:
| Field | Description |
|---|---|
| RCLNT | Client |
| RLDNR | Ledger |
| RBUKRS | Company code |
| GJAHR | Fiscal year |
| BELNR | Document number |
| DOCLN | Line item |
| RYEAR | Year |
| POPER | Posting period |
| RACCT | Account |
| DRCRK | Debit/Credit indicator |
| HSL | Amount in local currency |
| ZSIM_* | Simulation metadata fields |
Master Data Files
chart_of_accounts.csv
| Field | Description |
|---|---|
| account_number | GL account code |
| account_name | Display name |
| account_type | Asset, Liability, Equity, Revenue, Expense |
| account_subtype | Detailed classification |
| is_control_account | Links to subledger |
| normal_balance | Debit or Credit |
vendors.csv
| Field | Description |
|---|---|
| vendor_id | Unique identifier |
| vendor_name | Company name |
| tax_id | Tax identification |
| payment_terms | Standard terms |
| currency | Transaction currency |
| is_intercompany | IC flag |
customers.csv
| Field | Description |
|---|---|
| customer_id | Unique identifier |
| customer_name | Company/person name |
| credit_limit | Maximum credit |
| credit_rating | Rating code |
| payment_behavior | Typical payment pattern |
materials.csv
| Field | Description |
|---|---|
| material_id | Unique identifier |
| description | Material name |
| material_type | Classification |
| valuation_method | FIFO, LIFO, Avg |
| standard_cost | Unit cost |
employees.csv
| Field | Description |
|---|---|
| employee_id | Unique identifier |
| name | Full name |
| department | Department code |
| manager_id | Hierarchy link |
| approval_limit | Maximum approval amount |
| transaction_codes | Authorized T-codes |
Transaction Files
journal_entries.csv
| Field | Description |
|---|---|
| document_id | Entry identifier |
| company_code | Company |
| fiscal_year | Year |
| fiscal_period | Period |
| posting_date | Date posted |
| document_date | Original date |
| source | Transaction source |
| business_process | Process category |
| is_fraud | Fraud indicator |
| is_anomaly | Anomaly indicator |
Line Items (embedded or separate)
| Field | Description |
|---|---|
| line_number | Sequence |
| account_number | GL account |
| cost_center | Cost center |
| profit_center | Profit center |
| debit_amount | Debit |
| credit_amount | Credit |
| description | Line description |
Document Flow Files
purchase_orders.csv:
- Order header with vendor, dates, status
- Line items with materials, quantities, prices
goods_receipts.csv:
- Receipt linked to PO
- Quantities received, variances
vendor_invoices.csv:
- Invoice with three-way match status
- Payment terms, due date
payments.csv:
- Payment documents
- Bank references, cleared invoices
document_references.csv:
- Links between documents (FollowOn, Payment, Reversal)
- Ensures complete document chains
Subledger Files
ar_open_items.csv
| Field | Description |
|---|---|
| customer_id | Customer reference |
| invoice_number | Document number |
| invoice_date | Date issued |
| due_date | Payment due |
| original_amount | Invoice total |
| open_amount | Remaining balance |
| aging_bucket | 0-30, 31-60, 61-90, 90+ |
ap_open_items.csv
Similar structure for payables.
fa_register.csv
| Field | Description |
|---|---|
| asset_id | Asset identifier |
| description | Asset name |
| acquisition_date | Purchase date |
| acquisition_cost | Original cost |
| useful_life_years | Depreciation period |
| depreciation_method | Straight-line, etc. |
| accumulated_depreciation | Total depreciation |
| net_book_value | Current value |
inventory_positions.csv
| Field | Description |
|---|---|
| material_id | Material reference |
| warehouse | Location |
| quantity | Units on hand |
| unit_cost | Current cost |
| total_value | Extended value |
Period Close Files
trial_balances/YYYY_MM.csv
| Field | Description |
|---|---|
| account_number | GL account |
| account_name | Description |
| opening_balance | Period start |
| period_debits | Total debits |
| period_credits | Total credits |
| closing_balance | Period end |
accruals.csv
Accrual entries with reversal dates.
depreciation.csv
Monthly depreciation entries per asset.
Banking Files
banking_customers.csv
| Field | Description |
|---|---|
| customer_id | Unique identifier |
| customer_type | retail, business, trust |
| name | Customer name |
| created_at | Account creation date |
| risk_score | Calculated risk score (0-100) |
| kyc_status | verified, pending, enhanced_due_diligence |
| pep_flag | Politically exposed person |
| sanctions_flag | Sanctions list match |
bank_accounts.csv
| Field | Description |
|---|---|
| account_id | Unique identifier |
| customer_id | Owner reference |
| account_type | checking, savings, money_market |
| currency | Account currency |
| opened_date | Opening date |
| balance | Current balance |
| status | active, dormant, closed |
bank_transactions.csv
| Field | Description |
|---|---|
| transaction_id | Unique identifier |
| account_id | Account reference |
| timestamp | Transaction time |
| amount | Transaction amount |
| currency | Transaction currency |
| direction | credit, debit |
| channel | branch, atm, online, wire, ach |
| category | Transaction category |
| counterparty_id | Counterparty reference |
kyc_profiles.csv
| Field | Description |
|---|---|
| customer_id | Customer reference |
| declared_turnover | Expected monthly volume |
| transaction_frequency | Expected transactions/month |
| source_of_funds | Declared income source |
| geographic_exposure | List of countries |
| cash_intensity | Expected cash ratio |
| beneficial_owner_complexity | Ownership layers |
aml_typology_labels.csv
| Field | Description |
|---|---|
| transaction_id | Transaction reference |
| typology | structuring, funnel, layering, mule, fraud |
| confidence | Confidence score (0-1) |
| pattern_id | Related pattern identifier |
| related_transactions | Comma-separated related IDs |
entity_risk_labels.csv
| Field | Description |
|---|---|
| entity_id | Customer or account ID |
| entity_type | customer, account |
| risk_category | high, medium, low |
| risk_factors | Contributing factors |
| label_date | Label timestamp |
Process Mining Files (OCEL 2.0)
event_log.json
OCEL 2.0 format event log:
{
"ocel:global-log": {
"ocel:version": "2.0",
"ocel:ordering": "timestamp"
},
"ocel:events": {
"e1": {
"ocel:activity": "Create Purchase Order",
"ocel:timestamp": "2024-01-15T10:30:00Z",
"ocel:typedOmap": [
{"ocel:oid": "PO-001", "ocel:qualifier": "order"}
]
}
},
"ocel:objects": {
"PO-001": {
"ocel:type": "PurchaseOrder",
"ocel:attributes": {
"vendor": "VEND-001",
"amount": "10000.00"
}
}
}
}
objects.json
Object instances with types and attributes.
events.json
Event records with object relationships.
process_variants.csv
| Field | Description |
|---|---|
| variant_id | Unique identifier |
| activity_sequence | Ordered activity list |
| frequency | Occurrence count |
| avg_duration | Average case duration |
Audit Files
audit_engagements.csv
| Field | Description |
|---|---|
| engagement_id | Unique identifier |
| client_name | Client entity |
| engagement_type | Financial, Compliance, Operational |
| fiscal_year | Audit period |
| materiality | Materiality threshold |
| status | Planning, Fieldwork, Completion |
audit_workpapers.csv
| Field | Description |
|---|---|
| workpaper_id | Unique identifier |
| engagement_id | Engagement reference |
| workpaper_type | Lead schedule, Substantive, etc. |
| prepared_by | Preparer ID |
| reviewed_by | Reviewer ID |
| status | Draft, Reviewed, Final |
audit_evidence.csv
| Field | Description |
|---|---|
| evidence_id | Unique identifier |
| workpaper_id | Workpaper reference |
| evidence_type | Document, Inquiry, Observation, etc. |
| source | Evidence source |
| reliability | High, Medium, Low |
| sufficiency | Sufficient, Insufficient |
audit_risks.csv
| Field | Description |
|---|---|
| risk_id | Unique identifier |
| engagement_id | Engagement reference |
| risk_description | Risk narrative |
| risk_level | High, Significant, Low |
| likelihood | Probable, Possible, Remote |
| response | Response strategy |
audit_findings.csv
| Field | Description |
|---|---|
| finding_id | Unique identifier |
| engagement_id | Engagement reference |
| finding_type | Deficiency, Significant, Material Weakness |
| description | Finding narrative |
| recommendation | Recommended action |
| management_response | Response text |
audit_judgments.csv
| Field | Description |
|---|---|
| judgment_id | Unique identifier |
| workpaper_id | Workpaper reference |
| judgment_area | Revenue recognition, Estimates, etc. |
| alternatives_considered | Options evaluated |
| conclusion | Selected approach |
| rationale | Reasoning documentation |
Graph Export Files
PyTorch Geometric
graphs/transaction_network/pytorch_geometric/
├── node_features.pt # [num_nodes, features]
├── edge_index.pt # [2, num_edges]
├── edge_attr.pt # [num_edges, edge_features]
├── labels.pt # Node/edge labels
├── train_mask.pt # Training split
├── val_mask.pt # Validation split
└── test_mask.pt # Test split
Neo4j
graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── nodes_user.csv
├── edges_transaction.csv
├── edges_approval.csv
└── import.cypher # Import script
DGL (Deep Graph Library)
graphs/transaction_network/dgl/
├── graph.bin # DGL binary format
├── node_features.npy # NumPy arrays
└── edge_features.npy
Label Files
anomaly_labels.csv
| Field | Description |
|---|---|
| document_id | Entry reference |
| anomaly_id | Unique anomaly ID |
| anomaly_type | Classification |
| anomaly_category | Fraud, Error, Process, Statistical, Relational |
| severity | Low, Medium, High |
| description | Human-readable explanation |
fraud_labels.csv
| Field | Description |
|---|---|
| document_id | Entry reference |
| fraud_type | Specific fraud pattern (20+ types) |
| detection_difficulty | Easy, Medium, Hard |
| description | Fraud scenario description |
quality_labels.csv
| Field | Description |
|---|---|
| record_id | Record reference |
| field_name | Affected field |
| issue_type | MissingValue, Typo, FormatVariation, Duplicate |
| issue_subtype | Detailed classification |
| original_value | Value before modification |
| modified_value | Value after modification |
| severity | Severity level (1-5) |
Control Files
internal_controls.csv
| Field | Description |
|---|---|
| control_id | Unique identifier |
| control_name | Description |
| control_type | Preventive, Detective |
| frequency | Continuous, Daily, etc. |
| assertions | Completeness, Accuracy, etc. |
control_account_mappings.csv
| Field | Description |
|---|---|
| control_id | Control reference |
| account_number | GL account |
| threshold | Monetary threshold |
sod_rules.csv
Segregation of duties conflict definitions.
sod_conflict_pairs.csv
Actual SoD violations detected in generated data.
Parquet Format
Apache Parquet columnar format for large analytical datasets:
output:
format: parquet
compression: snappy # snappy, gzip, zstd
Benefits:
- Columnar storage — efficient for queries touching few columns
- Built-in compression — typically 5-10x smaller than CSV
- Schema embedding — self-describing files with full type information
- Predicate pushdown — query engines skip irrelevant row groups
Use with: Apache Spark, DuckDB, Polars, pandas, BigQuery, Snowflake, Databricks.
ERP-Specific Formats
SyntheticData can export in native ERP table schemas:
| Format | Target ERP | Tables |
|---|---|---|
sap | SAP S/4HANA | BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC |
oracle | Oracle EBS | GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES |
netsuite | NetSuite | Journal entries with subsidiary, multi-book, custom fields |
See ERP Output Formats for field mappings and configuration.
Compression Options
| Option | Extension | Use Case |
|---|---|---|
| none | .csv/.json | Development, small datasets |
| gzip | .csv.gz | General compression |
| zstd | .csv.zst | High performance |
| snappy | .parquet | Parquet default (fast) |
Configuration
output:
format: csv # csv, json, jsonl, parquet, sap, oracle, netsuite
compression: none # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)
compression_level: 6 # 1-9 (if compression enabled)
streaming: false # Enable streaming mode for large outputs
See Also
- ERP Output Formats — SAP, Oracle, NetSuite table exports
- Streaming Output — Real-time streaming sinks
- Configuration — Output settings reference
- Graph Export
- Anomaly Injection
- AML/KYC Testing
- Process Mining
ERP Output Formats
SyntheticData can export data in native ERP table formats, enabling direct load testing and integration validation against SAP S/4HANA, Oracle EBS, and NetSuite environments.
Overview
The datasynth-output crate provides three ERP-specific exporters alongside the standard CSV/JSON/Parquet sinks. Each exporter transforms the internal data model into the target ERP’s table schema with correct field names, data types, and referential integrity.
| ERP System | Exporter | Tables |
|---|---|---|
| SAP S/4HANA | SapExporter | BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC |
| Oracle EBS | OracleExporter | GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES |
| NetSuite | NetSuiteExporter | Journal entries with subsidiary, multi-book, custom fields |
SAP S/4HANA
Supported Tables
| Table | Description | Source Data |
|---|---|---|
| BKPF | Document Header | Journal entry headers |
| BSEG | Document Line Items | Journal entry line items |
| ACDOCA | Universal Journal | Full ACDOCA event records |
| LFA1 | Vendor Master | Vendor records |
| KNA1 | Customer Master | Customer records |
| MARA | Material Master | Material records |
| CSKS | Cost Center Master | Cost center assignments |
| CEPC | Profit Center Master | Profit center assignments |
BKPF Fields (Document Header)
| SAP Field | Description | Example |
|---|---|---|
MANDT | Client | 100 |
BUKRS | Company Code | 1000 |
BELNR | Document Number | 0100000001 |
GJAHR | Fiscal Year | 2024 |
BLART | Document Type | SA (G/L posting) |
BLDAT | Document Date | 2024-01-15 |
BUDAT | Posting Date | 2024-01-15 |
MONAT | Fiscal Period | 1 |
CPUDT | Entry Date | 2024-01-15 |
CPUTM | Entry Time | 10:30:00 |
USNAM | User Name | JSMITH |
BSEG Fields (Line Items)
| SAP Field | Description | Example |
|---|---|---|
MANDT | Client | 100 |
BUKRS | Company Code | 1000 |
BELNR | Document Number | 0100000001 |
GJAHR | Fiscal Year | 2024 |
BUZEI | Line Item | 001 |
BSCHL | Posting Key | 40 (debit) / 50 (credit) |
HKONT | GL Account | 1100 |
DMBTR | Amount in Local Currency | 1000.00 |
WRBTR | Amount in Doc Currency | 1000.00 |
KOSTL | Cost Center | CC100 |
PRCTR | Profit Center | PC100 |
ACDOCA Fields (Universal Journal)
The ACDOCA format includes all standard SAP Universal Journal fields plus simulation metadata:
| Field | Description |
|---|---|
RCLNT | Client |
RLDNR | Ledger |
RBUKRS | Company Code |
GJAHR | Fiscal Year |
BELNR | Document Number |
DOCLN | Line Item |
POPER | Posting Period |
RACCT | Account |
DRCRK | Debit/Credit Indicator |
HSL | Amount in Local Currency |
ZSIM_* | Simulation metadata fields |
Configuration
output:
format: sap
sap:
tables:
- bkpf
- bseg
- acdoca
- lfa1
- kna1
- mara
client: "100"
ledger: "0L"
Oracle EBS
Supported Tables
| Table | Description | Source Data |
|---|---|---|
| GL_JE_HEADERS | Journal Entry Headers | Journal entry headers |
| GL_JE_LINES | Journal Entry Lines | Journal entry line items |
| GL_JE_BATCHES | Journal Entry Batches | Batch groupings |
GL_JE_HEADERS Fields
| Oracle Field | Description | Example |
|---|---|---|
JE_HEADER_ID | Unique Header ID | 10001 |
LEDGER_ID | Ledger ID | 1 |
JE_BATCH_ID | Batch ID | 5001 |
PERIOD_NAME | Period Name | JAN-24 |
NAME | Journal Name | Manual Entry 001 |
JE_CATEGORY | Category | MANUAL, ADJUSTMENT, PAYABLES |
JE_SOURCE | Source | MANUAL, PAYABLES, RECEIVABLES |
CURRENCY_CODE | Currency | USD |
ACTUAL_FLAG | Type | A (Actual), B (Budget), E (Encumbrance) |
STATUS | Status | P (Posted), U (Unposted) |
DEFAULT_EFFECTIVE_DATE | Effective Date | 2024-01-15 |
RUNNING_TOTAL_DR | Total Debits | 10000.00 |
RUNNING_TOTAL_CR | Total Credits | 10000.00 |
PARENT_JE_HEADER_ID | Parent (for reversals) | null |
ACCRUAL_REV_FLAG | Reversal Flag | Y / N |
GL_JE_LINES Fields
| Oracle Field | Description | Example |
|---|---|---|
JE_HEADER_ID | Header Reference | 10001 |
JE_LINE_NUM | Line Number | 1 |
CODE_COMBINATION_ID | Account Combo ID | 10110 |
ENTERED_DR | Entered Debit | 1000.00 |
ENTERED_CR | Entered Credit | 0.00 |
ACCOUNTED_DR | Accounted Debit | 1000.00 |
ACCOUNTED_CR | Accounted Credit | 0.00 |
DESCRIPTION | Line Description | Customer payment |
EFFECTIVE_DATE | Effective Date | 2024-01-15 |
Configuration
output:
format: oracle
oracle:
ledger_id: 1
set_of_books_id: 1
NetSuite
Journal Entry Fields
NetSuite export includes support for subsidiaries, multi-book accounting, and custom fields:
| NetSuite Field | Description | Example |
|---|---|---|
INTERNAL_ID | Internal ID | 50001 |
EXTERNAL_ID | External ID (for import) | DS-JE-001 |
TRAN_ID | Transaction Number | JE00001 |
TRAN_DATE | Transaction Date | 2024-01-15 |
POSTING_PERIOD | Period ID | Jan 2024 |
SUBSIDIARY | Subsidiary ID | 1 |
CURRENCY | Currency Code | USD |
EXCHANGE_RATE | Exchange Rate | 1.000000 |
MEMO | Memo | Monthly accrual |
APPROVED | Approval Status | true |
REVERSAL_DATE | Reversal Date | 2024-02-01 |
DEPARTMENT | Department ID | 100 |
CLASS | Class ID | 1 |
LOCATION | Location ID | 1 |
TOTAL_DEBIT | Total Debits | 5000.00 |
TOTAL_CREDIT | Total Credits | 5000.00 |
NetSuite Line Fields
| Field | Description |
|---|---|
ACCOUNT | Account internal ID |
DEBIT | Debit amount |
CREDIT | Credit amount |
MEMO | Line memo |
DEPARTMENT | Department |
CLASS | Class segment |
LOCATION | Location segment |
ENTITY | Customer/Vendor reference |
CUSTOM_FIELDS | Additional custom field map |
Configuration
output:
format: netsuite
netsuite:
subsidiary_id: 1
include_custom_fields: true
Usage Examples
SAP Load Testing
Generate data for SAP S/4HANA load testing with full table coverage:
global:
industry: manufacturing
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 100000
output:
format: sap
sap:
tables: [bkpf, bseg, acdoca, lfa1, kna1, mara, csks, cepc]
client: "100"
Oracle EBS Migration Validation
Generate journal entries in Oracle EBS format for migration testing:
output:
format: oracle
oracle:
ledger_id: 1
NetSuite Integration Testing
Generate multi-subsidiary data with custom fields:
output:
format: netsuite
netsuite:
subsidiary_id: 1
include_custom_fields: true
Output Files
| Format | Output Files |
|---|---|
| SAP | sap_bkpf.csv, sap_bseg.csv, sap_acdoca.csv, sap_lfa1.csv, sap_kna1.csv, sap_mara.csv, sap_csks.csv, sap_cepc.csv |
| Oracle | oracle_gl_je_headers.csv, oracle_gl_je_lines.csv, oracle_gl_je_batches.csv |
| NetSuite | netsuite_journal_entries.csv, netsuite_journal_lines.csv |
See Also
- Output Formats — Standard CSV/JSON/Parquet output
- Streaming Output — Real-time streaming sinks
- Output Settings — Configuration reference
- ERP Load Testing — ERP testing use case
- datasynth-output — Crate reference
Streaming Output
SyntheticData provides streaming output sinks for real-time data generation, enabling memory-efficient export of large datasets without loading everything into memory at once.
Overview
The streaming module in datasynth-output implements the StreamingSink trait for four output formats:
| Sink | Description | File Extension |
|---|---|---|
CsvStreamingSink | CSV with automatic headers | .csv |
JsonStreamingSink | Pretty-printed JSON arrays | .json |
NdjsonStreamingSink | Newline-delimited JSON | .jsonl / .ndjson |
ParquetStreamingSink | Apache Parquet columnar | .parquet |
All streaming sinks accept StreamEvent values through the process() method:
#![allow(unused)]
fn main() {
pub enum StreamEvent<T> {
Data(T), // A data record to write
Flush, // Force flush to disk
Close, // Close the stream
}
}
StreamingSink Trait
All streaming sinks implement:
#![allow(unused)]
fn main() {
pub trait StreamingSink<T: Serialize + Send> {
/// Process a single stream event (data, flush, or close).
fn process(&mut self, event: StreamEvent<T>) -> SynthResult<()>;
/// Close the stream and flush remaining data.
fn close(&mut self) -> SynthResult<()>;
/// Return the number of items written so far.
fn items_written(&self) -> u64;
/// Return the number of bytes written so far.
fn bytes_written(&self) -> u64;
}
}
When to Use Streaming vs Batch
| Scenario | Recommendation |
|---|---|
| < 100K records | Batch (CsvSink / JsonSink) — simpler API |
| 100K–10M records | Streaming — lower memory footprint |
| > 10M records | Streaming with Parquet — columnar compression |
| Real-time consumers | Streaming NDJSON — line-by-line parsing |
| REST/WebSocket API | Streaming — integrated with server endpoints |
CSV Streaming
#![allow(unused)]
fn main() {
use datasynth_output::streaming::CsvStreamingSink;
use datasynth_core::traits::StreamEvent;
let mut sink = CsvStreamingSink::<JournalEntry>::new("output.csv".into())?;
// Write records one at a time (memory efficient)
for entry in generate_entries() {
sink.process(StreamEvent::Data(entry))?;
}
// Periodic flush (optional — ensures data is on disk)
sink.process(StreamEvent::Flush)?;
// Close when done
sink.close()?;
println!("Wrote {} items ({} bytes)", sink.items_written(), sink.bytes_written());
}
Headers are written automatically on the first Data event.
JSON Streaming
Pretty-printed JSON Array
#![allow(unused)]
fn main() {
use datasynth_output::streaming::JsonStreamingSink;
let mut sink = JsonStreamingSink::<JournalEntry>::new("output.json".into())?;
for entry in entries {
sink.process(StreamEvent::Data(entry))?;
}
sink.close()?; // Writes closing bracket
}
Output:
[
{ "document_id": "abc-001", ... },
{ "document_id": "abc-002", ... }
]
Newline-Delimited JSON (NDJSON)
#![allow(unused)]
fn main() {
use datasynth_output::streaming::NdjsonStreamingSink;
let mut sink = NdjsonStreamingSink::<JournalEntry>::new("output.jsonl".into())?;
for entry in entries {
sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;
}
Output:
{"document_id":"abc-001",...}
{"document_id":"abc-002",...}
NDJSON is ideal for streaming consumers that process records line by line (e.g., jq, Kafka, log aggregators).
Parquet Streaming
Apache Parquet provides columnar compression, making it ideal for large analytical datasets:
#![allow(unused)]
fn main() {
use datasynth_output::streaming::ParquetStreamingSink;
let mut sink = ParquetStreamingSink::<JournalEntry>::new("output.parquet".into())?;
for entry in entries {
sink.process(StreamEvent::Data(entry))?;
}
sink.close()?;
}
Parquet benefits:
- Columnar storage: Efficient for analytical queries that touch few columns
- Built-in compression: Snappy, Gzip, or Zstd per column group
- Schema embedding: Self-describing files with full type information
- Predicate pushdown: Query engines can skip irrelevant row groups
Configuration
Streaming output is enabled when using the server API or when the runtime detects memory pressure:
output:
format: csv # csv, json, jsonl, parquet
streaming: true # Enable streaming mode
compression: none # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)
Server Streaming
The server API uses streaming sinks for the /api/stream/ endpoints:
# Start streaming generation
curl -X POST http://localhost:3000/api/stream/start \
-H "Content-Type: application/json" \
-d '{"config": {...}, "format": "ndjson"}'
# WebSocket streaming
wscat -c ws://localhost:3000/ws/events
Backpressure
Streaming sinks monitor write throughput and provide backpressure signals:
items_written()/bytes_written(): Track progress for rate limitingFlushevents: Force periodic disk writes to bound memory usage- Disk space monitoring: The runtime’s
DiskGuardcan pause generation when disk space runs low
Performance
| Format | Throughput | File Size | Use Case |
|---|---|---|---|
| CSV | ~150K rows/sec | Largest | Universal compatibility |
| NDJSON | ~120K rows/sec | Large | Streaming consumers |
| JSON | ~100K rows/sec | Large | Human-readable |
| Parquet | ~80K rows/sec | Smallest | Analytics, data lakes |
Throughput varies with record complexity and disk speed.
See Also
- Output Formats — Batch output format details
- ERP Output Formats — SAP/Oracle/NetSuite formats
- Output Settings — Configuration reference
- Server API — Streaming via REST/WebSocket
- datasynth-output — Crate reference
Python Wrapper Specification (In-Memory Configs)
This document specifies a Python wrapper that makes DataSynth usable out of the box without requiring persisted configuration files. The wrapper focuses on rich, structured configuration objects and reusable configuration blueprints so developers can generate data entirely in memory while still benefiting from the full DataSynth configuration model.
Goals
- Zero-file setup: Instantiate and run DataSynth without writing YAML/JSON to disk.
- Rich configuration: Offer a Pythonic API that maps cleanly to the full DataSynth configuration schema.
- Blueprints: Provide reusable, parameterized configuration templates for common scenarios.
- Interoperable: Allow optional export to YAML/JSON for debugging or CLI parity.
- Composable: Enable programmatic composition, overrides, and validation.
Non-goals
- Replacing the DataSynth CLI or server API.
- Hiding the underlying schema; the wrapper should expose all configuration knobs.
- Managing persistence beyond optional explicit export helpers.
Package layout
packages/
datasynth_py/
__init__.py
client.py # entrypoint wrapper
config/
__init__.py
models.py # typed config objects
blueprints.py # blueprint registry + builders
validation.py # schema validation helpers
runtime/
__init__.py
session.py # in-memory execution
Core API surface
DataSynth entrypoint
from datasynth_py import DataSynth
synth = DataSynth()
Responsibilities
- Provide a
generate()method that accepts rich configuration objects. - Provide
blueprintsregistry access for common starting points. - Manage execution in memory, including optional output sinks.
generate() signature
result = synth.generate(
config=Config(...),
output=OutputSpec(...),
seed=42,
)
Behavior
- Validates configuration objects.
- Converts configuration to DataSynth schema (internal model or JSON/YAML in-memory string).
- Executes the generator and returns result handles (paths, in-memory tables, or streams).
Optional output handling
OutputSpec can include:
format(e.g.,parquet,csv,jsonl)sink(memory,temp_dir,path)compressionsettings
When sink="memory", the wrapper returns in-memory table objects (pandas DataFrames by default).
Configuration model
Typed configuration objects
Provide typed dataclasses/Pydantic models mirroring the DataSynth YAML schema:
GlobalSettingsCompanySettingsTransactionSettingsMasterDataSettingsComplianceSettingsOutputSettings
Example:
from datasynth_py.config import Config, GlobalSettings, CompanySettings
config = Config(
global_settings=GlobalSettings(
locale="en_US",
fiscal_year_start="2024-01-01",
periods=12,
),
companies=CompanySettings(count=5, industry="retail"),
)
Overrides and layering
Allow configuration layering to support incremental overrides:
config = base_config.override(
companies={"count": 10},
output={"format": "parquet"},
)
The wrapper merges overrides deeply, preserving nested settings.
Blueprints
Blueprints provide preconfigured setups with parameters. The wrapper ships with a registry:
from datasynth_py import blueprints
config = blueprints.retail_small(companies=3, transactions=5000)
Blueprint characteristics
- Parameterized: Each blueprint accepts keyword overrides for key metrics.
- Composable: A blueprint can extend or wrap another blueprint.
- Discoverable: Registry lists available blueprints and metadata.
blueprints.list()
# ["retail_small", "banking_medium", "saas_subscription", ...]
Execution model
The wrapper runs the Rust engine in-process via FFI or uses the DataSynth runtime API:
- In-memory config: converted to serialized config strings without writing to disk.
- Transient workspace: uses a temporary directory only if required by runtime internals.
- Deterministic runs:
seedcontrols RNG.
Streaming generation
The wrapper exposes a streaming session that connects to datasynth-server over WebSockets while using REST endpoints to start, pause, resume, and stop streams.
Examples
Example 1: Minimal generation in memory
from datasynth_py import DataSynth
from datasynth_py.config import Config, GlobalSettings, CompanySettings
config = Config(
global_settings=GlobalSettings(locale="en_US", fiscal_year_start="2024-01-01"),
companies=CompanySettings(count=2),
)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "memory"})
# result.tables -> dict[str, pandas.DataFrame]
print(result.tables["transactions"].head())
Example 2: Use a blueprint with overrides
from datasynth_py import DataSynth, blueprints
synth = DataSynth()
config = blueprints.retail_small(companies=4, transactions=15000)
result = synth.generate(
config=config,
output={"format": "parquet", "sink": "temp_dir"},
seed=7,
)
print(result.output_dir)
Example 3: Layering overrides for a custom scenario
from datasynth_py import DataSynth
from datasynth_py.config import Config, GlobalSettings, TransactionSettings
base = Config(global_settings=GlobalSettings(locale="en_GB"))
custom = base.override(
transactions=TransactionSettings(
count=25000,
currency="GBP",
anomaly_rate=0.02,
)
)
synth = DataSynth()
result = synth.generate(config=custom, output={"format": "jsonl", "sink": "memory"})
Example 4: Export configuration for debugging
from datasynth_py import DataSynth
from datasynth_py.config import Config
synth = DataSynth()
config = Config(...)
print(config.to_yaml())
print(config.to_json())
Example 5: Streaming events
import asyncio
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
async def main() -> None:
synth = DataSynth(server_url="http://localhost:3000")
config = blueprints.retail_small(companies=2, transactions=5000)
session = synth.stream(config=config, events_per_second=50)
async for event in session.events():
print(event)
break
asyncio.run(main())
Decisions
- In-memory table format: pandas DataFrames are the default return type for memory sinks.
- Validation errors: configuration validation raises
ConfigValidationErrorwith structured error details.
Python Wrapper Guide
This guide explains how to use the DataSynth Python wrapper for in-memory configuration, local CLI generation, and streaming generation through the server API.
Installation
The wrapper lives in the repository under python/. Install it in development mode:
cd python
pip install -e ".[all]"
Or install just the core with specific extras:
pip install -e ".[cli]" # For CLI generation (requires PyYAML)
pip install -e ".[memory]" # For in-memory tables (requires pandas)
pip install -e ".[streaming]" # For streaming (requires websockets)
Quick start (CLI generation)
from datasynth_py import DataSynth, CompanyConfig, Config, GlobalSettings, ChartOfAccountsSettings
config = Config(
global_settings=GlobalSettings(
industry="retail",
start_date="2024-01-01",
period_months=12,
),
companies=[
CompanyConfig(code="C001", name="Retail Corp", currency="USD", country="US"),
],
chart_of_accounts=ChartOfAccountsSettings(complexity="small"),
)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
print(result.output_dir) # Path to generated files
Using blueprints
Blueprints provide preconfigured templates for common scenarios:
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
# List available blueprints
print(blueprints.list())
# ['retail_small', 'banking_medium', 'manufacturing_large',
# 'banking_aml', 'ml_training', 'with_graph_export']
# Create a retail configuration with 4 companies
config = blueprints.retail_small(companies=4, transactions=10000)
# Banking/AML focused configuration
config = blueprints.banking_aml(customers=1000, typologies=True)
# ML training optimized configuration
config = blueprints.ml_training(
industry="manufacturing",
anomaly_ratio=0.05,
)
# Add graph export to any configuration
config = blueprints.with_graph_export(
base_config=blueprints.retail_small(),
formats=["pytorch_geometric", "neo4j"],
)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "parquet", "sink": "path", "path": "./output"})
Configuration model
The configuration model matches the CLI schema:
from datasynth_py import (
ChartOfAccountsSettings,
CompanyConfig,
Config,
FraudSettings,
GlobalSettings,
)
config = Config(
global_settings=GlobalSettings(
industry="manufacturing", # Industry sector
start_date="2024-01-01", # Simulation start date
period_months=12, # Number of months to simulate
seed=42, # Random seed for reproducibility
group_currency="USD", # Base currency
),
companies=[
CompanyConfig(
code="M001",
name="Manufacturing Co",
currency="USD",
country="US",
annual_transaction_volume="ten_k", # Volume preset
),
CompanyConfig(
code="M002",
name="Manufacturing EU",
currency="EUR",
country="DE",
annual_transaction_volume="hundred_k",
),
],
chart_of_accounts=ChartOfAccountsSettings(
complexity="medium", # small, medium, or large
),
fraud=FraudSettings(
enabled=True,
rate=0.01, # 1% fraud rate
),
)
Valid industry values
manufacturingretailfinancial_serviceshealthcaretechnologyprofessional_servicesenergytransportationreal_estatetelecommunications
Transaction volume presets
ten_k- 10,000 transactions/yearhundred_k- 100,000 transactions/yearone_m- 1,000,000 transactions/yearten_m- 10,000,000 transactions/yearhundred_m- 100,000,000 transactions/year
Extended configuration
Additional configuration sections for specialized scenarios:
from datasynth_py.config.models import (
Config,
GlobalSettings,
BankingSettings,
ScenarioSettings,
TemporalDriftSettings,
DataQualitySettings,
GraphExportSettings,
)
config = Config(
global_settings=GlobalSettings(industry="financial_services"),
# Banking/KYC/AML configuration
banking=BankingSettings(
enabled=True,
retail_customers=1000,
business_customers=200,
typologies_enabled=True, # Structuring, layering, mule patterns
),
# ML training scenario
scenario=ScenarioSettings(
tags=["ml_training", "fraud_detection"],
ml_training=True,
target_anomaly_ratio=0.05,
),
# Temporal drift for concept drift testing
temporal=TemporalDriftSettings(
enabled=True,
amount_mean_drift=0.02,
drift_type="gradual", # gradual, sudden, recurring
),
# Data quality issues for DQ model training
data_quality=DataQualitySettings(
enabled=True,
missing_rate=0.05,
typo_rate=0.02,
),
# Graph export for GNN training
graph_export=GraphExportSettings(
enabled=True,
formats=["pytorch_geometric", "neo4j"],
),
)
Configuration layering
Override configuration values:
from datasynth_py import Config, GlobalSettings
base = Config(global_settings=GlobalSettings(industry="retail", start_date="2024-01-01"))
custom = base.override(
fraud={"enabled": True, "rate": 0.02},
)
Validation
Validation raises ConfigValidationError with structured error details:
from datasynth_py import Config, GlobalSettings
from datasynth_py.config.validation import ConfigValidationError
try:
Config(global_settings=GlobalSettings(period_months=0)).validate()
except ConfigValidationError as exc:
for error in exc.errors:
print(error.path, error.message, error.value)
Output options
Control where and how data is generated:
from datasynth_py import DataSynth, OutputSpec
synth = DataSynth()
# Write to a specific path
result = synth.generate(
config=config,
output=OutputSpec(format="csv", sink="path", path="./output"),
)
# Write to a temporary directory
result = synth.generate(
config=config,
output=OutputSpec(format="parquet", sink="temp_dir"),
)
print(result.output_dir) # Temp directory path
# Load into memory (requires pandas)
result = synth.generate(
config=config,
output=OutputSpec(format="csv", sink="memory"),
)
print(result.tables["journal_entries"].head())
Fingerprint Operations
The Python wrapper provides access to fingerprint extraction, validation, and evaluation:
from datasynth_py import DataSynth
synth = DataSynth()
# Extract fingerprint from real data
synth.fingerprint.extract(
input_path="./real_data/",
output_path="./fingerprint.dsf",
privacy_level="standard" # minimal, standard, high, maximum
)
# Validate fingerprint file
is_valid, errors = synth.fingerprint.validate("./fingerprint.dsf")
if not is_valid:
print(f"Validation errors: {errors}")
# Get fingerprint info
info = synth.fingerprint.info("./fingerprint.dsf", detailed=True)
print(f"Privacy level: {info.privacy_level}")
print(f"Epsilon spent: {info.epsilon_spent}")
print(f"Tables: {info.tables}")
# Evaluate synthetic data fidelity
report = synth.fingerprint.evaluate(
fingerprint_path="./fingerprint.dsf",
synthetic_path="./synthetic_data/",
threshold=0.8
)
print(f"Overall score: {report.overall_score}")
print(f"Statistical fidelity: {report.statistical_fidelity}")
print(f"Correlation fidelity: {report.correlation_fidelity}")
print(f"Passes threshold: {report.passes}")
FidelityReport Fields
| Field | Description |
|---|---|
overall_score | Weighted average of all fidelity metrics (0-1) |
statistical_fidelity | KS statistic, Wasserstein distance, Benford MAD |
correlation_fidelity | Correlation matrix RMSE |
schema_fidelity | Column type match, row count ratio |
passes | Whether the score meets the threshold |
Streaming generation
Streaming uses the DataSynth server for real-time event generation. Start the server first:
cargo run -p datasynth-server -- --port 3000
Then stream events:
import asyncio
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
async def main() -> None:
synth = DataSynth(server_url="http://localhost:3000")
config = blueprints.retail_small(companies=2)
session = synth.stream(config=config, events_per_second=100)
async for event in session.events():
print(event)
break
asyncio.run(main())
Stream controls
session.pause()
session.resume()
session.stop()
Pattern triggers
Trigger specific patterns during streaming to simulate real-world scenarios:
# Trigger temporal patterns
session.trigger_month_end() # Month-end volume spike
session.trigger_year_end() # Year-end closing entries
session.trigger_pattern("quarter_end_spike")
# Trigger anomaly patterns
session.trigger_fraud_cluster() # Clustered fraud transactions
session.trigger_pattern("dormant_account_activity")
# Available patterns:
# - period_end_spike
# - quarter_end_spike
# - year_end_spike
# - fraud_cluster
# - error_burst
# - dormant_account_activity
Synchronous event consumption
For simpler use cases without async/await:
def process_event(event):
print(f"Received: {event['document_id']}")
session.sync_events(callback=process_event, max_events=1000)
Runtime requirements
The wrapper shells out to the datasynth-data CLI for batch generation. Ensure the binary is available:
cargo build --release
export DATASYNTH_BINARY=target/release/datasynth-data
Alternatively, pass binary_path when creating the client:
synth = DataSynth(binary_path="/path/to/datasynth-data")
Troubleshooting
- MissingDependencyError: Install the required optional dependency (
PyYAML,pandas, orwebsockets). - CLI not found: Build the
datasynth-databinary and setDATASYNTH_BINARYor passbinary_path. - ConfigValidationError: Check the error details for invalid configuration values.
- Streaming errors: Verify the server is running and reachable at the configured URL.
Ecosystem Integrations (v0.5.0)
DataSynth includes optional integrations with popular data engineering and ML platforms. Install with:
pip install datasynth-py[integrations]
# Or install specific integrations
pip install datasynth-py[airflow,dbt,mlflow,spark]
Apache Airflow
Use the DataSynthOperator to generate data as part of Airflow DAGs:
from datasynth_py.integrations import DataSynthOperator, DataSynthSensor, DataSynthValidateOperator
# Generate data
generate = DataSynthOperator(
task_id="generate_data",
config=config,
output_path="/data/synthetic/output",
)
# Wait for completion
sensor = DataSynthSensor(
task_id="wait_for_data",
output_path="/data/synthetic/output",
)
# Validate config
validate = DataSynthValidateOperator(
task_id="validate_config",
config_path="/data/configs/config.yaml",
)
dbt Integration
Generate dbt sources and seeds from synthetic data:
from datasynth_py.integrations import DbtSourceGenerator, create_dbt_project
gen = DbtSourceGenerator()
# Generate sources.yml for dbt
sources_path = gen.generate_sources_yaml("./output", "./my_dbt_project")
# Generate seed CSVs
seeds_dir = gen.generate_seeds("./output", "./my_dbt_project")
# Create complete dbt project from synthetic output
project = create_dbt_project("./output", "my_dbt_project")
MLflow Tracking
Track generation runs as MLflow experiments:
from datasynth_py.integrations import DataSynthMlflowTracker
tracker = DataSynthMlflowTracker(experiment_name="synthetic_data_runs")
# Track a generation run
run_info = tracker.track_generation("./output", config=cfg)
# Log quality metrics
tracker.log_quality_metrics({
"completeness": 0.98,
"benford_mad": 0.008,
"correlation_preservation": 0.95,
})
# Compare recent runs
comparison = tracker.compare_runs(n=5)
Apache Spark
Read synthetic data as Spark DataFrames:
from datasynth_py.integrations import DataSynthSparkReader
reader = DataSynthSparkReader()
# Read a single table
df = reader.read_table(spark, "./output", "journal_entries")
# Read all tables
tables = reader.read_all_tables(spark, "./output")
# Create temporary views for SQL queries
views = reader.create_temp_views(spark, "./output")
spark.sql("SELECT * FROM journal_entries WHERE amount > 10000").show()
For comprehensive integration documentation, see the Ecosystem Integrations guide.
Ecosystem Integrations
New in v0.5.0
DataSynth’s Python wrapper includes optional integrations with popular data engineering and ML platforms for seamless pipeline orchestration.
Installation
# Install all integrations
pip install datasynth-py[integrations]
# Install specific integrations
pip install datasynth-py[airflow]
pip install datasynth-py[dbt]
pip install datasynth-py[mlflow]
pip install datasynth-py[spark]
Apache Airflow
The Airflow integration provides custom operators and sensors for orchestrating synthetic data generation in Airflow DAGs.
DataSynthOperator
Generates synthetic data as an Airflow task:
from datasynth_py.integrations import DataSynthOperator
generate = DataSynthOperator(
task_id="generate_synthetic_data",
config={
"global": {"industry": "retail", "start_date": "2024-01-01", "period_months": 12},
"transactions": {"target_count": 50000},
"output": {"format": "csv"},
},
output_path="/data/synthetic/{{ ds }}",
)
| Parameter | Type | Description |
|---|---|---|
task_id | str | Airflow task identifier |
config | dict | Generation configuration (inline) |
config_path | str | Path to YAML config file (alternative to config) |
output_path | str | Output directory (supports Jinja templates) |
DataSynthSensor
Waits for synthetic data generation to complete:
from datasynth_py.integrations import DataSynthSensor
wait = DataSynthSensor(
task_id="wait_for_data",
output_path="/data/synthetic/{{ ds }}",
poke_interval=30,
timeout=600,
)
DataSynthValidateOperator
Validates a configuration file before generation:
from datasynth_py.integrations import DataSynthValidateOperator
validate = DataSynthValidateOperator(
task_id="validate_config",
config_path="/configs/retail.yaml",
)
Complete DAG Example
from airflow import DAG
from airflow.utils.dates import days_ago
from datasynth_py.integrations import (
DataSynthOperator,
DataSynthSensor,
DataSynthValidateOperator,
)
with DAG(
"weekly_synthetic_data",
start_date=days_ago(1),
schedule_interval="@weekly",
catchup=False,
) as dag:
validate = DataSynthValidateOperator(
task_id="validate",
config_path="/configs/retail.yaml",
)
generate = DataSynthOperator(
task_id="generate",
config_path="/configs/retail.yaml",
output_path="/data/synthetic/{{ ds }}",
)
wait = DataSynthSensor(
task_id="wait",
output_path="/data/synthetic/{{ ds }}",
)
validate >> generate >> wait
dbt Integration
Generate dbt-compatible project structures from synthetic data output.
DbtSourceGenerator
from datasynth_py.integrations import DbtSourceGenerator
gen = DbtSourceGenerator()
Generate sources.yml
Creates a dbt sources.yml file pointing to synthetic data tables:
sources_path = gen.generate_sources_yaml(
output_dir="./synthetic_output",
dbt_project_dir="./my_dbt_project",
)
# Creates ./my_dbt_project/models/sources.yml
Generate Seeds
Copies synthetic CSV files as dbt seeds:
seeds_dir = gen.generate_seeds(
output_dir="./synthetic_output",
dbt_project_dir="./my_dbt_project",
)
# Copies CSVs to ./my_dbt_project/seeds/
create_dbt_project
Creates a complete dbt project structure from synthetic output:
from datasynth_py.integrations import create_dbt_project
project = create_dbt_project(
output_dir="./synthetic_output",
project_name="synthetic_test",
)
This creates:
synthetic_test/
├── dbt_project.yml
├── models/
│ └── sources.yml
├── seeds/
│ ├── journal_entries.csv
│ ├── vendors.csv
│ ├── customers.csv
│ └── ...
└── tests/
Usage with dbt CLI
cd synthetic_test
dbt seed # Load synthetic CSVs
dbt run # Run transformations
dbt test # Run data tests
MLflow Integration
Track synthetic data generation runs as MLflow experiments for comparison and reproducibility.
DataSynthMlflowTracker
from datasynth_py.integrations import DataSynthMlflowTracker
tracker = DataSynthMlflowTracker(experiment_name="synthetic_data_experiments")
Track a Generation Run
run_info = tracker.track_generation(
output_dir="./output",
config=config,
)
# Logs: config parameters, output file counts, generation metadata
Log Quality Metrics
tracker.log_quality_metrics({
"completeness": 0.98,
"benford_mad": 0.008,
"correlation_preservation": 0.95,
"statistical_fidelity": 0.92,
})
Compare Runs
comparison = tracker.compare_runs(n=5)
for run in comparison:
print(f"Run {run['run_id']}: {run['metrics']}")
Experiment Comparison
Use MLflow to compare different generation configurations:
import mlflow
configs = {
"baseline": baseline_config,
"with_diffusion": diffusion_config,
"high_fraud": high_fraud_config,
}
for name, cfg in configs.items():
with mlflow.start_run(run_name=name):
result = synth.generate(config=cfg, output={"format": "csv", "sink": "temp_dir"})
tracker.track_generation(result.output_dir, config=cfg)
tracker.log_quality_metrics(evaluate_quality(result.output_dir))
View results in the MLflow UI:
mlflow ui --port 5000
# Open http://localhost:5000
Apache Spark
Read synthetic data output directly as Spark DataFrames for large-scale analysis.
DataSynthSparkReader
from datasynth_py.integrations import DataSynthSparkReader
reader = DataSynthSparkReader()
Read a Single Table
df = reader.read_table(spark, "./output", "journal_entries")
df.printSchema()
df.show(5)
Read All Tables
tables = reader.read_all_tables(spark, "./output")
for name, df in tables.items():
print(f"{name}: {df.count()} rows, {len(df.columns)} columns")
Create Temporary Views
views = reader.create_temp_views(spark, "./output")
# Now use SQL
spark.sql("""
SELECT
v.vendor_id,
v.name,
COUNT(p.document_id) as payment_count,
SUM(p.amount) as total_paid
FROM vendors v
JOIN payments p ON v.vendor_id = p.vendor_id
GROUP BY v.vendor_id, v.name
ORDER BY total_paid DESC
LIMIT 10
""").show()
Spark + DataSynth Pipeline
from pyspark.sql import SparkSession
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
from datasynth_py.integrations import DataSynthSparkReader
# Generate
synth = DataSynth()
config = blueprints.retail_small(transactions=100000)
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
# Load into Spark
spark = SparkSession.builder.appName("SyntheticAnalysis").getOrCreate()
reader = DataSynthSparkReader()
reader.create_temp_views(spark, result.output_dir)
# Analyze
spark.sql("""
SELECT fiscal_period, COUNT(*) as entry_count, SUM(amount) as total_amount
FROM journal_entries
GROUP BY fiscal_period
ORDER BY fiscal_period
""").show()
Integration Dependencies
| Integration | Required Package | Version |
|---|---|---|
| Airflow | apache-airflow | >= 2.5 |
| dbt | dbt-core | >= 1.5 |
| MLflow | mlflow | >= 2.0 |
| Spark | pyspark | >= 3.3 |
All integrations are optional — install only what you need.
See Also
Configuration
SyntheticData uses YAML configuration files to control all aspects of data generation.
Quick Start
# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Validate configuration
datasynth-data validate --config config.yaml
# Generate with configuration
datasynth-data generate --config config.yaml --output ./output
Configuration Sections
| Section | Description |
|---|---|
| Global Settings | Industry, dates, seed, performance |
| Companies | Company codes, currencies, volume weights |
| Transactions | Line items, amounts, sources |
| Master Data | Vendors, customers, materials, assets |
| Document Flows | P2P, O2C configuration |
| Financial Settings | Balance, subledger, FX, period close |
| Compliance | Fraud, controls, approval |
| AI & ML Features | LLM, diffusion, causal, certificates |
| Output Settings | Format, compression |
| Source-to-Pay | S2C sourcing pipeline (projects, RFx, bids, contracts, catalogs, scorecards) |
| Financial Reporting | Financial statements, bank reconciliation, management KPIs, budgets |
| HR | Payroll runs, time entries, expense reports |
| Manufacturing | Production orders, quality inspections, cycle counts |
| Sales Quotes | Quote-to-order pipeline |
| Accounting Standards | Revenue recognition (ASC 606/IFRS 15), impairment testing |
Reference
Minimal Configuration
global:
industry: manufacturing
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 10000
output:
format: csv
Full Configuration Example
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 12
group_currency: USD
companies:
- code: "1000"
name: "Headquarters"
currency: USD
country: US
volume_weight: 0.6
- code: "2000"
name: "European Subsidiary"
currency: EUR
country: DE
volume_weight: 0.4
chart_of_accounts:
complexity: medium
transactions:
target_count: 100000
line_items:
distribution: empirical
amounts:
min: 100
max: 1000000
master_data:
vendors:
count: 200
customers:
count: 500
materials:
count: 1000
document_flows:
p2p:
enabled: true
flow_rate: 0.3
o2c:
enabled: true
flow_rate: 0.3
fraud:
enabled: true
fraud_rate: 0.005
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
# AI & ML Features (v0.5.0)
diffusion:
enabled: true
n_steps: 1000
schedule: "cosine"
sample_size: 1000
causal:
enabled: true
template: "fraud_detection"
sample_size: 1000
validate: true
certificates:
enabled: true
issuer: "DataSynth"
include_quality_metrics: true
# Enterprise Process Chains (v0.6.0)
source_to_pay:
enabled: true
projects_per_period: 5
avg_bids_per_rfx: 4
contract_award_rate: 0.75
catalog_items_per_contract: 10
financial_reporting:
enabled: true
generate_balance_sheet: true
generate_income_statement: true
generate_cash_flow: true
generate_changes_in_equity: true
management_kpis:
enabled: true
budgets:
enabled: true
variance_threshold: 0.10
hr:
enabled: true
payroll_frequency: monthly
time_tracking: true
expense_reports: true
manufacturing:
enabled: true
production_orders_per_period: 20
quality_inspection_rate: 0.30
cycle_count_frequency: quarterly
sales_quotes:
enabled: true
quotes_per_period: 15
conversion_rate: 0.35
output:
format: csv
compression: none
Configuration Loading
Configuration can be loaded from:
-
YAML file (recommended):
datasynth-data generate --config config.yaml --output ./output -
JSON file:
datasynth-data generate --config config.json --output ./output -
Demo preset:
datasynth-data generate --demo --output ./output
Validation
The configuration is validated for:
| Rule | Description |
|---|---|
| Required fields | All mandatory fields must be present |
| Value ranges | Numbers within valid bounds |
| Distributions | Weights sum to 1.0 (±0.01 tolerance) |
| Dates | Valid date ranges |
| Uniqueness | Company codes must be unique |
| Consistency | Cross-field validations |
Run validation:
datasynth-data validate --config config.yaml
Overriding Values
Command-line options override configuration file values:
# Override seed
datasynth-data generate --config config.yaml --seed 12345 --output ./output
# Override format
datasynth-data generate --config config.yaml --format json --output ./output
Environment Variables
Some settings can be controlled via environment variables:
| Variable | Configuration Equivalent |
|---|---|
SYNTH_DATA_SEED | global.seed |
SYNTH_DATA_THREADS | global.worker_threads |
SYNTH_DATA_MEMORY_LIMIT | global.memory_limit |
See Also
YAML Schema Reference
Complete reference for all configuration options.
Schema Overview
global: # Global settings
companies: # Company definitions
chart_of_accounts: # COA structure
transactions: # Transaction settings
master_data: # Master data settings
document_flows: # P2P, O2C flows
intercompany: # IC settings
balance: # Balance settings
subledger: # Subledger settings
fx: # FX settings
period_close: # Period close settings
fraud: # Fraud injection
internal_controls: # SOX controls
anomaly_injection: # Anomaly injection
data_quality: # Data quality variations
graph_export: # Graph export settings
output: # Output settings
business_processes: # Process distribution
templates: # External templates
approval: # Approval thresholds
departments: # Department distribution
source_to_pay: # Source-to-Pay (v0.6.0)
financial_reporting: # Financial statements & KPIs (v0.6.0)
hr: # HR / payroll / expenses (v0.6.0)
manufacturing: # Production orders & costing (v0.6.0)
sales_quotes: # Quote-to-order pipeline (v0.6.0)
global
global:
seed: 42 # u64, optional - RNG seed
industry: manufacturing # string - industry preset
start_date: 2024-01-01 # date - generation start
period_months: 12 # u32, 1-120 - duration
group_currency: USD # string - base currency
worker_threads: 4 # usize, optional - parallelism
memory_limit: 2147483648 # u64, optional - bytes
Industries: manufacturing, retail, financial_services, healthcare, technology, energy, telecom, transportation, hospitality
companies
companies:
- code: "1000" # string - unique code
name: "Headquarters" # string - display name
currency: USD # string - local currency
country: US # string - ISO country code
volume_weight: 0.6 # f64, 0-1 - transaction weight
is_parent: true # bool - consolidation parent
parent_code: null # string, optional - parent ref
Constraints:
volume_weightacross all companies must sum to 1.0codemust be unique
chart_of_accounts
chart_of_accounts:
complexity: medium # small, medium, large
industry_specific: true # bool - use industry COA
custom_accounts: [] # list - additional accounts
Complexity levels:
small: ~100 accountsmedium: ~400 accountslarge: ~2500 accounts
transactions
transactions:
target_count: 100000 # u64 - total JEs to generate
line_items:
distribution: empirical # empirical, uniform, custom
min_lines: 2 # u32 - minimum line items
max_lines: 20 # u32 - maximum line items
custom_distribution: # only if distribution: custom
2: 0.6068
3: 0.0524
4: 0.1732
amounts:
min: 100 # f64 - minimum amount
max: 1000000 # f64 - maximum amount
distribution: log_normal # log_normal, uniform, custom
round_number_bias: 0.15 # f64, 0-1 - round number preference
sources: # transaction source weights
manual: 0.3
automated: 0.5
recurring: 0.15
adjustment: 0.05
benford:
enabled: true # bool - Benford's Law compliance
temporal:
month_end_spike: 2.5 # f64 - month-end volume multiplier
quarter_end_spike: 3.0 # f64 - quarter-end multiplier
year_end_spike: 4.0 # f64 - year-end multiplier
working_hours_only: true # bool - restrict to business hours
master_data
master_data:
vendors:
count: 200 # u32 - number of vendors
intercompany_ratio: 0.05 # f64, 0-1 - IC vendor ratio
customers:
count: 500 # u32 - number of customers
intercompany_ratio: 0.05 # f64, 0-1 - IC customer ratio
materials:
count: 1000 # u32 - number of materials
fixed_assets:
count: 100 # u32 - number of assets
employees:
count: 50 # u32 - number of employees
hierarchy_depth: 4 # u32 - org chart depth
document_flows
document_flows:
p2p: # Procure-to-Pay
enabled: true
flow_rate: 0.3 # f64, 0-1 - JE percentage
completion_rate: 0.95 # f64, 0-1 - full flow rate
three_way_match:
quantity_tolerance: 0.02 # f64, 0-1 - qty variance allowed
price_tolerance: 0.01 # f64, 0-1 - price variance allowed
o2c: # Order-to-Cash
enabled: true
flow_rate: 0.3 # f64, 0-1 - JE percentage
completion_rate: 0.95 # f64, 0-1 - full flow rate
intercompany
intercompany:
enabled: true
transaction_types: # weights must sum to 1.0
goods_sale: 0.4
service_provided: 0.2
loan: 0.15
dividend: 0.1
management_fee: 0.1
royalty: 0.05
transfer_pricing:
method: cost_plus # cost_plus, resale_minus, comparable
markup_range:
min: 0.03
max: 0.10
balance
balance:
opening_balance:
enabled: true
total_assets: 10000000 # f64 - opening balance sheet size
coherence_check:
enabled: true # bool - verify A = L + E
tolerance: 0.01 # f64 - allowed imbalance
subledger
subledger:
ar:
enabled: true
aging_buckets: [30, 60, 90] # list of days
ap:
enabled: true
aging_buckets: [30, 60, 90]
fixed_assets:
enabled: true
depreciation_methods:
- straight_line
- declining_balance
inventory:
enabled: true
valuation_methods:
- fifo
- weighted_average
fx
fx:
enabled: true
base_currency: USD
currency_pairs: # currencies to generate
- EUR
- GBP
- CHF
- JPY
volatility: 0.01 # f64 - daily volatility
translation:
method: current_rate # current_rate, temporal
period_close
period_close:
enabled: true
monthly:
accruals: true
depreciation: true
quarterly:
intercompany_elimination: true
annual:
closing_entries: true
retained_earnings: true
fraud
fraud:
enabled: true
fraud_rate: 0.005 # f64, 0-1 - fraud percentage
types: # weights must sum to 1.0
fictitious_transaction: 0.15
revenue_manipulation: 0.10
expense_capitalization: 0.10
split_transaction: 0.15
round_tripping: 0.05
kickback_scheme: 0.10
ghost_employee: 0.05
duplicate_payment: 0.15
unauthorized_discount: 0.10
suspense_abuse: 0.05
internal_controls
internal_controls:
enabled: true
controls:
- id: "CTL-001"
name: "Payment Approval"
type: preventive
frequency: continuous
sod_rules:
- conflict_type: create_approve
processes: [ap_invoice, ap_payment]
anomaly_injection
anomaly_injection:
enabled: true
total_rate: 0.02 # f64, 0-1 - total anomaly rate
generate_labels: true # bool - output ML labels
categories: # weights must sum to 1.0
fraud: 0.25
error: 0.40
process_issue: 0.20
statistical: 0.10
relational: 0.05
temporal_pattern:
year_end_spike: 1.5 # f64 - year-end multiplier
clustering:
enabled: true
cluster_probability: 0.2
data_quality
data_quality:
enabled: true
missing_values:
rate: 0.01 # f64, 0-1
pattern: mcar # mcar, mar, mnar, systematic
format_variations:
date_formats: true
amount_formats: true
duplicates:
rate: 0.001 # f64, 0-1
types: [exact, near, fuzzy]
typos:
rate: 0.005 # f64, 0-1
keyboard_aware: true
graph_export
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
- dgl
graphs:
- transaction_network
- approval_network
- entity_relationship
split:
train: 0.7
val: 0.15
test: 0.15
stratify: is_anomaly
features:
temporal: true
amount: true
structural: true
categorical: true
output
output:
format: csv # csv, json
compression: none # none, gzip, zstd
compression_level: 6 # u32, 1-9 (if compression enabled)
files:
journal_entries: true
acdoca: true
master_data: true
documents: true
subledgers: true
trial_balances: true
labels: true
controls: true
Validation Summary
| Field | Constraint |
|---|---|
period_months | 1-120 |
compression_level | 1-9 |
| All rates/percentages | 0.0-1.0 |
| Distributions | Sum to 1.0 (±0.01) |
| Company codes | Unique |
| Dates | Valid and consistent |
Diffusion Configuration (v0.5.0)
diffusion:
enabled: false # Enable diffusion model backend
n_steps: 1000 # Number of diffusion steps (default: 1000)
schedule: "linear" # Noise schedule: "linear", "cosine", "sigmoid"
sample_size: 1000 # Number of samples to generate (default: 1000)
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable diffusion model generation |
n_steps | integer | 1000 | Number of forward/reverse diffusion steps |
schedule | string | "linear" | Noise schedule type: linear, cosine, sigmoid |
sample_size | integer | 1000 | Number of samples to generate |
Causal Configuration (v0.5.0)
causal:
enabled: false # Enable causal generation
template: "fraud_detection" # Built-in template or custom graph path
sample_size: 1000 # Number of samples to generate
validate: true # Validate causal structure in output
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable causal/counterfactual generation |
template | string | "fraud_detection" | Template name (fraud_detection, revenue_cycle) or path to custom YAML |
sample_size | integer | 1000 | Number of causal samples to generate |
validate | bool | true | Run causal structure validation on output |
Built-in Causal Templates
| Template | Variables | Description |
|---|---|---|
fraud_detection | transaction_amount, approval_level, vendor_risk, fraud_flag | Fraud detection causal graph |
revenue_cycle | order_size, credit_score, payment_delay, revenue | Revenue cycle causal graph |
Certificate Configuration (v0.5.0)
certificates:
enabled: false # Enable synthetic data certificates
issuer: "DataSynth" # Certificate issuer name
include_quality_metrics: true # Include quality metrics in certificate
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Attach certificate to generated output |
issuer | string | "DataSynth" | Issuer identity for the certificate |
include_quality_metrics | bool | true | Include Benford MAD, correlation, fidelity metrics |
Source-to-Pay Configuration (v0.6.0)
source_to_pay:
enabled: false # Enable source-to-pay generation
spend_analysis:
hhi_threshold: 2500.0 # f64 - HHI threshold for sourcing trigger
contract_coverage_target: 0.80 # f64, 0-1 - target spend under contracts
sourcing:
projects_per_year: 10 # u32 - sourcing projects per year
renewal_horizon_months: 3 # u32 - months before expiry to trigger renewal
project_duration_months: 4 # u32 - average project duration
qualification:
pass_rate: 0.75 # f64, 0-1 - qualification pass rate
validity_days: 365 # u32 - qualification validity in days
financial_weight: 0.25 # f64 - financial stability weight
quality_weight: 0.30 # f64 - quality management weight
delivery_weight: 0.25 # f64 - delivery performance weight
compliance_weight: 0.20 # f64 - compliance weight
rfx:
rfi_threshold: 100000.0 # f64 - spend above which RFI required
min_invited_vendors: 3 # u32 - minimum vendors per RFx
max_invited_vendors: 8 # u32 - maximum vendors per RFx
response_rate: 0.70 # f64, 0-1 - vendor response rate
default_price_weight: 0.40 # f64 - price weight in evaluation
default_quality_weight: 0.35 # f64 - quality weight in evaluation
default_delivery_weight: 0.25 # f64 - delivery weight in evaluation
contracts:
min_duration_months: 12 # u32 - minimum contract duration
max_duration_months: 36 # u32 - maximum contract duration
auto_renewal_rate: 0.40 # f64, 0-1 - auto-renewal rate
amendment_rate: 0.20 # f64, 0-1 - contracts with amendments
type_distribution:
fixed_price: 0.40 # f64 - fixed price contracts
blanket: 0.30 # f64 - blanket/framework agreements
time_and_materials: 0.15 # f64 - T&M contracts
service_agreement: 0.15 # f64 - service agreements
catalog:
preferred_vendor_flag_rate: 0.70 # f64, 0-1 - items marked as preferred
multi_source_rate: 0.25 # f64, 0-1 - items with multiple sources
scorecards:
frequency: "quarterly" # string - review frequency
on_time_delivery_weight: 0.30 # f64 - OTD weight in score
quality_weight: 0.30 # f64 - quality weight in score
price_weight: 0.25 # f64 - price competitiveness weight
responsiveness_weight: 0.15 # f64 - responsiveness weight
grade_a_threshold: 90.0 # f64 - grade A threshold
grade_b_threshold: 75.0 # f64 - grade B threshold
grade_c_threshold: 60.0 # f64 - grade C threshold
p2p_integration:
off_contract_rate: 0.15 # f64, 0-1 - maverick purchase rate
price_tolerance: 0.02 # f64 - contract price variance allowed
catalog_enforcement: false # bool - enforce catalog ordering
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable source-to-pay generation |
sourcing.projects_per_year | u32 | 10 | Sourcing projects per year |
qualification.pass_rate | f64 | 0.75 | Supplier qualification pass rate |
rfx.response_rate | f64 | 0.70 | Fraction of invited vendors that respond |
contracts.auto_renewal_rate | f64 | 0.40 | Auto-renewal rate |
scorecards.frequency | string | "quarterly" | Scorecard review frequency |
p2p_integration.off_contract_rate | f64 | 0.15 | Rate of off-contract (maverick) purchases |
Financial Reporting Configuration (v0.6.0)
financial_reporting:
enabled: false # Enable financial reporting generation
generate_balance_sheet: true # bool - generate balance sheet
generate_income_statement: true # bool - generate income statement
generate_cash_flow: true # bool - generate cash flow statement
generate_changes_in_equity: true # bool - generate changes in equity
comparative_periods: 1 # u32 - number of comparative periods
management_kpis:
enabled: false # bool - enable KPI generation
frequency: "monthly" # string - monthly, quarterly
budgets:
enabled: false # bool - enable budget generation
revenue_growth_rate: 0.05 # f64 - expected revenue growth rate
expense_inflation_rate: 0.03 # f64 - expected expense inflation rate
variance_noise: 0.10 # f64 - noise for budget vs actual
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable financial reporting generation |
generate_balance_sheet | bool | true | Generate balance sheet output |
generate_income_statement | bool | true | Generate income statement output |
generate_cash_flow | bool | true | Generate cash flow statement output |
generate_changes_in_equity | bool | true | Generate changes in equity statement |
comparative_periods | u32 | 1 | Number of comparative periods to include |
management_kpis.enabled | bool | false | Enable management KPI calculation |
management_kpis.frequency | string | "monthly" | KPI calculation frequency |
budgets.enabled | bool | false | Enable budget generation |
budgets.revenue_growth_rate | f64 | 0.05 | Expected revenue growth rate for budgeting |
budgets.expense_inflation_rate | f64 | 0.03 | Expected expense inflation rate |
budgets.variance_noise | f64 | 0.10 | Random noise added to budget vs actual |
HR Configuration (v0.6.0)
hr:
enabled: false # Enable HR generation
payroll:
enabled: true # bool - enable payroll generation
pay_frequency: "monthly" # string - monthly, biweekly, weekly
salary_ranges:
staff_min: 50000.0 # f64 - staff level minimum salary
staff_max: 70000.0 # f64 - staff level maximum salary
manager_min: 80000.0 # f64 - manager level minimum salary
manager_max: 120000.0 # f64 - manager level maximum salary
director_min: 120000.0 # f64 - director level minimum salary
director_max: 180000.0 # f64 - director level maximum salary
executive_min: 180000.0 # f64 - executive level minimum salary
executive_max: 350000.0 # f64 - executive level maximum salary
tax_rates:
federal_effective: 0.22 # f64 - federal effective tax rate
state_effective: 0.05 # f64 - state effective tax rate
fica: 0.0765 # f64 - FICA/social security rate
benefits_enrollment_rate: 0.60 # f64, 0-1 - benefits enrollment rate
retirement_participation_rate: 0.45 # f64, 0-1 - retirement plan participation
time_attendance:
enabled: true # bool - enable time tracking
overtime_rate: 0.10 # f64, 0-1 - employees with overtime
expenses:
enabled: true # bool - enable expense report generation
submission_rate: 0.30 # f64, 0-1 - employees submitting per month
policy_violation_rate: 0.08 # f64, 0-1 - rate of policy violations
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable HR generation |
payroll.enabled | bool | true | Enable payroll generation |
payroll.pay_frequency | string | "monthly" | Pay frequency: monthly, biweekly, weekly |
payroll.benefits_enrollment_rate | f64 | 0.60 | Benefits enrollment rate |
payroll.retirement_participation_rate | f64 | 0.45 | Retirement plan participation rate |
time_attendance.enabled | bool | true | Enable time tracking |
time_attendance.overtime_rate | f64 | 0.10 | Rate of employees with overtime |
expenses.enabled | bool | true | Enable expense report generation |
expenses.submission_rate | f64 | 0.30 | Rate of employees submitting expenses per month |
expenses.policy_violation_rate | f64 | 0.08 | Rate of policy violations |
Manufacturing Configuration (v0.6.0)
manufacturing:
enabled: false # Enable manufacturing generation
production_orders:
orders_per_month: 50 # u32 - production orders per month
avg_batch_size: 100 # u32 - average batch size
yield_rate: 0.97 # f64, 0-1 - production yield rate
make_to_order_rate: 0.20 # f64, 0-1 - MTO vs MTS ratio
rework_rate: 0.03 # f64, 0-1 - rework rate
costing:
labor_rate_per_hour: 35.0 # f64 - labor rate per hour
overhead_rate: 1.50 # f64 - overhead multiplier on direct labor
standard_cost_update_frequency: "quarterly" # string - cost update cycle
routing:
avg_operations: 4 # u32 - average operations per routing
setup_time_hours: 1.5 # f64 - average setup time in hours
run_time_variation: 0.15 # f64 - run time variation coefficient
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable manufacturing generation |
production_orders.orders_per_month | u32 | 50 | Number of production orders per month |
production_orders.avg_batch_size | u32 | 100 | Average batch size |
production_orders.yield_rate | f64 | 0.97 | Production yield rate |
production_orders.rework_rate | f64 | 0.03 | Rework rate |
costing.labor_rate_per_hour | f64 | 35.0 | Direct labor cost per hour |
costing.overhead_rate | f64 | 1.50 | Overhead application multiplier |
routing.avg_operations | u32 | 4 | Average operations per routing step |
routing.setup_time_hours | f64 | 1.5 | Average machine setup time in hours |
Sales Quotes Configuration (v0.6.0)
sales_quotes:
enabled: false # Enable sales quote generation
quotes_per_month: 30 # u32 - quotes generated per month
win_rate: 0.35 # f64, 0-1 - quote-to-order conversion
validity_days: 30 # u32 - default quote validity period
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable sales quote generation |
quotes_per_month | u32 | 30 | Number of quotes generated per month |
win_rate | f64 | 0.35 | Fraction of quotes that convert to sales orders |
validity_days | u32 | 30 | Default quote validity period in days |
See Also
Industry Presets
SyntheticData includes pre-configured settings for common industries.
Using Presets
# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
Available Industries
| Industry | Key Characteristics |
|---|---|
| Manufacturing | Heavy P2P, inventory, fixed assets |
| Retail | High O2C volume, seasonal patterns |
| Financial Services | Complex intercompany, high controls |
| Healthcare | Regulatory focus, insurance seasonality |
| Technology | SaaS revenue, R&D capitalization |
Complexity Levels
| Level | Accounts | Vendors | Customers | Materials |
|---|---|---|---|---|
| Small | ~100 | 50 | 100 | 200 |
| Medium | ~400 | 200 | 500 | 1000 |
| Large | ~2500 | 1000 | 5000 | 10000 |
Manufacturing
Characteristics:
- High P2P activity (procurement, production)
- Significant inventory and WIP
- Fixed asset intensive
- Cost accounting emphasis
Key Settings:
global:
industry: manufacturing
transactions:
sources:
manual: 0.2
automated: 0.6
recurring: 0.15
adjustment: 0.05
document_flows:
p2p:
enabled: true
flow_rate: 0.4 # 40% of JEs from P2P
o2c:
enabled: true
flow_rate: 0.25 # 25% of JEs from O2C
master_data:
materials:
count: 1000
fixed_assets:
count: 200
subledger:
inventory:
enabled: true
valuation_methods:
- weighted_average
- fifo
Typical Account Distribution:
- 45% expense accounts (production costs)
- 25% asset accounts (inventory, equipment)
- 15% liability accounts
- 10% revenue accounts
- 5% equity accounts
Retail
Characteristics:
- High transaction volume
- Strong seasonal patterns
- High O2C activity
- Inventory turnover focus
Key Settings:
global:
industry: retail
transactions:
target_count: 500000 # High volume
temporal:
month_end_spike: 1.5
quarter_end_spike: 2.0
year_end_spike: 5.0 # Holiday season
document_flows:
p2p:
enabled: true
flow_rate: 0.25
o2c:
enabled: true
flow_rate: 0.45 # High sales activity
master_data:
customers:
count: 2000
materials:
count: 5000
subledger:
ar:
enabled: true
aging_buckets: [30, 60, 90, 120]
Seasonal Pattern:
- Q4 volume: 200-300% of Q1-Q3 average
- Black Friday/holiday spikes
- Post-holiday returns
Financial Services
Characteristics:
- Complex intercompany structures
- High regulatory requirements
- Sophisticated controls
- Mark-to-market adjustments
Key Settings:
global:
industry: financial_services
transactions:
sources:
automated: 0.7 # High automation
adjustment: 0.15 # MTM adjustments
intercompany:
enabled: true
transaction_types:
loan: 0.3
service_provided: 0.25
dividend: 0.2
management_fee: 0.15
royalty: 0.1
internal_controls:
enabled: true
controls:
- id: "SOX-001"
type: preventive
frequency: continuous
fx:
enabled: true
currency_pairs:
- EUR
- GBP
- CHF
- JPY
- CNY
volatility: 0.015
Control Requirements:
- SOX 404 compliance mandatory
- High SoD enforcement
- Continuous monitoring
Healthcare
Characteristics:
- Complex revenue recognition (insurance)
- Regulatory compliance (HIPAA)
- Seasonal patterns (flu season, open enrollment)
- High accounts receivable
Key Settings:
global:
industry: healthcare
transactions:
amounts:
min: 50
max: 500000
distribution: log_normal
document_flows:
o2c:
enabled: true
flow_rate: 0.5 # Revenue cycle focus
master_data:
customers:
count: 1000 # Patient/payer mix
subledger:
ar:
enabled: true
aging_buckets: [30, 60, 90, 120, 180] # Extended aging
fraud:
types:
fictitious_transaction: 0.2
revenue_manipulation: 0.3 # Upcoding focus
duplicate_payment: 0.2
Seasonal Pattern:
- Q1 spike (insurance deductible reset)
- Flu season (Oct-Feb)
- Open enrollment (Nov-Dec)
Technology
Characteristics:
- SaaS/subscription revenue
- R&D capitalization
- Stock compensation
- Deferred revenue management
Key Settings:
global:
industry: technology
transactions:
sources:
automated: 0.65
recurring: 0.25 # Subscription billing
manual: 0.08
adjustment: 0.02
document_flows:
o2c:
enabled: true
flow_rate: 0.35
subledger:
ar:
enabled: true
# Additional technology-specific
deferred_revenue:
enabled: true
recognition_period: monthly
capitalization:
r_and_d:
enabled: true
threshold: 50000
Revenue Pattern:
- Monthly recurring revenue (MRR)
- Annual contract billing (ACV)
- Usage-based components
Process Chain Defaults (v0.6.0)
Starting in v0.6.0, all five industry presets include default settings for the new enterprise process chains. When you generate a configuration with datasynth-data init, the preset populates sensible defaults for each new section, though they remain disabled until explicitly turned on.
| Process Chain | Manufacturing | Retail | Financial Services | Healthcare | Technology |
|---|---|---|---|---|---|
source_to_pay | High | Medium | Low | Medium | Low |
financial_reporting | Full | Full | Full | Full | Full |
hr | Full | Full | Full | Full | Full |
manufacturing | High | – | – | – | – |
sales_quotes | Medium | High | Low | Medium | High |
Manufacturing presets emphasize production orders, routing, and costing. Retail presets increase sales quote volume and quote-to-order win rates. Financial Services presets focus on financial reporting with comprehensive KPIs and budgets. Healthcare and Technology presets provide balanced defaults.
Each preset configures the following when you set enabled: true:
- source_to_pay: Sourcing projects, RFx events, contract management, catalogs, and vendor scorecards that feed into the existing P2P document flow.
- financial_reporting: Balance sheets, income statements, cash flow statements, management KPIs, and budget vs. actual variance analysis.
- hr: Payroll runs based on employee master data, time and attendance tracking, and expense report generation with policy violation injection.
- manufacturing: Production orders, WIP tracking, standard costing with labor and overhead, and routing operations.
- sales_quotes: Quote-to-order pipeline that feeds into the existing O2C document flow.
Customizing Presets
Start with a preset and customize:
# Generate preset
datasynth-data init --industry manufacturing -o config.yaml
# Edit config.yaml
# - Adjust transaction counts
# - Add companies
# - Enable additional features
# Validate and generate
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output
Combining Industries
For conglomerates, use multiple companies with different characteristics:
companies:
- code: "1000"
name: "Manufacturing Division"
volume_weight: 0.5
- code: "2000"
name: "Retail Division"
volume_weight: 0.3
- code: "3000"
name: "Services Division"
volume_weight: 0.2
See Also
Global Settings
Global settings control overall generation behavior.
Configuration
global:
seed: 42 # Random seed for reproducibility
industry: manufacturing # Industry preset
start_date: 2024-01-01 # Generation start date
period_months: 12 # Duration in months
group_currency: USD # Base/reporting currency
worker_threads: 4 # Parallel workers (optional)
memory_limit: 2147483648 # Memory limit in bytes (optional)
Fields
seed
Random number generator seed for reproducible output.
| Property | Value |
|---|---|
| Type | u64 |
| Required | No |
| Default | Random |
global:
seed: 42 # Same seed = same output
Use cases:
- Reproducible test datasets
- Debugging
- Consistent benchmarks
industry
Industry preset for domain-specific settings.
| Property | Value |
|---|---|
| Type | string |
| Required | Yes |
| Values | See below |
Available industries:
| Industry | Description |
|---|---|
manufacturing | Production, inventory, cost accounting |
retail | High volume sales, seasonal patterns |
financial_services | Complex IC, regulatory compliance |
healthcare | Insurance billing, compliance |
technology | SaaS revenue, R&D |
energy | Long-term assets, commodity trading |
telecom | Subscription revenue, network assets |
transportation | Fleet assets, fuel costs |
hospitality | Seasonal, revenue management |
start_date
Beginning date for generated data.
| Property | Value |
|---|---|
| Type | date (YYYY-MM-DD) |
| Required | Yes |
global:
start_date: 2024-01-01
Notes:
- First transaction will be on or after this date
- Combined with
period_monthsto determine date range
period_months
Duration of generation period.
| Property | Value |
|---|---|
| Type | u32 |
| Required | Yes |
| Range | 1-120 |
global:
period_months: 12 # One year
period_months: 36 # Three years
period_months: 1 # One month
Considerations:
- Longer periods = more data
- Period close features require at least 1 month
- Year-end close requires at least 12 months
group_currency
Base currency for consolidation and reporting.
| Property | Value |
|---|---|
| Type | string (ISO 4217) |
| Required | Yes |
global:
group_currency: USD
group_currency: EUR
group_currency: CHF
Used for:
- Currency translation
- Consolidation
- Intercompany eliminations
worker_threads
Number of parallel worker threads.
| Property | Value |
|---|---|
| Type | usize |
| Required | No |
| Default | Number of CPU cores |
global:
worker_threads: 4 # Use 4 threads
worker_threads: 1 # Single-threaded
Guidance:
- Default (CPU cores) is usually optimal
- Reduce for memory-constrained systems
- Increase may not improve performance beyond CPU cores
memory_limit
Maximum memory usage in bytes.
| Property | Value |
|---|---|
| Type | u64 |
| Required | No |
| Default | None (system limit) |
global:
memory_limit: 1073741824 # 1 GB
memory_limit: 2147483648 # 2 GB
memory_limit: 4294967296 # 4 GB
Behavior:
- Soft limit: Generation slows down
- Hard limit: Generation pauses until memory freed
- Streaming output to reduce memory pressure
Environment Variable Overrides
| Variable | Setting |
|---|---|
SYNTH_DATA_SEED | global.seed |
SYNTH_DATA_THREADS | global.worker_threads |
SYNTH_DATA_MEMORY_LIMIT | global.memory_limit |
SYNTH_DATA_SEED=12345 datasynth-data generate --config config.yaml --output ./output
Examples
Minimal
global:
industry: manufacturing
start_date: 2024-01-01
period_months: 12
group_currency: USD
Full Control
global:
seed: 42
industry: financial_services
start_date: 2023-01-01
period_months: 36
group_currency: USD
worker_threads: 8
memory_limit: 8589934592 # 8 GB
Development/Testing
global:
seed: 42 # Reproducible
industry: manufacturing
start_date: 2024-01-01
period_months: 1 # Short period
group_currency: USD
worker_threads: 1 # Single thread for debugging
Validation
| Check | Rule |
|---|---|
period_months | 1 ≤ value ≤ 120 |
start_date | Valid date |
industry | Known industry preset |
group_currency | Valid ISO 4217 code |
See Also
Companies
Company configuration defines the legal entities for data generation.
Configuration
companies:
- code: "1000"
name: "Headquarters"
currency: USD
country: US
volume_weight: 0.6
is_parent: true
parent_code: null
- code: "2000"
name: "European Subsidiary"
currency: EUR
country: DE
volume_weight: 0.4
is_parent: false
parent_code: "1000"
Fields
code
Unique identifier for the company.
| Property | Value |
|---|---|
| Type | string |
| Required | Yes |
| Constraints | Unique across all companies |
companies:
- code: "1000" # Four-digit SAP-style
- code: "US01" # Region-based
- code: "HQ" # Abbreviated
name
Display name for the company.
| Property | Value |
|---|---|
| Type | string |
| Required | Yes |
companies:
- name: "Headquarters"
- name: "European Operations GmbH"
- name: "Asia Pacific Holdings"
currency
Local currency for the company.
| Property | Value |
|---|---|
| Type | string (ISO 4217) |
| Required | Yes |
companies:
- currency: USD
- currency: EUR
- currency: CHF
- currency: JPY
Used for:
- Transaction amounts
- Local reporting
- FX translation
country
Country code for the company.
| Property | Value |
|---|---|
| Type | string (ISO 3166-1 alpha-2) |
| Required | Yes |
companies:
- country: US
- country: DE
- country: CH
- country: JP
Affects:
- Holiday calendars
- Tax calculations
- Regional templates
volume_weight
Relative transaction volume for this company.
| Property | Value |
|---|---|
| Type | f64 |
| Required | Yes |
| Range | 0.0 - 1.0 |
| Constraint | Sum across all companies = 1.0 |
companies:
- code: "1000"
volume_weight: 0.5 # 50% of transactions
- code: "2000"
volume_weight: 0.3 # 30% of transactions
- code: "3000"
volume_weight: 0.2 # 20% of transactions
is_parent
Whether this company is the consolidation parent.
| Property | Value |
|---|---|
| Type | bool |
| Required | No |
| Default | false |
companies:
- code: "1000"
is_parent: true # Consolidation parent
- code: "2000"
is_parent: false # Subsidiary
Notes:
- Exactly one company should be
is_parent: truefor consolidation - Parent receives elimination entries
parent_code
Reference to parent company for subsidiaries.
| Property | Value |
|---|---|
| Type | string or null |
| Required | No |
| Default | null |
companies:
- code: "1000"
is_parent: true
parent_code: null # No parent (is the parent)
- code: "2000"
is_parent: false
parent_code: "1000" # Owned by 1000
- code: "3000"
is_parent: false
parent_code: "2000" # Owned by 2000 (nested)
Examples
Single Company
companies:
- code: "1000"
name: "Demo Company"
currency: USD
country: US
volume_weight: 1.0
Multi-National
companies:
- code: "1000"
name: "Global Holdings Inc"
currency: USD
country: US
volume_weight: 0.4
is_parent: true
- code: "2000"
name: "European Operations GmbH"
currency: EUR
country: DE
volume_weight: 0.25
parent_code: "1000"
- code: "3000"
name: "UK Limited"
currency: GBP
country: GB
volume_weight: 0.15
parent_code: "2000"
- code: "4000"
name: "Asia Pacific Pte Ltd"
currency: SGD
country: SG
volume_weight: 0.2
parent_code: "1000"
Regional Structure
companies:
- code: "HQ"
name: "Headquarters"
currency: USD
country: US
volume_weight: 0.3
is_parent: true
- code: "NA01"
name: "North America Operations"
currency: USD
country: US
volume_weight: 0.3
parent_code: "HQ"
- code: "EU01"
name: "EMEA Operations"
currency: EUR
country: DE
volume_weight: 0.25
parent_code: "HQ"
- code: "AP01"
name: "APAC Operations"
currency: JPY
country: JP
volume_weight: 0.15
parent_code: "HQ"
Validation
| Check | Rule |
|---|---|
code | Must be unique |
volume_weight | Sum must equal 1.0 (±0.01) |
parent_code | Must reference existing company or be null |
is_parent | At most one true (if intercompany enabled) |
Intercompany Implications
When multiple companies exist:
- Intercompany transactions generated between companies
- FX rates generated for currency pairs
- Elimination entries created for parent
- Transfer pricing applied
See Intercompany Processing for details.
See Also
Transactions
Transaction settings control journal entry generation.
Configuration
transactions:
target_count: 100000
line_items:
distribution: empirical
min_lines: 2
max_lines: 20
amounts:
min: 100
max: 1000000
distribution: log_normal
round_number_bias: 0.15
sources:
manual: 0.3
automated: 0.5
recurring: 0.15
adjustment: 0.05
benford:
enabled: true
temporal:
month_end_spike: 2.5
quarter_end_spike: 3.0
year_end_spike: 4.0
working_hours_only: true
Fields
target_count
Total number of journal entries to generate.
| Property | Value |
|---|---|
| Type | u64 |
| Required | Yes |
transactions:
target_count: 10000 # Small dataset
target_count: 100000 # Medium dataset
target_count: 1000000 # Large dataset
line_items
Controls the number of line items per journal entry.
distribution
| Value | Description |
|---|---|
empirical | Based on real-world GL research |
uniform | Equal probability for all counts |
custom | User-defined probabilities |
Empirical distribution (default):
- 2 lines: 60.68%
- 3 lines: 5.24%
- 4 lines: 17.32%
- Even counts: 88% preference
line_items:
distribution: empirical
Custom distribution:
line_items:
distribution: custom
custom_distribution:
2: 0.50
3: 0.10
4: 0.20
5: 0.10
6: 0.10
min_lines / max_lines
| Property | Value |
|---|---|
| Type | u32 |
| Default | 2 / 20 |
line_items:
min_lines: 2
max_lines: 10
amounts
Controls transaction amounts.
min / max
| Property | Value |
|---|---|
| Type | f64 |
| Required | Yes |
amounts:
min: 100 # Minimum amount
max: 1000000 # Maximum amount
distribution
| Value | Description |
|---|---|
log_normal | Log-normal distribution (realistic) |
uniform | Equal probability across range |
custom | User-defined |
amounts:
distribution: log_normal
round_number_bias
Preference for round numbers (100, 500, 1000, etc.).
| Property | Value |
|---|---|
| Type | f64 |
| Range | 0.0 - 1.0 |
| Default | 0.15 |
amounts:
round_number_bias: 0.15 # 15% round numbers
round_number_bias: 0.0 # No round number bias
sources
Transaction source distribution (weights must sum to 1.0).
| Source | Description |
|---|---|
manual | Manual journal entries |
automated | System-generated |
recurring | Scheduled recurring entries |
adjustment | Period-end adjustments |
sources:
manual: 0.3
automated: 0.5
recurring: 0.15
adjustment: 0.05
benford
Benford’s Law compliance for first-digit distribution.
benford:
enabled: true # Follow P(d) = log10(1 + 1/d)
enabled: false # Disable Benford compliance
Expected distribution (enabled):
| Digit | Probability |
|---|---|
| 1 | 30.1% |
| 2 | 17.6% |
| 3 | 12.5% |
| 4 | 9.7% |
| 5 | 7.9% |
| 6 | 6.7% |
| 7 | 5.8% |
| 8 | 5.1% |
| 9 | 4.6% |
temporal
Temporal patterns for transaction timing.
Spikes
Volume multipliers for period ends:
temporal:
month_end_spike: 2.5 # 2.5x volume at month end
quarter_end_spike: 3.0 # 3.0x at quarter end
year_end_spike: 4.0 # 4.0x at year end
Working Hours
Restrict transactions to business hours:
temporal:
working_hours_only: true # Mon-Fri, 8am-6pm
working_hours_only: false # Any time
Examples
High Volume Retail
transactions:
target_count: 500000
line_items:
distribution: empirical
min_lines: 2
max_lines: 6
amounts:
min: 10
max: 50000
distribution: log_normal
round_number_bias: 0.3
sources:
manual: 0.1
automated: 0.8
recurring: 0.08
adjustment: 0.02
temporal:
month_end_spike: 1.5
quarter_end_spike: 2.0
year_end_spike: 5.0
Low Volume Manual
transactions:
target_count: 5000
line_items:
distribution: empirical
amounts:
min: 1000
max: 10000000
sources:
manual: 0.6
automated: 0.2
recurring: 0.1
adjustment: 0.1
temporal:
month_end_spike: 3.0
quarter_end_spike: 4.0
year_end_spike: 5.0
working_hours_only: true
Testing/Development
transactions:
target_count: 1000
line_items:
distribution: uniform
min_lines: 2
max_lines: 4
amounts:
min: 100
max: 10000
distribution: uniform
round_number_bias: 0.0
sources:
manual: 1.0
benford:
enabled: false
Validation
| Check | Rule |
|---|---|
target_count | > 0 |
min_lines | ≥ 2 |
max_lines | ≥ min_lines |
amounts.min | > 0 |
amounts.max | > min |
round_number_bias | 0.0 - 1.0 |
sources | Sum = 1.0 (±0.01) |
*_spike | ≥ 1.0 |
See Also
Master Data
Master data settings control generation of business entities.
Configuration
master_data:
vendors:
count: 200
intercompany_ratio: 0.05
customers:
count: 500
intercompany_ratio: 0.05
materials:
count: 1000
fixed_assets:
count: 100
employees:
count: 50
hierarchy_depth: 4
Vendors
Supplier master data configuration.
master_data:
vendors:
count: 200 # Number of vendors
intercompany_ratio: 0.05 # IC vendor percentage
payment_terms:
- code: "NET30"
days: 30
weight: 0.5
- code: "NET60"
days: 60
weight: 0.3
- code: "NET10"
days: 10
weight: 0.2
behavior:
late_payment_rate: 0.1 # % with late payment tendency
discount_usage_rate: 0.3 # % that take early pay discounts
Generated Fields
| Field | Description |
|---|---|
vendor_id | Unique identifier |
vendor_name | Company name |
tax_id | Tax identification number |
payment_terms | Default payment terms |
currency | Transaction currency |
bank_account | Bank details |
is_intercompany | IC vendor flag |
valid_from | Temporal validity start |
Customers
Customer master data configuration.
master_data:
customers:
count: 500 # Number of customers
intercompany_ratio: 0.05 # IC customer percentage
credit_rating:
- code: "AAA"
limit_multiplier: 10.0
weight: 0.1
- code: "AA"
limit_multiplier: 5.0
weight: 0.2
- code: "A"
limit_multiplier: 2.0
weight: 0.4
- code: "B"
limit_multiplier: 1.0
weight: 0.3
payment_behavior:
on_time_rate: 0.7 # % that pay on time
early_payment_rate: 0.1 # % that pay early
late_payment_rate: 0.2 # % that pay late
Generated Fields
| Field | Description |
|---|---|
customer_id | Unique identifier |
customer_name | Company/person name |
credit_limit | Maximum credit |
credit_rating | Rating code |
payment_behavior | Payment tendency |
currency | Transaction currency |
is_intercompany | IC customer flag |
Materials
Product/material master data.
master_data:
materials:
count: 1000 # Number of materials
types:
raw_material: 0.3
work_in_progress: 0.1
finished_goods: 0.4
services: 0.2
valuation:
- method: fifo
weight: 0.3
- method: weighted_average
weight: 0.5
- method: standard_cost
weight: 0.2
Generated Fields
| Field | Description |
|---|---|
material_id | Unique identifier |
description | Material name |
material_type | Classification |
unit_of_measure | UOM |
valuation_method | Costing method |
standard_cost | Unit cost |
gl_account | Inventory account |
Fixed Assets
Capital asset master data.
master_data:
fixed_assets:
count: 100 # Number of assets
categories:
buildings: 0.1
machinery: 0.3
vehicles: 0.2
furniture: 0.2
it_equipment: 0.2
depreciation:
- method: straight_line
weight: 0.7
- method: declining_balance
weight: 0.2
- method: units_of_production
weight: 0.1
Generated Fields
| Field | Description |
|---|---|
asset_id | Unique identifier |
description | Asset name |
asset_class | Category |
acquisition_date | Purchase date |
acquisition_cost | Original cost |
useful_life | Years |
depreciation_method | Method |
salvage_value | Residual value |
Employees
User/employee master data.
master_data:
employees:
count: 50 # Number of employees
hierarchy_depth: 4 # Org chart depth
roles:
- name: "AP Clerk"
approval_limit: 5000
weight: 0.3
- name: "AP Manager"
approval_limit: 50000
weight: 0.1
- name: "AR Clerk"
approval_limit: 5000
weight: 0.3
- name: "Controller"
approval_limit: 500000
weight: 0.1
- name: "CFO"
approval_limit: 999999999
weight: 0.05
transaction_codes:
- "FB01" # Post document
- "FB50" # Enter GL
- "F-28" # Post incoming payment
- "F-53" # Post outgoing payment
Generated Fields
| Field | Description |
|---|---|
employee_id | Unique identifier |
name | Full name |
department | Department code |
role | Job role |
manager_id | Supervisor reference |
approval_limit | Max approval amount |
transaction_codes | Authorized T-codes |
HR and Payroll Integration (v0.6.0)
Employee master data serves as the foundation for the hr configuration section introduced in v0.6.0. When the HR module is enabled, each employee record drives downstream generation:
- Payroll: Salary, tax withholdings, benefits deductions, and retirement contributions are computed per employee based on their role and the salary ranges defined in
hr.payroll.salary_ranges. Thepay_frequencysetting (monthly, biweekly, or weekly) determines how many payroll runs are generated per period. - Time and Attendance: Time entries are generated for each employee according to working days in the period. The
overtime_ratecontrols how many employees have overtime hours in a given period. - Expense Reports: A subset of employees (controlled by
hr.expenses.submission_rate) generate expense reports each month. Policy violations are injected at the configuredpolicy_violation_rate.
The employees.count and employees.hierarchy_depth settings in master_data directly determine the population size for all HR outputs. Increasing the employee count will proportionally increase payroll journal entries, time records, and expense reports.
master_data:
employees:
count: 200 # Drives payroll and HR volume
hierarchy_depth: 5
hr:
enabled: true # Activates payroll, time, and expenses
payroll:
pay_frequency: "biweekly" # 26 pay periods per year
expenses:
submission_rate: 0.40 # 40% of employees submit per month
Examples
Small Company
master_data:
vendors:
count: 50
customers:
count: 100
materials:
count: 200
fixed_assets:
count: 20
employees:
count: 10
hierarchy_depth: 2
Large Enterprise
master_data:
vendors:
count: 2000
intercompany_ratio: 0.1
customers:
count: 10000
intercompany_ratio: 0.1
materials:
count: 50000
fixed_assets:
count: 5000
employees:
count: 500
hierarchy_depth: 8
Validation
| Check | Rule |
|---|---|
count | > 0 |
intercompany_ratio | 0.0 - 1.0 |
hierarchy_depth | ≥ 1 |
| Distribution weights | Sum = 1.0 |
See Also
Document Flows
Document flow settings control P2P (Procure-to-Pay) and O2C (Order-to-Cash) process generation, including document types, three-way matching, credit checks, and document chain management.
Configuration
document_flows:
p2p:
enabled: true
flow_rate: 0.3
completion_rate: 0.95
three_way_match:
quantity_tolerance: 0.02
price_tolerance: 0.01
o2c:
enabled: true
flow_rate: 0.3
completion_rate: 0.95
Procure-to-Pay (P2P)
Flow
Purchase Purchase Goods Vendor Three-Way
Requisition → Order → Receipt → Invoice → Match → Payment
│ │ │
│ ┌────┘ │
▼ ▼ ▼
AP Open Item ← Match Result AP Aging
Purchase Order Types
SyntheticData models 6 PO types, each with different downstream behavior:
| Type | Description | Requires GR? | Use Case |
|---|---|---|---|
Standard | Standard goods purchase | Yes | Most common PO type |
Service | Service procurement | No | Consulting, maintenance, etc. |
Framework | Blanket/framework agreement | Yes | Long-term supply agreements |
Consignment | Vendor-managed inventory | Yes | Consignment stock |
StockTransfer | Inter-plant transfer | Yes | Internal stock movement |
Subcontracting | External processing | Yes | Outsourced manufacturing |
Goods Receipt Movement Types
Goods receipts use SAP-style movement type codes:
| Movement Type | Code | Description |
|---|---|---|
GrForPo | 101 | Standard GR against purchase order |
ReturnToVendor | 122 | Return materials to vendor |
GrForProduction | 131 | GR from production order |
TransferPosting | 301 | Transfer between plants/locations |
InitialEntry | 561 | Initial stock entry |
Scrapping | 551 | Scrap disposal |
Consumption | 201 | Direct consumption posting |
Three-Way Match
The three-way match validator compares Purchase Order, Goods Receipt, and Vendor Invoice to detect variances before payment.
Algorithm
For each invoice line item:
1. Find matching PO line (by PO reference + line number)
2. Sum GR quantities for that PO line (supports multiple partial GRs)
3. Compare:
a. PO quantity vs GR quantity → QuantityPoGr variance
b. GR quantity vs Invoice quantity → QuantityGrInvoice variance
c. PO unit price vs Invoice unit price → PricePoInvoice variance
d. PO total vs Invoice total → TotalAmount variance
4. Apply tolerances:
- Quantity: ±quantity_tolerance (default 2%)
- Price: ±price_tolerance (default 5%)
- Absolute: ±absolute_amount_tolerance (default $0.01)
5. Check over-delivery:
- If GR qty > PO qty and allow_over_delivery=true:
allow up to max_over_delivery_pct (default 10%)
Variance Types
| Variance Type | Description | Detection |
|---|---|---|
QuantityPoGr | GR quantity differs from PO quantity | PO vs GR comparison |
QuantityGrInvoice | Invoice quantity differs from GR quantity | GR vs Invoice comparison |
PricePoInvoice | Invoice unit price differs from PO price | PO vs Invoice comparison |
TotalAmount | Total invoice amount mismatch | Overall amount check |
MissingLine | PO line not found in invoice or GR | Line matching |
ExtraLine | Invoice has lines not on PO | Line matching |
Match Outcomes
| Outcome | Meaning | Action |
|---|---|---|
passed | All within tolerance | Proceed to payment |
quantity_variance | Quantity outside tolerance | Review required |
price_variance | Price outside tolerance | Review required |
blocked | Multiple variances or critical mismatch | Manual resolution |
Configuration
document_flows:
p2p:
three_way_match:
enabled: true
price_tolerance: 0.05 # 5% price variance allowed
quantity_tolerance: 0.02 # 2% quantity variance allowed
absolute_amount_tolerance: 0.01 # $0.01 rounding tolerance
allow_over_delivery: true
max_over_delivery_pct: 0.10 # 10% over-delivery allowed
P2P Stage Configuration
document_flows:
p2p:
enabled: true
flow_rate: 0.3 # 30% of JEs from P2P
completion_rate: 0.95 # 95% complete full flow
stages:
po_approval_rate: 0.9 # 90% of POs approved
gr_rate: 0.98 # 98% of POs get goods receipts
invoice_rate: 0.95 # 95% of GRs get invoices
payment_rate: 0.92 # 92% of invoices get paid
timing:
po_to_gr_days:
min: 1
max: 30
gr_to_invoice_days:
min: 1
max: 14
invoice_to_payment_days:
min: 10
max: 60
P2P Journal Entries
| Stage | Debit | Credit | Trigger |
|---|---|---|---|
| Goods Receipt | Inventory (1300) | GR/IR Clearing (2100) | GR posted |
| Invoice Receipt | GR/IR Clearing (2100) | Accounts Payable (2000) | Invoice verified |
| Payment | Accounts Payable (2000) | Cash (1000) | Payment executed |
| Price Variance | PPV Expense (5xxx) | GR/IR Clearing (2100) | Price mismatch |
Order-to-Cash (O2C)
Flow
Sales Credit Delivery Customer Customer
Order → Check → (Pick/ → Invoice → Receipt
│ Pack/ │ │
│ Ship) │ │
│ │ ▼ ▼
│ │ AR Open Item AR Aging
│ │ │
│ │ └→ Dunning (if overdue)
│ ▼
│ Inventory Issue
│ (COGS posting)
▼
Revenue Recognition
(ASC 606 / IFRS 15)
Sales Order Types
SyntheticData models 9 SO types:
| Type | Description | Requires Delivery? |
|---|---|---|
Standard | Standard sales order | Yes |
Rush | Priority/expedited order | Yes |
CashSale | Immediate payment at sale | Yes |
Return | Customer return order | No (creates return delivery) |
FreeOfCharge | No-charge delivery (samples, warranty) | Yes |
Consignment | Consignment fill-up/issue | Yes |
Service | Service order (no physical delivery) | No |
CreditMemoRequest | Request for credit memo | No |
DebitMemoRequest | Request for debit memo | No |
Delivery Types
6 delivery types model different fulfillment scenarios:
| Type | Description | Direction |
|---|---|---|
Outbound | Standard outbound delivery | Ship to customer |
Return | Customer return delivery | Receive from customer |
StockTransfer | Inter-plant stock transfer | Internal movement |
Replenishment | Replenishment delivery | Warehouse → store |
ConsignmentIssue | Issue from consignment stock | Consignment → customer |
ConsignmentReturn | Return to consignment stock | Customer → consignment |
Customer Invoice Types
7 invoice types with different accounting treatment:
| Type | Description | AR Impact |
|---|---|---|
Standard | Normal sales invoice | Creates receivable |
CreditMemo | Credit for returns/adjustments | Reduces receivable |
DebitMemo | Additional charge | Increases receivable |
ProForma | Pre-delivery invoice (no posting) | None |
DownPaymentRequest | Advance payment request | Creates special receivable |
FinalInvoice | Settles down payment | Clears down payment |
Intercompany | IC billing | Creates IC receivable |
Credit Check
Sales orders pass through credit verification before delivery:
document_flows:
o2c:
credit_check:
enabled: true
check_credit_limit: true # Verify customer limit
check_overdue: true # Check for past-due AR
block_threshold: 0.9 # Block if >90% of limit used
O2C Stage Configuration
document_flows:
o2c:
enabled: true
flow_rate: 0.3 # 30% of JEs from O2C
completion_rate: 0.95 # 95% complete full flow
stages:
so_approval_rate: 0.95 # 95% of SOs approved
credit_check_pass_rate: 0.9 # 90% pass credit check
delivery_rate: 0.98 # 98% of SOs get deliveries
invoice_rate: 0.95 # 95% of deliveries get invoices
collection_rate: 0.85 # 85% of invoices collected
timing:
so_to_delivery_days:
min: 1
max: 14
delivery_to_invoice_days:
min: 0
max: 3
invoice_to_payment_days:
min: 15
max: 90
O2C Journal Entries
| Stage | Debit | Credit | Trigger |
|---|---|---|---|
| Delivery | Cost of Goods Sold (5000) | Inventory (1300) | Goods issued |
| Invoice | Accounts Receivable (1100) | Revenue (4000) | Invoice posted |
| Receipt | Cash (1000) | Accounts Receivable (1100) | Payment received |
| Credit Memo | Revenue (4000) | Accounts Receivable (1100) | Credit issued |
Document Chain Manager
The document chain manager maintains referential integrity across the complete document flow by tracking references between documents.
Reference Types
| Type | Description | Example |
|---|---|---|
FollowOn | Next document in normal flow | PO → GR → Invoice → Payment |
Payment | Payment for invoice | PAY-001 → INV-001 |
Reversal | Correction or reversal document | CRED-001 → INV-001 |
Partial | Partial fulfillment | GR-001 (partial) → PO-001 |
CreditMemo | Credit against invoice | CM-001 → INV-001 |
DebitMemo | Debit against invoice | DM-001 → INV-001 |
Return | Return against delivery | RET-001 → DEL-001 |
IntercompanyMatch | IC matched pair | IC-INV-001 → IC-INV-002 |
Manual | User-defined reference | Any → Any |
Document Chain Output
PO-001 ─→ GR-001 ─→ INV-001 ─→ PAY-001
│ │ │ │
└──────────┴──────────┴──────────┘
Document Chain
The document_references.csv output file records all links:
| Field | Description |
|---|---|
source_document_id | Referencing document |
target_document_id | Referenced document |
reference_type | Type of reference |
created_date | Date reference was created |
Complex Scenario Examples
Partial Deliveries with Split Invoice
document_flows:
p2p:
enabled: true
flow_rate: 0.4
completion_rate: 0.90 # 10% incomplete (partial deliveries)
three_way_match:
quantity_tolerance: 0.05 # 5% tolerance for partials
allow_over_delivery: true
max_over_delivery_pct: 0.10
timing:
po_to_gr_days: { min: 3, max: 45 } # Longer lead times
gr_to_invoice_days: { min: 1, max: 21 }
invoice_to_payment_days: { min: 30, max: 90 }
High-Volume Retail O2C
document_flows:
o2c:
enabled: true
flow_rate: 0.5 # 50% of JEs from O2C
completion_rate: 0.98 # High completion rate
stages:
so_approval_rate: 0.99 # Auto-approved
credit_check_pass_rate: 0.95
delivery_rate: 0.99
invoice_rate: 0.99
collection_rate: 0.92
timing:
so_to_delivery_days: { min: 0, max: 3 } # Fast fulfillment
delivery_to_invoice_days: { min: 0, max: 0 } # Immediate invoice
invoice_to_payment_days: { min: 10, max: 45 }
Combined Manufacturing P2P + O2C
document_flows:
p2p:
enabled: true
flow_rate: 0.35
completion_rate: 0.95
three_way_match:
quantity_tolerance: 0.02
price_tolerance: 0.01
timing:
po_to_gr_days: { min: 5, max: 30 }
gr_to_invoice_days: { min: 1, max: 10 }
invoice_to_payment_days: { min: 20, max: 45 }
o2c:
enabled: true
flow_rate: 0.35
completion_rate: 0.90
credit_check:
enabled: true
block_threshold: 0.85
timing:
so_to_delivery_days: { min: 3, max: 21 }
delivery_to_invoice_days: { min: 0, max: 2 }
invoice_to_payment_days: { min: 30, max: 60 }
Validation
| Check | Rule |
|---|---|
flow_rate | 0.0 - 1.0 |
completion_rate | 0.0 - 1.0 |
tolerance values | 0.0 - 1.0 |
timing.min | ≥ 0 |
timing.max | ≥ min |
| Stage rates | 0.0 - 1.0 |
See Also
- Subledgers — AR/AP records generated by document flows
- FX & Currency — Multi-currency document flows
- Master Data — Vendor and customer master records
- Process Chains — Enterprise process chain architecture
- Process Mining — OCEL 2.0 event logs from document flows
- datasynth-generators — Generator crate reference
Subledgers
SyntheticData generates subsidiary ledger records for Accounts Receivable (AR), Accounts Payable (AP), Fixed Assets (FA), and Inventory, with automatic GL reconciliation and document flow linking.
Overview
Subledger generators produce detailed records that reconcile back to GL control accounts:
| Subledger | Control Account | Record Types | Output Files |
|---|---|---|---|
| AR | 1100 (AR Control) | Open items, aging, receipts, credit memos, dunning | ar_open_items.csv, ar_aging.csv |
| AP | 2000 (AP Control) | Open items, aging, payment scheduling, debit memos | ap_open_items.csv, ap_aging.csv |
| FA | 1600+ (Asset accounts) | Register, depreciation, acquisitions, disposals | fa_register.csv, fa_depreciation.csv |
| Inventory | 1300 (Inventory) | Positions, movements (22 types), valuation | inventory_positions.csv, inventory_movements.csv |
Configuration
subledger:
enabled: true
ar:
enabled: true
aging_buckets: [30, 60, 90, 120] # Days
dunning_levels: 3
credit_memo_rate: 0.05 # 5% of invoices get credit memos
ap:
enabled: true
aging_buckets: [30, 60, 90, 120]
early_payment_discount_rate: 0.02
payment_scheduling: true
fa:
enabled: true
depreciation_methods:
- straight_line
- declining_balance
- sum_of_years_digits
disposal_rate: 0.03 # 3% of assets disposed per year
inventory:
enabled: true
valuation_method: standard_cost # standard_cost, moving_average, fifo, lifo
cycle_count_frequency: monthly
Accounts Receivable (AR)
Record Types
The AR subledger generates:
- Open Items: Outstanding customer invoices with aging classification
- Receipts: Customer payments applied to invoices (full, partial, on-account)
- Credit Memos: Credits issued for returns, disputes, or pricing adjustments
- Aging Reports: Aged balances by customer and aging bucket
- Dunning Notices: Automated collection notices at configurable levels
Open Item Fields
| Field | Description |
|---|---|
customer_id | Customer reference |
invoice_number | Document number |
invoice_date | Issue date |
due_date | Payment due date |
original_amount | Invoice total |
open_amount | Remaining balance |
currency | Invoice currency |
payment_terms | Net 30, Net 60, etc. |
aging_bucket | 0-30, 31-60, 61-90, 91-120, 120+ |
dunning_level | Current dunning level (0-3) |
last_dunning_date | Date of last dunning notice |
dispute_flag | Whether item is disputed |
Aging Buckets
Default aging buckets classify receivables by days past due:
| Bucket | Range | Typical % |
|---|---|---|
| Current | 0-30 days | 65-75% |
| 31-60 | 31-60 days | 12-18% |
| 61-90 | 61-90 days | 5-8% |
| 91-120 | 91-120 days | 2-4% |
| 120+ | Over 120 days | 1-3% |
Dunning Process
Dunning generates progressively urgent collection notices:
| Level | Days Overdue | Action |
|---|---|---|
| 0 | 0-30 | No action (within terms) |
| 1 | 31-60 | Friendly reminder |
| 2 | 61-90 | Formal notice |
| 3 | 90+ | Final demand / collections |
Document Flow Integration
AR open items are created from O2C customer invoices:
Sales Order → Delivery → Customer Invoice → AR Open Item → Customer Receipt
│
└→ Dunning Notice (if overdue)
Accounts Payable (AP)
Record Types
The AP subledger generates:
- Open Items: Outstanding vendor invoices with aging and payment scheduling
- Payments: Vendor payment runs (check, wire, ACH)
- Debit Memos: Deductions for quality issues, returns, pricing errors
- Aging Reports: Aged payables by vendor
- Payment Scheduling: Planned payments considering cash flow and discounts
Open Item Fields
| Field | Description |
|---|---|
vendor_id | Vendor reference |
invoice_number | Vendor invoice number |
invoice_date | Invoice receipt date |
due_date | Payment due date |
baseline_date | Date for terms calculation |
original_amount | Invoice total |
open_amount | Remaining balance |
currency | Invoice currency |
payment_terms | 2/10 Net 30, etc. |
discount_date | Discount deadline |
discount_amount | Available discount |
payment_block | Block code (if blocked) |
three_way_match_status | Matched / Variance / Blocked |
Early Payment Discounts
The AP generator models cash discount optimization:
Payment Terms: 2/10 Net 30
→ Pay within 10 days: 2% discount
→ Pay by day 30: full amount
→ Past day 30: overdue
early_payment_discount_rate: 0.02 # Take 2% discount when offered
Payment Scheduling
When enabled, the AP generator creates a payment schedule that optimizes:
- Discount capture: Prioritize invoices with expiring discounts
- Cash flow: Spread payments across the period
- Vendor priority: Pay critical vendors first
Document Flow Integration
AP open items are created from P2P vendor invoices:
Purchase Order → Goods Receipt → Vendor Invoice → Three-Way Match → AP Open Item → Payment
│
└→ Debit Memo (if variance)
Fixed Assets (FA)
Record Types
The FA subledger generates:
- Asset Register: Master record for each fixed asset
- Depreciation Schedule: Monthly depreciation entries per asset
- Acquisitions: New asset additions (from PO or direct capitalization)
- Disposals: Asset retirements, sales, scrapping
- Transfers: Inter-company or inter-department transfers
- Impairment: Write-downs when fair value drops below book value
Asset Register Fields
| Field | Description |
|---|---|
asset_id | Unique identifier |
description | Asset name/description |
asset_class | Buildings, Equipment, Vehicles, IT, Furniture |
acquisition_date | Purchase/capitalization date |
acquisition_cost | Original cost |
useful_life_years | Depreciable life |
salvage_value | Residual value |
depreciation_method | Method used |
accumulated_depreciation | Total depreciation to date |
net_book_value | Current carrying value |
disposal_date | Date retired (if applicable) |
disposal_proceeds | Sale price (if sold) |
disposal_gain_loss | Gain or loss on disposal |
Depreciation Methods
| Method | Description | Use Case |
|---|---|---|
StraightLine | Equal amounts each period | Default, most common |
DecliningBalance { rate } | Fixed percentage of remaining balance | Accelerated (tax) |
SumOfYearsDigits | Decreasing fractions of depreciable base | Accelerated |
UnitsOfProduction { total_units } | Based on usage/output | Manufacturing equipment |
None | No depreciation | Land, construction in progress |
Depreciation Journal Entries
Each period, the FA generator creates depreciation entries:
| Debit | Credit | Amount |
|---|---|---|
| Depreciation Expense (6xxx) | Accumulated Depreciation (1650) | Period depreciation |
Disposal Accounting
When an asset is disposed:
| Scenario | Debit | Credit |
|---|---|---|
| Sale at gain | Cash, Accum Depr | Asset Cost, Gain on Disposal |
| Sale at loss | Cash, Accum Depr, Loss on Disposal | Asset Cost |
| Scrapping | Accum Depr, Loss on Disposal | Asset Cost |
Inventory
Record Types
The Inventory subledger generates:
- Positions: Current stock levels by material, plant, and storage location
- Movements: 22 movement types covering receipts, issues, transfers, and adjustments
- Valuation: Inventory value calculated using configurable valuation methods
Position Fields
| Field | Description |
|---|---|
material_id | Material reference |
plant | Plant/warehouse code |
storage_location | Storage location within plant |
quantity | Units on hand |
unit_of_measure | UOM |
unit_cost | Per-unit cost |
total_value | Extended value |
valuation_method | StandardCost, MovingAverage, FIFO, LIFO |
stock_status | Unrestricted, QualityInspection, Blocked |
last_movement_date | Date of last stock change |
Movement Types (22 types)
| Category | Movement Type | Description |
|---|---|---|
| Goods Receipt | GoodsReceiptPO | Receipt against purchase order |
GoodsReceiptProduction | Receipt from production order | |
GoodsReceiptOther | Receipt without reference | |
GoodsReceipt | Generic goods receipt | |
| Returns | ReturnToVendor | Return materials to vendor |
| Goods Issue | GoodsIssueSales | Issue for sales order / delivery |
GoodsIssueProduction | Issue to production order | |
GoodsIssueCostCenter | Issue to cost center (consumption) | |
GoodsIssueScrapping | Scrap disposal | |
GoodsIssue | Generic goods issue | |
Scrap | Alias for scrapping | |
| Transfers | TransferPlant | Between plants |
TransferStorageLocation | Between storage locations | |
TransferIn | Inbound transfer | |
TransferOut | Outbound transfer | |
TransferToInspection | Move to quality inspection | |
TransferFromInspection | Release from quality inspection | |
| Adjustments | PhysicalInventory | Physical count difference |
InventoryAdjustmentIn | Positive adjustment | |
InventoryAdjustmentOut | Negative adjustment | |
InitialStock | Initial stock entry | |
| Reversals | ReversalGoodsReceipt | Reverse a goods receipt |
ReversalGoodsIssue | Reverse a goods issue |
Valuation Methods
| Method | Description | Use Case |
|---|---|---|
StandardCost | Fixed cost per unit, variances posted separately | Manufacturing |
MovingAverage | Weighted average of all receipts | General purpose |
FIFO | First-in, first-out costing | Perishable goods |
LIFO | Last-in, first-out costing | Tax optimization (where permitted) |
Cycle Counting (v0.6.0)
The cycle_count_frequency setting controls how often physical inventory counts are performed. Cycle counting generates PhysicalInventory movement records that reconcile book quantities against counted quantities:
subledger:
inventory:
enabled: true
cycle_count_frequency: monthly # monthly, quarterly, annual
| Frequency | Behavior |
|---|---|
monthly | Each storage location counted once per month on a rolling basis |
quarterly | Full count once per quarter, with high-value items counted monthly |
annual | Single year-end wall-to-wall count |
Cycle count differences generate adjustment entries (InventoryAdjustmentIn or InventoryAdjustmentOut) and are flagged in the quality labels output for audit trail analysis.
Quality Inspection (v0.6.0)
Inventory positions can be placed in quality inspection status via TransferToInspection movements. This models the inspection hold process common in manufacturing and pharmaceutical industries:
Goods Receipt → Transfer to Inspection → QC Hold → Transfer from Inspection → Unrestricted Use
└→ Scrap (if rejected)
The rate of items routed through inspection depends on the material type and vendor scorecard grades (when source_to_pay is enabled). Materials from vendors with grade C or lower are routed through inspection at a higher rate.
Inventory Journal Entries
| Movement | Debit | Credit |
|---|---|---|
| Goods Receipt (PO) | Inventory | GR/IR Clearing |
| Goods Issue (Sales) | COGS | Inventory |
| Goods Issue (Production) | WIP | Inventory |
| Scrap | Scrap Expense | Inventory |
| Physical Count (surplus) | Inventory | Inventory Adjustment |
| Physical Count (shortage) | Inventory Adjustment | Inventory |
GL Reconciliation
The subledger generators ensure that subledger balances reconcile to GL control accounts:
GL Control Account Balance = Σ Subledger Open Items
AR Control (1100) = Σ AR Open Items
AP Control (2000) = Σ AP Open Items
Inventory (1300) = Σ Inventory Position Values
FA Gross (1600) = Σ FA Acquisition Costs
Accum Depr (1650) = Σ FA Accumulated Depreciation
Reconciliation is validated by the datasynth-eval coherence module and any differences are flagged as potential data quality issues.
Output Files
| File | Content |
|---|---|
subledgers/ar_open_items.csv | AR outstanding invoices |
subledgers/ar_aging.csv | AR aging analysis |
subledgers/ap_open_items.csv | AP outstanding invoices |
subledgers/ap_aging.csv | AP aging analysis |
subledgers/fa_register.csv | Fixed asset master records |
subledgers/fa_depreciation.csv | Depreciation schedule entries |
subledgers/inventory_positions.csv | Current stock positions |
subledgers/inventory_movements.csv | Stock movement history |
See Also
- Document Flows — P2P and O2C document chains
- Financial Settings — Balance and period close config
- FX & Currency — Multi-currency subledger support
- datasynth-generators — Generator crate reference
FX & Currency
SyntheticData generates realistic foreign exchange rates, currency translation entries, and cumulative translation adjustments (CTA) for multi-currency enterprise simulation.
Overview
The FX module in datasynth-generators provides three generators:
| Generator | Purpose | Output |
|---|---|---|
| FX Rate Service | Daily exchange rates via Ornstein-Uhlenbeck process | fx/daily_rates.csv, fx/period_rates.csv |
| Currency Translator | Translate foreign-currency financials to reporting currency | consolidation/currency_translation.csv |
| CTA Generator | Cumulative Translation Adjustment for consolidation | consolidation/cta_entries.csv |
Configuration
fx:
enabled: true
base_currency: USD # Reporting/functional currency
currencies:
- code: EUR
initial_rate: 1.10
volatility: 0.08
mean_reversion: 0.05
- code: GBP
initial_rate: 1.27
volatility: 0.07
mean_reversion: 0.04
- code: JPY
initial_rate: 0.0067
volatility: 0.10
mean_reversion: 0.06
- code: CHF
initial_rate: 1.12
volatility: 0.06
mean_reversion: 0.03
translation:
method: current_rate # current_rate, temporal, monetary_non_monetary
equity_at_historical: true
income_at_average: true
cta:
enabled: true
equity_account: "3900" # CTA equity account
FX Rate Service
Ornstein-Uhlenbeck Process
Exchange rates are generated using a mean-reverting stochastic process (Ornstein-Uhlenbeck), which models the tendency of exchange rates to revert toward a long-term equilibrium:
dX(t) = θ(μ - X(t))dt + σdW(t)
where:
X(t) = log exchange rate at time t
θ = mean reversion speed (mean_reversion config)
μ = long-term mean (derived from initial_rate)
σ = volatility
dW(t) = Wiener process (random walk)
This produces rates that:
- Mean-revert: Rates drift back toward the initial level over time
- Have realistic volatility: Day-to-day movements match configurable volatility targets
- Are serially correlated: Today’s rate depends on yesterday’s rate (not i.i.d.)
- Are deterministic: Given the same seed, rates are exactly reproducible
Rate Types
| Rate Type | Usage | Calculation |
|---|---|---|
| Daily spot | Transaction-date rates | O-U process output for each business day |
| Period average | Income statement translation | Arithmetic mean of daily rates within the period |
| Period closing | Balance sheet translation | Last business day rate in the period |
| Historical | Equity items | Rate at the date equity was contributed |
Output: daily_rates.csv
| Field | Description |
|---|---|
date | Business day |
from_currency | Source currency (e.g., EUR) |
to_currency | Target currency (e.g., USD) |
spot_rate | Daily spot rate |
inverse_rate | 1 / spot_rate |
Output: period_rates.csv
| Field | Description |
|---|---|
period | Fiscal period (YYYY-MM) |
from_currency | Source currency |
to_currency | Target currency |
average_rate | Period average |
closing_rate | Period-end closing rate |
Currency Translation
Translation Methods
SyntheticData supports three standard currency translation methods:
Current Rate Method (ASC 830 / IAS 21 — default)
The most common method for foreign subsidiaries with functional currency different from reporting currency:
| Item | Rate Used |
|---|---|
| Assets | Closing rate |
| Liabilities | Closing rate |
| Equity (contributed capital) | Historical rate |
| Equity (retained earnings) | Rolled-forward |
| Revenue | Average rate |
| Expenses | Average rate |
| Dividends | Rate on declaration date |
| CTA | Balancing item → Equity |
Temporal Method (ASC 830)
Used when the foreign operation’s functional currency is the parent’s currency (e.g., highly inflationary economies):
| Item | Rate Used |
|---|---|
| Monetary assets/liabilities | Closing rate |
| Non-monetary assets (at cost) | Historical rate |
| Non-monetary assets (at fair value) | Rate at fair value date |
| Revenue | Average rate |
| Expenses | Average rate |
| Depreciation | Historical rate of related asset |
| Remeasurement gain/loss | Income statement |
Monetary/Non-Monetary Method
| Item | Rate Used |
|---|---|
| Monetary items | Closing rate |
| Non-monetary items | Historical rate |
Translation Configuration
fx:
translation:
method: current_rate # current_rate | temporal | monetary_non_monetary
equity_at_historical: true
income_at_average: true
CTA Generator
The Cumulative Translation Adjustment arises because assets/liabilities are translated at closing rates while equity is at historical rates. The CTA is posted to Other Comprehensive Income (OCI) in equity:
CTA = Translated Net Assets (at closing rate)
- Translated Equity (at historical rates)
- Translated Net Income (at average rate)
CTA Journal Entry
| Debit | Credit | Description |
|---|---|---|
| CTA (Equity 3900) | Various BS accounts | Translation adjustment for period |
The CTA accumulates over time and is only recycled to the income statement when a foreign subsidiary is disposed of.
Configuration
fx:
cta:
enabled: true
equity_account: "3900" # OCI - CTA account
Multi-Currency Company Configuration
Multi-currency scenarios require companies with different functional currencies:
companies:
- code: C001
name: "US Parent Corp"
currency: USD
country: US
- code: C002
name: "European Subsidiary"
currency: EUR
country: DE
- code: C003
name: "UK Subsidiary"
currency: GBP
country: GB
- code: C004
name: "Japan Subsidiary"
currency: JPY
country: JP
fx:
enabled: true
base_currency: USD
currencies:
- { code: EUR, initial_rate: 1.10, volatility: 0.08, mean_reversion: 0.05 }
- { code: GBP, initial_rate: 1.27, volatility: 0.07, mean_reversion: 0.04 }
- { code: JPY, initial_rate: 0.0067, volatility: 0.10, mean_reversion: 0.06 }
intercompany:
enabled: true
# IC transactions generate FX exposure
Output Files
| File | Content |
|---|---|
fx/daily_rates.csv | Daily spot rates for all currency pairs |
fx/period_rates.csv | Period average and closing rates |
consolidation/currency_translation.csv | Translation entries per entity/period |
consolidation/cta_entries.csv | CTA adjustments (if CTA enabled) |
consolidation/consolidated_trial_balance.csv | Translated and consolidated TB |
See Also
- Financial Settings — Intercompany and consolidation config
- Intercompany Processing — IC matching and elimination
- Subledgers — Multi-currency subledger records
- Period Close Engine — Month-end FX revaluation
Financial Settings
Financial settings control balance, subledger, FX, and period close.
Balance Configuration
balance:
opening_balance:
enabled: true
total_assets: 10000000
coherence_check:
enabled: true
tolerance: 0.01
Opening Balance
Generate coherent opening balance sheet:
balance:
opening_balance:
enabled: true
total_assets: 10000000 # Total asset value
structure: # Balance sheet structure
current_assets: 0.3
fixed_assets: 0.5
other_assets: 0.2
current_liabilities: 0.2
long_term_debt: 0.3
equity: 0.5
Balance Coherence
Verify accounting equation:
balance:
coherence_check:
enabled: true # Verify Assets = L + E
tolerance: 0.01 # Allowed rounding variance
frequency: monthly # When to check
Subledger Configuration
subledger:
ar:
enabled: true
aging_buckets: [30, 60, 90, 120]
ap:
enabled: true
aging_buckets: [30, 60, 90]
fixed_assets:
enabled: true
depreciation_methods:
- straight_line
- declining_balance
inventory:
enabled: true
valuation_methods:
- fifo
- weighted_average
Accounts Receivable
subledger:
ar:
enabled: true
aging_buckets: [30, 60, 90, 120] # Aging period boundaries
collection:
on_time_rate: 0.7 # % paid within terms
write_off_rate: 0.02 # % written off
reconciliation:
enabled: true # Reconcile to GL
control_account: "1100" # AR control account
Accounts Payable
subledger:
ap:
enabled: true
aging_buckets: [30, 60, 90]
payment:
discount_usage_rate: 0.3 # % taking early pay discount
late_payment_rate: 0.1 # % paid late
reconciliation:
enabled: true
control_account: "2000" # AP control account
Fixed Assets
subledger:
fixed_assets:
enabled: true
depreciation_methods:
- method: straight_line
weight: 0.7
- method: declining_balance
rate: 0.2
weight: 0.2
- method: units_of_production
weight: 0.1
disposal:
rate: 0.05 # Annual disposal rate
gain_loss_account: "8000" # Gain/loss account
reconciliation:
enabled: true
control_accounts:
asset: "1500"
depreciation: "1510"
Inventory
subledger:
inventory:
enabled: true
valuation_methods:
- method: fifo
weight: 0.3
- method: weighted_average
weight: 0.5
- method: standard_cost
weight: 0.2
movements:
receipt_weight: 0.4
issue_weight: 0.4
adjustment_weight: 0.1
transfer_weight: 0.1
reconciliation:
enabled: true
control_account: "1200"
FX Configuration
fx:
enabled: true
base_currency: USD
currency_pairs:
- EUR
- GBP
- CHF
- JPY
volatility: 0.01
translation:
method: current_rate
Exchange Rates
fx:
enabled: true
base_currency: USD # Reporting currency
currency_pairs: # Currencies to generate
- EUR
- GBP
- CHF
rate_types:
- spot # Daily spot rates
- closing # Period closing rates
- average # Period average rates
volatility: 0.01 # Daily volatility
mean_reversion: 0.1 # Ornstein-Uhlenbeck parameter
Currency Translation
fx:
translation:
method: current_rate # current_rate, temporal
rate_mapping:
assets: closing_rate
liabilities: closing_rate
equity: historical_rate
revenue: average_rate
expense: average_rate
cta_account: "3500" # CTA equity account
Period Close Configuration
period_close:
enabled: true
monthly:
accruals: true
depreciation: true
quarterly:
intercompany_elimination: true
annual:
closing_entries: true
retained_earnings: true
Monthly Close
period_close:
monthly:
accruals:
enabled: true
auto_reverse: true # Reverse in next period
categories:
- expense_accrual
- revenue_accrual
- payroll_accrual
depreciation:
enabled: true
run_date: last_day # When to run
reconciliation:
enabled: true
subledger_to_gl: true
Quarterly Close
period_close:
quarterly:
intercompany_elimination:
enabled: true
types:
- intercompany_sales
- intercompany_profit
- intercompany_dividends
currency_translation:
enabled: true
Annual Close
period_close:
annual:
closing_entries:
enabled: true
close_revenue: true
close_expense: true
retained_earnings:
enabled: true
account: "3100"
year_end_adjustments:
- bad_debt_provision
- inventory_reserve
- bonus_accrual
Combined Example
balance:
opening_balance:
enabled: true
total_assets: 50000000
coherence_check:
enabled: true
subledger:
ar:
enabled: true
aging_buckets: [30, 60, 90, 120, 180]
ap:
enabled: true
aging_buckets: [30, 60, 90]
fixed_assets:
enabled: true
inventory:
enabled: true
fx:
enabled: true
base_currency: USD
currency_pairs: [EUR, GBP, CHF, JPY, CNY]
volatility: 0.012
period_close:
enabled: true
monthly:
accruals: true
depreciation: true
quarterly:
intercompany_elimination: true
annual:
closing_entries: true
retained_earnings: true
Financial Reporting (v0.6.0)
The financial_reporting section generates structured financial statements, management KPIs, and budgets derived from the underlying journal entries, trial balances, and period close data.
Financial Statements
financial_reporting:
enabled: true
generate_balance_sheet: true # Balance sheet
generate_income_statement: true # Income statement / P&L
generate_cash_flow: true # Cash flow statement
generate_changes_in_equity: true # Statement of changes in equity
comparative_periods: 1 # Number of prior-period comparatives
When enabled, the generator produces financial statements at each period close. The comparative_periods setting controls how many prior periods are included for comparative analysis. Statements are aggregated from the trial balance and subledger data, ensuring consistency with the underlying journal entries.
Management KPIs
financial_reporting:
management_kpis:
enabled: true
frequency: "monthly" # monthly or quarterly
Management KPIs include ratios and metrics computed from the generated financial data:
| KPI Category | Examples |
|---|---|
| Liquidity | Current ratio, quick ratio, cash conversion cycle |
| Profitability | Gross margin, operating margin, ROE, ROA |
| Efficiency | Inventory turnover, receivables turnover, asset turnover |
| Leverage | Debt-to-equity, interest coverage |
Budgets
financial_reporting:
budgets:
enabled: true
revenue_growth_rate: 0.05 # 5% expected growth
expense_inflation_rate: 0.03 # 3% cost inflation
variance_noise: 0.10 # 10% random noise on actuals vs budget
Budget generation creates a budget line for each GL account based on prior-period actuals, adjusted by the configured growth and inflation rates. The variance_noise parameter controls the spread between budget and actual figures, producing realistic budget-to-actual variance reports.
See Also
Compliance
Compliance settings control fraud injection, internal controls, and approval workflows.
Fraud Configuration
fraud:
enabled: true
fraud_rate: 0.005
types:
fictitious_transaction: 0.15
revenue_manipulation: 0.10
expense_capitalization: 0.10
split_transaction: 0.15
round_tripping: 0.05
kickback_scheme: 0.10
ghost_employee: 0.05
duplicate_payment: 0.15
unauthorized_discount: 0.10
suspense_abuse: 0.05
Fraud Rate
Overall percentage of fraudulent transactions:
fraud:
enabled: true
fraud_rate: 0.005 # 0.5% fraud rate
fraud_rate: 0.01 # 1% fraud rate
fraud_rate: 0.001 # 0.1% fraud rate
Fraud Types
| Type | Description |
|---|---|
fictitious_transaction | Completely fabricated entries |
revenue_manipulation | Premature/delayed revenue recognition |
expense_capitalization | Improper capitalization of expenses |
split_transaction | Split to avoid approval thresholds |
round_tripping | Circular transactions to inflate revenue |
kickback_scheme | Vendor kickback arrangements |
ghost_employee | Payments to non-existent employees |
duplicate_payment | Same invoice paid multiple times |
unauthorized_discount | Unapproved customer discounts |
suspense_abuse | Hiding items in suspense accounts |
Fraud Patterns
fraud:
patterns:
threshold_adjacent:
enabled: true
threshold: 10000 # Approval threshold
range: 0.1 # % below threshold
time_based:
weekend_preference: 0.3 # Weekend entry rate
after_hours_preference: 0.2 # After hours rate
entity_targeting:
repeat_offender_rate: 0.4 # Same user commits multiple
Internal Controls Configuration
internal_controls:
enabled: true
controls:
- id: "CTL-001"
name: "Payment Approval"
type: preventive
frequency: continuous
assertions:
- authorization
- validity
sod_rules:
- conflict_type: create_approve
processes: [ap_invoice, ap_payment]
Control Definition
internal_controls:
controls:
- id: "CTL-001"
name: "Payment Approval"
description: "Payments require manager approval"
type: preventive # preventive, detective
frequency: continuous # continuous, daily, weekly, monthly
assertions:
- authorization
- validity
- completeness
accounts: ["2000"] # Applicable accounts
threshold: 5000 # Trigger threshold
- id: "CTL-002"
name: "Journal Entry Review"
type: detective
frequency: daily
assertions:
- accuracy
- completeness
Control Types
| Type | Description |
|---|---|
preventive | Prevents errors/fraud before occurrence |
detective | Detects errors/fraud after occurrence |
Control Assertions
| Assertion | Description |
|---|---|
authorization | Proper approval obtained |
validity | Transaction is legitimate |
completeness | All transactions recorded |
accuracy | Amounts are correct |
cutoff | Recorded in correct period |
classification | Properly categorized |
Segregation of Duties
internal_controls:
sod_rules:
- conflict_type: create_approve
processes: [ap_invoice, ap_payment]
description: "Cannot create and approve payments"
- conflict_type: create_approve
processes: [ar_invoice, ar_receipt]
- conflict_type: custody_recording
processes: [cash_handling, cash_recording]
- conflict_type: authorization_custody
processes: [vendor_master, ap_payment]
SoD Conflict Types
| Type | Description |
|---|---|
create_approve | Create and approve same transaction |
custody_recording | Physical custody and recording |
authorization_custody | Authorization and physical access |
create_modify | Create and modify master data |
Approval Configuration
approval:
enabled: true
thresholds:
- level: 1
name: "Clerk"
max_amount: 5000
- level: 2
name: "Supervisor"
max_amount: 25000
- level: 3
name: "Manager"
max_amount: 100000
- level: 4
name: "Director"
max_amount: 500000
- level: 5
name: "Executive"
max_amount: null # Unlimited
Approval Thresholds
approval:
thresholds:
- level: 1
name: "Level 1 - Clerk"
max_amount: 5000
auto_approve: false
- level: 2
name: "Level 2 - Supervisor"
max_amount: 25000
auto_approve: false
- level: 3
name: "Level 3 - Manager"
max_amount: 100000
requires_dual: false # Single approver
- level: 4
name: "Level 4 - Director"
max_amount: 500000
requires_dual: true # Dual approval required
Approval Process
approval:
process:
workflow: hierarchical # hierarchical, matrix
escalation_days: 3 # Auto-escalate after N days
reminder_days: 1 # Send reminder after N days
exceptions:
recurring_exempt: true # Skip for recurring entries
system_exempt: true # Skip for system entries
Combined Example
fraud:
enabled: true
fraud_rate: 0.005
types:
fictitious_transaction: 0.15
split_transaction: 0.20
duplicate_payment: 0.15
ghost_employee: 0.10
kickback_scheme: 0.10
revenue_manipulation: 0.10
expense_capitalization: 0.10
unauthorized_discount: 0.10
internal_controls:
enabled: true
controls:
- id: "SOX-001"
name: "Payment Authorization"
type: preventive
frequency: continuous
threshold: 10000
- id: "SOX-002"
name: "JE Review"
type: detective
frequency: daily
sod_rules:
- conflict_type: create_approve
processes: [ap_invoice, ap_payment]
- conflict_type: create_approve
processes: [ar_invoice, ar_receipt]
- conflict_type: create_modify
processes: [vendor_master, ap_invoice]
approval:
enabled: true
thresholds:
- level: 1
max_amount: 5000
- level: 2
max_amount: 25000
- level: 3
max_amount: 100000
- level: 4
max_amount: null
Validation
| Check | Rule |
|---|---|
fraud_rate | 0.0 - 1.0 |
fraud.types | Sum = 1.0 |
control.id | Unique |
thresholds | Strictly ascending |
Synthetic Data Certificates (v0.5.0)
Certificates provide cryptographic proof of the privacy guarantees and quality metrics of generated data.
certificates:
enabled: true
issuer: "DataSynth"
include_quality_metrics: true
When enabled, a certificate.json file is produced alongside the output containing:
- DP Guarantee: Mechanism (Laplace/Gaussian), epsilon, delta, composition method
- Quality Metrics: Benford MAD, correlation preservation, statistical fidelity, MIA AUC
- Config Hash: SHA-256 hash of the generation configuration
- Signature: HMAC-SHA256 signature for tamper detection
- Fingerprint Hash: Hash of source fingerprint (if fingerprint-based generation)
The certificate can be embedded in Parquet file metadata or included as a separate JSON file.
# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate
# Certificate is written to ./output/certificate.json
See Also
Output Settings
Output settings control file formats and organization.
Configuration
output:
format: csv
compression: none
compression_level: 6
files:
journal_entries: true
acdoca: true
master_data: true
documents: true
subledgers: true
trial_balances: true
labels: true
controls: true
Format
Output file format selection.
output:
format: csv # CSV format (default)
format: json # JSON format
format: jsonl # Newline-delimited JSON
format: parquet # Apache Parquet columnar
format: sap # SAP S/4HANA table format
format: oracle # Oracle EBS GL tables
format: netsuite # NetSuite journal entries
CSV Format
Standard comma-separated values:
document_id,posting_date,company_code,account,debit,credit
abc-123,2024-01-15,1000,1100,"1000.00","0.00"
abc-123,2024-01-15,1000,4000,"0.00","1000.00"
Characteristics:
- UTF-8 encoding
- Header row included
- Quoted strings when needed
- Decimals as strings
JSON Format
Structured JSON with nested objects:
[
{
"header": {
"document_id": "abc-123",
"posting_date": "2024-01-15",
"company_code": "1000"
},
"lines": [
{"account": "1100", "debit": "1000.00", "credit": "0.00"},
{"account": "4000", "debit": "0.00", "credit": "1000.00"}
]
}
]
Parquet Format
Apache Parquet columnar format for analytics:
output:
format: parquet
compression: snappy # snappy (default), gzip, zstd
Parquet files are self-describing with embedded schema and support columnar compression. Ideal for Spark, DuckDB, Polars, pandas, and cloud data warehouses.
ERP Formats
Export in native ERP table schemas for load testing and integration validation:
# SAP S/4HANA
output:
format: sap
sap:
tables: [bkpf, bseg, acdoca, lfa1, kna1, mara, csks, cepc]
client: "100"
# Oracle EBS
output:
format: oracle
oracle:
ledger_id: 1
# NetSuite
output:
format: netsuite
netsuite:
subsidiary_id: 1
include_custom_fields: true
See ERP Output Formats for full field mappings.
Streaming Mode
Enable streaming output for memory-efficient generation of large datasets:
output:
format: csv # Any format
streaming: true # Enable streaming mode
See Streaming Output for details.
Compression
File compression options.
output:
compression: none # No compression
compression: gzip # Gzip compression (.gz)
compression: zstd # Zstandard compression (.zst)
Compression Level
When compression is enabled:
output:
compression: gzip
compression_level: 6 # 1-9, higher = smaller + slower
| Level | Speed | Size | Use Case |
|---|---|---|---|
| 1 | Fastest | Largest | Quick compression |
| 6 | Balanced | Medium | General use (default) |
| 9 | Slowest | Smallest | Maximum compression |
Compression Comparison
| Compression | Extension | Speed | Ratio |
|---|---|---|---|
none | .csv | N/A | 1.0 |
gzip | .csv.gz | Medium | ~0.15 |
zstd | .csv.zst | Fast | ~0.12 |
File Selection
Control which files are generated:
output:
files:
# Core transaction data
journal_entries: true # journal_entries.csv
acdoca: true # acdoca.csv (SAP format)
# Master data
master_data: true # vendors.csv, customers.csv, etc.
# Document flow
documents: true # purchase_orders.csv, invoices.csv, etc.
# Subsidiary ledgers
subledgers: true # ar_open_items.csv, ap_open_items.csv, etc.
# Period close
trial_balances: true # trial_balances/*.csv
# ML labels
labels: true # anomaly_labels.csv, fraud_labels.csv
# Controls
controls: true # internal_controls.csv, sod_rules.csv
Output Directory Structure
With all files enabled:
output/
├── master_data/
│ ├── chart_of_accounts.csv
│ ├── vendors.csv
│ ├── customers.csv
│ ├── materials.csv
│ ├── fixed_assets.csv
│ └── employees.csv
├── transactions/
│ ├── journal_entries.csv
│ └── acdoca.csv
├── documents/
│ ├── purchase_orders.csv
│ ├── goods_receipts.csv
│ ├── vendor_invoices.csv
│ ├── payments.csv
│ ├── sales_orders.csv
│ ├── deliveries.csv
│ ├── customer_invoices.csv
│ └── customer_receipts.csv
├── subledgers/
│ ├── ar_open_items.csv
│ ├── ar_aging.csv
│ ├── ap_open_items.csv
│ ├── ap_aging.csv
│ ├── fa_register.csv
│ ├── fa_depreciation.csv
│ ├── inventory_positions.csv
│ └── inventory_movements.csv
├── period_close/
│ └── trial_balances/
│ ├── 2024_01.csv
│ ├── 2024_02.csv
│ └── ...
├── consolidation/
│ ├── eliminations.csv
│ └── currency_translation.csv
├── fx/
│ ├── daily_rates.csv
│ └── period_rates.csv
├── graphs/ # If graph_export enabled
│ ├── pytorch_geometric/
│ └── neo4j/
├── labels/
│ ├── anomaly_labels.csv
│ └── fraud_labels.csv
└── controls/
├── internal_controls.csv
├── control_mappings.csv
└── sod_rules.csv
Examples
Development (Fast)
output:
format: csv
compression: none
files:
journal_entries: true
master_data: true
labels: true
Production (Compact)
output:
format: csv
compression: zstd
compression_level: 6
files:
journal_entries: true
acdoca: true
master_data: true
documents: true
subledgers: true
trial_balances: true
labels: true
controls: true
ML Training Focus
output:
format: csv
compression: gzip
files:
journal_entries: true
labels: true # Important for supervised learning
master_data: true # For feature engineering
SAP Integration
output:
format: csv
compression: none
files:
journal_entries: false
acdoca: true # SAP ACDOCA format
master_data: true
documents: true
Validation
| Check | Rule |
|---|---|
format | csv or json |
compression | none, gzip, or zstd |
compression_level | 1-9 (only when compression enabled) |
See Also
AI & ML Features Configuration
New in v0.5.0
This page documents the configuration for DataSynth’s AI and ML-powered generation features: LLM-augmented generation, diffusion models, causal generation, and synthetic data certificates.
LLM Configuration
Configure the LLM provider for metadata enrichment and natural language configuration.
llm:
provider: mock # Provider type
model: "gpt-4o-mini" # Model identifier
api_key_env: "OPENAI_API_KEY" # Environment variable for API key
base_url: null # Custom API endpoint (for 'custom' provider)
max_retries: 3 # Retry attempts on failure
timeout_secs: 30 # Request timeout
cache_enabled: true # Enable prompt-level caching
Provider Types
| Provider | Value | Requirements | Description |
|---|---|---|---|
| Mock | mock | None | Deterministic, no network. Default for CI/CD |
| OpenAI | openai | OPENAI_API_KEY env var | OpenAI API (GPT-4o, GPT-4o-mini, etc.) |
| Anthropic | anthropic | ANTHROPIC_API_KEY env var | Anthropic API (Claude models) |
| Custom | custom | base_url + API key env var | Any OpenAI-compatible endpoint |
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
provider | string | "mock" | LLM provider type |
model | string | "gpt-4o-mini" | Model identifier passed to the API |
api_key_env | string | "" | Environment variable name containing the API key |
base_url | string | null | Custom API base URL (required for custom provider) |
max_retries | integer | 3 | Maximum retry attempts on transient failures |
timeout_secs | integer | 30 | Per-request timeout in seconds |
cache_enabled | bool | true | Cache responses to avoid duplicate API calls |
Examples
Mock provider (default, no config needed):
# LLM enrichment uses mock provider by default
# No configuration required
OpenAI:
llm:
provider: openai
model: "gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
Anthropic:
llm:
provider: anthropic
model: "claude-sonnet-4-5-20250929"
api_key_env: "ANTHROPIC_API_KEY"
Self-hosted (e.g., vLLM, Ollama):
llm:
provider: custom
model: "llama-3-8b"
api_key_env: "LOCAL_API_KEY"
base_url: "http://localhost:8000/v1"
Azure OpenAI:
llm:
provider: custom
model: "gpt-4o-mini"
api_key_env: "AZURE_OPENAI_KEY"
base_url: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o-mini"
Diffusion Configuration
Configure the statistical diffusion model backend for learned distribution capture.
diffusion:
enabled: false # Enable diffusion generation
n_steps: 1000 # Number of diffusion steps
schedule: "linear" # Noise schedule type
sample_size: 1000 # Number of samples to generate
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable diffusion model generation |
n_steps | integer | 1000 | Number of forward/reverse diffusion steps. Higher values improve quality but increase compute time |
schedule | string | "linear" | Noise schedule: "linear", "cosine", "sigmoid" |
sample_size | integer | 1000 | Number of diffusion-generated samples |
Noise Schedules
| Schedule | Characteristics | Best For |
|---|---|---|
linear | Uniform noise addition, simple and robust | General purpose |
cosine | Slower noise addition, preserves fine details | Financial amounts with precise distributions |
sigmoid | Smooth transition between linear and cosine | Balanced quality and compute |
Examples
Basic diffusion:
diffusion:
enabled: true
n_steps: 1000
schedule: "cosine"
sample_size: 5000
Fast diffusion (fewer steps):
diffusion:
enabled: true
n_steps: 200
schedule: "linear"
sample_size: 1000
Causal Configuration
Configure causal graph-based data generation with Structural Causal Models.
causal:
enabled: false # Enable causal generation
template: "fraud_detection" # Built-in template or custom YAML path
sample_size: 1000 # Number of samples
validate: true # Validate causal structure in output
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable causal/counterfactual generation |
template | string | "fraud_detection" | Template name or path to custom YAML graph |
sample_size | integer | 1000 | Number of causal samples to generate |
validate | bool | true | Run causal structure validation on output |
Built-in Templates
| Template | Variables | Use Case |
|---|---|---|
fraud_detection | transaction_amount, approval_level, vendor_risk, fraud_flag | Fraud risk modeling |
revenue_cycle | order_size, credit_score, payment_delay, revenue | Revenue and credit analysis |
Custom Causal Graph
Point template to a YAML file defining a custom causal graph:
causal:
enabled: true
template: "./graphs/custom_fraud.yaml"
sample_size: 10000
validate: true
Custom graph format:
# custom_fraud.yaml
variables:
- name: transaction_amount
type: continuous
distribution: lognormal
params:
mu: 8.0
sigma: 1.5
- name: approval_level
type: count
distribution: normal
params:
mean: 1.0
std: 0.5
- name: fraud_flag
type: binary
edges:
- from: transaction_amount
to: approval_level
mechanism:
type: linear
coefficient: 0.00005
- from: transaction_amount
to: fraud_flag
mechanism:
type: logistic
scale: 0.0001
midpoint: 50000.0
strength: 0.8
Causal Mechanism Types
| Type | Parameters | Description |
|---|---|---|
linear | coefficient | y += coefficient × parent |
threshold | cutoff | y = 1 if parent > cutoff, else 0 |
polynomial | coefficients (list) | y += Σ c[i] × parent^i |
logistic | scale, midpoint | y += 1 / (1 + e^(-scale × (parent - midpoint))) |
Certificate Configuration
Configure synthetic data certificates for provenance and privacy attestation.
certificates:
enabled: false # Enable certificate generation
issuer: "DataSynth" # Certificate issuer identity
include_quality_metrics: true # Include quality metrics
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Generate a certificate with each output |
issuer | string | "DataSynth" | Issuer identity embedded in the certificate |
include_quality_metrics | bool | true | Include Benford MAD, correlation, fidelity, MIA AUC metrics |
Certificate Contents
When enabled, a certificate.json is produced containing:
| Section | Contents |
|---|---|
| Identity | certificate_id, generation_timestamp, generator_version |
| Reproducibility | config_hash (SHA-256), seed, fingerprint_hash |
| Privacy | DP mechanism, epsilon, delta, composition method, total queries |
| Quality | Benford MAD, correlation preservation, statistical fidelity, MIA AUC |
| Integrity | HMAC-SHA256 signature |
Combined Example
A complete configuration using all AI/ML features:
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 12
companies:
- code: "1000"
name: "Manufacturing Corp"
currency: USD
country: US
transactions:
target_count: 50000
# LLM enrichment for realistic metadata
llm:
provider: mock
# Diffusion for learned distributions
diffusion:
enabled: true
n_steps: 1000
schedule: "cosine"
sample_size: 5000
# Causal structure for fraud scenarios
causal:
enabled: true
template: "fraud_detection"
sample_size: 10000
validate: true
# Certificate for provenance
certificates:
enabled: true
issuer: "DataSynth v0.5.0"
include_quality_metrics: true
fraud:
enabled: true
fraud_rate: 0.005
anomaly_injection:
enabled: true
total_rate: 0.02
output:
format: csv
CLI Flags
Several AI/ML features can also be controlled via CLI flags:
# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate
# Initialize from natural language
datasynth-data init --from-description "1 year of retail data with fraud" -o config.yaml
# Train diffusion model
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json
# Generate causal data
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/
See Also
- LLM-Augmented Generation
- Diffusion Models
- Causal & Counterfactual Generation
- Synthetic Data Certificates
- YAML Schema Reference
Architecture
SyntheticData is designed as a modular, high-performance data generation system.
Overview
┌─────────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ datasynth-cli │ datasynth-server │ datasynth-ui │
├─────────────────────────────────────────────────────────────────────┤
│ Orchestration Layer │
│ datasynth-runtime │
├─────────────────────────────────────────────────────────────────────┤
│ Generation Layer │
│ datasynth-generators │ datasynth-graph │
├─────────────────────────────────────────────────────────────────────┤
│ Foundation Layer │
│ datasynth-core │ datasynth-config │ datasynth-output │
└─────────────────────────────────────────────────────────────────────┘
Key Characteristics
| Characteristic | Description |
|---|---|
| Modular | 12 independent crates with clear boundaries |
| Layered | Strict dependency hierarchy prevents cycles |
| High-Performance | Parallel execution, memory-efficient streaming |
| Deterministic | Seeded RNG for reproducible output |
| Type-Safe | Rust’s type system ensures correctness |
Architecture Sections
| Section | Description |
|---|---|
| Workspace Layout | Crate organization and dependencies |
| Domain Models | Core data structures |
| Data Flow | How data moves through the system |
| Generation Pipeline | Step-by-step generation process |
| Memory Management | Memory tracking and limits |
| Design Decisions | Key architectural choices |
Design Principles
Separation of Concerns
Each crate has a single responsibility:
datasynth-core: Domain models and distributionsdatasynth-config: Configuration and validationdatasynth-generators: Data generation logicdatasynth-output: File writingdatasynth-runtime: Orchestration
Dependency Inversion
Core components define traits, implementations provided by higher layers:
#![allow(unused)]
fn main() {
// datasynth-core defines the trait
pub trait Generator<T> {
fn generate_batch(&mut self, count: usize) -> Result<Vec<T>>;
}
// datasynth-generators implements it
impl Generator<JournalEntry> for JournalEntryGenerator {
fn generate_batch(&mut self, count: usize) -> Result<Vec<JournalEntry>> {
// Implementation
}
}
}
Configuration-Driven
All behavior controlled by configuration:
transactions:
target_count: 100000
benford:
enabled: true
Memory Safety
Rust’s ownership system prevents:
- Data races in parallel generation
- Memory leaks
- Buffer overflows
Component Interactions
┌─────────────┐
│ Config │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ JE Generator│ │ Doc Generator│ │ Master Data │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┼─────────────────┘
│
▼
┌──────────────┐
│ Orchestrator │
└──────┬───────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ CSV │ │ Graph │ │ JSON │
└─────────┘ └─────────┘ └─────────┘
Performance Architecture
Parallel Execution
#![allow(unused)]
fn main() {
// Thread pool distributes work
let entries: Vec<JournalEntry> = (0..num_threads)
.into_par_iter()
.flat_map(|thread_id| {
let mut gen = generator_for_thread(thread_id);
gen.generate_batch(batch_size)
})
.collect();
}
Streaming Output
#![allow(unused)]
fn main() {
// Memory-efficient streaming
for entry in generator.generate_stream() {
sink.write(&entry)?;
}
}
Memory Guards
#![allow(unused)]
fn main() {
// Memory limits enforced
let guard = MemoryGuard::new(config);
while !guard.check().exceeds_hard_limit {
generate_batch();
}
}
Extension Points
Custom Generators
Implement the Generator trait:
#![allow(unused)]
fn main() {
impl Generator<CustomType> for CustomGenerator {
fn generate_batch(&mut self, count: usize) -> Result<Vec<CustomType>> {
// Custom logic
}
}
}
Custom Output Sinks
Implement the Sink trait:
#![allow(unused)]
fn main() {
impl Sink<JournalEntry> for CustomSink {
fn write(&mut self, entry: &JournalEntry) -> Result<()> {
// Custom output logic
}
}
}
Custom Distributions
Create specialized samplers:
#![allow(unused)]
fn main() {
impl AmountSampler for CustomAmountSampler {
fn sample(&mut self) -> Decimal {
// Custom distribution
}
}
}
See Also
Workspace Layout
SyntheticData is organized as a Rust workspace with 15 crates following a layered architecture.
Crate Hierarchy
datasynth-cli → Binary entry point (commands: generate, validate, init, info, fingerprint)
datasynth-server → REST/gRPC/WebSocket server with auth, rate limiting, timeouts
datasynth-ui → Tauri/SvelteKit desktop UI
│
▼
datasynth-runtime → Orchestration layer (GenerationOrchestrator coordinates workflow)
│
├─────────────────────────────────────┐
▼ ▼
datasynth-generators datasynth-banking datasynth-ocpm datasynth-fingerprint datasynth-standards
│ │ │ │
└────────────────────────┴──────────────────┴────────────────────┘
│
┌────────────────┴────────────────┐
▼ ▼
datasynth-graph datasynth-eval
│ │
└────────────────┬────────────────┘
▼
datasynth-config
│
▼
datasynth-core → Foundation layer
│
▼
datasynth-output
datasynth-test-utils → Testing utilities
Dependency Matrix
| Crate | Depends On |
|---|---|
| datasynth-core | (none) |
| datasynth-config | datasynth-core |
| datasynth-output | datasynth-core |
| datasynth-generators | datasynth-core, datasynth-config |
| datasynth-graph | datasynth-core, datasynth-generators |
| datasynth-eval | datasynth-core |
| datasynth-banking | datasynth-core, datasynth-config |
| datasynth-ocpm | datasynth-core |
| datasynth-fingerprint | datasynth-core, datasynth-config |
| datasynth-standards | datasynth-core, datasynth-config |
| datasynth-runtime | datasynth-core, datasynth-config, datasynth-generators, datasynth-output, datasynth-graph, datasynth-banking, datasynth-ocpm, datasynth-fingerprint, datasynth-eval |
| datasynth-cli | datasynth-runtime, datasynth-fingerprint |
| datasynth-server | datasynth-runtime |
| datasynth-ui | datasynth-runtime (via Tauri) |
| datasynth-test-utils | datasynth-core |
Directory Structure
SyntheticData/
├── Cargo.toml # Workspace manifest
├── crates/
│ ├── datasynth-core/
│ │ ├── Cargo.toml
│ │ ├── README.md
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── models/ # Domain models (JournalEntry, Master data, etc.)
│ │ ├── distributions/ # Statistical samplers
│ │ ├── traits/ # Generator, Sink, PostProcessor traits
│ │ ├── templates/ # Template loading system
│ │ ├── accounts.rs # GL account constants
│ │ ├── uuid_factory.rs # Deterministic UUID generation
│ │ ├── memory_guard.rs # Memory limit enforcement
│ │ ├── disk_guard.rs # Disk space monitoring
│ │ ├── cpu_monitor.rs # CPU load tracking
│ │ ├── resource_guard.rs # Unified resource orchestration
│ │ ├── degradation.rs # Graceful degradation controller
│ │ ├── llm/ # LLM provider abstraction (Mock, HTTP, OpenAI, Anthropic)
│ │ ├── diffusion/ # Diffusion model backend (statistical, hybrid, training)
│ │ └── causal/ # Causal graphs, SCMs, interventions, counterfactuals
│ ├── datasynth-config/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── schema.rs # Configuration schema
│ │ ├── validation.rs # Config validation rules
│ │ └── presets/ # Industry preset definitions
│ ├── datasynth-generators/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── je_generator.rs
│ │ ├── coa_generator.rs
│ │ ├── control_generator.rs
│ │ ├── master_data/ # Vendor, Customer, Material, Asset, Employee
│ │ ├── document_flow/ # P2P, O2C, three-way match
│ │ ├── intercompany/ # IC generation, matching, elimination
│ │ ├── balance/ # Opening balance, balance tracker
│ │ ├── subledger/ # AR, AP, FA, Inventory
│ │ ├── fx/ # FX rates, translation, CTA
│ │ ├── period_close/ # Close engine, accruals, depreciation
│ │ ├── anomaly/ # Anomaly injection engine
│ │ ├── data_quality/ # Missing values, typos, duplicates
│ │ ├── audit/ # Engagement, workpaper, evidence, findings
│ │ ├── llm_enrichment/ # LLM-powered vendor names, descriptions, anomaly explanations
│ │ └── relationships/ # Entity graph, cross-process links, relationship strength
│ ├── datasynth-output/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── csv_sink.rs
│ │ ├── json_sink.rs
│ │ └── control_export.rs
│ ├── datasynth-graph/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── models/ # Node, edge types
│ │ ├── builders/ # Transaction, approval, entity graphs
│ │ ├── exporters/ # PyTorch Geometric, Neo4j, DGL
│ │ └── ml/ # Feature computation, train/val/test splits
│ ├── datasynth-runtime/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── orchestrator.rs # GenerationOrchestrator
│ │ └── progress.rs # Progress tracking
│ ├── datasynth-cli/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ └── main.rs # generate, validate, init, info, fingerprint commands
│ ├── datasynth-server/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── main.rs
│ │ ├── rest/ # Axum REST API
│ │ ├── grpc/ # Tonic gRPC service
│ │ └── websocket/ # WebSocket streaming
│ ├── datasynth-ui/
│ │ ├── package.json
│ │ ├── src/ # SvelteKit frontend
│ │ │ ├── routes/ # 15+ config pages
│ │ │ └── lib/ # Components, stores
│ │ └── src-tauri/ # Rust backend
│ ├── datasynth-eval/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── statistical/ # Benford, distributions, temporal
│ │ ├── coherence/ # Balance, IC, document chains
│ │ ├── quality/ # Completeness, consistency, duplicates
│ │ ├── ml/ # Feature distributions, label quality
│ │ ├── report/ # HTML/JSON report generation
│ │ └── enhancement/ # AutoTuner, RecommendationEngine
│ ├── datasynth-banking/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── models/ # Customer, Account, Transaction, KYC
│ │ ├── generators/ # Customer, account, transaction generation
│ │ ├── typologies/ # Structuring, funnel, layering, mule, fraud
│ │ ├── personas/ # Retail, business, trust behaviors
│ │ └── labels/ # Entity, relationship, transaction labels
│ ├── datasynth-ocpm/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── models/ # EventLog, Event, ObjectInstance, ObjectType
│ │ ├── generator/ # P2P, O2C event generation
│ │ └── export/ # OCEL 2.0 JSON export
│ ├── datasynth-fingerprint/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── models/ # Fingerprint, Manifest, Schema, Statistics
│ │ ├── privacy/ # Laplace, Gaussian, k-anonymity, PrivacyEngine
│ │ ├── extraction/ # Schema, stats, correlation, integrity extractors
│ │ ├── io/ # DSF file reader, writer, validator
│ │ ├── synthesis/ # ConfigSynthesizer, DistributionFitter, GaussianCopula
│ │ ├── evaluation/ # FidelityEvaluator, FidelityReport
│ │ ├── federated/ # Federated fingerprint protocol, secure aggregation
│ │ └── certificates/ # Synthetic data certificates, HMAC-SHA256 signing
│ ├── datasynth-standards/
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── framework.rs # AccountingFramework, FrameworkSettings
│ │ ├── accounting/ # Revenue (ASC 606/IFRS 15), Leases, Fair Value, Impairment
│ │ ├── audit/ # ISA standards, Analytical procedures, Opinions
│ │ └── regulatory/ # SOX 302/404, DeficiencyMatrix
│ └── datasynth-test-utils/
│ ├── Cargo.toml
│ └── src/
│ └── lib.rs # Test fixtures, assertions, mocks
├── benches/ # Criterion benchmark suite
├── docs/ # This documentation (mdBook)
├── python/ # Python wrapper (datasynth_py)
├── examples/ # Example configurations and templates
└── tests/ # Integration tests
Crate Purposes
Application Layer
| Crate | Purpose |
|---|---|
| datasynth-cli | Command-line interface with generate, validate, init, info, fingerprint commands |
| datasynth-server | REST/gRPC/WebSocket API with auth, rate limiting, timeouts |
| datasynth-ui | Cross-platform desktop application (Tauri + SvelteKit) |
Processing Layer
| Crate | Purpose |
|---|---|
| datasynth-runtime | Orchestrates generation workflow with resource guards |
| datasynth-generators | Core data generation (JE, master data, documents, anomalies, audit) |
| datasynth-graph | Graph construction and export for ML |
Domain-Specific Modules
| Crate | Purpose |
|---|---|
| datasynth-banking | KYC/AML banking transactions with fraud typologies |
| datasynth-ocpm | OCEL 2.0 process mining event logs |
| datasynth-fingerprint | Privacy-preserving fingerprint extraction and synthesis |
| datasynth-standards | Accounting/audit standards (US GAAP, IFRS, ISA, SOX, PCAOB) |
Foundation Layer
| Crate | Purpose |
|---|---|
| datasynth-core | Domain models, traits, distributions, resource guards |
| datasynth-config | Configuration schema and validation |
| datasynth-output | Output sinks (CSV, JSON, Parquet) |
Supporting Crates
| Crate | Purpose |
|---|---|
| datasynth-eval | Quality evaluation with auto-tuning recommendations |
| datasynth-test-utils | Test fixtures and assertions |
Build Commands
# Build entire workspace
cargo build --release
# Build specific crate
cargo build -p datasynth-core
cargo build -p datasynth-generators
cargo build -p datasynth-fingerprint
# Run tests
cargo test
cargo test -p datasynth-core
cargo test -p datasynth-fingerprint
# Generate documentation
cargo doc --workspace --no-deps
# Run benchmarks
cargo bench
Feature Flags
Workspace-level features:
[workspace.features]
default = ["full"]
full = ["server", "ui", "graph"]
server = []
ui = []
graph = []
Crate-level features:
# datasynth-core
[features]
templates = ["serde_yaml"]
# datasynth-output
[features]
compression = ["flate2", "zstd"]
Adding a New Crate
- Create directory:
crates/datasynth-newcrate/ - Add
Cargo.toml:[package] name = "datasynth-newcrate" version = "0.2.0" edition = "2021" [dependencies] datasynth-core = { path = "../datasynth-core" } - Add to workspace
Cargo.toml:[workspace] members = [ # ... "crates/datasynth-newcrate", ] - Create
src/lib.rs - Add documentation to
docs/src/crates/
See Also
Domain Models
Core data structures representing enterprise financial concepts.
Model Categories
| Category | Models |
|---|---|
| Accounting | JournalEntry, ChartOfAccounts, ACDOCA |
| Master Data | Vendor, Customer, Material, FixedAsset, Employee |
| Documents | PurchaseOrder, Invoice, Payment, etc. |
| Financial | TrialBalance, FxRate, AccountBalance |
| Financial Reporting | FinancialStatement, CashFlowItem, BankReconciliation, BankStatementLine |
| Sourcing (S2C) | SourcingProject, SupplierQualification, RfxEvent, Bid, BidEvaluation, ProcurementContract, CatalogItem, SupplierScorecard, SpendAnalysis |
| HR / Payroll | PayrollRun, PayrollLineItem, TimeEntry, ExpenseReport, ExpenseLineItem |
| Manufacturing | ProductionOrder, QualityInspection, CycleCount |
| Sales Quotes | SalesQuote, QuoteLineItem |
| Compliance | InternalControl, SoDRule, LabeledAnomaly |
Accounting
JournalEntry
The core accounting record.
#![allow(unused)]
fn main() {
pub struct JournalEntry {
pub header: JournalEntryHeader,
pub lines: Vec<JournalEntryLine>,
}
pub struct JournalEntryHeader {
pub document_id: Uuid,
pub company_code: String,
pub fiscal_year: u16,
pub fiscal_period: u8,
pub posting_date: NaiveDate,
pub document_date: NaiveDate,
pub created_at: DateTime<Utc>,
pub source: TransactionSource,
pub business_process: Option<BusinessProcess>,
// Document references
pub source_document_type: Option<DocumentType>,
pub source_document_id: Option<String>,
// Labels
pub is_fraud: bool,
pub fraud_type: Option<FraudType>,
pub is_anomaly: bool,
pub anomaly_type: Option<AnomalyType>,
// Control markers
pub control_ids: Vec<String>,
pub sox_relevant: bool,
pub sod_violation: bool,
}
pub struct JournalEntryLine {
pub line_number: u32,
pub account_number: String,
pub cost_center: Option<String>,
pub profit_center: Option<String>,
pub debit_amount: Decimal,
pub credit_amount: Decimal,
pub description: String,
pub tax_code: Option<String>,
}
}
Invariant: Sum of debits must equal sum of credits.
ChartOfAccounts
GL account structure.
#![allow(unused)]
fn main() {
pub struct ChartOfAccounts {
pub accounts: Vec<Account>,
}
pub struct Account {
pub account_number: String,
pub name: String,
pub account_type: AccountType,
pub account_subtype: AccountSubType,
pub is_control_account: bool,
pub normal_balance: NormalBalance,
pub is_active: bool,
}
pub enum AccountType {
Asset,
Liability,
Equity,
Revenue,
Expense,
}
pub enum AccountSubType {
// Assets
Cash, AccountsReceivable, Inventory, FixedAsset,
// Liabilities
AccountsPayable, AccruedLiabilities, LongTermDebt,
// Equity
CommonStock, RetainedEarnings,
// Revenue
SalesRevenue, ServiceRevenue,
// Expense
CostOfGoodsSold, OperatingExpense,
// ...
}
}
ACDOCA
SAP HANA Universal Journal format.
#![allow(unused)]
fn main() {
pub struct AcdocaEntry {
pub rclnt: String, // Client
pub rldnr: String, // Ledger
pub rbukrs: String, // Company code
pub gjahr: u16, // Fiscal year
pub belnr: String, // Document number
pub docln: u32, // Line item
pub ryear: u16, // Year
pub poper: u8, // Posting period
pub racct: String, // Account
pub drcrk: DebitCreditIndicator,
pub hsl: Decimal, // Amount in local currency
pub ksl: Decimal, // Amount in group currency
// Simulation fields
pub zsim_fraud: bool,
pub zsim_anomaly: bool,
pub zsim_source: String,
}
}
Master Data
Vendor
Supplier master record.
#![allow(unused)]
fn main() {
pub struct Vendor {
pub vendor_id: String,
pub vendor_name: String,
pub tax_id: Option<String>,
pub currency: String,
pub country: String,
pub payment_terms: PaymentTerms,
pub bank_account: Option<BankAccount>,
pub is_intercompany: bool,
pub behavior: VendorBehavior,
pub valid_from: NaiveDate,
pub valid_to: Option<NaiveDate>,
}
pub struct VendorBehavior {
pub late_payment_tendency: f64,
pub discount_usage_rate: f64,
}
}
Customer
Customer master record.
#![allow(unused)]
fn main() {
pub struct Customer {
pub customer_id: String,
pub customer_name: String,
pub currency: String,
pub country: String,
pub credit_limit: Decimal,
pub credit_rating: CreditRating,
pub payment_behavior: PaymentBehavior,
pub is_intercompany: bool,
pub valid_from: NaiveDate,
}
pub struct PaymentBehavior {
pub on_time_rate: f64,
pub early_payment_rate: f64,
pub late_payment_rate: f64,
pub average_days_late: u32,
}
}
Material
Product/material master.
#![allow(unused)]
fn main() {
pub struct Material {
pub material_id: String,
pub description: String,
pub material_type: MaterialType,
pub unit_of_measure: String,
pub valuation_method: ValuationMethod,
pub standard_cost: Decimal,
pub gl_account: String,
}
pub enum MaterialType {
RawMaterial,
WorkInProgress,
FinishedGoods,
Service,
}
pub enum ValuationMethod {
Fifo,
Lifo,
WeightedAverage,
StandardCost,
}
}
FixedAsset
Capital asset record.
#![allow(unused)]
fn main() {
pub struct FixedAsset {
pub asset_id: String,
pub description: String,
pub asset_class: AssetClass,
pub acquisition_date: NaiveDate,
pub acquisition_cost: Decimal,
pub useful_life_years: u32,
pub depreciation_method: DepreciationMethod,
pub salvage_value: Decimal,
pub accumulated_depreciation: Decimal,
pub disposal_date: Option<NaiveDate>,
}
}
Employee
User/employee record.
#![allow(unused)]
fn main() {
pub struct Employee {
pub employee_id: String,
pub name: String,
pub department: String,
pub role: String,
pub manager_id: Option<String>,
pub approval_limit: Decimal,
pub transaction_codes: Vec<String>,
pub hire_date: NaiveDate,
}
}
Documents
PurchaseOrder
P2P initiating document.
#![allow(unused)]
fn main() {
pub struct PurchaseOrder {
pub po_number: String,
pub vendor_id: String,
pub company_code: String,
pub order_date: NaiveDate,
pub items: Vec<PoLineItem>,
pub total_amount: Decimal,
pub currency: String,
pub status: PoStatus,
}
pub struct PoLineItem {
pub line_number: u32,
pub material_id: String,
pub quantity: Decimal,
pub unit_price: Decimal,
pub gl_account: String,
}
}
VendorInvoice
AP invoice with three-way match.
#![allow(unused)]
fn main() {
pub struct VendorInvoice {
pub invoice_number: String,
pub vendor_id: String,
pub po_number: Option<String>,
pub gr_number: Option<String>,
pub invoice_date: NaiveDate,
pub due_date: NaiveDate,
pub total_amount: Decimal,
pub match_status: MatchStatus,
}
pub enum MatchStatus {
Matched,
QuantityVariance,
PriceVariance,
Blocked,
}
}
DocumentReference
Links documents in flows.
#![allow(unused)]
fn main() {
pub struct DocumentReference {
pub from_document_type: DocumentType,
pub from_document_id: String,
pub to_document_type: DocumentType,
pub to_document_id: String,
pub reference_type: ReferenceType,
}
pub enum ReferenceType {
FollowsFrom, // Normal flow
PaymentFor, // Payment → Invoice
ReversalOf, // Reversal/credit memo
}
}
Financial
TrialBalance
Period-end balances.
#![allow(unused)]
fn main() {
pub struct TrialBalance {
pub company_code: String,
pub fiscal_year: u16,
pub fiscal_period: u8,
pub accounts: Vec<TrialBalanceRow>,
}
pub struct TrialBalanceRow {
pub account_number: String,
pub account_name: String,
pub opening_balance: Decimal,
pub period_debits: Decimal,
pub period_credits: Decimal,
pub closing_balance: Decimal,
}
}
FxRate
Exchange rate record.
#![allow(unused)]
fn main() {
pub struct FxRate {
pub from_currency: String,
pub to_currency: String,
pub rate_date: NaiveDate,
pub rate_type: RateType,
pub rate: Decimal,
}
pub enum RateType {
Spot,
Closing,
Average,
}
}
Compliance
LabeledAnomaly
ML training label.
#![allow(unused)]
fn main() {
pub struct LabeledAnomaly {
pub document_id: Uuid,
pub anomaly_id: String,
pub anomaly_type: AnomalyType,
pub category: AnomalyCategory,
pub severity: Severity,
pub description: String,
pub detection_difficulty: DetectionDifficulty,
}
pub enum AnomalyType {
Fraud,
Error,
ProcessIssue,
Statistical,
Relational,
}
}
InternalControl
SOX control definition.
#![allow(unused)]
fn main() {
pub struct InternalControl {
pub control_id: String,
pub name: String,
pub description: String,
pub control_type: ControlType,
pub frequency: ControlFrequency,
pub assertions: Vec<Assertion>,
}
}
Financial Reporting
FinancialStatement
Period-end financial statement with line items.
#![allow(unused)]
fn main() {
pub enum StatementType {
BalanceSheet,
IncomeStatement,
CashFlowStatement,
ChangesInEquity,
}
pub struct FinancialStatementLineItem {
pub line_code: String,
pub label: String,
pub section: String,
pub sort_order: u32,
pub amount: Decimal,
pub amount_prior: Option<Decimal>,
pub indent_level: u8,
pub is_total: bool,
pub gl_accounts: Vec<String>,
}
pub struct CashFlowItem {
pub item_code: String,
pub label: String,
pub category: CashFlowCategory, // Operating, Investing, Financing
pub amount: Decimal,
}
}
BankReconciliation
Bank statement reconciliation with auto-matching.
#![allow(unused)]
fn main() {
pub struct BankStatementLine {
pub line_id: String,
pub statement_date: NaiveDate,
pub direction: Direction, // Inflow, Outflow
pub amount: Decimal,
pub description: String,
pub match_status: MatchStatus, // Unmatched, AutoMatched, ManuallyMatched, BankCharge, Interest
pub matched_payment_id: Option<String>,
}
pub struct BankReconciliation {
pub reconciliation_id: String,
pub company_code: String,
pub bank_account: String,
pub period_start: NaiveDate,
pub period_end: NaiveDate,
pub opening_balance: Decimal,
pub closing_balance: Decimal,
pub status: ReconciliationStatus, // InProgress, Completed, CompletedWithExceptions
}
}
Sourcing (S2C)
Source-to-Contract models for the procurement pipeline.
SourcingProject
Top-level sourcing initiative.
#![allow(unused)]
fn main() {
pub struct SourcingProject {
pub project_id: String,
pub title: String,
pub category: String,
pub status: SourcingProjectStatus,
pub estimated_spend: Decimal,
pub start_date: NaiveDate,
pub target_award_date: NaiveDate,
}
}
RfxEvent
Request for Information/Proposal/Quote.
#![allow(unused)]
fn main() {
pub struct RfxEvent {
pub rfx_id: String,
pub project_id: String,
pub rfx_type: RfxType, // Rfi, Rfp, Rfq
pub title: String,
pub issue_date: NaiveDate,
pub close_date: NaiveDate,
pub invited_suppliers: Vec<String>,
}
}
ProcurementContract
Awarded contract resulting from bid evaluation.
#![allow(unused)]
fn main() {
pub struct ProcurementContract {
pub contract_id: String,
pub vendor_id: String,
pub rfx_id: Option<String>,
pub contract_value: Decimal,
pub start_date: NaiveDate,
pub end_date: NaiveDate,
pub auto_renew: bool,
}
}
Additional S2C models include SpendAnalysis, SupplierQualification, Bid, BidEvaluation, CatalogItem, and SupplierScorecard.
HR / Payroll
Hire-to-Retire (H2R) process models.
PayrollRun
A complete pay cycle for a company.
#![allow(unused)]
fn main() {
pub struct PayrollRun {
pub payroll_id: String,
pub company_code: String,
pub pay_period_start: NaiveDate,
pub pay_period_end: NaiveDate,
pub run_date: NaiveDate,
pub status: PayrollRunStatus, // Draft, Calculated, Approved, Posted, Reversed
pub total_gross: Decimal,
pub total_deductions: Decimal,
pub total_net: Decimal,
pub total_employer_cost: Decimal,
pub employee_count: u32,
}
}
TimeEntry
Employee time tracking record.
#![allow(unused)]
fn main() {
pub struct TimeEntry {
pub entry_id: String,
pub employee_id: String,
pub date: NaiveDate,
pub hours_regular: f64,
pub hours_overtime: f64,
pub hours_pto: f64,
pub hours_sick: f64,
pub project_id: Option<String>,
pub cost_center: Option<String>,
pub approval_status: TimeApprovalStatus, // Pending, Approved, Rejected
}
}
ExpenseReport
Employee expense reimbursement.
#![allow(unused)]
fn main() {
pub struct ExpenseReport {
pub report_id: String,
pub employee_id: String,
pub submission_date: NaiveDate,
pub status: ExpenseStatus, // Draft, Submitted, Approved, Rejected, Paid
pub total_amount: Decimal,
pub line_items: Vec<ExpenseLineItem>,
}
pub enum ExpenseCategory {
Travel, Meals, Lodging, Transportation,
Office, Entertainment, Training, Other,
}
}
Manufacturing
Production and quality process models.
ProductionOrder
Manufacturing production order linked to materials.
#![allow(unused)]
fn main() {
pub struct ProductionOrder {
pub order_id: String,
pub material_id: String,
pub planned_quantity: Decimal,
pub actual_quantity: Decimal,
pub start_date: NaiveDate,
pub end_date: Option<NaiveDate>,
pub status: ProductionOrderStatus,
}
}
QualityInspection
Quality control inspection record.
#![allow(unused)]
fn main() {
pub struct QualityInspection {
pub inspection_id: String,
pub production_order_id: String,
pub inspection_date: NaiveDate,
pub result: InspectionResult, // Pass, Fail, Conditional
pub defect_count: u32,
}
}
CycleCount
Inventory cycle count with variance tracking.
#![allow(unused)]
fn main() {
pub struct CycleCount {
pub count_id: String,
pub material_id: String,
pub warehouse: String,
pub count_date: NaiveDate,
pub system_quantity: Decimal,
pub counted_quantity: Decimal,
pub variance: Decimal,
}
}
Sales Quotes
Quote-to-order pipeline models.
SalesQuote
Sales quotation record.
#![allow(unused)]
fn main() {
pub struct SalesQuote {
pub quote_id: String,
pub customer_id: String,
pub quote_date: NaiveDate,
pub valid_until: NaiveDate,
pub total_amount: Decimal,
pub status: QuoteStatus, // Draft, Sent, Won, Lost, Expired
pub converted_order_id: Option<String>,
}
}
Decimal Handling
All monetary amounts use rust_decimal::Decimal:
#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;
let amount = dec!(1234.56);
let tax = amount * dec!(0.077);
}
Serialized as strings to prevent IEEE 754 issues:
{"amount": "1234.56"}
See Also
Data Flow
How data flows through the SyntheticData system.
High-Level Flow
┌─────────────┐
│ Config │
└──────┬──────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Master │ → │ Opening │ → │ Transact │ → │ Period │ │
│ │ Data │ │ Balances │ │ ions │ │ Close │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└───────────────────────────┬─────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CSV Sink │ │ Graph Export│ │ Labels │
└─────────────┘ └─────────────┘ └─────────────┘
Phase 1: Configuration Loading
YAML File → Parser → Validator → Config Object
- Load: Read YAML/JSON file
- Parse: Convert to strongly-typed structures
- Validate: Check constraints and ranges
- Resolve: Apply defaults and presets
#![allow(unused)]
fn main() {
let config = Config::from_yaml_file("config.yaml")?;
ConfigValidator::new().validate(&config)?;
}
Phase 2: Master Data Generation
Config → Master Data Generators → Entity Registry
Order of generation (to satisfy dependencies):
- Chart of Accounts: GL account structure
- Employees: Users with approval limits
- Vendors: Suppliers (reference employees as approvers)
- Customers: Buyers (reference employees)
- Materials: Products (reference accounts)
- Fixed Assets: Capital assets (reference accounts)
#![allow(unused)]
fn main() {
// Entity registry maintains references
let registry = EntityRegistry::new();
registry.register_vendors(&vendors);
registry.register_customers(&customers);
}
Phase 3: Opening Balance Generation
Config + CoA → Balance Generator → Opening JEs
Generates coherent opening balance sheet:
- Calculate target balances per account type
- Distribute across accounts
- Generate opening entries
- Verify A = L + E
#![allow(unused)]
fn main() {
let opening = OpeningBalanceGenerator::new(&config);
let entries = opening.generate()?;
// Verify balance coherence
assert!(entries.iter().all(|e| e.is_balanced()));
}
Phase 4: Transaction Generation
Document Flow Path
Config → P2P/O2C Generators → Documents → JE Generator → Entries
P2P Flow:
PO Generator → Purchase Order
│
▼
GR Generator → Goods Receipt → JE (Inventory/GR-IR)
│
▼
Invoice Gen. → Vendor Invoice → JE (GR-IR/AP)
│
▼
Payment Gen. → Payment → JE (AP/Cash)
Direct JE Path
Config → JE Generator → Entries
For transactions not from document flows:
- Manual entries
- Recurring entries
- Adjustments
Phase 5: Balance Tracking
Entries → Balance Tracker → Running Balances → Trial Balance
Continuous tracking during generation:
#![allow(unused)]
fn main() {
let mut tracker = BalanceTracker::new(&coa);
for entry in &entries {
tracker.post(&entry)?;
// Verify coherence after each entry
assert!(tracker.is_balanced());
}
let trial_balance = tracker.to_trial_balance(period);
}
Phase 6: Anomaly Injection
Entries → Anomaly Injector → Modified Entries + Labels
Anomalies injected post-generation:
- Select entries based on targeting strategy
- Apply anomaly transformation
- Generate label record
#![allow(unused)]
fn main() {
let injector = AnomalyInjector::new(&config.anomaly_injection);
let (modified, labels) = injector.inject(&entries)?;
}
Phase 7: Period Close
Entries + Balances → Close Engine → Closing Entries
Monthly:
- Accruals
- Depreciation
- Subledger reconciliation
Quarterly:
- IC eliminations
- Currency translation
Annual:
- Closing entries
- Retained earnings
Phase 8: Output Generation
CSV/JSON Output
Entries + Master Data → Sinks → Files
#![allow(unused)]
fn main() {
let mut sink = CsvSink::new("output/journal_entries.csv")?;
sink.write_batch(&entries)?;
sink.flush()?;
}
Graph Output
Entries → Graph Builder → Graph → Exporter → PyG/Neo4j
#![allow(unused)]
fn main() {
let builder = TransactionGraphBuilder::new();
let graph = builder.build(&entries)?;
let exporter = PyTorchGeometricExporter::new("output/graphs");
exporter.export(&graph, split_config)?;
}
Phase 9: Enterprise Process Chains (v0.6.0)
Source-to-Contract (S2C) Flow
Spend Analysis → Sourcing Project → Supplier Qualification → RFx Event → Bids →
Bid Evaluation → Contract Award → Catalog Items → [feeds into P2P] → Supplier Scorecard
S2C data feeds into the existing P2P procurement flow. Procurement contracts and catalog items provide the upstream sourcing context for purchase orders.
HR / Payroll Flow
Employees (Master Data) → Time Entries → Payroll Run → JE (Salary Expense/Cash)
→ Expense Reports → JE (Expense/AP)
HR data depends on the employee master data from Phase 2. Payroll runs generate journal entries that post to salary expense and cash accounts.
Financial Reporting Flow
Trial Balance → Balance Sheet + Income Statement
→ Cash Flow Statement (indirect method)
→ Changes in Equity
→ Management KPIs
→ Budget Variance Analysis
Payments (P2P/O2C) → Bank Reconciliation → Matched/Unmatched Items
Financial statements are derived from the adjusted trial balance. Bank reconciliations match payments from document flows against bank statement lines.
Manufacturing Flow
Materials (Master Data) → Production Orders → Quality Inspections
→ Cycle Counts
Manufacturing data depends on materials from the master data. Production orders consume raw materials and produce finished goods.
Sales Quote Flow
Customers (Master Data) → Sales Quotes → [feeds into O2C when won]
The quote-to-order pipeline generates sales quotes that, when won, link to sales orders in the O2C flow.
Accounting Standards Flow
Customers → Customer Contracts → Performance Obligations (ASC 606/IFRS 15)
Fixed Assets → Impairment Tests → Recoverable Amount Calculations
Revenue recognition generates contracts with performance obligations. Impairment testing evaluates fixed asset carrying amounts against recoverable values.
Data Dependencies
┌─────────────┐
│ Config │
└──────┬──────┘
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ CoA │ │Vendors│ │Customs│
└───┬───┘ └───┬───┘ └───┬───┘
│ │ │
│ ┌─────┴─────┐ │
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐
│ P2P Docs │ │ O2C Docs │
└──────┬──────┘ └──────┬──────┘
│ │
└───────┬────────┘
│
▼
┌─────────────┐
│ Entries │
└──────┬──────┘
│
┌──────────┼──────────┐──────────┐──────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌─────────┐ ┌───────┐
│ TB │ │ Graph │ │Labels │ │Fin.Stmt │ │BankRec│
└───────┘ └───────┘ └───────┘ └─────────┘ └───────┘
Streaming vs Batch
Batch Mode
All data in memory:
#![allow(unused)]
fn main() {
let entries = generator.generate_batch(100000)?;
sink.write_batch(&entries)?;
}
Pro: Fast parallel processing Con: Memory intensive
Streaming Mode
Process one at a time:
#![allow(unused)]
fn main() {
for entry in generator.generate_stream() {
sink.write(&entry?)?;
}
}
Pro: Memory efficient Con: No parallelism
Hybrid Mode
Batch with periodic flush:
#![allow(unused)]
fn main() {
for batch in generator.generate_batches(1000) {
let entries = batch?;
sink.write_batch(&entries)?;
if memory_guard.check().exceeds_soft_limit {
sink.flush()?;
}
}
}
See Also
Generation Pipeline
Step-by-step generation process orchestrated by datasynth-runtime.
Pipeline Overview
┌─────────────────────────────────────────────────────────────────────┐
│ GenerationOrchestrator │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Init │→│Master│→│Open │→│Trans │→│Close │→│Inject│→│Export│ │
│ │ │ │Data │ │Bal │ │ │ │ │ │ │ │ │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Stage 1: Initialization
Purpose: Prepare generation environment
#![allow(unused)]
fn main() {
pub fn initialize(&mut self) -> Result<()> {
// 1. Validate configuration
ConfigValidator::new().validate(&self.config)?;
// 2. Initialize RNG with seed
self.rng = ChaCha8Rng::seed_from_u64(self.config.global.seed);
// 3. Create UUID factory
self.uuid_factory = DeterministicUuidFactory::new(self.config.global.seed);
// 4. Set up memory guard
self.memory_guard = MemoryGuard::new(self.config.memory_config());
// 5. Create output directories
fs::create_dir_all(&self.output_path)?;
Ok(())
}
}
Outputs:
- Validated configuration
- Initialized RNG
- UUID factory
- Memory guard
- Output directories
Stage 2: Master Data Generation
Purpose: Generate all entity master records
#![allow(unused)]
fn main() {
pub fn generate_master_data(&mut self) -> Result<MasterDataState> {
let mut state = MasterDataState::new();
// 1. Chart of Accounts
let coa_gen = CoaGenerator::new(&self.config, &mut self.rng);
state.chart_of_accounts = coa_gen.generate()?;
// 2. Employees (needed for approvals)
let emp_gen = EmployeeGenerator::new(&self.config, &mut self.rng);
state.employees = emp_gen.generate()?;
// 3. Vendors (reference employees)
let vendor_gen = VendorGenerator::new(&self.config, &mut self.rng);
state.vendors = vendor_gen.generate()?;
// 4. Customers
let customer_gen = CustomerGenerator::new(&self.config, &mut self.rng);
state.customers = customer_gen.generate()?;
// 5. Materials
let material_gen = MaterialGenerator::new(&self.config, &mut self.rng);
state.materials = material_gen.generate()?;
// 6. Fixed Assets
let asset_gen = AssetGenerator::new(&self.config, &mut self.rng);
state.fixed_assets = asset_gen.generate()?;
// 7. Register in entity registry
self.registry.register_all(&state);
Ok(state)
}
}
Outputs:
- Chart of Accounts
- Vendors, Customers
- Materials, Fixed Assets
- Employees
- Entity Registry
Stage 3: Opening Balance Generation
Purpose: Create coherent opening balance sheet
#![allow(unused)]
fn main() {
pub fn generate_opening_balances(&mut self) -> Result<Vec<JournalEntry>> {
let generator = OpeningBalanceGenerator::new(
&self.config,
&self.state.chart_of_accounts,
&mut self.rng,
);
let entries = generator.generate()?;
// Initialize balance tracker
self.balance_tracker = BalanceTracker::new(&self.state.chart_of_accounts);
for entry in &entries {
self.balance_tracker.post(entry)?;
}
// Verify A = L + E
assert!(self.balance_tracker.is_balanced());
Ok(entries)
}
}
Outputs:
- Opening balance entries
- Initialized balance tracker
Stage 4: Transaction Generation
Purpose: Generate main transaction volume
#![allow(unused)]
fn main() {
pub fn generate_transactions(&mut self) -> Result<Vec<JournalEntry>> {
let target = self.config.transactions.target_count;
let mut entries = Vec::with_capacity(target as usize);
// Calculate counts by source
let p2p_count = (target as f64 * self.config.document_flows.p2p.flow_rate) as u64;
let o2c_count = (target as f64 * self.config.document_flows.o2c.flow_rate) as u64;
let other_count = target - p2p_count - o2c_count;
// 1. Generate P2P flows
let p2p_entries = self.generate_p2p_flows(p2p_count)?;
entries.extend(p2p_entries);
// 2. Generate O2C flows
let o2c_entries = self.generate_o2c_flows(o2c_count)?;
entries.extend(o2c_entries);
// 3. Generate other entries (manual, recurring, etc.)
let other_entries = self.generate_other_entries(other_count)?;
entries.extend(other_entries);
// 4. Sort by posting date
entries.sort_by_key(|e| e.header.posting_date);
// 5. Update balance tracker
for entry in &entries {
self.balance_tracker.post(entry)?;
}
Ok(entries)
}
}
P2P Flow Generation
#![allow(unused)]
fn main() {
fn generate_p2p_flows(&mut self, count: u64) -> Result<Vec<JournalEntry>> {
let mut p2p_gen = P2pGenerator::new(&self.config, &self.registry, &mut self.rng);
let mut doc_gen = DocumentFlowJeGenerator::new(&self.config);
let mut entries = Vec::new();
for _ in 0..count {
// 1. Generate document flow
let flow = p2p_gen.generate_flow()?;
self.state.documents.add_p2p_flow(&flow);
// 2. Generate journal entries from flow
let flow_entries = doc_gen.generate_from_p2p(&flow)?;
entries.extend(flow_entries);
}
Ok(entries)
}
}
Outputs:
- Journal entries
- Document records
- Updated balances
Stage 5: Period Close
Purpose: Run period-end processes
#![allow(unused)]
fn main() {
pub fn run_period_close(&mut self) -> Result<()> {
let close_engine = CloseEngine::new(&self.config.period_close);
for period in self.config.periods() {
// 1. Monthly close
let monthly_entries = close_engine.run_monthly_close(
period,
&self.state,
&mut self.balance_tracker,
)?;
self.state.entries.extend(monthly_entries);
// 2. Quarterly close (if applicable)
if period.is_quarter_end() {
let quarterly_entries = close_engine.run_quarterly_close(
period,
&self.state,
)?;
self.state.entries.extend(quarterly_entries);
}
// 3. Generate trial balance
let trial_balance = self.balance_tracker.to_trial_balance(period);
self.state.trial_balances.push(trial_balance);
}
// 4. Annual close
if self.config.has_year_end() {
let annual_entries = close_engine.run_annual_close(&self.state)?;
self.state.entries.extend(annual_entries);
}
Ok(())
}
}
Outputs:
- Accrual entries
- Depreciation entries
- Closing entries
- Trial balances
Stage 6: Anomaly Injection
Purpose: Add anomalies and generate labels
#![allow(unused)]
fn main() {
pub fn inject_anomalies(&mut self) -> Result<()> {
if !self.config.anomaly_injection.enabled {
return Ok(());
}
let mut injector = AnomalyInjector::new(
&self.config.anomaly_injection,
&mut self.rng,
);
// 1. Select entries for injection
let target_count = (self.state.entries.len() as f64
* self.config.anomaly_injection.total_rate) as usize;
// 2. Inject anomalies
let (modified, labels) = injector.inject(
&mut self.state.entries,
target_count,
)?;
// 3. Store labels
self.state.anomaly_labels = labels;
// 4. Apply data quality variations
if self.config.data_quality.enabled {
let dq_injector = DataQualityInjector::new(&self.config.data_quality);
dq_injector.apply(&mut self.state)?;
}
Ok(())
}
}
Outputs:
- Modified entries with anomalies
- Anomaly labels for ML
Stage 7: Export
Purpose: Write all outputs
#![allow(unused)]
fn main() {
pub fn export(&self) -> Result<()> {
// 1. Master data
self.export_master_data()?;
// 2. Transactions
self.export_transactions()?;
// 3. Documents
self.export_documents()?;
// 4. Subledgers
self.export_subledgers()?;
// 5. Trial balances
self.export_trial_balances()?;
// 6. Labels
self.export_labels()?;
// 7. Controls
self.export_controls()?;
// 8. Graphs (if enabled)
if self.config.graph_export.enabled {
self.export_graphs()?;
}
Ok(())
}
}
Outputs:
- CSV/JSON files
- Graph exports
- Label files
Stage 8: Banking & Process Mining
Purpose: Generate banking/KYC/AML data and OCEL 2.0 event logs
If banking or OCEL generation is enabled in the config, this stage produces banking transactions with KYC profiles and/or OCEL 2.0 event logs for process mining.
Outputs:
- Banking customers, accounts, transactions
- KYC profiles and AML typology labels
- OCEL 2.0 event logs, objects, process variants
Stage 9: Audit Generation
Purpose: Generate ISA-compliant audit data
If audit generation is enabled, generates engagement records, workpapers, evidence, risks, findings, and professional judgments.
Outputs:
- Audit engagements, workpapers, evidence
- Risk assessments and findings
- Professional judgment documentation
Stage 10: Graph Export
Purpose: Build and export ML-ready graphs
If graph export is enabled, builds transaction, approval, and entity graphs and exports to configured formats.
Outputs:
- PyTorch Geometric tensors (.pt)
- Neo4j CSV + Cypher scripts
- DGL graph structures
Stage 11: LLM Enrichment (v0.5.0)
Purpose: Enrich generated data with LLM-generated metadata
When LLM enrichment is enabled, uses the configured LlmProvider (Mock, OpenAI, Anthropic, or Custom) to generate realistic:
- Vendor names appropriate for industry and spend category
- Transaction descriptions and memo fields
- Natural language explanations for injected anomalies
The Mock provider is deterministic and requires no network access, making it suitable for CI/CD pipelines.
Outputs:
- Enriched vendor master data
- Enriched journal entry descriptions
- Anomaly explanation text
Stage 12: Diffusion Enhancement (v0.5.0)
Purpose: Optionally blend diffusion model outputs with rule-based data
When diffusion is enabled, uses a StatisticalDiffusionBackend to generate samples through a learned denoising process. The HybridGenerator blends diffusion outputs with rule-based data using one of three strategies:
- Interpolate: Weighted average of rule-based and diffusion values
- Select: Per-record random selection from either source
- Ensemble: Column-level blending (diffusion for amounts, rule-based for categoricals)
Outputs:
- Blended transaction amounts and attributes
- Diffusion fit report (mean/std errors, correlation preservation)
Stage 13: Causal Overlay (v0.5.0)
Purpose: Apply causal structure to generated data
When causal generation is enabled, constructs a StructuralCausalModel from the configured causal graph (or a built-in template like fraud_detection or revenue_cycle) and generates data that respects causal relationships. Supports:
- Observational generation: Data following the causal structure
- Interventional generation: Data under do-calculus interventions (“what-if” scenarios)
- Counterfactual generation: Counterfactual versions of existing records via abduction-action-prediction
The causal validator verifies that generated data preserves the specified causal structure.
Outputs:
- Causally-structured records
- Intervention results with effect estimates
- Counterfactual pairs (factual + counterfactual)
- Causal validation report
Stage 14: Source-to-Contract (v0.6.0)
Purpose: Generate the full S2C procurement pipeline
When source-to-pay is enabled, generates the complete sourcing lifecycle from spend analysis through supplier scorecards. The generation DAG follows:
Spend Analysis → Sourcing Project → Supplier Qualification → RFx Event → Bids →
Bid Evaluation → Procurement Contract → Catalog Items → [feeds into P2P] → Supplier Scorecard
Outputs:
- Spend analysis records and category hierarchies
- Sourcing projects with supplier qualification data
- RFx events (RFI/RFP/RFQ), bids, and bid evaluations
- Procurement contracts and catalog items
- Supplier scorecards with performance metrics
Stage 15: Financial Reporting (v0.6.0)
Purpose: Generate bank reconciliations and financial statements
When financial reporting is enabled, produces bank reconciliations with auto-matching and full financial statement sets derived from the adjusted trial balance.
Bank reconciliations match payments to bank statement lines with configurable auto-match, manual match, and exception rates. Financial statements include:
- Balance Sheet: Assets = Liabilities + Equity
- Income Statement: Revenue - COGS - OpEx - Tax = Net Income
- Cash Flow Statement: Indirect method with operating, investing, and financing categories
- Statement of Changes in Equity: Retained earnings, dividends, comprehensive income
Also generates management KPIs (financial ratios) and budget variance analysis when configured.
Outputs:
- Bank reconciliations with statement lines and reconciling items
- Financial statements (balance sheet, income statement, cash flow, changes in equity)
- Management KPIs and financial ratios
- Budget vs. actual variance reports
Stage 16: HR Data (v0.6.0)
Purpose: Generate Hire-to-Retire (H2R) process data
When HR generation is enabled, produces payroll runs, time entries, and expense reports linked to the employee master data generated in Stage 2.
Outputs:
- Payroll runs with employee pay line items (gross, deductions, net, employer cost)
- Time entries with regular hours, overtime, PTO, and sick leave
- Expense reports with categorized line items and approval workflows
Stage 17: Accounting Standards (v0.6.0)
Purpose: Generate ASC 606/IFRS 15 revenue recognition and impairment testing data
When accounting standards generation is enabled, produces customer contracts with performance obligations for revenue recognition and asset impairment test records.
Outputs:
- Customer contracts with performance obligations (ASC 606/IFRS 15)
- Revenue recognition schedules
- Asset impairment tests with recoverable amount calculations
Stage 18: Manufacturing (v0.6.0)
Purpose: Generate manufacturing process data
When manufacturing is enabled, produces production orders, quality inspections, and cycle counts linked to materials from the master data.
Outputs:
- Production orders with BOM components and routing steps
- Quality inspections with pass/fail/conditional results
- Inventory cycle counts with variance analysis
Stage 19: Sales Quotes, KPIs, and Budgets (v0.6.0)
Purpose: Generate sales pipeline and financial planning data
When enabled, produces the quote-to-order pipeline, management KPI computations, and budget variance analysis.
Outputs:
- Sales quotes with line items, conversion tracking, and win/loss rates
- Management KPIs (liquidity, profitability, efficiency, leverage ratios)
- Budget records with actual vs. planned variance analysis
Parallel Execution
Stages that support parallelism:
#![allow(unused)]
fn main() {
// Parallel transaction generation
let entries: Vec<JournalEntry> = (0..num_threads)
.into_par_iter()
.flat_map(|thread_id| {
let mut gen = JournalEntryGenerator::new(
&config,
seed + thread_id as u64,
);
gen.generate_batch(batch_size)
})
.collect();
}
Progress Tracking
#![allow(unused)]
fn main() {
pub fn run_with_progress<F>(&mut self, callback: F) -> Result<()>
where
F: Fn(Progress),
{
let tracker = ProgressTracker::new(self.config.total_items());
for stage in self.stages() {
tracker.set_phase(&stage.name);
stage.run()?;
tracker.advance(stage.items);
callback(tracker.progress());
}
Ok(())
}
}
See Also
Resource Management
How SyntheticData manages system resources during generation.
Overview
Large-scale data generation can stress system resources. SyntheticData provides:
- Memory Guard: Cross-platform memory tracking with soft/hard limits
- Disk Space Guard: Disk capacity monitoring and pre-write checks
- CPU Monitor: CPU load tracking with auto-throttling
- Resource Guard: Unified orchestration of all resource guards
- Graceful Degradation: Progressive feature reduction under resource pressure
- Streaming Output: Reduce memory pressure
Memory Guard
The MemoryGuard component tracks process memory usage:
#![allow(unused)]
fn main() {
pub struct MemoryGuard {
config: MemoryGuardConfig,
last_check: Instant,
last_usage: u64,
}
pub struct MemoryGuardConfig {
pub soft_limit: u64, // Pause/slow threshold
pub hard_limit: u64, // Stop threshold
pub check_interval_ms: u64, // How often to check
pub growth_rate_threshold: f64, // Bytes/sec warning
}
pub struct MemoryStatus {
pub current_usage: u64,
pub exceeds_soft_limit: bool,
pub exceeds_hard_limit: bool,
pub growth_rate: f64,
}
}
Platform Support
| Platform | Method |
|---|---|
| Linux | /proc/self/statm |
| macOS | ps command |
| Windows | Stubbed (returns 0) |
Linux Implementation
#![allow(unused)]
fn main() {
#[cfg(target_os = "linux")]
fn get_memory_usage() -> u64 {
let statm = fs::read_to_string("/proc/self/statm").ok()?;
let rss_pages: u64 = statm.split_whitespace().nth(1)?.parse().ok()?;
let page_size = unsafe { libc::sysconf(libc::_SC_PAGESIZE) } as u64;
rss_pages * page_size
}
}
macOS Implementation
#![allow(unused)]
fn main() {
#[cfg(target_os = "macos")]
fn get_memory_usage() -> u64 {
let output = Command::new("ps")
.args(["-o", "rss=", "-p", &std::process::id().to_string()])
.output()
.ok()?;
let rss_kb: u64 = String::from_utf8_lossy(&output.stdout)
.trim()
.parse()
.ok()?;
rss_kb * 1024
}
}
Configuration
global:
memory_limit: 2147483648 # 2 GB hard limit
Or programmatically:
#![allow(unused)]
fn main() {
let config = MemoryGuardConfig {
soft_limit: 1024 * 1024 * 1024, // 1 GB
hard_limit: 2 * 1024 * 1024 * 1024, // 2 GB
check_interval_ms: 1000, // Check every second
growth_rate_threshold: 100_000_000.0, // 100 MB/sec
};
}
Usage in Generation
#![allow(unused)]
fn main() {
pub fn generate_with_memory_guard(&mut self) -> Result<()> {
let guard = MemoryGuard::new(self.memory_config);
loop {
// Check memory
let status = guard.check();
if status.exceeds_hard_limit {
// Stop generation
return Err(Error::MemoryExceeded);
}
if status.exceeds_soft_limit {
// Flush output and trigger GC
self.sink.flush()?;
self.state.clear_caches();
continue;
}
if status.growth_rate > guard.config.growth_rate_threshold {
// Slow down
thread::sleep(Duration::from_millis(100));
}
// Generate batch
let batch = self.generator.generate_batch(BATCH_SIZE)?;
self.process_batch(batch)?;
if self.is_complete() {
break;
}
}
Ok(())
}
}
Memory Estimation
Estimate memory requirements before generation:
#![allow(unused)]
fn main() {
pub fn estimate_memory(config: &Config) -> MemoryEstimate {
let entry_size = 512; // Approximate bytes per entry
let master_data_size = config.estimate_master_data_size();
let peak = master_data_size
+ (config.transactions.target_count as u64 * entry_size);
let streaming_peak = master_data_size
+ (BATCH_SIZE as u64 * entry_size);
MemoryEstimate {
batch_peak: peak,
streaming_peak,
recommended_limit: peak * 2,
}
}
}
Memory-Efficient Patterns
Streaming Output
Write as you generate instead of accumulating:
#![allow(unused)]
fn main() {
// Memory-efficient
for entry in generator.generate_stream() {
sink.write(&entry?)?;
}
// Memory-intensive (avoid for large volumes)
let all_entries = generator.generate_batch(1_000_000)?;
sink.write_batch(&all_entries)?;
}
Batch Processing with Flush
#![allow(unused)]
fn main() {
const BATCH_SIZE: usize = 10_000;
let mut buffer = Vec::with_capacity(BATCH_SIZE);
for entry in generator.generate_stream() {
buffer.push(entry?);
if buffer.len() >= BATCH_SIZE {
sink.write_batch(&buffer)?;
buffer.clear();
}
}
// Final flush
if !buffer.is_empty() {
sink.write_batch(&buffer)?;
}
}
Lazy Loading
Load master data on demand:
#![allow(unused)]
fn main() {
pub struct LazyRegistry {
vendors: OnceCell<Vec<Vendor>>,
vendor_loader: Box<dyn Fn() -> Vec<Vendor>>,
}
impl LazyRegistry {
pub fn vendors(&self) -> &[Vendor] {
self.vendors.get_or_init(|| (self.vendor_loader)())
}
}
}
Memory Limits by Component
Estimated memory usage:
| Component | Size (per item) | For 1M entries |
|---|---|---|
| JournalEntry | ~512 bytes | ~500 MB |
| Document | ~1 KB | ~1 GB |
| Graph Node | ~128 bytes | ~128 MB |
| Graph Edge | ~64 bytes | ~64 MB |
Monitoring
Progress with Memory
#![allow(unused)]
fn main() {
orchestrator.run_with_progress(|progress| {
let memory_mb = guard.check().current_usage / 1_000_000;
println!(
"[{:.1}%] {} entries | {} MB",
progress.percent,
progress.current,
memory_mb
);
});
}
Server Endpoint
curl http://localhost:3000/health
{
"status": "healthy",
"memory_usage_mb": 512,
"memory_limit_mb": 2048,
"memory_percent": 25.0
}
Troubleshooting
Out of Memory
Symptoms: Process killed, “out of memory” error
Solutions:
- Reduce
target_count - Enable streaming output
- Increase system memory
- Set appropriate
memory_limit
Slow Generation
Symptoms: Generation slows over time
Cause: Memory pressure triggering slowdown
Solutions:
- Increase soft limit
- Reduce batch size
- Enable more aggressive flushing
Memory Not Freed
Symptoms: Memory stays high after generation
Cause: Data retained in caches
Solution: Explicitly clear state:
#![allow(unused)]
fn main() {
orchestrator.clear_caches();
}
Disk Space Guard
Monitors disk space and prevents disk exhaustion:
#![allow(unused)]
fn main() {
pub struct DiskSpaceGuardConfig {
pub hard_limit_mb: usize, // Minimum free space required
pub soft_limit_mb: usize, // Warning threshold
pub check_interval: usize, // Check every N operations
pub reserve_buffer_mb: usize, // Buffer to maintain
}
}
Platform Support
| Platform | Method |
|---|---|
| Linux/macOS | statvfs syscall |
| Windows | GetDiskFreeSpaceExW |
Usage
#![allow(unused)]
fn main() {
let guard = DiskSpaceGuard::with_min_free(100); // 100 MB minimum
// Periodic check
guard.check()?;
// Pre-write check with size estimation
guard.check_before_write(estimated_bytes)?;
// Size estimation for planning
let size = estimate_output_size_mb(100_000, &[OutputFormat::Csv], false);
}
CPU Monitor
Tracks CPU load with optional auto-throttling:
#![allow(unused)]
fn main() {
pub struct CpuMonitorConfig {
pub enabled: bool,
pub high_load_threshold: f64, // 0.85 default
pub critical_load_threshold: f64, // 0.95 default
pub sample_interval_ms: u64,
pub auto_throttle: bool,
pub throttle_delay_ms: u64,
}
}
Platform Support
| Platform | Method |
|---|---|
| Linux | /proc/stat parsing |
| macOS | top -l 1 command |
Usage
#![allow(unused)]
fn main() {
let config = CpuMonitorConfig::with_thresholds(0.85, 0.95)
.with_auto_throttle(50);
let monitor = CpuMonitor::new(config);
// In generation loop
if let Some(load) = monitor.sample() {
if load > 0.85 {
// Consider slowing down
}
monitor.maybe_throttle(); // Applies delay if critical
}
}
Unified Resource Guard
Combines all guards into single interface:
#![allow(unused)]
fn main() {
let guard = ResourceGuard::new(ResourceGuardConfig::default())
.with_memory_limit(2 * 1024 * 1024 * 1024)
.with_output_path("./output")
.with_cpu_monitoring();
// Check all resources at once
guard.check_all()?;
let stats = guard.stats();
println!("Memory: {}%", stats.memory_usage_percent);
println!("Disk: {} MB free", stats.disk_available_mb);
println!("CPU: {}%", stats.cpu_load * 100.0);
}
Graceful Degradation
Progressive feature reduction under resource pressure:
#![allow(unused)]
fn main() {
pub enum DegradationLevel {
Normal, // All features enabled
Reduced, // 50% batch, skip data quality, 50% anomaly rate
Minimal, // 25% batch, essential only, no injections
Emergency, // Flush and terminate
}
}
Thresholds
| Level | Memory | Disk | Batch Size | Actions |
|---|---|---|---|---|
| Normal | <70% | >1GB | 100% | Full operation |
| Reduced | 70-85% | 500MB-1GB | 50% | Skip data quality |
| Minimal | 85-95% | 100-500MB | 25% | Essential data only |
| Emergency | >95% | <100MB | 0% | Graceful shutdown |
Usage
#![allow(unused)]
fn main() {
let controller = DegradationController::new(DegradationConfig::default());
// Update based on current resource status
let status = ResourceStatus::new(
Some(memory_usage),
Some(disk_available_mb),
Some(cpu_load),
);
let (level, changed) = controller.update(&status);
if changed {
let actions = DegradationActions::for_level(level);
if actions.skip_data_quality {
// Disable data quality injection
}
if actions.terminate {
// Flush and exit
}
}
}
Configuration
global:
resource_budget:
memory:
hard_limit_mb: 2048
disk:
min_free_mb: 500
reserve_buffer_mb: 100
cpu:
enabled: true
high_load_threshold: 0.85
auto_throttle: true
degradation:
enabled: true
reduced_threshold: 0.70
minimal_threshold: 0.85
See Also
Enterprise Process Chains
SyntheticData models enterprise operations as interconnected process chains — end-to-end business flows that share master data, generate journal entries, and link through common documents. This page maps the current implementation status and shows how the chains integrate.
Coverage Matrix
| Chain | Full Name | Coverage | Status | Key Modules |
|---|---|---|---|---|
| S2P | Source-to-Pay | 95% | Implemented (P2P + S2C + OCPM) | document_flow/p2p_generator, sourcing/, ocpm/s2c_generator |
| O2C | Order-to-Cash | 95% | Implemented (+ OCPM) | document_flow/o2c_generator, master_data/customer, subledger/ar |
| R2R | Record-to-Report | 85% | Implemented (+ Bank Recon OCPM) | je_generator, balance/, period_close/, ocpm/bank_recon_generator |
| A2R | Acquire-to-Retire | 70% | Partially implemented | master_data/asset, subledger/fa, period_close/depreciation |
| INV | Inventory Management | 55% | Partially implemented | subledger/inventory, document_flow/ (GR/delivery links) |
| BANK | Banking & Treasury | 85% | Implemented (+ OCPM) | datasynth-banking, ocpm/bank_generator |
| H2R | Hire-to-Retire | 85% | Implemented (+ OCPM) | hr/, master_data/employee, ocpm/h2r_generator |
| MFG | Plan-to-Produce | 85% | Implemented (+ OCPM) | manufacturing/, ocpm/mfg_generator |
| AUDIT | Audit Lifecycle | 90% | Implemented (+ OCPM) | audit/, ocpm/audit_generator |
Implemented Chains
Source-to-Pay (S2P)
The S2P chain covers procurement from purchase requisition through payment:
Source-to-Contract (S2C) — Planned
┌──────────────────────────────────────────────┐
│ Spend Analysis → RFx → Bid Eval → Contract │
└──────────────────────────┬───────────────────┘
│
┌──────────────────────────────────────────┼──────────────────────────┐
│ Procure-to-Pay (P2P) — Implemented │
│ │ │
│ Purchase Purchase Goods Vendor Three-Way │
│ Requisition → Order → Receipt → Invoice → Match → Payment │
│ │ │ │ │ │
│ │ ┌────┘ │ │ │
│ ▼ ▼ ▼ ▼ │
│ AP Open Item ← Match Result AP Aging Bank │
└────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────────┘
▼
Vendor Network (quality scores, clusters, supply chain tiers)
P2P implementation details:
| Component | Types/Variants | Key Config |
|---|---|---|
| Purchase Orders | 6 types: Standard, Service, Framework, Consignment, StockTransfer, Subcontracting | flow_rate, completion_rate |
| Goods Receipts | 7 movement types: GrForPo, ReturnToVendor, GrForProduction, TransferPosting, InitialEntry, Scrapping, Consumption | gr_rate |
| Vendor Invoices | Three-way match with tolerance | price_tolerance, quantity_tolerance |
| Payments | Configurable terms and scheduling | payment_rate, timing ranges |
| Three-Way Match | PO ↔ GR ↔ Invoice validation with 6 variance types | allow_over_delivery, max_over_delivery_pct |
Order-to-Cash (O2C)
The O2C chain covers the revenue cycle from sales order through cash collection:
┌─────────────────────────────────────────────────────────────────────┐
│ Order-to-Cash (O2C) │
│ │
│ Sales Credit Delivery Customer Customer │
│ Order → Check → (Pick/ → Invoice → Receipt │
│ │ Pack/ │ │ │
│ │ Ship) │ │ │
│ │ │ ▼ ▼ │
│ │ │ AR Open Item AR Aging │
│ │ │ │ │
│ │ │ └→ Dunning Notices │
│ │ ▼ │
│ │ Inventory Issue │
│ │ (COGS posting) │
└────┼────────────────────────────────────────────────────────────────┘
│
Revenue Recognition (ASC 606 / IFRS 15)
Customer Contracts → Performance Obligations
O2C implementation details:
| Component | Types/Variants | Key Config |
|---|---|---|
| Sales Orders | 9 types: Standard, Rush, CashSale, Return, FreeOfCharge, Consignment, Service, CreditMemoRequest, DebitMemoRequest | flow_rate, credit_check |
| Deliveries | 6 types: Outbound, Return, StockTransfer, Replenishment, ConsignmentIssue, ConsignmentReturn | delivery_rate |
| Customer Invoices | 7 types: Standard, CreditMemo, DebitMemo, ProForma, DownPaymentRequest, FinalInvoice, Intercompany | invoice_rate |
| Customer Receipts | Full, partial, on-account, corrections, NSF | collection_rate |
Record-to-Report (R2R)
The R2R chain covers financial close and reporting:
Journal Entries (from all chains)
│
▼
Balance Tracker → Trial Balance → Adjustments → Close
│ │ │
├→ Intercompany Matching ├→ Accruals ├→ Year-End Close
│ └→ IC Elimination ├→ Reclasses └→ Retained Earnings
│ └→ FX Reval
▼
Consolidation
├→ Currency Translation
├→ CTA Adjustments
└→ Consolidated Trial Balance
R2R coverage:
- Journal entry generation from all process chains
- Opening balance, running balance tracking, trial balance per period
- Intercompany matching and elimination entries
- Period close engine: accruals, depreciation, year-end closing
- Audit simulation (ISA-compliant workpapers, findings, opinions)
Gaps: Financial statement generation (balance sheet, income statement, cash flow), budget vs actual reporting.
Banking & Treasury (BANK) — 85%
Implemented: Bank customer profiles, KYC/AML, bank accounts, transactions with fraud typologies (structuring, funnel, layering, mule, round-tripping). OCPM events for customer onboarding, KYC review, account management, and transaction lifecycle.
Gaps: Cash forecasting, liquidity management.
Hire-to-Retire (H2R) — 85%
Implemented: Employee master data, payroll runs with tax/deduction calculations, time entries with overtime, expense reports with policy violations. OCPM events for payroll lifecycle, time entry approval, and expense approval chains.
Gaps: Benefits administration, workforce planning.
Plan-to-Produce (MFG) — 85%
Implemented: Production orders with BOM explosion, routing operations, WIP costing, quality inspections, cycle counting. OCPM events for production order lifecycle, quality inspection, and cycle count reconciliation.
Gaps: Material requirements planning (MRP), advanced shop floor control.
Audit Lifecycle (AUDIT) — 90%
Implemented: Engagement planning, risk assessment (ISA 315/330), workpaper creation and review (ISA 230), evidence collection (ISA 500), findings (ISA 265), professional judgment documentation (ISA 200). OCPM events for the full engagement lifecycle.
Gaps: Multi-engagement portfolio management.
Partially Implemented Chains
Acquire-to-Retire (A2R) — 70%
Implemented: Fixed asset master data, depreciation (6 methods), acquisition from PO, disposal with gain/loss accounting, impairment testing (ASC 360/IAS 36).
Gaps: Capital project/WBS integration, asset transfers between companies, construction-in-progress (CIP) tracking.
Inventory Management (INV) — 55%
Implemented: Inventory positions, 22 movement types, 4 valuation methods, stock status tracking, P2P goods receipts, O2C goods issues.
Gaps: Quality inspection integration, obsolescence management, ABC analysis.
Cross-Process Integration
Process chains share data through several integration points, now with full OCPM event coverage:
S2C ──→ S2P O2C R2R
│ │ │ │
Contract GR ──── Inventory ─────┼── Delivery │
│ │ │ │
Payment ────────┼────────────┼── Receipt ──── Bank Recon
│ │ │ │ │
AP Open Item │ AR Open Item BANK │
│ MFG─┘ │ │ │
└──H2R──┴──────────────┴──── Journal Entries ┘
│ │
AUDIT ─────────────────────────── Trial Balance
│
Consolidation
──── All chains feed OCEL 2.0 Event Log (88 activities) ────
Integration Map
| Integration Point | From Chain | To Chain | Mechanism |
|---|---|---|---|
| Inventory bridge | S2P (Goods Receipt) | O2C (Delivery) | GR increases stock, delivery decreases |
| Payment clearing | S2P / O2C | BANK | Payment status → bank reconciliation |
| Journal entries | All chains | R2R | Every document posts GL entries |
| Asset acquisition | S2P (Capital PO) | A2R | PO → GR → Fixed Asset Record |
| Revenue recognition | O2C (Invoice) | R2R | Contract → Revenue JE |
| Depreciation | A2R | R2R | Monthly depreciation → Trial Balance |
| Intercompany | S2P / O2C | R2R | IC invoices → IC matching → elimination |
Document Reference Types
Documents maintain referential integrity across chains through 9 reference types:
| Reference Type | Description | Example |
|---|---|---|
FollowOn | Normal flow succession | PO → GR |
Payment | Payment for invoice | PAY → VI |
Reversal | Correction/reversal | Credit Memo → Invoice |
Partial | Partial fulfillment | Partial GR → PO |
CreditMemo | Credit against invoice | CM → Invoice |
DebitMemo | Debit against invoice | DM → Invoice |
Return | Return against delivery | Return → Delivery |
IntercompanyMatch | IC matching pair | IC-INV → IC-INV |
Manual | User-defined reference | Any → Any |
Roadmap
The process chain expansion follows a wave-based plan:
| Wave | Focus | Chains Affected |
|---|---|---|
| Wave 1 | S2C completion, bank reconciliation, financial statements | S2P, BANK, R2R |
| Wave 2 | Payroll/time, revenue recognition generator, impairment generator | H2R, O2C, A2R |
| Wave 3 | Production orders/WIP, cycle counting/QA, expense management | MFG, INV, H2R |
| Wave 4 | Sales quotes, cash forecasting, KPIs/budgets, obsolescence | O2C, BANK, R2R, INV |
For detailed coverage targets and implementation plans, see:
- S2P Process Chain Spec — Source-to-Contract extension
- Enterprise Process Chain Gaps — Full gap analysis across all chains
See Also
- Document Flows — P2P and O2C configuration
- Subledgers — AR, AP, FA, Inventory detail
- FX & Currency — Multi-currency and translation
- Generation Pipeline — How the orchestrator sequences generators
- Roadmap — Future development plans
Design Decisions
Key architectural choices and their rationale.
1. Deterministic RNG
Decision: Use seeded ChaCha8 RNG for all randomness.
Rationale:
- Reproducible output for testing and debugging
- Consistent results across runs
- Parallel generation with per-thread seeds
Implementation:
#![allow(unused)]
fn main() {
use rand_chacha::ChaCha8Rng;
use rand::SeedableRng;
let mut rng = ChaCha8Rng::seed_from_u64(config.global.seed);
}
Trade-off: Slightly slower than system RNG, but reproducibility is essential for financial data testing.
2. Precise Decimal Arithmetic
Decision: Use rust_decimal::Decimal for all monetary values.
Rationale:
- IEEE 754 floating-point causes rounding errors
- Financial systems require exact decimal representation
- Debits must exactly equal credits
Implementation:
#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;
let amount = dec!(1234.56);
let tax = amount * dec!(0.077); // Exact
}
Serialization: Decimals serialized as strings to preserve precision:
{"amount": "1234.56"}
3. Balanced Entry Enforcement
Decision: JournalEntry enforces debits = credits at construction.
Rationale:
- Invalid accounting entries should be impossible
- Catches bugs early in generation
- Guarantees trial balance coherence
Implementation:
#![allow(unused)]
fn main() {
impl JournalEntry {
pub fn new(header: JournalEntryHeader, lines: Vec<JournalEntryLine>) -> Result<Self> {
let entry = Self { header, lines };
if !entry.is_balanced() {
return Err(Error::UnbalancedEntry);
}
Ok(entry)
}
}
}
4. Collision-Free UUIDs
Decision: Use FNV-1a hash-based UUID generation with generator-type discriminators.
Rationale:
- Document IDs must be unique across all generators
- Deterministic generation requires deterministic IDs
- Different generator types might generate same sequence
Implementation:
#![allow(unused)]
fn main() {
pub struct DeterministicUuidFactory {
counter: AtomicU64,
seed: u64,
}
pub enum GeneratorType {
JournalEntry = 0x01,
DocumentFlow = 0x02,
Vendor = 0x03,
// ...
}
impl DeterministicUuidFactory {
pub fn generate(&self, gen_type: GeneratorType) -> Uuid {
let counter = self.counter.fetch_add(1, Ordering::SeqCst);
let hash_input = (self.seed, gen_type as u8, counter);
Uuid::from_bytes(fnv1a_hash(&hash_input))
}
}
}
5. Empirical Distributions
Decision: Base statistical distributions on academic research.
Rationale:
- Synthetic data should match real-world patterns
- Benford’s Law is expected in authentic financial data
- Line item distributions affect detection algorithms
Sources:
- Line item counts: GL research showing 60.68% two-line, 88% even counts
- Amounts: Log-normal with round-number bias
- Temporal: Month/quarter/year-end spikes
Implementation:
#![allow(unused)]
fn main() {
pub struct LineItemSampler {
distribution: EmpiricalDistribution,
}
impl LineItemSampler {
pub fn new() -> Self {
Self {
distribution: EmpiricalDistribution::from_data(&[
(2, 0.6068),
(3, 0.0524),
(4, 0.1732),
// ...
]),
}
}
}
}
6. Document Chain Integrity
Decision: Maintain proper reference chains with explicit links.
Rationale:
- Real document flows have traceable references
- Process mining requires complete chains
- Audit trails need document relationships
Implementation:
#![allow(unused)]
fn main() {
pub struct DocumentReference {
pub from_type: DocumentType,
pub from_id: String,
pub to_type: DocumentType,
pub to_id: String,
pub reference_type: ReferenceType,
}
// Payment explicitly references invoices
let payment_ref = DocumentReference {
from_type: DocumentType::Payment,
from_id: payment.id.clone(),
to_type: DocumentType::Invoice,
to_id: invoice.id.clone(),
reference_type: ReferenceType::PaymentFor,
};
}
7. Three-Way Match Validation
Decision: Implement actual PO/GR/Invoice matching with tolerances.
Rationale:
- Real P2P processes include match validation
- Variances are common and should be generated
- Match status affects downstream processing
Implementation:
#![allow(unused)]
fn main() {
pub fn validate_match(po: &PurchaseOrder, gr: &GoodsReceipt, inv: &Invoice,
config: &MatchConfig) -> MatchResult {
let qty_variance = (gr.quantity - po.quantity).abs() / po.quantity;
let price_variance = (inv.unit_price - po.unit_price).abs() / po.unit_price;
if qty_variance > config.quantity_tolerance {
return MatchResult::QuantityVariance(qty_variance);
}
if price_variance > config.price_tolerance {
return MatchResult::PriceVariance(price_variance);
}
MatchResult::Matched
}
}
8. Memory Guard Architecture
Decision: Cross-platform memory tracking with soft/hard limits.
Rationale:
- Large generations can exhaust memory
- OOM kills are unrecoverable
- Graceful degradation preferred
Implementation:
#![allow(unused)]
fn main() {
pub fn check(&self) -> MemoryStatus {
let current = self.get_memory_usage();
let growth_rate = (current - self.last_usage) as f64 / elapsed_ms;
MemoryStatus {
current_usage: current,
exceeds_soft_limit: current > self.config.soft_limit,
exceeds_hard_limit: current > self.config.hard_limit,
growth_rate,
}
}
}
9. Layered Crate Architecture
Decision: Strict layering with no circular dependencies.
Rationale:
- Clear separation of concerns
- Independent crate compilation
- Easier testing and maintenance
Layers:
- Foundation:
datasynth-core(no internal dependencies) - Services:
datasynth-config,datasynth-output - Processing:
datasynth-generators,datasynth-graph - Orchestration:
datasynth-runtime - Application:
datasynth-cli,datasynth-server,datasynth-ui
10. Configuration-Driven Behavior
Decision: All behavior controlled by external configuration.
Rationale:
- Flexibility without code changes
- Reproducible scenarios
- User-customizable presets
Scope: Configuration controls:
- Industry and complexity
- Transaction volumes and patterns
- Anomaly types and rates
- Output formats
- All feature toggles
11. Trait-Based Extensibility
Decision: Define traits in core, implement in higher layers.
Rationale:
- Dependency inversion
- Pluggable implementations
- Easy testing with mocks
Example:
#![allow(unused)]
fn main() {
// Defined in datasynth-core
pub trait Generator<T> {
fn generate_batch(&mut self, count: usize) -> Result<Vec<T>>;
}
// Implemented in datasynth-generators
impl Generator<JournalEntry> for JournalEntryGenerator {
fn generate_batch(&mut self, count: usize) -> Result<Vec<JournalEntry>> {
// Implementation
}
}
}
12. Parallel-Safe Design
Decision: Design all generators to be thread-safe.
Rationale:
- Generation can be parallelized
- Modern systems have many cores
- Linear scaling improves throughput
Implementation:
- Per-thread RNG seeds:
seed + thread_id - Atomic counters for UUID factory
- No shared mutable state during generation
- Rayon for parallel iteration
See Also
Crate Reference
SyntheticData is organized as a Rust workspace with 15 modular crates. This section provides detailed documentation for each crate.
Workspace Structure
datasynth-cli → Binary entry point (commands: generate, validate, init, info, fingerprint)
datasynth-server → REST/gRPC/WebSocket server with auth, rate limiting, timeouts
datasynth-ui → Tauri/SvelteKit desktop UI
↓
datasynth-runtime → Orchestration layer (GenerationOrchestrator coordinates workflow)
↓
datasynth-generators → Data generators (JE, Document Flows, Subledgers, Anomalies, Audit)
datasynth-banking → KYC/AML banking transaction generator with fraud typologies
datasynth-ocpm → Object-Centric Process Mining (OCEL 2.0 event logs)
datasynth-fingerprint → Privacy-preserving fingerprint extraction and synthesis
datasynth-standards → Accounting/audit standards (US GAAP, IFRS, ISA, SOX)
↓
datasynth-graph → Graph/network export (PyTorch Geometric, Neo4j, DGL)
datasynth-eval → Evaluation framework with auto-tuning and recommendations
↓
datasynth-config → Configuration schema, validation, industry presets
↓
datasynth-core → Domain models, traits, distributions, templates, resource guards
↓
datasynth-output → Output sinks (CSV, JSON, Parquet, ControlExport)
datasynth-test-utils → Testing utilities and fixtures
Crate Categories
Application Layer
| Crate | Description |
|---|---|
| datasynth-cli | Command-line interface binary with generate, validate, init, info, fingerprint commands |
| datasynth-server | REST/gRPC/WebSocket server with authentication, rate limiting, and timeouts |
| datasynth-ui | Cross-platform desktop GUI application (Tauri + SvelteKit) |
Core Processing
| Crate | Description |
|---|---|
| datasynth-runtime | Generation orchestration with resource guards and graceful degradation |
| datasynth-generators | All data generators (JE, master data, documents, subledgers, anomalies, audit) |
| datasynth-graph | ML graph export (PyTorch Geometric, Neo4j, DGL) |
Domain-Specific Modules
| Crate | Description |
|---|---|
| datasynth-banking | KYC/AML banking transactions with fraud typologies |
| datasynth-ocpm | Object-Centric Process Mining (OCEL 2.0) |
| datasynth-fingerprint | Privacy-preserving fingerprint extraction and synthesis |
| datasynth-standards | Accounting/audit standards (US GAAP, IFRS, ISA, SOX, PCAOB) |
Foundation
| Crate | Description |
|---|---|
| datasynth-core | Domain models, distributions, traits, resource guards |
| datasynth-config | Configuration schema and validation |
| datasynth-output | Output sinks (CSV, JSON, Parquet) |
Supporting
| Crate | Description |
|---|---|
| datasynth-eval | Quality evaluation with auto-tuning recommendations |
| datasynth-test-utils | Test utilities and fixtures |
Dependencies
The crates follow a strict dependency hierarchy:
- datasynth-core: No internal dependencies (foundation)
- datasynth-config: Depends on datasynth-core
- datasynth-output: Depends on datasynth-core
- datasynth-generators: Depends on datasynth-core, datasynth-config
- datasynth-graph: Depends on datasynth-core, datasynth-generators
- datasynth-eval: Depends on datasynth-core
- datasynth-banking: Depends on datasynth-core, datasynth-config
- datasynth-ocpm: Depends on datasynth-core
- datasynth-fingerprint: Depends on datasynth-core, datasynth-config
- datasynth-runtime: Depends on datasynth-core, datasynth-config, datasynth-generators, datasynth-output, datasynth-graph, datasynth-banking, datasynth-ocpm, datasynth-fingerprint, datasynth-eval
- datasynth-cli: Depends on datasynth-runtime, datasynth-fingerprint
- datasynth-server: Depends on datasynth-runtime
- datasynth-ui: Depends on datasynth-runtime (via Tauri)
- datasynth-standards: Depends on datasynth-core, datasynth-config
- datasynth-test-utils: Depends on datasynth-core
Building Individual Crates
# Build specific crate
cargo build -p datasynth-core
cargo build -p datasynth-generators
cargo build -p datasynth-fingerprint
# Run tests for specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators
cargo test -p datasynth-fingerprint
# Generate docs for specific crate
cargo doc -p datasynth-core --open
cargo doc -p datasynth-fingerprint --open
API Documentation
For detailed Rust API documentation, generate and view rustdoc:
cargo doc --workspace --no-deps --open
After deployment, API documentation is available at /api/ in the documentation site.
See Also
datasynth-core
Core domain models, traits, and distributions for synthetic accounting data generation.
Overview
datasynth-core provides the foundational building blocks for the SyntheticData workspace:
- Domain Models: Journal entries, chart of accounts, master data, documents, anomalies
- Statistical Distributions: Line item sampling, amount generation, temporal patterns
- Core Traits: Generator and Sink interfaces for extensibility
- Template System: File-based templates for regional/sector customization
- Infrastructure: UUID factory, memory guard, GL account constants
Module Structure
Domain Models (models/)
| Module | Description |
|---|---|
journal_entry.rs | Journal entry header and balanced line items |
chart_of_accounts.rs | Hierarchical GL accounts with account types |
master_data.rs | Enhanced vendors, customers with payment behavior |
documents.rs | Purchase orders, invoices, goods receipts, payments |
temporal.rs | Bi-temporal data model for audit trails |
anomaly.rs | Anomaly types and labels for ML training |
internal_control.rs | SOX 404 control definitions |
sourcing/ | S2C models: SourcingProject, SupplierQualification, RfxEvent, Bid, BidEvaluation, ProcurementContract, CatalogItem, SupplierScorecard, SpendAnalysis |
payroll.rs | PayrollRun, PayrollLineItem with gross/deductions/net/employer cost |
time_entry.rs | TimeEntry with regular, overtime, PTO, and sick hours |
expense_report.rs | ExpenseReport, ExpenseLineItem with category and approval workflow |
financial_statements.rs | FinancialStatement, FinancialStatementLineItem, CashFlowItem, StatementType |
bank_reconciliation.rs | BankReconciliation, BankStatementLine, ReconcilingItem with auto-matching |
Statistical Distributions (distributions/)
| Distribution | Description |
|---|---|
LineItemSampler | Empirical distribution (60.68% two-line, 88% even counts) |
AmountSampler | Log-normal with round-number bias, Benford compliance |
TemporalSampler | Seasonality patterns with industry integration |
BenfordSampler | First-digit distribution following P(d) = log10(1 + 1/d) |
FraudAmountGenerator | Suspicious amount patterns |
IndustrySeasonality | Industry-specific volume patterns |
HolidayCalendar | Regional holidays for US, DE, GB, CN, JP, IN |
Infrastructure
| Component | Description |
|---|---|
uuid_factory.rs | Deterministic FNV-1a hash-based UUID generation |
accounts.rs | Centralized GL control account numbers |
templates/ | YAML/JSON template loading and merging |
Resource Guards
| Component | Description |
|---|---|
memory_guard.rs | Cross-platform memory tracking with soft/hard limits |
disk_guard.rs | Disk space monitoring and pre-write capacity checks |
cpu_monitor.rs | CPU load tracking with auto-throttling |
resource_guard.rs | Unified orchestration of all resource guards |
degradation.rs | Graceful degradation system (Normal→Reduced→Minimal→Emergency) |
AI & ML Modules (v0.5.0)
| Module | Description |
|---|---|
llm/provider.rs | LlmProvider trait with complete() and complete_batch() methods |
llm/mock_provider.rs | Deterministic MockLlmProvider for testing (no network required) |
llm/http_provider.rs | HttpLlmProvider for OpenAI, Anthropic, and custom API endpoints |
llm/nl_config.rs | NlConfigGenerator — natural language to YAML configuration |
llm/cache.rs | LlmCache with FNV-1a hashing for prompt deduplication |
diffusion/backend.rs | DiffusionBackend trait with forward(), reverse(), generate() methods |
diffusion/schedule.rs | NoiseSchedule with linear, cosine, and sigmoid schedules |
diffusion/statistical.rs | StatisticalDiffusionBackend — fingerprint-guided denoising |
diffusion/hybrid.rs | HybridGenerator with Interpolate, Select, Ensemble blend strategies |
diffusion/training.rs | DiffusionTrainer and TrainedDiffusionModel with save/load |
causal/graph.rs | CausalGraph with variables, edges, and built-in templates |
causal/scm.rs | StructuralCausalModel with topological-order generation |
causal/intervention.rs | InterventionEngine with do-calculus and effect estimation |
causal/counterfactual.rs | CounterfactualGenerator with abduction-action-prediction |
causal/validation.rs | CausalValidator for causal structure validation |
Key Types
JournalEntry
#![allow(unused)]
fn main() {
pub struct JournalEntry {
pub header: JournalEntryHeader,
pub lines: Vec<JournalEntryLine>,
}
pub struct JournalEntryHeader {
pub document_id: Uuid,
pub company_code: String,
pub fiscal_year: u16,
pub fiscal_period: u8,
pub posting_date: NaiveDate,
pub document_date: NaiveDate,
pub source: TransactionSource,
pub business_process: Option<BusinessProcess>,
pub is_fraud: bool,
pub fraud_type: Option<FraudType>,
pub is_anomaly: bool,
pub anomaly_type: Option<AnomalyType>,
// ... additional fields
}
}
AccountType Hierarchy
#![allow(unused)]
fn main() {
pub enum AccountType {
Asset,
Liability,
Equity,
Revenue,
Expense,
}
pub enum AccountSubType {
// Assets
Cash,
AccountsReceivable,
Inventory,
FixedAsset,
// Liabilities
AccountsPayable,
AccruedLiabilities,
LongTermDebt,
// Equity
CommonStock,
RetainedEarnings,
// Revenue
SalesRevenue,
ServiceRevenue,
// Expense
CostOfGoodsSold,
OperatingExpense,
// ...
}
}
Anomaly Types
#![allow(unused)]
fn main() {
pub enum AnomalyType {
Fraud,
Error,
ProcessIssue,
Statistical,
Relational,
}
pub struct LabeledAnomaly {
pub document_id: Uuid,
pub anomaly_id: String,
pub anomaly_type: AnomalyType,
pub category: AnomalyCategory,
pub severity: Severity,
pub description: String,
}
}
Usage Examples
Creating a Balanced Journal Entry
#![allow(unused)]
fn main() {
use synth_core::models::{JournalEntry, JournalEntryLine, JournalEntryHeader};
use rust_decimal_macros::dec;
let header = JournalEntryHeader::new(/* ... */);
let mut entry = JournalEntry::new(header);
// Add balanced lines
entry.add_line(JournalEntryLine::debit("1100", dec!(1000.00), "AR Invoice"));
entry.add_line(JournalEntryLine::credit("4000", dec!(1000.00), "Revenue"));
// Entry enforces debits = credits
assert!(entry.is_balanced());
}
Sampling Amounts
#![allow(unused)]
fn main() {
use synth_core::distributions::AmountSampler;
let sampler = AmountSampler::new(42); // seed
// Benford-compliant amount
let amount = sampler.sample_benford_compliant(1000.0, 100000.0);
// Round-number biased
let round_amount = sampler.sample_with_round_bias(1000.0, 10000.0);
}
Using the UUID Factory
#![allow(unused)]
fn main() {
use synth_core::uuid_factory::{DeterministicUuidFactory, GeneratorType};
let factory = DeterministicUuidFactory::new(42);
// Generate collision-free UUIDs across generators
let je_id = factory.generate(GeneratorType::JournalEntry);
let doc_id = factory.generate(GeneratorType::DocumentFlow);
}
Memory Guard
#![allow(unused)]
fn main() {
use synth_core::memory_guard::{MemoryGuard, MemoryGuardConfig};
let config = MemoryGuardConfig {
soft_limit: 1024 * 1024 * 1024, // 1GB soft
hard_limit: 2 * 1024 * 1024 * 1024, // 2GB hard
check_interval_ms: 1000,
..Default::default()
};
let guard = MemoryGuard::new(config);
if guard.check().exceeds_soft_limit {
// Slow down or pause generation
}
}
Disk Space Guard
#![allow(unused)]
fn main() {
use synth_core::disk_guard::{DiskSpaceGuard, DiskSpaceGuardConfig};
let config = DiskSpaceGuardConfig {
hard_limit_mb: 100, // Require at least 100 MB free
soft_limit_mb: 500, // Warn when below 500 MB
check_interval: 500, // Check every 500 operations
reserve_buffer_mb: 50, // Keep 50 MB buffer
monitor_path: Some("./output".into()),
};
let guard = DiskSpaceGuard::new(config);
guard.check()?; // Returns error if disk full
guard.check_before_write(1024 * 1024)?; // Pre-write check
}
CPU Monitor
#![allow(unused)]
fn main() {
use synth_core::cpu_monitor::{CpuMonitor, CpuMonitorConfig};
let config = CpuMonitorConfig::with_thresholds(0.85, 0.95)
.with_auto_throttle(50); // 50ms delay when critical
let monitor = CpuMonitor::new(config);
// Sample and check in generation loop
if let Some(load) = monitor.sample() {
if monitor.is_throttling() {
monitor.maybe_throttle(); // Apply delay
}
}
}
Graceful Degradation
#![allow(unused)]
fn main() {
use synth_core::degradation::{
DegradationController, DegradationConfig, ResourceStatus, DegradationActions
};
let controller = DegradationController::new(DegradationConfig::default());
let status = ResourceStatus::new(
Some(0.80), // 80% memory usage
Some(800), // 800 MB disk free
Some(0.70), // 70% CPU load
);
let (level, changed) = controller.update(&status);
let actions = DegradationActions::for_level(level);
if actions.skip_data_quality {
// Skip data quality injection
}
if actions.terminate {
// Flush and exit gracefully
}
}
LLM Provider
#![allow(unused)]
fn main() {
use synth_core::llm::{LlmProvider, LlmRequest, MockLlmProvider};
let provider = MockLlmProvider::new(42);
let request = LlmRequest::new("Generate a realistic vendor name for a manufacturing company")
.with_seed(42)
.with_max_tokens(50);
let response = provider.complete(&request)?;
println!("Generated: {}", response.content);
}
Causal Generation
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, StructuralCausalModel};
// Use built-in fraud detection template
let graph = CausalGraph::fraud_detection_template();
let scm = StructuralCausalModel::new(graph)?;
// Generate observational samples
let samples = scm.generate(1000, 42)?;
// Run intervention: what if transaction_amount is set to 50000?
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let intervention_samples = intervened.generate(1000, 42)?;
}
Diffusion Model
#![allow(unused)]
fn main() {
use synth_core::diffusion::{
StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType,
HybridGenerator, BlendStrategy,
};
let config = DiffusionConfig {
n_steps: 1000,
schedule: NoiseScheduleType::Cosine,
seed: 42,
};
let backend = StatisticalDiffusionBackend::new(
vec![100.0, 5.0], // means
vec![50.0, 2.0], // stds
config,
);
let samples = backend.generate(1000, 2, 42);
// Hybrid: blend rule-based + diffusion
let hybrid = HybridGenerator::new(0.3); // 30% diffusion weight
let blended = hybrid.blend(&rule_based, &samples, BlendStrategy::Ensemble, 42);
}
Traits
Generator Trait
#![allow(unused)]
fn main() {
pub trait Generator {
type Output;
type Error;
fn generate_batch(&mut self, count: usize) -> Result<Vec<Self::Output>, Self::Error>;
fn generate_stream(&mut self) -> impl Iterator<Item = Result<Self::Output, Self::Error>>;
}
}
Sink Trait
#![allow(unused)]
fn main() {
pub trait Sink<T> {
type Error;
fn write(&mut self, item: &T) -> Result<(), Self::Error>;
fn write_batch(&mut self, items: &[T]) -> Result<(), Self::Error>;
fn flush(&mut self) -> Result<(), Self::Error>;
}
}
PostProcessor Trait
Interface for post-generation data transformations (e.g., data quality variations):
#![allow(unused)]
fn main() {
pub struct ProcessContext {
pub record_index: usize,
pub batch_size: usize,
pub output_format: String,
pub metadata: HashMap<String, String>,
}
pub struct ProcessorStats {
pub records_processed: usize,
pub records_modified: usize,
pub labels_generated: usize,
}
}
Template System
Load external templates for customization:
#![allow(unused)]
fn main() {
use synth_core::templates::{TemplateLoader, MergeStrategy};
let loader = TemplateLoader::new("templates/");
let names = loader.load_category("vendor_names", MergeStrategy::Extend)?;
}
Template categories:
person_namesvendor_namescustomer_namesmaterial_descriptionsline_item_descriptions
Decimal Handling
All financial amounts use rust_decimal::Decimal:
#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;
let amount = dec!(1234.56);
let tax = amount * dec!(0.077);
}
Decimals are serialized as strings to avoid IEEE 754 floating-point issues.
See Also
datasynth-config
Configuration schema, validation, and industry presets for synthetic data generation.
Overview
datasynth-config provides the configuration layer for SyntheticData:
- Schema Definition: Complete YAML configuration schema
- Validation: Bounds checking, constraint validation, distribution sum verification
- Industry Presets: Pre-configured settings for common industries
- Complexity Levels: Small, medium, and large organization profiles
Configuration Sections
| Section | Description |
|---|---|
global | Industry, dates, seed, performance settings |
companies | Company codes, currencies, volume weights |
chart_of_accounts | COA complexity and structure |
transactions | Line items, amounts, sources, temporal patterns |
master_data | Vendors, customers, materials, assets, employees |
document_flows | P2P, O2C configuration |
intercompany | IC transaction types and transfer pricing |
balance | Opening balances, trial balance generation |
subledger | AR, AP, FA, inventory settings |
fx | Currency and exchange rate settings |
period_close | Close tasks and schedules |
fraud | Fraud injection rates and types |
internal_controls | SOX controls and SoD rules |
anomaly_injection | Anomaly rates and labeling |
data_quality | Missing values, typos, duplicates |
graph_export | ML graph export formats |
output | Output format and compression |
Industry Presets
| Industry | Description |
|---|---|
manufacturing | Heavy P2P, inventory, fixed assets |
retail | High O2C volume, seasonal patterns |
financial_services | Complex intercompany, high controls |
healthcare | Regulatory focus, seasonal insurance |
technology | SaaS revenue patterns, R&D capitalization |
Key Types
Config
#![allow(unused)]
fn main() {
pub struct Config {
pub global: GlobalConfig,
pub companies: Vec<CompanyConfig>,
pub chart_of_accounts: CoaConfig,
pub transactions: TransactionConfig,
pub master_data: MasterDataConfig,
pub document_flows: DocumentFlowConfig,
pub intercompany: IntercompanyConfig,
pub balance: BalanceConfig,
pub subledger: SubledgerConfig,
pub fx: FxConfig,
pub period_close: PeriodCloseConfig,
pub fraud: FraudConfig,
pub internal_controls: ControlConfig,
pub anomaly_injection: AnomalyConfig,
pub data_quality: DataQualityConfig,
pub graph_export: GraphExportConfig,
pub output: OutputConfig,
}
}
GlobalConfig
#![allow(unused)]
fn main() {
pub struct GlobalConfig {
pub seed: Option<u64>,
pub industry: Industry,
pub start_date: NaiveDate,
pub period_months: u32, // 1-120
pub group_currency: String,
pub worker_threads: Option<usize>,
pub memory_limit: Option<u64>,
}
}
CompanyConfig
#![allow(unused)]
fn main() {
pub struct CompanyConfig {
pub code: String,
pub name: String,
pub currency: String,
pub country: String,
pub volume_weight: f64, // Must sum to 1.0 across companies
pub is_parent: bool,
pub parent_code: Option<String>,
}
}
Validation Rules
The ConfigValidator enforces:
| Rule | Constraint |
|---|---|
period_months | 1-120 (max 10 years) |
compression_level | 1-9 when compression enabled |
| Rate fields | 0.0-1.0 |
| Approval thresholds | Strictly ascending order |
| Distribution weights | Sum to 1.0 (±0.01 tolerance) |
| Company codes | Unique within configuration |
| Dates | start_date + period_months is valid |
Usage Examples
Loading Configuration
#![allow(unused)]
fn main() {
use synth_config::{Config, ConfigValidator};
// From YAML file
let config = Config::from_yaml_file("config.yaml")?;
// Validate
let validator = ConfigValidator::new();
validator.validate(&config)?;
}
Using Presets
#![allow(unused)]
fn main() {
use synth_config::{Config, Industry, Complexity};
// Create from preset
let config = Config::from_preset(Industry::Manufacturing, Complexity::Medium);
// Modify as needed
config.transactions.target_count = 50000;
}
Creating Configuration Programmatically
#![allow(unused)]
fn main() {
use synth_config::{Config, GlobalConfig, TransactionConfig};
let config = Config {
global: GlobalConfig {
seed: Some(42),
industry: Industry::Manufacturing,
start_date: NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
period_months: 12,
group_currency: "USD".to_string(),
..Default::default()
},
transactions: TransactionConfig {
target_count: 100000,
..Default::default()
},
..Default::default()
};
}
Saving Configuration
#![allow(unused)]
fn main() {
// To YAML
config.to_yaml_file("config.yaml")?;
// To JSON
config.to_json_file("config.json")?;
// To string
let yaml = config.to_yaml_string()?;
}
Configuration Examples
Minimal Configuration
global:
industry: manufacturing
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 10000
output:
format: csv
Full Configuration
See the YAML Schema Reference for complete documentation.
Complexity Levels
| Level | Accounts | Vendors | Customers | Materials |
|---|---|---|---|---|
small | ~100 | 50 | 100 | 200 |
medium | ~400 | 200 | 500 | 1000 |
large | ~2500 | 1000 | 5000 | 10000 |
Validation Error Types
#![allow(unused)]
fn main() {
pub enum ConfigError {
MissingRequiredField(String),
InvalidValue { field: String, value: String, constraint: String },
DistributionSumError { field: String, sum: f64 },
DuplicateCode { field: String, code: String },
DateRangeError { start: NaiveDate, end: NaiveDate },
ParseError(String),
}
}
See Also
datasynth-generators
Data generators for journal entries, master data, document flows, and anomalies.
Overview
datasynth-generators contains all data generation logic for SyntheticData:
- Core Generators: Journal entries, chart of accounts, users
- Master Data: Vendors, customers, materials, assets, employees
- Document Flows: P2P (Procure-to-Pay), O2C (Order-to-Cash)
- Financial: Intercompany, balance tracking, subledgers, FX, period close
- Quality: Anomaly injection, data quality variations
- Sourcing (S2C): Spend analysis, RFx, bids, contracts, catalogs, scorecards (v0.6.0)
- HR / Payroll: Payroll runs, time entries, expense reports (v0.6.0)
- Financial Reporting: Financial statements, bank reconciliation (v0.6.0)
- Standards: Revenue recognition, impairment testing (v0.6.0)
- Manufacturing: Production orders, quality inspections, cycle counts (v0.6.0)
Module Structure
Core Generators
| Generator | Description |
|---|---|
je_generator | Journal entry generation with statistical distributions |
coa_generator | Chart of accounts with industry-specific structures |
company_selector | Weighted company selection for transactions |
user_generator | User/persona generation with roles |
control_generator | Internal controls and SoD rules |
Master Data (master_data/)
| Generator | Description |
|---|---|
vendor_generator | Vendors with payment terms, bank accounts, behaviors |
customer_generator | Customers with credit ratings, payment patterns |
material_generator | Materials/products with BOM, valuations |
asset_generator | Fixed assets with depreciation schedules |
employee_generator | Employees with manager hierarchy |
entity_registry_manager | Central entity registry with temporal validity |
Document Flow (document_flow/)
| Generator | Description |
|---|---|
p2p_generator | PO → GR → Invoice → Payment flow |
o2c_generator | SO → Delivery → Invoice → Receipt flow |
document_chain_manager | Reference chain management |
document_flow_je_generator | Generate JEs from document flows |
three_way_match | PO/GR/Invoice matching validation |
Intercompany (intercompany/)
| Generator | Description |
|---|---|
ic_generator | Matched intercompany entry pairs |
matching_engine | IC matching and reconciliation |
elimination_generator | Consolidation elimination entries |
Balance (balance/)
| Generator | Description |
|---|---|
opening_balance_generator | Coherent opening balance sheet |
balance_tracker | Running balance validation |
trial_balance_generator | Period-end trial balance |
Subledger (subledger/)
| Generator | Description |
|---|---|
ar_generator | AR invoices, receipts, credit memos, aging |
ap_generator | AP invoices, payments, debit memos |
fa_generator | Fixed assets, depreciation, disposals |
inventory_generator | Inventory positions, movements, valuation |
reconciliation | GL-to-subledger reconciliation |
FX (fx/)
| Generator | Description |
|---|---|
fx_rate_service | FX rate generation (Ornstein-Uhlenbeck process) |
currency_translator | Trial balance translation |
cta_generator | Currency Translation Adjustment entries |
Period Close (period_close/)
| Generator | Description |
|---|---|
close_engine | Main orchestration |
accruals | Accrual entry generation |
depreciation | Monthly depreciation runs |
year_end | Year-end closing entries |
Anomaly (anomaly/)
| Generator | Description |
|---|---|
injector | Main anomaly injection engine |
types | Weighted anomaly type configurations |
strategies | Injection strategies (amount, date, duplication) |
patterns | Temporal patterns, clustering, entity targeting |
Data Quality (data_quality/)
| Generator | Description |
|---|---|
injector | Main data quality injector |
missing_values | MCAR, MAR, MNAR, Systematic patterns |
format_variations | Date, amount, identifier formats |
duplicates | Exact, near, fuzzy duplicates |
typos | Keyboard-aware typos, OCR errors |
labels | ML training labels for data quality issues |
Audit (audit/)
ISA-compliant audit data generation.
| Generator | Description |
|---|---|
engagement_generator | Audit engagement with phases (Planning, Fieldwork, Completion) |
workpaper_generator | Audit workpapers per ISA 230 |
evidence_generator | Audit evidence per ISA 500 |
risk_generator | Risk assessment per ISA 315/330 |
finding_generator | Audit findings per ISA 265 |
judgment_generator | Professional judgment documentation per ISA 200 |
LLM Enrichment (llm_enrichment/) — v0.5.0
| Generator | Description |
|---|---|
VendorLlmEnricher | Generate realistic vendor names by industry, spend category, and country |
TransactionLlmEnricher | Generate transaction descriptions and memo fields |
AnomalyLlmExplainer | Generate natural language explanations for injected anomalies |
Sourcing (sourcing/) – v0.6.0
Source-to-Contract (S2C) procurement pipeline generators.
| Generator | Description |
|---|---|
spend_analysis_generator | Spend analysis records and category hierarchies |
sourcing_project_generator | Sourcing project lifecycle management |
qualification_generator | Supplier qualification assessments |
rfx_generator | RFx events (RFI/RFP/RFQ) with invited suppliers |
bid_generator | Supplier bids with pricing and compliance data |
bid_evaluation_generator | Bid scoring, ranking, and award recommendations |
contract_generator | Procurement contracts with terms and renewal rules |
catalog_generator | Catalog items linked to contracts |
scorecard_generator | Supplier scorecards with performance metrics |
Generation DAG: spend_analysis -> sourcing_project -> qualification -> rfx -> bid -> bid_evaluation -> contract -> catalog -> [P2P] -> scorecard
HR (hr/) – v0.6.0
Hire-to-Retire (H2R) generators for the HR process chain.
| Generator | Description |
|---|---|
payroll_generator | Payroll runs with employee pay line items (gross, deductions, net, employer cost) |
time_entry_generator | Employee time entries with regular, overtime, PTO, and sick hours |
expense_report_generator | Expense reports with categorized line items and approval workflows |
Standards (standards/) – v0.6.0
Accounting and audit standards generators.
| Generator | Description |
|---|---|
revenue_recognition_generator | ASC 606/IFRS 15 customer contracts with performance obligations |
impairment_generator | Asset impairment tests with recoverable amount calculations |
Period Close Additions – v0.6.0
| Generator | Description |
|---|---|
financial_statement_generator | Balance sheet, income statement, cash flow, and changes in equity from trial balance data |
Bank Reconciliation – v0.6.0
| Generator | Description |
|---|---|
bank_reconciliation_generator | Bank reconciliations with statement lines, auto-matching, and reconciling items |
Relationships (relationships/)
| Generator | Description |
|---|---|
entity_graph_generator | Cross-process entity relationship graphs |
relationship_strength | Weighted relationship strength calculation |
Audit Engagement Structure:
#![allow(unused)]
fn main() {
pub struct AuditEngagement {
pub engagement_id: String,
pub client_name: String,
pub fiscal_year: u16,
pub phase: AuditPhase, // Planning, Fieldwork, Completion
pub materiality: MaterialityLevels,
pub team_size: usize,
pub has_fraud_risk: bool,
pub has_significant_risk: bool,
}
pub struct MaterialityLevels {
pub primary_materiality: Decimal, // 0.3-1% of base
pub performance_materiality: Decimal, // 50-75% of primary
pub clearly_trivial: Decimal, // 3-5% of primary
}
}
Usage Examples
Journal Entry Generation
#![allow(unused)]
fn main() {
use synth_generators::je_generator::JournalEntryGenerator;
let mut generator = JournalEntryGenerator::new(config, seed);
// Generate batch
let entries = generator.generate_batch(1000)?;
// Stream generation
for entry in generator.generate_stream().take(1000) {
process(entry?);
}
}
Master Data Generation
#![allow(unused)]
fn main() {
use synth_generators::master_data::{VendorGenerator, CustomerGenerator};
let mut vendor_gen = VendorGenerator::new(seed);
let vendors = vendor_gen.generate(100);
let mut customer_gen = CustomerGenerator::new(seed);
let customers = customer_gen.generate(200);
}
Document Flow Generation
#![allow(unused)]
fn main() {
use synth_generators::document_flow::{P2pGenerator, O2cGenerator};
let mut p2p = P2pGenerator::new(config, seed);
let p2p_flows = p2p.generate_batch(500)?;
let mut o2c = O2cGenerator::new(config, seed);
let o2c_flows = o2c.generate_batch(500)?;
}
Anomaly Injection
#![allow(unused)]
fn main() {
use synth_generators::anomaly::AnomalyInjector;
let mut injector = AnomalyInjector::new(config.anomaly_injection, seed);
// Inject into existing entries
let (modified_entries, labels) = injector.inject(&entries)?;
}
LLM Enrichment
#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::{VendorLlmEnricher, TransactionLlmEnricher};
use synth_core::llm::MockLlmProvider;
use std::sync::Arc;
let provider = Arc::new(MockLlmProvider::new(42));
// Enrich vendor names
let vendor_enricher = VendorLlmEnricher::new(provider.clone());
let name = vendor_enricher.enrich_vendor_name("manufacturing", "raw_materials", "US")?;
// Enrich transaction descriptions
let tx_enricher = TransactionLlmEnricher::new(provider);
let desc = tx_enricher.enrich_description("Office Supplies", "1000-5000", "retail", 3)?;
let memo = tx_enricher.enrich_memo("VendorInvoice", "Acme Corp", "2500.00")?;
}
Three-Way Match
The P2P generator validates document matching:
#![allow(unused)]
fn main() {
use synth_generators::document_flow::ThreeWayMatch;
let match_result = ThreeWayMatch::validate(
&purchase_order,
&goods_receipt,
&vendor_invoice,
tolerance_config,
);
match match_result {
MatchResult::Passed => { /* Process normally */ }
MatchResult::QuantityVariance(var) => { /* Handle variance */ }
MatchResult::PriceVariance(var) => { /* Handle variance */ }
}
}
Balance Coherence
The balance tracker maintains accounting equation:
#![allow(unused)]
fn main() {
use synth_generators::balance::BalanceTracker;
let mut tracker = BalanceTracker::new();
for entry in &entries {
tracker.post(&entry)?;
}
// Verify Assets = Liabilities + Equity
assert!(tracker.is_balanced());
}
FX Rate Generation
Uses Ornstein-Uhlenbeck process for realistic rate movements:
#![allow(unused)]
fn main() {
use synth_generators::fx::FxRateService;
let mut fx_service = FxRateService::new(config.fx, seed);
// Get rate for date
let rate = fx_service.get_rate("EUR", "USD", date)?;
// Generate daily rates
let rates = fx_service.generate_daily_rates(start, end)?;
}
Anomaly Types
Fraud Types
- FictitiousTransaction, RevenueManipulation, ExpenseCapitalization
- SplitTransaction, RoundTripping, KickbackScheme
- GhostEmployee, DuplicatePayment, UnauthorizedDiscount
Error Types
- DuplicateEntry, ReversedAmount, WrongPeriod
- WrongAccount, MissingReference, IncorrectTaxCode
Process Issues
- LatePosting, SkippedApproval, ThresholdManipulation
- MissingDocumentation, OutOfSequence
Statistical Anomalies
- UnusualAmount, TrendBreak, BenfordViolation, OutlierValue
Relational Anomalies
- CircularTransaction, DormantAccountActivity, UnusualCounterparty
See Also
datasynth-output
Output sinks for CSV, JSON, and streaming formats.
Overview
datasynth-output provides the output layer for SyntheticData:
- CSV Sink: High-performance CSV writing with optional compression
- JSON Sink: JSON and JSONL (newline-delimited) output
- Streaming: Async streaming output for real-time generation
- Control Export: Internal control and SoD rule export
Supported Formats
Standard Formats
| Format | Description | Extension |
|---|---|---|
| CSV | Standard comma-separated values | .csv |
| JSON | Pretty-printed JSON arrays | .json |
| JSONL | Newline-delimited JSON | .jsonl |
| Parquet | Apache Parquet columnar format | .parquet |
ERP Formats
| Format | Target ERP | Tables |
|---|---|---|
| SAP S/4HANA | SapExporter | BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC |
| Oracle EBS | OracleExporter | GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES |
| NetSuite | NetSuiteExporter | Journal entries with subsidiary/multi-book support |
Streaming Sinks
| Sink | Description |
|---|---|
CsvStreamingSink | Streaming CSV with automatic headers |
JsonStreamingSink | Streaming JSON arrays |
NdjsonStreamingSink | Streaming newline-delimited JSON |
ParquetStreamingSink | Streaming Apache Parquet |
Features
- Configurable compression (gzip, zstd, snappy for Parquet)
- Streaming writes for memory efficiency with backpressure support
- ERP-native table schemas (SAP, Oracle, NetSuite)
- Decimal values serialized as strings (IEEE 754 safe)
- Configurable field ordering and headers
- Automatic directory creation
Key Types
OutputConfig
#![allow(unused)]
fn main() {
pub struct OutputConfig {
pub format: OutputFormat,
pub compression: CompressionType,
pub compression_level: u32,
pub include_headers: bool,
pub decimal_precision: u32,
}
pub enum OutputFormat {
Csv,
Json,
Jsonl,
}
pub enum CompressionType {
None,
Gzip,
Zstd,
}
}
CsvSink
#![allow(unused)]
fn main() {
pub struct CsvSink<T> {
writer: BufWriter<Box<dyn Write>>,
config: OutputConfig,
headers_written: bool,
_phantom: PhantomData<T>,
}
}
JsonSink
#![allow(unused)]
fn main() {
pub struct JsonSink<T> {
writer: BufWriter<Box<dyn Write>>,
format: JsonFormat,
first_written: bool,
_phantom: PhantomData<T>,
}
}
Usage Examples
CSV Output
#![allow(unused)]
fn main() {
use synth_output::{CsvSink, OutputConfig, OutputFormat};
// Create sink
let config = OutputConfig {
format: OutputFormat::Csv,
compression: CompressionType::None,
include_headers: true,
..Default::default()
};
let mut sink = CsvSink::new("output/journal_entries.csv", config)?;
// Write data
sink.write_batch(&entries)?;
sink.flush()?;
}
Compressed Output
#![allow(unused)]
fn main() {
use synth_output::{CsvSink, OutputConfig, CompressionType};
let config = OutputConfig {
compression: CompressionType::Gzip,
compression_level: 6,
..Default::default()
};
let mut sink = CsvSink::new("output/entries.csv.gz", config)?;
sink.write_batch(&entries)?;
}
JSON Streaming
#![allow(unused)]
fn main() {
use synth_output::{JsonSink, OutputConfig, OutputFormat};
let config = OutputConfig {
format: OutputFormat::Jsonl,
..Default::default()
};
let mut sink = JsonSink::new("output/entries.jsonl", config)?;
// Stream writes (memory efficient)
for entry in entries {
sink.write(&entry)?;
}
sink.flush()?;
}
Control Export
#![allow(unused)]
fn main() {
use synth_output::ControlExporter;
let exporter = ControlExporter::new("output/controls/");
// Export all control-related data
exporter.export_controls(&internal_controls)?;
exporter.export_sod_rules(&sod_rules)?;
exporter.export_control_mappings(&mappings)?;
}
Sink Trait Implementation
All sinks implement the Sink trait:
#![allow(unused)]
fn main() {
impl<T: Serialize> Sink<T> for CsvSink<T> {
type Error = OutputError;
fn write(&mut self, item: &T) -> Result<(), Self::Error> {
// Single item write
}
fn write_batch(&mut self, items: &[T]) -> Result<(), Self::Error> {
// Batch write for efficiency
}
fn flush(&mut self) -> Result<(), Self::Error> {
// Ensure all data written to disk
}
}
}
Decimal Serialization
Financial amounts are serialized as strings to prevent IEEE 754 floating-point issues:
#![allow(unused)]
fn main() {
// Internal: Decimal
let amount = dec!(1234.56);
// CSV output: "1234.56" (string)
// JSON output: "1234.56" (string, not number)
}
This ensures exact decimal representation across all systems.
Performance Tips
Batch Writes
Prefer batch writes over individual writes:
#![allow(unused)]
fn main() {
// Good: Single batch write
sink.write_batch(&entries)?;
// Less efficient: Multiple writes
for entry in &entries {
sink.write(entry)?;
}
}
Buffer Size
The default buffer size is 8KB. For very large outputs, consider adjusting:
#![allow(unused)]
fn main() {
let sink = CsvSink::with_buffer_size(
"output/large.csv",
config,
64 * 1024, // 64KB buffer
)?;
}
Compression Trade-offs
| Compression | Speed | Size | Use Case |
|---|---|---|---|
| None | Fastest | Largest | Development, streaming |
| Gzip | Medium | Small | General purpose |
| Zstd | Fast | Smallest | Production, archival |
Output Structure
The output module creates organized directory structure:
output/
├── master_data/
│ ├── vendors.csv
│ └── customers.csv
├── transactions/
│ ├── journal_entries.csv
│ └── acdoca.csv
├── controls/
│ ├── internal_controls.csv
│ └── sod_rules.csv
└── labels/
└── anomaly_labels.csv
Error Handling
#![allow(unused)]
fn main() {
pub enum OutputError {
IoError(std::io::Error),
SerializationError(String),
CompressionError(String),
DirectoryCreationError(PathBuf),
}
}
See Also
- Output Formats — Standard format details
- ERP Output Formats — SAP, Oracle, NetSuite exports
- Streaming Output — StreamingSink API
- Configuration - Output Settings
- datasynth-core
datasynth-runtime
Runtime orchestration, parallel execution, and memory management.
Overview
datasynth-runtime provides the execution layer for SyntheticData:
- GenerationOrchestrator: Coordinates the complete generation workflow
- Parallel Execution: Multi-threaded generation with Rayon
- Memory Management: Integration with memory guard for OOM prevention
- Progress Tracking: Real-time progress reporting with pause/resume
Key Components
| Component | Description |
|---|---|
GenerationOrchestrator | Main workflow coordinator |
EnhancedOrchestrator | Extended orchestrator with all enterprise features |
ParallelExecutor | Thread pool management |
ProgressTracker | Progress bars and status reporting |
Generation Workflow
The orchestrator executes phases in order:
- Initialize: Load configuration, validate settings
- Master Data: Generate vendors, customers, materials, assets
- Opening Balances: Create coherent opening balance sheet
- Transactions: Generate journal entries with document flows
- Period Close: Run monthly/quarterly/annual close processes
- Anomalies: Inject configured anomalies and data quality issues
- Export: Write outputs and generate ML labels
- Banking: Generate KYC/AML data (if enabled)
- Audit: Generate ISA-compliant audit data (if enabled)
- Graphs: Build and export ML graphs (if enabled)
- LLM Enrichment: Enrich data with LLM-generated metadata (v0.5.0, if enabled)
- Diffusion Enhancement: Blend diffusion model outputs (v0.5.0, if enabled)
- Causal Overlay: Apply causal structure (v0.5.0, if enabled)
- S2C Sourcing: Generate Source-to-Contract procurement pipeline (v0.6.0, if enabled)
- Financial Reporting: Generate bank reconciliations and financial statements (v0.6.0, if enabled)
- HR Data: Generate payroll runs, time entries, and expense reports (v0.6.0, if enabled)
- Accounting Standards: Generate revenue recognition and impairment data (v0.6.0, if enabled)
- Manufacturing: Generate production orders, quality inspections, and cycle counts (v0.6.0, if enabled)
- Sales/KPIs/Budgets: Generate sales quotes, management KPIs, and budget variance data (v0.6.0, if enabled)
Key Types
GenerationOrchestrator
#![allow(unused)]
fn main() {
pub struct GenerationOrchestrator {
config: Config,
state: GenerationState,
progress: Arc<ProgressTracker>,
memory_guard: MemoryGuard,
}
pub struct GenerationState {
pub master_data: MasterDataState,
pub entries: Vec<JournalEntry>,
pub documents: DocumentState,
pub balances: BalanceState,
pub anomaly_labels: Vec<LabeledAnomaly>,
}
}
ProgressTracker
#![allow(unused)]
fn main() {
pub struct ProgressTracker {
pub current: AtomicU64,
pub total: u64,
pub phase: String,
pub paused: AtomicBool,
pub start_time: Instant,
}
pub struct Progress {
pub current: u64,
pub total: u64,
pub percent: f64,
pub phase: String,
pub entries_per_second: f64,
pub elapsed: Duration,
pub estimated_remaining: Duration,
}
}
Usage Examples
Basic Generation
#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
let config = Config::from_yaml_file("config.yaml")?;
let orchestrator = GenerationOrchestrator::new(config)?;
// Run full generation
orchestrator.run()?;
}
With Progress Callback
#![allow(unused)]
fn main() {
orchestrator.run_with_progress(|progress| {
println!(
"[{:.1}%] {} - {}/{} ({:.0} entries/sec)",
progress.percent,
progress.phase,
progress.current,
progress.total,
progress.entries_per_second,
);
})?;
}
Parallel Execution
#![allow(unused)]
fn main() {
use synth_runtime::ParallelExecutor;
let executor = ParallelExecutor::new(4); // 4 threads
let results: Vec<JournalEntry> = executor.run(|thread_id| {
let mut generator = JournalEntryGenerator::new(config.clone(), seed + thread_id);
generator.generate_batch(batch_size)
})?;
}
Memory-Aware Generation
#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
use synth_core::memory_guard::MemoryGuardConfig;
let memory_config = MemoryGuardConfig {
soft_limit: 1024 * 1024 * 1024, // 1GB
hard_limit: 2 * 1024 * 1024 * 1024, // 2GB
check_interval_ms: 1000,
..Default::default()
};
let orchestrator = GenerationOrchestrator::with_memory_config(config, memory_config)?;
orchestrator.run()?;
}
Pause/Resume
On Unix systems, generation can be paused and resumed:
# Start generation in background
datasynth-data generate --config config.yaml --output ./output &
# Send SIGUSR1 to toggle pause
kill -USR1 $(pgrep datasynth-data)
# Progress bar shows pause state
# [████████░░░░░░░░░░░░] 40% (PAUSED)
Programmatic Pause/Resume
#![allow(unused)]
fn main() {
// Pause
orchestrator.pause();
// Check state
if orchestrator.is_paused() {
println!("Generation paused");
}
// Resume
orchestrator.resume();
}
Enhanced Orchestrator
The EnhancedOrchestrator includes additional enterprise features:
#![allow(unused)]
fn main() {
use synth_runtime::EnhancedOrchestrator;
let orchestrator = EnhancedOrchestrator::new(config)?;
// All features enabled
orchestrator
.with_document_flows()
.with_intercompany()
.with_subledgers()
.with_fx()
.with_period_close()
.with_anomaly_injection()
.with_graph_export()
.run()?;
}
Enterprise Process Chain Phases (v0.6.0)
The EnhancedOrchestrator supports six new phases for enterprise process chains, controlled by PhaseConfig:
| Phase | Config Flag | Description |
|---|---|---|
| 14 | generate_sourcing | S2C procurement pipeline: spend analysis through supplier scorecards |
| 15 | generate_financial_statements / generate_bank_reconciliation | Financial statements and bank reconciliations |
| 16 | generate_hr | Payroll runs, time entries, expense reports |
| 17 | generate_accounting_standards | Revenue recognition (ASC 606/IFRS 15), impairment testing |
| 18 | generate_manufacturing | Production orders, quality inspections, cycle counts |
| 19 | generate_sales_kpi_budgets | Sales quotes, management KPIs, budget variance analysis |
Each phase is independently enabled and gracefully skips when its dependencies (e.g., master data) are unavailable.
Output Coordination
The orchestrator coordinates output to multiple sinks:
#![allow(unused)]
fn main() {
// Orchestrator automatically:
// 1. Creates output directories
// 2. Writes master data files
// 3. Writes transaction files
// 4. Writes subledger files
// 5. Writes labels for ML
// 6. Generates graphs if enabled
}
Error Handling
#![allow(unused)]
fn main() {
pub enum RuntimeError {
ConfigurationError(ConfigError),
GenerationError(String),
MemoryExceeded { limit: u64, current: u64 },
OutputError(OutputError),
Interrupted,
}
}
Performance Considerations
Thread Count
#![allow(unused)]
fn main() {
// Auto-detect (uses all cores)
let orchestrator = GenerationOrchestrator::new(config)?;
// Manual thread count
let orchestrator = GenerationOrchestrator::with_threads(config, 4)?;
}
Memory Management
The orchestrator monitors memory and can:
- Slow down generation when soft limit approached
- Pause generation at hard limit
- Stream output to reduce memory pressure
Batch Sizes
Batch sizes are automatically tuned based on:
- Available memory
- Number of threads
- Target throughput
See Also
datasynth-graph
Graph/network export for synthetic accounting data with ML-ready formats.
Overview
datasynth-graph provides graph construction and export capabilities:
- Graph Builders: Transaction, approval, entity relationship, and multi-layer hypergraph builders
- Hypergraph: 3-layer hypergraph (Governance, Process Events, Accounting Network) spanning 8 process families with 24 entity type codes and OCPM event hyperedges
- ML Export: PyTorch Geometric, Neo4j, DGL, RustGraph, and RustGraph Hypergraph formats
- Feature Engineering: Temporal, amount, structural, and categorical features
- Data Splits: Train/validation/test split generation
Graph Types
| Graph | Nodes | Edges | Use Case |
|---|---|---|---|
| Transaction Network | Accounts, Entities | Transactions | Anomaly detection |
| Approval Network | Users | Approvals | SoD analysis |
| Entity Relationship | Legal Entities | Ownership | Consolidation analysis |
Export Formats
PyTorch Geometric
graphs/transaction_network/pytorch_geometric/
├── node_features.pt # [num_nodes, num_features]
├── edge_index.pt # [2, num_edges]
├── edge_attr.pt # [num_edges, num_edge_features]
├── labels.pt # [num_nodes] or [num_edges]
├── train_mask.pt # Boolean mask
├── val_mask.pt
└── test_mask.pt
Neo4j
graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── edges_transaction.csv
├── edges_ownership.csv
└── import.cypher
DGL (Deep Graph Library)
graphs/approval_network/dgl/
├── graph.bin # DGL graph object
├── node_feats.npy # Node features
├── edge_feats.npy # Edge features
└── labels.npy # Labels
Feature Categories
| Category | Features |
|---|---|
| Temporal | weekday, period, is_month_end, is_quarter_end, is_year_end |
| Amount | log(amount), benford_probability, is_round_number |
| Structural | line_count, unique_accounts, has_intercompany |
| Categorical | business_process (one-hot), source_type (one-hot) |
Key Types
Graph Models
#![allow(unused)]
fn main() {
pub struct Graph {
pub nodes: Vec<Node>,
pub edges: Vec<Edge>,
pub node_features: Option<Array2<f32>>,
pub edge_features: Option<Array2<f32>>,
}
pub enum Node {
Account(AccountNode),
Entity(EntityNode),
User(UserNode),
Transaction(TransactionNode),
}
pub enum Edge {
Transaction(TransactionEdge),
Approval(ApprovalEdge),
Ownership(OwnershipEdge),
}
}
Split Configuration
#![allow(unused)]
fn main() {
pub struct SplitConfig {
pub train_ratio: f64, // e.g., 0.7
pub val_ratio: f64, // e.g., 0.15
pub test_ratio: f64, // e.g., 0.15
pub stratify_by: Option<String>,
pub random_seed: u64,
}
}
Usage Examples
Building Transaction Graph
#![allow(unused)]
fn main() {
use synth_graph::{TransactionGraphBuilder, GraphConfig};
let builder = TransactionGraphBuilder::new(GraphConfig::default());
let graph = builder.build(&journal_entries)?;
println!("Nodes: {}", graph.nodes.len());
println!("Edges: {}", graph.edges.len());
}
PyTorch Geometric Export
#![allow(unused)]
fn main() {
use synth_graph::{PyTorchGeometricExporter, SplitConfig};
let exporter = PyTorchGeometricExporter::new("output/graphs");
let split = SplitConfig {
train_ratio: 0.7,
val_ratio: 0.15,
test_ratio: 0.15,
stratify_by: Some("is_anomaly".to_string()),
random_seed: 42,
};
exporter.export(&graph, split)?;
}
Neo4j Export
#![allow(unused)]
fn main() {
use synth_graph::Neo4jExporter;
let exporter = Neo4jExporter::new("output/graphs/neo4j");
exporter.export(&graph)?;
// Generates import script:
// LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
// CREATE (:Account {id: row.id, name: row.name, ...})
}
Feature Engineering
#![allow(unused)]
fn main() {
use synth_graph::features::{FeatureExtractor, FeatureConfig};
let extractor = FeatureExtractor::new(FeatureConfig {
temporal: true,
amount: true,
structural: true,
categorical: true,
});
let node_features = extractor.extract_node_features(&entries)?;
let edge_features = extractor.extract_edge_features(&entries)?;
}
Graph Construction
Transaction Network
Accounts and entities become nodes; transactions become edges.
#![allow(unused)]
fn main() {
// Nodes:
// - Each GL account is a node
// - Each vendor/customer is a node
// Edges:
// - Each journal entry line creates an edge
// - Edge connects account to entity
// - Edge features: amount, date, fraud flag
}
Approval Network
Users become nodes; approval relationships become edges.
#![allow(unused)]
fn main() {
// Nodes:
// - Each user/employee is a node
// - Node features: approval_limit, department, role
// Edges:
// - Approval actions create edges
// - Edge features: amount, threshold, escalation
}
Entity Relationship Network
Legal entities become nodes; ownership and IC relationships become edges.
#![allow(unused)]
fn main() {
// Nodes:
// - Each company/legal entity is a node
// - Node features: currency, country, parent_flag
// Edges:
// - Ownership relationships (parent → subsidiary)
// - IC transaction relationships
// - Edge features: ownership_percent, transaction_volume
}
ML Integration
Loading in PyTorch
import torch
from torch_geometric.data import Data
# Load exported data
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')
data = Data(
x=node_features,
edge_index=edge_index,
edge_attr=edge_attr,
y=labels,
train_mask=train_mask,
)
Loading in Neo4j
# Import using generated script
neo4j-admin import \
--nodes=nodes_account.csv \
--nodes=nodes_entity.csv \
--relationships=edges_transaction.csv
Configuration
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
graphs:
- transaction_network
- approval_network
- entity_relationship
split:
train: 0.7
val: 0.15
test: 0.15
stratify: is_anomaly
features:
temporal: true
amount: true
structural: true
categorical: true
Multi-Layer Hypergraph (v0.6.2)
The hypergraph builder supports all 8 enterprise process families:
| Method | Family | Node Types |
|---|---|---|
add_p2p_documents() | P2P | PurchaseOrder, GoodsReceipt, VendorInvoice, Payment |
add_o2c_documents() | O2C | SalesOrder, Delivery, CustomerInvoice |
add_s2c_documents() | S2C | SourcingProject, RfxEvent, SupplierBid, ProcurementContract |
add_h2r_documents() | H2R | PayrollRun, TimeEntry, ExpenseReport |
add_mfg_documents() | MFG | ProductionOrder, QualityInspection, CycleCount |
add_bank_documents() | BANK | BankingCustomer, BankAccount, BankTransaction |
add_audit_documents() | AUDIT | AuditEngagement, Workpaper, AuditFinding, AuditEvidence |
add_bank_recon_documents() | Bank Recon | BankReconciliation, BankStatementLine, ReconcilingItem |
add_ocpm_events() | OCPM | Events as hyperedges (entity type 400) |
See Also
datasynth-cli
Command-line interface for synthetic accounting data generation.
Overview
datasynth-cli provides the datasynth-data binary for command-line usage:
- generate: Generate synthetic data from configuration
- init: Create configuration files with industry presets
- validate: Validate configuration files
- info: Display available presets and options
Installation
cargo build --release
# Binary at: target/release/datasynth-data
Commands
generate
Generate synthetic financial data.
# From configuration file
datasynth-data generate --config config.yaml --output ./output
# Demo mode with defaults
datasynth-data generate --demo --output ./demo-output
# Override seed
datasynth-data generate --config config.yaml --output ./output --seed 12345
# Verbose output
datasynth-data generate --config config.yaml --output ./output -v
init
Create a configuration file from presets.
# Industry preset with complexity
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
Available industries:
manufacturingretailfinancial_serviceshealthcaretechnologyenergytelecomtransportationhospitality
validate
Validate configuration files.
datasynth-data validate --config config.yaml
info
Display available options.
datasynth-data info
fingerprint
Privacy-preserving fingerprint operations.
# Extract fingerprint
datasynth-data fingerprint extract --input ./data.csv --output ./fp.dsf --privacy-level standard
# Validate fingerprint
datasynth-data fingerprint validate ./fp.dsf
# View fingerprint details
datasynth-data fingerprint info ./fp.dsf --detailed
# Evaluate fidelity
datasynth-data fingerprint evaluate --fingerprint ./fp.dsf --synthetic ./output/ --threshold 0.8
# Federated aggregation (v0.5.0)
datasynth-data fingerprint federated --sources ./a.dsf ./b.dsf --output ./combined.dsf --method weighted_average
diffusion (v0.5.0)
Diffusion model training and evaluation.
# Train diffusion model from fingerprint
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json
# Evaluate model fit
datasynth-data diffusion evaluate --model ./model.json --samples 5000
causal (v0.5.0)
Causal and counterfactual data generation.
# Generate from causal template
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/
# Run intervention
datasynth-data causal intervene --template fraud_detection --variable transaction_amount --value 50000 --samples 5000 --output ./intervention/
# Validate causal structure
datasynth-data causal validate --data ./causal/ --template fraud_detection
Key Types
CLI Arguments
#![allow(unused)]
fn main() {
#[derive(Parser)]
pub struct Cli {
#[command(subcommand)]
pub command: Command,
/// Enable verbose logging
#[arg(short, long)]
pub verbose: bool,
/// Suppress non-error output
#[arg(short, long)]
pub quiet: bool,
}
#[derive(Subcommand)]
pub enum Command {
Generate(GenerateArgs),
Init(InitArgs),
Validate(ValidateArgs),
Info,
Fingerprint(FingerprintArgs), // fingerprint subcommands
Diffusion(DiffusionArgs), // v0.5.0: diffusion model commands
Causal(CausalArgs), // v0.5.0: causal generation commands
}
}
Generate Arguments
#![allow(unused)]
fn main() {
pub struct GenerateArgs {
/// Configuration file path
#[arg(short, long)]
pub config: Option<PathBuf>,
/// Use demo preset
#[arg(long)]
pub demo: bool,
/// Output directory (required)
#[arg(short, long)]
pub output: PathBuf,
/// Override random seed
#[arg(long)]
pub seed: Option<u64>,
/// Output format
#[arg(long, default_value = "csv")]
pub format: String,
/// Attach a synthetic data certificate (v0.5.0)
#[arg(long)]
pub certificate: bool,
}
pub struct InitArgs {
// ... existing fields ...
/// Generate config from natural language description (v0.5.0)
#[arg(long)]
pub from_description: Option<String>,
}
}
Signal Handling
On Unix systems, pause/resume generation with SIGUSR1:
# Start in background
datasynth-data generate --config config.yaml --output ./output &
# Toggle pause
kill -USR1 $(pgrep datasynth-data)
Progress bar shows pause state:
[████████░░░░░░░░░░░░] 40% - 40000/100000 entries (PAUSED)
Exit Codes
| Code | Description |
|---|---|
| 0 | Success |
| 1 | Configuration error |
| 2 | Generation error |
| 3 | I/O error |
Environment Variables
| Variable | Description |
|---|---|
SYNTH_DATA_LOG | Log level (error, warn, info, debug, trace) |
SYNTH_DATA_THREADS | Worker thread count |
SYNTH_DATA_MEMORY_LIMIT | Memory limit in bytes |
SYNTH_DATA_LOG=debug datasynth-data generate --demo --output ./output
Progress Display
During generation, a progress bar shows:
Generating synthetic data...
[████████████████████] 100% - 100000/100000 entries
Phase: Transactions | 85,432 entries/sec | ETA: 0:00
Generation complete!
- Journal entries: 100,000
- Document flows: 15,000
- Output: ./output/
- Duration: 1.2s
Usage Examples
Basic Generation
datasynth-data init --industry manufacturing -o config.yaml
datasynth-data generate --config config.yaml --output ./output
Scripting
#!/bin/bash
for industry in manufacturing retail healthcare; do
datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done
CI/CD
# GitHub Actions
- name: Generate Test Data
run: |
cargo build --release
./target/release/datasynth-data generate --demo --output ./test-data
Reproducible Generation
# Same seed = same output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2 # No differences
See Also
datasynth-server
REST, gRPC, and WebSocket server for synthetic data generation.
Overview
datasynth-server provides server-based access to SyntheticData:
- REST API: Configuration management and stream control
- gRPC API: High-performance streaming generation
- WebSocket: Real-time event streaming
- Production Features: Authentication, rate limiting, timeouts
Starting the Server
cargo run -p datasynth-server -- --port 3000 --worker-threads 4
Command-Line Options
| Option | Default | Description |
|---|---|---|
--port | 3000 | HTTP/WebSocket port |
--grpc-port | 50051 | gRPC port |
--worker-threads | CPU cores | Worker thread count |
--api-key | None | Required API key |
--rate-limit | 100 | Max requests per minute |
--memory-limit | None | Memory limit in bytes |
Architecture
┌─────────────────────────────────────────────────────────────┐
│ datasynth-server │
├─────────────────────────────────────────────────────────────┤
│ REST API (Axum) │ gRPC (Tonic) │ WebSocket (Axum) │
├─────────────────────────────────────────────────────────────┤
│ Middleware Layer │
│ Auth │ Rate Limit │ Timeout │ CORS │ Logging │
├─────────────────────────────────────────────────────────────┤
│ Generation Service │
│ (wraps datasynth-runtime orchestrator) │
└─────────────────────────────────────────────────────────────┘
REST API Endpoints
Configuration
# Get current configuration
curl http://localhost:3000/api/config
# Update configuration
curl -X POST http://localhost:3000/api/config \
-H "Content-Type: application/json" \
-d '{"transactions": {"target_count": 50000}}'
# Validate configuration
curl -X POST http://localhost:3000/api/config/validate \
-H "Content-Type: application/json" \
-d @config.json
Stream Control
# Start generation
curl -X POST http://localhost:3000/api/stream/start
# Pause
curl -X POST http://localhost:3000/api/stream/pause
# Resume
curl -X POST http://localhost:3000/api/stream/resume
# Stop
curl -X POST http://localhost:3000/api/stream/stop
# Trigger pattern (month_end, quarter_end, year_end)
curl -X POST http://localhost:3000/api/stream/trigger/month_end
Health Check
curl http://localhost:3000/health
WebSocket API
Connect to ws://localhost:3000/ws/events for real-time events.
Event Types
// Progress
{"type": "progress", "current": 50000, "total": 100000, "percent": 50.0}
// Entry (streamed data)
{"type": "entry", "data": {"document_id": "abc-123", ...}}
// Error
{"type": "error", "message": "Memory limit exceeded"}
// Complete
{"type": "complete", "total_entries": 100000, "duration_ms": 1200}
gRPC API
Proto Definition
syntax = "proto3";
package synth;
service SynthService {
rpc GetConfig(Empty) returns (Config);
rpc SetConfig(Config) returns (Status);
rpc StartGeneration(GenerationRequest) returns (stream Entry);
rpc StopGeneration(Empty) returns (Status);
}
Client Example
#![allow(unused)]
fn main() {
use synth::synth_client::SynthClient;
let mut client = SynthClient::connect("http://localhost:50051").await?;
let request = tonic::Request::new(GenerationRequest { count: Some(1000) });
let mut stream = client.start_generation(request).await?.into_inner();
while let Some(entry) = stream.message().await? {
println!("Entry: {:?}", entry.document_id);
}
}
Middleware
Authentication
# With API key
curl -H "X-API-Key: your-key" http://localhost:3000/api/config
Rate Limiting
Sliding window rate limiter with per-client tracking.
// 429 response when exceeded
{
"error": "rate_limit_exceeded",
"retry_after": 30
}
Request Timeout
Default timeout is 30 seconds. Long-running operations use streaming.
Key Types
Server Configuration
#![allow(unused)]
fn main() {
pub struct ServerConfig {
pub port: u16,
pub grpc_port: u16,
pub worker_threads: usize,
pub api_key: Option<String>,
pub rate_limit: RateLimitConfig,
pub memory_limit: Option<u64>,
pub cors_origins: Vec<String>,
}
}
Rate Limit Configuration
#![allow(unused)]
fn main() {
pub struct RateLimitConfig {
pub max_requests: u32,
pub window_seconds: u64,
pub exempt_paths: Vec<String>,
}
}
Production Deployment
Docker
FROM rust:1.88 as builder
WORKDIR /app
COPY . .
RUN cargo build --release -p datasynth-server
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/datasynth-server /usr/local/bin/
EXPOSE 3000 50051
CMD ["datasynth-server", "--port", "3000"]
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: datasynth-server
spec:
replicas: 3
template:
spec:
containers:
- name: datasynth-server
image: datasynth-server:latest
ports:
- containerPort: 3000
- containerPort: 50051
env:
- name: SYNTH_API_KEY
valueFrom:
secretKeyRef:
name: synth-secrets
key: api-key
resources:
limits:
memory: "2Gi"
Monitoring
Health Endpoint
curl http://localhost:3000/health
{
"status": "healthy",
"uptime_seconds": 3600,
"memory_usage_mb": 512,
"active_streams": 2
}
Logging
Enable structured logging:
RUST_LOG=synth_server=info cargo run -p datasynth-server
See Also
datasynth-ui
Cross-platform desktop application for synthetic data generation.
Overview
datasynth-ui provides a graphical interface for SyntheticData:
- Visual Configuration: Comprehensive UI for all configuration sections
- Real-time Streaming: Live generation viewer with WebSocket
- Preset Management: One-click industry preset application
- Validation Feedback: Real-time configuration validation
Technology Stack
| Component | Technology |
|---|---|
| Backend | Tauri 2.0 (Rust) |
| Frontend | SvelteKit + Svelte 5 |
| Styling | TailwindCSS |
| State | Svelte stores with runes |
Prerequisites
Linux (Ubuntu/Debian)
sudo apt install libgtk-3-dev libwebkit2gtk-4.1-dev \
libappindicator3-dev librsvg2-dev
Linux (Fedora)
sudo dnf install gtk3-devel webkit2gtk4.1-devel \
libappindicator-gtk3-devel librsvg2-devel
Linux (Arch)
sudo pacman -S webkit2gtk-4.1 base-devel curl wget file \
openssl appmenu-gtk-module gtk3 librsvg libvips
macOS
No additional dependencies required (uses built-in WebKit).
Windows
WebView2 runtime (usually pre-installed on Windows 10/11).
Development
cd crates/datasynth-ui
# Install dependencies
npm install
# Frontend development (no desktop features)
npm run dev
# Desktop app development
npm run tauri dev
# Production build
npm run build
npm run tauri build
Project Structure
datasynth-ui/
├── src/ # Svelte frontend
│ ├── routes/ # SvelteKit pages
│ │ ├── +page.svelte # Dashboard
│ │ ├── config/ # Configuration pages (15+ sections)
│ │ │ ├── global/
│ │ │ ├── transactions/
│ │ │ ├── master-data/
│ │ │ └── ...
│ │ └── generate/
│ │ └── stream/ # Generation streaming viewer
│ └── lib/
│ ├── components/ # Reusable UI components
│ │ ├── forms/ # Form components
│ │ └── config/ # Config-specific components
│ ├── stores/ # Svelte stores
│ └── utils/ # Utilities
├── src-tauri/ # Rust backend
│ ├── src/
│ │ ├── lib.rs # Tauri commands
│ │ └── main.rs # App entry point
│ └── Cargo.toml
├── e2e/ # Playwright E2E tests
├── package.json
└── tauri.conf.json
Configuration Sections
| Section | Description |
|---|---|
| Global | Industry, dates, seed, performance |
| Transactions | Line items, amounts, sources |
| Master Data | Vendors, customers, materials |
| Document Flows | P2P, O2C configuration |
| Financial | Balance, subledger, FX, period close |
| Compliance | Fraud, controls, approval |
| Analytics | Graph export, anomaly, data quality |
| Output | Formats, compression |
Key Components
Config Store
// src/lib/stores/config.ts
import { writable } from 'svelte/store';
export const config = writable<Config>(defaultConfig);
export const isDirty = writable(false);
export function updateConfig(section: string, value: any) {
config.update(c => ({...c, [section]: value}));
isDirty.set(true);
}
Form Components
<!-- src/lib/components/forms/InputNumber.svelte -->
<script lang="ts">
export let value: number;
export let min: number = 0;
export let max: number = Infinity;
export let label: string;
</script>
<label>
{label}
<input type="number" bind:value {min} {max} />
</label>
Tauri Commands
#![allow(unused)]
fn main() {
// src-tauri/src/lib.rs
#[tauri::command]
async fn save_config(config: Config) -> Result<(), String> {
// Save configuration
}
#[tauri::command]
async fn start_generation(config: Config) -> Result<(), String> {
// Start generation via datasynth-runtime
}
}
Server Connection
The UI connects to datasynth-server for streaming:
# Start server first
cargo run -p datasynth-server
# Then run UI
npm run tauri dev
Default server URL: http://localhost:3000
Testing
# Unit tests
npm test
# E2E tests with Playwright
npx playwright test
# E2E with UI
npx playwright test --ui
Build Output
Production builds create platform-specific packages:
| Platform | Output |
|---|---|
| Windows | .msi, .exe |
| macOS | .dmg, .app |
| Linux | .deb, .AppImage, .rpm |
Located in: src-tauri/target/release/bundle/
UI Features
Dashboard
- System overview
- Quick stats
- Recent generations
Configuration Editor
- Visual form editors for all sections
- Real-time validation
- Dirty state tracking
- Export to YAML/JSON
Streaming Viewer
- Real-time progress
- Entry preview table
- Memory usage graph
- Pause/resume controls
Preset Selector
- Industry presets
- Complexity levels
- One-click application
See Also
datasynth-eval
Evaluation framework for synthetic financial data quality and coherence.
Overview
datasynth-eval provides automated quality assessment for generated data:
- Statistical Evaluation: Benford’s Law compliance, distribution analysis
- Coherence Checking: Balance verification, document chain integrity
- Intercompany Validation: IC matching and elimination verification
- Data Quality Analysis: Completeness, consistency, format validation
- ML Readiness: Feature distributions, label quality, graph structure
- Enhancement Derivation: Auto-tuning with configuration recommendations
Evaluation Categories
| Category | Description |
|---|---|
| Statistical | Benford’s Law, amount distributions, temporal patterns, line items |
| Coherence | Trial balance, subledger reconciliation, FX consistency, document chains |
| Intercompany | IC matching rates, elimination completeness |
| Quality | Completeness, consistency, duplicates, format validation, uniqueness |
| ML Readiness | Feature distributions, label quality, graph structure, train/val/test splits |
| Enhancement | Auto-tuning, configuration recommendations, root cause analysis |
Module Structure
| Module | Description |
|---|---|
statistical/ | Benford’s Law, amount distributions, temporal patterns |
coherence/ | Balance sheet, IC matching, document chains, subledger reconciliation |
quality/ | Completeness, consistency, duplicates, formats, uniqueness |
ml/ | Feature analysis, label quality, graph structure, splits |
report/ | HTML and JSON report generation with baseline comparisons |
tuning/ | Configuration optimization recommendations |
enhancement/ | Auto-tuning engine with config patch generation |
Key Types
Evaluator
#![allow(unused)]
fn main() {
pub struct Evaluator {
config: EvaluationConfig,
checkers: Vec<Box<dyn Checker>>,
}
pub struct EvaluationConfig {
pub benford_threshold: f64, // Chi-square threshold
pub balance_tolerance: Decimal, // Allowed imbalance
pub ic_match_threshold: f64, // Required match rate
pub duplicate_check: bool,
}
}
Evaluation Report
#![allow(unused)]
fn main() {
pub struct EvaluationReport {
pub overall_status: Status,
pub categories: Vec<CategoryResult>,
pub warnings: Vec<Warning>,
pub details: Vec<Finding>,
pub scores: Scores,
}
pub struct Scores {
pub benford_score: f64, // 0.0-1.0
pub balance_coherence: f64, // 0.0-1.0
pub ic_matching_rate: f64, // 0.0-1.0
pub uniqueness_score: f64, // 0.0-1.0
}
pub enum Status {
Passed,
PassedWithWarnings,
Failed,
}
}
Usage Examples
Basic Evaluation
#![allow(unused)]
fn main() {
use synth_eval::{Evaluator, EvaluationConfig};
let evaluator = Evaluator::new(EvaluationConfig::default());
let report = evaluator.evaluate(&generated_data)?;
println!("Status: {:?}", report.overall_status);
println!("Benford compliance: {:.2}%", report.scores.benford_score * 100.0);
}
Custom Configuration
#![allow(unused)]
fn main() {
let config = EvaluationConfig {
benford_threshold: 0.05, // 5% significance level
balance_tolerance: dec!(0.01), // 1 cent tolerance
ic_match_threshold: 0.99, // 99% required match
duplicate_check: true,
};
let evaluator = Evaluator::new(config);
}
Category-Specific Evaluation
#![allow(unused)]
fn main() {
use synth_eval::checkers::{BenfordChecker, BalanceChecker};
let benford = BenfordChecker::new(0.05);
let result = benford.check(&amounts)?;
let balance = BalanceChecker::new(dec!(0.01));
let result = balance.check(&trial_balance)?;
}
Evaluation Checks
Benford’s Law
Verifies first-digit distribution follows Benford’s Law:
#![allow(unused)]
fn main() {
// Expected: P(d) = log10(1 + 1/d)
// d=1: 30.1%, d=2: 17.6%, d=3: 12.5%, etc.
let benford_result = evaluator.check_benford(&amounts)?;
if benford_result.chi_square > critical_value {
println!("Warning: Amounts don't follow Benford's Law");
}
}
Balance Coherence
Verifies accounting equation:
#![allow(unused)]
fn main() {
// Assets = Liabilities + Equity
let balance_result = evaluator.check_balance(&trial_balance)?;
if !balance_result.passed {
println!("Imbalance: {:?}", balance_result.difference);
}
}
Document Chain Integrity
Verifies document references:
#![allow(unused)]
fn main() {
// PO → GR → Invoice → Payment chain
let chain_result = evaluator.check_document_chains(&documents)?;
for broken_chain in &chain_result.broken_chains {
println!("Broken chain: {} → {}", broken_chain.from, broken_chain.to);
}
}
IC Matching
Verifies intercompany transactions match:
#![allow(unused)]
fn main() {
let ic_result = evaluator.check_ic_matching(&ic_entries)?;
println!("Match rate: {:.2}%", ic_result.match_rate * 100.0);
println!("Unmatched: {}", ic_result.unmatched.len());
}
Uniqueness
Detects duplicate document IDs:
#![allow(unused)]
fn main() {
let unique_result = evaluator.check_uniqueness(&entries)?;
if !unique_result.duplicates.is_empty() {
for dup in &unique_result.duplicates {
println!("Duplicate ID: {}", dup.document_id);
}
}
}
Report Output
Console Report
#![allow(unused)]
fn main() {
evaluator.print_report(&report);
}
=== Evaluation Report ===
Status: PASSED
Scores:
Benford Compliance: 98.5%
Balance Coherence: 100.0%
IC Matching Rate: 99.8%
Uniqueness: 100.0%
Warnings:
- 3 entries with unusual amounts detected
Categories:
[✓] Statistical: PASSED
[✓] Coherence: PASSED
[✓] Intercompany: PASSED
[✓] Uniqueness: PASSED
JSON Report
#![allow(unused)]
fn main() {
let json = evaluator.to_json(&report)?;
std::fs::write("evaluation_report.json", json)?;
}
Integration with Generation
#![allow(unused)]
fn main() {
use synth_runtime::GenerationOrchestrator;
use synth_eval::Evaluator;
let orchestrator = GenerationOrchestrator::new(config)?;
let data = orchestrator.run()?;
// Evaluate generated data
let evaluator = Evaluator::new(EvaluationConfig::default());
let report = evaluator.evaluate(&data)?;
if report.overall_status == Status::Failed {
return Err("Generated data failed quality checks");
}
}
Enhancement Module
The enhancement module provides automatic configuration tuning based on evaluation results.
Pipeline Flow
Evaluation Results → Threshold Check → Gap Analysis → Root Cause → Config Suggestion
Auto-Tuning
#![allow(unused)]
fn main() {
use synth_eval::enhancement::{AutoTuner, AutoTuneResult};
let tuner = AutoTuner::new();
let result: AutoTuneResult = tuner.analyze(&evaluation);
for patch in result.patches_by_confidence() {
println!("{}: {} → {} (confidence: {:.0}%)",
patch.path,
patch.current_value.as_deref().unwrap_or("?"),
patch.suggested_value,
patch.confidence * 100.0
);
}
}
Key Types
#![allow(unused)]
fn main() {
pub struct ConfigPatch {
pub path: String, // e.g., "transactions.amount.benford_compliance"
pub current_value: Option<String>,
pub suggested_value: String,
pub confidence: f64, // 0.0-1.0
pub expected_impact: String,
}
pub struct AutoTuneResult {
pub patches: Vec<ConfigPatch>,
pub expected_improvement: f64,
pub addressed_metrics: Vec<String>,
pub unaddressable_metrics: Vec<String>,
pub summary: String,
}
}
Recommendation Engine
#![allow(unused)]
fn main() {
use synth_eval::enhancement::{RecommendationEngine, RecommendationPriority};
let engine = RecommendationEngine::new();
let recommendations = engine.generate(&evaluation);
for rec in recommendations.iter().filter(|r| r.priority == RecommendationPriority::Critical) {
println!("CRITICAL: {} - {}", rec.title, rec.root_cause.description);
}
}
Metric-to-Config Mappings
| Metric | Config Path | Strategy |
|---|---|---|
benford_p_value | transactions.amount.benford_compliance | Enable boolean |
round_number_ratio | transactions.amount.round_number_bias | Set to target |
temporal_correlation | transactions.temporal.seasonality_strength | Increase by gap |
anomaly_rate | anomaly_injection.base_rate | Set to target |
ic_match_rate | intercompany.match_precision | Increase by gap |
completeness_rate | data_quality.missing_values.overall_rate | Decrease by gap |
See Also
datasynth-banking
KYC/AML banking transaction generator for synthetic data.
Overview
datasynth-banking provides comprehensive banking transaction simulation for:
- Compliance testing and model training
- AML/fraud detection system evaluation
- KYC process simulation
- Regulatory reporting testing
Features
| Feature | Description |
|---|---|
| Customer Generation | Retail, business, and trust customers with realistic KYC profiles |
| Account Generation | Multiple account types with proper feature sets |
| Transaction Engine | Persona-based transaction generation with causal drivers |
| AML Typologies | Structuring, funnel accounts, layering, mule networks, and more |
| Ground Truth Labels | Multi-level labels for ML training |
| Spoofing Mode | Adversarial transaction generation for robustness testing |
Architecture
BankingOrchestrator (orchestration)
|
Generators (customer, account, transaction, counterparty)
|
Typologies (AML pattern injection)
|
Labels (ground truth generation)
|
Models (customer, account, transaction, KYC)
Module Structure
Models
| Model | Description |
|---|---|
BankingCustomer | Retail, Business, Trust customer types |
BankAccount | Account with type, features, status |
BankTransaction | Transaction with direction, channel, category |
KycProfile | Expected activity envelope for compliance |
CounterpartyPool | Transaction counterparty management |
CaseNarrative | Investigation and compliance narratives |
KYC Profile
#![allow(unused)]
fn main() {
pub struct KycProfile {
pub declared_purpose: String,
pub turnover_band: TurnoverBand,
pub transaction_frequency: TransactionFrequency,
pub expected_categories: Vec<TransactionCategory>,
pub source_of_funds: SourceOfFunds,
pub source_of_wealth: SourceOfWealth,
pub geographic_exposure: Vec<String>,
pub cash_intensity: CashIntensity,
pub beneficial_owner_complexity: OwnerComplexity,
// Ground truth fields
pub is_deceiving: bool,
pub actual_turnover_band: Option<TurnoverBand>,
}
}
AML Typologies
| Typology | Description |
|---|---|
| Structuring | Transactions below reporting thresholds ($10K) |
| Funnel Accounts | Multiple small deposits, few large withdrawals |
| Layering | Complex transaction chains to obscure origin |
| Mule Networks | Money mule payment chains |
| Round-Tripping | Circular transaction patterns |
| Credit Card Fraud | Fraudulent card transactions |
| Synthetic Identity | Fake identity transactions |
| Spoofing | Adversarial patterns for model testing |
Labels
| Label Type | Description |
|---|---|
| Entity Labels | Customer-level risk classifications |
| Relationship Labels | Relationship risk indicators |
| Transaction Labels | Transaction-level classifications |
| Narrative Labels | Investigation case narratives |
Usage Examples
Basic Generation
#![allow(unused)]
fn main() {
use synth_banking::{BankingOrchestrator, BankingConfig};
let config = BankingConfig::default();
let mut orchestrator = BankingOrchestrator::new(config, 12345);
// Generate customers and accounts
let customers = orchestrator.generate_customers();
let accounts = orchestrator.generate_accounts(&customers);
// Generate transaction stream
let transactions = orchestrator.generate_transactions(&accounts);
}
With AML Typologies
#![allow(unused)]
fn main() {
use synth_banking::{BankingConfig, TypologyConfig};
let config = BankingConfig {
customer_count: 1000,
typologies: TypologyConfig {
structuring_rate: 0.02, // 2% structuring patterns
funnel_rate: 0.01, // 1% funnel accounts
mule_rate: 0.005, // 0.5% mule networks
..Default::default()
},
..Default::default()
};
}
Accessing Labels
#![allow(unused)]
fn main() {
let labels = orchestrator.generate_labels();
// Entity-level labels
for entity_label in &labels.entity_labels {
println!("Customer {} risk: {:?}",
entity_label.customer_id,
entity_label.risk_tier
);
}
// Transaction-level labels
for tx_label in &labels.transaction_labels {
if tx_label.is_suspicious {
println!("Suspicious: {} - {:?}",
tx_label.transaction_id,
tx_label.typology
);
}
}
}
Key Types
Customer Types
#![allow(unused)]
fn main() {
pub enum BankingCustomerType {
Retail, // Individual customers
Business, // Business accounts
Trust, // Trust/corporate entities
}
}
Risk Tiers
#![allow(unused)]
fn main() {
pub enum RiskTier {
Low,
Medium,
High,
Prohibited,
}
}
Transaction Categories
#![allow(unused)]
fn main() {
pub enum TransactionCategory {
SalaryWages,
BusinessPayment,
Investment,
RealEstate,
Gambling,
Cryptocurrency,
CashDeposit,
CashWithdrawal,
WireTransfer,
AtmWithdrawal,
PosPayment,
OnlinePayment,
// ... more categories
}
}
AML Typologies
#![allow(unused)]
fn main() {
pub enum AmlTypology {
Structuring,
Funnel,
Layering,
Mule,
RoundTripping,
CreditCardFraud,
SyntheticIdentity,
None,
}
}
Export Files
| File | Description |
|---|---|
banking_customers.csv | Customer profiles with KYC data |
bank_accounts.csv | Account records with features |
bank_transactions.csv | Transaction records |
kyc_profiles.csv | Expected activity envelopes |
counterparties.csv | Counterparty pool |
entity_risk_labels.csv | Entity-level risk classifications |
transaction_risk_labels.csv | Transaction-level labels |
aml_typology_labels.csv | AML typology ground truth |
See Also
- datasynth-core - Core banking models
- Fraud Detection Use Case
- Anomaly Injection
datasynth-ocpm
Object-Centric Process Mining (OCPM) models and generators.
Overview
datasynth-ocpm provides OCEL 2.0 compliant event log generation across 8 enterprise process families:
- OCEL 2.0 Models: Events, objects, relationships per IEEE standard
- 8 Process Generators: P2P, O2C, S2C, H2R, MFG, BANK, AUDIT, Bank Recon
- 88 Activity Types: Covering the full enterprise lifecycle
- 52 Object Types: With lifecycle states and inter-object relationships
- Export Formats: OCEL 2.0 JSON, XML, and SQLite
OCEL 2.0 Standard
Implements the Object-Centric Event Log standard:
| Element | Description |
|---|---|
| Events | Activities with timestamps and attributes |
| Objects | Business objects (PO, Invoice, Payment, etc.) |
| Object Types | Type definitions with attribute schemas |
| Relationships | Object-to-object relationships |
| Event-Object Links | Many-to-many event-object associations |
Key Types
OCEL Models
#![allow(unused)]
fn main() {
pub struct OcelEventLog {
pub object_types: Vec<ObjectType>,
pub event_types: Vec<EventType>,
pub objects: Vec<Object>,
pub events: Vec<Event>,
pub relationships: Vec<ObjectRelationship>,
}
pub struct Event {
pub id: String,
pub event_type: String,
pub timestamp: DateTime<Utc>,
pub attributes: HashMap<String, Value>,
pub objects: Vec<ObjectRef>,
}
pub struct Object {
pub id: String,
pub object_type: String,
pub attributes: HashMap<String, Value>,
}
}
Process Flow Documents
#![allow(unused)]
fn main() {
pub struct P2pDocuments {
pub po_number: String,
pub vendor_id: String,
pub company_code: String,
pub amount: Decimal,
pub currency: String,
}
pub struct O2cDocuments {
pub so_number: String,
pub customer_id: String,
pub company_code: String,
pub amount: Decimal,
pub currency: String,
}
}
Process Flows
Procure-to-Pay (P2P)
Create PO → Approve PO → Release PO → Create GR → Post GR →
Receive Invoice → Verify Invoice → Post Invoice → Execute Payment
Events generated:
Create Purchase OrderApprove Purchase OrderRelease Purchase OrderCreate Goods ReceiptPost Goods ReceiptReceive Vendor InvoiceVerify Three-Way MatchPost Vendor InvoiceExecute Payment
Order-to-Cash (O2C)
Create SO → Check Credit → Release SO → Create Delivery →
Pick → Pack → Ship → Create Invoice → Post Invoice → Receive Payment
Events generated:
Create Sales OrderCheck CreditRelease Sales OrderCreate DeliveryPick MaterialsPack ShipmentShip GoodsCreate Customer InvoicePost Customer InvoiceReceive Customer Payment
Usage Examples
Generate P2P Case
#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, P2pDocuments};
let mut generator = OcpmGenerator::new(seed);
let documents = P2pDocuments::new(
"PO-001",
"V-001",
"1000",
dec!(5000.00),
"USD",
);
let users = vec!["user1", "user2", "user3"];
let start_time = Utc::now();
let result = generator.generate_p2p_case(&documents, start_time, &users);
}
Generate O2C Case
#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, O2cDocuments};
let documents = O2cDocuments::new(
"SO-001",
"C-001",
"1000",
dec!(10000.00),
"USD",
);
let result = generator.generate_o2c_case(&documents, start_time, &users);
}
Generate Complete Event Log
#![allow(unused)]
fn main() {
use synth_ocpm::OcpmGenerator;
let mut generator = OcpmGenerator::new(seed);
let event_log = generator.generate_event_log(
p2p_count: 1000,
o2c_count: 500,
start_date,
end_date,
)?;
}
Export Formats
OCEL 2.0 JSON
#![allow(unused)]
fn main() {
use synth_ocpm::export::{Ocel2Exporter, ExportFormat};
let exporter = Ocel2Exporter::new(ExportFormat::Json);
exporter.export(&event_log, "output/ocel2.json")?;
}
Output structure:
{
"objectTypes": [...],
"eventTypes": [...],
"objects": [...],
"events": [...],
"relations": [...]
}
OCEL 2.0 XML
#![allow(unused)]
fn main() {
let exporter = Ocel2Exporter::new(ExportFormat::Xml);
exporter.export(&event_log, "output/ocel2.xml")?;
}
SQLite Database
#![allow(unused)]
fn main() {
let exporter = Ocel2Exporter::new(ExportFormat::Sqlite);
exporter.export(&event_log, "output/ocel2.sqlite")?;
}
Tables created:
object_typesevent_typesobjectseventsevent_objectsobject_relationships
Process Families (v0.6.2)
| Family | Generator | Activities | Object Types | Variants |
|---|---|---|---|---|
| P2P | generate_p2p_case() | 9 | PurchaseOrder, GoodsReceipt, VendorInvoice, Payment, Material, Vendor | Happy, Exception, Error |
| O2C | generate_o2c_case() | 10 | SalesOrder, Delivery, CustomerInvoice, CustomerPayment, Material, Customer | Happy, Exception, Error |
| S2C | generate_s2c_case() | 8 | SourcingProject, SupplierQualification, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract | Happy, Exception, Error |
| H2R | generate_h2r_case() | 8 | PayrollRun, PayrollLineItem, TimeEntry, ExpenseReport | Happy, Exception, Error |
| MFG | generate_mfg_case() | 10 | ProductionOrder, RoutingOperation, QualityInspection, CycleCount | Happy, Exception, Error |
| BANK | generate_bank_case() | 8 | BankingCustomer, BankAccount, BankTransaction | Happy, Exception, Error |
| AUDIT | generate_audit_case() | 10 | AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment | Happy, Exception, Error |
| Bank Recon | generate_bank_recon_case() | 8 | BankReconciliation, BankStatementLine, ReconcilingItem | Happy, Exception, Error |
Variant distribution: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%).
Object Types (P2P/O2C)
| Type | Description |
|---|---|
| PurchaseOrder | P2P ordering document |
| GoodsReceipt | Inventory receipt |
| VendorInvoice | AP invoice |
| Payment | Payment document |
| SalesOrder | O2C ordering document |
| Delivery | Shipment document |
| CustomerInvoice | AR invoice |
| CustomerPayment | Customer receipt |
| Material | Product/item |
| Vendor | Supplier |
| Customer | Customer/buyer |
Integration with Process Mining Tools
OCEL 2.0 exports are compatible with:
- PM4Py: Python process mining library
- Celonis: Enterprise process mining platform
- PROM: Academic process mining toolkit
- OCPA: Object-centric process analysis tool
Loading in PM4Py
import pm4py
from pm4py.objects.ocel.importer import jsonocel
ocel = jsonocel.apply("ocel2.json")
print(f"Events: {len(ocel.events)}")
print(f"Objects: {len(ocel.objects)}")
See Also
datasynth-fingerprint
Privacy-preserving fingerprint extraction from real data and synthesis of matching synthetic data.
Overview
The datasynth-fingerprint crate provides tools for extracting statistical fingerprints from real datasets while preserving privacy through differential privacy mechanisms and k-anonymity. These fingerprints can then be used to generate synthetic data that matches the statistical properties of the original data without exposing sensitive information.
Architecture
Real Data → Extract → .dsf File → Generate → Synthetic Data → Evaluate
The fingerprinting workflow consists of three main stages:
- Extraction: Analyze real data and extract statistical properties
- Synthesis: Generate configuration and synthetic data from fingerprints
- Evaluation: Validate synthetic data fidelity against fingerprints
Key Components
Models (models/)
| Model | Description |
|---|---|
| Fingerprint | Root container with manifest, schema, statistics, correlations, integrity, rules, anomalies, privacy_audit |
| Manifest | Version, format, created_at, source metadata, privacy metadata, checksums, optional signature |
| SchemaFingerprint | Tables with columns, data types, cardinalities, relationships |
| StatisticsFingerprint | Numeric stats (distribution, percentiles, Benford), categorical stats (frequencies, entropy) |
| CorrelationFingerprint | Correlation matrices with copula parameters |
| IntegrityFingerprint | Foreign key definitions, cardinality rules |
| RulesFingerprint | Balance rules, approval thresholds |
| AnomalyFingerprint | Anomaly rates, type distributions, temporal patterns |
| PrivacyAudit | Actions log, epsilon spent, k-anonymity, warnings |
Privacy Engine (privacy/)
| Component | Description |
|---|---|
| LaplaceMechanism | Differential privacy with configurable epsilon |
| GaussianMechanism | Alternative DP mechanism for (ε,δ)-privacy |
| KAnonymity | Suppression of rare categorical values below k threshold |
| PrivacyEngine | Unified interface combining DP, k-anonymity, winsorization |
| PrivacyAuditBuilder | Build privacy audit with actions and warnings |
Privacy Levels
| Level | Epsilon | k | Outlier % | Use Case |
|---|---|---|---|---|
| Minimal | 5.0 | 3 | 99% | Low privacy, high utility |
| Standard | 1.0 | 5 | 95% | Balanced (default) |
| High | 0.5 | 10 | 90% | Higher privacy |
| Maximum | 0.1 | 20 | 85% | Maximum privacy |
Extraction Engine (extraction/)
| Extractor | Description |
|---|---|
| FingerprintExtractor | Main coordinator for all extraction |
| SchemaExtractor | Infer data types, cardinalities, relationships |
| StatsExtractor | Compute distributions, percentiles, Benford analysis |
| CorrelationExtractor | Pearson correlations, copula fitting |
| IntegrityExtractor | Detect foreign key relationships |
| RulesExtractor | Detect balance rules, approval patterns |
| AnomalyExtractor | Analyze anomaly rates and patterns |
I/O (io/)
| Component | Description |
|---|---|
| FingerprintWriter | Write .dsf files (ZIP with YAML/JSON components) |
| FingerprintReader | Read .dsf files with checksum verification |
| FingerprintValidator | Validate DSF structure and integrity |
| validate_dsf() | Convenience function for CLI validation |
Synthesis (synthesis/)
| Component | Description |
|---|---|
| ConfigSynthesizer | Convert fingerprint to GeneratorConfig |
| DistributionFitter | Fit AmountSampler parameters from statistics |
| GaussianCopula | Generate correlated values preserving multivariate structure |
Evaluation (evaluation/)
| Component | Description |
|---|---|
| FidelityEvaluator | Compare synthetic data against fingerprint |
| FidelityReport | Overall score, component scores, pass/fail status |
| FidelityConfig | Thresholds and weights for evaluation |
Federated Fingerprinting (federated/) — v0.5.0
| Component | Description |
|---|---|
| FederatedFingerprintProtocol | Orchestrates multi-source fingerprint aggregation |
| PartialFingerprint | Per-source fingerprint with local DP (epsilon, means, stds, correlations) |
| AggregatedFingerprint | Combined fingerprint with total epsilon and source count |
| AggregationMethod | WeightedAverage, Median, or TrimmedMean strategies |
| FederatedConfig | min_sources, max_epsilon_per_source, aggregation_method |
Certificates (certificates/) — v0.5.0
| Component | Description |
|---|---|
| SyntheticDataCertificate | Certificate with DP guarantees, quality metrics, config hash, signature |
| CertificateBuilder | Builder pattern for constructing certificates |
| DpGuarantee | DP mechanism, epsilon, delta, composition method, total queries |
| QualityMetrics | Benford MAD, correlation preservation, statistical fidelity, MIA AUC |
| sign_certificate() | HMAC-SHA256 signing |
| verify_certificate() | Signature verification |
Privacy-Utility Frontier (privacy/pareto.rs) — v0.5.0
| Component | Description |
|---|---|
| ParetoFrontier | Explore privacy-utility tradeoff space |
| ParetoPoint | Epsilon, utility_score, benford_mad, correlation_score |
| recommend() | Recommend optimal epsilon for target utility |
DSF File Format
The DataSynth Fingerprint (.dsf) file is a ZIP archive containing:
fingerprint.dsf (ZIP)
├── manifest.json # Version, checksums, privacy config
├── schema.yaml # Tables, columns, relationships
├── statistics.yaml # Distributions, percentiles, Benford
├── correlations.yaml # Correlation matrices, copulas
├── integrity.yaml # FK relationships, cardinality
├── rules.yaml # Balance constraints, approval thresholds
├── anomalies.yaml # Anomaly rates, type distribution
└── privacy_audit.json # Privacy decisions, epsilon spent
Usage
Extracting a Fingerprint
#![allow(unused)]
fn main() {
use datasynth_fingerprint::{
extraction::FingerprintExtractor,
privacy::{PrivacyEngine, PrivacyLevel},
io::FingerprintWriter,
};
// Create privacy engine with standard level
let privacy = PrivacyEngine::new(PrivacyLevel::Standard);
// Extract fingerprint from CSV data
let extractor = FingerprintExtractor::new(privacy);
let fingerprint = extractor.extract_from_csv("data.csv")?;
// Write to DSF file
let writer = FingerprintWriter::new();
writer.write(&fingerprint, "fingerprint.dsf")?;
}
Reading a Fingerprint
#![allow(unused)]
fn main() {
use datasynth_fingerprint::io::FingerprintReader;
let reader = FingerprintReader::new();
let fingerprint = reader.read("fingerprint.dsf")?;
println!("Tables: {:?}", fingerprint.schema.tables.len());
println!("Privacy epsilon spent: {}", fingerprint.privacy_audit.epsilon_spent);
}
Validating a Fingerprint
#![allow(unused)]
fn main() {
use datasynth_fingerprint::io::validate_dsf;
match validate_dsf("fingerprint.dsf") {
Ok(report) => println!("Valid: {:?}", report),
Err(e) => eprintln!("Invalid: {}", e),
}
}
Synthesizing Configuration
#![allow(unused)]
fn main() {
use datasynth_fingerprint::synthesis::ConfigSynthesizer;
let synthesizer = ConfigSynthesizer::new();
let config = synthesizer.synthesize(&fingerprint)?;
// Use config with datasynth-generators
}
Evaluating Fidelity
#![allow(unused)]
fn main() {
use datasynth_fingerprint::evaluation::{FidelityEvaluator, FidelityConfig};
let config = FidelityConfig::default();
let evaluator = FidelityEvaluator::new(config);
let report = evaluator.evaluate(&fingerprint, "./synthetic_data/")?;
println!("Overall score: {:.2}", report.overall_score);
println!("Pass: {}", report.passed);
for (metric, score) in &report.component_scores {
println!(" {}: {:.2}", metric, score);
}
}
Federated Fingerprinting
#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::{
FederatedFingerprintProtocol, FederatedConfig, AggregationMethod,
};
let config = FederatedConfig {
min_sources: 2,
max_epsilon_per_source: 5.0,
aggregation_method: AggregationMethod::WeightedAverage,
};
let protocol = FederatedFingerprintProtocol::new(config);
// Create partial fingerprints from each data source
let partial1 = FederatedFingerprintProtocol::create_partial(
"source_a", vec!["amount".into(), "count".into()], 10000,
vec![5000.0, 3.0], vec![2000.0, 1.5], 1.0,
);
let partial2 = FederatedFingerprintProtocol::create_partial(
"source_b", vec!["amount".into(), "count".into()], 8000,
vec![4500.0, 2.8], vec![1800.0, 1.2], 1.0,
);
// Aggregate without centralizing raw data
let aggregated = protocol.aggregate(&[partial1, partial2])?;
println!("Total epsilon: {}", aggregated.total_epsilon);
}
Synthetic Data Certificates
#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{
CertificateBuilder, DpGuarantee, QualityMetrics,
sign_certificate, verify_certificate,
};
let mut cert = CertificateBuilder::new("DataSynth v0.5.0")
.with_dp_guarantee(DpGuarantee {
mechanism: "Laplace".into(),
epsilon: 1.0,
delta: None,
composition_method: "sequential".into(),
total_queries: 50,
})
.with_quality_metrics(QualityMetrics {
benford_mad: Some(0.008),
correlation_preservation: Some(0.95),
statistical_fidelity: Some(0.92),
mia_auc: Some(0.52),
})
.with_seed(42)
.build();
// Sign and verify
sign_certificate(&mut cert, "my-secret-key");
assert!(verify_certificate(&cert, "my-secret-key"));
}
Fidelity Metrics
| Category | Metrics |
|---|---|
| Statistical | KS statistic, Wasserstein distance, Benford MAD |
| Correlation | Correlation matrix RMSE |
| Schema | Column type match, row count ratio |
| Rules | Balance equation compliance rate |
Privacy Guarantees
The fingerprint extraction process provides the following privacy guarantees:
- Differential Privacy: Numeric statistics are perturbed using Laplace or Gaussian mechanisms with configurable epsilon budget
- K-Anonymity: Categorical values appearing fewer than k times are suppressed
- Winsorization: Outliers are clipped to prevent identification of extreme values
- Audit Trail: All privacy decisions are logged for compliance verification
CLI Commands
# Extract fingerprint
datasynth-data fingerprint extract \
--input ./data.csv \
--output ./fp.dsf \
--privacy-level standard
# Validate
datasynth-data fingerprint validate ./fp.dsf
# Show info
datasynth-data fingerprint info ./fp.dsf --detailed
# Compare
datasynth-data fingerprint diff ./fp1.dsf ./fp2.dsf
# Evaluate fidelity
datasynth-data fingerprint evaluate \
--fingerprint ./fp.dsf \
--synthetic ./synthetic/ \
--threshold 0.8
# Federated fingerprinting
datasynth-data fingerprint federated \
--sources ./source_a.dsf ./source_b.dsf \
--output ./aggregated.dsf \
--method weighted_average
# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate
Dependencies
[dependencies]
datasynth-core = { path = "../datasynth-core" }
datasynth-config = { path = "../datasynth-config" }
serde = { version = "1.0", features = ["derive"] }
serde_yaml = "0.9"
serde_json = "1.0"
zip = "0.6"
sha2 = "0.10"
rand = "0.8"
statrs = "0.16"
See Also
datasynth-standards
The datasynth-standards crate provides comprehensive support for major accounting and auditing standards frameworks including IFRS, US GAAP, ISA, SOX, and PCAOB.
Overview
This crate contains domain models and business logic for:
- Accounting Standards: Revenue recognition, lease accounting, fair value measurement, impairment testing
- Audit Standards: ISA requirements, analytical procedures, confirmations, audit opinions
- Regulatory Frameworks: SOX 302/404 compliance, PCAOB standards
Modules
framework
Core accounting framework selection and settings.
#![allow(unused)]
fn main() {
use datasynth_standards::framework::{AccountingFramework, FrameworkSettings};
// Select framework
let framework = AccountingFramework::UsGaap;
assert!(framework.allows_lifo());
assert!(!framework.allows_impairment_reversal());
// Framework-specific settings
let settings = FrameworkSettings::us_gaap();
assert!(settings.validate().is_ok());
}
accounting
Accounting standards models:
| Module | Standards | Key Types |
|---|---|---|
revenue | ASC 606 / IFRS 15 | CustomerContract, PerformanceObligation, VariableConsideration |
leases | ASC 842 / IFRS 16 | Lease, ROUAsset, LeaseLiability, LeaseAmortizationEntry |
fair_value | ASC 820 / IFRS 13 | FairValueMeasurement, FairValueHierarchyLevel |
impairment | ASC 360 / IAS 36 | ImpairmentTest, RecoverableAmountMethod |
differences | Dual Reporting | FrameworkDifferenceRecord |
audit
Audit standards models:
| Module | Standards | Key Types |
|---|---|---|
isa_reference | ISA 200-720 | IsaStandard, IsaRequirement, IsaProcedureMapping |
analytical | ISA 520 | AnalyticalProcedure, VarianceInvestigation |
confirmation | ISA 505 | ExternalConfirmation, ConfirmationResponse |
opinion | ISA 700/705/706/701 | AuditOpinion, KeyAuditMatter, OpinionModification |
audit_trail | Traceability | AuditTrail, TrailGap |
pcaob | PCAOB AS | PcaobStandard, PcaobIsaMapping |
regulatory
Regulatory compliance models:
| Module | Standards | Key Types |
|---|---|---|
sox | SOX 302/404 | Sox302Certification, Sox404Assessment, DeficiencyMatrix, MaterialWeakness |
Usage Examples
Revenue Recognition
#![allow(unused)]
fn main() {
use datasynth_standards::accounting::revenue::{
CustomerContract, PerformanceObligation, ObligationType, SatisfactionPattern,
};
use datasynth_standards::framework::AccountingFramework;
use rust_decimal_macros::dec;
// Create a customer contract under US GAAP
let mut contract = CustomerContract::new(
"C001".to_string(),
"CUST001".to_string(),
dec!(100000),
AccountingFramework::UsGaap,
);
// Add performance obligations
let po = PerformanceObligation::new(
"PO001".to_string(),
ObligationType::Good,
SatisfactionPattern::PointInTime,
dec!(60000),
);
contract.add_performance_obligation(po);
}
Lease Accounting
#![allow(unused)]
fn main() {
use datasynth_standards::accounting::leases::{Lease, LeaseAssetClass, LeaseClassification};
use datasynth_standards::framework::AccountingFramework;
use chrono::NaiveDate;
use rust_decimal_macros::dec;
// Create a lease
let lease = Lease::new(
"L001".to_string(),
LeaseAssetClass::RealEstate,
NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
60, // 5-year term
dec!(10000), // Monthly payment
0.05, // Discount rate
AccountingFramework::UsGaap,
);
// Classify under US GAAP bright-line tests
let classification = lease.classify_us_gaap(
72, // Asset useful life (months)
dec!(600000), // Fair value
dec!(550000), // Present value of payments
);
}
ISA Standards
#![allow(unused)]
fn main() {
use datasynth_standards::audit::isa_reference::{
IsaStandard, IsaRequirement, IsaRequirementType,
};
// Reference an ISA standard
let standard = IsaStandard::Isa315;
assert_eq!(standard.number(), "315");
assert!(standard.title().contains("Risk"));
// Create a requirement
let requirement = IsaRequirement::new(
IsaStandard::Isa500,
"12".to_string(),
IsaRequirementType::Requirement,
"Design and perform audit procedures".to_string(),
);
}
SOX Compliance
#![allow(unused)]
fn main() {
use datasynth_standards::regulatory::sox::{
Sox404Assessment, DeficiencyMatrix, DeficiencyLikelihood, DeficiencyMagnitude,
};
use uuid::Uuid;
// Create a SOX 404 assessment
let assessment = Sox404Assessment::new(
Uuid::new_v4(),
2024,
true, // ICFR effective
);
// Classify a deficiency
let deficiency = DeficiencyMatrix::new(
DeficiencyLikelihood::Probable,
DeficiencyMagnitude::Material,
);
assert!(deficiency.is_material_weakness());
}
Framework Validation
The crate validates framework-specific rules:
#![allow(unused)]
fn main() {
use datasynth_standards::framework::{AccountingFramework, FrameworkSettings};
// LIFO is not permitted under IFRS
let mut settings = FrameworkSettings::ifrs();
settings.use_lifo_inventory = true;
assert!(settings.validate().is_err());
// PPE revaluation is not permitted under US GAAP
let mut settings = FrameworkSettings::us_gaap();
settings.use_ppe_revaluation = true;
assert!(settings.validate().is_err());
}
Dependencies
[dependencies]
datasynth-standards = "0.2.3"
Feature Flags
Currently, no optional features are defined. All functionality is included by default.
See Also
- Accounting Standards Guide - Detailed usage guide
- Configuration Reference - YAML configuration options
- datasynth-eval - Standards compliance evaluation
datasynth-test-utils
Test utilities and helpers for the SyntheticData workspace.
Overview
datasynth-test-utils provides shared testing infrastructure:
- Test Fixtures: Pre-configured test data and scenarios
- Assertion Helpers: Domain-specific assertions for financial data
- Mock Generators: Simplified generators for unit testing
- Snapshot Testing: Helpers for snapshot-based testing
Fixtures
Journal Entry Fixtures
#![allow(unused)]
fn main() {
use synth_test_utils::fixtures;
// Balanced two-line entry
let entry = fixtures::balanced_journal_entry();
assert!(entry.is_balanced());
// Entry with specific amounts
let entry = fixtures::journal_entry_with_amount(dec!(1000.00));
// Fraudulent entry for testing detection
let entry = fixtures::fraudulent_entry(FraudType::SplitTransaction);
}
Master Data Fixtures
#![allow(unused)]
fn main() {
// Sample vendors
let vendors = fixtures::sample_vendors(10);
// Sample customers
let customers = fixtures::sample_customers(20);
// Chart of accounts
let coa = fixtures::test_chart_of_accounts();
}
Amount Fixtures
#![allow(unused)]
fn main() {
// Benford-compliant amounts
let amounts = fixtures::sample_amounts(1000);
// Round-number biased amounts
let amounts = fixtures::round_amounts(100);
// Fraud-pattern amounts
let amounts = fixtures::suspicious_amounts(50);
}
Configuration Fixtures
#![allow(unused)]
fn main() {
// Minimal valid config
let config = fixtures::test_config();
// Manufacturing preset
let config = fixtures::manufacturing_config();
// Config with specific transaction count
let config = fixtures::config_with_transactions(10000);
}
Assertions
Balance Assertions
#![allow(unused)]
fn main() {
use synth_test_utils::assertions;
#[test]
fn test_entry_is_balanced() {
let entry = create_entry();
assertions::assert_balanced(&entry);
}
#[test]
fn test_trial_balance() {
let tb = generate_trial_balance();
assertions::assert_trial_balance_balanced(&tb);
}
}
Benford’s Law Assertions
#![allow(unused)]
fn main() {
#[test]
fn test_benford_compliance() {
let amounts = generate_amounts(10000);
assertions::assert_benford_compliant(&amounts, 0.05);
}
}
Document Chain Assertions
#![allow(unused)]
fn main() {
#[test]
fn test_p2p_chain() {
let documents = generate_p2p_flow();
assertions::assert_valid_document_chain(&documents);
}
}
Uniqueness Assertions
#![allow(unused)]
fn main() {
#[test]
fn test_no_duplicate_ids() {
let entries = generate_entries(1000);
assertions::assert_unique_document_ids(&entries);
}
}
Mock Generators
Simple Journal Entry Generator
#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockJeGenerator;
let mut generator = MockJeGenerator::new(42);
// Generate entries without full config
let entries = generator.generate(100);
}
Predictable Amount Generator
#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockAmountGenerator;
let mut generator = MockAmountGenerator::new();
// Returns predictable sequence
let amount1 = generator.next(); // 100.00
let amount2 = generator.next(); // 200.00
}
Fixed Date Generator
#![allow(unused)]
fn main() {
use synth_test_utils::mocks::MockDateGenerator;
let generator = MockDateGenerator::fixed(
NaiveDate::from_ymd_opt(2024, 1, 15).unwrap()
);
}
Snapshot Testing
#![allow(unused)]
fn main() {
use synth_test_utils::snapshots;
#[test]
fn test_je_serialization() {
let entry = fixtures::balanced_journal_entry();
snapshots::assert_json_snapshot("je_balanced", &entry);
}
#[test]
fn test_csv_output() {
let entries = fixtures::sample_entries(10);
snapshots::assert_csv_snapshot("entries_sample", &entries);
}
}
Test Helpers
Temporary Directories
#![allow(unused)]
fn main() {
use synth_test_utils::temp_dir;
#[test]
fn test_output_writing() {
let dir = temp_dir::create();
// Test writes to temp directory
let path = dir.path().join("test.csv");
write_output(&path)?;
assert!(path.exists());
// Directory cleaned up on drop
}
}
Seed Management
#![allow(unused)]
fn main() {
use synth_test_utils::seeds;
#[test]
fn test_deterministic_generation() {
let seed = seeds::fixed();
let result1 = generate_with_seed(seed);
let result2 = generate_with_seed(seed);
assert_eq!(result1, result2);
}
}
Time Helpers
#![allow(unused)]
fn main() {
use synth_test_utils::time;
#[test]
fn test_with_frozen_time() {
let frozen = time::freeze_at(2024, 1, 15);
let entry = generate_entry_with_current_date();
assert_eq!(entry.posting_date, frozen.date());
}
}
Usage in Other Crates
Add to Cargo.toml:
[dev-dependencies]
datasynth-test-utils = { path = "../datasynth-test-utils" }
Use in tests:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use synth_test_utils::{fixtures, assertions};
#[test]
fn test_my_function() {
let input = fixtures::test_config();
let result = my_function(&input);
assertions::assert_balanced(&result);
}
}
}
Fixture Data Files
Test data files in fixtures/:
datasynth-test-utils/
└── fixtures/
├── chart_of_accounts.yaml
├── sample_entries.json
├── vendor_master.csv
└── test_config.yaml
See Also
Advanced Topics
Advanced features for specialized use cases.
Overview
| Topic | Description |
|---|---|
| Anomaly Injection | Fraud, errors, and process issues |
| Data Quality Variations | Missing values, typos, duplicates |
| Graph Export | ML-ready graph formats |
| Intercompany Processing | Multi-entity transactions |
| Period Close Engine | Month/quarter/year-end processes |
| Performance Tuning | Optimization strategies |
Feature Matrix
| Feature | Use Case | Output |
|---|---|---|
| Anomaly Injection | ML training | Labels (CSV) |
| Data Quality | Testing robustness | Varied data |
| Graph Export | GNN training | PyG, Neo4j |
| Intercompany | Consolidation testing | IC pairs |
| Period Close | Full cycle testing | Closing entries |
Enabling Advanced Features
In Configuration
# Anomaly injection
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true
# Data quality variations
data_quality:
enabled: true
missing_values:
rate: 0.01
# Graph export
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
# Intercompany
intercompany:
enabled: true
# Period close
period_close:
enabled: true
monthly:
accruals: true
depreciation: true
Via CLI
Most advanced features are controlled through configuration. Use init to create a base config, then customize:
datasynth-data init --industry manufacturing --complexity medium -o config.yaml
# Edit config.yaml to enable advanced features
datasynth-data generate --config config.yaml --output ./output
Prerequisites
Some advanced features have dependencies:
| Feature | Requires |
|---|---|
| Intercompany | Multiple companies defined |
| Period Close | period_months ≥ 1 |
| Graph Export | Transactions generated |
| FX | Multiple currencies |
Output Files
Advanced features produce additional outputs:
output/
├── labels/ # Anomaly injection
│ ├── anomaly_labels.csv
│ ├── fraud_labels.csv
│ └── quality_issues.csv
├── graphs/ # Graph export
│ ├── pytorch_geometric/
│ └── neo4j/
├── consolidation/ # Intercompany
│ ├── eliminations.csv
│ └── ic_pairs.csv
└── period_close/ # Period close
├── trial_balances/
├── accruals.csv
└── closing_entries.csv
Performance Impact
| Feature | Impact | Mitigation |
|---|---|---|
| Anomaly Injection | Low | Post-processing |
| Data Quality | Low | Post-processing |
| Graph Export | Medium | Separate phase |
| Intercompany | Medium | Per-transaction |
| Period Close | Low | Per-period |
See Also
Fraud Patterns & ACFE Taxonomy
SyntheticData includes comprehensive fraud pattern modeling aligned with the Association of Certified Fraud Examiners (ACFE) Report to the Nations. This enables generation of realistic fraud scenarios for training machine learning models and testing audit analytics.
ACFE Fraud Taxonomy
The ACFE occupational fraud classification divides fraud into three main categories, each with distinct characteristics:
Asset Misappropriation (86% of cases)
The most common type of fraud, involving theft of organizational assets:
fraud:
enabled: true
acfe_category: asset_misappropriation
schemes:
cash_fraud:
- skimming # Sales not recorded
- larceny # Cash stolen after recording
- shell_company # Fictitious vendors
- ghost_employee # Non-existent employees
- expense_schemes # Personal expenses as business
non_cash_fraud:
- inventory_theft
- fixed_asset_misuse
Corruption (33% of cases)
Schemes involving conflicts of interest and bribery:
fraud:
enabled: true
acfe_category: corruption
schemes:
- purchasing_conflict # Undisclosed vendor ownership
- sales_conflict # Kickbacks from customers
- invoice_kickback # Vendor payment schemes
- bid_rigging # Collusion with vendors
- economic_extortion # Demands for payment
Financial Statement Fraud (10% of cases)
The least common but most costly fraud type:
fraud:
enabled: true
acfe_category: financial_statement
schemes:
overstatement:
- premature_revenue # Revenue before earned
- fictitious_revenues # Fake sales
- concealed_liabilities # Hidden obligations
- improper_asset_values # Overstated assets
understatement:
- understated_revenues # Hidden sales
- overstated_expenses # Inflated costs
ACFE Calibration
Generated fraud data is calibrated to match ACFE statistics:
| Metric | ACFE Value | Configuration |
|---|---|---|
| Median Loss | $117,000 | acfe.median_loss |
| Median Duration | 12 months | acfe.median_duration_months |
| Tip Detection | 42% | detection_method.tip |
| Internal Audit Detection | 16% | detection_method.internal_audit |
| Management Review Detection | 12% | detection_method.management_review |
fraud:
acfe_calibration:
enabled: true
median_loss: 117000
median_duration_months: 12
detection_methods:
tip: 0.42
internal_audit: 0.16
management_review: 0.12
external_audit: 0.04
accident: 0.06
Collusion & Conspiracy Modeling
SyntheticData models multi-party fraud networks with coordinated schemes:
Collusion Ring Types
#![allow(unused)]
fn main() {
pub enum CollusionRingType {
// Internal collusion
EmployeePair, // approver + processor
DepartmentRing, // 3-5 employees
ManagementSubordinate, // manager + subordinate
// Internal-external
EmployeeVendor, // purchasing + vendor contact
EmployeeCustomer, // sales rep + customer
EmployeeContractor, // project manager + contractor
// External rings
VendorRing, // bid rigging (2-4 vendors)
CustomerRing, // return fraud
}
}
Conspirator Roles
Each conspirator in a ring has a specific role:
- Initiator: Conceives scheme, recruits others
- Executor: Performs fraudulent transactions
- Approver: Provides approvals/overrides
- Concealer: Hides evidence, manipulates records
- Lookout: Monitors for detection
- Beneficiary: External recipient of proceeds
Configuration
fraud:
collusion:
enabled: true
ring_types:
- type: employee_vendor
probability: 0.15
min_members: 2
max_members: 4
- type: department_ring
probability: 0.08
min_members: 3
max_members: 5
defection_probability: 0.05
escalation_rate: 0.10
Management Override
Senior-level fraud with override patterns:
fraud:
management_override:
enabled: true
perpetrator_levels:
- senior_manager
- cfo
- ceo
override_types:
revenue:
- journal_entry_override
- revenue_recognition_acceleration
- reserve_manipulation
expense:
- capitalization_abuse
- expense_deferral
pressure_sources:
- financial_targets
- market_expectations
- covenant_compliance
Fraud Triangle
The fraud triangle (Pressure, Opportunity, Rationalization) is modeled:
fraud:
fraud_triangle:
pressure:
source: financial_targets
intensity: high
opportunity:
factors:
- weak_internal_controls
- management_override_capability
- lack_of_oversight
rationalization:
type: temporary_adjustment # "We'll fix it next quarter"
Red Flag Generation
Probabilistic fraud indicators with calibrated Bayesian probabilities:
Red Flag Strengths
| Strength | P(fraud|flag) | Examples |
|---|---|---|
| Strong | > 0.5 | Matched home address vendor/employee |
| Moderate | 0.2 - 0.5 | Vendor with no physical address |
| Weak | < 0.2 | Round number invoices |
Configuration
fraud:
red_flags:
enabled: true
inject_rate: 0.15 # 15% of transactions get flags
patterns:
strong:
- name: matched_address_vendor_employee
p_flag_given_fraud: 0.90
p_flag_given_no_fraud: 0.001
- name: sequential_check_numbers
p_flag_given_fraud: 0.80
p_flag_given_no_fraud: 0.01
moderate:
- name: approval_just_under_threshold
p_flag_given_fraud: 0.70
p_flag_given_no_fraud: 0.10
weak:
- name: round_number_invoice
p_flag_given_fraud: 0.40
p_flag_given_no_fraud: 0.20
Evaluation Benchmarks
ACFE-Calibrated Benchmarks
#![allow(unused)]
fn main() {
// General fraud detection
let bench = acfe_calibrated_1k();
// Collusion-focused benchmark
let bench = acfe_collusion_5k();
// Management override detection
let bench = acfe_management_override_2k();
}
Benchmark Metrics
#![allow(unused)]
fn main() {
pub struct AcfeAlignment {
/// Category distribution MAD vs ACFE
pub category_distribution_mad: f64,
/// Median loss ratio (actual / expected)
pub median_loss_ratio: f64,
/// Duration distribution KS statistic
pub duration_distribution_ks: f64,
/// Detection method chi-squared
pub detection_method_chi_sq: f64,
}
}
Output Files
| File | Description |
|---|---|
collusion_rings.json | Collusion network details with members, roles |
red_flags.csv | Red flag indicators with probabilities |
management_overrides.json | Management override schemes |
fraud_labels.csv | Enhanced fraud labels with ACFE category |
Best Practices
- Start with ACFE calibration: Use default ACFE statistics for realistic distribution
- Enable collusion gradually: Start with simple rings before complex networks
- Use red flags for training: Red flags provide weak supervision signals
- Validate against benchmarks: Use ACFE benchmarks to verify model performance
- Consider detection difficulty: Use
detection_difficultylabels for curriculum learning
Industry-Specific Features
SyntheticData includes industry-specific transaction modeling with authentic terminology, master data structures, and anomaly patterns. Three industries have full generator implementations (Manufacturing, Retail, Healthcare), while three additional industries (Technology, Financial Services, Professional Services) are available as configuration presets with industry-appropriate GL structures and anomaly rates.
Overview
Each industry module provides:
- Industry-specific transactions: Authentic transaction types using correct terminology
- Master data structures: Industry-specific entities (BOM, routings, clinical codes, etc.)
- Anomaly patterns: Industry-authentic fraud and error patterns
- GL account structures: Industry-appropriate chart of accounts
- Configuration options: Fine-grained control over industry characteristics
Implementation Status
| Industry | Status | Transaction Types | Master Data | Anomaly Patterns | Benchmarks |
|---|---|---|---|---|---|
| Manufacturing | Full generator | 13 types | BOM, routings, work centers | 5 patterns | Yes |
| Retail | Full generator | 11 types | Stores, POS, loyalty | 6 patterns | Yes |
| Healthcare | Full generator | 9 types | ICD-10, CPT, DRG, payers | 6 patterns | Yes |
| Technology | Config preset | Config-only | — | 3 anomaly rates | Yes |
| Financial Services | Config preset | Config-only | — | 3 anomaly rates | Yes |
| Professional Services | Config preset | Config-only | — | 3 anomaly rates | No |
Full generator industries have dedicated Rust enum types with per-transaction generation logic, dedicated master data structures, and industry-specific anomaly injection. Config preset industries use the standard generator pipeline but apply industry-appropriate GL account structures, transaction distributions, and anomaly rates through configuration.
Manufacturing
Transaction Types
#![allow(unused)]
fn main() {
pub enum ManufacturingTransaction {
// Production
WorkOrderIssuance, // Create production order
MaterialRequisition, // Issue materials to production
LaborBooking, // Record labor hours
OverheadAbsorption, // Apply manufacturing overhead
ScrapReporting, // Report production scrap
ReworkOrder, // Create rework order
ProductionVariance, // Record variances
// Inventory
RawMaterialReceipt, // Receive raw materials
WipTransfer, // Transfer between work centers
FinishedGoodsTransfer, // Move to finished goods
CycleCountAdjustment, // Inventory adjustments
// Costing
StandardCostRevaluation, // Update standard costs
PurchasePriceVariance, // Record PPV
}
}
Master Data
manufacturing:
bom:
depth: 4 # BOM levels (3-7 typical)
yield_rate: 0.97 # Expected yield
scrap_factor: 0.02 # Scrap percentage
routings:
operations_per_product: 5 # Average operations
setup_time_minutes: 30 # Default setup time
work_centers:
count: 20
capacity_hours: 8
efficiency: 0.85
Anomaly Patterns
| Anomaly | Description | Detection Method |
|---|---|---|
| Yield Manipulation | Reported yield differs from actual | Variance analysis |
| Labor Misallocation | Labor charged to wrong order | Cross-reference |
| Phantom Production | Production orders with no output | Data analytics |
| Obsolete Inventory | Aging inventory not written down | Aging analysis |
| Standard Cost Manipulation | Inflated standard costs | Trend analysis |
Configuration
industry_specific:
enabled: true
manufacturing:
enabled: true
bom_depth: 4
just_in_time: false
production_order_types:
- standard
- rework
- prototype
quality_framework: iso_9001
supplier_tiers: 2
standard_cost_frequency: quarterly
target_yield_rate: 0.97
scrap_alert_threshold: 0.03
anomaly_rates:
yield_manipulation: 0.005
labor_misallocation: 0.008
phantom_production: 0.002
obsolete_inventory: 0.01
Retail
Transaction Types
#![allow(unused)]
fn main() {
pub enum RetailTransaction {
// Point of Sale
PosSale, // Register sale
ReturnRefund, // Customer return
VoidTransaction, // Voided sale
EmployeeDiscount, // Staff discount
LoyaltyRedemption, // Points redemption
// Inventory
InventoryReceipt, // Receive from DC
StoreTransfer, // Store-to-store
MarkdownRecording, // Price reductions
ShrinkageAdjustment, // Inventory loss
// Cash Management
CashDrop, // Safe deposit
RegisterReconciliation, // Drawer count
}
}
Store Types
retail:
stores:
types:
- flagship # High-volume, full assortment
- standard # Normal retail store
- express # Small format, convenience
- outlet # Discount/clearance
- warehouse # Bulk/club format
- pop_up # Temporary locations
- digital # E-commerce only
Anomaly Patterns
| Anomaly | Description | Detection Method |
|---|---|---|
| Sweethearting | Not scanning items | Video analytics |
| Skimming | Cash theft from register | Cash variance |
| Refund Fraud | Fake returns | Return pattern |
| Receiving Fraud | Short shipment theft | 3-way match |
| Coupon Fraud | Invalid coupon use | Coupon validation |
| Employee Discount Abuse | Unauthorized discounts | Policy review |
Configuration
industry_specific:
enabled: true
retail:
enabled: true
store_types:
- standard
- express
- outlet
shrinkage_rate: 0.015
return_rate: 0.08
markdown_frequency: weekly
loss_prevention:
camera_coverage: 0.85
eas_enabled: true
pos_anomaly_rates:
sweethearting: 0.002
skimming: 0.001
refund_fraud: 0.003
Healthcare
Transaction Types
#![allow(unused)]
fn main() {
pub enum HealthcareTransaction {
// Revenue Cycle
PatientRegistration, // Register patient
ChargeCapture, // Record charges
ClaimSubmission, // Submit to payer
PaymentPosting, // Record payment
DenialManagement, // Handle denials
// Clinical
ProcedureCoding, // CPT codes
DiagnosisCoding, // ICD-10 codes
SupplyConsumption, // Medical supplies
PharmacyDispensing, // Medications
}
}
Coding Systems
healthcare:
coding:
icd10: true # Diagnosis codes
cpt: true # Procedure codes
drg: true # Diagnosis Related Groups
hcpcs: true # Supplies/equipment
Payer Mix
healthcare:
payer_mix:
medicare: 0.40
medicaid: 0.20
commercial: 0.30
self_pay: 0.10
Compliance Frameworks
healthcare:
compliance:
hipaa: true # Privacy rules
stark_law: true # Physician referrals
anti_kickback: true # AKS compliance
false_claims_act: true
Anomaly Patterns
| Anomaly | Description | Detection Method |
|---|---|---|
| Upcoding | Higher-level code than justified | Code validation |
| Unbundling | Splitting bundled services | Bundle analysis |
| Phantom Billing | Billing for unrendered services | Audit |
| Duplicate Billing | Same service billed twice | Duplicate check |
| Kickbacks | Physician referral payments | Relationship analysis |
| HIPAA Violations | Unauthorized data access | Access logs |
Configuration
industry_specific:
enabled: true
healthcare:
enabled: true
facility_type: hospital # hospital, physician_practice, etc.
payer_mix:
medicare: 0.40
medicaid: 0.20
commercial: 0.30
self_pay: 0.10
coding_system:
icd10: true
cpt: true
drg: true
compliance:
hipaa: true
stark_law: true
anti_kickback: true
avg_daily_encounters: 200
avg_charges_per_encounter: 8
anomaly_rates:
upcoding: 0.02
unbundling: 0.015
phantom_billing: 0.005
duplicate_billing: 0.008
Technology
Transaction Types
- License revenue recognition
- Subscription billing
- Professional services
- R&D capitalization
- Deferred revenue
Configuration
industry_specific:
enabled: true
technology:
enabled: true
revenue_model: subscription # license, subscription, usage
subscription_revenue_percent: 0.70
professional_services_percent: 0.20
license_revenue_percent: 0.10
r_and_d_capitalization_rate: 0.15
deferred_revenue_months: 12
anomaly_rates:
premature_revenue: 0.008
channel_stuffing: 0.003
improper_capitalization: 0.005
Financial Services
Transaction Types
- Loan origination
- Interest accrual
- Fee income
- Trading transactions
- Customer deposits
- Wire transfers
Configuration
industry_specific:
enabled: true
financial_services:
enabled: true
institution_type: commercial_bank
regulatory_framework: us # us, eu, uk
loan_portfolio_size: 1000
avg_loan_amount: 250000
loan_loss_provision_rate: 0.02
fee_income_percent: 0.15
trading_volume_daily: 50000000
anomaly_rates:
loan_fraud: 0.003
trading_fraud: 0.001
account_takeover: 0.002
Professional Services
Transaction Types
- Time and billing
- Engagement management
- Trust account transactions
- Expense reimbursement
- Partner distributions
Configuration
industry_specific:
enabled: true
professional_services:
enabled: true
billing_model: hourly # hourly, fixed_fee, contingency
avg_hourly_rate: 350
utilization_target: 0.75
realization_rate: 0.92
trust_accounting: true
engagement_types:
- audit
- tax
- advisory
- litigation
anomaly_rates:
billing_fraud: 0.004
trust_misappropriation: 0.001
expense_fraud: 0.008
Industry Benchmarks
SyntheticData provides pre-configured ML benchmarks for each industry:
#![allow(unused)]
fn main() {
// Get industry-specific benchmark
let bench = get_industry_benchmark(IndustrySector::Healthcare);
// Available benchmarks
let manufacturing = manufacturing_fraud_5k();
let retail = retail_fraud_10k();
let healthcare = healthcare_fraud_5k();
let technology = technology_fraud_3k();
let financial = financial_services_fraud_5k();
}
Benchmark Features
Each industry benchmark includes:
- Industry-specific transaction features
- Relevant anomaly types
- Appropriate cost matrices
- Industry-specific evaluation metrics
Best Practices
- Match industry to use case: Select the industry that matches your target domain
- Use industry presets first: Start with default settings before customizing
- Enable industry-specific anomalies: These provide realistic fraud patterns
- Consider regulatory context: Enable compliance frameworks relevant to your industry
- Use industry benchmarks: Evaluate models against industry-specific baselines
Output Files
| File | Description |
|---|---|
industry_transactions.csv | Industry-specific transaction log |
industry_master_data.json | Industry-specific entities |
industry_anomalies.csv | Industry-specific anomaly labels |
industry_gl_accounts.csv | Industry-specific chart of accounts |
Anomaly Injection
Generate labeled anomalies for machine learning training.
Overview
Anomaly injection adds realistic irregularities to generated data with full labeling for supervised learning:
- 20+ fraud types
- Error patterns
- Process issues
- Statistical outliers
- Relational anomalies
Configuration
anomaly_injection:
enabled: true
total_rate: 0.02 # 2% anomaly rate
generate_labels: true # Output ML labels
categories: # Category distribution
fraud: 0.25
error: 0.40
process_issue: 0.20
statistical: 0.10
relational: 0.05
temporal_pattern:
year_end_spike: 1.5 # More anomalies at year-end
clustering:
enabled: true
cluster_probability: 0.2 # 20% appear in clusters
Anomaly Categories
Fraud Types
| Type | Description | Detection Difficulty |
|---|---|---|
fictitious_transaction | Fabricated entries | Medium |
revenue_manipulation | Premature recognition | Hard |
expense_capitalization | Improper capitalization | Medium |
split_transaction | Split to avoid threshold | Easy |
round_tripping | Circular transactions | Hard |
kickback_scheme | Vendor kickbacks | Hard |
ghost_employee | Non-existent payee | Medium |
duplicate_payment | Same invoice twice | Easy |
unauthorized_discount | Unapproved discounts | Medium |
suspense_abuse | Hide in suspense | Hard |
fraud:
types:
fictitious_transaction: 0.15
split_transaction: 0.20
duplicate_payment: 0.15
ghost_employee: 0.10
kickback_scheme: 0.10
revenue_manipulation: 0.10
expense_capitalization: 0.10
unauthorized_discount: 0.10
Error Types
| Type | Description |
|---|---|
duplicate_entry | Same entry posted twice |
reversed_amount | Debit/credit swapped |
wrong_period | Posted to wrong period |
wrong_account | Incorrect GL account |
missing_reference | Missing document reference |
incorrect_tax_code | Wrong tax calculation |
misclassification | Wrong account category |
Process Issues
| Type | Description |
|---|---|
late_posting | Posted after cutoff |
skipped_approval | Missing required approval |
threshold_manipulation | Amount just below threshold |
missing_documentation | No supporting document |
out_of_sequence | Documents out of order |
Statistical Anomalies
| Type | Description |
|---|---|
unusual_amount | Significant deviation from mean |
trend_break | Sudden pattern change |
benford_violation | Doesn’t follow Benford’s Law |
outlier_value | Extreme value |
Relational Anomalies
| Type | Description |
|---|---|
circular_transaction | A → B → A flow |
dormant_account_activity | Inactive account used |
unusual_counterparty | Unexpected entity pairing |
Injection Strategies
Amount Manipulation
anomaly_injection:
strategies:
amount:
enabled: true
threshold_adjacent: 0.3 # Just below approval limit
round_number_bias: 0.4 # Suspicious round amounts
Threshold-adjacent: Amounts like $9,999 when limit is $10,000.
Date Manipulation
anomaly_injection:
strategies:
date:
enabled: true
weekend_bias: 0.2 # Weekend postings
after_hours_bias: 0.15 # After business hours
Duplication
anomaly_injection:
strategies:
duplication:
enabled: true
exact_duplicate: 0.5 # Exact copy
near_duplicate: 0.3 # Slight variations
delayed_duplicate: 0.2 # Same entry later
Temporal Patterns
Anomalies can follow realistic patterns:
anomaly_injection:
temporal_pattern:
month_end_spike: 1.2 # 20% more at month-end
quarter_end_spike: 1.5 # 50% more at quarter-end
year_end_spike: 2.0 # Double at year-end
seasonality: true # Follow industry patterns
Entity Targeting
Control which entities receive anomalies:
anomaly_injection:
entity_targeting:
strategy: weighted # random, repeat_offender, weighted
repeat_offender:
enabled: true
rate: 0.4 # 40% from same users
high_volume_bias: 0.3 # Target high-volume entities
Clustering
Real anomalies often cluster:
anomaly_injection:
clustering:
enabled: true
cluster_probability: 0.2 # 20% in clusters
cluster_size:
min: 3
max: 10
cluster_timespan_days: 30 # Within 30-day window
Output Labels
anomaly_labels.csv
| Field | Description |
|---|---|
document_id | Entry reference |
anomaly_id | Unique anomaly ID |
anomaly_type | Specific type |
anomaly_category | Fraud, Error, etc. |
severity | Low, Medium, High |
detection_difficulty | Easy, Medium, Hard |
description | Human-readable description |
fraud_labels.csv
Subset with fraud-specific fields:
| Field | Description |
|---|---|
document_id | Entry reference |
fraud_type | Specific fraud pattern |
perpetrator_id | Employee ID |
scheme_id | Related anomaly group |
amount_manipulated | Fraud amount |
ML Integration
Loading Labels
import pandas as pd
labels = pd.read_csv('output/labels/anomaly_labels.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_anomaly'] = data['anomaly_id'].notna()
Feature Engineering
# Create features
features = [
'amount', 'line_count', 'is_round_number',
'is_weekend', 'is_month_end', 'hour_of_day'
]
X = data[features]
y = data['is_anomaly']
Train/Test Split
Labels include suggested splits:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
stratify=y, # Maintain anomaly ratio
random_state=42
)
Example Configuration
Fraud Detection Training
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true
categories:
fraud: 1.0 # Only fraud for focused training
clustering:
enabled: true
cluster_probability: 0.3
fraud:
enabled: true
fraud_rate: 0.02
types:
split_transaction: 0.25
duplicate_payment: 0.25
kickback_scheme: 0.20
ghost_employee: 0.15
fictitious_transaction: 0.15
General Anomaly Detection
anomaly_injection:
enabled: true
total_rate: 0.05
generate_labels: true
categories:
fraud: 0.15
error: 0.45
process_issue: 0.25
statistical: 0.10
relational: 0.05
See Also
Data Quality Variations
Generate realistic data quality issues for testing robustness.
Overview
Real-world data has imperfections. The data quality module introduces:
- Missing values (various patterns)
- Format variations
- Duplicates
- Typos and transcription errors
- Encoding issues
Configuration
data_quality:
enabled: true
missing_values:
rate: 0.01
pattern: mcar
format_variations:
date_formats: true
amount_formats: true
duplicates:
rate: 0.001
types: [exact, near, fuzzy]
typos:
rate: 0.005
keyboard_aware: true
Missing Values
Patterns
| Pattern | Description |
|---|---|
mcar | Missing Completely At Random |
mar | Missing At Random (conditional) |
mnar | Missing Not At Random (value-dependent) |
systematic | Entire field groups missing |
data_quality:
missing_values:
rate: 0.01 # 1% missing overall
pattern: mcar
# Pattern-specific settings
mcar:
uniform: true # Equal probability all fields
mar:
conditions:
- field: vendor_name
dependent_on: is_intercompany
probability: 0.1
mnar:
conditions:
- field: amount
when_above: 100000 # Large amounts more likely missing
probability: 0.05
systematic:
groups:
- [address, city, country] # All or none
Field Targeting
data_quality:
missing_values:
fields:
description: 0.02 # 2% missing
cost_center: 0.05 # 5% missing
tax_code: 0.03 # 3% missing
exclude:
- document_id # Never make missing
- posting_date
- account_number
Format Variations
Date Formats
data_quality:
format_variations:
date_formats: true
date_variations:
iso: 0.6 # 2024-01-15
us: 0.2 # 01/15/2024
eu: 0.1 # 15.01.2024
long: 0.1 # January 15, 2024
Examples:
- ISO:
2024-01-15 - US:
01/15/2024,1/15/2024 - EU:
15.01.2024,15/01/2024 - Long:
January 15, 2024
Amount Formats
data_quality:
format_variations:
amount_formats: true
amount_variations:
plain: 0.5 # 1234.56
us_comma: 0.3 # 1,234.56
eu_format: 0.1 # 1.234,56
currency_prefix: 0.05 # $1,234.56
currency_suffix: 0.05 # 1.234,56 EUR
Identifier Formats
data_quality:
format_variations:
identifier_variations:
case: 0.1 # INV-001 vs inv-001
padding: 0.1 # 001 vs 1
separator: 0.1 # INV-001 vs INV_001 vs INV001
Duplicates
Duplicate Types
| Type | Description |
|---|---|
exact | Identical records |
near | Minor field differences |
fuzzy | Multiple field variations |
data_quality:
duplicates:
rate: 0.001 # 0.1% duplicates
types:
exact: 0.4 # 40% exact duplicates
near: 0.4 # 40% near duplicates
fuzzy: 0.2 # 20% fuzzy duplicates
Near Duplicate Variations
data_quality:
duplicates:
near:
fields_to_vary: 1 # Change 1 field
variations:
- field: posting_date
offset_days: [-1, 0, 1]
- field: amount
variance: 0.001 # 0.1% difference
Fuzzy Duplicate Variations
data_quality:
duplicates:
fuzzy:
fields_to_vary: 3 # Change multiple fields
include_typos: true
Typos
Typo Types
| Type | Description |
|---|---|
| Substitution | Adjacent key pressed |
| Transposition | Characters swapped |
| Insertion | Extra character |
| Deletion | Missing character |
| OCR errors | Scan-related (0/O, 1/l) |
| Homophones | Sound-alike substitution |
data_quality:
typos:
rate: 0.005 # 0.5% of string fields
keyboard_aware: true # Use QWERTY layout
types:
substitution: 0.35 # Adjacnet → Adjacent
transposition: 0.25 # Recieve → Receive
insertion: 0.15 # Shippping → Shipping
deletion: 0.15 # Shiping → Shipping
ocr_errors: 0.05 # O → 0, l → 1
homophones: 0.05 # their → there
Field Targeting
data_quality:
typos:
fields:
description: 0.02 # More likely in descriptions
vendor_name: 0.01
customer_name: 0.01
exclude:
- account_number # Never introduce typos
- document_id
Encoding Issues
data_quality:
encoding:
enabled: true
rate: 0.001
issues:
mojibake: 0.4 # UTF-8/Latin-1 confusion
missing_chars: 0.3 # Characters dropped
bom_issues: 0.2 # BOM artifacts
html_entities: 0.1 # & instead of &
Examples:
- Mojibake:
Müller→Müller - Missing:
Zürich→Zrich - HTML:
R&D→R&D
ML Training Labels
The data quality module generates labels for ML model training:
QualityIssueLabel
#![allow(unused)]
fn main() {
pub struct QualityIssueLabel {
pub issue_id: String,
pub issue_type: LabeledIssueType,
pub issue_subtype: Option<QualityIssueSubtype>,
pub document_id: String,
pub field_name: String,
pub original_value: Option<String>,
pub modified_value: Option<String>,
pub severity: u8, // 1-5
pub processor: String,
pub metadata: HashMap<String, String>,
}
}
Issue Types
| Type | Severity | Description |
|---|---|---|
MissingValue | 3 | Field is null/empty |
Typo | 2 | Character-level errors |
FormatVariation | 1 | Different formatting |
Duplicate | 4 | Duplicate record |
EncodingIssue | 3 | Character encoding problems |
Inconsistency | 3 | Cross-field inconsistency |
OutOfRange | 4 | Value outside expected range |
InvalidReference | 5 | Reference to non-existent entity |
Subtypes
Each issue type has detailed subtypes:
- Typo: Substitution, Transposition, Insertion, Deletion, DoubleChar, CaseError, OcrError, Homophone
- FormatVariation: DateFormat, AmountFormat, IdentifierFormat, TextFormat
- Duplicate: ExactDuplicate, NearDuplicate, FuzzyDuplicate, CrossSystemDuplicate
- EncodingIssue: Mojibake, MissingChars, Bom, ControlChars, HtmlEntities
Output
quality_issues.csv
| Field | Description |
|---|---|
document_id | Affected record |
field_name | Field with issue |
issue_type | missing, typo, duplicate, etc. |
original_value | Value before modification |
modified_value | Value after modification |
quality_labels.csv (ML Training)
| Field | Description |
|---|---|
issue_id | Unique issue identifier |
issue_type | LabeledIssueType enum |
issue_subtype | Detailed subtype |
document_id | Affected document |
field_name | Affected field |
original_value | Original value |
modified_value | Modified value |
severity | 1-5 severity score |
processor | Which processor injected |
Example Configurations
Testing Data Pipelines
data_quality:
enabled: true
missing_values:
rate: 0.02
pattern: mcar
format_variations:
date_formats: true
amount_formats: true
typos:
rate: 0.01
keyboard_aware: true
Testing Deduplication
data_quality:
enabled: true
duplicates:
rate: 0.05 # High duplicate rate
types:
exact: 0.3
near: 0.4
fuzzy: 0.3
Testing OCR Processing
data_quality:
enabled: true
typos:
rate: 0.03
types:
ocr_errors: 0.8 # Mostly OCR-style errors
substitution: 0.2
See Also
Graph Export
Export transaction data as ML-ready graphs.
Overview
Graph export transforms financial data into network representations:
- Accounting Network (GL accounts as nodes, transactions as edges) - New in v0.2.1
- Transaction networks (accounts and entities)
- Approval networks (users and approvals)
- Entity relationship graphs (ownership)
Accounting Network Graph Export
The accounting network represents money flows between GL accounts, designed for network reconstruction and anomaly detection algorithms.
Quick Start
# Generate with graph export enabled
datasynth-data generate --config config.yaml --output ./output --graph-export
Graph Structure
| Element | Description |
|---|---|
| Nodes | GL Accounts from Chart of Accounts |
| Edges | Money flows FROM credit accounts TO debit accounts |
| Direction | Directed graph (source→target) |
┌──────────────┐
│ Credit Acct │
│ (2000) │
└──────┬───────┘
│ $1,000
▼
┌──────────────┐
│ Debit Acct │
│ (1100) │
└──────────────┘
Edge Features (8 dimensions)
| Feature | Index | Description |
|---|---|---|
log_amount | F0 | log10(transaction amount) |
benford_prob | F1 | Expected first-digit probability |
weekday | F2 | Day of week (normalized 0-1) |
period | F3 | Fiscal period (normalized 0-1) |
is_month_end | F4 | Last 3 days of month |
is_year_end | F5 | Last month of year |
is_anomaly | F6 | Anomaly flag (0 or 1) |
business_process | F7 | Encoded business process |
Output Files
output/graphs/accounting_network/pytorch_geometric/
├── edge_index.npy # [2, E] source→target node indices
├── node_features.npy # [N, 4] node feature vectors
├── edge_features.npy # [E, 8] edge feature vectors
├── edge_labels.npy # [E] anomaly labels (0=normal, 1=anomaly)
├── node_labels.npy # [N] node labels
├── train_mask.npy # [N] boolean training mask
├── val_mask.npy # [N] boolean validation mask
├── test_mask.npy # [N] boolean test mask
├── metadata.json # Graph statistics and configuration
└── load_graph.py # Auto-generated Python loader script
Loading in Python
import numpy as np
import json
# Load metadata
with open('metadata.json') as f:
meta = json.load(f)
print(f"Nodes: {meta['num_nodes']}, Edges: {meta['num_edges']}")
# Load arrays
edge_index = np.load('edge_index.npy') # [2, E]
node_features = np.load('node_features.npy') # [N, F]
edge_features = np.load('edge_features.npy') # [E, 8]
edge_labels = np.load('edge_labels.npy') # [E]
# For PyTorch Geometric
import torch
from torch_geometric.data import Data
data = Data(
x=torch.from_numpy(node_features).float(),
edge_index=torch.from_numpy(edge_index).long(),
edge_attr=torch.from_numpy(edge_features).float(),
y=torch.from_numpy(edge_labels).long(),
)
Configuration
graph_export:
enabled: true
formats:
- pytorch_geometric
train_ratio: 0.7
validation_ratio: 0.15
# test_ratio is automatically 1 - train - val = 0.15
Use Cases
- Anomaly Detection: Train GNNs to detect anomalous transaction patterns
- Network Reconstruction: Validate accounting network recovery algorithms
- Fraud Detection: Identify suspicious money flow patterns
- Link Prediction: Predict likely transaction relationships
Configuration
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
- dgl
graphs:
- transaction_network
- approval_network
- entity_relationship
split:
train: 0.7
val: 0.15
test: 0.15
stratify: is_anomaly
features:
temporal: true
amount: true
structural: true
categorical: true
Graph Types
Transaction Network
Accounts and entities as nodes, transactions as edges.
┌──────────┐
│ Account │
│ 1100 │
└────┬─────┘
│ $1000
▼
┌──────────┐
│ Customer │
│ C-001 │
└──────────┘
Nodes:
- GL accounts
- Vendors
- Customers
- Cost centers
Edges:
- Journal entry lines
- Payments
- Invoices
Approval Network
Users as nodes, approval relationships as edges.
┌──────────┐
│ Clerk │
│ U-001 │
└────┬─────┘
│ approved
▼
┌──────────┐
│ Manager │
│ U-002 │
└──────────┘
Nodes: Employees/users Edges: Approval actions
Entity Relationship Network
Legal entities with ownership relationships.
┌──────────┐
│ Parent │
│ 1000 │
└────┬─────┘
│ 100%
▼
┌──────────┐
│ Sub │
│ 2000 │
└──────────┘
Nodes: Companies Edges: Ownership, IC transactions
Export Formats
PyTorch Geometric
output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt # [num_nodes, num_features]
├── edge_index.pt # [2, num_edges]
├── edge_attr.pt # [num_edges, num_edge_features]
├── labels.pt # Labels
├── train_mask.pt # Boolean training mask
├── val_mask.pt # Boolean validation mask
└── test_mask.pt # Boolean test mask
Loading in Python:
import torch
from torch_geometric.data import Data
# Load tensors
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')
# Create PyG Data object
data = Data(
x=node_features,
edge_index=edge_index,
edge_attr=edge_attr,
y=labels,
train_mask=train_mask,
)
print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")
Neo4j
output/graphs/transaction_network/neo4j/
├── nodes_account.csv
├── nodes_vendor.csv
├── nodes_customer.csv
├── edges_transaction.csv
├── edges_payment.csv
└── import.cypher
Import script (import.cypher):
// Load accounts
LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
CREATE (:Account {
id: row.id,
name: row.name,
type: row.type
});
// Load transactions
LOAD CSV WITH HEADERS FROM 'file:///edges_transaction.csv' AS row
MATCH (from:Account {id: row.from_id})
MATCH (to:Account {id: row.to_id})
CREATE (from)-[:TRANSACTION {
amount: toFloat(row.amount),
date: date(row.posting_date),
is_anomaly: toBoolean(row.is_anomaly)
}]->(to);
DGL (Deep Graph Library)
output/graphs/transaction_network/dgl/
├── graph.bin # Serialized DGL graph
├── node_feats.npy # Node features
├── edge_feats.npy # Edge features
└── labels.npy # Labels
Loading in Python:
import dgl
import numpy as np
# Load graph
graph = dgl.load_graphs('graph.bin')[0][0]
# Load features
graph.ndata['feat'] = torch.tensor(np.load('node_feats.npy'))
graph.edata['feat'] = torch.tensor(np.load('edge_feats.npy'))
graph.ndata['label'] = torch.tensor(np.load('labels.npy'))
Features
Temporal Features
features:
temporal: true
| Feature | Description |
|---|---|
weekday | Day of week (0-6) |
period | Fiscal period (1-12) |
is_month_end | Last 3 days of month |
is_quarter_end | Last week of quarter |
is_year_end | Last month of year |
hour | Hour of posting |
Amount Features
features:
amount: true
| Feature | Description |
|---|---|
log_amount | log10(amount) |
benford_prob | Expected first-digit probability |
is_round_number | Ends in 00, 000, etc. |
amount_zscore | Standard deviations from mean |
Structural Features
features:
structural: true
| Feature | Description |
|---|---|
line_count | Number of JE lines |
unique_accounts | Distinct accounts used |
has_intercompany | IC transaction flag |
debit_credit_ratio | Total debits / credits |
Categorical Features
features:
categorical: true
One-hot encoded:
business_process: Manual, P2P, O2C, etc.source_type: System, User, Recurringaccount_type: Asset, Liability, etc.
Train/Val/Test Splits
split:
train: 0.7 # 70% training
val: 0.15 # 15% validation
test: 0.15 # 15% test
stratify: is_anomaly # Maintain anomaly ratio
random_seed: 42 # Reproducible splits
Stratification options:
is_anomaly: Balanced anomaly detectionis_fraud: Balanced fraud detectionaccount_type: Balanced by account typenull: Random (no stratification)
GNN Training Example
import torch
from torch_geometric.nn import GCNConv
class AnomalyGNN(torch.nn.Module):
def __init__(self, num_features, hidden_dim):
super().__init__()
self.conv1 = GCNConv(num_features, hidden_dim)
self.conv2 = GCNConv(hidden_dim, 2) # Binary classification
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index)
return x
# Train
model = AnomalyGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(100):
model.train()
optimizer.zero_grad()
out = model(data)
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
Multi-Layer Hypergraph (v0.6.2)
The RustGraph Hypergraph exporter now supports all 8 enterprise process families with 24 entity type codes:
Entity Type Codes
| Range | Family | Types |
|---|---|---|
| 100 | Accounting | GL Accounts |
| 300-303 | P2P | PurchaseOrder, GoodsReceipt, VendorInvoice, Payment |
| 310-312 | O2C | SalesOrder, Delivery, CustomerInvoice |
| 320-325 | S2C | SourcingProject, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract, SupplierQualification |
| 330-333 | H2R | PayrollRun, TimeEntry, ExpenseReport, PayrollLineItem |
| 340-343 | MFG | ProductionOrder, RoutingOperation, QualityInspection, CycleCount |
| 350-352 | BANK | BankingCustomer, BankAccount, BankTransaction |
| 360-365 | AUDIT | AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment |
| 370-372 | Bank Recon | BankReconciliation, BankStatementLine, ReconcilingItem |
| 400 | OCPM | OcpmEvent (events as hyperedges) |
OCPM Events as Hyperedges
When events_as_hyperedges: true, each OCPM event becomes a hyperedge connecting all its participating objects. This enables cross-process analysis via the hypergraph structure.
Per-Family Toggles
graph_export:
hypergraph:
enabled: true
process_layer:
include_p2p: true
include_o2c: true
include_s2c: true
include_h2r: true
include_mfg: true
include_bank: true
include_audit: true
include_r2r: true
events_as_hyperedges: true
See Also
Intercompany Processing
Generate matched intercompany transactions and elimination entries.
Overview
Intercompany features simulate multi-entity corporate structures:
- IC transaction pairs (seller/buyer)
- Transfer pricing
- IC reconciliation
- Consolidation eliminations
Prerequisites
Multiple companies must be defined:
companies:
- code: "1000"
name: "Parent Company"
is_parent: true
volume_weight: 0.5
- code: "2000"
name: "Subsidiary"
parent_code: "1000"
volume_weight: 0.5
Configuration
intercompany:
enabled: true
transaction_types:
goods_sale: 0.4
service_provided: 0.2
loan: 0.15
dividend: 0.1
management_fee: 0.1
royalty: 0.05
transfer_pricing:
method: cost_plus
markup_range:
min: 0.03
max: 0.10
elimination:
enabled: true
timing: quarterly
IC Transaction Types
Goods Sale
Internal sale of inventory between entities.
Seller (1000):
Dr Intercompany Receivable 1,100
Cr IC Revenue 1,100
Dr IC COGS 800
Cr Inventory 800
Buyer (2000):
Dr Inventory 1,100
Cr Intercompany Payable 1,100
Service Provided
Internal services (IT, HR, legal).
Provider (1000):
Dr IC Receivable 500
Cr IC Service Revenue 500
Receiver (2000):
Dr Service Expense 500
Cr IC Payable 500
Loan
Intercompany financing.
Lender (1000):
Dr IC Loan Receivable 10,000
Cr Cash 10,000
Borrower (2000):
Dr Cash 10,000
Cr IC Loan Payable 10,000
Dividend
Upstream dividend payment.
Subsidiary (2000):
Dr Retained Earnings 5,000
Cr Cash 5,000
Parent (1000):
Dr Cash 5,000
Cr Dividend Income 5,000
Management Fee
Corporate overhead allocation.
Parent (1000):
Dr IC Receivable 1,000
Cr Mgmt Fee Revenue 1,000
Subsidiary (2000):
Dr Mgmt Fee Expense 1,000
Cr IC Payable 1,000
Royalty
IP licensing fees.
Licensor (1000):
Dr IC Receivable 750
Cr Royalty Revenue 750
Licensee (2000):
Dr Royalty Expense 750
Cr IC Payable 750
Transfer Pricing
Methods
| Method | Description |
|---|---|
cost_plus | Cost + markup percentage |
resale_minus | Resale price - margin |
comparable_uncontrolled | Market price |
transfer_pricing:
method: cost_plus
markup_range:
min: 0.03 # 3% minimum markup
max: 0.10 # 10% maximum markup
# OR for resale minus
method: resale_minus
margin_range:
min: 0.15
max: 0.25
Arm’s Length Pricing
Prices generated to be defensible:
#![allow(unused)]
fn main() {
fn calculate_transfer_price(cost: Decimal, method: &TransferPricingMethod) -> Decimal {
match method {
TransferPricingMethod::CostPlus { markup } => {
cost * (Decimal::ONE + markup)
}
TransferPricingMethod::ResaleMinus { margin, resale_price } => {
resale_price * (Decimal::ONE - margin)
}
TransferPricingMethod::Comparable { market_price } => {
market_price
}
}
}
}
IC Matching
Matched Pair Structure
#![allow(unused)]
fn main() {
pub struct ICMatchedPair {
pub pair_id: String,
pub seller_company: String,
pub buyer_company: String,
pub seller_entry_id: Uuid,
pub buyer_entry_id: Uuid,
pub transaction_type: ICTransactionType,
pub amount: Decimal,
pub currency: String,
pub transaction_date: NaiveDate,
}
}
Match Validation
intercompany:
matching:
enabled: true
tolerance: 0.01 # 1% variance allowed
mismatch_rate: 0.02 # 2% intentional mismatches
Match statuses:
matched: Amounts reconciletiming_difference: Different posting datesfx_difference: Currency conversion varianceunmatched: No matching entry
Eliminations
Timing
intercompany:
elimination:
timing: quarterly # monthly, quarterly, annual
Elimination Types
Revenue/Expense Elimination:
Elimination entry:
Dr IC Revenue (1000) 1,100
Cr IC Expense (2000) 1,100
Unrealized Profit Elimination:
If buyer still holds inventory:
Dr IC Revenue 300
Cr Inventory 300
Receivable/Payable Elimination:
Dr IC Payable (2000) 10,000
Cr IC Receivable (1000) 10,000
Output Files
ic_pairs.csv
| Field | Description |
|---|---|
pair_id | Unique pair identifier |
seller_company | Selling entity |
buyer_company | Buying entity |
seller_entry_id | Seller’s JE document ID |
buyer_entry_id | Buyer’s JE document ID |
transaction_type | Type of IC transaction |
amount | Transaction amount |
match_status | Match result |
eliminations.csv
| Field | Description |
|---|---|
elimination_id | Unique ID |
ic_pair_id | Reference to IC pair |
elimination_type | Revenue, profit, balance |
debit_company | Company debited |
credit_company | Company credited |
amount | Elimination amount |
period | Fiscal period |
Example Configuration
Multi-National Structure
companies:
- code: "1000"
name: "US Headquarters"
currency: USD
country: US
is_parent: true
volume_weight: 0.4
- code: "2000"
name: "European Hub"
currency: EUR
country: DE
parent_code: "1000"
volume_weight: 0.3
- code: "3000"
name: "Asia Pacific"
currency: JPY
country: JP
parent_code: "1000"
volume_weight: 0.3
intercompany:
enabled: true
transaction_types:
goods_sale: 0.5
service_provided: 0.2
management_fee: 0.15
royalty: 0.15
transfer_pricing:
method: cost_plus
markup_range:
min: 0.05
max: 0.12
elimination:
enabled: true
timing: quarterly
See Also
Interconnectivity and Relationship Modeling
SyntheticData provides comprehensive relationship modeling capabilities for generating realistic enterprise networks with multi-tier vendor relationships, customer segmentation, relationship strength calculations, and cross-process linkages.
Overview
Real enterprise data exhibits complex interconnections between entities:
- Vendors form multi-tier supply chains (supplier-of-supplier)
- Customers segment by value (Enterprise vs. SMB) with different behaviors
- Relationships vary in strength based on transaction history
- Business processes connect (P2P and O2C link through inventory)
SyntheticData models all of these patterns to produce realistic, interconnected data.
Multi-Tier Vendor Networks
Supply Chain Tiers
Vendors are organized into a supply chain hierarchy:
| Tier | Description | Visibility | Typical Count |
|---|---|---|---|
| Tier 1 | Direct suppliers | Full financial visibility | 50-100 per company |
| Tier 2 | Supplier’s suppliers | Partial visibility | 4-10 per Tier 1 |
| Tier 3 | Deep supply chain | Minimal visibility | 2-5 per Tier 2 |
Vendor Clusters
Vendors are classified into behavioral clusters:
| Cluster | Share | Characteristics |
|---|---|---|
| ReliableStrategic | 20% | High delivery scores, low invoice errors, consistent quality |
| StandardOperational | 50% | Average performance, predictable patterns |
| Transactional | 25% | One-off or occasional purchases |
| Problematic | 5% | Quality issues, late deliveries, invoice discrepancies |
Vendor Lifecycle Stages
Onboarding → RampUp → SteadyState → Decline → Terminated
Each stage has associated behaviors:
- Onboarding: Initial qualification, small orders
- RampUp: Increasing order volumes
- SteadyState: Stable, predictable patterns
- Decline: Reduced orders, performance issues
- Terminated: Relationship ended
Vendor Quality Scores
| Metric | Range | Description |
|---|---|---|
delivery_on_time | 0.0-1.0 | Percentage of on-time deliveries |
quality_pass_rate | 0.0-1.0 | Quality inspection pass rate |
invoice_accuracy | 0.0-1.0 | Invoice matching accuracy |
responsiveness_score | 0.0-1.0 | Communication responsiveness |
Vendor Concentration Analysis
SyntheticData tracks vendor concentration risks:
dependencies:
max_single_vendor_concentration: 0.15 # No vendor > 15% of spend
top_5_concentration: 0.45 # Top 5 vendors < 45% of spend
single_source_percent: 0.05 # 5% of materials single-sourced
Customer Value Segmentation
Value Segments
Customers follow a Pareto-like distribution:
| Segment | Revenue Share | Customer Share | Typical Order Value |
|---|---|---|---|
| Enterprise | 40% | 5% | $50,000+ |
| MidMarket | 35% | 20% | $5,000-$50,000 |
| SMB | 20% | 50% | $500-$5,000 |
| Consumer | 5% | 25% | $50-$500 |
Customer Lifecycle
Prospect → New → Growth → Mature → AtRisk → Churned
↓
WonBack
Each stage has associated behaviors:
- Prospect: Potential customer, conversion probability
- New: First purchase within 90 days
- Growth: Increasing order frequency/value
- Mature: Stable, loyal customer
- AtRisk: Declining activity, churn signals
- Churned: No activity for extended period
- WonBack: Previously churned, now returned
Customer Engagement Metrics
| Metric | Description |
|---|---|
order_frequency | Average orders per period |
recency_days | Days since last order |
nps_score | Net Promoter Score (-100 to +100) |
engagement_score | Composite engagement metric (0.0-1.0) |
Customer Networks
- Referral Networks: Customers refer other customers (configurable rate)
- Corporate Hierarchies: Parent/child company relationships
- Industry Clusters: Customers grouped by industry vertical
Relationship Strength Modeling
Composite Strength Calculation
Relationship strength is computed from multiple factors:
| Component | Weight | Scale | Description |
|---|---|---|---|
| Transaction Volume | 30% | Logarithmic | Total monetary value |
| Transaction Count | 25% | Square root | Number of transactions |
| Duration | 20% | Linear | Relationship age in days |
| Recency | 15% | Exponential decay | Days since last transaction |
| Mutual Connections | 10% | Jaccard index | Shared network connections |
Strength Categories
| Strength | Threshold | Description |
|---|---|---|
| Strong | ≥ 0.7 | Core business relationship |
| Moderate | ≥ 0.4 | Regular, established relationship |
| Weak | ≥ 0.1 | Occasional relationship |
| Dormant | < 0.1 | Inactive relationship |
Recency Decay
The recency component uses exponential decay:
recency_score = exp(-days_since_last / half_life)
Default half-life is 90 days.
Cross-Process Linkages
Inventory Links (P2P ↔ O2C)
Inventory naturally connects Procure-to-Pay and Order-to-Cash:
P2P: Purchase Order → Goods Receipt → Vendor Invoice → Payment
↓
[Inventory]
↓
O2C: Sales Order → Delivery → Customer Invoice → Receipt
When enabled, SyntheticData generates explicit CrossProcessLink records connecting:
GoodsReceipt(P2P) toDelivery(O2C) via inventory item
Payment-Bank Reconciliation
Links payment transactions to bank statement entries for reconciliation.
Intercompany Bilateral Links
Ensures intercompany transactions are properly linked between sending and receiving entities.
Entity Graph
Graph Structure
The EntityGraph provides a unified view of all entity relationships:
| Component | Description |
|---|---|
| Nodes | Entities with type, ID, and metadata |
| Edges | Relationships with type and strength |
| Indexes | Fast lookups by entity type and ID |
Entity Types (16 types)
Company, Vendor, Customer, Employee, Department, CostCenter,
Project, Contract, Asset, BankAccount, Material, Product,
Location, Currency, Account, Entity
Relationship Types (26 types)
// Transactional
BuysFrom, SellsTo, PaysTo, ReceivesFrom, SuppliesTo, OrdersFrom
// Organizational
ReportsTo, Manages, BelongsTo, OwnedBy, PartOf, Contains
// Network
ReferredBy, PartnersWith, AffiliateOf, SubsidiaryOf
// Process
ApprovesFor, AuthorizesFor, ProcessesFor
// Financial
BillsTo, ShipsTo, CollectsFrom, RemitsTo
// Document
ReferencedBy, SupersededBy, AmendedBy, LinkedTo
Configuration
Complete Example
vendor_network:
enabled: true
depth: 3
tiers:
tier1:
count_min: 50
count_max: 100
tier2:
count_per_parent_min: 4
count_per_parent_max: 10
tier3:
count_per_parent_min: 2
count_per_parent_max: 5
clusters:
reliable_strategic: 0.20
standard_operational: 0.50
transactional: 0.25
problematic: 0.05
dependencies:
max_single_vendor_concentration: 0.15
top_5_concentration: 0.45
single_source_percent: 0.05
customer_segmentation:
enabled: true
value_segments:
enterprise:
revenue_share: 0.40
customer_share: 0.05
avg_order_min: 50000.0
mid_market:
revenue_share: 0.35
customer_share: 0.20
avg_order_min: 5000.0
avg_order_max: 50000.0
smb:
revenue_share: 0.20
customer_share: 0.50
avg_order_min: 500.0
avg_order_max: 5000.0
consumer:
revenue_share: 0.05
customer_share: 0.25
avg_order_min: 50.0
avg_order_max: 500.0
lifecycle:
prospect_rate: 0.10
new_rate: 0.15
growth_rate: 0.20
mature_rate: 0.35
at_risk_rate: 0.10
churned_rate: 0.08
won_back_rate: 0.02
networks:
referrals:
enabled: true
referral_rate: 0.15
corporate_hierarchies:
enabled: true
hierarchy_probability: 0.30
relationship_strength:
enabled: true
calculation:
transaction_volume_weight: 0.30
transaction_count_weight: 0.25
relationship_duration_weight: 0.20
recency_weight: 0.15
mutual_connections_weight: 0.10
recency_half_life_days: 90
thresholds:
strong: 0.7
moderate: 0.4
weak: 0.1
cross_process_links:
enabled: true
inventory_p2p_o2c: true
payment_bank_reconciliation: true
intercompany_bilateral: true
Network Evaluation
SyntheticData includes network metrics evaluation:
| Metric | Description | Typical Range |
|---|---|---|
| Connectivity | Largest connected component ratio | > 0.95 |
| Power Law Alpha | Degree distribution exponent | 2.0-3.0 |
| Clustering Coefficient | Local clustering | 0.10-0.50 |
| Top-1 Concentration | Largest node share | < 0.15 |
| Top-5 Concentration | Top 5 nodes share | < 0.45 |
| HHI | Herfindahl-Hirschman Index | < 0.25 |
These metrics validate that generated networks exhibit realistic properties.
API Usage
Rust API
#![allow(unused)]
fn main() {
use datasynth_core::models::{
VendorNetwork, VendorCluster, SupplyChainTier,
SegmentedCustomerPool, CustomerValueSegment,
EntityGraph, RelationshipStrengthCalculator,
};
use datasynth_generators::relationships::EntityGraphGenerator;
// Generate vendor network
let vendor_generator = VendorGenerator::new(config);
let vendor_network = vendor_generator.generate_vendor_network("C001");
// Generate segmented customers
let customer_generator = CustomerGenerator::new(config);
let customer_pool = customer_generator.generate_segmented_pool("C001");
// Build entity graph with cross-process links
let graph_generator = EntityGraphGenerator::with_defaults();
let entity_graph = graph_generator.generate_entity_graph(
&vendor_network,
&customer_pool,
&transactions,
&document_flows,
);
}
Python API
from datasynth_py import DataSynth
from datasynth_py.config import VendorNetworkConfig, CustomerSegmentationConfig
config = Config(
vendor_network=VendorNetworkConfig(
enabled=True,
depth=3,
clusters={"reliable_strategic": 0.20, "standard_operational": 0.50},
),
customer_segmentation=CustomerSegmentationConfig(
enabled=True,
value_segments={
"enterprise": {"revenue_share": 0.40, "customer_share": 0.05},
"mid_market": {"revenue_share": 0.35, "customer_share": 0.20},
},
),
)
result = DataSynth().generate(config=config, output={"format": "csv"})
See Also
- Graph Export - Exporting entity graphs to PyTorch Geometric, Neo4j, DGL
- Intercompany Processing - Multi-entity transaction matching
- Master Data Configuration - Vendor and customer settings
Period Close Engine
Generate period-end accounting processes.
Overview
The period close engine simulates:
- Monthly close (accruals, depreciation)
- Quarterly close (IC elimination, translation)
- Annual close (closing entries, retained earnings)
Configuration
period_close:
enabled: true
monthly:
accruals: true
depreciation: true
reconciliation: true
quarterly:
intercompany_elimination: true
currency_translation: true
annual:
closing_entries: true
retained_earnings: true
Monthly Close
Accruals
Generate reversing accrual entries:
period_close:
monthly:
accruals:
enabled: true
auto_reverse: true # Reverse next period
categories:
expense_accrual: 0.4
revenue_accrual: 0.2
payroll_accrual: 0.3
other: 0.1
Expense Accrual:
Period 1 (accrue):
Dr Expense 10,000
Cr Accrued Liabilities 10,000
Period 2 (reverse):
Dr Accrued Liabilities 10,000
Cr Expense 10,000
Depreciation
Calculate and post monthly depreciation:
period_close:
monthly:
depreciation:
enabled: true
run_date: last_day # When in period
methods:
straight_line: 0.7
declining_balance: 0.2
units_of_production: 0.1
Depreciation Entry:
Dr Depreciation Expense 5,000
Cr Accumulated Depreciation 5,000
Subledger Reconciliation
Verify subledger-to-GL control accounts:
period_close:
monthly:
reconciliation:
enabled: true
checks:
- subledger: ar
control_account: "1100"
- subledger: ap
control_account: "2000"
- subledger: inventory
control_account: "1200"
Reconciliation Report:
| Subledger | Control Account | Subledger Balance | GL Balance | Difference |
|---|---|---|---|---|
| AR | 1100 | 500,000 | 500,000 | 0 |
| AP | 2000 | (300,000) | (300,000) | 0 |
Quarterly Close
IC Elimination
Generate consolidation eliminations:
period_close:
quarterly:
intercompany_elimination:
enabled: true
types:
- revenue_expense # Eliminate IC sales
- unrealized_profit # Eliminate IC inventory profit
- receivable_payable # Eliminate IC balances
- dividends # Eliminate IC dividends
See Intercompany Processing for details.
Currency Translation
Translate foreign subsidiary balances:
period_close:
quarterly:
currency_translation:
enabled: true
method: current_rate # current_rate, temporal
rate_mapping:
assets: closing_rate
liabilities: closing_rate
equity: historical_rate
revenue: average_rate
expense: average_rate
cta_account: "3500" # CTA equity account
Translation Entry (CTA):
If foreign currency strengthened:
Dr Foreign Subsidiary Investment 10,000
Cr CTA (Other Comprehensive) 10,000
Annual Close
Closing Entries
Close temporary accounts to retained earnings:
period_close:
annual:
closing_entries:
enabled: true
close_revenue: true
close_expense: true
income_summary_account: "3900"
Closing Sequence:
1. Close Revenue:
Dr Revenue accounts (all) 1,000,000
Cr Income Summary 1,000,000
2. Close Expenses:
Dr Income Summary 800,000
Cr Expense accounts (all) 800,000
3. Close Income Summary:
Dr Income Summary 200,000
Cr Retained Earnings 200,000
Retained Earnings
Update retained earnings:
period_close:
annual:
retained_earnings:
enabled: true
account: "3100"
dividend_account: "3150"
Year-End Adjustments
Additional adjusting entries:
period_close:
annual:
adjustments:
- type: bad_debt_provision
rate: 0.02 # 2% of AR
- type: inventory_reserve
rate: 0.01 # 1% of inventory
- type: bonus_accrual
rate: 0.10 # 10% of salary expense
Financial Statements (v0.6.0)
The period close engine can now generate full financial statement sets from the adjusted trial balance. This is controlled by the financial_reporting configuration section.
Balance Sheet
Generates a statement of financial position with current/non-current asset and liability classifications:
Assets Liabilities & Equity
├── Current Assets ├── Current Liabilities
│ ├── Cash & Equivalents │ ├── Accounts Payable
│ ├── Accounts Receivable │ ├── Accrued Liabilities
│ └── Inventory │ └── Current Debt
├── Non-Current Assets ├── Non-Current Liabilities
│ ├── Fixed Assets (net) │ └── Long-Term Debt
│ └── Intangibles └── Equity
Total Assets = Total L + E ├── Common Stock
└── Retained Earnings
Income Statement
Generates a multi-step income statement:
Revenue
- Cost of Goods Sold
= Gross Profit
- Operating Expenses
= Operating Income
+/- Other Income/Expense
= Income Before Tax
- Income Tax
= Net Income
Cash Flow Statement
Generates an indirect-method cash flow statement with three categories:
financial_reporting:
generate_cash_flow: true
Categories:
- Operating: Net income + non-cash adjustments + working capital changes
- Investing: Capital expenditures, asset disposals
- Financing: Debt proceeds/repayments, equity transactions, dividends
Statement of Changes in Equity
Tracks equity movements across the period:
- Opening retained earnings
- Net income for the period
- Dividends declared
- Other comprehensive income (CTA, unrealized gains)
- Closing retained earnings
Management KPIs
When financial_reporting.management_kpis is enabled, computes financial ratios:
- Liquidity: Current ratio, quick ratio, cash ratio
- Profitability: Gross margin, operating margin, net margin, ROA, ROE
- Efficiency: Inventory turnover, AR turnover, AP turnover, days sales outstanding
- Leverage: Debt-to-equity, debt-to-assets, interest coverage
Budgets
When financial_reporting.budgets is enabled, generates budget records with variance analysis:
financial_reporting:
budgets:
enabled: true
variance_threshold: 0.10 # Flag variances > 10%
Produces budget vs. actual comparisons by account and period, with favorable/unfavorable variance flags.
Output Files
trial_balances/YYYY_MM.csv
| Field | Description |
|---|---|
account_number | GL account |
account_name | Account description |
opening_balance | Period start |
period_debits | Total debits |
period_credits | Total credits |
closing_balance | Period end |
accruals.csv
| Field | Description |
|---|---|
accrual_id | Unique ID |
accrual_type | Category |
period | Accrual period |
amount | Accrual amount |
reversal_period | When reversed |
entry_id | Related JE ID |
depreciation.csv
| Field | Description |
|---|---|
asset_id | Fixed asset |
period | Depreciation period |
method | Depreciation method |
depreciation_amount | Period expense |
accumulated_total | Running total |
net_book_value | Remaining value |
closing_entries.csv
| Field | Description |
|---|---|
entry_id | Closing entry ID |
entry_type | Revenue, expense, summary |
account | Account closed |
amount | Closing amount |
fiscal_year | Year closed |
financial_statements.csv (v0.6.0)
| Field | Description |
|---|---|
statement_id | Unique statement identifier |
statement_type | balance_sheet, income_statement, cash_flow, changes_in_equity |
company_code | Company code |
period_end | Statement date |
basis | us_gaap, ifrs, statutory |
line_code | Line item code |
label | Display label |
section | Statement section |
amount | Current period amount |
amount_prior | Prior period amount |
bank_reconciliations.csv (v0.6.0)
| Field | Description |
|---|---|
reconciliation_id | Unique reconciliation ID |
company_code | Company code |
bank_account | Bank account identifier |
period_start | Reconciliation period start |
period_end | Reconciliation period end |
opening_balance | Opening bank balance |
closing_balance | Closing bank balance |
status | in_progress, completed, completed_with_exceptions |
management_kpis.csv (v0.6.0)
| Field | Description |
|---|---|
company_code | Company code |
period | Reporting period |
kpi_name | Ratio name (e.g., current_ratio, gross_margin) |
kpi_value | Computed ratio value |
category | liquidity, profitability, efficiency, leverage |
Close Schedule
Month 1-11:
├── Accruals
├── Depreciation
└── Reconciliation
Month 3, 6, 9:
├── IC Elimination
└── Currency Translation
Month 12:
├── All monthly tasks
├── All quarterly tasks
├── Year-end adjustments
└── Closing entries
Example Configuration
Full Close Cycle
global:
start_date: 2024-01-01
period_months: 12
period_close:
enabled: true
monthly:
accruals:
enabled: true
auto_reverse: true
depreciation:
enabled: true
reconciliation:
enabled: true
quarterly:
intercompany_elimination:
enabled: true
currency_translation:
enabled: true
annual:
closing_entries:
enabled: true
retained_earnings:
enabled: true
adjustments:
- type: bad_debt_provision
rate: 0.02
See Also
Fingerprinting
Privacy-preserving fingerprint extraction enables generating synthetic data that matches the statistical properties of real data without exposing sensitive information.
Overview
Fingerprinting is a three-stage process:
- Extract: Analyze real data and capture statistical properties into a
.dsffingerprint file - Synthesize: Generate synthetic data configuration from the fingerprint
- Evaluate: Validate that synthetic data matches the fingerprint’s statistical properties
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Real Data │────▶│ Extract │────▶│ .dsf File │────▶│ Evaluate │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
Privacy Engine Config Synthesizer Fidelity Report
Privacy Mechanisms
Differential Privacy
The extraction process applies differential privacy to protect individual records:
- Laplace Mechanism: Adds calibrated noise to numeric statistics
- Gaussian Mechanism: Alternative for (ε,δ)-differential privacy
- Epsilon Budget: Tracks privacy budget across all operations
Privacy Guarantee: For any two datasets D and D' differing in one record,
the probability ratio of any output is bounded by e^ε
K-Anonymity
Categorical values are protected through suppression:
- Values appearing fewer than k times are replaced with
<suppressed> - Prevents identification of rare categories
- Configurable threshold per privacy level
Winsorization
Numeric outliers are clipped to prevent identification:
- Values beyond the configured percentile are capped
- Prevents extreme values from leaking individual information
- Outlier percentile varies by privacy level (85%-99%)
Privacy Levels
| Level | Epsilon | k | Outlier % | Description |
|---|---|---|---|---|
| Minimal | 5.0 | 3 | 99% | Highest utility, lower privacy |
| Standard | 1.0 | 5 | 95% | Balanced (recommended default) |
| High | 0.5 | 10 | 90% | Higher privacy for sensitive data |
| Maximum | 0.1 | 20 | 85% | Maximum privacy, some utility loss |
Choosing a Privacy Level
- Minimal: Internal testing, non-sensitive data
- Standard: General use, moderate sensitivity
- High: Personal financial data, healthcare
- Maximum: Highly sensitive data, regulatory compliance
DSF File Format
The DataSynth Fingerprint (.dsf) file is a ZIP archive containing:
fingerprint.dsf
├── manifest.json # Metadata, checksums, privacy config
├── schema.yaml # Table/column structure, relationships
├── statistics.yaml # Distributions, percentiles, Benford
├── correlations.yaml # Correlation matrices, copula params
├── integrity.yaml # Foreign keys, cardinality rules
├── rules.yaml # Balance constraints, thresholds
├── anomalies.yaml # Anomaly rates, patterns
└── privacy_audit.json # All privacy decisions logged
Manifest Structure
{
"version": "1.0",
"format": "dsf",
"created_at": "2026-01-23T10:30:00Z",
"source": {
"row_count": 100000,
"column_count": 25,
"tables": ["journal_entries", "vendors"]
},
"privacy": {
"level": "standard",
"epsilon": 1.0,
"k": 5
},
"checksums": {
"schema": "sha256:...",
"statistics": "sha256:...",
"correlations": "sha256:..."
}
}
Extraction Process
Step 1: Schema Extraction
Analyzes data structure:
- Infers column data types (numeric, categorical, date, text)
- Computes cardinalities
- Detects foreign key relationships
- Identifies primary keys
Step 2: Statistical Extraction
Computes distributions with privacy:
- Numeric columns: Mean, std, min, max, percentiles (with DP noise)
- Categorical columns: Frequencies (with k-anonymity)
- Temporal columns: Date ranges, seasonality patterns
- Benford analysis: First-digit distribution compliance
Step 3: Correlation Extraction
Captures multivariate relationships:
- Pearson correlation matrices (with DP)
- Copula parameters for joint distributions
- Cross-table relationship strengths
Step 4: Rules Extraction
Detects business rules:
- Balance equations (debits = credits)
- Approval thresholds
- Validation constraints
Step 5: Anomaly Pattern Extraction
Captures anomaly characteristics:
- Overall anomaly rate
- Type distribution
- Temporal patterns
Synthesis Process
Configuration Generation
The ConfigSynthesizer converts fingerprints to generation configuration:
#![allow(unused)]
fn main() {
// From fingerprint statistics, generate:
AmountSampler {
distribution: LogNormal,
mean: fp.statistics.amount.mean,
std: fp.statistics.amount.std,
round_number_bias: 0.15,
}
}
Copula-Based Generation
For correlated columns, the GaussianCopula preserves relationships:
- Generate independent uniform samples
- Apply correlation structure
- Transform to target marginal distributions
Fidelity Evaluation
Metrics
| Metric | Description | Target |
|---|---|---|
| KS Statistic | Max CDF difference | < 0.1 |
| Wasserstein Distance | Earth mover’s distance | < 0.1 |
| Benford MAD | Mean absolute deviation from Benford | < 0.015 |
| Correlation RMSE | Correlation matrix difference | < 0.1 |
| Schema Match | Column type agreement | > 0.95 |
Fidelity Report
Fidelity Evaluation Report
==========================
Overall Score: 0.87
Status: PASSED (threshold: 0.80)
Component Scores:
Statistical: 0.89
Correlation: 0.85
Schema: 0.98
Rules: 0.76
Details:
- KS statistic (amount): 0.05
- Benford MAD: 0.008
- Correlation RMSE: 0.07
CLI Usage
Basic Workflow
# 1. Extract fingerprint from real data
datasynth-data fingerprint extract \
--input ./real_data.csv \
--output ./fingerprint.dsf \
--privacy-level standard
# 2. Validate fingerprint integrity
datasynth-data fingerprint validate ./fingerprint.dsf
# 3. View fingerprint details
datasynth-data fingerprint info ./fingerprint.dsf --detailed
# 4. Generate synthetic data (using derived config)
datasynth-data generate \
--config ./derived_config.yaml \
--output ./synthetic_data
# 5. Evaluate fidelity
datasynth-data fingerprint evaluate \
--fingerprint ./fingerprint.dsf \
--synthetic ./synthetic_data \
--threshold 0.85 \
--report ./report.html
Comparing Fingerprints
# Compare two versions
datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf
Custom Privacy Parameters
# Override privacy level with custom values
datasynth-data fingerprint extract \
--input ./sensitive_data.csv \
--output ./fingerprint.dsf \
--epsilon 0.3 \
--k 15
Best Practices
Data Preparation
- Clean data first: Remove obvious errors before extraction
- Consistent formats: Standardize date and number formats
- Document exclusions: Note any columns excluded from extraction
Privacy Selection
- Start with standard: Adjust based on fidelity evaluation
- Consider sensitivity: Use higher privacy for personal data
- Review audit log: Check privacy decisions in
privacy_audit.json
Fidelity Optimization
- Check component scores: Identify weak areas
- Adjust generation config: Tune parameters for low-scoring metrics
- Iterate: Re-evaluate after adjustments
Compliance
- Preserve audit trail: Keep
.dsffiles for compliance review - Document privacy choices: Record rationale for privacy level
- Version fingerprints: Track changes over time
Troubleshooting
Low Fidelity Score
Cause: Statistical differences between synthetic and fingerprint
Solutions:
- Review component scores to identify specific issues
- Adjust generation configuration parameters
- Consider using auto-tuning recommendations
Fingerprint Validation Errors
Cause: Corrupted or modified DSF file
Solutions:
- Re-extract from source data
- Check file transfer integrity
- Verify checksums match manifest
Privacy Budget Exceeded
Cause: Too many queries on sensitive data
Solutions:
- Reduce number of extracted statistics
- Use higher epsilon (lower privacy)
- Aggregate fine-grained statistics
See Also
Accounting & Audit Standards
SyntheticData includes comprehensive support for major accounting and auditing standards frameworks, enabling the generation of standards-compliant synthetic financial data suitable for audit analytics, compliance testing, and ML model training.
Overview
The datasynth-standards crate provides domain models and generation logic for:
| Category | Standards |
|---|---|
| Accounting | US GAAP (ASC), IFRS |
| Auditing | ISA (International Standards on Auditing), PCAOB |
| Regulatory | SOX (Sarbanes-Oxley Act) |
Accounting Framework Selection
Framework Options
accounting_standards:
enabled: true
framework: us_gaap # Options: us_gaap, ifrs, dual_reporting
| Framework | Description |
|---|---|
us_gaap | United States Generally Accepted Accounting Principles |
ifrs | International Financial Reporting Standards |
dual_reporting | Generate data for both frameworks with reconciliation |
Key Framework Differences
The generator automatically handles framework-specific rules:
| Area | US GAAP | IFRS |
|---|---|---|
| Inventory costing | LIFO permitted | LIFO prohibited |
| Development costs | Generally expensed | Capitalized when criteria met |
| PPE revaluation | Cost model only | Revaluation model permitted |
| Impairment reversal | Not permitted | Permitted (except goodwill) |
| Lease classification | Bright-line tests (75%/90%) | Principles-based |
Revenue Recognition (ASC 606 / IFRS 15)
Generate realistic customer contracts with performance obligations:
accounting_standards:
revenue_recognition:
enabled: true
generate_contracts: true
avg_obligations_per_contract: 2.0
variable_consideration_rate: 0.15
over_time_recognition_rate: 0.30
contract_count: 100
Generated Entities
- Customer Contracts: Transaction price, status, framework
- Performance Obligations: Goods, services, licenses with satisfaction patterns
- Variable Consideration: Discounts, rebates, incentives with constraint application
- Revenue Recognition Schedule: Period-by-period recognition
5-Step Model Compliance
The generator follows the 5-step revenue recognition model:
- Identify the contract
- Identify performance obligations
- Determine transaction price
- Allocate transaction price to obligations
- Recognize revenue when/as obligations are satisfied
Lease Accounting (ASC 842 / IFRS 16)
Generate lease portfolios with ROU assets and lease liabilities:
accounting_standards:
leases:
enabled: true
lease_count: 50
finance_lease_percent: 0.30
avg_lease_term_months: 60
generate_amortization: true
real_estate_percent: 0.40
Generated Entities
- Leases: Classification, commencement date, term, payments, discount rate
- ROU Assets: Initial measurement, accumulated depreciation, carrying amount
- Lease Liabilities: Current/non-current portions
- Amortization Schedules: Period-by-period interest and principal
Classification Logic
- US GAAP: Bright-line tests (75% term, 90% PV)
- IFRS: All leases (except short-term/low-value) recognized on balance sheet
Fair Value Measurement (ASC 820 / IFRS 13)
Generate fair value measurements across hierarchy levels:
accounting_standards:
fair_value:
enabled: true
measurement_count: 30
level1_percent: 0.60 # Quoted prices
level2_percent: 0.30 # Observable inputs
level3_percent: 0.10 # Unobservable inputs
include_sensitivity_analysis: true
Fair Value Hierarchy
| Level | Description | Examples |
|---|---|---|
| Level 1 | Quoted prices in active markets | Listed stocks, exchange-traded funds |
| Level 2 | Observable inputs | Corporate bonds, interest rate swaps |
| Level 3 | Unobservable inputs | Private equity, complex derivatives |
Impairment Testing (ASC 360 / IAS 36)
Generate impairment tests with framework-specific methodology:
accounting_standards:
impairment:
enabled: true
test_count: 15
impairment_rate: 0.20
generate_projections: true
include_goodwill: true
Framework Differences
- US GAAP: Two-step test (recoverability then measurement)
- IFRS: One-step test comparing to recoverable amount
ISA Compliance (Audit Standards)
Generate audit procedures mapped to ISA requirements:
audit_standards:
isa_compliance:
enabled: true
compliance_level: comprehensive # basic, standard, comprehensive
generate_isa_mappings: true
generate_coverage_summary: true
include_pcaob: true
framework: dual # isa, pcaob, dual
Supported ISA Standards
The crate includes 34 ISA standards from ISA 200 through ISA 720:
| Series | Focus Area |
|---|---|
| ISA 200-265 | General principles and responsibilities |
| ISA 300-450 | Risk assessment and response |
| ISA 500-580 | Audit evidence |
| ISA 600-620 | Using work of others |
| ISA 700-720 | Conclusions and reporting |
Analytical Procedures (ISA 520)
Generate analytical procedures with variance investigation:
audit_standards:
analytical_procedures:
enabled: true
procedures_per_account: 3
variance_probability: 0.20
generate_investigations: true
include_ratio_analysis: true
Procedure Types
- Trend analysis: Year-over-year comparisons
- Ratio analysis: Key financial ratios
- Reasonableness tests: Expected vs. actual comparisons
External Confirmations (ISA 505)
Generate confirmation procedures with response tracking:
audit_standards:
confirmations:
enabled: true
confirmation_count: 50
positive_response_rate: 0.85
exception_rate: 0.10
Confirmation Types
- Bank confirmations
- Accounts receivable confirmations
- Accounts payable confirmations
- Legal confirmations
Audit Opinion (ISA 700/705/706/701)
Generate audit opinions with key audit matters:
audit_standards:
opinion:
enabled: true
generate_kam: true
average_kam_count: 3
Opinion Types
- Unmodified
- Qualified
- Adverse
- Disclaimer
SOX Compliance
Generate SOX 302/404 compliance documentation:
audit_standards:
sox:
enabled: true
generate_302_certifications: true
generate_404_assessments: true
materiality_threshold: 10000.0
Section 302 Certifications
- CEO and CFO certifications
- Disclosure controls effectiveness
- Material weakness identification
Section 404 Assessments
- ICFR effectiveness assessment
- Key control testing
- Deficiency classification matrix
Deficiency Classification
The DeficiencyMatrix classifies deficiencies based on:
| Likelihood | Magnitude | Classification |
|---|---|---|
| Probable | Material | Material Weakness |
| Reasonably Possible | More Than Inconsequential | Significant Deficiency |
| Remote | Inconsequential | Control Deficiency |
PCAOB Standards
Generate PCAOB-specific audit elements:
audit_standards:
pcaob:
enabled: true
generate_cam: true
integrated_audit: true
PCAOB-Specific Requirements
- Critical Audit Matters (CAMs) vs. Key Audit Matters (KAMs)
- Integrated audit (ICFR + financial statements)
- AS 2201 ICFR testing requirements
Evaluation and Validation
The datasynth-eval crate includes standards compliance evaluators:
#![allow(unused)]
fn main() {
use datasynth_eval::coherence::{
StandardsComplianceEvaluation,
RevenueRecognitionEvaluator,
LeaseAccountingEvaluator,
StandardsThresholds,
};
// Evaluate revenue recognition compliance
let eval = RevenueRecognitionEvaluator::evaluate(&contracts);
assert!(eval.po_allocation_compliance >= 0.95);
// Evaluate lease classification accuracy
let eval = LeaseAccountingEvaluator::evaluate(&leases, "us_gaap");
assert!(eval.classification_accuracy >= 0.90);
}
Compliance Thresholds
| Metric | Default Threshold |
|---|---|
| PO allocation compliance | 95% |
| Revenue timing compliance | 95% |
| Lease classification accuracy | 90% |
| ROU asset accuracy | 95% |
| Fair value hierarchy compliance | 95% |
| ISA coverage | 90% |
| SOX control coverage | 95% |
| Audit trail completeness | 90% |
Output Files
When standards generation is enabled, additional files are exported:
output/
└── standards/
├── accounting/
│ ├── customer_contracts.csv
│ ├── performance_obligations.csv
│ ├── variable_consideration.csv
│ ├── revenue_recognition_schedule.csv
│ ├── leases.csv
│ ├── rou_assets.csv
│ ├── lease_liabilities.csv
│ ├── lease_amortization.csv
│ ├── fair_value_measurements.csv
│ ├── impairment_tests.csv
│ └── framework_differences.csv
├── audit/
│ ├── isa_requirement_mappings.csv
│ ├── isa_coverage_summary.csv
│ ├── analytical_procedures.csv
│ ├── variance_investigations.csv
│ ├── confirmations.csv
│ ├── confirmation_responses.csv
│ ├── audit_opinions.csv
│ ├── key_audit_matters.csv
│ ├── audit_trails.json
│ └── pcaob_mappings.csv
└── regulatory/
├── sox_302_certifications.csv
├── sox_404_assessments.csv
├── deficiency_classifications.csv
└── material_weaknesses.csv
Use Cases
Audit Analytics Training
Generate labeled data for training audit analytics models with known standards compliance levels.
Compliance Testing
Test compliance monitoring systems with synthetic data covering all major accounting and auditing standards.
IFRS to US GAAP Reconciliation
Use dual reporting mode to generate reconciliation data for multi-framework analysis.
SOX Testing
Generate internal control data with known deficiencies for testing SOX monitoring systems.
See Also
- COSO Framework - Internal control framework
- Audit Simulation - Audit analytics use cases
- SOX Compliance - SOX testing use cases
Performance Tuning
Optimize SyntheticData for your hardware and requirements.
Performance Characteristics
| Metric | Typical Performance |
|---|---|
| Single-threaded | ~100,000 entries/second |
| Parallel (8 cores) | ~600,000 entries/second |
| Memory per 1M entries | ~500 MB |
Configuration Tuning
Worker Threads
global:
worker_threads: 8 # Match CPU cores
Guidelines:
- Default: Uses all available cores
- I/O bound: May benefit from > cores
- Memory constrained: Reduce threads
Memory Limits
global:
memory_limit: 2147483648 # 2 GB
Guidelines:
- Set to ~75% of available RAM
- Leave room for OS and other processes
- Lower limit = more streaming, less memory
Batch Sizes
The orchestrator automatically tunes batch sizes, but you can influence behavior:
transactions:
target_count: 100000
# Implicit batch sizing based on:
# - Available memory
# - Number of threads
# - Target count
Hardware Recommendations
Minimum
| Resource | Specification |
|---|---|
| CPU | 2 cores |
| RAM | 4 GB |
| Storage | 10 GB |
Suitable for: <100K entries, development
Recommended
| Resource | Specification |
|---|---|
| CPU | 8 cores |
| RAM | 16 GB |
| Storage | 50 GB SSD |
Suitable for: 1M entries, production
High Performance
| Resource | Specification |
|---|---|
| CPU | 32+ cores |
| RAM | 64+ GB |
| Storage | NVMe SSD |
Suitable for: 10M+ entries, benchmarking
Optimizing Generation
Reduce Memory Pressure
Enable streaming output:
output:
format: csv
# Writing as generated reduces memory
Disable unnecessary features:
graph_export:
enabled: false # Skip if not needed
anomaly_injection:
enabled: false # Add in post-processing
Optimize for Speed
Maximize parallelism:
global:
worker_threads: 16 # More threads
Simplify output:
output:
format: csv # Faster than JSON
compression: none # Skip compression time
Reduce complexity:
chart_of_accounts:
complexity: small # Fewer accounts
document_flows:
p2p:
enabled: false # Skip if not needed
Optimize for Size
Enable compression:
output:
compression: zstd
compression_level: 9 # Maximum compression
Minimize output files:
output:
files:
journal_entries: true
acdoca: false
master_data: false # Only what you need
Benchmarking
Built-in Benchmarks
# Run all benchmarks
cargo bench
# Specific benchmark
cargo bench --bench generation_throughput
# With baseline comparison
cargo bench -- --baseline main
Benchmark Categories
| Benchmark | Measures |
|---|---|
generation_throughput | Entries/second |
distribution_sampling | Distribution speed |
output_sink | Write performance |
scalability | Parallel scaling |
correctness | Validation overhead |
Manual Benchmarking
# Time generation
time datasynth-data generate --config config.yaml --output ./output
# Profile memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output
Profiling
CPU Profiling
# With perf (Linux)
perf record datasynth-data generate --config config.yaml --output ./output
perf report
# With Instruments (macOS)
xcrun xctrace record --template "Time Profiler" \
--launch datasynth-data generate --config config.yaml --output ./output
Memory Profiling
# With heaptrack (Linux)
heaptrack datasynth-data generate --config config.yaml --output ./output
heaptrack_print heaptrack.*.gz
# With Instruments (macOS)
xcrun xctrace record --template "Allocations" \
--launch datasynth-data generate --config config.yaml --output ./output
Common Bottlenecks
I/O Bound
Symptoms:
- CPU utilization < 100%
- Disk utilization high
Solutions:
- Use faster storage (SSD/NVMe)
- Enable compression (reduces write volume)
- Increase buffer sizes
Memory Bound
Symptoms:
- OOM errors
- Excessive swapping
Solutions:
- Reduce
target_count - Lower
memory_limit - Enable streaming
- Reduce parallel threads
CPU Bound
Symptoms:
- CPU at 100%
- Generation time scales linearly
Solutions:
- Add more cores
- Simplify configuration
- Disable unnecessary features
Scaling Guidelines
Entries vs Time
| Entries | ~Time (8 cores) |
|---|---|
| 10,000 | <1 second |
| 100,000 | ~2 seconds |
| 1,000,000 | ~20 seconds |
| 10,000,000 | ~3 minutes |
Entries vs Memory
| Entries | Peak Memory |
|---|---|
| 10,000 | ~50 MB |
| 100,000 | ~200 MB |
| 1,000,000 | ~1.5 GB |
| 10,000,000 | ~12 GB |
Memory estimates include full in-memory processing. Streaming reduces by ~80%.
Server Performance
Rate Limiting
cargo run -p datasynth-server -- \
--port 3000 \
--rate-limit 1000 # Requests per minute
Connection Pooling
For high-concurrency scenarios, configure worker threads:
cargo run -p datasynth-server -- \
--worker-threads 16 # Handle more connections
WebSocket Optimization
# Client-side: batch messages
const BATCH_SIZE = 100; // Request 100 entries at a time
See Also
LLM-Augmented Generation
New in v0.5.0
LLM-augmented generation uses Large Language Models to enrich synthetic data with realistic metadata — vendor names, transaction descriptions, memo fields, and anomaly explanations — that would be difficult to generate with rule-based approaches alone.
Overview
Traditional synthetic data generators produce structurally correct but often generic-sounding text fields. LLM augmentation addresses this by using language models to generate contextually appropriate text based on the financial domain, industry, and transaction context.
DataSynth provides a pluggable provider abstraction that supports:
| Provider | Description | Use Case |
|---|---|---|
| Mock | Deterministic, no network required | CI/CD, testing, reproducible builds |
| OpenAI | OpenAI-compatible APIs (GPT-4o-mini, etc.) | Production enrichment |
| Anthropic | Anthropic API (Claude models) | Production enrichment |
| Custom | Any OpenAI-compatible endpoint | Self-hosted models, Azure OpenAI |
Provider Abstraction
All LLM functionality is built around the LlmProvider trait:
#![allow(unused)]
fn main() {
pub trait LlmProvider: Send + Sync {
fn name(&self) -> &str;
fn complete(&self, request: &LlmRequest) -> Result<LlmResponse, SynthError>;
fn complete_batch(&self, requests: &[LlmRequest]) -> Result<Vec<LlmResponse>, SynthError>;
}
}
LlmRequest
#![allow(unused)]
fn main() {
let request = LlmRequest::new("Generate a vendor name for a German auto parts manufacturer")
.with_system("You are a business data generator. Return only the company name.")
.with_seed(42)
.with_max_tokens(50)
.with_temperature(0.7);
}
| Field | Type | Default | Description |
|---|---|---|---|
prompt | String | (required) | The generation prompt |
system | Option<String> | None | System message for context |
max_tokens | u32 | 100 | Maximum response tokens |
temperature | f64 | 0.7 | Sampling temperature |
seed | Option<u64> | None | Seed for deterministic output |
LlmResponse
#![allow(unused)]
fn main() {
pub struct LlmResponse {
pub content: String, // Generated text
pub usage: TokenUsage, // Input/output token counts
pub cached: bool, // Whether result came from cache
}
}
Mock Provider
The MockLlmProvider generates deterministic, contextually-aware text without any network calls. It is the default provider and is ideal for:
- CI/CD pipelines where network access is restricted
- Reproducible builds with deterministic output
- Development and testing
- Environments where API costs are a concern
#![allow(unused)]
fn main() {
use synth_core::llm::MockLlmProvider;
let provider = MockLlmProvider::new(42); // seeded for reproducibility
}
The mock provider uses the seed and prompt content to generate plausible-sounding business names and descriptions deterministically.
HTTP Provider
The HttpLlmProvider connects to real LLM APIs:
#![allow(unused)]
fn main() {
use synth_core::llm::{HttpLlmProvider, LlmConfig, LlmProviderType};
let config = LlmConfig {
provider: LlmProviderType::OpenAi,
model: "gpt-4o-mini".to_string(),
api_key_env: "OPENAI_API_KEY".to_string(),
base_url: None,
max_retries: 3,
timeout_secs: 30,
cache_enabled: true,
};
let provider = HttpLlmProvider::new(config)?;
}
Configuration
# In your generation config
llm:
provider: openai # mock, openai, anthropic, custom
model: "gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
base_url: null # Override for custom endpoints
max_retries: 3
timeout_secs: 30
cache_enabled: true
| Field | Type | Default | Description |
|---|---|---|---|
provider | string | mock | Provider type: mock, openai, anthropic, custom |
model | string | gpt-4o-mini | Model identifier |
api_key_env | string | — | Environment variable containing the API key |
base_url | string | null | Custom API base URL (required for custom provider) |
max_retries | integer | 3 | Maximum retry attempts on failure |
timeout_secs | integer | 30 | Request timeout in seconds |
cache_enabled | bool | true | Enable prompt-level caching |
Enrichment Types
Vendor Name Enrichment
Generates realistic vendor names based on industry, spend category, and country:
#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::VendorLlmEnricher;
let enricher = VendorLlmEnricher::new(provider.clone());
let name = enricher.enrich_vendor_name("manufacturing", "raw_materials", "DE")?;
// e.g., "Rheinische Stahlwerke GmbH"
// Batch enrichment for efficiency
let names = enricher.enrich_batch(&[
("manufacturing".into(), "raw_materials".into(), "DE".into()),
("retail".into(), "logistics".into(), "US".into()),
], 42)?;
}
Transaction Description Enrichment
Generates contextually appropriate journal entry descriptions:
#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::TransactionLlmEnricher;
let enricher = TransactionLlmEnricher::new(provider.clone());
let desc = enricher.enrich_description(
"Office Supplies", // account name
"1000-5000", // amount range
"retail", // industry
3, // fiscal period
)?;
let memo = enricher.enrich_memo(
"VendorInvoice", // document type
"Acme Corp", // vendor name
"2500.00", // amount
)?;
}
Anomaly Explanation
Generates natural language explanations for injected anomalies:
#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::AnomalyLlmExplainer;
let explainer = AnomalyLlmExplainer::new(provider.clone());
let explanation = explainer.explain(
"DuplicatePayment", // anomaly type
3, // affected records
"Same amount, same vendor, 2 days apart", // statistical details
)?;
}
Natural Language Configuration
The NlConfigGenerator converts natural language descriptions into YAML configuration:
#![allow(unused)]
fn main() {
use synth_core::llm::NlConfigGenerator;
let yaml = NlConfigGenerator::generate(
"Generate 1 year of retail data for a mid-size US company with fraud patterns",
&provider,
)?;
}
CLI Usage
datasynth-data init \
--from-description "Generate 1 year of manufacturing data for a German mid-cap with intercompany transactions" \
-o config.yaml
The generator parses intent into structured fields:
#![allow(unused)]
fn main() {
pub struct ConfigIntent {
pub industry: Option<String>, // e.g., "manufacturing"
pub country: Option<String>, // e.g., "DE"
pub company_size: Option<String>, // e.g., "mid-cap"
pub period_months: Option<u32>, // e.g., 12
pub features: Vec<String>, // e.g., ["intercompany"]
}
}
Caching
The LlmCache deduplicates identical prompts using FNV-1a hashing:
#![allow(unused)]
fn main() {
use synth_core::llm::LlmCache;
let cache = LlmCache::new(10000); // max 10,000 entries
let key = LlmCache::cache_key("prompt text", Some("system"), Some(42));
cache.insert(key, "cached response".into());
if let Some(response) = cache.get(key) {
// Use cached response
}
}
Caching is enabled by default and significantly reduces API costs when generating similar entities.
Cost and Privacy Considerations
Cost Management
- Use the Mock provider for development and CI/CD (zero cost)
- Enable caching to avoid duplicate API calls
- Use batch enrichment (
complete_batch) to reduce per-request overhead - Set appropriate
max_tokenslimits to control response sizes - Consider
gpt-4o-minior similar efficient models for bulk enrichment
Privacy
- LLM prompts contain only synthetic context (industry, category, amount ranges) — never real data
- No PII or sensitive information is sent to LLM providers
- The Mock provider keeps everything local with no network traffic
- For maximum privacy, use self-hosted models via the
customprovider type
See Also
- AI & ML Configuration
- LLM Training Data Use Case
- datasynth-core LLM Module
- datasynth-generators LLM Enrichment
Diffusion Models
New in v0.5.0
DataSynth integrates a statistical diffusion model backend for learned distribution capture, offering an alternative and complement to rule-based generation.
Overview
Diffusion models generate data through a learned denoising process: starting from pure noise and iteratively removing it to produce realistic samples. DataSynth’s implementation uses a statistical backend that captures column-level distributions and inter-column correlations from fingerprint data, then generates new samples through a configurable noise schedule.
Forward Process (Training): x₀ → x₁ → x₂ → ... → xₜ (pure noise)
Reverse Process (Generation): xₜ → xₜ₋₁ → ... → x₁ → x₀ (data)
Architecture
DiffusionBackend Trait
All diffusion backends implement a common interface:
#![allow(unused)]
fn main() {
pub trait DiffusionBackend: Send + Sync {
fn name(&self) -> &str;
fn forward(&self, x: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
fn reverse(&self, x_t: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
fn generate(&self, n_samples: usize, n_features: usize, seed: u64) -> Vec<Vec<f64>>;
}
}
Statistical Diffusion Backend
The StatisticalDiffusionBackend uses per-column means and standard deviations (extracted from fingerprint data) to guide the denoising process:
#![allow(unused)]
fn main() {
use synth_core::diffusion::{StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType};
let config = DiffusionConfig {
n_steps: 1000,
schedule: NoiseScheduleType::Cosine,
seed: 42,
};
let backend = StatisticalDiffusionBackend::new(
vec![5000.0, 3.5, 2.0], // column means
vec![2000.0, 1.5, 0.8], // column standard deviations
config,
);
// Optionally add correlation structure
let backend = backend.with_correlations(vec![
vec![1.0, 0.65, 0.72],
vec![0.65, 1.0, 0.55],
vec![0.72, 0.55, 1.0],
]);
let samples = backend.generate(1000, 3, 42);
}
Noise Schedules
The noise schedule controls how noise is added during the forward process and removed during the reverse process.
| Schedule | Formula | Characteristics |
|---|---|---|
| Linear | β_t = β_min + t/T × (β_max - β_min) | Uniform noise addition; simple and robust |
| Cosine | β_t = 1 - ᾱ_t/ᾱ_{t-1}, ᾱ_t = cos²(π/2 × t/T) | Slower noise addition; better for preserving fine details |
| Sigmoid | β_t = sigmoid(a + (b-a) × t/T) | Smooth transition; balanced between linear and cosine |
#![allow(unused)]
fn main() {
use synth_core::diffusion::{NoiseSchedule, NoiseScheduleType};
let schedule = NoiseSchedule::new(&NoiseScheduleType::Cosine, 1000);
// Access schedule components
println!("Steps: {}", schedule.n_steps());
println!("First beta: {}", schedule.betas[0]);
println!("Last alpha_bar: {}", schedule.alpha_bars[999]);
}
Schedule Properties
The NoiseSchedule precomputes all values needed for efficient forward/reverse steps:
| Property | Description |
|---|---|
betas | Noise variance at each step |
alphas | 1 - beta at each step |
alpha_bars | Cumulative product of alphas |
sqrt_alpha_bars | √(ᾱ_t) for forward process |
sqrt_one_minus_alpha_bars | √(1 - ᾱ_t) for noise scaling |
Hybrid Generation
The HybridGenerator blends rule-based and diffusion-generated data to combine the structural guarantees of rule-based generation with the distributional fidelity of diffusion models.
Blend Strategies
| Strategy | Description | Best For |
|---|---|---|
| Interpolate | Weighted average: w × diffusion + (1-w) × rule_based | Smooth blending of continuous values |
| Select | Per-record random selection from either source | Maintaining distinct record characteristics |
| Ensemble | Column-level: diffusion for amounts, rule-based for categoricals | Mixed-type data with different generation needs |
#![allow(unused)]
fn main() {
use synth_core::diffusion::{HybridGenerator, BlendStrategy};
let hybrid = HybridGenerator::new(0.3); // 30% diffusion weight
println!("Weight: {}", hybrid.weight());
// Interpolation blend
let blended = hybrid.blend(
&rule_based_data,
&diffusion_data,
BlendStrategy::Interpolate,
42,
);
// Ensemble blend (specify which columns use diffusion)
let ensemble = hybrid.blend_ensemble(
&rule_based_data,
&diffusion_data,
&[0, 2], // columns 0 and 2 from diffusion
);
}
Training Pipeline
The DiffusionTrainer fits a model from column-level parameters and correlation matrices (typically extracted from a fingerprint):
Training
#![allow(unused)]
fn main() {
use synth_core::diffusion::{DiffusionTrainer, ColumnDiffusionParams, ColumnType, DiffusionConfig};
let params = vec![
ColumnDiffusionParams {
name: "amount".into(),
mean: 5000.0,
std: 2000.0,
min: 0.0,
max: 100000.0,
col_type: ColumnType::Continuous,
},
ColumnDiffusionParams {
name: "line_items".into(),
mean: 3.5,
std: 1.5,
min: 1.0,
max: 20.0,
col_type: ColumnType::Integer,
},
];
let corr_matrix = vec![
vec![1.0, 0.65],
vec![0.65, 1.0],
];
let config = DiffusionConfig { n_steps: 1000, schedule: NoiseScheduleType::Cosine, seed: 42 };
let model = DiffusionTrainer::fit(params, corr_matrix, config);
}
Generation from Trained Model
#![allow(unused)]
fn main() {
let samples = model.generate(5000, 42);
// Save/load model
model.save(Path::new("./model.json"))?;
let loaded = TrainedDiffusionModel::load(Path::new("./model.json"))?;
}
Evaluation
#![allow(unused)]
fn main() {
let report = DiffusionTrainer::evaluate(&model, 5000, 42);
println!("Overall score: {:.3}", report.overall_score);
println!("Correlation error: {:.4}", report.correlation_error);
for (i, (mean_err, std_err)) in report.mean_errors.iter().zip(&report.std_errors).enumerate() {
println!("Column {}: mean_err={:.4}, std_err={:.4}", i, mean_err, std_err);
}
}
The FitReport contains:
| Metric | Description |
|---|---|
mean_errors | Per-column mean absolute error |
std_errors | Per-column standard deviation error |
correlation_error | RMSE of correlation matrix |
overall_score | Weighted composite score (0-1, higher is better) |
CLI Usage
Train a Model
datasynth-data diffusion train \
--fingerprint ./fingerprint.dsf \
--output ./model.json \
--n-steps 1000 \
--schedule cosine
Evaluate a Model
datasynth-data diffusion evaluate \
--model ./model.json \
--samples 5000
Configuration
diffusion:
enabled: true
n_steps: 1000 # Number of diffusion steps
schedule: "cosine" # Noise schedule: linear, cosine, sigmoid
sample_size: 1000 # Samples to generate
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable diffusion generation |
n_steps | integer | 1000 | Forward/reverse diffusion steps |
schedule | string | "linear" | Noise schedule type |
sample_size | integer | 1000 | Number of samples |
Utility Functions
DataSynth provides helper functions for working with diffusion data:
#![allow(unused)]
fn main() {
use synth_core::diffusion::{
add_gaussian_noise, normalize_features, denormalize_features,
clip_values, generate_noise,
};
// Normalize data to zero mean, unit variance
let (normalized, means, stds) = normalize_features(&data);
// Add calibrated noise
let noisy = add_gaussian_noise(&normalized[0], 0.1, &mut rng);
// Denormalize back to original scale
let original_scale = denormalize_features(&generated, &means, &stds);
// Clip to valid ranges
clip_values(&mut samples, 0.0, 100000.0);
}
See Also
Causal & Counterfactual Generation
New in v0.5.0
DataSynth supports Structural Causal Models (SCMs) for generating data with explicit causal structure, running interventional “what-if” scenarios, and producing counterfactual records.
Overview
Traditional synthetic data generators capture correlations but not causation. Causal generation lets you:
- Define causal relationships between variables (e.g., “transaction amount causes approval level”)
- Generate observational data that follows the causal structure
- Run interventions to answer “what if?” questions (do-calculus)
- Produce counterfactuals — “what would have happened if X were different?”
This is particularly valuable for fraud detection, audit analytics, and regulatory “what-if” scenario testing.
Causal Graph
A causal graph defines variables and the directed edges (causal mechanisms) between them.
Variables
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalVariable, CausalVarType};
let var = CausalVariable::new("transaction_amount", CausalVarType::Continuous)
.with_distribution("lognormal")
.with_param("mu", 8.0)
.with_param("sigma", 1.5);
}
| Variable Type | Description | Example |
|---|---|---|
Continuous | Real-valued | Transaction amount, revenue |
Categorical | Discrete categories | Industry, department |
Count | Non-negative integers | Line items, approvals |
Binary | Boolean (0/1) | Fraud flag, approval status |
Causal Mechanisms
Edges between variables define how a parent causally affects a child:
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalEdge, CausalMechanism};
let edge = CausalEdge {
from: "transaction_amount".into(),
to: "approval_level".into(),
mechanism: CausalMechanism::Logistic { scale: 0.001, midpoint: 50000.0 },
strength: 1.0,
};
}
| Mechanism | Formula | Use Case |
|---|---|---|
Linear { coefficient } | y += coefficient × parent | Proportional effects |
Threshold { cutoff } | y = 1 if parent > cutoff, else 0 | Binary triggers |
Polynomial { coefficients } | y += Σ coefficients[i] × parent^i | Non-linear effects |
Logistic { scale, midpoint } | y += 1 / (1 + e^(-scale × (parent - midpoint))) | S-curve effects |
Building a Graph
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, CausalVariable, CausalVarType, CausalEdge, CausalMechanism};
let mut graph = CausalGraph::new();
// Add variables
graph.add_variable(
CausalVariable::new("transaction_amount", CausalVarType::Continuous)
.with_distribution("lognormal")
.with_param("mu", 8.0)
.with_param("sigma", 1.5)
);
graph.add_variable(
CausalVariable::new("approval_level", CausalVarType::Count)
.with_distribution("normal")
.with_param("mean", 1.0)
.with_param("std", 0.5)
);
graph.add_variable(
CausalVariable::new("fraud_flag", CausalVarType::Binary)
);
// Add causal edges
graph.add_edge(CausalEdge {
from: "transaction_amount".into(),
to: "approval_level".into(),
mechanism: CausalMechanism::Linear { coefficient: 0.00005 },
strength: 1.0,
});
graph.add_edge(CausalEdge {
from: "transaction_amount".into(),
to: "fraud_flag".into(),
mechanism: CausalMechanism::Logistic { scale: 0.0001, midpoint: 50000.0 },
strength: 0.8,
});
// Validate (checks for cycles, missing variables)
graph.validate()?;
}
Built-in Templates
DataSynth includes pre-configured causal graphs for common financial scenarios:
Fraud Detection Template
#![allow(unused)]
fn main() {
let graph = CausalGraph::fraud_detection_template();
}
Variables: transaction_amount, approval_level, vendor_risk, fraud_flag
Causal structure:
transaction_amount→approval_level(linear)transaction_amount→fraud_flag(logistic)vendor_risk→fraud_flag(linear)
Revenue Cycle Template
#![allow(unused)]
fn main() {
let graph = CausalGraph::revenue_cycle_template();
}
Variables: order_size, credit_score, payment_delay, revenue
Causal structure:
order_size→revenue(linear)credit_score→payment_delay(linear, negative)order_size→payment_delay(linear)
Structural Causal Model (SCM)
The SCM wraps a causal graph and provides generation capabilities:
#![allow(unused)]
fn main() {
use synth_core::causal::StructuralCausalModel;
let scm = StructuralCausalModel::new(graph)?;
// Generate observational data
let samples = scm.generate(10000, 42)?;
// samples: Vec<HashMap<String, f64>>
for sample in &samples[..3] {
println!("Amount: {:.2}, Approval: {:.0}, Fraud: {:.0}",
sample["transaction_amount"],
sample["approval_level"],
sample["fraud_flag"],
);
}
}
Data is generated in topological order — root variables are sampled from their distributions first, then child variables are computed based on their parents’ values and the causal mechanisms.
Interventions (Do-Calculus)
Interventions answer “what would happen if we force variable X to value V?”, cutting all incoming causal edges to X.
Single Intervention
#![allow(unused)]
fn main() {
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let samples = intervened.generate(5000, 42)?;
}
Multiple Interventions
#![allow(unused)]
fn main() {
let intervened = scm
.intervene("transaction_amount", 50000.0)?
.and_intervene("vendor_risk", 0.9);
let samples = intervened.generate(5000, 42)?;
}
Intervention Engine with Effect Estimation
#![allow(unused)]
fn main() {
use synth_core::causal::InterventionEngine;
let engine = InterventionEngine::new(scm);
let result = engine.do_intervention(
&[("transaction_amount".into(), 50000.0)],
5000, // samples
42, // seed
)?;
// Compare baseline vs intervention
println!("Baseline fraud rate: {:.4}",
result.baseline_samples.iter()
.map(|s| s["fraud_flag"])
.sum::<f64>() / result.baseline_samples.len() as f64
);
// Effect estimates with confidence intervals
for (var, effect) in &result.effect_estimates {
println!("{}: ATE={:.4}, 95% CI=({:.4}, {:.4})",
var,
effect.average_treatment_effect,
effect.confidence_interval.0,
effect.confidence_interval.1,
);
}
}
The InterventionResult contains:
| Field | Description |
|---|---|
baseline_samples | Data generated without intervention |
intervened_samples | Data generated with the intervention applied |
effect_estimates | Per-variable average treatment effects with confidence intervals |
Counterfactual Generation
Counterfactuals answer “what would have happened to this specific record if X were different?” using the abduction-action-prediction framework:
- Abduction: Infer the latent noise variables from the factual observation
- Action: Apply the intervention (change X to new value)
- Prediction: Propagate through the SCM with inferred noise
#![allow(unused)]
fn main() {
use synth_core::causal::CounterfactualGenerator;
use std::collections::HashMap;
let cf_gen = CounterfactualGenerator::new(scm);
// Factual record
let factual: HashMap<String, f64> = [
("transaction_amount".to_string(), 5000.0),
("approval_level".to_string(), 1.0),
("fraud_flag".to_string(), 0.0),
].into_iter().collect();
// What if the amount had been 100,000?
let counterfactual = cf_gen.generate_counterfactual(
&factual,
"transaction_amount",
100000.0,
42,
)?;
println!("Factual fraud_flag: {}", factual["fraud_flag"]);
println!("Counterfactual fraud_flag: {}", counterfactual["fraud_flag"]);
}
Batch Counterfactuals
#![allow(unused)]
fn main() {
let pairs = cf_gen.generate_batch_counterfactuals(
&factual_records,
"transaction_amount",
100000.0,
42,
)?;
for pair in &pairs {
println!("Changed variables: {:?}", pair.changed_variables);
}
}
Each CounterfactualPair contains:
| Field | Description |
|---|---|
factual | The original observation |
counterfactual | The counterfactual version |
changed_variables | List of variables that changed |
Causal Validation
Validate that generated data preserves the specified causal structure:
#![allow(unused)]
fn main() {
use synth_core::causal::CausalValidator;
let report = CausalValidator::validate_causal_structure(&samples, &graph);
println!("Valid: {}", report.valid);
for check in &report.checks {
println!("{}: {} — {}", check.name, if check.passed { "PASS" } else { "FAIL" }, check.details);
}
if !report.violations.is_empty() {
println!("Violations: {:?}", report.violations);
}
}
The validator checks:
- Causal edge directions are respected (parent-child correlations)
- Independence constraints hold (non-adjacent variables)
- Intervention effects are consistent with the graph structure
CLI Usage
Generate Observational Data
datasynth-data causal generate \
--template fraud_detection \
--samples 10000 \
--seed 42 \
--output ./causal_output
Run Interventions
datasynth-data causal intervene \
--template fraud_detection \
--variable transaction_amount \
--value 50000 \
--samples 5000 \
--output ./intervention_output
Validate Causal Structure
datasynth-data causal validate \
--data ./causal_output \
--template fraud_detection
Configuration
causal:
enabled: true
template: "fraud_detection" # or "revenue_cycle" or path to custom YAML
sample_size: 10000
validate: true # validate causal structure in output
Custom Causal Graph YAML
# custom_graph.yaml
variables:
- name: order_size
type: continuous
distribution: lognormal
params:
mu: 7.0
sigma: 1.2
- name: discount_rate
type: continuous
distribution: beta
params:
alpha: 2.0
beta: 8.0
- name: revenue
type: continuous
edges:
- from: order_size
to: revenue
mechanism:
type: linear
coefficient: 0.95
- from: discount_rate
to: revenue
mechanism:
type: linear
coefficient: -5000.0
See Also
Federated Fingerprinting
New in v0.5.0
Federated fingerprinting enables extracting statistical fingerprints from multiple distributed data sources and combining them without centralizing the raw data.
Overview
In many enterprise environments, data is distributed across multiple systems, departments, or legal entities that cannot share raw data due to privacy regulations or data governance policies. Federated fingerprinting addresses this by:
- Local extraction: Each data source extracts a partial fingerprint with its own differential privacy budget
- Secure aggregation: Partial fingerprints are combined using a configurable aggregation strategy
- Privacy composition: The total privacy budget is tracked across all sources
Source A → [Extract + Local DP] → Partial FP A ─┐
Source B → [Extract + Local DP] → Partial FP B ─┼→ [Aggregate] → Combined FP → [Generate]
Source C → [Extract + Local DP] → Partial FP C ─┘
Partial Fingerprints
Each data source produces a PartialFingerprint containing noised statistics:
#![allow(unused)]
fn main() {
pub struct PartialFingerprint {
pub source_id: String, // Identifier for this data source
pub local_epsilon: f64, // DP epsilon budget spent locally
pub record_count: u64, // Number of records in source
pub column_names: Vec<String>, // Column identifiers
pub means: Vec<f64>, // Per-column means (noised)
pub stds: Vec<f64>, // Per-column standard deviations (noised)
pub mins: Vec<f64>, // Per-column minimums (noised)
pub maxs: Vec<f64>, // Per-column maximums (noised)
pub correlations: Vec<f64>, // Flat row-major correlation matrix (noised)
}
}
Creating a Partial Fingerprint
#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::FederatedFingerprintProtocol;
let partial = FederatedFingerprintProtocol::create_partial(
"department_finance", // source ID
vec!["amount".into(), "line_items".into()], // columns
50000, // record count
vec![8500.0, 3.2], // means
vec![4200.0, 1.8], // standard deviations
1.0, // local epsilon budget
);
}
Aggregation Methods
| Method | Description | Properties |
|---|---|---|
| WeightedAverage | Weighted by record count | Best for balanced sources |
| Median | Median across sources | Robust to outlier sources |
| TrimmedMean | Mean after removing extremes | Balances robustness and efficiency |
Protocol Usage
#![allow(unused)]
fn main() {
use datasynth_fingerprint::federated::{
FederatedFingerprintProtocol, FederatedConfig, AggregationMethod,
};
// Configure the protocol
let config = FederatedConfig {
min_sources: 2, // Minimum sources required
max_epsilon_per_source: 5.0, // Max DP budget per source
aggregation_method: AggregationMethod::WeightedAverage,
};
let protocol = FederatedFingerprintProtocol::new(config);
// Collect partial fingerprints from each source
let partial_a = FederatedFingerprintProtocol::create_partial(
"source_a", vec!["amount".into(), "count".into()],
10000, vec![5000.0, 3.0], vec![2000.0, 1.5], 1.0,
);
let partial_b = FederatedFingerprintProtocol::create_partial(
"source_b", vec!["amount".into(), "count".into()],
8000, vec![4500.0, 2.8], vec![1800.0, 1.2], 1.0,
);
let partial_c = FederatedFingerprintProtocol::create_partial(
"source_c", vec!["amount".into(), "count".into()],
12000, vec![5500.0, 3.3], vec![2200.0, 1.7], 1.0,
);
// Aggregate
let aggregated = protocol.aggregate(&[partial_a, partial_b, partial_c])?;
println!("Total records: {}", aggregated.total_record_count); // 30000
println!("Total epsilon: {}", aggregated.total_epsilon); // 3.0 (sum)
println!("Sources: {}", aggregated.source_count); // 3
}
Aggregated Fingerprint
The AggregatedFingerprint contains the combined statistics:
#![allow(unused)]
fn main() {
pub struct AggregatedFingerprint {
pub column_names: Vec<String>,
pub means: Vec<f64>, // Aggregated means
pub stds: Vec<f64>, // Aggregated standard deviations
pub mins: Vec<f64>, // Aggregated minimums
pub maxs: Vec<f64>, // Aggregated maximums
pub correlations: Vec<f64>, // Aggregated correlation matrix
pub total_record_count: u64, // Sum across all sources
pub total_epsilon: f64, // Sum of local epsilons
pub source_count: usize, // Number of contributing sources
}
}
Privacy Budget Composition
The total privacy budget is the sum of local epsilons across all sources. This follows sequential composition — each source’s local DP guarantee composes with the others.
For example, if three sources each spend ε=1.0 locally, the total privacy cost of the aggregated fingerprint is ε=3.0 under sequential composition.
To minimize total budget:
- Use the lowest
local_epsilonthat provides sufficient utility - Prefer fewer sources with more records over many sources with few records
- Use
max_epsilon_per_sourceto enforce per-source budget caps
CLI Usage
# Aggregate fingerprints from multiple sources
datasynth-data fingerprint federated \
--sources ./finance.dsf ./operations.dsf ./sales.dsf \
--output ./aggregated.dsf \
--method weighted_average \
--max-epsilon 5.0
# Then generate from the aggregated fingerprint
datasynth-data generate \
--fingerprint ./aggregated.dsf \
--output ./synthetic_output
Configuration
# Federated config is specified per-invocation via CLI flags
# The aggregation method and privacy budget are controlled at execution time
| CLI Flag | Default | Description |
|---|---|---|
--sources | (required) | Two or more .dsf fingerprint files |
--output | (required) | Output path for aggregated fingerprint |
--method | weighted_average | Aggregation strategy |
--max-epsilon | 5.0 | Maximum epsilon per source |
See Also
- Fingerprinting Guide
- Synthetic Data Certificates
- datasynth-fingerprint Crate
- Privacy & Regulatory Compliance
Synthetic Data Certificates
New in v0.5.0
Synthetic data certificates provide cryptographic proof of the privacy guarantees and quality metrics associated with generated data.
Overview
As synthetic data becomes increasingly used in regulated industries, organizations need verifiable assurance that:
- The data was generated with specific differential privacy guarantees
- Quality metrics meet documented thresholds
- The generation configuration hasn’t been tampered with
- The certificate itself is authentic (HMAC-SHA256 signed)
Certificates are produced during generation and can be embedded in output files or distributed alongside them.
Certificate Structure
#![allow(unused)]
fn main() {
pub struct SyntheticDataCertificate {
pub certificate_id: String, // Unique certificate identifier
pub generation_timestamp: String, // ISO 8601 timestamp
pub generator_version: String, // DataSynth version
pub config_hash: String, // SHA-256 hash of generation config
pub seed: Option<u64>, // RNG seed for reproducibility
pub dp_guarantee: Option<DpGuarantee>,
pub quality_metrics: Option<QualityMetrics>,
pub fingerprint_hash: Option<String>, // Source fingerprint hash
pub issuer: String, // Certificate issuer
pub signature: Option<String>, // HMAC-SHA256 signature
}
}
DP Guarantee
#![allow(unused)]
fn main() {
pub struct DpGuarantee {
pub mechanism: String, // "Laplace" or "Gaussian"
pub epsilon: f64, // Privacy budget spent
pub delta: Option<f64>, // For (ε,δ)-DP
pub composition_method: String, // "sequential", "advanced", "rdp"
pub total_queries: u32, // Number of DP queries made
}
}
Quality Metrics
#![allow(unused)]
fn main() {
pub struct QualityMetrics {
pub benford_mad: Option<f64>, // Mean Absolute Deviation from Benford's Law
pub correlation_preservation: Option<f64>, // Correlation matrix similarity (0-1)
pub statistical_fidelity: Option<f64>, // Overall statistical fidelity score (0-1)
pub mia_auc: Option<f64>, // Membership Inference Attack AUC (closer to 0.5 = better privacy)
}
}
Building Certificates
Use the CertificateBuilder for fluent construction:
#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{
CertificateBuilder, DpGuarantee, QualityMetrics,
};
let cert = CertificateBuilder::new("DataSynth v0.5.0")
.with_dp_guarantee(DpGuarantee {
mechanism: "Laplace".into(),
epsilon: 1.0,
delta: None,
composition_method: "sequential".into(),
total_queries: 50,
})
.with_quality_metrics(QualityMetrics {
benford_mad: Some(0.008),
correlation_preservation: Some(0.95),
statistical_fidelity: Some(0.92),
mia_auc: Some(0.52),
})
.with_config_hash("sha256:abc123...")
.with_seed(42)
.with_fingerprint_hash("sha256:def456...")
.with_generator_version("0.5.0")
.build();
}
Signing and Verification
Certificates are signed using HMAC-SHA256:
#![allow(unused)]
fn main() {
use datasynth_fingerprint::certificates::{sign_certificate, verify_certificate};
// Sign
sign_certificate(&mut cert, "my-secret-key-material");
// Verify
let valid = verify_certificate(&cert, "my-secret-key-material");
assert!(valid);
// Tampered certificate fails verification
cert.dp_guarantee.as_mut().unwrap().epsilon = 0.001; // tamper
assert!(!verify_certificate(&cert, "my-secret-key-material"));
}
Output Embedding
Certificates can be:
- Standalone JSON: Written as
certificate.jsonin the output directory - Parquet metadata: Embedded in Parquet file metadata under the
datasynth_certificatekey - JSON metadata: Included in the generation manifest
CLI Usage
# Generate data with certificate
datasynth-data generate \
--config config.yaml \
--output ./output \
--certificate
# Certificate is written to ./output/certificate.json
Configuration
certificates:
enabled: true
issuer: "DataSynth"
include_quality_metrics: true
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable certificate generation |
issuer | string | "DataSynth" | Issuer identity |
include_quality_metrics | bool | true | Include quality metrics in certificate |
Privacy-Utility Pareto Frontier
The ParetoFrontier helps find optimal privacy-utility tradeoffs:
#![allow(unused)]
fn main() {
use datasynth_fingerprint::privacy::pareto::{ParetoFrontier, ParetoPoint};
let epsilons = vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0];
let points = ParetoFrontier::explore(&epsilons, |epsilon| {
// Evaluate utility at this epsilon level
ParetoPoint {
epsilon,
delta: None,
utility_score: compute_utility(epsilon),
benford_mad: compute_benford(epsilon),
correlation_score: compute_correlation(epsilon),
}
});
// Recommend epsilon for target utility
if let Some(recommended_epsilon) = ParetoFrontier::recommend(&points, 0.90) {
println!("For 90% utility, use epsilon = {:.2}", recommended_epsilon);
}
}
The frontier identifies non-dominated points where no other configuration achieves both better privacy and better utility.
See Also
Deployment & Operations
This section covers everything you need to deploy, operate, and maintain DataSynth in production environments.
Deployment Options
DataSynth supports three deployment models, each suited to different operational requirements:
| Method | Best For | Scaling | Complexity |
|---|---|---|---|
| Docker / Compose | Small teams, dev/staging, single-node | Vertical | Low |
| Kubernetes / Helm | Production, multi-tenant, auto-scaling | Horizontal | Medium |
| Bare Metal / SystemD | Regulated environments, air-gapped networks | Vertical | Low |
Architecture at a Glance
DataSynth server exposes two network interfaces:
- REST API on port 3000 – configuration, bulk generation, streaming control, health probes, Prometheus metrics
- gRPC API on port 50051 – high-throughput generation for programmatic clients
Both share an in-process ServerState with atomic counters, so a single process can serve REST, gRPC, and WebSocket clients concurrently.
Operations Guides
| Guide | Description |
|---|---|
| Operational Runbook | Grafana dashboards, alert response, troubleshooting, log analysis |
| Capacity Planning | Sizing model, reference benchmarks, disk and memory estimates |
| Disaster Recovery | Backup procedures, deterministic replay, stateless restart |
Security & API
| Guide | Description |
|---|---|
| API Reference | Endpoints, authentication, rate limiting, WebSocket protocol, error formats |
| Security Hardening | Pre-deployment checklist, TLS/mTLS, secrets, container security, audit logging |
| TLS & Reverse Proxy | Nginx, Envoy, and native TLS configuration |
Quick Decision Tree
- Need auto-scaling or HA? – Use Kubernetes.
- Single server, want observability? – Use Docker Compose with the full stack (Prometheus + Grafana).
- Air-gapped or compliance-restricted? – Use Bare Metal with SystemD.
Docker Deployment
This guide walks through building, configuring, and running DataSynth as Docker containers.
Prerequisites
- Docker Engine 24+ (or Docker Desktop 4.25+)
- Docker Compose v2
- 2 GB RAM minimum (4 GB recommended)
- 10 GB disk for images and generated data
Images
DataSynth provides two container images:
| Image | Dockerfile | Purpose |
|---|---|---|
datasynth/datasynth-server | Dockerfile | Server (REST + gRPC + WebSocket) |
datasynth/datasynth-cli | Dockerfile.cli | CLI for batch generation jobs |
Multi-Stage Build Walkthrough
The server Dockerfile uses a four-stage build with cargo-chef for dependency caching:
Stage 1: chef -- installs cargo-chef on rust:1.88-bookworm
Stage 2: planner -- computes recipe.json from Cargo.lock
Stage 3: builder -- cooks dependencies (cached), then builds datasynth-server + datasynth-data
Stage 4: runtime -- copies binaries into gcr.io/distroless/cc-debian12
Build the server image:
docker build -t datasynth/datasynth-server:0.5.0 .
Build the CLI-only image:
docker build -t datasynth/datasynth-cli:0.5.0 -f Dockerfile.cli .
Build Arguments and Features
To enable optional features (TLS, Redis rate limiting, OpenTelemetry), modify the build command in the builder stage. For example, to enable Redis:
# In the builder stage, replace the cargo build line:
RUN cargo build --release -p datasynth-server -p datasynth-cli --features redis
Image Size
The distroless runtime image is approximately 40-60 MB. The build cache layer with cooked dependencies significantly speeds up rebuilds when only application code changes.
Docker Compose Stack
The repository includes a production-ready docker-compose.yml with the full observability stack:
services:
datasynth-server:
build:
context: .
dockerfile: Dockerfile
ports:
- "50051:50051" # gRPC
- "3000:3000" # REST
environment:
- RUST_LOG=info
- DATASYNTH_API_KEYS=${DATASYNTH_API_KEYS:-}
healthcheck:
test: ["CMD", "/usr/local/bin/datasynth-data", "--help"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
deploy:
resources:
limits:
memory: 2G
cpus: "2.0"
restart: unless-stopped
redis:
image: redis:7-alpine
profiles:
- redis
ports:
- "6379:6379"
command: >
redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
deploy:
resources:
limits:
memory: 256M
cpus: "0.5"
volumes:
- redis-data:/data
restart: unless-stopped
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./deploy/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./deploy/prometheus-alerts.yml:/etc/prometheus/alerts.yml:ro
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.0
ports:
- "3001:3000"
volumes:
- ./deploy/grafana/provisioning:/etc/grafana/provisioning:ro
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
redis-data:
Starting the Stack
Basic server only:
docker compose up -d datasynth-server
Full observability stack (server + Prometheus + Grafana):
docker compose up -d
With Redis for distributed rate limiting:
docker compose --profile redis up -d
Verifying the Deployment
# Health check
curl http://localhost:3000/health
# Readiness probe
curl http://localhost:3000/ready
# Prometheus metrics
curl http://localhost:3000/metrics
# Grafana UI
open http://localhost:3001 # admin / admin (or GRAFANA_PASSWORD)
Environment Variables
| Variable | Default | Description |
|---|---|---|
RUST_LOG | info | Log level: trace, debug, info, warn, error |
DATASYNTH_API_KEYS | (none) | Comma-separated API keys for authentication |
DATASYNTH_WORKER_THREADS | 0 (auto) | Tokio worker threads; 0 = CPU count |
DATASYNTH_REDIS_URL | (none) | Redis URL for distributed rate limiting |
DATASYNTH_TLS_CERT | (none) | Path to TLS certificate (PEM) |
DATASYNTH_TLS_KEY | (none) | Path to TLS private key (PEM) |
OTEL_EXPORTER_OTLP_ENDPOINT | (none) | OpenTelemetry collector endpoint |
OTEL_SERVICE_NAME | (none) | OpenTelemetry service name |
Resource Limits
Recommended container resource limits by workload:
| Workload | CPU | Memory | Notes |
|---|---|---|---|
| Light (dev/test) | 1 core | 1 GB | Small configs, < 10K entries |
| Medium (staging) | 2 cores | 2 GB | Medium configs, up to 100K entries |
| Heavy (production) | 4 cores | 4 GB | Large configs, streaming, multiple clients |
| Batch CLI job | 2-8 cores | 2-8 GB | Scales linearly with core count |
Running CLI Jobs in Docker
Generate data with the CLI image:
docker run --rm \
-v $(pwd)/output:/output \
datasynth/datasynth-cli:0.5.0 \
generate --demo --output /output
Generate from a custom config:
docker run --rm \
-v $(pwd)/config.yaml:/config.yaml:ro \
-v $(pwd)/output:/output \
datasynth/datasynth-cli:0.5.0 \
generate --config /config.yaml --output /output
Networking
The server binds to 0.0.0.0 by default inside the container. Port mapping:
| Container Port | Protocol | Service |
|---|---|---|
| 3000 | TCP | REST API + WebSocket + Prometheus metrics |
| 50051 | TCP | gRPC API |
For WebSocket connections through a reverse proxy, ensure the proxy supports HTTP Upgrade headers. See TLS & Reverse Proxy for Nginx and Envoy configurations.
Logging
DataSynth server outputs structured JSON logs to stdout, which integrates with Docker’s logging drivers:
# View logs
docker compose logs -f datasynth-server
# Filter by level
docker compose logs datasynth-server | jq 'select(.level == "ERROR")'
To change the log format or level, set the RUST_LOG environment variable:
# Debug logging for the server crate only
RUST_LOG=datasynth_server=debug docker compose up -d datasynth-server
Kubernetes Deployment
This guide covers deploying DataSynth on Kubernetes using the included Helm chart or raw manifests.
Prerequisites
- Kubernetes 1.27+
- Helm 3.12+ (for Helm-based deployment)
kubectlconfigured for your cluster- A container registry accessible from the cluster
- Metrics Server installed (for HPA)
Helm Chart
The Helm chart is located at deploy/helm/datasynth/ and manages all Kubernetes resources.
Quick Install
# From the repository root
helm install datasynth ./deploy/helm/datasynth \
--namespace datasynth \
--create-namespace
Install with Custom Values
helm install datasynth ./deploy/helm/datasynth \
--namespace datasynth \
--create-namespace \
--set image.repository=your-registry.example.com/datasynth-server \
--set image.tag=0.5.0 \
--set autoscaling.minReplicas=3 \
--set autoscaling.maxReplicas=15
Upgrade
helm upgrade datasynth ./deploy/helm/datasynth \
--namespace datasynth \
--reuse-values \
--set image.tag=0.6.0
Uninstall
helm uninstall datasynth --namespace datasynth
Chart Reference
values.yaml Key Parameters
| Parameter | Default | Description |
|---|---|---|
replicaCount | 2 | Initial replicas (ignored when HPA is enabled) |
image.repository | datasynth/datasynth-server | Container image repository |
image.tag | 0.5.0 | Image tag |
service.type | ClusterIP | Service type |
service.restPort | 3000 | REST API port |
service.grpcPort | 50051 | gRPC port |
resources.requests.cpu | 500m | CPU request |
resources.requests.memory | 512Mi | Memory request |
resources.limits.cpu | 2 | CPU limit |
resources.limits.memory | 2Gi | Memory limit |
autoscaling.enabled | true | Enable HPA |
autoscaling.minReplicas | 2 | Minimum replicas |
autoscaling.maxReplicas | 10 | Maximum replicas |
autoscaling.targetCPUUtilizationPercentage | 70 | CPU scaling target |
podDisruptionBudget.enabled | true | Enable PDB |
podDisruptionBudget.minAvailable | 1 | Minimum available pods |
apiKeys | [] | API keys (stored in a Secret) |
config.enabled | false | Mount DataSynth YAML config via ConfigMap |
redis.enabled | false | Deploy Redis sidecar for distributed rate limiting |
serviceMonitor.enabled | false | Create Prometheus ServiceMonitor |
ingress.enabled | false | Enable Ingress resource |
Authentication
API keys are stored in a Kubernetes Secret and injected via the DATASYNTH_API_KEYS environment variable:
# values-production.yaml
apiKeys:
- "your-secure-api-key-1"
- "your-secure-api-key-2"
For external secret management, use the External Secrets Operator or mount from a Vault sidecar. See Security Hardening for details.
DataSynth Configuration via ConfigMap
To inject a DataSynth generation config into the pods:
config:
enabled: true
content: |
global:
industry: manufacturing
start_date: "2024-01-01"
period_months: 12
seed: 42
companies:
- code: "1000"
name: "Manufacturing Corp"
currency: USD
country: US
annual_transaction_volume: 100000
The config is mounted at /etc/datasynth/datasynth.yaml as a read-only volume.
Health Probes
The Helm chart configures three probes:
| Probe | Endpoint | Initial Delay | Period | Failure Threshold |
|---|---|---|---|---|
| Startup | GET /live | 5s | 5s | 30 (= 2.5 min max startup) |
| Liveness | GET /live | 15s | 20s | 3 |
| Readiness | GET /ready | 5s | 10s | 3 |
The readiness probe checks configuration validity, memory usage, and disk availability. A pod reporting not-ready is removed from Service endpoints until it recovers.
Horizontal Pod Autoscaler (HPA)
The chart creates an HPA by default targeting 70% CPU utilization:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
# Uncomment to also scale on memory:
# targetMemoryUtilizationPercentage: 80
Custom metrics scaling (e.g., on synth_active_streams) requires the Prometheus Adapter:
# Custom metrics HPA example (requires prometheus-adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: datasynth-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: datasynth
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: synth_active_streams
target:
type: AverageValue
averageValue: "5"
Pod Disruption Budget (PDB)
The PDB ensures at least one pod remains available during voluntary disruptions (node drains, cluster upgrades):
podDisruptionBudget:
enabled: true
minAvailable: 1
For larger deployments, switch to maxUnavailable:
podDisruptionBudget:
enabled: true
maxUnavailable: 1
Ingress and TLS
Nginx Ingress with cert-manager
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
hosts:
- host: datasynth.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: datasynth-tls
hosts:
- datasynth.example.com
WebSocket Support
For Nginx Ingress, WebSocket upgrade is handled automatically for paths starting with /ws/. If you use a path-based routing rule, ensure the annotation is set:
nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
gRPC Ingress
gRPC requires a separate Ingress resource or an Ingress controller that supports gRPC (e.g., Nginx Ingress with nginx.ingress.kubernetes.io/backend-protocol: "GRPC"):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: datasynth-grpc
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- secretName: datasynth-grpc-tls
hosts:
- grpc.datasynth.example.com
rules:
- host: grpc.datasynth.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: datasynth
port:
name: grpc
Manual Manifests (Without Helm)
If you prefer raw manifests, here is a minimal deployment:
---
apiVersion: v1
kind: Namespace
metadata:
name: datasynth
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: datasynth
namespace: datasynth
spec:
replicas: 2
selector:
matchLabels:
app: datasynth
template:
metadata:
labels:
app: datasynth
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: datasynth
image: datasynth/datasynth-server:0.5.0
ports:
- containerPort: 3000
name: http-rest
- containerPort: 50051
name: grpc
env:
- name: RUST_LOG
value: "info"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
livenessProbe:
httpGet:
path: /live
port: http-rest
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /ready
port: http-rest
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: datasynth
namespace: datasynth
spec:
type: ClusterIP
ports:
- port: 3000
targetPort: http-rest
name: http-rest
- port: 50051
targetPort: grpc
name: grpc
selector:
app: datasynth
Prometheus ServiceMonitor
If you use the Prometheus Operator, enable the ServiceMonitor:
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
path: /metrics
labels:
release: prometheus # Must match your Prometheus Operator selector
Rolling Update Strategy
The chart uses a zero-downtime rolling update strategy:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
Combined with the PDB and readiness probes, this ensures that:
- A new pod starts and becomes ready before an old pod is terminated.
- At least
minAvailablepods are always serving traffic. - Config and secret changes trigger a rolling restart via checksum annotations.
Topology Spread
For multi-zone clusters, use topology spread constraints to distribute pods evenly:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: datasynth
Bare Metal Deployment
This guide covers installing and running DataSynth directly on a Linux server using SystemD.
Prerequisites
- Linux x86_64 (Ubuntu 22.04+, Debian 12+, RHEL 9+, or equivalent)
- 2 GB RAM minimum (4 GB recommended)
- Root or sudo access for initial setup
Binary Installation
Option 1: Download Pre-Built Binary
# Download the latest release
curl -L https://github.com/ey-asu-rnd/SyntheticData/releases/latest/download/datasynth-server-linux-x86_64.tar.gz \
-o datasynth-server.tar.gz
# Extract
tar xzf datasynth-server.tar.gz
# Install binaries
sudo install -m 0755 datasynth-server /usr/local/bin/
sudo install -m 0755 datasynth-data /usr/local/bin/
# Verify
datasynth-server --help
datasynth-data --version
Option 2: Build from Source
# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Install protobuf compiler (required for gRPC)
sudo apt-get install -y protobuf-compiler # Debian/Ubuntu
sudo dnf install -y protobuf-compiler # RHEL/Fedora
# Clone and build
git clone https://github.com/ey-asu-rnd/SyntheticData.git
cd SyntheticData
cargo build --release -p datasynth-server -p datasynth-cli
# Install
sudo install -m 0755 target/release/datasynth-server /usr/local/bin/
sudo install -m 0755 target/release/datasynth-data /usr/local/bin/
To enable optional features during the build:
# With TLS support
cargo build --release -p datasynth-server --features tls
# With Redis distributed rate limiting
cargo build --release -p datasynth-server --features redis
# With OpenTelemetry
cargo build --release -p datasynth-server --features otel
# All features
cargo build --release -p datasynth-server --features "tls,redis,otel"
User and Permissions
Create a dedicated service account:
# Create system user (no home dir, no login shell)
sudo useradd --system --no-create-home --shell /usr/sbin/nologin datasynth
# Create data and config directories
sudo mkdir -p /var/lib/datasynth
sudo mkdir -p /etc/datasynth
sudo mkdir -p /etc/datasynth/tls
# Set ownership
sudo chown -R datasynth:datasynth /var/lib/datasynth
sudo chmod 750 /var/lib/datasynth
sudo chown -R root:datasynth /etc/datasynth
sudo chmod 750 /etc/datasynth
sudo chmod 700 /etc/datasynth/tls
Environment Configuration
Copy the example environment file:
sudo cp deploy/datasynth-server.env.example /etc/datasynth/server.env
sudo chown root:datasynth /etc/datasynth/server.env
sudo chmod 640 /etc/datasynth/server.env
Edit /etc/datasynth/server.env:
# Logging level
RUST_LOG=info
# API authentication (comma-separated keys)
DATASYNTH_API_KEYS=your-secure-key-1,your-secure-key-2
# Worker threads (0 = auto-detect from CPU count)
DATASYNTH_WORKER_THREADS=0
# TLS (requires --features tls build)
# DATASYNTH_TLS_CERT=/etc/datasynth/tls/cert.pem
# DATASYNTH_TLS_KEY=/etc/datasynth/tls/key.pem
SystemD Service
The repository includes a production-ready SystemD unit at deploy/datasynth-server.service. Install it:
sudo cp deploy/datasynth-server.service /etc/systemd/system/
sudo systemctl daemon-reload
Unit File Walkthrough
[Unit]
Description=DataSynth Synthetic Data Server
Documentation=https://github.com/ey-asu-rnd/SyntheticData
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=datasynth
Group=datasynth
EnvironmentFile=-/etc/datasynth/server.env
ExecStart=/usr/local/bin/datasynth-server \
--host 0.0.0.0 \
--port 50051 \
--rest-port 3000
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStartSec=30
TimeoutStopSec=30
# Resource limits
MemoryMax=4G
CPUQuota=200%
TasksMax=512
LimitNOFILE=65536
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
ReadWritePaths=/var/lib/datasynth
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=datasynth-server
[Install]
WantedBy=multi-user.target
Key security directives:
| Directive | Effect |
|---|---|
NoNewPrivileges=true | Prevents privilege escalation |
ProtectSystem=strict | Mounts filesystem read-only except ReadWritePaths |
ProtectHome=true | Hides /home, /root, /run/user |
PrivateTmp=true | Isolates /tmp |
PrivateDevices=true | Restricts device access |
ReadWritePaths=/var/lib/datasynth | Only writable directory |
Enable and Start
sudo systemctl enable datasynth-server
sudo systemctl start datasynth-server
sudo systemctl status datasynth-server
Common Operations
# View logs
journalctl -u datasynth-server -f
# Restart
sudo systemctl restart datasynth-server
# Reload (sends HUP signal)
sudo systemctl reload datasynth-server
# Stop
sudo systemctl stop datasynth-server
Log Rotation
SystemD journal handles log rotation automatically. To configure retention:
# /etc/systemd/journald.conf.d/datasynth.conf
[Journal]
SystemMaxUse=2G
MaxRetentionSec=30d
Reload journald:
sudo systemctl restart systemd-journald
To export logs to a file for external log aggregation:
# Export today's logs as JSON
journalctl -u datasynth-server --since today -o json > /var/log/datasynth-$(date +%F).json
Firewall Configuration
Open the required ports:
# UFW (Ubuntu)
sudo ufw allow 3000/tcp comment "DataSynth REST"
sudo ufw allow 50051/tcp comment "DataSynth gRPC"
# firewalld (RHEL/CentOS)
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=50051/tcp
sudo firewall-cmd --reload
Verifying the Installation
# Health check
curl -s http://localhost:3000/health | python3 -m json.tool
# Readiness check
curl -s http://localhost:3000/ready | python3 -m json.tool
# Prometheus metrics
curl -s http://localhost:3000/metrics
# Generate test data via CLI
datasynth-data generate --demo --output /tmp/datasynth-test
ls -la /tmp/datasynth-test/
Updating
# Stop the service
sudo systemctl stop datasynth-server
# Replace the binary
sudo install -m 0755 /path/to/new/datasynth-server /usr/local/bin/
# Start the service
sudo systemctl start datasynth-server
# Verify
curl -s http://localhost:3000/health | python3 -m json.tool
Operational Runbook
This runbook provides step-by-step procedures for monitoring, alerting, troubleshooting, and maintaining DataSynth in production.
Monitoring Stack Overview
The recommended monitoring stack uses Prometheus for metrics collection and Grafana for dashboards and alerting. The docker-compose.yml in the repository root sets this up automatically.
| Component | Default URL | Purpose |
|---|---|---|
| Prometheus | http://localhost:9090 | Metrics storage and alerting rules |
| Grafana | http://localhost:3001 | Dashboards and visualization |
DataSynth /metrics | http://localhost:3000/metrics | Prometheus exposition endpoint |
DataSynth /api/metrics | http://localhost:3000/api/metrics | JSON metrics endpoint |
Prometheus Configuration
The repository includes a pre-configured Prometheus scrape config at deploy/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: "datasynth"
static_configs:
- targets: ["datasynth-server:3000"]
metrics_path: "/metrics"
scrape_interval: 10s
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
For Kubernetes, use the ServiceMonitor CRD instead (see Kubernetes deployment).
Available Metrics
DataSynth exposes the following Prometheus metrics at GET /metrics:
| Metric | Type | Description |
|---|---|---|
synth_entries_generated_total | Counter | Total journal entries generated since startup |
synth_anomalies_injected_total | Counter | Total anomalies injected |
synth_uptime_seconds | Gauge | Server uptime in seconds |
synth_entries_per_second | Gauge | Current generation throughput |
synth_active_streams | Gauge | Number of active WebSocket streaming connections |
synth_stream_events_total | Counter | Total events sent through WebSocket streams |
synth_info | Gauge | Server version info label (always 1) |
Grafana Dashboard Setup
Step 1: Add Prometheus Data Source
- Open Grafana at
http://localhost:3001. - Navigate to Configuration > Data Sources > Add data source.
- Select Prometheus.
- Set URL to
http://prometheus:9090(Docker) or your Prometheus endpoint. - Click Save & Test.
If using Docker Compose, the Prometheus data source is auto-provisioned via deploy/grafana/provisioning/datasources/prometheus.yml.
Step 2: Create the DataSynth Dashboard
Create a new dashboard with the following panels:
Panel 1: Generation Throughput
Type: Time series
Query: rate(synth_entries_generated_total[5m])
Title: Entries Generated per Second (5m rate)
Unit: ops/sec
Panel 2: Active WebSocket Streams
Type: Stat
Query: synth_active_streams
Title: Active Streams
Thresholds: 0 (green), 5 (yellow), 10 (red)
Panel 3: Total Entries (Counter)
Type: Stat
Query: synth_entries_generated_total
Title: Total Entries Generated
Format: short
Panel 4: Anomaly Injection Rate
Type: Time series
Query A: rate(synth_anomalies_injected_total[5m])
Query B: rate(synth_entries_generated_total[5m])
Title: Anomaly Rate
Transform: A / B (using math expression)
Unit: percentunit
Panel 5: Server Uptime
Type: Stat
Query: synth_uptime_seconds
Title: Server Uptime
Unit: seconds (s)
Panel 6: Stream Events Rate
Type: Time series
Query: rate(synth_stream_events_total[1m])
Title: Stream Events per Second
Unit: events/sec
Step 3: Save and Export
Save the dashboard and export as JSON for version control. Place the file in deploy/grafana/provisioning/dashboards/ for automatic provisioning.
Alert Rules
The repository includes pre-configured alert rules at deploy/prometheus-alerts.yml:
Alert: ServerDown
- alert: ServerDown
expr: up{job="datasynth"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "DataSynth server is down"
description: "DataSynth server has been unreachable for more than 1 minute."
Response procedure:
- Check the server process:
systemctl status datasynth-serverordocker compose ps. - Check logs:
journalctl -u datasynth-server -n 100ordocker compose logs --tail 100 datasynth-server. - Check resource exhaustion:
free -h,df -h,top. - If OOM killed, increase memory limits and restart.
- If disk full, clear output directory and restart.
Alert: HighErrorRate
- alert: HighErrorRate
expr: rate(synth_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on DataSynth server"
Response procedure:
- Check application logs for error patterns:
journalctl -u datasynth-server -p err. - Look for invalid configuration:
curl localhost:3000/ready. - Check if clients are sending malformed requests (rate limit headers in responses).
- If errors are generation failures, check available memory and disk.
Alert: HighMemoryUsage
- alert: HighMemoryUsage
expr: synth_memory_usage_bytes / 1024 / 1024 > 3072
for: 10m
labels:
severity: critical
annotations:
summary: "High memory usage on DataSynth server"
description: "Memory usage is {{ $value }}MB, exceeding 3GB threshold."
Response procedure:
- Check DataSynth’s internal degradation level:
curl localhost:3000/ready– thememorycheck status will showok,degraded, orfail. - If degraded, DataSynth automatically reduces batch sizes. Wait for current work to complete.
- If in Emergency mode, stop active streams:
curl -X POST localhost:3000/api/stream/stop. - Consider increasing memory limits or reducing concurrent streams.
Alert: HighLatency
- alert: HighLatency
expr: histogram_quantile(0.99, rate(datasynth_api_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
Response procedure:
- Check if bulk generation requests are creating large datasets. The default timeout is 300 seconds.
- Verify CPU is not throttled:
kubectl top podordocker stats. - Consider splitting large generation requests into smaller batches.
Alert: NoEntitiesGenerated
- alert: NoEntitiesGenerated
expr: increase(synth_entries_generated_total[1h]) == 0 and synth_active_streams > 0
for: 15m
labels:
severity: warning
Response procedure:
- Streams are connected but not producing data. Check if streams are paused.
- Resume streams:
curl -X POST localhost:3000/api/stream/resume. - Check logs for generation failures.
- Verify the configuration is valid:
curl localhost:3000/api/config.
Common Troubleshooting
Server Fails to Start
| Symptom | Cause | Resolution |
|---|---|---|
Invalid gRPC address | Bad --host or --port value | Check arguments and env vars |
Failed to bind REST listener | Port already in use | lsof -i :3000 to find conflict |
protoc not found | Missing protobuf compiler | Install protobuf-compiler package |
| Immediate exit, no logs | Panic before logger init | Run with RUST_LOG=debug |
Generation Errors
| Symptom | Cause | Resolution |
|---|---|---|
Failed to create orchestrator | Invalid config | Validate with datasynth-data validate --config config.yaml |
Rate limit exceeded | Too many API requests | Wait for Retry-After header, increase rate limits |
| Empty journal entries | No companies configured | Check curl localhost:3000/api/config |
| Slow generation | Large period or high volume | Add worker threads, increase CPU |
Connection Issues
| Symptom | Cause | Resolution |
|---|---|---|
Connection refused on 3000 | Server not running or wrong port | Check process and port bindings |
401 Unauthorized | Missing or invalid API key | Add X-API-Key header or Authorization: Bearer <key> |
429 Too Many Requests | Rate limit exceeded | Respect Retry-After header |
| WebSocket drops immediately | Proxy not forwarding Upgrade | Configure proxy for WebSocket (see TLS doc) |
Memory Issues
DataSynth monitors its own memory usage via /proc/self/statm (Linux) and triggers automatic degradation:
| Degradation Level | Trigger | Behavior |
|---|---|---|
| Normal | < 70% of limit | Full throughput |
| Reduced | 70-85% | Smaller batch sizes |
| Minimal | 85-95% | Single-record generation |
| Emergency | > 95% | Rejects new work |
Check the current level:
curl -s localhost:3000/ready | jq '.checks[] | select(.name == "memory")'
Log Analysis
Structured JSON Logs
DataSynth emits structured JSON logs with the following fields:
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "INFO",
"target": "datasynth_server::rest::routes",
"message": "Configuration update requested",
"thread_id": 42
}
Common Log Queries
Filter by severity:
# SystemD
journalctl -u datasynth-server -p err --since "1 hour ago"
# Docker
docker compose logs datasynth-server | jq 'select(.level == "ERROR" or .level == "WARN")'
Find configuration changes:
journalctl -u datasynth-server | grep "Configuration update"
Track generation throughput:
journalctl -u datasynth-server | grep "entries_generated"
Find API authentication failures:
journalctl -u datasynth-server | grep -i "unauthorized\|invalid api key"
Log Level Configuration
Set per-module log levels with RUST_LOG:
# Everything at info, server REST module at debug
RUST_LOG=info,datasynth_server::rest=debug
# Generation engine at trace (very verbose)
RUST_LOG=info,datasynth_runtime=trace
# Suppress noisy modules
RUST_LOG=info,hyper=warn,tower=warn
Routine Maintenance
Health Check Script
Create a monitoring script for external health checks:
#!/bin/bash
# /usr/local/bin/datasynth-healthcheck.sh
ENDPOINT="${1:-http://localhost:3000}"
# Check health
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT/health")
if [ "$HTTP_CODE" != "200" ]; then
echo "CRITICAL: Health check failed (HTTP $HTTP_CODE)"
exit 2
fi
# Check readiness
READY=$(curl -s "$ENDPOINT/ready" | jq -r '.ready')
if [ "$READY" != "true" ]; then
echo "WARNING: Server not ready"
exit 1
fi
echo "OK: DataSynth healthy and ready"
exit 0
Prometheus Rule Testing
Validate alert rules before deploying:
# Install promtool
go install github.com/prometheus/prometheus/cmd/promtool@latest
# Test rules
promtool check rules deploy/prometheus-alerts.yml
Backup Checklist
| Item | Location | Frequency |
|---|---|---|
| DataSynth config | /etc/datasynth/server.env | On change |
| Generation configs | YAML files | On change |
| Grafana dashboards | Export as JSON | Weekly |
| Prometheus data | prometheus-data volume | Per retention policy |
| API keys | Kubernetes Secret or env file | On rotation |
Incident Response Template
When a production incident occurs:
- Detect: Alert fires or user reports an issue.
- Triage: Check
/health,/ready, and/metricsendpoints. - Contain: If generating bad data, stop streams:
POST /api/stream/stop. - Diagnose: Collect logs (
journalctl -u datasynth-server --since "1 hour ago"). - Resolve: Apply fix (restart, config change, scale up).
- Verify: Confirm
/readyreturnsready: trueand metrics are flowing. - Document: Record root cause and remediation steps.
Capacity Planning
This guide provides sizing models, reference benchmarks, and recommendations for provisioning DataSynth deployments.
Performance Characteristics
DataSynth is CPU-bound during generation and I/O-bound during output. Key characteristics:
- Throughput: 100K+ journal entries per second on a single core
- Scaling: Near-linear scaling with CPU cores for batch generation
- Memory: Proportional to active dataset size (companies, accounts, master data)
- Disk: Output size depends on format, compression, and enabled modules
- Network: REST/gRPC overhead is minimal; bulk generation is the bottleneck
Sizing Model
CPU
DataSynth uses Rayon for parallel generation and Tokio for async I/O. The relationship between CPU cores and throughput:
| Cores | Approx. Entries/sec | Use Case |
|---|---|---|
| 1 | 100K | Development, small datasets |
| 2 | 180K | Staging, medium datasets |
| 4 | 350K | Production, large datasets |
| 8 | 650K | High-throughput batch jobs |
| 16 | 1.1M | Maximum single-node throughput |
These numbers are for journal entry generation with balanced debit/credit lines. Enabling additional modules (document flows, subledgers, master data, anomaly injection) reduces throughput by 30-60% due to cross-referencing overhead.
Memory
Memory usage depends on the active generation context:
| Component | Approximate Memory |
|---|---|
| Base server process | 50-100 MB |
| Chart of accounts (small) | 5-10 MB |
| Chart of accounts (large) | 30-50 MB |
| Master data per company (small) | 20-40 MB |
| Master data per company (medium) | 80-150 MB |
| Master data per company (large) | 200-400 MB |
| Active journal entries buffer | 2-5 MB per 10K entries |
| Document flow chains | 50-100 MB per company |
| Anomaly injection engine | 20-50 MB |
Sizing formula (approximate):
Memory (MB) = 100 + (companies * master_data_per_company) + (buffer_entries * 0.5)
Recommended Memory by Config Complexity
| Complexity | Companies | Memory Minimum | Memory Recommended |
|---|---|---|---|
| Small | 1-2 | 512 MB | 1 GB |
| Medium | 3-5 | 1 GB | 2 GB |
| Large | 5-10 | 2 GB | 4 GB |
| Enterprise | 10-20 | 4 GB | 8 GB |
DataSynth includes built-in memory guards that trigger graceful degradation before OOM. See Runbook - Memory Issues for degradation levels.
Disk Sizing
Output Size by Format
The output size depends on the number of entries, enabled modules, and output format:
| Entries | CSV (uncompressed) | JSON (uncompressed) | Parquet (compressed) |
|---|---|---|---|
| 10K | 15-25 MB | 30-50 MB | 3-5 MB |
| 100K | 150-250 MB | 300-500 MB | 30-50 MB |
| 1M | 1.5-2.5 GB | 3-5 GB | 300-500 MB |
| 10M | 15-25 GB | 30-50 GB | 3-5 GB |
These estimates cover journal entries only. Enabling all modules (master data, document flows, subledgers, audit trails, etc.) can multiply total output by 5-10x.
Output Files by Module
When all modules are enabled, a typical generation produces 60+ output files:
| Category | Typical File Count | Size Relative to JE |
|---|---|---|
| Journal entries + ACDOCA | 2-3 | 1.0x (baseline) |
| Master data | 6-8 | 0.3-0.5x |
| Document flows | 8-10 | 1.5-2.0x |
| Subledgers | 8-12 | 1.0-1.5x |
| Period close + consolidation | 5-8 | 0.5-1.0x |
| Labels + controls | 6-10 | 0.1-0.3x |
| Audit trails | 6-8 | 0.3-0.5x |
Disk Provisioning Formula
Disk (GB) = entries_millions * format_multiplier * module_multiplier * safety_margin
Where:
format_multiplier: CSV=0.25, JSON=0.50, Parquet=0.05 (per million entries)
module_multiplier: JE only=1.0, all modules=5.0
safety_margin: 1.5 (for temp files, logs, etc.)
Example: 1M entries, all modules, CSV format:
Disk = 1 * 0.25 * 5.0 * 1.5 = 1.875 GB (round up to 2 GB)
Reference Benchmarks
Benchmarks run on c5.2xlarge (8 vCPU, 16 GB RAM):
| Scenario | Config | Entries | Time | Throughput | Peak Memory |
|---|---|---|---|---|---|
| Batch (small) | 1 company, small CoA, JE only | 100K | 0.8s | 125K/s | 280 MB |
| Batch (medium) | 3 companies, medium CoA, all modules | 100K | 3.2s | 31K/s | 850 MB |
| Batch (large) | 5 companies, large CoA, all modules | 1M | 45s | 22K/s | 2.1 GB |
| Streaming | 1 company, JE only | continuous | – | 10 events/s | 350 MB |
| Concurrent API | 10 parallel bulk requests | 10K each | 4.5s | 22K/s total | 1.2 GB |
Container Resource Recommendations
Docker / Single Host
| Profile | CPU | Memory | Disk | Use Case |
|---|---|---|---|---|
| Dev | 1 core | 1 GB | 10 GB | Local testing |
| Staging | 2 cores | 2 GB | 50 GB | Integration testing |
| Production | 4 cores | 4 GB | 100 GB | Regular generation |
| Batch worker | 8 cores | 8 GB | 200 GB | Large dataset generation |
Kubernetes
| Profile | requests.cpu | requests.memory | limits.cpu | limits.memory | Replicas |
|---|---|---|---|---|---|
| Light | 250m | 256Mi | 1 | 1Gi | 2 |
| Standard | 500m | 512Mi | 2 | 2Gi | 2-5 |
| Heavy | 1000m | 1Gi | 4 | 4Gi | 3-10 |
| Burst | 2000m | 2Gi | 8 | 8Gi | 5-20 |
Scaling Guidelines
Vertical Scaling (Single Node)
Vertical scaling is effective up to 16 cores. Beyond that, returns diminish due to lock contention in the shared ServerState. Recommendations:
- Start with the “Standard” Kubernetes profile.
- Monitor
synth_entries_per_secondin Grafana. - If throughput plateaus at high CPU, add replicas instead.
Horizontal Scaling (Multi-Replica)
DataSynth is stateless – each pod generates data independently. Horizontal scaling considerations:
- Enable Redis for shared rate limiting across replicas.
- Use deterministic seeds per replica to avoid duplicate data (seed = base_seed + replica_index).
- Route bulk generation requests to specific replicas if output deduplication matters.
- WebSocket streams are per-connection and do not share state across replicas.
Scaling Decision Tree
Is throughput below target?
|
+-- Yes: Is CPU utilization > 70%?
| |
| +-- Yes: Add more replicas (horizontal)
| +-- No: Is memory > 80%?
| |
| +-- Yes: Increase memory limit
| +-- No: Check I/O (disk throughput, network)
|
+-- No: Current sizing is adequate
Network Bandwidth
DataSynth’s network requirements are modest:
| Operation | Bandwidth | Notes |
|---|---|---|
| Health checks | < 1 KB/s | Negligible |
| Prometheus scrape | 5-10 KB per scrape | Every 10-30s |
| Bulk API response (10K entries) | 5-15 MB burst | Short-lived |
| WebSocket stream | 1-5 KB/s per connection | 10 events/s default |
| gRPC streaming | 2-10 KB/s per stream | Depends on message size |
Network is rarely the bottleneck. A 1 Gbps link supports hundreds of concurrent clients.
Disaster Recovery
DataSynth is a stateless data generation engine. It does not maintain a persistent database or durable state that requires traditional backup and recovery. Instead, recovery relies on two key properties:
- Deterministic generation – Given the same configuration and seed, DataSynth produces identical output.
- Stateless server – The server process can be restarted from scratch at any time.
What Needs to Be Backed Up
| Asset | Location | Recovery Priority |
|---|---|---|
| Generation config (YAML) | /etc/datasynth/, ConfigMap, or source control | Critical |
| Environment / secrets | /etc/datasynth/server.env, K8s Secrets | Critical |
| API keys | Environment variable or Secret | Critical |
| Generated output files | Output directory, object storage | Depends on use case |
| Grafana dashboards | deploy/grafana/provisioning/ or exported JSON | Low – can re-provision |
| Prometheus data | prometheus-data volume | Low – regenerate from metrics |
The generation config and seed are the most important assets. With them, you can reproduce any dataset exactly.
Backup Procedures
Configuration Backup
Store all DataSynth configuration in version control. This is the primary backup mechanism:
# Recommended repository structure
configs/
production/
manufacturing.yaml # Generation config
server.env.encrypted # Encrypted environment file
staging/
retail.yaml
server.env.encrypted
For Kubernetes, export the ConfigMap and Secret:
# Export current config
kubectl -n datasynth get configmap datasynth-config -o yaml > backup/configmap.yaml
# Export secrets (base64-encoded)
kubectl -n datasynth get secret datasynth-api-keys -o yaml > backup/secret.yaml
Output Data Backup
If generated data must be preserved (not just re-generated), back up the output directory:
# Local backup
tar czf datasynth-output-$(date +%F).tar.gz /var/lib/datasynth/output/
# S3 backup
aws s3 sync /var/lib/datasynth/output/ s3://your-bucket/datasynth/$(date +%F)/
Scheduled Backup Script
#!/bin/bash
# /usr/local/bin/datasynth-backup.sh
# Run via cron: 0 2 * * * /usr/local/bin/datasynth-backup.sh
BACKUP_DIR="/var/backups/datasynth"
DATE=$(date +%F)
mkdir -p "$BACKUP_DIR"
# Back up configuration
cp /etc/datasynth/server.env "$BACKUP_DIR/server.env.$DATE"
# Back up output if it exists and is non-empty
if [ -d /var/lib/datasynth/output ] && [ "$(ls -A /var/lib/datasynth/output)" ]; then
tar czf "$BACKUP_DIR/output-$DATE.tar.gz" /var/lib/datasynth/output/
fi
# Retain 30 days of backups
find "$BACKUP_DIR" -type f -mtime +30 -delete
echo "Backup completed: $DATE"
Deterministic Recovery
DataSynth uses ChaCha8 RNG with a configurable seed. When the seed is set in the configuration, every run produces byte-identical output.
Reproducing a Dataset
To reproduce a previous generation run:
- Retrieve the configuration file used for that run.
- Confirm the seed value is set (not random).
- Run the generation with the same configuration.
# Example config with deterministic seed
global:
industry: manufacturing
start_date: "2024-01-01"
period_months: 12
seed: 42 # <-- deterministic seed
# Regenerate identical data
datasynth-data generate --config config.yaml --output ./recovered-output
# Verify output is identical
diff <(sha256sum original-output/*.csv | sort) <(sha256sum recovered-output/*.csv | sort)
Important Caveats for Determinism
Deterministic output requires exact version matching:
| Factor | Must Match? | Notes |
|---|---|---|
| DataSynth version | Yes | Different versions may change generation logic |
| Configuration YAML | Yes | Any parameter change alters output |
| Seed value | Yes | Different seed = different data |
| Operating system | No | Cross-platform determinism is guaranteed |
| CPU architecture | No | ChaCha8 output is platform-independent |
| Number of threads | No | Parallelism does not affect determinism |
If you need to reproduce data from a past release, pin the DataSynth version:
# Docker: use the exact version tag
docker run --rm \
-v $(pwd)/config.yaml:/config.yaml:ro \
-v $(pwd)/output:/output \
datasynth/datasynth-server:0.5.0 \
datasynth-data generate --config /config.yaml --output /output
# Source: checkout the exact tag
git checkout v0.5.0
cargo build --release -p datasynth-cli
Stateless Restart
The DataSynth server maintains no persistent state. All in-memory state (counters, active streams, generation context) is ephemeral. A restart produces a fresh server.
Restart Procedure
Docker:
docker compose restart datasynth-server
Kubernetes:
# Rolling restart (zero downtime with PDB)
kubectl -n datasynth rollout restart deployment/datasynth
# Verify rollout
kubectl -n datasynth rollout status deployment/datasynth
SystemD:
sudo systemctl restart datasynth-server
What Is Lost on Restart
| State | Lost? | Impact |
|---|---|---|
| Prometheus metrics counters | Yes | Counters reset to 0; Prometheus handles counter resets via rate() |
| Active WebSocket streams | Yes | Clients must reconnect |
| Uptime counter | Yes | Resets to 0 |
| In-progress bulk generation | Yes | Client receives connection error; must retry |
| Configuration (if set via API) | Yes | Reverts to default; use ConfigMap or env for persistence |
| Rate limit buckets | Yes | All clients get fresh rate limit windows |
Mitigating Restart Impact
- Use config files, not the API, for persistent configuration. The
POST /api/configendpoint only updates in-memory state. - Set up client retry logic for bulk generation requests.
- Use Kubernetes PDB to ensure at least one pod is always running during rolling restarts.
- Monitor with Prometheus – counter resets are handled automatically by
rate()andincrease()functions.
Recovery Scenarios
Scenario 1: Server Process Crash
- SystemD or Kubernetes automatically restarts the process.
- Verify with
curl localhost:3000/health. - Check logs for crash cause:
journalctl -u datasynth-server -n 200. - No data loss – server is stateless.
Scenario 2: Node Failure (Kubernetes)
- Kubernetes reschedules pods to healthy nodes.
- PDB ensures minimum availability during rescheduling.
- Clients reconnect automatically (Service endpoint updates).
- No manual intervention required.
Scenario 3: Configuration Lost
- Retrieve config from version control.
- Redeploy:
kubectl apply -f configmap.yamlor copy to/etc/datasynth/. - Restart server to pick up new config.
Scenario 4: Need to Reproduce Historical Data
- Identify the DataSynth version and config used.
- Pin the version (Docker tag or Git tag).
- Run generation with the same config and seed.
- Verify with checksums.
Recovery Time Objectives
| Component | RTO | RPO | Notes |
|---|---|---|---|
| Server process | < 30s | N/A (stateless) | Auto-restart via SystemD/K8s |
| Full service (K8s) | < 2 min | N/A (stateless) | Pod scheduling + startup probes |
| Data regeneration | Depends on size | 0 (deterministic) | Re-run with same config+seed |
| Config recovery | < 5 min | Last commit | From version control |
API Reference
DataSynth exposes REST, gRPC, and WebSocket interfaces. This page documents all endpoints, authentication, rate limiting, error formats, and the WebSocket protocol.
Base URLs
| Protocol | Default URL | Port |
|---|---|---|
| REST | http://localhost:3000 | 3000 |
| gRPC | grpc://localhost:50051 | 50051 |
| WebSocket | ws://localhost:3000/ws/ | 3000 |
Authentication
Authentication is optional and disabled by default. When enabled, all endpoints except health probes require a valid API key.
Enabling Authentication
Pass API keys at startup:
# CLI argument
datasynth-server --api-keys "key-1,key-2"
# Environment variable
DATASYNTH_API_KEYS="key-1,key-2" datasynth-server
Sending API Keys
The server accepts API keys via two headers (checked in order):
| Method | Header | Example |
|---|---|---|
| Bearer token | Authorization | Authorization: Bearer your-api-key |
| Custom header | X-API-Key | X-API-Key: your-api-key |
Exempt Paths
These paths never require authentication, even when auth is enabled:
GET /healthGET /readyGET /liveGET /metrics
Authentication Internals
- API keys are hashed with Argon2id at server startup.
- Verification iterates all stored hashes (no short-circuit) to prevent timing side-channel attacks.
- A 5-second LRU cache avoids repeated Argon2 verification for rapid successive requests.
Error Responses
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer
API key required. Provide via 'Authorization: Bearer <key>' or 'X-API-Key' header
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer
Invalid API key
Rate Limiting
Rate limiting is configurable and disabled by default. When enabled, it tracks requests per client IP using a sliding window.
Default Configuration
| Parameter | Default | Description |
|---|---|---|
max_requests | 100 | Maximum requests per window |
window | 60 seconds | Time window duration |
| Exempt paths | /health, /ready, /live | Not rate-limited |
Rate Limit Headers
All non-exempt responses include rate limit headers:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed in the window |
X-RateLimit-Remaining | Requests remaining in the current window |
Retry-After | Seconds until the window resets (only on 429) |
Rate Limit Exceeded Response
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 60
Rate limit exceeded. Max 100 requests per 60 seconds.
Client Identification
The rate limiter identifies clients by IP address, checked in order:
X-Forwarded-Forheader (first IP)X-Real-IPheader- Fallback:
unknown(all unidentified clients share a bucket)
Distributed Rate Limiting
For multi-replica deployments, enable Redis-backed rate limiting:
datasynth-server --redis-url redis://127.0.0.1:6379
This requires the redis feature to be enabled at build time.
Security Headers
All responses include the following security headers:
| Header | Value | Purpose |
|---|---|---|
X-Content-Type-Options | nosniff | Prevent MIME type sniffing |
X-Frame-Options | DENY | Prevent clickjacking |
X-XSS-Protection | 0 | Disable legacy XSS filter (rely on CSP) |
Referrer-Policy | strict-origin-when-cross-origin | Control referrer leakage |
Content-Security-Policy | default-src 'none'; frame-ancestors 'none' | Restrict resource loading |
Cache-Control | no-store | Prevent caching of API responses |
Request ID
Every response includes an X-Request-Id header. If the client sends an X-Request-Id header, its value is preserved. Otherwise, a UUID v4 is generated.
# Client-provided request ID
curl -H "X-Request-Id: my-trace-123" http://localhost:3000/health
# Response header: X-Request-Id: my-trace-123
# Auto-generated request ID
curl -v http://localhost:3000/health
# Response header: X-Request-Id: 550e8400-e29b-41d4-a716-446655440000
CORS Configuration
Default allowed origins:
| Origin | Purpose |
|---|---|
http://localhost:5173 | Vite dev server |
http://localhost:3000 | Local development |
http://127.0.0.1:5173 | Localhost variant |
http://127.0.0.1:3000 | Localhost variant |
tauri://localhost | Tauri desktop app |
Allowed methods: GET, POST, PUT, DELETE, OPTIONS
Allowed headers: Content-Type, Authorization, Accept
REST API Endpoints
Health & Metrics
GET /health
Returns overall server health status.
Response 200 OK:
{
"healthy": true,
"version": "0.5.0",
"uptime_seconds": 3600
}
GET /ready
Kubernetes-compatible readiness probe. Performs deep checks (config, memory, disk).
Response 200 OK (when ready):
{
"ready": true,
"message": "Service is ready",
"checks": [
{ "name": "config", "status": "ok" },
{ "name": "memory", "status": "ok" },
{ "name": "disk", "status": "ok" }
]
}
Response 503 Service Unavailable (when not ready):
{
"ready": false,
"message": "Service is not ready",
"checks": [
{ "name": "config", "status": "ok" },
{ "name": "memory", "status": "fail" },
{ "name": "disk", "status": "ok" }
]
}
GET /live
Kubernetes-compatible liveness probe. Lightweight heartbeat.
Response 200 OK:
{
"alive": true,
"timestamp": "2024-01-15T10:30:00.123456789Z"
}
GET /api/metrics
Returns server metrics as JSON.
Response 200 OK:
{
"total_entries_generated": 150000,
"total_anomalies_injected": 750,
"uptime_seconds": 3600,
"session_entries": 150000,
"session_entries_per_second": 41.67,
"active_streams": 2,
"total_stream_events": 50000
}
GET /metrics
Prometheus-compatible metrics in text exposition format.
Response 200 OK (text/plain; version=0.0.4):
# HELP synth_entries_generated_total Total number of journal entries generated
# TYPE synth_entries_generated_total counter
synth_entries_generated_total 150000
# HELP synth_anomalies_injected_total Total number of anomalies injected
# TYPE synth_anomalies_injected_total counter
synth_anomalies_injected_total 750
# HELP synth_uptime_seconds Server uptime in seconds
# TYPE synth_uptime_seconds gauge
synth_uptime_seconds 3600
# HELP synth_entries_per_second Rate of entry generation
# TYPE synth_entries_per_second gauge
synth_entries_per_second 41.67
# HELP synth_active_streams Number of active streaming connections
# TYPE synth_active_streams gauge
synth_active_streams 2
# HELP synth_stream_events_total Total events sent through streams
# TYPE synth_stream_events_total counter
synth_stream_events_total 50000
# HELP synth_info Server version information
# TYPE synth_info gauge
synth_info{version="0.5.0"} 1
Configuration
GET /api/config
Returns the current generation configuration.
Response 200 OK:
{
"success": true,
"message": "Current configuration",
"config": {
"industry": "Manufacturing",
"start_date": "2024-01-01",
"period_months": 12,
"seed": 42,
"coa_complexity": "Medium",
"companies": [
{
"code": "1000",
"name": "Manufacturing Corp",
"currency": "USD",
"country": "US",
"annual_transaction_volume": 100000,
"volume_weight": 1.0
}
],
"fraud_enabled": true,
"fraud_rate": 0.02
}
}
POST /api/config
Updates the generation configuration.
Request body:
{
"industry": "retail",
"start_date": "2024-06-01",
"period_months": 6,
"seed": 12345,
"coa_complexity": "large",
"companies": [
{
"code": "1000",
"name": "Retail Corp",
"currency": "USD",
"country": "US",
"annual_transaction_volume": 200000,
"volume_weight": 1.0
}
],
"fraud_enabled": true,
"fraud_rate": 0.05
}
Valid industries: manufacturing, retail, financial_services, healthcare, technology, professional_services, energy, transportation, real_estate, telecommunications
Valid CoA complexities: small, medium, large
Response 200 OK:
{
"success": true,
"message": "Configuration updated and applied",
"config": { ... }
}
Error 400 Bad Request:
{
"success": false,
"message": "Unknown industry: 'invalid'. Valid values: manufacturing, retail, ...",
"config": null
}
Generation
POST /api/generate/bulk
Generates journal entries in a single batch. Maximum 1,000,000 entries per request.
Request body:
{
"entry_count": 10000,
"include_master_data": true,
"inject_anomalies": true
}
All fields are optional. Without entry_count, the server uses the configured volume.
Response 200 OK:
{
"success": true,
"entries_generated": 10000,
"duration_ms": 450,
"anomaly_count": 50
}
Error 400 Bad Request (entry count too large):
entry_count (2000000) exceeds maximum allowed value (1000000)
Streaming Control
POST /api/stream/start
Starts the event stream. WebSocket clients begin receiving events.
Request body:
{
"events_per_second": 10,
"max_events": 10000,
"inject_anomalies": false
}
POST /api/stream/stop
Stops all active streams.
POST /api/stream/pause
Pauses active streams. Events stop flowing but connections remain open.
POST /api/stream/resume
Resumes paused streams.
POST /api/stream/trigger/:pattern
Triggers a named generation pattern for upcoming streamed entries.
Valid patterns: year_end_spike, period_end_spike, holiday_cluster, fraud_cluster, error_cluster, uniform, custom:*
Response:
{
"success": true,
"message": "Pattern 'year_end_spike' will be applied to upcoming entries"
}
WebSocket Protocol
ws://localhost:3000/ws/metrics
Sends metrics updates every 1 second as JSON text frames:
{
"timestamp": "2024-01-15T10:30:00.123Z",
"total_entries": 150000,
"total_anomalies": 750,
"entries_per_second": 41.67,
"active_streams": 2,
"uptime_seconds": 3600
}
ws://localhost:3000/ws/events
Streams generated journal entry events as JSON text frames:
{
"sequence": 1234,
"timestamp": "2024-01-15T10:30:00.456Z",
"event_type": "JournalEntry",
"document_id": "JE-2024-001234",
"company_code": "1000",
"amount": "15000.00",
"is_anomaly": false
}
Connection Management
- The server responds to WebSocket
Pingframes withPong. - Clients should send periodic pings to keep the connection alive through proxies.
- Close the connection by sending a WebSocket
Closeframe. - The server decrements
active_streamswhen a client disconnects.
Example: Connecting with wscat
# Install wscat
npm install -g wscat
# Connect to metrics stream
wscat -c ws://localhost:3000/ws/metrics
# Connect to event stream
wscat -c ws://localhost:3000/ws/events
Example: Connecting with curl (WebSocket)
curl --include \
--no-buffer \
--header "Connection: Upgrade" \
--header "Upgrade: websocket" \
--header "Sec-WebSocket-Version: 13" \
--header "Sec-WebSocket-Key: $(openssl rand -base64 16)" \
http://localhost:3000/ws/events
Request Timeout
The default request timeout is 300 seconds (5 minutes), which accommodates large bulk generation requests. Requests exceeding this timeout receive a 408 Request Timeout response.
Error Format
REST API errors follow a consistent format:
Validation errors return JSON:
{
"success": false,
"message": "Descriptive error message",
"config": null
}
Server errors return plain text:
HTTP/1.1 500 Internal Server Error
Generation failed: <error description>
HTTP Status Codes
| Code | Meaning | When |
|---|---|---|
| 200 | Success | Request completed |
| 400 | Bad Request | Invalid parameters |
| 401 | Unauthorized | Missing or invalid API key |
| 408 | Request Timeout | Request exceeded 300s timeout |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Generation or server failure |
| 503 | Service Unavailable | Readiness check failed |
Security Hardening
This guide provides a pre-deployment security checklist and detailed guidance on TLS, secrets management, container security, and audit logging for DataSynth.
Pre-Deployment Checklist
Complete this checklist before exposing DataSynth to any network beyond localhost:
| # | Item | Priority | Status |
|---|---|---|---|
| 1 | Enable API key authentication | Critical | |
| 2 | Use strong, unique API keys (32+ chars) | Critical | |
| 3 | Enable TLS (direct or via reverse proxy) | Critical | |
| 4 | Set explicit CORS allowed origins | High | |
| 5 | Enable rate limiting | High | |
| 6 | Run as non-root user | High | |
| 7 | Use read-only root filesystem (container) | High | |
| 8 | Drop all Linux capabilities | High | |
| 9 | Set resource limits (memory, CPU, file descriptors) | High | |
| 10 | Restrict network exposure (firewall, security groups) | High | |
| 11 | Enable structured logging to a central log aggregator | Medium | |
| 12 | Set up Prometheus monitoring and alerts | Medium | |
| 13 | Rotate API keys periodically | Medium | |
| 14 | Review and restrict CORS origins | Medium | |
| 15 | Enable mTLS for gRPC if used in service mesh | Low |
Authentication Hardening
Strong API Keys
Generate cryptographically strong API keys:
# Generate a 48-character random key
openssl rand -base64 36
# Example output: kZ9mR3xY7pQ2wV5nL8jH4cF6gT0aD1bE3sU9iO7
Recommendations:
- Minimum 32 characters, ideally 48+
- Use different keys per environment (dev, staging, production)
- Use different keys per client/team when possible
- Rotate keys quarterly or after any suspected compromise
Argon2id Hashing
DataSynth hashes API keys with Argon2id (the recommended password hashing algorithm). Keys are hashed at startup; the plaintext is never stored in memory after hashing.
For pre-hashed keys (avoiding plaintext in environment variables), hash the key externally and pass the PHC-format hash:
# Python example: pre-hash an API key
from argon2 import PasswordHasher
ph = PasswordHasher()
hash = ph.hash("your-api-key")
print(hash)
# $argon2id$v=19$m=65536,t=3,p=4$...
Pass the pre-hashed value to the server via the AuthConfig::with_prehashed_keys() API (for embedded use) or store in a secrets manager.
API Key Rotation
To rotate keys without downtime:
- Add the new key to
DATASYNTH_API_KEYSalongside the old key. - Restart the server (rolling restart in K8s).
- Update all clients to use the new key.
- Remove the old key from
DATASYNTH_API_KEYS. - Restart again.
TLS Configuration
Option 1: Reverse Proxy TLS (Recommended)
Terminate TLS at a reverse proxy (Nginx, Envoy, cloud load balancer) and forward plain HTTP to DataSynth. See TLS & Reverse Proxy for full configurations.
Advantages:
- Centralized certificate management
- Standard renewal workflows (cert-manager, Let’s Encrypt)
- Offloads TLS from the application
- Easier to audit and rotate certificates
Option 2: Native TLS
Build DataSynth with TLS support:
cargo build --release -p datasynth-server --features tls
Run with certificate and key:
datasynth-server \
--tls-cert /etc/datasynth/tls/cert.pem \
--tls-key /etc/datasynth/tls/key.pem
Certificate Requirements
| Requirement | Detail |
|---|---|
| Format | PEM-encoded X.509 |
| Key type | RSA 2048+ or ECDSA P-256/P-384 |
| Protocol | TLS 1.2 or 1.3 (1.0/1.1 disabled) |
| Cipher suites | HIGH:!aNULL:!MD5 (Nginx default) |
| Subject Alternative Name | Must match the hostname clients use |
mTLS for gRPC
For service-to-service communication, configure mutual TLS:
# Nginx mTLS configuration
server {
listen 50051 ssl http2;
ssl_certificate /etc/ssl/certs/server.pem;
ssl_certificate_key /etc/ssl/private/server-key.pem;
# Client certificate verification
ssl_client_certificate /etc/ssl/certs/ca.pem;
ssl_verify_client on;
location / {
grpc_pass grpc://127.0.0.1:50051;
}
}
Secret Management
Environment Variables
For simple deployments, store secrets in environment files with restricted permissions:
# Create the environment file
sudo install -m 640 -o root -g datasynth /dev/null /etc/datasynth/server.env
# Edit the file
sudo vi /etc/datasynth/server.env
Never commit plaintext secrets to version control. Use .gitignore to exclude env files.
Kubernetes Secrets
For Kubernetes, store API keys in a Secret resource:
apiVersion: v1
kind: Secret
metadata:
name: datasynth-api-keys
namespace: datasynth
type: Opaque
stringData:
api-keys: "key-1,key-2"
External Secrets Operator
For production, integrate with a secrets manager via the External Secrets Operator:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: datasynth-api-keys
namespace: datasynth
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secretsmanager
kind: ClusterSecretStore
target:
name: datasynth-api-keys
data:
- secretKey: api-keys
remoteRef:
key: datasynth/api-keys
HashiCorp Vault
Inject secrets via the Vault Agent sidecar:
# Pod annotations for Vault Agent Injector
podAnnotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "datasynth"
vault.hashicorp.com/agent-inject-secret-api-keys: "secret/data/datasynth/api-keys"
vault.hashicorp.com/agent-inject-template-api-keys: |
{{- with secret "secret/data/datasynth/api-keys" -}}
{{ .Data.data.keys }}
{{- end -}}
Container Security
Distroless Base Image
The production Dockerfile uses gcr.io/distroless/cc-debian12, which contains:
- No shell (
/bin/sh,/bin/bash) - No package manager
- No unnecessary system utilities
- Only the C runtime library and certificates
This minimizes the attack surface and prevents shell-based exploits.
Security Context (Kubernetes)
The Helm chart enforces the following security context:
podSecurityContext:
runAsNonRoot: true # Pod must run as non-root
runAsUser: 1000 # UID 1000
runAsGroup: 1000 # GID 1000
fsGroup: 1000 # Filesystem group
securityContext:
allowPrivilegeEscalation: false # No setuid/setgid
readOnlyRootFilesystem: true # Read-only root FS
capabilities:
drop:
- ALL # Drop all Linux capabilities
SystemD Sandboxing
The SystemD unit file includes comprehensive sandboxing:
NoNewPrivileges=true # Prevent privilege escalation
ProtectSystem=strict # Read-only filesystem
ProtectHome=true # Hide home directories
PrivateTmp=true # Isolated /tmp
PrivateDevices=true # No device access
ProtectKernelTunables=true # No sysctl modification
ProtectKernelModules=true # No module loading
ProtectControlGroups=true # No cgroup modification
RestrictNamespaces=true # No namespace creation
RestrictRealtime=true # No realtime scheduling
RestrictSUIDSGID=true # No SUID/SGID
Image Scanning
Scan the container image for vulnerabilities before deployment:
# Trivy
trivy image datasynth/datasynth-server:0.5.0
# Grype
grype datasynth/datasynth-server:0.5.0
# Docker Scout
docker scout cves datasynth/datasynth-server:0.5.0
The distroless base image has a minimal CVE surface. Address any findings in the Rust dependencies via cargo audit:
cargo install cargo-audit
cargo audit
Network Security
Principle of Least Exposure
Only expose the ports and endpoints that clients need:
| Deployment | Expose REST (3000) | Expose gRPC (50051) | Expose Metrics |
|---|---|---|---|
| Internal API only | Via Ingress/LB | Via Ingress/LB | Prometheus only |
| Public API | Via Ingress + WAF | No | No |
| Dev/staging | Localhost only | Localhost only | Localhost only |
Network Policies (Kubernetes)
Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: datasynth-allow-ingress
namespace: datasynth
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: datasynth
policyTypes:
- Ingress
ingress:
# Allow from Ingress controller
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- port: 3000
protocol: TCP
# Allow Prometheus scraping
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 3000
protocol: TCP
CORS Lockdown
In production, override the default CORS configuration to allow only your application’s domain:
#![allow(unused)]
fn main() {
// Programmatic configuration
let cors = CorsConfig {
allowed_origins: vec![
"https://app.example.com".to_string(),
],
allow_any_origin: false,
};
}
Never enable allow_any_origin: true in production.
Audit Logging
Request Tracing
Every request receives an X-Request-Id header (auto-generated UUID v4 or client-provided). Use this to correlate logs across services.
Structured Log Fields
DataSynth emits JSON-structured logs with the following fields useful for security auditing:
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "INFO",
"target": "datasynth_server::rest::routes",
"message": "Configuration update requested: industry=retail, period_months=6",
"thread_id": 42
}
Log Events to Monitor
| Event | Log Pattern | Severity |
|---|---|---|
| Authentication failure | Unauthorized / Invalid API key | High |
| Rate limit exceeded | Rate limit exceeded | Medium |
| Configuration change | Configuration update requested | Medium |
| Stream start/stop | Stream started / Stream stopped | Low |
| WebSocket connection | WebSocket connected / disconnected | Low |
| Server panic | Server panic: | Critical |
Centralized Logging
Forward structured logs to a central aggregator:
Docker:
services:
datasynth-server:
logging:
driver: "fluentd"
options:
fluentd-address: "localhost:24224"
tag: "datasynth.server"
SystemD to Loki:
# Install Promtail for journal forwarding
# /etc/promtail/config.yaml
scrape_configs:
- job_name: datasynth
journal:
matches:
- _SYSTEMD_UNIT=datasynth-server.service
labels:
job: datasynth
RBAC (Kubernetes)
The Helm chart creates a ServiceAccount by default. Bind minimal permissions:
serviceAccount:
create: true
automount: true # Only if needed by the application
annotations: {}
DataSynth does not require any Kubernetes API access. If automount is not needed, set it to false to prevent the ServiceAccount token from being mounted into the pod.
Supply Chain Security
Reproducible Builds
The Dockerfile uses pinned versions:
rust:1.88-bookworm– pinned Rust compiler versiongcr.io/distroless/cc-debian12– pinned distroless imagecargo-chef --locked– locked dependency resolution
Dependency Auditing
# Check for known vulnerabilities
cargo audit
# Check for unmaintained or yanked crates
cargo audit --deny warnings
Run cargo audit in CI on every pull request.
SBOM Generation
Generate a Software Bill of Materials for compliance:
# Using cargo-cyclonedx
cargo install cargo-cyclonedx
cargo cyclonedx --all
# Using syft for container images
syft datasynth/datasynth-server:0.5.0 -o cyclonedx-json > sbom.json
TLS & Reverse Proxy Configuration
DataSynth server supports TLS in two ways:
- Native TLS (with
tlsfeature flag) - direct rustls termination - Reverse Proxy - recommended for production deployments
Native TLS
Build with TLS support:
cargo build --release -p datasynth-server --features tls
Run with certificate and key:
datasynth-server --tls-cert /path/to/cert.pem --tls-key /path/to/key.pem
Nginx Reverse Proxy
upstream datasynth_rest {
server 127.0.0.1:3000;
}
upstream datasynth_grpc {
server 127.0.0.1:50051;
}
server {
listen 443 ssl http2;
server_name datasynth.example.com;
ssl_certificate /etc/ssl/certs/datasynth.pem;
ssl_certificate_key /etc/ssl/private/datasynth-key.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# REST API
location / {
proxy_pass http://datasynth_rest;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
}
# WebSocket
location /ws/ {
proxy_pass http://datasynth_rest;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 3600s;
}
# gRPC
location /synth_server. {
grpc_pass grpc://datasynth_grpc;
grpc_read_timeout 300s;
}
}
Envoy Proxy
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 443
filter_chains:
- transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /etc/ssl/certs/datasynth.pem
private_key:
filename: /etc/ssl/private/datasynth-key.pem
filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: datasynth
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: datasynth_rest
timeout: 300s
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: datasynth_rest
connect_timeout: 5s
type: STRICT_DNS
load_assignment:
cluster_name: datasynth_rest
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 3000
Health Check Configuration
For load balancers, use these health check endpoints:
| Endpoint | Purpose | Expected Response |
|---|---|---|
GET /health | Basic health | 200 OK |
GET /ready | Readiness probe | 200 OK / 503 Unavailable |
GET /live | Liveness probe | 200 OK |
Use Cases
Real-world applications for SyntheticData.
Overview
| Use Case | Description |
|---|---|
| Fraud Detection ML | Train supervised fraud models |
| Audit Analytics | Test audit procedures |
| SOX Compliance | Test control monitoring |
| Process Mining | Generate OCEL 2.0 event logs |
| ERP Load Testing | Load and stress testing |
Use Case Summary
| Use Case | Key Features | Output Focus |
|---|---|---|
| Fraud Detection | Anomaly injection, graph export | Labels, graphs |
| Audit Analytics | Full document flows, controls | Transactions, controls |
| SOX Compliance | SoD rules, approval workflows | Controls, violations |
| Process Mining | OCEL 2.0 export | Event logs |
| ERP Testing | High volume, realistic patterns | Raw transactions |
Quick Configuration
Fraud Detection
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true
graph_export:
enabled: true
formats:
- pytorch_geometric
Audit Analytics
document_flows:
p2p:
enabled: true
o2c:
enabled: true
internal_controls:
enabled: true
SOX Compliance
internal_controls:
enabled: true
sod_rules: [...]
approval:
enabled: true
Process Mining
document_flows:
p2p:
enabled: true
o2c:
enabled: true
# Use datasynth-ocpm for OCEL 2.0 export
ERP Testing
transactions:
target_count: 1000000
output:
format: csv
Selecting a Use Case
Choose Fraud Detection if:
- Training ML/AI models
- Building anomaly detection systems
- Need labeled datasets
Choose Audit Analytics if:
- Testing audit software
- Validating analytical procedures
- Need complete document trails
Choose SOX Compliance if:
- Testing control monitoring systems
- Validating SoD enforcement
- Need control test data
Choose Process Mining if:
- Using PM4Py, Celonis, or similar tools
- Need OCEL 2.0 compliant logs
- Analyzing business processes
Choose ERP Testing if:
- Load testing financial systems
- Performance benchmarking
- Need high-volume realistic data
Combining Use Cases
Use cases can be combined:
# Fraud detection + audit analytics
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true
document_flows:
p2p:
enabled: true
o2c:
enabled: true
internal_controls:
enabled: true
graph_export:
enabled: true
See Also
Fraud Detection ML
Train machine learning models for financial fraud detection.
Overview
SyntheticData generates labeled datasets for supervised fraud detection:
- 20+ fraud patterns with full labels
- Graph representations for GNN models
- Realistic data distributions
- Configurable fraud rates and types
Configuration
global:
seed: 42
industry: financial_services
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 100000
fraud:
enabled: true
fraud_rate: 0.02 # 2% fraud rate
types:
split_transaction: 0.20
duplicate_payment: 0.15
fictitious_transaction: 0.15
ghost_employee: 0.10
kickback_scheme: 0.10
revenue_manipulation: 0.10
expense_capitalization: 0.10
unauthorized_discount: 0.10
anomaly_injection:
enabled: true
total_rate: 0.02
generate_labels: true
categories:
fraud: 1.0 # Focus on fraud only
graph_export:
enabled: true
formats:
- pytorch_geometric
split:
train: 0.7
val: 0.15
test: 0.15
stratify: is_fraud
output:
format: csv
Output Files
Tabular Data
output/
├── transactions/
│ └── journal_entries.csv
├── labels/
│ ├── anomaly_labels.csv
│ └── fraud_labels.csv
└── master_data/
└── ...
Graph Data
output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt
├── edge_index.pt
├── edge_attr.pt
├── labels.pt
├── train_mask.pt
├── val_mask.pt
└── test_mask.pt
ML Pipeline
1. Load Data
import pandas as pd
import torch
# Load tabular data
entries = pd.read_csv('output/transactions/journal_entries.csv')
labels = pd.read_csv('output/labels/fraud_labels.csv')
# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_fraud'] = data['fraud_type'].notna()
print(f"Total entries: {len(data)}")
print(f"Fraud entries: {data['is_fraud'].sum()}")
print(f"Fraud rate: {data['is_fraud'].mean():.2%}")
2. Feature Engineering
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Numerical features
numerical_features = [
'debit_amount', 'credit_amount', 'line_count'
]
# Derived features
data['log_amount'] = np.log1p(data['debit_amount'] + data['credit_amount'])
data['is_round'] = (data['debit_amount'] % 100 == 0).astype(int)
data['is_weekend'] = pd.to_datetime(data['posting_date']).dt.dayofweek >= 5
data['is_month_end'] = pd.to_datetime(data['posting_date']).dt.day >= 28
# Categorical features
categorical_features = ['source', 'business_process', 'company_code']
3. Train Model (Tabular)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
# Prepare features
X = data[numerical_features + derived_features]
y = data['is_fraud']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
4. Train GNN Model
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data
# Load graph data
node_features = torch.load('output/graphs/.../node_features.pt')
edge_index = torch.load('output/graphs/.../edge_index.pt')
labels = torch.load('output/graphs/.../labels.pt')
train_mask = torch.load('output/graphs/.../train_mask.pt')
val_mask = torch.load('output/graphs/.../val_mask.pt')
test_mask = torch.load('output/graphs/.../test_mask.pt')
data = Data(
x=node_features,
edge_index=edge_index,
y=labels,
train_mask=train_mask,
val_mask=val_mask,
test_mask=test_mask,
)
# Define GNN
class FraudGNN(torch.nn.Module):
def __init__(self, num_features, hidden_channels):
super().__init__()
self.conv1 = GCNConv(num_features, hidden_channels)
self.conv2 = GCNConv(hidden_channels, hidden_channels)
self.linear = torch.nn.Linear(hidden_channels, 2)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index).relu()
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index).relu()
x = self.linear(x)
return x
# Train
model = FraudGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(200):
model.train()
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
# Validation
if epoch % 10 == 0:
model.eval()
pred = out.argmax(dim=1)
val_acc = (pred[data.val_mask] == data.y[data.val_mask]).float().mean()
print(f'Epoch {epoch}: Val Acc: {val_acc:.4f}')
Fraud Types for Training
| Type | Detection Approach | Difficulty |
|---|---|---|
| Split Transaction | Amount patterns | Easy |
| Duplicate Payment | Similarity matching | Easy |
| Fictitious Transaction | Anomaly detection | Medium |
| Ghost Employee | Entity verification | Medium |
| Kickback Scheme | Relationship analysis | Hard |
| Revenue Manipulation | Trend analysis | Hard |
Best Practices
Class Imbalance
from imblearn.over_sampling import SMOTE
# Handle imbalanced classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
Threshold Tuning
from sklearn.metrics import precision_recall_curve
# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * precision * recall / (precision + recall)
optimal_idx = f1_scores.argmax()
optimal_threshold = thresholds[optimal_idx]
Cross-Validation
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV ROC-AUC: {scores.mean():.4f} (+/- {scores.std():.4f})")
See Also
Audit Analytics
Test audit procedures and analytical tools with realistic data.
Overview
SyntheticData generates comprehensive datasets for audit analytics:
- Complete document trails
- Known control exceptions
- Benford’s Law compliant amounts
- Realistic temporal patterns
Configuration
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 100000
benford:
enabled: true # Realistic first-digit distribution
temporal:
month_end_spike: 2.5
quarter_end_spike: 3.0
year_end_spike: 4.0
document_flows:
p2p:
enabled: true
flow_rate: 0.35
three_way_match:
quantity_tolerance: 0.02
price_tolerance: 0.01
o2c:
enabled: true
flow_rate: 0.35
master_data:
vendors:
count: 200
customers:
count: 500
internal_controls:
enabled: true
anomaly_injection:
enabled: true
total_rate: 0.03
generate_labels: true
categories:
fraud: 0.20
error: 0.50
process_issue: 0.30
output:
format: csv
Audit Procedures
1. Benford’s Law Analysis
Test first-digit distribution of amounts:
import pandas as pd
import numpy as np
from scipy import stats
# Load data
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Extract first digits
amounts = entries['debit_amount'] + entries['credit_amount']
amounts = amounts[amounts > 0]
first_digits = amounts.apply(lambda x: int(str(x)[0]))
# Calculate observed distribution
observed = first_digits.value_counts().sort_index()
observed_freq = observed / observed.sum()
# Expected Benford distribution
benford = {d: np.log10(1 + 1/d) for d in range(1, 10)}
# Chi-square test
chi_stat, p_value = stats.chisquare(
observed.values,
[benford[d] * observed.sum() for d in range(1, 10)]
)
print(f"Chi-square: {chi_stat:.2f}, p-value: {p_value:.4f}")
2. Three-Way Match Testing
Verify PO, GR, and Invoice alignment:
# Load documents
po = pd.read_csv('output/documents/purchase_orders.csv')
gr = pd.read_csv('output/documents/goods_receipts.csv')
inv = pd.read_csv('output/documents/vendor_invoices.csv')
# Join on references
matched = po.merge(gr, left_on='po_number', right_on='po_reference')
matched = matched.merge(inv, left_on='po_number', right_on='po_reference')
# Calculate variances
matched['qty_variance'] = abs(matched['gr_quantity'] - matched['po_quantity']) / matched['po_quantity']
matched['price_variance'] = abs(matched['inv_unit_price'] - matched['po_unit_price']) / matched['po_unit_price']
# Identify exceptions
qty_exceptions = matched[matched['qty_variance'] > 0.02]
price_exceptions = matched[matched['price_variance'] > 0.01]
print(f"Quantity exceptions: {len(qty_exceptions)}")
print(f"Price exceptions: {len(price_exceptions)}")
3. Duplicate Payment Detection
Find potential duplicate payments:
# Load payments and invoices
payments = pd.read_csv('output/documents/payments.csv')
invoices = pd.read_csv('output/documents/vendor_invoices.csv')
# Group by vendor and amount
potential_dups = invoices.groupby(['vendor_id', 'total_amount']).filter(
lambda x: len(x) > 1
)
# Check payment dates
duplicates = []
for (vendor, amount), group in potential_dups.groupby(['vendor_id', 'total_amount']):
if len(group) > 1:
duplicates.append({
'vendor': vendor,
'amount': amount,
'count': len(group),
'invoices': group['invoice_number'].tolist()
})
print(f"Potential duplicate payments: {len(duplicates)}")
4. Journal Entry Testing
Analyze manual journal entries:
# Load entries
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Filter manual entries
manual = entries[entries['source'] == 'Manual']
# Analyze characteristics
print(f"Manual entries: {len(manual)}")
print(f"Weekend entries: {manual['is_weekend'].sum()}")
print(f"Month-end entries: {manual['is_month_end'].sum()}")
# Top accounts with manual entries
top_accounts = manual.groupby('account_number').size().sort_values(ascending=False).head(10)
5. Cutoff Testing
Verify transactions recorded in correct period:
# Identify late postings
entries['posting_date'] = pd.to_datetime(entries['posting_date'])
entries['document_date'] = pd.to_datetime(entries['document_date'])
entries['posting_lag'] = (entries['posting_date'] - entries['document_date']).dt.days
# Find entries posted after period end
late_postings = entries[entries['posting_lag'] > 5]
print(f"Late postings: {len(late_postings)}")
# Check year-end cutoff
year_end = entries['posting_date'].dt.year.max()
cutoff_issues = entries[
(entries['document_date'].dt.year < year_end) &
(entries['posting_date'].dt.year == year_end + 1)
]
6. Segregation of Duties
Check for SoD violations:
# Load controls data
sod_rules = pd.read_csv('output/controls/sod_rules.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Find entries with SoD violations
violations = entries[entries['sod_violation'] == True]
print(f"SoD violations: {len(violations)}")
# Analyze by conflict type
violation_types = violations.groupby('sod_conflict_type').size()
Audit Analytics Dashboard
Key Metrics
| Metric | Query | Expected |
|---|---|---|
| Benford Chi-square | First-digit test | < 15.51 (p > 0.05) |
| Match exceptions | Three-way match | < 2% |
| Duplicate indicators | Amount/vendor matching | < 0.5% |
| Late postings | Document vs posting date | < 1% |
| SoD violations | Control violations | Known from labels |
Population Statistics
# Summary statistics
print("=== Audit Population Summary ===")
print(f"Total transactions: {len(entries):,}")
print(f"Total amount: ${entries['debit_amount'].sum():,.2f}")
print(f"Unique vendors: {entries['vendor_id'].nunique()}")
print(f"Unique customers: {entries['customer_id'].nunique()}")
print(f"Date range: {entries['posting_date'].min()} to {entries['posting_date'].max()}")
7. Financial Statement Analytics (v0.6.0)
Analyze generated financial statements for consistency, trend analysis, and ratio testing:
import pandas as pd
# Load financial statements
balance_sheet = pd.read_csv('output/financial_reporting/balance_sheet.csv')
income_stmt = pd.read_csv('output/financial_reporting/income_statement.csv')
cash_flow = pd.read_csv('output/financial_reporting/cash_flow.csv')
# Verify accounting equation holds
for _, row in balance_sheet.iterrows():
assets = row['total_assets']
liabilities = row['total_liabilities']
equity = row['total_equity']
imbalance = abs(assets - (liabilities + equity))
assert imbalance < 0.01, f"A=L+E violation: {imbalance}"
# Analytical procedures: ratio analysis
ratios = pd.DataFrame({
'period': balance_sheet['period'],
'current_ratio': balance_sheet['current_assets'] / balance_sheet['current_liabilities'],
'gross_margin': income_stmt['gross_profit'] / income_stmt['revenue'],
'debt_to_equity': balance_sheet['total_liabilities'] / balance_sheet['total_equity'],
})
# Flag unusual ratio movements (> 2 std devs from mean)
for col in ['current_ratio', 'gross_margin', 'debt_to_equity']:
mean = ratios[col].mean()
std = ratios[col].std()
outliers = ratios[abs(ratios[col] - mean) > 2 * std]
if len(outliers) > 0:
print(f"Unusual {col} in periods: {outliers['period'].tolist()}")
Budget Variance Analysis
When budgets are enabled, compare budget to actual for each account:
# Load budget vs actual data
budget = pd.read_csv('output/financial_reporting/budget_vs_actual.csv')
# Calculate variance percentage
budget['variance_pct'] = (budget['actual'] - budget['budget']) / budget['budget']
# Identify material variances (> 10%)
material = budget[abs(budget['variance_pct']) > 0.10]
print(f"Material variances: {len(material)} accounts")
print(material[['account', 'budget', 'actual', 'variance_pct']].to_string())
# Favorable vs unfavorable analysis
favorable = budget[
((budget['account_type'] == 'revenue') & (budget['variance_pct'] > 0)) |
((budget['account_type'] == 'expense') & (budget['variance_pct'] < 0))
]
print(f"Favorable variances: {len(favorable)}")
Management KPI Trend Analysis
# Load KPI data
kpis = pd.read_csv('output/financial_reporting/management_kpis.csv')
# Check for declining trends
for kpi_name in kpis['kpi_name'].unique():
series = kpis[kpis['kpi_name'] == kpi_name].sort_values('period')
values = series['value'].values
# Simple trend check: are last 3 periods declining?
if len(values) >= 3 and all(values[-3+i] > values[-3+i+1] for i in range(2)):
print(f"Declining trend: {kpi_name}")
Payroll Audit Testing (v0.6.0)
When the HR module is enabled, test payroll data for anomalies:
# Load payroll data
payroll = pd.read_csv('output/hr/payroll_entries.csv')
# Ghost employee check: employees with pay but no time entries
time_entries = pd.read_csv('output/hr/time_entries.csv')
paid_employees = set(payroll['employee_id'].unique())
active_employees = set(time_entries['employee_id'].unique())
no_time = paid_employees - active_employees
print(f"Employees paid without time entries: {len(no_time)}")
# Payroll amount reasonableness
payroll_summary = payroll.groupby('employee_id')['gross_pay'].sum()
mean_pay = payroll_summary.mean()
std_pay = payroll_summary.std()
outliers = payroll_summary[payroll_summary > mean_pay + 3 * std_pay]
print(f"Unusually high total pay: {len(outliers)} employees")
# Expense policy violation detection
expenses = pd.read_csv('output/hr/expense_reports.csv')
violations = expenses[expenses['policy_violation'] == True]
print(f"Expense policy violations: {len(violations)}")
Sampling
Statistical Sampling
from scipy import stats
# Calculate sample size for attribute testing
population_size = len(entries)
confidence_level = 0.95
tolerable_error_rate = 0.05
expected_error_rate = 0.01
# Sample size formula
z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)
sample_size = int(
(z_score ** 2 * expected_error_rate * (1 - expected_error_rate)) /
(tolerable_error_rate ** 2)
)
print(f"Recommended sample size: {sample_size}")
# Random sample
sample = entries.sample(n=sample_size, random_state=42)
Stratified Sampling
# Stratify by amount
entries['amount_stratum'] = pd.qcut(
entries['debit_amount'] + entries['credit_amount'],
q=5,
labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)
# Sample from each stratum
stratified_sample = entries.groupby('amount_stratum').apply(
lambda x: x.sample(n=min(100, len(x)), random_state=42)
)
See Also
SOX Compliance Testing
Test internal control monitoring systems.
Overview
SyntheticData generates data for SOX 404 compliance testing:
- Internal control definitions
- Control test evidence
- Segregation of Duties violations
- Approval workflow data
Configuration
global:
seed: 42
industry: financial_services
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 50000
internal_controls:
enabled: true
controls:
- id: "CTL-001"
name: "Payment Authorization"
type: preventive
frequency: continuous
threshold: 10000
assertions: [authorization, validity]
- id: "CTL-002"
name: "Journal Entry Review"
type: detective
frequency: daily
assertions: [accuracy, completeness]
- id: "CTL-003"
name: "Bank Reconciliation"
type: detective
frequency: monthly
assertions: [existence, completeness]
sod_rules:
- conflict_type: create_approve
processes: [ap_invoice, ap_payment]
description: "Cannot create and approve payments"
- conflict_type: create_approve
processes: [ar_invoice, ar_receipt]
description: "Cannot create and approve receipts"
- conflict_type: custody_recording
processes: [cash_handling, cash_recording]
description: "Cannot handle and record cash"
approval:
enabled: true
thresholds:
- level: 1
max_amount: 5000
- level: 2
max_amount: 25000
- level: 3
max_amount: 100000
- level: 4
max_amount: null
fraud:
enabled: true
fraud_rate: 0.005
types:
skipped_approval: 0.30
threshold_manipulation: 0.30
unauthorized_discount: 0.20
duplicate_payment: 0.20
output:
format: csv
Control Testing
1. Control Evidence
import pandas as pd
# Load control data
controls = pd.read_csv('output/controls/internal_controls.csv')
mappings = pd.read_csv('output/controls/control_account_mappings.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Identify entries subject to each control
for _, control in controls.iterrows():
control_id = control['control_id']
threshold = control['threshold']
# Filter entries in scope
if pd.notna(threshold):
in_scope = entries[
(entries['control_ids'].str.contains(control_id)) &
(entries['debit_amount'] >= threshold)
]
else:
in_scope = entries[entries['control_ids'].str.contains(control_id)]
print(f"{control['name']}: {len(in_scope)} entries in scope")
2. Approval Testing
# Load entries with approval data
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Test approval compliance
approval_required = entries[entries['debit_amount'] >= 5000]
approved = approval_required[approval_required['approved_by'].notna()]
not_approved = approval_required[approval_required['approved_by'].isna()]
print(f"Requiring approval: {len(approval_required)}")
print(f"Properly approved: {len(approved)}")
print(f"Missing approval: {len(not_approved)}")
# Test approval levels
def check_approval_level(row):
amount = row['debit_amount']
if amount >= 100000:
return row['approval_level'] >= 4
elif amount >= 25000:
return row['approval_level'] >= 3
elif amount >= 5000:
return row['approval_level'] >= 2
return True
entries['approval_adequate'] = entries.apply(check_approval_level, axis=1)
inadequate = entries[~entries['approval_adequate']]
print(f"Inadequate approval level: {len(inadequate)}")
3. Segregation of Duties
# Load SoD data
sod_rules = pd.read_csv('output/controls/sod_rules.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')
# Identify violations
violations = entries[entries['sod_violation'] == True]
print(f"Total SoD violations: {len(violations)}")
# Analyze by type
violation_summary = violations.groupby('sod_conflict_type').agg({
'document_id': 'count',
'debit_amount': 'sum'
}).rename(columns={'document_id': 'count', 'debit_amount': 'total_amount'})
print("\nViolations by type:")
print(violation_summary)
# Analyze by user
user_violations = violations.groupby('created_by').size().sort_values(ascending=False)
print("\nTop violators:")
print(user_violations.head(10))
4. Threshold Manipulation
# Detect threshold-adjacent transactions
approval_threshold = 10000
entries['near_threshold'] = (
(entries['debit_amount'] >= approval_threshold * 0.9) &
(entries['debit_amount'] < approval_threshold)
)
near_threshold = entries[entries['near_threshold']]
print(f"Near-threshold entries: {len(near_threshold)}")
# Statistical analysis
expected_near = len(entries) * 0.10 # 10% would be in this range randomly
chi_stat = ((len(near_threshold) - expected_near) ** 2) / expected_near
print(f"Chi-square statistic: {chi_stat:.2f}")
Control Matrix
Generate RACM
# Risk and Control Matrix
controls = pd.read_csv('output/controls/internal_controls.csv')
mappings = pd.read_csv('output/controls/control_account_mappings.csv')
racm = controls.merge(mappings, on='control_id')
racm = racm[[
'control_id', 'name', 'control_type', 'frequency',
'account_number', 'assertions'
]]
# Add testing results
racm['population'] = racm['account_number'].apply(
lambda x: len(entries[entries['account_number'] == x])
)
racm['exceptions'] = racm['control_id'].apply(
lambda x: len(entries[
(entries['control_ids'].str.contains(x)) &
(entries['is_anomaly'] == True)
])
)
racm['exception_rate'] = racm['exceptions'] / racm['population']
print(racm)
Test Documentation
Control Test Template
def document_control_test(control_id, entries, sample_size=25):
"""Generate control test documentation."""
control = controls[controls['control_id'] == control_id].iloc[0]
# Get population
population = entries[entries['control_ids'].str.contains(control_id)]
# Sample
sample = population.sample(n=min(sample_size, len(population)), random_state=42)
# Test results
exceptions = sample[sample['is_anomaly'] == True]
return {
'control_id': control_id,
'control_name': control['name'],
'control_type': control['control_type'],
'frequency': control['frequency'],
'population_size': len(population),
'sample_size': len(sample),
'exceptions_found': len(exceptions),
'exception_rate': len(exceptions) / len(sample),
'conclusion': 'Effective' if len(exceptions) == 0 else 'Exception Noted'
}
# Test all controls
results = []
for control_id in controls['control_id']:
result = document_control_test(control_id, entries)
results.append(result)
test_results = pd.DataFrame(results)
test_results.to_csv('control_test_results.csv', index=False)
Deficiency Assessment
# Classify deficiencies
def assess_deficiency(exception_rate, amount_impact):
if exception_rate > 0.10 or amount_impact > 1000000:
return 'Material Weakness'
elif exception_rate > 0.05 or amount_impact > 100000:
return 'Significant Deficiency'
elif exception_rate > 0:
return 'Control Deficiency'
return 'No Deficiency'
test_results['amount_impact'] = test_results['control_id'].apply(
lambda x: entries[
(entries['control_ids'].str.contains(x)) &
(entries['is_anomaly'] == True)
]['debit_amount'].sum()
)
test_results['deficiency_classification'] = test_results.apply(
lambda x: assess_deficiency(x['exception_rate'], x['amount_impact']),
axis=1
)
print(test_results[['control_name', 'exception_rate', 'deficiency_classification']])
See Also
Process Mining
Generate OCEL 2.0 event logs for process mining analysis across 8 enterprise process families.
Overview
SyntheticData generates comprehensive process mining data:
- OCEL 2.0 compliant event logs with 88 activity types and 52 object types
- 8 process families: P2P, O2C, S2C, H2R, MFG, BANK, AUDIT, Bank Recon
- Object-centric relationships with lifecycle states
- Three variant types per generator: HappyPath (75%), ExceptionPath (20%), ErrorPath (5%)
- Cross-process object linking via shared document IDs
Configuration
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 6
transactions:
target_count: 50000
document_flows:
p2p:
enabled: true
flow_rate: 0.4
completion_rate: 0.95
stages:
po_approval_rate: 0.9
gr_rate: 0.98
invoice_rate: 0.95
payment_rate: 0.92
o2c:
enabled: true
flow_rate: 0.4
completion_rate: 0.90
stages:
so_approval_rate: 0.95
credit_check_pass_rate: 0.9
delivery_rate: 0.98
invoice_rate: 0.95
collection_rate: 0.85
master_data:
vendors:
count: 100
customers:
count: 200
materials:
count: 500
employees:
count: 30
output:
format: csv
OCEL 2.0 Export
Use the datasynth-ocpm crate for OCEL 2.0 export:
#![allow(unused)]
fn main() {
use synth_ocpm::{OcpmGenerator, Ocel2Exporter, ExportFormat};
let mut generator = OcpmGenerator::new(seed);
let event_log = generator.generate_event_log(
p2p_count: 5000,
o2c_count: 5000,
start_date,
end_date,
)?;
let exporter = Ocel2Exporter::new(ExportFormat::Json);
exporter.export(&event_log, "output/ocel2.json")?;
}
P2P Process
Event Sequence
Create PO → Approve PO → Release PO → Create GR → Post GR →
Receive Invoice → Verify Invoice → Post Invoice → Execute Payment
Objects
| Object Type | Attributes |
|---|---|
| PurchaseOrder | po_number, vendor_id, total_amount |
| GoodsReceipt | gr_number, po_reference, quantity |
| VendorInvoice | invoice_number, amount, due_date |
| Payment | payment_number, amount, bank_ref |
| Material | material_id, description |
| Vendor | vendor_id, name |
Object Relationships
PurchaseOrder ─┬── contains ──→ Material
└── from ──────→ Vendor
GoodsReceipt ──── for ──────→ PurchaseOrder
VendorInvoice ─── for ──────→ PurchaseOrder
└── matches ──→ GoodsReceipt
Payment ───────── pays ──────→ VendorInvoice
O2C Process
Event Sequence
Create SO → Check Credit → Release SO → Create Delivery →
Pick → Pack → Ship → Create Invoice → Post Invoice → Receive Payment
Objects
| Object Type | Attributes |
|---|---|
| SalesOrder | so_number, customer_id, total_amount |
| Delivery | delivery_number, so_reference |
| CustomerInvoice | invoice_number, amount, due_date |
| CustomerPayment | receipt_number, amount |
| Material | material_id, description |
| Customer | customer_id, name |
Analysis with PM4Py
Load Event Log
from pm4py.objects.ocel.importer import jsonocel
# Load OCEL 2.0
ocel = jsonocel.apply("output/ocel2.json")
print(f"Events: {len(ocel.events)}")
print(f"Objects: {len(ocel.objects)}")
print(f"Object types: {ocel.object_types}")
Process Discovery
from pm4py.algo.discovery.ocel import algorithm as ocel_discovery
# Discover object-centric Petri net
ocpn = ocel_discovery.apply(ocel)
# Visualize
from pm4py.visualization.ocel.ocpn import visualizer
gviz = visualizer.apply(ocpn)
visualizer.save(gviz, "ocpn.png")
Object Lifecycle Analysis
from pm4py.statistics.ocel import object_lifecycle
# Analyze PurchaseOrder lifecycle
po_lifecycle = object_lifecycle.get_lifecycle_summary(
ocel,
object_type="PurchaseOrder"
)
print("Purchase Order Lifecycle:")
print(f" Average duration: {po_lifecycle['avg_duration']}")
print(f" Completion rate: {po_lifecycle['completion_rate']:.2%}")
Conformance Checking
from pm4py.algo.conformance.ocel import algorithm as ocel_conformance
# Check conformance against expected model
results = ocel_conformance.apply(ocel, ocpn)
print(f"Conformant cases: {results['conformant']}")
print(f"Non-conformant: {results['non_conformant']}")
Process Metrics
Throughput Time
import pandas as pd
from datetime import timedelta
# Load events
events = pd.DataFrame(ocel.events)
# Calculate case durations
case_durations = events.groupby('case_id').agg({
'timestamp': ['min', 'max']
})
case_durations['duration'] = (
case_durations[('timestamp', 'max')] -
case_durations[('timestamp', 'min')]
)
print(f"Mean throughput time: {case_durations['duration'].mean()}")
print(f"Median throughput time: {case_durations['duration'].median()}")
Activity Frequency
# Count activity occurrences
activity_counts = events['activity'].value_counts()
print("Activity frequency:")
print(activity_counts)
Bottleneck Analysis
# Calculate waiting times between activities
events = events.sort_values(['case_id', 'timestamp'])
events['wait_time'] = events.groupby('case_id')['timestamp'].diff()
# Find bottlenecks
bottlenecks = events.groupby('activity')['wait_time'].mean().sort_values(ascending=False)
print("Bottleneck activities:")
print(bottlenecks.head(5))
Variant Analysis
from pm4py.algo.discovery.ocel import variants
# Get process variants
variant_stats = variants.get_variants_statistics(ocel)
print(f"Unique variants: {len(variant_stats)}")
print("\nTop variants:")
for variant, stats in sorted(variant_stats.items(), key=lambda x: -x[1]['count'])[:5]:
print(f" {variant}: {stats['count']} cases")
Integration with Tools
Celonis
# Export to Celonis format
from pm4py.objects.ocel.exporter import csv as ocel_csv_exporter
ocel_csv_exporter.apply(ocel, "output/celonis/")
# Upload CSV files to Celonis
OCPA
# Export to OCPA format
from pm4py.objects.ocel.exporter import sqlite
sqlite.apply(ocel, "output/ocel.sqlite")
# Open in OCPA tool
New Process Families (v0.6.2)
S2C — Source-to-Contract
Create Sourcing Project → Qualify Supplier → Publish RFx →
Submit Bid → Evaluate Bids → Award Contract →
Activate Contract → Complete Sourcing
H2R — Hire-to-Retire
Submit Time Entry → Approve Time Entry →
Create Payroll Run → Calculate Payroll → Approve Payroll → Post Payroll
Submit Expense → Approve Expense
MFG — Manufacturing
Create Production Order → Release → Start Operation →
Complete Operation → Quality Inspection → Confirm Production →
Close Production Order
BANK — Banking Operations
Onboard Customer → KYC Review → Open Account →
Execute Transaction → Authorize → Complete Transaction
AUDIT — Audit Engagement Lifecycle
Create Engagement → Plan → Assess Risk → Create Workpaper →
Collect Evidence → Review Workpaper → Raise Finding →
Remediate Finding → Record Judgment → Complete Engagement
Bank Recon — Bank Reconciliation
Import Bank Statement → Auto Match Items → Manual Match Item →
Create Reconciling Item → Resolve Exception →
Approve Reconciliation → Post Entries → Complete Reconciliation
S2P Process Mining
The full Source-to-Pay chain provides rich process mining opportunities beyond basic P2P:
Extended Event Sequence
Spend Analysis → Supplier Qualification → RFx Published →
Bid Received → Bid Evaluation → Contract Award →
Create PO → Approve PO → Release PO →
Create GR → Post GR →
Receive Invoice → Verify Invoice (Three-Way Match) → Post Invoice →
Schedule Payment → Execute Payment
Extended Object Types
| Object Type | Attributes |
|---|---|
| SpendCategory | category_code, total_spend, vendor_count |
| SourcingProject | project_type, target_savings, status |
| SupplierBid | vendor_id, bid_amount, technical_score |
| ProcurementContract | contract_value, validity_period, terms |
| PurchaseRequisition | requester, catalog_item, urgency |
| PurchaseOrder | po_type, vendor_id, total_amount |
| GoodsReceipt | gr_number, received_qty, movement_type |
| VendorInvoice | invoice_amount, match_status, due_date |
| Payment | payment_method, cleared_amount, bank_ref |
Cycle Time Analysis
# Analyze end-to-end procurement cycle times
po_events = events[events['object_type'] == 'PurchaseOrder']
# PO creation to payment completion
cycle_times = po_events.groupby('case_id').agg({
'timestamp': ['min', 'max']
})
cycle_times['cycle_time'] = (
cycle_times[('timestamp', 'max')] -
cycle_times[('timestamp', 'min')]
)
# Segment by PO type
cycle_by_type = po_events.merge(
objects[['po_type']], on='object_id'
).groupby('po_type')['cycle_time'].describe()
Three-Way Match Conformance
# Identify invoices that failed three-way match
match_events = events[events['activity'] == 'Verify Invoice']
blocked = match_events[match_events['match_status'] == 'blocked']
print(f"Three-way match block rate: {len(blocked)/len(match_events):.1%}")
print(f"Most common variance: {blocked['variance_type'].mode()[0]}")
See Also
- Document Flows — P2P and O2C configuration
- Process Chains — Enterprise process chain architecture
- datasynth-ocpm Crate — OCEL 2.0 implementation
- Audit Analytics
AML/KYC Testing
Generate realistic banking transaction data with KYC profiles and AML typologies for compliance testing and fraud detection model development.
Overview
The datasynth-banking module generates synthetic banking data designed for:
- AML System Testing: Validate transaction monitoring rules against known patterns
- KYC Process Testing: Test customer onboarding and risk assessment workflows
- ML Model Training: Train supervised models with labeled fraud typologies
- Scenario Analysis: Test detection capabilities against specific attack patterns
KYC Profile Generation
Customer Types
| Type | Description | Typical Characteristics |
|---|---|---|
| Retail | Individual customers | Salary deposits, consumer spending |
| Business | Small to medium businesses | Payroll, supplier payments |
| Trust | Trust accounts, complex structures | Investment flows, distributions |
KYC Profile Components
Each customer has a KYC profile defining expected behavior:
kyc_profile:
declared_turnover: 50000 # Expected monthly volume
transaction_frequency: 25 # Expected transactions/month
source_of_funds: "employment" # Declared income source
geographic_exposure: ["US", "EU"]
cash_intensity: 0.05 # Expected cash ratio
beneficial_owner_complexity: 1 # Ownership layers
Risk Scoring
Customers are assigned risk scores based on:
- Geographic exposure (high-risk jurisdictions)
- Industry sector
- Transaction patterns vs. declared profile
- Beneficial ownership complexity
AML Typology Generation
Structuring
Breaking large transactions into smaller amounts to avoid reporting thresholds.
Detection Signatures:
- Multiple transactions just below $10,000 threshold
- Same-day deposits across multiple branches
- Round-number amounts (e.g., $9,900, $9,800)
Configuration:
typologies:
structuring:
enabled: true
rate: 0.001
threshold: 10000
margin: 500
Funnel Accounts
Concentrating funds from multiple sources before moving to destination.
Pattern:
Source A ─┐
Source B ─┼─▶ Funnel Account ─▶ Destination
Source C ─┘
Detection Signatures:
- Many small inbound, few large outbound
- High throughput relative to account balance
- Short holding periods
Layering
Complex chains of transactions to obscure fund origins.
Pattern:
Origin ─▶ Shell A ─▶ Shell B ─▶ Shell C ─▶ Destination
└─▶ Mixing ─┘
Detection Signatures:
- Rapid consecutive transfers
- Circular transaction patterns
- Cross-border routing through multiple jurisdictions
Money Mule Networks
Using recruited individuals to move illicit funds.
Pattern:
Fraudster ─▶ Mule 1 ─▶ Cash Withdrawal
─▶ Mule 2 ─▶ Wire Transfer
─▶ Mule 3 ─▶ Crypto Exchange
Detection Signatures:
- New accounts with sudden high volume
- Immediate outbound after inbound
- Multiple accounts with similar patterns
Round-Tripping
Moving funds in circular patterns to create apparent legitimacy.
Pattern:
Company A ─▶ Offshore ─▶ Company A (as "investment")
Detection Signatures:
- Funds return to origin within short period
- Offshore intermediaries
- Inflated invoicing
Fraud Patterns
Credit card fraud and synthetic identity patterns.
Patterns:
- Card testing (small amounts across merchants)
- Account takeover (changed behavior profile)
- Synthetic identity (blended PII attributes)
Generated Data
Output Files
banking/
├── banking_customers.csv # Customer profiles with KYC data
├── bank_accounts.csv # Account records with features
├── bank_transactions.csv # Transaction records
├── kyc_profiles.csv # Expected activity envelopes
├── counterparties.csv # Counterparty pool
├── aml_typology_labels.csv # Ground truth typology labels
├── entity_risk_labels.csv # Entity-level risk classifications
└── transaction_risk_labels.csv # Transaction-level classifications
Customer Record
customer_id,customer_type,name,created_at,risk_score,kyc_status,pep_flag,sanctions_flag
CUST001,retail,John Smith,2024-01-15,25,verified,false,false
CUST002,business,Acme Corp,2024-02-01,65,enhanced_due_diligence,false,false
Transaction Record
transaction_id,account_id,timestamp,amount,currency,direction,channel,category,counterparty_id
TXN001,ACC001,2024-03-15T10:30:00Z,9800.00,USD,credit,branch,cash_deposit,
TXN002,ACC001,2024-03-15T11:45:00Z,9750.00,USD,credit,branch,cash_deposit,
Typology Label
transaction_id,typology,confidence,pattern_id,related_transactions
TXN001,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"
TXN002,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"
Configuration
Basic Banking Setup
banking:
enabled: true
customers:
retail: 5000
business: 500
trust: 50
transactions:
target_count: 500000
date_range:
start: 2024-01-01
end: 2024-12-31
typologies:
structuring:
enabled: true
rate: 0.002
funnel:
enabled: true
rate: 0.001
layering:
enabled: true
rate: 0.0005
mule:
enabled: true
rate: 0.001
fraud:
enabled: true
rate: 0.005
labels:
generate: true
include_confidence: true
include_related: true
Adversarial Testing
Generate transactions designed to evade detection:
banking:
typologies:
spoofing:
enabled: true
strategies:
- threshold_aware # Varies amounts around thresholds
- temporal_distribution # Spreads over time windows
- channel_mixing # Uses multiple channels
Use Cases
Transaction Monitoring Rule Testing
# Generate data with known structuring patterns
datasynth-data generate --config banking_structuring.yaml --output ./test_data
# Expected results:
# - 0.2% of transactions should trigger structuring alerts
# - Labels in aml_typology_labels.csv for validation
ML Model Training
import pandas as pd
from sklearn.model_selection import train_test_split
# Load transactions and labels
transactions = pd.read_csv("banking/bank_transactions.csv")
labels = pd.read_csv("banking/aml_typology_labels.csv")
# Merge and prepare features
data = transactions.merge(labels, on="transaction_id", how="left")
data["is_suspicious"] = data["typology"].notna()
# Split for training
X_train, X_test, y_train, y_test = train_test_split(
data[features],
data["is_suspicious"],
test_size=0.2,
stratify=data["is_suspicious"]
)
Network Analysis
The banking data supports graph-based analysis:
import networkx as nx
# Build transaction network
G = nx.DiGraph()
for _, txn in transactions.iterrows():
if txn["counterparty_id"]:
G.add_edge(txn["account_id"], txn["counterparty_id"],
weight=txn["amount"])
# Detect funnel accounts (high in-degree, low out-degree)
in_degree = dict(G.in_degree())
out_degree = dict(G.out_degree())
funnels = [n for n in G.nodes()
if in_degree.get(n, 0) > 10 and out_degree.get(n, 0) < 3]
KYC Deviation Analysis
# Compare actual behavior to KYC profile
customers = pd.read_csv("banking/banking_customers.csv")
kyc = pd.read_csv("banking/kyc_profiles.csv")
transactions = pd.read_csv("banking/bank_transactions.csv")
# Calculate actual monthly volumes
actual = transactions.groupby(["customer_id", "month"])["amount"].sum()
# Compare to declared turnover
merged = actual.merge(kyc, on="customer_id")
merged["deviation"] = (merged["actual"] - merged["declared_turnover"]) / merged["declared_turnover"]
# Flag significant deviations
alerts = merged[merged["deviation"].abs() > 0.5]
Best Practices
Realistic Testing
- Match production volumes: Configure similar customer counts and transaction rates
- Use realistic ratios: Keep typology rates at realistic levels (0.1-1%)
- Include noise: Add legitimate edge cases that shouldn’t trigger alerts
Label Quality
- Verify ground truth: Labels reflect injected patterns, not detected ones
- Include confidence: Use confidence scores for uncertain classifications
- Track related transactions: Pattern IDs link related suspicious activity
Model Validation
- Test detection rates: Measure recall against known patterns
- Check false positives: Ensure legitimate transactions aren’t flagged
- Validate across typologies: Test each pattern type separately
See Also
ERP Load Testing
Generate high-volume data for ERP system testing.
Overview
SyntheticData generates realistic transaction volumes for:
- Load testing
- Stress testing
- Performance benchmarking
- System integration testing
Configuration
High Volume Generation
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 12
worker_threads: 8 # Maximize parallelism
transactions:
target_count: 1000000 # 1 million entries
line_items:
distribution: empirical
amounts:
min: 100
max: 10000000
distribution: log_normal
sources:
manual: 0.15
automated: 0.65
recurring: 0.15
adjustment: 0.05
temporal:
month_end_spike: 2.5
quarter_end_spike: 3.0
year_end_spike: 4.0
document_flows:
p2p:
enabled: true
flow_rate: 0.35
o2c:
enabled: true
flow_rate: 0.35
master_data:
vendors:
count: 2000
customers:
count: 5000
materials:
count: 10000
output:
format: csv
compression: none # Fastest for import
SAP ACDOCA Format
output:
files:
journal_entries: false
acdoca: true # SAP Universal Journal format
Volume Sizing
Transaction Volume Guidelines
| Company Size | Annual Entries | Per Day | Configuration |
|---|---|---|---|
| Small | 10,000 | ~30 | target_count: 10000 |
| Medium | 100,000 | ~300 | target_count: 100000 |
| Large | 1,000,000 | ~3,000 | target_count: 1000000 |
| Enterprise | 10,000,000 | ~30,000 | target_count: 10000000 |
Master Data Guidelines
| Size | Vendors | Customers | Materials |
|---|---|---|---|
| Small | 100 | 200 | 500 |
| Medium | 500 | 1,000 | 5,000 |
| Large | 2,000 | 10,000 | 50,000 |
| Enterprise | 10,000 | 100,000 | 500,000 |
Load Testing Scenarios
1. Steady State Load
Normal daily operation:
transactions:
target_count: 100000
temporal:
month_end_spike: 1.0 # No spikes
quarter_end_spike: 1.0
year_end_spike: 1.0
working_hours_only: true
2. Peak Period Load
Month-end closing:
global:
start_date: 2024-01-25
period_months: 1 # Focus on month-end
transactions:
target_count: 50000
temporal:
month_end_spike: 5.0 # 5x normal volume
3. Year-End Stress
Year-end closing simulation:
global:
start_date: 2024-12-01
period_months: 1
transactions:
target_count: 200000
temporal:
month_end_spike: 3.0
quarter_end_spike: 4.0
year_end_spike: 10.0 # Extreme spike
4. Batch Import
Large batch import testing:
transactions:
target_count: 500000
sources:
automated: 1.0 # All system-generated
output:
compression: none # For fastest import
Manufacturing ERP Testing (v0.6.0)
Production Order Load
Generate production orders with WIP tracking, routings, and standard costing:
global:
industry: manufacturing
start_date: 2024-01-01
period_months: 12
worker_threads: 8
transactions:
target_count: 500000
manufacturing:
enabled: true
production_orders:
orders_per_month: 200 # High volume
avg_batch_size: 150
yield_rate: 0.96
rework_rate: 0.04
costing:
labor_rate_per_hour: 42.0
overhead_rate: 1.75
routing:
avg_operations: 6
setup_time_hours: 2.0
document_flows:
p2p:
enabled: true
flow_rate: 0.40 # Heavy procurement
subledger:
inventory:
enabled: true
valuation_methods:
- standard_cost
- weighted_average
This configuration exercises production order creation, goods issue to production, goods receipt from production, WIP valuation, and standard cost variance posting.
Three-Way Match with Source-to-Pay
Test the full procurement lifecycle from sourcing through payment:
source_to_pay:
enabled: true
sourcing:
projects_per_year: 20
rfx:
min_invited_vendors: 5
max_invited_vendors: 12
contracts:
min_duration_months: 12
max_duration_months: 24
p2p_integration:
off_contract_rate: 0.10 # 10% maverick spending
catalog_enforcement: true
document_flows:
p2p:
enabled: true
flow_rate: 0.40
three_way_match:
quantity_tolerance: 0.02
price_tolerance: 0.01
HR and Payroll Testing (v0.6.0)
Payroll Processing Load
Generate payroll runs, time entries, and expense reports:
master_data:
employees:
count: 500
hierarchy_depth: 6
hr:
enabled: true
payroll:
enabled: true
pay_frequency: "biweekly" # 26 pay periods per year
benefits_enrollment_rate: 0.75
retirement_participation_rate: 0.55
time_attendance:
enabled: true
overtime_rate: 0.15
expenses:
enabled: true
submission_rate: 0.40
policy_violation_rate: 0.05
This exercises payroll journal entry generation (salary, tax withholdings, benefits deductions), time and attendance record creation, and expense report approval workflows.
Expense Report Compliance
Test expense policy enforcement with elevated violation rates:
hr:
enabled: true
expenses:
enabled: true
submission_rate: 0.60 # 60% of employees submit
policy_violation_rate: 0.15 # Elevated violation rate for testing
anomaly_injection:
enabled: true
generate_labels: true
Procurement Testing (v0.6.0)
Vendor Scorecard and Qualification
Generate the full source-to-pay cycle for procurement system testing:
source_to_pay:
enabled: true
qualification:
pass_rate: 0.80
validity_days: 365
scorecards:
frequency: "quarterly"
grade_a_threshold: 85.0
grade_c_threshold: 55.0
catalog:
preferred_vendor_flag_rate: 0.65
multi_source_rate: 0.30
vendor_network:
enabled: true
depth: 3
Sales Quote Pipeline
Test quote-to-order conversion with the O2C flow:
sales_quotes:
enabled: true
quotes_per_month: 100
win_rate: 0.30
validity_days: 45
document_flows:
o2c:
enabled: true
flow_rate: 0.40
Won quotes automatically feed into the O2C document flow as sales orders.
Performance Monitoring
Generation Metrics
# Time generation
time datasynth-data generate --config config.yaml --output ./output
# Monitor memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output
# Watch progress
datasynth-data generate --config config.yaml --output ./output -v
Import Metrics
Track these during ERP import:
| Metric | Description |
|---|---|
| Import rate | Records per second |
| Memory usage | Peak RAM during import |
| CPU utilization | Processor load |
| I/O throughput | Disk read/write speed |
| Lock contention | Database lock waits |
Data Import Strategies
SAP S/4HANA
# Generate ACDOCA format
datasynth-data generate --config config.yaml --output ./output
# Use SAP Data Services or LSMW for import
# Output: output/transactions/acdoca.csv
Oracle EBS
-- Create staging table
CREATE TABLE XX_JE_STAGING (
document_id VARCHAR2(36),
posting_date DATE,
account VARCHAR2(20),
debit NUMBER,
credit NUMBER
);
-- Load via SQL*Loader
LOAD DATA
INFILE 'journal_entries.csv'
INTO TABLE XX_JE_STAGING
FIELDS TERMINATED BY ','
Microsoft Dynamics
# Use Data Management Framework
# Import journal_entries.csv via Data Entity
Validation
Post-Import Checks
-- Verify record count
SELECT COUNT(*) FROM journal_entries;
-- Verify balance
SELECT SUM(debit) - SUM(credit) AS imbalance
FROM journal_entries;
-- Check date range
SELECT MIN(posting_date), MAX(posting_date)
FROM journal_entries;
Reconciliation
import pandas as pd
# Compare source and target
source = pd.read_csv('output/transactions/journal_entries.csv')
target = pd.read_sql('SELECT * FROM journal_entries', connection)
# Verify counts
assert len(source) == len(target), "Record count mismatch"
# Verify totals
assert abs(source['debit_amount'].sum() - target['debit'].sum()) < 0.01
Batch Processing
Chunked Generation
For very large volumes, generate in chunks:
# Generate 10 batches of 1M each
for i in {1..10}; do
datasynth-data generate \
--config config.yaml \
--output ./output/batch_$i \
--seed $((42 + i))
done
Parallel Import
# Import chunks in parallel
for batch in ./output/batch_*; do
import_job $batch &
done
wait
Performance Tips
Generation Speed
- Increase threads:
worker_threads: 16 - Disable unnecessary features: Turn off graph export, anomalies
- Use fast storage: NVMe SSD
- Reduce complexity: Smaller COA, fewer master records
Import Speed
- Disable triggers: During bulk import
- Drop indexes: Recreate after import
- Increase batch size: Larger commits
- Parallel loading: Multiple import streams
See Also
Causal Analysis
New in v0.5.0
Use DataSynth’s causal generation capabilities for “what-if” scenario testing and counterfactual analysis in audit and risk management.
When to Use Causal Generation
Causal generation is ideal when you need to:
- Test audit scenarios: “What would happen to fraud rates if we increased the approval threshold?”
- Risk assessment: “How would revenue change if we lost our top vendor?”
- Policy evaluation: “What is the causal effect of implementing a new control?”
- Training causal ML models: Generate data with known causal structure for model validation
Setting Up a Fraud Detection SCM
# Generate causally-structured fraud detection data
datasynth-data causal generate \
--template fraud_detection \
--samples 50000 \
--seed 42 \
--output ./fraud_causal
The fraud_detection template models:
transaction_amount→approval_level(larger amounts require higher approval)transaction_amount→fraud_flag(larger amounts have higher fraud probability)vendor_risk→fraud_flag(risky vendors associated with more fraud)
Running Interventions
Answer “what if?” questions by forcing variables to specific values:
# What if all transactions were $50,000?
datasynth-data causal intervene \
--template fraud_detection \
--variable transaction_amount \
--value 50000 \
--samples 10000 \
--output ./intervention_50k
# What if vendor risk were always high (0.9)?
datasynth-data causal intervene \
--template fraud_detection \
--variable vendor_risk \
--value 0.9 \
--samples 10000 \
--output ./intervention_high_risk
Compare the intervention output against the baseline to estimate causal effects.
Counterfactual Analysis for Audit
For individual transaction review:
from datasynth_py import DataSynth
synth = DataSynth()
# Load a specific flagged transaction
factual = {
"transaction_amount": 5000.0,
"approval_level": 1.0,
"vendor_risk": 0.3,
"fraud_flag": 0.0,
}
# What would have happened if the amount were 10x larger?
# The counterfactual preserves the same "noise" (latent factors)
# but propagates the new amount through the causal structure
This helps auditors understand which factors most influence risk assessments.
Configuration Example
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 12
causal:
enabled: true
template: "fraud_detection"
sample_size: 50000
validate: true
# Combine with regular generation
transactions:
target_count: 100000
fraud:
enabled: true
fraud_rate: 0.005
Validating Causal Structure
Verify that generated data preserves the intended causal relationships:
datasynth-data causal validate \
--data ./fraud_causal \
--template fraud_detection
The validator checks:
- Parent-child correlations match expected directions
- Independence constraints hold for non-adjacent variables
- Intervention effects are consistent with the graph
See Also
LLM Training Data
New in v0.5.0
Generate LLM-enriched synthetic financial data for training and fine-tuning language models on domain-specific tasks.
When to Use LLM-Enriched Data
- Fine-tuning: Train financial document understanding models on realistic data
- RAG evaluation: Test retrieval-augmented generation with known-truth synthetic documents
- Classification training: Generate labeled financial text for transaction categorization
- Anomaly explanation: Train models to explain financial anomalies in natural language
Quality vs Cost Tradeoffs
| Provider | Quality | Cost | Latency | Reproducibility |
|---|---|---|---|---|
| Mock | Good (template-based) | Free | Instant | Fully deterministic |
| gpt-4o-mini | High | ~$0.15/1M tokens | ~200ms/req | Seed-based |
| gpt-4o | Very High | ~$2.50/1M tokens | ~500ms/req | Seed-based |
| Claude (Anthropic) | Very High | Varies | ~300ms/req | Seed-based |
| Self-hosted | Varies | Infrastructure cost | Varies | Full control |
Using the Mock Provider for CI/CD
The mock provider generates deterministic, contextually-aware text without any API calls:
# Default: uses mock provider (no API key needed)
datasynth-data generate --config config.yaml --output ./output
# Explicit mock configuration
llm:
provider: mock
The mock provider is suitable for:
- CI/CD pipelines
- Automated testing
- Reproducible research
- Development environments
Using Real LLM Providers
For production-quality enrichment:
llm:
provider: openai
model: "gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
cache_enabled: true # Avoid duplicate API calls
max_retries: 3
timeout_secs: 30
export OPENAI_API_KEY="sk-..."
datasynth-data generate --config config.yaml --output ./output
Batch Generation for Large Datasets
For large-scale enrichment, use batch mode to minimize API overhead:
from datasynth_py import DataSynth, Config
from datasynth_py.config import blueprints
# Generate base data first (fast, rule-based)
config = blueprints.manufacturing_large(transactions=100000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
# Then enrich with LLM in a separate pass if needed
Example: Financial Document Understanding
Generate training data for a document understanding model:
global:
seed: 42
industry: manufacturing
start_date: 2024-01-01
period_months: 12
transactions:
target_count: 50000
document_flows:
p2p:
enabled: true
flow_rate: 0.4
o2c:
enabled: true
flow_rate: 0.3
anomaly_injection:
enabled: true
total_rate: 0.03
generate_labels: true
# LLM enrichment adds realistic descriptions
llm:
provider: mock # or openai for higher quality
The generated data includes:
- Vendor names appropriate for the industry and spend category
- Transaction descriptions that read like real GL entries
- Memo fields on invoices and payments
- Natural language explanations for flagged anomalies
See Also
Pipeline Orchestration
New in v0.5.0
Integrate DataSynth into data engineering pipelines using Apache Airflow, dbt, MLflow, and Apache Spark.
Overview
DataSynth’s Python wrapper includes optional integrations for popular data engineering platforms, enabling synthetic data generation as part of automated workflows.
pip install datasynth-py[integrations]
Apache Airflow
Generate Data in a DAG
from airflow import DAG
from airflow.utils.dates import days_ago
from datasynth_py.integrations import (
DataSynthOperator,
DataSynthSensor,
DataSynthValidateOperator,
)
config = {
"global": {"industry": "retail", "start_date": "2024-01-01", "period_months": 12},
"transactions": {"target_count": 50000},
}
with DAG("synthetic_data_pipeline", start_date=days_ago(1), schedule_interval="@weekly") as dag:
validate = DataSynthValidateOperator(
task_id="validate_config",
config_path="/configs/retail.yaml",
)
generate = DataSynthOperator(
task_id="generate_data",
config=config,
output_path="/data/synthetic/{{ ds }}",
)
wait = DataSynthSensor(
task_id="wait_for_output",
output_path="/data/synthetic/{{ ds }}",
)
validate >> generate >> wait
dbt Integration
Generate dbt Sources from Synthetic Data
from datasynth_py.integrations import DbtSourceGenerator, create_dbt_project
# Generate sources.yml pointing to synthetic CSV files
gen = DbtSourceGenerator()
gen.generate_sources_yaml("./synthetic_output", "./my_dbt_project")
# Generate seed CSVs for dbt
gen.generate_seeds("./synthetic_output", "./my_dbt_project")
# Or create a complete dbt project structure
project = create_dbt_project("./synthetic_output", "my_dbt_project")
This creates:
models/sources.ymlwith table definitionsseeds/directory with CSV files- Standard dbt project structure
Testing dbt Models with Synthetic Data
# 1. Generate synthetic data
datasynth-data generate --config retail.yaml --output ./synthetic
# 2. Create dbt project from output
python -c "from datasynth_py.integrations import create_dbt_project; create_dbt_project('./synthetic', 'test_project')"
# 3. Run dbt
cd test_project && dbt seed && dbt run && dbt test
MLflow Tracking
Track Generation Experiments
from datasynth_py.integrations import DataSynthMlflowTracker
tracker = DataSynthMlflowTracker(experiment_name="data_generation")
# Track a generation run (logs config, metrics, artifacts)
run_info = tracker.track_generation("./output", config=config)
# Log additional quality metrics
tracker.log_quality_metrics({
"benford_mad": 0.008,
"correlation_preservation": 0.95,
"completeness": 0.99,
})
# Compare recent runs
comparison = tracker.compare_runs(n=10)
for run in comparison:
print(f"Run {run['run_id']}: quality={run['metrics'].get('statistical_fidelity', 'N/A')}")
A/B Testing Generation Configs
configs = [
("baseline", baseline_config),
("with_diffusion", diffusion_config),
("with_llm", llm_config),
]
for name, cfg in configs:
with mlflow.start_run(run_name=name):
result = synth.generate(config=cfg, output={"format": "csv", "sink": "temp_dir"})
tracker.track_generation(result.output_dir, config=cfg)
Apache Spark
Read Synthetic Data as DataFrames
from datasynth_py.integrations import DataSynthSparkReader
reader = DataSynthSparkReader()
# Read a single table
je_df = reader.read_table(spark, "./output", "journal_entries")
je_df.show(5)
# Read all tables at once
tables = reader.read_all_tables(spark, "./output")
for name, df in tables.items():
print(f"{name}: {df.count()} rows")
# Create temporary SQL views
reader.create_temp_views(spark, "./output")
spark.sql("""
SELECT posting_date, SUM(amount) as total
FROM journal_entries
WHERE fiscal_period = 12
GROUP BY posting_date
ORDER BY posting_date
""").show()
End-to-End Pipeline Example
"""
Complete pipeline: Generate → Track → Load → Transform → Test
"""
from datasynth_py import DataSynth
from datasynth_py.config import blueprints
from datasynth_py.integrations import (
DataSynthMlflowTracker,
DataSynthSparkReader,
DbtSourceGenerator,
)
# 1. Generate
synth = DataSynth()
config = blueprints.retail_small(transactions=50000)
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})
# 2. Track with MLflow
tracker = DataSynthMlflowTracker(experiment_name="pipeline_test")
tracker.track_generation(result.output_dir, config=config)
# 3. Load into Spark
reader = DataSynthSparkReader()
reader.create_temp_views(spark, result.output_dir)
# 4. Create dbt project for transformation testing
gen = DbtSourceGenerator()
gen.generate_sources_yaml(result.output_dir, "./dbt_project")
See Also
Contributing
Welcome to the SyntheticData contributor guide.
Overview
SyntheticData is an open-source project and we welcome contributions from the community. This section covers everything you need to know to contribute effectively.
Ways to Contribute
Code Contributions
- Bug fixes: Fix issues reported in the GitHub issue tracker
- New features: Implement new generators, output formats, or analysis tools
- Performance improvements: Optimize generation speed or memory usage
- Documentation: Improve or expand the documentation
Non-Code Contributions
- Bug reports: Report issues with detailed reproduction steps
- Feature requests: Suggest new features or improvements
- Documentation feedback: Point out unclear or missing documentation
- Testing: Test pre-release versions and report issues
Quick Start
# Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/SyntheticData.git
cd SyntheticData
# Create a feature branch
git checkout -b feature/my-feature
# Make your changes and run tests
cargo test
# Submit a pull request
Contribution Guidelines
Before You Start
- Check existing issues: Look for related issues or discussions
- Open an issue first: For significant changes, discuss before implementing
- Follow code style: Run
cargo fmtandcargo clippy - Write tests: All new features need test coverage
- Update documentation: Keep docs in sync with code changes
Code of Conduct
We are committed to providing a welcoming and inclusive environment. Please:
- Be respectful and constructive in discussions
- Focus on the technical merits of contributions
- Help newcomers learn and contribute
- Report unacceptable behavior to the maintainers
Getting Help
- GitHub Issues: For bugs and feature requests
- GitHub Discussions: For questions and general discussion
- Pull Request Reviews: For feedback on your contributions
In This Section
| Page | Description |
|---|---|
| Development Setup | Set up your development environment |
| Code Style | Coding standards and conventions |
| Testing | Testing guidelines and practices |
| Pull Requests | PR submission and review process |
License
By contributing to SyntheticData, you agree that your contributions will be licensed under the project’s MIT License.
Development Setup
Set up your local development environment for SyntheticData.
Prerequisites
Required
- Rust: 1.88 or later (install via rustup)
- Git: For version control
- Cargo: Included with Rust
Optional
- Node.js 18+: For desktop UI development (datasynth-ui)
- Protocol Buffers: For gRPC development
- mdBook: For documentation development
Installation
1. Clone the Repository
git clone https://github.com/EY-ASU-RnD/SyntheticData.git
cd SyntheticData
2. Install Rust Toolchain
# Install rustup if not present
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install stable toolchain
rustup install stable
rustup default stable
# Add useful components
rustup component add clippy rustfmt
3. Build the Project
# Debug build (faster compilation)
cargo build
# Release build (optimized)
cargo build --release
# Check without building
cargo check
4. Run Tests
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific crate tests
cargo test -p datasynth-core
cargo test -p datasynth-generators
IDE Setup
VS Code
Recommended extensions:
{
"recommendations": [
"rust-lang.rust-analyzer",
"tamasfe.even-better-toml",
"serayuzgur.crates",
"vadimcn.vscode-lldb"
]
}
Settings for the project:
{
"rust-analyzer.cargo.features": "all",
"rust-analyzer.checkOnSave.command": "clippy",
"editor.formatOnSave": true
}
JetBrains (RustRover/IntelliJ)
- Install the Rust plugin
- Open the project directory
- Configure Cargo settings under Preferences > Languages & Frameworks > Rust
Desktop UI Setup
For developing the Tauri/SvelteKit desktop UI:
# Navigate to UI crate
cd crates/datasynth-ui
# Install Node.js dependencies
npm install
# Run development server
npm run dev
# Run Tauri desktop app
npm run tauri dev
# Build production
npm run build
Documentation Setup
For working on documentation:
# Install mdBook
cargo install mdbook
# Build documentation
cd docs
mdbook build
# Serve with live reload
mdbook serve --open
# Generate Rust API docs
cargo doc --workspace --no-deps --open
Project Structure
SyntheticData/
├── crates/
│ ├── datasynth-cli/ # CLI binary
│ ├── datasynth-core/ # Core models and traits
│ ├── datasynth-config/ # Configuration schema
│ ├── datasynth-generators/ # Data generators
│ ├── datasynth-output/ # Output sinks
│ ├── datasynth-graph/ # Graph export
│ ├── datasynth-runtime/ # Orchestration
│ ├── datasynth-server/ # REST/gRPC server
│ ├── datasynth-ui/ # Desktop UI
│ └── datasynth-ocpm/ # OCEL 2.0 export
├── benches/ # Benchmarks
├── docs/ # Documentation
├── configs/ # Example configs
└── templates/ # Data templates
Environment Variables
| Variable | Description | Default |
|---|---|---|
RUST_LOG | Log level (trace, debug, info, warn, error) | info |
SYNTH_CONFIG_PATH | Default config search path | Current directory |
SYNTH_TEMPLATE_PATH | Template files location | ./templates |
Debugging
VS Code Launch Configuration
{
"version": "0.2.0",
"configurations": [
{
"type": "lldb",
"request": "launch",
"name": "Debug CLI",
"cargo": {
"args": ["build", "--bin=datasynth-data", "--package=datasynth-cli"]
},
"args": ["generate", "--demo", "--output", "./output"],
"cwd": "${workspaceFolder}"
}
]
}
Logging
Enable debug logging:
RUST_LOG=debug cargo run --release -- generate --demo --output ./output
Module-specific logging:
RUST_LOG=synth_generators=debug,synth_core=info cargo run ...
Common Issues
Build Failures
# Clean and rebuild
cargo clean
cargo build
# Update dependencies
cargo update
Test Failures
# Run tests with backtrace
RUST_BACKTRACE=1 cargo test
# Run single test with output
cargo test test_name -- --nocapture
Memory Issues
For large generation volumes, increase system limits:
# Linux: Increase open file limit
ulimit -n 65536
# Check memory usage during generation
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output
Next Steps
- Review Code Style guidelines
- Read Testing practices
- Learn the Pull Request process
Code Style
Coding standards and conventions for SyntheticData.
Rust Style
Formatting
All code must be formatted with rustfmt:
# Format all code
cargo fmt
# Check formatting without changes
cargo fmt --check
Linting
Code must pass Clippy without warnings:
# Run clippy
cargo clippy
# Run clippy with all features
cargo clippy --all-features
# Run clippy on all targets
cargo clippy --all-targets
Configuration
The project uses these Clippy settings in Cargo.toml:
[workspace.lints.clippy]
all = "warn"
pedantic = "warn"
nursery = "warn"
Naming Conventions
General Rules
| Item | Convention | Example |
|---|---|---|
| Types | PascalCase | JournalEntry, VendorGenerator |
| Functions | snake_case | generate_batch, parse_config |
| Variables | snake_case | entry_count, total_amount |
| Constants | SCREAMING_SNAKE_CASE | MAX_LINE_ITEMS, DEFAULT_SEED |
| Modules | snake_case | je_generator, document_flow |
Domain-Specific Names
Use accounting domain terminology consistently:
#![allow(unused)]
fn main() {
// Good - uses domain terms
struct JournalEntry { ... }
struct ChartOfAccounts { ... }
fn post_to_gl() { ... }
// Avoid - generic terms
struct Entry { ... }
struct AccountList { ... }
fn save_data() { ... }
}
Code Organization
Module Structure
#![allow(unused)]
fn main() {
// 1. Module documentation
//! Brief description of the module.
//!
//! Extended description with examples.
// 2. Imports (grouped and sorted)
use std::collections::HashMap;
use chrono::{NaiveDate, Utc};
use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};
use crate::models::JournalEntry;
// 3. Constants
const DEFAULT_BATCH_SIZE: usize = 1000;
// 4. Type definitions
pub struct Generator { ... }
// 5. Trait implementations
impl Generator { ... }
// 6. Unit tests
#[cfg(test)]
mod tests { ... }
}
Import Organization
Group imports in this order:
- Standard library (
std::) - External crates (alphabetically)
- Workspace crates (
synth_*) - Current crate (
crate::)
#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::Arc;
use chrono::NaiveDate;
use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};
use uuid::Uuid;
use synth_core::models::JournalEntry;
use synth_core::traits::Generator;
use crate::config::GeneratorConfig;
}
Documentation
Public API Documentation
All public items must have documentation:
#![allow(unused)]
fn main() {
/// Generates journal entries with realistic financial patterns.
///
/// This generator produces balanced journal entries following
/// configurable statistical distributions for amounts, line counts,
/// and temporal patterns.
///
/// # Examples
///
/// ```
/// use synth_generators::JournalEntryGenerator;
///
/// let generator = JournalEntryGenerator::new(config, seed);
/// let entries = generator.generate_batch(1000)?;
/// ```
///
/// # Errors
///
/// Returns `GeneratorError` if:
/// - Configuration is invalid
/// - Memory limits are exceeded
pub struct JournalEntryGenerator { ... }
}
Module Documentation
Each module should have a module-level doc comment:
#![allow(unused)]
fn main() {
//! Journal Entry generation module.
//!
//! This module provides generators for creating realistic
//! journal entries with proper accounting rules enforcement.
//!
//! # Overview
//!
//! The main entry point is [`JournalEntryGenerator`], which
//! coordinates line item generation and balance verification.
}
Error Handling
Error Types
Use thiserror for error definitions:
#![allow(unused)]
fn main() {
use thiserror::Error;
#[derive(Debug, Error)]
pub enum GeneratorError {
#[error("Invalid configuration: {0}")]
InvalidConfig(String),
#[error("Memory limit exceeded: used {used} bytes, limit {limit} bytes")]
MemoryExceeded { used: usize, limit: usize },
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
}
}
Result Types
Define type aliases for common result types:
#![allow(unused)]
fn main() {
pub type Result<T> = std::result::Result<T, GeneratorError>;
}
Error Propagation
Use ? for error propagation:
#![allow(unused)]
fn main() {
// Good
fn process() -> Result<Data> {
let config = load_config()?;
let data = generate_data(&config)?;
Ok(data)
}
// Avoid
fn process() -> Result<Data> {
let config = match load_config() {
Ok(c) => c,
Err(e) => return Err(e),
};
// ...
}
}
Financial Data
Decimal Precision
Always use rust_decimal::Decimal for financial amounts:
#![allow(unused)]
fn main() {
use rust_decimal::Decimal;
use rust_decimal_macros::dec;
// Good
let amount: Decimal = dec!(1234.56);
// Avoid - floating point
let amount: f64 = 1234.56;
}
Serialization
Serialize decimals as strings to avoid precision loss:
#![allow(unused)]
fn main() {
#[derive(Serialize, Deserialize)]
pub struct LineItem {
#[serde(serialize_with = "serialize_decimal_as_string")]
pub amount: Decimal,
}
}
Testing
Test Organization
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
// Group related tests
mod generation {
use super::*;
#[test]
fn generates_balanced_entries() {
// Arrange
let config = test_config();
let generator = Generator::new(config, 42);
// Act
let entries = generator.generate_batch(100).unwrap();
// Assert
for entry in entries {
assert!(entry.is_balanced());
}
}
}
mod validation {
// ...
}
}
}
Test Naming
Use descriptive test names:
#![allow(unused)]
fn main() {
// Good - describes behavior
#[test]
fn rejects_unbalanced_entry() { ... }
#[test]
fn generates_benford_compliant_amounts() { ... }
// Avoid - vague names
#[test]
fn test_1() { ... }
#[test]
fn it_works() { ... }
}
Performance
Allocation
Minimize allocations in hot paths:
#![allow(unused)]
fn main() {
// Good - reuse buffer
let mut buffer = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
buffer.push(generate_entry()?);
}
// Avoid - reallocations
let mut buffer = Vec::new();
for _ in 0..batch_size {
buffer.push(generate_entry()?);
}
}
Iterator Usage
Prefer iterators over explicit loops:
#![allow(unused)]
fn main() {
// Good
let total: Decimal = entries
.iter()
.map(|e| e.amount)
.sum();
// Avoid
let mut total = Decimal::ZERO;
for entry in &entries {
total += entry.amount;
}
}
See Also
- Testing - Testing guidelines
- Pull Requests - Submission process
Testing
Testing guidelines and practices for SyntheticData.
Running Tests
All Tests
# Run all tests
cargo test
# Run with output displayed
cargo test -- --nocapture
# Run tests in parallel (default)
cargo test
# Run tests sequentially
cargo test -- --test-threads=1
Specific Tests
# Run tests for a specific crate
cargo test -p datasynth-core
cargo test -p datasynth-generators
# Run a single test by name
cargo test test_balanced_entry
# Run tests matching a pattern
cargo test benford
cargo test journal_entry
Test Output
# Show stdout/stderr from tests
cargo test -- --nocapture
# Show test timing
cargo test -- --show-output
# Run ignored tests
cargo test -- --ignored
# Run all tests including ignored
cargo test -- --include-ignored
Test Organization
Unit Tests
Place unit tests in the same file as the code:
#![allow(unused)]
fn main() {
// src/generators/je_generator.rs
pub struct JournalEntryGenerator { ... }
impl JournalEntryGenerator {
pub fn generate(&self) -> Result<JournalEntry> { ... }
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn generates_balanced_entry() {
let generator = JournalEntryGenerator::new(test_config(), 42);
let entry = generator.generate().unwrap();
assert!(entry.is_balanced());
}
}
}
Integration Tests
Place integration tests in the tests/ directory:
crates/datasynth-generators/
├── src/
│ └── ...
└── tests/
├── generation_flow.rs
└── document_chains.rs
Test Modules
Group related tests in submodules:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
mod generation {
use super::super::*;
#[test]
fn batch_generation() { ... }
#[test]
fn streaming_generation() { ... }
}
mod validation {
use super::super::*;
#[test]
fn rejects_invalid_config() { ... }
}
}
}
Test Patterns
Arrange-Act-Assert
Use the AAA pattern for test structure:
#![allow(unused)]
fn main() {
#[test]
fn calculates_correct_total() {
// Arrange
let entries = vec![
create_entry(dec!(100.00)),
create_entry(dec!(200.00)),
create_entry(dec!(300.00)),
];
// Act
let total = calculate_total(&entries);
// Assert
assert_eq!(total, dec!(600.00));
}
}
Test Fixtures
Create helper functions for common test data:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
fn test_config() -> GeneratorConfig {
GeneratorConfig {
seed: 42,
batch_size: 100,
..Default::default()
}
}
fn create_test_entry() -> JournalEntry {
JournalEntryBuilder::new()
.with_company("1000")
.with_date(NaiveDate::from_ymd_opt(2024, 1, 15).unwrap())
.add_line(Account::CASH, dec!(1000.00), Decimal::ZERO)
.add_line(Account::REVENUE, Decimal::ZERO, dec!(1000.00))
.build()
.unwrap()
}
}
}
Deterministic Testing
Use fixed seeds for reproducibility:
#![allow(unused)]
fn main() {
#[test]
fn deterministic_generation() {
let seed = 42;
let gen1 = Generator::new(config.clone(), seed);
let gen2 = Generator::new(config.clone(), seed);
let result1 = gen1.generate_batch(100).unwrap();
let result2 = gen2.generate_batch(100).unwrap();
assert_eq!(result1, result2);
}
}
Property-Based Testing
Use proptest for property-based tests:
#![allow(unused)]
fn main() {
use proptest::prelude::*;
proptest! {
#[test]
fn entries_are_always_balanced(
debit in 1u64..1_000_000,
line_count in 2usize..10,
) {
let entry = generate_entry(debit, line_count);
prop_assert!(entry.is_balanced());
}
}
}
Domain-Specific Tests
Balance Verification
Test that journal entries are balanced:
#![allow(unused)]
fn main() {
#[test]
fn entry_debits_equal_credits() {
let entry = generate_test_entry();
let total_debits: Decimal = entry.lines
.iter()
.map(|l| l.debit_amount)
.sum();
let total_credits: Decimal = entry.lines
.iter()
.map(|l| l.credit_amount)
.sum();
assert_eq!(total_debits, total_credits);
}
}
Benford’s Law
Test amount distribution compliance:
#![allow(unused)]
fn main() {
#[test]
fn amounts_follow_benford() {
let entries = generate_entries(10_000);
let first_digits = extract_first_digits(&entries);
let observed = calculate_distribution(&first_digits);
let expected = benford_distribution();
let chi_square = calculate_chi_square(&observed, &expected);
assert!(chi_square < 15.51, "Distribution deviates from Benford's Law");
}
}
Document Chain Integrity
Test document reference chains:
#![allow(unused)]
fn main() {
#[test]
fn p2p_chain_is_complete() {
let documents = generate_p2p_flow();
// Verify chain: PO -> GR -> Invoice -> Payment
let po = &documents.purchase_order;
let gr = &documents.goods_receipt;
let invoice = &documents.vendor_invoice;
let payment = &documents.payment;
assert_eq!(gr.po_reference, Some(po.po_number.clone()));
assert_eq!(invoice.po_reference, Some(po.po_number.clone()));
assert_eq!(payment.invoice_reference, Some(invoice.invoice_number.clone()));
}
}
Decimal Precision
Test that decimal values maintain precision:
#![allow(unused)]
fn main() {
#[test]
fn decimal_precision_preserved() {
let original = dec!(1234.5678);
// Serialize and deserialize
let json = serde_json::to_string(&original).unwrap();
let restored: Decimal = serde_json::from_str(&json).unwrap();
assert_eq!(original, restored);
}
}
Benchmarks
Running Benchmarks
# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench --bench generation_throughput
# Run benchmark with specific filter
cargo bench -- batch_generation
Writing Benchmarks
#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
fn generation_benchmark(c: &mut Criterion) {
let config = test_config();
c.bench_function("generate_1000_entries", |b| {
b.iter(|| {
let generator = Generator::new(config.clone(), 42);
generator.generate_batch(1000).unwrap()
})
});
}
fn scaling_benchmark(c: &mut Criterion) {
let config = test_config();
let mut group = c.benchmark_group("scaling");
for size in [100, 1000, 10000].iter() {
group.bench_with_input(
BenchmarkId::from_parameter(size),
size,
|b, &size| {
b.iter(|| {
let generator = Generator::new(config.clone(), 42);
generator.generate_batch(size).unwrap()
})
},
);
}
group.finish();
}
criterion_group!(benches, generation_benchmark, scaling_benchmark);
criterion_main!(benches);
}
Test Coverage
Measuring Coverage
# Install coverage tool
cargo install cargo-tarpaulin
# Run with coverage
cargo tarpaulin --out Html
# View report
open tarpaulin-report.html
Coverage Guidelines
- Aim for 80%+ coverage on core logic
- 100% coverage on public API
- Focus on behavior, not lines
- Don’t test trivial getters/setters
Continuous Integration
Tests run automatically on:
- Pull request creation
- Push to main branch
- Nightly scheduled runs
CI Test Matrix
| Test Type | Trigger | Platform |
|---|---|---|
| Unit tests | All PRs | Linux, macOS, Windows |
| Integration tests | All PRs | Linux |
| Benchmarks | Main branch | Linux |
| Coverage | Weekly | Linux |
See Also
- Code Style - Coding standards
- Pull Requests - Submission process
Pull Requests
Guide to submitting and reviewing pull requests.
Before You Start
1. Check for Existing Work
- Search open issues for related discussions
- Check open PRs for similar changes
- Review the roadmap for planned features
2. Open an Issue First
For significant changes, open an issue to discuss:
- New features or major changes
- Breaking changes to public API
- Architectural changes
- Performance improvements
3. Create a Branch
# Sync with upstream
git checkout main
git pull origin main
# Create feature branch
git checkout -b feature/my-feature
# Or for bug fixes
git checkout -b fix/issue-123
Making Changes
1. Write Code
Follow the Code Style guidelines:
# Format code
cargo fmt
# Run clippy
cargo clippy
# Run tests
cargo test
2. Write Tests
Add tests for new functionality:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn new_feature_works() {
// Test implementation
}
}
}
3. Update Documentation
- Update relevant docs in
docs/src/ - Add/update rustdoc comments
- Update CHANGELOG.md if applicable
4. Commit Changes
Write clear commit messages:
# Good commit messages
git commit -m "Add Benford's Law validation to amount generator"
git commit -m "Fix off-by-one error in batch generation"
git commit -m "Improve memory efficiency in large volume generation"
# Avoid vague messages
git commit -m "Fix bug"
git commit -m "Update code"
git commit -m "WIP"
Commit Message Format
<type>: <short summary>
<optional detailed description>
<optional footer>
Types:
feat: New featurefix: Bug fixdocs: Documentation onlyrefactor: Code change without feature/fixtest: Adding/updating testsperf: Performance improvementchore: Maintenance tasks
Submitting a PR
1. Push Your Branch
git push -u origin feature/my-feature
2. Create Pull Request
Use the PR template:
## Summary
Brief description of changes.
## Changes
- Added X feature
- Fixed Y bug
- Updated Z documentation
## Testing
- [ ] Added unit tests
- [ ] Added integration tests
- [ ] Ran full test suite
- [ ] Tested manually
## Checklist
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] Tests pass locally
- [ ] No new warnings from clippy
3. PR Title Format
<type>: <short description>
Examples:
feat: Add OCEL 2.0 export formatfix: Correct decimal serialization in JSON outputdocs: Add process mining use case guide
Review Process
Automated Checks
All PRs must pass:
| Check | Requirement |
|---|---|
| Build | Compiles on all platforms |
| Tests | All tests pass |
| Formatting | cargo fmt --check passes |
| Linting | cargo clippy has no warnings |
| Documentation | Builds without errors |
Code Review
Reviewers will check:
- Correctness: Does the code do what it claims?
- Tests: Are changes adequately tested?
- Style: Does code follow conventions?
- Documentation: Are changes documented?
- Performance: Any performance implications?
Responding to Feedback
- Address all comments
- Push fixes as new commits (don’t force-push during review)
- Mark resolved conversations
- Ask for clarification if needed
Merging
Requirements
Before merging:
- All CI checks pass
- At least one approving review
- No unresolved conversations
- Branch is up to date with main
Merge Strategy
We use squash and merge for most PRs:
- Combines all commits into one
- Keeps main history clean
- Preserves full history in PR
After Merge
- Delete your feature branch
- Update local main:
git checkout main
git pull origin main
git branch -d feature/my-feature
Special Cases
Breaking Changes
For breaking changes:
- Open an issue for discussion first
- Document migration path
- Update CHANGELOG with breaking change notice
- Use
BREAKING CHANGE:in commit footer
Large PRs
For large changes:
- Consider splitting into smaller PRs
- Create a tracking issue
- Use feature flags if needed
- Provide detailed documentation
Security Issues
For security vulnerabilities:
- Do not open a public issue
- Contact maintainers directly
- Follow responsible disclosure
PR Templates
Feature PR
## Summary
Adds [feature] to support [use case].
## Motivation
[Why is this needed?]
## Changes
- Added `NewType` struct in `datasynth-core`
- Implemented `NewGenerator` in `datasynth-generators`
- Added configuration options in `datasynth-config`
- Updated CLI to support new feature
## Testing
- Added unit tests for `NewType`
- Added integration tests for generation flow
- Manual testing with sample configs
## Documentation
- Added user guide section
- Updated configuration reference
- Added example configuration
Bug Fix PR
## Summary
Fixes #123 - [brief description]
## Root Cause
[What caused the bug?]
## Solution
[How does this fix it?]
## Testing
- Added regression test
- Verified fix with reproduction steps from issue
- Ran full test suite
## Checklist
- [ ] Regression test added
- [ ] Root cause documented
- [ ] Related issues linked
See Also
- Development Setup - Environment setup
- Code Style - Coding standards
- Testing - Testing guidelines
Compliance & Regulatory Overview
DataSynth generates synthetic financial data for testing, training, and analytics. This section documents how DataSynth aligns with key regulatory frameworks and provides self-assessment artifacts for compliance teams.
Regulatory Landscape
Synthetic data generation sits at the intersection of several regulatory domains. While pure synthetic data (generated without real-world data as input) generally faces fewer regulatory constraints than real data processing, organizations deploying DataSynth should understand the applicable frameworks.
EU AI Act
The EU AI Act (Regulation 2024/1689) introduces obligations for AI systems and their training data. DataSynth addresses two key articles:
Article 50 – Transparency for Synthetic Content: All DataSynth output includes machine-readable content credentials indicating that the data is synthetically generated. This is implemented through the ContentCredential system in datasynth-core, which embeds markers in CSV headers, JSON metadata, and Parquet file metadata. Content marking is enabled by default and can be configured via the marking section in the configuration YAML.
Article 10 – Data Governance: DataSynth generates automated DataGovernanceReport documents that describe data sources (synthetic generation, no real data used), processing steps (COA generation through quality validation), quality measures applied (Benford’s Law compliance, balance coherence, referential integrity), and bias assessments. These reports provide the documentation trail required under Article 10.
For full details, see EU AI Act Compliance.
NIST AI Risk Management Framework
The NIST AI RMF (AI 100-1) provides a voluntary framework for managing risks in AI systems. DataSynth has completed a self-assessment across all four core functions:
| Function | Focus Area | DataSynth Alignment |
|---|---|---|
| MAP | Context and use cases | Documented intended uses, users, and known limitations |
| MEASURE | Metrics and evaluation | Quality gates, privacy metrics (MIA, linkage), statistical validation |
| MANAGE | Risk mitigation | Deterministic reproducibility, audit logging, content marking |
| GOVERN | Policies and oversight | Access control (API key + JWT/RBAC), configuration management, quality gate governance |
For the complete self-assessment, see NIST AI RMF Self-Assessment.
GDPR
The General Data Protection Regulation applies differently depending on the DataSynth workflow:
Pure Synthetic Generation (no real data input): GDPR obligations are minimal because no personal data is processed. The generated output contains no data subjects. Article 30 records should still document the processing activity for audit completeness.
Fingerprint Extraction (real data as input): When DataSynth’s fingerprint module extracts statistical profiles from real datasets, GDPR applies in full. The fingerprint module includes differential privacy (Laplace mechanism with configurable epsilon/delta budgets), k-anonymity suppression of rare values, and a complete privacy audit trail. A Data Protection Impact Assessment (DPIA) template is provided for this scenario.
For templates and detailed guidance, see GDPR Compliance.
SOC 2 Readiness
DataSynth’s architecture supports SOC 2 Type II controls across the Trust Services Criteria:
| Criteria | DataSynth Controls |
|---|---|
| Security | API key authentication with Argon2id hashing, JWT/OIDC support, TLS termination, CORS lockdown |
| Availability | Graceful degradation under resource pressure, health/readiness endpoints |
| Processing Integrity | Deterministic RNG (ChaCha8), balanced journal entries enforced at construction, quality gates |
| Confidentiality | Content marking prevents synthetic data from being mistaken for real data |
| Privacy | Differential privacy in fingerprint extraction, no real PII in standard generation |
For deployment security controls, see Security Hardening.
ISO 27001 Alignment
DataSynth supports ISO 27001:2022 Annex A controls relevant to data processing tools:
| Control | Implementation |
|---|---|
| A.5.12 Classification of information | Content credentials classify all output as synthetic |
| A.8.10 Information deletion | Deterministic generation eliminates data retention concerns for pure synthetic workflows |
| A.8.11 Data masking | Fingerprint extraction applies differential privacy and k-anonymity |
| A.8.12 Data leakage prevention | Quality gates include privacy metrics (MIA AUC-ROC, linkage attack assessment) |
| A.8.25 Secure development lifecycle | Deterministic builds, dependency auditing (cargo audit), SBOM generation |
For access control configuration, see Security Hardening.
Quick Reference
| Framework | Status | Documentation |
|---|---|---|
| EU AI Act Article 50 | Implemented (content marking) | EU AI Act |
| EU AI Act Article 10 | Implemented (governance reports) | EU AI Act |
| NIST AI RMF | Self-assessment complete | NIST AI RMF |
| GDPR | Templates provided | GDPR |
| SOC 2 | Readiness documented | SOC 2 Readiness |
| ISO 27001 | Annex A alignment documented | ISO 27001 Alignment |
See Also
EU AI Act Compliance
DataSynth implements technical controls aligned with the EU Artificial Intelligence Act (Regulation 2024/1689), focusing on Article 50 (transparency for synthetic content) and Article 10 (data governance for high-risk AI systems).
Article 50 — Synthetic Content Marking
Article 50(2) requires that providers of AI systems generating synthetic content shall ensure outputs are marked in a machine-readable format and detectable as artificially generated.
How DataSynth Complies
DataSynth embeds machine-readable synthetic content credentials in all output files:
- CSV: Comment header lines with C2PA-inspired metadata
- JSON:
_synthetic_metadatatop-level object with credential fields - Parquet: Key-value metadata pairs in the file footer
Configuration
compliance:
content_marking:
enabled: true # Default: true
format: embedded # embedded, sidecar, or both
article10_report: true # Generate Article 10 governance report
Marking Formats
| Format | Description |
|---|---|
embedded | Credentials embedded directly in output files (default) |
sidecar | Separate .synthetic-credential.json file alongside each output |
both | Both embedded and sidecar credentials |
Credential Fields
Each synthetic content credential contains:
| Field | Description | Example |
|---|---|---|
generator | Tool identifier | "DataSynth" |
version | Generator version | "0.5.0" |
timestamp | ISO 8601 generation time | "2024-06-15T10:30:00Z" |
content_type | Output category | "synthetic_financial_data" |
method | Generation technique | "rule_based_statistical" |
config_hash | SHA-256 of config used | "a1b2c3..." |
declaration | Human-readable statement | "This content is synthetic..." |
Programmatic Detection
Third-party systems can detect synthetic DataSynth output by:
- CSV: Checking for
# X-Synthetic-Generator: DataSynthheader lines - JSON: Checking for
_synthetic_metadata.generator == "DataSynth" - Parquet: Reading
synthetic_generatorfrom file metadata
Article 10 — Data Governance
Article 10 requires appropriate data governance practices for training datasets used by high-risk AI systems. When synthetic data from DataSynth is used to train such systems, the Article 10 data governance report provides documentation.
Governance Report Contents
The automated report includes:
- Data Sources: Documentation of all inputs (configuration parameters, seed values, statistical distributions)
- Processing Steps: Complete pipeline documentation (CoA generation, master data, document flows, anomaly injection, quality validation)
- Quality Measures: Statistical validation results (Benford’s Law, balance coherence, distribution fitting)
- Bias Assessment: Known limitations, demographic representation gaps, and mitigation measures
Generating the Report
Enable in configuration:
compliance:
article10_report: true
The report is written as article10_governance_report.json in the output directory.
Report Structure
{
"report_version": "1.0",
"generator": "DataSynth",
"generated_at": "2024-06-15T10:30:00Z",
"data_sources": ["configuration_parameters", "statistical_distributions", "deterministic_rng"],
"processing_steps": [
"chart_of_accounts_generation",
"master_data_generation",
"document_flow_generation",
"journal_entry_generation",
"anomaly_injection",
"quality_validation"
],
"quality_measures": [
"benfords_law_compliance",
"balance_sheet_coherence",
"document_chain_integrity",
"referential_integrity"
],
"bias_assessment": {
"known_limitations": [
"Statistical distributions are parameterized, not learned from real data",
"Temporal patterns use simplified seasonal models"
],
"mitigation_measures": [
"Configurable distribution parameters per industry profile",
"Quality gate validation ensures statistical plausibility"
]
}
}
See Also
NIST AI Risk Management Framework Self-Assessment
This document provides a self-assessment of DataSynth against the NIST AI Risk Management Framework (AI 100-1, January 2023). The framework defines four core functions – MAP, MEASURE, MANAGE, and GOVERN – each with categories and subcategories. This assessment covers all four functions as they apply to a synthetic data generation tool.
Assessment Scope
- System: DataSynth synthetic financial data generator
- Version: 0.5.x
- Assessment Date: 2025
- Assessor: Development team (self-assessment)
- AI System Type: Data generation tool (not a decision-making AI system)
- Risk Classification: The generated synthetic data may be used as training data for AI/ML systems. DataSynth itself does not make autonomous decisions, but the quality of its output can affect downstream AI system performance.
MAP: Context and Framing
The MAP function establishes the context for AI risk management by identifying intended use cases, users, and known limitations.
MAP 1: Intended Use Cases
DataSynth is designed for the following use cases:
| Use Case | Description | Risk Level |
|---|---|---|
| ML Training Data | Generate labeled datasets for fraud detection, anomaly detection, and audit analytics models | Medium |
| Software Testing | Provide realistic test data for ERP systems, accounting platforms, and audit tools | Low |
| Privacy-Preserving Analytics | Replace real financial data with synthetic equivalents that preserve statistical properties | Medium |
| Compliance Testing | Generate SOX control test evidence, COSO framework data, and SoD violation scenarios | Low |
| Process Mining | Create OCEL 2.0 event logs for process analysis without exposing real business processes | Low |
| Education and Research | Provide realistic financial datasets for academic research and training | Low |
Not intended for: Replacement of real financial records in regulatory filings, direct use as evidence in audit engagements, or any scenario where the synthetic nature of the data is concealed.
MAP 2: Intended Users
| User Group | Typical Use | Access Level |
|---|---|---|
| Data Scientists | Training ML models for fraud/anomaly detection | API or CLI |
| QA Engineers | ERP and accounting system load/integration testing | CLI or Python wrapper |
| Auditors | Testing audit analytics tools against known-labeled data | CLI output files |
| Compliance Teams | SOX control testing, COSO framework validation | CLI or server API |
| Researchers | Academic study of financial data patterns | Python wrapper |
MAP 3: Known Limitations
DataSynth users should understand the following limitations:
-
No Real PII: Generated names, identifiers, and addresses are synthetic. They do not correspond to real individuals or organizations. This is a design feature, not a limitation, but downstream systems should not treat synthetic identities as real.
-
Statistical Approximation: Generated data follows configurable statistical distributions (log-normal, Benford’s Law, Gaussian mixtures) that approximate real-world patterns. They are not derived from actual transaction populations unless fingerprint extraction is used.
-
Industry Profile Approximations: Pre-configured industry profiles (retail, manufacturing, financial services, healthcare, technology) are based on published research and general knowledge. They may not match specific organizations within an industry.
-
Temporal Pattern Simplification: Business day calendars, holiday schedules, and intraday patterns are modeled but may not capture all regional or organizational nuances.
-
Anomaly Injection Boundaries: Injected fraud patterns follow configurable typologies (ACFE taxonomy) but do not represent the full diversity of real-world fraud schemes.
-
Fingerprint Extraction Privacy: When extracting fingerprints from real data, differential privacy noise and k-anonymity are applied. The privacy guarantees depend on correct epsilon/delta parameter selection.
MAP 4: Deployment Context
DataSynth can be deployed as:
- A CLI tool on developer workstations
- A server (REST/gRPC/WebSocket) in cloud or on-premises environments
- A Python library embedded in data pipelines
- A desktop application (Tauri/SvelteKit)
Each deployment context has different risk profiles. Server deployments require authentication, TLS, and rate limiting. CLI usage on trusted workstations has fewer access control requirements.
MEASURE: Metrics and Evaluation
The MEASURE function establishes metrics, methods, and benchmarks for evaluating AI system trustworthiness.
MEASURE 1: Quality Gate Metrics
DataSynth includes a comprehensive evaluation framework (datasynth-eval) with configurable quality gates. Each metric has defined thresholds and automated pass/fail checking.
Statistical Quality
| Metric | Gate Name | Threshold | Comparison | Purpose |
|---|---|---|---|---|
| Benford’s Law MAD | benford_compliance | 0.015 | LTE | First-digit distribution follows Benford’s Law |
| Balance Coherence | balance_sheet_valid | 1.0 | GTE | Assets = Liabilities + Equity |
| Document Chain Integrity | doc_chain_complete | 0.95 | GTE | P2P/O2C chains are complete |
| Temporal Consistency | temporal_valid | 0.90 | GTE | Temporal patterns match configuration |
| Correlation Preservation | correlation_check | 0.80 | GTE | Cross-field correlations preserved |
Data Quality
| Metric | Gate Name | Threshold | Comparison | Purpose |
|---|---|---|---|---|
| Completion Rate | completeness | 0.95 | GTE | Required fields are populated |
| Duplicate Rate | uniqueness | 0.05 | LTE | Acceptable duplicate rate |
| Referential Integrity | ref_integrity | 0.99 | GTE | Foreign key references valid |
| IC Match Rate | ic_matching | 0.95 | GTE | Intercompany transactions match |
Gate Profiles
Quality gates are organized into profiles with configurable strictness:
evaluation:
quality_gates:
profile: strict # strict, default, lenient
fail_strategy: collect_all
gates:
- name: benford_compliance
metric: benford_mad
threshold: 0.015
comparison: lte
- name: balance_valid
metric: balance_coherence
threshold: 1.0
comparison: gte
- name: completeness
metric: completion_rate
threshold: 0.95
comparison: gte
MEASURE 2: Privacy Evaluation
DataSynth evaluates privacy risk through empirical attacks on generated data.
Membership Inference Attack (MIA)
The MIA module (datasynth-eval/src/privacy/membership_inference.rs) implements a distance-based classifier that attempts to determine whether a specific record was part of the generation configuration. Key metrics:
| Metric | Threshold | Interpretation |
|---|---|---|
| AUC-ROC | <= 0.60 | Near-random classification indicates strong privacy |
| Accuracy | <= 0.55 | Low accuracy means synthetic data does not memorize patterns |
| Precision/Recall | Balanced | No systematic bias toward members or non-members |
Linkage Attack Assessment
The linkage module (datasynth-eval/src/privacy/linkage.rs) evaluates re-identification risk using quasi-identifier combinations:
| Metric | Threshold | Interpretation |
|---|---|---|
| Re-identification Rate | <= 0.05 | Less than 5% of synthetic records can be linked to originals |
| K-Anonymity Achieved | >= 5 | Each quasi-identifier combination appears at least 5 times |
| Unique QI Overlap | Reported | Number of overlapping quasi-identifier combinations |
NIST SP 800-226 Alignment
The evaluation framework includes self-assessment against NIST SP 800-226 criteria for de-identification. The NistAlignmentReport evaluates:
- Data transformation adequacy
- Re-identification risk assessment
- Documentation completeness
- Privacy control effectiveness
Overall alignment score must meet >= 71% for a passing grade.
Fingerprint Module Privacy
When fingerprint extraction is used with real data input, the datasynth-fingerprint privacy engine provides:
| Mechanism | Parameter | Default (Standard Level) |
|---|---|---|
| Differential Privacy (Laplace) | Epsilon | 1.0 |
| K-Anonymity | K threshold | 5 |
| Outlier Protection | Winsorization percentile | 95th |
| Composition | Method | Naive (RDP/zCDP available) |
Privacy levels provide pre-configured parameter sets:
| Level | Epsilon | K | Use Case |
|---|---|---|---|
| Minimal | 5.0 | 3 | Low sensitivity |
| Standard | 1.0 | 5 | Balanced (default) |
| High | 0.5 | 10 | Sensitive data |
| Maximum | 0.1 | 20 | Highly sensitive data |
MEASURE 3: Completeness and Uniqueness
The evaluation module tracks data completeness and uniqueness metrics:
- Completeness: Measures the percentage of non-null values across all required fields. Reported as
overall_completenessin the evaluation output. - Uniqueness: Measures the duplicate rate across primary key fields. Collision-free UUIDs (FNV-1a hash-based with generator-type discriminators) ensure deterministic uniqueness.
MEASURE 4: Distribution Validation
Statistical validation tests verify that generated data matches configured distributions:
| Test | Implementation | Purpose |
|---|---|---|
| Benford First Digit | Chi-squared against Benford distribution | Transaction amounts follow expected first-digit distribution |
| Distribution Fit | Anderson-Darling test | Amount distributions match configured log-normal parameters |
| Correlation Check | Pearson/Spearman correlation | Cross-field correlations preserved via copula models |
| Temporal Patterns | Autocorrelation analysis | Seasonality and period-end patterns present |
MANAGE: Risk Mitigation
The MANAGE function addresses risk response and mitigation strategies.
MANAGE 1: Deterministic Reproducibility
DataSynth uses ChaCha8 CSPRNG with configurable seeds. Given the same configuration and seed, the output is identical across runs and platforms. This provides:
- Auditability: Any generated dataset can be exactly reproduced by preserving the configuration YAML and seed value.
- Debugging: Anomalous output can be reproduced for investigation.
- Regression Testing: Changes to generation logic can be detected by comparing output hashes.
global:
seed: 42 # Deterministic seed
industry: manufacturing
start_date: 2024-01-01
period_months: 12
MANAGE 2: Audit Logging
DataSynth provides audit trails at multiple levels:
Generation Audit: The runtime emits structured JSON logs for every generation phase, including timing, record counts, and resource utilization.
Privacy Audit: The fingerprint module maintains a PrivacyAudit record of every privacy-related action (noise additions with epsilon spent, value suppressions, generalizations, winsorizations). This audit is embedded in the .dsf fingerprint file.
Server Audit: The REST/gRPC server logs authentication attempts, configuration changes, stream operations, and rate limit events with request correlation IDs (X-Request-Id).
Run Manifest: Each generation run produces a manifest documenting the configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.
MANAGE 3: Data Lineage Tracking
DataSynth tracks data lineage through:
- Configuration Hashing: SHA-256 hash of the input configuration is embedded in all output metadata.
- Content Credentials: Every output file includes a
ContentCredentiallinking back to the generator version, configuration hash, and seed. - Document Reference Chains: Generated document flows maintain explicit reference chains (PO -> GR -> Invoice -> Payment) with
DocumentReferencerecords. - Data Governance Reports: Automated Article 10 governance reports document all processing steps from COA generation through quality validation.
MANAGE 4: Content Marking
All synthetic output is marked to prevent confusion with real data:
- CSV: Comment headers with
# SYNTHETIC DATA - Generated by DataSynth v{version} - JSON:
_metadata.content_credentialobject with generator, timestamp, config hash, and EU AI Act article reference - Parquet: Custom metadata key-value pairs with full credential JSON
- Sidecar Files: Optional
.credential.jsonfiles alongside output files
Content marking is enabled by default and can be configured:
marking:
enabled: true
format: embedded # embedded, sidecar, both
MANAGE 5: Graceful Degradation
The resource guard system (datasynth-core) monitors memory, disk, and CPU usage, applying progressive degradation:
| Level | Memory Threshold | Response |
|---|---|---|
| Normal | < 70% | Full feature generation |
| Reduced | 70-85% | Disable optional features |
| Minimal | 85-95% | Core generation only |
| Emergency | > 95% | Graceful shutdown |
This prevents resource exhaustion from affecting other systems in shared environments.
GOVERN: Policies and Oversight
The GOVERN function establishes organizational policies and structures for AI risk management.
GOVERN 1: Access Control
DataSynth implements layered access control for the server deployment:
API Key Authentication: Keys are hashed with Argon2id at startup. Verification uses timing-safe comparison with a short-lived cache to prevent side-channel attacks. Keys are provided via the X-API-Key header or Authorization: Bearer header.
JWT/OIDC Integration (optional jwt feature): Supports external identity providers (Keycloak, Auth0, Entra ID) with RS256 token validation. JWT claims include subject, roles, and tenant ID for multi-tenancy.
RBAC: Role-based access control via JWT claims enables differentiated access:
| Role | Permissions |
|---|---|
operator | Start/stop/pause generation streams |
admin | Configuration changes, API key management |
viewer | Read-only access to status and metrics |
Exempt Paths: Health (/health), readiness (/ready), liveness (/live), and metrics (/metrics) endpoints are exempt from authentication for infrastructure integration.
GOVERN 2: Configuration Management
DataSynth configuration is managed through:
- YAML Schema Validation: All configuration is validated against a typed schema before generation begins. Invalid configurations produce descriptive error messages.
- Industry Presets: Pre-validated configuration presets for common industries (retail, manufacturing, financial services, healthcare, technology) reduce misconfiguration risk.
- Complexity Levels: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide validated scaling parameters.
- Template System: YAML/JSON templates with merge strategies enable configuration reuse while allowing overrides.
GOVERN 3: Quality Gates as Governance Controls
Quality gates serve as automated governance controls:
evaluation:
quality_gates:
profile: strict
fail_strategy: fail_fast # Stop on first failure
gates:
- name: benford_compliance
metric: benford_mad
threshold: 0.015
comparison: lte
- name: privacy_mia
metric: privacy_mia_auc
threshold: 0.60
comparison: lte
- name: balance_coherence
metric: balance_coherence
threshold: 1.0
comparison: gte
Gate profiles can enforce:
- Fail-fast: Stop generation on first quality failure
- Collect-all: Run all checks and report all failures
- Custom thresholds: Organization-specific quality requirements
The GateEngine evaluates all configured gates against the ComprehensiveEvaluation and produces a GateResult with per-gate pass/fail status, actual values, and summary messages.
GOVERN 4: Audit Trail Completeness
The following audit artifacts are produced for each generation run:
| Artifact | Location | Contents |
|---|---|---|
| Run Manifest | output/_manifest.json | Config hash, seed, timestamps, record counts, gate results |
| Content Credentials | Embedded in each output file | Generator version, config hash, seed, EU AI Act reference |
| Data Governance Report | output/_governance_report.json | Article 10 data sources, processing steps, quality measures, bias assessment |
| Privacy Audit | Embedded in .dsf files | Epsilon spent, actions taken, composition method, remaining budget |
| Server Logs | Structured JSON to stdout/log aggregator | Request traces, auth events, config changes, stream operations |
| Quality Gate Results | output/_evaluation.json | Per-gate pass/fail, actual vs threshold, summary |
GOVERN 5: Incident Response
For scenarios where generated data is mistakenly used as real data:
- Detection: Content credentials in output files identify synthetic origin
- Containment: Deterministic generation means the exact dataset can be reproduced and identified
- Remediation: All output files carry machine-readable markers that downstream systems can check programmatically
- Prevention: Content marking is enabled by default and requires explicit configuration to disable
Assessment Summary
| Function | Category Count | Addressed | Notes |
|---|---|---|---|
| MAP | 4 | 4 | Use cases, users, limitations, and deployment documented |
| MEASURE | 4 | 4 | Quality gates, privacy metrics, completeness, distribution validation |
| MANAGE | 5 | 5 | Reproducibility, audit logging, lineage, content marking, degradation |
| GOVERN | 5 | 5 | Access control, config management, quality gates, audit trails, incident response |
Overall Assessment: DataSynth provides comprehensive risk management controls appropriate for a synthetic data generation tool. The primary residual risks relate to (1) parameter misconfiguration leading to unrealistic output, mitigated by quality gates and industry presets, and (2) privacy leakage during fingerprint extraction from real data, mitigated by differential privacy with configurable epsilon/delta budgets and empirical privacy evaluation.
See Also
GDPR Compliance
This document provides GDPR (General Data Protection Regulation) compliance guidance for DataSynth deployments. DataSynth generates purely synthetic data by default, but certain workflows (fingerprint extraction) may process real personal data.
Synthetic Data and GDPR
Pure Synthetic Generation
When DataSynth generates data from configuration alone (no real data input):
- No personal data is processed: All names, identifiers, and transactions are algorithmically generated
- No data subjects exist: Synthetic entities have no real-world counterparts
- GDPR does not apply to the generated output, as it contains no personal data per Article 4(1)
This is the default operating mode for all datasynth-data generate workflows.
Fingerprint Extraction Workflows
When using datasynth-data fingerprint extract with real data as input:
- Real personal data may be processed during statistical fingerprint extraction
- GDPR obligations apply to the extraction phase
- Differential privacy controls limit information retained in the fingerprint
- The output fingerprint (.dsf file) contains only aggregate statistics, not individual records
Article 30 — Records of Processing Activities
Template for Pure Synthetic Generation
| Field | Value |
|---|---|
| Purpose | Generation of synthetic financial data for testing, training, and validation |
| Categories of data subjects | None (no real data subjects) |
| Categories of personal data | None (all data is synthetic) |
| Recipients | Internal development, QA, and data science teams |
| Transfers to third countries | Not applicable (no personal data) |
| Retention period | Per project requirements |
| Technical measures | Seed-based deterministic generation, content marking |
Template for Fingerprint Extraction
| Field | Value |
|---|---|
| Purpose | Statistical fingerprint extraction for privacy-preserving data synthesis |
| Legal basis | Legitimate interest (Article 6(1)(f)) or consent |
| Categories of data subjects | As per source dataset (e.g., customers, vendors, employees) |
| Categories of personal data | As per source dataset (aggregate statistics only retained) |
| Recipients | Data engineering team operating DataSynth |
| Transfers to third countries | Assess per deployment topology |
| Retention period | Fingerprint files: per project; source data: minimize retention |
| Technical measures | Differential privacy (configurable epsilon/delta), k-anonymity |
Data Protection Impact Assessment (DPIA)
A DPIA under Article 35 is recommended when fingerprint extraction processes:
- Large-scale datasets (>100,000 records)
- Special categories of data (Article 9)
- Data relating to vulnerable persons
DPIA Template for Fingerprint Extraction
1. Description of Processing
DataSynth extracts statistical fingerprints from source data. The fingerprint captures distribution parameters (means, variances, correlations) without retaining individual records. Differential privacy noise is added with configurable epsilon/delta parameters.
2. Necessity and Proportionality
- Purpose: Enable realistic synthetic data generation without accessing source data repeatedly
- Minimization: Only aggregate statistics are retained
- Privacy controls: Differential privacy with user-specified budget
3. Risks to Data Subjects
| Risk | Likelihood | Severity | Mitigation |
|---|---|---|---|
| Re-identification from fingerprint | Low | High | Differential privacy, k-anonymity enforcement |
| Membership inference | Low | Medium | MIA AUC-ROC testing in evaluation framework |
| Fingerprint file compromise | Medium | Low | Aggregate statistics only, no individual records |
4. Measures to Address Risks
- Configure
fingerprint_privacy.level: highormaximumfor sensitive data - Set
fingerprint_privacy.epsilonto 0.1-1.0 range (lower = stronger privacy) - Enable k-anonymity with
fingerprint_privacy.k_anonymity >= 5 - Use evaluation framework MIA testing to verify privacy guarantees
Privacy Configuration
fingerprint_privacy:
level: high # minimal, standard, high, maximum, custom
epsilon: 0.5 # Privacy budget (lower = stronger)
delta: 1.0e-5 # Failure probability
k_anonymity: 10 # Minimum group size
composition_method: renyi_dp # naive, advanced, renyi_dp, zcdp
Privacy Level Presets
| Level | Epsilon | Delta | k-Anonymity | Use Case |
|---|---|---|---|---|
minimal | 10.0 | 1e-3 | 2 | Non-sensitive aggregates |
standard | 1.0 | 1e-5 | 5 | General business data |
high | 0.5 | 1e-6 | 10 | Sensitive financial data |
maximum | 0.1 | 1e-8 | 20 | Regulated personal data |
Data Subject Rights
Pure Synthetic Mode
Articles 15-22 (access, rectification, erasure, etc.) do not apply as no real data subjects exist in synthetic output.
Fingerprint Extraction Mode
- Right of access (Art. 15): Fingerprints contain only aggregate statistics; individual records cannot be extracted
- Right to erasure (Art. 17): Delete source data and fingerprint files; regenerate synthetic data with new parameters
- Right to restriction (Art. 18): Suspend fingerprint extraction pipeline
- Right to object (Art. 21): Remove individual from source dataset before extraction
International Transfers
- Synthetic output: Generally not subject to Chapter V transfer restrictions (no personal data)
- Fingerprint files: Assess whether aggregate statistics constitute personal data in your jurisdiction
- Source data: Standard GDPR transfer rules apply during fingerprint extraction
NIST SP 800-226 Alignment
DataSynth’s evaluation framework includes NIST SP 800-226 alignment reporting for synthetic data privacy assessment. Enable via:
privacy:
nist_alignment_enabled: true
See Also
SOC 2 Type II Readiness
This document describes how DataSynth’s architecture and controls align with the AICPA Trust Services Criteria (TSC) used in SOC 2 Type II engagements. DataSynth is a synthetic data generation tool, not a cloud-hosted SaaS product, so this assessment focuses on the controls embedded in the software itself rather than organizational policies. Organizations deploying DataSynth should layer their own operational controls (change management, personnel security, vendor management) on top of the technical controls described here.
Assessment Scope
- System: DataSynth synthetic financial data generator
- Version: 0.5.x
- Deployment Models: CLI binary, REST/gRPC/WebSocket server, Python library, desktop application
- Assessment Type: Architecture readiness (pre-audit self-assessment)
CC1: Security
The Security criterion (Common Criteria) requires that the system is protected against unauthorized access, both logical and physical.
Authentication
DataSynth’s server component (datasynth-server) implements two authentication mechanisms:
API Key Authentication: API keys are hashed with Argon2id (memory-hard, side-channel resistant) at server startup. Verification iterates all stored hashes without short-circuiting to prevent timing-based enumeration. A short-lived (5-second TTL) FNV-1a hash cache avoids repeated Argon2id computation for successive requests from the same client. Keys are accepted via Authorization: Bearer <key> or X-API-Key headers.
JWT/OIDC (optional jwt feature): External identity providers (Keycloak, Auth0, Entra ID) issue RS256-signed tokens. The JwtValidator verifies issuer, audience, expiration, and signature. Claims include subject, email, roles, and tenant ID for multi-tenancy.
Authorization
Role-Based Access Control (RBAC) enforces least-privilege access:
| Role | GenerateData | ManageJobs | ViewJobs | ManageConfig | ViewConfig | ViewMetrics | ManageApiKeys |
|---|---|---|---|---|---|---|---|
| Admin | Y | Y | Y | Y | Y | Y | Y |
| Operator | Y | Y | Y | N | Y | Y | N |
| Viewer | N | N | Y | N | Y | Y | N |
RBAC can be disabled for development environments; when disabled, all authenticated requests are treated as Admin.
Network Security
The security headers middleware injects the following headers on all server responses:
| Header | Value | Purpose |
|---|---|---|
X-Content-Type-Options | nosniff | Prevent MIME-type sniffing |
X-Frame-Options | DENY | Prevent clickjacking |
Content-Security-Policy | default-src 'none'; frame-ancestors 'none' | Restrict resource loading |
Referrer-Policy | strict-origin-when-cross-origin | Limit referrer leakage |
Cache-Control | no-store | Prevent caching of API responses |
X-XSS-Protection | 0 | Defer to CSP (modern best practice) |
TLS termination is supported via reverse proxy (nginx, Caddy, Envoy) or Kubernetes ingress. CORS is configurable with allowlisted origins.
Rate Limiting
Per-client rate limiting uses a sliding-window counter with configurable thresholds (requests per second, burst size). A Redis-backed rate limiter is available for multi-instance deployments (redis feature flag).
CC2: Availability
The Availability criterion requires that the system is available for operation and use as committed.
Graceful Degradation
The DegradationController in datasynth-core monitors memory, disk, and CPU utilization and applies progressive feature reduction:
| Level | Memory | Disk | CPU | Response |
|---|---|---|---|---|
| Normal | < 70% | > 1000 MB | < 80% | All features enabled, full batch sizes |
| Reduced | 70–85% | 500–1000 MB | 80–90% | Half batch sizes, skip data quality injection |
| Minimal | 85–95% | 100–500 MB | > 90% | Essential data only, no anomaly injection |
| Emergency | > 95% | < 100 MB | – | Flush pending writes, terminate gracefully |
Auto-recovery with hysteresis (5% improvement required) allows the system to step back up one level at a time when resource pressure subsides.
Resource Monitoring
- Memory guard: Reads
/proc/self/statm(Linux) orps(macOS) to track resident set size against configurable limits. - Disk guard: Uses
statvfs(Unix) orGetDiskFreeSpaceExW(Windows) to monitor available disk space in the output directory. - CPU monitor: Tracks CPU utilization with auto-throttle at 0.95 threshold.
- Resource guard: Unified orchestration that combines all three monitors and drives the
DegradationController.
Graceful Shutdown
The server handles SIGTERM by stopping acceptance of new requests, waiting for in-flight requests to complete (with configurable timeout), and flushing pending output. The CLI supports SIGUSR1 for pause/resume of generation runs.
Health Endpoints
The following endpoints are exempt from authentication for infrastructure integration:
| Endpoint | Purpose |
|---|---|
/health | General health check |
/ready | Readiness probe (Kubernetes) |
/live | Liveness probe (Kubernetes) |
/metrics | Prometheus-compatible metrics |
CC3: Processing Integrity
The Processing Integrity criterion requires that system processing is complete, valid, accurate, timely, and authorized.
Deterministic Generation
DataSynth uses the ChaCha8 cryptographically secure pseudo-random number generator with a configurable seed. Given the same configuration YAML and seed value, output is byte-identical across runs and platforms. This provides auditability (reproduce any dataset from its configuration) and regression detection (compare output hashes after code changes).
Quality Gates
The evaluation framework (datasynth-eval) applies configurable pass/fail criteria to every generation run. Built-in quality gate profiles provide three levels of strictness:
| Metric | Strict | Default | Lenient |
|---|---|---|---|
| Benford MAD | <= 0.01 | <= 0.015 | <= 0.03 |
| Balance Coherence | >= 0.999 | >= 0.99 | >= 0.95 |
| Document Chain Integrity | >= 0.95 | >= 0.90 | >= 0.80 |
| Completion Rate | >= 0.99 | >= 0.95 | >= 0.90 |
| Duplicate Rate | <= 0.001 | <= 0.01 | <= 0.05 |
| Referential Integrity | >= 0.999 | >= 0.99 | >= 0.95 |
| IC Match Rate | >= 0.99 | >= 0.95 | >= 0.85 |
| Privacy MIA AUC | <= 0.55 | <= 0.60 | <= 0.70 |
Gate evaluation supports fail-fast (stop on first failure) and collect-all (report all failures) strategies.
Balance Validation
The JournalEntry model enforces debits = credits at construction time. An entry that does not balance cannot be created, eliminating an entire class of data integrity errors.
Content Marking
EU AI Act Article 50 synthetic content credentials are embedded in all output files (CSV headers, JSON metadata, Parquet file metadata). This prevents synthetic data from being mistaken for real financial records. Content marking is enabled by default.
CC4: Confidentiality
The Confidentiality criterion requires that information designated as confidential is protected as committed.
No Real Data Storage
In the default operating mode (pure synthetic generation), DataSynth does not process, store, or transmit real data. All names, identifiers, transactions, and addresses are algorithmically generated from configuration parameters and RNG output.
Fingerprint Privacy
When the fingerprint extraction workflow processes real data, the following privacy controls apply:
| Mechanism | Default (Standard Level) |
|---|---|
| Differential privacy (Laplace) | Epsilon = 1.0, Delta = 1e-5 |
| K-anonymity suppression | K >= 5 |
| Composition accounting | Naive (Renyi DP, zCDP available) |
The output .dsf fingerprint file contains only aggregate statistics (means, variances, correlations), not individual records.
API Key Security
API keys are never stored in plaintext. At server startup, raw keys are hashed with Argon2id (random salt, PHC format) and discarded. Verification uses Argon2id comparison that iterates all stored hashes to prevent timing-based key enumeration.
Audit Logging
The JsonAuditLogger emits structured JSON audit events via the tracing crate. Each event records timestamp, request ID, actor identity (user ID or API key hash prefix), action, resource, outcome (success/denied/error), tenant ID, source IP, and user agent. Events are suitable for SIEM ingestion.
CC5: Privacy
The Privacy criterion requires that personal information is collected, used, retained, disclosed, and disposed of in conformity with commitments.
Synthetic Data by Design
DataSynth’s default mode generates purely synthetic data. No personal information is collected or processed. Generated entities (vendors, customers, employees) have no real-world counterparts. This eliminates most privacy obligations for pure synthetic workflows.
Privacy Evaluation
The evaluation framework includes empirical privacy testing:
- Membership Inference Attack (MIA): Distance-based classifier measures AUC-ROC. A score near 0.50 indicates the synthetic data does not memorize real data patterns.
- Linkage Attack Assessment: Evaluates re-identification risk using quasi-identifier combinations. Measures achieved k-anonymity and unique QI overlap.
NIST SP 800-226 Alignment
The evaluation framework generates NIST SP 800-226 alignment reports assessing data transformation adequacy, re-identification risk, documentation completeness, and privacy control effectiveness. An overall alignment score of >= 71% is required for a passing grade.
Fingerprint Extraction Privacy Levels
| Level | Epsilon | Delta | K-Anonymity | Use Case |
|---|---|---|---|---|
minimal | 10.0 | 1e-3 | 2 | Non-sensitive aggregates |
standard | 1.0 | 1e-5 | 5 | General business data |
high | 0.5 | 1e-6 | 10 | Sensitive financial data |
maximum | 0.1 | 1e-8 | 20 | Regulated personal data |
Controls Mapping
The following table maps DataSynth features to SOC 2 Trust Services Criteria identifiers.
| TSC ID | Criterion | DataSynth Control | Implementation |
|---|---|---|---|
| CC6.1 | Logical access security | API key authentication | auth.rs: Argon2id hashing, timing-safe comparison |
| CC6.1 | Logical access security | JWT/OIDC support | auth.rs: RS256 token validation (optional jwt feature) |
| CC6.3 | Role-based access | RBAC enforcement | rbac.rs: Admin/Operator/Viewer roles with permission matrix |
| CC6.6 | System boundaries | Security headers | security_headers.rs: CSP, X-Frame-Options, HSTS support |
| CC6.6 | System boundaries | Rate limiting | rate_limit.rs: Per-client sliding window, Redis backend |
| CC6.8 | Transmission security | TLS support | Reverse proxy TLS termination, Kubernetes ingress |
| CC7.2 | Monitoring | Resource guards | resource_guard.rs: CPU, memory, disk monitoring |
| CC7.2 | Monitoring | Audit logging | audit.rs: Structured JSON events for SIEM |
| CC7.3 | Change detection | Config hashing | SHA-256 hash of configuration embedded in output |
| CC7.4 | Incident response | Content marking | Content credentials identify synthetic origin |
| CC8.1 | Processing integrity | Deterministic RNG | ChaCha8 with configurable seed |
| CC8.1 | Processing integrity | Quality gates | gates/engine.rs: Configurable pass/fail thresholds |
| CC8.1 | Processing integrity | Balance validation | JournalEntry enforces debits = credits at construction |
| CC9.1 | Availability management | Graceful degradation | degradation.rs: Normal/Reduced/Minimal/Emergency levels |
| CC9.1 | Availability management | Health endpoints | /health, /ready, /live (auth-exempt) |
| P3.1 | Privacy notice | Synthetic content marking | EU AI Act Article 50 credentials in all output |
| P4.1 | Collection limitation | No real data by default | Pure synthetic generation requires no data collection |
| P6.1 | Data quality | Quality gates | Statistical, coherence, and privacy quality metrics |
| P8.1 | Disposal | Deterministic generation | No persistent state; regenerate from config + seed |
Gap Analysis
The following areas require organizational controls that are outside DataSynth’s software scope:
| Area | Recommendation |
|---|---|
| Physical security | Deploy on infrastructure with appropriate physical access controls |
| Change management | Implement CI/CD pipelines with code review and approval workflows |
| Vendor management | Assess third-party dependencies via cargo audit and SBOM generation |
| Personnel security | Apply organizational onboarding/offboarding procedures for API key management |
| Backup and recovery | Configure backup for generation configurations and output data per retention policies |
| Incident response plan | Document procedures for scenarios where synthetic data is mistakenly treated as real |
See Also
- ISO 27001 Alignment
- Security Hardening
- NIST AI RMF Self-Assessment
- GDPR Compliance
- EU AI Act Compliance
ISO 27001:2022 Alignment
This document maps DataSynth’s technical controls to the ISO/IEC 27001:2022 Annex A controls. DataSynth is a synthetic data generation tool, not a managed service, so this alignment focuses on controls that are directly addressable by the software. Organizational controls (A.5.1 through A.5.37), people controls (A.6), and physical controls (A.7) are primarily the responsibility of the deploying organization and are noted where DataSynth provides supporting capabilities.
Assessment Scope
- System: DataSynth synthetic financial data generator
- Version: 0.5.x
- Standard: ISO/IEC 27001:2022 (Annex A controls from ISO/IEC 27002:2022)
- Assessment Type: Self-assessment of technical control alignment
A.5 Organizational Controls
A.5.1 Policies for Information Security
DataSynth supports policy-as-code through its configuration management approach:
- Configuration-as-code: All generation parameters are defined in version-controllable YAML files with typed schema validation. Invalid configurations are rejected before generation begins.
- Industry presets: Pre-validated configurations for retail, manufacturing, financial services, healthcare, and technology industries reduce misconfiguration risk.
- CLAUDE.md: The project’s development guidelines are codified and version-controlled alongside the source code, establishing security-relevant coding standards (
#[deny(clippy::unwrap_used)], input validation requirements).
Organizations should supplement these technical controls with written information security policies governing DataSynth deployment, access, and data handling.
A.5.12 Classification of Information
DataSynth classifies all generated output as synthetic through the content marking system:
- Embedded credentials: CSV headers, JSON metadata objects, and Parquet file metadata contain machine-readable
ContentCredentialrecords identifying the content as synthetic. - Human-readable declarations: Each credential includes a
declarationfield: “This content is synthetically generated and does not represent real transactions or entities.” - Configuration hash: SHA-256 hash of the generation configuration is embedded in output, enabling traceability from any output file back to its generation parameters.
- Sidecar files: Optional
.synthetic-credential.jsonsidecar files provide classification metadata alongside each output file.
A.5.23 Information Security for Use of Cloud Services
DataSynth supports cloud deployment through:
- Kubernetes support: Helm charts and deployment manifests for containerized deployment with health (
/health), readiness (/ready), and liveness (/live) probe endpoints. - Stateless server: The server component maintains no persistent state beyond in-memory generation jobs. Configuration and output are externalized, supporting cloud-native architectures.
- TLS termination: Integration with Kubernetes ingress controllers, nginx, Caddy, and Envoy for TLS termination.
- Secret management: API keys can be injected via environment variables or mounted secrets rather than hardcoded in configuration files.
A.8 Technological Controls
A.8.1 User Endpoint Devices
The CLI binary (datasynth-data) is a stateless executable:
- No persistent credentials: The CLI does not store API keys, tokens, or session data on disk.
- No network access required: The CLI operates entirely offline for generation workflows. Network access is only needed when connecting to a remote DataSynth server.
- Deterministic output: Given the same configuration and seed, the CLI produces identical output, eliminating concerns about endpoint-specific state affecting results.
A.8.5 Secure Authentication
DataSynth implements multiple authentication mechanisms:
API Key Authentication:
- Keys are hashed with Argon2id (memory-hard, timing-attack resistant) at server startup.
- Raw keys are discarded after hashing; only PHC-format hashes are retained in memory.
- Verification iterates all stored hashes without short-circuiting to prevent timing-based key enumeration.
- A 5-second TTL cache using FNV-1a fast hashing reduces repeated Argon2id computation overhead.
JWT/OIDC Integration (optional jwt feature):
- RS256 token validation with issuer, audience, and expiration checks.
- Compatible with Keycloak, Auth0, and Microsoft Entra ID.
- Claims extraction provides subject, email, roles, and tenant ID for downstream RBAC and audit.
Authentication Bypass:
- Infrastructure endpoints (
/health,/ready,/live,/metrics) are exempt from authentication to support load balancer and orchestrator probes.
A.8.9 Configuration Management
DataSynth enforces configuration integrity through:
- Typed schema validation: YAML configuration is deserialized into strongly-typed Rust structs. Type mismatches, missing required fields, and constraint violations (e.g., rates outside 0.0–1.0, non-ascending approval thresholds) produce descriptive error messages before generation begins.
- Complexity presets: Small (~100 accounts), medium (~400), and large (~2500) complexity levels provide pre-validated scaling parameters.
- Template system: YAML/JSON templates with merge strategies enable configuration reuse while maintaining a single source of truth for shared settings.
- Configuration hashing: SHA-256 hash of the resolved configuration is computed before generation and embedded in all output metadata, enabling drift detection.
A.8.12 Data Leakage Prevention
DataSynth’s architecture inherently prevents data leakage:
- Synthetic-only generation: The default workflow generates data from statistical distributions and configuration parameters. No real data enters the pipeline.
- Content marking: All output files carry machine-readable synthetic content credentials (EU AI Act Article 50). Third-party systems can detect and flag synthetic content programmatically.
- Fingerprint privacy: When real data is used as input for fingerprint extraction, differential privacy (Laplace mechanism, configurable epsilon/delta) and k-anonymity suppress individual-level information. The resulting
.dsffile contains only aggregate statistics. - Quality gate enforcement: The
PrivacyMiaAucquality gate validates that generated data does not memorize real data patterns (MIA AUC-ROC threshold).
A.8.16 Monitoring Activities
DataSynth provides monitoring at multiple layers:
Structured Audit Logging:
The JsonAuditLogger emits structured JSON events via the tracing crate, recording:
- Timestamp (UTC), request ID, actor identity
- Action attempted, resource accessed, outcome (success/denied/error)
- Tenant ID, source IP, user agent
Events are emitted at INFO level with a dedicated audit_event structured field for log aggregation filtering.
Resource Monitoring:
- Memory guard reads
/proc/self/statm(Linux) orps(macOS) for resident set size tracking. - Disk guard uses
statvfs(Unix) /GetDiskFreeSpaceExW(Windows) for available space monitoring. - CPU monitor tracks utilization with auto-throttle at 0.95 threshold.
- The
DegradationControllercombines all monitors and emits level-change events when resource pressure triggers degradation.
Generation Monitoring:
- Run manifests capture configuration hash, seed, crate versions, start/end times, record counts, and quality gate results.
- Prometheus-compatible
/metricsendpoint exposes runtime statistics.
A.8.24 Use of Cryptography
DataSynth uses cryptographic primitives for the following purposes:
| Purpose | Algorithm | Implementation |
|---|---|---|
| Deterministic RNG | ChaCha8 (CSPRNG) | rand_chacha crate, configurable seed |
| API key hashing | Argon2id | argon2 crate, random salt, PHC format |
| Configuration integrity | SHA-256 | Config hash embedded in output metadata |
| JWT verification | RS256 (RSA + SHA-256) | jsonwebtoken crate (optional jwt feature) |
| UUID generation | FNV-1a hash | Deterministic collision-free UUIDs with generator-type discriminators |
Cryptographic operations use well-maintained Rust crate implementations. No custom cryptographic algorithms are implemented.
A.8.25 Secure Development Lifecycle
DataSynth’s development process includes:
- Static analysis:
cargo clippywith#[deny(clippy::unwrap_used)]enforces safe error handling across the codebase. - Test coverage: 2,500+ tests across 15 crates covering unit, integration, and property-based scenarios.
- Dependency auditing:
cargo auditchecks for known vulnerabilities in dependencies. - Type safety: Rust’s ownership model and type system eliminate entire classes of memory safety and concurrency bugs at compile time.
- MSRV policy: Minimum Supported Rust Version (1.88) ensures builds use a recent, well-supported compiler.
- CI/CD: Automated build, test, lint, and audit checks on every commit.
A.8.28 Secure Coding
DataSynth applies secure coding practices:
- No
unwrap()in library code:#[deny(clippy::unwrap_used)]prevents panics from unchecked error handling. - Input validation: All user-provided configuration values are validated against typed schemas with range constraints before use.
- Precise decimal arithmetic: Financial amounts use
rust_decimal(serialized as strings) instead of IEEE 754 floating point, preventing rounding errors in financial calculations. - No unsafe code: The codebase does not use
unsafeblocks in application logic. - Timing-safe comparisons: API key verification uses constant-time Argon2id comparison (iterating all hashes) to prevent side-channel attacks.
- Memory-safe concurrency: Rust’s ownership model prevents data races at compile time. Shared state uses
Arc<Mutex<>>or atomic operations.
Statement of Applicability
The following table summarizes the applicability of ISO 27001:2022 Annex A controls to DataSynth.
Implemented Controls
| Control | Title | Implementation |
|---|---|---|
| A.5.1 | Information security policies | Configuration-as-code with schema validation |
| A.5.12 | Classification of information | Synthetic content marking (EU AI Act Article 50) |
| A.5.23 | Cloud service security | Kubernetes deployment, health probes, TLS support |
| A.8.1 | User endpoint devices | Stateless CLI with no persistent credentials |
| A.8.5 | Secure authentication | Argon2id API keys, JWT/OIDC, RBAC |
| A.8.9 | Configuration management | Typed schema validation, presets, hashing |
| A.8.12 | Data leakage prevention | Synthetic-only generation, content marking, fingerprint privacy |
| A.8.16 | Monitoring activities | Structured audit logs, resource monitors, run manifests |
| A.8.24 | Use of cryptography | ChaCha8 RNG, Argon2id, SHA-256, RS256 JWT |
| A.8.25 | Secure development lifecycle | Clippy, 2,500+ tests, cargo audit, CI/CD |
| A.8.28 | Secure coding | No unwrap, input validation, precise decimals, no unsafe |
Partially Implemented Controls
| Control | Title | Status | Gap |
|---|---|---|---|
| A.5.8 | Information security in project management | Partial | Security considerations are embedded in code (schema validation, quality gates) but formal project management security procedures are organizational |
| A.5.14 | Information transfer | Partial | TLS support for server API; file-based output transfer policies are organizational |
| A.5.29 | Information security during disruption | Partial | Graceful degradation handles resource pressure; broader business continuity is organizational |
| A.8.8 | Management of technical vulnerabilities | Partial | cargo audit scans dependencies; patch management cadence is organizational |
| A.8.15 | Logging | Partial | Structured JSON audit events with correlation IDs; log retention and SIEM integration are organizational |
| A.8.26 | Application security requirements | Partial | Input validation and schema enforcement are built-in; threat modeling documentation is organizational |
Not Applicable Controls
| Control | Title | Rationale |
|---|---|---|
| A.5.19 | Information security in supplier relationships | DataSynth is open-source software; supplier controls apply to the deploying organization |
| A.5.30 | ICT readiness for business continuity | Business continuity planning is an organizational responsibility |
| A.6.1–A.6.8 | People controls | Personnel security controls are organizational |
| A.7.1–A.7.14 | Physical controls | Physical security controls depend on deployment environment |
| A.8.2 | Privileged access rights | OS-level privilege management is outside DataSynth’s scope |
| A.8.7 | Protection against malware | Endpoint protection is an infrastructure concern |
| A.8.20 | Networks security | Network segmentation and firewall rules are infrastructure concerns |
| A.8.23 | Web filtering | Web filtering is an organizational network control |
Continuous Improvement
DataSynth supports ISO 27001’s Plan-Do-Check-Act cycle through:
- Plan: Configuration-as-code with schema validation enforces security requirements at design time.
- Do: Automated quality gates and resource guards enforce controls during operation.
- Check: Evaluation framework produces quantitative metrics (Benford MAD, balance coherence, MIA AUC-ROC) that can be trended over time.
- Act: The AutoTuner in
datasynth-evalgenerates configuration patches from evaluation gaps, creating a feedback loop for continuous improvement.
See Also
Roadmap: Enterprise Simulation & ML Ground Truth
This roadmap outlines completed features, planned enhancements, and the wave-based expansion strategy for enterprise process chain coverage.
Completed Features
v0.1.0 — Core Generation
- Statistical distributions: Benford’s Law compliance, log-normal mixtures, copulas
- Industry presets: Manufacturing, Retail, Financial Services, Healthcare, Technology
- Chart of Accounts: Small (~100), Medium (~400), Large (~2500) complexity levels
- Temporal patterns: Month-end/quarter-end volume spikes, business day calendars
- Master data: Vendors, customers, materials, fixed assets, employees
- Document flows: P2P (6 PO types, three-way match) and O2C (9 SO types, 6 delivery types, 7 invoice types)
- Intercompany: IC matching, transfer pricing, consolidation elimination entries
- Subledgers: AR (aging, dunning), AP (scheduling, discounts), FA (6 depreciation methods), Inventory (22 movement types, 4 valuation methods)
- Currency & FX: Ornstein-Uhlenbeck exchange rates, ASC 830 translation, CTA
- Period close: Monthly close engine, accruals, depreciation runs, year-end closing
- Balance coherence: Opening balances, running balance tracking, trial balance per period
- Anomaly injection: 60+ fraud types, error patterns, process issues with full labeling
- Data quality: Missing values (MCAR/MAR/MNAR), format variations, typos, duplicates
- Graph export: PyTorch Geometric, Neo4j, DGL with train/val/test splits
- Internal controls: COSO 2013 framework, SoD rules, 12 transaction + 6 entity controls
- Resource guards: Memory, disk, CPU monitoring with graceful degradation
- REST/gRPC/WebSocket server with authentication and rate limiting
- Desktop UI: Tauri/SvelteKit with 15+ configuration pages
- Python wrapper: Programmatic access with blueprints and config validation
v0.2.0 — Privacy & Standards
- Fingerprint extraction: Statistical properties from real data into
.dsffiles - Differential privacy: Laplace and Gaussian mechanisms with configurable epsilon
- K-anonymity: Suppression of rare categorical values
- Fidelity evaluation: KS, Wasserstein, Benford MAD metric comparison
- Gaussian copula synthesis: Preserve multivariate correlations
- Accounting standards: Revenue recognition (ASC 606/IFRS 15), Leases (ASC 842/IFRS 16), Fair Value (ASC 820/IFRS 13), Impairment (ASC 360/IAS 36)
- Audit standards: ISA compliance (34 standards), analytical procedures, confirmations, opinions, PCAOB mappings
- SOX compliance: Section 302/404 assessments, deficiency matrix, material weakness classification
- Streaming output: CSV, JSON, NDJSON, Parquet streaming sinks with backpressure
- ERP output formats: SAP S/4HANA (BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA), Oracle EBS (GL_JE_HEADERS/LINES), NetSuite
v0.3.0 — Fraud & Industry
- ACFE-aligned fraud taxonomy: Asset misappropriation, corruption, financial statement fraud calibrated to ACFE statistics
- Collusion modeling: 8 ring types, 6 conspirator roles, defection/escalation dynamics
- Management override: Fraud triangle modeling (pressure, opportunity, rationalization)
- Red flag generation: 40+ probabilistic indicators with Bayesian probabilities
- Industry-specific generators: Manufacturing (BOM, WIP, production orders), Retail (POS, shrinkage, loyalty), Healthcare (ICD-10, CPT, DRG, payer mix)
- Industry benchmarks: Pre-configured ML benchmarks per industry
- Banking/KYC/AML: Customer personas, KYC profiles, fraud typologies (structuring, funnel, layering, mule, round-tripping)
- Process mining: OCEL 2.0 event logs with P2P and O2C processes
- Evaluation framework: Auto-tuning with configuration recommendations from metric gaps
- Vendor networks: Tiered supply chains, quality scores, clusters
- Customer segmentation: Value segments, lifecycle stages, network positions
- Cross-process links: Entity graph, relationship strength, cross-process integration
v0.5.0 — AI & Advanced Features
- LLM-augmented generation: Pluggable provider abstraction (Mock, OpenAI, Anthropic) for realistic vendor names, descriptions, memo fields, and anomaly explanations
- Natural language configuration: Generate YAML configs from descriptions
- Diffusion model backend: Statistical diffusion with configurable noise schedules (linear, cosine, sigmoid) for learned distribution capture
- Hybrid generation: Blend rule-based and diffusion outputs
- Causal generation: Structural Causal Models (SCMs), do-calculus interventions, counterfactual generation
- Built-in causal templates:
fraud_detectionandrevenue_cyclecausal graphs - Federated fingerprinting: Secure aggregation (weighted average, median, trimmed mean) for distributed data sources
- Synthetic data certificates: Cryptographic proof of DP guarantees with HMAC-SHA256 signing
- Privacy-utility Pareto frontier: Automated exploration of optimal epsilon values
- Ecosystem integrations: Airflow, dbt, MLflow, Spark pipeline integration
Planned Enhancements
Wave 1 — Foundation (enables everything else)
These items close the most critical gaps and unblock downstream work.
| Item | Chain | Description | Dependencies |
|---|---|---|---|
| S2C completion | S2P | Source-to-Contract: spend analysis, RFx, bid evaluation, contract management, catalog items, supplier scorecards | Extends existing P2P |
| Bank reconciliation | BANK | Bank statement lines, auto-matching, reconciliation breaks, clearing | Validates all payment chains |
| Financial statement generator | R2R | Balance sheet, income statement, cash flow statement from trial balance | Consumes all JE data |
Impact: S2C creates a closed-loop procurement model. Bank reconciliation validates payment integrity across S2P and O2C. Financial statements provide the final reporting layer for R2R.
Wave 2 — Core Process Chains
| Item | Chain | Description | Dependencies |
|---|---|---|---|
| Payroll & time management | H2R | Payroll runs, time entries, overtime, benefits, tax withholding | Employee master data |
| Revenue recognition generator | O2C→R2R | Wire CustomerContract + PerformanceObligation models to SO/Invoice data | Existing ASC 606 models |
| Impairment generator | A2R→R2R | Wire existing ImpairmentTest model to FA generator with JE output | Existing ASC 360 models |
Impact: Payroll is the largest H2R gap and enables SoD analysis for personnel. Revenue recognition and impairment generators wire existing standards models into the generation pipeline.
Wave 3 — Operational Depth
| Item | Chain | Description | Dependencies |
|---|---|---|---|
| Production orders & WIP | MFG | Production order lifecycle, material consumption, WIP costing, variance analysis | Manufacturing industry config |
| Cycle counting & QA | INV | Cycle count programs, quality inspection, inspection lots, vendor quality feedback | Inventory subledger |
| Expense management | H2R | Expense reports, policy enforcement, receipt matching, reimbursement | Employee master data |
Impact: Manufacturing becomes a fully simulated chain. Inventory completeness enables ABC analysis and obsolescence. Expenses extend H2R with AP integration.
Wave 4 — Polish
| Item | Chain | Description | Dependencies |
|---|---|---|---|
| Sales quotes | O2C | Quote-to-order conversion tracking (fills orphan quote_id FK) | O2C generator |
| Cash forecasting | BANK | Projected cash flows from AP/AR schedules | AP/AR subledgers |
| KPIs & budget variance | R2R | Management reporting, budget vs actual analysis | Financial statements |
| Obsolescence management | INV | Slow-moving/excess stock identification and write-downs | Inventory aging |
Impact: These items round out each chain with planning and reporting capabilities.
Cross-Process Integration Vision
The wave plan steadily increases cross-process coverage:
| Integration | Current | After Wave 1 | After Wave 2 | After Wave 4 |
|---|---|---|---|---|
| S2P → Inventory | GR updates stock | Same | Same | Same |
| Inventory → O2C | Delivery reduces stock | Same | Same | Obsolescence feeds write-downs |
| S2P/O2C → BANK | Payments created | Payments reconciled | Same | Cash forecasting |
| All → R2R | JEs → Trial Balance | JEs → Financial Statements | + Revenue recog, impairment | + Budget variance |
| H2R → S2P | Employee authorizations | Same | Expense → AP | Same |
| S2P → A2R | Capital PO → FA | Same | Same | Same |
| MFG → S2P | Config only | Same | Production → PR demand | Same |
| MFG → INV | Config only | Same | WIP → FG transfers | + QA feedback |
Coverage Targets
| Chain | Current | Wave 1 | Wave 2 | Wave 3 | Wave 4 |
|---|---|---|---|---|---|
| S2P | 85% | 95% | 95% | 95% | 95% |
| O2C | 93% | 93% | 97% | 97% | 99% |
| R2R | 78% | 88% | 92% | 92% | 97% |
| A2R | 70% | 70% | 80% | 80% | 80% |
| INV | 55% | 55% | 55% | 75% | 85% |
| BANK | 65% | 85% | 85% | 85% | 90% |
| H2R | 30% | 30% | 60% | 75% | 75% |
| MFG | 20% | 20% | 20% | 60% | 60% |
Guiding Principles
- Enterprise realism: Simulate multi-entity, multi-region, multi-currency operations with coherent process flows
- ML ground truth: Capture true labels and causal factors for supervised learning, explainability, and evaluation
- Scalability: Handle large volumes with stable performance and reproducible results
- Backward compatibility: New features are additive; existing configs continue to work
Dependencies & Risks
- Schema stability: New models must not break existing serialization formats
- Performance: Each wave adds generators; resource guards ensure stable memory/CPU
- Validation complexity: Cross-chain coherence checks multiply as integration points increase
Contributing
We welcome contributions to any roadmap area. See Contributing Guidelines for details.
To propose new features:
- Open a GitHub issue with the
enhancementlabel - Describe the use case and expected behavior
- Reference relevant roadmap items if applicable
Feedback
Roadmap priorities are influenced by user feedback. Please share your use cases and requirements:
- GitHub Issues: Feature requests and bug reports
- Email: michael.ivertowski@ch.ey.com
See Also
- Process Chains — Current process chain architecture and coverage matrix
- S2P Spec — Source-to-Contract specification
- Process Chain Gaps — Detailed gap analysis
Production Readiness Roadmap
Version: 1.0 | Date: February 2026 | Status: Living Document
This roadmap addresses the infrastructure, operations, security, compliance, and ecosystem maturity required to transition DataSynth from a feature-complete beta to a production-grade enterprise platform. It complements the existing feature roadmap which covers domain-specific enhancements.
Table of Contents
- Current State Assessment
- Phase 1: Foundation (0-3 months)
- Phase 2: Hardening (3-6 months)
- Phase 3: Enterprise Grade (6-12 months)
- Phase 4: Market Leadership (12-18 months)
- Industry & Research Context
- Competitive Positioning
- Regulatory Landscape
- Risk Register
Current State Assessment
Production Readiness Scorecard (v0.5.0 — Phase 2 Complete)
| Category | Score | Status | Key Findings |
|---|---|---|---|
| Workspace Structure | 9/10 | Excellent | 15 well-organized crates, clear separation of concerns |
| Testing | 10/10 | Excellent | 2,500+ tests, property testing via proptest, fuzzing harnesses (cargo-fuzz), k6 load tests, coverage via cargo-llvm-cov + Codecov |
| CI/CD | 9/10 | Excellent | 7-job pipeline: fmt, clippy, cross-platform test (Linux/macOS/Windows), MSRV 1.88, security scanning (cargo-deny + cargo-audit), coverage, benchmark regression |
| Error Handling | 10/10 | Excellent | Idiomatic thiserror/anyhow; #![deny(clippy::unwrap_used)] enforced across all library crates; zero unwrap calls in non-test code |
| Observability | 9/10 | Excellent | Structured JSON logging, feature-gated OpenTelemetry (OTLP traces + Prometheus metrics), request ID propagation, request logging middleware, data lineage graph |
| Deployment | 10/10 | Excellent | Multi-stage Dockerfile (distroless), Docker Compose, Kubernetes Helm chart (HPA, PDB, Redis subchart), SystemD service, comprehensive deployment guides (Docker, K8s, bare-metal) |
| Security | 9/10 | Excellent | Argon2id key hashing with timing-safe comparison, security headers, request validation, TLS support (rustls), env var interpolation for secrets, cargo-deny + cargo-audit in CI, security hardening guide |
| Performance | 9/10 | Excellent | 5 Criterion benchmark suites, 100K+ entries/sec; CI benchmark regression tracking on PRs; k6 load testing framework |
| Python Bindings | 8/10 | Strong | Strict mypy, PEP 561 compliant, blueprints; classified as “Beta”, no async support |
| Server | 10/10 | Excellent | REST/gRPC/WebSocket complete; async job queue; distributed rate limiting (Redis); stateless config loading; enhanced probes; full middleware stack |
| Documentation | 10/10 | Excellent | mdBook + rustdoc + CHANGELOG + CONTRIBUTING; deployment guides (Docker, K8s, bare-metal), operational runbook, capacity planning, DR procedures, API reference, security hardening |
| Code Quality | 10/10 | Excellent | Zero TODO/FIXME comments, warnings-as-errors enforced, panic-free library crates, 6 unsafe blocks (all justified) |
| Privacy | 9/10 | Excellent | Formal DP composition (RDP, zCDP), privacy budget management, MIA/linkage evaluation, NIST SP 800-226 alignment, SynQP matrix, custom privacy levels |
| Data Lineage | 9/10 | Excellent | Per-file checksums, lineage graph, W3C PROV-JSON export, CLI verify command for manifest integrity |
Overall: 9.4/10 — Enterprise-grade with Kubernetes deployment, formal privacy guarantees, panic-free library code, comprehensive operations documentation, and data lineage tracking. Remaining gaps: RBAC/OAuth2, plugin SDK, Python async support.
Phase 1: Foundation (0-3 months)
Goal: Establish the minimum viable production infrastructure.
1.1 Containerization & Packaging
Priority: Critical | Effort: Medium
| Deliverable | Description |
|---|---|
| Multi-stage Dockerfile | Rust builder stage + distroless/alpine runtime (~20MB image) |
| Docker Compose | Local dev stack: server + Prometheus + Grafana + Redis |
| OCI image publishing | GitHub Actions workflow to push to GHCR/ECR on tagged releases |
| Binary distribution | Pre-built binaries for Linux (x86_64, aarch64), macOS (Apple Silicon), Windows |
| SystemD service file | Production daemon configuration with resource limits |
Implementation Notes:
# Target image structure
FROM rust:1.88-bookworm AS builder
# ... build with --release
FROM gcr.io/distroless/cc-debian12
COPY --from=builder /app/target/release/datasynth-server /
EXPOSE 3000
ENTRYPOINT ["/datasynth-server"]
1.2 Security Hardening
Priority: Critical | Effort: Medium
| Deliverable | Description |
|---|---|
| API key hashing | Argon2id for stored keys; timing-safe comparison via subtle crate |
| Request validation middleware | Content-Type enforcement, configurable max body size (default 10MB) |
| TLS support | Native rustls integration or documented reverse proxy (nginx/Caddy) setup |
| Secrets management | Environment variable interpolation in config (${ENV_VAR} syntax) |
| Security headers | X-Content-Type-Options, X-Frame-Options, Strict-Transport-Security |
| Input sanitization | Validate all user-supplied config values before processing |
| Dependency auditing | cargo-audit and cargo-deny in CI pipeline |
1.3 Observability Stack
Priority: Critical | Effort: Medium
| Deliverable | Description |
|---|---|
| OpenTelemetry integration | Replace custom metrics with opentelemetry + opentelemetry-otlp crates |
| Structured logging | JSON-formatted logs with request IDs, span context, correlation traces |
| Prometheus metrics | Generation throughput, latency histograms, error rates, resource utilization |
| Distributed tracing | Trace generation pipeline phases end-to-end with span hierarchy |
| Health check enhancement | Add dependency checks (disk space, memory) to /ready endpoint |
| Alert rules | Example Prometheus alerting rules for SLO violations |
Key Metrics to Instrument:
datasynth_generation_entries_total(Counter) — Total entries generateddatasynth_generation_duration_seconds(Histogram) — Per-phase latencydatasynth_generation_errors_total(Counter) — Errors by typedatasynth_memory_usage_bytes(Gauge) — Current memory consumptiondatasynth_active_sessions(Gauge) — Concurrent generation sessionsdatasynth_api_request_duration_seconds(Histogram) — API latency by endpoint
1.4 CI/CD Hardening
Priority: High | Effort: Low
| Deliverable | Description |
|---|---|
| Code coverage | cargo-tarpaulin or cargo-llvm-cov with Codecov integration |
| Security scanning | cargo-audit for CVEs, cargo-deny for license compliance |
| MSRV validation | CI job testing against minimum supported Rust version (1.88) |
| Cross-platform matrix | Test on Linux, macOS, Windows in CI |
| Benchmark tracking | Criterion results uploaded to GitHub Pages; regression alerts on PRs |
| Release automation | Semantic versioning with auto-changelog via git-cliff |
| Container scanning | Trivy or Grype scanning of published Docker images |
Phase 2: Hardening (3-6 months)
Goal: Enterprise-grade reliability, scalability, and compliance foundations.
2.1 Scalability & High Availability
Priority: High | Effort: High
| Deliverable | Description |
|---|---|
| Redis-backed rate limiting | Distributed rate limiting via redis-rs for multi-instance deployments |
| Horizontal scaling | Stateless server design; shared config via Redis/S3 |
| Kubernetes Helm chart | Production-ready chart with HPA, PDB, resource limits, readiness probes |
| Load testing framework | k6 or Locust scripts for API stress testing |
| Graceful rolling updates | Zero-downtime deployments with connection draining |
| Job queue | Async generation jobs with status tracking (Redis Streams or similar) |
2.2 Data Lineage & Provenance
Priority: High | Effort: Medium
| Deliverable | Description |
|---|---|
| Generation manifest | JSON/YAML file recording: config hash, seed, version, timestamp, checksums for all outputs |
| Data lineage graph | Track which config section produced which output file and row ranges |
| Reproducibility verification | CLI command: datasynth-data verify --manifest manifest.json --output ./output/ |
| W3C PROV compatibility | Export lineage in W3C PROV-JSON format for interoperability |
| Audit trail | Append-only log of all generation runs with user, config, and output metadata |
Rationale: Data lineage is becoming a regulatory requirement under the EU AI Act (Article 10 — data governance for training data) and is a key differentiator in the enterprise synthetic data market. NIST AI RMF 1.0 also emphasizes provenance tracking under its MAP and MEASURE functions.
2.3 Enhanced Privacy Guarantees
Priority: High | Effort: High
| Deliverable | Description |
|---|---|
| Formal DP accounting | Implement Renyi DP and zero-concentrated DP (zCDP) composition tracking |
| Privacy budget management | Global budget tracking across multiple generation runs |
| Membership inference testing | Automated MIA evaluation as post-generation quality gate |
| NIST SP 800-226 alignment | Validate DP implementation against NIST Guidelines for Evaluating DP Guarantees |
| SynQP framework integration | Implement the IEEE SynQP evaluation matrix for joint quality-privacy assessment |
| Configurable privacy levels | Presets: relaxed (ε=10), standard (ε=1), strict (ε=0.1) with utility tradeoff documentation |
Research Context: The NIST SP 800-226 (Guidelines for Evaluating Differential Privacy Guarantees) provides the authoritative framework for DP evaluation. The SynQP framework (IEEE, 2025) introduces standardized privacy-quality evaluation matrices. Benchmarking DP tabular synthesis algorithms was a key topic at TPDP 2025, and federated DP approaches (FedDPSyn) are emerging for distributed generation.
2.4 Unwrap Audit & Robustness
Priority: Medium | Effort: Medium
| Deliverable | Description |
|---|---|
| Unwrap elimination | Audit and replace ~2,300 unwrap() calls in non-test code with proper error handling |
| Panic-free guarantee | Add #![deny(clippy::unwrap_used)] lint for library crates (not test/bench) |
| Fuzzing harnesses | cargo-fuzz targets for config parsing, fingerprint loading, and API endpoints |
| Property test expansion | Increase proptest coverage for statistical invariants and balance coherence |
2.5 Documentation: Operations
Priority: Medium | Effort: Low
| Deliverable | Description |
|---|---|
| Deployment guide | Docker, K8s, bare-metal deployment with step-by-step instructions |
| Operational runbook | Monitoring dashboards, common alerts, troubleshooting procedures |
| Capacity planning guide | Memory/CPU/disk sizing for different generation scales |
| Disaster recovery | Backup/restore procedures for server state and configurations |
| API rate limits documentation | Document auth, rate limiting, and CORS behavior for integrators |
| Security hardening guide | Checklist for production security configuration |
Phase 3: Enterprise Grade (6-12 months)
Goal: Enterprise features, compliance certifications, and ecosystem maturity.
3.1 Multi-Tenancy & Access Control
Priority: High | Effort: High
| Deliverable | Description |
|---|---|
| RBAC | Role-based access control (admin, operator, viewer) with JWT/OAuth2 |
| Tenant isolation | Namespace-based isolation for multi-tenant SaaS deployment |
| Audit logging | Structured audit events for all API actions (who/what/when) |
| SSO integration | SAML 2.0 and OIDC support for enterprise identity providers |
| API versioning | URL-based API versioning (v1, v2) with deprecation lifecycle |
3.2 Advanced Evaluation & Quality Gates
Priority: High | Effort: Medium
| Deliverable | Description |
|---|---|
| Automated quality gates | Pre-configured pass/fail criteria for generation runs |
| Benchmark suite expansion | Domain-specific benchmarks: financial realism, fraud detection efficacy, audit trail coherence |
| Regression testing | Golden dataset comparison with tolerance thresholds |
| Quality dashboard | Web-based visualization of quality metrics over time |
| Third-party validation | Integration with SDMetrics and SDV evaluation utilities |
Quality Metrics to Implement:
- Statistical fidelity: Column distribution similarity (KL divergence, Wasserstein distance)
- Structural fidelity: Correlation matrix preservation, inter-table referential integrity
- Privacy: Nearest-neighbor distance ratio, attribute disclosure risk, identity disclosure risk (SynQP)
- Utility: Train-on-synthetic-test-on-real (TSTR) ML performance parity
- Temporal fidelity: Autocorrelation preservation, seasonal pattern retention
- Domain-specific: Benford compliance MAD, balance equation coherence, document chain integrity
3.3 Plugin & Extension SDK
Priority: Medium | Effort: High
| Deliverable | Description |
|---|---|
| Generator trait API | Stable, documented trait interface for custom generators |
| Plugin loading | Dynamic plugin loading via libloading or WASM runtime |
| Template marketplace | Repository of community-contributed industry templates |
| Custom output sinks | Plugin API for custom export formats (database write, S3, GCS) |
| Webhook system | Event-driven notifications (generation start/complete/error) |
3.4 Python Ecosystem Maturity
Priority: Medium | Effort: Medium
| Deliverable | Description |
|---|---|
| Async support | asyncio-compatible API using websockets for streaming |
| Conda package | Publish to conda-forge for data science workflows |
| Jupyter integration | Example notebooks for common use cases (fraud ML, audit analytics) |
| pandas/polars integration | Direct DataFrame output without intermediate CSV |
| PyPI 1.0.0 release | Promote from Beta to Production/Stable classifier |
| Type stubs | Complete .pyi stubs for IDE support |
3.5 Regulatory Compliance Framework
Priority: Medium | Effort: Medium
| Deliverable | Description |
|---|---|
| EU AI Act readiness | Synthetic content marking (Article 50), training data documentation (Article 10) |
| NIST AI RMF alignment | Self-assessment against MAP, MEASURE, MANAGE, GOVERN functions |
| SOC 2 Type II preparation | Document controls for security, availability, processing integrity |
| GDPR compliance documentation | Data processing documentation, privacy impact assessment template |
| ISO 27001 alignment | Information security management system controls mapping |
Regulatory Context: The EU AI Act’s Article 50 transparency obligations (enforceable August 2026) require AI systems generating synthetic content to mark outputs as artificially generated in a machine-readable format. Article 10 mandates training data governance including documentation of data sources. Organizations face penalties up to €35M or 7% of global turnover for non-compliance. The NIST AI RMF 1.0 (expanded significantly through 2024-2025) provides the voluntary framework becoming the “operational layer” beneath regulatory compliance globally.
Phase 4: Market Leadership (12-18 months)
Goal: Cutting-edge capabilities informed by latest research, establishing DataSynth as the reference platform for financial synthetic data.
4.1 LLM-Augmented Generation
Priority: Medium | Effort: High
| Deliverable | Description |
|---|---|
| LLM-guided metadata enrichment | Use LLMs to generate realistic vendor names, descriptions, memo fields |
| Natural language config | Generate YAML configs from natural language descriptions (“Generate 1 year of manufacturing data for a mid-size German company”) |
| Semantic constraint validation | LLM-based validation of inter-column logical relationships |
| Explanation generation | Natural language explanations for anomaly labels and findings |
Research Context: Multiple 2025 papers demonstrate LLM-augmented tabular data generation. LLM-TabFlow (March 2025) addresses preserving inter-column logical relationships. StructSynth (August 2025) focuses on structure-aware synthesis in low-data regimes. LLM-TabLogic (August 2025) uses prompt-guided latent diffusion to maintain logical constraints. The CFA Institute’s July 2025 report on “Synthetic Data in Investment Management” validates the growing importance of synthetic data in financial applications.
4.2 Diffusion Model Integration
Priority: Medium | Effort: Very High
| Deliverable | Description |
|---|---|
| TabDDPM backend | Optional diffusion-model-based generation for learned distribution capture |
| FinDiff integration | Financial-domain diffusion model for learned financial patterns |
| Hybrid generation | Combine rule-based generators with learned models for maximum fidelity |
| Model fine-tuning pipeline | Train custom diffusion models on fingerprint data |
| Imb-FinDiff for rare events | Diffusion-based class imbalance handling for fraud patterns |
Research Context: The diffusion model landscape for tabular data has matured rapidly. TabDiff (ICLR 2025) introduced joint continuous-time diffusion with feature-wise learnable schedules, achieving 22.5% improvement over prior SOTA. FinDiff and its extensions (Imb-FinDiff for class imbalance, DP-Fed-FinDiff for federated privacy-preserving generation) are specifically designed for financial tabular data. A comprehensive survey (February 2025) catalogs 15+ diffusion models for tabular data. TabGraphSyn (December 2025) combines GNNs with diffusion for graph-guided tabular synthesis.
4.3 Advanced Privacy Techniques
Priority: Medium | Effort: High
| Deliverable | Description |
|---|---|
| Federated fingerprinting | Extract fingerprints from distributed data sources without centralization |
| Synthetic data certificates | Cryptographic proof that output satisfies DP guarantees |
| Privacy-utility Pareto frontier | Automated exploration of optimal ε values for given utility targets |
| Surrogate public data | Support for surrogate public data approaches to improve DP utility |
Research Context: TPDP 2025 featured FedDPSyn for federated DP tabular synthesis and research on surrogate public data for DP (Hod et al.). The AI-generated synthetic tabular data market reached $1.36B in 2024 and is projected to reach $6.73B by 2029 (37.9% CAGR), driven by privacy regulation and AI training demand.
4.4 Ecosystem & Integration
Priority: Medium | Effort: Medium
| Deliverable | Description |
|---|---|
| Terraform provider | Infrastructure-as-code for DataSynth server deployment |
| Airflow/Dagster operators | Pipeline integration for automated generation in data workflows |
| dbt integration | Generate synthetic data as dbt sources for analytics testing |
| Spark connector | Read DataSynth output directly as Spark DataFrames |
| MLflow integration | Track generation runs as MLflow experiments with metrics |
4.5 Causal & Counterfactual Generation
Priority: Low | Effort: Very High
| Deliverable | Description |
|---|---|
| Causal graph specification | Define causal relationships between entities in config |
| Interventional generation | “What-if” scenarios: generate data under hypothetical interventions |
| Counterfactual samples | Generate counterfactual versions of existing records |
| Causal discovery validation | Validate that generated data preserves specified causal structure |
Industry & Research Context
Synthetic Data Market (2025-2026)
The synthetic data market is experiencing explosive growth:
- Gartner predicts 75% of businesses will use GenAI to create synthetic customer data by 2026, up from <5% in 2023.
- The AI-generated synthetic tabular data market reached $1.36B in 2024, projected to $6.73B by 2029 (37.9% CAGR).
- Synthetic data is predicted to account for >60% of all training data for GenAI models by 2030 (CFA Institute, July 2025).
Key Research Papers & Developments
Tabular Data Generation
- TabDiff (ICLR 2025) — Mixed-type diffusion with learnable feature-wise schedules; 22.5% improvement on correlation preservation
- LLM-TabFlow (March 2025) — Preserving inter-column logical relationships via LLM guidance
- StructSynth (August 2025) — Structure-aware LLM synthesis for low-data regimes
- LLM-TabLogic (August 2025) — Prompt-guided latent diffusion maintaining logical constraints
- TabGraphSyn (December 2025) — Graph-guided latent diffusion combining VAE+GNN with diffusion
Financial Domain
- FinDiff (ICAIF 2023) — Diffusion models for financial tabular data
- Imb-FinDiff (ICAIF 2024) — Conditional diffusion for class-imbalanced financial data
- DP-Fed-FinDiff — Federated DP diffusion for privacy-preserving financial synthesis
- CFA Institute Report (July 2025) — “Synthetic Data in Investment Management” validating FinDiff as SOTA
Privacy & Evaluation
- SynQP (IEEE, 2025) — Standardized quality-privacy evaluation framework for synthetic data
- NIST SP 800-226 — Guidelines for Evaluating Differential Privacy Guarantees
- TPDP 2025 — Benchmarking DP tabular synthesis; federated approaches; membership inference attacks
- Consensus Privacy Metrics (Pilgram et al., 2025) — Framework for standardized privacy evaluation
Surveys
- “Diffusion Models for Tabular Data” (February 2025) — Comprehensive survey cataloging 15+ models
- “Comprehensive Survey of Synthetic Tabular Data Generation” (Shi et al., 2025) — Broad overview of methods
Technology Trends Impacting DataSynth
| Trend | Impact | Timeframe |
|---|---|---|
| LLM-augmented generation | Realistic metadata, natural language config | 2026 |
| Diffusion models for tabular data | Learned distribution capture as alternative/complement to rule-based | 2026-2027 |
| Federated DP synthesis | Generate from distributed sources without centralization | 2027 |
| Causal modeling | “What-if” scenarios and interventional generation | 2027-2028 |
| OTEL standardization | Unified observability across Rust ecosystem | 2026 |
| WASM plugins | Safe, sandboxed extensibility for custom generators | 2026-2027 |
| EU AI Act enforcement | Mandatory synthetic content marking and data governance | August 2026 |
Competitive Positioning
Market Landscape (2025-2026)
| Platform | Focus | Key Differentiator | Pricing | Status |
|---|---|---|---|---|
| Gretel.ai | Developer APIs | Navigator (NL-to-data); acquired by NVIDIA (March 2025) | Usage-based | Integrated into NVIDIA NeMo |
| MOSTLY AI | Enterprise compliance | TabularARGN with built-in DP; fairness controls | Enterprise license | Independent |
| Tonic.ai | Test data management | Database-aware synthesis; acquired Fabricate (April 2025) | Per-database | Growing |
| Hazy | Financial services | Regulated-sector focus; sequential data | Enterprise license | Independent |
| SDV/DataCebo | Open source ecosystem | CTGAN, TVAEs, Gaussian copulas; Python-native | Freemium | Open source core |
| K2view | Entity-based testing | All-in-one enterprise data management | Enterprise license | Established |
DataSynth Competitive Advantages
| Advantage | Detail |
|---|---|
| Domain depth | Deepest financial/accounting domain model (IFRS, US GAAP, ISA, SOX, COSO, KYC/AML) |
| Rule-based coherence | Guaranteed balance equations, document chain integrity, three-way matching |
| Deterministic reproducibility | ChaCha8 RNG with seed control; bit-exact reproducibility across runs |
| Performance | 100K+ entries/sec (Rust native); 10-100x faster than Python-based competitors |
| Privacy-preserving fingerprinting | Unique extract-synthesize workflow with DP guarantees |
| Process mining | Native OCEL 2.0 event log generation (unique in market) |
| Graph-native | Direct PyTorch Geometric, Neo4j, DGL export for GNN workflows |
| Full-stack | CLI + REST/gRPC/WebSocket server + Desktop UI + Python bindings |
Competitive Gaps to Address
| Gap | Competitors with Feature | Priority |
|---|---|---|
| Cloud-hosted SaaS offering | Gretel, MOSTLY AI, Tonic | Phase 3 |
| No-code UI for non-technical users | MOSTLY AI, K2view | Phase 3 |
| Database-aware synthesis from production data | Tonic.ai | Phase 4 |
| LLM-powered natural language interface | Gretel Navigator | Phase 4 |
| Pre-built ML model training pipelines | Gretel | Phase 3 |
| Marketplace for community templates | SDV ecosystem | Phase 3 |
Regulatory Landscape
EU AI Act Timeline
| Date | Milestone | DataSynth Impact |
|---|---|---|
| Feb 2025 | Prohibited AI systems discontinued; AI literacy obligations | Low — DataSynth is a tool, not a prohibited system |
| Aug 2025 | GPAI transparency requirements; training data documentation | Medium — Users training AI with DataSynth output need provenance |
| Aug 2026 | Full high-risk AI compliance; Article 50 transparency | High — Synthetic content marking required; data governance mandated |
| Aug 2027 | High-risk AI in harmonized products | Low — Indirect impact |
Required Compliance Features
- Synthetic content marking (Article 50): All generated data must include machine-readable markers indicating artificial generation
- Training data documentation (Article 10): Generation manifests must document configs, sources, and processing steps
- Quality management (Annex IV): Documented quality assurance processes for generation and evaluation
- Risk assessment: Template for users to assess risks of using synthetic data in AI systems
Other Regulatory Frameworks
| Framework | Relevance | Status |
|---|---|---|
| NIST AI RMF 1.0 | Voluntary; becoming the operational governance layer globally | Self-assessment planned (Phase 3) |
| NIST SP 800-226 | DP evaluation guidelines | Alignment planned (Phase 2) |
| GDPR | Synthetic data reduces but doesn’t eliminate privacy obligations | Documentation in Phase 3 |
| SOX | DataSynth already generates SOX-compliant test data | Feature complete |
| ISO 27001 | Information security controls for server deployment | Alignment in Phase 3 |
| SOC 2 Type II | Trust service criteria for SaaS offering | Phase 3 preparation |
Risk Register
Technical Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Performance regression with OTEL instrumentation | Medium | Medium | Benchmark-gated CI; sampling in production |
| Breaking API changes during versioning | Low | High | Semantic versioning; deprecation policy; compatibility tests |
| Memory safety issues in unsafe blocks | Low | Critical | Miri testing; minimize unsafe; regular audits |
| Dependency CVEs | Medium | High | cargo-audit in CI; Dependabot alerts |
| Plugin system security (WASM/dynamic loading) | Medium | High | WASM sandboxing; capability-based permissions |
Business Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| EU AI Act scope broader than anticipated | Medium | High | Proactive Article 50 compliance; legal review |
| Competitor acqui-hires (Gretel→NVIDIA pattern) | Medium | Medium | Build unique domain depth as defensible moat |
| Open-source competitors (SDV) closing feature gap | Medium | Medium | Focus on financial domain depth and performance |
| Enterprise customers requiring SOC 2 certification | High | Medium | Begin SOC 2 preparation in Phase 3 |
| Python ecosystem expects native (PyO3) bindings | Medium | Medium | Evaluate PyO3 migration for v2.0 |
Operational Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Production incidents without runbooks | High | Medium | Prioritize ops documentation in Phase 2 |
| Scaling issues under concurrent load | Medium | High | Load testing in Phase 2; HPA configuration |
| Secret exposure in logs or configs | Low | Critical | Structured logging with PII filtering; secret scanning |
Success Criteria
Phase 1 Exit Criteria
- Docker image published and scannable (multi-stage distroless build)
-
cargo-auditandcargo-denypassing in CI - OTEL traces available via feature-gated
otelflag with OTLP export - Prometheus metrics scraped and graphed (Docker Compose stack)
- Code coverage measured and reported via cargo-llvm-cov + Codecov
- Cross-platform CI (Linux + macOS + Windows)
Phase 2 Exit Criteria
- Helm chart deployed to staging K8s cluster
- Generation manifest produced for every run (with per-file checksums, lineage graph, W3C PROV-JSON)
- Load test: k6 scripts for health, bulk generation, WebSocket, job queue, and soak testing
- Zero
unwrap()calls in library crate non-test code (#![deny(clippy::unwrap_used)]enforced) - Formal DP composition tracking with budget management (RDP, zCDP, privacy budget manager)
- Operations runbook reviewed and validated (deployment guides, runbook, capacity planning, DR, API reference, security hardening)
Phase 3 Exit Criteria
- JWT/OAuth2 authentication with RBAC
- Automated quality gates blocking below-threshold runs
- Plugin SDK documented with 2+ community plugins
- Python 1.0.0 on PyPI with async support
- EU AI Act Article 50 compliance verified
- SOC 2 Type II readiness assessment completed
Phase 4 Exit Criteria
- LLM-augmented generation available as opt-in feature
- Diffusion model backend demonstrated on financial dataset
- 3+ ecosystem integrations (Airflow, dbt, MLflow)
- Causal generation prototype validated
Appendix A: OpenTelemetry Integration Architecture
┌─────────────────────────────────────────────────────┐
│ DataSynth Server │
│ ┌───────────┐ ┌──────────┐ ┌─────────────────┐ │
│ │ REST API │ │ gRPC │ │ WebSocket │ │
│ └─────┬─────┘ └────┬─────┘ └───────┬─────────┘ │
│ │ │ │ │
│ ┌─────┴──────────────┴────────────────┴──────────┐ │
│ │ Tower Middleware Stack │ │
│ │ [Auth] [RateLimit] [Tracing] [Metrics] │ │
│ └────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────┐ │
│ │ OpenTelemetry SDK │ │
│ │ ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │ │
│ │ │ Traces │ │ Metrics │ │ Logs │ │ │
│ │ └────┬────┘ └────┬─────┘ └───────┬─────────┘ │ │
│ └───────┼───────────┼───────────────┼────────────┘ │
│ │ │ │ │
│ ┌───────┴───────────┴───────────────┴────────────┐ │
│ │ OTLP Exporter (gRPC/HTTP) │ │
│ └────────────────────┬───────────────────────────┘ │
└───────────────────────┼─────────────────────────────┘
│
┌─────────┴──────────┐
│ OTel Collector │
│ (Agent sidecar) │
└──┬──────┬──────┬───┘
│ │ │
┌─────┘ ┌───┘ ┌──┘
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌─────┐
│Jaeger│ │Prom. │ │Loki │
│/Tempo│ │ │ │ │
└──────┘ └──────┘ └─────┘
Appendix B: Recommended Rust Crate Additions
| Category | Crate | Purpose | Phase |
|---|---|---|---|
| Observability | opentelemetry (0.27+) | Unified telemetry API | 1 |
| Observability | opentelemetry-otlp | OTLP exporter | 1 |
| Observability | tracing-opentelemetry | Bridge tracing → OTEL | 1 |
| Security | argon2 | Password/key hashing | 1 |
| Security | subtle | Constant-time comparison | 1 |
| Security | rustls | Native TLS | 1 |
| Scalability | redis | Distributed state/rate-limiting | 2 |
| Scalability | deadpool-redis | Redis connection pooling | 2 |
| Testing | cargo-tarpaulin | Code coverage | 1 |
| Testing | cargo-fuzz | Fuzz testing | 2 |
| Auth | jsonwebtoken | JWT tokens | 3 |
| Auth | oauth2 | OAuth2 client | 3 |
| Plugins | wasmtime | WASM plugin runtime | 3 |
| Build | git-cliff | Changelog generation | 1 |
Appendix C: Key References
Standards & Guidelines
- NIST AI RMF 1.0 — AI Risk Management Framework
- NIST SP 800-226 — Guidelines for Evaluating Differential Privacy Guarantees
- EU AI Act (Regulation 2024/1689) — Articles 10, 50
- ISO/IEC 25020:2019 — Systems and software Quality Requirements and Evaluation (SQuaRE)
Research Papers
- Chen et al. (2025) — “Benchmarking Differentially Private Tabular Data Synthesis Algorithms” (TPDP 2025)
- SynQP (IEEE, 2025) — “A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data”
- Xu et al. (2025) — “TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation” (ICLR 2025)
- Sattarov & Schreyer (2023) — “FinDiff: Diffusion Models for Financial Tabular Data Generation” (ICAIF 2023)
- Shi et al. (2025) — “Comprehensive Survey of Synthetic Tabular Data Generation”
- CFA Institute (July 2025) — “Synthetic Data in Investment Management”
- Pilgram et al. (2025) — “A Consensus Privacy Metrics Framework for Synthetic Data”
Industry Reports
- Gartner (2024) — “By 2026, 75% of businesses will use GenAI for synthetic customer data”
- GlobeNewsWire (January 2026) — AI-Generated Synthetic Tabular Dataset Market: $6.73B by 2029
Research: System Improvements for Enhanced Realism
This research document series analyzes the current SyntheticData system and proposes comprehensive improvements across multiple dimensions to achieve greater realism, statistical validity, and domain authenticity.
Document Index
| Document | Focus Area | Priority |
|---|---|---|
| 01-realism-names-metadata.md | Names, descriptions, metadata realism | High |
| 02-statistical-distributions.md | Numerical and statistical distributions | High |
| 03-temporal-patterns.md | Temporal correctness and distributions | High |
| 04-interconnectivity.md | Entity relationships and referential integrity | Critical |
| 05-pattern-drift.md | Process and pattern evolution over time | Medium |
| 06-anomaly-patterns.md | Anomaly detection and injection patterns | High |
| 07-fraud-patterns.md | Fraud typologies and detection scenarios | High |
| 08-domain-specific.md | Industry-specific enhancements | Medium |
Executive Summary
Current State Assessment
The SyntheticData system is a mature, well-architected synthetic data generation platform with strong foundations in:
- Deterministic generation via ChaCha8 RNG with configurable seeds
- Domain modeling with 50+ entity types across accounting, banking, and audit domains
- Statistical foundations including Benford’s Law, log-normal distributions, and temporal seasonality
- Referential integrity through document chains, three-way matching, and intercompany reconciliation
- Standards compliance with COSO 2013, ISA, SOX, IFRS, and US GAAP frameworks
Key Improvement Themes
After comprehensive analysis, we identify eight major improvement themes:
1. Realism in Names & Metadata
Current Gap: Generic placeholder names, limited cultural diversity, simplistic descriptions Impact: Immediate visual detection of synthetic nature Effort: Medium | Value: High
2. Statistical Distribution Enhancements
Current Gap: Single-mode distributions, limited correlation modeling, no regime changes Impact: ML models trained on synthetic data may not generalize Effort: High | Value: Critical
3. Temporal Pattern Sophistication
Current Gap: Static multipliers, no business day calculations, limited regional calendars Impact: Unrealistic transaction timing patterns Effort: Medium | Value: High
4. Interconnectivity & Relationship Modeling
Current Gap: Shallow relationship graphs, limited network effects, no behavioral clustering Impact: Graph-based analytics yield unrealistic structures Effort: High | Value: Critical
5. Pattern & Process Drift
Current Gap: Limited drift types, no organizational change modeling, static processes Impact: Temporal ML models overfit to stable patterns Effort: Medium | Value: High
6. Anomaly Pattern Enrichment
Current Gap: Limited anomaly correlation, no multi-stage anomalies, binary labeling Impact: Anomaly detection models lack nuanced training data Effort: Medium | Value: High
7. Fraud Pattern Sophistication
Current Gap: Isolated fraud events, limited collusion modeling, no adaptive patterns Impact: Fraud detection systems miss complex schemes Effort: High | Value: Critical
8. Domain-Specific Enhancements
Current Gap: Generic industry modeling, limited regulatory specificity Impact: Industry-specific use cases require extensive customization Effort: Medium | Value: Medium
Implementation Roadmap
Phase 1: Foundation (Q1)
- Culturally-aware name generation with regional distributions
- Enhanced amount distributions with mixture models
- Business day calculation utilities
- Relationship graph depth improvements
Phase 2: Statistical Sophistication (Q2)
- Multi-modal distribution support
- Cross-field correlation modeling
- Regime change simulation
- Network effect modeling
Phase 3: Temporal Evolution (Q3)
- Organizational change events
- Process evolution modeling
- Adaptive fraud patterns
- Multi-stage anomaly injection
Phase 4: Domain Specialization (Q4)
- Industry-specific regulatory frameworks
- Enhanced audit trail generation
- Advanced graph analytics support
- Privacy-preserving fingerprint improvements
Metrics for Success
Realism Metrics
- Human Detection Rate: % of samples correctly identified as synthetic by domain experts
- Statistical Divergence: KL divergence between synthetic and real-world distributions
- Temporal Correlation: Autocorrelation alignment with empirical baselines
ML Utility Metrics
- Transfer Learning Gap: Performance delta when models trained on synthetic data are applied to real data
- Feature Distribution Overlap: Overlap coefficient for key feature distributions
- Anomaly Detection AUC: Baseline AUC on synthetic vs. improvement after enhancements
Technical Metrics
- Generation Throughput: Records/second with enhanced features
- Memory Efficiency: Peak memory usage for equivalent dataset sizes
- Configuration Complexity: Lines of YAML required for common scenarios
Next Steps
- Review individual research documents for detailed analysis
- Prioritize improvements based on use case requirements
- Create implementation tickets for Phase 1 items
- Establish baseline metrics for tracking progress
Research conducted: January 2026 System version analyzed: 0.2.3
Research: Realism in Names, Descriptions, and Metadata
Current State Analysis
Entity Name Generation
The current system uses basic name generation across multiple entity types:
| Entity Type | Current Approach | Realism Level |
|---|---|---|
| Vendors | “Vendor_{id}” or template-based | Low |
| Customers | “Customer_{id}” or template-based | Low |
| Employees | First/Last name pools | Medium |
| Materials | “Material_{id}” with category prefix | Low |
| Cost Centers | “{dept}_{code}” pattern | Medium |
| GL Accounts | Numeric codes with descriptions | High |
| Companies | Configurable but often generic | Medium |
Description Generation
Current descriptions follow predictable patterns:
- Journal entries: “{type} for {entity}”
- Invoices: “Invoice for {goods/services}”
- Payments: “Payment for Invoice {ref}”
Metadata Patterns
- Timestamps: Well-distributed but lack system-specific quirks
- User IDs: Sequential or simple patterns
- References: Deterministic but predictable formats
Improvement Recommendations
1. Culturally-Aware Name Generation
1.1 Regional Name Pools
Implementation: Create region-specific name databases with appropriate cultural distributions.
# Proposed configuration structure
name_generation:
strategy: regional_weighted
regions:
- region: north_america
weight: 0.45
subregions:
- country: US
weight: 0.85
cultural_mix:
- origin: anglo
weight: 0.55
- origin: hispanic
weight: 0.25
- origin: asian
weight: 0.12
- origin: other
weight: 0.08
- country: CA
weight: 0.10
- country: MX
weight: 0.05
- region: europe
weight: 0.30
- region: asia_pacific
weight: 0.25
1.2 Company Name Patterns by Industry
Retail:
- Pattern:
{Founder} {Product}→ “Johnson’s Hardware” - Pattern:
{Adjective} {Category}→ “Premier Electronics” - Pattern:
{Location} {Type}→ “Westside Grocers”
Manufacturing:
- Pattern:
{Name} {Industry} {Suffix}→ “Anderson Steel Corporation” - Pattern:
{Acronym} {Type}→ “ACM Industries” - Pattern:
{Technical} {Systems}→ “Precision Machining Systems”
Professional Services:
- Pattern:
{Partner1}, {Partner2} & {Partner3}→ “Smith, Chen & Associates” - Pattern:
{Name} {Specialty} {Type}→ “Hartwell Tax Advisors” - Pattern:
{Adjective} {Service} {Suffix}→ “Strategic Consulting Group”
Financial Services:
- Pattern:
{Location} {Type} {Entity}→ “Pacific Coast Credit Union” - Pattern:
{Founder} {Service}→ “Morgan Wealth Management” - Pattern:
{Region} {Specialty}→ “Midwest Commercial Lending”
1.3 Vendor Name Realism
Current: Vendor_00042 or simple templates
Proposed: Industry-appropriate vendor names based on spend category:
#![allow(unused)]
fn main() {
// Conceptual structure
pub struct VendorNameGenerator {
category_templates: HashMap<SpendCategory, Vec<NameTemplate>>,
regional_styles: HashMap<Region, NamingConvention>,
legal_suffixes: HashMap<Country, Vec<String>>,
}
impl VendorNameGenerator {
pub fn generate(&self, category: SpendCategory, region: Region) -> VendorName {
// Select template based on category
// Apply regional naming conventions
// Add appropriate legal suffix (Inc., LLC, GmbH, Ltd., S.A., etc.)
}
}
}
Examples by Category:
| Category | Example Names |
|---|---|
| Office Supplies | Staples, Office Depot, ULINE, Quill Corporation |
| IT Services | Accenture Technology, Cognizant Solutions, InfoSys Systems |
| Raw Materials | Alcoa Aluminum, US Steel Supply, Nucor Materials |
| Utilities | Pacific Gas & Electric, ConEdison, Duke Energy |
| Professional Services | Deloitte & Touche, KPMG Advisory, BDO Consulting |
| Logistics | FedEx Freight, UPS Supply Chain, XPO Logistics |
| Facilities | ABM Industries, CBRE Services, JLL Facilities |
2. Realistic Description Generation
2.1 Journal Entry Descriptions
Current Pattern: Generic, formulaic
Proposed: Context-aware, varied descriptions with realistic abbreviations and typos
journal_entry_descriptions:
revenue:
templates:
- "Revenue recognition - {customer} - {contract_ref}"
- "Rev rec {period} - {product_line}"
- "Sales revenue {region} Q{quarter}"
- "Earned revenue - PO# {po_number}"
abbreviations:
enabled: true
probability: 0.3
mappings:
Revenue: ["Rev", "REV"]
recognition: ["rec", "recog"]
Quarter: ["Q", "Qtr"]
variations:
case_variation: 0.1
typo_rate: 0.02
expense:
templates:
- "AP invoice - {vendor} - {invoice_ref}"
- "{expense_category} - {cost_center}"
- "Accrued {expense_type} {period}"
- "{vendor_short} inv {invoice_num}"
context_aware:
include_approver: 0.2
include_po_reference: 0.7
include_department: 0.4
2.2 Invoice Line Item Descriptions
Goods:
- "Qty {quantity} {product_name} @ ${unit_price}/ea"
- "{product_sku} - {product_description}"
- "{quantity}x {product_short_name}"
- "Lot# {lot_number} {product_name}"
Services:
- "Professional services - {date_range}"
- "Consulting fees - {project_name}"
- "Retainer - {month} {year}"
- "{hours} hrs @ ${rate}/hr - {service_type}"
2.3 Payment Descriptions
Current: “Payment for Invoice INV-00123”
Proposed variations:
- "Pmt INV-00123"
- "ACH payment - {vendor} - {invoice_ref}"
- "Wire transfer ref {bank_ref}"
- "Check #{check_number} - {vendor}"
- "EFT {date} {vendor_short}"
- "Batch payment - {batch_id}"
3. Enhanced Metadata Generation
3.1 User ID Patterns
Current: Sequential or simple random
Proposed: Realistic corporate patterns
user_id_patterns:
format: "{first_initial}{last_name}{disambiguator}"
examples:
- "jsmith"
- "jsmith2"
- "john.smith"
- "smithj"
system_accounts:
- prefix: "SVC_"
examples: ["SVC_BATCH", "SVC_INTERFACE", "SVC_RECON"]
- prefix: "SYS_"
examples: ["SYS_AUTO", "SYS_SCHEDULER"]
- prefix: "INT_"
examples: ["INT_SAP", "INT_ORACLE", "INT_SALESFORCE"]
admin_accounts:
- pattern: "admin_{system}"
- examples: ["admin_gl", "admin_ap", "admin_ar"]
3.2 Reference Number Formats
Realistic patterns by document type:
reference_formats:
purchase_order:
patterns:
- "PO-{year}{seq:06}" # PO-2024000142
- "4500{seq:06}" # SAP-style: 4500000142
- "{plant}-{year}-{seq:05}" # CHI-2024-00142
invoice:
vendor_patterns:
- "INV-{seq:08}"
- "{vendor_prefix}-{date}-{seq:04}"
- "{random_alpha:3}{seq:06}"
internal_patterns:
- "VINV-{year}{seq:06}"
- "{company_code}-AP-{seq:07}"
journal_entry:
patterns:
- "{year}{period:02}{seq:06}" # 202401000142
- "JE-{date}-{seq:04}" # JE-20240115-0142
- "{company}-{year}-{seq:07}" # C001-2024-0000142
bank_reference:
patterns:
- "{date}{random:10}" # Bank statement ref
- "TRN{seq:12}" # Transaction ID
- "{swift_code}{date}{seq:06}" # SWIFT format
3.3 Timestamp Realism
System-specific posting behaviors:
timestamp_patterns:
batch_processing:
typical_times: ["02:00", "06:00", "22:00"]
duration_minutes: 30-180
day_pattern: "business_days"
manual_posting:
peak_hours: [9, 10, 11, 14, 15, 16]
off_peak_probability: 0.15
lunch_dip: [12, 13]
lunch_probability: 0.3
interface_posting:
patterns:
- hourly: ":00", ":15", ":30", ":45"
- real_time: random within seconds
source_systems:
- name: "SAP_INTERFACE"
posting_lag_hours: 0-4
- name: "LEGACY_BATCH"
posting_time: "23:30"
posting_day: "next_business_day"
period_end_crunch:
enabled: true
days_before_close: 3
extended_hours: true
weekend_activity: 0.3
4. Address and Contact Information
4.1 Realistic Address Generation
Current Gap: Generic or missing addresses
Proposed: Region-appropriate address formats
address_generation:
us:
format: "{street_number} {street_name} {street_type}\n{city}, {state} {zip}"
components:
street_numbers:
residential: 100-9999
commercial: 1-500
distribution: "log_normal"
street_names:
sources: ["census_data", "common_names"]
include_directional: 0.3 # "N", "S", "E", "W"
street_types:
distribution:
Street: 0.25
Avenue: 0.15
Road: 0.12
Drive: 0.12
Boulevard: 0.08
Lane: 0.08
Way: 0.08
Court: 0.05
Place: 0.04
Circle: 0.03
cities:
source: "population_weighted"
major_metro_weight: 0.6
commercial_patterns:
suite_probability: 0.4
floor_probability: 0.2
building_name_probability: 0.15
de:
format: "{street_name} {street_number}\n{postal_code} {city}"
# German addresses put number after street name
jp:
format: "〒{postal_code}\n{prefecture}{city}{ward}\n{block}-{building}-{unit}"
# Japanese addressing system
4.2 Phone Number Formats
phone_generation:
formats:
us: "+1 ({area_code}) {exchange}-{subscriber}"
uk: "+44 {area_code} {local_number}"
de: "+49 {area_code} {subscriber}"
area_codes:
us:
source: "valid_area_codes"
weight_by_population: true
exclude_toll_free: true
business_toll_free_rate: 0.3
4.3 Email Patterns
email_generation:
corporate:
patterns:
- "{first}.{last}@{company_domain}"
- "{first_initial}{last}@{company_domain}"
- "{first}_{last}@{company_domain}"
domain_generation:
from_company_name: true
tld_distribution:
".com": 0.75
".net": 0.10
".io": 0.05
".co": 0.05
country_tld: 0.05
vendor_contacts:
patterns:
- "accounts.payable@{domain}"
- "ar@{domain}"
- "billing@{domain}"
- "{first}.{last}@{domain}"
generic_rate: 0.4
5. Material and Product Naming
5.1 SKU Patterns
sku_generation:
patterns:
category_prefix:
format: "{category:3}-{subcategory:3}-{sequence:06}"
example: "ELE-CPT-000142" # Electronics-Components
alphanumeric:
format: "{alpha:2}{numeric:6}{check_digit}"
example: "AB123456C"
hierarchical:
format: "{division}-{family}-{class}-{item}"
example: "01-234-567-8901"
5.2 Product Descriptions
By Category:
product_descriptions:
raw_materials:
templates:
- "{material_type}, {grade}, {specification}"
- "{chemical_formula} {purity}% pure"
- "{material} {form} - {dimensions}"
examples:
- "Steel Coil, Grade 304, 1.2mm thickness"
- "Aluminum Sheet 6061-T6, 4' x 8' x 0.125\""
- "Polyethylene Pellets, HDPE, 50lb bag"
finished_goods:
templates:
- "{brand} {product_line} {model}"
- "{product_type} - {feature1}, {feature2}"
- "{category} {variant} ({color}/{size})"
examples:
- "Acme Pro Series 5000X Widget"
- "Heavy-Duty Industrial Pump - 2HP, 120V"
- "Office Chair Ergonomic Mesh (Black/Large)"
services:
templates:
- "{service_type} - {duration} {frequency}"
- "Professional {service} Services"
- "{specialty} Consultation - {scope}"
examples:
- "HVAC Maintenance - Annual Contract"
- "Professional IT Support Services"
- "Legal Consultation - Contract Review"
6. Implementation Priority
| Enhancement | Effort | Impact | Priority |
|---|---|---|---|
| Regional name pools | Medium | High | P1 |
| Industry-specific vendor names | Medium | High | P1 |
| Varied JE descriptions | Low | Medium | P1 |
| Reference number formats | Low | High | P1 |
| User ID patterns | Low | Medium | P2 |
| Address generation | High | Medium | P2 |
| Product descriptions | Medium | Medium | P2 |
| Email patterns | Low | Low | P3 |
| Phone formatting | Low | Low | P3 |
7. Data Sources
Recommended External Data Sources:
-
Name Data:
- US Census Bureau name frequency data
- International name databases (regional)
- Industry-specific company name patterns
-
Address Data:
- OpenAddresses project
- Census TIGER/Line files
- Postal code databases by country
-
Reference Patterns:
- ERP documentation (SAP, Oracle, NetSuite)
- Industry EDI standards
- Banking reference formats (SWIFT, ACH)
-
Product Data:
- UNSPSC category codes
- Industry classification systems
- Standard material specifications
8. Configuration Example
# Enhanced name and metadata configuration
realism:
names:
strategy: culturally_aware
primary_region: north_america
diversity_index: 0.4
vendors:
naming_style: industry_appropriate
include_legal_suffix: true
regional_distribution:
domestic: 0.7
international: 0.3
descriptions:
variation_enabled: true
abbreviation_rate: 0.25
typo_injection_rate: 0.01
references:
format_style: erp_realistic
include_check_digits: true
timestamps:
system_behavior_modeling: true
batch_window_realism: true
addresses:
format: regional_appropriate
commercial_indicators: true
Next Steps
- Create name pool data files for major regions
- Implement
NameGeneratortrait with regional strategies - Build description template engine with variation support
- Add reference format configurations to schema
- Integrate address generation with Faker-like libraries
See also: 02-statistical-distributions.md for numerical realism
Research: Statistical and Numerical Distributions
Current State Analysis
Existing Distribution Implementations
The system currently supports several distribution types:
| Distribution | Implementation | Usage |
|---|---|---|
| Log-Normal | AmountSampler | Transaction amounts |
| Benford’s Law | BenfordSampler | First-digit distribution |
| Uniform | Standard | ID generation, selection |
| Weighted | LineItemSampler | Line item counts |
| Poisson | TemporalSampler | Event counts |
| Normal/Gaussian | Standard | Some variations |
Current Strengths
- Benford’s Law compliance: First-digit distribution follows expected 30.1%, 17.6%, 12.5%… pattern
- Log-normal amounts: Realistic transaction size distributions
- Temporal weighting: Period-end spikes, day-of-week patterns
- Industry seasonality: 10 industry profiles with event-based multipliers
Current Gaps
- Single-mode distributions: No mixture models for multi-modal data
- Limited correlation: Cross-field dependencies not modeled
- Static parameters: No regime changes or parameter drift
- Missing distributions: Pareto, Weibull, Beta not available
- No copulas: Joint distributions not correlated realistically
Improvement Recommendations
1. Multi-Modal Distribution Support
1.1 Gaussian Mixture Models
Real-world transaction amounts often exhibit multiple modes:
#![allow(unused)]
fn main() {
/// Gaussian Mixture Model for multi-modal distributions
pub struct GaussianMixture {
components: Vec<GaussianComponent>,
}
pub struct GaussianComponent {
weight: f64, // Component weight (sum to 1.0)
mean: f64, // Component mean
std_dev: f64, // Component standard deviation
}
impl GaussianMixture {
/// Sample from the mixture distribution
pub fn sample(&self, rng: &mut impl Rng) -> f64 {
// Select component based on weights
let component = self.select_component(rng);
// Sample from selected Gaussian
component.sample(rng)
}
}
}
Configuration:
amount_distribution:
type: gaussian_mixture
components:
- weight: 0.60
mean: 500
std_dev: 200
label: "small_transactions"
- weight: 0.30
mean: 5000
std_dev: 1500
label: "medium_transactions"
- weight: 0.10
mean: 50000
std_dev: 15000
label: "large_transactions"
1.2 Log-Normal Mixture
For strictly positive amounts with multiple modes:
amount_distribution:
type: lognormal_mixture
components:
- weight: 0.70
mu: 5.5 # log-scale mean
sigma: 1.2 # log-scale std dev
label: "routine_expenses"
- weight: 0.25
mu: 8.5
sigma: 0.8
label: "capital_expenses"
- weight: 0.05
mu: 11.0
sigma: 0.5
label: "major_projects"
1.3 Realistic Transaction Amount Profiles
By Transaction Type:
| Type | Distribution | Parameters | Notes |
|---|---|---|---|
| Petty Cash | Log-normal | μ=3.5, σ=0.8 | $10-$500 range |
| AP Invoices | Mixture(3) | See below | Multi-modal |
| Payroll | Normal | μ=4500, σ=1200 | Per employee |
| Utilities | Log-normal | μ=7.0, σ=0.4 | Monthly, stable |
| Capital | Pareto | α=1.5, xₘ=10000 | Heavy tail |
AP Invoice Mixture:
ap_invoices:
type: lognormal_mixture
components:
# Operating expenses
- weight: 0.50
mu: 6.0 # ~$400 median
sigma: 1.5
# Inventory/materials
- weight: 0.35
mu: 8.0 # ~$3000 median
sigma: 1.0
# Capital/projects
- weight: 0.15
mu: 10.5 # ~$36000 median
sigma: 0.8
2. Cross-Field Correlation Modeling
2.1 Correlation Matrix Support
Define correlations between numeric fields:
correlations:
enabled: true
fields:
- name: transaction_amount
- name: line_item_count
- name: approval_level
- name: processing_time_hours
- name: discount_percentage
matrix:
# Correlation coefficients (Pearson's r)
# Higher amounts → more line items
- [1.00, 0.65, 0.72, 0.45, -0.20]
# More items → higher amount
- [0.65, 1.00, 0.55, 0.60, -0.15]
# Higher amount → higher approval
- [0.72, 0.55, 1.00, 0.50, -0.30]
# More complex → longer processing
- [0.45, 0.60, 0.50, 1.00, -0.10]
# Higher amount → lower discount %
- [-0.20, -0.15, -0.30, -0.10, 1.00]
2.2 Copula-Based Generation
For more sophisticated dependency modeling:
#![allow(unused)]
fn main() {
/// Copula types for dependency modeling
pub enum CopulaType {
/// Gaussian copula - symmetric dependencies
Gaussian { correlation: f64 },
/// Clayton copula - lower tail dependence
Clayton { theta: f64 },
/// Gumbel copula - upper tail dependence
Gumbel { theta: f64 },
/// Frank copula - symmetric, no tail dependence
Frank { theta: f64 },
/// Student-t copula - both tail dependencies
StudentT { correlation: f64, df: f64 },
}
pub struct CopulaGenerator {
copula: CopulaType,
marginals: Vec<Box<dyn Distribution>>,
}
}
Use Cases:
- Amount & Days-to-Pay: Larger invoices may have longer payment terms (Clayton copula)
- Revenue & COGS: Strong positive correlation (Gaussian copula)
- Fraud Amount & Detection Delay: Upper tail dependence (Gumbel copula)
2.3 Conditional Distributions
Generate values conditional on other fields:
conditional_distributions:
# Discount percentage depends on order amount
discount:
type: conditional
given: order_amount
breakpoints:
- threshold: 1000
distribution: { type: constant, value: 0 }
- threshold: 5000
distribution: { type: uniform, min: 0, max: 0.05 }
- threshold: 25000
distribution: { type: uniform, min: 0.05, max: 0.10 }
- threshold: 100000
distribution: { type: uniform, min: 0.10, max: 0.15 }
- threshold: infinity
distribution: { type: normal, mean: 0.15, std: 0.03 }
# Payment terms depend on vendor relationship
payment_terms:
type: conditional
given: vendor_relationship_months
breakpoints:
- threshold: 6
distribution: { type: choice, values: [0, 15], weights: [0.8, 0.2] }
- threshold: 24
distribution: { type: choice, values: [15, 30], weights: [0.6, 0.4] }
- threshold: infinity
distribution: { type: choice, values: [30, 45, 60], weights: [0.5, 0.35, 0.15] }
3. Industry-Specific Amount Distributions
3.1 Retail
retail:
transaction_amounts:
pos_sales:
type: lognormal_mixture
components:
- weight: 0.65
mu: 3.0 # ~$20 median
sigma: 1.0
label: "small_basket"
- weight: 0.30
mu: 4.5 # ~$90 median
sigma: 0.8
label: "medium_basket"
- weight: 0.05
mu: 6.0 # ~$400 median
sigma: 0.6
label: "large_basket"
inventory_orders:
type: lognormal
mu: 9.0 # ~$8000 median
sigma: 1.5
seasonal_multipliers:
black_friday: 3.5
christmas_week: 2.8
back_to_school: 1.6
3.2 Manufacturing
manufacturing:
transaction_amounts:
raw_materials:
type: lognormal_mixture
components:
- weight: 0.40
mu: 8.0 # ~$3000 median
sigma: 1.0
label: "consumables"
- weight: 0.45
mu: 10.0 # ~$22000 median
sigma: 0.8
label: "production_materials"
- weight: 0.15
mu: 12.0 # ~$163000 median
sigma: 0.6
label: "bulk_orders"
maintenance:
type: pareto
alpha: 2.0
x_min: 500
label: "repair_costs"
capital_equipment:
type: lognormal
mu: 12.5 # ~$268000 median
sigma: 1.0
3.3 Financial Services
financial_services:
transaction_amounts:
wire_transfers:
type: lognormal_mixture
components:
- weight: 0.30
mu: 8.0 # ~$3000
sigma: 1.2
label: "retail_wire"
- weight: 0.40
mu: 11.0 # ~$60000
sigma: 1.0
label: "commercial_wire"
- weight: 0.20
mu: 14.0 # ~$1.2M
sigma: 0.8
label: "institutional_wire"
- weight: 0.10
mu: 17.0 # ~$24M
sigma: 1.0
label: "large_value"
ach_transactions:
type: lognormal
mu: 7.5 # ~$1800
sigma: 2.0
fee_income:
type: weibull
scale: 500
shape: 1.5
4. Regime Change Modeling
4.1 Structural Breaks
Model sudden changes in distribution parameters:
regime_changes:
enabled: true
changes:
- date: "2024-03-15"
type: acquisition
effects:
- field: transaction_volume
multiplier: 1.35
- field: average_amount
shift: 5000
- field: vendor_count
multiplier: 1.25
- date: "2024-07-01"
type: price_increase
effects:
- field: cogs_ratio
shift: 0.03
- field: avg_invoice_amount
multiplier: 1.08
- date: "2024-10-01"
type: new_product_line
effects:
- field: revenue
multiplier: 1.20
- field: inventory_turns
multiplier: 0.85
4.2 Gradual Parameter Drift
Model slow changes over time:
parameter_drift:
enabled: true
parameters:
- field: transaction_amount
type: linear
annual_drift: 0.03 # 3% annual increase (inflation)
- field: digital_payment_ratio
type: logistic
start_value: 0.40
end_value: 0.85
midpoint_months: 18
steepness: 0.15
- field: approval_threshold
type: step
steps:
- month: 6
value: 5000
- month: 18
value: 7500
- month: 30
value: 10000
4.3 Economic Cycle Modeling
economic_cycles:
enabled: true
base_cycle:
type: sinusoidal
period_months: 48 # 4-year cycle
amplitude: 0.15 # ±15% variation
recession_events:
- start: "2024-09-01"
duration_months: 8
severity: moderate # 10-20% decline
effects:
- revenue: -0.15
- discretionary_spend: -0.35
- capital_investment: -0.50
- headcount: -0.10
recovery:
type: gradual
months: 12
5. Enhanced Benford’s Law Compliance
5.1 Second and Third Digit Distributions
Extend beyond first-digit to full Benford compliance:
#![allow(unused)]
fn main() {
pub struct BenfordDistribution {
digits: BenfordDigitConfig,
}
pub struct BenfordDigitConfig {
first_digit: bool, // Standard Benford
second_digit: bool, // Second digit distribution
first_two: bool, // Joint first-two digits
summation: bool, // Summation test
}
impl BenfordDistribution {
/// Generate amount following full Benford's Law
pub fn sample_benford_compliant(&self, rng: &mut impl Rng) -> Decimal {
// Use log-uniform distribution to ensure Benford compliance
// across multiple digit positions
}
}
}
5.2 Benford Deviation Injection
For anomaly scenarios, intentionally violate Benford:
benford_deviations:
enabled: false # Enable for fraud scenarios
deviation_types:
# Round number preference (fraud indicator)
round_number_bias:
probability: 0.15
targets: [1000, 5000, 10000, 25000]
tolerance: 0.01
# Threshold avoidance (approval bypass)
threshold_clustering:
thresholds: [5000, 10000, 25000]
cluster_below: true
distance: 50-200
# Uniform distribution (fabricated data)
uniform_injection:
probability: 0.05
range: [1000, 9999]
6. Statistical Validation Framework
6.1 Distribution Fitness Tests
#![allow(unused)]
fn main() {
pub struct DistributionValidator {
tests: Vec<StatisticalTest>,
}
pub enum StatisticalTest {
/// Kolmogorov-Smirnov test
KolmogorovSmirnov { significance: f64 },
/// Chi-squared goodness of fit
ChiSquared { bins: usize, significance: f64 },
/// Anderson-Darling test
AndersonDarling { significance: f64 },
/// Benford's Law chi-squared
BenfordChiSquared { digits: u8, significance: f64 },
/// Mean Absolute Deviation from Benford
BenfordMAD { threshold: f64 },
}
}
6.2 Validation Configuration
validation:
statistical_tests:
enabled: true
tests:
- type: benford_first_digit
threshold_mad: 0.015
warning_mad: 0.010
- type: distribution_fit
target: lognormal
ks_significance: 0.05
- type: correlation_check
expected_correlations:
- fields: [amount, line_items]
expected_r: 0.65
tolerance: 0.10
reporting:
generate_plots: true
output_format: html
include_raw_data: false
7. New Distribution Types
7.1 Pareto Distribution
For heavy-tailed phenomena (80/20 rule):
# Top 20% of customers generate 80% of revenue
customer_revenue:
type: pareto
alpha: 1.16 # Shape parameter for 80/20
x_min: 1000 # Minimum value
truncate_max: 10000000 # Optional cap
7.2 Weibull Distribution
For time-to-event data:
# Days until payment
days_to_payment:
type: weibull
shape: 2.0 # k > 1: increasing hazard (more likely to pay over time)
scale: 30.0 # λ: characteristic life
shift: 0 # Minimum days
7.3 Beta Distribution
For proportions and percentages:
# Discount percentage
discount_rate:
type: beta
alpha: 2.0 # Shape parameter 1
beta: 8.0 # Shape parameter 2
# This gives mode around 11%, right-skewed
scale:
min: 0.0
max: 0.25 # Max 25% discount
7.4 Zero-Inflated Distributions
For data with excess zeros:
# Credits/returns (many transactions have zero)
credit_amount:
type: zero_inflated
zero_probability: 0.85
positive_distribution:
type: lognormal
mu: 5.0
sigma: 1.5
8. Implementation Priority
| Enhancement | Complexity | Impact | Priority |
|---|---|---|---|
| Mixture models | Medium | High | P1 |
| Correlation matrices | High | Critical | P1 |
| Industry-specific profiles | Medium | High | P1 |
| Regime changes | Medium | High | P2 |
| Copula support | High | Medium | P2 |
| Additional distributions | Low | Medium | P2 |
| Validation framework | Medium | High | P1 |
| Conditional distributions | Medium | Medium | P3 |
9. Configuration Example
# Complete statistical distribution configuration
distributions:
# Global amount settings
amounts:
default:
type: lognormal_mixture
components:
- { weight: 0.6, mu: 6.0, sigma: 1.5 }
- { weight: 0.3, mu: 8.5, sigma: 1.0 }
- { weight: 0.1, mu: 11.0, sigma: 0.8 }
by_transaction_type:
payroll:
type: normal
mean: 4500
std_dev: 1500
truncate_min: 1000
utilities:
type: lognormal
mu: 7.0
sigma: 0.5
# Correlation settings
correlations:
enabled: true
model: gaussian_copula
pairs:
- fields: [amount, processing_days]
correlation: 0.45
- fields: [amount, approval_level]
correlation: 0.72
# Drift settings
drift:
enabled: true
inflation_rate: 0.03
regime_changes:
- date: "2024-06-01"
field: avg_transaction
multiplier: 1.15
# Validation
validation:
benford_compliance: true
distribution_tests: true
correlation_verification: true
Technical Implementation Notes
Performance Considerations
- Pre-computation: Calculate CDF tables for frequently-used distributions
- Vectorization: Use SIMD for batch sampling where possible
- Caching: Cache correlation matrix decompositions (Cholesky)
- Lazy evaluation: Defer complex distribution calculations until needed
Memory Efficiency
- Streaming: Generate correlated samples in batches
- Reference tables: Use compact lookup tables for standard distributions
- On-demand: Compute regime-adjusted parameters at sample time
See also: 03-temporal-patterns.md for time-based distributions
Research: Temporal Patterns and Distributions
Implementation Status: Core temporal patterns implemented in v0.3.0. See CLAUDE.md for configuration examples.
Implementation Summary (v0.3.0)
| Feature | Status | Location |
|---|---|---|
| Business day calculator | ✅ Implemented | datasynth-core/src/distributions/business_day.rs |
| Holiday calendars (11 regions) | ✅ Implemented | datasynth-core/src/distributions/holidays.rs |
| Period-end dynamics (decay curves) | ✅ Implemented | datasynth-core/src/distributions/period_end.rs |
| Processing lag modeling | ✅ Implemented | datasynth-core/src/distributions/processing_lag.rs |
| Timezone handling | ✅ Implemented | datasynth-core/src/distributions/timezone.rs |
| Fiscal calendar (custom, 4-4-5) | ✅ Implemented | Config: fiscal_calendar |
| Intraday segments | ✅ Implemented | Config: intraday |
| Settlement rules (T+N) | ✅ Implemented | business_day.rs |
| Half-day policies | ✅ Implemented | business_day.rs |
| Lunar calendars | 🔄 Planned | Approximate via fixed dates |
Current State Analysis
Existing Temporal Infrastructure
| Component | Lines | Functionality |
|---|---|---|
TemporalSampler | 632 | Date/time sampling with seasonality |
IndustrySeasonality | 538 | 10 industry profiles |
HolidayCalendar | 852 | 6 regional calendars |
DriftController | 373 | Gradual/sudden drift |
FiscalPeriod | 849 | Period close mechanics |
BiTemporal | 449 | Audit trail versioning |
Current Capabilities
- Period-end spikes: Month-end (2.5x), Quarter-end (4.0x), Year-end (6.0x)
- Day-of-week patterns: Monday catch-up (1.3x), Friday wind-down (0.85x)
- Holiday handling: 6 regions with ~15 holidays each
- Working hours: 8-18 business hours with peak weighting
- Industry seasonality: Black Friday, tax season, etc.
Current Gaps
- No business day calculation - T+1, T+2 settlement not supported
- No fiscal calendar alternatives - Only calendar year supported
- Limited regional coverage - Missing LATAM, more APAC
- No half-day handling - Early closes before holidays
- Static spike multipliers - No decay curves toward period-end
- No timezone awareness - All times in single timezone
- Missing lunar calendars - Approximate Chinese New Year, Diwali
Improvement Recommendations
1. Business Day Calculations
1.1 Core Business Day Engine
#![allow(unused)]
fn main() {
pub struct BusinessDayCalculator {
calendar: HolidayCalendar,
weekend_days: HashSet<Weekday>,
half_day_handling: HalfDayPolicy,
}
pub enum HalfDayPolicy {
FullDay, // Count as full business day
HalfDay, // Count as 0.5 business days
NonBusinessDay, // Treat as holiday
}
impl BusinessDayCalculator {
/// Add N business days to a date
pub fn add_business_days(&self, date: NaiveDate, days: i32) -> NaiveDate;
/// Subtract N business days from a date
pub fn sub_business_days(&self, date: NaiveDate, days: i32) -> NaiveDate;
/// Count business days between two dates
pub fn business_days_between(&self, start: NaiveDate, end: NaiveDate) -> i32;
/// Get the next business day (inclusive or exclusive)
pub fn next_business_day(&self, date: NaiveDate, inclusive: bool) -> NaiveDate;
/// Get the previous business day
pub fn prev_business_day(&self, date: NaiveDate, inclusive: bool) -> NaiveDate;
/// Is this date a business day?
pub fn is_business_day(&self, date: NaiveDate) -> bool;
}
}
1.2 Settlement Date Logic
settlement_rules:
enabled: true
conventions:
# Standard equity settlement
equity:
type: T_plus_N
days: 2
calendar: exchange
# Government bonds
government_bonds:
type: T_plus_N
days: 1
calendar: federal
# Corporate bonds
corporate_bonds:
type: T_plus_N
days: 2
calendar: combined
# FX spot
fx_spot:
type: T_plus_N
days: 2
calendar: both_currencies
# Wire transfers
wire_domestic:
type: same_day_or_next
cutoff_time: "14:00"
calendar: federal
# ACH
ach:
type: T_plus_N
days: 1-3
distribution: { 1: 0.6, 2: 0.3, 3: 0.1 }
1.3 Month-End Conventions
month_end_conventions:
# Modified Following
modified_following:
if_holiday: next_business_day
if_crosses_month: previous_business_day
# Preceding
preceding:
if_holiday: previous_business_day
# Following
following:
if_holiday: next_business_day
# End of Month
end_of_month:
if_start_is_eom: end_stays_eom
2. Expanded Regional Calendars
2.1 Additional Regions
Latin America:
calendars:
brazil:
holidays:
- name: "Carnival"
type: floating
rule: "easter - 47 days"
duration_days: 2
activity_multiplier: 0.05
- name: "Tiradentes Day"
type: fixed
month: 4
day: 21
- name: "Independence Day"
type: fixed
month: 9
day: 7
- name: "Republic Day"
type: fixed
month: 11
day: 15
mexico:
holidays:
- name: "Constitution Day"
type: floating
rule: "first monday of february"
- name: "Benito Juárez Birthday"
type: floating
rule: "third monday of march"
- name: "Labor Day"
type: fixed
month: 5
day: 1
- name: "Independence Day"
type: fixed
month: 9
day: 16
- name: "Revolution Day"
type: floating
rule: "third monday of november"
- name: "Day of the Dead"
type: fixed
month: 11
day: 2
activity_multiplier: 0.3
Asia-Pacific Expansion:
australia:
holidays:
- name: "Australia Day"
type: fixed
month: 1
day: 26
observance: "next_monday_if_weekend"
- name: "ANZAC Day"
type: fixed
month: 4
day: 25
- name: "Queen's Birthday"
type: floating
rule: "second monday of june"
regional_variation: true # Different dates by state
singapore:
holidays:
- name: "Chinese New Year"
type: lunar
duration_days: 2
- name: "Vesak Day"
type: lunar
- name: "Hari Raya Puasa"
type: islamic
rule: "end of ramadan"
- name: "Deepavali"
type: lunar
calendar: hindu
south_korea:
holidays:
- name: "Seollal"
type: lunar
calendar: korean
duration_days: 3
- name: "Chuseok"
type: lunar
calendar: korean
duration_days: 3
2.2 Lunar Calendar Implementation
#![allow(unused)]
fn main() {
/// Accurate lunar calendar calculations
pub struct LunarCalendar {
calendar_type: LunarCalendarType,
cache: HashMap<i32, Vec<LunarDate>>,
}
pub enum LunarCalendarType {
Chinese, // Chinese lunisolar
Islamic, // Hijri calendar
Hebrew, // Jewish calendar
Hindu, // Vikram Samvat
Korean, // Dangun calendar
}
impl LunarCalendar {
/// Convert Gregorian date to lunar date
pub fn to_lunar(&self, date: NaiveDate) -> LunarDate;
/// Convert lunar date to Gregorian
pub fn to_gregorian(&self, lunar: LunarDate) -> NaiveDate;
/// Get Chinese New Year date for a given Gregorian year
pub fn chinese_new_year(&self, year: i32) -> NaiveDate;
/// Get Ramadan start date for a given Gregorian year
pub fn ramadan_start(&self, year: i32) -> NaiveDate;
/// Get Diwali date (new moon in Kartik)
pub fn diwali(&self, year: i32) -> NaiveDate;
}
}
3. Period-End Dynamics
3.1 Decay Curves Instead of Static Multipliers
Replace flat multipliers with realistic acceleration curves:
period_end_dynamics:
enabled: true
month_end:
model: exponential_acceleration
parameters:
start_day: -10 # 10 days before month-end
base_multiplier: 1.0
peak_multiplier: 3.5
decay_rate: 0.3 # Exponential decay parameter
# Activity profile by days-to-close
daily_profile:
-10: 1.0
-7: 1.2
-5: 1.5
-3: 2.0
-2: 2.5
-1: 3.0 # Day before close
0: 3.5 # Close day
quarter_end:
model: stepped_exponential
inherit_from: month_end
additional_multiplier: 1.5
year_end:
model: extended_crunch
parameters:
start_day: -15
sustained_high_days: 5
peak_multiplier: 6.0
# Year-end specific activities
activities:
- type: "audit_adjustments"
days: [-3, -2, -1, 0]
multiplier: 2.0
- type: "tax_provisions"
days: [-5, -4, -3]
multiplier: 1.5
- type: "impairment_reviews"
days: [-10, -9, -8]
multiplier: 1.3
3.2 Intra-Day Patterns
intraday_patterns:
# Morning rush
morning_spike:
start: "08:30"
end: "10:00"
multiplier: 1.8
# Pre-lunch activity
late_morning:
start: "10:00"
end: "12:00"
multiplier: 1.2
# Lunch lull
lunch_dip:
start: "12:00"
end: "13:30"
multiplier: 0.4
# Afternoon steady
afternoon:
start: "13:30"
end: "16:00"
multiplier: 1.0
# End-of-day push
eod_rush:
start: "16:00"
end: "17:30"
multiplier: 1.5
# After hours (manual only)
after_hours:
start: "17:30"
end: "20:00"
multiplier: 0.15
type: manual_only
3.3 Time Zone Handling
timezones:
enabled: true
company_timezones:
default: "America/New_York"
by_entity:
- entity_pattern: "EU_*"
timezone: "Europe/London"
- entity_pattern: "DE_*"
timezone: "Europe/Berlin"
- entity_pattern: "APAC_*"
timezone: "Asia/Singapore"
- entity_pattern: "JP_*"
timezone: "Asia/Tokyo"
posting_behavior:
# Consolidation timing
consolidation:
coordinator_timezone: "America/New_York"
cutoff_time: "18:00"
# Intercompany coordination
intercompany:
settlement_timezone: "UTC"
matching_window_hours: 24
4. Fiscal Calendar Alternatives
4.1 Non-Calendar Year Support
fiscal_calendar:
type: custom
year_start:
month: 7
day: 1
# Fiscal year 2024 = July 1, 2024 - June 30, 2025
period_naming:
format: "FY{year}P{period:02}"
# FY2024P01 = July 2024
4.2 4-4-5 Calendar
fiscal_calendar:
type: 4-4-5
year_start:
anchor: first_sunday_of_february
# Or: last_saturday_of_january
periods:
Q1:
- weeks: 4
- weeks: 4
- weeks: 5
Q2:
- weeks: 4
- weeks: 4
- weeks: 5
Q3:
- weeks: 4
- weeks: 4
- weeks: 5
Q4:
- weeks: 4
- weeks: 4
- weeks: 5
# 53rd week handling (every 5-6 years)
leap_week:
occurrence: calculated
placement: Q4_P3 # Added to last period
4.3 13-Period Calendar
fiscal_calendar:
type: 13_period
weeks_per_period: 4
year_start:
anchor: first_monday_of_january
# 53rd week handling
extra_week_period: 13
5. Advanced Seasonality
5.1 Multi-Factor Seasonality
seasonality:
factors:
# Annual cycle
annual:
type: fourier
harmonics: 3
coefficients:
cos1: 0.15
sin1: 0.08
cos2: 0.05
sin2: 0.03
cos3: 0.02
sin3: 0.01
# Weekly cycle
weekly:
type: categorical
values:
monday: 1.25
tuesday: 1.10
wednesday: 1.00
thursday: 1.00
friday: 0.90
saturday: 0.15
sunday: 0.05
# Monthly cycle (within month)
monthly:
type: piecewise
segments:
- days: [1, 5]
multiplier: 1.3
label: "month_start"
- days: [6, 20]
multiplier: 0.9
label: "mid_month"
- days: [21, 31]
multiplier: 1.4
label: "month_end"
# Interaction effects
interactions:
- factors: [annual, weekly]
type: multiplicative
- factors: [monthly, weekly]
type: additive
5.2 Weather-Driven Seasonality
For relevant industries:
weather_seasonality:
enabled: true
industries: [retail, utilities, agriculture, construction]
patterns:
temperature:
cold_threshold: 32 # Fahrenheit
hot_threshold: 85
effects:
cold:
utilities: 1.8
construction: 0.5
retail_outdoor: 0.3
hot:
utilities: 1.5
construction: 0.8
retail_outdoor: 1.3
precipitation:
effects:
rain:
construction: 0.6
retail_brick_mortar: 0.8
retail_online: 1.2
# Regional weather profiles
regional_profiles:
northeast_us:
winter_severity: high
summer_humidity: medium
southwest_us:
winter_severity: low
summer_heat: extreme
pacific_northwest:
precipitation_days: high
temperature_variance: low
6. Transaction Timing Realism
6.1 Processing Lag Modeling
processing_lags:
# Time between event and posting
event_to_posting:
distribution: lognormal
parameters:
sales_order:
mu: 0.5 # ~1.6 hours median
sigma: 0.8
goods_receipt:
mu: 1.5 # ~4.5 hours median
sigma: 0.5
invoice_receipt:
mu: 2.0 # ~7.4 hours median
sigma: 0.6
payment:
mu: 0.2 # ~1.2 hours median
sigma: 0.3
# Day-boundary crossing
cross_day_posting:
enabled: true
probability_by_hour:
"17:00": 0.7 # 70% post next day if after 5pm
"19:00": 0.9
"21:00": 0.99
# Batch processing delays
batch_delays:
enabled: true
schedules:
nightly_batch:
run_time: "02:00"
affects: [bank_transactions, interfaces]
hourly_sync:
interval_minutes: 60
affects: [inventory_movements]
6.2 Human vs. System Posting Patterns
posting_patterns:
human:
# Working hours focus
primary_hours: [9, 10, 11, 14, 15, 16]
probability: 0.8
# Occasional overtime
extended_hours: [8, 17, 18, 19]
probability: 0.15
# Rare late night
late_hours: [20, 21, 22]
probability: 0.05
# Keystroke timing (for detailed simulation)
entry_duration:
simple_je:
mean_seconds: 45
std_seconds: 15
complex_je:
mean_seconds: 180
std_seconds: 60
system:
# Interface postings
interface:
typical_times: ["01:00", "05:00", "13:00"]
duration_minutes: 15-45
burst_rate: 100-500 # Records per minute
# Automated recurring
recurring:
time: "00:30"
day: first_business_day
# Real-time integrations
realtime:
latency_ms: 100-500
batch_size: 1
7. Period Close Orchestration
7.1 Close Calendar Generation
close_calendar:
enabled: true
# Standard close schedule
monthly:
soft_close:
day: 2 # 2nd business day
activities: [preliminary_review, initial_accruals]
hard_close:
day: 5 # 5th business day
activities: [final_adjustments, lock_period]
reporting:
day: 7 # 7th business day
activities: [management_reports, variance_analysis]
quarterly:
extended_close:
additional_days: 3
activities:
- quarter_end_reserves
- intercompany_reconciliation
- consolidation
annual:
extended_close:
additional_days: 10
activities:
- audit_adjustments
- tax_provisions
- impairment_testing
- goodwill_analysis
- segment_reporting
7.2 Late Posting Behavior
late_postings:
enabled: true
# Probability of late posting by days after close
probability_curve:
day_1: 0.08 # 8% of transactions post 1 day late
day_2: 0.03
day_3: 0.01
day_4: 0.005
day_5+: 0.002
# Characteristics of late postings
characteristics:
# More likely to be corrections
correction_probability: 0.4
# Higher average amount
amount_multiplier: 1.5
# Require special approval
approval_required: true
# Must reference original period
period_reference: required
8. Implementation Priority
| Enhancement | Complexity | Impact | Priority | Status |
|---|---|---|---|---|
| Business day calculator | Medium | Critical | P1 | ✅ v0.3.0 |
| Additional regional calendars | Medium | High | P1 | ✅ v0.3.0 (11 regions) |
| Decay curves for period-end | Low | High | P1 | ✅ v0.3.0 |
| Non-calendar fiscal years | Medium | Medium | P2 | ✅ v0.3.0 |
| 4-4-5 calendar support | High | Medium | P2 | ✅ v0.3.0 |
| Timezone handling | Medium | Medium | P2 | ✅ v0.3.0 |
| Lunar calendar accuracy | High | Medium | P3 | 🔄 Planned |
| Weather seasonality | Medium | Low | P3 | 🔄 Planned |
| Intra-day patterns | Low | Medium | P2 | ✅ v0.3.0 |
| Processing lag modeling | Medium | High | P1 | ✅ v0.3.0 |
9. Validation Metrics
temporal_validation:
metrics:
# Period-end spike ratios
period_end_spikes:
month_end_ratio:
expected: 2.0-3.0
tolerance: 0.5
quarter_end_ratio:
expected: 3.5-4.5
tolerance: 0.5
year_end_ratio:
expected: 5.0-7.0
tolerance: 1.0
# Day-of-week distribution
dow_distribution:
test: chi_squared
expected_weights: [1.3, 1.1, 1.0, 1.0, 0.85, 0.1, 0.05]
significance: 0.05
# Holiday compliance
holiday_activity:
max_activity_on_holiday: 0.1
allow_exceptions: ["bank_settlement"]
# Business hours
business_hours:
human_transactions:
in_hours_rate: 0.85-0.95
system_transactions:
off_hours_allowed: true
# Late posting rate
late_postings:
max_rate: 0.15
concentration_test: true # Should not cluster
See also: 04-interconnectivity.md for relationship modeling
Research: Interconnectivity and Relationship Modeling
Implementation Status: P1 features implemented in v0.3.0. See Interconnectivity Documentation for usage.
Implementation Summary (v0.3.0)
| Feature | Status | Location |
|---|---|---|
| Multi-tier vendor networks | ✅ Implemented | datasynth-core/src/models/vendor_network.rs |
| Vendor clusters & lifecycle | ✅ Implemented | datasynth-core/src/models/vendor_network.rs |
| Customer value segmentation | ✅ Implemented | datasynth-core/src/models/customer_segment.rs |
| Customer lifecycle stages | ✅ Implemented | datasynth-core/src/models/customer_segment.rs |
| Relationship strength modeling | ✅ Implemented | datasynth-core/src/models/relationship.rs |
| Entity graph (16 types, 26 relations) | ✅ Implemented | datasynth-core/src/models/relationship.rs |
| Cross-process links (P2P↔O2C) | ✅ Implemented | datasynth-generators/src/relationships/ |
| Network evaluation metrics | ✅ Implemented | datasynth-eval/src/coherence/network.rs |
| Configuration & validation | ✅ Implemented | datasynth-config/src/schema.rs, validation.rs |
| Organizational hierarchy depth | 🔄 P2 - Planned | - |
| Network effect modeling | 🔄 P2 - Planned | - |
| Community detection | 🔄 P3 - Planned | - |
Current State Analysis
Existing Relationship Infrastructure
| Relationship Type | Implementation | Depth |
|---|---|---|
| Document Chains | DocumentChainManager | Strong |
| Three-Way Match | ThreeWayMatcher | Strong |
| Intercompany | ICMatchingEngine | Strong |
| GL Balance Links | Account hierarchies | Medium |
| Vendor-Customer | Basic master data | Weak |
| Employee-Approval | Approval chains | Medium |
| Entity Registry | EntityRegistry | Medium |
Current Strengths
- Document flow integrity: PO → GR → Invoice → Payment chains maintained
- Intercompany matching: Automatic generation of offsetting entries
- Balance coherence: Trial balance validation, A=L+E enforcement
- Graph export: PyTorch Geometric, Neo4j, DGL formats supported
- COSO control mapping: Controls linked to processes and risks
Current Gaps
- Shallow vendor networks: No supplier-of-supplier modeling
- Limited customer relationships: No customer segmentation
- No organizational hierarchy depth: Flat cost center structures
- Missing behavioral clustering: Entities don’t cluster by behavior
- No network effects: Relationships don’t influence behavior
- Static relationships: No relationship lifecycle modeling
Improvement Recommendations
1. Deep Vendor Network Modeling
1.1 Multi-Tier Supply Chain
vendor_network:
enabled: true
depth: 3 # Tier-1, Tier-2, Tier-3 suppliers
tiers:
tier_1:
count: 50-100
relationship: direct_supplier
visibility: full
transaction_volume: high
tier_2:
count: 200-500
relationship: supplier_of_supplier
visibility: partial
transaction_volume: medium
# Only visible through Tier-1 transactions
tier_3:
count: 500-2000
relationship: indirect
visibility: minimal
transaction_volume: low
# Dependency modeling
dependencies:
concentration:
max_single_vendor: 0.15 # No vendor > 15% of spend
top_5_vendors: 0.45 # Top 5 < 45% of spend
critical_materials:
single_source: 0.05 # 5% of materials are single-source
dual_source: 0.15
multi_source: 0.80
substitutability:
easy: 0.60
moderate: 0.30
difficult: 0.10
1.2 Vendor Relationship Attributes
#![allow(unused)]
fn main() {
pub struct VendorRelationship {
vendor_id: VendorId,
relationship_type: VendorRelationshipType,
start_date: NaiveDate,
end_date: Option<NaiveDate>,
// Relationship strength
strategic_importance: StrategicLevel, // Critical, Important, Standard, Transactional
spend_tier: SpendTier, // Platinum, Gold, Silver, Bronze
// Behavioral attributes
payment_history: PaymentBehavior,
dispute_frequency: DisputeLevel,
quality_score: f64,
// Contract terms
contracted_rates: Vec<ContractedRate>,
rebate_agreements: Vec<RebateAgreement>,
payment_terms: PaymentTerms,
// Network position
tier: SupplyChainTier,
parent_vendor: Option<VendorId>,
child_vendors: Vec<VendorId>,
}
pub enum VendorRelationshipType {
DirectSupplier,
ServiceProvider,
Contractor,
Distributor,
Manufacturer,
RawMaterialSupplier,
OEMPartner,
Affiliate,
}
}
1.3 Vendor Behavior Clustering
vendor_clusters:
enabled: true
clusters:
reliable_strategic:
size: 0.20
characteristics:
payment_terms: [30, 45, 60]
on_time_delivery: 0.95-1.0
quality_issues: rare
price_stability: high
transaction_frequency: weekly+
standard_operational:
size: 0.50
characteristics:
payment_terms: [30]
on_time_delivery: 0.85-0.95
quality_issues: occasional
price_stability: medium
transaction_frequency: monthly
transactional:
size: 0.25
characteristics:
payment_terms: [0, 15]
on_time_delivery: 0.75-0.90
quality_issues: moderate
price_stability: low
transaction_frequency: quarterly
problematic:
size: 0.05
characteristics:
payment_terms: [0] # COD only
on_time_delivery: 0.50-0.80
quality_issues: frequent
price_stability: volatile
transaction_frequency: declining
2. Customer Relationship Depth
2.1 Customer Segmentation
customer_segmentation:
enabled: true
dimensions:
value:
- segment: enterprise
revenue_share: 0.40
customer_share: 0.05
characteristics:
avg_order_value: 50000+
order_frequency: weekly
payment_behavior: terms
churn_risk: low
- segment: mid_market
revenue_share: 0.35
customer_share: 0.20
characteristics:
avg_order_value: 5000-50000
order_frequency: monthly
payment_behavior: mixed
churn_risk: medium
- segment: smb
revenue_share: 0.20
customer_share: 0.50
characteristics:
avg_order_value: 500-5000
order_frequency: quarterly
payment_behavior: prepay
churn_risk: high
- segment: consumer
revenue_share: 0.05
customer_share: 0.25
characteristics:
avg_order_value: 50-500
order_frequency: occasional
payment_behavior: immediate
churn_risk: very_high
lifecycle:
- stage: prospect
conversion_rate: 0.15
avg_duration_days: 30
- stage: new
definition: "<90 days"
behavior: exploring
support_intensity: high
- stage: growth
definition: "90-365 days"
behavior: expanding
upsell_opportunity: high
- stage: mature
definition: ">365 days"
behavior: stable
retention_focus: true
- stage: at_risk
triggers: [declining_orders, late_payments, complaints]
intervention: required
- stage: churned
definition: "no activity >180 days"
win_back_probability: 0.10
2.2 Customer Network Effects
customer_networks:
enabled: true
# Referral relationships
referrals:
enabled: true
referral_rate: 0.15
referred_customer_value_multiplier: 1.2
max_referral_chain: 3
# Parent-child relationships (corporate structures)
corporate_hierarchies:
enabled: true
probability: 0.30
hierarchy_depth: 3
billing_consolidation: true
# Industry clustering
industry_affinity:
enabled: true
same_industry_cluster_probability: 0.40
industry_trend_correlation: 0.70
3. Organizational Hierarchy Modeling
3.1 Deep Cost Center Structure
organizational_structure:
depth: 5
levels:
- level: 1
name: division
count: 3-5
examples: ["North America", "EMEA", "APAC"]
- level: 2
name: business_unit
count_per_parent: 2-4
examples: ["Commercial", "Consumer", "Industrial"]
- level: 3
name: department
count_per_parent: 3-6
examples: ["Sales", "Marketing", "Operations", "Finance"]
- level: 4
name: function
count_per_parent: 2-5
examples: ["Inside Sales", "Field Sales", "Sales Ops"]
- level: 5
name: team
count_per_parent: 2-4
examples: ["Team Alpha", "Team Beta"]
# Cross-cutting structures
matrix_relationships:
enabled: true
types:
- primary: division
secondary: function
# e.g., "EMEA Sales" reports to both EMEA Head and Global Sales VP
# Shared services
shared_services:
enabled: true
centers:
- name: "Corporate Finance"
serves: all_divisions
allocation_method: headcount
- name: "IT Infrastructure"
serves: all_divisions
allocation_method: usage
- name: "HR Services"
serves: all_divisions
allocation_method: headcount
3.2 Approval Hierarchy
approval_hierarchy:
enabled: true
# Spending authority matrix
authority_matrix:
manager:
limit: 5000
exception_rate: 0.02
senior_manager:
limit: 25000
exception_rate: 0.01
director:
limit: 100000
exception_rate: 0.005
vp:
limit: 500000
exception_rate: 0.002
c_level:
limit: unlimited
exception_rate: 0.001
# Approval chains
chain_rules:
sequential:
enabled: true
for: [capital_expenditure, contracts]
parallel:
enabled: true
for: [operational_expenses]
minimum_approvals: 2
skip_level:
enabled: true
probability: 0.05
audit_flag: true
4. Entity Relationship Graph
4.1 Comprehensive Relationship Model
#![allow(unused)]
fn main() {
/// Unified entity relationship graph
pub struct EntityGraph {
nodes: HashMap<EntityId, EntityNode>,
edges: Vec<RelationshipEdge>,
indexes: GraphIndexes,
}
pub struct EntityNode {
id: EntityId,
entity_type: EntityType,
attributes: EntityAttributes,
created_at: DateTime<Utc>,
last_activity: DateTime<Utc>,
}
pub enum EntityType {
Company,
Vendor,
Customer,
Employee,
Department,
CostCenter,
Project,
Contract,
Asset,
BankAccount,
}
pub struct RelationshipEdge {
from_id: EntityId,
to_id: EntityId,
relationship_type: RelationshipType,
strength: f64, // 0.0 - 1.0
start_date: NaiveDate,
end_date: Option<NaiveDate>,
attributes: RelationshipAttributes,
}
pub enum RelationshipType {
// Transactional
BuysFrom,
SellsTo,
PaysTo,
ReceivesFrom,
// Organizational
ReportsTo,
Manages,
BelongsTo,
OwnedBy,
// Contractual
ContractedWith,
GuaranteedBy,
InsuredBy,
// Financial
LendsTo,
BorrowsFrom,
InvestsIn,
// Network
ReferredBy,
PartnersWith,
CompetesWith,
}
}
4.2 Relationship Strength Modeling
relationship_strength:
calculation:
type: composite
factors:
transaction_volume:
weight: 0.30
normalization: log_scale
transaction_count:
weight: 0.25
normalization: sqrt_scale
relationship_duration:
weight: 0.20
decay: none
recency:
weight: 0.15
decay: exponential
half_life_days: 90
mutual_connections:
weight: 0.10
normalization: jaccard_similarity
thresholds:
strong: 0.7
moderate: 0.4
weak: 0.1
dormant: 0.0
5. Transaction Chain Integrity
5.1 Extended Document Chains
document_chains:
# P2P extended chain
procure_to_pay:
stages:
- type: purchase_requisition
optional: true
approval_required: conditional # >$1000
- type: purchase_order
required: true
generates: commitment
- type: goods_receipt
required: conditional # For goods, not services
updates: inventory
tolerance: 0.05 # 5% over-receipt allowed
- type: vendor_invoice
required: true
matching: three_way # PO, GR, Invoice
tolerance: 0.02
- type: payment
required: true
methods: [ach, wire, check, virtual_card]
generates: bank_transaction
# Chain integrity rules
integrity:
sequence_enforcement: strict
backdating_allowed: false
amount_cascade: true # Amounts must flow through
# O2C extended chain
order_to_cash:
stages:
- type: quote
optional: true
validity_days: 30
- type: sales_order
required: true
credit_check: conditional
- type: pick_list
required: conditional
triggers: inventory_reservation
- type: delivery
required: conditional
updates: inventory
generates: shipping_document
- type: customer_invoice
required: true
triggers: revenue_recognition
- type: customer_receipt
required: true
applies_to: invoices
generates: bank_transaction
integrity:
partial_shipment: allowed
partial_payment: allowed
credit_memo: allowed
5.2 Cross-Process Linkages
cross_process_links:
enabled: true
links:
# Inventory connects P2P and O2C
- source_process: procure_to_pay
source_stage: goods_receipt
target_process: order_to_cash
target_stage: pick_list
through: inventory
# Returns create reverse flows
- source_process: order_to_cash
source_stage: delivery
target_process: returns
target_stage: return_receipt
condition: quality_issue
# Payments connect to bank reconciliation
- source_process: procure_to_pay
source_stage: payment
target_process: bank_reconciliation
target_stage: bank_statement_line
matching: automatic
# Intercompany bilateral links
- source_process: intercompany_sale
source_stage: ic_invoice
target_process: intercompany_purchase
target_stage: ic_invoice
matching: elimination_required
6. Network Effect Modeling
6.1 Behavioral Influence
network_effects:
enabled: true
influence_types:
# Transaction patterns spread through network
transaction_contagion:
enabled: true
effect: "similar vendors show similar payment patterns"
correlation: 0.40
lag_days: 30
# Risk propagation
risk_propagation:
enabled: true
effect: "vendor issues affect connected vendors"
propagation_depth: 2
decay_per_hop: 0.50
# Seasonal correlation
seasonal_sync:
enabled: true
effect: "connected entities show correlated seasonality"
correlation: 0.60
# Price correlation
price_linkage:
enabled: true
effect: "commodity price changes propagate"
propagation_speed: immediate
pass_through_rate: 0.80
6.2 Community Detection
community_detection:
enabled: true
algorithms:
- type: louvain
resolution: 1.0
output: vendor_communities
- type: label_propagation
output: customer_segments
- type: girvan_newman
output: department_clusters
use_cases:
# Fraud detection
fraud_rings:
algorithm: connected_components
edge_filter: suspicious_transactions
min_size: 3
# Vendor consolidation
vendor_overlap:
algorithm: jaccard_similarity
threshold: 0.70
output: consolidation_candidates
# Customer segmentation
behavioral_clusters:
algorithm: spectral
features: [purchase_pattern, payment_behavior, product_mix]
7. Relationship Lifecycle
7.1 Lifecycle Stages
relationship_lifecycle:
enabled: true
vendor_lifecycle:
stages:
onboarding:
duration_days: 30-90
activities: [due_diligence, contract_negotiation, system_setup]
transaction_volume: limited
ramp_up:
duration_days: 90-180
activities: [volume_increase, performance_monitoring]
transaction_volume: growing
steady_state:
duration_days: ongoing
activities: [regular_transactions, periodic_review]
transaction_volume: stable
decline:
triggers: [quality_issues, price_competitiveness, strategic_shift]
activities: [reduced_orders, alternative_sourcing]
transaction_volume: decreasing
termination:
triggers: [contract_end, performance_failure, strategic_decision]
activities: [final_settlement, transition]
transaction_volume: zero
transitions:
probability_matrix:
onboarding:
ramp_up: 0.80
termination: 0.20
ramp_up:
steady_state: 0.85
decline: 0.10
termination: 0.05
steady_state:
steady_state: 0.90
decline: 0.08
termination: 0.02
decline:
steady_state: 0.20
decline: 0.50
termination: 0.30
customer_lifecycle:
# Similar structure for customer relationships
stages:
prospect: { conversion_rate: 0.15 }
new: { retention_rate: 0.70 }
active: { retention_rate: 0.90 }
at_risk: { save_rate: 0.50 }
churned: { win_back_rate: 0.10 }
8. Graph Export Enhancements
8.1 Enhanced PyTorch Geometric Export
graph_export:
pytorch_geometric:
enabled: true
node_features:
# Node type encoding
type_encoding: one_hot
# Numeric features
numeric:
- field: transaction_volume
normalization: log_scale
- field: relationship_duration_days
normalization: min_max
- field: average_amount
normalization: z_score
# Categorical features
categorical:
- field: industry
encoding: label
- field: region
encoding: one_hot
- field: segment
encoding: embedding
edge_features:
- field: relationship_strength
normalization: none
- field: transaction_count
normalization: log_scale
- field: last_transaction_days_ago
normalization: min_max
# Temporal graphs
temporal:
enabled: true
snapshot_frequency: monthly
edge_weight_decay: exponential
half_life_days: 90
# Heterogeneous graph support
heterogeneous:
enabled: true
node_types: [company, vendor, customer, employee, account]
edge_types: [buys_from, sells_to, reports_to, pays_to]
8.2 Enhanced Neo4j Export
neo4j_export:
enabled: true
# Node labels
node_labels:
- label: Company
properties: [code, name, currency, country]
- label: Vendor
properties: [id, name, category, rating]
- label: Customer
properties: [id, name, segment, region]
- label: Transaction
properties: [id, amount, date, type]
# Relationship types
relationships:
- type: TRANSACTS_WITH
properties: [volume, count, first_date, last_date]
- type: BELONGS_TO
properties: [start_date, role]
- type: SUPPLIES
properties: [material_type, contract_id]
# Indexes for query optimization
indexes:
- label: Transaction
property: date
type: range
- label: Vendor
property: id
type: unique
- label: Customer
property: segment
type: lookup
# Full-text search
fulltext:
- name: entity_search
labels: [Vendor, Customer]
properties: [name, description]
9. Implementation Priority
| Enhancement | Complexity | Impact | Priority | Status |
|---|---|---|---|---|
| Vendor network depth | High | High | P1 | ✅ v0.3.0 |
| Customer segmentation | Medium | High | P1 | ✅ v0.3.0 |
| Organizational hierarchy | Medium | Medium | P2 | 🔄 Planned |
| Relationship strength modeling | Medium | High | P1 | ✅ v0.3.0 |
| Cross-process linkages | Medium | High | P1 | ✅ v0.3.0 |
| Network effect modeling | High | Medium | P2 | 🔄 Planned |
| Relationship lifecycle | Medium | Medium | P2 | ✅ v0.3.0 |
| Community detection | High | Medium | P3 | 🔄 Planned |
| Enhanced graph export | Low | High | P1 | 🔄 Partial |
10. Validation Framework
relationship_validation:
integrity_checks:
# All transactions have valid entity references
referential_integrity:
enabled: true
strict: true
# Document chains are complete
chain_completeness:
enabled: true
allow_partial: false
exception_rate: 0.02
# Intercompany entries balance
intercompany_balance:
enabled: true
tolerance: 0.01
network_metrics:
# Graph connectivity
connectivity:
check_strongly_connected: false
check_weakly_connected: true
max_isolated_nodes: 0.05
# Degree distribution
degree_distribution:
check_power_law: true
min_alpha: 1.5
max_alpha: 3.0
# Clustering coefficient
clustering:
min_coefficient: 0.1
max_coefficient: 0.5
See also: 05-pattern-drift.md for temporal evolution of patterns
Research: Pattern and Process Drift Over Time
Implementation Status: COMPLETE (v0.3.0)
This research document has been fully implemented. See the following modules:
datasynth-core/src/models/organizational_event.rs- Organizational eventsdatasynth-core/src/models/process_evolution.rs- Process evolution typesdatasynth-core/src/models/technology_transition.rs- Technology transitionsdatasynth-core/src/models/regulatory_events.rs- Regulatory changesdatasynth-core/src/models/drift_events.rs- Ground truth labelsdatasynth-core/src/distributions/behavioral_drift.rs- Behavioral driftdatasynth-core/src/distributions/market_drift.rs- Market/economic driftdatasynth-core/src/distributions/event_timeline.rs- Event orchestrationdatasynth-core/src/distributions/drift_recorder.rs- Ground truth recordingdatasynth-eval/src/statistical/drift_detection.rs- Drift detection evaluationdatasynth-config/src/schema.rs- Configuration types
Current State Analysis
Existing Drift Implementation
The current DriftController (373 lines) supports:
| Drift Type | Implementation | Realism |
|---|---|---|
| Gradual | Linear parameter drift | Medium |
| Sudden | Point-in-time shifts | Medium |
| Recurring | Seasonal patterns | Good |
| Mixed | Combination modes | Medium |
Current Capabilities
- Amount drift: Mean and variance adjustments over time
- Anomaly rate drift: Changing fraud/error rates
- Concept drift factor: Generic drift indicator
- Seasonal adjustment: Periodic recurring patterns
- Sudden drift probability: Random regime changes
Current Gaps
- No organizational events: Mergers, restructuring not modeled
- No process evolution: Static business processes
- No regulatory changes: Compliance requirements don’t evolve
- No technology transitions: System changes not simulated
- No behavioral drift: Entity behaviors remain static
- No market-driven drift: External factors not modeled
- Limited drift detection signals: Hard to validate drift presence
Improvement Recommendations
1. Organizational Event Modeling
1.1 Corporate Event Timeline
organizational_events:
enabled: true
events:
# Mergers and Acquisitions
- type: acquisition
date: "2024-06-15"
acquired_entity: "TargetCorp"
effects:
- entity_count_increase: 1.35
- vendor_count_increase: 1.25
- customer_overlap: 0.15
- integration_period_months: 12
- synergy_realization:
start_month: 6
full_realization_month: 18
cost_reduction: 0.08
# Divestiture
- type: divestiture
date: "2024-09-01"
divested_entity: "NonCoreBusiness"
effects:
- revenue_reduction: 0.12
- entity_count_reduction: 0.10
- vendor_transition_period: 6
# Reorganization
- type: reorganization
date: "2024-04-01"
type: functional_to_regional
effects:
- cost_center_restructure: true
- approval_chain_changes: true
- reporting_line_changes: true
- transition_period_months: 3
- temporary_confusion_factor: 1.15
# Leadership Change
- type: leadership_change
date: "2024-07-01"
position: CFO
effects:
- policy_changes_probability: 0.40
- approval_threshold_review: true
- vendor_review_trigger: true
- audit_focus_shift: possible
# Layoffs
- type: workforce_reduction
date: "2024-11-01"
reduction_percent: 0.10
effects:
- employee_count_reduction: 0.10
- workload_redistribution: true
- approval_delays: 1.20
- error_rate_increase: 1.15
- duration_months: 6
1.2 Integration Pattern Modeling
#![allow(unused)]
fn main() {
pub struct IntegrationSimulator {
phases: Vec<IntegrationPhase>,
current_phase: usize,
}
pub struct IntegrationPhase {
name: String,
start_month: u32,
end_month: u32,
effects: IntegrationEffects,
}
pub struct IntegrationEffects {
// Duplicate transactions during transition
duplicate_probability: f64,
// Coding errors during chart migration
miscoding_rate: f64,
// Legacy system parallel run
parallel_posting: bool,
// Vendor/customer migration errors
master_data_errors: f64,
// Timing differences
posting_delay_multiplier: f64,
}
}
1.3 Merger Accounting Patterns
merger_accounting:
enabled: true
day_1_entries:
- type: fair_value_adjustment
accounts: [inventory, fixed_assets, intangibles]
adjustment_range: [-0.20, 0.30]
- type: goodwill_recognition
calculation: "purchase_price - fair_value_net_assets"
- type: liability_assumption
includes: [accounts_payable, debt, contingencies]
post_merger:
# Integration costs
integration_expenses:
monthly_range: [100000, 500000]
duration_months: 12-18
categories: [consulting, severance, systems, legal]
# Synergy realization
synergies:
start_month: 6
ramp_up_months: 12
categories:
- type: headcount_reduction
target: 0.05
- type: vendor_consolidation
target: 0.10
- type: facility_optimization
target: 0.03
# Restructuring reserves
restructuring:
initial_reserve: 5000000
utilization_pattern: front_loaded
true_up_probability: 0.30
2. Process Evolution Modeling
2.1 Business Process Changes
process_evolution:
enabled: true
changes:
# New approval workflow
- type: approval_workflow_change
date: "2024-03-01"
from: sequential
to: parallel
effects:
- approval_time_reduction: 0.40
- same_day_approval_increase: 0.25
- skip_approval_detection: improved
# Automation introduction
- type: process_automation
date: "2024-05-01"
process: invoice_matching
effects:
- manual_matching_reduction: 0.70
- matching_accuracy_improvement: 0.15
- exception_visibility_increase: true
- posting_timing: more_consistent
# Policy change
- type: policy_change
date: "2024-08-01"
policy: expense_approval_limits
changes:
- manager_limit: 5000 -> 7500
- director_limit: 25000 -> 35000
effects:
- approval_escalation_reduction: 0.20
- processing_time_reduction: 0.15
# Control enhancement
- type: control_enhancement
date: "2024-10-01"
control: three_way_match
changes:
- tolerance: 0.05 -> 0.02
- mandatory_for: all_po_invoices
effects:
- exception_rate_increase: 0.15
- fraud_detection_improvement: 0.25
2.2 Technology Transition Patterns
technology_transitions:
enabled: true
transitions:
# ERP migration
- type: erp_migration
phases:
- name: parallel_run
start: "2024-06-01"
duration_months: 3
effects:
- duplicate_entries: true
- reconciliation_required: true
- posting_delays: 1.30
- name: cutover
date: "2024-09-01"
effects:
- legacy_system: read_only
- new_system: live
- catch_up_period: 5_days
- name: stabilization
start: "2024-09-01"
duration_months: 3
effects:
- error_rate_multiplier: 1.25
- support_ticket_increase: 3.0
- workaround_transactions: 0.10
# Module implementation
- type: module_implementation
module: advanced_analytics
go_live: "2024-04-15"
effects:
- new_transaction_types: [analytical_adjustment]
- automated_entries_increase: 0.20
# Integration change
- type: integration_upgrade
system: bank_interface
date: "2024-07-01"
effects:
- real_time_enabled: true
- batch_processing: deprecated
- posting_frequency: continuous
3. Regulatory and Compliance Drift
3.1 Regulatory Changes
regulatory_changes:
enabled: true
changes:
# New accounting standard
- type: accounting_standard_adoption
standard: ASC_842 # Leases
effective_date: "2024-01-01"
effects:
- new_account_codes: [rou_asset, lease_liability]
- reclassification_entries: true
- disclosure_changes: true
- audit_focus: high
# Tax law change
- type: tax_law_change
date: "2024-07-01"
jurisdiction: federal
change: corporate_tax_rate
from: 0.21
to: 0.25
effects:
- deferred_tax_revaluation: true
- provision_adjustment: true
- quarterly_estimate_revision: true
# Compliance requirement
- type: new_compliance_requirement
regulation: SOX_AI_controls
effective_date: "2024-10-01"
requirements:
- ai_model_documentation: required
- automated_control_testing: required
- data_lineage_tracking: required
effects:
- new_control_activities: 15
- testing_frequency: quarterly
- documentation_overhead: 0.10
# Industry regulation
- type: industry_regulation
industry: financial_services
regulation: enhanced_kyc
date: "2024-06-01"
effects:
- customer_onboarding_time: 1.50
- documentation_requirements: increased
- rejection_rate_increase: 0.08
3.2 Audit Focus Shifts
audit_focus_evolution:
enabled: true
shifts:
# Risk-based changes
- trigger: fraud_detection
date: "2024-03-15"
new_focus_areas:
- vendor_payments: high
- manual_journal_entries: high
- related_party_transactions: medium
effects:
- sampling_rate_increase: 0.30
- documentation_requests: increased
# Industry trend response
- trigger: industry_trend
date: "2024-06-01"
trend: cybersecurity_risks
new_focus_areas:
- it_general_controls: high
- access_management: high
- change_management: medium
effects:
- itgc_testing_expansion: true
- soc2_requirements: enhanced
# Prior year findings
- trigger: prior_year_finding
finding: revenue_recognition_timing
date: "2024-01-01"
effects:
- cutoff_testing: enhanced
- sample_sizes: increased
- management_inquiry: extensive
4. Behavioral Drift
4.1 Entity Behavior Evolution
behavioral_drift:
enabled: true
vendor_behavior:
# Payment term negotiation
payment_terms_drift:
direction: extending
rate_per_year: 2.5 # Days per year
variance_increase: true
trigger: economic_conditions
# Quality drift
quality_drift:
new_vendors:
initial_period_months: 6
quality_improvement_rate: 0.02
established_vendors:
complacency_risk: 0.05
quality_decline_rate: 0.01
# Price drift
pricing_behavior:
inflation_pass_through: 0.80
contract_renegotiation_frequency: annual
opportunistic_increase_probability: 0.10
customer_behavior:
# Payment behavior evolution
payment_drift:
economic_downturn:
days_extension: 5-15
bad_debt_rate_increase: 0.02
economic_upturn:
days_reduction: 2-5
early_payment_discount_uptake: 0.15
# Order pattern drift
order_drift:
digital_shift:
online_order_increase_per_year: 0.05
average_order_value_decrease: 0.03
order_frequency_increase: 0.10
employee_behavior:
# Approval pattern drift
approval_drift:
end_of_month_rush:
intensity_increase_per_year: 0.05
rubber_stamping_risk:
increase_with_volume: true
threshold: 50 # Approvals per day
# Error pattern drift
error_drift:
new_employee:
error_rate: 0.08
learning_curve_months: 6
target_error_rate: 0.02
experienced_employee:
fatigue_increase: 0.01_per_year
4.2 Collective Behavior Patterns
collective_drift:
enabled: true
patterns:
# Year-end behavior
year_end_intensity:
drift: increasing
rate_per_year: 0.05
explanation: "tighter close deadlines, more scrutiny"
# Automation adoption
automation_adoption:
s_curve_adoption: true
early_adopters: 0.15
mainstream: 0.60
laggards: 0.25
effects_by_phase:
early:
manual_reduction: 0.10
error_types_shift: true
mainstream:
manual_reduction: 0.50
new_error_types: automation_failures
late:
manual_reduction: 0.80
exception_handling_focus: true
# Remote work impact
remote_work_patterns:
transition_date: "2024-01-01"
remote_percentage: 0.60
effects:
- posting_time_distribution: flattened
- batch_processing_increase: true
- approval_response_time: longer
- documentation_quality: variable
5. Market-Driven Drift
5.1 Economic Cycle Effects
economic_cycles:
enabled: true
cycles:
# Business cycle
business_cycle:
type: sinusoidal
period_months: 48
amplitude: 0.15
effects:
expansion:
revenue_growth: positive
hiring: active
capital_investment: high
credit_terms: generous
contraction:
revenue_growth: negative
layoffs: possible
capital_investment: low
credit_terms: tight
# Industry cycle
industry_specific:
technology:
period_months: 36
amplitude: 0.25
manufacturing:
period_months: 60
amplitude: 0.20
retail:
period_months: 12 # Annual
amplitude: 0.35
# Recession simulation
recession:
enabled: false # Trigger explicitly
onset: gradual # or sudden
duration_months: 12-24
severity: moderate # mild, moderate, severe
effects:
revenue_decline: 0.15-0.30
ar_aging_increase: 15_days
bad_debt_increase: 0.03
vendor_consolidation: 0.10
workforce_reduction: 0.08
capex_freeze: true
5.2 Commodity and Input Cost Drift
input_cost_drift:
enabled: true
commodities:
- name: steel
base_price: 800 # per ton
volatility: 0.20
correlation_with_economy: 0.60
pass_through_to_cogs: 0.15
- name: energy
base_price: 75 # per barrel equivalent
volatility: 0.35
seasonal_pattern: true
pass_through_to_overhead: 0.08
- name: labor
base_cost: 35 # per hour
annual_increase: 0.03
regional_variation: true
pass_through_to_all: true
price_shock_events:
- type: supply_disruption
probability_per_year: 0.10
duration_months: 3-9
price_increase: 0.30-1.00
affected_commodities: [specific]
- type: demand_surge
probability_per_year: 0.15
duration_months: 2-6
price_increase: 0.15-0.40
affected_commodities: [broad]
6. Concept Drift Detection Signals
6.1 Drift Indicators in Generated Data
drift_signals:
enabled: true
embedded_signals:
# Statistical shift markers
statistical:
- type: mean_shift
field: transaction_amount
visibility: detectable_by_cusum
magnitude: configurable
- type: variance_change
field: processing_time
visibility: detectable_by_levene
direction: both
- type: distribution_change
field: payment_terms
visibility: detectable_by_ks_test
gradual: true
# Categorical drift markers
categorical:
- type: category_proportion_shift
field: transaction_type
new_category_emergence: true
old_category_decline: true
- type: label_drift
field: account_code
new_codes: added_over_time
deprecated_codes: declining_usage
# Temporal drift markers
temporal:
- type: seasonality_change
field: transaction_count
pattern_evolution: true
detectability: acf_analysis
- type: trend_change
field: revenue
change_points: marked
detectability: pelt_algorithm
# Ground truth labels for drift
drift_labels:
enabled: true
output_file: drift_events.csv
columns:
- event_type
- start_date
- end_date
- affected_fields
- magnitude
- detection_difficulty
6.2 Drift Validation Metrics
drift_validation:
metrics:
# Drift presence verification
drift_detection:
methods:
- adwin # Adaptive Windowing
- ddm # Drift Detection Method
- eddm # Early Drift Detection Method
- ph # Page-Hinkley Test
threshold_calibration: true
# Drift magnitude
magnitude_metrics:
- hellinger_distance
- kl_divergence
- wasserstein_distance
- psi # Population Stability Index
# Drift timing accuracy
timing_metrics:
- detection_delay_days
- false_positive_rate
- detection_precision
7. Implementation Framework
7.1 Drift Controller Enhancement
#![allow(unused)]
fn main() {
pub struct EnhancedDriftController {
// Existing drift
parameter_drift: ParameterDrift,
// New: Organizational events
event_timeline: EventTimeline,
// New: Process changes
process_evolution: ProcessEvolution,
// New: Regulatory changes
regulatory_calendar: RegulatoryCalendar,
// New: Behavioral models
behavioral_drift: BehavioralDriftModel,
// New: Market factors
market_model: MarketModel,
// Drift detection ground truth
drift_labels: DriftLabelRecorder,
}
impl EnhancedDriftController {
/// Get all active effects for a given date
pub fn get_effects(&self, date: NaiveDate) -> DriftEffects {
let mut effects = DriftEffects::default();
// Apply organizational events
effects.merge(self.event_timeline.effects_at(date));
// Apply process evolution
effects.merge(self.process_evolution.effects_at(date));
// Apply regulatory changes
effects.merge(self.regulatory_calendar.effects_at(date));
// Apply behavioral drift
effects.merge(self.behavioral_drift.effects_at(date));
// Apply market conditions
effects.merge(self.market_model.effects_at(date));
// Record for ground truth
self.drift_labels.record(date, &effects);
effects
}
}
}
7.2 Configuration Integration
# Master drift configuration
drift:
enabled: true
# Parameter drift (existing)
parameters:
amount_mean_drift: 0.02
amount_variance_drift: 0.01
# Organizational events (new)
organizational:
events_file: "organizational_events.yaml"
random_events:
reorganization_probability: 0.10
leadership_change_probability: 0.15
# Process evolution (new)
process:
automation_curve: s_curve
policy_review_frequency: quarterly
# Regulatory changes (new)
regulatory:
calendar_file: "regulatory_calendar.yaml"
jurisdictions: [us, eu]
# Behavioral drift (new)
behavioral:
vendor_learning: true
customer_churn: true
employee_turnover: 0.15
# Market factors (new)
market:
economic_cycle: true
commodity_volatility: true
inflation_rate: 0.03
# Drift labeling (new)
labels:
enabled: true
output_format: csv
include_magnitude: true
8. Implementation Priority
| Enhancement | Complexity | Impact | Priority |
|---|---|---|---|
| Organizational events | Medium | High | P1 |
| Process evolution | Medium | High | P1 |
| Regulatory changes | Low | Medium | P2 |
| Behavioral drift | High | High | P1 |
| Market-driven drift | Medium | Medium | P2 |
| Drift detection signals | Low | High | P1 |
| Technology transitions | High | Medium | P3 |
| Collective behavior | Medium | Medium | P2 |
9. Use Cases
- ML Model Robustness Testing: Train models on stable data, test on drifted data
- Drift Detection Benchmarking: Evaluate drift detection algorithms on known drift
- Change Management Simulation: Test system responses to organizational changes
- Regulatory Impact Analysis: Model effects of compliance requirement changes
- Economic Scenario Planning: Generate data under different economic conditions
See also: 06-anomaly-patterns.md for anomaly injection patterns
Research: Anomaly Pattern Enhancements
Current State Analysis
Existing Anomaly Categories
| Category | Types | Implementation |
|---|---|---|
| Fraud | Fictitious, Revenue Manipulation, Split, Round-trip, Ghost Employee, Duplicate Payment | Good |
| Error | Duplicate Entry, Reversed Amount, Wrong Period, Wrong Account, Missing Reference | Good |
| Process | Late Posting, Skipped Approval, Threshold Manipulation | Medium |
| Statistical | Unusual Amount, Trend Break, Benford Violation | Medium |
| Relational | Circular Transaction, Dormant Account | Basic |
Current Strengths
- Labeled output:
anomaly_labels.csvwith ground truth - Configurable injection rate: Per-anomaly-type rates
- Quality issue labeling: Separate from fraud labels
- Multiple anomaly types: 20+ distinct patterns
- COSO control mapping: Anomalies linked to control failures
Current Gaps
- Binary labeling only: No severity or confidence scores
- Independent injection: Anomalies don’t correlate with each other
- No multi-stage anomalies: Complex schemes not modeled
- Static patterns: Same anomaly signature throughout
- No near-miss generation: Only clear anomalies or clean data
- Limited context awareness: Anomalies don’t adapt to entity behavior
- No detection difficulty labeling: All anomalies treated equally
Improvement Recommendations
1. Multi-Dimensional Anomaly Labeling
1.1 Enhanced Label Schema
anomaly_labeling:
schema:
# Primary classification
anomaly_id: uuid
transaction_ids: [uuid]
anomaly_type: string
anomaly_category: [fraud, error, process, statistical, relational]
# Severity scoring
severity:
level: [low, medium, high, critical]
score: 0.0-1.0
financial_impact: decimal
materiality_threshold: exceeded | below
# Detection characteristics
detection:
difficulty: [trivial, easy, moderate, hard, expert]
recommended_methods: [rule_based, statistical, ml, graph, hybrid]
expected_false_positive_rate: 0.0-1.0
key_indicators: [string]
# Confidence and certainty
confidence:
ground_truth_certainty: [definite, probable, possible]
label_source: [injected, inferred, manual]
# Temporal characteristics
temporal:
first_occurrence: date
last_occurrence: date
frequency: [one_time, recurring, continuous]
detection_window: days
# Relationship context
context:
related_anomalies: [uuid]
affected_entities: [entity_id]
control_failures: [control_id]
root_cause: string
1.2 Materiality-Based Severity
severity_calculation:
materiality_thresholds:
trivial: 0.001 # 0.1% of relevant base
immaterial: 0.01 # 1%
material: 0.05 # 5%
highly_material: 0.10 # 10%
bases_by_type:
revenue: total_revenue
expense: total_expenses
asset: total_assets
liability: total_liabilities
severity_factors:
financial_impact:
weight: 0.40
calculation: amount / materiality_threshold
detection_difficulty:
weight: 0.25
mapping:
trivial: 0.1
easy: 0.3
moderate: 0.5
hard: 0.7
expert: 0.9
persistence:
weight: 0.20
calculation: duration_days / 365
entity_involvement:
weight: 0.15
calculation: log(affected_entity_count)
2. Correlated Anomaly Injection
2.1 Anomaly Co-occurrence Patterns
anomaly_correlations:
enabled: true
patterns:
# Fraud often accompanied by concealment
fraud_concealment:
primary: fictitious_vendor
correlated:
- type: document_manipulation
probability: 0.80
lag_days: 0-30
- type: approval_bypass
probability: 0.60
lag_days: 0
- type: audit_trail_gaps
probability: 0.40
lag_days: 0-90
# Error cascades
error_cascade:
primary: wrong_account_coding
correlated:
- type: reconciliation_difference
probability: 0.90
lag_days: 30-60
- type: balance_discrepancy
probability: 0.70
lag_days: 30
- type: correcting_entry
probability: 0.85
lag_days: 1-45
# Process failures cluster
process_breakdown:
primary: skipped_approval
correlated:
- type: threshold_splitting
probability: 0.50
lag_days: -30 to 30
- type: late_posting
probability: 0.40
lag_days: 0-15
- type: documentation_missing
probability: 0.60
lag_days: 0
2.2 Temporal Clustering
temporal_clustering:
enabled: true
clusters:
# Period-end error spikes
period_end_errors:
window: last_5_business_days
error_rate_multiplier: 2.5
types: [wrong_period, duplicate_entry, late_posting]
# Post-holiday cleanup
post_holiday:
window: first_3_business_days_after_holiday
types: [duplicate_entry, missing_reference]
multiplier: 1.8
# Quarter-end pressure
quarter_end:
window: last_week_of_quarter
fraud_types: [revenue_manipulation, expense_deferral]
multiplier: 1.5
# Year-end audit prep
year_end_audit:
window: december
correction_types: [reclassification, prior_period_adjustment]
multiplier: 3.0
3. Multi-Stage Anomaly Patterns
3.1 Complex Scheme Modeling
multi_stage_anomalies:
enabled: true
schemes:
# Gradual embezzlement
gradual_embezzlement:
stages:
- stage: 1
name: testing
duration_months: 2
transactions: 3-5
amount_range: [100, 500]
detection_difficulty: hard
- stage: 2
name: escalation
duration_months: 6
transactions: 10-20
amount_range: [500, 2000]
detection_difficulty: moderate
- stage: 3
name: acceleration
duration_months: 3
transactions: 20-50
amount_range: [2000, 10000]
detection_difficulty: easy
- stage: 4
name: desperation
duration_months: 1
transactions: 5-10
amount_range: [10000, 50000]
detection_difficulty: trivial
total_scheme_probability: 0.02
# Revenue manipulation over time
revenue_scheme:
stages:
- stage: 1
name: acceleration
quarter: Q4
action: early_revenue_recognition
amount_percent: 0.02
- stage: 2
name: deferral
quarter: Q1_next
action: expense_deferral
amount_percent: 0.03
- stage: 3
name: reserve_manipulation
quarter: Q2
action: reserve_release
amount_percent: 0.02
- stage: 4
name: channel_stuffing
quarter: Q4
action: forced_sales
amount_percent: 0.05
cycle_probability: 0.01
# Vendor kickback scheme
kickback_scheme:
stages:
- stage: 1
name: vendor_setup
actions: [create_vendor, build_relationship]
duration_months: 3
- stage: 2
name: price_inflation
actions: [inflated_invoices]
inflation_percent: 0.10-0.25
duration_months: 12
- stage: 3
name: kickback_payments
actions: [off_book_payments]
kickback_percent: 0.50
frequency: quarterly
- stage: 4
name: concealment
actions: [document_destruction, false_approvals]
ongoing: true
3.2 Scheme Evolution
#![allow(unused)]
fn main() {
pub struct MultiStageAnomaly {
scheme_id: Uuid,
scheme_type: SchemeType,
current_stage: u32,
start_date: NaiveDate,
perpetrators: Vec<EntityId>,
transactions: Vec<TransactionId>,
total_impact: Decimal,
detection_status: DetectionStatus,
}
impl MultiStageAnomaly {
/// Advance scheme to next stage
pub fn advance(&mut self, date: NaiveDate) -> Vec<AnomalyAction> {
// Check if conditions met for stage advancement
// Return actions for current stage
}
/// Check if scheme should be detected based on accumulated evidence
pub fn detection_probability(&self) -> f64 {
// Increases with:
// - Number of transactions
// - Total amount
// - Duration
// - Carelessness factor
}
}
}
4. Near-Miss and Edge Case Generation
4.1 Near-Anomaly Patterns
near_miss_generation:
enabled: true
proportion_of_anomalies: 0.30 # 30% of "anomalies" are near-misses
patterns:
# Almost duplicate (timing difference)
near_duplicate:
description: "Similar transaction, different timing"
difference:
amount: exact_match
date: 1-3_days_apart
vendor: same
label: not_anomaly
detection_challenge: high
# Threshold proximity
threshold_proximity:
description: "Transaction just below approval threshold"
distance_from_threshold: [0.90, 0.99]
label: not_anomaly
suspicion_score: high
# Unusual but explainable
unusual_legitimate:
description: "Unusual pattern with valid business reason"
types:
- year_end_bonus
- contract_prepayment
- settlement_payment
- insurance_claim
label: not_anomaly
false_positive_trigger: high
# Corrected error
corrected_error:
description: "Error that was caught and fixed"
original_error: any
correction_lag_days: 1-5
net_impact: zero
label: error_corrected
visibility: both_entries_visible
4.2 Boundary Condition Testing
boundary_conditions:
enabled: true
conditions:
# Exact threshold matches
exact_thresholds:
types: [approval_limit, materiality, tolerance]
probability: 0.01
label: boundary_case
# Round number preference (non-fraudulent)
legitimate_round_numbers:
amounts: [1000, 5000, 10000, 25000]
probability: 0.05
label: not_anomaly
context: budget_allocations
# Last-minute but legitimate
period_boundary:
timing: last_hour_before_close
legitimate_probability: 0.80
label: timing_anomaly_only
# Zero and negative amounts
edge_amounts:
zero_amount_probability: 0.001
negative_amount_probability: 0.002
labels: data_quality_issue
5. Context-Aware Anomaly Injection
5.1 Entity-Specific Patterns
entity_aware_anomalies:
enabled: true
vendor_specific:
# New vendors have higher error rates
new_vendor_errors:
definition: vendor_age < 90_days
error_rate_multiplier: 2.5
common_errors: [wrong_account, missing_po]
# Large vendors have more complex issues
strategic_vendor_issues:
definition: vendor_spend > percentile_90
anomaly_types: [contract_deviation, price_variance]
rate_multiplier: 1.5
# International vendors
international_vendor_issues:
definition: vendor_country != company_country
anomaly_types: [fx_errors, withholding_tax_errors]
rate_multiplier: 2.0
employee_specific:
# New employee learning curve
new_employee_errors:
definition: employee_tenure < 180_days
error_rate: 0.05
error_types: [coding_error, approval_violation]
decay: exponential
# High-volume processors
volume_fatigue:
definition: daily_transactions > 50
error_rate_increase: 0.02
peak_time: end_of_day
# Vacation coverage
coverage_errors:
trigger: primary_approver_absent
error_rate_multiplier: 1.8
types: [delayed_approval, wrong_approver]
account_specific:
# High-risk accounts
high_risk_accounts:
accounts: [cash, revenue, inventory]
monitoring_level: enhanced
anomaly_injection_rate: 1.5x
# Infrequently used accounts
dormant_account_activity:
definition: no_activity_90_days
any_activity_suspicious: true
label: statistical_anomaly
5.2 Behavioral Baseline Deviation
behavioral_deviation:
enabled: true
baselines:
# Establish per-entity behavioral baseline
baseline_period: 90_days
metrics:
- average_transaction_amount
- transaction_frequency
- typical_posting_time
- common_counterparties
- usual_account_codes
deviations:
# Amount deviation
amount_anomaly:
threshold: 3_standard_deviations
label: statistical_anomaly
severity: based_on_deviation
# Frequency deviation
frequency_anomaly:
threshold: 2_standard_deviations
types: [sudden_increase, sudden_decrease, irregular_pattern]
# Counterparty deviation
new_counterparty:
first_time_transaction: true
risk_score: elevated
label: relationship_anomaly
# Timing deviation
timing_anomaly:
threshold: outside_usual_hours
consideration: legitimate_reasons_exist
label: timing_anomaly
6. Detection Difficulty Classification
6.1 Difficulty Taxonomy
detection_difficulty:
levels:
trivial:
description: "Obvious on cursory review"
examples:
- duplicate_same_day
- obviously_wrong_amount
- missing_required_field
expected_detection_rate: 0.99
detection_methods: [basic_rules]
easy:
description: "Detectable with standard controls"
examples:
- threshold_violations
- approval_gaps
- segregation_of_duties
expected_detection_rate: 0.90
detection_methods: [automated_rules, basic_analytics]
moderate:
description: "Requires analytical procedures"
examples:
- trend_deviations
- ratio_anomalies
- benford_violations
expected_detection_rate: 0.70
detection_methods: [statistical_analysis, ratio_analysis]
hard:
description: "Requires advanced techniques or domain expertise"
examples:
- complex_fraud_schemes
- collusion_patterns
- sophisticated_manipulation
expected_detection_rate: 0.40
detection_methods: [ml_models, graph_analysis, forensic_audit]
expert:
description: "Only detectable by specialized investigation"
examples:
- long_running_schemes
- management_override
- deep_concealment
expected_detection_rate: 0.15
detection_methods: [tip_or_complaint, forensic_investigation, external_audit]
6.2 Difficulty Factors
#![allow(unused)]
fn main() {
pub struct DifficultyCalculator {
factors: Vec<DifficultyFactor>,
}
pub enum DifficultyFactor {
// Concealment techniques
Concealment {
document_manipulation: bool,
approval_circumvention: bool,
timing_exploitation: bool,
splitting: bool,
},
// Blending with normal activity
Blending {
amount_within_normal_range: bool,
timing_within_normal_hours: bool,
counterparty_is_established: bool,
account_coding_correct: bool,
},
// Collusion
Collusion {
number_of_participants: u32,
includes_management: bool,
external_parties: bool,
},
// Duration and frequency
Temporal {
duration_months: u32,
transaction_frequency: Frequency,
gradual_escalation: bool,
},
// Amount characteristics
Amount {
total_amount: Decimal,
individual_amounts_small: bool,
round_numbers_avoided: bool,
},
}
}
7. Anomaly Generation Strategies
7.1 Strategy Configuration
anomaly_strategies:
# Random injection (current approach)
random:
enabled: true
weight: 0.40
parameters:
base_rate: 0.02
per_type_rates: {...}
# Scenario-based injection
scenario_based:
enabled: true
weight: 0.30
scenarios:
- name: "new_employee_fraud"
trigger: employee_tenure < 365
probability: 0.005
scheme: embezzlement
- name: "vendor_collusion"
trigger: vendor_concentration > 0.15
probability: 0.01
scheme: kickback
- name: "year_end_pressure"
trigger: month == 12
probability: 0.03
types: [revenue_manipulation, reserve_adjustment]
# Adversarial injection
adversarial:
enabled: true
weight: 0.20
strategy: evade_known_detectors
detectors_to_evade:
- benford_analysis
- duplicate_detection
- threshold_monitoring
techniques:
- amount_variation
- timing_spreading
- entity_rotation
# Benchmark-based injection
benchmark:
enabled: true
weight: 0.10
source: acfe_report_to_the_nations
calibration:
median_loss: 117000
duration_months: 12
detection_method_distribution: {...}
7.2 Adaptive Anomaly Injection
#![allow(unused)]
fn main() {
pub struct AdaptiveAnomalyInjector {
// Tracks what's been injected
injection_history: Vec<InjectedAnomaly>,
// Ensures variety
type_distribution: TypeDistribution,
// Ensures difficulty spread
difficulty_distribution: DifficultyDistribution,
// Ensures temporal spread
temporal_distribution: TemporalDistribution,
}
impl AdaptiveAnomalyInjector {
/// Inject anomaly with awareness of what's already been injected
pub fn inject(&mut self, context: &GenerationContext) -> Option<Anomaly> {
// Check if injection appropriate at this point
if !self.should_inject(context) {
return None;
}
// Select type based on current distribution gaps
let anomaly_type = self.select_type_for_balance();
// Select difficulty based on current distribution gaps
let difficulty = self.select_difficulty_for_balance();
// Generate anomaly
let anomaly = self.generate_anomaly(anomaly_type, difficulty, context);
// Record injection
self.record_injection(&anomaly);
Some(anomaly)
}
}
}
8. Output Enhancements
8.1 Enhanced Label File
output:
anomaly_labels:
format: parquet # or csv
columns:
# Identifiers
- anomaly_id
- transaction_ids # Array
- scheme_id # For multi-stage
# Classification
- anomaly_type
- category
- subcategory
# Severity
- severity_level
- severity_score
- financial_impact
- is_material
# Detection
- difficulty_level
- difficulty_score
- recommended_detection_methods # Array
- key_indicators # Array
# Temporal
- first_date
- last_date
- duration_days
- stage # For multi-stage
# Context
- affected_entities # Array
- control_failures # Array
- related_anomalies # Array
# Metadata
- injection_strategy
- generation_seed
- ground_truth_certainty
# Separate scheme file for multi-stage
schemes:
format: json
structure:
scheme_id: uuid
scheme_type: string
stages: [...]
transactions_by_stage: {...}
total_impact: decimal
perpetrators: [entity_ids]
8.2 Detection Benchmark Output
detection_benchmarks:
enabled: true
outputs:
# Performance expectations by method
expected_performance:
format: json
content:
by_method:
rule_based:
expected_recall: 0.40
expected_precision: 0.85
statistical:
expected_recall: 0.55
expected_precision: 0.70
ml_supervised:
expected_recall: 0.75
expected_precision: 0.80
graph_based:
expected_recall: 0.65
expected_precision: 0.75
# Difficulty distribution
difficulty_summary:
format: csv
columns: [difficulty_level, count, percentage, avg_amount]
# Detection challenge set
challenge_cases:
format: json
description: "Curated set of hardest-to-detect anomalies"
count: 100
selection_criteria: difficulty_score > 0.7
9. Implementation Priority
| Enhancement | Complexity | Impact | Priority |
|---|---|---|---|
| Multi-dimensional labeling | Low | High | P1 |
| Correlated anomaly injection | Medium | High | P1 |
| Multi-stage schemes | High | High | P1 |
| Near-miss generation | Medium | High | P1 |
| Context-aware injection | Medium | High | P2 |
| Difficulty classification | Low | High | P1 |
| Adaptive injection | Medium | Medium | P2 |
| Detection benchmarks | Low | Medium | P2 |
See also: 07-fraud-patterns.md for fraud-specific patterns
Research: Fraud Pattern Improvements
Implementation Status: COMPLETE (v0.3.0)
This research document has been fully implemented in v0.3.0. See:
- Fraud Patterns Documentation
- CHANGELOG.md for detailed feature list
Key implementations:
- ACFE-aligned fraud taxonomy with calibration statistics
- Collusion and conspiracy modeling with 9 ring types
- Management override patterns with fraud triangle
- Red flag generation with Bayesian probabilities
- ACFE-calibrated ML benchmarks
Current State Analysis
Existing Fraud Typologies
| Category | Types Implemented | Realism |
|---|---|---|
| Asset Misappropriation | Ghost Employee, Duplicate Payment, Fictitious Vendor | Medium |
| Financial Statement Fraud | Revenue Manipulation, Round-tripping | Basic |
| Corruption | (Limited) | Weak |
| Banking/AML | Structuring, Layering, Mule, Funnel, Spoofing | Good |
Current Strengths
- Banking module: Sophisticated AML typologies with transaction networks
- Fraud labeling: Ground truth labels for ML training
- Control mapping: Fraud linked to control failures
- Amount patterns: Benford violations for fraudulent amounts
Current Gaps
- No collusion modeling: Fraud actors operate independently
- Limited concealment: Fraud isn’t actively hidden
- No behavioral adaptation: Fraudsters don’t learn
- Static schemes: Same patterns throughout
- Missing corruption types: Bribery, kickbacks underdeveloped
- No management override: All fraud at operational level
- Limited financial statement fraud: Complex schemes not modeled
Improvement Recommendations
1. Comprehensive Fraud Taxonomy
1.1 ACFE-Aligned Framework
Based on the Association of Certified Fraud Examiners Occupational Fraud and Abuse Classification:
fraud_taxonomy:
# Asset Misappropriation (86% of cases, $100k median loss)
asset_misappropriation:
cash:
theft_of_cash_on_hand:
- larceny
- skimming
theft_of_cash_receipts:
- sales_skimming
- receivables_skimming
- refund_schemes
fraudulent_disbursements:
- billing_schemes:
- shell_company
- non_accomplice_vendor
- personal_purchases
- payroll_schemes:
- ghost_employee
- falsified_wages
- commission_schemes
- expense_reimbursement:
- mischaracterized_expenses
- overstated_expenses
- fictitious_expenses
- check_tampering:
- forged_maker
- forged_endorsement
- altered_payee
- authorized_maker
- register_disbursements:
- false_voids
- false_refunds
inventory_and_assets:
- misuse
- larceny
# Corruption (33% of cases, $150k median loss)
corruption:
conflicts_of_interest:
- purchasing_schemes
- sales_schemes
bribery:
- invoice_kickbacks
- bid_rigging
illegal_gratuities: true
economic_extortion: true
# Financial Statement Fraud (10% of cases, $954k median loss)
financial_statement_fraud:
overstatement:
- timing_differences:
- premature_revenue
- delayed_expenses
- fictitious_revenues
- concealed_liabilities
- improper_asset_valuations
- improper_disclosures
understatement:
- understated_revenues
- overstated_expenses
- overstated_liabilities
1.2 Industry-Specific Fraud Patterns
industry_fraud_patterns:
manufacturing:
common_schemes:
- type: inventory_theft
frequency: high
methods: [larceny, false_shipments, scrap_manipulation]
- type: vendor_kickbacks
frequency: medium
methods: [inflated_pricing, phantom_materials]
- type: quality_fraud
frequency: low
methods: [false_certifications, spec_violations]
retail:
common_schemes:
- type: register_fraud
frequency: high
methods: [skimming, false_voids, sweethearting]
- type: return_fraud
frequency: high
methods: [fictitious_returns, receipt_fraud]
- type: inventory_shrinkage
frequency: very_high
methods: [employee_theft, vendor_collusion]
financial_services:
common_schemes:
- type: loan_fraud
frequency: medium
methods: [false_documentation, appraisal_fraud]
- type: insider_trading
frequency: low
methods: [front_running, tip_schemes]
- type: account_takeover
frequency: medium
methods: [identity_theft, credential_theft]
healthcare:
common_schemes:
- type: billing_fraud
frequency: high
methods: [upcoding, unbundling, phantom_billing]
- type: kickbacks
frequency: medium
methods: [referral_fees, drug_company_payments]
- type: identity_fraud
frequency: medium
methods: [patient_identity_theft, provider_impersonation]
professional_services:
common_schemes:
- type: billing_fraud
frequency: high
methods: [inflated_hours, phantom_work]
- type: expense_fraud
frequency: medium
methods: [personal_expenses, inflated_claims]
- type: client_fund_misappropriation
frequency: low
methods: [trust_account_theft, advance_fee_theft]
2. Collusion and Conspiracy Modeling
2.1 Collusion Network Generation
collusion_networks:
enabled: true
network_types:
# Internal collusion
internal:
- type: employee_pair
roles: [approver, processor]
scheme: approval_bypass
probability: 0.005
- type: department_ring
size: 3-5
roles: [initiator, approver, concealer]
scheme: expense_fraud
probability: 0.002
- type: management_subordinate
roles: [manager, subordinate]
scheme: ghost_employee
probability: 0.003
# Internal-external collusion
internal_external:
- type: employee_vendor
roles: [purchasing_agent, vendor_contact]
scheme: kickback
probability: 0.008
- type: employee_customer
roles: [sales_rep, customer]
scheme: false_credits
probability: 0.004
- type: employee_contractor
roles: [project_manager, contractor]
scheme: overbilling
probability: 0.006
# External rings
external:
- type: vendor_ring
size: 2-4
scheme: bid_rigging
probability: 0.002
- type: customer_ring
size: 2-3
scheme: return_fraud
probability: 0.003
network_characteristics:
trust_building:
initial_period_months: 3
test_transactions: 2-5
test_amounts: small
communication_patterns:
frequency: coded
channels: [personal_email, phone, in_person]
visibility: low
profit_sharing:
methods: [equal_split, role_based, initiator_premium]
payment_channels: [cash, personal_accounts, crypto]
2.2 Collusion Behavior Modeling
#![allow(unused)]
fn main() {
pub struct CollusionRing {
ring_id: Uuid,
members: Vec<Conspirator>,
scheme_type: SchemeType,
formation_date: NaiveDate,
status: RingStatus,
total_stolen: Decimal,
detection_risk: f64,
}
pub struct Conspirator {
entity_id: EntityId,
role: ConspiratorRole,
join_date: NaiveDate,
loyalty: f64, // Probability of not defecting
risk_tolerance: f64, // Willingness to escalate
share: f64, // Percentage of proceeds
}
pub enum ConspiratorRole {
Initiator, // Conceives scheme
Executor, // Performs transactions
Approver, // Provides approvals
Concealer, // Hides evidence
Lookout, // Monitors for detection
Beneficiary, // External recipient
}
impl CollusionRing {
/// Simulate ring behavior for a period
pub fn simulate_period(&mut self, period: &Period) -> Vec<FraudAction> {
// Check for defection
if self.check_defection() {
return self.dissolve();
}
// Check for escalation
let escalation = self.check_escalation();
// Generate fraudulent transactions
let actions = self.generate_actions(period, escalation);
// Update detection risk
self.update_detection_risk(&actions);
actions
}
/// Check if any member might defect
fn check_defection(&self) -> bool {
// Factors: loyalty, detection_risk, personal_circumstances
}
}
}
3. Concealment Techniques
3.1 Document Manipulation
concealment_techniques:
document_manipulation:
# Forged documents
forgery:
types:
- invoices
- receipts
- approvals
- contracts
quality_levels:
crude: 0.20 # Easy to detect
moderate: 0.50 # Requires scrutiny
sophisticated: 0.25 # Difficult to detect
professional: 0.05 # Expert required
# Altered documents
alteration:
techniques:
- amount_change
- date_change
- payee_change
- description_change
detection_indicators:
- different_handwriting
- correction_fluid
- digital_artifacts
# Destroyed documents
destruction:
methods:
- physical_destruction
- digital_deletion
- "lost_in_transition"
recovery_probability: 0.30
audit_trail_manipulation:
techniques:
- backdating_entries
- manipulating_timestamps
- deleting_log_entries
- creating_false_trails
sophistication_levels:
basic: "obvious_gaps"
intermediate: "plausible_explanations"
advanced: "complete_alternative_narrative"
segregation_circumvention:
methods:
- shared_credentials
- delegated_authority_abuse
- emergency_access_exploitation
- system_override_use
3.2 Transaction Structuring
transaction_structuring:
# Below threshold structuring
threshold_avoidance:
thresholds:
- type: approval_limit
values: [1000, 5000, 10000, 25000]
technique: split_below
margin: 0.05-0.15
- type: reporting_threshold
values: [10000] # CTR threshold
technique: structure_below
margin: 0.10-0.20
- type: audit_sample_threshold
values: [materiality * 0.5]
technique: avoid_population
margin: variable
# Timing manipulation
timing_techniques:
- type: spread_over_periods
purpose: avoid_trending
pattern: randomized
- type: burst_before_vacation
purpose: delayed_discovery
window: 1_week
- type: holiday_timing
purpose: reduced_oversight
targets: [year_end, summer]
# Entity rotation
entity_rotation:
- type: vendor_rotation
purpose: avoid_concentration_alerts
rotation_frequency: quarterly
- type: account_rotation
purpose: avoid_pattern_detection
accounts: [expense_categories]
- type: department_rotation
purpose: spread_impact
pattern: round_robin
4. Management Override
4.1 Override Patterns
management_override:
enabled: true
scenarios:
# Revenue manipulation
revenue_override:
perpetrator_level: senior_management
techniques:
- journal_entry_override
- revenue_recognition_acceleration
- reserve_manipulation
- side_agreement_concealment
concealment:
- false_documentation
- intimidation_of_subordinates
- auditor_deception
# Expense manipulation
expense_override:
perpetrator_level: department_head+
techniques:
- capitalization_abuse
- expense_deferral
- cost_allocation_manipulation
pressure_sources:
- budget_targets
- bonus_thresholds
- analyst_expectations
# Asset manipulation
asset_override:
perpetrator_level: senior_management
techniques:
- impairment_avoidance
- valuation_manipulation
- classification_abuse
motivations:
- covenant_compliance
- credit_rating_maintenance
- acquisition_valuation
override_characteristics:
# Authority abuse
authority_patterns:
- override_segregation_of_duties
- suppress_exception_reports
- modify_control_parameters
- grant_inappropriate_access
# Pressure and rationalization
fraud_triangle:
pressure:
- financial_targets
- personal_financial_issues
- market_expectations
opportunity:
- weak_board_oversight
- auditor_reliance_on_management
- complex_transactions
rationalization:
- "temporary_adjustment"
- "everyone_does_it"
- "for_the_good_of_company"
4.2 Tone at the Top Effects
tone_effects:
enabled: true
# Positive tone (ethical leadership)
ethical_leadership:
effects:
- fraud_rate_reduction: 0.50
- whistleblower_increase: 2.0
- control_compliance_improvement: 0.20
# Negative tone (pressure culture)
pressure_culture:
effects:
- fraud_rate_increase: 2.5
- concealment_sophistication: increased
- collusion_probability: 1.5x
- management_override_probability: 3.0x
# Mixed signals
inconsistent_messaging:
effects:
- employee_confusion: true
- selective_compliance: true
- rationalization_easier: true
5. Adaptive Fraud Behavior
5.1 Learning and Adaptation
adaptive_fraud:
enabled: true
learning_behaviors:
# Response to near-detection
near_detection_response:
behaviors:
- temporary_pause: 0.40
- technique_change: 0.30
- amount_reduction: 0.20
- scheme_abandonment: 0.10
pause_duration_days: 30-90
# Response to control changes
control_adaptation:
when: new_control_implemented
behaviors:
- find_workaround: 0.60
- wait_for_relaxation: 0.25
- abandon_scheme: 0.15
adaptation_time_days: 30-60
# Success reinforcement
success_reinforcement:
when: fraud_not_detected
behaviors:
- increase_frequency: 0.30
- increase_amount: 0.40
- recruit_accomplices: 0.15
- maintain_current: 0.15
sophistication_evolution:
stages:
novice:
characteristics: [obvious_patterns, small_amounts, nervous_behavior]
detection_difficulty: easy
intermediate:
characteristics: [some_concealment, medium_amounts, confidence]
detection_difficulty: moderate
experienced:
characteristics: [sophisticated_concealment, varied_amounts, systematic]
detection_difficulty: hard
expert:
characteristics: [professional_techniques, large_amounts, network]
detection_difficulty: expert
progression:
trigger: months_undetected > 6
probability: 0.30_per_trigger
5.2 Detection Evasion
#![allow(unused)]
fn main() {
pub struct AdaptiveFraudster {
experience_level: ExperienceLevel,
known_controls: Vec<ControlId>,
detection_events: Vec<DetectionEvent>,
technique_repertoire: Vec<FraudTechnique>,
}
impl AdaptiveFraudster {
/// Adapt technique based on environment
pub fn adapt_technique(&mut self, context: &Context) -> FraudTechnique {
// Avoid known controls
let available = self.filter_by_controls(context.active_controls);
// Avoid previously detected patterns
let safe = self.filter_by_history(&available);
// Select based on risk/reward
self.select_optimal(&safe, context.current_risk_tolerance)
}
/// Learn from near-detection
pub fn learn_from_event(&mut self, event: &DetectionEvent) {
match event.outcome {
DetectionOutcome::Detected => {
self.avoid_technique(event.technique);
self.reduce_risk_tolerance();
}
DetectionOutcome::NearMiss => {
self.modify_technique(event.technique);
self.record_warning_sign(event.indicator);
}
DetectionOutcome::Undetected => {
self.reinforce_technique(event.technique);
self.consider_escalation();
}
}
}
}
}
6. Financial Statement Fraud Schemes
6.1 Revenue Manipulation Schemes
revenue_schemes:
# Premature revenue recognition
premature_recognition:
techniques:
- bill_and_hold:
description: "Ship to warehouse, recognize revenue"
indicators: [unusual_shipping, customer_complaints]
journal_entries:
- dr: accounts_receivable
cr: revenue
- channel_stuffing:
description: "Force product on distributors"
indicators: [quarter_end_spike, high_returns_next_period]
side_agreements: [return_rights, extended_payment]
- percentage_of_completion_abuse:
description: "Overstate project completion"
indicators: [optimistic_estimates, margin_improvements]
documentation: [false_progress_reports]
- round_tripping:
description: "Simultaneous buy/sell with related party"
indicators: [offsetting_transactions, unusual_counterparties]
complexity: high
# Fictitious revenue
fictitious_revenue:
techniques:
- fake_invoices:
description: "Bill nonexistent customers"
concealment: [fake_customer_setup, false_confirmations]
- side_agreements:
description: "Hidden terms negate sale"
concealment: [verbal_agreements, separate_documentation]
- related_party_transactions:
description: "Transactions with undisclosed affiliates"
concealment: [complex_ownership, offshore_entities]
6.2 Expense and Liability Manipulation
expense_liability_schemes:
# Expense deferral
expense_deferral:
techniques:
- improper_capitalization:
description: "Capitalize operating expenses"
accounts: [fixed_assets, intangibles]
indicators: [unusual_asset_growth, low_maintenance]
- reserve_manipulation:
description: "Cookie jar reserves"
pattern: [build_in_good_years, release_in_bad]
indicators: [volatile_provisions, earnings_smoothing]
- period_cutoff_manipulation:
description: "Push expenses to next period"
timing: [quarter_end, year_end]
techniques: [hold_invoices, delay_receipt]
# Liability concealment
liability_concealment:
techniques:
- off_balance_sheet:
description: "Structure to avoid consolidation"
vehicles: [SPEs, unconsolidated_subsidiaries]
concealment: [complex_structures, offshore]
- contingency_understatement:
description: "Understate legal/warranty liabilities"
rationalization: ["uncertain", "immaterial"]
indicators: [subsequent_large_settlements]
7. Fraud Red Flags and Indicators
7.1 Behavioral Red Flags
behavioral_red_flags:
# Employee behavior
employee_indicators:
- indicator: living_beyond_means
fraud_correlation: 0.45
detection_method: lifestyle_analysis
- indicator: financial_difficulties
fraud_correlation: 0.40
detection_method: background_check
- indicator: unusually_close_vendor_relationships
fraud_correlation: 0.35
detection_method: relationship_analysis
- indicator: control_issues_attitude
fraud_correlation: 0.30
detection_method: 360_feedback
- indicator: never_takes_vacation
fraud_correlation: 0.50
detection_method: hr_records
- indicator: excessive_overtime
fraud_correlation: 0.25
detection_method: time_records
# Transaction behavior
transaction_indicators:
- indicator: round_number_preference
fraud_correlation: 0.20
detection_method: benford_analysis
- indicator: just_below_threshold
fraud_correlation: 0.60
detection_method: threshold_analysis
- indicator: end_of_period_concentration
fraud_correlation: 0.35
detection_method: temporal_analysis
- indicator: unusual_journal_entries
fraud_correlation: 0.55
detection_method: journal_entry_testing
7.2 Red Flag Generation
red_flag_injection:
enabled: true
# Inject red flags that correlate with but don't prove fraud
correlations:
# Strong correlation - usually indicates fraud
strong:
- flag: matched_home_address_vendor_employee
fraud_probability: 0.85
inject_with_fraud: 0.90
inject_without_fraud: 0.001
- flag: sequential_check_numbers_to_same_vendor
fraud_probability: 0.70
inject_with_fraud: 0.80
inject_without_fraud: 0.01
# Moderate correlation - worth investigating
moderate:
- flag: vendor_no_physical_address
fraud_probability: 0.40
inject_with_fraud: 0.60
inject_without_fraud: 0.05
- flag: approval_just_under_threshold
fraud_probability: 0.35
inject_with_fraud: 0.70
inject_without_fraud: 0.10
# Weak correlation - often legitimate
weak:
- flag: round_number_invoice
fraud_probability: 0.15
inject_with_fraud: 0.40
inject_without_fraud: 0.20
- flag: end_of_month_timing
fraud_probability: 0.10
inject_with_fraud: 0.50
inject_without_fraud: 0.30
8. Fraud Investigation Scenarios
8.1 Investigation-Ready Data
investigation_scenarios:
enabled: true
scenarios:
# Whistleblower scenario
whistleblower_tip:
allegation: "Vendor XYZ may be fictitious"
evidence_trail:
- vendor_setup_documents
- approval_chain
- payment_history
- address_verification
- phone_verification
hidden_clues:
- approver_is_also_requester
- address_is_ups_store
- phone_goes_to_employee
# Audit finding follow-up
audit_finding:
initial_finding: "Unusual vendor payment pattern"
investigation_path:
- transaction_sample
- vendor_analysis
- employee_relationship_map
- comparative_analysis
discovery_stages:
- stage_1: "Vendor has only one customer - us"
- stage_2: "All invoices approved by same person"
- stage_3: "Vendor address matches employee relative"
# Hotline report
anonymous_tip:
report: "Manager taking kickbacks from contractor"
evidence_available:
- contract_documents
- bid_history
- payment_records
- email_metadata
additional_clues:
- bids_always_awarded_to_same_contractor
- contract_amendments_increase_cost_30%
- manager_new_car_timing_correlates
8.2 Evidence Chain Generation
#![allow(unused)]
fn main() {
pub struct FraudEvidenceChain {
fraud_id: Uuid,
evidence_items: Vec<EvidenceItem>,
discovery_order: Vec<EvidenceId>,
linking_relationships: Vec<EvidenceLink>,
}
pub struct EvidenceItem {
id: EvidenceId,
item_type: EvidenceType,
content: EvidenceContent,
source_system: String,
timestamp: DateTime<Utc>,
accessibility: Accessibility, // How hard to find
probative_value: f64, // Strength as evidence
}
pub enum EvidenceType {
Transaction,
Document,
Communication,
SystemLog,
ExternalRecord,
WitnessStatement,
PhysicalEvidence,
}
impl FraudEvidenceChain {
/// Generate investigation-ready evidence trail
pub fn generate_trail(&self) -> InvestigationTrail {
// Order evidence by discoverability
// Create logical links between items
// Add red herrings (false leads that are eliminated)
// Include corroborating evidence
}
}
}
9. Implementation Priority
| Enhancement | Complexity | Impact | Priority |
|---|---|---|---|
| ACFE-aligned taxonomy | Low | High | P1 |
| Collusion modeling | High | High | P1 |
| Concealment techniques | Medium | High | P1 |
| Management override | Medium | High | P1 |
| Adaptive behavior | High | Medium | P2 |
| Financial statement fraud | High | High | P1 |
| Red flag generation | Medium | High | P1 |
| Investigation scenarios | Medium | Medium | P2 |
| Industry-specific patterns | Medium | Medium | P2 |
10. Validation and Calibration
fraud_validation:
# Calibration against real-world statistics
calibration:
source: acfe_report_to_the_nations_2024
metrics:
median_loss: 117000
median_duration_months: 12
detection_methods:
tip: 0.42
internal_audit: 0.16
management_review: 0.12
external_audit: 0.04
accident: 0.06
perpetrator_departments:
accounting: 0.21
operations: 0.17
executive: 0.12
sales: 0.11
customer_service: 0.08
# Distribution validation
distribution_checks:
- metric: loss_distribution
expected: lognormal
parameters_from: acfe_data
- metric: duration_distribution
expected: exponential
mean_months: 12
- metric: detection_method_distribution
expected: categorical
match_acfe: true
See also: 08-domain-specific.md for industry-specific enhancements
Research: Domain-Specific Enhancements
Implementation Status: COMPLETE (v0.3.0)
This research document has been fully implemented in v0.3.0. See:
- Industry-Specific Features Documentation
- CHANGELOG.md for detailed feature list
Key implementations:
- Manufacturing: BOM, routings, work centers, production variances
- Retail: POS transactions, shrinkage, loss prevention
- Healthcare: Revenue cycle, ICD-10/CPT/DRG coding, payer mix
- Technology: License/subscription revenue, R&D capitalization
- Financial Services: Loan origination, trading, regulatory frameworks
- Professional Services: Time & billing, trust accounting
- Industry-specific anomaly patterns for each sector
- Industry-specific ML benchmarks
Current State Analysis
Existing Industry Support
| Industry | Configuration | Generator Support | Realism |
|---|---|---|---|
| Manufacturing | Preset available | Good | Medium |
| Retail | Preset available | Good | Medium |
| Financial Services | Preset + Banking module | Strong | Good |
| Healthcare | Preset available | Basic | Low |
| Technology | Preset available | Basic | Low |
| Professional Services | Limited | Basic | Low |
Current Strengths
- Banking module: Comprehensive KYC/AML with fraud typologies
- Industry presets: 5 industry configurations available
- Seasonality profiles: 10 industry-specific patterns
- Standards support: IFRS, US GAAP, ISA, SOX frameworks
Current Gaps
- Shallow industry modeling: Generic patterns across industries
- Limited regulatory specificity: One-size-fits-all compliance
- Missing vertical-specific transactions: Generic document flows
- No industry-specific anomalies: Same fraud patterns everywhere
- Limited terminology: Generic naming regardless of industry
Industry-Specific Enhancement Recommendations
1. Manufacturing Industry
1.1 Enhanced Transaction Types
manufacturing:
transaction_types:
# Production-specific
production:
- work_order_issuance
- material_requisition
- labor_booking
- overhead_absorption
- scrap_reporting
- rework_order
- production_variance
# Inventory movements
inventory:
- raw_material_receipt
- wip_transfer
- finished_goods_transfer
- consignment_movement
- subcontractor_shipment
- cycle_count_adjustment
- physical_inventory_adjustment
# Cost accounting
costing:
- standard_cost_revaluation
- purchase_price_variance
- production_variance_allocation
- overhead_rate_adjustment
- interplant_transfer_pricing
# Manufacturing-specific master data
master_data:
bill_of_materials:
levels: 3-7
components_per_level: 2-15
yield_rates: 0.95-0.99
scrap_factors: 0.01-0.05
routings:
operations: 3-12
work_centers: 5-50
labor_rates: by_skill_level
machine_rates: by_equipment_type
production_orders:
types: [discrete, repetitive, process]
statuses: [planned, released, confirmed, completed]
1.2 Manufacturing Anomalies
manufacturing_anomalies:
production:
- type: yield_manipulation
description: "Inflating yield to hide scrap"
indicators: [abnormal_yield, missing_scrap_entries]
- type: labor_misallocation
description: "Charging labor to wrong orders"
indicators: [unusual_labor_distribution, overtime_patterns]
- type: phantom_production
description: "Recording production that didn't occur"
indicators: [no_material_consumption, missing_quality_records]
inventory:
- type: obsolete_inventory_concealment
description: "Failing to write down obsolete stock"
indicators: [no_movement_items, aging_without_provision]
- type: consignment_manipulation
description: "Recording consigned goods as owned"
indicators: [unusual_consignment_patterns, ownership_disputes]
costing:
- type: standard_cost_manipulation
description: "Setting unrealistic standards"
indicators: [persistent_favorable_variances, standard_changes]
- type: overhead_misallocation
description: "Allocating overhead to wrong products"
indicators: [margin_anomalies, allocation_base_changes]
2. Retail Industry
2.1 Enhanced Transaction Types
retail:
transaction_types:
# Point of Sale
pos:
- cash_sale
- credit_card_sale
- debit_sale
- gift_card_sale
- layaway_transaction
- special_order
- rain_check
# Returns and adjustments
returns:
- customer_return
- exchange
- price_adjustment
- markdown
- damage_writeoff
- vendor_return
# Inventory
inventory:
- receiving
- transfer_in
- transfer_out
- cycle_count
- shrinkage_adjustment
- donation
- disposal
# Promotions
promotions:
- coupon_redemption
- loyalty_redemption
- bundle_discount
- flash_sale
- clearance_markdown
# Retail-specific metrics
metrics:
same_store_sales: by_period
basket_size: average_and_distribution
conversion_rate: by_store_type
shrinkage_rate: by_category
markdown_percentage: by_season
inventory_turn: by_category
2.2 Retail Anomalies
retail_anomalies:
pos_fraud:
- type: sweethearting
description: "Employee gives free/discounted items to friends"
indicators: [high_void_rate, specific_cashier_patterns]
- type: skimming
description: "Not recording cash sales"
indicators: [cash_short, transaction_gaps]
- type: refund_fraud
description: "Fraudulent refunds to personal cards"
indicators: [refund_patterns, card_number_reuse]
inventory_fraud:
- type: receiving_fraud
description: "Collusion with vendors on short shipments"
indicators: [variance_patterns, vendor_concentration]
- type: transfer_fraud
description: "Fake transfers to cover theft"
indicators: [transfer_without_receipt, location_patterns]
promotional_abuse:
- type: coupon_fraud
description: "Applying coupons without customer purchase"
indicators: [high_coupon_rate, timing_patterns]
- type: employee_discount_abuse
description: "Using employee discount for non-employees"
indicators: [discount_volume, transaction_timing]
3. Healthcare Industry
3.1 Enhanced Transaction Types
healthcare:
transaction_types:
# Revenue cycle
revenue:
- patient_registration
- charge_capture
- claim_submission
- payment_posting
- denial_management
- patient_billing
- collection_activity
# Clinical operations
clinical:
- supply_consumption
- pharmacy_dispensing
- procedure_coding
- diagnosis_coding
- medical_record_documentation
# Payer transactions
payer:
- contract_payment
- capitation_payment
- risk_adjustment
- quality_bonus
- value_based_payment
# Healthcare-specific elements
elements:
coding:
icd10: diagnostic_codes
cpt: procedure_codes
drg: diagnosis_related_groups
hcpcs: healthcare_common_procedure
payers:
types: [medicare, medicaid, commercial, self_pay]
mix_distribution: configurable
contract_terms: by_payer
compliance:
hipaa: true
stark_law: true
anti_kickback: true
false_claims_act: true
3.2 Healthcare Anomalies
healthcare_anomalies:
billing_fraud:
- type: upcoding
description: "Billing for more expensive service than provided"
indicators: [code_distribution_shift, complexity_increase]
- type: unbundling
description: "Billing separately for bundled services"
indicators: [modifier_patterns, procedure_combinations]
- type: phantom_billing
description: "Billing for services not rendered"
indicators: [impossible_combinations, deceased_patient_billing]
- type: duplicate_billing
description: "Billing multiple times for same service"
indicators: [same_day_duplicates, claim_resubmission_patterns]
kickback_schemes:
- type: physician_referral_kickback
description: "Payments for patient referrals"
indicators: [referral_concentration, payment_timing]
- type: medical_director_fraud
description: "Sham medical director agreements"
indicators: [no_services_rendered, excessive_compensation]
compliance_violations:
- type: hipaa_violation
description: "Unauthorized access to patient records"
indicators: [access_patterns, audit_log_anomalies]
- type: credential_fraud
description: "Using credentials of another provider"
indicators: [impossible_geography, timing_conflicts]
4. Technology Industry
4.1 Enhanced Transaction Types
technology:
transaction_types:
# Revenue recognition (ASC 606)
revenue:
- license_revenue
- subscription_revenue
- professional_services
- maintenance_revenue
- usage_based_revenue
- milestone_based_revenue
# Software development
development:
- r_and_d_expense
- capitalized_software
- amortization
- impairment_testing
# Cloud operations
cloud:
- hosting_costs
- bandwidth_costs
- storage_costs
- compute_costs
- third_party_services
# Sales and marketing
sales:
- commission_expense
- deferred_commission
- customer_acquisition_cost
- marketing_program_expense
# Tech-specific accounting
accounting:
revenue_recognition:
multiple_element_arrangements: true
variable_consideration: true
contract_modifications: true
software_development:
capitalization_criteria: true
useful_life_determination: true
impairment_testing: annual
stock_compensation:
option_valuation: black_scholes
rsu_accounting: true
performance_units: true
4.2 Technology Anomalies
technology_anomalies:
revenue_fraud:
- type: premature_license_recognition
description: "Recognizing license revenue before delivery criteria met"
indicators: [quarter_end_concentration, delivery_delays]
- type: side_letter_abuse
description: "Hidden terms that negate revenue recognition"
indicators: [unusual_contract_terms, customer_complaints]
- type: channel_stuffing
description: "Forcing product on resellers at period end"
indicators: [reseller_inventory_buildup, returns_next_quarter]
capitalization_fraud:
- type: improper_capitalization
description: "Capitalizing expenses that should be expensed"
indicators: [r_and_d_ratio_changes, asset_growth]
- type: useful_life_manipulation
description: "Extending useful life to reduce amortization"
indicators: [useful_life_changes, peer_comparison]
stock_compensation:
- type: options_backdating
description: "Selecting favorable grant dates retroactively"
indicators: [grant_date_patterns, exercise_price_analysis]
- type: vesting_manipulation
description: "Accelerating vesting to manage earnings"
indicators: [vesting_schedule_changes, departure_timing]
5. Financial Services Industry
5.1 Enhanced Transaction Types
financial_services:
transaction_types:
# Banking operations
banking:
- loan_origination
- loan_disbursement
- loan_payment
- interest_accrual
- fee_income
- deposit_transaction
- wire_transfer
- ach_transaction
# Investment operations
investments:
- trade_execution
- trade_settlement
- dividend_receipt
- interest_receipt
- mark_to_market
- realized_gain_loss
- unrealized_gain_loss
# Insurance operations
insurance:
- premium_collection
- claim_payment
- reserve_adjustment
- reinsurance_transaction
- commission_payment
- policy_acquisition_cost
# Asset management
asset_management:
- management_fee
- performance_fee
- distribution
- capital_call
- redemption
# Regulatory requirements
regulatory:
capital_requirements:
basel_iii: true
leverage_ratio: true
liquidity_coverage: true
reporting:
call_reports: true
form_10k_10q: true
form_13f: true
sar_filing: true
5.2 Financial Services Anomalies
financial_services_anomalies:
lending_fraud:
- type: loan_fraud
description: "Falsified loan applications"
indicators: [documentation_inconsistencies, verification_failures]
- type: appraisal_fraud
description: "Inflated property valuations"
indicators: [appraisal_variances, appraiser_concentration]
- type: straw_borrower
description: "Using nominee to obtain loans"
indicators: [relationship_patterns, fund_flow_analysis]
trading_fraud:
- type: wash_trading
description: "Buying and selling same security to inflate volume"
indicators: [self_trades, volume_patterns]
- type: front_running
description: "Trading ahead of customer orders"
indicators: [timing_analysis, profitability_patterns]
- type: churning
description: "Excessive trading to generate commissions"
indicators: [turnover_ratio, commission_patterns]
insurance_fraud:
- type: premium_theft
description: "Agent pocketing premiums"
indicators: [lapsed_policies, customer_complaints]
- type: claims_fraud
description: "Fraudulent or inflated claims"
indicators: [claim_patterns, adjuster_analysis]
- type: reserve_manipulation
description: "Understating claim reserves"
indicators: [reserve_development, adequacy_analysis]
6. Professional Services
6.1 Enhanced Transaction Types
professional_services:
transaction_types:
# Time and billing
billing:
- time_entry
- expense_entry
- invoice_generation
- write_off_adjustment
- realization_adjustment
- wip_adjustment
# Engagement management
engagement:
- engagement_setup
- budget_allocation
- milestone_billing
- retainer_application
- contingency_fee
# Resource management
resource:
- staff_allocation
- contractor_engagement
- subcontractor_payment
- expert_fee
# Client accounting
client:
- trust_deposit
- trust_withdrawal
- cost_advance
- client_reimbursement
# Professional-specific metrics
metrics:
utilization_rate: by_level
realization_rate: by_practice
collection_rate: by_client
leverage_ratio: staff_to_partner
revenue_per_professional: by_level
6.2 Professional Services Anomalies
professional_services_anomalies:
billing_fraud:
- type: inflated_hours
description: "Billing for time not worked"
indicators: [impossible_hours, pattern_analysis]
- type: phantom_work
description: "Billing for work never performed"
indicators: [no_work_product, client_complaints]
- type: duplicate_billing
description: "Billing multiple clients for same time"
indicators: [time_overlap, total_hours_analysis]
expense_fraud:
- type: personal_expense_billing
description: "Charging personal expenses to clients"
indicators: [expense_patterns, vendor_analysis]
- type: markup_abuse
description: "Excessive markups on pass-through costs"
indicators: [markup_comparison, cost_analysis]
trust_account_fraud:
- type: commingling
description: "Mixing trust and operating funds"
indicators: [transfer_patterns, reconciliation_issues]
- type: misappropriation
description: "Using client funds for personal use"
indicators: [unauthorized_withdrawals, shortages]
7. Real Estate Industry
7.1 Enhanced Transaction Types
real_estate:
transaction_types:
# Property management
property:
- rent_collection
- cam_charges
- security_deposit
- lease_payment
- tenant_improvement
- property_tax
- insurance_expense
# Development
development:
- land_acquisition
- construction_draw
- development_fee
- capitalized_interest
- soft_cost
- hard_cost
# Investment
investment:
- property_acquisition
- property_disposition
- depreciation
- impairment
- fair_value_adjustment
- debt_service
# REIT-specific
reit:
- ffo_calculation
- dividend_distribution
- taxable_income
- section_1031_exchange
7.2 Real Estate Anomalies
real_estate_anomalies:
property_management:
- type: rent_skimming
description: "Not recording cash rent payments"
indicators: [occupancy_vs_revenue, cash_deposits]
- type: kickback_maintenance
description: "Receiving kickbacks from contractors"
indicators: [contractor_concentration, pricing_analysis]
development:
- type: cost_inflation
description: "Inflating development costs"
indicators: [cost_per_unit_comparison, change_order_patterns]
- type: capitalization_abuse
description: "Capitalizing operating expenses"
indicators: [capitalization_ratio, expense_classification]
valuation:
- type: appraisal_manipulation
description: "Influencing property appraisals"
indicators: [appraisal_vs_sale_price, appraiser_relationships]
- type: impairment_avoidance
description: "Failing to record impairments"
indicators: [occupancy_decline, market_comparisons]
8. Industry-Specific Configuration
8.1 Unified Industry Configuration
# Master industry configuration schema
industry_configuration:
industry: manufacturing # or retail, healthcare, etc.
# Industry-specific settings
settings:
transaction_types:
enabled: [production, inventory, costing]
weights:
production_orders: 0.30
inventory_movements: 0.40
cost_adjustments: 0.30
master_data:
bill_of_materials: true
routings: true
work_centers: true
production_resources: true
anomaly_injection:
industry_specific: true
generic: true
industry_weight: 0.60
terminology:
use_industry_terms: true
document_naming: industry_standard
account_descriptions: industry_specific
seasonality:
profile: manufacturing
custom_events:
- name: plant_shutdown
month: 7
duration_weeks: 2
activity_multiplier: 0.10
regulatory:
frameworks:
- environmental: epa
- safety: osha
- quality: iso_9001
# Cross-industry settings (inherit from base)
inherit:
- accounting_standards
- audit_standards
- control_framework
8.2 Industry Presets Enhancement
presets:
manufacturing_automotive:
base: manufacturing
customizations:
bom_depth: 7
just_in_time: true
quality_framework: iatf_16949
supplier_tiers: 3
defect_rates: very_low
retail_grocery:
base: retail
customizations:
perishable_inventory: true
high_volume_low_margin: true
shrinkage_focus: true
vendor_managed_inventory: true
healthcare_hospital:
base: healthcare
customizations:
inpatient: true
outpatient: true
emergency_services: true
ancillary_services: true
case_mix_complexity: high
technology_saas:
base: technology
customizations:
subscription_revenue: primary
professional_services: secondary
monthly_recurring_revenue: true
churn_modeling: true
financial_services_bank:
base: financial_services
customizations:
banking_charter: commercial
deposit_taking: true
lending: true
capital_markets: limited
9. Implementation Priority
| Industry | Enhancement Scope | Complexity | Priority |
|---|---|---|---|
| Manufacturing | Full enhancement | High | P1 |
| Retail | Full enhancement | Medium | P1 |
| Healthcare | Full enhancement | High | P1 |
| Technology | Revenue recognition | Medium | P2 |
| Financial Services | Extend banking module | Medium | P1 |
| Professional Services | New module | Medium | P2 |
| Real Estate | New module | Medium | P3 |
10. Terminology and Naming
industry_terminology:
manufacturing:
document_types:
purchase_order: "Production Purchase Order"
invoice: "Vendor Invoice"
receipt: "Goods Receipt / Material Document"
accounts:
wip: "Work in Process"
fg: "Finished Goods Inventory"
rm: "Raw Materials Inventory"
transactions:
production: "Manufacturing Order Settlement"
variance: "Production Variance Posting"
healthcare:
document_types:
invoice: "Claim"
payment: "Remittance Advice"
receipt: "Patient Payment"
accounts:
ar: "Patient Accounts Receivable"
revenue: "Net Patient Service Revenue"
contractual: "Contractual Allowance"
transactions:
billing: "Charge Capture"
collection: "Payment Posting"
# Similar for other industries...
Summary
This research document series provides a comprehensive analysis of improvement opportunities for the SyntheticData system. Key themes across all documents:
- Depth over breadth: Enhance existing features rather than adding new surface-level capabilities
- Correlation modeling: Move from independent generation to correlated, interconnected data
- Temporal realism: Add dynamic behavior that evolves over time
- Domain authenticity: Use real industry terminology, patterns, and regulations
- Detection-aware design: Generate data that enables meaningful ML training and evaluation
The recommended implementation approach is phased, starting with high-impact, lower-complexity enhancements and building toward more sophisticated modeling over time.
End of Research Document Series
Total documents: 8 Research conducted: January 2026 System version analyzed: 0.2.3