Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CLI Reference

The datasynth-data command-line tool provides commands for generating synthetic financial data and extracting fingerprints from real data.

Installation

After building the project, the binary is at target/release/datasynth-data.

cargo build --release
./target/release/datasynth-data --help

Global Options

OptionDescription
-h, --helpShow help information
-V, --versionShow version number
-v, --verboseEnable verbose output
-q, --quietSuppress non-error output

Commands

generate

Generate synthetic financial data.

datasynth-data generate [OPTIONS]

Options:

OptionTypeDescription
--config <PATH>PathConfiguration YAML file
--demoFlagUse demo preset instead of config
--output <DIR>PathOutput directory (required)
--format <FMT>StringOutput format: csv, json
--seed <NUM>u64Override random seed

Examples:

# Generate with configuration file
datasynth-data generate --config config.yaml --output ./output

# Use demo mode
datasynth-data generate --demo --output ./demo-output

# Override seed for reproducibility
datasynth-data generate --config config.yaml --output ./output --seed 12345

# JSON output format
datasynth-data generate --config config.yaml --output ./output --format json

init

Create a new configuration file from industry presets.

datasynth-data init [OPTIONS]

Options:

OptionTypeDescription
--industry <NAME>StringIndustry preset
--complexity <LEVEL>Stringsmall, medium, large
-o, --output <PATH>PathOutput file path

Available Industries:

  • manufacturing
  • retail
  • financial_services
  • healthcare
  • technology

Examples:

# Create manufacturing config
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Create large retail config
datasynth-data init --industry retail --complexity large -o retail.yaml

validate

Validate a configuration file.

datasynth-data validate --config <PATH>

Options:

OptionTypeDescription
--config <PATH>PathConfiguration file to validate

Example:

datasynth-data validate --config config.yaml

Validation Checks:

  • Required fields present
  • Value ranges (period_months: 1-120)
  • Distribution weights sum to 1.0 (±0.01 tolerance)
  • Date consistency
  • Company code uniqueness
  • Compression level: 1-9 when enabled
  • All rate/percentage fields: 0.0-1.0
  • Approval thresholds: strictly ascending order

info

Display available presets and configuration options.

datasynth-data info

Output includes:

  • Available industry presets
  • Complexity levels
  • Supported output formats
  • Feature capabilities

fingerprint

Privacy-preserving fingerprint extraction and evaluation. This command has several subcommands.

datasynth-data fingerprint <SUBCOMMAND>

fingerprint extract

Extract a fingerprint from real data with privacy controls.

datasynth-data fingerprint extract [OPTIONS]

Options:

OptionTypeDescription
--input <PATH>PathInput CSV data file (required)
--output <PATH>PathOutput .dsf fingerprint file (required)
--privacy-level <LEVEL>StringPrivacy level: minimal, standard, high, maximum
--epsilon <FLOAT>f64Custom differential privacy epsilon (overrides level)
--k <INT>usizeCustom k-anonymity threshold (overrides level)

Privacy Levels:

LevelEpsilonkOutlier %Use Case
minimal5.0399%Low privacy, high utility
standard1.0595%Balanced (default)
high0.51090%Higher privacy
maximum0.12085%Maximum privacy

Examples:

# Extract with standard privacy
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Extract with custom privacy parameters
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --epsilon 0.75 \
    --k 8

fingerprint validate

Validate a fingerprint file’s integrity and structure.

datasynth-data fingerprint validate <PATH>

Arguments:

ArgumentTypeDescription
<PATH>PathPath to .dsf fingerprint file

Validation Checks:

  • DSF file structure (ZIP archive with required components)
  • SHA-256 checksums for all components
  • Required fields in manifest, schema, statistics
  • Privacy audit completeness

Example:

datasynth-data fingerprint validate ./fingerprint.dsf

fingerprint info

Display fingerprint metadata and statistics.

datasynth-data fingerprint info <PATH> [OPTIONS]

Arguments:

ArgumentTypeDescription
<PATH>PathPath to .dsf fingerprint file

Options:

OptionTypeDescription
--detailedFlagShow detailed statistics
--jsonFlagOutput as JSON

Examples:

# Basic info
datasynth-data fingerprint info ./fingerprint.dsf

# Detailed statistics
datasynth-data fingerprint info ./fingerprint.dsf --detailed

# JSON output for scripting
datasynth-data fingerprint info ./fingerprint.dsf --json

fingerprint diff

Compare two fingerprints.

datasynth-data fingerprint diff <PATH1> <PATH2>

Arguments:

ArgumentTypeDescription
<PATH1>PathFirst .dsf fingerprint file
<PATH2>PathSecond .dsf fingerprint file

Output includes:

  • Schema differences (columns added/removed/changed)
  • Statistical distribution changes
  • Correlation matrix differences

Example:

datasynth-data fingerprint diff ./fp_v1.dsf ./fp_v2.dsf

fingerprint evaluate

Evaluate synthetic data fidelity against a fingerprint.

datasynth-data fingerprint evaluate [OPTIONS]

Options:

OptionTypeDescription
--fingerprint <PATH>PathReference .dsf fingerprint file (required)
--synthetic <PATH>PathDirectory containing synthetic data (required)
--threshold <FLOAT>f64Minimum fidelity score (0.0-1.0, default 0.8)
--report <PATH>PathOutput report file (HTML or JSON based on extension)

Fidelity Metrics:

  • Statistical: KS statistic, Wasserstein distance, Benford MAD
  • Correlation: Correlation matrix RMSE
  • Schema: Column type match, row count ratio
  • Rules: Balance equation compliance rate

Examples:

# Basic evaluation
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.8

# Generate HTML report
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic_data/ \
    --threshold 0.85 \
    --report ./fidelity_report.html

diffusion (v0.5.0)

Train and evaluate diffusion models for statistical data generation.

diffusion train

Train a diffusion model from a fingerprint file.

datasynth-data diffusion train \
    --fingerprint ./fingerprint.dsf \
    --output ./model.json \
    --n-steps 1000 \
    --schedule cosine
OptionTypeDefaultDescription
--fingerprintpath(required)Path to .dsf fingerprint file
--outputpath(required)Output path for trained model
--n-stepsinteger1000Number of diffusion steps
--schedulestringlinearNoise schedule: linear, cosine, sigmoid

diffusion evaluate

Evaluate a trained diffusion model’s fit quality.

datasynth-data diffusion evaluate \
    --model ./model.json \
    --samples 5000
OptionTypeDefaultDescription
--modelpath(required)Path to trained model
--samplesinteger1000Number of evaluation samples

causal (v0.5.0)

Generate data with causal structure, run interventions, and produce counterfactuals.

causal generate

Generate data following a causal graph structure.

datasynth-data causal generate \
    --template fraud_detection \
    --samples 10000 \
    --seed 42 \
    --output ./causal_output
OptionTypeDefaultDescription
--templatestring(required)Built-in template (fraud_detection, revenue_cycle) or path to custom YAML
--samplesinteger1000Number of samples to generate
--seedinteger(random)Random seed for reproducibility
--outputpath(required)Output directory

causal intervene

Run do-calculus interventions (“what-if” scenarios).

datasynth-data causal intervene \
    --template fraud_detection \
    --variable transaction_amount \
    --value 50000 \
    --samples 5000 \
    --output ./intervention_output
OptionTypeDefaultDescription
--templatestring(required)Causal template or YAML path
--variablestring(required)Variable to intervene on
--valuefloat(required)Value to set the variable to
--samplesinteger1000Number of intervention samples
--outputpath(required)Output directory

causal validate

Validate that generated data preserves causal structure.

datasynth-data causal validate \
    --data ./causal_output \
    --template fraud_detection
OptionTypeDefaultDescription
--datapath(required)Path to generated data
--templatestring(required)Causal template to validate against

fingerprint federated (v0.5.0)

Aggregate fingerprints from multiple distributed sources without centralizing raw data.

datasynth-data fingerprint federated \
    --sources ./source_a.dsf ./source_b.dsf ./source_c.dsf \
    --output ./aggregated.dsf \
    --method weighted_average \
    --max-epsilon 5.0
OptionTypeDefaultDescription
--sourcespaths(required)Two or more .dsf fingerprint files
--outputpath(required)Output path for aggregated fingerprint
--methodstringweighted_averageAggregation method: weighted_average, median, trimmed_mean
--max-epsilonfloat5.0Maximum epsilon budget per source

init –from-description (v0.5.0)

Generate configuration from a natural language description using LLM.

datasynth-data init \
    --from-description "Generate 1 year of retail data for a mid-size US company with fraud patterns" \
    -o config.yaml

Uses the configured LLM provider (defaults to Mock) to parse the description and generate an appropriate YAML configuration.

generate –certificate (v0.5.0)

Attach a synthetic data certificate to the generated output.

datasynth-data generate \
    --config config.yaml \
    --output ./output \
    --certificate

Produces a certificate.json in the output directory containing DP guarantees, quality metrics, and an HMAC-SHA256 signature.

Signal Handling (Unix)

On Unix systems, you can pause and resume generation:

# Start generation in background
datasynth-data generate --config config.yaml --output ./output &

# Pause generation
kill -USR1 $(pgrep datasynth-data)

# Resume generation (send SIGUSR1 again)
kill -USR1 $(pgrep datasynth-data)

Exit Codes

CodeMeaning
0Success
1General error
2Configuration error
3I/O error
4Validation error
5Fingerprint error

Environment Variables

VariableDescription
SYNTH_DATA_LOGLog level (error, warn, info, debug, trace)
SYNTH_DATA_THREADSNumber of worker threads

Example:

SYNTH_DATA_LOG=debug datasynth-data generate --config config.yaml --output ./output

Configuration File Location

The tool looks for configuration files in this order:

  1. Path specified with --config
  2. ./datasynth-data.yaml in current directory
  3. ~/.config/datasynth-data/config.yaml

Output Directory Structure

Generation creates this structure:

output/
├── master_data/          Vendors, customers, materials, assets, employees
├── transactions/         Journal entries, purchase orders, invoices, payments
├── subledgers/           AR, AP, FA, inventory detail records
├── period_close/         Trial balances, accruals, closing entries
├── consolidation/        Eliminations, currency translation
├── fx/                   Exchange rates, CTA adjustments
├── banking/              KYC profiles, bank transactions, AML typology labels
├── process_mining/       OCEL 2.0 event logs, process variants
├── audit/                Engagements, workpapers, findings, risk assessments
├── graphs/               PyTorch Geometric, Neo4j, DGL exports (if enabled)
├── labels/               Anomaly, fraud, and data quality labels for ML
└── controls/             Internal control mappings, SoD rules

Scripting Examples

Batch Generation

#!/bin/bash
for industry in manufacturing retail healthcare; do
    datasynth-data init --industry $industry --complexity medium -o ${industry}.yaml
    datasynth-data generate --config ${industry}.yaml --output ./output/${industry}
done

CI/CD Pipeline

# GitHub Actions example
- name: Generate Test Data
  run: |
    cargo build --release
    ./target/release/datasynth-data generate --demo --output ./test-data

- name: Validate Generation
  run: |
    # Check output files exist
    test -f ./test-data/transactions/journal_entries.csv
    test -f ./test-data/master_data/vendors.csv

Reproducible Generation

# Same seed produces identical output
datasynth-data generate --config config.yaml --output ./run1 --seed 42
datasynth-data generate --config config.yaml --output ./run2 --seed 42
diff -r run1 run2  # No differences

Fingerprint Pipeline

#!/bin/bash
# Extract fingerprint from real data
datasynth-data fingerprint extract \
    --input ./real_data.csv \
    --output ./fingerprint.dsf \
    --privacy-level high

# Validate the fingerprint
datasynth-data fingerprint validate ./fingerprint.dsf

# Generate synthetic data matching the fingerprint
# (fingerprint informs config generation)
datasynth-data generate --config generated_config.yaml --output ./synthetic

# Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic \
    --threshold 0.85 \
    --report ./fidelity_report.html

Troubleshooting

Common Issues

“Configuration file not found”

# Check file path
ls -la config.yaml
# Use absolute path
datasynth-data generate --config /full/path/to/config.yaml --output ./output

“Invalid configuration”

# Validate first
datasynth-data validate --config config.yaml

“Permission denied”

# Check output directory permissions
mkdir -p ./output
chmod 755 ./output

“Out of memory”

The generator includes memory guards that prevent OOM conditions. If you still encounter issues:

  • Reduce transaction count in configuration
  • The system will automatically reduce batch sizes under memory pressure
  • Check memory_guard settings in configuration

“Fingerprint validation failed”

# Check DSF file integrity
datasynth-data fingerprint validate ./fingerprint.dsf

# View detailed info
datasynth-data fingerprint info ./fingerprint.dsf --detailed

“Low fidelity score”

If synthetic data fidelity is below threshold:

  • Review the fidelity report for specific metrics
  • Adjust configuration to better match fingerprint statistics
  • Consider using the evaluation framework’s auto-tuning recommendations

See Also