Quick Start

This guide walks you through generating your first synthetic financial dataset.

Overview

The typical workflow is:

Initialize a configuration file
Validate the configuration
Generate synthetic data
Review the output

Step 1: Initialize Configuration

Create a configuration file for your industry and complexity needs:

# Manufacturing company with medium complexity (~400 accounts)
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

Available Industry Presets

Industry	Description
`manufacturing`	Production, inventory, cost accounting
`retail`	Sales, inventory, customer transactions
`financial_services`	Banking, investments, regulatory reporting
`healthcare`	Patient revenue, medical supplies, compliance
`technology`	R&D, SaaS revenue, deferred revenue

Complexity Levels

Level	Accounts	Description
`small`	~100	Simple chart of accounts
`medium`	~400	Typical mid-size company
`large`	~2500	Enterprise-scale structure

Step 2: Review Configuration

Open config.yaml to review and customize:

global:
  seed: 42                        # For reproducible generation
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 1.0

transactions:
  target_count: 100000            # Number of journal entries

fraud:
  enabled: true
  fraud_rate: 0.005               # 0.5% fraud transactions

output:
  format: csv
  compression: none

See the Configuration Guide for all options.

Step 3: Validate Configuration

Check your configuration for errors:

datasynth-data validate --config config.yaml

The validator checks:

Required fields are present
Values are within valid ranges
Distribution weights sum to 1.0
Dates are consistent

Step 4: Generate Data

Run the generation:

datasynth-data generate --config config.yaml --output ./output

You’ll see a progress bar:

Generating synthetic data...
[████████████████████████████████] 100000/100000 entries
Generation complete in 1.2s

Step 5: Explore Output

The output directory contains organized subdirectories:

output/
├── master_data/
│   ├── vendors.csv
│   ├── customers.csv
│   ├── materials.csv
│   └── employees.csv
├── transactions/
│   ├── journal_entries.csv
│   ├── acdoca.csv
│   ├── purchase_orders.csv
│   └── vendor_invoices.csv
├── subledgers/
│   ├── ar_open_items.csv
│   └── ap_open_items.csv
├── period_close/
│   └── trial_balances/
├── labels/
│   ├── anomaly_labels.csv
│   └── fraud_labels.csv
└── controls/
    └── internal_controls.csv

Common Customizations

Generate More Data

transactions:
  target_count: 1000000           # 1 million entries

Enable Graph Export

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

Add Anomaly Injection

anomaly_injection:
  enabled: true
  total_rate: 0.02                # 2% anomaly rate
  generate_labels: true           # For ML training

Multiple Companies

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    volume_weight: 0.6

  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    volume_weight: 0.4

Next Steps

Explore Demo Mode for built-in presets
Learn the CLI Reference
Review Output Formats
See Configuration for all options

Quick Reference

# Common commands
datasynth-data init --industry <industry> --complexity <level> -o config.yaml
datasynth-data validate --config config.yaml
datasynth-data generate --config config.yaml --output ./output
datasynth-data generate --demo --output ./demo-output
datasynth-data info                   # Show available presets

Keyboard shortcuts

SyntheticData Documentation