Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration

SyntheticData uses YAML configuration files to control all aspects of data generation.

Quick Start

# Create configuration from preset
datasynth-data init --industry manufacturing --complexity medium -o config.yaml

# Validate configuration
datasynth-data validate --config config.yaml

# Generate with configuration
datasynth-data generate --config config.yaml --output ./output

Configuration Sections

SectionDescription
Global SettingsIndustry, dates, seed, performance
CompaniesCompany codes, currencies, volume weights
TransactionsLine items, amounts, sources
Master DataVendors, customers, materials, assets
Document FlowsP2P, O2C configuration
Financial SettingsBalance, subledger, FX, period close
ComplianceFraud, controls, approval
AI & ML FeaturesLLM, diffusion, causal, certificates
Output SettingsFormat, compression
Source-to-PayS2C sourcing pipeline (projects, RFx, bids, contracts, catalogs, scorecards)
Financial ReportingFinancial statements, bank reconciliation, management KPIs, budgets
HRPayroll runs, time entries, expense reports
ManufacturingProduction orders, quality inspections, cycle counts
Sales QuotesQuote-to-order pipeline
Accounting StandardsRevenue recognition (ASC 606/IFRS 15), impairment testing

Reference

Minimal Configuration

global:
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 10000

output:
  format: csv

Full Configuration Example

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US
    volume_weight: 0.6
  - code: "2000"
    name: "European Subsidiary"
    currency: EUR
    country: DE
    volume_weight: 0.4

chart_of_accounts:
  complexity: medium

transactions:
  target_count: 100000
  line_items:
    distribution: empirical
  amounts:
    min: 100
    max: 1000000

master_data:
  vendors:
    count: 200
  customers:
    count: 500
  materials:
    count: 1000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.3
  o2c:
    enabled: true
    flow_rate: 0.3

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
    - neo4j

# AI & ML Features (v0.5.0)
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 1000

causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 1000
  validate: true

certificates:
  enabled: true
  issuer: "DataSynth"
  include_quality_metrics: true

# Enterprise Process Chains (v0.6.0)
source_to_pay:
  enabled: true
  projects_per_period: 5
  avg_bids_per_rfx: 4
  contract_award_rate: 0.75
  catalog_items_per_contract: 10

financial_reporting:
  enabled: true
  generate_balance_sheet: true
  generate_income_statement: true
  generate_cash_flow: true
  generate_changes_in_equity: true
  management_kpis:
    enabled: true
  budgets:
    enabled: true
    variance_threshold: 0.10

hr:
  enabled: true
  payroll_frequency: monthly
  time_tracking: true
  expense_reports: true

manufacturing:
  enabled: true
  production_orders_per_period: 20
  quality_inspection_rate: 0.30
  cycle_count_frequency: quarterly

sales_quotes:
  enabled: true
  quotes_per_period: 15
  conversion_rate: 0.35

output:
  format: csv
  compression: none

Configuration Loading

Configuration can be loaded from:

  1. YAML file (recommended):

    datasynth-data generate --config config.yaml --output ./output
    
  2. JSON file:

    datasynth-data generate --config config.json --output ./output
    
  3. Demo preset:

    datasynth-data generate --demo --output ./output
    

Validation

The configuration is validated for:

RuleDescription
Required fieldsAll mandatory fields must be present
Value rangesNumbers within valid bounds
DistributionsWeights sum to 1.0 (±0.01 tolerance)
DatesValid date ranges
UniquenessCompany codes must be unique
ConsistencyCross-field validations

Run validation:

datasynth-data validate --config config.yaml

Overriding Values

Command-line options override configuration file values:

# Override seed
datasynth-data generate --config config.yaml --seed 12345 --output ./output

# Override format
datasynth-data generate --config config.yaml --format json --output ./output

Environment Variables

Some settings can be controlled via environment variables:

VariableConfiguration Equivalent
SYNTH_DATA_SEEDglobal.seed
SYNTH_DATA_THREADSglobal.worker_threads
SYNTH_DATA_MEMORY_LIMITglobal.memory_limit

See Also