LLM Training Data

New in v0.5.0

Generate LLM-enriched synthetic financial data for training and fine-tuning language models on domain-specific tasks.

When to Use LLM-Enriched Data

Fine-tuning: Train financial document understanding models on realistic data
RAG evaluation: Test retrieval-augmented generation with known-truth synthetic documents
Classification training: Generate labeled financial text for transaction categorization
Anomaly explanation: Train models to explain financial anomalies in natural language

Quality vs Cost Tradeoffs

Provider	Quality	Cost	Latency	Reproducibility
Mock	Good (template-based)	Free	Instant	Fully deterministic
gpt-4o-mini	High	~$0.15/1M tokens	~200ms/req	Seed-based
gpt-4o	Very High	~$2.50/1M tokens	~500ms/req	Seed-based
Claude (Anthropic)	Very High	Varies	~300ms/req	Seed-based
Self-hosted	Varies	Infrastructure cost	Varies	Full control

Using the Mock Provider for CI/CD

The mock provider generates deterministic, contextually-aware text without any API calls:

# Default: uses mock provider (no API key needed)
datasynth-data generate --config config.yaml --output ./output

# Explicit mock configuration
llm:
  provider: mock

The mock provider is suitable for:

CI/CD pipelines
Automated testing
Reproducible research
Development environments

Using Real LLM Providers

For production-quality enrichment:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"
  cache_enabled: true       # Avoid duplicate API calls
  max_retries: 3
  timeout_secs: 30

export OPENAI_API_KEY="sk-..."
datasynth-data generate --config config.yaml --output ./output

Batch Generation for Large Datasets

For large-scale enrichment, use batch mode to minimize API overhead:

from datasynth_py import DataSynth, Config
from datasynth_py.config import blueprints

# Generate base data first (fast, rule-based)
config = blueprints.manufacturing_large(transactions=100000)
synth = DataSynth()
result = synth.generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Then enrich with LLM in a separate pass if needed

Example: Financial Document Understanding

Generate training data for a document understanding model:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

transactions:
  target_count: 50000

document_flows:
  p2p:
    enabled: true
    flow_rate: 0.4
  o2c:
    enabled: true
    flow_rate: 0.3

anomaly_injection:
  enabled: true
  total_rate: 0.03
  generate_labels: true

# LLM enrichment adds realistic descriptions
llm:
  provider: mock     # or openai for higher quality

The generated data includes:

Vendor names appropriate for the industry and spend category
Transaction descriptions that read like real GL entries
Memo fields on invoices and payments
Natural language explanations for flagged anomalies

SyntheticData Documentation