Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

AI & ML Features Configuration

New in v0.5.0

This page documents the configuration for DataSynth’s AI and ML-powered generation features: LLM-augmented generation, diffusion models, causal generation, and synthetic data certificates.

LLM Configuration

Configure the LLM provider for metadata enrichment and natural language configuration.

llm:
  provider: mock              # Provider type
  model: "gpt-4o-mini"       # Model identifier
  api_key_env: "OPENAI_API_KEY"  # Environment variable for API key
  base_url: null              # Custom API endpoint (for 'custom' provider)
  max_retries: 3              # Retry attempts on failure
  timeout_secs: 30            # Request timeout
  cache_enabled: true         # Enable prompt-level caching

Provider Types

ProviderValueRequirementsDescription
MockmockNoneDeterministic, no network. Default for CI/CD
OpenAIopenaiOPENAI_API_KEY env varOpenAI API (GPT-4o, GPT-4o-mini, etc.)
AnthropicanthropicANTHROPIC_API_KEY env varAnthropic API (Claude models)
Customcustombase_url + API key env varAny OpenAI-compatible endpoint

Field Reference

FieldTypeDefaultDescription
providerstring"mock"LLM provider type
modelstring"gpt-4o-mini"Model identifier passed to the API
api_key_envstring""Environment variable name containing the API key
base_urlstringnullCustom API base URL (required for custom provider)
max_retriesinteger3Maximum retry attempts on transient failures
timeout_secsinteger30Per-request timeout in seconds
cache_enabledbooltrueCache responses to avoid duplicate API calls

Examples

Mock provider (default, no config needed):

# LLM enrichment uses mock provider by default
# No configuration required

OpenAI:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"

Anthropic:

llm:
  provider: anthropic
  model: "claude-sonnet-4-5-20250929"
  api_key_env: "ANTHROPIC_API_KEY"

Self-hosted (e.g., vLLM, Ollama):

llm:
  provider: custom
  model: "llama-3-8b"
  api_key_env: "LOCAL_API_KEY"
  base_url: "http://localhost:8000/v1"

Azure OpenAI:

llm:
  provider: custom
  model: "gpt-4o-mini"
  api_key_env: "AZURE_OPENAI_KEY"
  base_url: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o-mini"

Diffusion Configuration

Configure the statistical diffusion model backend for learned distribution capture.

diffusion:
  enabled: false              # Enable diffusion generation
  n_steps: 1000               # Number of diffusion steps
  schedule: "linear"          # Noise schedule type
  sample_size: 1000           # Number of samples to generate

Field Reference

FieldTypeDefaultDescription
enabledboolfalseEnable diffusion model generation
n_stepsinteger1000Number of forward/reverse diffusion steps. Higher values improve quality but increase compute time
schedulestring"linear"Noise schedule: "linear", "cosine", "sigmoid"
sample_sizeinteger1000Number of diffusion-generated samples

Noise Schedules

ScheduleCharacteristicsBest For
linearUniform noise addition, simple and robustGeneral purpose
cosineSlower noise addition, preserves fine detailsFinancial amounts with precise distributions
sigmoidSmooth transition between linear and cosineBalanced quality and compute

Examples

Basic diffusion:

diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

Fast diffusion (fewer steps):

diffusion:
  enabled: true
  n_steps: 200
  schedule: "linear"
  sample_size: 1000

Causal Configuration

Configure causal graph-based data generation with Structural Causal Models.

causal:
  enabled: false              # Enable causal generation
  template: "fraud_detection" # Built-in template or custom YAML path
  sample_size: 1000           # Number of samples
  validate: true              # Validate causal structure in output

Field Reference

FieldTypeDefaultDescription
enabledboolfalseEnable causal/counterfactual generation
templatestring"fraud_detection"Template name or path to custom YAML graph
sample_sizeinteger1000Number of causal samples to generate
validatebooltrueRun causal structure validation on output

Built-in Templates

TemplateVariablesUse Case
fraud_detectiontransaction_amount, approval_level, vendor_risk, fraud_flagFraud risk modeling
revenue_cycleorder_size, credit_score, payment_delay, revenueRevenue and credit analysis

Custom Causal Graph

Point template to a YAML file defining a custom causal graph:

causal:
  enabled: true
  template: "./graphs/custom_fraud.yaml"
  sample_size: 10000
  validate: true

Custom graph format:

# custom_fraud.yaml
variables:
  - name: transaction_amount
    type: continuous
    distribution: lognormal
    params:
      mu: 8.0
      sigma: 1.5
  - name: approval_level
    type: count
    distribution: normal
    params:
      mean: 1.0
      std: 0.5
  - name: fraud_flag
    type: binary

edges:
  - from: transaction_amount
    to: approval_level
    mechanism:
      type: linear
      coefficient: 0.00005
  - from: transaction_amount
    to: fraud_flag
    mechanism:
      type: logistic
      scale: 0.0001
      midpoint: 50000.0
    strength: 0.8

Causal Mechanism Types

TypeParametersDescription
linearcoefficienty += coefficient × parent
thresholdcutoffy = 1 if parent > cutoff, else 0
polynomialcoefficients (list)y += Σ c[i] × parent^i
logisticscale, midpointy += 1 / (1 + e^(-scale × (parent - midpoint)))

Certificate Configuration

Configure synthetic data certificates for provenance and privacy attestation.

certificates:
  enabled: false              # Enable certificate generation
  issuer: "DataSynth"        # Certificate issuer identity
  include_quality_metrics: true  # Include quality metrics

Field Reference

FieldTypeDefaultDescription
enabledboolfalseGenerate a certificate with each output
issuerstring"DataSynth"Issuer identity embedded in the certificate
include_quality_metricsbooltrueInclude Benford MAD, correlation, fidelity, MIA AUC metrics

Certificate Contents

When enabled, a certificate.json is produced containing:

SectionContents
Identitycertificate_id, generation_timestamp, generator_version
Reproducibilityconfig_hash (SHA-256), seed, fingerprint_hash
PrivacyDP mechanism, epsilon, delta, composition method, total queries
QualityBenford MAD, correlation preservation, statistical fidelity, MIA AUC
IntegrityHMAC-SHA256 signature

Combined Example

A complete configuration using all AI/ML features:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

companies:
  - code: "1000"
    name: "Manufacturing Corp"
    currency: USD
    country: US

transactions:
  target_count: 50000

# LLM enrichment for realistic metadata
llm:
  provider: mock

# Diffusion for learned distributions
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

# Causal structure for fraud scenarios
causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 10000
  validate: true

# Certificate for provenance
certificates:
  enabled: true
  issuer: "DataSynth v0.5.0"
  include_quality_metrics: true

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02

output:
  format: csv

CLI Flags

Several AI/ML features can also be controlled via CLI flags:

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

# Initialize from natural language
datasynth-data init --from-description "1 year of retail data with fraud" -o config.yaml

# Train diffusion model
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json

# Generate causal data
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/

See Also