AI & ML Features Configuration

New in v0.5.0

This page documents the configuration for DataSynth’s AI and ML-powered generation features: LLM-augmented generation, diffusion models, causal generation, and synthetic data certificates.

LLM Configuration

Configure the LLM provider for metadata enrichment and natural language configuration.

llm:
  provider: mock              # Provider type
  model: "gpt-4o-mini"       # Model identifier
  api_key_env: "OPENAI_API_KEY"  # Environment variable for API key
  base_url: null              # Custom API endpoint (for 'custom' provider)
  max_retries: 3              # Retry attempts on failure
  timeout_secs: 30            # Request timeout
  cache_enabled: true         # Enable prompt-level caching

Provider Types

Provider	Value	Requirements	Description
Mock	`mock`	None	Deterministic, no network. Default for CI/CD
OpenAI	`openai`	`OPENAI_API_KEY` env var	OpenAI API (GPT-4o, GPT-4o-mini, etc.)
Anthropic	`anthropic`	`ANTHROPIC_API_KEY` env var	Anthropic API (Claude models)
Custom	`custom`	`base_url` + API key env var	Any OpenAI-compatible endpoint

Field Reference

Field	Type	Default	Description
`provider`	string	`"mock"`	LLM provider type
`model`	string	`"gpt-4o-mini"`	Model identifier passed to the API
`api_key_env`	string	`""`	Environment variable name containing the API key
`base_url`	string	null	Custom API base URL (required for `custom` provider)
`max_retries`	integer	`3`	Maximum retry attempts on transient failures
`timeout_secs`	integer	`30`	Per-request timeout in seconds
`cache_enabled`	bool	`true`	Cache responses to avoid duplicate API calls

Examples

Mock provider (default, no config needed):

# LLM enrichment uses mock provider by default
# No configuration required

OpenAI:

llm:
  provider: openai
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"

Anthropic:

llm:
  provider: anthropic
  model: "claude-sonnet-4-5-20250929"
  api_key_env: "ANTHROPIC_API_KEY"

Self-hosted (e.g., vLLM, Ollama):

llm:
  provider: custom
  model: "llama-3-8b"
  api_key_env: "LOCAL_API_KEY"
  base_url: "http://localhost:8000/v1"

Azure OpenAI:

llm:
  provider: custom
  model: "gpt-4o-mini"
  api_key_env: "AZURE_OPENAI_KEY"
  base_url: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o-mini"

Diffusion Configuration

Configure the statistical diffusion model backend for learned distribution capture.

diffusion:
  enabled: false              # Enable diffusion generation
  n_steps: 1000               # Number of diffusion steps
  schedule: "linear"          # Noise schedule type
  sample_size: 1000           # Number of samples to generate

Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable diffusion model generation
`n_steps`	integer	`1000`	Number of forward/reverse diffusion steps. Higher values improve quality but increase compute time
`schedule`	string	`"linear"`	Noise schedule: `"linear"`, `"cosine"`, `"sigmoid"`
`sample_size`	integer	`1000`	Number of diffusion-generated samples

Noise Schedules

Schedule	Characteristics	Best For
`linear`	Uniform noise addition, simple and robust	General purpose
`cosine`	Slower noise addition, preserves fine details	Financial amounts with precise distributions
`sigmoid`	Smooth transition between linear and cosine	Balanced quality and compute

Examples

Basic diffusion:

diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

Fast diffusion (fewer steps):

diffusion:
  enabled: true
  n_steps: 200
  schedule: "linear"
  sample_size: 1000

Causal Configuration

Configure causal graph-based data generation with Structural Causal Models.

causal:
  enabled: false              # Enable causal generation
  template: "fraud_detection" # Built-in template or custom YAML path
  sample_size: 1000           # Number of samples
  validate: true              # Validate causal structure in output

Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable causal/counterfactual generation
`template`	string	`"fraud_detection"`	Template name or path to custom YAML graph
`sample_size`	integer	`1000`	Number of causal samples to generate
`validate`	bool	`true`	Run causal structure validation on output

Built-in Templates

Template	Variables	Use Case
`fraud_detection`	transaction_amount, approval_level, vendor_risk, fraud_flag	Fraud risk modeling
`revenue_cycle`	order_size, credit_score, payment_delay, revenue	Revenue and credit analysis

Custom Causal Graph

Point template to a YAML file defining a custom causal graph:

causal:
  enabled: true
  template: "./graphs/custom_fraud.yaml"
  sample_size: 10000
  validate: true

Custom graph format:

# custom_fraud.yaml
variables:
  - name: transaction_amount
    type: continuous
    distribution: lognormal
    params:
      mu: 8.0
      sigma: 1.5
  - name: approval_level
    type: count
    distribution: normal
    params:
      mean: 1.0
      std: 0.5
  - name: fraud_flag
    type: binary

edges:
  - from: transaction_amount
    to: approval_level
    mechanism:
      type: linear
      coefficient: 0.00005
  - from: transaction_amount
    to: fraud_flag
    mechanism:
      type: logistic
      scale: 0.0001
      midpoint: 50000.0
    strength: 0.8

Causal Mechanism Types

Type	Parameters	Description
`linear`	`coefficient`	y += coefficient × parent
`threshold`	`cutoff`	y = 1 if parent > cutoff, else 0
`polynomial`	`coefficients` (list)	y += Σ c[i] × parent^i
`logistic`	`scale`, `midpoint`	y += 1 / (1 + e^(-scale × (parent - midpoint)))

Certificate Configuration

Configure synthetic data certificates for provenance and privacy attestation.

certificates:
  enabled: false              # Enable certificate generation
  issuer: "DataSynth"        # Certificate issuer identity
  include_quality_metrics: true  # Include quality metrics

Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Generate a certificate with each output
`issuer`	string	`"DataSynth"`	Issuer identity embedded in the certificate
`include_quality_metrics`	bool	`true`	Include Benford MAD, correlation, fidelity, MIA AUC metrics

Certificate Contents

When enabled, a certificate.json is produced containing:

Section	Contents
Identity	certificate_id, generation_timestamp, generator_version
Reproducibility	config_hash (SHA-256), seed, fingerprint_hash
Privacy	DP mechanism, epsilon, delta, composition method, total queries
Quality	Benford MAD, correlation preservation, statistical fidelity, MIA AUC
Integrity	HMAC-SHA256 signature

Combined Example

A complete configuration using all AI/ML features:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12

companies:
  - code: "1000"
    name: "Manufacturing Corp"
    currency: USD
    country: US

transactions:
  target_count: 50000

# LLM enrichment for realistic metadata
llm:
  provider: mock

# Diffusion for learned distributions
diffusion:
  enabled: true
  n_steps: 1000
  schedule: "cosine"
  sample_size: 5000

# Causal structure for fraud scenarios
causal:
  enabled: true
  template: "fraud_detection"
  sample_size: 10000
  validate: true

# Certificate for provenance
certificates:
  enabled: true
  issuer: "DataSynth v0.5.0"
  include_quality_metrics: true

fraud:
  enabled: true
  fraud_rate: 0.005

anomaly_injection:
  enabled: true
  total_rate: 0.02

output:
  format: csv

CLI Flags

Several AI/ML features can also be controlled via CLI flags:

# Generate with certificate
datasynth-data generate --config config.yaml --output ./output --certificate

# Initialize from natural language
datasynth-data init --from-description "1 year of retail data with fraud" -o config.yaml

# Train diffusion model
datasynth-data diffusion train --fingerprint ./fp.dsf --output ./model.json

# Generate causal data
datasynth-data causal generate --template fraud_detection --samples 10000 --output ./causal/

SyntheticData Documentation

AI & ML Features Configuration

LLM Configuration

Provider Types

Field Reference

Examples

Diffusion Configuration

Field Reference

Noise Schedules

Examples

Causal Configuration

Field Reference

Built-in Templates

Custom Causal Graph

Causal Mechanism Types

Certificate Configuration

Field Reference

Certificate Contents

Combined Example

CLI Flags

See Also

Keyboard shortcuts

SyntheticData Documentation