Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLM-Augmented Generation

New in v0.5.0

LLM-augmented generation uses Large Language Models to enrich synthetic data with realistic metadata — vendor names, transaction descriptions, memo fields, and anomaly explanations — that would be difficult to generate with rule-based approaches alone.

Overview

Traditional synthetic data generators produce structurally correct but often generic-sounding text fields. LLM augmentation addresses this by using language models to generate contextually appropriate text based on the financial domain, industry, and transaction context.

DataSynth provides a pluggable provider abstraction that supports:

ProviderDescriptionUse Case
MockDeterministic, no network requiredCI/CD, testing, reproducible builds
OpenAIOpenAI-compatible APIs (GPT-4o-mini, etc.)Production enrichment
AnthropicAnthropic API (Claude models)Production enrichment
CustomAny OpenAI-compatible endpointSelf-hosted models, Azure OpenAI

Provider Abstraction

All LLM functionality is built around the LlmProvider trait:

#![allow(unused)]
fn main() {
pub trait LlmProvider: Send + Sync {
    fn name(&self) -> &str;
    fn complete(&self, request: &LlmRequest) -> Result<LlmResponse, SynthError>;
    fn complete_batch(&self, requests: &[LlmRequest]) -> Result<Vec<LlmResponse>, SynthError>;
}
}

LlmRequest

#![allow(unused)]
fn main() {
let request = LlmRequest::new("Generate a vendor name for a German auto parts manufacturer")
    .with_system("You are a business data generator. Return only the company name.")
    .with_seed(42)
    .with_max_tokens(50)
    .with_temperature(0.7);
}
FieldTypeDefaultDescription
promptString(required)The generation prompt
systemOption<String>NoneSystem message for context
max_tokensu32100Maximum response tokens
temperaturef640.7Sampling temperature
seedOption<u64>NoneSeed for deterministic output

LlmResponse

#![allow(unused)]
fn main() {
pub struct LlmResponse {
    pub content: String,       // Generated text
    pub usage: TokenUsage,     // Input/output token counts
    pub cached: bool,          // Whether result came from cache
}
}

Mock Provider

The MockLlmProvider generates deterministic, contextually-aware text without any network calls. It is the default provider and is ideal for:

  • CI/CD pipelines where network access is restricted
  • Reproducible builds with deterministic output
  • Development and testing
  • Environments where API costs are a concern
#![allow(unused)]
fn main() {
use synth_core::llm::MockLlmProvider;

let provider = MockLlmProvider::new(42); // seeded for reproducibility
}

The mock provider uses the seed and prompt content to generate plausible-sounding business names and descriptions deterministically.

HTTP Provider

The HttpLlmProvider connects to real LLM APIs:

#![allow(unused)]
fn main() {
use synth_core::llm::{HttpLlmProvider, LlmConfig, LlmProviderType};

let config = LlmConfig {
    provider: LlmProviderType::OpenAi,
    model: "gpt-4o-mini".to_string(),
    api_key_env: "OPENAI_API_KEY".to_string(),
    base_url: None,
    max_retries: 3,
    timeout_secs: 30,
    cache_enabled: true,
};

let provider = HttpLlmProvider::new(config)?;
}

Configuration

# In your generation config
llm:
  provider: openai          # mock, openai, anthropic, custom
  model: "gpt-4o-mini"
  api_key_env: "OPENAI_API_KEY"
  base_url: null            # Override for custom endpoints
  max_retries: 3
  timeout_secs: 30
  cache_enabled: true
FieldTypeDefaultDescription
providerstringmockProvider type: mock, openai, anthropic, custom
modelstringgpt-4o-miniModel identifier
api_key_envstringEnvironment variable containing the API key
base_urlstringnullCustom API base URL (required for custom provider)
max_retriesinteger3Maximum retry attempts on failure
timeout_secsinteger30Request timeout in seconds
cache_enabledbooltrueEnable prompt-level caching

Enrichment Types

Vendor Name Enrichment

Generates realistic vendor names based on industry, spend category, and country:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::VendorLlmEnricher;

let enricher = VendorLlmEnricher::new(provider.clone());
let name = enricher.enrich_vendor_name("manufacturing", "raw_materials", "DE")?;
// e.g., "Rheinische Stahlwerke GmbH"

// Batch enrichment for efficiency
let names = enricher.enrich_batch(&[
    ("manufacturing".into(), "raw_materials".into(), "DE".into()),
    ("retail".into(), "logistics".into(), "US".into()),
], 42)?;
}

Transaction Description Enrichment

Generates contextually appropriate journal entry descriptions:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::TransactionLlmEnricher;

let enricher = TransactionLlmEnricher::new(provider.clone());

let desc = enricher.enrich_description(
    "Office Supplies",    // account name
    "1000-5000",          // amount range
    "retail",             // industry
    3,                    // fiscal period
)?;

let memo = enricher.enrich_memo(
    "VendorInvoice",      // document type
    "Acme Corp",          // vendor name
    "2500.00",            // amount
)?;
}

Anomaly Explanation

Generates natural language explanations for injected anomalies:

#![allow(unused)]
fn main() {
use synth_generators::llm_enrichment::AnomalyLlmExplainer;

let explainer = AnomalyLlmExplainer::new(provider.clone());
let explanation = explainer.explain(
    "DuplicatePayment",           // anomaly type
    3,                             // affected records
    "Same amount, same vendor, 2 days apart",  // statistical details
)?;
}

Natural Language Configuration

The NlConfigGenerator converts natural language descriptions into YAML configuration:

#![allow(unused)]
fn main() {
use synth_core::llm::NlConfigGenerator;

let yaml = NlConfigGenerator::generate(
    "Generate 1 year of retail data for a mid-size US company with fraud patterns",
    &provider,
)?;
}

CLI Usage

datasynth-data init \
    --from-description "Generate 1 year of manufacturing data for a German mid-cap with intercompany transactions" \
    -o config.yaml

The generator parses intent into structured fields:

#![allow(unused)]
fn main() {
pub struct ConfigIntent {
    pub industry: Option<String>,     // e.g., "manufacturing"
    pub country: Option<String>,      // e.g., "DE"
    pub company_size: Option<String>, // e.g., "mid-cap"
    pub period_months: Option<u32>,   // e.g., 12
    pub features: Vec<String>,        // e.g., ["intercompany"]
}
}

Caching

The LlmCache deduplicates identical prompts using FNV-1a hashing:

#![allow(unused)]
fn main() {
use synth_core::llm::LlmCache;

let cache = LlmCache::new(10000); // max 10,000 entries
let key = LlmCache::cache_key("prompt text", Some("system"), Some(42));

cache.insert(key, "cached response".into());
if let Some(response) = cache.get(key) {
    // Use cached response
}
}

Caching is enabled by default and significantly reduces API costs when generating similar entities.

Cost and Privacy Considerations

Cost Management

  • Use the Mock provider for development and CI/CD (zero cost)
  • Enable caching to avoid duplicate API calls
  • Use batch enrichment (complete_batch) to reduce per-request overhead
  • Set appropriate max_tokens limits to control response sizes
  • Consider gpt-4o-mini or similar efficient models for bulk enrichment

Privacy

  • LLM prompts contain only synthetic context (industry, category, amount ranges) — never real data
  • No PII or sensitive information is sent to LLM providers
  • The Mock provider keeps everything local with no network traffic
  • For maximum privacy, use self-hosted models via the custom provider type

See Also