Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Tuning

Optimize SyntheticData for your hardware and requirements.

Performance Characteristics

MetricTypical Performance
Single-threaded~100,000 entries/second
Parallel (8 cores)~600,000 entries/second
Memory per 1M entries~500 MB

Configuration Tuning

Worker Threads

global:
  worker_threads: 8                  # Match CPU cores

Guidelines:

  • Default: Uses all available cores
  • I/O bound: May benefit from > cores
  • Memory constrained: Reduce threads

Memory Limits

global:
  memory_limit: 2147483648           # 2 GB

Guidelines:

  • Set to ~75% of available RAM
  • Leave room for OS and other processes
  • Lower limit = more streaming, less memory

Batch Sizes

The orchestrator automatically tunes batch sizes, but you can influence behavior:

transactions:
  target_count: 100000

# Implicit batch sizing based on:
# - Available memory
# - Number of threads
# - Target count

Hardware Recommendations

Minimum

ResourceSpecification
CPU2 cores
RAM4 GB
Storage10 GB

Suitable for: <100K entries, development

ResourceSpecification
CPU8 cores
RAM16 GB
Storage50 GB SSD

Suitable for: 1M entries, production

High Performance

ResourceSpecification
CPU32+ cores
RAM64+ GB
StorageNVMe SSD

Suitable for: 10M+ entries, benchmarking

Optimizing Generation

Reduce Memory Pressure

Enable streaming output:

output:
  format: csv
  # Writing as generated reduces memory

Disable unnecessary features:

graph_export:
  enabled: false                     # Skip if not needed

anomaly_injection:
  enabled: false                     # Add in post-processing

Optimize for Speed

Maximize parallelism:

global:
  worker_threads: 16                 # More threads

Simplify output:

output:
  format: csv                        # Faster than JSON
  compression: none                  # Skip compression time

Reduce complexity:

chart_of_accounts:
  complexity: small                  # Fewer accounts

document_flows:
  p2p:
    enabled: false                   # Skip if not needed

Optimize for Size

Enable compression:

output:
  compression: zstd
  compression_level: 9               # Maximum compression

Minimize output files:

output:
  files:
    journal_entries: true
    acdoca: false
    master_data: false               # Only what you need

Benchmarking

Built-in Benchmarks

# Run all benchmarks
cargo bench

# Specific benchmark
cargo bench --bench generation_throughput

# With baseline comparison
cargo bench -- --baseline main

Benchmark Categories

BenchmarkMeasures
generation_throughputEntries/second
distribution_samplingDistribution speed
output_sinkWrite performance
scalabilityParallel scaling
correctnessValidation overhead

Manual Benchmarking

# Time generation
time datasynth-data generate --config config.yaml --output ./output

# Profile memory
/usr/bin/time -v datasynth-data generate --config config.yaml --output ./output

Profiling

CPU Profiling

# With perf (Linux)
perf record datasynth-data generate --config config.yaml --output ./output
perf report

# With Instruments (macOS)
xcrun xctrace record --template "Time Profiler" \
    --launch datasynth-data generate --config config.yaml --output ./output

Memory Profiling

# With heaptrack (Linux)
heaptrack datasynth-data generate --config config.yaml --output ./output
heaptrack_print heaptrack.*.gz

# With Instruments (macOS)
xcrun xctrace record --template "Allocations" \
    --launch datasynth-data generate --config config.yaml --output ./output

Common Bottlenecks

I/O Bound

Symptoms:

  • CPU utilization < 100%
  • Disk utilization high

Solutions:

  • Use faster storage (SSD/NVMe)
  • Enable compression (reduces write volume)
  • Increase buffer sizes

Memory Bound

Symptoms:

  • OOM errors
  • Excessive swapping

Solutions:

  • Reduce target_count
  • Lower memory_limit
  • Enable streaming
  • Reduce parallel threads

CPU Bound

Symptoms:

  • CPU at 100%
  • Generation time scales linearly

Solutions:

  • Add more cores
  • Simplify configuration
  • Disable unnecessary features

Scaling Guidelines

Entries vs Time

Entries~Time (8 cores)
10,000<1 second
100,000~2 seconds
1,000,000~20 seconds
10,000,000~3 minutes

Entries vs Memory

EntriesPeak Memory
10,000~50 MB
100,000~200 MB
1,000,000~1.5 GB
10,000,000~12 GB

Memory estimates include full in-memory processing. Streaming reduces by ~80%.

Server Performance

Rate Limiting

cargo run -p datasynth-server -- \
    --port 3000 \
    --rate-limit 1000              # Requests per minute

Connection Pooling

For high-concurrency scenarios, configure worker threads:

cargo run -p datasynth-server -- \
    --worker-threads 16            # Handle more connections

WebSocket Optimization

# Client-side: batch messages
const BATCH_SIZE = 100;  // Request 100 entries at a time

See Also