Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Capacity Planning

This guide provides sizing models, reference benchmarks, and recommendations for provisioning DataSynth deployments.

Performance Characteristics

DataSynth is CPU-bound during generation and I/O-bound during output. Key characteristics:

  • Throughput: 100K+ journal entries per second on a single core
  • Scaling: Near-linear scaling with CPU cores for batch generation
  • Memory: Proportional to active dataset size (companies, accounts, master data)
  • Disk: Output size depends on format, compression, and enabled modules
  • Network: REST/gRPC overhead is minimal; bulk generation is the bottleneck

Sizing Model

CPU

DataSynth uses Rayon for parallel generation and Tokio for async I/O. The relationship between CPU cores and throughput:

CoresApprox. Entries/secUse Case
1100KDevelopment, small datasets
2180KStaging, medium datasets
4350KProduction, large datasets
8650KHigh-throughput batch jobs
161.1MMaximum single-node throughput

These numbers are for journal entry generation with balanced debit/credit lines. Enabling additional modules (document flows, subledgers, master data, anomaly injection) reduces throughput by 30-60% due to cross-referencing overhead.

Memory

Memory usage depends on the active generation context:

ComponentApproximate Memory
Base server process50-100 MB
Chart of accounts (small)5-10 MB
Chart of accounts (large)30-50 MB
Master data per company (small)20-40 MB
Master data per company (medium)80-150 MB
Master data per company (large)200-400 MB
Active journal entries buffer2-5 MB per 10K entries
Document flow chains50-100 MB per company
Anomaly injection engine20-50 MB

Sizing formula (approximate):

Memory (MB) = 100 + (companies * master_data_per_company) + (buffer_entries * 0.5)
ComplexityCompaniesMemory MinimumMemory Recommended
Small1-2512 MB1 GB
Medium3-51 GB2 GB
Large5-102 GB4 GB
Enterprise10-204 GB8 GB

DataSynth includes built-in memory guards that trigger graceful degradation before OOM. See Runbook - Memory Issues for degradation levels.

Disk Sizing

Output Size by Format

The output size depends on the number of entries, enabled modules, and output format:

EntriesCSV (uncompressed)JSON (uncompressed)Parquet (compressed)
10K15-25 MB30-50 MB3-5 MB
100K150-250 MB300-500 MB30-50 MB
1M1.5-2.5 GB3-5 GB300-500 MB
10M15-25 GB30-50 GB3-5 GB

These estimates cover journal entries only. Enabling all modules (master data, document flows, subledgers, audit trails, etc.) can multiply total output by 5-10x.

Output Files by Module

When all modules are enabled, a typical generation produces 60+ output files:

CategoryTypical File CountSize Relative to JE
Journal entries + ACDOCA2-31.0x (baseline)
Master data6-80.3-0.5x
Document flows8-101.5-2.0x
Subledgers8-121.0-1.5x
Period close + consolidation5-80.5-1.0x
Labels + controls6-100.1-0.3x
Audit trails6-80.3-0.5x

Disk Provisioning Formula

Disk (GB) = entries_millions * format_multiplier * module_multiplier * safety_margin

Where:
  format_multiplier:  CSV=0.25, JSON=0.50, Parquet=0.05  (per million entries)
  module_multiplier:  JE only=1.0, all modules=5.0
  safety_margin:      1.5 (for temp files, logs, etc.)

Example: 1M entries, all modules, CSV format:

Disk = 1 * 0.25 * 5.0 * 1.5 = 1.875 GB (round up to 2 GB)

Reference Benchmarks

Benchmarks run on c5.2xlarge (8 vCPU, 16 GB RAM):

ScenarioConfigEntriesTimeThroughputPeak Memory
Batch (small)1 company, small CoA, JE only100K0.8s125K/s280 MB
Batch (medium)3 companies, medium CoA, all modules100K3.2s31K/s850 MB
Batch (large)5 companies, large CoA, all modules1M45s22K/s2.1 GB
Streaming1 company, JE onlycontinuous10 events/s350 MB
Concurrent API10 parallel bulk requests10K each4.5s22K/s total1.2 GB

Container Resource Recommendations

Docker / Single Host

ProfileCPUMemoryDiskUse Case
Dev1 core1 GB10 GBLocal testing
Staging2 cores2 GB50 GBIntegration testing
Production4 cores4 GB100 GBRegular generation
Batch worker8 cores8 GB200 GBLarge dataset generation

Kubernetes

Profilerequests.cpurequests.memorylimits.cpulimits.memoryReplicas
Light250m256Mi11Gi2
Standard500m512Mi22Gi2-5
Heavy1000m1Gi44Gi3-10
Burst2000m2Gi88Gi5-20

Scaling Guidelines

Vertical Scaling (Single Node)

Vertical scaling is effective up to 16 cores. Beyond that, returns diminish due to lock contention in the shared ServerState. Recommendations:

  1. Start with the “Standard” Kubernetes profile.
  2. Monitor synth_entries_per_second in Grafana.
  3. If throughput plateaus at high CPU, add replicas instead.

Horizontal Scaling (Multi-Replica)

DataSynth is stateless – each pod generates data independently. Horizontal scaling considerations:

  1. Enable Redis for shared rate limiting across replicas.
  2. Use deterministic seeds per replica to avoid duplicate data (seed = base_seed + replica_index).
  3. Route bulk generation requests to specific replicas if output deduplication matters.
  4. WebSocket streams are per-connection and do not share state across replicas.

Scaling Decision Tree

Is throughput below target?
  |
  +-- Yes: Is CPU utilization > 70%?
  |    |
  |    +-- Yes: Add more replicas (horizontal)
  |    +-- No:  Is memory > 80%?
  |         |
  |         +-- Yes: Increase memory limit
  |         +-- No:  Check I/O (disk throughput, network)
  |
  +-- No: Current sizing is adequate

Network Bandwidth

DataSynth’s network requirements are modest:

OperationBandwidthNotes
Health checks< 1 KB/sNegligible
Prometheus scrape5-10 KB per scrapeEvery 10-30s
Bulk API response (10K entries)5-15 MB burstShort-lived
WebSocket stream1-5 KB/s per connection10 events/s default
gRPC streaming2-10 KB/s per streamDepends on message size

Network is rarely the bottleneck. A 1 Gbps link supports hundreds of concurrent clients.