Capacity Planning
This guide provides sizing models, reference benchmarks, and recommendations for provisioning DataSynth deployments.
Performance Characteristics
DataSynth is CPU-bound during generation and I/O-bound during output. Key characteristics:
- Throughput: 100K+ journal entries per second on a single core
- Scaling: Near-linear scaling with CPU cores for batch generation
- Memory: Proportional to active dataset size (companies, accounts, master data)
- Disk: Output size depends on format, compression, and enabled modules
- Network: REST/gRPC overhead is minimal; bulk generation is the bottleneck
Sizing Model
CPU
DataSynth uses Rayon for parallel generation and Tokio for async I/O. The relationship between CPU cores and throughput:
| Cores | Approx. Entries/sec | Use Case |
|---|---|---|
| 1 | 100K | Development, small datasets |
| 2 | 180K | Staging, medium datasets |
| 4 | 350K | Production, large datasets |
| 8 | 650K | High-throughput batch jobs |
| 16 | 1.1M | Maximum single-node throughput |
These numbers are for journal entry generation with balanced debit/credit lines. Enabling additional modules (document flows, subledgers, master data, anomaly injection) reduces throughput by 30-60% due to cross-referencing overhead.
Memory
Memory usage depends on the active generation context:
| Component | Approximate Memory |
|---|---|
| Base server process | 50-100 MB |
| Chart of accounts (small) | 5-10 MB |
| Chart of accounts (large) | 30-50 MB |
| Master data per company (small) | 20-40 MB |
| Master data per company (medium) | 80-150 MB |
| Master data per company (large) | 200-400 MB |
| Active journal entries buffer | 2-5 MB per 10K entries |
| Document flow chains | 50-100 MB per company |
| Anomaly injection engine | 20-50 MB |
Sizing formula (approximate):
Memory (MB) = 100 + (companies * master_data_per_company) + (buffer_entries * 0.5)
Recommended Memory by Config Complexity
| Complexity | Companies | Memory Minimum | Memory Recommended |
|---|---|---|---|
| Small | 1-2 | 512 MB | 1 GB |
| Medium | 3-5 | 1 GB | 2 GB |
| Large | 5-10 | 2 GB | 4 GB |
| Enterprise | 10-20 | 4 GB | 8 GB |
DataSynth includes built-in memory guards that trigger graceful degradation before OOM. See Runbook - Memory Issues for degradation levels.
Disk Sizing
Output Size by Format
The output size depends on the number of entries, enabled modules, and output format:
| Entries | CSV (uncompressed) | JSON (uncompressed) | Parquet (compressed) |
|---|---|---|---|
| 10K | 15-25 MB | 30-50 MB | 3-5 MB |
| 100K | 150-250 MB | 300-500 MB | 30-50 MB |
| 1M | 1.5-2.5 GB | 3-5 GB | 300-500 MB |
| 10M | 15-25 GB | 30-50 GB | 3-5 GB |
These estimates cover journal entries only. Enabling all modules (master data, document flows, subledgers, audit trails, etc.) can multiply total output by 5-10x.
Output Files by Module
When all modules are enabled, a typical generation produces 60+ output files:
| Category | Typical File Count | Size Relative to JE |
|---|---|---|
| Journal entries + ACDOCA | 2-3 | 1.0x (baseline) |
| Master data | 6-8 | 0.3-0.5x |
| Document flows | 8-10 | 1.5-2.0x |
| Subledgers | 8-12 | 1.0-1.5x |
| Period close + consolidation | 5-8 | 0.5-1.0x |
| Labels + controls | 6-10 | 0.1-0.3x |
| Audit trails | 6-8 | 0.3-0.5x |
Disk Provisioning Formula
Disk (GB) = entries_millions * format_multiplier * module_multiplier * safety_margin
Where:
format_multiplier: CSV=0.25, JSON=0.50, Parquet=0.05 (per million entries)
module_multiplier: JE only=1.0, all modules=5.0
safety_margin: 1.5 (for temp files, logs, etc.)
Example: 1M entries, all modules, CSV format:
Disk = 1 * 0.25 * 5.0 * 1.5 = 1.875 GB (round up to 2 GB)
Reference Benchmarks
Benchmarks run on c5.2xlarge (8 vCPU, 16 GB RAM):
| Scenario | Config | Entries | Time | Throughput | Peak Memory |
|---|---|---|---|---|---|
| Batch (small) | 1 company, small CoA, JE only | 100K | 0.8s | 125K/s | 280 MB |
| Batch (medium) | 3 companies, medium CoA, all modules | 100K | 3.2s | 31K/s | 850 MB |
| Batch (large) | 5 companies, large CoA, all modules | 1M | 45s | 22K/s | 2.1 GB |
| Streaming | 1 company, JE only | continuous | – | 10 events/s | 350 MB |
| Concurrent API | 10 parallel bulk requests | 10K each | 4.5s | 22K/s total | 1.2 GB |
Container Resource Recommendations
Docker / Single Host
| Profile | CPU | Memory | Disk | Use Case |
|---|---|---|---|---|
| Dev | 1 core | 1 GB | 10 GB | Local testing |
| Staging | 2 cores | 2 GB | 50 GB | Integration testing |
| Production | 4 cores | 4 GB | 100 GB | Regular generation |
| Batch worker | 8 cores | 8 GB | 200 GB | Large dataset generation |
Kubernetes
| Profile | requests.cpu | requests.memory | limits.cpu | limits.memory | Replicas |
|---|---|---|---|---|---|
| Light | 250m | 256Mi | 1 | 1Gi | 2 |
| Standard | 500m | 512Mi | 2 | 2Gi | 2-5 |
| Heavy | 1000m | 1Gi | 4 | 4Gi | 3-10 |
| Burst | 2000m | 2Gi | 8 | 8Gi | 5-20 |
Scaling Guidelines
Vertical Scaling (Single Node)
Vertical scaling is effective up to 16 cores. Beyond that, returns diminish due to lock contention in the shared ServerState. Recommendations:
- Start with the “Standard” Kubernetes profile.
- Monitor
synth_entries_per_secondin Grafana. - If throughput plateaus at high CPU, add replicas instead.
Horizontal Scaling (Multi-Replica)
DataSynth is stateless – each pod generates data independently. Horizontal scaling considerations:
- Enable Redis for shared rate limiting across replicas.
- Use deterministic seeds per replica to avoid duplicate data (seed = base_seed + replica_index).
- Route bulk generation requests to specific replicas if output deduplication matters.
- WebSocket streams are per-connection and do not share state across replicas.
Scaling Decision Tree
Is throughput below target?
|
+-- Yes: Is CPU utilization > 70%?
| |
| +-- Yes: Add more replicas (horizontal)
| +-- No: Is memory > 80%?
| |
| +-- Yes: Increase memory limit
| +-- No: Check I/O (disk throughput, network)
|
+-- No: Current sizing is adequate
Network Bandwidth
DataSynth’s network requirements are modest:
| Operation | Bandwidth | Notes |
|---|---|---|
| Health checks | < 1 KB/s | Negligible |
| Prometheus scrape | 5-10 KB per scrape | Every 10-30s |
| Bulk API response (10K entries) | 5-15 MB burst | Short-lived |
| WebSocket stream | 1-5 KB/s per connection | 10 events/s default |
| gRPC streaming | 2-10 KB/s per stream | Depends on message size |
Network is rarely the bottleneck. A 1 Gbps link supports hundreds of concurrent clients.