Disaster Recovery

DataSynth is a stateless data generation engine. It does not maintain a persistent database or durable state that requires traditional backup and recovery. Instead, recovery relies on two key properties:

Deterministic generation – Given the same configuration and seed, DataSynth produces identical output.
Stateless server – The server process can be restarted from scratch at any time.

What Needs to Be Backed Up

Asset	Location	Recovery Priority
Generation config (YAML)	`/etc/datasynth/`, ConfigMap, or source control	Critical
Environment / secrets	`/etc/datasynth/server.env`, K8s Secrets	Critical
API keys	Environment variable or Secret	Critical
Generated output files	Output directory, object storage	Depends on use case
Grafana dashboards	`deploy/grafana/provisioning/` or exported JSON	Low – can re-provision
Prometheus data	`prometheus-data` volume	Low – regenerate from metrics

The generation config and seed are the most important assets. With them, you can reproduce any dataset exactly.

Backup Procedures

Configuration Backup

Store all DataSynth configuration in version control. This is the primary backup mechanism:

# Recommended repository structure
configs/
  production/
    manufacturing.yaml      # Generation config
    server.env.encrypted    # Encrypted environment file
  staging/
    retail.yaml
    server.env.encrypted

For Kubernetes, export the ConfigMap and Secret:

# Export current config
kubectl -n datasynth get configmap datasynth-config -o yaml > backup/configmap.yaml

# Export secrets (base64-encoded)
kubectl -n datasynth get secret datasynth-api-keys -o yaml > backup/secret.yaml

Output Data Backup

If generated data must be preserved (not just re-generated), back up the output directory:

# Local backup
tar czf datasynth-output-$(date +%F).tar.gz /var/lib/datasynth/output/

# S3 backup
aws s3 sync /var/lib/datasynth/output/ s3://your-bucket/datasynth/$(date +%F)/

Scheduled Backup Script

#!/bin/bash
# /usr/local/bin/datasynth-backup.sh
# Run via cron: 0 2 * * * /usr/local/bin/datasynth-backup.sh

BACKUP_DIR="/var/backups/datasynth"
DATE=$(date +%F)

mkdir -p "$BACKUP_DIR"

# Back up configuration
cp /etc/datasynth/server.env "$BACKUP_DIR/server.env.$DATE"

# Back up output if it exists and is non-empty
if [ -d /var/lib/datasynth/output ] && [ "$(ls -A /var/lib/datasynth/output)" ]; then
  tar czf "$BACKUP_DIR/output-$DATE.tar.gz" /var/lib/datasynth/output/
fi

# Retain 30 days of backups
find "$BACKUP_DIR" -type f -mtime +30 -delete

echo "Backup completed: $DATE"

Deterministic Recovery

DataSynth uses ChaCha8 RNG with a configurable seed. When the seed is set in the configuration, every run produces byte-identical output.

Reproducing a Dataset

To reproduce a previous generation run:

Retrieve the configuration file used for that run.
Confirm the seed value is set (not random).
Run the generation with the same configuration.

# Example config with deterministic seed
global:
  industry: manufacturing
  start_date: "2024-01-01"
  period_months: 12
  seed: 42              # <-- deterministic seed

# Regenerate identical data
datasynth-data generate --config config.yaml --output ./recovered-output

# Verify output is identical
diff <(sha256sum original-output/*.csv | sort) <(sha256sum recovered-output/*.csv | sort)

Important Caveats for Determinism

Deterministic output requires exact version matching:

Factor	Must Match?	Notes
DataSynth version	Yes	Different versions may change generation logic
Configuration YAML	Yes	Any parameter change alters output
Seed value	Yes	Different seed = different data
Operating system	No	Cross-platform determinism is guaranteed
CPU architecture	No	ChaCha8 output is platform-independent
Number of threads	No	Parallelism does not affect determinism

If you need to reproduce data from a past release, pin the DataSynth version:

# Docker: use the exact version tag
docker run --rm \
  -v $(pwd)/config.yaml:/config.yaml:ro \
  -v $(pwd)/output:/output \
  datasynth/datasynth-server:0.5.0 \
  datasynth-data generate --config /config.yaml --output /output

# Source: checkout the exact tag
git checkout v0.5.0
cargo build --release -p datasynth-cli

Stateless Restart

The DataSynth server maintains no persistent state. All in-memory state (counters, active streams, generation context) is ephemeral. A restart produces a fresh server.

Restart Procedure

Docker:

docker compose restart datasynth-server

Kubernetes:

# Rolling restart (zero downtime with PDB)
kubectl -n datasynth rollout restart deployment/datasynth

# Verify rollout
kubectl -n datasynth rollout status deployment/datasynth

SystemD:

sudo systemctl restart datasynth-server

What Is Lost on Restart

State	Lost?	Impact
Prometheus metrics counters	Yes	Counters reset to 0; Prometheus handles counter resets via `rate()`
Active WebSocket streams	Yes	Clients must reconnect
Uptime counter	Yes	Resets to 0
In-progress bulk generation	Yes	Client receives connection error; must retry
Configuration (if set via API)	Yes	Reverts to default; use ConfigMap or env for persistence
Rate limit buckets	Yes	All clients get fresh rate limit windows

Mitigating Restart Impact

Use config files, not the API, for persistent configuration. The POST /api/config endpoint only updates in-memory state.
Set up client retry logic for bulk generation requests.
Use Kubernetes PDB to ensure at least one pod is always running during rolling restarts.
Monitor with Prometheus – counter resets are handled automatically by rate() and increase() functions.

Recovery Scenarios

Scenario 1: Server Process Crash

SystemD or Kubernetes automatically restarts the process.
Verify with curl localhost:3000/health.
Check logs for crash cause: journalctl -u datasynth-server -n 200.
No data loss – server is stateless.

Scenario 2: Node Failure (Kubernetes)

Kubernetes reschedules pods to healthy nodes.
PDB ensures minimum availability during rescheduling.
Clients reconnect automatically (Service endpoint updates).
No manual intervention required.

Scenario 3: Configuration Lost

Retrieve config from version control.
Redeploy: kubectl apply -f configmap.yaml or copy to /etc/datasynth/.
Restart server to pick up new config.

Scenario 4: Need to Reproduce Historical Data

Identify the DataSynth version and config used.
Pin the version (Docker tag or Git tag).
Run generation with the same config and seed.
Verify with checksums.

Recovery Time Objectives

Component	RTO	RPO	Notes
Server process	< 30s	N/A (stateless)	Auto-restart via SystemD/K8s
Full service (K8s)	< 2 min	N/A (stateless)	Pod scheduling + startup probes
Data regeneration	Depends on size	0 (deterministic)	Re-run with same config+seed
Config recovery	< 5 min	Last commit	From version control

Keyboard shortcuts

SyntheticData Documentation