SOC 2 Type II Readiness
This document describes how DataSynth’s architecture and controls align with the AICPA Trust Services Criteria (TSC) used in SOC 2 Type II engagements. DataSynth is a synthetic data generation tool, not a cloud-hosted SaaS product, so this assessment focuses on the controls embedded in the software itself rather than organizational policies. Organizations deploying DataSynth should layer their own operational controls (change management, personnel security, vendor management) on top of the technical controls described here.
Assessment Scope
- System: DataSynth synthetic financial data generator
- Version: 0.5.x
- Deployment Models: CLI binary, REST/gRPC/WebSocket server, Python library, desktop application
- Assessment Type: Architecture readiness (pre-audit self-assessment)
CC1: Security
The Security criterion (Common Criteria) requires that the system is protected against unauthorized access, both logical and physical.
Authentication
DataSynth’s server component (datasynth-server) implements two authentication mechanisms:
API Key Authentication: API keys are hashed with Argon2id (memory-hard, side-channel resistant) at server startup. Verification iterates all stored hashes without short-circuiting to prevent timing-based enumeration. A short-lived (5-second TTL) FNV-1a hash cache avoids repeated Argon2id computation for successive requests from the same client. Keys are accepted via Authorization: Bearer <key> or X-API-Key headers.
JWT/OIDC (optional jwt feature): External identity providers (Keycloak, Auth0, Entra ID) issue RS256-signed tokens. The JwtValidator verifies issuer, audience, expiration, and signature. Claims include subject, email, roles, and tenant ID for multi-tenancy.
Authorization
Role-Based Access Control (RBAC) enforces least-privilege access:
| Role | GenerateData | ManageJobs | ViewJobs | ManageConfig | ViewConfig | ViewMetrics | ManageApiKeys |
|---|---|---|---|---|---|---|---|
| Admin | Y | Y | Y | Y | Y | Y | Y |
| Operator | Y | Y | Y | N | Y | Y | N |
| Viewer | N | N | Y | N | Y | Y | N |
RBAC can be disabled for development environments; when disabled, all authenticated requests are treated as Admin.
Network Security
The security headers middleware injects the following headers on all server responses:
| Header | Value | Purpose |
|---|---|---|
X-Content-Type-Options | nosniff | Prevent MIME-type sniffing |
X-Frame-Options | DENY | Prevent clickjacking |
Content-Security-Policy | default-src 'none'; frame-ancestors 'none' | Restrict resource loading |
Referrer-Policy | strict-origin-when-cross-origin | Limit referrer leakage |
Cache-Control | no-store | Prevent caching of API responses |
X-XSS-Protection | 0 | Defer to CSP (modern best practice) |
TLS termination is supported via reverse proxy (nginx, Caddy, Envoy) or Kubernetes ingress. CORS is configurable with allowlisted origins.
Rate Limiting
Per-client rate limiting uses a sliding-window counter with configurable thresholds (requests per second, burst size). A Redis-backed rate limiter is available for multi-instance deployments (redis feature flag).
CC2: Availability
The Availability criterion requires that the system is available for operation and use as committed.
Graceful Degradation
The DegradationController in datasynth-core monitors memory, disk, and CPU utilization and applies progressive feature reduction:
| Level | Memory | Disk | CPU | Response |
|---|---|---|---|---|
| Normal | < 70% | > 1000 MB | < 80% | All features enabled, full batch sizes |
| Reduced | 70–85% | 500–1000 MB | 80–90% | Half batch sizes, skip data quality injection |
| Minimal | 85–95% | 100–500 MB | > 90% | Essential data only, no anomaly injection |
| Emergency | > 95% | < 100 MB | – | Flush pending writes, terminate gracefully |
Auto-recovery with hysteresis (5% improvement required) allows the system to step back up one level at a time when resource pressure subsides.
Resource Monitoring
- Memory guard: Reads
/proc/self/statm(Linux) orps(macOS) to track resident set size against configurable limits. - Disk guard: Uses
statvfs(Unix) orGetDiskFreeSpaceExW(Windows) to monitor available disk space in the output directory. - CPU monitor: Tracks CPU utilization with auto-throttle at 0.95 threshold.
- Resource guard: Unified orchestration that combines all three monitors and drives the
DegradationController.
Graceful Shutdown
The server handles SIGTERM by stopping acceptance of new requests, waiting for in-flight requests to complete (with configurable timeout), and flushing pending output. The CLI supports SIGUSR1 for pause/resume of generation runs.
Health Endpoints
The following endpoints are exempt from authentication for infrastructure integration:
| Endpoint | Purpose |
|---|---|
/health | General health check |
/ready | Readiness probe (Kubernetes) |
/live | Liveness probe (Kubernetes) |
/metrics | Prometheus-compatible metrics |
CC3: Processing Integrity
The Processing Integrity criterion requires that system processing is complete, valid, accurate, timely, and authorized.
Deterministic Generation
DataSynth uses the ChaCha8 cryptographically secure pseudo-random number generator with a configurable seed. Given the same configuration YAML and seed value, output is byte-identical across runs and platforms. This provides auditability (reproduce any dataset from its configuration) and regression detection (compare output hashes after code changes).
Quality Gates
The evaluation framework (datasynth-eval) applies configurable pass/fail criteria to every generation run. Built-in quality gate profiles provide three levels of strictness:
| Metric | Strict | Default | Lenient |
|---|---|---|---|
| Benford MAD | <= 0.01 | <= 0.015 | <= 0.03 |
| Balance Coherence | >= 0.999 | >= 0.99 | >= 0.95 |
| Document Chain Integrity | >= 0.95 | >= 0.90 | >= 0.80 |
| Completion Rate | >= 0.99 | >= 0.95 | >= 0.90 |
| Duplicate Rate | <= 0.001 | <= 0.01 | <= 0.05 |
| Referential Integrity | >= 0.999 | >= 0.99 | >= 0.95 |
| IC Match Rate | >= 0.99 | >= 0.95 | >= 0.85 |
| Privacy MIA AUC | <= 0.55 | <= 0.60 | <= 0.70 |
Gate evaluation supports fail-fast (stop on first failure) and collect-all (report all failures) strategies.
Balance Validation
The JournalEntry model enforces debits = credits at construction time. An entry that does not balance cannot be created, eliminating an entire class of data integrity errors.
Content Marking
EU AI Act Article 50 synthetic content credentials are embedded in all output files (CSV headers, JSON metadata, Parquet file metadata). This prevents synthetic data from being mistaken for real financial records. Content marking is enabled by default.
CC4: Confidentiality
The Confidentiality criterion requires that information designated as confidential is protected as committed.
No Real Data Storage
In the default operating mode (pure synthetic generation), DataSynth does not process, store, or transmit real data. All names, identifiers, transactions, and addresses are algorithmically generated from configuration parameters and RNG output.
Fingerprint Privacy
When the fingerprint extraction workflow processes real data, the following privacy controls apply:
| Mechanism | Default (Standard Level) |
|---|---|
| Differential privacy (Laplace) | Epsilon = 1.0, Delta = 1e-5 |
| K-anonymity suppression | K >= 5 |
| Composition accounting | Naive (Renyi DP, zCDP available) |
The output .dsf fingerprint file contains only aggregate statistics (means, variances, correlations), not individual records.
API Key Security
API keys are never stored in plaintext. At server startup, raw keys are hashed with Argon2id (random salt, PHC format) and discarded. Verification uses Argon2id comparison that iterates all stored hashes to prevent timing-based key enumeration.
Audit Logging
The JsonAuditLogger emits structured JSON audit events via the tracing crate. Each event records timestamp, request ID, actor identity (user ID or API key hash prefix), action, resource, outcome (success/denied/error), tenant ID, source IP, and user agent. Events are suitable for SIEM ingestion.
CC5: Privacy
The Privacy criterion requires that personal information is collected, used, retained, disclosed, and disposed of in conformity with commitments.
Synthetic Data by Design
DataSynth’s default mode generates purely synthetic data. No personal information is collected or processed. Generated entities (vendors, customers, employees) have no real-world counterparts. This eliminates most privacy obligations for pure synthetic workflows.
Privacy Evaluation
The evaluation framework includes empirical privacy testing:
- Membership Inference Attack (MIA): Distance-based classifier measures AUC-ROC. A score near 0.50 indicates the synthetic data does not memorize real data patterns.
- Linkage Attack Assessment: Evaluates re-identification risk using quasi-identifier combinations. Measures achieved k-anonymity and unique QI overlap.
NIST SP 800-226 Alignment
The evaluation framework generates NIST SP 800-226 alignment reports assessing data transformation adequacy, re-identification risk, documentation completeness, and privacy control effectiveness. An overall alignment score of >= 71% is required for a passing grade.
Fingerprint Extraction Privacy Levels
| Level | Epsilon | Delta | K-Anonymity | Use Case |
|---|---|---|---|---|
minimal | 10.0 | 1e-3 | 2 | Non-sensitive aggregates |
standard | 1.0 | 1e-5 | 5 | General business data |
high | 0.5 | 1e-6 | 10 | Sensitive financial data |
maximum | 0.1 | 1e-8 | 20 | Regulated personal data |
Controls Mapping
The following table maps DataSynth features to SOC 2 Trust Services Criteria identifiers.
| TSC ID | Criterion | DataSynth Control | Implementation |
|---|---|---|---|
| CC6.1 | Logical access security | API key authentication | auth.rs: Argon2id hashing, timing-safe comparison |
| CC6.1 | Logical access security | JWT/OIDC support | auth.rs: RS256 token validation (optional jwt feature) |
| CC6.3 | Role-based access | RBAC enforcement | rbac.rs: Admin/Operator/Viewer roles with permission matrix |
| CC6.6 | System boundaries | Security headers | security_headers.rs: CSP, X-Frame-Options, HSTS support |
| CC6.6 | System boundaries | Rate limiting | rate_limit.rs: Per-client sliding window, Redis backend |
| CC6.8 | Transmission security | TLS support | Reverse proxy TLS termination, Kubernetes ingress |
| CC7.2 | Monitoring | Resource guards | resource_guard.rs: CPU, memory, disk monitoring |
| CC7.2 | Monitoring | Audit logging | audit.rs: Structured JSON events for SIEM |
| CC7.3 | Change detection | Config hashing | SHA-256 hash of configuration embedded in output |
| CC7.4 | Incident response | Content marking | Content credentials identify synthetic origin |
| CC8.1 | Processing integrity | Deterministic RNG | ChaCha8 with configurable seed |
| CC8.1 | Processing integrity | Quality gates | gates/engine.rs: Configurable pass/fail thresholds |
| CC8.1 | Processing integrity | Balance validation | JournalEntry enforces debits = credits at construction |
| CC9.1 | Availability management | Graceful degradation | degradation.rs: Normal/Reduced/Minimal/Emergency levels |
| CC9.1 | Availability management | Health endpoints | /health, /ready, /live (auth-exempt) |
| P3.1 | Privacy notice | Synthetic content marking | EU AI Act Article 50 credentials in all output |
| P4.1 | Collection limitation | No real data by default | Pure synthetic generation requires no data collection |
| P6.1 | Data quality | Quality gates | Statistical, coherence, and privacy quality metrics |
| P8.1 | Disposal | Deterministic generation | No persistent state; regenerate from config + seed |
Gap Analysis
The following areas require organizational controls that are outside DataSynth’s software scope:
| Area | Recommendation |
|---|---|
| Physical security | Deploy on infrastructure with appropriate physical access controls |
| Change management | Implement CI/CD pipelines with code review and approval workflows |
| Vendor management | Assess third-party dependencies via cargo audit and SBOM generation |
| Personnel security | Apply organizational onboarding/offboarding procedures for API key management |
| Backup and recovery | Configure backup for generation configurations and output data per retention policies |
| Incident response plan | Document procedures for scenarios where synthetic data is mistakenly treated as real |