Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SOC 2 Type II Readiness

This document describes how DataSynth’s architecture and controls align with the AICPA Trust Services Criteria (TSC) used in SOC 2 Type II engagements. DataSynth is a synthetic data generation tool, not a cloud-hosted SaaS product, so this assessment focuses on the controls embedded in the software itself rather than organizational policies. Organizations deploying DataSynth should layer their own operational controls (change management, personnel security, vendor management) on top of the technical controls described here.

Assessment Scope

  • System: DataSynth synthetic financial data generator
  • Version: 0.5.x
  • Deployment Models: CLI binary, REST/gRPC/WebSocket server, Python library, desktop application
  • Assessment Type: Architecture readiness (pre-audit self-assessment)

CC1: Security

The Security criterion (Common Criteria) requires that the system is protected against unauthorized access, both logical and physical.

Authentication

DataSynth’s server component (datasynth-server) implements two authentication mechanisms:

API Key Authentication: API keys are hashed with Argon2id (memory-hard, side-channel resistant) at server startup. Verification iterates all stored hashes without short-circuiting to prevent timing-based enumeration. A short-lived (5-second TTL) FNV-1a hash cache avoids repeated Argon2id computation for successive requests from the same client. Keys are accepted via Authorization: Bearer <key> or X-API-Key headers.

JWT/OIDC (optional jwt feature): External identity providers (Keycloak, Auth0, Entra ID) issue RS256-signed tokens. The JwtValidator verifies issuer, audience, expiration, and signature. Claims include subject, email, roles, and tenant ID for multi-tenancy.

Authorization

Role-Based Access Control (RBAC) enforces least-privilege access:

RoleGenerateDataManageJobsViewJobsManageConfigViewConfigViewMetricsManageApiKeys
AdminYYYYYYY
OperatorYYYNYYN
ViewerNNYNYYN

RBAC can be disabled for development environments; when disabled, all authenticated requests are treated as Admin.

Network Security

The security headers middleware injects the following headers on all server responses:

HeaderValuePurpose
X-Content-Type-OptionsnosniffPrevent MIME-type sniffing
X-Frame-OptionsDENYPrevent clickjacking
Content-Security-Policydefault-src 'none'; frame-ancestors 'none'Restrict resource loading
Referrer-Policystrict-origin-when-cross-originLimit referrer leakage
Cache-Controlno-storePrevent caching of API responses
X-XSS-Protection0Defer to CSP (modern best practice)

TLS termination is supported via reverse proxy (nginx, Caddy, Envoy) or Kubernetes ingress. CORS is configurable with allowlisted origins.

Rate Limiting

Per-client rate limiting uses a sliding-window counter with configurable thresholds (requests per second, burst size). A Redis-backed rate limiter is available for multi-instance deployments (redis feature flag).


CC2: Availability

The Availability criterion requires that the system is available for operation and use as committed.

Graceful Degradation

The DegradationController in datasynth-core monitors memory, disk, and CPU utilization and applies progressive feature reduction:

LevelMemoryDiskCPUResponse
Normal< 70%> 1000 MB< 80%All features enabled, full batch sizes
Reduced70–85%500–1000 MB80–90%Half batch sizes, skip data quality injection
Minimal85–95%100–500 MB> 90%Essential data only, no anomaly injection
Emergency> 95%< 100 MBFlush pending writes, terminate gracefully

Auto-recovery with hysteresis (5% improvement required) allows the system to step back up one level at a time when resource pressure subsides.

Resource Monitoring

  • Memory guard: Reads /proc/self/statm (Linux) or ps (macOS) to track resident set size against configurable limits.
  • Disk guard: Uses statvfs (Unix) or GetDiskFreeSpaceExW (Windows) to monitor available disk space in the output directory.
  • CPU monitor: Tracks CPU utilization with auto-throttle at 0.95 threshold.
  • Resource guard: Unified orchestration that combines all three monitors and drives the DegradationController.

Graceful Shutdown

The server handles SIGTERM by stopping acceptance of new requests, waiting for in-flight requests to complete (with configurable timeout), and flushing pending output. The CLI supports SIGUSR1 for pause/resume of generation runs.

Health Endpoints

The following endpoints are exempt from authentication for infrastructure integration:

EndpointPurpose
/healthGeneral health check
/readyReadiness probe (Kubernetes)
/liveLiveness probe (Kubernetes)
/metricsPrometheus-compatible metrics

CC3: Processing Integrity

The Processing Integrity criterion requires that system processing is complete, valid, accurate, timely, and authorized.

Deterministic Generation

DataSynth uses the ChaCha8 cryptographically secure pseudo-random number generator with a configurable seed. Given the same configuration YAML and seed value, output is byte-identical across runs and platforms. This provides auditability (reproduce any dataset from its configuration) and regression detection (compare output hashes after code changes).

Quality Gates

The evaluation framework (datasynth-eval) applies configurable pass/fail criteria to every generation run. Built-in quality gate profiles provide three levels of strictness:

MetricStrictDefaultLenient
Benford MAD<= 0.01<= 0.015<= 0.03
Balance Coherence>= 0.999>= 0.99>= 0.95
Document Chain Integrity>= 0.95>= 0.90>= 0.80
Completion Rate>= 0.99>= 0.95>= 0.90
Duplicate Rate<= 0.001<= 0.01<= 0.05
Referential Integrity>= 0.999>= 0.99>= 0.95
IC Match Rate>= 0.99>= 0.95>= 0.85
Privacy MIA AUC<= 0.55<= 0.60<= 0.70

Gate evaluation supports fail-fast (stop on first failure) and collect-all (report all failures) strategies.

Balance Validation

The JournalEntry model enforces debits = credits at construction time. An entry that does not balance cannot be created, eliminating an entire class of data integrity errors.

Content Marking

EU AI Act Article 50 synthetic content credentials are embedded in all output files (CSV headers, JSON metadata, Parquet file metadata). This prevents synthetic data from being mistaken for real financial records. Content marking is enabled by default.


CC4: Confidentiality

The Confidentiality criterion requires that information designated as confidential is protected as committed.

No Real Data Storage

In the default operating mode (pure synthetic generation), DataSynth does not process, store, or transmit real data. All names, identifiers, transactions, and addresses are algorithmically generated from configuration parameters and RNG output.

Fingerprint Privacy

When the fingerprint extraction workflow processes real data, the following privacy controls apply:

MechanismDefault (Standard Level)
Differential privacy (Laplace)Epsilon = 1.0, Delta = 1e-5
K-anonymity suppressionK >= 5
Composition accountingNaive (Renyi DP, zCDP available)

The output .dsf fingerprint file contains only aggregate statistics (means, variances, correlations), not individual records.

API Key Security

API keys are never stored in plaintext. At server startup, raw keys are hashed with Argon2id (random salt, PHC format) and discarded. Verification uses Argon2id comparison that iterates all stored hashes to prevent timing-based key enumeration.

Audit Logging

The JsonAuditLogger emits structured JSON audit events via the tracing crate. Each event records timestamp, request ID, actor identity (user ID or API key hash prefix), action, resource, outcome (success/denied/error), tenant ID, source IP, and user agent. Events are suitable for SIEM ingestion.


CC5: Privacy

The Privacy criterion requires that personal information is collected, used, retained, disclosed, and disposed of in conformity with commitments.

Synthetic Data by Design

DataSynth’s default mode generates purely synthetic data. No personal information is collected or processed. Generated entities (vendors, customers, employees) have no real-world counterparts. This eliminates most privacy obligations for pure synthetic workflows.

Privacy Evaluation

The evaluation framework includes empirical privacy testing:

  • Membership Inference Attack (MIA): Distance-based classifier measures AUC-ROC. A score near 0.50 indicates the synthetic data does not memorize real data patterns.
  • Linkage Attack Assessment: Evaluates re-identification risk using quasi-identifier combinations. Measures achieved k-anonymity and unique QI overlap.

NIST SP 800-226 Alignment

The evaluation framework generates NIST SP 800-226 alignment reports assessing data transformation adequacy, re-identification risk, documentation completeness, and privacy control effectiveness. An overall alignment score of >= 71% is required for a passing grade.

Fingerprint Extraction Privacy Levels

LevelEpsilonDeltaK-AnonymityUse Case
minimal10.01e-32Non-sensitive aggregates
standard1.01e-55General business data
high0.51e-610Sensitive financial data
maximum0.11e-820Regulated personal data

Controls Mapping

The following table maps DataSynth features to SOC 2 Trust Services Criteria identifiers.

TSC IDCriterionDataSynth ControlImplementation
CC6.1Logical access securityAPI key authenticationauth.rs: Argon2id hashing, timing-safe comparison
CC6.1Logical access securityJWT/OIDC supportauth.rs: RS256 token validation (optional jwt feature)
CC6.3Role-based accessRBAC enforcementrbac.rs: Admin/Operator/Viewer roles with permission matrix
CC6.6System boundariesSecurity headerssecurity_headers.rs: CSP, X-Frame-Options, HSTS support
CC6.6System boundariesRate limitingrate_limit.rs: Per-client sliding window, Redis backend
CC6.8Transmission securityTLS supportReverse proxy TLS termination, Kubernetes ingress
CC7.2MonitoringResource guardsresource_guard.rs: CPU, memory, disk monitoring
CC7.2MonitoringAudit loggingaudit.rs: Structured JSON events for SIEM
CC7.3Change detectionConfig hashingSHA-256 hash of configuration embedded in output
CC7.4Incident responseContent markingContent credentials identify synthetic origin
CC8.1Processing integrityDeterministic RNGChaCha8 with configurable seed
CC8.1Processing integrityQuality gatesgates/engine.rs: Configurable pass/fail thresholds
CC8.1Processing integrityBalance validationJournalEntry enforces debits = credits at construction
CC9.1Availability managementGraceful degradationdegradation.rs: Normal/Reduced/Minimal/Emergency levels
CC9.1Availability managementHealth endpoints/health, /ready, /live (auth-exempt)
P3.1Privacy noticeSynthetic content markingEU AI Act Article 50 credentials in all output
P4.1Collection limitationNo real data by defaultPure synthetic generation requires no data collection
P6.1Data qualityQuality gatesStatistical, coherence, and privacy quality metrics
P8.1DisposalDeterministic generationNo persistent state; regenerate from config + seed

Gap Analysis

The following areas require organizational controls that are outside DataSynth’s software scope:

AreaRecommendation
Physical securityDeploy on infrastructure with appropriate physical access controls
Change managementImplement CI/CD pipelines with code review and approval workflows
Vendor managementAssess third-party dependencies via cargo audit and SBOM generation
Personnel securityApply organizational onboarding/offboarding procedures for API key management
Backup and recoveryConfigure backup for generation configurations and output data per retention policies
Incident response planDocument procedures for scenarios where synthetic data is mistakenly treated as real

See Also