SyntheticData generates multiple file types organized into categories.
output/
├── master_data/ # Entity master records
├── transactions/ # Journal entries and documents
├── subledgers/ # Subsidiary ledger records
├── period_close/ # Trial balances and closing
├── consolidation/ # Elimination entries
├── fx/ # Exchange rates
├── banking/ # KYC profiles and bank transactions
├── process_mining/ # OCEL 2.0 event logs
├── audit/ # Audit engagements and workpapers
├── graphs/ # ML-ready graph exports
├── labels/ # Anomaly, fraud, and quality labels
└── controls/ # Internal control mappings
Default format with standard conventions:
- UTF-8 encoding
- Comma-separated values
- Header row included
- Quoted strings when needed
- Decimal values serialized as strings (prevents floating-point artifacts)
Example (journal_entries.csv):
document_id,posting_date,company_code,account,description,debit,credit,is_fraud
abc-123,2024-01-15,1000,1100,Customer payment,"1000.00","0.00",false
abc-123,2024-01-15,1000,1200,Cash receipt,"0.00","1000.00",false
Structured format with nested objects:
Example (journal_entries.json):
[
{
"header": {
"document_id": "abc-123",
"posting_date": "2024-01-15",
"company_code": "1000",
"source": "Manual",
"is_fraud": false
},
"lines": [
{
"account": "1100",
"description": "Customer payment",
"debit": "1000.00",
"credit": "0.00"
},
{
"account": "1200",
"description": "Cash receipt",
"debit": "0.00",
"credit": "1000.00"
}
]
}
]
SAP Universal Journal format with simulation fields:
| Field | Description |
| RCLNT | Client |
| RLDNR | Ledger |
| RBUKRS | Company code |
| GJAHR | Fiscal year |
| BELNR | Document number |
| DOCLN | Line item |
| RYEAR | Year |
| POPER | Posting period |
| RACCT | Account |
| DRCRK | Debit/Credit indicator |
| HSL | Amount in local currency |
| ZSIM_* | Simulation metadata fields |
| Field | Description |
| account_number | GL account code |
| account_name | Display name |
| account_type | Asset, Liability, Equity, Revenue, Expense |
| account_subtype | Detailed classification |
| is_control_account | Links to subledger |
| normal_balance | Debit or Credit |
| Field | Description |
| vendor_id | Unique identifier |
| vendor_name | Company name |
| tax_id | Tax identification |
| payment_terms | Standard terms |
| currency | Transaction currency |
| is_intercompany | IC flag |
| Field | Description |
| customer_id | Unique identifier |
| customer_name | Company/person name |
| credit_limit | Maximum credit |
| credit_rating | Rating code |
| payment_behavior | Typical payment pattern |
| Field | Description |
| material_id | Unique identifier |
| description | Material name |
| material_type | Classification |
| valuation_method | FIFO, LIFO, Avg |
| standard_cost | Unit cost |
| Field | Description |
| employee_id | Unique identifier |
| name | Full name |
| department | Department code |
| manager_id | Hierarchy link |
| approval_limit | Maximum approval amount |
| transaction_codes | Authorized T-codes |
| Field | Description |
| document_id | Entry identifier |
| company_code | Company |
| fiscal_year | Year |
| fiscal_period | Period |
| posting_date | Date posted |
| document_date | Original date |
| source | Transaction source |
| business_process | Process category |
| is_fraud | Fraud indicator |
| is_anomaly | Anomaly indicator |
| Field | Description |
| line_number | Sequence |
| account_number | GL account |
| cost_center | Cost center |
| profit_center | Profit center |
| debit_amount | Debit |
| credit_amount | Credit |
| description | Line description |
purchase_orders.csv:
- Order header with vendor, dates, status
- Line items with materials, quantities, prices
goods_receipts.csv:
- Receipt linked to PO
- Quantities received, variances
vendor_invoices.csv:
- Invoice with three-way match status
- Payment terms, due date
payments.csv:
- Payment documents
- Bank references, cleared invoices
document_references.csv:
- Links between documents (FollowOn, Payment, Reversal)
- Ensures complete document chains
| Field | Description |
| customer_id | Customer reference |
| invoice_number | Document number |
| invoice_date | Date issued |
| due_date | Payment due |
| original_amount | Invoice total |
| open_amount | Remaining balance |
| aging_bucket | 0-30, 31-60, 61-90, 90+ |
Similar structure for payables.
| Field | Description |
| asset_id | Asset identifier |
| description | Asset name |
| acquisition_date | Purchase date |
| acquisition_cost | Original cost |
| useful_life_years | Depreciation period |
| depreciation_method | Straight-line, etc. |
| accumulated_depreciation | Total depreciation |
| net_book_value | Current value |
| Field | Description |
| material_id | Material reference |
| warehouse | Location |
| quantity | Units on hand |
| unit_cost | Current cost |
| total_value | Extended value |
| Field | Description |
| account_number | GL account |
| account_name | Description |
| opening_balance | Period start |
| period_debits | Total debits |
| period_credits | Total credits |
| closing_balance | Period end |
Accrual entries with reversal dates.
Monthly depreciation entries per asset.
| Field | Description |
| customer_id | Unique identifier |
| customer_type | retail, business, trust |
| name | Customer name |
| created_at | Account creation date |
| risk_score | Calculated risk score (0-100) |
| kyc_status | verified, pending, enhanced_due_diligence |
| pep_flag | Politically exposed person |
| sanctions_flag | Sanctions list match |
| Field | Description |
| account_id | Unique identifier |
| customer_id | Owner reference |
| account_type | checking, savings, money_market |
| currency | Account currency |
| opened_date | Opening date |
| balance | Current balance |
| status | active, dormant, closed |
| Field | Description |
| transaction_id | Unique identifier |
| account_id | Account reference |
| timestamp | Transaction time |
| amount | Transaction amount |
| currency | Transaction currency |
| direction | credit, debit |
| channel | branch, atm, online, wire, ach |
| category | Transaction category |
| counterparty_id | Counterparty reference |
| Field | Description |
| customer_id | Customer reference |
| declared_turnover | Expected monthly volume |
| transaction_frequency | Expected transactions/month |
| source_of_funds | Declared income source |
| geographic_exposure | List of countries |
| cash_intensity | Expected cash ratio |
| beneficial_owner_complexity | Ownership layers |
| Field | Description |
| transaction_id | Transaction reference |
| typology | structuring, funnel, layering, mule, fraud |
| confidence | Confidence score (0-1) |
| pattern_id | Related pattern identifier |
| related_transactions | Comma-separated related IDs |
| Field | Description |
| entity_id | Customer or account ID |
| entity_type | customer, account |
| risk_category | high, medium, low |
| risk_factors | Contributing factors |
| label_date | Label timestamp |
OCEL 2.0 format event log:
{
"ocel:global-log": {
"ocel:version": "2.0",
"ocel:ordering": "timestamp"
},
"ocel:events": {
"e1": {
"ocel:activity": "Create Purchase Order",
"ocel:timestamp": "2024-01-15T10:30:00Z",
"ocel:typedOmap": [
{"ocel:oid": "PO-001", "ocel:qualifier": "order"}
]
}
},
"ocel:objects": {
"PO-001": {
"ocel:type": "PurchaseOrder",
"ocel:attributes": {
"vendor": "VEND-001",
"amount": "10000.00"
}
}
}
}
Object instances with types and attributes.
Event records with object relationships.
| Field | Description |
| variant_id | Unique identifier |
| activity_sequence | Ordered activity list |
| frequency | Occurrence count |
| avg_duration | Average case duration |
| Field | Description |
| engagement_id | Unique identifier |
| client_name | Client entity |
| engagement_type | Financial, Compliance, Operational |
| fiscal_year | Audit period |
| materiality | Materiality threshold |
| status | Planning, Fieldwork, Completion |
| Field | Description |
| workpaper_id | Unique identifier |
| engagement_id | Engagement reference |
| workpaper_type | Lead schedule, Substantive, etc. |
| prepared_by | Preparer ID |
| reviewed_by | Reviewer ID |
| status | Draft, Reviewed, Final |
| Field | Description |
| evidence_id | Unique identifier |
| workpaper_id | Workpaper reference |
| evidence_type | Document, Inquiry, Observation, etc. |
| source | Evidence source |
| reliability | High, Medium, Low |
| sufficiency | Sufficient, Insufficient |
| Field | Description |
| risk_id | Unique identifier |
| engagement_id | Engagement reference |
| risk_description | Risk narrative |
| risk_level | High, Significant, Low |
| likelihood | Probable, Possible, Remote |
| response | Response strategy |
| Field | Description |
| finding_id | Unique identifier |
| engagement_id | Engagement reference |
| finding_type | Deficiency, Significant, Material Weakness |
| description | Finding narrative |
| recommendation | Recommended action |
| management_response | Response text |
| Field | Description |
| judgment_id | Unique identifier |
| workpaper_id | Workpaper reference |
| judgment_area | Revenue recognition, Estimates, etc. |
| alternatives_considered | Options evaluated |
| conclusion | Selected approach |
| rationale | Reasoning documentation |
graphs/transaction_network/pytorch_geometric/
├── node_features.pt # [num_nodes, features]
├── edge_index.pt # [2, num_edges]
├── edge_attr.pt # [num_edges, edge_features]
├── labels.pt # Node/edge labels
├── train_mask.pt # Training split
├── val_mask.pt # Validation split
└── test_mask.pt # Test split
graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── nodes_user.csv
├── edges_transaction.csv
├── edges_approval.csv
└── import.cypher # Import script
graphs/transaction_network/dgl/
├── graph.bin # DGL binary format
├── node_features.npy # NumPy arrays
└── edge_features.npy
| Field | Description |
| document_id | Entry reference |
| anomaly_id | Unique anomaly ID |
| anomaly_type | Classification |
| anomaly_category | Fraud, Error, Process, Statistical, Relational |
| severity | Low, Medium, High |
| description | Human-readable explanation |
| Field | Description |
| document_id | Entry reference |
| fraud_type | Specific fraud pattern (20+ types) |
| detection_difficulty | Easy, Medium, Hard |
| description | Fraud scenario description |
| Field | Description |
| record_id | Record reference |
| field_name | Affected field |
| issue_type | MissingValue, Typo, FormatVariation, Duplicate |
| issue_subtype | Detailed classification |
| original_value | Value before modification |
| modified_value | Value after modification |
| severity | Severity level (1-5) |
| Field | Description |
| control_id | Unique identifier |
| control_name | Description |
| control_type | Preventive, Detective |
| frequency | Continuous, Daily, etc. |
| assertions | Completeness, Accuracy, etc. |
| Field | Description |
| control_id | Control reference |
| account_number | GL account |
| threshold | Monetary threshold |
Segregation of duties conflict definitions.
Actual SoD violations detected in generated data.
Apache Parquet columnar format for large analytical datasets:
output:
format: parquet
compression: snappy # snappy, gzip, zstd
Benefits:
- Columnar storage — efficient for queries touching few columns
- Built-in compression — typically 5-10x smaller than CSV
- Schema embedding — self-describing files with full type information
- Predicate pushdown — query engines skip irrelevant row groups
Use with: Apache Spark, DuckDB, Polars, pandas, BigQuery, Snowflake, Databricks.
SyntheticData can export in native ERP table schemas:
| Format | Target ERP | Tables |
sap | SAP S/4HANA | BKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC |
oracle | Oracle EBS | GL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES |
netsuite | NetSuite | Journal entries with subsidiary, multi-book, custom fields |
See ERP Output Formats for field mappings and configuration.
| Option | Extension | Use Case |
| none | .csv/.json | Development, small datasets |
| gzip | .csv.gz | General compression |
| zstd | .csv.zst | High performance |
| snappy | .parquet | Parquet default (fast) |
output:
format: csv # csv, json, jsonl, parquet, sap, oracle, netsuite
compression: none # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)
compression_level: 6 # 1-9 (if compression enabled)
streaming: false # Enable streaming mode for large outputs