Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Output Formats

SyntheticData generates multiple file types organized into categories.

Output Directory Structure

output/
├── master_data/          # Entity master records
├── transactions/         # Journal entries and documents
├── subledgers/           # Subsidiary ledger records
├── period_close/         # Trial balances and closing
├── consolidation/        # Elimination entries
├── fx/                   # Exchange rates
├── banking/              # KYC profiles and bank transactions
├── process_mining/       # OCEL 2.0 event logs
├── audit/                # Audit engagements and workpapers
├── graphs/               # ML-ready graph exports
├── labels/               # Anomaly, fraud, and quality labels
└── controls/             # Internal control mappings

File Formats

CSV

Default format with standard conventions:

  • UTF-8 encoding
  • Comma-separated values
  • Header row included
  • Quoted strings when needed
  • Decimal values serialized as strings (prevents floating-point artifacts)

Example (journal_entries.csv):

document_id,posting_date,company_code,account,description,debit,credit,is_fraud
abc-123,2024-01-15,1000,1100,Customer payment,"1000.00","0.00",false
abc-123,2024-01-15,1000,1200,Cash receipt,"0.00","1000.00",false

JSON

Structured format with nested objects:

Example (journal_entries.json):

[
  {
    "header": {
      "document_id": "abc-123",
      "posting_date": "2024-01-15",
      "company_code": "1000",
      "source": "Manual",
      "is_fraud": false
    },
    "lines": [
      {
        "account": "1100",
        "description": "Customer payment",
        "debit": "1000.00",
        "credit": "0.00"
      },
      {
        "account": "1200",
        "description": "Cash receipt",
        "debit": "0.00",
        "credit": "1000.00"
      }
    ]
  }
]

ACDOCA (SAP HANA)

SAP Universal Journal format with simulation fields:

FieldDescription
RCLNTClient
RLDNRLedger
RBUKRSCompany code
GJAHRFiscal year
BELNRDocument number
DOCLNLine item
RYEARYear
POPERPosting period
RACCTAccount
DRCRKDebit/Credit indicator
HSLAmount in local currency
ZSIM_*Simulation metadata fields

Master Data Files

chart_of_accounts.csv

FieldDescription
account_numberGL account code
account_nameDisplay name
account_typeAsset, Liability, Equity, Revenue, Expense
account_subtypeDetailed classification
is_control_accountLinks to subledger
normal_balanceDebit or Credit

vendors.csv

FieldDescription
vendor_idUnique identifier
vendor_nameCompany name
tax_idTax identification
payment_termsStandard terms
currencyTransaction currency
is_intercompanyIC flag

customers.csv

FieldDescription
customer_idUnique identifier
customer_nameCompany/person name
credit_limitMaximum credit
credit_ratingRating code
payment_behaviorTypical payment pattern

materials.csv

FieldDescription
material_idUnique identifier
descriptionMaterial name
material_typeClassification
valuation_methodFIFO, LIFO, Avg
standard_costUnit cost

employees.csv

FieldDescription
employee_idUnique identifier
nameFull name
departmentDepartment code
manager_idHierarchy link
approval_limitMaximum approval amount
transaction_codesAuthorized T-codes

Transaction Files

journal_entries.csv

FieldDescription
document_idEntry identifier
company_codeCompany
fiscal_yearYear
fiscal_periodPeriod
posting_dateDate posted
document_dateOriginal date
sourceTransaction source
business_processProcess category
is_fraudFraud indicator
is_anomalyAnomaly indicator

Line Items (embedded or separate)

FieldDescription
line_numberSequence
account_numberGL account
cost_centerCost center
profit_centerProfit center
debit_amountDebit
credit_amountCredit
descriptionLine description

Document Flow Files

purchase_orders.csv:

  • Order header with vendor, dates, status
  • Line items with materials, quantities, prices

goods_receipts.csv:

  • Receipt linked to PO
  • Quantities received, variances

vendor_invoices.csv:

  • Invoice with three-way match status
  • Payment terms, due date

payments.csv:

  • Payment documents
  • Bank references, cleared invoices

document_references.csv:

  • Links between documents (FollowOn, Payment, Reversal)
  • Ensures complete document chains

Subledger Files

ar_open_items.csv

FieldDescription
customer_idCustomer reference
invoice_numberDocument number
invoice_dateDate issued
due_datePayment due
original_amountInvoice total
open_amountRemaining balance
aging_bucket0-30, 31-60, 61-90, 90+

ap_open_items.csv

Similar structure for payables.

fa_register.csv

FieldDescription
asset_idAsset identifier
descriptionAsset name
acquisition_datePurchase date
acquisition_costOriginal cost
useful_life_yearsDepreciation period
depreciation_methodStraight-line, etc.
accumulated_depreciationTotal depreciation
net_book_valueCurrent value

inventory_positions.csv

FieldDescription
material_idMaterial reference
warehouseLocation
quantityUnits on hand
unit_costCurrent cost
total_valueExtended value

Period Close Files

trial_balances/YYYY_MM.csv

FieldDescription
account_numberGL account
account_nameDescription
opening_balancePeriod start
period_debitsTotal debits
period_creditsTotal credits
closing_balancePeriod end

accruals.csv

Accrual entries with reversal dates.

depreciation.csv

Monthly depreciation entries per asset.

Banking Files

banking_customers.csv

FieldDescription
customer_idUnique identifier
customer_typeretail, business, trust
nameCustomer name
created_atAccount creation date
risk_scoreCalculated risk score (0-100)
kyc_statusverified, pending, enhanced_due_diligence
pep_flagPolitically exposed person
sanctions_flagSanctions list match

bank_accounts.csv

FieldDescription
account_idUnique identifier
customer_idOwner reference
account_typechecking, savings, money_market
currencyAccount currency
opened_dateOpening date
balanceCurrent balance
statusactive, dormant, closed

bank_transactions.csv

FieldDescription
transaction_idUnique identifier
account_idAccount reference
timestampTransaction time
amountTransaction amount
currencyTransaction currency
directioncredit, debit
channelbranch, atm, online, wire, ach
categoryTransaction category
counterparty_idCounterparty reference

kyc_profiles.csv

FieldDescription
customer_idCustomer reference
declared_turnoverExpected monthly volume
transaction_frequencyExpected transactions/month
source_of_fundsDeclared income source
geographic_exposureList of countries
cash_intensityExpected cash ratio
beneficial_owner_complexityOwnership layers

aml_typology_labels.csv

FieldDescription
transaction_idTransaction reference
typologystructuring, funnel, layering, mule, fraud
confidenceConfidence score (0-1)
pattern_idRelated pattern identifier
related_transactionsComma-separated related IDs

entity_risk_labels.csv

FieldDescription
entity_idCustomer or account ID
entity_typecustomer, account
risk_categoryhigh, medium, low
risk_factorsContributing factors
label_dateLabel timestamp

Process Mining Files (OCEL 2.0)

event_log.json

OCEL 2.0 format event log:

{
  "ocel:global-log": {
    "ocel:version": "2.0",
    "ocel:ordering": "timestamp"
  },
  "ocel:events": {
    "e1": {
      "ocel:activity": "Create Purchase Order",
      "ocel:timestamp": "2024-01-15T10:30:00Z",
      "ocel:typedOmap": [
        {"ocel:oid": "PO-001", "ocel:qualifier": "order"}
      ]
    }
  },
  "ocel:objects": {
    "PO-001": {
      "ocel:type": "PurchaseOrder",
      "ocel:attributes": {
        "vendor": "VEND-001",
        "amount": "10000.00"
      }
    }
  }
}

objects.json

Object instances with types and attributes.

events.json

Event records with object relationships.

process_variants.csv

FieldDescription
variant_idUnique identifier
activity_sequenceOrdered activity list
frequencyOccurrence count
avg_durationAverage case duration

Audit Files

audit_engagements.csv

FieldDescription
engagement_idUnique identifier
client_nameClient entity
engagement_typeFinancial, Compliance, Operational
fiscal_yearAudit period
materialityMateriality threshold
statusPlanning, Fieldwork, Completion

audit_workpapers.csv

FieldDescription
workpaper_idUnique identifier
engagement_idEngagement reference
workpaper_typeLead schedule, Substantive, etc.
prepared_byPreparer ID
reviewed_byReviewer ID
statusDraft, Reviewed, Final

audit_evidence.csv

FieldDescription
evidence_idUnique identifier
workpaper_idWorkpaper reference
evidence_typeDocument, Inquiry, Observation, etc.
sourceEvidence source
reliabilityHigh, Medium, Low
sufficiencySufficient, Insufficient

audit_risks.csv

FieldDescription
risk_idUnique identifier
engagement_idEngagement reference
risk_descriptionRisk narrative
risk_levelHigh, Significant, Low
likelihoodProbable, Possible, Remote
responseResponse strategy

audit_findings.csv

FieldDescription
finding_idUnique identifier
engagement_idEngagement reference
finding_typeDeficiency, Significant, Material Weakness
descriptionFinding narrative
recommendationRecommended action
management_responseResponse text

audit_judgments.csv

FieldDescription
judgment_idUnique identifier
workpaper_idWorkpaper reference
judgment_areaRevenue recognition, Estimates, etc.
alternatives_consideredOptions evaluated
conclusionSelected approach
rationaleReasoning documentation

Graph Export Files

PyTorch Geometric

graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, edge_features]
├── labels.pt           # Node/edge labels
├── train_mask.pt       # Training split
├── val_mask.pt         # Validation split
└── test_mask.pt        # Test split

Neo4j

graphs/entity_relationship/neo4j/
├── nodes_account.csv
├── nodes_entity.csv
├── nodes_user.csv
├── edges_transaction.csv
├── edges_approval.csv
└── import.cypher        # Import script

DGL (Deep Graph Library)

graphs/transaction_network/dgl/
├── graph.bin           # DGL binary format
├── node_features.npy   # NumPy arrays
└── edge_features.npy

Label Files

anomaly_labels.csv

FieldDescription
document_idEntry reference
anomaly_idUnique anomaly ID
anomaly_typeClassification
anomaly_categoryFraud, Error, Process, Statistical, Relational
severityLow, Medium, High
descriptionHuman-readable explanation

fraud_labels.csv

FieldDescription
document_idEntry reference
fraud_typeSpecific fraud pattern (20+ types)
detection_difficultyEasy, Medium, Hard
descriptionFraud scenario description

quality_labels.csv

FieldDescription
record_idRecord reference
field_nameAffected field
issue_typeMissingValue, Typo, FormatVariation, Duplicate
issue_subtypeDetailed classification
original_valueValue before modification
modified_valueValue after modification
severitySeverity level (1-5)

Control Files

internal_controls.csv

FieldDescription
control_idUnique identifier
control_nameDescription
control_typePreventive, Detective
frequencyContinuous, Daily, etc.
assertionsCompleteness, Accuracy, etc.

control_account_mappings.csv

FieldDescription
control_idControl reference
account_numberGL account
thresholdMonetary threshold

sod_rules.csv

Segregation of duties conflict definitions.

sod_conflict_pairs.csv

Actual SoD violations detected in generated data.

Parquet Format

Apache Parquet columnar format for large analytical datasets:

output:
  format: parquet
  compression: snappy      # snappy, gzip, zstd

Benefits:

  • Columnar storage — efficient for queries touching few columns
  • Built-in compression — typically 5-10x smaller than CSV
  • Schema embedding — self-describing files with full type information
  • Predicate pushdown — query engines skip irrelevant row groups

Use with: Apache Spark, DuckDB, Polars, pandas, BigQuery, Snowflake, Databricks.

ERP-Specific Formats

SyntheticData can export in native ERP table schemas:

FormatTarget ERPTables
sapSAP S/4HANABKPF, BSEG, ACDOCA, LFA1, KNA1, MARA, CSKS, CEPC
oracleOracle EBSGL_JE_HEADERS, GL_JE_LINES, GL_JE_BATCHES
netsuiteNetSuiteJournal entries with subsidiary, multi-book, custom fields

See ERP Output Formats for field mappings and configuration.

Compression Options

OptionExtensionUse Case
none.csv/.jsonDevelopment, small datasets
gzip.csv.gzGeneral compression
zstd.csv.zstHigh performance
snappy.parquetParquet default (fast)

Configuration

output:
  format: csv              # csv, json, jsonl, parquet, sap, oracle, netsuite
  compression: none        # none, gzip, zstd (CSV/JSON) or snappy/gzip/zstd (Parquet)
  compression_level: 6     # 1-9 (if compression enabled)
  streaming: false         # Enable streaming mode for large outputs

See Also