Graph Export

Export transaction data as ML-ready graphs.

Overview

Graph export transforms financial data into network representations:

Accounting Network (GL accounts as nodes, transactions as edges) - New in v0.2.1
Transaction networks (accounts and entities)
Approval networks (users and approvals)
Entity relationship graphs (ownership)

Accounting Network Graph Export

The accounting network represents money flows between GL accounts, designed for network reconstruction and anomaly detection algorithms.

Quick Start

# Generate with graph export enabled
datasynth-data generate --config config.yaml --output ./output --graph-export

Graph Structure

Element	Description
Nodes	GL Accounts from Chart of Accounts
Edges	Money flows FROM credit accounts TO debit accounts
Direction	Directed graph (source→target)

     ┌──────────────┐
     │ Credit Acct  │
     │   (2000)     │
     └──────┬───────┘
            │ $1,000
            ▼
     ┌──────────────┐
     │ Debit Acct   │
     │   (1100)     │
     └──────────────┘

Edge Features (8 dimensions)

Feature	Index	Description
`log_amount`	F0	log10(transaction amount)
`benford_prob`	F1	Expected first-digit probability
`weekday`	F2	Day of week (normalized 0-1)
`period`	F3	Fiscal period (normalized 0-1)
`is_month_end`	F4	Last 3 days of month
`is_year_end`	F5	Last month of year
`is_anomaly`	F6	Anomaly flag (0 or 1)
`business_process`	F7	Encoded business process

Output Files

output/graphs/accounting_network/pytorch_geometric/
├── edge_index.npy      # [2, E] source→target node indices
├── node_features.npy   # [N, 4] node feature vectors
├── edge_features.npy   # [E, 8] edge feature vectors
├── edge_labels.npy     # [E] anomaly labels (0=normal, 1=anomaly)
├── node_labels.npy     # [N] node labels
├── train_mask.npy      # [N] boolean training mask
├── val_mask.npy        # [N] boolean validation mask
├── test_mask.npy       # [N] boolean test mask
├── metadata.json       # Graph statistics and configuration
└── load_graph.py       # Auto-generated Python loader script

Loading in Python

import numpy as np
import json

# Load metadata
with open('metadata.json') as f:
    meta = json.load(f)
print(f"Nodes: {meta['num_nodes']}, Edges: {meta['num_edges']}")

# Load arrays
edge_index = np.load('edge_index.npy')      # [2, E]
node_features = np.load('node_features.npy') # [N, F]
edge_features = np.load('edge_features.npy') # [E, 8]
edge_labels = np.load('edge_labels.npy')     # [E]

# For PyTorch Geometric
import torch
from torch_geometric.data import Data

data = Data(
    x=torch.from_numpy(node_features).float(),
    edge_index=torch.from_numpy(edge_index).long(),
    edge_attr=torch.from_numpy(edge_features).float(),
    y=torch.from_numpy(edge_labels).long(),
)

Configuration

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
  train_ratio: 0.7
  validation_ratio: 0.15
  # test_ratio is automatically 1 - train - val = 0.15

Use Cases

Anomaly Detection: Train GNNs to detect anomalous transaction patterns
Network Reconstruction: Validate accounting network recovery algorithms
Fraud Detection: Identify suspicious money flow patterns
Link Prediction: Predict likely transaction relationships

Configuration

graph_export:
  enabled: true

  formats:
    - pytorch_geometric
    - neo4j
    - dgl

  graphs:
    - transaction_network
    - approval_network
    - entity_relationship

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly

  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

Graph Types

Transaction Network

Accounts and entities as nodes, transactions as edges.

     ┌──────────┐
     │ Account  │
     │  1100    │
     └────┬─────┘
          │ $1000
          ▼
     ┌──────────┐
     │ Customer │
     │  C-001   │
     └──────────┘

Nodes:

GL accounts
Vendors
Customers
Cost centers

Edges:

Journal entry lines
Payments
Invoices

Approval Network

Users as nodes, approval relationships as edges.

     ┌──────────┐
     │  Clerk   │
     │  U-001   │
     └────┬─────┘
          │ approved
          ▼
     ┌──────────┐
     │ Manager  │
     │  U-002   │
     └──────────┘

Nodes: Employees/users Edges: Approval actions

Entity Relationship Network

Legal entities with ownership relationships.

     ┌──────────┐
     │  Parent  │
     │  1000    │
     └────┬─────┘
          │ 100%
          ▼
     ┌──────────┐
     │   Sub    │
     │  2000    │
     └──────────┘

Nodes: Companies Edges: Ownership, IC transactions

Export Formats

PyTorch Geometric

output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, num_features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, num_edge_features]
├── labels.pt           # Labels
├── train_mask.pt       # Boolean training mask
├── val_mask.pt         # Boolean validation mask
└── test_mask.pt        # Boolean test mask

Loading in Python:

import torch
from torch_geometric.data import Data

# Load tensors
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')

# Create PyG Data object
data = Data(
    x=node_features,
    edge_index=edge_index,
    edge_attr=edge_attr,
    y=labels,
    train_mask=train_mask,
)

print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")

Neo4j

output/graphs/transaction_network/neo4j/
├── nodes_account.csv
├── nodes_vendor.csv
├── nodes_customer.csv
├── edges_transaction.csv
├── edges_payment.csv
└── import.cypher

Import script (import.cypher):

// Load accounts
LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
CREATE (:Account {
    id: row.id,
    name: row.name,
    type: row.type
});

// Load transactions
LOAD CSV WITH HEADERS FROM 'file:///edges_transaction.csv' AS row
MATCH (from:Account {id: row.from_id})
MATCH (to:Account {id: row.to_id})
CREATE (from)-[:TRANSACTION {
    amount: toFloat(row.amount),
    date: date(row.posting_date),
    is_anomaly: toBoolean(row.is_anomaly)
}]->(to);

DGL (Deep Graph Library)

output/graphs/transaction_network/dgl/
├── graph.bin           # Serialized DGL graph
├── node_feats.npy      # Node features
├── edge_feats.npy      # Edge features
└── labels.npy          # Labels

Loading in Python:

import dgl
import numpy as np

# Load graph
graph = dgl.load_graphs('graph.bin')[0][0]

# Load features
graph.ndata['feat'] = torch.tensor(np.load('node_feats.npy'))
graph.edata['feat'] = torch.tensor(np.load('edge_feats.npy'))
graph.ndata['label'] = torch.tensor(np.load('labels.npy'))

Features

Temporal Features

features:
  temporal: true

Feature	Description
`weekday`	Day of week (0-6)
`period`	Fiscal period (1-12)
`is_month_end`	Last 3 days of month
`is_quarter_end`	Last week of quarter
`is_year_end`	Last month of year
`hour`	Hour of posting

Amount Features

features:
  amount: true

Feature	Description
`log_amount`	log10(amount)
`benford_prob`	Expected first-digit probability
`is_round_number`	Ends in 00, 000, etc.
`amount_zscore`	Standard deviations from mean

Structural Features

features:
  structural: true

Feature	Description
`line_count`	Number of JE lines
`unique_accounts`	Distinct accounts used
`has_intercompany`	IC transaction flag
`debit_credit_ratio`	Total debits / credits

Categorical Features

features:
  categorical: true

One-hot encoded:

business_process: Manual, P2P, O2C, etc.
source_type: System, User, Recurring
account_type: Asset, Liability, etc.

Train/Val/Test Splits

split:
  train: 0.7                         # 70% training
  val: 0.15                          # 15% validation
  test: 0.15                         # 15% test
  stratify: is_anomaly               # Maintain anomaly ratio
  random_seed: 42                    # Reproducible splits

Stratification options:

is_anomaly: Balanced anomaly detection
is_fraud: Balanced fraud detection
account_type: Balanced by account type
null: Random (no stratification)

GNN Training Example

import torch
from torch_geometric.nn import GCNConv

class AnomalyGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, 2)  # Binary classification

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

# Train
model = AnomalyGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

Multi-Layer Hypergraph (v0.6.2)

The RustGraph Hypergraph exporter now supports all 8 enterprise process families with 24 entity type codes:

Entity Type Codes

Range	Family	Types
100	Accounting	GL Accounts
300-303	P2P	PurchaseOrder, GoodsReceipt, VendorInvoice, Payment
310-312	O2C	SalesOrder, Delivery, CustomerInvoice
320-325	S2C	SourcingProject, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract, SupplierQualification
330-333	H2R	PayrollRun, TimeEntry, ExpenseReport, PayrollLineItem
340-343	MFG	ProductionOrder, RoutingOperation, QualityInspection, CycleCount
350-352	BANK	BankingCustomer, BankAccount, BankTransaction
360-365	AUDIT	AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment
370-372	Bank Recon	BankReconciliation, BankStatementLine, ReconcilingItem
400	OCPM	OcpmEvent (events as hyperedges)

OCPM Events as Hyperedges

When events_as_hyperedges: true, each OCPM event becomes a hyperedge connecting all its participating objects. This enables cross-process analysis via the hypergraph structure.

Per-Family Toggles

graph_export:
  hypergraph:
    enabled: true
    process_layer:
      include_p2p: true
      include_o2c: true
      include_s2c: true
      include_h2r: true
      include_mfg: true
      include_bank: true
      include_audit: true
      include_r2r: true
      events_as_hyperedges: true

SyntheticData Documentation

Graph Export

Overview

Accounting Network Graph Export

Quick Start

Graph Structure

Edge Features (8 dimensions)

Output Files

Loading in Python

Configuration

Use Cases

Configuration

Graph Types

Transaction Network

Approval Network

Entity Relationship Network

Export Formats

PyTorch Geometric

Neo4j

DGL (Deep Graph Library)

Features

Temporal Features

Amount Features

Structural Features

Categorical Features

Train/Val/Test Splits

GNN Training Example

Multi-Layer Hypergraph (v0.6.2)

Entity Type Codes

OCPM Events as Hyperedges

Per-Family Toggles

See Also

Keyboard shortcuts

SyntheticData Documentation