Graph Export
Export transaction data as ML-ready graphs.
Overview
Graph export transforms financial data into network representations:
- Accounting Network (GL accounts as nodes, transactions as edges) - New in v0.2.1
- Transaction networks (accounts and entities)
- Approval networks (users and approvals)
- Entity relationship graphs (ownership)
Accounting Network Graph Export
The accounting network represents money flows between GL accounts, designed for network reconstruction and anomaly detection algorithms.
Quick Start
# Generate with graph export enabled
datasynth-data generate --config config.yaml --output ./output --graph-export
Graph Structure
| Element | Description |
|---|---|
| Nodes | GL Accounts from Chart of Accounts |
| Edges | Money flows FROM credit accounts TO debit accounts |
| Direction | Directed graph (source→target) |
┌──────────────┐
│ Credit Acct │
│ (2000) │
└──────┬───────┘
│ $1,000
▼
┌──────────────┐
│ Debit Acct │
│ (1100) │
└──────────────┘
Edge Features (8 dimensions)
| Feature | Index | Description |
|---|---|---|
log_amount | F0 | log10(transaction amount) |
benford_prob | F1 | Expected first-digit probability |
weekday | F2 | Day of week (normalized 0-1) |
period | F3 | Fiscal period (normalized 0-1) |
is_month_end | F4 | Last 3 days of month |
is_year_end | F5 | Last month of year |
is_anomaly | F6 | Anomaly flag (0 or 1) |
business_process | F7 | Encoded business process |
Output Files
output/graphs/accounting_network/pytorch_geometric/
├── edge_index.npy # [2, E] source→target node indices
├── node_features.npy # [N, 4] node feature vectors
├── edge_features.npy # [E, 8] edge feature vectors
├── edge_labels.npy # [E] anomaly labels (0=normal, 1=anomaly)
├── node_labels.npy # [N] node labels
├── train_mask.npy # [N] boolean training mask
├── val_mask.npy # [N] boolean validation mask
├── test_mask.npy # [N] boolean test mask
├── metadata.json # Graph statistics and configuration
└── load_graph.py # Auto-generated Python loader script
Loading in Python
import numpy as np
import json
# Load metadata
with open('metadata.json') as f:
meta = json.load(f)
print(f"Nodes: {meta['num_nodes']}, Edges: {meta['num_edges']}")
# Load arrays
edge_index = np.load('edge_index.npy') # [2, E]
node_features = np.load('node_features.npy') # [N, F]
edge_features = np.load('edge_features.npy') # [E, 8]
edge_labels = np.load('edge_labels.npy') # [E]
# For PyTorch Geometric
import torch
from torch_geometric.data import Data
data = Data(
x=torch.from_numpy(node_features).float(),
edge_index=torch.from_numpy(edge_index).long(),
edge_attr=torch.from_numpy(edge_features).float(),
y=torch.from_numpy(edge_labels).long(),
)
Configuration
graph_export:
enabled: true
formats:
- pytorch_geometric
train_ratio: 0.7
validation_ratio: 0.15
# test_ratio is automatically 1 - train - val = 0.15
Use Cases
- Anomaly Detection: Train GNNs to detect anomalous transaction patterns
- Network Reconstruction: Validate accounting network recovery algorithms
- Fraud Detection: Identify suspicious money flow patterns
- Link Prediction: Predict likely transaction relationships
Configuration
graph_export:
enabled: true
formats:
- pytorch_geometric
- neo4j
- dgl
graphs:
- transaction_network
- approval_network
- entity_relationship
split:
train: 0.7
val: 0.15
test: 0.15
stratify: is_anomaly
features:
temporal: true
amount: true
structural: true
categorical: true
Graph Types
Transaction Network
Accounts and entities as nodes, transactions as edges.
┌──────────┐
│ Account │
│ 1100 │
└────┬─────┘
│ $1000
▼
┌──────────┐
│ Customer │
│ C-001 │
└──────────┘
Nodes:
- GL accounts
- Vendors
- Customers
- Cost centers
Edges:
- Journal entry lines
- Payments
- Invoices
Approval Network
Users as nodes, approval relationships as edges.
┌──────────┐
│ Clerk │
│ U-001 │
└────┬─────┘
│ approved
▼
┌──────────┐
│ Manager │
│ U-002 │
└──────────┘
Nodes: Employees/users Edges: Approval actions
Entity Relationship Network
Legal entities with ownership relationships.
┌──────────┐
│ Parent │
│ 1000 │
└────┬─────┘
│ 100%
▼
┌──────────┐
│ Sub │
│ 2000 │
└──────────┘
Nodes: Companies Edges: Ownership, IC transactions
Export Formats
PyTorch Geometric
output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt # [num_nodes, num_features]
├── edge_index.pt # [2, num_edges]
├── edge_attr.pt # [num_edges, num_edge_features]
├── labels.pt # Labels
├── train_mask.pt # Boolean training mask
├── val_mask.pt # Boolean validation mask
└── test_mask.pt # Boolean test mask
Loading in Python:
import torch
from torch_geometric.data import Data
# Load tensors
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')
# Create PyG Data object
data = Data(
x=node_features,
edge_index=edge_index,
edge_attr=edge_attr,
y=labels,
train_mask=train_mask,
)
print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")
Neo4j
output/graphs/transaction_network/neo4j/
├── nodes_account.csv
├── nodes_vendor.csv
├── nodes_customer.csv
├── edges_transaction.csv
├── edges_payment.csv
└── import.cypher
Import script (import.cypher):
// Load accounts
LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
CREATE (:Account {
id: row.id,
name: row.name,
type: row.type
});
// Load transactions
LOAD CSV WITH HEADERS FROM 'file:///edges_transaction.csv' AS row
MATCH (from:Account {id: row.from_id})
MATCH (to:Account {id: row.to_id})
CREATE (from)-[:TRANSACTION {
amount: toFloat(row.amount),
date: date(row.posting_date),
is_anomaly: toBoolean(row.is_anomaly)
}]->(to);
DGL (Deep Graph Library)
output/graphs/transaction_network/dgl/
├── graph.bin # Serialized DGL graph
├── node_feats.npy # Node features
├── edge_feats.npy # Edge features
└── labels.npy # Labels
Loading in Python:
import dgl
import numpy as np
# Load graph
graph = dgl.load_graphs('graph.bin')[0][0]
# Load features
graph.ndata['feat'] = torch.tensor(np.load('node_feats.npy'))
graph.edata['feat'] = torch.tensor(np.load('edge_feats.npy'))
graph.ndata['label'] = torch.tensor(np.load('labels.npy'))
Features
Temporal Features
features:
temporal: true
| Feature | Description |
|---|---|
weekday | Day of week (0-6) |
period | Fiscal period (1-12) |
is_month_end | Last 3 days of month |
is_quarter_end | Last week of quarter |
is_year_end | Last month of year |
hour | Hour of posting |
Amount Features
features:
amount: true
| Feature | Description |
|---|---|
log_amount | log10(amount) |
benford_prob | Expected first-digit probability |
is_round_number | Ends in 00, 000, etc. |
amount_zscore | Standard deviations from mean |
Structural Features
features:
structural: true
| Feature | Description |
|---|---|
line_count | Number of JE lines |
unique_accounts | Distinct accounts used |
has_intercompany | IC transaction flag |
debit_credit_ratio | Total debits / credits |
Categorical Features
features:
categorical: true
One-hot encoded:
business_process: Manual, P2P, O2C, etc.source_type: System, User, Recurringaccount_type: Asset, Liability, etc.
Train/Val/Test Splits
split:
train: 0.7 # 70% training
val: 0.15 # 15% validation
test: 0.15 # 15% test
stratify: is_anomaly # Maintain anomaly ratio
random_seed: 42 # Reproducible splits
Stratification options:
is_anomaly: Balanced anomaly detectionis_fraud: Balanced fraud detectionaccount_type: Balanced by account typenull: Random (no stratification)
GNN Training Example
import torch
from torch_geometric.nn import GCNConv
class AnomalyGNN(torch.nn.Module):
def __init__(self, num_features, hidden_dim):
super().__init__()
self.conv1 = GCNConv(num_features, hidden_dim)
self.conv2 = GCNConv(hidden_dim, 2) # Binary classification
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index)
return x
# Train
model = AnomalyGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(100):
model.train()
optimizer.zero_grad()
out = model(data)
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
Multi-Layer Hypergraph (v0.6.2)
The RustGraph Hypergraph exporter now supports all 8 enterprise process families with 24 entity type codes:
Entity Type Codes
| Range | Family | Types |
|---|---|---|
| 100 | Accounting | GL Accounts |
| 300-303 | P2P | PurchaseOrder, GoodsReceipt, VendorInvoice, Payment |
| 310-312 | O2C | SalesOrder, Delivery, CustomerInvoice |
| 320-325 | S2C | SourcingProject, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract, SupplierQualification |
| 330-333 | H2R | PayrollRun, TimeEntry, ExpenseReport, PayrollLineItem |
| 340-343 | MFG | ProductionOrder, RoutingOperation, QualityInspection, CycleCount |
| 350-352 | BANK | BankingCustomer, BankAccount, BankTransaction |
| 360-365 | AUDIT | AuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment |
| 370-372 | Bank Recon | BankReconciliation, BankStatementLine, ReconcilingItem |
| 400 | OCPM | OcpmEvent (events as hyperedges) |
OCPM Events as Hyperedges
When events_as_hyperedges: true, each OCPM event becomes a hyperedge connecting all its participating objects. This enables cross-process analysis via the hypergraph structure.
Per-Family Toggles
graph_export:
hypergraph:
enabled: true
process_layer:
include_p2p: true
include_o2c: true
include_s2c: true
include_h2r: true
include_mfg: true
include_bank: true
include_audit: true
include_r2r: true
events_as_hyperedges: true