Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Graph Export

Export transaction data as ML-ready graphs.

Overview

Graph export transforms financial data into network representations:

  • Accounting Network (GL accounts as nodes, transactions as edges) - New in v0.2.1
  • Transaction networks (accounts and entities)
  • Approval networks (users and approvals)
  • Entity relationship graphs (ownership)

Accounting Network Graph Export

The accounting network represents money flows between GL accounts, designed for network reconstruction and anomaly detection algorithms.

Quick Start

# Generate with graph export enabled
datasynth-data generate --config config.yaml --output ./output --graph-export

Graph Structure

ElementDescription
NodesGL Accounts from Chart of Accounts
EdgesMoney flows FROM credit accounts TO debit accounts
DirectionDirected graph (source→target)
     ┌──────────────┐
     │ Credit Acct  │
     │   (2000)     │
     └──────┬───────┘
            │ $1,000
            ▼
     ┌──────────────┐
     │ Debit Acct   │
     │   (1100)     │
     └──────────────┘

Edge Features (8 dimensions)

FeatureIndexDescription
log_amountF0log10(transaction amount)
benford_probF1Expected first-digit probability
weekdayF2Day of week (normalized 0-1)
periodF3Fiscal period (normalized 0-1)
is_month_endF4Last 3 days of month
is_year_endF5Last month of year
is_anomalyF6Anomaly flag (0 or 1)
business_processF7Encoded business process

Output Files

output/graphs/accounting_network/pytorch_geometric/
├── edge_index.npy      # [2, E] source→target node indices
├── node_features.npy   # [N, 4] node feature vectors
├── edge_features.npy   # [E, 8] edge feature vectors
├── edge_labels.npy     # [E] anomaly labels (0=normal, 1=anomaly)
├── node_labels.npy     # [N] node labels
├── train_mask.npy      # [N] boolean training mask
├── val_mask.npy        # [N] boolean validation mask
├── test_mask.npy       # [N] boolean test mask
├── metadata.json       # Graph statistics and configuration
└── load_graph.py       # Auto-generated Python loader script

Loading in Python

import numpy as np
import json

# Load metadata
with open('metadata.json') as f:
    meta = json.load(f)
print(f"Nodes: {meta['num_nodes']}, Edges: {meta['num_edges']}")

# Load arrays
edge_index = np.load('edge_index.npy')      # [2, E]
node_features = np.load('node_features.npy') # [N, F]
edge_features = np.load('edge_features.npy') # [E, 8]
edge_labels = np.load('edge_labels.npy')     # [E]

# For PyTorch Geometric
import torch
from torch_geometric.data import Data

data = Data(
    x=torch.from_numpy(node_features).float(),
    edge_index=torch.from_numpy(edge_index).long(),
    edge_attr=torch.from_numpy(edge_features).float(),
    y=torch.from_numpy(edge_labels).long(),
)

Configuration

graph_export:
  enabled: true
  formats:
    - pytorch_geometric
  train_ratio: 0.7
  validation_ratio: 0.15
  # test_ratio is automatically 1 - train - val = 0.15

Use Cases

  1. Anomaly Detection: Train GNNs to detect anomalous transaction patterns
  2. Network Reconstruction: Validate accounting network recovery algorithms
  3. Fraud Detection: Identify suspicious money flow patterns
  4. Link Prediction: Predict likely transaction relationships

Configuration

graph_export:
  enabled: true

  formats:
    - pytorch_geometric
    - neo4j
    - dgl

  graphs:
    - transaction_network
    - approval_network
    - entity_relationship

  split:
    train: 0.7
    val: 0.15
    test: 0.15
    stratify: is_anomaly

  features:
    temporal: true
    amount: true
    structural: true
    categorical: true

Graph Types

Transaction Network

Accounts and entities as nodes, transactions as edges.

     ┌──────────┐
     │ Account  │
     │  1100    │
     └────┬─────┘
          │ $1000
          ▼
     ┌──────────┐
     │ Customer │
     │  C-001   │
     └──────────┘

Nodes:

  • GL accounts
  • Vendors
  • Customers
  • Cost centers

Edges:

  • Journal entry lines
  • Payments
  • Invoices

Approval Network

Users as nodes, approval relationships as edges.

     ┌──────────┐
     │  Clerk   │
     │  U-001   │
     └────┬─────┘
          │ approved
          ▼
     ┌──────────┐
     │ Manager  │
     │  U-002   │
     └──────────┘

Nodes: Employees/users Edges: Approval actions

Entity Relationship Network

Legal entities with ownership relationships.

     ┌──────────┐
     │  Parent  │
     │  1000    │
     └────┬─────┘
          │ 100%
          ▼
     ┌──────────┐
     │   Sub    │
     │  2000    │
     └──────────┘

Nodes: Companies Edges: Ownership, IC transactions

Export Formats

PyTorch Geometric

output/graphs/transaction_network/pytorch_geometric/
├── node_features.pt    # [num_nodes, num_features]
├── edge_index.pt       # [2, num_edges]
├── edge_attr.pt        # [num_edges, num_edge_features]
├── labels.pt           # Labels
├── train_mask.pt       # Boolean training mask
├── val_mask.pt         # Boolean validation mask
└── test_mask.pt        # Boolean test mask

Loading in Python:

import torch
from torch_geometric.data import Data

# Load tensors
node_features = torch.load('node_features.pt')
edge_index = torch.load('edge_index.pt')
edge_attr = torch.load('edge_attr.pt')
labels = torch.load('labels.pt')
train_mask = torch.load('train_mask.pt')

# Create PyG Data object
data = Data(
    x=node_features,
    edge_index=edge_index,
    edge_attr=edge_attr,
    y=labels,
    train_mask=train_mask,
)

print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")

Neo4j

output/graphs/transaction_network/neo4j/
├── nodes_account.csv
├── nodes_vendor.csv
├── nodes_customer.csv
├── edges_transaction.csv
├── edges_payment.csv
└── import.cypher

Import script (import.cypher):

// Load accounts
LOAD CSV WITH HEADERS FROM 'file:///nodes_account.csv' AS row
CREATE (:Account {
    id: row.id,
    name: row.name,
    type: row.type
});

// Load transactions
LOAD CSV WITH HEADERS FROM 'file:///edges_transaction.csv' AS row
MATCH (from:Account {id: row.from_id})
MATCH (to:Account {id: row.to_id})
CREATE (from)-[:TRANSACTION {
    amount: toFloat(row.amount),
    date: date(row.posting_date),
    is_anomaly: toBoolean(row.is_anomaly)
}]->(to);

DGL (Deep Graph Library)

output/graphs/transaction_network/dgl/
├── graph.bin           # Serialized DGL graph
├── node_feats.npy      # Node features
├── edge_feats.npy      # Edge features
└── labels.npy          # Labels

Loading in Python:

import dgl
import numpy as np

# Load graph
graph = dgl.load_graphs('graph.bin')[0][0]

# Load features
graph.ndata['feat'] = torch.tensor(np.load('node_feats.npy'))
graph.edata['feat'] = torch.tensor(np.load('edge_feats.npy'))
graph.ndata['label'] = torch.tensor(np.load('labels.npy'))

Features

Temporal Features

features:
  temporal: true
FeatureDescription
weekdayDay of week (0-6)
periodFiscal period (1-12)
is_month_endLast 3 days of month
is_quarter_endLast week of quarter
is_year_endLast month of year
hourHour of posting

Amount Features

features:
  amount: true
FeatureDescription
log_amountlog10(amount)
benford_probExpected first-digit probability
is_round_numberEnds in 00, 000, etc.
amount_zscoreStandard deviations from mean

Structural Features

features:
  structural: true
FeatureDescription
line_countNumber of JE lines
unique_accountsDistinct accounts used
has_intercompanyIC transaction flag
debit_credit_ratioTotal debits / credits

Categorical Features

features:
  categorical: true

One-hot encoded:

  • business_process: Manual, P2P, O2C, etc.
  • source_type: System, User, Recurring
  • account_type: Asset, Liability, etc.

Train/Val/Test Splits

split:
  train: 0.7                         # 70% training
  val: 0.15                          # 15% validation
  test: 0.15                         # 15% test
  stratify: is_anomaly               # Maintain anomaly ratio
  random_seed: 42                    # Reproducible splits

Stratification options:

  • is_anomaly: Balanced anomaly detection
  • is_fraud: Balanced fraud detection
  • account_type: Balanced by account type
  • null: Random (no stratification)

GNN Training Example

import torch
from torch_geometric.nn import GCNConv

class AnomalyGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, 2)  # Binary classification

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

# Train
model = AnomalyGNN(data.num_features, 64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

Multi-Layer Hypergraph (v0.6.2)

The RustGraph Hypergraph exporter now supports all 8 enterprise process families with 24 entity type codes:

Entity Type Codes

RangeFamilyTypes
100AccountingGL Accounts
300-303P2PPurchaseOrder, GoodsReceipt, VendorInvoice, Payment
310-312O2CSalesOrder, Delivery, CustomerInvoice
320-325S2CSourcingProject, RfxEvent, SupplierBid, BidEvaluation, ProcurementContract, SupplierQualification
330-333H2RPayrollRun, TimeEntry, ExpenseReport, PayrollLineItem
340-343MFGProductionOrder, RoutingOperation, QualityInspection, CycleCount
350-352BANKBankingCustomer, BankAccount, BankTransaction
360-365AUDITAuditEngagement, Workpaper, AuditFinding, AuditEvidence, RiskAssessment, ProfessionalJudgment
370-372Bank ReconBankReconciliation, BankStatementLine, ReconcilingItem
400OCPMOcpmEvent (events as hyperedges)

OCPM Events as Hyperedges

When events_as_hyperedges: true, each OCPM event becomes a hyperedge connecting all its participating objects. This enables cross-process analysis via the hypergraph structure.

Per-Family Toggles

graph_export:
  hypergraph:
    enabled: true
    process_layer:
      include_p2p: true
      include_o2c: true
      include_s2c: true
      include_h2r: true
      include_mfg: true
      include_bank: true
      include_audit: true
      include_r2r: true
      events_as_hyperedges: true

See Also