AML/KYC Testing

Generate realistic banking transaction data with KYC profiles and AML typologies for compliance testing and fraud detection model development.

Overview

The datasynth-banking module generates synthetic banking data designed for:

AML System Testing: Validate transaction monitoring rules against known patterns
KYC Process Testing: Test customer onboarding and risk assessment workflows
ML Model Training: Train supervised models with labeled fraud typologies
Scenario Analysis: Test detection capabilities against specific attack patterns

KYC Profile Generation

Customer Types

Type	Description	Typical Characteristics
Retail	Individual customers	Salary deposits, consumer spending
Business	Small to medium businesses	Payroll, supplier payments
Trust	Trust accounts, complex structures	Investment flows, distributions

KYC Profile Components

Each customer has a KYC profile defining expected behavior:

kyc_profile:
  declared_turnover: 50000        # Expected monthly volume
  transaction_frequency: 25       # Expected transactions/month
  source_of_funds: "employment"   # Declared income source
  geographic_exposure: ["US", "EU"]
  cash_intensity: 0.05            # Expected cash ratio
  beneficial_owner_complexity: 1  # Ownership layers

Risk Scoring

Customers are assigned risk scores based on:

Geographic exposure (high-risk jurisdictions)
Industry sector
Transaction patterns vs. declared profile
Beneficial ownership complexity

AML Typology Generation

Structuring

Breaking large transactions into smaller amounts to avoid reporting thresholds.

Detection Signatures:
- Multiple transactions just below $10,000 threshold
- Same-day deposits across multiple branches
- Round-number amounts (e.g., $9,900, $9,800)

Configuration:

typologies:
  structuring:
    enabled: true
    rate: 0.001
    threshold: 10000
    margin: 500

Funnel Accounts

Concentrating funds from multiple sources before moving to destination.

Pattern:
Source A ─┐
Source B ─┼─▶ Funnel Account ─▶ Destination
Source C ─┘

Detection Signatures:
- Many small inbound, few large outbound
- High throughput relative to account balance
- Short holding periods

Layering

Complex chains of transactions to obscure fund origins.

Pattern:
Origin ─▶ Shell A ─▶ Shell B ─▶ Shell C ─▶ Destination
                          └─▶ Mixing ─┘

Detection Signatures:
- Rapid consecutive transfers
- Circular transaction patterns
- Cross-border routing through multiple jurisdictions

Money Mule Networks

Using recruited individuals to move illicit funds.

Pattern:
Fraudster ─▶ Mule 1 ─▶ Cash Withdrawal
           ─▶ Mule 2 ─▶ Wire Transfer
           ─▶ Mule 3 ─▶ Crypto Exchange

Detection Signatures:
- New accounts with sudden high volume
- Immediate outbound after inbound
- Multiple accounts with similar patterns

Round-Tripping

Moving funds in circular patterns to create apparent legitimacy.

Pattern:
Company A ─▶ Offshore ─▶ Company A (as "investment")

Detection Signatures:
- Funds return to origin within short period
- Offshore intermediaries
- Inflated invoicing

Fraud Patterns

Credit card fraud and synthetic identity patterns.

Patterns:
- Card testing (small amounts across merchants)
- Account takeover (changed behavior profile)
- Synthetic identity (blended PII attributes)

Generated Data

Output Files

banking/
├── banking_customers.csv        # Customer profiles with KYC data
├── bank_accounts.csv            # Account records with features
├── bank_transactions.csv        # Transaction records
├── kyc_profiles.csv             # Expected activity envelopes
├── counterparties.csv           # Counterparty pool
├── aml_typology_labels.csv      # Ground truth typology labels
├── entity_risk_labels.csv       # Entity-level risk classifications
└── transaction_risk_labels.csv  # Transaction-level classifications

Customer Record

customer_id,customer_type,name,created_at,risk_score,kyc_status,pep_flag,sanctions_flag
CUST001,retail,John Smith,2024-01-15,25,verified,false,false
CUST002,business,Acme Corp,2024-02-01,65,enhanced_due_diligence,false,false

Transaction Record

transaction_id,account_id,timestamp,amount,currency,direction,channel,category,counterparty_id
TXN001,ACC001,2024-03-15T10:30:00Z,9800.00,USD,credit,branch,cash_deposit,
TXN002,ACC001,2024-03-15T11:45:00Z,9750.00,USD,credit,branch,cash_deposit,

Typology Label

transaction_id,typology,confidence,pattern_id,related_transactions
TXN001,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"
TXN002,structuring,0.95,STRUCT_001,"TXN001,TXN002,TXN003"

Configuration

Basic Banking Setup

banking:
  enabled: true
  customers:
    retail: 5000
    business: 500
    trust: 50

  transactions:
    target_count: 500000
    date_range:
      start: 2024-01-01
      end: 2024-12-31

  typologies:
    structuring:
      enabled: true
      rate: 0.002
    funnel:
      enabled: true
      rate: 0.001
    layering:
      enabled: true
      rate: 0.0005
    mule:
      enabled: true
      rate: 0.001
    fraud:
      enabled: true
      rate: 0.005

  labels:
    generate: true
    include_confidence: true
    include_related: true

Adversarial Testing

Generate transactions designed to evade detection:

banking:
  typologies:
    spoofing:
      enabled: true
      strategies:
        - threshold_aware        # Varies amounts around thresholds
        - temporal_distribution  # Spreads over time windows
        - channel_mixing         # Uses multiple channels

Use Cases

Transaction Monitoring Rule Testing

# Generate data with known structuring patterns
datasynth-data generate --config banking_structuring.yaml --output ./test_data

# Expected results:
# - 0.2% of transactions should trigger structuring alerts
# - Labels in aml_typology_labels.csv for validation

ML Model Training

import pandas as pd
from sklearn.model_selection import train_test_split

# Load transactions and labels
transactions = pd.read_csv("banking/bank_transactions.csv")
labels = pd.read_csv("banking/aml_typology_labels.csv")

# Merge and prepare features
data = transactions.merge(labels, on="transaction_id", how="left")
data["is_suspicious"] = data["typology"].notna()

# Split for training
X_train, X_test, y_train, y_test = train_test_split(
    data[features],
    data["is_suspicious"],
    test_size=0.2,
    stratify=data["is_suspicious"]
)

Network Analysis

The banking data supports graph-based analysis:

import networkx as nx

# Build transaction network
G = nx.DiGraph()
for _, txn in transactions.iterrows():
    if txn["counterparty_id"]:
        G.add_edge(txn["account_id"], txn["counterparty_id"],
                   weight=txn["amount"])

# Detect funnel accounts (high in-degree, low out-degree)
in_degree = dict(G.in_degree())
out_degree = dict(G.out_degree())
funnels = [n for n in G.nodes()
           if in_degree.get(n, 0) > 10 and out_degree.get(n, 0) < 3]

KYC Deviation Analysis

# Compare actual behavior to KYC profile
customers = pd.read_csv("banking/banking_customers.csv")
kyc = pd.read_csv("banking/kyc_profiles.csv")
transactions = pd.read_csv("banking/bank_transactions.csv")

# Calculate actual monthly volumes
actual = transactions.groupby(["customer_id", "month"])["amount"].sum()

# Compare to declared turnover
merged = actual.merge(kyc, on="customer_id")
merged["deviation"] = (merged["actual"] - merged["declared_turnover"]) / merged["declared_turnover"]

# Flag significant deviations
alerts = merged[merged["deviation"].abs() > 0.5]

Best Practices

Realistic Testing

Match production volumes: Configure similar customer counts and transaction rates
Use realistic ratios: Keep typology rates at realistic levels (0.1-1%)
Include noise: Add legitimate edge cases that shouldn’t trigger alerts

Label Quality

Verify ground truth: Labels reflect injected patterns, not detected ones
Include confidence: Use confidence scores for uncertain classifications
Track related transactions: Pattern IDs link related suspicious activity

Model Validation

Test detection rates: Measure recall against known patterns
Check false positives: Ensure legitimate transactions aren’t flagged
Validate across typologies: Test each pattern type separately

SyntheticData Documentation