Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Anomaly Injection

Generate labeled anomalies for machine learning training.

Overview

Anomaly injection adds realistic irregularities to generated data with full labeling for supervised learning:

  • 20+ fraud types
  • Error patterns
  • Process issues
  • Statistical outliers
  • Relational anomalies

Configuration

anomaly_injection:
  enabled: true
  total_rate: 0.02                   # 2% anomaly rate
  generate_labels: true              # Output ML labels

  categories:                        # Category distribution
    fraud: 0.25
    error: 0.40
    process_issue: 0.20
    statistical: 0.10
    relational: 0.05

  temporal_pattern:
    year_end_spike: 1.5              # More anomalies at year-end

  clustering:
    enabled: true
    cluster_probability: 0.2         # 20% appear in clusters

Anomaly Categories

Fraud Types

TypeDescriptionDetection Difficulty
fictitious_transactionFabricated entriesMedium
revenue_manipulationPremature recognitionHard
expense_capitalizationImproper capitalizationMedium
split_transactionSplit to avoid thresholdEasy
round_trippingCircular transactionsHard
kickback_schemeVendor kickbacksHard
ghost_employeeNon-existent payeeMedium
duplicate_paymentSame invoice twiceEasy
unauthorized_discountUnapproved discountsMedium
suspense_abuseHide in suspenseHard
fraud:
  types:
    fictitious_transaction: 0.15
    split_transaction: 0.20
    duplicate_payment: 0.15
    ghost_employee: 0.10
    kickback_scheme: 0.10
    revenue_manipulation: 0.10
    expense_capitalization: 0.10
    unauthorized_discount: 0.10

Error Types

TypeDescription
duplicate_entrySame entry posted twice
reversed_amountDebit/credit swapped
wrong_periodPosted to wrong period
wrong_accountIncorrect GL account
missing_referenceMissing document reference
incorrect_tax_codeWrong tax calculation
misclassificationWrong account category

Process Issues

TypeDescription
late_postingPosted after cutoff
skipped_approvalMissing required approval
threshold_manipulationAmount just below threshold
missing_documentationNo supporting document
out_of_sequenceDocuments out of order

Statistical Anomalies

TypeDescription
unusual_amountSignificant deviation from mean
trend_breakSudden pattern change
benford_violationDoesn’t follow Benford’s Law
outlier_valueExtreme value

Relational Anomalies

TypeDescription
circular_transactionA → B → A flow
dormant_account_activityInactive account used
unusual_counterpartyUnexpected entity pairing

Injection Strategies

Amount Manipulation

anomaly_injection:
  strategies:
    amount:
      enabled: true
      threshold_adjacent: 0.3        # Just below approval limit
      round_number_bias: 0.4         # Suspicious round amounts

Threshold-adjacent: Amounts like $9,999 when limit is $10,000.

Date Manipulation

anomaly_injection:
  strategies:
    date:
      enabled: true
      weekend_bias: 0.2              # Weekend postings
      after_hours_bias: 0.15         # After business hours

Duplication

anomaly_injection:
  strategies:
    duplication:
      enabled: true
      exact_duplicate: 0.5           # Exact copy
      near_duplicate: 0.3            # Slight variations
      delayed_duplicate: 0.2         # Same entry later

Temporal Patterns

Anomalies can follow realistic patterns:

anomaly_injection:
  temporal_pattern:
    month_end_spike: 1.2             # 20% more at month-end
    quarter_end_spike: 1.5           # 50% more at quarter-end
    year_end_spike: 2.0              # Double at year-end
    seasonality: true                # Follow industry patterns

Entity Targeting

Control which entities receive anomalies:

anomaly_injection:
  entity_targeting:
    strategy: weighted               # random, repeat_offender, weighted

    repeat_offender:
      enabled: true
      rate: 0.4                      # 40% from same users

    high_volume_bias: 0.3            # Target high-volume entities

Clustering

Real anomalies often cluster:

anomaly_injection:
  clustering:
    enabled: true
    cluster_probability: 0.2         # 20% in clusters
    cluster_size:
      min: 3
      max: 10
    cluster_timespan_days: 30        # Within 30-day window

Output Labels

anomaly_labels.csv

FieldDescription
document_idEntry reference
anomaly_idUnique anomaly ID
anomaly_typeSpecific type
anomaly_categoryFraud, Error, etc.
severityLow, Medium, High
detection_difficultyEasy, Medium, Hard
descriptionHuman-readable description

fraud_labels.csv

Subset with fraud-specific fields:

FieldDescription
document_idEntry reference
fraud_typeSpecific fraud pattern
perpetrator_idEmployee ID
scheme_idRelated anomaly group
amount_manipulatedFraud amount

ML Integration

Loading Labels

import pandas as pd

labels = pd.read_csv('output/labels/anomaly_labels.csv')
entries = pd.read_csv('output/transactions/journal_entries.csv')

# Merge
data = entries.merge(labels, on='document_id', how='left')
data['is_anomaly'] = data['anomaly_id'].notna()

Feature Engineering

# Create features
features = [
    'amount', 'line_count', 'is_round_number',
    'is_weekend', 'is_month_end', 'hour_of_day'
]

X = data[features]
y = data['is_anomaly']

Train/Test Split

Labels include suggested splits:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,  # Maintain anomaly ratio
    random_state=42
)

Example Configuration

Fraud Detection Training

anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

  categories:
    fraud: 1.0                       # Only fraud for focused training

  clustering:
    enabled: true
    cluster_probability: 0.3

fraud:
  enabled: true
  fraud_rate: 0.02
  types:
    split_transaction: 0.25
    duplicate_payment: 0.25
    kickback_scheme: 0.20
    ghost_employee: 0.15
    fictitious_transaction: 0.15

General Anomaly Detection

anomaly_injection:
  enabled: true
  total_rate: 0.05
  generate_labels: true

  categories:
    fraud: 0.15
    error: 0.45
    process_issue: 0.25
    statistical: 0.10
    relational: 0.05

See Also