Research: Statistical and Numerical Distributions
Current State Analysis
Existing Distribution Implementations
The system currently supports several distribution types:
| Distribution | Implementation | Usage |
|---|---|---|
| Log-Normal | AmountSampler | Transaction amounts |
| Benford’s Law | BenfordSampler | First-digit distribution |
| Uniform | Standard | ID generation, selection |
| Weighted | LineItemSampler | Line item counts |
| Poisson | TemporalSampler | Event counts |
| Normal/Gaussian | Standard | Some variations |
Current Strengths
- Benford’s Law compliance: First-digit distribution follows expected 30.1%, 17.6%, 12.5%… pattern
- Log-normal amounts: Realistic transaction size distributions
- Temporal weighting: Period-end spikes, day-of-week patterns
- Industry seasonality: 10 industry profiles with event-based multipliers
Current Gaps
- Single-mode distributions: No mixture models for multi-modal data
- Limited correlation: Cross-field dependencies not modeled
- Static parameters: No regime changes or parameter drift
- Missing distributions: Pareto, Weibull, Beta not available
- No copulas: Joint distributions not correlated realistically
Improvement Recommendations
1. Multi-Modal Distribution Support
1.1 Gaussian Mixture Models
Real-world transaction amounts often exhibit multiple modes:
#![allow(unused)]
fn main() {
/// Gaussian Mixture Model for multi-modal distributions
pub struct GaussianMixture {
components: Vec<GaussianComponent>,
}
pub struct GaussianComponent {
weight: f64, // Component weight (sum to 1.0)
mean: f64, // Component mean
std_dev: f64, // Component standard deviation
}
impl GaussianMixture {
/// Sample from the mixture distribution
pub fn sample(&self, rng: &mut impl Rng) -> f64 {
// Select component based on weights
let component = self.select_component(rng);
// Sample from selected Gaussian
component.sample(rng)
}
}
}
Configuration:
amount_distribution:
type: gaussian_mixture
components:
- weight: 0.60
mean: 500
std_dev: 200
label: "small_transactions"
- weight: 0.30
mean: 5000
std_dev: 1500
label: "medium_transactions"
- weight: 0.10
mean: 50000
std_dev: 15000
label: "large_transactions"
1.2 Log-Normal Mixture
For strictly positive amounts with multiple modes:
amount_distribution:
type: lognormal_mixture
components:
- weight: 0.70
mu: 5.5 # log-scale mean
sigma: 1.2 # log-scale std dev
label: "routine_expenses"
- weight: 0.25
mu: 8.5
sigma: 0.8
label: "capital_expenses"
- weight: 0.05
mu: 11.0
sigma: 0.5
label: "major_projects"
1.3 Realistic Transaction Amount Profiles
By Transaction Type:
| Type | Distribution | Parameters | Notes |
|---|---|---|---|
| Petty Cash | Log-normal | μ=3.5, σ=0.8 | $10-$500 range |
| AP Invoices | Mixture(3) | See below | Multi-modal |
| Payroll | Normal | μ=4500, σ=1200 | Per employee |
| Utilities | Log-normal | μ=7.0, σ=0.4 | Monthly, stable |
| Capital | Pareto | α=1.5, xₘ=10000 | Heavy tail |
AP Invoice Mixture:
ap_invoices:
type: lognormal_mixture
components:
# Operating expenses
- weight: 0.50
mu: 6.0 # ~$400 median
sigma: 1.5
# Inventory/materials
- weight: 0.35
mu: 8.0 # ~$3000 median
sigma: 1.0
# Capital/projects
- weight: 0.15
mu: 10.5 # ~$36000 median
sigma: 0.8
2. Cross-Field Correlation Modeling
2.1 Correlation Matrix Support
Define correlations between numeric fields:
correlations:
enabled: true
fields:
- name: transaction_amount
- name: line_item_count
- name: approval_level
- name: processing_time_hours
- name: discount_percentage
matrix:
# Correlation coefficients (Pearson's r)
# Higher amounts → more line items
- [1.00, 0.65, 0.72, 0.45, -0.20]
# More items → higher amount
- [0.65, 1.00, 0.55, 0.60, -0.15]
# Higher amount → higher approval
- [0.72, 0.55, 1.00, 0.50, -0.30]
# More complex → longer processing
- [0.45, 0.60, 0.50, 1.00, -0.10]
# Higher amount → lower discount %
- [-0.20, -0.15, -0.30, -0.10, 1.00]
2.2 Copula-Based Generation
For more sophisticated dependency modeling:
#![allow(unused)]
fn main() {
/// Copula types for dependency modeling
pub enum CopulaType {
/// Gaussian copula - symmetric dependencies
Gaussian { correlation: f64 },
/// Clayton copula - lower tail dependence
Clayton { theta: f64 },
/// Gumbel copula - upper tail dependence
Gumbel { theta: f64 },
/// Frank copula - symmetric, no tail dependence
Frank { theta: f64 },
/// Student-t copula - both tail dependencies
StudentT { correlation: f64, df: f64 },
}
pub struct CopulaGenerator {
copula: CopulaType,
marginals: Vec<Box<dyn Distribution>>,
}
}
Use Cases:
- Amount & Days-to-Pay: Larger invoices may have longer payment terms (Clayton copula)
- Revenue & COGS: Strong positive correlation (Gaussian copula)
- Fraud Amount & Detection Delay: Upper tail dependence (Gumbel copula)
2.3 Conditional Distributions
Generate values conditional on other fields:
conditional_distributions:
# Discount percentage depends on order amount
discount:
type: conditional
given: order_amount
breakpoints:
- threshold: 1000
distribution: { type: constant, value: 0 }
- threshold: 5000
distribution: { type: uniform, min: 0, max: 0.05 }
- threshold: 25000
distribution: { type: uniform, min: 0.05, max: 0.10 }
- threshold: 100000
distribution: { type: uniform, min: 0.10, max: 0.15 }
- threshold: infinity
distribution: { type: normal, mean: 0.15, std: 0.03 }
# Payment terms depend on vendor relationship
payment_terms:
type: conditional
given: vendor_relationship_months
breakpoints:
- threshold: 6
distribution: { type: choice, values: [0, 15], weights: [0.8, 0.2] }
- threshold: 24
distribution: { type: choice, values: [15, 30], weights: [0.6, 0.4] }
- threshold: infinity
distribution: { type: choice, values: [30, 45, 60], weights: [0.5, 0.35, 0.15] }
3. Industry-Specific Amount Distributions
3.1 Retail
retail:
transaction_amounts:
pos_sales:
type: lognormal_mixture
components:
- weight: 0.65
mu: 3.0 # ~$20 median
sigma: 1.0
label: "small_basket"
- weight: 0.30
mu: 4.5 # ~$90 median
sigma: 0.8
label: "medium_basket"
- weight: 0.05
mu: 6.0 # ~$400 median
sigma: 0.6
label: "large_basket"
inventory_orders:
type: lognormal
mu: 9.0 # ~$8000 median
sigma: 1.5
seasonal_multipliers:
black_friday: 3.5
christmas_week: 2.8
back_to_school: 1.6
3.2 Manufacturing
manufacturing:
transaction_amounts:
raw_materials:
type: lognormal_mixture
components:
- weight: 0.40
mu: 8.0 # ~$3000 median
sigma: 1.0
label: "consumables"
- weight: 0.45
mu: 10.0 # ~$22000 median
sigma: 0.8
label: "production_materials"
- weight: 0.15
mu: 12.0 # ~$163000 median
sigma: 0.6
label: "bulk_orders"
maintenance:
type: pareto
alpha: 2.0
x_min: 500
label: "repair_costs"
capital_equipment:
type: lognormal
mu: 12.5 # ~$268000 median
sigma: 1.0
3.3 Financial Services
financial_services:
transaction_amounts:
wire_transfers:
type: lognormal_mixture
components:
- weight: 0.30
mu: 8.0 # ~$3000
sigma: 1.2
label: "retail_wire"
- weight: 0.40
mu: 11.0 # ~$60000
sigma: 1.0
label: "commercial_wire"
- weight: 0.20
mu: 14.0 # ~$1.2M
sigma: 0.8
label: "institutional_wire"
- weight: 0.10
mu: 17.0 # ~$24M
sigma: 1.0
label: "large_value"
ach_transactions:
type: lognormal
mu: 7.5 # ~$1800
sigma: 2.0
fee_income:
type: weibull
scale: 500
shape: 1.5
4. Regime Change Modeling
4.1 Structural Breaks
Model sudden changes in distribution parameters:
regime_changes:
enabled: true
changes:
- date: "2024-03-15"
type: acquisition
effects:
- field: transaction_volume
multiplier: 1.35
- field: average_amount
shift: 5000
- field: vendor_count
multiplier: 1.25
- date: "2024-07-01"
type: price_increase
effects:
- field: cogs_ratio
shift: 0.03
- field: avg_invoice_amount
multiplier: 1.08
- date: "2024-10-01"
type: new_product_line
effects:
- field: revenue
multiplier: 1.20
- field: inventory_turns
multiplier: 0.85
4.2 Gradual Parameter Drift
Model slow changes over time:
parameter_drift:
enabled: true
parameters:
- field: transaction_amount
type: linear
annual_drift: 0.03 # 3% annual increase (inflation)
- field: digital_payment_ratio
type: logistic
start_value: 0.40
end_value: 0.85
midpoint_months: 18
steepness: 0.15
- field: approval_threshold
type: step
steps:
- month: 6
value: 5000
- month: 18
value: 7500
- month: 30
value: 10000
4.3 Economic Cycle Modeling
economic_cycles:
enabled: true
base_cycle:
type: sinusoidal
period_months: 48 # 4-year cycle
amplitude: 0.15 # ±15% variation
recession_events:
- start: "2024-09-01"
duration_months: 8
severity: moderate # 10-20% decline
effects:
- revenue: -0.15
- discretionary_spend: -0.35
- capital_investment: -0.50
- headcount: -0.10
recovery:
type: gradual
months: 12
5. Enhanced Benford’s Law Compliance
5.1 Second and Third Digit Distributions
Extend beyond first-digit to full Benford compliance:
#![allow(unused)]
fn main() {
pub struct BenfordDistribution {
digits: BenfordDigitConfig,
}
pub struct BenfordDigitConfig {
first_digit: bool, // Standard Benford
second_digit: bool, // Second digit distribution
first_two: bool, // Joint first-two digits
summation: bool, // Summation test
}
impl BenfordDistribution {
/// Generate amount following full Benford's Law
pub fn sample_benford_compliant(&self, rng: &mut impl Rng) -> Decimal {
// Use log-uniform distribution to ensure Benford compliance
// across multiple digit positions
}
}
}
5.2 Benford Deviation Injection
For anomaly scenarios, intentionally violate Benford:
benford_deviations:
enabled: false # Enable for fraud scenarios
deviation_types:
# Round number preference (fraud indicator)
round_number_bias:
probability: 0.15
targets: [1000, 5000, 10000, 25000]
tolerance: 0.01
# Threshold avoidance (approval bypass)
threshold_clustering:
thresholds: [5000, 10000, 25000]
cluster_below: true
distance: 50-200
# Uniform distribution (fabricated data)
uniform_injection:
probability: 0.05
range: [1000, 9999]
6. Statistical Validation Framework
6.1 Distribution Fitness Tests
#![allow(unused)]
fn main() {
pub struct DistributionValidator {
tests: Vec<StatisticalTest>,
}
pub enum StatisticalTest {
/// Kolmogorov-Smirnov test
KolmogorovSmirnov { significance: f64 },
/// Chi-squared goodness of fit
ChiSquared { bins: usize, significance: f64 },
/// Anderson-Darling test
AndersonDarling { significance: f64 },
/// Benford's Law chi-squared
BenfordChiSquared { digits: u8, significance: f64 },
/// Mean Absolute Deviation from Benford
BenfordMAD { threshold: f64 },
}
}
6.2 Validation Configuration
validation:
statistical_tests:
enabled: true
tests:
- type: benford_first_digit
threshold_mad: 0.015
warning_mad: 0.010
- type: distribution_fit
target: lognormal
ks_significance: 0.05
- type: correlation_check
expected_correlations:
- fields: [amount, line_items]
expected_r: 0.65
tolerance: 0.10
reporting:
generate_plots: true
output_format: html
include_raw_data: false
7. New Distribution Types
7.1 Pareto Distribution
For heavy-tailed phenomena (80/20 rule):
# Top 20% of customers generate 80% of revenue
customer_revenue:
type: pareto
alpha: 1.16 # Shape parameter for 80/20
x_min: 1000 # Minimum value
truncate_max: 10000000 # Optional cap
7.2 Weibull Distribution
For time-to-event data:
# Days until payment
days_to_payment:
type: weibull
shape: 2.0 # k > 1: increasing hazard (more likely to pay over time)
scale: 30.0 # λ: characteristic life
shift: 0 # Minimum days
7.3 Beta Distribution
For proportions and percentages:
# Discount percentage
discount_rate:
type: beta
alpha: 2.0 # Shape parameter 1
beta: 8.0 # Shape parameter 2
# This gives mode around 11%, right-skewed
scale:
min: 0.0
max: 0.25 # Max 25% discount
7.4 Zero-Inflated Distributions
For data with excess zeros:
# Credits/returns (many transactions have zero)
credit_amount:
type: zero_inflated
zero_probability: 0.85
positive_distribution:
type: lognormal
mu: 5.0
sigma: 1.5
8. Implementation Priority
| Enhancement | Complexity | Impact | Priority |
|---|---|---|---|
| Mixture models | Medium | High | P1 |
| Correlation matrices | High | Critical | P1 |
| Industry-specific profiles | Medium | High | P1 |
| Regime changes | Medium | High | P2 |
| Copula support | High | Medium | P2 |
| Additional distributions | Low | Medium | P2 |
| Validation framework | Medium | High | P1 |
| Conditional distributions | Medium | Medium | P3 |
9. Configuration Example
# Complete statistical distribution configuration
distributions:
# Global amount settings
amounts:
default:
type: lognormal_mixture
components:
- { weight: 0.6, mu: 6.0, sigma: 1.5 }
- { weight: 0.3, mu: 8.5, sigma: 1.0 }
- { weight: 0.1, mu: 11.0, sigma: 0.8 }
by_transaction_type:
payroll:
type: normal
mean: 4500
std_dev: 1500
truncate_min: 1000
utilities:
type: lognormal
mu: 7.0
sigma: 0.5
# Correlation settings
correlations:
enabled: true
model: gaussian_copula
pairs:
- fields: [amount, processing_days]
correlation: 0.45
- fields: [amount, approval_level]
correlation: 0.72
# Drift settings
drift:
enabled: true
inflation_rate: 0.03
regime_changes:
- date: "2024-06-01"
field: avg_transaction
multiplier: 1.15
# Validation
validation:
benford_compliance: true
distribution_tests: true
correlation_verification: true
Technical Implementation Notes
Performance Considerations
- Pre-computation: Calculate CDF tables for frequently-used distributions
- Vectorization: Use SIMD for batch sampling where possible
- Caching: Cache correlation matrix decompositions (Cholesky)
- Lazy evaluation: Defer complex distribution calculations until needed
Memory Efficiency
- Streaming: Generate correlated samples in batches
- Reference tables: Use compact lookup tables for standard distributions
- On-demand: Compute regime-adjusted parameters at sample time
See also: 03-temporal-patterns.md for time-based distributions