Causal & Counterfactual Generation
New in v0.5.0
DataSynth supports Structural Causal Models (SCMs) for generating data with explicit causal structure, running interventional “what-if” scenarios, and producing counterfactual records.
Overview
Traditional synthetic data generators capture correlations but not causation. Causal generation lets you:
- Define causal relationships between variables (e.g., “transaction amount causes approval level”)
- Generate observational data that follows the causal structure
- Run interventions to answer “what if?” questions (do-calculus)
- Produce counterfactuals — “what would have happened if X were different?”
This is particularly valuable for fraud detection, audit analytics, and regulatory “what-if” scenario testing.
Causal Graph
A causal graph defines variables and the directed edges (causal mechanisms) between them.
Variables
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalVariable, CausalVarType};
let var = CausalVariable::new("transaction_amount", CausalVarType::Continuous)
.with_distribution("lognormal")
.with_param("mu", 8.0)
.with_param("sigma", 1.5);
}
| Variable Type | Description | Example |
|---|---|---|
Continuous | Real-valued | Transaction amount, revenue |
Categorical | Discrete categories | Industry, department |
Count | Non-negative integers | Line items, approvals |
Binary | Boolean (0/1) | Fraud flag, approval status |
Causal Mechanisms
Edges between variables define how a parent causally affects a child:
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalEdge, CausalMechanism};
let edge = CausalEdge {
from: "transaction_amount".into(),
to: "approval_level".into(),
mechanism: CausalMechanism::Logistic { scale: 0.001, midpoint: 50000.0 },
strength: 1.0,
};
}
| Mechanism | Formula | Use Case |
|---|---|---|
Linear { coefficient } | y += coefficient × parent | Proportional effects |
Threshold { cutoff } | y = 1 if parent > cutoff, else 0 | Binary triggers |
Polynomial { coefficients } | y += Σ coefficients[i] × parent^i | Non-linear effects |
Logistic { scale, midpoint } | y += 1 / (1 + e^(-scale × (parent - midpoint))) | S-curve effects |
Building a Graph
#![allow(unused)]
fn main() {
use synth_core::causal::{CausalGraph, CausalVariable, CausalVarType, CausalEdge, CausalMechanism};
let mut graph = CausalGraph::new();
// Add variables
graph.add_variable(
CausalVariable::new("transaction_amount", CausalVarType::Continuous)
.with_distribution("lognormal")
.with_param("mu", 8.0)
.with_param("sigma", 1.5)
);
graph.add_variable(
CausalVariable::new("approval_level", CausalVarType::Count)
.with_distribution("normal")
.with_param("mean", 1.0)
.with_param("std", 0.5)
);
graph.add_variable(
CausalVariable::new("fraud_flag", CausalVarType::Binary)
);
// Add causal edges
graph.add_edge(CausalEdge {
from: "transaction_amount".into(),
to: "approval_level".into(),
mechanism: CausalMechanism::Linear { coefficient: 0.00005 },
strength: 1.0,
});
graph.add_edge(CausalEdge {
from: "transaction_amount".into(),
to: "fraud_flag".into(),
mechanism: CausalMechanism::Logistic { scale: 0.0001, midpoint: 50000.0 },
strength: 0.8,
});
// Validate (checks for cycles, missing variables)
graph.validate()?;
}
Built-in Templates
DataSynth includes pre-configured causal graphs for common financial scenarios:
Fraud Detection Template
#![allow(unused)]
fn main() {
let graph = CausalGraph::fraud_detection_template();
}
Variables: transaction_amount, approval_level, vendor_risk, fraud_flag
Causal structure:
transaction_amount→approval_level(linear)transaction_amount→fraud_flag(logistic)vendor_risk→fraud_flag(linear)
Revenue Cycle Template
#![allow(unused)]
fn main() {
let graph = CausalGraph::revenue_cycle_template();
}
Variables: order_size, credit_score, payment_delay, revenue
Causal structure:
order_size→revenue(linear)credit_score→payment_delay(linear, negative)order_size→payment_delay(linear)
Structural Causal Model (SCM)
The SCM wraps a causal graph and provides generation capabilities:
#![allow(unused)]
fn main() {
use synth_core::causal::StructuralCausalModel;
let scm = StructuralCausalModel::new(graph)?;
// Generate observational data
let samples = scm.generate(10000, 42)?;
// samples: Vec<HashMap<String, f64>>
for sample in &samples[..3] {
println!("Amount: {:.2}, Approval: {:.0}, Fraud: {:.0}",
sample["transaction_amount"],
sample["approval_level"],
sample["fraud_flag"],
);
}
}
Data is generated in topological order — root variables are sampled from their distributions first, then child variables are computed based on their parents’ values and the causal mechanisms.
Interventions (Do-Calculus)
Interventions answer “what would happen if we force variable X to value V?”, cutting all incoming causal edges to X.
Single Intervention
#![allow(unused)]
fn main() {
let intervened = scm.intervene("transaction_amount", 50000.0)?;
let samples = intervened.generate(5000, 42)?;
}
Multiple Interventions
#![allow(unused)]
fn main() {
let intervened = scm
.intervene("transaction_amount", 50000.0)?
.and_intervene("vendor_risk", 0.9);
let samples = intervened.generate(5000, 42)?;
}
Intervention Engine with Effect Estimation
#![allow(unused)]
fn main() {
use synth_core::causal::InterventionEngine;
let engine = InterventionEngine::new(scm);
let result = engine.do_intervention(
&[("transaction_amount".into(), 50000.0)],
5000, // samples
42, // seed
)?;
// Compare baseline vs intervention
println!("Baseline fraud rate: {:.4}",
result.baseline_samples.iter()
.map(|s| s["fraud_flag"])
.sum::<f64>() / result.baseline_samples.len() as f64
);
// Effect estimates with confidence intervals
for (var, effect) in &result.effect_estimates {
println!("{}: ATE={:.4}, 95% CI=({:.4}, {:.4})",
var,
effect.average_treatment_effect,
effect.confidence_interval.0,
effect.confidence_interval.1,
);
}
}
The InterventionResult contains:
| Field | Description |
|---|---|
baseline_samples | Data generated without intervention |
intervened_samples | Data generated with the intervention applied |
effect_estimates | Per-variable average treatment effects with confidence intervals |
Counterfactual Generation
Counterfactuals answer “what would have happened to this specific record if X were different?” using the abduction-action-prediction framework:
- Abduction: Infer the latent noise variables from the factual observation
- Action: Apply the intervention (change X to new value)
- Prediction: Propagate through the SCM with inferred noise
#![allow(unused)]
fn main() {
use synth_core::causal::CounterfactualGenerator;
use std::collections::HashMap;
let cf_gen = CounterfactualGenerator::new(scm);
// Factual record
let factual: HashMap<String, f64> = [
("transaction_amount".to_string(), 5000.0),
("approval_level".to_string(), 1.0),
("fraud_flag".to_string(), 0.0),
].into_iter().collect();
// What if the amount had been 100,000?
let counterfactual = cf_gen.generate_counterfactual(
&factual,
"transaction_amount",
100000.0,
42,
)?;
println!("Factual fraud_flag: {}", factual["fraud_flag"]);
println!("Counterfactual fraud_flag: {}", counterfactual["fraud_flag"]);
}
Batch Counterfactuals
#![allow(unused)]
fn main() {
let pairs = cf_gen.generate_batch_counterfactuals(
&factual_records,
"transaction_amount",
100000.0,
42,
)?;
for pair in &pairs {
println!("Changed variables: {:?}", pair.changed_variables);
}
}
Each CounterfactualPair contains:
| Field | Description |
|---|---|
factual | The original observation |
counterfactual | The counterfactual version |
changed_variables | List of variables that changed |
Causal Validation
Validate that generated data preserves the specified causal structure:
#![allow(unused)]
fn main() {
use synth_core::causal::CausalValidator;
let report = CausalValidator::validate_causal_structure(&samples, &graph);
println!("Valid: {}", report.valid);
for check in &report.checks {
println!("{}: {} — {}", check.name, if check.passed { "PASS" } else { "FAIL" }, check.details);
}
if !report.violations.is_empty() {
println!("Violations: {:?}", report.violations);
}
}
The validator checks:
- Causal edge directions are respected (parent-child correlations)
- Independence constraints hold (non-adjacent variables)
- Intervention effects are consistent with the graph structure
CLI Usage
Generate Observational Data
datasynth-data causal generate \
--template fraud_detection \
--samples 10000 \
--seed 42 \
--output ./causal_output
Run Interventions
datasynth-data causal intervene \
--template fraud_detection \
--variable transaction_amount \
--value 50000 \
--samples 5000 \
--output ./intervention_output
Validate Causal Structure
datasynth-data causal validate \
--data ./causal_output \
--template fraud_detection
Configuration
causal:
enabled: true
template: "fraud_detection" # or "revenue_cycle" or path to custom YAML
sample_size: 10000
validate: true # validate causal structure in output
Custom Causal Graph YAML
# custom_graph.yaml
variables:
- name: order_size
type: continuous
distribution: lognormal
params:
mu: 7.0
sigma: 1.2
- name: discount_rate
type: continuous
distribution: beta
params:
alpha: 2.0
beta: 8.0
- name: revenue
type: continuous
edges:
- from: order_size
to: revenue
mechanism:
type: linear
coefficient: 0.95
- from: discount_rate
to: revenue
mechanism:
type: linear
coefficient: -5000.0