Diffusion Models
New in v0.5.0
DataSynth integrates a statistical diffusion model backend for learned distribution capture, offering an alternative and complement to rule-based generation.
Overview
Diffusion models generate data through a learned denoising process: starting from pure noise and iteratively removing it to produce realistic samples. DataSynth’s implementation uses a statistical backend that captures column-level distributions and inter-column correlations from fingerprint data, then generates new samples through a configurable noise schedule.
Forward Process (Training): x₀ → x₁ → x₂ → ... → xₜ (pure noise)
Reverse Process (Generation): xₜ → xₜ₋₁ → ... → x₁ → x₀ (data)
Architecture
DiffusionBackend Trait
All diffusion backends implement a common interface:
#![allow(unused)]
fn main() {
pub trait DiffusionBackend: Send + Sync {
fn name(&self) -> &str;
fn forward(&self, x: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
fn reverse(&self, x_t: &[Vec<f64>], t: usize) -> Vec<Vec<f64>>;
fn generate(&self, n_samples: usize, n_features: usize, seed: u64) -> Vec<Vec<f64>>;
}
}
Statistical Diffusion Backend
The StatisticalDiffusionBackend uses per-column means and standard deviations (extracted from fingerprint data) to guide the denoising process:
#![allow(unused)]
fn main() {
use synth_core::diffusion::{StatisticalDiffusionBackend, DiffusionConfig, NoiseScheduleType};
let config = DiffusionConfig {
n_steps: 1000,
schedule: NoiseScheduleType::Cosine,
seed: 42,
};
let backend = StatisticalDiffusionBackend::new(
vec![5000.0, 3.5, 2.0], // column means
vec![2000.0, 1.5, 0.8], // column standard deviations
config,
);
// Optionally add correlation structure
let backend = backend.with_correlations(vec![
vec![1.0, 0.65, 0.72],
vec![0.65, 1.0, 0.55],
vec![0.72, 0.55, 1.0],
]);
let samples = backend.generate(1000, 3, 42);
}
Noise Schedules
The noise schedule controls how noise is added during the forward process and removed during the reverse process.
| Schedule | Formula | Characteristics |
|---|---|---|
| Linear | β_t = β_min + t/T × (β_max - β_min) | Uniform noise addition; simple and robust |
| Cosine | β_t = 1 - ᾱ_t/ᾱ_{t-1}, ᾱ_t = cos²(π/2 × t/T) | Slower noise addition; better for preserving fine details |
| Sigmoid | β_t = sigmoid(a + (b-a) × t/T) | Smooth transition; balanced between linear and cosine |
#![allow(unused)]
fn main() {
use synth_core::diffusion::{NoiseSchedule, NoiseScheduleType};
let schedule = NoiseSchedule::new(&NoiseScheduleType::Cosine, 1000);
// Access schedule components
println!("Steps: {}", schedule.n_steps());
println!("First beta: {}", schedule.betas[0]);
println!("Last alpha_bar: {}", schedule.alpha_bars[999]);
}
Schedule Properties
The NoiseSchedule precomputes all values needed for efficient forward/reverse steps:
| Property | Description |
|---|---|
betas | Noise variance at each step |
alphas | 1 - beta at each step |
alpha_bars | Cumulative product of alphas |
sqrt_alpha_bars | √(ᾱ_t) for forward process |
sqrt_one_minus_alpha_bars | √(1 - ᾱ_t) for noise scaling |
Hybrid Generation
The HybridGenerator blends rule-based and diffusion-generated data to combine the structural guarantees of rule-based generation with the distributional fidelity of diffusion models.
Blend Strategies
| Strategy | Description | Best For |
|---|---|---|
| Interpolate | Weighted average: w × diffusion + (1-w) × rule_based | Smooth blending of continuous values |
| Select | Per-record random selection from either source | Maintaining distinct record characteristics |
| Ensemble | Column-level: diffusion for amounts, rule-based for categoricals | Mixed-type data with different generation needs |
#![allow(unused)]
fn main() {
use synth_core::diffusion::{HybridGenerator, BlendStrategy};
let hybrid = HybridGenerator::new(0.3); // 30% diffusion weight
println!("Weight: {}", hybrid.weight());
// Interpolation blend
let blended = hybrid.blend(
&rule_based_data,
&diffusion_data,
BlendStrategy::Interpolate,
42,
);
// Ensemble blend (specify which columns use diffusion)
let ensemble = hybrid.blend_ensemble(
&rule_based_data,
&diffusion_data,
&[0, 2], // columns 0 and 2 from diffusion
);
}
Training Pipeline
The DiffusionTrainer fits a model from column-level parameters and correlation matrices (typically extracted from a fingerprint):
Training
#![allow(unused)]
fn main() {
use synth_core::diffusion::{DiffusionTrainer, ColumnDiffusionParams, ColumnType, DiffusionConfig};
let params = vec![
ColumnDiffusionParams {
name: "amount".into(),
mean: 5000.0,
std: 2000.0,
min: 0.0,
max: 100000.0,
col_type: ColumnType::Continuous,
},
ColumnDiffusionParams {
name: "line_items".into(),
mean: 3.5,
std: 1.5,
min: 1.0,
max: 20.0,
col_type: ColumnType::Integer,
},
];
let corr_matrix = vec![
vec![1.0, 0.65],
vec![0.65, 1.0],
];
let config = DiffusionConfig { n_steps: 1000, schedule: NoiseScheduleType::Cosine, seed: 42 };
let model = DiffusionTrainer::fit(params, corr_matrix, config);
}
Generation from Trained Model
#![allow(unused)]
fn main() {
let samples = model.generate(5000, 42);
// Save/load model
model.save(Path::new("./model.json"))?;
let loaded = TrainedDiffusionModel::load(Path::new("./model.json"))?;
}
Evaluation
#![allow(unused)]
fn main() {
let report = DiffusionTrainer::evaluate(&model, 5000, 42);
println!("Overall score: {:.3}", report.overall_score);
println!("Correlation error: {:.4}", report.correlation_error);
for (i, (mean_err, std_err)) in report.mean_errors.iter().zip(&report.std_errors).enumerate() {
println!("Column {}: mean_err={:.4}, std_err={:.4}", i, mean_err, std_err);
}
}
The FitReport contains:
| Metric | Description |
|---|---|
mean_errors | Per-column mean absolute error |
std_errors | Per-column standard deviation error |
correlation_error | RMSE of correlation matrix |
overall_score | Weighted composite score (0-1, higher is better) |
CLI Usage
Train a Model
datasynth-data diffusion train \
--fingerprint ./fingerprint.dsf \
--output ./model.json \
--n-steps 1000 \
--schedule cosine
Evaluate a Model
datasynth-data diffusion evaluate \
--model ./model.json \
--samples 5000
Configuration
diffusion:
enabled: true
n_steps: 1000 # Number of diffusion steps
schedule: "cosine" # Noise schedule: linear, cosine, sigmoid
sample_size: 1000 # Samples to generate
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable diffusion generation |
n_steps | integer | 1000 | Forward/reverse diffusion steps |
schedule | string | "linear" | Noise schedule type |
sample_size | integer | 1000 | Number of samples |
Utility Functions
DataSynth provides helper functions for working with diffusion data:
#![allow(unused)]
fn main() {
use synth_core::diffusion::{
add_gaussian_noise, normalize_features, denormalize_features,
clip_values, generate_noise,
};
// Normalize data to zero mean, unit variance
let (normalized, means, stds) = normalize_features(&data);
// Add calibrated noise
let noisy = add_gaussian_noise(&normalized[0], 0.1, &mut rng);
// Denormalize back to original scale
let original_scale = denormalize_features(&generated, &means, &stds);
// Clip to valid ranges
clip_values(&mut samples, 0.0, 100000.0);
}