Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Quality Variations

Generate realistic data quality issues for testing robustness.

Overview

Real-world data has imperfections. The data quality module introduces:

  • Missing values (various patterns)
  • Format variations
  • Duplicates
  • Typos and transcription errors
  • Encoding issues

Configuration

data_quality:
  enabled: true

  missing_values:
    rate: 0.01
    pattern: mcar

  format_variations:
    date_formats: true
    amount_formats: true

  duplicates:
    rate: 0.001
    types: [exact, near, fuzzy]

  typos:
    rate: 0.005
    keyboard_aware: true

Missing Values

Patterns

PatternDescription
mcarMissing Completely At Random
marMissing At Random (conditional)
mnarMissing Not At Random (value-dependent)
systematicEntire field groups missing
data_quality:
  missing_values:
    rate: 0.01                       # 1% missing overall
    pattern: mcar

    # Pattern-specific settings
    mcar:
      uniform: true                  # Equal probability all fields

    mar:
      conditions:
        - field: vendor_name
          dependent_on: is_intercompany
          probability: 0.1

    mnar:
      conditions:
        - field: amount
          when_above: 100000         # Large amounts more likely missing
          probability: 0.05

    systematic:
      groups:
        - [address, city, country]   # All or none

Field Targeting

data_quality:
  missing_values:
    fields:
      description: 0.02              # 2% missing
      cost_center: 0.05              # 5% missing
      tax_code: 0.03                 # 3% missing
    exclude:
      - document_id                  # Never make missing
      - posting_date
      - account_number

Format Variations

Date Formats

data_quality:
  format_variations:
    date_formats: true
    date_variations:
      iso: 0.6                       # 2024-01-15
      us: 0.2                        # 01/15/2024
      eu: 0.1                        # 15.01.2024
      long: 0.1                      # January 15, 2024

Examples:

  • ISO: 2024-01-15
  • US: 01/15/2024, 1/15/2024
  • EU: 15.01.2024, 15/01/2024
  • Long: January 15, 2024

Amount Formats

data_quality:
  format_variations:
    amount_formats: true
    amount_variations:
      plain: 0.5                     # 1234.56
      us_comma: 0.3                  # 1,234.56
      eu_format: 0.1                 # 1.234,56
      currency_prefix: 0.05          # $1,234.56
      currency_suffix: 0.05          # 1.234,56 EUR

Identifier Formats

data_quality:
  format_variations:
    identifier_variations:
      case: 0.1                      # INV-001 vs inv-001
      padding: 0.1                   # 001 vs 1
      separator: 0.1                 # INV-001 vs INV_001 vs INV001

Duplicates

Duplicate Types

TypeDescription
exactIdentical records
nearMinor field differences
fuzzyMultiple field variations
data_quality:
  duplicates:
    rate: 0.001                      # 0.1% duplicates
    types:
      exact: 0.4                     # 40% exact duplicates
      near: 0.4                      # 40% near duplicates
      fuzzy: 0.2                     # 20% fuzzy duplicates

Near Duplicate Variations

data_quality:
  duplicates:
    near:
      fields_to_vary: 1              # Change 1 field
      variations:
        - field: posting_date
          offset_days: [-1, 0, 1]
        - field: amount
          variance: 0.001            # 0.1% difference

Fuzzy Duplicate Variations

data_quality:
  duplicates:
    fuzzy:
      fields_to_vary: 3              # Change multiple fields
      include_typos: true

Typos

Typo Types

TypeDescription
SubstitutionAdjacent key pressed
TranspositionCharacters swapped
InsertionExtra character
DeletionMissing character
OCR errorsScan-related (0/O, 1/l)
HomophonesSound-alike substitution
data_quality:
  typos:
    rate: 0.005                      # 0.5% of string fields
    keyboard_aware: true             # Use QWERTY layout

    types:
      substitution: 0.35             # Adjacnet → Adjacent
      transposition: 0.25            # Recieve → Receive
      insertion: 0.15                # Shippping → Shipping
      deletion: 0.15                 # Shiping → Shipping
      ocr_errors: 0.05               # O → 0, l → 1
      homophones: 0.05               # their → there

Field Targeting

data_quality:
  typos:
    fields:
      description: 0.02              # More likely in descriptions
      vendor_name: 0.01
      customer_name: 0.01
    exclude:
      - account_number               # Never introduce typos
      - document_id

Encoding Issues

data_quality:
  encoding:
    enabled: true
    rate: 0.001

    issues:
      mojibake: 0.4                  # UTF-8/Latin-1 confusion
      missing_chars: 0.3             # Characters dropped
      bom_issues: 0.2                # BOM artifacts
      html_entities: 0.1             # & instead of &

Examples:

  • Mojibake: MüllerMüller
  • Missing: ZürichZrich
  • HTML: R&DR&D

ML Training Labels

The data quality module generates labels for ML model training:

QualityIssueLabel

#![allow(unused)]
fn main() {
pub struct QualityIssueLabel {
    pub issue_id: String,
    pub issue_type: LabeledIssueType,
    pub issue_subtype: Option<QualityIssueSubtype>,
    pub document_id: String,
    pub field_name: String,
    pub original_value: Option<String>,
    pub modified_value: Option<String>,
    pub severity: u8,  // 1-5
    pub processor: String,
    pub metadata: HashMap<String, String>,
}
}

Issue Types

TypeSeverityDescription
MissingValue3Field is null/empty
Typo2Character-level errors
FormatVariation1Different formatting
Duplicate4Duplicate record
EncodingIssue3Character encoding problems
Inconsistency3Cross-field inconsistency
OutOfRange4Value outside expected range
InvalidReference5Reference to non-existent entity

Subtypes

Each issue type has detailed subtypes:

  • Typo: Substitution, Transposition, Insertion, Deletion, DoubleChar, CaseError, OcrError, Homophone
  • FormatVariation: DateFormat, AmountFormat, IdentifierFormat, TextFormat
  • Duplicate: ExactDuplicate, NearDuplicate, FuzzyDuplicate, CrossSystemDuplicate
  • EncodingIssue: Mojibake, MissingChars, Bom, ControlChars, HtmlEntities

Output

quality_issues.csv

FieldDescription
document_idAffected record
field_nameField with issue
issue_typemissing, typo, duplicate, etc.
original_valueValue before modification
modified_valueValue after modification

quality_labels.csv (ML Training)

FieldDescription
issue_idUnique issue identifier
issue_typeLabeledIssueType enum
issue_subtypeDetailed subtype
document_idAffected document
field_nameAffected field
original_valueOriginal value
modified_valueModified value
severity1-5 severity score
processorWhich processor injected

Example Configurations

Testing Data Pipelines

data_quality:
  enabled: true

  missing_values:
    rate: 0.02
    pattern: mcar

  format_variations:
    date_formats: true
    amount_formats: true

  typos:
    rate: 0.01
    keyboard_aware: true

Testing Deduplication

data_quality:
  enabled: true

  duplicates:
    rate: 0.05                       # High duplicate rate
    types:
      exact: 0.3
      near: 0.4
      fuzzy: 0.3

Testing OCR Processing

data_quality:
  enabled: true

  typos:
    rate: 0.03
    types:
      ocr_errors: 0.8                # Mostly OCR-style errors
      substitution: 0.2

See Also