Privacy-Preserving ML

Privacy-Preserving Techniques

TechniquePrivacy LevelPerformance OverheadMaturityUse Case
Differential PrivacyStrongMediumProduction-readyModel training with privacy guarantees
Federated LearningMediumMediumProduction-readyDecentralized training (mobile, medical)
Homomorphic EncryptionVery StrongVery HighResearch/EarlyComputation on encrypted data
Secure Multi-Party ComputationStrongHighEmergingCollaborative ML without data sharing
Trusted Execution EnvironmentsHardware-dependentLowProduction-readySecure enclaves for sensitive data
Data AnonymizationVariableLowMatureDe-identification, k-anonymity

Differential Privacy

Core Concepts

Definition: Mathematical guarantee that individual data points don't significantly affect output

Privacy Budget (ε): Lower ε = stronger privacy (typical: 0.1-10)

Noise Addition: Add calibrated noise to gradients or outputs

Composition: Privacy budget decreases with repeated queries

Trade-off: Privacy vs model accuracy

DP-SGD (Differentially Private SGD)

Mechanism: Add noise to gradients during training

Gradient Clipping: Bound sensitivity of gradients

Noise Addition: Gaussian or Laplacian noise

Privacy Accounting: Track cumulative privacy loss

Tools: Opacus (PyTorch), TensorFlow Privacy

Local vs Global DP

Local DP: Noise added by individual users before data leaves device

Pros: Maximum privacy, no trusted aggregator needed

Cons: Higher noise, lower utility

Global DP: Noise added by central server to aggregated data

Pros: Better utility, less noise

Cons: Requires trusted server

Applications

  • Census Data: US Census uses DP
  • Apple: Keyboard predictions, usage analytics
  • Google: Chrome usage statistics
  • Microsoft: Windows telemetry
  • Medical Research: Patient data analysis
  • Finance: Transaction pattern analysis

Federated Learning

How It Works

  1. Server broadcasts model to participating devices
  2. Local training on each device's private data
  3. Send updates (gradients or model weights) to server
  4. Server aggregates updates (e.g., FedAvg)
  5. Repeat until convergence

Key Benefit: Raw data never leaves devices

Challenges

  • Non-IID Data: Device data distributions differ
  • Communication Costs: Frequent model updates expensive
  • Device Heterogeneity: Varying compute/network capabilities
  • Dropout: Devices go offline mid-training
  • Privacy Leakage: Gradients can leak information
  • Byzantine Attacks: Malicious participants

Federated Learning Variants

Horizontal FL: Same features, different samples (e.g., mobile keyboards)

Vertical FL: Different features, same samples (e.g., banks + retailers)

Federated Transfer Learning: Different features and samples

Cross-Silo FL: Few organizations (hospitals, banks)

Cross-Device FL: Many devices (smartphones, IoT)

FL Frameworks & Tools

  • TensorFlow Federated: Google's FL framework
  • PySyft: OpenMined's privacy-preserving ML
  • Flower (flwr): Framework-agnostic FL
  • FATE: WeBank's federated AI ecosystem
  • FedML: Research and production FL
  • NVIDIA FLARE: Federated learning platform

Advanced Privacy Techniques

Homomorphic Encryption

Definition: Perform computations on encrypted data without decrypting

Types:

  • Partially HE: One operation (addition or multiplication)
  • Somewhat HE: Limited operations
  • Fully HE: Arbitrary computations (slow)

Challenge: 1000-10,000x slowdown

Tools: Microsoft SEAL, HElib, Palisade, Concrete

Use Case: Encrypted cloud inference

Secure Multi-Party Computation (SMPC)

Definition: Multiple parties jointly compute function without revealing inputs

Techniques:

  • Secret Sharing: Split data into shares
  • Garbled Circuits: Encrypt computation graph
  • Oblivious Transfer: Selective information retrieval

Tools: MP-SPDZ, CrypTen, PySyft

Use Case: Collaborative training without data sharing

Trusted Execution Environments (TEE)

Definition: Hardware-based secure enclaves for sensitive computation

Examples:

  • Intel SGX: Software Guard Extensions
  • AMD SEV: Secure Encrypted Virtualization
  • ARM TrustZone: Secure world isolation
  • AWS Nitro Enclaves: Cloud TEE

Benefit: Low overhead, hardware guarantees

Limitation: Hardware dependency, side-channel attacks

Synthetic Data Generation

Definition: Generate artificial data that preserves statistical properties

Techniques:

  • GANs: Generative Adversarial Networks
  • VAEs: Variational Autoencoders
  • CTGAN: Conditional tabular GAN
  • Bayesian Networks: Probabilistic models

Tools: Gretel, Mostly AI, Synthetic Data Vault

Use Case: Training/testing without real data

Data Anonymization Methods

TechniqueMethodPrivacy LevelData UtilityReversible
Masking/RedactionReplace sensitive data with X's or blanksHighLowNo
PseudonymizationReplace identifiers with pseudonymsMediumHighYes (with key)
TokenizationReplace with random tokens, store mappingMediumHighYes (with vault)
GeneralizationReplace specific values with ranges (age 25 → 20-30)MediumMediumNo
k-AnonymityEnsure each record indistinguishable from k-1 othersMediumMediumNo
l-DiversityAt least l distinct sensitive values per groupHighMediumNo
t-ClosenessDistribution of sensitive attribute close to overallHighLowNo