Privacy-Preserving ML
Privacy-Preserving Techniques
| Technique | Privacy Level | Performance Overhead | Maturity | Use Case |
|---|---|---|---|---|
| Differential Privacy | Strong | Medium | Production-ready | Model training with privacy guarantees |
| Federated Learning | Medium | Medium | Production-ready | Decentralized training (mobile, medical) |
| Homomorphic Encryption | Very Strong | Very High | Research/Early | Computation on encrypted data |
| Secure Multi-Party Computation | Strong | High | Emerging | Collaborative ML without data sharing |
| Trusted Execution Environments | Hardware-dependent | Low | Production-ready | Secure enclaves for sensitive data |
| Data Anonymization | Variable | Low | Mature | De-identification, k-anonymity |
Differential Privacy
Core Concepts
Definition: Mathematical guarantee that individual data points don't significantly affect output
Privacy Budget (ε): Lower ε = stronger privacy (typical: 0.1-10)
Noise Addition: Add calibrated noise to gradients or outputs
Composition: Privacy budget decreases with repeated queries
Trade-off: Privacy vs model accuracy
DP-SGD (Differentially Private SGD)
Mechanism: Add noise to gradients during training
Gradient Clipping: Bound sensitivity of gradients
Noise Addition: Gaussian or Laplacian noise
Privacy Accounting: Track cumulative privacy loss
Tools: Opacus (PyTorch), TensorFlow Privacy
Local vs Global DP
Local DP: Noise added by individual users before data leaves device
Pros: Maximum privacy, no trusted aggregator needed
Cons: Higher noise, lower utility
Global DP: Noise added by central server to aggregated data
Pros: Better utility, less noise
Cons: Requires trusted server
Applications
- Census Data: US Census uses DP
- Apple: Keyboard predictions, usage analytics
- Google: Chrome usage statistics
- Microsoft: Windows telemetry
- Medical Research: Patient data analysis
- Finance: Transaction pattern analysis
Federated Learning
How It Works
- Server broadcasts model to participating devices
- Local training on each device's private data
- Send updates (gradients or model weights) to server
- Server aggregates updates (e.g., FedAvg)
- Repeat until convergence
Key Benefit: Raw data never leaves devices
Challenges
- Non-IID Data: Device data distributions differ
- Communication Costs: Frequent model updates expensive
- Device Heterogeneity: Varying compute/network capabilities
- Dropout: Devices go offline mid-training
- Privacy Leakage: Gradients can leak information
- Byzantine Attacks: Malicious participants
Federated Learning Variants
Horizontal FL: Same features, different samples (e.g., mobile keyboards)
Vertical FL: Different features, same samples (e.g., banks + retailers)
Federated Transfer Learning: Different features and samples
Cross-Silo FL: Few organizations (hospitals, banks)
Cross-Device FL: Many devices (smartphones, IoT)
FL Frameworks & Tools
- TensorFlow Federated: Google's FL framework
- PySyft: OpenMined's privacy-preserving ML
- Flower (flwr): Framework-agnostic FL
- FATE: WeBank's federated AI ecosystem
- FedML: Research and production FL
- NVIDIA FLARE: Federated learning platform
Advanced Privacy Techniques
Homomorphic Encryption
Definition: Perform computations on encrypted data without decrypting
Types:
- Partially HE: One operation (addition or multiplication)
- Somewhat HE: Limited operations
- Fully HE: Arbitrary computations (slow)
Challenge: 1000-10,000x slowdown
Tools: Microsoft SEAL, HElib, Palisade, Concrete
Use Case: Encrypted cloud inference
Secure Multi-Party Computation (SMPC)
Definition: Multiple parties jointly compute function without revealing inputs
Techniques:
- Secret Sharing: Split data into shares
- Garbled Circuits: Encrypt computation graph
- Oblivious Transfer: Selective information retrieval
Tools: MP-SPDZ, CrypTen, PySyft
Use Case: Collaborative training without data sharing
Trusted Execution Environments (TEE)
Definition: Hardware-based secure enclaves for sensitive computation
Examples:
- Intel SGX: Software Guard Extensions
- AMD SEV: Secure Encrypted Virtualization
- ARM TrustZone: Secure world isolation
- AWS Nitro Enclaves: Cloud TEE
Benefit: Low overhead, hardware guarantees
Limitation: Hardware dependency, side-channel attacks
Synthetic Data Generation
Definition: Generate artificial data that preserves statistical properties
Techniques:
- GANs: Generative Adversarial Networks
- VAEs: Variational Autoencoders
- CTGAN: Conditional tabular GAN
- Bayesian Networks: Probabilistic models
Tools: Gretel, Mostly AI, Synthetic Data Vault
Use Case: Training/testing without real data
Data Anonymization Methods
| Technique | Method | Privacy Level | Data Utility | Reversible |
|---|---|---|---|---|
| Masking/Redaction | Replace sensitive data with X's or blanks | High | Low | No |
| Pseudonymization | Replace identifiers with pseudonyms | Medium | High | Yes (with key) |
| Tokenization | Replace with random tokens, store mapping | Medium | High | Yes (with vault) |
| Generalization | Replace specific values with ranges (age 25 → 20-30) | Medium | Medium | No |
| k-Anonymity | Ensure each record indistinguishable from k-1 others | Medium | Medium | No |
| l-Diversity | At least l distinct sensitive values per group | High | Medium | No |
| t-Closeness | Distribution of sensitive attribute close to overall | High | Low | No |
