Will Percey — Portfolio

Privacy-Preserving Techniques

Technique	Privacy Level	Performance Overhead	Maturity	Use Case
Differential Privacy	Strong	Medium	Production-ready	Model training with privacy guarantees
Federated Learning	Medium	Medium	Production-ready	Decentralized training (mobile, medical)
Homomorphic Encryption	Very Strong	Very High	Research/Early	Computation on encrypted data
Secure Multi-Party Computation	Strong	High	Emerging	Collaborative ML without data sharing
Trusted Execution Environments	Hardware-dependent	Low	Production-ready	Secure enclaves for sensitive data
Data Anonymization	Variable	Low	Mature	De-identification, k-anonymity

Differential Privacy

Core Concepts

Definition: Mathematical guarantee that individual data points don't significantly affect output

Privacy Budget (ε): Lower ε = stronger privacy (typical: 0.1-10)

Noise Addition: Add calibrated noise to gradients or outputs

Composition: Privacy budget decreases with repeated queries

Trade-off: Privacy vs model accuracy

DP-SGD (Differentially Private SGD)

Mechanism: Add noise to gradients during training

Gradient Clipping: Bound sensitivity of gradients

Noise Addition: Gaussian or Laplacian noise

Privacy Accounting: Track cumulative privacy loss

Tools: Opacus (PyTorch), TensorFlow Privacy

Local vs Global DP

Local DP: Noise added by individual users before data leaves device

Pros: Maximum privacy, no trusted aggregator needed

Cons: Higher noise, lower utility

Global DP: Noise added by central server to aggregated data

Pros: Better utility, less noise

Cons: Requires trusted server

Applications

Census Data: US Census uses DP
Apple: Keyboard predictions, usage analytics
Google: Chrome usage statistics
Microsoft: Windows telemetry
Medical Research: Patient data analysis
Finance: Transaction pattern analysis

Federated Learning

How It Works

Server broadcasts model to participating devices
Local training on each device's private data
Send updates (gradients or model weights) to server
Server aggregates updates (e.g., FedAvg)
Repeat until convergence

Key Benefit: Raw data never leaves devices

Challenges

Non-IID Data: Device data distributions differ
Communication Costs: Frequent model updates expensive
Device Heterogeneity: Varying compute/network capabilities
Dropout: Devices go offline mid-training
Privacy Leakage: Gradients can leak information
Byzantine Attacks: Malicious participants

Federated Learning Variants

Horizontal FL: Same features, different samples (e.g., mobile keyboards)

Vertical FL: Different features, same samples (e.g., banks + retailers)

Federated Transfer Learning: Different features and samples

Cross-Silo FL: Few organizations (hospitals, banks)

Cross-Device FL: Many devices (smartphones, IoT)

FL Frameworks & Tools

TensorFlow Federated: Google's FL framework
PySyft: OpenMined's privacy-preserving ML
Flower (flwr): Framework-agnostic FL
FATE: WeBank's federated AI ecosystem
FedML: Research and production FL
NVIDIA FLARE: Federated learning platform

Advanced Privacy Techniques

Homomorphic Encryption

Definition: Perform computations on encrypted data without decrypting

Types:

Partially HE: One operation (addition or multiplication)
Somewhat HE: Limited operations
Fully HE: Arbitrary computations (slow)

Challenge: 1000-10,000x slowdown

Tools: Microsoft SEAL, HElib, Palisade, Concrete

Use Case: Encrypted cloud inference

Secure Multi-Party Computation (SMPC)

Definition: Multiple parties jointly compute function without revealing inputs

Techniques:

Secret Sharing: Split data into shares
Garbled Circuits: Encrypt computation graph
Oblivious Transfer: Selective information retrieval

Tools: MP-SPDZ, CrypTen, PySyft

Use Case: Collaborative training without data sharing

Trusted Execution Environments (TEE)

Definition: Hardware-based secure enclaves for sensitive computation

Examples:

Intel SGX: Software Guard Extensions
AMD SEV: Secure Encrypted Virtualization
ARM TrustZone: Secure world isolation
AWS Nitro Enclaves: Cloud TEE

Benefit: Low overhead, hardware guarantees

Limitation: Hardware dependency, side-channel attacks

Synthetic Data Generation

Definition: Generate artificial data that preserves statistical properties

Techniques:

GANs: Generative Adversarial Networks
VAEs: Variational Autoencoders
CTGAN: Conditional tabular GAN
Bayesian Networks: Probabilistic models

Tools: Gretel, Mostly AI, Synthetic Data Vault

Use Case: Training/testing without real data

Data Anonymization Methods

Technique	Method	Privacy Level	Data Utility	Reversible
Masking/Redaction	Replace sensitive data with X's or blanks	High	Low	No
Pseudonymization	Replace identifiers with pseudonyms	Medium	High	Yes (with key)
Tokenization	Replace with random tokens, store mapping	Medium	High	Yes (with vault)
Generalization	Replace specific values with ranges (age 25 → 20-30)	Medium	Medium	No
k-Anonymity	Ensure each record indistinguishable from k-1 others	Medium	Medium	No
l-Diversity	At least l distinct sensitive values per group	High	Medium	No
t-Closeness	Distribution of sensitive attribute close to overall	High	Low	No