Data Architecture
Data Storage Patterns
Centralized repository storing structured, semi-structured, and unstructured data at scale. Schema-on-read approach allows flexible analysis. Raw data stored in native format. Supports big data analytics, ML training, exploration. Typically built on object storage (S3, ADLS, GCS). Requires data cataloging and governance to prevent data swamp. Apache Hadoop HDFS, AWS Lake Formation, Azure Data Lake.
Optimized for analytical queries on structured data. Schema-on-write with predefined structure. OLAP workloads with star/snowflake schemas. Historical data storage for BI and reporting. ETL pipelines load processed data. Columnar storage for query performance. Amazon Redshift, Azure Synapse, Google BigQuery, Snowflake. Higher data quality than data lakes.
Subset of data warehouse focused on specific business function or department. Subject-oriented (sales, finance, marketing). Faster queries due to smaller data scope. Departmental autonomy and performance. Can be dependent (sourced from warehouse) or independent (own ETL). Dimensional modeling for specific analytics. Reduces warehouse query load.
Combines data lake flexibility with data warehouse performance. ACID transactions on data lake storage. Schema enforcement and evolution capabilities. Direct SQL queries on raw data. Unified batch and streaming processing. Delta Lake, Apache Iceberg, Apache Hudi formats. Reduces data duplication between lake and warehouse. Single platform for all analytics.
Current or near-real-time operational data for tactical decisions. Integrates data from multiple source systems. Subject-oriented and integrated but not historical. Supports operational reporting and real-time analytics. Bridge between OLTP and OLAP. Normalized or lightly denormalized. Refreshed frequently (hourly/real-time). Used for operational dashboards.
Using different database technologies for different data storage needs. Relational for transactions, NoSQL for documents, graph for relationships, time-series for metrics. Each workload uses best-fit technology. Microservices often employ polyglot approach. Complexity in data consistency and integration. Choose database based on read/write patterns, query types, scaling needs.
Data Processing Architectures
Hybrid batch and stream processing architecture. Batch layer processes complete datasets for accuracy. Speed layer handles real-time data with approximate results. Serving layer merges both views. Addresses latency vs completeness tradeoff. Batch corrects speed layer errors. Complexity from maintaining two processing paths. Used for high-volume analytics requiring both speed and accuracy.
Stream processing only, eliminating batch layer. All data treated as streams, including historical reprocessing. Single technology stack and codebase. Simpler than Lambda with one processing path. Requires replayable log (Kafka). Stream processor must handle state and windowing. Good for event-driven systems. LinkedIn, Uber use Kappa for real-time analytics.
Process large volumes of data in scheduled batches. High throughput but high latency (hours to days). Complete dataset available for processing. Ideal for historical analysis, ETL, data science. MapReduce, Spark batch, Hadoop. Cost-effective for large-scale transformations. Fault tolerance through replay. Nightly/weekly batch jobs common in enterprises.
Continuous processing of data in motion. Low latency (milliseconds to seconds). Event-driven with windowing and stateful operations. Real-time analytics, alerting, monitoring. Apache Kafka Streams, Apache Flink, Spark Streaming. Exactly-once or at-least-once semantics. Complex event processing (CEP). Handles infinite unbounded data streams.
Small batches processed at frequent intervals. Compromise between batch and streaming. Spark Structured Streaming uses micro-batches. Simpler programming model than pure streaming. Near-real-time latency (seconds to minutes). Easier fault tolerance and reprocessing. Less complex than true streaming. Good for semi-real-time analytics.
Combines multiple processing patterns for flexibility. Batch for historical analysis, stream for real-time. Change data capture (CDC) for incremental updates. Unified APIs like Spark or Flink supporting both. Hot path (real-time) and cold path (batch) patterns. Azure Stream Analytics + Synapse. Flexibility at cost of complexity. Best-of-both-worlds approach.
Data Governance
Centralized inventory of data assets with metadata. Searchable repository of tables, columns, schemas. Business glossary with definitions. Data discovery and lineage tracking. Technical metadata (schema, size) and business metadata (ownership, purpose). AWS Glue Data Catalog, Azure Purview, Google Data Catalog, Collibra, Alation. Prevents shadow IT and duplicate datasets.
Tracking data flow from source to consumption. Upstream dependencies and downstream impacts. Impact analysis for changes. Compliance and audit trails. Root cause analysis for data quality issues. Column-level lineage for detailed tracking. Automated lineage via query parsing or manual documentation. Essential for GDPR, regulatory compliance.
Ensuring accuracy, completeness, consistency, timeliness of data. Validation rules and quality checks. Profiling to understand data characteristics. Anomaly detection for outliers. Data quality dimensions: accuracy, completeness, consistency, validity, uniqueness, timeliness. Great Expectations, Deequ, Talend Data Quality. Quality gates in pipelines. Data quality scores and dashboards.
Single source of truth for critical business entities. Customer, product, supplier master data. Golden records from multiple sources. Data stewardship and ownership. Reference data management. Centralized hub or federated approach. Informatica MDM, SAP Master Data Governance, Oracle MDM. Eliminates duplicate and conflicting data.
Managing data about data. Technical metadata (schema, types, constraints), business metadata (definitions, ownership), operational metadata (lineage, quality). Active metadata drives automation. Metadata-driven architectures for flexibility. Metadata as code in version control. Metadata APIs for programmatic access. Foundation for data catalog and governance.
Protecting sensitive data from unauthorized access. Classification (public, internal, confidential, restricted). Encryption at rest and in transit. Access controls and least privilege. Data masking and tokenization for non-production. GDPR, CCPA, HIPAA compliance. Right to be forgotten, data minimization. DLP tools, encryption key management, audit logging.
Data Modeling
Dimensional model with central fact table and dimension tables. Denormalized for query performance. Simple queries with single joins to facts. Intuitive for business users. Most common data warehouse schema. Fact table contains metrics and foreign keys. Dimension tables have descriptive attributes. Easy to understand and maintain. Optimized for read-heavy OLAP queries.
Normalized version of star schema with dimension hierarchies. Dimensions split into sub-dimensions. Saves storage but requires more joins. More complex queries than star schema. Useful when dimensions have many attributes. Reduces data redundancy. Trade-off between storage and query performance. Named for snowflake-like appearance of ERD.
Agile and scalable modeling approach for data warehouses. Hub (business keys), Link (relationships), Satellite (attributes) tables. Insert-only, audit trail of all changes. Parallel loading and historization. Separates business logic from storage. Highly normalized but optimized for loading. Good for complex source systems. Dan Linstedt methodology.
Kimball methodology for data warehouse design. Business process matrices and bus architecture. Conformed dimensions shared across facts. Slowly changing dimensions (SCD) types 0-7. Grain declaration for fact tables. Additive, semi-additive, non-additive measures. Factless fact tables for events. Lifecycle approach: business requirements, dimensional model, physical design, ETL, BI.
Third normal form (3NF) for transactional systems. Eliminates redundancy and update anomalies. Entity-relationship diagrams (ERD). Primary keys, foreign keys, constraints. ACID transactions and referential integrity. Good for write-heavy OLTP. Requires joins for reporting. Inmon approach for data warehousing. Flexibility for changing requirements.
One Big Table with all attributes denormalized. Optimized for specific analytics use case. Pre-joined dimensions and facts. Eliminates query-time joins for performance. Common in columnar databases and BI tools. Trade storage for query speed. Easier for non-technical users. Tableau extracts, Power BI imports often use wide tables. Simplifies data access.
Data Lifecycle Management
Defined periods for keeping data before deletion or archival. Legal, regulatory, and business requirements. Different retention for different data types. PII often has strict limits. Financial data typically 7 years. Retention schedules in data governance policy. Automated enforcement via lifecycle rules. Compliance with GDPR, HIPAA retention requirements.
Moving infrequently accessed data to lower-cost storage. Maintain accessibility but reduce primary storage costs. Compressed and stored in archival formats. Glacier, Archive Tier, Tape. Retrieval time trade-off (minutes to hours). Compliance archives for regulatory requirements. Automated tiering based on access patterns. Important for historical data and auditability.
Permanent removal of data per retention policy or user request. GDPR Article 17 right to erasure. Soft delete (mark as deleted) vs hard delete (physical removal). Delete from all systems including backups. Data deletion logs for audit. Complex in distributed systems and replicas. Anonymization as alternative to deletion. Automated deletion workflows.
Storage tiers based on access frequency and performance needs. Hot: frequent access, low latency, high cost (SSD, premium). Warm: occasional access, moderate cost (standard storage). Cold: rare access, high latency, low cost (archive tier, glacier). Automated tiering based on access patterns. S3 Intelligent Tiering, Azure Blob lifecycle. Optimize cost while maintaining accessibility.
Automated policies for data movement and deletion. Object lifecycle rules (S3, GCS, Azure). Metadata-driven lifecycle based on classification. Event-driven automation (Lambda, Functions). ILM (Information Lifecycle Management) tools. Transition between storage tiers automatically. Scheduled deletion based on retention. Reduces manual effort and ensures compliance.
Regular backups for disaster recovery and data protection. RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements. Full, incremental, differential backup strategies. 3-2-1 rule: 3 copies, 2 media types, 1 offsite. Point-in-time recovery for databases. Backup testing and restore drills. AWS Backup, Azure Backup, Veeam. Immutable backups against ransomware.
