Data Architecture

database

Data Storage Patterns

Data Lake

Centralized repository storing structured, semi-structured, and unstructured data at scale. Schema-on-read approach allows flexible analysis. Raw data stored in native format. Supports big data analytics, ML training, exploration. Typically built on object storage (S3, ADLS, GCS). Requires data cataloging and governance to prevent data swamp. Apache Hadoop HDFS, AWS Lake Formation, Azure Data Lake.

Similar Technologies
Data WarehouseData HubData ReservoirObject StorageDistributed File System
Data Warehouse

Optimized for analytical queries on structured data. Schema-on-write with predefined structure. OLAP workloads with star/snowflake schemas. Historical data storage for BI and reporting. ETL pipelines load processed data. Columnar storage for query performance. Amazon Redshift, Azure Synapse, Google BigQuery, Snowflake. Higher data quality than data lakes.

Similar Technologies
Data MartData LakeOLAP CubeAnalytical DatabaseData Lakehouse
Data Mart

Subset of data warehouse focused on specific business function or department. Subject-oriented (sales, finance, marketing). Faster queries due to smaller data scope. Departmental autonomy and performance. Can be dependent (sourced from warehouse) or independent (own ETL). Dimensional modeling for specific analytics. Reduces warehouse query load.

Similar Technologies
Data WarehouseDepartment DatabaseAnalytical DatabaseSubject AreaData Subset
Data Lakehouse

Combines data lake flexibility with data warehouse performance. ACID transactions on data lake storage. Schema enforcement and evolution capabilities. Direct SQL queries on raw data. Unified batch and streaming processing. Delta Lake, Apache Iceberg, Apache Hudi formats. Reduces data duplication between lake and warehouse. Single platform for all analytics.

Similar Technologies
Data Lake + WarehouseModern Data PlatformUnified Analytics PlatformLake + Warehouse HybridDelta Architecture
Operational Data Store (ODS)

Current or near-real-time operational data for tactical decisions. Integrates data from multiple source systems. Subject-oriented and integrated but not historical. Supports operational reporting and real-time analytics. Bridge between OLTP and OLAP. Normalized or lightly denormalized. Refreshed frequently (hourly/real-time). Used for operational dashboards.

Similar Technologies
Data WarehouseReal-time DatabaseOperational AnalyticsCurrent State DatabaseIntegration Layer
Polyglot Persistence

Using different database technologies for different data storage needs. Relational for transactions, NoSQL for documents, graph for relationships, time-series for metrics. Each workload uses best-fit technology. Microservices often employ polyglot approach. Complexity in data consistency and integration. Choose database based on read/write patterns, query types, scaling needs.

Similar Technologies
Single DatabaseUnified Data PlatformMulti-model DatabaseHeterogeneous StorageDatabase per Service
settings_ethernet

Data Processing Architectures

Lambda Architecture

Hybrid batch and stream processing architecture. Batch layer processes complete datasets for accuracy. Speed layer handles real-time data with approximate results. Serving layer merges both views. Addresses latency vs completeness tradeoff. Batch corrects speed layer errors. Complexity from maintaining two processing paths. Used for high-volume analytics requiring both speed and accuracy.

Similar Technologies
Kappa ArchitectureBatch ProcessingStream ProcessingUnified ProcessingDelta Architecture
Kappa Architecture

Stream processing only, eliminating batch layer. All data treated as streams, including historical reprocessing. Single technology stack and codebase. Simpler than Lambda with one processing path. Requires replayable log (Kafka). Stream processor must handle state and windowing. Good for event-driven systems. LinkedIn, Uber use Kappa for real-time analytics.

Similar Technologies
Lambda ArchitecturePure StreamingEvent-driven ArchitectureStream-firstUnified Streaming
Batch Processing

Process large volumes of data in scheduled batches. High throughput but high latency (hours to days). Complete dataset available for processing. Ideal for historical analysis, ETL, data science. MapReduce, Spark batch, Hadoop. Cost-effective for large-scale transformations. Fault tolerance through replay. Nightly/weekly batch jobs common in enterprises.

Similar Technologies
Stream ProcessingReal-time ProcessingMicro-batchScheduled JobsETL Pipeline
Stream Processing

Continuous processing of data in motion. Low latency (milliseconds to seconds). Event-driven with windowing and stateful operations. Real-time analytics, alerting, monitoring. Apache Kafka Streams, Apache Flink, Spark Streaming. Exactly-once or at-least-once semantics. Complex event processing (CEP). Handles infinite unbounded data streams.

Similar Technologies
Batch ProcessingMicro-batchReal-time ProcessingEvent ProcessingContinuous Processing
Micro-batch Processing

Small batches processed at frequent intervals. Compromise between batch and streaming. Spark Structured Streaming uses micro-batches. Simpler programming model than pure streaming. Near-real-time latency (seconds to minutes). Easier fault tolerance and reprocessing. Less complex than true streaming. Good for semi-real-time analytics.

Similar Technologies
Pure StreamingBatch ProcessingHybrid ProcessingMini-batchTumbling Windows
Hybrid Processing

Combines multiple processing patterns for flexibility. Batch for historical analysis, stream for real-time. Change data capture (CDC) for incremental updates. Unified APIs like Spark or Flink supporting both. Hot path (real-time) and cold path (batch) patterns. Azure Stream Analytics + Synapse. Flexibility at cost of complexity. Best-of-both-worlds approach.

Similar Technologies
Lambda ArchitectureUnified ProcessingMulti-modal ProcessingBatch + StreamFlexible Processing
verified_user

Data Governance

Data Catalog

Centralized inventory of data assets with metadata. Searchable repository of tables, columns, schemas. Business glossary with definitions. Data discovery and lineage tracking. Technical metadata (schema, size) and business metadata (ownership, purpose). AWS Glue Data Catalog, Azure Purview, Google Data Catalog, Collibra, Alation. Prevents shadow IT and duplicate datasets.

Similar Technologies
Metadata RepositoryData RegistryAsset InventoryData DictionaryMetadata Management
Data Lineage

Tracking data flow from source to consumption. Upstream dependencies and downstream impacts. Impact analysis for changes. Compliance and audit trails. Root cause analysis for data quality issues. Column-level lineage for detailed tracking. Automated lineage via query parsing or manual documentation. Essential for GDPR, regulatory compliance.

Similar Technologies
Data Flow TrackingImpact AnalysisData ProvenanceDependency TrackingData Journey Mapping
Data Quality

Ensuring accuracy, completeness, consistency, timeliness of data. Validation rules and quality checks. Profiling to understand data characteristics. Anomaly detection for outliers. Data quality dimensions: accuracy, completeness, consistency, validity, uniqueness, timeliness. Great Expectations, Deequ, Talend Data Quality. Quality gates in pipelines. Data quality scores and dashboards.

Similar Technologies
Data ValidationData CleansingData ProfilingQuality AssuranceData Accuracy
Master Data Management

Single source of truth for critical business entities. Customer, product, supplier master data. Golden records from multiple sources. Data stewardship and ownership. Reference data management. Centralized hub or federated approach. Informatica MDM, SAP Master Data Governance, Oracle MDM. Eliminates duplicate and conflicting data.

Similar Technologies
Reference Data ManagementGolden RecordSingle Source of TruthEntity ManagementData Hub
Metadata Management

Managing data about data. Technical metadata (schema, types, constraints), business metadata (definitions, ownership), operational metadata (lineage, quality). Active metadata drives automation. Metadata-driven architectures for flexibility. Metadata as code in version control. Metadata APIs for programmatic access. Foundation for data catalog and governance.

Similar Technologies
Data DictionarySchema RegistryMetadata RepositoryData DocumentationCatalog Management
Data Privacy & Security

Protecting sensitive data from unauthorized access. Classification (public, internal, confidential, restricted). Encryption at rest and in transit. Access controls and least privilege. Data masking and tokenization for non-production. GDPR, CCPA, HIPAA compliance. Right to be forgotten, data minimization. DLP tools, encryption key management, audit logging.

Similar Technologies
Data ProtectionAccess ControlData EncryptionCompliance ManagementPrivacy Controls
account_tree

Data Modeling

Star Schema

Dimensional model with central fact table and dimension tables. Denormalized for query performance. Simple queries with single joins to facts. Intuitive for business users. Most common data warehouse schema. Fact table contains metrics and foreign keys. Dimension tables have descriptive attributes. Easy to understand and maintain. Optimized for read-heavy OLAP queries.

Similar Technologies
Snowflake SchemaGalaxy SchemaData VaultThird Normal FormFlat Table
Snowflake Schema

Normalized version of star schema with dimension hierarchies. Dimensions split into sub-dimensions. Saves storage but requires more joins. More complex queries than star schema. Useful when dimensions have many attributes. Reduces data redundancy. Trade-off between storage and query performance. Named for snowflake-like appearance of ERD.

Similar Technologies
Star SchemaNormalized SchemaThird Normal FormDimensional ModelData Vault
Data Vault 2.0

Agile and scalable modeling approach for data warehouses. Hub (business keys), Link (relationships), Satellite (attributes) tables. Insert-only, audit trail of all changes. Parallel loading and historization. Separates business logic from storage. Highly normalized but optimized for loading. Good for complex source systems. Dan Linstedt methodology.

Similar Technologies
Star SchemaThird Normal FormAnchor ModelingKimball DimensionalEnsemble Modeling
Dimensional Modeling

Kimball methodology for data warehouse design. Business process matrices and bus architecture. Conformed dimensions shared across facts. Slowly changing dimensions (SCD) types 0-7. Grain declaration for fact tables. Additive, semi-additive, non-additive measures. Factless fact tables for events. Lifecycle approach: business requirements, dimensional model, physical design, ETL, BI.

Similar Technologies
Normalized ModelingData VaultActivity SchemaWide TablesOBT (One Big Table)
Normalized Data Modeling

Third normal form (3NF) for transactional systems. Eliminates redundancy and update anomalies. Entity-relationship diagrams (ERD). Primary keys, foreign keys, constraints. ACID transactions and referential integrity. Good for write-heavy OLTP. Requires joins for reporting. Inmon approach for data warehousing. Flexibility for changing requirements.

Similar Technologies
DenormalizedDimensional ModelingStar SchemaWide TablesDocument Model
Wide Table / OBT

One Big Table with all attributes denormalized. Optimized for specific analytics use case. Pre-joined dimensions and facts. Eliminates query-time joins for performance. Common in columnar databases and BI tools. Trade storage for query speed. Easier for non-technical users. Tableau extracts, Power BI imports often use wide tables. Simplifies data access.

Similar Technologies
Star SchemaNormalized ModelMaterialized ViewAggregate TableOLAP Cube
cycle

Data Lifecycle Management

Data Retention Policies

Defined periods for keeping data before deletion or archival. Legal, regulatory, and business requirements. Different retention for different data types. PII often has strict limits. Financial data typically 7 years. Retention schedules in data governance policy. Automated enforcement via lifecycle rules. Compliance with GDPR, HIPAA retention requirements.

Similar Technologies
Data DeletionArchival StrategyData ExpirationLifecycle PolicyRetention Schedule
Data Archival

Moving infrequently accessed data to lower-cost storage. Maintain accessibility but reduce primary storage costs. Compressed and stored in archival formats. Glacier, Archive Tier, Tape. Retrieval time trade-off (minutes to hours). Compliance archives for regulatory requirements. Automated tiering based on access patterns. Important for historical data and auditability.

Similar Technologies
Cold StorageBackupLong-term StorageData TieringOffline Storage
Data Deletion & Right to be Forgotten

Permanent removal of data per retention policy or user request. GDPR Article 17 right to erasure. Soft delete (mark as deleted) vs hard delete (physical removal). Delete from all systems including backups. Data deletion logs for audit. Complex in distributed systems and replicas. Anonymization as alternative to deletion. Automated deletion workflows.

Similar Technologies
Data AnonymizationData MaskingSoft DeleteData ExpirationData Purging
Hot/Warm/Cold Data Tiers

Storage tiers based on access frequency and performance needs. Hot: frequent access, low latency, high cost (SSD, premium). Warm: occasional access, moderate cost (standard storage). Cold: rare access, high latency, low cost (archive tier, glacier). Automated tiering based on access patterns. S3 Intelligent Tiering, Azure Blob lifecycle. Optimize cost while maintaining accessibility.

Similar Technologies
Single Tier StorageManual TieringStorage ClassesAccess-based TieringPerformance Tiers
Data Lifecycle Automation

Automated policies for data movement and deletion. Object lifecycle rules (S3, GCS, Azure). Metadata-driven lifecycle based on classification. Event-driven automation (Lambda, Functions). ILM (Information Lifecycle Management) tools. Transition between storage tiers automatically. Scheduled deletion based on retention. Reduces manual effort and ensures compliance.

Similar Technologies
Manual LifecycleScheduled JobsEvent-driven LifecyclePolicy-based AutomationILM Tools
Data Backup & Recovery

Regular backups for disaster recovery and data protection. RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements. Full, incremental, differential backup strategies. 3-2-1 rule: 3 copies, 2 media types, 1 offsite. Point-in-time recovery for databases. Backup testing and restore drills. AWS Backup, Azure Backup, Veeam. Immutable backups against ransomware.

Similar Technologies
ReplicationSnapshotsContinuous Data ProtectionDisaster RecoveryData Redundancy