Apache Stack
Messaging/Streaming
High-throughput event streaming platform; durably stores and processes real-time data streams with pub-sub and queue patterns
Multi-tenant messaging with built-in geo-replication; separates storage from compute, better for cloud deployments than Kafka
Stateful stream processing with exactly-once semantics; handles complex event processing, windowing, and real-time analytics
Early real-time processing framework; largely superseded by Flink but still used for simple streaming topologies
Stream processing tightly integrated with Kafka; good for stateful transformations on Kafka topics
Data Storage
Masterless wide-column store optimized for writes; linear scalability across datacenters, eventual consistency
Column-family database on Hadoop; random read/write access to billions of rows, strong consistency
Document database with multi-master replication; HTTP/JSON API, offline-first architecture
Time-series OLAP database with sub-second query latency; excellent for event data and real-time dashboards
MPP SQL database for interactive analytics; combines fast queries with high concurrency, alternative to ClickHouse
Real-time OLAP store designed for user-facing analytics; ultra-low latency on fresh data, used by LinkedIn/Uber
Table format enabling ACID transactions on data lakes; schema evolution, time travel, partition management on S3/HDFS
Big Data Processing
Distributed filesystem (HDFS) + MapReduce processing; foundation for big data ecosystem, mostly legacy now
In-memory batch/streaming engine with unified API; 100x faster than MapReduce, supports SQL/ML/graph processing
SQL query engine over Hadoop/S3; translates SQL to MapReduce/Spark jobs, metadata catalog for data lakes
Dataflow scripting language for Hadoop; procedural alternative to SQL, largely replaced by Spark
Search/Indexing
Enterprise search with full-text indexing, faceting, geo-search; more feature-rich than Elasticsearch out-of-box
Core search library providing indexing/search algorithms; powers both Solr and Elasticsearch
Workflow/Orchestration
Python-based DAG scheduler for ETL pipelines; dynamic workflows, extensive integrations, monitoring/alerting
Visual dataflow automation with back-pressure handling; drag-drop ETL for moving/transforming data between systems
XML-based Hadoop workflow coordinator; legacy tool for scheduling MapReduce/Hive jobs
Computation
In-memory columnar format for zero-copy data sharing between processes; 10-100x faster than serialization
SQL parser/optimizer framework used by many databases; provides cost-based query optimization
Write-once pipelines that run on Flink/Spark/Dataflow; abstracts execution engine for portable data processing
Other Notable
BI tool for exploring/visualizing data; connects to 40+ databases, SQL IDE, shareable dashboards
Columnar file format for analytics; efficient compression/encoding, predicate pushdown, industry standard
Row-based serialization with schema evolution; compact binary format for streaming/RPC
Distributed coordination for leader election, config management; being replaced by Raft-based alternatives
