Ashwin Aravind

Lambda Architecture – by Nathan Marz, for Big data systems tries to address the issues with the inherent data processing systems and delivering information real-time for downstream apps/Insights/closed loop systems. Some of the Key issues building scale-able data processing systems are

Human fault tolerance-Bugs in code, unexpected data corruption due to human intervention. Ever increasing Schema enhancements-Ever growing volume and addition of new enhancements. Query complexity -Complexity of Queries to fetch meaning info from data. CAP Theorem – Capability to get the above issues addressed with at most Accuracy and Consistency while having partition tolerance

Lambda architecture tries to solve these problems by following fundamental design features into the distributed computing system.

Immutable data: Storing data in distributed files system in the lowest granular and append only form. This can be normalized, depending on the complexity of the data. By doing this all historical data can be recovered from bugs, data corruption, schema changes etc.

Batch Processing: The Batch processing layer receives data from the immutable store at regular intervals and aggregates/applies algorithm for specific use cases and builds the batch views.

Speed Layer: The stream processing layer receives data from the source at real time and aggregates/applies algorithm for specific use cases and builds the real time views.

Serving Layer: The Serving layer provides a view to the downstream apps which has data from both Batch and Real time views.

Technology Stack Building a data processing system using Lambda architecture needs several technologies with different features, as below.

HADOOP FILE SYSTEM (HDFS)- Since the data is stored in immutable form, HDFS is the most viable option for storage of all historical data as it is scalable, fault tolerant, highest throughput and most suitable for batch processing for both structured and unstructured data. MAP REDUCE– Batch processing framework on the Hadoop file system is most suitable given the fact that it is scalable and supports wide set of languages and frameworks like – PIG, HIVE as well as Streaming. The Mapreduce jobs rebuild the all of the historical data and update the precompute views. ELEPHANTDB/VOLDEMORTDB/IMPALA/DRILL – The MapReduce jobs need to write into data stores that are optimized for fast random reads and no random writes, given the scenario that the whole of the historical data is rebuilt. NoSQL DB-HBase /MongoDB– The Speed layer computes a fraction of the whole datasets in the final views, hence the target database is usually fast NoSql databases. STORM– Apache Storm is a real time computing framework similar to mapreduce with the difference that it is optimized for stream processing. SPARK– Apache Spark is a in-memory cluster computing framework build on the concept of RDDs(Resilient Distributed Database), again optimized for realtime time computing and adhoc querying. KAFKA– Apache Kafka is fast scalable clustered message broker system that has capability to write huge datasets into the stream processing system.