What is Common between Mumbai Dabbawalas and Apache Hadoop?
While the premise of picking up lunchboxes and delivering at work places sounds simple, there is a highly sophisticated and more than a century old process that quietly works behind the scenes. The efficiency of the process has earned the Dabbawalas a six-sigma rating from Forbes magazine. The Six Sigma quality certification was established by the International Quality Federation in 1986, to judge the quality standards of an organization. There are more than 5000 Dabbawalas who deliver over 300,000 lunchboxes everyday covering every nook and corner of Mumbai. Braving extreme weather conditions which is common during the monsoon season in Mumbai, the Dabbawalas manage to deliver the boxes on time every working day. The local Dabbawalas at the receiving and the sending ends are known to the customers personally, so that there is no question of lack of trust. Also, they are well accustomed to the local areas they cater to, which allows them to access any destination with ease. They rely on bicycles, carriages and the local trains to transport the lunchboxes during the round trip. On an average, every lunch box changes hands four times and travels 60-70 kilometres in its journey to reach its eventual destination. Each box is differentiated and sorted along the route on the basis of markings on the lid, which give an indication of the source as well as the destination address. Here is a quick summary of the steps that take place from the time the lunchbox is picked up at the home and gets returned by evening.
- The first Dabbawala collects the lunchbox from the household and marks it with a unique code
- Each Dabbawala meets at a designated place where the boxes get sorted and grouped into a carriage
- The second Dabbawala marks the carriage uniquely to represent the destination and puts that in a local train. The markings include the local rail station to unload the boxes and the building address where the box has to be finally delivered.
- The third one travels along with the dabbas in the local train to handover the carriages at each station.
- The fourth Dabbawala picks up the dabbas from the train, decodes the final destination and delivers it.
The process is just reversed in the evening to return the empty lunchboxes.
If you are familiar with MapReduce, this should already ring a bell. Almost a century before Google published the GFS and the Google MapReduce papers, the Mumbai Dabbawalas mastered the algorithm of MapReduce for their own, efficient distributed processing! For the uninitiated, Apache Hadoop is a framework to process large amounts of data in a highly parallelized and distributed environment. It solves the problem of processing petabytes of data by slicing the dataset into individual chunks that can be processed individually by inexpensive machines in a cluster. Apache Hadoop has two components – 1) A file system called HDFS that is designed to deal with the distributed data in a highly reliable way and, 2) the MapReduce engine that processes each slice of the data by applying the algorithm. For example, the Indian Meteorological department would have recorded the temperatures of each city on a daily basis for the last 100 years. Undoubtedly, this dataset would run into a few Terabytes! Imagine the computing power that is required to query this dataset to find the city with the highest temperature in the last 100 years. This is exactly where Hadoop can play a role! Once the Terabyte sized dataset is submitted to HDFS, it would slice the dataset into equal chunks and distributes each chunk to a machine running within the cluster. Then, the developer would need to write the code in two parts – 1) The code that finds the maximum temperature per each slice of dataset running on each machine (Mapper) and, 2) the code that can collect and aggregate the output of the previous step to find the city with maximum temperature (Reducer). MapReduce is precisely the algorithm that helps developers perform these two steps efficiently. If the developer writes the MapReduce code to find the city with the maximum temperature on a tiny dataset with just a few records, that same code will seamlessly work against Petabytes of data! Effectively, Apache Hadoop makes it easy to process large datasets by letting the developers focus on the core logic than worrying about the complexity and size of the data. In between the Map and Reduce phases, there are sub processes to shuffle and sort the data to make it easy for the reducers to aggregate the results. Below is an illustration of MapReduce process.
Now that we have explored both the models, let’s compare and contrast the Mumbai Dabbawala methodology with Apace Hadoop.
- Just like HDFS slices and distributes the chunk of data to individual nodes, each household submits the lunchbox to a Dabbawala.
- All the lunchboxes are collected at the common place for tagging them and to put them into carriages with unique codes. This is the job of the Mapper!
- Based on the code, carriages that need to go to the common destination are sorted and on-boarded to the respective trains. This is called Shuffle and Sort phase in MapReduce.
- At each railway station, the Dabbawala picks up the carriage and delivers each box in that to respective customers. This is the Reduce phase.
Just like the way each node in the cluster does its job without the knowledge of other processes, each Dabbawala participates in the workflow just by focusing on his task. This an evidence of how a parallelized environment can scale better.
It is fascinating to see how the century old Dabbawala system has adopted an algorithm that is now powering the Big Data revolution!
PS – I have intentionally simplified the description of Apache Hadoop and used the Indian Meteorological scenario only for illustration purposes. The solution to this problem can be achieved just with a Mapper.
- Janakiram MSV, Chief Editor, CloudStory.in