Hadoop: Past, Present and Future with Mike Cafarella

Podcast Wednesday, March 9 2016

Subscribe: RSS

mike-cafarella

“HDFS is going to be a cockroach – I don’t think its ever going away.”

Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community.

Today’s guest is Mike Cafarella, co-creator of Hadoop. Mike takes us on a journey from past to present. Hadoop was based on the Google File System and MapReduce papers, and so Mike and I talk about what it was like to work on a distributed file system in 2004, and the challenges of implementing real software systems based on white papers. We also discuss YARN, and the wave of innovation that YARN enabled within the Hadoop ecosystem. Mike will also be presenting at Strata + Hadoop World in San Jose. We’re partnering with O’Reilly to support this conference – if you want to go to Strata, you can save 20% off a ticket with our code PCSED.

Questions

What were challenges of building a web crawler in the early 2000’s?
What were the breakthrough concepts of the Google File System?
How were people thinking about distributed systems 10-12 years ago?
When the MapReduce paper came out of Google, did you immediately realize it was a good fit with NDFS?
What were the consequences of so many big companies converging around Hadoop?
What was the issue with the MapReduce component of Hadoop having too many responsibilities?
What is Deep Dive and what lead you to work on it?