Big Data is a buzz word in the current market and Hadoop is the related term that is too much talked about when it comes to Big Data solutions. I was requested to review a book on “Optimizing Hadoop for Map Reduce” written by Khaled Tannir published with packt publishing. I spent some time in reading this book. Not so many books you could find on hadoop other than well known volumes “Hadoop – The definitive guide by Tom white” and “Hadoop in Action by Chuck lam”. This book is a kind of reference for production scale settings.If you are using Hadoop at production scale “Optimizing Hadoop for Map Reduce” book is a reference book that you must have it to setup your Hadoop cluster for optimization.
This book is well organized into 7 chapters and this book will be great if you are an experienced Map Reduce user or developer who is working in a production scale Hadoop environment. and of course, it is also a very helpful guide to Hadoop enthusiasts and Map Reduce beginners for getting something that is equal to production scale.
First chapter, covers fair bit of introduction to Map Reduce programming model and explains further to the internals of Map Reduce as to how the jobs can be executing with this distributed, fault tolerant, parallel processing framework. Internal understanding of this explanation can be a base foundation for optimizing your MR jobs.
Second Chapter, introduces optimizing Map Reduce using appropriate configuration parameters in configuration files. Any one who has done a “hello world!” Map Reduce program would know about these configuration files such as core-site.xml, hdfs-site.xml, mapred-site.xml. etc.,
As Hadoop runs on cluster environment, it is always necessary to monitor the jobs running on different nodes in a cluster. Even though Hadoop comes with a UI for monitoring jobs with number of metrics. You need enhanced monitoring experience in a production environment to diagnose the problems in production. This chapter well describes about the open source monitoring tools such as Chukwa, Ganglia, Nagios and Ambari. and how to use these effectively for optimizing your map reduce jobs.
Third chapter, Introduces performance tuning process cycle to optimize your Hadoop cluster for MR jobs with Hadoop counters and talks about the effective use of Linux Os utilities dstat, top, htop, iotop, vmstat, iostat, sar, or netstat for capturing system level performance metrics which can be further used to optimize your MR jobs.
It also introduces TeraGen tool(which is quite interesting to me) for creating performance baseline for Hadoop cluster. How to bench mark your cluster for performance with this tool, is nicely explained in this book.
Fourth chapter, introduces scenarios to identify cluster’s weakness. With several hundred node Hadoop cluster, node failures are common bottle necks for Hadoop Administrator.
How to check node’s health? and how to identify a massive I/O traffic?. this book explains very well to solve these problems.
Fifth chapter, concentrates more on profiling map reduce jobs. As Hadoop is written in Java, We write code in Java, which requires objects to be created, will reside in heap area. It is obvious to get problems related to memory. Not doing appropriate memory sizing for MR jobs is a performance bottle neck . This chapter explains very well about the memory settings and how to profile MR jobs and how to write MR job code for performance.
Sixth Chapter , talks about how to use combiners and compression techniques to optimize Map Reduce tasks (map() and reduce() outputs). How combiners will help to improve the overall job execution time is quite lucidly explained in the book. Especially, In a large volumes of data set using compression(LZW- Limpel-Zif-Oberhumer, LZ4, or Snappy, bzip2, gzip) would add to MR job performance. It also covers choosing correct writable types and custom implementations of WritableComparator and RawComparator.
Seventh Chapter, Introduces best practices and recommendations for Optimizing Map Reduce jobs.It includes The common Hadoop cluster checklist. The BIOS and Operating system recommendations. Hadoop best practices and recommendations.
One good thing that you can find here is that you have Map Reduce code template for optimization as Map Reduce jobs are similar in style to write any job, using this template, you will be inheriting the best practices of Hadoop users.
May be topics for next edition
They didn’t discuss Map Reduce 2.0 YARN(Apache Hadoop NextGen MapReduce) framework related configurations(yarn-site.xml) as it is said to be a framework with more features and for performance. They didn’t discuss contrasting the old API and the new one. Though configurations are similar with some enhanced changes it is worth to include them in next edition for complete reading.
Even though, MR jobs are common, the on top utilities such as PIG and Hive are used heavily when compared to writing MR jobs directly. how these configurations will effect in MR optimization? can be a further topics for enhancing this book to next edition.
On the whole this book is a reference for MR job optimization. And for those who are working in a production scale it is worth to read and keep a copy of this book. For the organizations who is running hadoop for elephant scale, this is a book that you can grab it for your library to enhance MR jobs optimization.
This book is published through packt publishing you can buy this book from them. Book is available in both print and eBook format.