Distributed Log Collection for Hadoop using Apache Flume : Book Review

Apache FlumeToday I’ll be sharing a review of another book on Hadoop ecosystem titled “Apache Flume : Distributed Log Collection for Hadoop” written by Steve Hoffman published with packt publishing. You can find the book at https://www.packtpub.com/application-development/apache-flume-distributed-log-collection-hadoop-second-edition.

It may sound that I’m selling this book, but I’m not. I’m actually promoting this book because it will help all developers who are into Big Data and providing cutting edge solutions to elephant scale data. Hadoop increasingly evolving over recent years, it’s ecosystem is now no more single hadoop , many tools have been added to it’s ecosystem day by day. Don’t know about Hadoop ecosystem? read it here.

This book contains 178 pages, is nicely separated to 9 chapters.

What’s good about this book?

  1. Starts with fare bit of introduction to what is Flume and why it is introduced by Cloudera engineers.
  2. Discusses the Architecture of Flume with proper diagrams, and talks about Flume 0.9 and 1.x versions.
  3. Gives in depth knowledge on Flume sources, channels and sinks.
  4. Details the configurations of three channels namely The memory channel, The File system Channel and Spillable memory channel and discusses pro’s and con’s of it as to which channel to use when.
  5. Talks about different sinks with configurations such as popular Hadoop HDFS sink, Morphline Solr sink and Elastic Search sink.
  6. Introduces different sources available such as source using exec, Spooling directory source, Syslog sources, even JMS messaging as source. Though only ActiveMQ (one type of JMS API implementation) are tested, it can work for other JMS implementations as well.
  7. Talks in depth about interceptors and ETL (Extract Transform Load) while loading data(sources) through channels using events to write data to sinks.
  8. A good discussion on how to monitor Flume using Monit, Nagios and performance metrics using Ganglia.
  9. Very much interesting about this book is the one that talks about real time use case log collection of Nginx web server installed on Amazon EC2 cluster servers. Web server logs as sources will go through channels server events which will write logs to Elastic search sinks. These logs are viewed/searched over a web UI using Kibana UI

What’s bad?

I believe the glaring problem of this book is it doesn’t describe a whole project in any of its chapters. Mostly the chapters are focus on each aspect of Apache Flume. They are detailed, but it’s hard to see the overview or the general outlook of the chapter. Maybe because I’m used to the way I expect complete example for each programming stuff, and I prefer to have a full project laid out. Then describe each section part-by-part. Though there are samples in the book with some property configurations, but it’s up to the reader to comprehend the whole project. But overall, this book is a great reference for hadoop developers on how to use Flume to feed streaming data to Hadoop or any of it’s sinks.

Happy Learning!

Interested in learning Hadoop? read here.

Leave a Comment