Treasure Data

    • Service
      • How it Works
      • Pricing
    • Business
      • White Papers
      • Success Stories
    • Developers
      • Documentation
      • Support
    • Company
      • About Us
      • Partners
      • Blog
      • Contact Us
      • Careers
    • Newsroom
      • Press Releases
      • In The News
  • Sign Up!

Fluentd + Hadoop: Instant Big Data Collection

  • Hadoop
  • Fluentd
  • data collection

Tweet

About

This post describes how to use Fluentd’s newly released WebHDFS plugin to aggregate semi-structured logs into Hadoop HDFS.

Background

Fluentd is a JSON-based, open-source log collector originally written at Treasure Data. Fluentd is specifically designed for solving big data collection problem.

Many companies choose Hadoop Distributed Filesystem (HDFS) for big data storage. [1] Until recently, however, the only API interface was Java. This changed with the new WebHDFS interface, which allows users to interact with HDFS via HTTP. [2]

This post shows you how to set up Fluentd to receive data over HTTP and upload it to HDFS via WebHDFS.

Mechanism

The figure below shows the high-level architecture.

Install

For simplicity, this post shows the one-node configuration. You should have the following software installed on the same node.

  • Fluentd with WebHDFS Plugin
  • HDFS

Fluentd’s most recent version of deb/rpm package (v1.1.10 or later) includes the WebHDFS plugin. If you want to use Ruby Gems to install the plugin, gem install fluent-plugin-webhdfs does the job.

  • Debian Package
  • RPM Package
  • For CDH, please refer to the downloads page (CDH3u5 and CDH4 or later)

Fluentd Configuration

Let’s configure Fluentd. If you use deb/rpm, the Fluentd’s config file is located at /etc/td-agent/td-agent.conf. Otherwise, it is located at /etc/fluentd/fluentd.conf.

HTTP Input

For input, let’s set up Fluentd to accept data from HTTP. This is what the Fluentd configuration looks like.

  type http
  port 8080

WebHDFS Output

The output configuration should look like this:

  type webhdfs
  host namenode.your.cluster.local
  port 50070
  path /log/%Y%m%d_%H/access.log.${hostname}
  flush_interval 10s

The match section specifies the regexp to match the tags. If the tag is matched, then the config inside it is used.

flush_internal indicates how often data is written to HDFS. Append operation is used to append the incoming data to the file specified by the path parameter.

For the value of path, you can use the placeholders for time and hostname (notice how %Y%m%d_%H and ${hostname} are used above). This prevents multiple Fluentd instances to append the data into the same file, which must be avoided for append operation.

The other two options, host and port, specify HDFS’s NameNode host and port respectively.

HDFS Configuration

Append is disabled by default. Please put these configurations into your hdfs-site.xml and restart the whole cluster.

  dfs.webhdfs.enabled
  true



  dfs.support.append
  true



  dfs.support.broken.append
  true

Also, please make sure that path specified in Fluentd’s WebHDFS output is configured to be writable by hdfs user.

Test

To test the setup, just post a JSON to Fluentd. This example users curl command to do so.

$ curl -X POST -d 'json={"action":"login","user":2}' \
  http://localhost:8080/hdfs.access.test

Then, let’s access HDFS and see the stored data.

$ sudo -u hdfs hadoop fs -lsr /log/
drwxr-xr-x   - 1 supergroup          0 2012-10-22 09:40 /log/20121022_14/access.log.dev

Success!

Conclusion

Fluentd + WebHDFS make real-time log collection easy, robust and scalable! @tagomoris has been using this plugin to collect 100,000 msgs/sec for a couple of months to help NHN Japan analyze big data.

And we’re hiring!

At Treasure Data, we are building a massively scalable cloud data warehouse that brings the power of big data analytics at your fingertips in days, not months. All of your time should go into getting more value from your data, not managing and troubleshooting your data pipeline.

Have you designed and built your own distribute systems before? Does the idea of managing 150+ billion data items (growing at 1 billion per day) excite you? Or, are you an open-source enthusiast with an entrepreneurial bent? If so, please drop us a line at info@treasure-data.com!

Further Readings

  • Fluentd + WebHDFS Tutorial
  • Fluentd WebHDFS Plugin
  • Fluentd Documentation
  • Fluentd Plugins List
  • Fluentd Source Code
  • Tutorial: Store Apache Logs into Amazon S3
  • Tutorial: Store Apache Logs into MongoDB

Acknowledgement

Satoshi Tagomori contributed the WebHDFS plugin and battle-tested it in a super large-scale production environment. Thanks Satoshi!

Kaz


  1. Some of Fluentd users have been using fluent-plugin-mongo with MongoDB quite successfully.
  2. WebHDFS is supported for Apache 1.0.0 (and later), CDH3u5 (and later) and CDH4 (and later).
  • 6 months ago
  • 1
  • Comments
  • Permalink
  • Share
    Tweet

1 Notes/ Hide

  1. ykhroki likes this
  2. treasure-data posted this

Recent comments

Blog comments powered by Disqus
← Previous • Next →

Treasure Data Blog

About

Hi, we're Treasure Data, Inc. We provide a cloud-based platform to help you store and analyze data.

Treasure Data Elsewhere

  • @TreasureData on Twitter
  • Facebook Profile
  • Linkedin Profile
  • treasure-data on github

See more →
  • RSS
  • Random
  • Archive
  • Mobile

Treasure Data Blog uses Treasure Data, Inc. All Rights Reserved. Effector Theme by Carlo Franco.

Powered by Tumblr