Log Everything as JSON. Make Your Life Easier.
The story of an engineer.
Here is an anecdote. I am sure some of you have had a similar experience.
Alex, an engineer, logs all kinds of events. Since he is the primary consumer of the log, the format is optimized for human-readability. One canonical example is Apache logs:
10.0.1.22 - - [15/Oct/2010:11:46:46 -0700] “GET /favicon.ico HTTP/1.1” 404… 10.0.1.22 - - [15/Oct/2010:11:46:58 -0700] “GET / HTTP/1.1” 200…
This looks great. Time stamp, URL, HTTP status code…each line gives Alex a lot of information to work with if the service is having issues.
Bob, a business analyst, asks Alex for the number of daily unique users. Alex writes a parser for the Apache log and crontabs the script. He also builds a little Web interface so that his colleague can query the parsed data on his own. Bob finds the interface super useful.
Bob comes back a few weeks later and complains that the web interface is broken. Alex scratches his head and takes a look at the logs, only to realize that someone added an extra field in each line, breaking his custom parser. He pushes the change and tells Bob that everything is okay again. Instead of writing a new feature, Alex has to go back and has to fill back the missing data.
Every 3 weeks or so, repeat Step 3.
What’s wrong with this?
The takeaway lesson of the above story is twofold: (1) logs are not just for humans to read and (2) logs change.
(1) Logs are not just for humans. As Paul Querna points out, the primary consumer of logs are shifting from humans to computers. This means log formats should have a well-defined structure that can be parsed easily and robustly.
(2) Logs change. If the logs never changed, writing a custom parser might not be too terrible. The engineer would write it once and be done. But in reality, logs change. Every time you add a feature, you start logging more data, and as you add more data, the printf-style format inevitably changes. This implies that the custom parser has to be updated constantly, consuming valuable development time.
Enter JSON!
Here is a suggestion: Start logging your data as JSON.
JSON has a couple of advantages over other “structures”.
Widely adopted. Most engineers know what JSON is, and there is a JSON library for every language imaginable. This means there is little overhead to parse logs.
Readable. Readability counts because engineers have to jump in and read the logs if there is a problem. JSON is text-based (as opposed to binary-based) and its format is a subset of JavaScript object literal (which most engineers are familiar with). In fact, properly formatted JSON is easier to read than logs formatted in ad hoc ways.
Example of JSON-based Logging: Fluentd
We’ve already talked about Fluentd in this blog, so I won’t bother you with the details. It’s a logging daemon that can talk to a variety of services (ex: MongoDB, Scribe, etc.)
One of the key features of Fluentd is that everything is logged as JSON. Here is a little code snippet that logs data to Fluentd from Ruby.
require ‘fluent-logger’
# some code in between
log = Fluent::Logger::FluentLogger.new(nil, :host => ‘localhost’, :port=>24224)
log.post(‘myapp.access’, {“user-agent” => user_agent})
Now, suppose you wanted to start logging the referrer URL in addition to user agent. You just need to update the Ruby hash that corresponds to JSON.
require ‘fluent-logger’
# some code in between
log = Fluent::Logger::FluentLogger.new(nil, :host => ‘localhost’, :port=>24224)
log.post(‘myapp.access’, {“user-agent” => user_agent, ”referrer” => referrer_url}) # Added a field!
That’s the only change you need to make. All the existing scripts work as before, since all we did was adding a new field to the existing JSON.
In contrast, imagine you were logging the same data in a printf-inspired format. Your code initially looks like this:
log = CustomLogger.new
#some code in between
log.post(“web.access”, “user-agent: #{user_agent} blah blah”)
When you decide to log the referrer URL, you update it to:
log = CustomLogger.new
#some code in between
log.post(“web.access”, “user-agent: #{user_agent} blah blah referrer: #{referrer_url}”)
Now, most likely your old parser is broken, and you have to go and update your regex and whatnot.
We are biased towards Fluentd because we wrote it ourselves. But regardless of what software/framework you choose for logging, you should start logging everything as JSON right away.
Enabling Facebook’s Log Infrastructure with Fluentd
About
This post shows how you can replace Scribe with Fluentd.
What is Scribe?

Facebook uses Scribe as its core log aggregation service.. The description of Github reads, “Scribe is a server for aggregating log data streamed in real time from a large number of servers.”
A network of Scribe servers forms a directed graph. Each server is a node and directed edges represent lines of communication. Usually, Scribe is installed on every node, and logs are collected to one giant “aggregator” node. The collected logs are written into HDFS (Hadoop Distributed File System) and later analyzed by Hadoop MapReduce or Hive.

Scribe is quite popular. In addition to Facebook, Twitter and Zynga use Scribe in production.
Why Fluentd?
Scribe is solid. It has been effectively deployed at several web powerhouses with serious scalability challenges. So, why would you switch to Fluentd? The answer is threefold: 1) Ease of management, 2) Flexibility, and 3) Compatibility.
1) Ease of Management
Scribe is insanely difficult to install correctly. Not only do you need to build Boost, Thrift, and libhdfs from source, you must pick the correct versions of the software or the build would fail. In constrast, installing and deploying Fluentd is a breeze. It comes with rpm/deb packages maintained by Treasure Data, Inc. (That’s us!). If you use Chef (systems integration framework), you can use the cookbook we have authored, too.
2) Flexibility
Scribe is fast because it’s written in C++. But C++’s hairiness makes Scribe difficult to modify or extend. On the other hand, Fluentd is written in ~3,000 lines of Ruby, and you can easily customize and extend its behavior. In terms of performance, Scribe definitely beats Fluentd, but Fluentd is quite competent: it supports a multi-process mode and can handle upto 20,000 messages per second on a single host. If that’s not good enough, go ahead and choose Scribe. I hope you don’t get stuck in the versioning hell ;-)
3) Compatibility
Thanks to its extendable design, Fluentd already has a Scribe plug-in that supports log aggregation via Thrift. This plug-in is 100% compatible with Scribe and can replace an existing instance of Scribe out of the box.
Just to show off Fluentd’s versatility…Fluentd also has a plug-in that can output to Hoop, a REST HTTP gateway with full support for HDFS operations. For the list of all the officially supported plug-ins, please check out the Fluent Github repo.
Installation
These plug-ins are assumed to be installed with Fluentd.
deb/rpm packages are by far the easiest way to install all three. Here are the relevant links:
Configuration
This section walks you through how to replace a Scribe-based system with a Fluentd-based system. Don’t worry, it really is a drop-in replacement.

Configuring Fluentd on Front-end Nodes
For front-end nodes, The Scribe Input and Output plug-in are used (see below). If you have multiple aggregator nodes, you can use the [Copy plug-in].(http://fluentd.org/doc/plugin.html#copy)
# Scribe Input
<source>
type scribe
port 1463
add_prefix scribe
</source>
# Scribe Output
<match scribe.*>
type scribe
host log-aggregator-host
port 1463
field_ref message
</match>
Configuring Fluentd on Log-Aggregator Nodes
The aggregator nodes receive the requests from the Scribe Input plug-in, and output to HDFS with the Hoop plugin. The received logs are buffered, and periodically appended to the existing log files on HDFS.
<source>
type scribe
port 14631
add_prefix scribe
</source>
<match scribe.*>
type hoop
hoop_server hoop-server:14000
path /hoop/%Y%m%d/scribe-%Y%m%d-%H.log
username username
time_slice_wait 30s
flush_interval 5s
output_include_time false
output_include_tag true
output_data_type attr:message
add_newline false
remove_prefix scribe
default_tag unknown
</match>
Conclusion
Fluentd brings Facebook-like log aggregation infrastructure to your servers. The only difference is your system is a lot more flexible and does not require an army of engineers to maintain :)
And we’re hiring!
At Treasure Data, we are writing powerful software that makes Big Data accessible. All of your time should go into data analysis, not data management. We are here to help you do that.
We have a number of technical challenges ahead of us. We are small (a team of six) and actively looking for hackers and product managers who want to transform how people analyze Big Data. If you think you are a fit, please let us know. We’d love to talk to you!
Further Readings
- Fluentd Scribe Plugin
- Fluentd Hoop Plugin
- Fluentd Documentation
- Fluentd Plugins List
- Fluentd Source Code
Acknowledgement
Satoshi Tagomori is contributing the Hoop & Scribe plug-in for Fluentd. Also, he has ran comprehensive Fluentd benchmarks (in Japanese). Thanks Satoshi!
Real-Time Log Collection with Fluentd and MongoDB
About
This post shows how to use Fluentd-MongoDB plugin to aggregate semi-structured logs in real-time.
Background
Fluentd is an advanced open-source log collector developed at Treasure Data, Inc (see previous post). Because Fluentd handles logs as semi-structured data streams, the ideal database should have strong support for semi-structured data. There are several databases that meet this criterion, but we believe MongoDB is the market leader.
For those of you who do not know what MongoDB is, it is an open-source, document-oriented database developed at 10gen, Inc. It is schema-free and uses a JSON-like format to manage semi-structured data.
This post shows how to import Apache logs into MongoDB with Fluentd, by really small configurations.
Mechanism
The figure below shows how the things work.

Fluentd does 3 things:
- It continuously “tails” the access log.
- It parses the incoming log entries into meaningful fields (such as
ip,path, etc) and buffers them. - It writes the buffered data to MongoDB periodically.
Install
For simplicity, this post shows the one-node configuration. You should have the following software installed on the same node.
- Fluentd with MongoDB Plugin
- MongoDB
- Apache (with the Combined Log Format)
Fluentd’s most recent version of deb/rpm package includes the MongoDB plugin. If you want to use Ruby Gems to install the plugin, gem install fluent-plugin-mongo does the job.
For MongoDB, please refer to the downloads page.
Configuration
Let’s start the actual configurations. If you use deb/rpm, the Fluentd’s config file is located at /etc/td-agent/td-agent.conf. Otherwise, it is located at /etc/fluentd/fluentd.conf.
Tail Input
For input, let’s set up Fluentd to track the recent Apache logs (usually at /var/log/apache2/access_log). This is what the Fluentd configuration looks like.
<source>
type tail
format apache
path /var/log/apache2/access_log
tag mongo.apache
</source>
Let’s go through the configuration line by line.
type tail: The tail plugin continuously tracks the log file. This handy plugin is part of Fluentd’s core plugins.format apache: Use Fluentd’s built-in Apache log parser.path /var/log/apache2/access_log: Assuming the Apache log is in/var/log/apache2/access_log.tag mongo.apache:mongo.apachtells Fluentd to parse the log entry into meaningtful fields.
That’s it. You should be able to output a JSON-formatted data stream for MongoDB to consume.
MongoDB Output
The output configuration should look like this:
<match mongo.**>
# plugin type
type mongo
# mongodb db + collection
database apache
collection access
# mongodb host + port
host localhost
port 27017
# interval
flush_interval 10s
</match>
The match section specifies the regexp to match the tags. If the tag is matched, then the config inside the <match>...</match> is used. In this example, the mongo.apache tag (generated by tail) is always used.
The ** in match.** matches zero or more period-delimited tag elements (e.g. match/match.a/match.a.b). flush_internal indicates how often the data is written to the database (MongoDB in this case). Other options specify MongoDB’s host, port, db, and collection.
Test
To test the configuration, just ping the Apache server however you want. This example uses ab (Apache Bench) program.
$ ab -n 100 -c 10 http://localhost/
Then, let’s access MongoDB and see the stored data.
$ mongo
> use apache
> db.access.find()
{ "_id" : ObjectId("4ed1ed3a340765ce73000001"), "host" : "127.0.0.1", "user" : "-", "method" : "GET", "path" : "/", "code" : "200", "size" : "44", "time" : ISODate("2011-11-27T07:56:27Z") }
{ "_id" : ObjectId("4ed1ed3a340765ce73000002"), "host" : "127.0.0.1", "user" : "-", "method" : "GET", "path" : "/", "code" : "200", "size" : "44", "time" : ISODate("2011-11-27T07:56:34Z") }
{ "_id" : ObjectId("4ed1ed3a340765ce73000003"), "host" : "127.0.0.1", "user" : "-", "method" : "GET", "path" : "/", "code" : "200", "size" : "44", "time" : ISODate("2011-11-27T07:56:34Z") }
Conclusion
Fluentd + MongoDB make real-time log collection simple, easy and robust.
And we’re hiring!
At Treasure Data, we are writing powerful software that makes Big Data accessible. All of your time should go into data analysis, not data management. We are here to help you do that.
We have a number of technical challenges ahead of us. We are small (a team of five) and actively looking for hackers and product managers who want to transform how people analyze Big Data. If you think you are a fit, please let us know. We’d love to talk to you!
Further Readings
Acknowledgement
Masahiro Nakagawa contributed the MongoDB plugin for Fluentd. Thanks Masahiro!
Fluentd: the missing log collector
About
This post introduces Fluentd, an open-source log collector developed at Treasure Data, Inc.
The Problems
The fundamental problem with logs is that they are usually stored in files although they are best represented as streams (by Adam Wiggins, CTO at Heroku). Traditionally, they have been dumped into text-based files and collected by rsync in hourly or daily fashion. With today’s web/mobile applications, this creates two problems.
Problem 1: Need Ad-Hoc Parsing
The text-based logs have their own format, and analytics engineer need to write a dedicated parser for each format. But that’s probably not the best use of your time. You should be analyzing data to make better business decisions instead of writing one parser after another.
Problem 2: Lacks Freshness
The logs lag. The realtime analysis of user behavior makes feature iterations a lot faster. A nimbler A/B testing will help you differentiate your service from competitors.
This is where Fluentd comes in. We believe Fluentd solves all issues of scalable log collection by getting rid of files and turning logs into true semi-structured data streams.
What’s Fluentd?
The best way to describe Fluentd is ‘syslogd that understands JSON’. The notable features are:
- Easy installation by rpm/deb/gem
- Small footprint with 3000 lines of Ruby
- Semi-Structured data logging
- Easy start with small configuration
- Fully pluggable architecture, and plugin distribution by Ruby gems
Other similar systems are Facebook’s Scribe and Cloudera’s Flume. Here is a table to summarize the differences among Scribe, Flume, and Fluentd. (Note: I don’t know much about next-generation Flume NG branch, but big movement is happening to Flume!):

Of course, there’re pros and cons here. Fluentd takes maximum extensibility and flexibility over Ruby’s eco-system, while Scribe takes the performance (although Fluentd is pretty fast too. It can handle 18000msgs/s per core). Flume is powered by Java and therefore integrates natively with many enterprise systems.
The following sections describe the basic concepts of Fluentd in more detail.
LogEntry = time + tag + record
Unlike traditional raw-text log, the log entry of Fluentd consists of three entities: time, tag, and record.

- The time is the UNIX timestamp when the logs are posted.
- The tag is used to route the message in log-forwarding, which is described later.
- The record is represented as JSON, not raw text.
The record is intentionally represented as JSON. Fluentd is designed to collect semi-structured data, not unstructured data. This means no parsing is required at the later analysis pipeline. It’s easy to handle, and faster than ad-hoc regexp. But the application needs to use the logging library for fluentd.
Internal Architecture: Input -> Buffer -> Output
Fluentd consists of three basic components: Input, Buffer, and Output. The basic behavior is 1) Feeding logs from Input, 2) Buffers them, and 3) Forward to Output.

Input
Input is the place where the log comes in. The user can extend it to feed the events from various sources. The example Input supported officially includes: HTTP+JSON, tailing files (Apache log parser is supported), syslog. Of course you can add Input plugin by writing a Ruby plugin.
Buffer
Buffer exists for reliability. When the Output fails, the events are kept by Buffer and automatically retried. Memory or File Buffer is supported now.
Output
Buffer creates chunks of logs, and passes them to the Output. Output stores or forwards chunks. The buffer waits several seconds to 1 minute, to create chunks. This is really efficient for writing into the storage which supports batch-style importing.
Many Input/Output plugins are under heavy development in the community: MongoDB, Redis, CouchDB, Amazon S3, Amazon SQS, Scribe, 0MQ, AMQP, Delayed, Growl, etc.
Log Forwarding
Fluentd works well with one-node, but it can have multi-node configuration.

The application servers have one Fluentd locally, and it forwards the local logs into another Fluentd, which aggregates all the logs into one place. The tag is used to determine the destination Fluentd (static configuration by config files).
Conclusion
Fluentd makes real-time log collection dead simple. Out of the possible solutions, we believe Fluentd is easiest to install, configure, extend, and perform well.
Of course it’s an early-stage product compared to Scribe and Flume, but we already have some users aggregating tens of millions of daily logs using Fluentd. The # of committers and plugins are increasing everyday.
We’re hiring!
At Treasure Data, we strive to eliminate obstacles for analyzing Big Data. We believe that all of your time should go into data analysis. We are here to build powerful tools to help you do that.
We have a number of technical challenges ahead of us. We are small (a team of five so far) and actively looking for hackers and product managers who want to transform how people analyze Big Data. If you think you are a fit, please let us know. We’d love to talk to you!
Further Readings
MessagePack: the missing serializer
MessagePack in a nutshell
Greetings! We are kicking off the Treasure Data blog with MessagePack, the efficient, blazing fast serializer at the core of our technology.
The best way to describe MessagePack is “JSON on steroids”. It supports an almost identical set of data types as JSON —Nil, Boolean, Integer, Float, String, Array, and Associative Array— but runs much faster and requires a fraction of space.
The gory details
MessagePack is fast and space-efficient for a couple of reasons.
- Stream deserializer.
MessagePack’s protocol is designed so that one can start deserializing the buffered data before all the data is received. The user simply appends new data to the buffer and start deserializing them right away. The real benefit of stream deserializer is pipelining; by overlapping deserialization and data reception, one can cut down the total time drastically.
- “zero-copy” serialize/deserializer.

MessagePack’s dramatic speedup comes from “zero-copy” serialization (currently implemented only in the C++ and D library). As the name suggests, “zero-copy” serialization copies no data. Well, almost.
Instead of the entire data, the library keeps track of just enough metadata to recover the object for read operations. “zero-copy” deserialization works similarly but the other way around. The absence of copy operations speeds up serialization/deserialization, especially for large data.
- Being smart about serialization schema.
Like many other efficient messaging protocols, MessagePack is a binary protocol. Furthermore, it is optimized to store common data types compactly. Here is a quick comparison with JSON.

- Community, Community, Community.
Since the inception of the MessagePack project, we have had the fortune of having experts implement the library for each programming language. Instead of asking them to write a simple wrapper around the core C implementation, we encouraged them to go as low-level and hardcore as possible to squeeze in as many implementation-specific optimizations.
For example, the Ruby library has “zero-copy” deserialization implemented. This blog post shows how the Python’s implementation of MessagePack runs circles around every other serialization library. The community is active and growing, and the performance of each library continues to improve.
And this is only the beginning
Treasure Data eliminates obstacles for analyzing Big Data. All of your time should go into data analysis, not management. We are here to build powerful tools to help you do that.
We have a number of technical challenges ahead of us. We are small (a team of five so far) and actively looking for hackers and product managers who want to transform how people analyze Big Data. If you think you are a fit, please let us know. We’d love to talk to you!
Further readings: