This month, Parviz Deyham from Amazon Web Service promoted Fluentd as the best data collection tool for Amazon Elastic MapReduce (EMR), a hosted Hadoop framework running on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
In the best practices whitepaper, Parviz, an Enterpise Solution Architect at AWS, notes that, "Fluentd is easier to install and maintain and has better documentation and support than Flume and Scribe." Collecting data in a scalable and reliable manner has an important place in big data architecture. Many big data analytics solutions fail to provide robust tools for data collection or require that developers write custom data collectors from the origin to the final collection point.
While such attempts to write custom data collectors are important, users can leverage open source frameworks that have already been written to provide scalable and efficient distributed data collection. Open source software is part of our DNA at Treasure Data and we are thrilled that Parviz sees the value in a versatile and lightweight data collection tool like Fluentd to stream data efficiently to the cloud.
We would like to thank the entire Fluentd community for their dedication and contributions to making Fluentd a first class data collection tool. This recommendation from Amazon validates the philosophy and benefits of Fluentd’s architecture, maintainability, and simplicity.
Treasure Data has been developed by Hadoop experts. We get Hadoop, and, in many ways, it’s part of our core. As we have built out the platform, we noticed that the storage layer needs to be multi-tenant, elastic, and easy to manage while keeping the scalability and efficiency. This led us to create Plazma, our own distributed columnar storage system in place of HDFS. We wanted to leverage the “store everything now, analyze later” model of our schema-less architecture and provide better performance in terms of storage and query processing.
By separating the MapReduce processing engine of Hadoop and the storage layer, we would be able to optimize the elasticity, efficiency, and reliability of the system. Making our system more modular also allowed us to use columnar storage for our data and allow queries to only parse through the relevant records instead of reading the whole dataset. Plazma led us to process the queries faster, manage databases more simply, and make better use of our schemaless database architecture.
We achieved our technical goals by architecting Plazma in the following ways:
- JSON processing: automatically converts row-based JSON objects into a columnar format
- Columnar storage: uses a columnar file storage format which significantly reduces disk IO for analytical queries
- IO optimizations: implements various IO optimizations such as parallel pre-fetch and background decompression
- Scalability and ease management: Plazma is built on top of object-based storage, which is more easier to scale and maintain
These are some of the key innovations we made with Plazma to optimize query processing and storage and provide us with a more efficient distributed storage system solution. Some companies make the argument that leveraging HDFS allows for their business to take advantage of open source innovation, which is preferable to on-premise solutions. However, for our purposes, Plazma is much more efficient in terms of query processing and allows us to separate the processing and storage layers for optimizing query processing and manageability.
While this technology is currently proprietary to Treasure Data, we have discussed open sourcing it to provide developers with the tools they need for efficient distributed storage systems meant for big data analytics processing.
What do you think? Would you find this kind of technology useful and would you be interested in using it? Leave your thoughts in the comments.
The entire Treasure Data team has been thinking a lot about our open source log collector tool Fluentd. Based on the feedback from developers in the community, the maintainers (a couple of core maintainers are Treasure Data engineers) have been focusing on a few key components.
Here is a sample of some of the key features that are in development in our roadmap:
- Fluentd for Windows: Because of a lot of demand, Fluentd will support Windows shortly.
- Fluentd “Error Stream” Feature: Route broken or invalid events into separate data stream for later analysis and keep your primary data sink error-free.
- Fluentd Monitoring Tool: Monitor your deployments of Fluentd and the health of your Fluentd instances with a monitoring tool and UI that will help you optimize Fluentd’s reliability and performance.
As the original author and core members of the Fluentd project, we are committing ourselves to expanding the community, within the US and Japan as well as across the globe. Over the next few months, Kiyoto Tamura, our employee and a Fluentd maintainer, will spearhead this effort. As the first step, he will be kicking off Fluentd meetups in San Francisco next week. Food and drinks will be on us, and our goal is to foster a genuine community around Fluentd. We promise that there won’t be any corporate advertisement. For the very first meetup, we are thinking of using Heavybit as the venue (an incubator started by Heroku co-founder James Lindenbaum that focused on startups with serious technologies).
If you are in the Bay Area, we would love to have you sign up to our MeetUp group where we will be announcing all the future events:
If you are not in the Bay Area, please watch us on GitHub as we are working hard on Fluentd over the next few months:
Recently, Treasure Data has recruited Keith Goldstein as VP, Business Development. Keith is a software industry veteran with deep knowledge and extensive contacts in the Big Data, Data Warehouse, ETL and BI community.
Most recently Keith was employed in a similar role for Talend, the leading open source data integration company. At Talend, he established the Global SI Partnership programs, doubled OEM revenue year over year, and established the company’s Big Data ecosystem, including including Hortonworks’ distribution of Talend for data integration with HDP.
Not only does Keith bring thirty years of experience to the company, he brings a pragmatic and enthusiastic perspective that is ideal for a young company like Treasure Data.
Welcome on board, Keith!
Today, we are thrilled to announce that Jake Becker has joined our engineering team.
Jake likes Emu. Some of his friends think he looks like one, too.
Jake is Treasure Data’s second Stanford grad (who is the first!? Oh, wait, it’s me). He majored in Computer Science and was President of Stanford ACM, the largest student group in computer science at Stanford. He’s previously interned at Apple and Facebook, and Treasure Data is his first job out of school.
It’s been just two weeks since Jake joined us, but he’s already working on a major project to improve our API server and web interface. Although Jake is by far the youngest team member, we are quickly realizing he is excellent at web development =) We’re always on the lookout for good web devs (Ruby on Rails especially!) so send us your resume if you’d like to join our team (and check out careers page!)
Welcome aboard, Jake!
One fateful evening in July 2012, in our small, one-room office in Los Altos, our CTO Kaz made what turned out to be the best marketing/sales/product development decision thus far.
To be honest, we didn’t know what to expect at first. It was surely an improvement over arbitrary chains of email we exchanged with our customers, but we weren’t really sure how much additional value it would give us. After all, it’s just this little chat box. Would it work at all?
But it worked. It worked shockingly well.
We were pleasantly surprised that people actually asked us questions via Olark, and they were equally surprised that we answered their questions promptly and knowledgeably.
As we answered more and more questions on Olark, we began to realize that Olark is way more than a customer support tool: it’s a product feedback collector, lead converter and inbound marketer.
Olark the Product Feedback Collector
Say you have a question about a software service. Probably the first place you go is their documentation page. Now, let’s say you are still lost after reading the documentation. How likely are you going to email support@<company_name>.com to detail your problem when you aren’t even sure if someone can responds to your email in a timely manner?
The answer is ‘very unlikely’, and this is where Olark is extremely powerful.
Olark dramatically lowers the hurdle to ask questions online because all you have to do is type your question into a textbox without leaving the page. We experienced this first-hand: the moment we turned on our Olark chatbox, visitors started asking us all sorts of questions. Some of them pointed out bugs while others gave us insights and ideas to improve usability. All of them gave us valuable feedback to make Treasure Data a better platform.
Olark the Lead Converter
Treasure Data is fortunate to have a number of paying customers. As we asked them why they decided to use our service, we noticed a pattern: they all spoke glowingly of our customer support.
And the more I think about it, the more I feel that Olark is central to our customer support setup. Olark didn’t just lower the hurdle for our customers to seek support: it lowered the hurdle for us to help our customers.
I often ask myself, “How many of our paying customers converted because we provided interactive, immediate support through Olark?” Of course, it’s impossible to know, and we wouldn’t dare to A/B test this. But considering how highly our customers speak of our customer service, Olark might be the single most effective lead conversion tool we have.
Olark the Inbound Marketer
From day one, customer support has been a mission shared across the team at Treasure Data. Our informal policy is “whoever is most qualified should provide customer support as promptly as possible”. Both our CEO and CTO have done plenty of customer support, and engineers answer all the technical questions.
Apparently, this is not the norm.
Time and again, our customers have been surprised that we provide quality support via Olark. Here is a screenshot from a real Olark session:
The mere fact we do not oursource technical support made us “impressive” in one customer’s eye! To be sure, this customer was particularly impressionable, but it’s true that our customers absolutely love our customer support. There is no better marketing than endorsements from your customers!
Run an online business? Start using Olark NOW!
If you sell anything online, start using Olark today. Hypothetically, you can create your own in-browser, live chat system. But why build your own if Olark “just works”? And Olark is very affordable: our ROI on Olark has been at least 5000%.
So, go start talking to your customers. Now.
P.S. If you want to be part of our awesome team, drop us a line: We are hiring =)
This Thursday from 18:30 PDT, Treasure Data is teaming up with Slideshare to host the first Fluentd Meetup in the San Francisco Bay Area!
You can sign up for the event from HERE. It will be at Slideshare-LinkedIn’s brand new office in Downtown San Francisco and start at 18:30 on Thursday, March 7th.
In case you’ve never heard of Fluentd…Fluentd is a versatile log collector that makes logging more robust, maintainable and fun (yes, logging can be fun!) Originally, Sada, one of our co-founders, wrote Fluentd to make it easier for our customers to import data onto our service. After using Fluentd (packaged as td-agent) for a couple of weeks, one of our customers suggested that we open-source it.
We heeded their advice and open-sourced Fluentd in October 2011. Since then, the project has grown to have close to 1,000 stars on GitHub with 117 plugins that enable Fluentd to talk to pretty much any other service.
In its relatively short life as an open source project, Fluentd has seen a pretty wide adoption: its users include PPTV (a popular video sharing website in China), Backplane (a Lady Gaga-backed social media start-up) and Slideshare (the slide sharing web service and this meetup’s co-host!) among others.
See You There!
There will be two talks, one by Sada, the original author of Fluentd, going over the overall architecture and another by Sylvain, a Slideshare operations engineer, on how Slideshare uses Fluentd.
If you are an infrastructure engineer who’s done spending hours and days on debugging your logging infrastructure or a developer who wants to have a robust logging solution without reinventing the wheel or just curious about Slideshare’s new office, sign up and come hang out with us this Thursday (refreshments provided courtesy of Slideshare)!
This week is a big week for Treasure Data. Not only will we be a Premier Exhibitor at Strata from Tuesday through Thursday, we’ll be sponsoring Heroku Waza 2013 alongside GitHub, New Relic and six other great companies (hint: scroll down!)
Team Treasure Data will be manning the coat check that will also recharge your laptops for you (no need to huddle around a handful of outlets and sit on the floor!), so please drop by and say hi =) Waza will be a great opportunity for us to get to know the Heroku community better and learn how we can improve our service.
Looking forward to seeing you all on 2/28!
We are excited to announce that we are one of the Premier Exhibitors at the upcoming Strata 2013 Santa Clara!
Strata, hosted by O’Reilly Media, is a widely respected conference where the best and brightest in big data and data science gather for three days and share their thoughts and insights.
As an exhibitor, we will be at Booth #908 to tell our story: what Treasure Data’s Big Data as-a-Service platform is, and why using Treasure Data instead of building your own Hadoop cluster or running sharded MySQLs make sense (hint: faster time-to-answer and near-zero maintenance).
If you are going to Strata 2013 Santa Clara next week, remember to drop by Booth #908!
According to Ivar, Product Manager at Cloud9, the data collected and analyzed on Treasure Data led them to new insights about their product. In his words:
>For example, we recently revisited how our customers interact with Cloud9 IDE’s workspaces. After we interviewed several customers, we came up with a number of questions about workspace usage. Because we were already logging anonymized user activity data on Treasure Data, getting answers to our questions were only a few queries away.
>The query results were illuminating. Our customers were using workspaces in ways we never expected. In fact, the usage was so unexpected that our findings convinced us to shift our focus to a different customer segment.
When I read those paragraphs, I couldn’t help grinning. As a service platform, our customer’s success is our success, and learning that our service empowered our customer in a fundamental way is truly inspiring.
That’s enough from me. Go and read what Cloud9 IDE has to say about us =)