The web continues to produce a deluge of data and signals about users and their activities online. At Klout, we are doing our best to translate these signals into a reliable measure of user influence and reach. Like many other startups and a growing number of large enterprises, Klout uses Hadoop to crunch large volumes of data using clusters of commodity servers. At Klout, we capture and process over 3 billion signals a day and Hadoop is an excellent, horizontally scalable platform for just doing just that.
What Ever Happened to Business Intelligence?
Besides ingesting and processing big data, we also need to deliver deep data analytics for our internal and external customers. We need to uncover traffic and engagement trends for improving our consumer experiences at Klout.com while driving ROI for our advertisers and business partners. Hadoop, by its nature, is a batch processing system and is not yet suitable for delivering interactive, “anything by anything” queries. But there is hope for those who need to support business intelligence workloads for Big Data sets, using open source software and inexpensive hardware. Hive, which runs on Hadoop, exposes a SQL interface and presents a relational “database” view of your data on top of your raw Hadoop files. But Hive can’t support real time, highly interactive business intelligence queries because it still behaves like a batch processing system, generating MapReduce code to do its work.
A Business Intelligence “Index” for Hive
At Klout, we developed a unique solution to this problem that leverages Hadoop’s scale and cost effectiveness while delivering “speed of thought” ad hoc queries. The key was to create a multi-dimensional query “index”, or cube, that sits in front of Hive and serves realtime, ad hoc, “anything by anything” queries using the upcoming version of Microsoft SQL Server Analysis Services (code name “Denali”). Yes, you read it right. We use aWindows based multi-dimensional (MOLAP) product from Microsoft to load 350 million rows of Hive data per day and achieve an average query response time ofunder 10 seconds on 35 billion rows of data. All queries are served by a single, $7,000 server with internal RAID storage.
An Unholy Alliance?
With all of the specialized business intelligence appliances on the market (Vertica, Green Plum, Aster Data, Netezza, Teradata), why did we pick Microsoft SQL Server Analysis Services (SSAS)? Because it’s a full featured business intelligence engine, it’s in active development, it’s inexpensive, has widespread query tool support, great documentation, and it scales. How much does it scale? At Yahoo!, I deployed SQL Server Analysis Services to support our display advertising business that ingested 3.5 billions rows a day and delivered average query times of less than 7 seconds on 500 billion rows of data. Just as important, SSAS provides a true business view of data to end users in the form of a cube with measures and dimensions, hiding the complexities of SQL and delivering a rich semantic layer on top of your raw, unstructured Hadoop data. Hive can do what it does best by providing a Cloud-based, inexpensive, centralized data warehouse, while SSAS manages all the data aggregates to support realtime ad hoc queries. In fact, to deliver equivalent, performant cube functionality using traditional SQL databases would require generating thousands of data aggregates, creating complexity and making system changes onerous.
What’s Wrong With This Picture?
There’s only one problem with this solution. There doesn’t yet exist a direct connection between Hive and SSAS for loading a cube. So, we use Sqoop to load each day’s data into MySql, essentially using MySql as a staging area for loading data into the cube. This solution works and scales pretty well, but it introduces unnecessary data latency.
What’s Next
We’re working with Microsoft to develop direct connectivity between Hive and our cube using their just announced support for Hive and Hadoop. By eliminating the staging server, we will reduce our data latency dramatically and eliminate a key dependency. The good news is that you can do what we did here at Klout right now, for no cost. You can download CTP3 of SQL Server Code Named “Denali” for free by downloading it here or by using your MSDN account. Stay tuned for further updates.
Do you want to take on Big Data challenges like this? We’re looking for great engineers who can think out of the box. Check us out.


Pingback: Microsoft positions database SQL Server 2012 for the Big Data market | What is a Server
Pingback: Why Klout is making its bed with Hadoop and … Microsoft - Jobbook News
Pingback: Microsoft courts Big Data market with SQL Server update - My Blog
Pingback: Microsoft positions database SQL Server 2012 for the Big Data market - My Blog
Pingback: Windows Azure and Cloud Computing Posts for 3/16/2012+ - Windows Azure Blog
Pingback: Microsoft courts big data market with SQL Server update - HackerMuslim.com | HackerMuslim.com
Pingback: Canon 5d mark III review
Pingback: Big Data példák - Kővári Attila szakmai blogja - TechNetKlub
Pingback: Big data market survey: Hadoop solutions - O'Reilly Radar
Pingback: Big Data for Everyone: Using Microsoft’s Familiar BI Tools with Hadoop | UpSearchBI
Pingback: Klout’s Take on Big Data | CIOPakistan.com - Business Technology Leadership
Pingback: Beyond Hadoop: Fast Queries from Big Data | :: Metamarkets Group ::
Pingback: How Klout Turned Big Data Into Giant Data « The Official Klout Blog | Klout
Pingback: Find Your Klout « Klout Engineering
Pingback: Klout’s Take on Big Data | CIO Pakistan