Last November, I wrote a post called Big Data, Bigger Brains. In that post, I wrote about how we were able to make business intelligence (BI) work in a Big Data environment at Klout. That was a big step forward for Klout, but our work wasn’t yet done. Recently we launched a whole new website and a new Klout Score that substantially upped the stakes.
Really Big Data
At Klout, we need to perform quick, deep analysis on vast amounts of user data. We need to set up complex alerts and monitors to ensure that our data collection and processing is accurate and timely. With our new Score, we increased the amount of signals we collected by four times. This means that we now collect and normalize more than 12 billion signals a day into a Hive data warehouse of more than 1 trillion rows. In addition, we have hundreds of millions of user profiles that translate into a massive “customer” dimension, rich with attributes. Our existing configuration of connecting Hive to SQL Server Analysis Services (SSAS) by using MySql as a staging area was no longer feasible.
So, how did we eliminate MySql from the equation? Simple. We leveraged Microsoft’s Hive ODBC driver and SQL Server’s OpenQuery interface to connect the SSAS directly to Hive. Microsoft’s Kay Unkroth and Denny Lee and myself wrote a whitepaper detailing the specifics here. Now, we process 12 billion rows a day by leveraging the power of Hadoop and HiveQL right from within SSAS. For our largest cube, it takes about an hour to update a day’s worth of data – yes, 12 billions rows worth. By combining a great OLAP engine like SSAS with Hive, we get the best of both worlds: 1 trillion rows of granular data exposed through a interactive query interface compatible with existing business intelligence tools.
What I really want for Christmas…
So, what’s wrong with this story? What I really want is the OLAP engine itself to reside alongside of Hive/Hadoop, rather than live alone in a non-clustered environment. If the multi-dimensional engine resided inside HDFS, we could eliminate the double-write (write to HDFS, write to the cube) and leverage the aggregate memory and disk available across the Hadoop cluster for virtually unlimited scale out. As an added benefit, a single write would eliminate latency and vastly simplify the operational environment. I can dream, can’t I?
Do you want to take on Big Data challenges like this? We’re looking for great engineers who can think out of the box. Check us out.