Our goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.
When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.
Measuring influence is a bit like trying to measure an emotion like hate or jealousy. It’s really hard and takes a boatload of data.
A huge part of what we do is develop machine learning models that make sense of this data. On top of that, there’s an endless amount of this data and we need a platform to ingest, prepare, and analyze it.
The two biggest platforms are Facebook and Twitter, but it hardly ends there when it comes to social media. There’s LinkedIn, Foursquare, Path, Youtube, Quora, and many others. This presents the challenge of creating models for each platform and building data analysis platforms that can handle unstructured data.
To handle this at Klout, we’ve turned to open source technologies. We rely on Cloudera’s CDH3 Hadoop distribution for analysis and many of our data services. Another exciting open source technology we’ve recently embraced is Node.js. Node.js provides incredibly fast performance and asynchronous event processing, all at a massive scale. This is important to us as our products scale and get more and more realtime.
Twitter was the natural selection for our first network to analyze due to the open nature of the data as well as the simplistic nature of actions you can take on Twitter, such as a mention or a retweet.
However, as our models matured, the growth of Twitter increased. As of this post, our Twitter cluster has the following stats:
- 75 million people scored daily
- 4 billion graph edges scored daily
- 48 million people are influenced by or influence an average of 27 people
- We derive hundreds of thousands of different topics that 14 million users are influential
- On average 5 topics per user using NLP and semantic analysis
- For topics, 3 months of mentions and retweets are analyzed, currently over 6 billion
Twitter Analytics Overview
From the twitter firehose, data is written to disk in buffered chunks. A mapreduce job handles the task of preparing the firehose data into different buckets needed for each of the workflows. These different workflows serve different products from performing bot and spam detection to scoring to topic extraction.
Many of our mapreduce jobs are written in java, but we also rely on Pig Latin for some purposes such as performing simple joins are population aggregates and statistics.
Oozie is used to coordinate the different workflow components. To serve out data both internally and externally, we dump out raw csv files or load this data into HBase which interfaces with load balanced API servers.
Twitter Scoring Workflow
We use a machine learning and statistical based approach to perform our scoring. This model currently has over 35 features. The scoring workflow consists of different Oozie jobs, many of which perform feature extraction. In the final jobs of this workflow, all the features are fed into the scoring model, which produces scores.
We’ve experimented with Mahout in the past and we will be using more of it in the future.
Our mission to measure influence is nowhere near complete. Luckily, here at Klout we believe in taking on the biggest challenges, and that is just what we are doing.
Do you have any questions or comments? Let us know! Also, if you want to take on these challenges with us, consider joining.
This post is by our CTO, Binh Tran, and adapted from a post for Cloudera’s blog.