The deluge of data we are subjected to daily can be a good thing sometimes. It gives us a wide-open window to look at people’s behaviour, say, through their use of social media. The obvious question then is: can we use this data to predict what people will do online?
That is the question some colleagues and I tried to answer in a new study (preprint of which can be found on Arxiv). The trouble with trying to solve this problem is that each user on social media needs their own tailored model, because people are as unique online as they are in real life. This makes scaling up the operation, that is, finding an algorithm that can suit multiple users, difficult.
The solution for scaling up is to build the model on behaviour that can be easily measured and summarised. On social networks, such as Twitter, this behaviour could be the delay of passing on information or any changes made to content, through retweets or modified tweets. This delay or modification can be seen as an indicator of that person’s behaviour, and can then be used to predict future reactions.
In many ways, this kind of behaviour can be compared to a person’s genotype, the genetic makeup that defines our physical traits. Like your genes determine those traits, your tweets could be used to create your social media genotype that predicts your reactions to tweets on different topics.
For example, sharing or liking some types of information can be used as a metric. A collection of such metrics defines the user’s response to a particular category of information in a way that the genotype defines the traits of an individual. Within our genotype model, each network user gains a unique behavioural signature for the topic-specific content they share.
We used Twitter to test our approach of building social media genotypes for individual users. Apart from tweets and retweets, we included factors such as who follows the user, who the user follows, delay between receiving tweet and retweeting.
Some Twitter users have thousands or even millions of followers. So several questions arise. How active were the followers in re-tweeting a message on a specific topic? Were the reactions to topics consistent for each individual user? What can we conclude about the user and their activities on Twitter from these measurements?
To group tweets into categories, we used hashtags that some of the tweets contain to mark tweets relevant to a specific issue. Then, we categorised those hashtags into broad topics on which we conducted our experiments.
We used two different Twitter datasets. A large dataset that included a 20% sample of all tweets collected over a six-month period in 2009. The second, smaller dataset contained all the messages of selected users. The first dataset included a network-wide view for a six-month period for a sample of tweets. In contrast, the second dataset from 2012 provided a chance to compare reactions of a smaller subnetwork of users with all their tweets to each other. Both were used to check our hypothesis.
Our approach captured consistent topic-wise behaviour. Some individual users did show inconsistencies for hashtags within a topic. However, barring minor exceptions, a group of users' social media genotypes remained very consistent, which is to say that they responded in similar ways and spread the message with similar delays or modifications.
So predicting did seem possible. Indeed, our comparative analysis of the 2009 and 2012 dataset revealed that we could consistently and quantitatively predict how tweets spread.
It should be noted that our experiment was designed to test our approach. Different metrics of user behaviour can be used. Different methods of grouping messages into categories can be applied to other social networks. But the success of our initial experiments show that there may potentially be a robust way of predicting what you will tweet next.