Doing a Twitter Analysis with R

Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. There is also a blog post about CDV. I will write another blog post about the result of Coding Durer another day but this article is going to be a twitter analysis of the hashtag #codingdurer. This article was a very good start for me to do the analysis.

tumblr_inline_mn4aupdWkb1qz4rgp

First we want to get the tweets and we are going to use the awesome twitteR package. If you want to know how you can get the API key and stuff I recommend to visit this page here. If you have everything setup we are good to go. The code down below does the authentication with Twitter and loads our packages. I assume you know how to install a R package or at least find a solution on the web.

# get package
require(twitteR)
library(dplyr)
library(ggplot2)
library(tidytext)

# do auth
consumer_key <- "my_key"
consumer_secret <- "my_secret"
access_token <- "my_token"
access_secret <- "my_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

We are now going to search for all the tweets containing the hashtag #codingdurer using the searchTwitter function from the twitteR package. After converting the result to a easy-to-work-with data frame we are going to remove all the retweets from our results because we do not want any duplicated tweets. I also removed the links from the twitter text as we do not need them.

# get tweets
cd_twitter <- searchTwitter("#CodingDurer", n = 2000)
cd_twitter_df <- twListToDF(cd_twitter)

# remove retweets
cd_twitter_unique <- cd_twitter_df %>% filter(!isRetweet)

# remove link
cd_twitter_nolink <- cd_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", "", text, perl = TRUE))

With the code down below we are going to extract the twenty most active twitter accounts during Coding Durer. I used some simple ggplot for graphics and saved it to a variable called people.

# who is tweeting
people = cd_twitter_nolink %>%
count(screenName, sort = TRUE) %>% slice(1:20) %>%
ggplot(aes(x = reorder(screenName, n, function(n) -n), y = n)) +
ylab("Number of Tweets") +
xlab("") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most active twitter users")

Now we want to know the twenty most used words from the tweets. This is going to be a bit trickier. First we extract all the words being said. Then we are going to remove all the stop words (and some special words like codingdurer, https …) as they are going to be uninteresting for us. We are also going to remove any twitter account name from the tweets. Now we are almost good to go. We are just doing some singularization and then we can save the top twenty words as a ggplot graphic in a variable called word.

# what is being said
tweet_words <- cd_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, text)

# remove stop words
my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("codingdurer","https", "t.co", "amp")))
tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

# remove name of tweeters
cd_twitter_df$screenName = tolower(cd_twitter_df$screenName)
tweet_words_interesting = filter(tweet_words_interesting, !(word %in% unique(cd_twitter_df$screenName)))

# singularize words
tweet_words_interesting$word2 = singularize(unlist(tokenize(tweet_words_interesting$word)))
tweet_words_interesting$word2[tweet_words_interesting$word2 == "datum"] = "data"
tweet_words_interesting$word2[tweet_words_interesting$word == "people"] = "people"

word = tweet_words_interesting %>%
count(word2, sort = TRUE) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(word2, n, function(n) -n), y = n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Word Occurrence") +
xlab("") +
ggtitle("Most used words in tweets")

# plot all together
grid.arrange(people, word, nrow=2, top = "Twitter Analysis of #codingdurer")

The grid.arrange function let us plot both of our graphics at once. Now we can see who the most active twitter users were and what the most used words were. It is good to see words like art, data and project at the top.

C7cgPf9WwAAW4Dp

Make sure you check out my Github for other data driven projects.

Build simple but nifty cohorts in R

cohort-analysis

Cohorts are always a great way to split a group into segments and get a deeper view of what ever you looking at. Imagine you have an online shop and would like to know how your user retention has developed over the last view weeks. I will explain cohorts down below after we created some data to build a cohort.

# get packages
library(ggplot2)
library(reshape2)
require(viridis)

# simulate cohort data
mydata = replicate(15, sort(runif(15, 1, 100), T))
mydata[lower.tri(mydata)] = NA

# convert to df and add cohort label
mydata = t(mydata)
mydata = as.data.frame(mydata)
mydata$cohort = as.factor(c(15:1))

# reshape and reorder
mydata = na.omit(melt(mydata, id.vars = "cohort"))
mydata$variable = as.numeric(gsub("V","",mydata$variable))
mydata$cohort = factor(mydata$cohort, levels=rev(levels(mydata$cohort)))

# plot cohort
ggplot(mydata, aes(variable, cohort)) +
 theme_minimal() +
 xlab('Week') +
 ylab('Cohort') +
 geom_tile(aes(fill = value), color='white') +
 scale_fill_viridis(direction = -1) +
 scale_x_continuous(breaks = round(seq(min(mydata$variable), max(mydata$variable), by = 1)))

With the code above you can simulate fifteen cohorts over a maximum period of fifteen weeks (or whatever the period might be). After creating some data you can easily use ggplot to build your cohort diagram. I have used a minimal theme and a neat viridis color palette.

rplot03

The diagram above basically shows the retention rate of fifteen different groups. For example about 25 percent of the people from cohort one came back to visit our online shop 15 weeks after their first visit. Cohort fifteen visit the online shop for the first time this week that is why we just have data from one week. With this principle in mind you can analyze your retention rates over time.

And of course this little plot can be used for all kinds of different task. Make sure you check out the code on my Github along with other projects. I also recommend analyzecore.com for really good R related marketing content.

Visualizing clustering results in R

Recently I thought about how to visualize the result of a cluster analysis. I do not mean the visualization of the clusters itself but the results in terms of content and variable description – something you could give away to someone who does not understand the mechanics of cluster algorithms and just want to see a description of the resulting clusters. I came up with a fairly easy ggplot solution but let’s get some data before we go into that.

# load packages
require(reshape2)
require(ggplot2)
require(viridis)
require(dplyr)

# get the data
url = 'http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv'
food = read.csv(url)

# filter on specific countries
food = subset(food, food$Country %in% c("Albania","Belgium","Denmark","France","Romania","USSR","W Germany","Finland","UK"))

With the code above we are getting some example data of the 25 European countries and their protein consumption (in percent) from nine major food sources. We are going to reduce the data set and filter on nine countries. With the code below you are transforming the data to a long table format which is required for plotting.

# melt data
DT1 = melt(food,id.vars = "Country")

# plot data
ggplot(DT1, aes(Country, value)) +
geom_bar(aes(fill = Country), position = "dodge", stat="identity") +
facet_wrap(~variable, scales = "free") +
xlab("") + ylab("protein intake (in %)") +
theme(axis.text.x=element_blank()) +
scale_fill_viridis(discrete=TRUE)

From here on its just a bit of classic ggplot commands to get the diagram we want. I set up a grouped barplot with a facet wrap und some neat coloring with the viridis palette.

rplot2

I think this plot is perfect to see the differences between the countries (clusters) in just one diagram. Find the full code on my Github along with other projects.

Marketing attribution with markov chains in R

In the world of e-commerce a customer has often seen more than just one marketing channel before they buy a product. We call this a customer journey. Marketing attribution has the goal to find out the importance of each channel over all customers. This information can then be used to optimize your marketing strategy and allocate your budget perfectly but also gives you valuable insights into your customers.

shutterstock_181286708devices-touchpoint-attribution

There are a lot of different models to allocate your conversions (or sales) to the different marketing channels. Most of the wider known models (e.g. last click) work on a heuristic manner and are fairly simple to implement but with huge restrictions. I am not going to explain these models in this blog post as you can find tons of articles on the web about this topic.

Today we want to focus on a more sophisticated algorithmic approach of marketing attribution which works on the basis of markov chains. In this model each customer journey is represented in a directed graph where each vertex is channel and the edges represent the probability of transition between the channels. As we are going to focus on how to use this model in R, I totally recommend checking out the research by Eva Anderl and her colleagues. There is another research paper by Olli Rentola which gives a great overview of different algorithmic models for marketing attribution.

There is a great package in R called ChannelAttribution by Davide Altomare which provides you with the right functions to build a markov based attribution model. But let’s start with creating some data. With the code below we are going to create customer journeys of different length with userid and their touchpoints to a channel on a specific date.

# load packages
require(dplyr)
require(reshape2)
require(ggplot2)
require(ChannelAttribution)
require(viridis)

# simulate some customer journeys
mydata = data.frame(userid = sample(c(1:1000), 5000, replace = TRUE),
                    date = sample(c(1:32), 5000, replace = TRUE),
                    channel = sample(c(0:9), 5000, replace = TRUE,
                              prob = c(0.1, 0.15, 0.05, 0.07, 0.11, 0.07, 0.13, 0.1, 0.06, 0.16)))
mydata$date = as.Date(mydata$date, origin = "2017-01-01")
mydata$channel = paste0('channel_', mydata$channel)

To feed our model with data we need to transform out table from long format to sequences with the code below. I used some simple dplyr commands to get this done and cleaned up the data with the gsub function.

# create sequence per user
seq = mydata %>%
 group_by(userid) %>%
 summarise(path = as.character(list(channel)))

# group identical paths and add up conversions
seq = seq %>%
 group_by(path) %>%
 summarise(total_conversions = n())

# clean paths
seq$path = gsub("c\\(|)|\"|([\n])","", seq$path)
seq$path = gsub(",","\\1 \\2>", seq$path)

Now we are good to go and run our models. The cool thing about the ChannelAttribution package is that it not just allows us to perform the markov chain but also has a function to compute some basic heuristic models (e.g. last touch, first touch, linear touch). There are a lot more parameters to specify your model but for our example this going to be it. Use the help function from the console to check out your possibilities.

# run models
basic_model = heuristic_models(seq, "path", "total_conversions")
dynamic_model = markov_model(seq, "path", "total_conversions")

# build barplot
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","markov")

result = melt(result, id.vars="channel")

ggplot(result, aes(channel, value)) +
 geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
 scale_fill_viridis(discrete=TRUE) +
 xlab("") + ylab("Conversions") +
 guides(fill = guide_legend(title = "Model"))

# build another barplot to see deviations
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","markov")

result$first = ((result$first - result$markov)/result$markov)
result$last = ((result$last - result$markov)/result$markov)
result$linear = ((result$linear- result$markov)/result$markov)

result = melt(result[1:4], id.vars="channel")

ggplot(result, aes(channel, value)) +
 geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
 scale_fill_viridis(discrete=TRUE) +
 xlab("") + ylab("Deviation from markov") +
 guides(fill = guide_legend(title = "Model"))

Now we would like to display in a simple barplot (see code above) to see which channels are generating the most conversions and which needs to catch up. I am using ggplot for this with the awesome viridis package for a neat coloring.

rplot

We can go even further and use another barplot to see how the basic heuristic models perform compared to your fancy markov model. Now we can perfectly see some real difference between all these models. If you making serious decisions on which channels you spend your marketing budget you should definitely compare different models to get the full picture.

rplot01

You can get the whole code on my Github along with other data driven projects.

Automated Facebook reporting with R and Google Spreadsheets

Imagine you want to do an automated reporting of the usage of a Facebook page (or multiple pages) and want the results to be displayed in a Google Spreadsheet. You can use two wonderful APIs in R to reach your goal easily with just a few lines of code and automate the whole process.

unbenannt

First of all let us get some data from a public Facebook page with the help of the awesome Rfacebook package. This package provides a series of functions that allow R users to access Facebook’s API to get information about users and posts, and collect public status updates that mention specific keywords. Before requesting data you have to go to the Facebook developer website, register as a developer and create a new app (which will then give you an ID and secret to use the API). See the reference manual of the package for detailed information about the authentication process.

# get packages
require(Rfacebook)

# set parameters
my_id <- "myAppID"
my_secret <- "myAppSecret"

# create fb dev account and do auth
my_oauth <- fbOAuth(app_id=my_id,app_secret=my_secret)

# get data from the facebook page with the ID 111492028881193
getpagedata <- getPage(111492028881193, token = my_oauth, n = 10) 

The getPage function will request information from a public Facebook page. In our case we are requesting the last ten posts of a page with the ID 111492028881193. The request will also include information on the date the post were created, the content of the post and metrics like likes_count and shares_count. To find the ID of a Facebook page you can use this helpful website. See the reference manual of the package to find a lot more functions to get data via the API.

Now having this data in a neat little data frame in R we want to write it automatically to a Google Spreadsheet. Here we can use the googlesheets package, which allows you to access and manage your Google spreadsheets directly from R. In our example we just going to create a new spreadsheet named “facebook_test” and load up our data from the Facebook API with just one line of code. Now you have an automated reporting from Facebook to Google spreadsheets with a little help of R. Make sure you also have a look at the reference manual of the googlesheets package, as it provides a lot of more possibilities to automate your reporting. The cool thing is that it is designed for the use with the %>% pipe operator and, to a lesser extent, the data-wrangling mentality of dplyr.

# get package
require(googlesheets)

# create a spreadsheet and fill in the data
facebook_test <- gs_new("facebook_test", ws_title = "Data From Facebook API", input = getpagedata, trim = TRUE)

Go to my Github to see the code along with some other projects.