Doing a Twitter Analysis with R

Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. There is also a blog post about CDV. I will write another blog post about the result of Coding Durer another day but this article is going to be a twitter analysis of the hashtag #codingdurer. This article was a very good start for me to do the analysis.

tumblr_inline_mn4aupdWkb1qz4rgp

First we want to get the tweets and we are going to use the awesome twitteR package. If you want to know how you can get the API key and stuff I recommend to visit this page here. If you have everything setup we are good to go. The code down below does the authentication with Twitter and loads our packages. I assume you know how to install a R package or at least find a solution on the web.

# get package
require(twitteR)
library(dplyr)
library(ggplot2)
library(tidytext)

# do auth
consumer_key <- "my_key"
consumer_secret <- "my_secret"
access_token <- "my_token"
access_secret <- "my_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

We are now going to search for all the tweets containing the hashtag #codingdurer using the searchTwitter function from the twitteR package. After converting the result to a easy-to-work-with data frame we are going to remove all the retweets from our results because we do not want any duplicated tweets. I also removed the links from the twitter text as we do not need them.

# get tweets
cd_twitter <- searchTwitter("#CodingDurer", n = 2000)
cd_twitter_df <- twListToDF(cd_twitter)

# remove retweets
cd_twitter_unique <- cd_twitter_df %>% filter(!isRetweet)

# remove link
cd_twitter_nolink <- cd_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", "", text, perl = TRUE))

With the code down below we are going to extract the twenty most active twitter accounts during Coding Durer. I used some simple ggplot for graphics and saved it to a variable called people.

# who is tweeting
people = cd_twitter_nolink %>%
count(screenName, sort = TRUE) %>% slice(1:20) %>%
ggplot(aes(x = reorder(screenName, n, function(n) -n), y = n)) +
ylab("Number of Tweets") +
xlab("") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most active twitter users")

Now we want to know the twenty most used words from the tweets. This is going to be a bit trickier. First we extract all the words being said. Then we are going to remove all the stop words (and some special words like codingdurer, https …) as they are going to be uninteresting for us. We are also going to remove any twitter account name from the tweets. Now we are almost good to go. We are just doing some singularization and then we can save the top twenty words as a ggplot graphic in a variable called word.

# what is being said
tweet_words <- cd_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, text)

# remove stop words
my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("codingdurer","https", "t.co", "amp")))
tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

# remove name of tweeters
cd_twitter_df$screenName = tolower(cd_twitter_df$screenName)
tweet_words_interesting = filter(tweet_words_interesting, !(word %in% unique(cd_twitter_df$screenName)))

# singularize words
tweet_words_interesting$word2 = singularize(unlist(tokenize(tweet_words_interesting$word)))
tweet_words_interesting$word2[tweet_words_interesting$word2 == "datum"] = "data"
tweet_words_interesting$word2[tweet_words_interesting$word == "people"] = "people"

word = tweet_words_interesting %>%
count(word2, sort = TRUE) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(word2, n, function(n) -n), y = n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Word Occurrence") +
xlab("") +
ggtitle("Most used words in tweets")

# plot all together
grid.arrange(people, word, nrow=2, top = "Twitter Analysis of #codingdurer")

The grid.arrange function let us plot both of our graphics at once. Now we can see who the most active twitter users were and what the most used words were. It is good to see words like art, data and project at the top.

C7cgPf9WwAAW4Dp

Make sure you check out my Github for other data driven projects.

Design your own Leaflet maps with Mapbox Studio

bild12

Recently I wanted to create a leaflet map with a specific type of map style but I could not find an appropriate design on the web. I found out that you can use Mapbox Studio to easily design your own maps and use them from within the r package for leaflet. With the code down below we will get an interactive map for Hamburg with our own little design.

# packages
require(leaflet)

# set tiles parameters
tcu_map = "YourLeafletURL"
map_attr = "© <a href='https://www.mapbox.com/map-feedback/'>Mapbox</a> Basemap © <a href='https://insidedatablog.wordpress.com/'>Inside Data Design</a>"

# plot
leaflet() %>%
 setView(lng = 9.993682, lat = 53.551085, zoom = 11) %>%
 addTiles(urlTemplate = tcu_map, attribution = map_attr)

First we have to visit the Mapbox website, sign up for an account and create our own map via Mapbox Studio. After creating your own style (the best is to start from a default style and manipulate it for your needs) they will offer you a URL which can be used to display your style in Leaflet. You will find the URL under styles and the dropdown menu of your own created style (next to the edit button). If you haven`t created any style yet go to “New style” to create your first own map design.

Make sure you check out the code on my Github along with other projects.

Automated Facebook reporting with R and Google Spreadsheets

Imagine you want to do an automated reporting of the usage of a Facebook page (or multiple pages) and want the results to be displayed in a Google Spreadsheet. You can use two wonderful APIs in R to reach your goal easily with just a few lines of code and automate the whole process.

unbenannt

First of all let us get some data from a public Facebook page with the help of the awesome Rfacebook package. This package provides a series of functions that allow R users to access Facebook’s API to get information about users and posts, and collect public status updates that mention specific keywords. Before requesting data you have to go to the Facebook developer website, register as a developer and create a new app (which will then give you an ID and secret to use the API). See the reference manual of the package for detailed information about the authentication process.

# get packages
require(Rfacebook)

# set parameters
my_id <- "myAppID"
my_secret <- "myAppSecret"

# create fb dev account and do auth
my_oauth <- fbOAuth(app_id=my_id,app_secret=my_secret)

# get data from the facebook page with the ID 111492028881193
getpagedata <- getPage(111492028881193, token = my_oauth, n = 10) 

The getPage function will request information from a public Facebook page. In our case we are requesting the last ten posts of a page with the ID 111492028881193. The request will also include information on the date the post were created, the content of the post and metrics like likes_count and shares_count. To find the ID of a Facebook page you can use this helpful website. See the reference manual of the package to find a lot more functions to get data via the API.

Now having this data in a neat little data frame in R we want to write it automatically to a Google Spreadsheet. Here we can use the googlesheets package, which allows you to access and manage your Google spreadsheets directly from R. In our example we just going to create a new spreadsheet named “facebook_test” and load up our data from the Facebook API with just one line of code. Now you have an automated reporting from Facebook to Google spreadsheets with a little help of R. Make sure you also have a look at the reference manual of the googlesheets package, as it provides a lot of more possibilities to automate your reporting. The cool thing is that it is designed for the use with the %>% pipe operator and, to a lesser extent, the data-wrangling mentality of dplyr.

# get package
require(googlesheets)

# create a spreadsheet and fill in the data
facebook_test <- gs_new("facebook_test", ws_title = "Data From Facebook API", input = getpagedata, trim = TRUE)

Go to my Github to see the code along with some other projects.

Using the quintly API from within R

quintly is an online social media analytics tool to help you track, benchmark and optimize your social media performance. You need to have a quintly business account in order to access the API but you can get a demo account via their webpage. For authentication they use Basic Auth via HTTPS. For the username you have to send your quintly client id and for the password your API secret (included in the demo account but you will need to ask the support).

d5cac3c0a8257483b1ff48164b9e21a2

The API let you access metrics from your own or a public social media account from Facebook, Instagram and other platforms. There are two ways of fetching data. Either by asking for predefined metrics, or by specifying a completely customized query by using QQL (Quintly Query Language). For this blog post we will use a predefined metric to get started.

# get packages
library(httr)
require(rjson)

# set parameters (change to your ID and PW)
clientid <- "YourClientId"
apisecret <- "YourAPISecret"

# do authentication
req <- GET("https://api.quintly.com/v0.9/list-profiles", authenticate(clientid, apisecret, type = "basic"))
stop_for_status(req)
content(req)

# get the data (change profile ID, this can be found in your quintly account)
req <- GET("https://api.quintly.com/v0.9/qql?metric=fanCount&startTime=2016-09-04&endTime=2016-09-04&interval=daily&profileIds=12345", authenticate(clientid, apisecret, type = "basic"))

# convert the data from json to a data frame
json <- content(req)
data <- as.data.frame(json$data)

# some small processing steps
colnames(data) <- c("account","timestamp","fancount")
data$account[data$account == "12345"] <- "YourAccountName"

I used the httr package to retrieve data from the quintly API and the rjson package to handle the incoming data which will be in json format. As you can see from the get command we were asking for the metric fanCount. You can find the whole list of predefined metrics on their API documentation website. All other parameters (startTime, endTime, interval, profileIds) are mandatory for every request. After getting the data via the API we can transform it from json to a data frame for further work.

You can find the code above along with other projects on my Github.