Top 16% Solution to Kaggle’s Product Classification Challenge

Kaggle is a platform for predictive modelling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. As of May 2016, Kaggle had over 536,000 registered users, or Kagglers. The community spans 194 countries. It is the largest and most diverse data community in the world (Wikipedia).

grafik

One of my first Kaggle competitions was the OTTO product classification challange. OTTO is one of the world’s biggest e-commerce companies. For this competition, OTTO has provided a dataset with 93 features (all features have been obfuscated) for more than 200,000 products. The objective was to build a predictive model which is able to distinguish between their main product categories. There are nine categories for all products.

Kaggle also allows users to publicly share their code on each competition page. It helped me a lot to check out some other people’s code before getting started. You can find my R script for the OTTO product classification challenge on my Github.

Automated Facebook reporting with R and Google Spreadsheets

Imagine you want to do an automated reporting of the usage of a Facebook page (or multiple pages) and want the results to be displayed in a Google Spreadsheet. You can use two wonderful APIs in R to reach your goal easily with just a few lines of code and automate the whole process.

unbenannt

First of all let us get some data from a public Facebook page with the help of the awesome Rfacebook package. This package provides a series of functions that allow R users to access Facebook’s API to get information about users and posts, and collect public status updates that mention specific keywords. Before requesting data you have to go to the Facebook developer website, register as a developer and create a new app (which will then give you an ID and secret to use the API). See the reference manual of the package for detailed information about the authentication process.

# get packages
require(Rfacebook)

# set parameters
my_id <- "myAppID"
my_secret <- "myAppSecret"

# create fb dev account and do auth
my_oauth <- fbOAuth(app_id=my_id,app_secret=my_secret)

# get data from the facebook page with the ID 111492028881193
getpagedata <- getPage(111492028881193, token = my_oauth, n = 10) 

The getPage function will request information from a public Facebook page. In our case we are requesting the last ten posts of a page with the ID 111492028881193. The request will also include information on the date the post were created, the content of the post and metrics like likes_count and shares_count. To find the ID of a Facebook page you can use this helpful website. See the reference manual of the package to find a lot more functions to get data via the API.

Now having this data in a neat little data frame in R we want to write it automatically to a Google Spreadsheet. Here we can use the googlesheets package, which allows you to access and manage your Google spreadsheets directly from R. In our example we just going to create a new spreadsheet named “facebook_test” and load up our data from the Facebook API with just one line of code. Now you have an automated reporting from Facebook to Google spreadsheets with a little help of R. Make sure you also have a look at the reference manual of the googlesheets package, as it provides a lot of more possibilities to automate your reporting. The cool thing is that it is designed for the use with the %>% pipe operator and, to a lesser extent, the data-wrangling mentality of dplyr.

# get package
require(googlesheets)

# create a spreadsheet and fill in the data
facebook_test <- gs_new("facebook_test", ws_title = "Data From Facebook API", input = getpagedata, trim = TRUE)

Go to my Github to see the code along with some other projects.

Using the quintly API from within R

quintly is an online social media analytics tool to help you track, benchmark and optimize your social media performance. You need to have a quintly business account in order to access the API but you can get a demo account via their webpage. For authentication they use Basic Auth via HTTPS. For the username you have to send your quintly client id and for the password your API secret (included in the demo account but you will need to ask the support).

d5cac3c0a8257483b1ff48164b9e21a2

The API let you access metrics from your own or a public social media account from Facebook, Instagram and other platforms. There are two ways of fetching data. Either by asking for predefined metrics, or by specifying a completely customized query by using QQL (Quintly Query Language). For this blog post we will use a predefined metric to get started.

# get packages
library(httr)
require(rjson)

# set parameters (change to your ID and PW)
clientid <- "YourClientId"
apisecret <- "YourAPISecret"

# do authentication
req <- GET("https://api.quintly.com/v0.9/list-profiles", authenticate(clientid, apisecret, type = "basic"))
stop_for_status(req)
content(req)

# get the data (change profile ID, this can be found in your quintly account)
req <- GET("https://api.quintly.com/v0.9/qql?metric=fanCount&startTime=2016-09-04&endTime=2016-09-04&interval=daily&profileIds=12345", authenticate(clientid, apisecret, type = "basic"))

# convert the data from json to a data frame
json <- content(req)
data <- as.data.frame(json$data)

# some small processing steps
colnames(data) <- c("account","timestamp","fancount")
data$account[data$account == "12345"] <- "YourAccountName"

I used the httr package to retrieve data from the quintly API and the rjson package to handle the incoming data which will be in json format. As you can see from the get command we were asking for the metric fanCount. You can find the whole list of predefined metrics on their API documentation website. All other parameters (startTime, endTime, interval, profileIds) are mandatory for every request. After getting the data via the API we can transform it from json to a data frame for further work.

You can find the code above along with other projects on my Github.

What’s Cooking on Kaggle? Top 20% Solution

Kaggle is a platform for predictive modelling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. As of May 2016, Kaggle had over 536,000 registered users, or Kagglers. The community spans 194 countries. It is the largest and most diverse data community in the world (Wikipedia).

screen-shot-2013-09-19-at-4-39-23-pm-640x442

One of the most interesting data sets I found on Kaggle was within the What’s Cooking challenge. The competition was hosted by Yummly, a mobile app and website that provides recipe recommendations. The Yummly app was named “Best of 2014” in Apple’s App Store. The competition asks you to predict the category of a dish’s cuisine given a list of its ingredients. The training data included a recipe id, the type of cuisine, and a list of ingredients of each recipe. There were 20 types of cuisine in the data set.

I was able to get a prediction score of about 80 percent with a fairly easy solution. First of all I removed all rare ingredients in the data set. I did not do much feature engineering, except from creating one simple variable for which counts the total number of ingredients per recipe. I also tried some text mining in form of word stemming which brings back a recipes’ ingredient to its root word (e.g. tomatoes become tomato). That approach did not help much in the end so I removed it from my script. I saved my training data in a spare matrix and trained a multiclass classification model using softmax with the xgboost package.

Kaggle also allows users to publicly share their code on each competition page. It helped me a lot to check out some other people’s code before getting started. You can find my R script for the What’s Cooking challenge on my Github.

Bike sharing usage with Leaflet and Shiny

My interactive map shows the bike sharing usage of StadtRAD, the bike sharing system in Hamburg – Germany. The data is available on the open data platform from Deutsche Bahn, the public railway company in Germany. The last new StadtRAD station was put into operation in May 2016, that is why a have chosen to display the usage of June 2016. The brighter the lines, the more bikes have been cycled along that street.

c19fodkw8aa_d-l

From data processing and spatial analysis to visualization the whole project was done in R. I have used the leaflet and shiny package to display the data interactively. The bikes themselves don’t have GPS, so the routes are estimated on a shortest route basis using the awesome cyclestreets API. The biggest challenge has been the aggregation of overlapping routes. I found the overline function from the stplanr package very helpful. It converts a series of overlaying lines and aggregates their values for overlapping segments. The raw data file from Deutsche Bahn is quite huge so I struggled to import the data into R to process it. In the end the read.csv.sql function from the sqldf package did the job.

You can find the whole code from processing to the shiny functions on my github. The code could easily be used to map other spatial data, for example the car sharing data from car2go which is available via their API. This might be a future project.

Sorting tables with Google’s weighted sort algorithm

Imagine you have a data set which contains every page of your website and two corresponding key performance indicators (KPI). The first KPI are your page impressions which are defined as number of times a specific page has been accessed. The second KPI is the bounce rate and represents the percentage of visitors who enter your website on a specific page and then leave (“bounce”) directly rather than continuing on to view other pages within your website. The bounce rate is a measure of the effectiveness of a website.

Page Page_Impressions Bounce_Rate
Page_483 883302 0.16
Page_186 459298 0.1
Page_307 366320 0.0
Page_388 312929 0.01
Page_308 261719 0.0
Page_277 226548 0.03
Page_31 212580 0.02
Page_129 189652 0.12
Page_337 176998 0.02

So far so good, but if you sort your table via bounce rate you have not achieved much, as most of the pages with a high bounce rate will have a small amount of page impressions. The underlying question is the following: How can I get pages with a high bounce rate but also a high amount of page impressions?

Page Page_Impressions Bounce_Rate
Page_108 1 1.00
Page_121 1 1.00
Page_83 3 1.00
Page_45 3 0.67
Page_112 4 0.67
Page_11 12 0.57
Page_140 388 0.56
Page_89 2 0.5
Page_133 2 0.5

Google developed a weighted sort algorithm that will help you. It is based on calculating the Estimated True Value (ETV) and using the ETV as a score on which to sort. The ETV computation is performed for each page as follows:

ETV(page x) = 
(PI / maxPI * BR) + ((1 – PI / maxPI)) * avgBR)

BR = Bounce rate for the page x

PI = Page impressions for the page x

maxPI = The highest amount of page impressions found for any page

avgBR = The Bounce rate for the whole table

The ETV calculation essentially says, “I need to figure out whether the row of data (page x) that I’m looking at is a large or a small percentage of the data in the whole table. If it’s a large percentage, then I need to take this row seriously in the sense that the data is probably meaningful. If it’s a small percentage, then I shouldn’t take this row seriously, and the reality is that I’d probably be more ‘correct’ if I just leaned towards using the average Bounce Rate for the whole table.“ Now you can sort your table via ETV and you will have the most interesting pages on top of the table.

Page Page_Impression Bounce_Rate ETV
Page_483 883302 0.16 0.16
Page_186 459298 0.1 0.07557256
Page_129 189652 0.12 0.06433205
Page_35 58882 0.17 0.05717034
Page_37 81772 0.13 0.05660003
Page_179 126322 0.09 0.05495925
Page_274 19304 0.3 0.05459478
Page_377 174237 0.07 0.05323211
Page_275 58731 0.1 0.05249535

Find the R code below to compute the ETV. It is also on my Github.

# calculate max visits
max_visit = max(dataset$Page_Impressions)
dataset$prop_visits = dataset$Page_Impressions/max_visit

# calculate average weighted bounce rate
weighted_avg_br = sum(dataset$Page_Impressions * dataset$Bounce_Rate)/sum(dataset$Page_Impressions)

# calculate Estimated True Value (ETV)
dataset$ETV = (dataset$prop_visits*dataset$Bounce_Rate)+((1-(dataset$prop_visits))*weighted_avg_br)

And of course the weighted sort algorithm can be implemented for any kind of problem, not only in the field of web analytics.