RMarkdown for automated Marketing Reporting

In a past article I presented the awesome ChannelAttribution package which helps you to run algorithmic marketing attribution models. This time I am going to use this model to create a full rmarkdown report with some analysis and visualization around the topic of marketing attribution.

Automated Reports NAV

I am starting with some web analytics data which has seven columns. The userid identifies a unique user, the sessionid identifies a unique visit on our website, the orderid does not change for a user until he buys something on our website, a timestamp, an event_type which identifies the action someone is taking on our page (page impression vs. productview vs. order), the medium the visiter is coming from to visit our page and a order_number to identify unique orders. See an example of the data below:

userid sessionid orderid timestamp event_type medium orderid
1 1 1 10.05.2017 15:36 page_impression SEA <NULL>
2 2 2 12.01.2017 16:36 page_impression SEO <NULL>
3 3 3 28.04.2017 16:06 page_impression SEA <NULL>
3 4 3 28.04.2017 18:53 page_impression SEA <NULL>
3 5 3 28.04.2017 19:01 page_impression Link <NULL>
4 6 4 16.02.2017 18:09 page_impression SEO <NULL>
4 7 4 16.02.2017 19:56 page_impression SEO <NULL>
4 8 4 17.02.2017 16:16 page_impression SEO <NULL>

To run our attribution models with the ChannelAttribution package we need to format our data to sequences. With the code below we can do that and also save some data frames for our output tables in the markdown document.

# create sequences on medium
mydata$medium = as.character(mydata$medium) 
seq = mydata %>%
 group_by(orderid) %>%
 summarise(path = as.character(list(medium)))

# save the same data for markdown output table
seq_output = mydata %>%
 group_by(orderid) %>%
 summarise(path = as.character(list(medium)))

seq_output$path = gsub("c\\(|)|\"|([\n])","", seq_output$path)
seq_output$path = gsub(",","\\1 \\2>", seq_output$path)
seq_output = merge(order_num, seq_output, by = "orderid")
seq_output$orderid = NULL
colnames(seq_output) = c("Bestellnummer","Zeitstempel","Customer Journey")
write.table(seq_output, "Qi2-Customer_Journey.csv", row.names = F, sep = ";")

# group identical paths and add up conversions
seq = seq %>%
group_by(path) %>%
summarise(total_conversions = n())

# clean paths
seq$path = gsub("c\\(|)|\"|([\n])","", seq$path)
seq$path = gsub(",","\\1 \\2>", seq$path)

# save for later use
seqdata = seq

# save the same data for markdown output table
seq_output_agg = seq
colnames(seq_output_agg) = c("Customer Journey","Conversions")
seq_output_agg = seq_output_agg[order(seq_output_agg$Conversions, decreasing = T),]

Before we run our models we are going to do some analysis on our sequences with the TraMineR package, which is perfect for any sequence based analysis. The code below prepares our data for the markdown document.

# split path into single columns
seq.table = cSplit(as.data.table(seqdata), "path", ">")
# create sequence object
seq.seq = seqdef(seq.table, var = 2:length(seq.table))
# distribution table (perc)
dist_perc = as.data.frame(seqstatd(seq.seq)[[1]])
dist_perc = as.data.frame(t(dist_perc))
dist_perc$path = rownames(dist_perc)

# prepare plot
dist_perc = melt(dist_perc, id.var="path")
dist_perc$path = gsub("path_","",dist_perc$path)
colnames(dist_perc) = c("Session","Medium","Anteil")

# distribution table (full)
dist_full = as.data.frame(seqstatd(seq.seq)[[2]])
dist_full = as.data.frame(t(dist_full))
dist_full$path = rownames(dist_full)

# preapre plot
dist_full = melt(dist_full, id.var="path")
colnames(dist_full) = c("Path","Session","Besucher")
dist_full$Path = NULL
dist_full$Session = gsub("path_","",dist_full$Session)

Next we are going to run our attribution models and save the results for display in the markdown document. Finally we need to save our data frames as a Rdata file.

# run models
basic_model = heuristic_models(seq, "path", "total_conversions")
dynamic_model = markov_model(seq, "path", "total_conversions")

# build data frame for plot
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","algorithmic")

# melt the data frame for plotting
result = melt(result, id.vars="channel")
colnames(result) = c("Kanal","Modell","Conversions")
result$Conversions = round(result$Conversions, 0)

# build data frame for devisation plot
result1 = merge(basic_model,dynamic_model, by = "channel_name")
names(result1) = c("channel","first","last","linear","algorithmic")
result1$first = ((result1$first - result1$algorithmic)/result1$algorithmic)
result1$last = ((result1$last - result1$algorithmic)/result1$algorithmic)
result1$linear = ((result1$linear- result1$algorithmic)/result1$algorithmic)
result1$algorithmic = NULL

# melt the data frame for plotting
result1 = melt(result1, id.vars="channel")
colnames(result1) = c("Kanal","Modell","Conversions")
result1$Conversions = round(result1$Conversions, 5)

# colorpalette for plotting
mypal = colorRampPalette(c("#FFD296", "#C77100"))

Now we are able to use our processed data from above to create a nifty markdown document. I have used the plotly package to create some interactive ggplot visualizations. I also used a markdown theme called united and my own css for styling the document as you can see in the YAML header.

---
# YAML
title: "Case Study: Marketing Attribution"
author: erstellt von Alexander Kruse, etracker Data Lab
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
includes:
in_header: extLogo.html
css: "mycss.css"
theme: united
highlight: tango
---

```{r include=FALSE}

# rm
#options(warn=-1)

# packages
require(ggplot2)
require(viridis)
library(DT)
library(plotly)
library(magrittr)
require(RColorBrewer)

cedta.override = c("gWidgetsWWW","statET","FastRWeb","slidify","rmarkdown")
assignInNamespace("cedta.override",c(data.table:::cedta.override,"rmarkdown"),"data.table")

# load pre-processed data
setwd("K:/Consulting/13_Alex_Data_Analyst/Datenanalyse_Projekte/Attribution/anonym")
load("full_save_anonym.RData")
rm(seq.table, dynamic_model, basic_model, mydata, seq.seq, seq, seqdata, ids, order_num)

```

<style type="text/css">

h1.title {
color: #FF5F01;
}

h2 {
color: #FF5F01;
}

</style>




## Einleitung
Über die Oberfläche von etracker Analytics haben Sie derzeit noch nicht die Möglichkeit sich die Customer Journeys Ihrer Käufer auf Bestellnummerebene anzeigen zulassen. In diesem Sinne zeigt Ihnen der vorliegende Report welche Kanalkontakte ein Besucher hatte, bevor er etwas auf Ihrer Website gekauft hat. Der Report wird zudem durch weitere Analysen und Visualisierungen zum Thema Marketing-Attribution ergänzt. Ausführliche Informationen und Anwendungsbeispiele befinden sich im etracker Whitepaper [Attribution: Mit der richtigen Strategie die Marketing Performance optimieren](https://www.etracker.com/wp-content/uploads/2017/05/etracker_WP_Attributionsmodell.pdf). Tiefergehende Erklärungen zum Thema algorithmische Marketing-Attribution finden Sie in folgendem Video: [Multi-touch Attribution: How It Works & Why It Will Disrupt Media Buying](http://www.onebyaol.com/blog/teg-talks-episode-3-multi-touch-attribution-how-it-works-why-it-will-disrupt-media-buying).

Das vorliegende Dokument wurde vom etracker Data Lab erstellt. Über das Data Lab bietet etracker Website-Betreibern verschiedene Analyse-Services an, die über den Umfang der etracker Webanalyse-Lösung hinausgehen. Ein Team von Data-Analysten beantwortet dabei auch sehr komplexe und individuelle Fragestellungen auf Basis Ihrer Webanalyse-Rohdaten und statistischer Modelle sowie spezieller Visualisierungsmöglichkeiten. Wichtig ist hierbei, dass das Data Lab für seine Analysen die gleichen Daten nutzt die Sie auch an der etracker Oberfläche sehen.


![](etracker_data_lab.PNG)


## Datengrundlage
Der vorliegende Report wurde mit den etracker Trackingdaten eines anonymisierten Accounts erstellt und kann auch auf Grundlage Ihrer Daten erstellt werden. Die nachfolgenden Tabellen zeigen zunächst die Customer Journeys pro Bestellnummer an. Eine Customer-Journey besteht aus allen im Zeitraum getrackten Kanalkontakten eines Besuchers bis zu seinem Kauf. Kanalkontakte außerhalb des Datenausschnitts werden nicht berücksichtigt, was ein wesentlichen Einfluss auf die Analyse hat. Bei Conversions zu Beginn des Analysezeitraums fehlen ggf. Kanalkontakte einzelner Customer Journeys. Die untenstehende Tabelle ist interaktiv und verfügt über eine Sortier- und Suchfunktion.




```{r, echo=FALSE}

datatable(seq_output, options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#FF5F01', 'color': '#fff'});",
"}")
))

```




In der nachfolgenden Tabelle finden Sie sämtliche Customer Journeys die zum Kaufabschluss geführt haben. Eine weitere Tabellenspalte zeigt Ihnen an, wie häufig diese Kontaktkombination zu einer Conversion geführt hat. Identische Customer Journeys wurden demnach zusammengezählt. Bei Betrachtung der Tabelle zeigt sich, dass ein Großteil der Customer Journeys nur einen (z.B. Type-In) oder identische (sog. distinkte) Kanalkontakte aufweisen (z.B. Type-In > Type-In). All diese Kunden hatten also nur Kontakt zu einem Kanal.




```{r, echo=FALSE}

datatable(seq_output_agg, options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#FF5F01', 'color': '#fff'});",
"}")
))

```




## Customer-Journey-Analyse
Die nachfolgenden Grafiken analysieren und visualisieren die obenstehenden Daten aus weiteren Blickwinkeln. Die erste Grafik zeigt Ihnen an, wie viele Käufer eine Customer Journey mit einer, zwei oder mehreren Sessions bzw. Kanalkontakten vor einem Kauf hatten. Die meisten Käufer hatten nur einen Kanalkontakt. Die Kurve flacht schnell ab und nur ca. 1% aller Käufer hatten mehr als zehn Sessions vor einer Conversion. Auch die folgenden Grafiken sind interaktiv, Sie können zoomen oder sich mit der Maus Werte anzeigen lassen.




```{r, echo=FALSE, fig.width=9.5, fig.height=4}
ggplotly(ggplot(dist_full, aes(x = Session, y = Besucher, group = 1)) +
theme_minimal() +
geom_point(stat='summary', fun.y=sum) +
stat_summary(fun.y=sum, geom="line") +
xlab("Session") + ylab("Besucher"))

```




Die nachfolgende Grafik führt die obenstehende Idee noch weiter aus. Sie können für jeden Customer-Journey-Abschnitt (Session 01, Session 02, ...) erkennen, welche Medien wie häufig genutzt wurden. So lässt sich z.B. erkennen, das Type-Ins verständlicherweise mit zunehmender Customer-Journey-Länge mehr werden, da die Besucher die Website z.B. im Browserverlauf gespeichert haben. Es zeigt sich zudem, dass die Kanäle Affiliate und Link bei zunehmender Customer-Journey-Länge an Bedeutung verlieren. Dies ist ein erster Hinweis darauf, dass ein Last-Click-Attributionsmodell diese Kanäle unterbewerten, ein First-Click-Modell ggf. überbewerten würde. Da wie oben beschrieben nur wenige Besucher eine Customer-Journey mit mehr als zehn Sessions haben macht es Sinn die Kanalverteilung auch nur bis zu dieser Customer-Journey-Länge zu betrachten.




```{r, echo=FALSE, cache=FALSE, message = FALSE, warnings = FALSE, fig.width=10.9, fig.height=5}
mypal <- suppressWarnings(colorRampPalette(brewer.pal(7,"Oranges")))
ggplotly(ggplot(dist_perc, aes(x = Session, y = Anteil, fill = Medium)) +
theme_minimal() +
geom_bar(stat = "identity") +
xlab("Session") + ylab("Anteil an Besuchern") +
scale_fill_manual( values = mypal(7)) +
guides(fill = guide_legend(title = "Kanal:")))

```




## Marketing-Attribution
Website-Besucher haben oft deutlich mehr als nur einen Werbemittelkontakt und es kann viel Zeit vergehen, bis aus einem ersten Kontakt mit dem gesuchten Produkt ein Kaufabschluss wird. So rückt immer häufiger die Frage ins Zentrum, welches Werbemittel welchen Anteil am Erfolg einer Marketing-Strategie hat.

Abgerundet wird die vorliegende Analyse daher mit der Attribution der einzelnen Conversions zu den genutzen Medien. Die Grafik zeigt deutlich, dass der Kanal SEO für die meisten Conversions verantwortlich gemacht werden kann, unabhängig für welches Attributionsmodell wir uns entscheiden. Auffällig ist jedoch, wie unterschiedlich die Verteilung der Conversions bei den Kanälen Affiliate und Link ist. Ein Last-Click-Modell ordnet diesen Kanälen verhältnismäßig wenige Conversions zu.




```{r, echo=FALSE, fig.width=10.9, fig.height=4}
# plot everything
ggplotly(ggplot(result, aes(Kanal, Conversions)) +
theme_minimal() +
geom_bar(aes(fill = Modell), position = "dodge", stat="identity") +
scale_fill_manual( values = mypal(4)) +
xlab("") + ylab("Conversions") +
guides(fill = guide_legend(title = "Modell:")))

```




Die verschiedenen Attributionsmodelle unterscheiden sich darin, wie sie die einzelnen Medien bei der Zuordnung der Conversions gewichten. Dabei kann generell zwischen zwei Modellarten unterschieden werden: den heuristischen (z. B. Last-Click) und den algorithmischen Attributionsmodellen. Algorithmische Attributionsmodelle errechnen den Wert der einzelnen Kanäle aus den gesamten historischen Besucherdaten auf feingranularer Ebene, ohne Informationsverlust oder starre Zuordnungsregeln und ist der exakteste Ansatz zur Verteilung des Ertrages eines Werbeerfolges auf Werbekanäle. In den Grafiken sehen Sie daher neben den klassichen Modellen auch ein von etracker konzipiertes algorithmisches Attributionsmodell. Technische Informationen zu unserem Modell finden Sie in folgendem Paper: [Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2343077).




```{r, echo=FALSE, fig.width=10.9, fig.height=4}
# plot everything
ggplotly(ggplot(result1, aes(Kanal, Conversions)) +
geom_bar(aes(fill = Modell), position = "dodge", stat="identity") +
theme_minimal() +
scale_fill_manual( values = mypal(4)) +
xlab("") + ylab("Abweichung vom algo. Modell (in %)") +
guides(fill = guide_legend(title = "Modell:")))

```




Die obenstehende Grafik zeigt wie stark die heuristischen Modelle vom algorithmischen Attributionsmodell abweichen. Die wenigsten starken Ausschläge zeigt das lineare Modell, welches damit dem algorithmischen Modell von etracker am ähnlichsten ist. Sollte Sie den Attributionsreport über die etracker Oberfläche nutzen, empfiehlt sich somit den Fokus auf das lineare Modell (oder ggf. Badewanne) zulegen.




## Fazit
Die vorliegende Analyse zeigt deutlich, dass die Wahl eines geeigneten Attributionsmodell eine wichtige Entscheidung darstellt. Ein Last-Click-Modell bewertet die Kanäle Affiliate und Link deutlich unter, wobei klassische Kanäle wie SEO und SEA hingegen überbewertet werden. Es kann davon ausgegangen werden, dass das algorithmische Attributionsmodell von etracker die genauste Kanalbewertung errechnet und von den heuritischen Modellen dem linearen Attributionsmodell am ähnlichsten ist.






<center>

Bei Fragen kommen Sie gerne auf uns zu.






Alexander Kruse

etracker Data Lab

Tel: +49 40 555 659 667

E-Mail: kruse@etracker.de

</center>

The final document can be seen here. Make sure you check out my Github for other data driven projects.

Doing a Twitter Analysis with R

Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. There is also a blog post about CDV. I will write another blog post about the result of Coding Durer another day but this article is going to be a twitter analysis of the hashtag #codingdurer. This article was a very good start for me to do the analysis.

tumblr_inline_mn4aupdWkb1qz4rgp

First we want to get the tweets and we are going to use the awesome twitteR package. If you want to know how you can get the API key and stuff I recommend to visit this page here. If you have everything setup we are good to go. The code down below does the authentication with Twitter and loads our packages. I assume you know how to install a R package or at least find a solution on the web.

# get package
require(twitteR)
library(dplyr)
library(ggplot2)
library(tidytext)

# do auth
consumer_key <- "my_key"
consumer_secret <- "my_secret"
access_token <- "my_token"
access_secret <- "my_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

We are now going to search for all the tweets containing the hashtag #codingdurer using the searchTwitter function from the twitteR package. After converting the result to a easy-to-work-with data frame we are going to remove all the retweets from our results because we do not want any duplicated tweets. I also removed the links from the twitter text as we do not need them.

# get tweets
cd_twitter <- searchTwitter("#CodingDurer", n = 2000)
cd_twitter_df <- twListToDF(cd_twitter)

# remove retweets
cd_twitter_unique <- cd_twitter_df %>% filter(!isRetweet)

# remove link
cd_twitter_nolink <- cd_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", "", text, perl = TRUE))

With the code down below we are going to extract the twenty most active twitter accounts during Coding Durer. I used some simple ggplot for graphics and saved it to a variable called people.

# who is tweeting
people = cd_twitter_nolink %>%
count(screenName, sort = TRUE) %>% slice(1:20) %>%
ggplot(aes(x = reorder(screenName, n, function(n) -n), y = n)) +
ylab("Number of Tweets") +
xlab("") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most active twitter users")

Now we want to know the twenty most used words from the tweets. This is going to be a bit trickier. First we extract all the words being said. Then we are going to remove all the stop words (and some special words like codingdurer, https …) as they are going to be uninteresting for us. We are also going to remove any twitter account name from the tweets. Now we are almost good to go. We are just doing some singularization and then we can save the top twenty words as a ggplot graphic in a variable called word.

# what is being said
tweet_words <- cd_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, text)

# remove stop words
my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("codingdurer","https", "t.co", "amp")))
tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

# remove name of tweeters
cd_twitter_df$screenName = tolower(cd_twitter_df$screenName)
tweet_words_interesting = filter(tweet_words_interesting, !(word %in% unique(cd_twitter_df$screenName)))

# singularize words
tweet_words_interesting$word2 = singularize(unlist(tokenize(tweet_words_interesting$word)))
tweet_words_interesting$word2[tweet_words_interesting$word2 == "datum"] = "data"
tweet_words_interesting$word2[tweet_words_interesting$word == "people"] = "people"

word = tweet_words_interesting %>%
count(word2, sort = TRUE) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(word2, n, function(n) -n), y = n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Word Occurrence") +
xlab("") +
ggtitle("Most used words in tweets")

# plot all together
grid.arrange(people, word, nrow=2, top = "Twitter Analysis of #codingdurer")

The grid.arrange function let us plot both of our graphics at once. Now we can see who the most active twitter users were and what the most used words were. It is good to see words like art, data and project at the top.

C7cgPf9WwAAW4Dp

Make sure you check out my Github for other data driven projects.

Build simple but nifty cohorts in R

cohort-analysis

Cohorts are always a great way to split a group into segments and get a deeper view of what ever you looking at. Imagine you have an online shop and would like to know how your user retention has developed over the last view weeks. I will explain cohorts down below after we created some data to build a cohort.

# get packages
library(ggplot2)
library(reshape2)
require(viridis)

# simulate cohort data
mydata = replicate(15, sort(runif(15, 1, 100), T))
mydata[lower.tri(mydata)] = NA

# convert to df and add cohort label
mydata = t(mydata)
mydata = as.data.frame(mydata)
mydata$cohort = as.factor(c(15:1))

# reshape and reorder
mydata = na.omit(melt(mydata, id.vars = "cohort"))
mydata$variable = as.numeric(gsub("V","",mydata$variable))
mydata$cohort = factor(mydata$cohort, levels=rev(levels(mydata$cohort)))

# plot cohort
ggplot(mydata, aes(variable, cohort)) +
 theme_minimal() +
 xlab('Week') +
 ylab('Cohort') +
 geom_tile(aes(fill = value), color='white') +
 scale_fill_viridis(direction = -1) +
 scale_x_continuous(breaks = round(seq(min(mydata$variable), max(mydata$variable), by = 1)))

With the code above you can simulate fifteen cohorts over a maximum period of fifteen weeks (or whatever the period might be). After creating some data you can easily use ggplot to build your cohort diagram. I have used a minimal theme and a neat viridis color palette.

rplot03

The diagram above basically shows the retention rate of fifteen different groups. For example about 25 percent of the people from cohort one came back to visit our online shop 15 weeks after their first visit. Cohort fifteen visit the online shop for the first time this week that is why we just have data from one week. With this principle in mind you can analyze your retention rates over time.

And of course this little plot can be used for all kinds of different task. Make sure you check out the code on my Github along with other projects. I also recommend analyzecore.com for really good R related marketing content.

Visualizing clustering results in R

Recently I thought about how to visualize the result of a cluster analysis. I do not mean the visualization of the clusters itself but the results in terms of content and variable description – something you could give away to someone who does not understand the mechanics of cluster algorithms and just want to see a description of the resulting clusters. I came up with a fairly easy ggplot solution but let’s get some data before we go into that.

# load packages
require(reshape2)
require(ggplot2)
require(viridis)
require(dplyr)

# get the data
url = 'http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv'
food = read.csv(url)

# filter on specific countries
food = subset(food, food$Country %in% c("Albania","Belgium","Denmark","France","Romania","USSR","W Germany","Finland","UK"))

With the code above we are getting some example data of the 25 European countries and their protein consumption (in percent) from nine major food sources. We are going to reduce the data set and filter on nine countries. With the code below you are transforming the data to a long table format which is required for plotting.

# melt data
DT1 = melt(food,id.vars = "Country")

# plot data
ggplot(DT1, aes(Country, value)) +
geom_bar(aes(fill = Country), position = "dodge", stat="identity") +
facet_wrap(~variable, scales = "free") +
xlab("") + ylab("protein intake (in %)") +
theme(axis.text.x=element_blank()) +
scale_fill_viridis(discrete=TRUE)

From here on its just a bit of classic ggplot commands to get the diagram we want. I set up a grouped barplot with a facet wrap und some neat coloring with the viridis palette.

rplot2

I think this plot is perfect to see the differences between the countries (clusters) in just one diagram. Find the full code on my Github along with other projects.

Marketing attribution with markov chains in R

In the world of e-commerce a customer has often seen more than just one marketing channel before they buy a product. We call this a customer journey. Marketing attribution has the goal to find out the importance of each channel over all customers. This information can then be used to optimize your marketing strategy and allocate your budget perfectly but also gives you valuable insights into your customers.

shutterstock_181286708devices-touchpoint-attribution

There are a lot of different models to allocate your conversions (or sales) to the different marketing channels. Most of the wider known models (e.g. last click) work on a heuristic manner and are fairly simple to implement but with huge restrictions. I am not going to explain these models in this blog post as you can find tons of articles on the web about this topic.

Today we want to focus on a more sophisticated algorithmic approach of marketing attribution which works on the basis of markov chains. In this model each customer journey is represented in a directed graph where each vertex is channel and the edges represent the probability of transition between the channels. As we are going to focus on how to use this model in R, I totally recommend checking out the research by Eva Anderl and her colleagues. There is another research paper by Olli Rentola which gives a great overview of different algorithmic models for marketing attribution.

There is a great package in R called ChannelAttribution by Davide Altomare which provides you with the right functions to build a markov based attribution model. But let’s start with creating some data. With the code below we are going to create customer journeys of different length with userid and their touchpoints to a channel on a specific date.

# load packages
require(dplyr)
require(reshape2)
require(ggplot2)
require(ChannelAttribution)
require(viridis)

# simulate some customer journeys
mydata = data.frame(userid = sample(c(1:1000), 5000, replace = TRUE),
                    date = sample(c(1:32), 5000, replace = TRUE),
                    channel = sample(c(0:9), 5000, replace = TRUE,
                              prob = c(0.1, 0.15, 0.05, 0.07, 0.11, 0.07, 0.13, 0.1, 0.06, 0.16)))
mydata$date = as.Date(mydata$date, origin = "2017-01-01")
mydata$channel = paste0('channel_', mydata$channel)

To feed our model with data we need to transform out table from long format to sequences with the code below. I used some simple dplyr commands to get this done and cleaned up the data with the gsub function.

# create sequence per user
seq = mydata %>%
 group_by(userid) %>%
 summarise(path = as.character(list(channel)))

# group identical paths and add up conversions
seq = seq %>%
 group_by(path) %>%
 summarise(total_conversions = n())

# clean paths
seq$path = gsub("c\\(|)|\"|([\n])","", seq$path)
seq$path = gsub(",","\\1 \\2>", seq$path)

Now we are good to go and run our models. The cool thing about the ChannelAttribution package is that it not just allows us to perform the markov chain but also has a function to compute some basic heuristic models (e.g. last touch, first touch, linear touch). There are a lot more parameters to specify your model but for our example this going to be it. Use the help function from the console to check out your possibilities.

# run models
basic_model = heuristic_models(seq, "path", "total_conversions")
dynamic_model = markov_model(seq, "path", "total_conversions")

# build barplot
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","markov")

result = melt(result, id.vars="channel")

ggplot(result, aes(channel, value)) +
 geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
 scale_fill_viridis(discrete=TRUE) +
 xlab("") + ylab("Conversions") +
 guides(fill = guide_legend(title = "Model"))

# build another barplot to see deviations
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","markov")

result$first = ((result$first - result$markov)/result$markov)
result$last = ((result$last - result$markov)/result$markov)
result$linear = ((result$linear- result$markov)/result$markov)

result = melt(result[1:4], id.vars="channel")

ggplot(result, aes(channel, value)) +
 geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
 scale_fill_viridis(discrete=TRUE) +
 xlab("") + ylab("Deviation from markov") +
 guides(fill = guide_legend(title = "Model"))

Now we would like to display in a simple barplot (see code above) to see which channels are generating the most conversions and which needs to catch up. I am using ggplot for this with the awesome viridis package for a neat coloring.

rplot

We can go even further and use another barplot to see how the basic heuristic models perform compared to your fancy markov model. Now we can perfectly see some real difference between all these models. If you making serious decisions on which channels you spend your marketing budget you should definitely compare different models to get the full picture.

rplot01

You can get the whole code on my Github along with other data driven projects.

Analyzing visitor flows with Google’s chart tool in R

Let’s say you have a website or an app and you would like to know how your visitors navigate through it. I came across the googleVis package to solve this task. It provides you with an interface to Google’s chart tools and lets you create interactive charts based on data frames. In this package you will find a function to create sankey diagrams, which are a specific type of flow diagram. Usally the weight of an arrow is shown proportionally to the flow quantity. Let’s put this into practice.

First we need some data. Imagine you have a data set were you have all the page accesses from your visitors stored in a simple data frame.

UserID Timestamp Screen_name
1947849340340 01.02.2017 12:55:02 Main Screen
1947849340340 01.02.2017 12:55:05 My Prizes Screen
1947849340340 01.02.2017 12:55:10 Tutorial Screen
1947849340340 01.02.2017 12:55:20 Reminder Screen
1947849340340 01.02.2017 12:55:22 Terms Screen
1947849340340 01.02.2017 12:55:42 Main Screen
1453754950034 01.02.2017 21:14:22 Main Screen
1453754950034 01.02.2017 21:14:23 My Prizes Screen
1453754950034 01.02.2017 21:14:29 Prizes Screen
1453754950034 01.02.2017 21:14:44 Prizes Screen

To build a sankey diagram we will need to transform our table from long format into visitor paths. As you can see from the code below I was using a mix of simple dplyr code and the seqdef function from the TraMineR package, which lets you create a sequence object. I totally recommend checking out TraMineR if you working with any kind of sequence data, as it provide a lot of different function for mining, describing and visualizing sequences data.

# create user paths from data frame
seq = mydata %>%
group_by(user.id) %>%
summarise(Path = paste(list(screen_name), sep = ", "))

# remove ugly stuff from paths
seq$Path = gsub("c\\(|)|\"|([\n])|-","", seq$Path)
seq$Path = gsub("\\","", seq$Path, fixed = T)

# split path column into single columns and create sequence object
seq_table = cSplit(as.data.table(seq), "Path", ",")
seq_table = seqdef(seq_table)

# create empty df for later
orders.plot = data.frame()

# save sequence object as df
orders = as.data.frame(seq_table[2:length(seq_table)])

# convert ugly % to END
orders = as.data.frame(lapply(orders, function(y) gsub("%", "END", y)))
orders[length(orders)+1] = "END"

# transform data to long table format for ploting
for (i in 2:ncol(orders)) {

ord.cache = orders %>%
group_by(orders[ , i-1], orders[ , i]) %>%
summarise(n=n())

colnames(ord.cache)[1:2] = c('from', 'to')

ord.cache$from = paste(ord.cache$from, '(', i-1, ')', sep='')
ord.cache$to = paste(ord.cache$to, '(', i, ')', sep='')

orders.plot = rbind(orders.plot, as.data.frame(ord.cache))

}

# plot sankey
plot(gvisSankey(orders.plot, from='from', to='to', weight='n',
options=list(height=900, width=1800,
sankey="{link:{colorMode: 'source',
color:{fill:'source'}}}")))

For plotting purposes I needed to transform the data back to long table format. I also changed the states which named % to END, just to make sure that this means a customer’s journey has ended at this point. After calling the gvisSankey function your browser will open and you will have your neat visitor flow diagram.

And of course you can use sankey diagrams to visualize any type of sequence data. Make sure you check out my Github for the full code along with other projects.

Using association rules to perform a market basket analysis

Imagine you have an online shop and you would like to know which products often bought together. This task is known under the term of market basket analysis, in which retailers seek to understand the purchase behavior of their customers. This information can then be used for purposes of cross-selling and up-selling (Wikipedia).

amazon_funny-300x249

Let us assume we have a data set which contains a list of customers of an online shop and the products they have bought (or viewed) in the past. We can see that one customer can have bought multiple products.

UserID ProductID
10039052252084471969 Product_587
10039052252084471969 Product_40
10039052252084471969 Product_154
10046183258816255929 Product_256
10046183258816255929 Product_44
10047293680636077566 Product_1184
10055849645924040293 Product_334
10060944748730254910 Product_306
10060944748730254910 Product_154
10060944748730254910 Product_78

We will use a rule-based machine learning algorithm called Apriori to perform our market basket analysis. It is intended to identify strong rules/relations discovered in a data set. The easiest way to understand association rule mining is to look at the results of such an analysis. To do that we first want to read in our data set from above as transactions in single format. I saved my data as a csv file with two rows called mydata. After this we will use the apriori algorithm from the arules package to identify strong rules in the data set.

# read in data
trans = read.transactions("mydata.csv", format = "single", sep = ";", cols = c("transactionID", "productID"), encoding = "UTF-8")

# run apriori algorithm
rules = apriori(trans, parameter = list(supp = 0.005, conf = 0.001, minlen = 2))

# sort rules by lift
rules = sort(rules, by = "lift", decreasing = T)

# print out rules to console
inspect(rules)

# remove redundant rules
subset.matrix = is.subset(rules,rules)
subset.matrix[lower.tri(subset.matrix,diag=T)] = 1
rules = rules[!redundant]

# write to df
ruledf = data.frame(
 lhs = labels(lhs(rules)),
 rhs = labels(rhs(rules)),
 rules@quality)

# filter rules on specific product
ruledf = filter(ruledf, lhs == "{Product_183}")

Running the apriori algorithm with the code above will give us a list of association rules based on our input data. Let us have a look at the output of the model to see what these rules look like. You can use the inspect command from the arules package to print out rules to the console.

lhs rhs support confidence lift
{Product_125} {Product_306} 0.006 0.387 4.040
{Product_306} {Product_125} 0.006 0.072 4.040
{Product_63} {Product_385} 0.005 0.400 19.472
{Product_385} {Product_63} 0.005 0.285 19.472
{Product_264} {Product_92} 0.005 0.378 27.143
{Product_92} {Product_264} 0.005 0.360 27.143
{Product_523} {Product_306} 0.005 0.369 3.859
{Product_306} {Product_523} 0.005 0.061 3.858
{Product_102} {Product_120} 0.005 0.506 8.460

An example rule for our data set could be {product_125} ⇒ {product_306} meaning that if product_125 is bought, customers also buy product_306. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support is defined as the proportion of transactions in the data set which contain the specific product(s). In the table above, the rule {product_125} ⇒ {product_306} has a support of 0.006 meaning that the two products have been bought together in 0.6% of all transactions. The confidence is another important measure of interest. The rule {product_125} ⇒ {product_306} has a confidence of 0.387, which means that the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS is 38.7%. If you want to execute the Apriori algorithm you will need to define both a minimum support and a minimum confidence constraint at the same time. This will help you filter out interesting rules. We also defined a minimum length of two because we want the rule to cover at least two products. Another popular measure of interest is the lift of a association rule. The lift is defined as lift(X ⇒ Y ) = supp(X ∪ Y )/(supp(X)supp(Y)), and can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations. There is a lot more to discover about association rule mining with the arules package if you look at its reference manual.

lhs rhs support confidence lift
{Product_92} {Product_264} 0.005 0.360 27.143
{Product_374} {Product_378} 0.006 0.398 21.923
{Product_98} {Product_929} 0.012 0.556 20.165
{Product_375} {Product_376} 0.007 0.365 20.139
{Product_257} {Product_880} 0.006 0.378 19.847
{Product_63} {Product_385} 0.005 0.400 19.472
{Product_908} {Product_98} 0.007 0.412 18.702
{Product_376} {Product_378} 0.006 0.331 18.338
{Product_378} {Product_375} 0.006 0.384 17.824
{Product_54} {Product_719} 0.005 0.256 17.415

At the table above we sorted our rules via lift and now we can see the top 10 most interesting associations in our data set. This information can now be used for purposes of cross-selling and up-selling. We also removed the redundant rules from this table. As you can see from the code above you can also easily use filters to find rules for a specific product. Find the whole code along with other projects on my Github.