Design your own Leaflet maps with Mapbox Studio

bild12

Recently I wanted to create a leaflet map with a specific type of map style but I could not find an appropriate design on the web. I found out that you can use Mapbox Studio to easily design your own maps and use them from within the r package for leaflet. With the code down below we will get an interactive map for Hamburg with our own little design.

# packages
require(leaflet)

# set tiles parameters
tcu_map = "YourLeafletURL"
map_attr = "© <a href='https://www.mapbox.com/map-feedback/'>Mapbox</a> Basemap © <a href='https://insidedatablog.wordpress.com/'>Inside Data Design</a>"

# plot
leaflet() %>%
 setView(lng = 9.993682, lat = 53.551085, zoom = 11) %>%
 addTiles(urlTemplate = tcu_map, attribution = map_attr)

First we have to visit the Mapbox website, sign up for an account and create our own map via Mapbox Studio. After creating your own style (the best is to start from a default style and manipulate it for your needs) they will offer you a URL which can be used to display your style in Leaflet. You will find the URL under styles and the dropdown menu of your own created style (next to the edit button). If you haven`t created any style yet go to “New style” to create your first own map design.

Make sure you check out the code on my Github along with other projects.

Build simple but nifty cohorts in R

cohort-analysis

Cohorts are always a great way to split a group into segments and get a deeper view of what ever you looking at. Imagine you have an online shop and would like to know how your user retention has developed over the last view weeks. I will explain cohorts down below after we created some data to build a cohort.

# get packages
library(ggplot2)
library(reshape2)
require(viridis)

# simulate cohort data
mydata = replicate(15, sort(runif(15, 1, 100), T))
mydata[lower.tri(mydata)] = NA

# convert to df and add cohort label
mydata = t(mydata)
mydata = as.data.frame(mydata)
mydata$cohort = as.factor(c(15:1))

# reshape and reorder
mydata = na.omit(melt(mydata, id.vars = "cohort"))
mydata$variable = as.numeric(gsub("V","",mydata$variable))
mydata$cohort = factor(mydata$cohort, levels=rev(levels(mydata$cohort)))

# plot cohort
ggplot(mydata, aes(variable, cohort)) +
 theme_minimal() +
 xlab('Week') +
 ylab('Cohort') +
 geom_tile(aes(fill = value), color='white') +
 scale_fill_viridis(direction = -1) +
 scale_x_continuous(breaks = round(seq(min(mydata$variable), max(mydata$variable), by = 1)))

With the code above you can simulate fifteen cohorts over a maximum period of fifteen weeks (or whatever the period might be). After creating some data you can easily use ggplot to build your cohort diagram. I have used a minimal theme and a neat viridis color palette.

rplot03

The diagram above basically shows the retention rate of fifteen different groups. For example about 25 percent of the people from cohort one came back to visit our online shop 15 weeks after their first visit. Cohort fifteen visit the online shop for the first time this week that is why we just have data from one week. With this principle in mind you can analyze your retention rates over time.

And of course this little plot can be used for all kinds of different task. Make sure you check out the code on my Github along with other projects. I also recommend analyzecore.com for really good R related marketing content.

Visualizing clustering results in R

Recently I thought about how to visualize the result of a cluster analysis. I do not mean the visualization of the clusters itself but the results in terms of content and variable description – something you could give away to someone who does not understand the mechanics of cluster algorithms and just want to see a description of the resulting clusters. I came up with a fairly easy ggplot solution but let’s get some data before we go into that.

# load packages
require(reshape2)
require(ggplot2)
require(viridis)
require(dplyr)

# get the data
url = 'http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv'
food = read.csv(url)

# filter on specific countries
food = subset(food, food$Country %in% c("Albania","Belgium","Denmark","France","Romania","USSR","W Germany","Finland","UK"))

With the code above we are getting some example data of the 25 European countries and their protein consumption (in percent) from nine major food sources. We are going to reduce the data set and filter on nine countries. With the code below you are transforming the data to a long table format which is required for plotting.

# melt data
DT1 = melt(food,id.vars = "Country")

# plot data
ggplot(DT1, aes(Country, value)) +
geom_bar(aes(fill = Country), position = "dodge", stat="identity") +
facet_wrap(~variable, scales = "free") +
xlab("") + ylab("protein intake (in %)") +
theme(axis.text.x=element_blank()) +
scale_fill_viridis(discrete=TRUE)

From here on its just a bit of classic ggplot commands to get the diagram we want. I set up a grouped barplot with a facet wrap und some neat coloring with the viridis palette.

rplot2

I think this plot is perfect to see the differences between the countries (clusters) in just one diagram. Find the full code on my Github along with other projects.

Marketing attribution with markov chains in R

In the world of e-commerce a customer has often seen more than just one marketing channel before they buy a product. We call this a customer journey. Marketing attribution has the goal to find out the importance of each channel over all customers. This information can then be used to optimize your marketing strategy and allocate your budget perfectly but also gives you valuable insights into your customers.

shutterstock_181286708devices-touchpoint-attribution

There are a lot of different models to allocate your conversions (or sales) to the different marketing channels. Most of the wider known models (e.g. last click) work on a heuristic manner and are fairly simple to implement but with huge restrictions. I am not going to explain these models in this blog post as you can find tons of articles on the web about this topic.

Today we want to focus on a more sophisticated algorithmic approach of marketing attribution which works on the basis of markov chains. In this model each customer journey is represented in a directed graph where each vertex is channel and the edges represent the probability of transition between the channels. As we are going to focus on how to use this model in R, I totally recommend checking out the research by Eva Anderl and her colleagues. There is another research paper by Olli Rentola which gives a great overview of different algorithmic models for marketing attribution.

There is a great package in R called ChannelAttribution by Davide Altomare which provides you with the right functions to build a markov based attribution model. But let’s start with creating some data. With the code below we are going to create customer journeys of different length with userid and their touchpoints to a channel on a specific date.

# load packages
require(dplyr)
require(reshape2)
require(ggplot2)
require(ChannelAttribution)
require(viridis)

# simulate some customer journeys
mydata = data.frame(userid = sample(c(1:1000), 5000, replace = TRUE),
                    date = sample(c(1:32), 5000, replace = TRUE),
                    channel = sample(c(0:9), 5000, replace = TRUE,
                              prob = c(0.1, 0.15, 0.05, 0.07, 0.11, 0.07, 0.13, 0.1, 0.06, 0.16)))
mydata$date = as.Date(mydata$date, origin = "2017-01-01")
mydata$channel = paste0('channel_', mydata$channel)

To feed our model with data we need to transform out table from long format to sequences with the code below. I used some simple dplyr commands to get this done and cleaned up the data with the gsub function.

# create sequence per user
seq = mydata %>%
 group_by(userid) %>%
 summarise(path = as.character(list(channel)))

# group identical paths and add up conversions
seq = seq %>%
 group_by(path) %>%
 summarise(total_conversions = n())

# clean paths
seq$path = gsub("c\\(|)|\"|([\n])","", seq$path)
seq$path = gsub(",","\\1 \\2>", seq$path)

Now we are good to go and run our models. The cool thing about the ChannelAttribution package is that it not just allows us to perform the markov chain but also has a function to compute some basic heuristic models (e.g. last touch, first touch, linear touch). There are a lot more parameters to specify your model but for our example this going to be it. Use the help function from the console to check out your possibilities.

# run models
basic_model = heuristic_models(seq, "path", "total_conversions")
dynamic_model = markov_model(seq, "path", "total_conversions")

# build barplot
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","markov")

result = melt(result, id.vars="channel")

ggplot(result, aes(channel, value)) +
 geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
 scale_fill_viridis(discrete=TRUE) +
 xlab("") + ylab("Conversions") +
 guides(fill = guide_legend(title = "Model"))

# build another barplot to see deviations
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","markov")

result$first = ((result$first - result$markov)/result$markov)
result$last = ((result$last - result$markov)/result$markov)
result$linear = ((result$linear- result$markov)/result$markov)

result = melt(result[1:4], id.vars="channel")

ggplot(result, aes(channel, value)) +
 geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
 scale_fill_viridis(discrete=TRUE) +
 xlab("") + ylab("Deviation from markov") +
 guides(fill = guide_legend(title = "Model"))

Now we would like to display in a simple barplot (see code above) to see which channels are generating the most conversions and which needs to catch up. I am using ggplot for this with the awesome viridis package for a neat coloring.

rplot

We can go even further and use another barplot to see how the basic heuristic models perform compared to your fancy markov model. Now we can perfectly see some real difference between all these models. If you making serious decisions on which channels you spend your marketing budget you should definitely compare different models to get the full picture.

rplot01

You can get the whole code on my Github along with other data driven projects.

Cultural data hackathon with Shiny and Leaflet

CodingDaVinci is the first German open cultural data hackathon that started in Berlin 2014. They bring together both cultural heritage institutions and the hacker & designer community to develop ideas and prototypes for the cultural sector and the public. In 2016 the hackathon took place in Hamburg and ran for a total of 10 weeks. Open cultural data is usually held by cultural heritage institutions such as galleries, libraries, archives and museums (GLAMs). A list of projects from 2016 can be seen on their website.

cdv-nord_kacheln-2016

As I was playing around with leaflet recently I wanted to take part at the hackathon and develop something around spatial data. Leaflet is one of the most popular open-source JavaScript libraries for interactive maps and there is a wonderful R package that makes it easy to integrate and control Leaflet maps in R. You can find a neat little github page that will let you get started with Leaflet easily.

In the end I took visitor data from the museums in Hamburg. The data captures group bookings from private persons, schools, universities and other educational institutions. I came up with the idea of a map which shows the amount of museum visits of schools per neighborhood. On another layer I also mapped the amount to students per neighborhood and plotted every school as a marker on the map. This way the user can find out in which areas of Hamburg the schools often visit the museums and which neighborhoods need to catch up.

bild2

I think this map is relevant because cultural education is an indispensable part of our education system, as it is essential to the dignity of a human and the free development of his personality (Article 22, UN Human Rights Charter). An important part of this is the school trip to the museum, in order to attract students to art and culture as well as to strengthen their personality formation and social competence by means of cultural education.

Make sure you try out the map. If you are interested in the code you can check out my Github.

Deploy your Shiny apps easily with shinyapps.io

shiny

In the past I created several shiny apps to display data interactively. Sometimes you might want to share your application with other people and do not have your own server nor the skills to set one up or pay to for it. If you just want to make your shiny app public without any big effort or money you definitely should check out shinyapps.io. They describe their self as a platform as a service (PaaS) for hosting Shiny web apps (applications).

# get package
require(rsconnect)

# change bundle size for bigger aps
options(rsconnect.max.bundle.size = 30000000000)

# set your parameters
mytoken <- "MyTokenID"
mysecret <- "MySecretID"
myname = "MyName" # you could use multiple free accounts 🙂

# connect your account
rsconnect::setAccountInfo(name=myname,
token=mytoken,
secret=mysecret)

# deploy your app
rsconnect::deployApp("Path to the folder of your APP")

To get started you will need the latest version of their rsconnect package. You also need to sign up for an account on their website to get your token and secret (essential for using the service). Being done with this you can configure your account using the setAccountInfo function and deploy your app with deployApp. Two things I struggled with you might want to know about. Put all your script and contents for the app in one folder and remove all setwd() from your scripts. If you have a big app use the rsconnect.max.bundle.size function to solve your size issues.

Make sure you check out my Github for other data driven projects.

Analyzing visitor flows with Google’s chart tool in R

Let’s say you have a website or an app and you would like to know how your visitors navigate through it. I came across the googleVis package to solve this task. It provides you with an interface to Google’s chart tools and lets you create interactive charts based on data frames. In this package you will find a function to create sankey diagrams, which are a specific type of flow diagram. Usally the weight of an arrow is shown proportionally to the flow quantity. Let’s put this into practice.

First we need some data. Imagine you have a data set were you have all the page accesses from your visitors stored in a simple data frame.

UserID Timestamp Screen_name
1947849340340 01.02.2017 12:55:02 Main Screen
1947849340340 01.02.2017 12:55:05 My Prizes Screen
1947849340340 01.02.2017 12:55:10 Tutorial Screen
1947849340340 01.02.2017 12:55:20 Reminder Screen
1947849340340 01.02.2017 12:55:22 Terms Screen
1947849340340 01.02.2017 12:55:42 Main Screen
1453754950034 01.02.2017 21:14:22 Main Screen
1453754950034 01.02.2017 21:14:23 My Prizes Screen
1453754950034 01.02.2017 21:14:29 Prizes Screen
1453754950034 01.02.2017 21:14:44 Prizes Screen

To build a sankey diagram we will need to transform our table from long format into visitor paths. As you can see from the code below I was using a mix of simple dplyr code and the seqdef function from the TraMineR package, which lets you create a sequence object. I totally recommend checking out TraMineR if you working with any kind of sequence data, as it provide a lot of different function for mining, describing and visualizing sequences data.

# create user paths from data frame
seq = mydata %>%
group_by(user.id) %>%
summarise(Path = paste(list(screen_name), sep = ", "))

# remove ugly stuff from paths
seq$Path = gsub("c\\(|)|\"|([\n])|-","", seq$Path)
seq$Path = gsub("\\","", seq$Path, fixed = T)

# split path column into single columns and create sequence object
seq_table = cSplit(as.data.table(seq), "Path", ",")
seq_table = seqdef(seq_table)

# create empty df for later
orders.plot = data.frame()

# save sequence object as df
orders = as.data.frame(seq_table[2:length(seq_table)])

# convert ugly % to END
orders = as.data.frame(lapply(orders, function(y) gsub("%", "END", y)))
orders[length(orders)+1] = "END"

# transform data to long table format for ploting
for (i in 2:ncol(orders)) {

ord.cache = orders %>%
group_by(orders[ , i-1], orders[ , i]) %>%
summarise(n=n())

colnames(ord.cache)[1:2] = c('from', 'to')

ord.cache$from = paste(ord.cache$from, '(', i-1, ')', sep='')
ord.cache$to = paste(ord.cache$to, '(', i, ')', sep='')

orders.plot = rbind(orders.plot, as.data.frame(ord.cache))

}

# plot sankey
plot(gvisSankey(orders.plot, from='from', to='to', weight='n',
options=list(height=900, width=1800,
sankey="{link:{colorMode: 'source',
color:{fill:'source'}}}")))

For plotting purposes I needed to transform the data back to long table format. I also changed the states which named % to END, just to make sure that this means a customer’s journey has ended at this point. After calling the gvisSankey function your browser will open and you will have your neat visitor flow diagram.

And of course you can use sankey diagrams to visualize any type of sequence data. Make sure you check out my Github for the full code along with other projects.