Analyzing visitor flows with Google’s chart tool in R

Let’s say you have a website or an app and you would like to know how your visitors navigate through it. I came across the googleVis package to solve this task. It provides you with an interface to Google’s chart tools and lets you create interactive charts based on data frames. In this package you will find a function to create sankey diagrams, which are a specific type of flow diagram. Usally the weight of an arrow is shown proportionally to the flow quantity. Let’s put this into practice.

First we need some data. Imagine you have a data set were you have all the page accesses from your visitors stored in a simple data frame.

UserID Timestamp Screen_name
1947849340340 01.02.2017 12:55:02 Main Screen
1947849340340 01.02.2017 12:55:05 My Prizes Screen
1947849340340 01.02.2017 12:55:10 Tutorial Screen
1947849340340 01.02.2017 12:55:20 Reminder Screen
1947849340340 01.02.2017 12:55:22 Terms Screen
1947849340340 01.02.2017 12:55:42 Main Screen
1453754950034 01.02.2017 21:14:22 Main Screen
1453754950034 01.02.2017 21:14:23 My Prizes Screen
1453754950034 01.02.2017 21:14:29 Prizes Screen
1453754950034 01.02.2017 21:14:44 Prizes Screen

To build a sankey diagram we will need to transform our table from long format into visitor paths. As you can see from the code below I was using a mix of simple dplyr code and the seqdef function from the TraMineR package, which lets you create a sequence object. I totally recommend checking out TraMineR if you working with any kind of sequence data, as it provide a lot of different function for mining, describing and visualizing sequences data.

# create user paths from data frame
seq = mydata %>%
group_by(user.id) %>%
summarise(Path = paste(list(screen_name), sep = ", "))

# remove ugly stuff from paths
seq$Path = gsub("c\\(|)|\"|([\n])|-","", seq$Path)
seq$Path = gsub("\\","", seq$Path, fixed = T)

# split path column into single columns and create sequence object
seq_table = cSplit(as.data.table(seq), "Path", ",")
seq_table = seqdef(seq_table)

# create empty df for later
orders.plot = data.frame()

# save sequence object as df
orders = as.data.frame(seq_table[2:length(seq_table)])

# convert ugly % to END
orders = as.data.frame(lapply(orders, function(y) gsub("%", "END", y)))
orders[length(orders)+1] = "END"

# transform data to long table format for ploting
for (i in 2:ncol(orders)) {

ord.cache = orders %>%
group_by(orders[ , i-1], orders[ , i]) %>%
summarise(n=n())

colnames(ord.cache)[1:2] = c('from', 'to')

ord.cache$from = paste(ord.cache$from, '(', i-1, ')', sep='')
ord.cache$to = paste(ord.cache$to, '(', i, ')', sep='')

orders.plot = rbind(orders.plot, as.data.frame(ord.cache))

}

# plot sankey
plot(gvisSankey(orders.plot, from='from', to='to', weight='n',
options=list(height=900, width=1800,
sankey="{link:{colorMode: 'source',
color:{fill:'source'}}}")))

For plotting purposes I needed to transform the data back to long table format. I also changed the states which named % to END, just to make sure that this means a customer’s journey has ended at this point. After calling the gvisSankey function your browser will open and you will have your neat visitor flow diagram.

And of course you can use sankey diagrams to visualize any type of sequence data. Make sure you check out my Github for the full code along with other projects.

Advertisements

Author: inside data blog

data analysis & visualization blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s