Sorting tables with Google’s weighted sort algorithm

Imagine you have a data set which contains every page of your website and two corresponding key performance indicators (KPI). The first KPI are your page impressions which are defined as number of times a specific page has been accessed. The second KPI is the bounce rate and represents the percentage of visitors who enter your website on a specific page and then leave (“bounce”) directly rather than continuing on to view other pages within your website. The bounce rate is a measure of the effectiveness of a website.

Page Page_Impressions Bounce_Rate
Page_483 883302 0.16
Page_186 459298 0.1
Page_307 366320 0.0
Page_388 312929 0.01
Page_308 261719 0.0
Page_277 226548 0.03
Page_31 212580 0.02
Page_129 189652 0.12
Page_337 176998 0.02

So far so good, but if you sort your table via bounce rate you have not achieved much, as most of the pages with a high bounce rate will have a small amount of page impressions. The underlying question is the following: How can I get pages with a high bounce rate but also a high amount of page impressions?

Page Page_Impressions Bounce_Rate
Page_108 1 1.00
Page_121 1 1.00
Page_83 3 1.00
Page_45 3 0.67
Page_112 4 0.67
Page_11 12 0.57
Page_140 388 0.56
Page_89 2 0.5
Page_133 2 0.5

Google developed a weighted sort algorithm that will help you. It is based on calculating the Estimated True Value (ETV) and using the ETV as a score on which to sort. The ETV computation is performed for each page as follows:

ETV(page x) = 
(PI / maxPI * BR) + ((1 – PI / maxPI)) * avgBR)

BR = Bounce rate for the page x

PI = Page impressions for the page x

maxPI = The highest amount of page impressions found for any page

avgBR = The Bounce rate for the whole table

The ETV calculation essentially says, “I need to figure out whether the row of data (page x) that I’m looking at is a large or a small percentage of the data in the whole table. If it’s a large percentage, then I need to take this row seriously in the sense that the data is probably meaningful. If it’s a small percentage, then I shouldn’t take this row seriously, and the reality is that I’d probably be more ‘correct’ if I just leaned towards using the average Bounce Rate for the whole table.“ Now you can sort your table via ETV and you will have the most interesting pages on top of the table.

Page Page_Impression Bounce_Rate ETV
Page_483 883302 0.16 0.16
Page_186 459298 0.1 0.07557256
Page_129 189652 0.12 0.06433205
Page_35 58882 0.17 0.05717034
Page_37 81772 0.13 0.05660003
Page_179 126322 0.09 0.05495925
Page_274 19304 0.3 0.05459478
Page_377 174237 0.07 0.05323211
Page_275 58731 0.1 0.05249535

Find the R code below to compute the ETV. It is also on my Github.

# calculate max visits
max_visit = max(dataset$Page_Impressions)
dataset$prop_visits = dataset$Page_Impressions/max_visit

# calculate average weighted bounce rate
weighted_avg_br = sum(dataset$Page_Impressions * dataset$Bounce_Rate)/sum(dataset$Page_Impressions)

# calculate Estimated True Value (ETV)
dataset$ETV = (dataset$prop_visits*dataset$Bounce_Rate)+((1-(dataset$prop_visits))*weighted_avg_br)

And of course the weighted sort algorithm can be implemented for any kind of problem, not only in the field of web analytics.


Author: inside data blog

data analysis & visualization blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s