RMarkdown for automated Marketing Reporting

In a past article I presented the awesome ChannelAttribution package which helps you to run algorithmic marketing attribution models. This time I am going to use this model to create a full rmarkdown report with some analysis and visualization around the topic of marketing attribution.

Automated Reports NAV

I am starting with some web analytics data which has seven columns. The userid identifies a unique user, the sessionid identifies a unique visit on our website, the orderid does not change for a user until he buys something on our website, a timestamp, an event_type which identifies the action someone is taking on our page (page impression vs. productview vs. order), the medium the visiter is coming from to visit our page and a order_number to identify unique orders. See an example of the data below:

userid sessionid orderid timestamp event_type medium orderid
1 1 1 10.05.2017 15:36 page_impression SEA <NULL>
2 2 2 12.01.2017 16:36 page_impression SEO <NULL>
3 3 3 28.04.2017 16:06 page_impression SEA <NULL>
3 4 3 28.04.2017 18:53 page_impression SEA <NULL>
3 5 3 28.04.2017 19:01 page_impression Link <NULL>
4 6 4 16.02.2017 18:09 page_impression SEO <NULL>
4 7 4 16.02.2017 19:56 page_impression SEO <NULL>
4 8 4 17.02.2017 16:16 page_impression SEO <NULL>

To run our attribution models with the ChannelAttribution package we need to format our data to sequences. With the code below we can do that and also save some data frames for our output tables in the markdown document.

# create sequences on medium
mydata$medium = as.character(mydata$medium) 
seq = mydata %>%
 group_by(orderid) %>%
 summarise(path = as.character(list(medium)))

# save the same data for markdown output table
seq_output = mydata %>%
 group_by(orderid) %>%
 summarise(path = as.character(list(medium)))

seq_output$path = gsub("c\\(|)|\"|([\n])","", seq_output$path)
seq_output$path = gsub(",","\\1 \\2>", seq_output$path)
seq_output = merge(order_num, seq_output, by = "orderid")
seq_output$orderid = NULL
colnames(seq_output) = c("Bestellnummer","Zeitstempel","Customer Journey")
write.table(seq_output, "Qi2-Customer_Journey.csv", row.names = F, sep = ";")

# group identical paths and add up conversions
seq = seq %>%
group_by(path) %>%
summarise(total_conversions = n())

# clean paths
seq$path = gsub("c\\(|)|\"|([\n])","", seq$path)
seq$path = gsub(",","\\1 \\2>", seq$path)

# save for later use
seqdata = seq

# save the same data for markdown output table
seq_output_agg = seq
colnames(seq_output_agg) = c("Customer Journey","Conversions")
seq_output_agg = seq_output_agg[order(seq_output_agg$Conversions, decreasing = T),]

Before we run our models we are going to do some analysis on our sequences with the TraMineR package, which is perfect for any sequence based analysis. The code below prepares our data for the markdown document.

# split path into single columns
seq.table = cSplit(as.data.table(seqdata), "path", ">")
# create sequence object
seq.seq = seqdef(seq.table, var = 2:length(seq.table))
# distribution table (perc)
dist_perc = as.data.frame(seqstatd(seq.seq)[[1]])
dist_perc = as.data.frame(t(dist_perc))
dist_perc$path = rownames(dist_perc)

# prepare plot
dist_perc = melt(dist_perc, id.var="path")
dist_perc$path = gsub("path_","",dist_perc$path)
colnames(dist_perc) = c("Session","Medium","Anteil")

# distribution table (full)
dist_full = as.data.frame(seqstatd(seq.seq)[[2]])
dist_full = as.data.frame(t(dist_full))
dist_full$path = rownames(dist_full)

# preapre plot
dist_full = melt(dist_full, id.var="path")
colnames(dist_full) = c("Path","Session","Besucher")
dist_full$Path = NULL
dist_full$Session = gsub("path_","",dist_full$Session)

Next we are going to run our attribution models and save the results for display in the markdown document. Finally we need to save our data frames as a Rdata file.

# run models
basic_model = heuristic_models(seq, "path", "total_conversions")
dynamic_model = markov_model(seq, "path", "total_conversions")

# build data frame for plot
result = merge(basic_model,dynamic_model, by = "channel_name")
names(result) = c("channel","first","last","linear","algorithmic")

# melt the data frame for plotting
result = melt(result, id.vars="channel")
colnames(result) = c("Kanal","Modell","Conversions")
result$Conversions = round(result$Conversions, 0)

# build data frame for devisation plot
result1 = merge(basic_model,dynamic_model, by = "channel_name")
names(result1) = c("channel","first","last","linear","algorithmic")
result1$first = ((result1$first - result1$algorithmic)/result1$algorithmic)
result1$last = ((result1$last - result1$algorithmic)/result1$algorithmic)
result1$linear = ((result1$linear- result1$algorithmic)/result1$algorithmic)
result1$algorithmic = NULL

# melt the data frame for plotting
result1 = melt(result1, id.vars="channel")
colnames(result1) = c("Kanal","Modell","Conversions")
result1$Conversions = round(result1$Conversions, 5)

# colorpalette for plotting
mypal = colorRampPalette(c("#FFD296", "#C77100"))

Now we are able to use our processed data from above to create a nifty markdown document. I have used the plotly package to create some interactive ggplot visualizations. I also used a markdown theme called united and my own css for styling the document as you can see in the YAML header.

---
# YAML
title: "Case Study: Marketing Attribution"
author: erstellt von Alexander Kruse, etracker Data Lab
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
includes:
in_header: extLogo.html
css: "mycss.css"
theme: united
highlight: tango
---

```{r include=FALSE}

# rm
#options(warn=-1)

# packages
require(ggplot2)
require(viridis)
library(DT)
library(plotly)
library(magrittr)
require(RColorBrewer)

cedta.override = c("gWidgetsWWW","statET","FastRWeb","slidify","rmarkdown")
assignInNamespace("cedta.override",c(data.table:::cedta.override,"rmarkdown"),"data.table")

# load pre-processed data
setwd("K:/Consulting/13_Alex_Data_Analyst/Datenanalyse_Projekte/Attribution/anonym")
load("full_save_anonym.RData")
rm(seq.table, dynamic_model, basic_model, mydata, seq.seq, seq, seqdata, ids, order_num)

```

<style type="text/css">

h1.title {
color: #FF5F01;
}

h2 {
color: #FF5F01;
}

</style>




## Einleitung
Über die Oberfläche von etracker Analytics haben Sie derzeit noch nicht die Möglichkeit sich die Customer Journeys Ihrer Käufer auf Bestellnummerebene anzeigen zulassen. In diesem Sinne zeigt Ihnen der vorliegende Report welche Kanalkontakte ein Besucher hatte, bevor er etwas auf Ihrer Website gekauft hat. Der Report wird zudem durch weitere Analysen und Visualisierungen zum Thema Marketing-Attribution ergänzt. Ausführliche Informationen und Anwendungsbeispiele befinden sich im etracker Whitepaper [Attribution: Mit der richtigen Strategie die Marketing Performance optimieren](https://www.etracker.com/wp-content/uploads/2017/05/etracker_WP_Attributionsmodell.pdf). Tiefergehende Erklärungen zum Thema algorithmische Marketing-Attribution finden Sie in folgendem Video: [Multi-touch Attribution: How It Works & Why It Will Disrupt Media Buying](http://www.onebyaol.com/blog/teg-talks-episode-3-multi-touch-attribution-how-it-works-why-it-will-disrupt-media-buying).

Das vorliegende Dokument wurde vom etracker Data Lab erstellt. Über das Data Lab bietet etracker Website-Betreibern verschiedene Analyse-Services an, die über den Umfang der etracker Webanalyse-Lösung hinausgehen. Ein Team von Data-Analysten beantwortet dabei auch sehr komplexe und individuelle Fragestellungen auf Basis Ihrer Webanalyse-Rohdaten und statistischer Modelle sowie spezieller Visualisierungsmöglichkeiten. Wichtig ist hierbei, dass das Data Lab für seine Analysen die gleichen Daten nutzt die Sie auch an der etracker Oberfläche sehen.


![](etracker_data_lab.PNG)


## Datengrundlage
Der vorliegende Report wurde mit den etracker Trackingdaten eines anonymisierten Accounts erstellt und kann auch auf Grundlage Ihrer Daten erstellt werden. Die nachfolgenden Tabellen zeigen zunächst die Customer Journeys pro Bestellnummer an. Eine Customer-Journey besteht aus allen im Zeitraum getrackten Kanalkontakten eines Besuchers bis zu seinem Kauf. Kanalkontakte außerhalb des Datenausschnitts werden nicht berücksichtigt, was ein wesentlichen Einfluss auf die Analyse hat. Bei Conversions zu Beginn des Analysezeitraums fehlen ggf. Kanalkontakte einzelner Customer Journeys. Die untenstehende Tabelle ist interaktiv und verfügt über eine Sortier- und Suchfunktion.




```{r, echo=FALSE}

datatable(seq_output, options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#FF5F01', 'color': '#fff'});",
"}")
))

```




In der nachfolgenden Tabelle finden Sie sämtliche Customer Journeys die zum Kaufabschluss geführt haben. Eine weitere Tabellenspalte zeigt Ihnen an, wie häufig diese Kontaktkombination zu einer Conversion geführt hat. Identische Customer Journeys wurden demnach zusammengezählt. Bei Betrachtung der Tabelle zeigt sich, dass ein Großteil der Customer Journeys nur einen (z.B. Type-In) oder identische (sog. distinkte) Kanalkontakte aufweisen (z.B. Type-In > Type-In). All diese Kunden hatten also nur Kontakt zu einem Kanal.




```{r, echo=FALSE}

datatable(seq_output_agg, options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#FF5F01', 'color': '#fff'});",
"}")
))

```




## Customer-Journey-Analyse
Die nachfolgenden Grafiken analysieren und visualisieren die obenstehenden Daten aus weiteren Blickwinkeln. Die erste Grafik zeigt Ihnen an, wie viele Käufer eine Customer Journey mit einer, zwei oder mehreren Sessions bzw. Kanalkontakten vor einem Kauf hatten. Die meisten Käufer hatten nur einen Kanalkontakt. Die Kurve flacht schnell ab und nur ca. 1% aller Käufer hatten mehr als zehn Sessions vor einer Conversion. Auch die folgenden Grafiken sind interaktiv, Sie können zoomen oder sich mit der Maus Werte anzeigen lassen.




```{r, echo=FALSE, fig.width=9.5, fig.height=4}
ggplotly(ggplot(dist_full, aes(x = Session, y = Besucher, group = 1)) +
theme_minimal() +
geom_point(stat='summary', fun.y=sum) +
stat_summary(fun.y=sum, geom="line") +
xlab("Session") + ylab("Besucher"))

```




Die nachfolgende Grafik führt die obenstehende Idee noch weiter aus. Sie können für jeden Customer-Journey-Abschnitt (Session 01, Session 02, ...) erkennen, welche Medien wie häufig genutzt wurden. So lässt sich z.B. erkennen, das Type-Ins verständlicherweise mit zunehmender Customer-Journey-Länge mehr werden, da die Besucher die Website z.B. im Browserverlauf gespeichert haben. Es zeigt sich zudem, dass die Kanäle Affiliate und Link bei zunehmender Customer-Journey-Länge an Bedeutung verlieren. Dies ist ein erster Hinweis darauf, dass ein Last-Click-Attributionsmodell diese Kanäle unterbewerten, ein First-Click-Modell ggf. überbewerten würde. Da wie oben beschrieben nur wenige Besucher eine Customer-Journey mit mehr als zehn Sessions haben macht es Sinn die Kanalverteilung auch nur bis zu dieser Customer-Journey-Länge zu betrachten.




```{r, echo=FALSE, cache=FALSE, message = FALSE, warnings = FALSE, fig.width=10.9, fig.height=5}
mypal <- suppressWarnings(colorRampPalette(brewer.pal(7,"Oranges")))
ggplotly(ggplot(dist_perc, aes(x = Session, y = Anteil, fill = Medium)) +
theme_minimal() +
geom_bar(stat = "identity") +
xlab("Session") + ylab("Anteil an Besuchern") +
scale_fill_manual( values = mypal(7)) +
guides(fill = guide_legend(title = "Kanal:")))

```




## Marketing-Attribution
Website-Besucher haben oft deutlich mehr als nur einen Werbemittelkontakt und es kann viel Zeit vergehen, bis aus einem ersten Kontakt mit dem gesuchten Produkt ein Kaufabschluss wird. So rückt immer häufiger die Frage ins Zentrum, welches Werbemittel welchen Anteil am Erfolg einer Marketing-Strategie hat.

Abgerundet wird die vorliegende Analyse daher mit der Attribution der einzelnen Conversions zu den genutzen Medien. Die Grafik zeigt deutlich, dass der Kanal SEO für die meisten Conversions verantwortlich gemacht werden kann, unabhängig für welches Attributionsmodell wir uns entscheiden. Auffällig ist jedoch, wie unterschiedlich die Verteilung der Conversions bei den Kanälen Affiliate und Link ist. Ein Last-Click-Modell ordnet diesen Kanälen verhältnismäßig wenige Conversions zu.




```{r, echo=FALSE, fig.width=10.9, fig.height=4}
# plot everything
ggplotly(ggplot(result, aes(Kanal, Conversions)) +
theme_minimal() +
geom_bar(aes(fill = Modell), position = "dodge", stat="identity") +
scale_fill_manual( values = mypal(4)) +
xlab("") + ylab("Conversions") +
guides(fill = guide_legend(title = "Modell:")))

```




Die verschiedenen Attributionsmodelle unterscheiden sich darin, wie sie die einzelnen Medien bei der Zuordnung der Conversions gewichten. Dabei kann generell zwischen zwei Modellarten unterschieden werden: den heuristischen (z. B. Last-Click) und den algorithmischen Attributionsmodellen. Algorithmische Attributionsmodelle errechnen den Wert der einzelnen Kanäle aus den gesamten historischen Besucherdaten auf feingranularer Ebene, ohne Informationsverlust oder starre Zuordnungsregeln und ist der exakteste Ansatz zur Verteilung des Ertrages eines Werbeerfolges auf Werbekanäle. In den Grafiken sehen Sie daher neben den klassichen Modellen auch ein von etracker konzipiertes algorithmisches Attributionsmodell. Technische Informationen zu unserem Modell finden Sie in folgendem Paper: [Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2343077).




```{r, echo=FALSE, fig.width=10.9, fig.height=4}
# plot everything
ggplotly(ggplot(result1, aes(Kanal, Conversions)) +
geom_bar(aes(fill = Modell), position = "dodge", stat="identity") +
theme_minimal() +
scale_fill_manual( values = mypal(4)) +
xlab("") + ylab("Abweichung vom algo. Modell (in %)") +
guides(fill = guide_legend(title = "Modell:")))

```




Die obenstehende Grafik zeigt wie stark die heuristischen Modelle vom algorithmischen Attributionsmodell abweichen. Die wenigsten starken Ausschläge zeigt das lineare Modell, welches damit dem algorithmischen Modell von etracker am ähnlichsten ist. Sollte Sie den Attributionsreport über die etracker Oberfläche nutzen, empfiehlt sich somit den Fokus auf das lineare Modell (oder ggf. Badewanne) zulegen.




## Fazit
Die vorliegende Analyse zeigt deutlich, dass die Wahl eines geeigneten Attributionsmodell eine wichtige Entscheidung darstellt. Ein Last-Click-Modell bewertet die Kanäle Affiliate und Link deutlich unter, wobei klassische Kanäle wie SEO und SEA hingegen überbewertet werden. Es kann davon ausgegangen werden, dass das algorithmische Attributionsmodell von etracker die genauste Kanalbewertung errechnet und von den heuritischen Modellen dem linearen Attributionsmodell am ähnlichsten ist.






<center>

Bei Fragen kommen Sie gerne auf uns zu.






Alexander Kruse

etracker Data Lab

Tel: +49 40 555 659 667

E-Mail: kruse@etracker.de

</center>

The final document can be seen here. Make sure you check out my Github for other data driven projects.

Creating abstract city maps for Leaflet usage

Leaflet is a great way to display spatial information in an interactive way. If you want to display the difference between different neighborhoods you would usually get the proper shapefiles on the web and connect your data to them. But sometimes it does not need detailed shapefiles and you want more abstraction to get your information across. I came up with the idea to draw my own little simplified polygons to get an abstract map of Hamburg.

AbtractHHtool

There are some great and free tools on the web to create your own polygons. I was using click2shp. You are just going to draw your polygons on a google map and afterwards you can export your polygons as a shapefile to use them from within R. Down below you find a little R script to display your polygons in a Shiny App.

#############################################################################################################################################
# PACKAGES
#############################################################################################################################################

require(leaflet)
require(shinythemes)
require(rgdal)
require(maptools)
require(rmapshaper)
require(shiny)
require(leaflet.extras)

#############################################################################################################################################
# UI
#############################################################################################################################################

shinyUI(
bootstrapPage(theme = shinytheme("united"),
 navbarPage(title="Where to live in Hamburg?",
 tabPanel("Karte",
 div(class="outer",

tags$style(type = "text/css", ".outer {position: fixed; top: 50px; left: 0; right: 0; bottom: 0; overflow: hidden; padding: 0}"),

leafletOutput("mymap", width = "100%", height = "100%")
)))))

#############################################################################################################################################
# SERVER
#############################################################################################################################################

shinyServer(
function(input, output, session) {

# setwd
setwd("YourPath")

# load your own shapes
hhshape <- readOGR(dsn = ".", layer = "click2shp_out_poly")

# load some data (could be anything)
data <- read.csv("anwohner.csv", sep = ";", header = T)
rownames(data) <- data$ID
hhshape <- SpatialPolygonsDataFrame(hhshape, data)

# remove rivers from sp file
hhshape <- hhshape[!(hhshape$Stadtteil %in% c("Alster","Elbe","Nix")), ]

# create a continuous palette function
pal <- colorNumeric(
 palette = "Blues",
 domain = hhshape@data$Anwohner
)

# plot map
output$mymap <- renderLeaflet({ leaflet(options = leafletOptions(zoomControl = FALSE, minZoom = 11, maxZoom = 11, dragging = FALSE)) %>%
 setView(lng = 9.992924, lat = 53.55100, zoom = 11) %>%
 addPolygons(data = hhshape,
  fillColor = ~pal(hhshape@data$Anwohner), fillOpacity = 1, stroke = T, color = "white", opacity = 1, weight = 1.2, layerId = hhshape@data$ID,
  highlightOptions = highlightOptions(color= "grey", opacity = 1, fillColor = "grey", stroke = T, weight = 12, bringToFront = T, sendToBack = TRUE),
  label=~stringr::str_c(Stadtteil,' ',"Anwohner:",formatC(Sicherheit, big.mark = ',', format='d')),
  labelOptions= labelOptions(direction = 'auto'))
})
})

This little R Code will give you the following result.

AbtractHH

Make sure you check out my Github for other data driven projects.

Shiny App for cultural hackathon

Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. I also wrote an article about CDV on this blog.

Logo_CodingDurer_300dpi-1-300x164

At CodingDurer we developed a Shiny App to explore the genre of church interior paintings developed in the Netherlands in the middle of the 17th century. There are hundreds of church interior paintings scattered across collections around the world. The research of this subject to date has focused mainly on particular artists or churches, rather than the overall genre and its network of artists and places. This project, born during the Coding Durer 2017, addresses this issue by providing a platform for further research on the paintings and creating an insight into the bigger picture of the genre for the first time. This visualization of over 200 paintings of 26 different churches by 16 different artists was created with the following research questions in mind:

  • In what places the artists were active and in what places did they depict church interior(s)?
  • Did the artists have ‘favourite’ church interiors?
  • In what places and when would the artists possibly meet?
  • What church interiors were depicted the most?
  • What church interiors were depicted by most artists?

durer1

The starting point of the project was a spreadsheet listing the paintings, artists, collections, etc. that was created for research purposes two years ago. This re-purposed data needed cleaning and additional information, e.g. IDs (artists, churches, paintings), locations (longitude, latitude), and stable URLs for images. You can see an image of the Shiny App above and try it out yourself here.

You can get the whole code on my Github along with other data driven projects.

Doing a Twitter Analysis with R

Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. There is also a blog post about CDV. I will write another blog post about the result of Coding Durer another day but this article is going to be a twitter analysis of the hashtag #codingdurer. This article was a very good start for me to do the analysis.

tumblr_inline_mn4aupdWkb1qz4rgp

First we want to get the tweets and we are going to use the awesome twitteR package. If you want to know how you can get the API key and stuff I recommend to visit this page here. If you have everything setup we are good to go. The code down below does the authentication with Twitter and loads our packages. I assume you know how to install a R package or at least find a solution on the web.

# get package
require(twitteR)
library(dplyr)
library(ggplot2)
library(tidytext)

# do auth
consumer_key <- "my_key"
consumer_secret <- "my_secret"
access_token <- "my_token"
access_secret <- "my_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

We are now going to search for all the tweets containing the hashtag #codingdurer using the searchTwitter function from the twitteR package. After converting the result to a easy-to-work-with data frame we are going to remove all the retweets from our results because we do not want any duplicated tweets. I also removed the links from the twitter text as we do not need them.

# get tweets
cd_twitter <- searchTwitter("#CodingDurer", n = 2000)
cd_twitter_df <- twListToDF(cd_twitter)

# remove retweets
cd_twitter_unique <- cd_twitter_df %>% filter(!isRetweet)

# remove link
cd_twitter_nolink <- cd_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", "", text, perl = TRUE))

With the code down below we are going to extract the twenty most active twitter accounts during Coding Durer. I used some simple ggplot for graphics and saved it to a variable called people.

# who is tweeting
people = cd_twitter_nolink %>%
count(screenName, sort = TRUE) %>% slice(1:20) %>%
ggplot(aes(x = reorder(screenName, n, function(n) -n), y = n)) +
ylab("Number of Tweets") +
xlab("") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most active twitter users")

Now we want to know the twenty most used words from the tweets. This is going to be a bit trickier. First we extract all the words being said. Then we are going to remove all the stop words (and some special words like codingdurer, https …) as they are going to be uninteresting for us. We are also going to remove any twitter account name from the tweets. Now we are almost good to go. We are just doing some singularization and then we can save the top twenty words as a ggplot graphic in a variable called word.

# what is being said
tweet_words <- cd_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, text)

# remove stop words
my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("codingdurer","https", "t.co", "amp")))
tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

# remove name of tweeters
cd_twitter_df$screenName = tolower(cd_twitter_df$screenName)
tweet_words_interesting = filter(tweet_words_interesting, !(word %in% unique(cd_twitter_df$screenName)))

# singularize words
tweet_words_interesting$word2 = singularize(unlist(tokenize(tweet_words_interesting$word)))
tweet_words_interesting$word2[tweet_words_interesting$word2 == "datum"] = "data"
tweet_words_interesting$word2[tweet_words_interesting$word == "people"] = "people"

word = tweet_words_interesting %>%
count(word2, sort = TRUE) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(word2, n, function(n) -n), y = n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Word Occurrence") +
xlab("") +
ggtitle("Most used words in tweets")

# plot all together
grid.arrange(people, word, nrow=2, top = "Twitter Analysis of #codingdurer")

The grid.arrange function let us plot both of our graphics at once. Now we can see who the most active twitter users were and what the most used words were. It is good to see words like art, data and project at the top.

C7cgPf9WwAAW4Dp

Make sure you check out my Github for other data driven projects.

How to import multiple data files (the fast way)

csv-logo

Sometimes you have your data stored in multiple csv files and want to load them into a single data frame in R. There are several answers on the web to this questions and I recently found a fast solution to this problem.

# packages
require(data.table)

# set wd
setwd("PathToYourFolder")

# import files
files = list.files(pattern="*.csv")
dataset = do.call(rbind, lapply(files, fread))
rm(files)

# transform data to df
dataset <- as.data.frame(unclass(dataset))

The code above uses both lapply and the cool fread function from the data.table package to load in your data in a quiet fast manner. I recommend to try out this approach if you dealing with long import times.

Make sure you check out my Github for other data driven projects.

How to import a subset of a (too huge) csv file

01_sqldf

Sometimes people create csv files that are just too huge to upload them into your R session while most of the times you just need a subset of this data set. Recently I tapped into this problem and first I tried to import the whole file with functions like fread or the classic read.csv but this did not help much as the file was just too big and my computer failed to import it. With the awesome read.csv.sql function from the sqldf package I found a good way to solve my problem. This function enables you to use SQL statements within the import function which make it possible to select only a subset of the file to reduce the import size.

mydata = read.csv.sql("mydata.csv", sql = "select * from file where City = '\"Hamburg\"' ", sep = ";")

The code above loads only those lines of the file in where the city is Hamburg. I still had trouble to with the encoding that is why I used this ugly string with backlashes in the SQL statement. I will leave it like this as you might having the same problem.

Make sure you check out my Github for other data driven projects.

Using association rule mining in forensic accounting

A few years ago I started to use the R programming language more intensive while writing my master thesis. I used the wonderful arules package for mining association rules and frequent item sets from Michael Hahsler and others. I used this package in the field of forensic accounting. Forensic data analysis is a branch of digital forensics. It examines structured data with regard to incidents of financial crime. The aim is to discover and analyze patterns of fraudulent activities (Wikipedia). Find down below an excerpt from my thesis.

21-1

A study by Schneider and John from the year 2013 shows that 37% of the surveyed companies in Germany report that they have already become victims of economic crimes in the last twelve months. The literature research of the present master thesis has been shown that a large part of the forensic analysis methods are used to uncover economic crimes on aggregated data (e.g. balance sheet positions). On the basis of various scientific researches, it can also be shown that there are currently only a few publications which use analytical methods to investigate unaggregated transactions of financial accounting directly on economic crimes. A study of Debreceny and Gray from 2013 reveals that the analysis of the company’s internal financial accounting data has great potential for detecting fraud. For these reasons, this master thesis uses the data mining methodology of association analysis to directly apply financial accounting data for the purposes of forensic accounting to investigate economic crimes.

Make sure you check out the code on my Github along with other projects.