Creating abstract city maps for Leaflet usage

Leaflet is a great way to display spatial information in an interactive way. If you want to display the difference between different neighborhoods you would usually get the proper shapefiles on the web and connect your data to them. But sometimes it does not need detailed shapefiles and you want more abstraction to get your information across. I came up with the idea to draw my own little simplified polygons to get an abstract map of Hamburg.

AbtractHHtool

There are some great and free tools on the web to create your own polygons. I was using click2shp. You are just going to draw your polygons on a google map and afterwards you can export your polygons as a shapefile to use them from within R. Down below you find a little R script to display your polygons in a Shiny App.

#############################################################################################################################################
# PACKAGES
#############################################################################################################################################

require(leaflet)
require(shinythemes)
require(rgdal)
require(maptools)
require(rmapshaper)
require(shiny)
require(leaflet.extras)

#############################################################################################################################################
# UI
#############################################################################################################################################

shinyUI(
bootstrapPage(theme = shinytheme("united"),
 navbarPage(title="Where to live in Hamburg?",
 tabPanel("Karte",
 div(class="outer",

tags$style(type = "text/css", ".outer {position: fixed; top: 50px; left: 0; right: 0; bottom: 0; overflow: hidden; padding: 0}"),

leafletOutput("mymap", width = "100%", height = "100%")
)))))

#############################################################################################################################################
# SERVER
#############################################################################################################################################

shinyServer(
function(input, output, session) {

# setwd
setwd("YourPath")

# load your own shapes
hhshape <- readOGR(dsn = ".", layer = "click2shp_out_poly")

# load some data (could be anything)
data <- read.csv("anwohner.csv", sep = ";", header = T)
rownames(data) <- data$ID
hhshape <- SpatialPolygonsDataFrame(hhshape, data)

# remove rivers from sp file
hhshape <- hhshape[!(hhshape$Stadtteil %in% c("Alster","Elbe","Nix")), ]

# create a continuous palette function
pal <- colorNumeric(
 palette = "Blues",
 domain = hhshape@data$Anwohner
)

# plot map
output$mymap <- renderLeaflet({ leaflet(options = leafletOptions(zoomControl = FALSE, minZoom = 11, maxZoom = 11, dragging = FALSE)) %>%
 setView(lng = 9.992924, lat = 53.55100, zoom = 11) %>%
 addPolygons(data = hhshape,
  fillColor = ~pal(hhshape@data$Anwohner), fillOpacity = 1, stroke = T, color = "white", opacity = 1, weight = 1.2, layerId = hhshape@data$ID,
  highlightOptions = highlightOptions(color= "grey", opacity = 1, fillColor = "grey", stroke = T, weight = 12, bringToFront = T, sendToBack = TRUE),
  label=~stringr::str_c(Stadtteil,' ',"Anwohner:",formatC(Sicherheit, big.mark = ',', format='d')),
  labelOptions= labelOptions(direction = 'auto'))
})
})

This little R Code will give you the following result.

AbtractHH

Make sure you check out my Github for other data driven projects.

Doing a Twitter Analysis with R

Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. There is also a blog post about CDV. I will write another blog post about the result of Coding Durer another day but this article is going to be a twitter analysis of the hashtag #codingdurer. This article was a very good start for me to do the analysis.

tumblr_inline_mn4aupdWkb1qz4rgp

First we want to get the tweets and we are going to use the awesome twitteR package. If you want to know how you can get the API key and stuff I recommend to visit this page here. If you have everything setup we are good to go. The code down below does the authentication with Twitter and loads our packages. I assume you know how to install a R package or at least find a solution on the web.

# get package
require(twitteR)
library(dplyr)
library(ggplot2)
library(tidytext)

# do auth
consumer_key <- "my_key"
consumer_secret <- "my_secret"
access_token <- "my_token"
access_secret <- "my_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

We are now going to search for all the tweets containing the hashtag #codingdurer using the searchTwitter function from the twitteR package. After converting the result to a easy-to-work-with data frame we are going to remove all the retweets from our results because we do not want any duplicated tweets. I also removed the links from the twitter text as we do not need them.

# get tweets
cd_twitter <- searchTwitter("#CodingDurer", n = 2000)
cd_twitter_df <- twListToDF(cd_twitter)

# remove retweets
cd_twitter_unique <- cd_twitter_df %>% filter(!isRetweet)

# remove link
cd_twitter_nolink <- cd_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", "", text, perl = TRUE))

With the code down below we are going to extract the twenty most active twitter accounts during Coding Durer. I used some simple ggplot for graphics and saved it to a variable called people.

# who is tweeting
people = cd_twitter_nolink %>%
count(screenName, sort = TRUE) %>% slice(1:20) %>%
ggplot(aes(x = reorder(screenName, n, function(n) -n), y = n)) +
ylab("Number of Tweets") +
xlab("") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most active twitter users")

Now we want to know the twenty most used words from the tweets. This is going to be a bit trickier. First we extract all the words being said. Then we are going to remove all the stop words (and some special words like codingdurer, https …) as they are going to be uninteresting for us. We are also going to remove any twitter account name from the tweets. Now we are almost good to go. We are just doing some singularization and then we can save the top twenty words as a ggplot graphic in a variable called word.

# what is being said
tweet_words <- cd_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, text)

# remove stop words
my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("codingdurer","https", "t.co", "amp")))
tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

# remove name of tweeters
cd_twitter_df$screenName = tolower(cd_twitter_df$screenName)
tweet_words_interesting = filter(tweet_words_interesting, !(word %in% unique(cd_twitter_df$screenName)))

# singularize words
tweet_words_interesting$word2 = singularize(unlist(tokenize(tweet_words_interesting$word)))
tweet_words_interesting$word2[tweet_words_interesting$word2 == "datum"] = "data"
tweet_words_interesting$word2[tweet_words_interesting$word == "people"] = "people"

word = tweet_words_interesting %>%
count(word2, sort = TRUE) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(word2, n, function(n) -n), y = n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Word Occurrence") +
xlab("") +
ggtitle("Most used words in tweets")

# plot all together
grid.arrange(people, word, nrow=2, top = "Twitter Analysis of #codingdurer")

The grid.arrange function let us plot both of our graphics at once. Now we can see who the most active twitter users were and what the most used words were. It is good to see words like art, data and project at the top.

C7cgPf9WwAAW4Dp

Make sure you check out my Github for other data driven projects.

Cultural data hackathon with Shiny and Leaflet

CodingDaVinci is the first German open cultural data hackathon that started in Berlin 2014. They bring together both cultural heritage institutions and the hacker & designer community to develop ideas and prototypes for the cultural sector and the public. In 2016 the hackathon took place in Hamburg and ran for a total of 10 weeks. Open cultural data is usually held by cultural heritage institutions such as galleries, libraries, archives and museums (GLAMs). A list of projects from 2016 can be seen on their website.

cdv-nord_kacheln-2016

As I was playing around with leaflet recently I wanted to take part at the hackathon and develop something around spatial data. Leaflet is one of the most popular open-source JavaScript libraries for interactive maps and there is a wonderful R package that makes it easy to integrate and control Leaflet maps in R. You can find a neat little github page that will let you get started with Leaflet easily.

In the end I took visitor data from the museums in Hamburg. The data captures group bookings from private persons, schools, universities and other educational institutions. I came up with the idea of a map which shows the amount of museum visits of schools per neighborhood. On another layer I also mapped the amount to students per neighborhood and plotted every school as a marker on the map. This way the user can find out in which areas of Hamburg the schools often visit the museums and which neighborhoods need to catch up.

bild2

I think this map is relevant because cultural education is an indispensable part of our education system, as it is essential to the dignity of a human and the free development of his personality (Article 22, UN Human Rights Charter). An important part of this is the school trip to the museum, in order to attract students to art and culture as well as to strengthen their personality formation and social competence by means of cultural education.

Make sure you try out the map. If you are interested in the code you can check out my Github.

What’s Cooking on Kaggle? Top 20% Solution

Kaggle is a platform for predictive modelling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. As of May 2016, Kaggle had over 536,000 registered users, or Kagglers. The community spans 194 countries. It is the largest and most diverse data community in the world (Wikipedia).

screen-shot-2013-09-19-at-4-39-23-pm-640x442

One of the most interesting data sets I found on Kaggle was within the What’s Cooking challenge. The competition was hosted by Yummly, a mobile app and website that provides recipe recommendations. The Yummly app was named “Best of 2014” in Apple’s App Store. The competition asks you to predict the category of a dish’s cuisine given a list of its ingredients. The training data included a recipe id, the type of cuisine, and a list of ingredients of each recipe. There were 20 types of cuisine in the data set.

I was able to get a prediction score of about 80 percent with a fairly easy solution. First of all I removed all rare ingredients in the data set. I did not do much feature engineering, except from creating one simple variable for which counts the total number of ingredients per recipe. I also tried some text mining in form of word stemming which brings back a recipes’ ingredient to its root word (e.g. tomatoes become tomato). That approach did not help much in the end so I removed it from my script. I saved my training data in a spare matrix and trained a multiclass classification model using softmax with the xgboost package.

Kaggle also allows users to publicly share their code on each competition page. It helped me a lot to check out some other people’s code before getting started. You can find my R script for the What’s Cooking challenge on my Github.

Bike sharing usage with Leaflet and Shiny

My interactive map shows the bike sharing usage of StadtRAD, the bike sharing system in Hamburg – Germany. The data is available on the open data platform from Deutsche Bahn, the public railway company in Germany. The last new StadtRAD station was put into operation in May 2016, that is why a have chosen to display the usage of June 2016. The brighter the lines, the more bikes have been cycled along that street.

c19fodkw8aa_d-l

From data processing and spatial analysis to visualization the whole project was done in R. I have used the leaflet and shiny package to display the data interactively. The bikes themselves don’t have GPS, so the routes are estimated on a shortest route basis using the awesome cyclestreets API. The biggest challenge has been the aggregation of overlapping routes. I found the overline function from the stplanr package very helpful. It converts a series of overlaying lines and aggregates their values for overlapping segments. The raw data file from Deutsche Bahn is quite huge so I struggled to import the data into R to process it. In the end the read.csv.sql function from the sqldf package did the job.

You can find the whole code from processing to the shiny functions on my github. The code could easily be used to map other spatial data, for example the car sharing data from car2go which is available via their API. This might be a future project.