Bhaskar Karambelkar's Blog

Visualizing India v/s Pakistan One Day International Results

 

Tags: Cricket streamgraph rvest dplyr lubridate rstats


This is my small effort to pickup streamgraph support in R developed by Bob Rudis. (Described here).

What you see is per year aggregations of results of all India v/s Pakistan One day Internationals. I pulled the records from Wikipedia and used rvest by Hadley Wickham. for extracting the results. After that a little data munging using dplyr and lubridate and voilà.

Blue’s are India and Green’s are Pakistan in accordance with their team colors. India had an abysmal records against Pakistan right up until mid 90s, but it has picked up quite a bit after that. And of course India has won all 6 of it’s Cricket world cup matches against Pakistan.

As of today the tally stands at: India 51 wins and Pakistan 72 wins. Below’s a detailed breakdown.

Running a chi-square test for dependency between the result and the venue didn’t find any association between the two, which in layman terms means the results have been unrelated to the venue.

For the nerds (Oh sorry Data Scientists), the code is shown below.


setwd("~/code/R/workspaces/cricket")
library(stringr)
library(rvest)
library(lubridate)
library(dplyr)
library(streamgraph)

# Wikipedia is our best go to source
indvspak <- html('https://en.wikipedia.org/wiki/List_of_ODI_cricket_matches_played_between_India_and_Pakistan')
# Summary table
results.summary <- indvspak %>% html_node('.wikitable') %>% html_table()

# Any dependency btween venue and result ?
chisq.test(results.summary[2:3,3:5])

# The XPATH expression below was obtained using Chrome's Element Inspector.
results <-  indvspak %>%
  html_node(xpath='//*[@id="mw-content-text"]/table[4]') %>% html_table()

# Sensible headers
colnames(results) <- c('MatchNum','Date','Winner','WonBy','Venue','MoM')

# Fix Date
results$Date <- ymd(str_replace(results$Date,'^0([0-9]{4}-[0-9]{2}-[0-9]{2}).*$','\\1'))
# Extract just the year in a new field
results$year <- year(results$Date)

# So that we get our colors as per team colors
results$Winner <- factor(results$Winner,levels=c('India','Pakistan','No result'),ordered=T)

results %>% select(year,Winner) %>%
  group_by(year,Winner) %>% tally() %>%
  streamgraph("Winner", "n", "year", offset="zero", interpolate="linear") %>%
  sg_legend(show=TRUE,
            label="Ind v/s Pak One Day International Results : Over the years") %>%
  sg_axis_x(1, "year", "%Y") %>%
  sg_colors("GnBu")