Statisfaction

Daily casualties in Syria

Posted in Dataset, R by Julyan Arbel on 9 February 2012

Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.

We note first that the attacks accelerate, as the cumulated graph is mostly convex (click to enlarge):

Plotting the numbers by day shows the bloody situation of  Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:

There are almost twice as much deaths on Fridays as any other day in average:
Here are boxplots for the logarithm of daily casualties by day of the week:

and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:

Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)

library(ggplot2)
input=read.csv("http://dl.dropbox.com/u/1391912/Blog%20statisfaction/data/syria.txt",
               sep="\t",header=TRUE,stringsAsFactors=FALSE)
input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE),
                           labels = c("Not Friday", "Friday"))
input$Date=as.Date(input$History,"%d/%m/%Y")
input$WeekDays=factor(input$WeekDay,
                      levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend
qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1))
qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="")
qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2))
qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2))
qplot(WeekDays,log(Number+1),data=input,geom="boxplot",xlab="",ylab="",colour=WeekDays)

Created by Pretty R at inside-R.org

Tagged with: ,

Triathlon data with ggplot2

Posted in Dataset, Sport by Julyan Arbel on 11 October 2011

As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this

As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).

Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):

We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers  finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!

An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.

Tagged with: , ,

Artist view of crimes in London

Posted in Art, R by Julyan Arbel on 10 October 2011

At first sight, one could think this picture is a scale model of some narrow moutains, like Bryce Canyon… Actually it represents crimes in East London, an cardboard artwork by the Londoner artist Abigail Reynolds, called Mount Fear.  Here is what can be read on the artist’s webpage:

The terrain of Mount Fear is generated by data sets relating to the frequency and position of urban crimes. Precise statistics are provided by the police. Each individual incident adds to the height of the model, forming a mountainous terrain.

All Mount Fear models are built on the same principals. The imaginative fantasy space seemingly proposed by the scupture is subverted by the hard facts and logic of the criteria that shape it. The object does not describe an ideal other-worldly space separated from lived reality, but conversely describes in relentless detail the actuality of life on the city streets.

No mention of the statistical method used (kernel, Dirichlet process density estimation?). Some crime data can be found on UNdata for example, or here for an interactive map. It reminds a great work by David Kahle about crime in Houston, combining ggplot2 and GoogleMap. He won a ggplot2 case study competition for this. His code is available here. I like in particular the contour plot, with cool rainbow colors, where both the crime level and the map background are clearly visible.

Tagged with: ,

Googling Bayes’ pictures

Posted in LaTeX, R by Julyan Arbel on 29 September 2011

I am writing way too many posts in a row on Google tools. I promise I will think about something else soon.

I find amusing the possibility to launch a search in Google images by just dragging a picture into the search box, instead of typing text. I remember that Pierre told me about it a long time ago, but this is the first time I play with it.

For example (easy) it will recognize the (supposedly) picture of our celebrated English mathematician (note that I used an objective picture name name3.jpg, at least unbiased):

More involved is to try with a character picture. Launching a picture of Bayes’ Theorem suggests as “Best guest for this image” bayes theorem (itself!), and as first entry its Wikipedia page. So it gets it all right as with Bayes picture:

Actually, there is a trick here because the image is taken from the Internet on a page which links to Wikipedia. So I guess it uses more this piece of information than any character recognition of the formula. When the picture can not be found on the Internet, e.g. with a LaTeX typed formula, the result is not that clear. So it does not really beat the (not convincing) LaTeX search offered by Springer which looks for academic papers with some given LaTeX code.

You might also try to recover an R package from CRAN by using a plot that it produces (who knows?). For example, it will tell you that this kernel density plot produced with the diamonds dataset can be found on the ggplot2 page about geom_histogram:

Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: