In Bayesian nonparametrics, many models address the problem of density regression, including covariate dependent processes. These were settled by the pioneering works by [current ISBA president] MacEachern (1999) who introduced the general class of dependent Dirichlet processes. The literature on dependent processes was developed in numerous models, such as nonparametric regression, time series data, meta-analysis, to cite but a few, and applied to a wealth of fields such as, e.g., epidemiology, bioassay problems, genomics, finance. For references, see for instance the chapter by David Dunson in the Bayesian nonparametrics textbook (edited in 2010 by Nils Lid Hjort, Chris Holmes, Peter Müller and Stephen G. Walker). With Kerrie Mengersen and Judith Rousseau, we have proposed a dependent model in the same vein for modeling the influence of fuel spills on species diversity (arxiv).
In our ecological example, the model provides a series of densities on the Y axis (in our case, posterior density of species diversity), indexed by some covariate X (a pollutant). See file density_plot.txt. The following Plotly R code
library(plotly) mydata = read.csv("density_plot.txt") df = as.data.frame(mydata) plot_ly(df, x = Y, y = X, z = Z, group = X, type = "scatter3d", mode = "lines")
provides a graph as below. For the interactive version, see the RPubs page here.
Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.
Plotting the numbers by day shows the bloody situation of Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:
and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:
Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)
library(ggplot2) input=read.csv("http://dl.dropbox.com/u/1391912/Blog%20statisfaction/data/syria.txt", sep="\t",header=TRUE,stringsAsFactors=FALSE) input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE), labels = c("Not Friday", "Friday")) input$Date=as.Date(input$History,"%d/%m/%Y") input$WeekDays=factor(input$WeekDay, levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1)) qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="") qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2)) qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2)) qplot(WeekDays,log(Number+1),data=input,geom="boxplot",xlab="",ylab="",colour=WeekDays)
As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this
As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).
Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):
We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!
An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.
At first sight, one could think this picture is a scale model of some narrow moutains, like Bryce Canyon… Actually it represents crimes in East London, an cardboard artwork by the Londoner artist Abigail Reynolds, called Mount Fear. Here is what can be read on the artist’s webpage:
The terrain of Mount Fear is generated by data sets relating to the frequency and position of urban crimes. Precise statistics are provided by the police. Each individual incident adds to the height of the model, forming a mountainous terrain.
All Mount Fear models are built on the same principals. The imaginative fantasy space seemingly proposed by the scupture is subverted by the hard facts and logic of the criteria that shape it. The object does not describe an ideal other-worldly space separated from lived reality, but conversely describes in relentless detail the actuality of life on the city streets.
No mention of the statistical method used (kernel, Dirichlet process density estimation?). Some crime data can be found on UNdata for example, or here for an interactive map. It reminds a great work by David Kahle about crime in Houston, combining ggplot2 and GoogleMap. He won a ggplot2 case study competition for this. His code is available here. I like in particular the contour plot, with cool rainbow colors, where both the crime level and the map background are clearly visible.
I find amusing the possibility to launch a search in Google images by just dragging a picture into the search box, instead of typing text. I remember that Pierre told me about it a long time ago, but this is the first time I play with it.
For example (easy) it will recognize the (supposedly) picture of our celebrated English mathematician (note that I used an objective picture name name3.jpg, at least unbiased):
More involved is to try with a character picture. Launching a picture of Bayes’ Theorem suggests as “Best guest for this image” bayes theorem (itself!), and as first entry its Wikipedia page. So it gets it all right as with Bayes picture:
Actually, there is a trick here because the image is taken from the Internet on a page which links to Wikipedia. So I guess it uses more this piece of information than any character recognition of the formula. When the picture can not be found on the Internet, e.g. with a LaTeX typed formula, the result is not that clear. So it does not really beat the (not convincing) LaTeX search offered by Springer which looks for academic papers with some given LaTeX code.
You might also try to recover an R package from CRAN by using a plot that it produces (who knows?). For example, it will tell you that this kernel density plot produced with the diamonds dataset can be found on the ggplot2 page about geom_histogram: