In Bayesian nonparametrics, many models address the problem of density regression, including covariate dependent processes. These were settled by the pioneering works by [current ISBA president] MacEachern (1999) who introduced the general class of dependent Dirichlet processes. The literature on dependent processes was developed in numerous models, such as nonparametric regression, time series data, meta-analysis, to cite but a few, and applied to a wealth of fields such as, e.g., epidemiology, bioassay problems, genomics, finance. For references, see for instance the chapter by David Dunson in the Bayesian nonparametrics textbook (edited in 2010 by Nils Lid Hjort, Chris Holmes, Peter Müller and Stephen G. Walker). With Kerrie Mengersen and Judith Rousseau, we have proposed a dependent model in the same vein for modeling the influence of fuel spills on species diversity (arxiv).
In our ecological example, the model provides a series of densities on the Y axis (in our case, posterior density of species diversity), indexed by some covariate X (a pollutant). See file density_plot.txt. The following Plotly R code
library(plotly) mydata = read.csv("density_plot.txt") df = as.data.frame(mydata) plot_ly(df, x = Y, y = X, z = Z, group = X, type = "scatter3d", mode = "lines")
provides a graph as below. For the interactive version, see the RPubs page here.
I presented an arxived paper of my postdoc at the big success Young Bayesian Conference in Vienna. The big picture of the talk is simple: there are situations in Bayesian nonparametrics where you don’t know how to sample from the posterior distribution, but you can only compute posterior expectations (so-called marginal methods). So e.g. you cannot provide credible intervals. But sometimes all the moments of the posterior distribution are available as posterior expectations. So morally, you should be able to say more about the posterior distribution than just reporting the posterior mean. To be more specific, we consider a hazard (h) mixture model
where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).
We consider the survival function which is recovered from the hazard rate by the transform
and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.
The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.
I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:
- evaluates the approximate density
- samples from it
A package example for a mixture of beta and 2 to 7 moments gives that result:
For a quick recap, Pierre and I supervised a team project at Ensae last year, on a statistical critique of the abstract painting 1024 Colours by painter Gerhard Richter. The four students, Clémence Bonniot, Anne Degrave, Guillaume Roussellet and Astrid Tricaud, did an outstanding job. Here is a selection of graphs and results they produced.
1. As a preliminary descriptive study, the following scatter plots come and complete the triangle plot.The R function , from the package of the same name, displays the pixels with their coordinates in the RGB cube. It shows that, as a joint law, the triplets are somehow concentrated along the black-white diagonal of the cube.
The same occurs when the points are projected on the sides of the cube. Here is a comparison with uniform simulations.
2. It is interesting to see what happens in other color representations. HSL and HSV are two cylindrical models, succintly described by this Wikimedia picture:
The points parameterized in these model were projected on the sides as well; here, the sides of the cylinder are to be seen as the circular top (or bottom), the lateral side, and the section of the cylinder by a half-plane from its axis. Its shows that some colors in the hue coordinate (rainbow-like colors) are missing, for instance green or purple.
For the HSL model,
and the HSV model.
The histograms complete this first analysis. For HSL,
3. The students built a few ad-hoc tests for uniformity, either following our perspective or on their own. They used a Kolmogorov-Smirnov test, a test, and some entropy based tests.
4. We eventually turned to testing for spatial autocorrelation. In other words, is the color of one cell related to the color of its neighbors (in which case you can predict the neighbors’ colors given one cell), or is it “non informative”? A graphical way to check this is to plot average level of a color coordinate of the neighbors of a pixel with respect to its own coordinate. Then to fit a simple regression on this cloud: if the slope of the regression line is non zero, then there is some correlation, of the sign of the slope. We tried various combinations of coordinates, and different radii for the neighborhood’s definition, with no significant correlation. A (so-called Moran) test quantified this assessment. Here is for example the plot of the average level of red of the eight closer neighbors of each pixel with respect to its level of red.
Following a few entries on sports here and there, I was wondering what kind of law follow the running records with respect to the distance. The data are available on Wikipedia, or here for a tidied version. It collects 18 distances, from 100 meters to 100 kilometers. A log-log scale is in order:
It is nice to find a clear power law: the relation between the logarithms of time T and of distance D is linear. Its slope (given by the lm function) defines the power in the following relation:
Another type of race consists in running backwards (or retrorunning). The linear link is similar
with a slightly larger power
So it gets harder to run longer distances backwards than forwards…
It would be interesting to compare the powers for other sports like swimming and cycling.
As an appetizer for Paris triathlon, Jérôme and I ran as a team last week-end an adventure racing in Champagne region (it mainly consists in running, cycling, canoeing, with a flavor of orienteering, and Champagne is kept for the end). It was organized by Ecole Polytechnique students who, for the first time, divided Saturday’s legs in two parts: in order to reduce the traffic jam in each leg, the odd number teams were to perform part 1, then part 2, and even number teams in the reverse order.
As the results popped out, we wondered whether the order of performance had favored one of the groups or not. A very much crucial question for us as we were the only odd number team in the top five. Using ggplot and a dataframe donnees including time and Group variables, the code for the (non normalized) histograms of the two groups (even: 0, odd: 1) looks like this
There are roughly the same number of teams in each group (36 and 38). Time is in hours; the effective racing time is around 12 to 15 hours, but to this is added or substracted penalties or bonus, which explains total times between 5 and 30 hours. The whole impression is that the even group as a flatter histogram, and it might be that it is slightly more to the left that the odd one. To test this last hypothesis, I proceeded with a non-parametric test, the Wilcoxon test (or Mann-Whitney test): the null hypothesis is that the distributions of the two groups (say timeO and time1) do not differ by a location shift, and the alternative is that they differ by some non zero location shift.
> wilcox.test(time0,time1, paired=FALSE) Wilcoxon rank sum test data: time0 and time1 W = 640, p-value = 0.64 alternative hypothesis: true location shift is not equal to 0
The p-value is greater than 0.05, so we conclude that there is no significant location shift in time. This test is certainly not the most appropriate one since the two distributions do not have the same shape.
This thread has been widely discussed on a lot of forums. To make a long story short, running natively R on an iDevice (meaning iPhone/iPad) is disabled by its OS, unless it is jailbroken. The steps for the installation through Cydia are described in this R wiki, or this post. But there are some limitations, including bugs in package management. Running R directly seems not to be the more sensible way to proceed: smartphones lack so much power for that, you would do it just for fun. Typing R code on a small keyboard is quite tedious. But is manageable on a larger one (iPad, Galaxy Tab).
There could be another short-term solution to use R on an iDevice (or any other smartphone). As explained on this introduction to RStudio, the novel IDE can be launched directly on the Web, while running on a server. This would be an easier and safer solution than jailbreaking the mobile, which would have a number of benefits, including: the ability to access and share your files from anywhere (with an Internet connection), to benefit from more powerful computing resources (of your server), and to focus on a single configuration of R (for packages).
But there is a but, because the RStudio development team is still working on it. It is close to working (it is possible to click on tabs and menus) but not yet possible to get the keyboard to initialize and to type in the console. Yet executing the code can be done by clicking on Run line(s)+Enter. It is not a lot for now, but is a good start.
I was recently introduced to RStudio, a new integrated development environment for R, it is just amazing!
It is free, and open, compatible with PC/Mac/Linux OSs. You can also choose to run it in the cloud, and access it from your favorite web browser.
As you can see, the window divides into four in a customizable way. For the source code, multiple files are organized by tabs. The editor is great, providing code completion (for commands and existing variables), indentation, and syntax highlighting. Execution from source can be done with the Run icons (as well as saving, printing), or by a shortcut (Command or Ctrl+Enter). For more shortcuts you can check this page.
Another quadrant is for Workspace and History. The first lists as in Matlab all your current variables. You can have a quick look to them by clicking (equivalent to the View command), and edit your functions (equivalent to the fix command). It makes easy to import datasets from text files or from the Web. The bottom left quadrant allows to browse your files. A very interesting feat is the Plots tab, which stores all the past plots. You can export at the end the one you want, in pdf or png. It includes as well a package installation tool, and the R documentation.
For enabling access from any Web browser, there exists the RStudio Server version, to be installed on a Linux server. It’s great for team projects, as you can enable access to anyone by giving the server url. It allows to install exotic packages on a single server, which is good when you have restrictions on other computers. And you can check from a remote position the results of your computations you are drawing for the week-end. It looks like that:
For people at ENSAE, admin registration is not required for installing RStudio, so don’t hesitate to give it a try!
This is the first post of Statisfaction on R-bloggers. As an introduction: we are PhD students and postdocs at CREST, a research centre on economics and statistics located in Paris, France. We jointly share tips and tricks useful in our everyday jobs, links to various pages, articles, conferences, seminars, including a PhD student seminar at CREST. Since we happen to write stuff about R, we are joining the community through the R-bloggers aggregator.
In case you missed them, here are a few entries related to R:
This short post is to share an R tip Pierre recently gave me. When you need to store values sequentially (typically inside a loop), it’s more far efficient to create the whole vector (or matrix) and to fill it, rather than to concatenate the values to your current vector (or matrix). In terms of allocation time, it’s up to 60 times faster
which makes a big difference. In a Gibbs sampling loop, where a lot of other computations are done, I still have gain of 25%! For other recommendations (don’t use parentheses where you can avoid them), you can have a look on this post.