## Quantitative arguments as hypermedia

Hey readers,

I’m Joseph Dureau, I have been an avid reader of this blog for while now, and I’m very glad Pierre proposed me to share a few things. Until a few months ago, I used to work on Bayesian inference methods for stochastic processes, with applications to epidemiology. Along with fellow colleagues from this past life, I have now taken the startup path, founding Standard Analytics. We’re looking into how web technologies can be used to enhance browsability, transparency and impact of scientific publications. Here’s a start on what we’ve been up to so far.

Let me just make it clear that everything I’m presenting is fully open source, and available here. I hope you’ll find it interesting, and we’re very excited to hear from you! Here it goes..

To date, the Web has developed most rapidly as a medium of documents for people rather than for data and information that can be processed automatically.

Berners-Lee et al, 2001

Since this sentence was written, twelve years ago, ambitious and collective initiatives have been undertaken to revolutionize what machines can do for us on the web. When I make a purchase online, my email service is able to understand it from the purchase confirmation email, communicate to the online store service, authenticate, obtain information on the delivery, and provide me with a real-time representation of where the item is located. Machines now have the means to process data in a smarter way, and to communicate over it!

However, when it comes to exchanging quantitative arguments, be it in a blog post or in a scientific article, web technology does not bring us much further than what can be done with pen and paper. (more…)

## Random Colours (part 3)

Thanks to Pierre, we now have a new playground for saptial stats, see this post. Before that, let’s see if we can see basic stuff without spatial information.

Data consist in three 32*32 tables, R, G and B, of numbers between 0 and 255. Certainly, the tables should be considered together as a 32*32 table of (r,g,b) vectors. Still, the first basic thing to do is to plot three separate histograms for R, G and B:

compared to uniformly simulated data

We see that the painter has a bias for darker colors, and rather misses light green and light blue ones.

Then, what can we do for representing (r,g,b) vectors? I guess that a good visualization is the color triangle

A few words to explain where it comes from. Say (r,g,b) data is normalized in the unit cube. Then the corners of the color triangle correspond to (plain) red, green and blue, from bottom left, right, to top. It is a section of the cube, with two opposite and equidistant points: black (0,0,0) and white (1,1,1). This triangle is said to be a simplex: any of its points’ coordinates sum to 1. Now the data in the triangle is obtained as (r,g,b) points, diveded by (r+g+b). It took me a while to compute the coordinates (x,y) of those points in a basis of the triangle (I did that stuff more easily back in highschool!). It should give something like that:

x=(1-r+g)/2 y=1/sqrt(3)*((1-r-g)/2+b)

What do we see? The colors do not look like uniformly distributed, because 1) points are much more concentrated in the center, and 2) the painter favoured red colors in comparison with green ones (very few in the bottom right corner) and blue ones (in a minor extent). Arguing aginst point 1) could be that projecting (r,g,b) points on the simplex naturally implies a higher density in the center. That is right, but it would not be that dense, as we see with uniformly simulated data:

So colors are not uniform in the RGB model. There should be a cognitive interpretation out there, no? It is not obvious that human eyes comprehend colors on the same scale as the RGB model does. If not, there is no reason for human sight to comprehend uniformity in the same way as a computer. As Pierre pointed out, what we find in the RGB model might be different in the HSV model.We’ll see this model later.

Next step, spatial autocorrelation tests?

## Dirichlet process and related priors

My contribution this year to MCB seminar at CREST is about nonparametric Bayes (today at 2 pm, room 14). I shall start with 1) a few words of history, then introduce 2) the Dirichlet Process by several of its numerous defining properties. I will next introduce an extension of the Dirichlet Process, namely 3) the DP mixtures, useful for applications like 4) density estimation. Last, I will show 5) posterior MCMC simulations for a density model and give some 6) reference textbooks.

**1) History**

Ferguson, T.S. (1973), A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1, 209-230.

Doksum, K. (1974). Tailfree and neutral random probabilities and their posterior distributions. The Annals of Probability, 2 183-201.

Blackwell, D. (1973). Discreteness of Ferguson selections. The Annals of Statistics, 1 356-358.

Blackwell, D. and MacQueen, J. (1973). Ferguson distributions via Polya urn schemes. The Annals of Statistics, 1 353-355.

**2) Dirichlet Process defining properties**

Mainly based on Peter Müller’s slides of class 2 at Santa Cruz this summer.

Dirichlet distribution on finite partitions

Stick-breaking/ Sethuraman representation

Polya urn analogy for the predictive probability function

Normalization of a Gamma Process

**3) Dirichlet Process Mixtures (DPM)**

Convolution of a DP with a continuous kernel to circumvent its discretness.

**4) Density estimation with DPMs**

**5) Posterior MCMC simulation**

Based on Peter Müller’s slides of class 6.

**6) Textbooks**

Bayesian nonparametrics, 2003, J. K. Ghosh, R. V. Ramamoorthi.

Bayesian nonparametrics, 2010, Nils Lid Hjort, Chris Holmes, Peter Müller, Stephen G. Walker, and contributions by Subhashis Ghosal, Antonio Lijoi, Igor Prünster, Yee Whye Teh, Michael I. Jordan, Jim Griffin, David B. Dunson, Fernando Quintana.

## 256 (random ?) colors

This painting by Gerhard Richter is called *256 colors*. The painter is fully committed to this kind of work, as you can see here. When visiting the San Francisco Museum of Modern Art (SFMOMA) (I’m getting literate…), the guide asked the following question:

Do you think the colors are positioned randomly or not?

Not a trivial question, is it? And you, would you say it is random? This work dates back to 1974, when computer screens mainly displayed green letters on a black background. So it seems the artist did not benefit of computer assistance.

There are many ways to interpret this plain English statement into statistic terms. For example, are the colors, with no ordering, uniformly distributed? (OK, this doesn’t mean at all (true) randomness, but this is a question…) It would be nice to have the 256 colors in RGB. In this color model, (0,0,0) is black, and (255,255,255) is white. I think that there are rather more dark colors than light ones, ie more data points near the (0,0,0) vertex than near the opposite one, in the RGB cube. So a test of uniformity would probably be rejected.

A more subtle way to interpret uniformity in the painting would be to take into account the position of the colors… Any idea how to check that? I have no clue.

Here is a larger one, 1024 colors…

leave a comment