Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).
With Robin Ryder we wrote a paper titled The Wang-Landau Algorithm Reaches the Flat Histogram in Finite Time and it has been accepted in Annals of Applied Probability (arXiv preprint here). I’m especially happy about it since it was the last remaining unpublished chapter of my PhD thesis. In this post I’ll try to explain what we proved here on a simple example.
For a quick recap, Pierre and I supervised a team project at Ensae last year, on a statistical critique of the abstract painting 1024 Colours by painter Gerhard Richter. The four students, Clémence Bonniot, Anne Degrave, Guillaume Roussellet and Astrid Tricaud, did an outstanding job. Here is a selection of graphs and results they produced.
1. As a preliminary descriptive study, the following scatter plots come and complete the triangle plot.The R function , from the package of the same name, displays the pixels with their coordinates in the RGB cube. It shows that, as a joint law, the triplets are somehow concentrated along the black-white diagonal of the cube.
The same occurs when the points are projected on the sides of the cube. Here is a comparison with uniform simulations.
2. It is interesting to see what happens in other color representations. HSL and HSV are two cylindrical models, succintly described by this Wikimedia picture:
The points parameterized in these model were projected on the sides as well; here, the sides of the cylinder are to be seen as the circular top (or bottom), the lateral side, and the section of the cylinder by a half-plane from its axis. Its shows that some colors in the hue coordinate (rainbow-like colors) are missing, for instance green or purple.
For the HSL model,
and the HSV model.
The histograms complete this first analysis. For HSL,
3. The students built a few ad-hoc tests for uniformity, either following our perspective or on their own. They used a Kolmogorov-Smirnov test, a test, and some entropy based tests.
4. We eventually turned to testing for spatial autocorrelation. In other words, is the color of one cell related to the color of its neighbors (in which case you can predict the neighbors’ colors given one cell), or is it “non informative”? A graphical way to check this is to plot average level of a color coordinate of the neighbors of a pixel with respect to its own coordinate. Then to fit a simple regression on this cloud: if the slope of the regression line is non zero, then there is some correlation, of the sign of the slope. We tried various combinations of coordinates, and different radii for the neighborhood’s definition, with no significant correlation. A (so-called Moran) test quantified this assessment. Here is for example the plot of the average level of red of the eight closer neighbors of each pixel with respect to its level of red.
As we all know, there is more and more available data and more and more efficient ways to analyze them to get useful answers, about pretty much everything. So, statisticians and “data scientists” in general are usually busy people. However, if some of them ever get bored, there seems to be nice associations out there seeking volunteers to put their big minds to good use!
Here are some I came across recently:
- Statistics without borders is a fairly recent association “under the auspices of the American Statistical Association”, with already 200 statisticians involved. They are currently involved in a project in Haiti to help understanding the impact of the 2010 massive earthquake.
- A new initiative called “Data Without borders”, seems promising. The stated goal is to bridge this gap between the people with data and the people who know what to do with it. Certainly a good idea! You can subscribe to the mailing list here.
- More for the French readers among you: the ENSAE grad school has a charity program here. The goal is to provide statistical power tools to non-profit associations, previous studies include a study for Emmaus and a study for Mobil’douche (douche means shower in French!!), an association providing free mobile showers to homeless people.
With powerful free software like R allowing to carry out complete statistical analysis, I bet there’s going to be many more non-profit data science associations around!
EDIT: the great kaggle.com hosts forecasting competitions including some for non-profit associations, like a recent one from Wikimedia Foundation. Moreover anyone can register and host a competition on this site, so it might be the easiest way for a non-profit association to have data scientists working for them.
ENSAE is this grande école where we are teaching assistant, and tutoring some applied statistics projects. It is offering a job position for September (yes, September 2011) for a full professor or a teaching assistant in insurance/actuaries/finance. The application deadline is by the end of June. ENSAE website provides the job description, in French. I can testify that this is a great place to work: students are motivated and good, and the presence of CREST nearby make it easy to interact with a lot people (here are members from the stat and the finance labs).
Since this is my first post here, a quick introduction is in order: I am a postdoc at CREST and CEREMADE (Paris Dauphine). I have a personal blog, but will be cross-posting here as well.
As Julyan already mentioned, Master’s level students at ENSAE are required to work on an Applied Statistics project. Céline Duval, Pierre Jacob and myself are proposing a topic on correlation between toponymy (place names) and geographic localisation of French towns and cities (“communes”, of which there are 36,000). The idea is to detect which characteristics of a place name are informative of where in France that place is.
For example, some endings are very informative: on this map where every cyan dot represents one commune, place names ending in “-y” are mostly found in the North-East and place names ending in “-ac” in the South-West.
The following map shows that a “w” in a place indicates it lies in the North-East, and that a “k” indicates that it lies in the North-East or Brittany.
Just for fun, here is a map of place names which include one of the major rivers: Loire, Seine, Rhône, Saône, Garonne and Marne. Unsurprisingly, they align almost perfectly with the rivers.
Students interested in the project can contact Céline, Pierre or myself. People interested in the project might enjoy Keith Brigg’s pages on English place names.
Second year at ENSAE includes a year-long team project in applied statistics. The aim is to work on data, to build a model based on statistics, econometrics or time series lectures, and to tests a few hypotheses on it. As an example, last year projects I supervised with Marie Chanchole, Andrew Gelman and Nicolas Chopin were
Fractional polynomials for modeling nonlinear regression functions, with a case study on same sex marriage state law in the US, and
Study of a speed dating experiment at Columbia University.
Critique statistique d’une peinture abstraite
Encadré par Julyan Arbel (E04), Pierre Jacob (F14)
Laboratoire de statistique, CREST
On propose d’étudier sous un angle statistique certaines oeuvres d’art moderne ou contemporain, en particulier des peintures abstraites. Le point de départ est une oeuvre du peintre allemand Gerhard Richter, dénommée 1024 colours. On envisagera d’autres peinture du même artiste, et étendra éventuellement l’étude à d’autres peintres, par exemple Piet Mondrian.
Dans 1024 colours, on considère chaque pixel comme une variable aléatoire. Sa valeur, le plus généralement, s’exprime dans le modèle RGB, c’est à dire un triplet correspondant aux niveaux en rouge, vert et bleu. Il existe d’autres modèles de représentation de couleur (par exemple le modèle HSV) auxquels on comparera le modèle RGB en terme de résultats statistiques.
L’étude descriptive de l’oeuvre consistera à rechercher des couleurs prédominantes, déterminer les axes principaux des couleurs dans l’espace RGB par une ACP, etc.
La principale motivation statistique est l’étude de l’uniformité des couleurs dans l’oeuvre. En effet, l’auteur dit avoir choisi les couleurs de manière aléatoire; il s’agira donc de définir cette notion dans le contexte des images et de tester cette hypothèse. On se propose de tester cette hypothèse en deux temps. D’abord sans tenir compte de la disposition spatiale des pixels: on veux tester l’uniformité des 1024 couleurs dans l’espace RGB. Ensuite, en intégrant la position, on cherchera à tester la présence d’autocorrélation spatiale. Ce dernier test nécessitera une recherche bibliographique.
Le sujet reste volontairement ouvert à d’autres pistes d’étude sur l’initiative du groupe.
La base de données est constituée de trois matrices, R, G, B, de taille 32×32, disponibles ici. Les valeurs des trois couleurs sont exprimées sur une échelle de 0 à 256.
Les données ont été extraites par un script écrit en Python. La connaissance de ce langage n’est pas un prérequis pour le sujet, mais elle sera utile pour extraire d’autres bases avec le même script. L’analyse statistique sera de préférence programmée en langage R.