Statisfaction

Rasmus Bååth’s Bayesian first aid

Posted in Project, R, Statistics by Pierre Jacob on 23 January 2014

Besides having coded a pretty cool MCMC app in Javascript, this guy Rasmus Bååth has started the Bayesian first aid project. The idea is that if there’s an R function called blabla.test performing test “blabla”, there should be a function bayes.blabla.test performing a similar test in a Bayesian framework, and showing the output in a similar way so that the user can easily compare both approaches.This post explains it all. Jags and BEST seem to be the two main workhorses under the hood.

Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).

Final post on the Wang-Landau and the Flat Histogram criterion

Posted in Project, Statistics by Pierre Jacob on 15 December 2012
Gaussian density biased such that 75% of the mass is in the negative values and 25% in the positive values

Gaussian density biased such that 75% of the mass is in the negative values and 25% in the positive values

Hey,

With Robin Ryder we wrote a paper titled The Wang-Landau Algorithm Reaches the Flat Histogram in Finite Time and it has been accepted in Annals of Applied Probability (arXiv preprint  here). I’m especially happy about it since it was the last remaining unpublished chapter of my PhD thesis. In this post I’ll try to explain what we proved here on a simple example.

(more…)

Last and final on Richter’s painting

Posted in Art, Project, R by Julyan Arbel on 22 August 2011

For a quick recap, Pierre and I supervised a team project at Ensae last year, on a statistical critique of the abstract painting 1024 Colours by painter Gerhard Richter. The four students, Clémence Bonniot, Anne Degrave, Guillaume Roussellet and Astrid Tricaud, did an outstanding job. Here is a selection of graphs and results they produced.

1. As a preliminary descriptive study, the following scatter plots come and complete the triangle plot.The R function \texttt{scatterplot3D}, from the package of the same name, displays the pixels with their coordinates in the RGB cube. It shows that, as a joint law, the triplets are somehow concentrated along the black-white diagonal of the cube.

The same occurs when the points are projected on the sides of the cube. Here is a comparison with uniform simulations.

2. It is interesting to see what happens in other color representations. HSL and HSV are two cylindrical models, succintly described by this Wikimedia picture:

The points parameterized in these model were projected on the sides as well; here, the sides of the cylinder are to be seen as the circular top (or bottom), the lateral side, and the section of the cylinder by a half-plane from its axis. Its shows that some colors in the hue coordinate (rainbow-like colors) are missing, for instance green or purple.

For the HSL model,

and the HSV model.

The histograms complete this first analysis. For HSL,

and HSV.

3. The students built a few ad-hoc tests for uniformity, either following our perspective or on their own. They used a Kolmogorov-Smirnov test, a \chi^2 test, and some entropy based tests.

4. We eventually turned to testing for spatial autocorrelation. In other words, is the color of one cell related to the color of its neighbors (in which case you can predict the neighbors’ colors given one cell), or is it “non informative”? A graphical way to check this is to plot average level of a color coordinate of the neighbors of a pixel with respect to its own coordinate. Then to fit a simple regression on this cloud: if the slope of the regression line is non zero, then there is some correlation, of the sign of the slope. We tried various combinations of coordinates, and different radii for the neighborhood’s definition, with no significant correlation. A (so-called Moran) test quantified this assessment. Here is for example the plot of the average level of red of the eight closer neighbors of each pixel with respect to its level of red.

Thanks again to the students for their enthusiasm!

Non-profit data science associations

Posted in Project, Statistics by Pierre Jacob on 28 June 2011

Hey there,

As we all know, there is more and more available data and more and more efficient ways to analyze them to get useful answers, about pretty much everything. So, statisticians  and “data scientists” in general are usually busy people. However, if some of them ever get bored, there seems to be nice associations out there seeking volunteers to put their big minds to good use!

Here are some I came across recently:

With powerful free software like R allowing to carry out complete statistical analysis, I bet there’s going to be many more non-profit data science associations around!

EDIT: the great kaggle.com hosts forecasting competitions including some for non-profit associations, like a recent one from Wikimedia Foundation. Moreover anyone can register and host a competition on this site, so it might be the easiest way for a non-profit association to have data scientists working for them.

Job offer at ENSAE

Posted in Project by Julyan Arbel on 7 June 2011

ENSAE is this grande école where we are teaching assistant, and tutoring some applied statistics projects. It is offering a job position for September (yes, September 2011) for a full professor or a teaching assistant in insurance/actuaries/finance. The application deadline is by the end of June. ENSAE website provides the job description, in French. I can testify that this is a great place to work: students are motivated and good, and the presence of CREST nearby make it easy to interact with a lot people (here are members from the stat and the finance labs).

Tagged with: ,

Toponymy and localisation

Posted in Project by Robin Ryder on 21 October 2010

Cross-posted from my personal blog.

Since this is my first post here, a quick introduction is in order: I am a postdoc at CREST and CEREMADE (Paris Dauphine). I have a personal blog, but will be cross-posting here as well.

As Julyan already mentioned, Master’s level students at ENSAE are required to work on an Applied Statistics project. Céline Duval, Pierre Jacob and myself are proposing a topic on correlation between toponymy (place names) and geographic localisation of French towns and cities (“communes”, of which there are 36,000). The idea is to detect which characteristics of a place name are informative of where in France that place is.

For example, some endings are very informative: on this map where every cyan dot represents one commune, place names ending in “-y” are mostly found in the North-East and place names ending in “-ac” in the South-West.

The following map shows that a “w” in a place indicates it lies in the North-East, and that a “k” indicates that it lies in the North-East or Brittany.

Just for fun, here is a map of place names which include one of the major rivers: Loire, Seine, Rhône, Saône, Garonne and Marne. Unsurprisingly, they align almost perfectly with the rivers.

Students interested in the project can contact Céline, Pierre or myself. People interested in the project might enjoy Keith Brigg’s pages on English place names.

Team project at ENSAE, stats on paintings

Posted in Project by Julyan Arbel on 6 October 2010

Second year at ENSAE includes a year-long team project in applied statistics. The aim is to work on data, to build a model based on statistics, econometrics or time series lectures, and to tests a few hypotheses on it. As an example, last year projects I supervised with Marie Chanchole, Andrew Gelman and Nicolas Chopin were

Fractional polynomials for modeling nonlinear regression functions, with a case study on same sex marriage state law in the US, and

Study of a speed dating experiment at Columbia University.

Any potentially interested academic in stats can contact me.  Bloggers on Statisfaction are providing four projects, one of which is the following (echoing a series of posts here, here and here):

Critique statistique d’une peinture abstraite

Encadré par Julyan Arbel (E04), Pierre Jacob (F14)
julyan.arbel@ensae.fr, pierre.jacob@ensae.fr
Laboratoire de statistique, CREST

PROBLÉMATIQUE
On propose d’étudier sous un angle statistique certaines oeuvres d’art moderne ou contemporain, en particulier des peintures abstraites. Le point de départ est une oeuvre du peintre allemand Gerhard Richter, dénommée 1024 colours. On envisagera d’autres peinture du même artiste, et étendra éventuellement l’étude à d’autres peintres, par exemple Piet Mondrian.

Dans 1024 colours, on considère chaque pixel comme une variable aléatoire. Sa valeur, le plus généralement, s’exprime dans le modèle RGB, c’est à dire un triplet correspondant aux niveaux en rouge, vert et bleu. Il existe d’autres modèles de représentation de couleur (par exemple le modèle HSV) auxquels on comparera le modèle RGB en terme de résultats statistiques.

L’étude descriptive de l’oeuvre consistera à rechercher des couleurs prédominantes, déterminer les axes principaux des couleurs dans l’espace RGB par une ACP, etc.
La principale motivation statistique est l’étude de l’uniformité des couleurs dans l’oeuvre. En effet, l’auteur dit avoir choisi les couleurs de manière aléatoire; il s’agira donc de définir cette notion dans le contexte des images et de tester cette hypothèse. On se propose de tester cette hypothèse en deux temps. D’abord sans tenir compte de la disposition spatiale des pixels: on veux tester l’uniformité des 1024 couleurs dans l’espace RGB. Ensuite, en intégrant la position, on cherchera à tester la présence d’autocorrélation spatiale. Ce dernier test nécessitera une recherche bibliographique.
Le sujet reste volontairement ouvert à d’autres pistes d’étude sur l’initiative du groupe.

DONNÉES UTILISÉES
La base de données est constituée de trois matrices, R, G, B, de taille 32×32, disponibles ici. Les valeurs des trois couleurs sont exprimées sur une échelle de 0 à 256.
Les données ont été extraites par un script écrit en Python. La connaissance de ce langage n’est pas un prérequis pour le sujet, mais elle sera utile pour extraire d’autres bases avec le même script. L’analyse statistique sera de préférence programmée en langage R.

BIBLIOGRAPHIE INDICATIVE
http://www.couleur.org
Tests d’autocorrélation spatiale sur R

Follow

Get every new post delivered to your Inbox.

Join 53 other followers

%d bloggers like this: