Statisfaction

Comment courir un semi-marathon ou un 20 km?

Posted in French, Sport by Jérôme Lê on 28 October 2011

Dimanche 9 octobre 2011, j’ai couru le 20km de Paris. Sur la ligne de départ à attendre dans le froid et la pluie, un « personal coach » jeune et dynamique nous donnait quelques conseils à suivre durant la course. Le principal d’entre eux était de commencer lentement sa course et d’accélérer sur la fin. Pour les gens dont c’est la première course, le conseil est sans doute justifié pour éviter tout problème ou abandon. Toutefois, est ce vraiment la stratégie optimale à suivre ?

(more…)

PAWL package on CRAN

Posted in Geek, R by Pierre Jacob on 26 October 2011

The PAWL package (which I talked about there, and which implements the parallel adaptive Wang-Landau algorithm and adaptive Metropolis-Hastings for comparison) is now on CRAN!

http://cran.r-project.org/web/packages/PAWL/index.html

which means that within R you can easily install it by typing

Isn’t that amazing? It’s just amazing. Kudos to the CRAN team for their quickness and their help.

World Statistics Day

Posted in General by Julyan Arbel on 21 October 2011

As recommended by the United Nations Statistics Commission, a World Statistics Day is observed every 5 years. The first one occured last year, so no stats day to celebrate this year. Nevertheless, cheers to all of you readers: recent activity (a post by Robin about his article with Pierre, adverts by Xian, and a reference in a statistics blogs list) made it the busiest day ever for the 1+ year old Statisfaction:

The Wang-Landau algorithm reaches the flat histogram in finite time.

Posted in Statistics by Robin Ryder on 20 October 2011

Cross-posted from my personal blog.

MCMC practitioners may be familiar with the Wang-Landau algorithm, which is widely used in Physics. This algorithm divides the sample space into “boxes”. Given a target distribution, the algorithm then samples proportionally to the target in each box, while aiming at spending a pre-defined proportion of the sample in each box. (Usually these predefined proportions are uniform.)

This strategy can help move faster between modes of a distribution, by forcing the sample to visit often the space between modes.

The most sophisticated versions of this algorithm combine a decreasing stochastic schedule and the so-called flat histogram criterion: whenever the proportions of the sample in each box are close enough to the desired frequencies, the stochastic schedule decreases. A decreasing schedule is necessary for diminishing adaptation to hold.

Until now, it was unknown whether the flat histogram is necessarily reached in finite time, and hence whether the schedule ever starts decreasing.

Pierre and I just submitted and arXived a proof that the flat histogram is reached in finite time under some conditions, and may never be reached in other cases.

Random art on the web

Posted in Art, R by Julyan Arbel on 15 October 2011

Since we explored some statitics of an abstract painting with Pierre (we even have an article in Variances last issue!), I became more sensitive to art linked to randomness. Here are some pointers to related websites I have digged out.

Random.org, mentioned here by Pierre, is, at it reads, a true random number service that generates randomness via atmospheric noise. The American writer Eric Hoffer is quoted about creativity:

Creativity is the ability to introduce order into the randomness of nature.

You will find there contributed pages of users of the service about varied forms of arts, like pages which generate Samuel Beckett-like prose, or Jazz Scales. In visual arts, you can find for example the Bryce girl 1, a fractal landscape by Fuller Thompson of Bryce Canyon (with an extra sexy girl by the way); and nice pastel Richter-like pictures by Dave Nelson (to be compared with an excerpt of Richter’s 1024 colors):

Random-art.org is a program which randomly generates a picture with a given character seed. You can visit the gallery, or make your own. Sadly, “Bayes” generates an ugly flashy pic:

Day to day data gather together artists who collect, list, database and absurdly analyse the data of everyday life. You can find there links to artists like Abigail Reynolds and her Mount Fear of crimes in London, and many others.

R users produced great outputs too. Interestingly, the two following graphs feel like 3D, although only made up of lines and curves. Paul Butler’s visualization of Facebook connections (with a bit of post processing):

and Christophe Ladroue’s representation of individual rankings across lags in a triathlon:

Check the R Graph Gallery if you want to enhance your data visualization in R!

Triathlon data with ggplot2

Posted in Dataset, Sport by Julyan Arbel on 11 October 2011

As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this

As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).

Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):

We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers  finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!

An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.

Tagged with: , ,

Artist view of crimes in London

Posted in Art, R by Julyan Arbel on 10 October 2011

At first sight, one could think this picture is a scale model of some narrow moutains, like Bryce Canyon… Actually it represents crimes in East London, an cardboard artwork by the Londoner artist Abigail Reynolds, called Mount Fear.  Here is what can be read on the artist’s webpage:

The terrain of Mount Fear is generated by data sets relating to the frequency and position of urban crimes. Precise statistics are provided by the police. Each individual incident adds to the height of the model, forming a mountainous terrain.

All Mount Fear models are built on the same principals. The imaginative fantasy space seemingly proposed by the scupture is subverted by the hard facts and logic of the criteria that shape it. The object does not describe an ideal other-worldly space separated from lived reality, but conversely describes in relentless detail the actuality of life on the city streets.

No mention of the statistical method used (kernel, Dirichlet process density estimation?). Some crime data can be found on UNdata for example, or here for an interactive map. It reminds a great work by David Kahle about crime in Houston, combining ggplot2 and GoogleMap. He won a ggplot2 case study competition for this. His code is available here. I like in particular the contour plot, with cool rainbow colors, where both the crime level and the map background are clearly visible.

Tagged with: ,

Calling Google Maps API from R

Posted in Geek, R by Pierre Jacob on 5 October 2011

Hi,

Related to Julyan’s previous post, I want to share an easy way to access Google Maps API through R. And then we’ll stop about Google, otherwise it’ll look like we’re just looking for jobs.

My problem was the following: I have a database (from priceofweed.com), with locations written as “city, region, country”. What I wanted was the precise location (latitude, longitude) for each city. After some browsing it’s possible to grab a list of cities for each country from some local geographical institute and merge that with the database. The problem is that for each country the database is often in a different format, and full of unnecessary information for the problem at hand (and hence unnecessarily large). For example the information for the US is there somewhere (and it’s amazingly detailed by the way), whereas for other countries it’s there.

So instead a “lazy” method consists in calling Google Maps to find the location for each city, since google maps has a pretty good world-wide coverage of geographic names, it should work! The R function is described there, and I copy paste it here:

getDocNodeVal=function(doc, path)
{
   sapply(getNodeSet(doc, path), function(el) xmlValue(el))
}

gGeoCode=function(str)
{
  library(XML)
  u=paste('http://maps.google.com/maps/api/geocode/xml?sensor=false&address=',str)
  doc = xmlTreeParse(u, useInternal=TRUE)
  str=gsub(' ','%20',str)
  lat=getDocNodeVal(doc, "/GeocodeResponse/result/geometry/location/lat")
  lng=getDocNodeVal(doc, "/GeocodeResponse/result/geometry/location/lng")
  list(lat = lat, lng = lng)
}

gGeoCode("Malakoff, France")

Created by Pretty R at inside-R.org

There are limitations though: it’s free up to 2,500 requests per day and then you’re kicked out for 24 hours. Otherwise… you have to pay! See the terms here. Pretty convenient though!

EDIT: a more detailed post about Google GeoCoding, and the use of it on Missouri Sex Offender Registry:
http://www.franklincenterhq.org/2541/geocoding-addresses-from-missouri-sex-offender-registry/

Follow

Get every new post delivered to your Inbox.