# Statisfaction

## From SVG to probability distributions [with R package]

Posted in R, Statistics by Pierre Jacob on 25 August 2013

Hey,

To illustrate generally complex probability density functions on continuous spaces, researchers always use the same examples, for instance mixtures of Gaussian distributions or a banana shaped distribution defined on $\mathbb{R}^2$ with density function:

$f(x,y) = \exp\left(-\frac{x^2}{200} - \frac{1}{2}(y+Bx^2-100B)^2\right)$

If we draw a sample from this distribution using MCMC we obtain a [scatter]plot like this one:

Fig. 1: a sample from the very lame banana shaped distribution

Clearly it doesn’t really look like a banana, even if you use yellow to colour the dots like here. Actually it looks more like a boomerang, if anything. I was worried about this for a while, until I came up with a more realistic banana shaped distribution:

Fig. 2: a sample from the realistic banana shaped distribution

See how the shape is well defined compared to the first figure? And there’s even the little tail, that proves so convenient when we want to peel off the fruit. More generally we might want to create target density functions based on general shapes. For this you can now try RShapeTarget, which you can install directly from R using devtools:

library(devtools)
install_github(repo="RShapeTarget", username="pierrejacob")


The package parses SVG files representing shapes, and creates target densities from them. More precisely, a SVG files contains “paths”, which are sequence of points (for instance the above banana is a single closed path). The associated log density at any point $x$ is defined by $-1/(2\lambda) \times d(x, P)$ where $P$ is the closest path of the shape from $x$ and $d(x,P)$ is the distance between the point and the path. The parameter $\lambda$ specifies the rate at which the density decays when the point goes away from the shape. With this you can define the maple leaf distribution, as a tribute to JSM 2013:

Fig. 3: a sample the “O Canada” probability distribution.

In the package you can get a distribution from a SVG file using the following code:

library(RShapeTarget)
# create target from file
my_shape_target <- create_target_from_shape(my_svg_file_name, lambda =1)
# test the log density function on 25 randomly generated points
my_shape_target$logd(matrix(rnorm(50), ncol = 2), my_shape_target$algo_parameters)


Since characters are just a bunch of paths, you can also define distributions based on words, for instance:

Hodor: Hodor.

which is done as follows (warning you’re only allowed a-z and A-Z, no numbers no space no punctuation for now):

library(RShapeTarget)
word_target <- create_target_from_word("Hodor")


For the words, I defined the target density function as before, except that it’s constant on the letters: so if a point is outside a letter its density is computed based on the distance to the nearest path; if it’s inside a letter it’s just constant, so that the letters are “filled” with some constant density. I thought it’d look better.

Now I’m not worried about the banana shaped distribution any more, but by the fact that the only word I could think of was “Hodor” (with whom you can chat over there).

## Using R in LaTeX with knitr and RStudio

Posted in Geek, LaTeX, R by Julyan Arbel on 28 February 2013

Hi,

I presented today at INSEE R user group (FL$\tau$R) how to use knitr (Sweave evolution) for writing $\LaTeX$ documents which are self contained with respect to the source code: your data changed? No big deal, just compile your .Rnw file again and you are done with an updated version of your paper![Ctrl+Shift+I] is easy. Some benefits with respect to having two separate .R and .tex files: it is integrated in a single software (RStudio), you can call variables in your text with the \Sexpr{} command. The slow speed at compilation is no more a real matter as one can put “cache=TRUE” in code chunk options not to reevaluate unchanged chunks, which fastens things.

I share the (brief) slides below. They won’t help much those who already use knitr, but they give the first steps for those who would like to give it a try.

## Next R meeting in Paris INSEE: ggplot2 and parallel computing

Posted in R by Julyan Arbel on 12 June 2012

Hi,
our group of R users from INSEE, aka FL$\tau$R, meets monthly in Paris. Next meeting is on Wed 13 (tomorrow), 1-2 pm, room 539 (an ID is needed to come in,  map to access INSEE $\tau$R), about ggplot2 and parallel computing. Since the first meeting in February, presentations have included hot topics like webscrapping, C in R, RStudio, SQLite databases or cartography (most of them in French). See you there!

## Daily casualties in Syria

Posted in Dataset, R by Julyan Arbel on 9 February 2012

Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.

We note first that the attacks accelerate, as the cumulated graph is mostly convex (click to enlarge):

Plotting the numbers by day shows the bloody situation of  Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:

There are almost twice as much deaths on Fridays as any other day in average:
Here are boxplots for the logarithm of daily casualties by day of the week:

and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:

Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)

library(ggplot2)
input=read.csv("http://dl.dropbox.com/u/1391912/Blog%20statisfaction/data/syria.txt",
sep="\t",header=TRUE,stringsAsFactors=FALSE)
input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE),
labels = c("Not Friday", "Friday"))
input$Date=as.Date(input$History,"%d/%m/%Y")
input$WeekDays=factor(input$WeekDay,
levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1)) qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="") qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2)) qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2)) qplot(WeekDays,log(Number+1),data=input,geom="boxplot",xlab="",ylab="",colour=WeekDays) Created by Pretty R at inside-R.org Tagged with: , ## Coming R meetings in Paris Posted in R, Seminar/Conference by Julyan Arbel on 4 February 2012 If you live in Paris and are interested in R, there will be two meetings for you this week. First a Semin-R session, organized at the Muséum National d’Histoire Naturelle on Tuesday 7 Feb (too bad, the Museum is closed on Tuesdays). Presentations will be about colors, phylogenies and maps, while I will speak about (my beloved) RStudio. The slides of previous sessions can be found here (most of them are in French). The following day, 8 Feb, a group of R users from INSEE will have its first meeting (13-14h, INSEE, room R12), about SQLite data in R, maps, and $\LaTeX$ in R. I guess anyone can join! UPDATE: Here is a colorful map to access INSEE $\tau R$. Come with an ID, and say you are visiting the meeting organizer Matthieu Cornec. Room R12 is on the ground floor (left). ## Psycho dice and Monte Carlo Posted in R, Statistics by Julyan Arbel on 16 December 2011 Following Pierre’s post on psycho dice, I want here to see by which average margin repeated plays might be called influenced by mind will. The rules are the following (exerpt from the novel Midnight in the Garden of Good and Evil, by John Berendt): You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice standing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls. Simplify the game by forgetting the elimination step. Suppose first one plays with an even dice of 1/p faces. The probability of it to show the right face is p (for somebody with no psy power). Denote X the time to first success with one dice, which follows, by independence, a geometric distribution Geom(p) (with the starting-to-1 convention). X has the following probability mass and cumulative distribution functions, with q=1-p: $f_X(k)=pq^{k-1},\quad F_X(k)=1-q^k.$ Now denote Y the time to success in the game with n dice. This simultaneous case is the same as playing n times independently with 1 dice, and then taking Y as the sample maximum of the different times to success. So Y‘s cdf is $F_Y(k)=F_X(k)^n=(1-q^k)^n.$ Its pmf can be obtained either exactly by difference, or up to a normalizing constant C by differentiation: $f_Y(k)=Cq^k(1-q^k)^{n-1}.$ As it is not too far from the Geom(p) pmf, one can use the latter as the proposal in a Monte Carlo estimate. If $X_i$‘s are N independent Geom(p) variables, then $E(Y) \approx \frac{\sum_i X_i(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}}$ and $E(Y^2) \approx \frac{\sum_i X_i^2(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}}.$ The following R lines produce the estimates $\mu_Y=E(Y) = 11.4$ and $\sigma_Y=sd(Y) = 6.5$. p=1/6 q=1-p n=4 rgeom1=function(n,p){rgeom(n,p)+1} h=function(x){(1-q^x)^(n-1)} N=10^6 X=rgeom1(N,p) (C=1/mean(h(X))) (m1_Y=C*mean(X*h(X))) (m2_Y=C*mean(X^2*h(X))) (sd_Y=sqrt(m2_Y-m1_Y^2)) Created by Pretty R at inside-R.org Now it is possible to use a test (from classical test theory) to estimate the average margin with which repeated games should deviate in order to detect statistical evidence of psy power. We are interested in testing $H_0\,:\,E(Y)=\mu_Y$ against $H_1\,:\,E(Y)<\mu_Y$, for repeated plays. If the game is played k times, then one rejects $H_0$ if the sampled mean $\bar{Y}$ is less than $\mu_Y -\frac{\sigma_Y}{\sqrt{k}}q_{.95}$, where $q_{.95}$ is the 95% standard normal quantile. To indicate the presence of a psy power, someone playing $k=20$ times should perform in 2 rolls less than the predicted value $\mu_Y= 11.4$ (in 1 roll less if playing $k=80$ times). I can’t wait, I’m going to grab a dice! ## Create maps with maptools R package Posted in Dataset, R by Julyan Arbel on 13 December 2011 Baptiste Coulmont explains on his blog how to use the R package maptools. It is based on shapefile files, for example the ones offered by the French geography agency IGN (at départements and communes level). Some additional material like roads and railways are provided by the OpenStreetMap project, here. For the above map, you need to dowload and dezip the files departements.shp.zip and ile-de-france.shp.zip. The red dots correspond to points of interest longitude / latitude, here churches stored in a vector eglises (use e.g. this to geolocalise places of interest). Then run this code from Baptiste’s tutorial library(maptools) france<-readShapeSpatial("departements.shp", proj4string=CRS("+proj=longlat")) routesidf<-readShapeLines( "ile-de-france.shp/roads.shp", proj4string=CRS("+proj=longlat") ) trainsidf<-readShapeLines( "ile-de-france.shp/railways.shp", proj4string=CRS("+proj=longlat") ) plot(france,xlim=c(2.2,2.4),ylim=c(48.75,48.95),lwd=2) plot(routesidf[routesidf$type=="secondary",],add=TRUE,lwd=2,col="lightgray")
plot(routesidf[routesidf$type=="primary",],add=TRUE,lwd=2,col="lightgray") plot(trainsidf[trainsidf$type=="rail",],add=TRUE,lwd=1,col="burlywood3")
points(eglises$lon,eglises$lat,pch=20,col="red")

Created by Pretty R at inside-R.org

## Power-laws: choose your x and y variables carefully

Posted in R, Sport by Julyan Arbel on 16 November 2011

This is a follow-up of the post Power of running world records

As suggested by Andrew, plotting running world records could benefit from a change of variables. More exactly the use of different variables sheds light on a [now] well-known [to me] sports result provided in a 2000 Nature paper by Sandra Savaglio and Vincenzo Carbone (thanks Ken): the dependence between time and distance in log-log scale is not linear on the whole range of races, but piecewise linear. There is one break-point around time 2’33’’ (or equivalently distance around 1100 m). As mentioned in the article, this threshold corresponds to a physiological critical change in the athlete’s energy expenditure: in short races (less than 1000 m) the effort follows an anaerobic metabolism, whereas it switches to aerobic metabolism for middle and long distances (or longer…). Interestingly, the energy is more efficiently consumed in the second regime than in the first: the decay in speed slows down for endurance races.

The reason of this graphical/visual difference is simple. Denote distance, time and speed by D, T and S. I have plotted the log T~ log D relation, which gave $T\propto D^{\alpha}$ with $\alpha=1.11$. When using the speed S as one of the variables, the relations are $S\propto D^{\gamma}$ and $S\propto T^{\beta}$ with $\gamma=1-\alpha$ and $\beta=\frac{1}{\alpha}-1\approx 1-\alpha$ to the first order because $\alpha$ is close to 1. With Nature paper findings (with the opposite sign convention), the two $\beta$s are $\beta_{\text{an}}=-0.165$ (anaerobic) and $\beta_{\text{ac}}=-0.072$ (aerobic), ie $\alpha_{\text{an}}=1.20$ and $\alpha_{\text{ac}}=1.08$. My improper $\alpha=1.11$ is indeed in between. The slope ratio is much larger (larger than 2) on a plot involving the speed, clearly showing the two regimes, than on my original plot (a few 10%), which is the reason why it appear almost linear (although afterthought, and with good goggles, two lines might have been detected).

Below is the S ~ log D relation (click to enlarge) on which it appears clearly that 100 m and 100 km races are two outliers. It takes time to erase the loss of time due to the start of the race (100 m and 200 m are run at the same speed…), whereas the 100 km suffers from a lack of interest among athletes.Achim Zeileis also provides an extended world records table and R code in his comment.

As an aside, Andrew and Cosma Shalizi also comment and resolve an ambiguity of mine: one usually speaks about power-laws without much precision of context, but there are mainly two separate sets of power-law models. Either power-law regressions, where you plot y~x for two different variables (this is the case here); or power-law distributions, ie the probability distribution of a single variable x is $p(x)\propto x^{-a}$, or extensions of that (with lots of natural examples, ranging from the size of cities to the number of deaths in attacks in wars).

## PAWL package on CRAN

Posted in Geek, R by Pierre Jacob on 26 October 2011

The PAWL package (which I talked about there, and which implements the parallel adaptive Wang-Landau algorithm and adaptive Metropolis-Hastings for comparison) is now on CRAN!

http://cran.r-project.org/web/packages/PAWL/index.html

which means that within R you can easily install it by typing

install.packages("PAWL")

Isn’t that amazing? It’s just amazing. Kudos to the CRAN team for their quickness and their help.

## Random art on the web

Posted in Art, R by Julyan Arbel on 15 October 2011

Since we explored some statitics of an abstract painting with Pierre (we even have an article in Variances last issue!), I became more sensitive to art linked to randomness. Here are some pointers to related websites I have digged out.

Random.org, mentioned here by Pierre, is, at it reads, a true random number service that generates randomness via atmospheric noise. The American writer Eric Hoffer is quoted about creativity:

Creativity is the ability to introduce order into the randomness of nature.

You will find there contributed pages of users of the service about varied forms of arts, like pages which generate Samuel Beckett-like prose, or Jazz Scales. In visual arts, you can find for example the Bryce girl 1, a fractal landscape by Fuller Thompson of Bryce Canyon (with an extra sexy girl by the way); and nice pastel Richter-like pictures by Dave Nelson (to be compared with an excerpt of Richter’s 1024 colors):

Random-art.org is a program which randomly generates a picture with a given character seed. You can visit the gallery, or make your own. Sadly, “Bayes” generates an ugly flashy pic:

Day to day data gather together artists who collect, list, database and absurdly analyse the data of everyday life. You can find there links to artists like Abigail Reynolds and her Mount Fear of crimes in London, and many others.

R users produced great outputs too. Interestingly, the two following graphs feel like 3D, although only made up of lines and curves. Paul Butler’s visualization of Facebook connections (with a bit of post processing):

and Christophe Ladroue’s representation of individual rankings across lags in a triathlon:

Check the R Graph Gallery if you want to enhance your data visualization in R!