Statisfaction

Using R in LaTeX with knitr and RStudio

Posted in Geek, LaTeX, R by Julyan Arbel on 28 February 2013

Hi,

I presented today at INSEE R user group (FL\tauR) how to use knitr (Sweave evolution) for writing \LaTeX documents which are self contained with respect to the source code: your data changed? No big deal, just compile your .Rnw file again and you are done with an updated version of your paper![Ctrl+Shift+I] is easy. Some benefits with respect to having two separate .R and .tex files: it is integrated in a single software (RStudio), you can call variables in your text with the \Sexpr{} command. The slow speed at compilation is no more a real matter as one can put “cache=TRUE” in code chunk options not to reevaluate unchanged chunks, which fastens things.

I share the (brief) slides below. They won’t help much those who already use knitr, but they give the first steps for those who would like to give it a try.

Dropbox Space Race

Posted in Geek, General by Julyan Arbel on 1 December 2012

https://www.dropbox.com/static/images/spacerace2012/rocket-splash.jpg

Hi,

additionally to the referral program (you refer a new user, you win an extra .5 Go), the Dropbox Space Race will give you 3 Go extra space (for 2 years) if you register with your email from a competing university. The best schools will get more space. Here are the 100 top schools. Com’ on, there is no french school in the 100 top !

Thanks Nicolas for the info.

Tagged with: , ,

Next R meeting in Paris INSEE: ggplot2 and parallel computing

Posted in R by Julyan Arbel on 12 June 2012

Hi,
our group of R users from INSEE, aka FL\tauR, meets monthly in Paris. Next meeting is on Wed 13 (tomorrow), 1-2 pm, room 539 (an ID is needed to come in,  map to access INSEE \tauR), about ggplot2 and parallel computing. Since the first meeting in February, presentations have included hot topics like webscrapping, C in R, RStudio, SQLite databases or cartography (most of them in French). See you there!

Priors on probability measures

Posted in Seminar/Conference, Statistics by Julyan Arbel on 24 April 2012

Hi,

for the next GTB meeting at Crest, 3rd May, I will present Peter Orbanz‘ work on Projective limit random probabilities on Polish spaces. It will follow my previous presentation about Bayesian nonparametrics on the Dirichlet process.

The article provides a means of constructing any arbitrary prior distribution on the set of probability measures by working on its finite-dimensional marginals. The vanilla example is the Dirichlet process, which is characterized by its Dirichlet distribution marginals on any finite partition of the space (other examples are the Normalized Inverse Gaussian Process and the Pòlya Tree). The figure above illustrates the projective property of the marginals.

Peter will speak at ISBA 2012 Kyoto session : On the uses of random probabilities in Bayesian inference, along with Ramses Mena and Antonio Lijoi. I’ll write more about that later on!

A world without referees

Posted in General by Julyan Arbel on 10 April 2012

 

In an invited contribution to the last ISBA Bulletin, Larry Wasserman discusses the  “almost 350 years old” peer review system (paper). Have a look on it, it’s quite thought provoking!

We should think about our field like a marketplace of ideas. Everyone should be free to put their ideas out there. There is no need for referees. Good ideas will get recognized, used and cited. Bad ideas will be ignored. This process will be imperfect. But is it really better to have two or three people decide the fate of your work?

A world where you put your work on arXiv or on your web page, where you save so much time, isn’t it tempting?

Rochebrune Workshop 2012

Posted in General, Seminar/Conference by Julyan Arbel on 9 April 2012

Hey,
Last week I attended Rochebrune workshop for the second time. The genius organizers’ idea (Liliane Bel and Eric Parent from AgroParisTech, Jean-Jacques Borreux from Liège University) is to mix ski, stats and spirits (mostly Genepi and Chartreuse) around a remote alpine chalet on top of Megève ski resort.

Most of the attendees are (young) Bayesians working in applied fields, ranging from biology, ecology and epidemiology, to meteorology and climatology. We had great talks about fishes, trees, birds (Joël’s busard cendré), drugs and avalanches. More methodological talks dealt with extremes, Bayesian model averaging, and simulations: variational approximations, INLA, ABC, and MCMC in general. We had a tutorial about JAGS and WinBugs/OpenBugs as well (and how to interface them with R using rjags and R2WinBUGS). I presented my work about Multidimensional covariate dependent Dirichlet processes (all presentations here).

In addition to the 10 talks per day, a sacrosanct 5-hour skiing slot was reserved in the afternoon, with lessons from crazy Mégevan instructors. They must be really good: Pierre, don’t be afraid, I jumped and felt significantly less than two years ago. Have a Chartreuse, cheers!

Valentine Day and lonely people in France

Posted in Dataset, General by Julyan Arbel on 14 February 2012

Insee published recently a paper (in French), well in line with the Valentine Day, which characterizes people living alone or in couple by socio-professional category, along with the data.

Between 1990 and 2008 (two population surveys), the proportion of people living alone mostly increased for people under 60. After 60, 38% of women live alone, for only 17% of men,  because women are married to older men, and live longer than them, in average. See that proportion by age:

Spatially, there is a kind of South/North opposition. During the working life, lonely people live in the South (left), while lonely retired people live in the North (right), with an exception for Île-de-France (Paris) with a high proportion whatever the age:

Daily casualties in Syria

Posted in Dataset, R by Julyan Arbel on 9 February 2012

Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.

We note first that the attacks accelerate, as the cumulated graph is mostly convex (click to enlarge):

Plotting the numbers by day shows the bloody situation of  Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:

There are almost twice as much deaths on Fridays as any other day in average:
Here are boxplots for the logarithm of daily casualties by day of the week:

and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:

Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)

library(ggplot2)
input=read.csv("http://dl.dropbox.com/u/1391912/Blog%20statisfaction/data/syria.txt",
               sep="\t",header=TRUE,stringsAsFactors=FALSE)
input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE),
                           labels = c("Not Friday", "Friday"))
input$Date=as.Date(input$History,"%d/%m/%Y")
input$WeekDays=factor(input$WeekDay,
                      levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend
qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1))
qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="")
qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2))
qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2))
qplot(WeekDays,log(Number+1),data=input,geom="boxplot",xlab="",ylab="",colour=WeekDays)

Created by Pretty R at inside-R.org

Tagged with: ,

Coming R meetings in Paris

Posted in R, Seminar/Conference by Julyan Arbel on 4 February 2012

If you live in Paris and are interested in R, there will be two meetings for you this week.

First a Semin-R session, organized at the Muséum National d’Histoire Naturelle on Tuesday 7 Feb (too bad, the Museum is closed on Tuesdays). Presentations will be about colors, phylogenies and maps, while I will speak about (my beloved) RStudio. The slides of previous sessions can be found here (most of them are in French).

The following day, 8 Feb, a group of R users from INSEE will have its first meeting (13-14h, INSEE, room R12), about SQLite data in R, maps, and \LaTeX in R.

I guess anyone can join!

UPDATE: Here is a colorful map to access INSEE \tau R. Come with an ID, and say you are visiting the meeting organizer Matthieu Cornec. Room R12 is on the ground floor (left).

Psycho dice and Monte Carlo

Posted in R, Statistics by Julyan Arbel on 16 December 2011

Following Pierre’s post on psycho dice, I want here to see by which average margin repeated plays might be called influenced by mind will. The rules are the following (exerpt from the novel Midnight in the Garden of Good and Evil, by John Berendt):

You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice standing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls.

Simplify the game by forgetting the elimination step. Suppose first one plays with an even dice of 1/p faces. The probability of it to show the right face is p (for somebody with no psy power). Denote X the time to first success with one dice, which follows, by independence, a geometric distribution Geom(p) (with the starting-to-1 convention). X has the following probability mass and cumulative distribution functions, with q=1-p:

f_X(k)=pq^{k-1},\quad F_X(k)=1-q^k.

Now denote Y the time to success in the game with n dice. This simultaneous case is the same as playing n times independently with 1 dice, and then taking Y as the sample maximum of the different times to success. So Y‘s cdf is

F_Y(k)=F_X(k)^n=(1-q^k)^n.

Its pmf can be obtained either exactly by difference, or up to a normalizing constant C by differentiation:

f_Y(k)=Cq^k(1-q^k)^{n-1}.

As it is not too far from the Geom(p) pmf, one can use the latter as the proposal in a Monte Carlo estimate. If X_i‘s are N independent Geom(p) variables, then

E(Y) \approx \frac{\sum_i X_i(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}} and E(Y^2) \approx \frac{\sum_i X_i^2(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}}.

The following R lines produce the estimates \mu_Y=E(Y) = 11.4 and \sigma_Y=sd(Y) = 6.5.

p=1/6
q=1-p
n=4
rgeom1=function(n,p){rgeom(n,p)+1}
h=function(x){(1-q^x)^(n-1)}

N=10^6
X=rgeom1(N,p)
(C=1/mean(h(X)))
(m1_Y=C*mean(X*h(X)))
(m2_Y=C*mean(X^2*h(X)))
(sd_Y=sqrt(m2_Y-m1_Y^2))

Created by Pretty R at inside-R.org

Now it is possible to use a test (from classical test theory) to estimate the average margin with which repeated games should deviate in order to detect statistical evidence of psy power. We are interested in testing H_0\,:\,E(Y)=\mu_Y against H_1\,:\,E(Y)<\mu_Y, for repeated plays.

If the game is played k times, then one rejects H_0 if the sampled mean \bar{Y} is less than \mu_Y -\frac{\sigma_Y}{\sqrt{k}}q_{.95}, where q_{.95} is the 95% standard normal quantile. To indicate the presence of a psy power, someone playing k=20 times should perform in 2 rolls less than the predicted value \mu_Y= 11.4 (in 1 roll less if playing k=80 times). I can’t wait, I’m going to grab a dice!

Follow

Get every new post delivered to your Inbox.

Join 50 other followers

%d bloggers like this: