Hey hey,

With Alexandre Thiéry we’ve been working on non-negative unbiased estimators for a while now. Since I’ve been talking about it at conferences and since we’ve just arXived the second version of the article, it’s time for a blog post. This post is kind of a follow-up of a previous post from July, where I was commenting on Playing Russian Roulette with Intractable Likelihoods by Mark Girolami, Anne-Marie Lyne, Heiko Strathmann, Daniel Simpson, Yves Atchade.

The setting is the combination of two components.

**1°)** There are techniques to “debias” consistent estimators. Consider a sequence converging to in the sense . Introduce an integer-valued random variable and the survival probabilities . Then the random variable is an unbiased estimator of , i.e. its expectation is . Under additional assumptions it has a finite variance and a finite expected computational time… wow. We’ve just removed the bias off a sequence of biased estimators. We’ve reached the limit, we’ve reached infinity, we’re beyond heaven. That random truncation trick has been invented and reinvented (from Von Neumann and Ulam!) over the years but the most thorough and general study is found in Rhee & Glynn (2013). See for instance Rychlik (1990) for an early example of the same trick.

**2°)** Now, since there’s one way to debias estimators, there might be others. In particular there might be some way to remove the bias *and* to guarantee some positivity constraint. That is, assume now that is in . We might want to have an unbiased estimator of that takes almost surely non-negative values. A motivating example is precisely the Russian Roulette paper mentioned above, and in general the pseudo-marginal methods. With those methods we can perform “exact inference” on a posterior distribution, as long as we have access to non-negative unbiased estimators of its probability density function point-wise evaluations.

Our results identify cases where non-negative unbiased estimators can be obtained, in the following sense. For instance, assume that we have access to a real-valued unbiased estimator of , from which we can draw independent copies. We show that there is no algorithm taking those estimators as input and producing almost surely non-negative unbiased estimators of that . So that it’s impossible to “positivate” an unbiased estimator just like that. To prove such a result we rely on a precise definition of algorithm, which we believe is not restrictive.

More generally we show that if we have unbiased estimators of and want to obtain non-negative unbiased estimators of for some function , well that’s impossible in general. We are sorry.

However if you have an unbiased estimator of taking values in an interval , then it can be possible to have a non-negative unbiased estimator of , depending on the function considered, and in this case the problem is very much related to the Bernoulli Factory problem of Von Neumann (again! Damn you v.N.). In other words, if you have more knowledge on your unbiased estimator used as input (in this case lower and upper bounds), the problem might have a solution. In practice this type of knowledge would be model specific.

When there isn’t any non-negative unbiased estimators available, pseudo-marginal methods cannot be directly applied. Since those methods have proven very successful in some important areas such as hidden Markov models, we believe it’s interesting to characterize the other settings in which they might be applied. In the paper we discuss exact simulation of diffusions, inference for big data, doubly intractable distribution and inference based on reference priors. In those fields (at least the first three) people have tried to come up with general non-negative unbiased estimators, so we hope to save them some time!

]]>

Hey there,

It’s been a while I haven’t written about parallelization and GPUs. With colleagues Lawrence Murray and Anthony Lee we have just arXived a new version of Parallel resampling in the particle filter. The setting is that, on modern computing architectures such as GPUs, thousands of operations can be performed in parallel (i.e. simultaneously) and therefore the rest of the calculations that cannot be parallelized quickly becomes the bottleneck. In the case of the particle filter (or any sequential Monte Carlo method such as SMC samplers), that bottleneck is the resampling step. The article investigates this issue and numerically compares different resampling schemes.

In the resampling step, given a vector of “weights” (non-negative real numbers), a vector of integers called “offspring counts”, , is drawn such that for all , . That is, in average a particle has a number of offprings proportional to its normalized weight. Most implementations of the resampling step require a collective operation, such as computing the sum of the weights to normalize them. On top of being a collective operation, computing the sum of the weights is not a numerically stable operation, if the weight vector is very large. Numerical results in the article show that in single precision floating point format (as preferred for fast execution on the GPU) and for vectors of size half a million or more, a typical implementation of the resampling step (multinomial, residual, systematic…) exhibits a non-negligible bias due to numerical instability.

Two resampling strategies come to the rescue: Metropolis and Rejection resampling. These methods, described in details in the article, rely only on pair-wise weight comparisons and thus 1) are numerically stable and 2) bypass collective operations. Interestingly enough, the Metropolis resampler is theoretically biased but, when numerical stability is taken into account in single precision, proves “less biased” than the traditional resampling strategies (which are theoretically unbiased!), again when using half a million particles or more. It’s not too crazy to imagine that particle filters will soon be commonly run with millions of particles, hence the interest of studying the behaviour of resampling schemes in that regime.

Other practical aspects of resampling implementations are discussed in the article, such as whether the resampling step should be done on the CPU or on the GPU, taking into account the cost of copying the vectors into memory. Decision matrices are given (figure above), giving some indication on which is the best strategy in terms of performing resampling on CPU or GPU, and which resampling scheme to use.

All the numerical results of the article can be reproduced using the Resampling package for Libbi.

]]>

library(wesanderson) # on CRAN library(RShapeTarget) # available on https://github.com/pierrejacob/RShapeTarget/ library(PAWL) # on CRAN

Let’s invoke the *moustarget* distribution.

shape <- create_target_from_shape( file_name=system.file(package = "RShapeTarget", "extdata/moustache.svg"), lambda=5) rinit <- function(size) matrix(rnorm(2*size), ncol = 2) moustarget <- target(name = "moustache", dimension = 2, rinit = rinit, logdensity = shape$logd, parameters = shape$algo_parameters)

This defines a target distribution represented by a SVG file using RShapeTarget. The target probability density function is defined on and is proportional to on the segments described in the SVG files, and decreases exponentially fast to away from the segments. The density function of the *moustarget* is plotted below, a picture being worth a thousand words.

ranges <- apply(shape$bounding_box, 2, range) gridx <- seq(from=ranges[1,1], to=ranges[2,1], length.out=300) gridy <- seq(from=ranges[1,2], to=ranges[2,2], length.out=300) grid.df <- expand.grid(gridx, gridy) grid.df$logdensity <- moustarget@logdensity(cbind(grid.df$Var1, grid.df$Var2), moustarget@parameters) names(grid.df) <- c("x", "y", "z") g2d <- ggplot(grid.df) + geom_raster(aes(x=x, y=y, fill=exp(z))) + xlab("X") + ylab("Y") g2d <- g2d + xlim(ranges[,1]) + ylim(ranges[,2]) pal <- wes.palette(name = "GrandBudapest", type = "continuous") g2d <- g2d + scale_fill_gradientn(name = "density", colours = pal(50)) g2d <- g2d + theme(legend.position = "bottom", legend.text = element_text(size = 10)) g2d

There are various interesting aspects to note about this distribution. First it is very multi-modal and strongly non-Gaussian, thus providing an interesting toy problem for testing MCMC algorithms. Furthermore, sampling from the *moustarget* can be made arbitrarily difficult by pulling the moustache down, thus separating the moustache mode from the remaining probability mass around the eyes, ears and hat. Finally, note that the colours chosen to represent the density above approximately match the principal colours used in Grand Budapest Hotel by Wes Anderson. This is thanks to the awesome wesanderson package on CRAN. Obviously it is now very tempting to launch the Wang-Landau algorithm on this target with a spatial binning strategy, in order to try out the various palettes provided in wesanderson.

mhparameters <- tuningparameters(nchains = 10, niterations = 10000, storeall = TRUE) getPos <- function(points, logdensity) points[,2] explore_range <- c(-700,0) ncuts <- 20 positionbinning <- binning(position = getPos, name = "position", binrange = explore_range, ncuts = ncuts, useLearningRate = TRUE, autobinning = FALSE) pawlresults <- pawl(target = moustarget, binning = positionbinning, AP = mhparameters, verbose = TRUE) pawlchains <- ConvertResults(pawlresults, verbose = FALSE) locations <- positionbinning@getLocations(pawlresults$finalbins, pawlchains$X2) pawlchains$locations <- factor(locations) g <- ggplot(subset(pawlchains), aes(x=X1, y = X2, alpha = exp(logdens), size = exp(logdens), colour = locations)) + geom_point() + theme(legend.position="none") + xlab("X") + ylab("Y") g <- g + geom_hline(yintercept = pawlresults$finalbins) pal <- wes.palette(name = "GrandBudapest", type = "continuous") print(g + scale_color_manual(values = pal(21)) + labs(title = "Moustarget in Grand Budapest colours")) pal <- wes.palette(name = "Darjeeling", type = "continuous") print(g + scale_color_manual(values = pal(21)) + labs(title = "Moustarget in Darjeeling colours")) pal <- wes.palette(name = "Zissou", type = "continuous") print(g + scale_color_manual(values = pal(21)) + labs(title = "Moustarget in Zissou colours"))

]]>

Hey,

There’s a nice exhibition open until May 26th at the British Library in London, entitled Beautiful Science: Picturing Data, Inspiring Insight. Various examples of data visualizations are shown, either historical or very modern, or even made especially for the exhibition. Definitely worth a detour if you happen to be in the area, you can see everything in 15 minutes.

In particular there are nice visualisations of historical climate data, gathered from the logbooks of the English East India company, whose ships were crossing every possible sea in the beginning of the 19th century. The logbooks contain locations and daily weather reports, handwritten by the captains themselves. Turns out the logbooks are kept at the British Library itself and some of them are on display at the exhibition. More info on that project here: oldweather.org.

]]>

Besides having coded a pretty cool MCMC app in Javascript, this guy Rasmus Bååth has started the Bayesian first aid project. The idea is that if there’s an R function called **blabla.test** performing test “blabla”, there should be a function **bayes.blabla.test** performing a similar test in a Bayesian framework, and showing the output in a similar way so that the user can easily compare both approaches.This post explains it all. Jags and BEST seem to be the two main workhorses under the hood.

Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).

]]>

I’m Joseph Dureau, I have been an avid reader of this blog for while now, and I’m very glad Pierre proposed me to share a few things. Until a few months ago, I used to work on Bayesian inference methods for stochastic processes, with applications to epidemiology. Along with fellow colleagues from this past life, I have now taken the startup path, founding Standard Analytics. We’re looking into how web technologies can be used to enhance browsability, transparency and impact of scientific publications. Here’s a start on what we’ve been up to so far.

Let me just make it clear that everything I’m presenting is fully open source, and available here. I hope you’ll find it interesting, and we’re very excited to hear from you! Here it goes..

To date, the Web has developed most rapidly as a medium of documents for people rather than for data and information that can be processed automatically.

Berners-Lee et al, 2001

Since this sentence was written, twelve years ago, ambitious and collective initiatives have been undertaken to revolutionize what machines can do for us on the web. When I make a purchase online, my email service is able to understand it from the purchase confirmation email, communicate to the online store service, authenticate, obtain information on the delivery, and provide me with a real-time representation of where the item is located. Machines now have the means to process data in a smarter way, and to communicate over it!

However, when it comes to exchanging quantitative arguments, be it in a blog post or in a scientific article, web technology does not bring us much further than what can be done with pen and paper. Let’s take the following HTML snippet, inspired from a New York Times blog post:

<p>

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

</p>

It would be rendered as:

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

While this argument can be understood by humans, machines only see it as a plain piece of text. There is no way for them to know that this is a correlation analysis, no easy way for them to provide me with the raw data that lies behind that number, and no way to efficiently bring me some perspective with results from other studies on the same subject. But wouldn’t that be helpful?

Inspiration to move forward can be found in the schema.org initiative, a collaboration between the main search engines, including Bing, Google, Yahoo! and Yandex, that have started in 2011 defining a collection of schemas, or html tags, that webmasters can use to mark up their pages. This type of shared vocabularies have brought machines to a first level of understanding: they can know when they are referring to the same notion, which sets the basis for smarter communication. The schemas also define an ontology, sets of relations between these common notions. For example, they state that the description of an order usually contains a customer (that can either be a person or an organisation), as well as a billing address, a confirmation number, etc. Using this information, machines know which properties to expect when encountering a object of a given Class, and can be told what to do with their content.

Structured data has contributed to the explosion of services on the web, and not only from the *search* companies. For example, the BBC has built a stunning Wildlife finder that aggregates and organizes data from all over the web for every single living species on the planet! Another exciting example is the Veterans Job Bank initiative of the White House, that simply required employers willing to hire veterans to mark up their online job listing: it started with over 500 000 job opportunities for veterans in the US!

In our context, let’s just mark up our original example with RDFa Lite attributes:

<p vocab =”http://schema.org” prefix =”stats: http://standardanalytics.io/stats/“

resource=”#obesity” typeof=”Comment stats:Correlation>

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries

( cor = <span property=”stats:estimate“>-0.45</span>,

p = <span property=”stats:pValue” > 0.06</span>).

</p>

That would still be rendered as :

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

By referring to an RDF statistical vocabulary we have been working on, we make it explicit that the argument is based on a correlation estimate, which is provided along with a p-value expressing how strong the evidence is. In this way, metadata can be attached directly to each argument, bringing the browsability of scientific results to a whole new level! Based on these tags, machines now have a sense of the significance of a statistical test, making it virtually possible for a search engine to automatically provide me with perspective on any quantitative argument, based on alternative studies on the same question. In this example, I could figure out in a breeze if other studies corroborate such a relation between obesity and average time spent eating, ranked by their strength of evidence, or know about any related analysis that could bring me some deeper insights on the matter!

But more can still be done. The statement, even marked up, remains somewhat arbitrary. It leaves it to the readers to faithfully

believe and interpret the numbers, or to manually make their way to the supplementary materials (in best case scenarios of scientific publishing) or write to the author for more details. This point has understandably fed vivid discussions online and in newspapers on verifiability and reusability of results, shedding skepticism over scientific findings in general. Again, schemas and linked data provide with a simple solution here. What if I could attach to my argument an unambiguous description of how I got to the stated numbers, in the standard format that is used to exchange linked data on the web (a.k.a. JSON-LD), and referring to the same standardised vocabulary?

It would simply look like that:

<p vocab=”http://schema.org” prefix =”stats: http://standardanalytics.io/stats/“

resource=”#obesity” typeof=”Comment stats:Correlation>

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries

(<a property=”isBasedOnUrl” href=”http://r.standardanalytics.io/obesity/0.0.0“>

cor = <span property=”stats:estimate“>-0.45</span>,

p = <span property=”stats:pValue” > 0.06</span> </a>).

</p>

Rendered as :

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

The isBasedOnUrl link taking me to a 5-stars description of the analysis:

{

“@context”: “https://r.standardanalytics.io/contexts/datapackage.jsonld“,

“name”: “obesity-analysis”,

“version”: “0.0.0″,

“description”: “Correlation analysis based on ‘Obesity and the Fastness of Food’ blog post of the New York Times”,

“citation”: “http://economix.blogs.nytimes.com/2009/05/05/obesity-and-the-fastness-of-food/“,

“license”: “CC0-1.0″,

“repository”: [

{

"codeRepository": "https://github.com/standard-analytics/blog.git",

"path": "data/obesity-analysis"

}

],

“keywords”: [ "Obesity", "Fast", "Food", "OECD" ],

“author”: {

“name”: “Joseph Dureau”,

“email”: “joseph@standardanalytics.io“

},

“isBasedOnUrl”: [ "https://r.standardanalytics.io/obesity/0.0.0" ],

“analytics”: [

{

"name": "correlationTest",

"description": "Exploring links between obesity rates and average time spend eating.",

"programmingLanguage": { "name": "R" },

"runtime": "R",

"targetProduct": { "operatingSystem": "Unix" },

"sampleType": "scripts/corTest.R",

"input": [ "obesity-analysis/0.0.0/dataset/OECD" ],

“output”: [ "obesity-analysis/0.0.0/dataset/obesityFoodFastness" ] } ],

“dataset”: [

{

"name": "obesityFoodFastness",

"description": "Is the obesity rate in a country (percentage of national population with a body mass index

higher than 30) correlated with the average number of minutes people spend eating each

day?",

"isBasedOnUrl": [ "obesity-analysis/0.0.0/analytics/correlationTest#3" ],

“distribution”: { “@context”: { “@vocab”: “http://standardanalytics.io/stats” },

“@type”: “Correlation”,

“covariate1″ : “obesity$ObesityRate”,

“covariate2″ : “obesity$MinutesSpentEating”,

“estimate” : -0.45035,

“statTest” : {

“@type” : “TTest”,

“testStatistic” : -2.0176,

“df” : 16,

“pValue” : 0.06073

}

}

}

]

}

The raw data these results depend on can be retrieved by following once again the isBasedOnUrl. You are starting to see the logic. The quantitative argument is now based on fully transparent and reproducible calculations. Every moving piece is here for readers (and reviewers) to verify the argument and push the exploration further. More soon on these aspects with a first packaged analysis from the Reproducibility project in psychology!

What we I am introducing here is no science-fiction. Pre-alpha version of the registry and its client used to host your quantitative arguments are already available, and most importantly the linked data technology has proven to scale: since the launch of the project in 2011, over 5 million sites have been marked up with schema.org vocabulary! For you to start sharing 5-stars quantitative arguments today, we have published a JSON-LD packaging tool forR users. Simply install the RJSONLD package, available on CRAN, and add a single line to your script for every analytic you wish to export in JSON-LD format:

obesity <- read.csv(‘obesity.csv‘)

result = cor.test(obesity$ObesityRate,obesity$MinutesSpentEating)

RJSONLD.export(result,’ObesityFoodFastness.jsonld‘)

The current version of the statistical markup vocabulary I have been mentioning in this post covers the most common notions of statistical analysis, and we will be expanding it. Yet, it is still at a draft stage and its construction should not remain an individual initiative: we warmly encourage anyone to debate and enrich it! If you can’t find a Class for the statistical test you have been working with, if you’re more into Bayesian statistics , or if there is there anything you’d like to suggest, let’s build up this vocabulary together!

**How can you contribute?** Shoot me an email (joseph@standardanalytics.io), or simply sign in to get a Github account if you do not already have one, go to the issue tracker to provide feedback, propose extensions, and share the word!

We have shown you here how to tag your quantitative arguments to make them browsable, and how to export shareable JSON-LD versions of your statistical results in R. To learn about automatically marking up and packaging your analysis to publish five-stars science in three simple steps, follow our blog!

]]>

A few days after the MCMSki conference, I start to see the main lessons gathered there.

- I should really read the full program before attending the next MCMSki. The three parallel sessions looked consistently interesting, and I really regret having missed some talks (in particular Dawn Woodard‘s and Natesh Pillai‘s) and some posters as well (admittedly, due to exhaustion on my part).
- Compared to the previous instance three years ago (in Utah), the main themes have significantly changed. Scalability, approximate methods, non-asymptotic results, 1/n methods … these keywords are now on everyone’s lips. Can’t wait to see if MCQMC’14 will feel that different from MCQMC’12.
- The community is rightfully concerned about scaling Monte Carlo methods to big data, with some people pointing out that models should also be rethought in this new context.
- The place of software developers in the conference, or simply references to software packages in the talks, is much greater than it used to be. It’s a very good sign towards reproducible research in our field. There’s still a lot of work to do, in particular in terms of making parallel computing easier to access (time to advertise LibBi a little bit). On a related note, many people now point out whether their proposed algorithms are parallel-friendly or not.
- Going from the Rockies to the Alps, the food drastically changed from cheeseburgers to just melted cheese. Bread could be found but ground beef and Budweiser were reported missing.
- It’s fun to have an international conference in your home country, but switching from French to English all the time was confusing.

Back in flooded Oxford now!

]]>

Happy new year to everyone, and perhaps see you at MCMski 4 in Chamonix next week, which I expect to be a very friendly and exciting even if I’m not much into skiing. :-)

I will talk for the first time about SQMC, a QMC (Quasi Monte Carlo) variant of particle filtering (PF) that Mathieu Gerber and I developed in the recent months. We are quite excited about it for a variety of reasons, but I will give more details shortly on this blog.

I thought that my talk would clash with a session on PMCMC, which was quite unfortunate as I suspect that session would target perhaps the same audience, but looking at the program, I see it’s no longer the case. Thanks the power that be!

I also organise a session on “Bayesian computation in Neurosciences” in MCMski 4. Feel free to come if you have interest in the subject. Myself, I think it’s a particular cool area of application, about which I know very very little… which is why I organise a session to learn more about it! :-) I also co-organise (with Simon Barthelmé and Adam Johansen) a workshop at Warwick on the same subject, more details soon.

]]>

Many reactions seem to focus on Academia.edu, which is private company, so perhaps that case is no so black and white. However, I found the story (also mentioned by the WP paper) of our colleague Daniel Povey much more infuriating: Daniel put a legit copy of one of his paper on his web site, some robot wrongly detected this copy as the version owned by Elsevier, sent a DCMA take down note to Google, and boom, Google automatically shut downs Daniel’s google web page entirely. Welcome to the brave new world of robots enacting the Law.

I was talking with an Economist the other day. He told me that big corporations very rarely innovate, because they invested so much in a particular, currently lucrative, business model, even that model is doomed in the medium term. He gave me the example of Kodak: they developed the first digital camera before anyone else, yet they never managed to turn around their business model to make the transition to digital photography. They filed for bankruptcy last year. I think the same applies to Elsevier: even if it does not even make sense for them in the long run, this company is going to fight ugly to defend its current business model (the “treasure chest behind a pay wall”, the treasure being our papers) rather that trying to transition to a new business model compatible with open access. So I guess it falls on us to consider sending our paper to new players in academic publishing.

In other news, I have heard many French Universities are going to lose any access to Elsevier journals as of 1st Jan 2014, because of failed negociations between Elsevier and these Universities, but I found little detail on the interweb on this particular story.

]]>

Dennis Lindley most sadly passed away yesterday at the hospital near his home in Somerset. He was one of the founding fathers of our field (of Bayesian statistics), who contributed to formalise Bayesian statistics in a coherent theory. And to make it one with rational decision-making, a perspective…]]>

I’ve just heard this sad piece of news. Definitely one of the greatest statisticians of the last 50 years. Wished I’d had met him in person.

Originally posted on Xi'an's Og:

**D**ennis Lindley most sadly passed away yesterday at the hospital near his home in Somerset. He was one of the founding fathers of our field (of Bayesian statistics), who contributed to formalise Bayesian statistics in a coherent theory. And to make it one with rational decision-making, a perspective missing in Jeffreys’ vision. (His papers figured prominently in the tutorials we gave yesterday for the opening of O’Bayes 250.) At the age of 90, his interest in the topic had not waned away: as his interview with Tony O’Hagan last Spring showed, his passionate arguing for the rationale of the Bayesian approach was still there and alive! The review he wrote of *The Black Swan* a few years ago also demonstrated he had preserved his ability to see through bogus arguments. (See his scathing “*One hardly advances the respect with which statisticians are held in society by making…*

View original 142 more words

]]>