*Hello all,*

*This is an article intended for the ISBA bulletin, jointly written by us all at Statisfaction, Rasmus Bååth from Publishable Stuff, Boris Hejblum from Research side effects, Thiago G. Martins from tgmstat@wordpress, Ewan Cameron from Another Astrostatistics Blog and Gregory Gandenberger from gandenberger.org. *

Inspired by established blogs, such as the popular Statistical Modeling, Causal Inference, and Social Science or Xi’an’s Og, each of us began blogging as a way to diarize our learning adventures, to share bits of R code or LaTeX tips, and to advertise our own papers and projects. Along the way we’ve come to a new appreciation of the world of academic blogging: a never-ending international seminar, attended by renowned scientists and anonymous users alike. Here we share our experiences by weighing the pros and cons of blogging from the point of view of young researchers.

At least at face value blogging has some notable advantages over traditional academic communication: publication is instantaneous and thus proves efficient in sparking discussions and debates; it allows all sorts of technological sorcery (hyperlinks, animations, applications), while many journals are still adapting to grayscale plots; and it allows for humorous and colourful writing styles, freeing the writer from the constraints of the impersonal academic prose. Last but not least, it is acceptable to blog about almost any topic, from office politics to funding bodies, from complaints about the absurdity of p-values to debates on the net profits of publishing companies, not to mention quarrels about the term “data science”.

For young researchers, some aspects are particularly appealing. By putting academics directly in touch with one another through comments and replies, young researchers are given the opportunity to “talk” directly on technical subjects to some of the most renowned names in their fields—and indeed a surprising number of senior researchers are avid blog readers! This often proves much more efficient than trying to awkwardly stalk the same professors at conferences. Through such interactions, young academics can show off their many interests and skills, which can do much to fill out the picture painted by their academic CV.

Beyond those low and careerist considerations, we see blogging as a good tool to learn and to share scientific ideas. According to popular belief, only a third of all started research projects end up in a publication; but all of them can at least end up on a blog. So if you indulge in a bit of off-topic study or burn a few hours playing around with a new methodology it need not fuel your performance anxiety: a blog post explaining it will still feel like a delivered product. And you will very likely get some interesting feedback—though rarely to the depth given in journal reviews.

Finally, using blogs to advertise articles and packages seems particularly useful at the early stage of a career, where you might not be invited to that many conferences, or might only be given some dark corner of a giant poster session to talk about your work.

Some cautionary notes now, blogging can be risky! As the adage goes, “better to keep your mouth shut and appear a fool than to open it and remove all doubt”. Beyond the quality of the content being shared, blogs are also sometimes disregarded by academics as a frivolous medium; there is a risk that your colleagues will see your blogging hobby as a pure waste of time.

A second risk is to disclose too much information about promising research leads. There should be some balance between ideas shared and ideas kept secret, so that blogging does not jeopardize publication. Other platforms that formally establish precedence (such as arXiv) might be better suited for the initial presentation of new and exciting work. For this reason it seems wisest to blog a posteriori, though the interest of these blogs will be less than their potential to function as real-time research diaries.

A third risk is genuine time-wasting. For those who have never tried, it can be surprising to discover how many hours are needed to write each post. It can be frustrating in the beginning when reader statistics indicate an audience of just one or two spam-bots and some curious relatives. On the other hand there are still a limited number of academic blogs on statistics so far, so the market is far from saturation: any new blog can quickly garner a decent amount of attention. Of course it can be hard to keep a regular posting schedule, which is necessary to maintain a stable reading base.

To conclude, blogging can be a clever way to bypass the hierarchical structure of academia. It gives everyone a direct and fast access to everyone else. In some respects it helps to alleviate key problems affecting young researchers, such as the lengthy reviewing process of top journals and the lack of communication space.

]]>

but in the RSS version, it reads

.

Well, that’s a bummer. For now, I recommend anyone to read instead the arxiv version (updated on Monday).

]]>

Almost 10 months since my latest post? I guess bloggin’ ain’t my thing… In my defense, Mathieu Gerber and I were quite busy revising our SQMC paper. I am happy to announce that it has just been accepted as a read paper in JRSSB. If all goes as planned, we should present the paper at the RSS ordinary meeting on Dec 10. Everybody is welcome to attend, and submit an oral or written discussion (or both). More details soon, when the event is officially announced on the RSS web-site.

What is SQMC? It is a QMC (Quasi-Monte Carlo) version of particle filtering. For the same CPU cost, it typically generates much more accurate estimators. Interested? consider reading the paper here (more recent version coming soon), checking this video where I present SQMC, or, even better, attending our talk in London!

]]>

where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).

We consider the survival function which is recovered from the hazard rate by the transform

and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.

The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.

I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:

- evaluates the approximate density
- samples from it

A package example for a mixture of beta and 2 to 7 moments gives that result:

]]>

Hey hey,

With Alexandre Thiéry we’ve been working on non-negative unbiased estimators for a while now. Since I’ve been talking about it at conferences and since we’ve just arXived the second version of the article, it’s time for a blog post. This post is kind of a follow-up of a previous post from July, where I was commenting on Playing Russian Roulette with Intractable Likelihoods by Mark Girolami, Anne-Marie Lyne, Heiko Strathmann, Daniel Simpson, Yves Atchade.

The setting is the combination of two components.

**1°)** There are techniques to “debias” consistent estimators. Consider a sequence converging to in the sense . Introduce an integer-valued random variable and the survival probabilities . Then the random variable is an unbiased estimator of , i.e. its expectation is . Under additional assumptions it has a finite variance and a finite expected computational time… wow. We’ve just removed the bias off a sequence of biased estimators. We’ve reached the limit, we’ve reached infinity, we’re beyond heaven. That random truncation trick has been invented and reinvented (from Von Neumann and Ulam!) over the years but the most thorough and general study is found in Rhee & Glynn (2013). See for instance Rychlik (1990) for an early example of the same trick.

**2°)** Now, since there’s one way to debias estimators, there might be others. In particular there might be some way to remove the bias *and* to guarantee some positivity constraint. That is, assume now that is in . We might want to have an unbiased estimator of that takes almost surely non-negative values. A motivating example is precisely the Russian Roulette paper mentioned above, and in general the pseudo-marginal methods. With those methods we can perform “exact inference” on a posterior distribution, as long as we have access to non-negative unbiased estimators of its probability density function point-wise evaluations.

Our results identify cases where non-negative unbiased estimators can be obtained, in the following sense. For instance, assume that we have access to a real-valued unbiased estimator of , from which we can draw independent copies. We show that there is no algorithm taking those estimators as input and producing almost surely non-negative unbiased estimators of that . So that it’s impossible to “positivate” an unbiased estimator just like that. To prove such a result we rely on a precise definition of algorithm, which we believe is not restrictive.

More generally we show that if we have unbiased estimators of and want to obtain non-negative unbiased estimators of for some function , well that’s impossible in general. We are sorry.

However if you have an unbiased estimator of taking values in an interval , then it can be possible to have a non-negative unbiased estimator of , depending on the function considered, and in this case the problem is very much related to the Bernoulli Factory problem of Von Neumann (again! Damn you v.N.). In other words, if you have more knowledge on your unbiased estimator used as input (in this case lower and upper bounds), the problem might have a solution. In practice this type of knowledge would be model specific.

When there isn’t any non-negative unbiased estimators available, pseudo-marginal methods cannot be directly applied. Since those methods have proven very successful in some important areas such as hidden Markov models, we believe it’s interesting to characterize the other settings in which they might be applied. In the paper we discuss exact simulation of diffusions, inference for big data, doubly intractable distribution and inference based on reference priors. In those fields (at least the first three) people have tried to come up with general non-negative unbiased estimators, so we hope to save them some time!

]]>

Hey there,

It’s been a while I haven’t written about parallelization and GPUs. With colleagues Lawrence Murray and Anthony Lee we have just arXived a new version of Parallel resampling in the particle filter. The setting is that, on modern computing architectures such as GPUs, thousands of operations can be performed in parallel (i.e. simultaneously) and therefore the rest of the calculations that cannot be parallelized quickly becomes the bottleneck. In the case of the particle filter (or any sequential Monte Carlo method such as SMC samplers), that bottleneck is the resampling step. The article investigates this issue and numerically compares different resampling schemes.

In the resampling step, given a vector of “weights” (non-negative real numbers), a vector of integers called “offspring counts”, , is drawn such that for all , . That is, in average a particle has a number of offprings proportional to its normalized weight. Most implementations of the resampling step require a collective operation, such as computing the sum of the weights to normalize them. On top of being a collective operation, computing the sum of the weights is not a numerically stable operation, if the weight vector is very large. Numerical results in the article show that in single precision floating point format (as preferred for fast execution on the GPU) and for vectors of size half a million or more, a typical implementation of the resampling step (multinomial, residual, systematic…) exhibits a non-negligible bias due to numerical instability.

Two resampling strategies come to the rescue: Metropolis and Rejection resampling. These methods, described in details in the article, rely only on pair-wise weight comparisons and thus 1) are numerically stable and 2) bypass collective operations. Interestingly enough, the Metropolis resampler is theoretically biased but, when numerical stability is taken into account in single precision, proves “less biased” than the traditional resampling strategies (which are theoretically unbiased!), again when using half a million particles or more. It’s not too crazy to imagine that particle filters will soon be commonly run with millions of particles, hence the interest of studying the behaviour of resampling schemes in that regime.

Other practical aspects of resampling implementations are discussed in the article, such as whether the resampling step should be done on the CPU or on the GPU, taking into account the cost of copying the vectors into memory. Decision matrices are given (figure above), giving some indication on which is the best strategy in terms of performing resampling on CPU or GPU, and which resampling scheme to use.

All the numerical results of the article can be reproduced using the Resampling package for Libbi.

]]>

library(wesanderson) # on CRAN library(RShapeTarget) # available on https://github.com/pierrejacob/RShapeTarget/ library(PAWL) # on CRAN

Let’s invoke the *moustarget* distribution.

shape <- create_target_from_shape( file_name=system.file(package = "RShapeTarget", "extdata/moustache.svg"), lambda=5) rinit <- function(size) matrix(rnorm(2*size), ncol = 2) moustarget <- target(name = "moustache", dimension = 2, rinit = rinit, logdensity = shape$logd, parameters = shape$algo_parameters)

This defines a target distribution represented by a SVG file using RShapeTarget. The target probability density function is defined on and is proportional to on the segments described in the SVG files, and decreases exponentially fast to away from the segments. The density function of the *moustarget* is plotted below, a picture being worth a thousand words.

ranges <- apply(shape$bounding_box, 2, range) gridx <- seq(from=ranges[1,1], to=ranges[2,1], length.out=300) gridy <- seq(from=ranges[1,2], to=ranges[2,2], length.out=300) grid.df <- expand.grid(gridx, gridy) grid.df$logdensity <- moustarget@logdensity(cbind(grid.df$Var1, grid.df$Var2), moustarget@parameters) names(grid.df) <- c("x", "y", "z") g2d <- ggplot(grid.df) + geom_raster(aes(x=x, y=y, fill=exp(z))) + xlab("X") + ylab("Y") g2d <- g2d + xlim(ranges[,1]) + ylim(ranges[,2]) pal <- wes.palette(name = "GrandBudapest", type = "continuous") g2d <- g2d + scale_fill_gradientn(name = "density", colours = pal(50)) g2d <- g2d + theme(legend.position = "bottom", legend.text = element_text(size = 10)) g2d

There are various interesting aspects to note about this distribution. First it is very multi-modal and strongly non-Gaussian, thus providing an interesting toy problem for testing MCMC algorithms. Furthermore, sampling from the *moustarget* can be made arbitrarily difficult by pulling the moustache down, thus separating the moustache mode from the remaining probability mass around the eyes, ears and hat. Finally, note that the colours chosen to represent the density above approximately match the principal colours used in Grand Budapest Hotel by Wes Anderson. This is thanks to the awesome wesanderson package on CRAN. Obviously it is now very tempting to launch the Wang-Landau algorithm on this target with a spatial binning strategy, in order to try out the various palettes provided in wesanderson.

mhparameters <- tuningparameters(nchains = 10, niterations = 10000, storeall = TRUE) getPos <- function(points, logdensity) points[,2] explore_range <- c(-700,0) ncuts <- 20 positionbinning <- binning(position = getPos, name = "position", binrange = explore_range, ncuts = ncuts, useLearningRate = TRUE, autobinning = FALSE) pawlresults <- pawl(target = moustarget, binning = positionbinning, AP = mhparameters, verbose = TRUE) pawlchains <- ConvertResults(pawlresults, verbose = FALSE) locations <- positionbinning@getLocations(pawlresults$finalbins, pawlchains$X2) pawlchains$locations <- factor(locations) g <- ggplot(subset(pawlchains), aes(x=X1, y = X2, alpha = exp(logdens), size = exp(logdens), colour = locations)) + geom_point() + theme(legend.position="none") + xlab("X") + ylab("Y") g <- g + geom_hline(yintercept = pawlresults$finalbins) pal <- wes.palette(name = "GrandBudapest", type = "continuous") print(g + scale_color_manual(values = pal(21)) + labs(title = "Moustarget in Grand Budapest colours")) pal <- wes.palette(name = "Darjeeling", type = "continuous") print(g + scale_color_manual(values = pal(21)) + labs(title = "Moustarget in Darjeeling colours")) pal <- wes.palette(name = "Zissou", type = "continuous") print(g + scale_color_manual(values = pal(21)) + labs(title = "Moustarget in Zissou colours"))

]]>

Hey,

There’s a nice exhibition open until May 26th at the British Library in London, entitled Beautiful Science: Picturing Data, Inspiring Insight. Various examples of data visualizations are shown, either historical or very modern, or even made especially for the exhibition. Definitely worth a detour if you happen to be in the area, you can see everything in 15 minutes.

In particular there are nice visualisations of historical climate data, gathered from the logbooks of the English East India company, whose ships were crossing every possible sea in the beginning of the 19th century. The logbooks contain locations and daily weather reports, handwritten by the captains themselves. Turns out the logbooks are kept at the British Library itself and some of them are on display at the exhibition. More info on that project here: oldweather.org.

]]>

Besides having coded a pretty cool MCMC app in Javascript, this guy Rasmus Bååth has started the Bayesian first aid project. The idea is that if there’s an R function called **blabla.test** performing test “blabla”, there should be a function **bayes.blabla.test** performing a similar test in a Bayesian framework, and showing the output in a similar way so that the user can easily compare both approaches.This post explains it all. Jags and BEST seem to be the two main workhorses under the hood.

Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).

]]>

I’m Joseph Dureau, I have been an avid reader of this blog for while now, and I’m very glad Pierre proposed me to share a few things. Until a few months ago, I used to work on Bayesian inference methods for stochastic processes, with applications to epidemiology. Along with fellow colleagues from this past life, I have now taken the startup path, founding Standard Analytics. We’re looking into how web technologies can be used to enhance browsability, transparency and impact of scientific publications. Here’s a start on what we’ve been up to so far.

Let me just make it clear that everything I’m presenting is fully open source, and available here. I hope you’ll find it interesting, and we’re very excited to hear from you! Here it goes..

To date, the Web has developed most rapidly as a medium of documents for people rather than for data and information that can be processed automatically.

Berners-Lee et al, 2001

Since this sentence was written, twelve years ago, ambitious and collective initiatives have been undertaken to revolutionize what machines can do for us on the web. When I make a purchase online, my email service is able to understand it from the purchase confirmation email, communicate to the online store service, authenticate, obtain information on the delivery, and provide me with a real-time representation of where the item is located. Machines now have the means to process data in a smarter way, and to communicate over it!

However, when it comes to exchanging quantitative arguments, be it in a blog post or in a scientific article, web technology does not bring us much further than what can be done with pen and paper. Let’s take the following HTML snippet, inspired from a New York Times blog post:

<p>

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

</p>

It would be rendered as:

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

While this argument can be understood by humans, machines only see it as a plain piece of text. There is no way for them to know that this is a correlation analysis, no easy way for them to provide me with the raw data that lies behind that number, and no way to efficiently bring me some perspective with results from other studies on the same subject. But wouldn’t that be helpful?

Inspiration to move forward can be found in the schema.org initiative, a collaboration between the main search engines, including Bing, Google, Yahoo! and Yandex, that have started in 2011 defining a collection of schemas, or html tags, that webmasters can use to mark up their pages. This type of shared vocabularies have brought machines to a first level of understanding: they can know when they are referring to the same notion, which sets the basis for smarter communication. The schemas also define an ontology, sets of relations between these common notions. For example, they state that the description of an order usually contains a customer (that can either be a person or an organisation), as well as a billing address, a confirmation number, etc. Using this information, machines know which properties to expect when encountering a object of a given Class, and can be told what to do with their content.

Structured data has contributed to the explosion of services on the web, and not only from the *search* companies. For example, the BBC has built a stunning Wildlife finder that aggregates and organizes data from all over the web for every single living species on the planet! Another exciting example is the Veterans Job Bank initiative of the White House, that simply required employers willing to hire veterans to mark up their online job listing: it started with over 500 000 job opportunities for veterans in the US!

In our context, let’s just mark up our original example with RDFa Lite attributes:

<p vocab =”http://schema.org” prefix =”stats: http://standardanalytics.io/stats/“

resource=”#obesity” typeof=”Comment stats:Correlation>

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries

( cor = <span property=”stats:estimate“>-0.45</span>,

p = <span property=”stats:pValue” > 0.06</span>).

</p>

That would still be rendered as :

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

By referring to an RDF statistical vocabulary we have been working on, we make it explicit that the argument is based on a correlation estimate, which is provided along with a p-value expressing how strong the evidence is. In this way, metadata can be attached directly to each argument, bringing the browsability of scientific results to a whole new level! Based on these tags, machines now have a sense of the significance of a statistical test, making it virtually possible for a search engine to automatically provide me with perspective on any quantitative argument, based on alternative studies on the same question. In this example, I could figure out in a breeze if other studies corroborate such a relation between obesity and average time spent eating, ranked by their strength of evidence, or know about any related analysis that could bring me some deeper insights on the matter!

But more can still be done. The statement, even marked up, remains somewhat arbitrary. It leaves it to the readers to faithfully

believe and interpret the numbers, or to manually make their way to the supplementary materials (in best case scenarios of scientific publishing) or write to the author for more details. This point has understandably fed vivid discussions online and in newspapers on verifiability and reusability of results, shedding skepticism over scientific findings in general. Again, schemas and linked data provide with a simple solution here. What if I could attach to my argument an unambiguous description of how I got to the stated numbers, in the standard format that is used to exchange linked data on the web (a.k.a. JSON-LD), and referring to the same standardised vocabulary?

It would simply look like that:

<p vocab=”http://schema.org” prefix =”stats: http://standardanalytics.io/stats/“

resource=”#obesity” typeof=”Comment stats:Correlation>

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries

(<a property=”isBasedOnUrl” href=”http://r.standardanalytics.io/obesity/0.0.0“>

cor = <span property=”stats:estimate“>-0.45</span>,

p = <span property=”stats:pValue” > 0.06</span> </a>).

</p>

Rendered as :

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

The isBasedOnUrl link taking me to a 5-stars description of the analysis:

{

“@context”: “https://r.standardanalytics.io/contexts/datapackage.jsonld“,

“name”: “obesity-analysis”,

“version”: “0.0.0”,

“description”: “Correlation analysis based on ‘Obesity and the Fastness of Food’ blog post of the New York Times”,

“citation”: “http://economix.blogs.nytimes.com/2009/05/05/obesity-and-the-fastness-of-food/“,

“license”: “CC0-1.0″,

“repository”: [

{

“codeRepository”: “https://github.com/standard-analytics/blog.git“,

“path”: “data/obesity-analysis”

}

],

“keywords”: [ “Obesity”, “Fast”, “Food”, “OECD” ],

“author”: {

“name”: “Joseph Dureau”,

“email”: “joseph@standardanalytics.io“

},

“isBasedOnUrl”: [ “https://r.standardanalytics.io/obesity/0.0.0” ],

“analytics”: [

{

“name”: “correlationTest”,

“description”: “Exploring links between obesity rates and average time spend eating.”,

“programmingLanguage”: { “name”: “R” },

“runtime”: “R”,

“targetProduct”: { “operatingSystem”: “Unix” },

“sampleType”: “scripts/corTest.R”,

“input”: [ “obesity-analysis/0.0.0/dataset/OECD” ],

“output”: [ “obesity-analysis/0.0.0/dataset/obesityFoodFastness” ] } ],

“dataset”: [

{

“name”: “obesityFoodFastness”,

“description”: “Is the obesity rate in a country (percentage of national population with a body mass index

higher than 30) correlated with the average number of minutes people spend eating each

day?”,

“isBasedOnUrl”: [ “obesity-analysis/0.0.0/analytics/correlationTest#3” ],

“distribution”: { “@context”: { “@vocab”: “http://standardanalytics.io/stats” },

“@type”: “Correlation”,

“covariate1″ : “obesity$ObesityRate”,

“covariate2″ : “obesity$MinutesSpentEating”,

“estimate” : -0.45035,

“statTest” : {

“@type” : “TTest”,

“testStatistic” : -2.0176,

“df” : 16,

“pValue” : 0.06073

}

}

}

]

}

The raw data these results depend on can be retrieved by following once again the isBasedOnUrl. You are starting to see the logic. The quantitative argument is now based on fully transparent and reproducible calculations. Every moving piece is here for readers (and reviewers) to verify the argument and push the exploration further. More soon on these aspects with a first packaged analysis from the Reproducibility project in psychology!

What we I am introducing here is no science-fiction. Pre-alpha version of the registry and its client used to host your quantitative arguments are already available, and most importantly the linked data technology has proven to scale: since the launch of the project in 2011, over 5 million sites have been marked up with schema.org vocabulary! For you to start sharing 5-stars quantitative arguments today, we have published a JSON-LD packaging tool forR users. Simply install the RJSONLD package, available on CRAN, and add a single line to your script for every analytic you wish to export in JSON-LD format:

obesity <- read.csv(‘obesity.csv‘)

result = cor.test(obesity$ObesityRate,obesity$MinutesSpentEating)

RJSONLD.export(result,’ObesityFoodFastness.jsonld‘)

The current version of the statistical markup vocabulary I have been mentioning in this post covers the most common notions of statistical analysis, and we will be expanding it. Yet, it is still at a draft stage and its construction should not remain an individual initiative: we warmly encourage anyone to debate and enrich it! If you can’t find a Class for the statistical test you have been working with, if you’re more into Bayesian statistics , or if there is there anything you’d like to suggest, let’s build up this vocabulary together!

**How can you contribute?** Shoot me an email (joseph@standardanalytics.io), or simply sign in to get a Github account if you do not already have one, go to the issue tracker to provide feedback, propose extensions, and share the word!

We have shown you here how to tag your quantitative arguments to make them browsable, and how to export shareable JSON-LD versions of your statistical results in R. To learn about automatically marking up and packaging your analysis to publish five-stars science in three simple steps, follow our blog!

]]>