Statisfaction

Quantitative arguments as hypermedia

Posted in General by jodureau on 16 January 2014

Hey readers,

I’m Joseph Dureau, I have been an avid reader of this blog for while now, and I’m very glad Pierre proposed me to share a few things. Until a few months ago, I used to work on Bayesian inference methods for stochastic processes, with applications to epidemiology. Along with fellow colleagues from this past life, I have now taken the startup path, founding Standard Analytics. We’re looking into how web technologies can be used to enhance browsability, transparency and impact of scientific publications. Here’s a start on what we’ve been up to so far.
Let me just make it clear that everything I’m presenting is fully open source, and available here. I hope you’ll find it interesting, and we’re very excited to hear from you! Here it goes..

To date, the Web has developed most rapidly as a medium of documents for people  rather than for data and information that can be processed automatically.
Berners-Lee et al, 2001

Since this sentence was written, twelve years ago, ambitious and collective initiatives have been undertaken to revolutionize what machines can do for us on the web. When I make a purchase online, my email service is able to understand it from the purchase confirmation email, communicate to the online store service, authenticate, obtain information on the delivery, and provide me with a real-time representation of where the item is located. Machines now have the means to process data in a smarter way, and to communicate over it!

However, when it comes to exchanging quantitative arguments, be it in a blog post or in a scientific article, web technology does not bring us much further than what can be done with pen and paper. Let’s take the following HTML snippet, inspired from a New York Times blog post:

<p>
The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).
</p>

It would be rendered as:

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

While this argument can be understood by humans, machines only see it as a plain piece of text. There is no way for them to know that this is a correlation analysis, no easy way for them to provide me with the raw data that lies behind that number, and no way to efficiently bring me some perspective with results from other studies on the same subject. But wouldn’t that be helpful?

Inspiration to move forward can be found in the schema.org initiative, a collaboration between the main search engines, including Bing, Google, Yahoo! and Yandex, that have started in 2011 defining a collection of schemas, or html tags, that webmasters can use to mark up their pages. This type of shared vocabularies have brought machines to a first level of understanding: they can know when they are referring to the same notion, which sets the basis for smarter communication. The schemas also define an ontology, sets of relations between these common notions. For example, they state that the description of an order usually contains a customer (that can either be a person or an organisation), as well as a billing address, a confirmation number, etc. Using this information, machines know which properties to expect when encountering a object of a given Class, and can be told what to do with their content.

Structured data has contributed to the explosion of services on the web, and not only from the *search* companies. For example, the BBC has built a stunning Wildlife finder that aggregates and organizes data from all over the web for every single living species on the planet! Another exciting example is the Veterans Job Bank initiative of the White House, that simply required employers willing to hire veterans to mark up their online job listing: it started with over 500 000 job opportunities for veterans in the US!

In our context, let’s just mark up our original example with RDFa Lite attributes:

<p vocab   =”http://schema.org”  prefix =”stats: http://standardanalytics.io/stats/
resource=”#obesity”                typeof=”Comment  stats:Correlation>
The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries
( cor = <span property=”stats:estimate>-0.45</span>, 
p    = <span property=”stats:pValue”   0.06</span>).
</p>

That would still be rendered as :

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

By referring to an RDF statistical vocabulary we have been working on, we make it explicit that the argument is based on a correlation estimate, which is provided along with a p-value expressing how strong the evidence is. In this way, metadata can be attached directly to each argument, bringing the browsability of scientific results to a whole new level! Based on these tags, machines now have a sense of the significance of a statistical test, making it virtually possible for a search engine to automatically provide me with perspective on any quantitative argument, based on alternative studies on the same question. In this example, I could figure out in a breeze if other studies corroborate such a relation between obesity and average time spent eating, ranked by their strength of evidence, or know about any related analysis that could bring me some deeper insights on the matter!

But more can still be done. The statement, even marked up, remains somewhat arbitrary. It leaves it to the readers to faithfully
believe and interpret the numbers, or to manually make their way to the supplementary materials (in best case scenarios of scientific publishing) or write to the author for more details. This point has understandably fed vivid discussions online and in newspapers on verifiability and reusability of results, shedding skepticism over scientific findings in general. Again, schemas and linked data provide with a simple solution here. What if I could attach to my argument an unambiguous description of how I got to the stated numbers, in the standard format that is used to exchange linked data on the web (a.k.a. JSON-LD), and referring to the same standardised vocabulary?

It would simply look like that:

<p vocab=”http://schema.org” prefix =”stats: http://standardanalytics.io/stats/
resource=”#obesity”            typeof=”Comment  stats:Correlation>
The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries
(<a property=”isBasedOnUrl”  href=”http://r.standardanalytics.io/obesity/0.0.0>
cor = <span property=”stats:estimate>-0.45</span>,
p    = <span property=”stats:pValue”   0.06</span>  </a>).
</p>

Rendered as :

The French spend the most time per day eating, but have one of the lowest obesity rates among developed nations. Coincidence? Maybe not, there does seem to be some correlation among OECD countries (cor = -0.45, p = 0.06).

The isBasedOnUrl link taking me to a 5-stars description of the analysis:

{
“@context”: “https://r.standardanalytics.io/contexts/datapackage.jsonld“,
“name”: “obesity-analysis”,
“version”: “0.0.0”,
“description”: “Correlation analysis based on ‘Obesity and the Fastness of Food’ blog post of the New York Times”,
“citation”: “http://economix.blogs.nytimes.com/2009/05/05/obesity-and-the-fastness-of-food/“,
“license”: “CC0-1.0”,
“repository”: [
{
“codeRepository”: “https://github.com/standard-analytics/blog.git“,
“path”: “data/obesity-analysis”
}
],
“keywords”: [ “Obesity”, “Fast”, “Food”, “OECD” ],
“author”: {
“name”: “Joseph Dureau”,
“email”: “joseph@standardanalytics.io
},
“isBasedOnUrl”: [ “https://r.standardanalytics.io/obesity/0.0.0” ],
“analytics”: [
{
“name”: “correlationTest”,
“description”: “Exploring links between obesity rates and average time spend eating.”,
“programmingLanguage”: { “name”: “R” },
“runtime”: “R”,
“targetProduct”: { “operatingSystem”: “Unix” },
“sampleType”: “scripts/corTest.R”,
“input”: [ “obesity-analysis/0.0.0/dataset/OECD” ],
“output”: [ “obesity-analysis/0.0.0/dataset/obesityFoodFastness” ] } ],
“dataset”: [
{
“name”: “obesityFoodFastness”,
“description”: “Is the obesity rate in a country (percentage of national population with a body mass index
higher than 30) correlated with the average number of minutes people spend eating each
day?”,
“isBasedOnUrl”: [ “obesity-analysis/0.0.0/analytics/correlationTest#3” ],
“distribution”: { “@context”: { “@vocab”: “http://standardanalytics.io/stats” },
“@type”: “Correlation”,
“covariate1” : “obesity$ObesityRate”,
“covariate2” : “obesity$MinutesSpentEating”,
“estimate” : -0.45035,
“statTest” : {
“@type” : “TTest”,
“testStatistic” : -2.0176,
“df” : 16,
“pValue” : 0.06073
}
}
}
]
}

The raw data these results depend on can be retrieved by following once again the isBasedOnUrl. You are starting to see the logic. The quantitative argument is now based on fully transparent and reproducible calculations. Every moving piece is here for readers (and reviewers) to verify the argument and push the exploration further. More soon on these aspects with a first packaged analysis from the Reproducibility project in psychology!

What we I am introducing here is no science-fiction. Pre-alpha version of the registry and its client used to host your quantitative arguments are already available, and most importantly the linked data technology has proven to scale: since the launch of the project in 2011, over 5 million sites have been marked up with schema.org vocabulary! For you to start sharing 5-stars quantitative arguments today, we have published a JSON-LD packaging tool forR users. Simply install the RJSONLD package, available on CRAN, and add a single line to your script for every analytic you wish to export in JSON-LD format:

obesity <- read.csv(‘obesity.csv‘)
result = cor.test(obesity$ObesityRate,obesity$MinutesSpentEating)
RJSONLD.export(result,’ObesityFoodFastness.jsonld‘)

The current version of the statistical markup vocabulary I have been mentioning in this post covers the most common notions of statistical analysis, and we will be expanding it. Yet, it is still at a draft stage and its construction should not remain an individual initiative: we warmly encourage anyone to debate and enrich it! If you can’t find a Class for the statistical test you have been working with, if you’re more into Bayesian statistics , or if there is there anything you’d like to suggest, let’s build up this vocabulary together!

How can you contribute? Shoot me an email (joseph@standardanalytics.io), or simply sign in to get a Github account if you do not already have one, go to the issue tracker to provide feedback, propose extensions, and share the word!

We have shown you here how to tag your quantitative arguments to make them browsable, and how to export shareable JSON-LD versions of your statistical results in R. To learn about automatically marking up and packaging your analysis to publish five-stars science in three simple steps, follow our blog!

Tagged with: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: