Tuesday, 31 March 2009

BioStandards: Big Data require Big Standards

The term Big Data is increasingly being used.  In late 2008 it was used as the title of a special issue of Nature.

Here is a great quote from an Editorial in the issue entitled "Community cleverness required":

"The lack of standards, for instance, confounds many a researcher seeking to harness the diversity of knowledge now available on any chosen topic. All credit, then, to those in the vanguard of interoperability."

Coping with Big Data requires Big Standards - big in the latter sense,  meaning widely adopted, not extensive, standards.  In fact, a common approach is to define a 'minimum' standard.

BioStandards: The three pillars of reporting standards are scope, semantics and syntax.

A large part of ensuring that data can be shared is defining common ways to do so.  Reporting standards are composed of three aspects - scope, semantics and syntax.  In other words, checklists, controlled vocabularies and ontologies and formats.  In yet more words, checklists are text documents that describe the scope of a particular area of knowledge to be captured in a standardized way, cvs and ontologies are shared vocabularies that help semantically express that knowledge and file formats define a syntax that can be used to physically capture and exchange that information.

Reporting Standards allow sharing of data and promote interoperability

There are an increasing number of "minimum information" checklist projects in biology.  This projects are defining essential - or minimal - descriptors for reporting different types of data.  These communities have self-organized into a community called MIBBI - this stands for "Minimum Information about a Biological or Biomedical Investigation".

A similar and more mature community exists for ontology development.  The OBO Foundry now contains over 60 community-driven ontology development projects.

Likewise there are an increasing number of format projects but they have yet to come together into a joined-up community.  We have therefore simple started to collect up projects here on this page that we, for now, call the BioFormat community.

Data-intensive science - a new paradigm

Doug Kell maintains an excellent blog and has recently posted on the emergence of a fourth paradigm of generating new knowledge through scientific research.

His post entitled The fourth paradigm of scientific knowledge generation – data-intensive science describes a recent paper by Bell, Hey and Szalay entitled "Computer science. Beyond the data deluge".

The first two paradigms have long been recognized - theoretical and experimental.  More recently there are an increasing number of computer simulation studies.

Now with the growth of the internet, increased computing power, and improved aggregation of data   - along with new highthroughput technologies -  it is possible to do science by grabbing the vast quantities of data that are already out there.

YouTube launched YouTube.edu

To follow up briefly on a past post called "Videos promote the sharing of scientific data" YouTube has now launched YouTube.edu.  This specialized version of the YouTube's is full of videos from the best universities in the world - MIT, Stanford, Harvard, UC Berkeley, etc - and the choice of subjects continues to expand.  Some universities now offer whole courses online in this fashion.

Monday, 30 March 2009

Data Formats

This is a list of 'format' projects that is referred to in a subsequent post entitled "Big Data requires Big Standards".

For now there is no formal umbrella community for format projects in the same sense as the MIBBI umbrella project for checklists.

We would like to suggest that such a community could form, for example under a "BioFormats" banner and we are currently discussing this with the leads of the below projects.

UML, XML or tabular representations

UML is an excellent generic data modelling language, XML is ideal for data exchange and tabular formats ready for import and processing in spreadsheet packages are ideal for biologists who want to easily access and analyze data. Therefore formats taking advantages of the strengths of these different approaches have been developed.  There are now increasing efforts to support both XML (for bioinformaticians) and tabular (for biologists) through the use of converters (e.g. XLST).

Universal and Domain specific formats

There is increasing interest in harmonizing multi-omic studies. To do so ideally requires the ability to represent multiple checklists in a single format.  From work on domain-specific formats two interoperable format are emerging, FUGE and ISATAB.  After these two, domain-specific formats are listed.  

Universal Formats

Functional Genomics Experiment (FuGE) (UML/XML): http://fuge.sourceforge.net/

Investigation / Study / Assay (ISA) tab-delimited (TAB) format (Tabular):  http://isatab.sourceforge.net/

Both projects are being developed through extensions to cover an increasing range of biological investigations.

Domain-specific formats


Genomics and Metagenomics (16S studies)

HUPO-PSI XMls: http://www.psidev.info

System Biology
SMBL: http://sbml.org

UHTS Applications
Short Read Format (SRF): http://srf.sourceforge.net/

Monday, 9 March 2009

ESRC Sharing Research Data Workshop

The Economic and Social Data Service will be holding a workshop entitled "Sharing Research Data: Pioneers, Policies and Protocols"

Friday 13 March 2009

Park Crescent Conference Centre, London, W1W 5PNESRC workshop

Slides: All of the presentations from the workshop are found at the bottom of this page.

Monday, 2 March 2009

NEBC Data Policy: A Case Study

Continuing on the theme of the need for a repository of links to current data policies, here is the first link to an existing policy along with a brief description.  Hopefully more will follow.

The NERC Environmental Bioinformatics Centre (NEBC) has a data policy to cover 'omics data. NEBC's primary means of support its policy is through its EnvBase data catalogue.

Here is a breakdown of the structure of the policy.

The first section sets the frame of reference. This policy inherits most of its top level principles from the NERC Data Policy, its parent document.  The document is generally brief and most of the detail is given in the four appendices.

The appendices cover:

1. Timeline for submission of data
2. Submission to public databases and compliance to recognized standards
3. Guidelines on submitting specific file formats
4. Terms of reference for the policy

The timeline in Appendix 1 includes the recommendation that researchers contact NEBC early with a grant to establish an baseline description of the data they expect to generate.

Appendix 2 lists all routine data types generated along with the appropriate public repositories and standards when available.

Appendix 3 covers accepted (recommended) file formats for submitting data directly to EnvBase for which a public repository is not available.

Appendix 4 describes the specific services available to NERC researchers funded under NERC Science Programmes that have 'bought into' the services provided by NEBC.

A Repository of Data Policies is Needed

In developing a new data policy, it is most effective and wise to look around for existing policies that can be adopted in part or whole or at least modified.  Good policies should already promulgate consensus on best practice in a particular community or for a specific data type.

Finding data policies on the web is not simple...They are dispersed - when they exist - and finding them is time-consuming. 

In a previous post, we centralized links to the top level policies of a range of funding bodies.  This is an obvious first step, but data policies can be generated by Institutions, Data Centres and Projects. 

To help promote awareness of these policy and supoprt the development of new policies based upon them, it would be useful to have a centralized listing.  

We are currently research existing policies, no matter how small, and aim to post a collection here.   In the mean time, please feel free to email me links (dfield 'at' ceh.ac.uk).

Data Policies of Major Funding Agencies

Note: This is an updated version of Table 1 in the publication "Omics Data Sharing" (2009).

A range of funding agencies have published data policies. These documents are dispersed across the internet, normally being posted within the website of each funding agency. Here is a collection of data policies for funders in the US and the UK that support 'omics research.

Economic and Social Research Council Data Policy

Natural Environmental Research Council (NERC) Data Mangement Policy

National Institute of Health Data Sharing Policy

Gordon and Betty Moore Foundation Data Sharing Policy and Implementation Guidance (2005),

Data Sharing Philosophy and Plan (2008)

Genome Canada Data Release and Resource Sharing Policy

Medical Research Council Data Sharing and Preservation Policy

Biotechnology and Biological Sciences Research Council (BBSRC) Data Sharing Policy

Wellcome Trust Policy on Data Management and Sharing

Genomics: GTL Program (Office of Biological and Environmental Research) Information and Data Sharing Policy