Thursday, 14 June 2012

DataCite 2012 Summer Meeting - A short report


Title of this year DataCite 2012 Summer Meeting, in Copenhagen, is "DIGITAL RESEARCH DATA IN PRACTICE: solutions for improving discovery, access and use". Here some key messages from the event (for more see also tweets at: #datacite, #biosharing, #gigascience)

Chair: Lee-Ann Coleman, British Library
Keynote presentation: Jonathan Grant, President of RAND Europe,  
The science of science.
  • The new science paradigm is based on the 4 As: Advocacy (make the case for research funding); Accountability (to taxpayers and donors); Analysis (what works in research), and Allocation (what to fund: institution, domains, people);
  • We must move from advocacy to accountability and we need practical evidence for science policy 
  • We must analyses what works also to speed things up: e.g., the time lag between spending on research and health gain is 17 years!
  • Survey devised that works to capture the impacts that have arisen from research grant
Session 1: Discovery: It’s all about the metadata? Or is it? 
Chair: Jan Brase, TIB

Vishwas Chavan, GBIF
Towards next generation (data inclusive) publishing.
  • The global, federated infrastructure for sharing biodiversity datasets has now over 327 million records;
  • This community has embraced the concept of 'data papers' and over 70 are in the pipeline and will be published in 6 different (Pensoft) journals;
  • But beyond datasets the need exists for creating persistent identifiers also for specimens, sequences, taxon names etc.;
  • Data usage index is needed for publishers, datasets, thematic and country;
  • More in "Data publishing framework for primary biodiversity data" a thematic review in BMC Bioinformatics (2011)
Andrew Treloar, ANDS, 
Seeking Serendipity: repurposing DataCite metadata to augment ANDS discovery.
  • Data a a first-class object: from unstructured, disconnected, invisible, single use data to managed, connected, findable and reusable data;
  • Research Data Australia supports creation or search of research datasets, collections, projects and organizations - gradually adding functionality.
Eefke Smit, STM Association, Data and Publications; and how they belong together.
  • Deposition of datasets in archives continue to grow, surpassing journal articles in biomedical sciences;
  • The data publication pyramid: 75% of research data is never made openly available, too many disciples still lacks community endorsed archive!
  • STM and DataCite have just launched a new statement to: data must be deposited in trustworthy repositories; databases must also have links back to the publication(s); support for creation of best practices for citation of datasets; invitation to sign the statement (link to the statement soon; it was just signed live at this meeting by Eefke and Adam Farquhar, President, DataCite!)
Session 2: Access: understanding technical, legal or ethical barriers to access
Chair: Brigitte Hausstein, GESIS

Matthew Woollard, UKDA
Persistent identifiers in practice. The UK Data Archive's approach.
  • The importance of recoding changes: approx 15% of the (social science) data collected is altered within the first year.
Michael Wilson, STFC,
Meeting a scientific facility provider's duty to maximise the value of data.
  • Defends patents on innovation derived from science: this may require producing data sets over 20 years earlier!
  • Capturing automatically the facility lifetimes (via the ICAT Metadata Catalogue): from submission of the proposals to the publication of the results;
  • Currently DOIs are assigned at the higher entity, but not at data file (individual record) level, but it maybe needed soon;
  • Even if only <1% if the data is commercialized, still unsolved remains the issue of what to publish data / how long the embargo should be;
  • FP7 ENSURE project works to extend the state-of-the-art in digital preservation.
Sunje Dallmeier-Tiessen, CERN,
DataCite & INSPIRE: facilitating data preservation and reuse in High-Energy Physics. 
  • In the High Energy Physics (HEP) projects, the discussion is about the levels of data description, and where it should be preserved, when associated to data publications;
  • INSPIRE has 50k users and 1 million record in collaboration with the Durham HepData Project, in UK
Session 3: Different flavours of use
Chair: Herbert Gruttemeier, INIST

Scott Edmonds, GigaScience, BGI Shenzhen,
Adventures in Data Citation.
  • Tackling the long tail of curation - democratization of big data; challenges with compliance to community standards; lack of standards interoperability across;
  • GigaScience, a joint venture between BMC and BGI, with an associated data hosting platform: GigaDB;
  • GigaScience issue 1 is due in in July 2012 - datasets description formatted in ISA-Tab;
  • E. coli #crowdsourcing: the first tweetome!
  • Data citation is still failing (e.g., Google Scholar does not takes 'data publication' in account) and should be improved;
  • Minor quibbles: exports to citation managers: rules for versioning and to set granularity (e.g., citing papers vs micropublications).
Jean-Fran├žois Perrin, ILL,
DOI usage: a large neutron facility.
  • ILL raw data is available online since 1973, and released immediately after the experiment ends;
  • But it is necessary to add experimental metadata that help in the interpretation of raw data file, also collecting it automatically, where possible, and link all to the publication;
  • Pandata works to federate data infrastructure for synchrotron and neutron sources;
  • Standardizing the format is the first step.
Susanna Sansone, University of Oxford,
The ISA Commons - experiences from the field; link to my presentation.
  • Shared data have little or no value if they are not interpretable and, consequently, reusable; see the example provided by the work of Ioannidis et al., Nature Genetics, 2009;
  • Say no to ‘data blobs', yes to verifiable, complete and structured information!
  • Importance of capture all salient features of the experimental workflow, making the annotation explicit and discoverable: not too much, not too little, just 'right';
  • Many community norms and standards: lack of coordination, fragmentation and uneven coverage - see list at BioSharing;
  • ISA-Tab a general purpose experimental metadata tracking framework used by a growing number of communities in several biosciences domains, more in Sansone et al., Nature Genetics, 2012. 
Closing remarks – Adam Farquhar, President, DataCite.

No comments:

Post a Comment

Sociable