BoF Session on Data Citation

Where we discuss Anything related to the march plenary, such as requests for f2f meetings, BoF sessions, the agenda, etc.

Moderators: Leif.Laaksonen, SaraPittonetGaiarin

BoF Session on Data Citation

Postby rauber » Tue Feb 12, 2013 12:40 pm

Proposed BoF Session at the RDA Gothenburg Meeting:

Data Citation

Being able to reliably and efficiently cite entire or subsets of data in large and dynamically growing or changing datasets constitutes a significant challenge for a range of research domains. Several approaches for assigning PIDs to support data citation at different levels in the process have been proposed. These may range from individual PIDs being assigned to individual data elements, via metadata-based approaches, to PIDs assigned to queries executed on time-stamped and versioned databases.
This BoF session aims to bring together a small group of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing entire or arbitrary subsets of large and dynamically changing datasets. We will be looking at different types of data and database management system, ranging from structured data and SQL-based DBMS to semi-structured and graph-based databases. The goal is to assure that subsections of data can be uniquely identified in the face of data being added, deleted or otherwise modified in a database, across longer periods of time, even when data is being migrated from one DBMS to another. We want to discuss and evaluate different existing approaches to this challenge, evaluate their and vantages and shortcomings and identify obstacles to their deployment in different settings, as well as potential recommendations for approaches under certain conditions. Amongst others these should subsequently form a solid basis for citing data, linking to it from publications in an actionable manner. Yet, the focus of the discussion will NOT be on the definition of suitable PID systems and protocols or publications-to-data linking data structures, but focus on the best way to uniquely and persistently identify subsets of data in a dynamically changing setting.
Following an evaluation of the field, the barriers identified and potential solutions one potential output of the session would be the prospect of establishing a Working Group within RDA to tackle and resolve this issue.

Chairs:
Andreas Rauber, Vienna University of Technology & SBA
Reagan Moore, UNC Chapel Hill
Dieter van Uytvanck, MPI

Interested experts:
Hans Pfeiffenberger, Alfred-Wegner Institute for Polar and Maritime Research
Daan Broeder, MPI
Peter Wittenburg, Max Plank Institute
Martina Stockhause, World Data Centre for Climate (WDCC)
Jan Brase, Technische Informationsbibliothek (TIB), German National Library of Science and Technology
Natalia Manola, National & Kapodistrian University of Athens
Jo McEntyre, EMBL - European Bioinformaics Institute
Stefan Proell, Secure Business Austria
Paul Uhlir, The National Academies


A number of other people have expressed interest on addressing this subject and will be contacted as soon as the BoF Session has been approved, to submit 1-2 page position papers on core challenges and/or potential solutions.

Note: A summary of the position statements to this BoF is available at
download/file.php?id=90
Last edited by rauber on Mon Mar 18, 2013 1:29 pm, edited 3 times in total.
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: BoF Session on Data Citation

Postby rauber » Tue Feb 12, 2013 3:15 pm

If you are interested in joining this BoF session on Data Citation, if you want to contribute a position statement, or have any questions, please contact me at rauber@ifs.tuwien.ac.at or leave a reply to this thread.
We'll proceed with more detaled planning as soon as the session slots are finalized.

best regards, Andi
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: BoF Session on Data Citation

Postby rduerr » Tue Feb 26, 2013 10:27 pm

I am very interested in this issue. You may be aware that the Federation of Earth Science Information Partners has made a start in this area (see http://commons.esipfed.org/node/308); but, I would be interested in contributing to or at least evaluating the results of this group...
rduerr
 
Posts: 4
Joined: Tue Oct 23, 2012 5:48 pm

Re: BoF Session on Data Citation

Postby rauber » Wed Feb 27, 2013 8:10 am

Thanks for the pointer to the ESIP document. The BoF session will very likely focus on a subset of the issues of data citation identified, specifically "Subset Used", which is identified in the document as "This may be the most challenging aspect of data citation."

Details concerning the position statements will follow soon.

best regards, Andi
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

BoF Session on Data Citation: Call for Position Papers

Postby rauber » Wed Feb 27, 2013 5:27 pm

Dear Colleagues,

In order to ensure a successful and focused BoF session on Data Citation, we would ask everybody planning to attend this session to
*** prepare a short, informal position paper *** identifying the key challenge that you would like to adress as part of the planned Working Group, focussing specifically on your goals for the 12 and 18 months periods. As stated above, these position statements should be rather short (2-3 pages) and informal, presenting your goals/challenges/proposed solutions.

Note, that this BoF/WG activity is very focused, addressing specifically technical solutions on how to cite data that is potentially dynamically changing/growing, very large, complex, etc.

The aim, as envisaged so far, will NOT include issues such as what the most suitable PID solution would be (there is a separate WG addressing PIDs and interoperability), terminology issues, or aspects of economic/social impact of data citation, as these are covered by other initiatives. Of course we may decide to extend the scope of the WG depending on how the discussion evolves during the BoF session in Gothenburg, but please be aware that the initial aim of RDA WGs really is on providing actual solutions to problems on a very practical level, "picking low-hanging fruits", as well as establishing tighter cooperation with other initiatives in this area.

Please upload/post your position statement as a reply to this forum post, stating your name/affiliation, or - if you prefer - send a copy of it by email to my colleague Stefan Proell (sproell@sba-research.org), who will then compile the set and distribute it as pre-BoF reading material. Please try to have your material in by Mon., March 11.

We will NOT have the time during the BoF session to do formal presentations, so we would ask you to go through them in advance, using the time at the meeting for actual discussion and decisions on the goals for the WG, preparing the WG Case statement.

If you have any questions, please contact us, preferably via the forum at:
viewtopic.php?f=5&t=58
which also contains a description of the BoF session.

If you have any questions, feel free to contact us, or - preferably - use this forum for any up-front discussions on topics to be addressed.

Looking forward to seeing lots of position papers by Monday next week - and all of you in Gothenburg the week after.

best regards,
Andreas
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Position Statement - Citing Dynamic Data Subsets

Postby sproell » Thu Mar 07, 2013 9:59 am

Dear Colleagues,
please find our position statement on citation of data subsets in dynamic environments below. As Andi Rauber is offline the next week, please use this forum for questions or write me (Stefan Pröll) an Email: sproell@sba-research.org . I will circulate an Email where I ask you to provide a short, informal position paper identifying the key challenge that you would like to adress as part of the planned Working Group. I will then merge the position statements to a single document which will serve as a reader for the session at the end of next week. Thanks in advance!

---
Position Statement - Citing Dynamic Data Subsets

Currently research data sets are mostly treated as one entity, i.e. indivisible, static and referencable as one unit. Approaches using persistent identifiers that allow the identification of entire data sets, with subsets being addressed at best via textual descriptions. Assigning persistant identifiers (PIDs) to data portions of finer granularity, i.e. database rows or even cells would require enormous numbers of unique identifiers and yield infeasible citations.

PID approaches are suited very well for static data, which should only serve as reference point once it has been created. Using the identification and additional metadata is sufficient to search, identify, and retrieve data again. However, many settings require us to go beyond these limitations and introduce scalable and machine-actionable methods that can be used in dynamically changing, very large databases. Also many data sets continue to grow and are updated as the data sets are used productively in experiments. In order to enable data citation in dynamic environments versioning support is required. Furthermore different stakeholders may be interested in diverse portions of the data. Hence, clearly defined subsets of the data need to be identifiable and citable as well.

We propose a model that uses queries for citing subsets of databases. By introducing versioning of the data and time stamping of queries, identical results can be obtained on retrieving and rerunning the query at later points in time. Identity of result tuple sequence is ensured by enforcing predefined sorting criteria for queries not explicitly defining unique sorting. Thus result sets are stable across time and consistent, even in case of lacking or ambiguous result set sortings in the initial query or in the case of changes in the internals of the database management system. Assigning PIDs to the query allows data citation of arbitrary subsets of very large data sets. Our model is agnostic of the actual PID-system used because only plain text queries will be referenced. The query can then be used to retrieve the corresponding subset of data in the correct version. The time stamp annotated to the query allows determining which version to choose. Additionally our approach is machine-actionable, hence it is transparent for the users.

The planned steps that will follow are discussions with the different stakeholders, gathering their concerns and their requirements. This phase will be followed by a prototype implementation with a relational database management system (12 months). This system will be evaluated and assessed. After successful demonstration the model will be extended to other database concepts (NoSQL) and advanced features (e.g. stored procedures) (18 months).
sproell
 
Posts: 5
Joined: Tue Feb 12, 2013 12:03 pm
Location: Vienna, Austria

Re: BoF Session on Data Citation

Postby rwmoore » Thu Mar 07, 2013 1:28 pm

There are multiple types of data sets for which a persistent identifier is desired for a subset of the data. Each data subset is generated by a specific operation. As proposed by Andreas Rauber, the operation combined with the PID for the data set can be a persistent identifier for the data subset. Consider a NetCDF or HDF5 file. The data subset is then represented by the data parsing operation library call and the file PID.

A more general approach is needed to represent data subsets for other data types. The Data Format Description Language (DFDL) effort from the Open Grid Forum provides a way to describe the structure of a file using an associated XML file. The operations on the file structure can then be described as operations on the XML description. The PID for a data subset then becomes the PID of the file and the operation characterization that is applied to the XML description of the structure of the file.

An even more general approach is to define the PID for the data subset as the combination of the PID of the file and a unique identifier for the workflow (set of procedures) that will be applied to the file. This requires assigning unique IDs to the workflows. The result is a two-part identifier, the PID of the file and the PID of the workflow. Note that the workflow could be a query on a database for a specified time interval. The workflow enables description of complex operations that require application of multiple processing steps to generate the desired data subset.

For registration of workflows to be feasible, workflow virtualization is needed. The workflow needs to be executable across a variety of operating systems, driven by evolution of computing technology. Workflow virtualization is supported by data grid middleware, which map the I/O operations of procedures within the workflow to the I/O protocol of the storage system. The choice of virtualization mechanisms for the I/O mapping is an essential component of any approach that attempts to identify subsets of data sets.
rwmoore
 
Posts: 13
Joined: Thu Nov 08, 2012 3:55 pm

Re: BoF Session on Data Citation - Position Statement WDCC

Postby Martina » Mon Mar 11, 2013 12:53 pm

I. Short- to Medium-term Goals (for static data)

WDCC's (World Data Center for Climate) main current interest is to introduce PIDs for the data management of our LTA. The idea is to assign PIDs at a level of very fine granularity, causing a vast increase in the number of PIDs. Operational infrastructures must be able to cope with this, which reflects a particular view that PIDs – of this particular kind at least – are supposed to be computationally cheap. Along with the introductions of PIDs for the LTA WDCC’s existing DataCite DOI service is to be extended by PID services in two ways:

1. Enable the citation of subsets of a DOI data collection:
Subsets are typically collections selected in respect of e.g. variable, temporal coverage, spatial coverage, frequency, ensemble member and/or other parameters.

2. Enable the citation of subsets including parameters of multiple DOIs:
WDCC plans to provide additional PIDs for selected parameters, which are often analyzed and cited together, e.g. like single parameters for all model runs for a single CMIP5 experiment.

This can be achieved by creating a pre-defined selection of collections. As a consequence subsets of data may appear in multiple collections.
Both ways could be implemented through a framework for generic PID collections, which is a task the PID Information Types WG will be working towards. Such PID collections, at the conceptual and the technical level, are intentionally kept free of strong semantics. They are in this sense comparable to the basic Abstract Data Types of an object-oriented programming language or standard library, designed to hold things together without mandating particular use or meaning. One particular aspect is the ability to assign an individual element PID to multiple collections.

II. Long-term Goals

1. Enable the citation of user-defined (customized) data collections:
Users can create their own collections in order to cite precisely the used files of one or multiple DOI data collections or to make their data products citable. Thus a reference list would include the original data DOI and more specific PIDs.
For this, WDCC could set up an end-user web GUI where any user can stitch together custom data collections, which will receive their own PIDs.

2. Citation of data during the project phase (dynamic data):
DKRZ aims to support its users during the data creation / project collaboration phase as well. That requires version control and tracking of data update/changes. Version control involves the problem of distinguishing technical data versions (checksum changes) needed for data identification from scientific data versions (data content changes) for use in data citations. For the citation of data only relevant scientific data versions are appropriate, e.g. major data versions.
Martina
 
Posts: 2
Joined: Mon Mar 11, 2013 12:04 pm

Re: BoF Session on Data Citation

Postby sproell » Tue Mar 12, 2013 8:34 am

*Position Statement from Natalia Manola*

The scientific community, as well as society in general, needs to take specific strategic actions to safeguard its scientific production, bring together “all” data and publications in the context of appropriate interoperable global data infrastructures, and make them accessible for further use. In short, within the current data infrastructures landscape there is a need for a holistic approach to “Global Scientific Information Infrastructure”, which covers any type of research output. The data landscape is very heterogenous and output can include large or small datasets, literature, digitized collections, and other digital artifacts that accompany the communication among scientists and scholars. This clearly poses a high complexity of interoperability challenges and an approach from the publication side that provides flexible and open linking mechanisms to data sets, provides the best chances for the interoperability of data infrastructures through the publication infrastructure.

This working group will provide a forum for scientific communities to explore and align their methodology on publication-data linking, across disciplines, including underlying data infrastructure mechanisms and researchers’ workflows/behaviours.

Working group specific topics


  • Types of relationships of linked data. A publication points to a data item because it may be used to produce the research described in the paper, or because it references for some other reason. Is there a vocabulary that could be created and used?
  • Depending on the granularity of the data citation, many times researchers need to cite a large amount of data (in the hundreds or thousands) within a paper. Can these data be described in the form of a collection (aggregations)? Which is the best way to describe this collection in the case of a db and/or in the case of data files. How can these collections be represented in the system (i.e., metadata, PIDs to describe them) and be interoperated among infrastructures? Are these collections related to the scientific workflows to embed publication-data linking process (research objects like in myExperiment).
  • Versioning and capturing data citation changes made over time. How are data changes captured in this sense in the case of files or in the case of dbs (snapshots)?
sproell
 
Posts: 5
Joined: Tue Feb 12, 2013 12:03 pm
Location: Vienna, Austria

Re: BoF Session on Data Citation

Postby mcentyre » Wed Mar 13, 2013 8:30 am

Position Statement from Jo McEntyre

The European Bioinformatics Institute maintain the world’s most comprehensive range of freely available and up-to-date molecular databases. Developed through international collaboration, our services let you share data, perform complex queries and analyse the results in different ways. The EBI also maintains Europe PubMed Central - the database of scientific literature. We have an ongoing interest in literature-data integration, and are currently exploring data citation networks through (a) identifying data Accession numbers (PIDs) cited in full-text articles and (b) through references to the literature cited in database records. This information is shared through the public website and web services of Europe PubMed Central. We are interested in improving the infrastructural/technical aspects of data citation, further analysing the context of citations in text, and building tools for sharing this information widely to interested parties.
mcentyre
 
Posts: 1
Joined: Tue Mar 12, 2013 5:47 pm

Next

Return to March Plenary

Who is online

Users browsing this forum: No registered users and 2 guests

cron