Case Statement: Data Citation - Making Research Data Citable

Where we discuss case statements that are "final" (i.e. ready for review).

Moderators: Leif.Laaksonen, SaraPittonetGaiarin

Case Statement: Data Citation - Making Research Data Citable

Postby rauber » Wed May 08, 2013 4:41 pm

Dear Colleagues,

Following the BoF Session in Gothenburg (see discussion thread at http://forum.rd-alliance.org/viewtopic.php?f=5&t=58) and several individual follow-up discussions we have now created a first draft of the Case Statement for the Working Group on Data Citation - Making Research Data Citable, attached below.

We are looking forward to receiving your comments, questions, as well as for expressions of interest concerning potential pilots for the solutions to develop and test, either as individual institutions, or as pairs of content owners and solution implementers.

best regards,
Andreas Rauber
You do not have the required permissions to view the files attached to this post.
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: Case Statement: Data Citation - Making Research Data Cit

Postby tobiasweigel » Mon May 13, 2013 3:18 pm

Thanks for uploading this. I mostly agree with the overall charter and the list of issues described on the first pages.
However, I feel that there is some sort of topical break when it comes to the actual work plan. This might just be the result of cutting down the topic towards a subset that is actually achievable within the limited timeframe (something many WGs seem to be struggling with). I agree with the general strategy, which sounds to me achievable and pragmatic (roughly: develop - test - iterate). But as far as the content goes, I might be missing or misunderstanding something important, so here's a detail question:

I am not sure what exactly is intended with the reference model in WP2. As you mention in the beginning, data may come in various formats, from databases to individual netCDF/HDF5 files. Does the reference model provide the means to describe which data is a subset of which other data? And how do PIDs relate to this, as they may be assigned at differing levels of granularity? Perhaps the unclear point for me is whether it's a data model of some kind or sth. else such as a set of policies or a process model.
You are also mentioning data collections to which new elements can be added in a manner that is time-stamped. How is this reflected in the work plan/work packages?

Best, Tobias
Tobias Weigel, DKRZ
tobiasweigel
 
Posts: 26
Joined: Mon Oct 29, 2012 9:01 am
Location: DKRZ, Hamburg

Re: Case Statement: Data Citation - Making Research Data Cit

Postby rauber » Tue May 14, 2013 2:01 pm

Hi Tobias,

Thanks for raising an important issue - maybe we should clarify this a bit better in the wording of the actual work.
While not wanting to preclude any other solution, there was a feeling that approaches assigning PIDs at different levels of granularity would not scale, as they would either require enormous numbers of PIDs being cited (e.g. when assigning PIDs on a data item level) or not support citation at sufficient level of granularity (when assigning a PID e.g. to an entire data set).

The apporach thus proposed and that most partners seem to want to test initially thus would assign the PID to the query (or whichever other means of identifying a respective subset of data in any given data file/set), and ensure that this selection statement can be re-executed with identical results at a later point in time (requiring, in the case of SQL-style DBMS, versioned and time-stamped databases, time-stamped queries, potentially ensuring unique sorts if the query itself should not be sufficient and order of result tuples is essential), plus, potentially, hash keys computed over the result set (IDs) to allow verification of the identity of the result sets obtained. This should allow identification of arbitrary subsets of data in a transparent manner (i.e. not requiring specific actions on behalf of the researcher), while being scalable to even very large data sets. The goal is to implement this (and analogous solutions for other types of data sets/systems) in initial pilots, evaluate them and see whether this principle can be recommended as a generic approach, or which specific advantages/disadvantage etc. have to be observed.

Which PID is to be used, as well as which metadata needs to be added to the query identified by the PID in order to allow attribution as well as human-readable interpretation of a citation is to be dealt within a separate WG, all under the umbrella of the Data Publication IG.

Does this help and make sense? We should probably add this more verbatim in the WG description.

Andi
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: Case Statement: Data Citation - Making Research Data Cit

Postby rauber » Tue May 14, 2013 2:10 pm

A quick comment also on other comments: thanks to all of you who have contacted me via email with comments on the draft Case Statement: I have collected these and will incorporate them. In a nutshell these comprise specifically the extention of the stakeholder communities, including

- Libraries, who provide data citation related services, and who link between the various communities,
i.e. researchers, publishers, data providers, etc.
- Funding Agencies and Evaluators of Research, who want to know what they get for their money
- Enterpreneurs, who may see value added / businesses enabled such as currently observed by
Software companies providing services based on open data.

I hope I've been able to answer most of the other questions submitted by email - and I'd like to encourage everybody to use this public channel for questions, answers and suggestions to share the information and views right away, so that we can shape this together, and see which pilots would be most interesting to launch initially.

best, andi
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: Case Statement: Data Citation - Making Research Data Cit

Postby pcruse » Thu May 23, 2013 12:16 pm

Thanks to the group for putting the case statement together. My comment concerns the 4 stakeholder communities (data providers, solution providers, researchers, community) that have so far been identified. I would suggest that the case statement also includes libraries/information providers in the mix of stakeholders. Libraries are a neutral services entity and work with researchers, publishers, and the broader community to provide enduring access to information, including data. This is part of our core mission. Libraries have the ability to work equally with publishers and societies on many of the items included in the case statement outlines. Libraries can also work with researchers to create incentives to take up many of the actions that you have identified. Finally, DataCite, an important component of data citation, is a group of libraries that have come together to push forward data citation.

Here are a handful of specific actions that libraries can take:
- encourage data citation with outreach and services, where appropriate and possible
- educate researchers on importance and benefits of data citation
- work with publishers to encourage the inclusion of cited data in traditional scholarly literature
- foster an environment of attribution for all scholarly works by encouraging scholars to cite their data and to ask for data citation in the journals where they publish.

Trisha Cruse
UC Curation Center (UC3)
California Digital Library
pcruse
 
Posts: 1
Joined: Thu May 23, 2013 12:01 pm

Re: Case Statement: Data Citation - Making Research Data Cit

Postby tobiasweigel » Tue May 28, 2013 7:59 am

rauber wrote:The apporach thus proposed and that most partners seem to want to test initially thus would assign the PID to the query (or whichever other means of identifying a respective subset of data in any given data file/set), and ensure that this selection statement can be re-executed with identical results at a later point in time (requiring, in the case of SQL-style DBMS, versioned and time-stamped databases, time-stamped queries, potentially ensuring unique sorts if the query itself should not be sufficient and order of result tuples is essential), plus, potentially, hash keys computed over the result set (IDs) to allow verification of the identity of the result sets obtained. This should allow identification of arbitrary subsets of data in a transparent manner (i.e. not requiring specific actions on behalf of the researcher), while being scalable to even very large data sets. The goal is to implement this (and analogous solutions for other types of data sets/systems) in initial pilots, evaluate them and see whether this principle can be recommended as a generic approach, or which specific advantages/disadvantage etc. have to be observed.


Hi Andi,

thanks for the explanation. You are right, this should be added more verbatim prominently in the description. It seems you already have a detail solution in mind when you talk about re-executable queries, which is helpful given the limited timeframe; I'd guess not everyone caring about the larger citation issue may agree with this solution, but then again, RDA WGs do not work in the "one size fits all" fashion and in the end, actually working solutions count. I think this particular view should be clarified early on so you can control eventual scope creep. Later WGs/the IG should pick up differing aspects.

Best, Tobias
Tobias Weigel, DKRZ
tobiasweigel
 
Posts: 26
Joined: Mon Oct 29, 2012 9:01 am
Location: DKRZ, Hamburg

Re: Case Statement: Data Citation - Making Research Data Cit

Postby rauber » Tue May 28, 2013 11:14 am

pcruse wrote:Thanks to the group for putting the case statement together. My comment concerns the 4 stakeholder communities (data providers, solution providers, researchers, community) that have so far been identified. I would suggest that the case statement also includes libraries/information providers in the mix of stakeholders. ...


Hi Trisha,

Thanks for pointing this out - libraries will for sure be included as a stakeholder community in the next version of the draft case statement. We also need to ensure a close collaboration with the other WGs working on PIDs and metadata to be associated with data citation to support attribution etc. Maybe some libraries who also actually hold data would be interested in testing the feasibility of the approaches elaborated in this WG.

Ciao, andi
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: Case Statement: Data Citation - Making Research Data Cit

Postby rauber » Tue May 28, 2013 11:25 am

tobiasweigel wrote:
Hi Andi,

thanks for the explanation. You are right, this should be added more verbatim prominently in the description. It seems you already have a detail solution in mind when you talk about re-executable queries, which is helpful given the limited timeframe; I'd guess not everyone caring about the larger citation issue may agree with this solution, but then again, RDA WGs do not work in the "one size fits all" fashion and in the end, actually working solutions count. I think this particular view should be clarified early on so you can control eventual scope creep. Later WGs/the IG should pick up differing aspects.

Best, Tobias


Hi Tobias,

Ok, we'll revise the description to make this more specific!

But, just to make sure:
We definitely do not want to limit the WG to one specific implementation/approach, and definitely just on one example for SQL-style DBMS! This was just meant as a specific example of how to implement it for that type of data. The basic notion is to have some form of data representation and some form of accessing subsets of that data - which is the situation giving rise to the need for citing subsets of data, when data is being changed, updated, growing, etc. (if it's only static, indivisible blocks of data, there is little need for absolutely new models).
Now, as soon as you have some data and some means of access, the principle should work to assign the PID to the specific "selection", i.e. operation that identifies the subset of the data - and to ensure that this process is deterministicly repeatable at late rpoints in time. In the SQL example, that would be versioned and time-stamped database, potentially with unique sorting. In the WG we definitely want to explore several options, particularly for non-SQL style data.

I'd be happy to see some member stepping forward to propose a solution or run a pilot for other types of data - as well as, of course, discussing deficiencies of the current concept! We have had a few rounds of sanity check on the principles outlined above, and they seem to be fine so far, in terms of stability across different systems, scalability in some scenarios, etc. - but we definitely want to keep discussing this, identifying potential downsides - and actually testing it on a few pilots.

Ciao, andi
rauber
 
Posts: 12
Joined: Mon Jan 14, 2013 2:11 pm

Re: Case Statement: Data Citation - Making Research Data Cit

Postby dietuyt » Fri Jun 07, 2013 1:39 pm

Dear all,

I would like to mention the CLARIN concept of the virtual collection registry which seems to fit in very well into this discussion:

http://www.clarin.eu/sites/default/file ... tGuide.pdf

http://clarin.ids-mannheim.de/vcr/app/public

One example of such a virtual collection could be the recordings mentioned in 1 specific book:

http://www.mpi.nl/trobriand

best regards,
Dieter
dietuyt
 
Posts: 2
Joined: Fri Nov 09, 2012 1:10 pm


Return to RDA Case Statement Discussions

Who is online

Users browsing this forum: No registered users and 1 guest

cron