Data Foundation and Terminology

Where we discuss anything related to the RDA (catchall)

Moderators: Leif.Laaksonen, SaraPittonetGaiarin

Re: Data Foundation and Terminology

Postby pwittenburg » Sun Mar 03, 2013 4:43 pm


I read through your version. It suggest some radical changes which I hesitate to integrate without a broad discussion again. And you open some discussion where we came to different conclusions so far. Some of your suggestions are better formulations for what we want to express. When the current document says "describe" you suggest "develop and document" which I like more and actually is what we wanted to express I guess.
But you also suggest to include a number of terms in the charter. We tried to avoid this since the terms are the result of the process. Instead we speak about "data organization" which is a vague term, but also refer to some papers that describe the scope which we have in mind. We should not change that.
You also suggest to include a formal representation such as with SKOS, OWL. This was an explicit wish of some colleagues to not start with using a formal representation at first instance. Finally I agree that it would be nice to have a machine interpretable representation, but we need to respect the colleagues here which is why we used the term ISO-like which refers to models such as ISO 11175 etc. But it could also be a wiki at first instance, since when a formalism will be introduced we will have immediately a discussion about the formalism and this would distract us from doing the real stuff.

So thanks a lot for your comments. We will certainly discuss them in Gothenburg and subsequent virtual meetings. And I hope that you will be present in Gothenburg.

You do not have the required permissions to view the files attached to this post.
Posts: 31
Joined: Wed Oct 17, 2012 4:47 pm

Re: Data Foundation and Terminology

Postby pwittenburg » Thu Mar 14, 2013 9:03 am

Dear all,

here is a new version of the concept note which some people have been writing. This version has not been discussed broadly and the text that has been added is marked in red.
Let us prepare the session in Gothenburg:
- Please read the case statement which we submitted to the council.
- Please read the concept note which offers convictions of some of us.
- Please look at some of the papers mentioned. We added one paper referring to the Named Data Networking initiative.

For the session I suggest the following agenda:
1. Introduction to the goals of this CWG (this needs to be done since there will be new people showing up)
2. Discussion of the Case Statement
3. Discussion of the Concept Note and relevant Papers
4. Next Steps - How to move on

You do not have the required permissions to view the files attached to this post.
Posts: 31
Joined: Wed Oct 17, 2012 4:47 pm

Re: Data Foundation and Terminology

Postby rwmoore » Thu Mar 14, 2013 4:13 pm

The concept note is a reasonable description of data management topics.

I am quite interested in the evolution of data management infrastructure from
a focus on data, to include a focus on information, and now to include a focus
on knowledge. A simple way to characterize this is:

Data management: This was supported by file systems.

Information management: This is the association of a context with each data
file, and has been supported by digital libraries.

Knowledge management: This is the active evaluation of relationships on the
context associated with each file, and is typically implemented as policy-
controlled procedures.

We can apply policies/procedures to control the creation of data files, implying
that instead of registering a static data file, we can register the procedure that
creates the file.

We can also apply policies/procedures to manage properties associated with
the environment that manages the data. It turns out that the information
needed to manage the environment is much more extensive than the context
that defines the data. We currently support 338 attributes for state
information in iRODS, of which about 10% are associated with the data file. I
expect a definition for each of these attributes will be needed for a complete
description of data management concepts.

Posts: 13
Joined: Thu Nov 08, 2012 3:55 pm

Re: Data Foundation and Terminology

Postby pwittenburg » Mon Mar 25, 2013 1:38 pm

Dear DFT Session participants,

here is the list of persons that (partly) participated in the DFT session in Gothenburg. Please check your names, emails etc.
Let us know when something is missing or wrong.

You do not have the required permissions to view the files attached to this post.
Posts: 31
Joined: Wed Oct 17, 2012 4:47 pm

Notes on Data Foundation and Terminology WG meeting

Postby Gary » Sat Mar 30, 2013 3:55 pm

Thanks to all participants at the DFT WG meeting in Goteborg. Attached are the edited meeting notes, thanks to Stan Ahalt for taking these, from the meeting.

This includes some action items and next steps, which are summarized below.

A broad outline of our next steps (Phase 2 March 2013 - July 2013) are listed in our case statement:
Broadly discuss and extend the basic conceptualization note via the OIF and by having Video-Conference interactions. Instead of using the email-list more interaction will be done via them forum to indicate to others what is happening in the group. However there is a possibility for people to access the email-list exchange to inform themselves.

Based on the WG meeting in Goteborg some refine Action Items were discusses and agreed to.
Between now and the Washington meeting: 16-18 September, we will:

• Compactly define, through the use of a Use Case template, at a minimum our 5 use cases,
• use these use cases to generate requirements that can be used to test the basic model
• hold virtual working meetings approximately every 2 weeks
o Hold discussion with other WG leaders in 2 weeks
o Finalize official DTF WG Case Statement
o Explicit invitations to other WGs
• Generate assignment of use cases in a short space of time
• Test the proposed model(s) against the Use Cases
• Accept new Concept notes –these should be short and to the point.
• Define our interactions with the other WGs as much as possible.
o Probably a session will be held with the other WGs within approx. 3 weeks

I also include as an attachment my briefing slides on the Use Case presented at the meeting.
You do not have the required permissions to view the files attached to this post.
Posts: 27
Joined: Thu Dec 06, 2012 7:45 pm

Re: Data Foundation and Terminology

Postby pfeiff » Wed Apr 10, 2013 8:07 am


you said "We currently support 338 attributes for state
information in iRODS, of which about 10% are associated with the data file. I
expect a definition for each of these attributes will be needed for a complete
description of data management concepts."

whoever will be using these 338 attributes, it cannot be the average scientist, not even a data scientist.
I even suspect that information specialists at highly competent data centres will not be able to use it
consistently and interoperably around the world.
So I guess these attributes and their associated concepts would need to be local to data centres and opaque to the "rest of us".

I derive this line of thinking from experience with scientific authors, reviewers, editors who struggle with the few additional concepts of a data journal with a public discussion stage, which go beyond a simple submit, accept, reject, revise,... workflow.

Posts: 1
Joined: Thu Nov 08, 2012 9:34 pm

Is DCAT a useful vocabulary for us?

Postby Gary » Thu Apr 11, 2013 2:50 pm

Is anyone familiar with the Data Catalog Vocabulary (DCAT) - an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web?

This includes a graphical model of the concepts.

It has some useful concepts with attributes and relations that we might discuss.
Their organizing concepts seem to be:

(Data) Catalogs
Datasets in a catalog and
Distribution which represents an accessible form of a dataset

Here is an example of a distinction they make between a Catalog Record and a catalog:

Catalog Record exists for catalogs where a distinction is made between metadata about a dataset and metadata about the dataset's entry in the catalog. For example, the publication date property of the dataset reflects the date when the information was originally made available by the publishing agency, while the publication date of the catalog record is the date when the dataset was added to the catalog. In cases where both dates differ, or where only the latter is known, the publication date should only be specified for the catalog record.
Posts: 27
Joined: Thu Dec 06, 2012 7:45 pm

A technical memo for the Data Foundation and Terminology WG

Postby Gary » Wed Apr 17, 2013 8:27 pm

David Dubin, Karen M. Wickett, Simone Sacchi, Allen H. Renear of the University of Illinois Center for Informatics Research in Science and Scholarship (CIRSS) has provided a technical memo to the work group called:

Motivating Questions for Understanding Data

The text of the memo is copied below and a copy of the memo is attached.

Technical Memo for the RDA Initiative Data Foundation and Terminology Working Group

Contributors: David Dubin, Karen M. Wickett, Simone Sacchi, Allen H. Renear
University of Illinois Center for Informatics Research in Science and Scholarship

As we’ve stated in our papers, the CIRSS Data Concept Group has developed two models relating to scientific data. Our aim is a precise, shared understanding of concepts such as dataset, format, encoding, version, file, and collection. The SAM and BRM models are motivated by issues we feel are relevant across a broad range of disciplines, applications, and degrees of data complexity. We feel the issues will arise no matter what terminology and definitions are adopted by this working group.

Since our specific proposals are presented in papers that have already been shared with the group, we summarize three of the most basic motivating questions below.

1. Which of the key attributes of data are essential, and which are contextual? Intrinsic properties of data (e.g., structural properties) are important for data processing, establishing data identity, expressive limitations, etc. But the most salient and problematic properties of data are relational and contingent: those linking symbols to content and one encoding level to another, for example. Data acquire these relational properties through their participation in events that may have either humans or instruments as agents. While intrinsic properties of data can be discovered or rediscovered through analysis, relational properties have to be documented—it may be possible to reconstruct them, but documented evidence often plays a role in those discoveries.

2. How do we account for change in data management? What is really going on when we say data is “generated” or “modified? ” One can understand data as social objects capable of coming into existence, undergoing genuine change, and being destroyed. But much of our familiar discourse about data, if taken literally, would identify them with ideal abstract objects, such as graphs, sequences, relations, or sets. On that view, data do not literally come into existence or undergo literal change, but enter into relationships of discovery or individuation. Both of these views present problems. On the first view, we cannot identify data with any particular digital object—different bit sequences, for example, may constitute the “same” data at different times. The second view requires that we reframe our usual notions of digital objects undergoing change: we know what a particular unit of data is, but must adopt a more complicated account of how it came to be what it is.

3. How is individual data identity related to data as a class or category? Identity requires a basis, and what makes something data may turn on different criteria than what makes something some particular data. If (as we suggest in our SAM model) data are abstract particulars, then their individual identity conditions are supplied by intrinsic properties, but their status as data emerges through their provenance. On a social objects view of data, individual and class identity are likely to be more closely coupled (both connected to provenance events).

We hope that issues such as these will be useful for framing options and decisions, and look forward to contributing to working group discussions.
You do not have the required permissions to view the files attached to this post.
Posts: 27
Joined: Thu Dec 06, 2012 7:45 pm

What is a (Data) Catalog? What is Metadata? What is a Datase

Postby Gary » Thu Apr 18, 2013 3:20 pm

The Ground European Network for Eareth Science-DEC (GENESI – Digital Earth Community) has some definitions in their FAQ section

One is for catalog which they define and discuss as follows:

A catalog is a collection of entries, each of which describes and points to a feature collection. Catalogs include indexed listings of feature collections, their contents, their coverages, and other metadata. Registers the existence, location, and description of feature collections held by an Information Community. Catalogs provide the capability to add and delete entries. At a minimum Catalog will include the name for the feature collection and the locational handle that specifies where this data may be found. The means by which an Information Community advertises its holdings to members of the Information Community and to the rest of the world. Each catalog is unique to its Information Community.

While this discussion was aimed a bit at geo-features (e.g.. a coverage) one might generalize this to domain features of any type. Any comments on this and/or the other aspects of the definition? They don't mention the ID aspect of a collection but it is implied. This vagueness is something that DFT may help to overcome by providing a mire complete description and definition of something like a catalog.

BTW, they also define Metadata and Data Set as below:

Metadata is data about data or a service, or the documentation of data. In human-readable form, it has primarily been used as information to enable the manager or user to understand, compare and interchange the content of the described data set. In the Web Services context, XML-encoded (machine-readable and human-readable) metadata stored in catalogues and registries enables services to use those catalogues and registries to find data and services. A Metadata dataset (after ISO 19101) is the set of metadata describing a specific dataset.

For services, the Service metadata (after OGC) is the most basic operation all (OGC) services must provide and defines the ability to describe themselves. This "Get Capabilities" operation, yielding a capabilities document, is common to all services. An XML vocabulary comprised of several parts for describing different aspects of a service. The first unit describes the service interface in sufficient detail so that an automated process can read the description and invoke an operation that the service advertises. A second unit describes the data content of the service (or the data it operates on) in a way that enables clients to dynamically compose requests for service. According to ISO TC-211, the Service Metadata are the operations and (geographic) information available at a server.
A dataset is a logically meaningful grouping of similar or related data. Data having mostly similar characteristics (source or class of source, processing level and algorithms, etc.). This may include data from many sources and in many formats. For the Earth observation products, collections are generally associated to mission sensors and operation mode, e.g. a collection of all the TerraSAR-X spotlight mode data. The collection metadata describes these sets by providing the identifier, the description, the possible geographical and temporal extent, the common attributes, etc.

A dataset series (according to ISO 19115, ISO 19113 and ISO 19114) is a collection of datasets sharing the same product specification
Posts: 27
Joined: Thu Dec 06, 2012 7:45 pm

The idea of Research Objects with IDs

Postby Gary » Thu May 16, 2013 3:22 pm

Is anyone in the community interested in the notion of Research Objects,
which is broader and richer than a Digital Object.

The core idea, expressed in Bechhofer, Sean, et al. "Why linked data is not enough for scientists." e-Science (e-Science),
2010 IEEE Sixth International Conference on. IEEE, 2010.
is the concept of a semantically rich aggregation of resources that provide the
“units of knowledge” which supply structure for delivery of
information as Linked Data.

"A Research Object (RO) provides
a container for a principled aggregation of resources, produced
and consumed by common services and shareable within and
across organizational boundaries. An RO bundles together essential
information relating to experiments and investigations.
This includes not only the data used, and methods employed
to produce and analyze that data, but also the people involved
in the investigation."

They describe the following motivating Scenario of a scientific workflow with implied ROs needing identification
and relating along the way of the flow.

"We use a scenario to motivate our approach and to illustrate
aspects of the following discussion.

Alice runs an (in-silico) experiment that involves the execution
of a scientific workflow over some data sets. The output
of the workflow includes results of the analysis along with
provenance information detailing the services used, intermediate
results, logs and final results. She collects together and
publishes this information as a Research Object so that others
can 1/ validate that the results that Alice has obtained are
fair; and 2/ reuse the data, results and experimental method
that Alice has described. Alice also includes within the RO
links/mappings from data and resources used in her RO to
public resources such as the ConceptWiki or LarKC, providing
additional context. Alice embeds the RO in a blog post.
Bob wants to reuse Alice’s research results and thus needs
sufficient information to be able to understand and interpret
the RO that Alice has provided. Ideally, this should require
little (if any) use of backchannels, direct or out-of-band communication
with Alice. Bob can then deconstruct Alice’s RO,
construct a new experiment by, for example, replacing some
data but keeping the same workflow, and then republishes on
his blog, including in the new RO a link to Alice’s original.
In order to support this interaction, common structure for
describing the resources and their relationships are needed."

A copy of the article is at ... cc63c9.pdf
Posts: 27
Joined: Thu Dec 06, 2012 7:45 pm


Return to RDA Discussion Area

Who is online

Users browsing this forum: No registered users and 1 guest