Common Naming report 2007

This report attempts to make a case for agreement on the choice of names, with some suggestions on which particular names one might attempt to agree on.

Contents

RDF and the semantic web

RDF is sometimes promoted as a good general way for a resource (by which I mean table, database, data set, data record, etc.) to provide a structured representation for the information it carries. While this may be true, such a characterization does not distinguish it from its many predecessors, such as XML and S-expressions. When we speak of the "power" of the semantic web we're not referring to RDF but rather to the ability to combine (or mash up) resources in order to answer questions that can't be answered by any of the individual resources. This ability depends not just on RDF but on the establishment of a culture of naming that leads to meaningful resource combination.

In order for resources to combine, they need to share names for things that they both talk about. That is, when resource A talks about thing T, and resource B talks about thing T, the fact that they're talking about the same thing has to be recognizable somehow. The mechanism of choice in RDF for such linking is the URI. Common choice and consistent use of URIs therefore form the backbone of the semantic web.

For example, if two resources record, in RDF, information about a molecular species (say, a drug or metabolite), they need to share some URI or other key related to the molecule if a combination of the two resources is to connect the information that each resource has relating to that record -- say, molecular properties such as melting point from one resource with functional properties such as disease implications from another.

This is not generally the case now. For genes (or gene records), some applications use [Life sciences identifiers:LSID]'s, while others use NCBI's eutils CGI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=4553961, while others invent their own names (e.g. http://purl.org/swan/ncbi/pubmed/4553961). (See http://bio2rdf.org/JSPWiki/Wiki.jsp?page=Creeps for an indication of the spectrum of possibilities.) Even an application designer who wants to do the right thing and share URIs is at a loss, as there is no common practice.

Even when a URI is shared, it may not have a clear meaning, so there is no criterion of truth for assertions made using the URI. For example, one application might use an Entrez query URI to mean a gene record, while another might use the same URI to mean a gene. These things - the [:InformationResource:information resource], and the thing it describes - need to have distinct URIs, and the URIs need clearly articulated meanings (see "different names for different things").

Existing URIs may be either underspecified or may designate things that differ in either meaning or use from what applications might need. For example, a URI owner for a database record may identify the URI as referring to the record with specifying whether the thing is an HTML page, XML file, ASN, RDF/XML, Turtle, RDFa, etc., or some class subsuming a set of equivalent records; and without committing to consistency of meaning over time. Precise specification of URI meaning is much more critical for semantic web applications than for human consumption because programs are very inflexible, certainly compared to humans.

Why not use NCBI's query URIs?

E.g. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=4553961

  • They appear to be unstable (NCBI may change the interface at any time)
  • They are error prone (permutations of the CGI arguments will deliver the same web resource but won't be recognized by semweb applications as denoting the same thing)
  • They don't play nicely with the use of Qnames in RDF/XML and N3 (e.g. no question marks allowed in qnames)
  • There's no consistent determination as to the referent of the URI (record vs. gene vs. ...)

Why not use NCBI's 'view' URIs?

These aren't bad. See http://view.ncbi.nlm.nih.gov/

  • I didn't discover these until after I had written this memo initially
  • They don't cover non-Entrez sources, such as MeSH
  • No designated denotations for the URIs
  • No resolution (303 or otherwise) to RDF
  • Level of commitment not clear

Why not use an LSID?

After all, this is what [Life sciences identifier:LSID]'s were designed for, yes?

This needs to be investigated.

  • In many cases it is unclear exactly what thing an LSID refers to. For example, does an LSID for Entrez Gene refer to a gene or to a gene record?
  • There is no independent URI for the thing's metadata; you have to do a SOAP call to get to it.
  • I have found no evidence that [:LifeScienceIdentifiers:LSID]'s are already shared among independently developed applications.
  • Do they resolve to documents (can you GET them)? (AWWW: URI owners should provide representations of their information resources.) Certainly we would have to use the URL form of the [:LifeScienceIdentifiers:LSID] http://biopathways.org/resolver/ etc. in order to obtain resolution in web browsers, but does even this work? (e.g. accessing a pubmed LSID seems to get only the abstract, not the metadata or the article.)
  • For things that need names but don't have them, how do we make claims to a part of the LSID namespace, and where is the procedure documented?
  • Level of support is not clear. I3C is gone. http://lsid.biopathways.org/ is clearly no longer supported. As TDWG has adopted LSIDs, activity has increased on the LSID sourceforce site, but there has been no move to establish legitimacy of the LSID URN namespace or to update the spec.

Let's not fight the [Life sciences identifier:LSID] wars again, but rather learn what we can from the experience.

Why not use the info: URI scheme?

Superficially the problem we face -- providing well-behaved URIs for "orphan" things that have no well-supported URI -- is similar to the one that the info: scheme is supposed to solve. It is possible to register new 'domains' within the info: URI space, and we could consider doing so. (For information on the info: scheme, see http://www.loc.gov/standards/uri/info.html (the main web site is down as of 2007-03-24).

The main problem with info:, besides its obscurity, is that there is no well defined way to get information about things named by info: URIs - there are only general rules in the info: registry, and browsers can't or don't access descriptions or representations of particular things having info: URIs. Given a choice, http: would be preferred, according to AWWW.

Why not use concordances?

A concordance in this context is a set of assertions that relate one set of names to another set of names for the same or related things. For example, if resource 1 uses identifiers http://project1.org/123 to identify thing 123 and resource 2 uses http://project2.net/thing123 to identify thing 123, then a concordance connecting the two resources would contain an assertion {<http://project2.net/thing123> owl:sameAs <http://project2.net/thing123>.} (among many others, one assertion per thing).

The relation connecting the names might be an equivalence such as owl:sameAs or owl:equivalentClass, or if the things are dissimilar the relation might be one that sets up a correspondence, such as the "describes" relationship that connects an information resource (database record, etc.) with the entity it descibes (molecular species, gene, etc.).

This is a cumbersome and fragile solution if used on a large scale. It moves some of the burden of name alignment among resources off of the client, which is good, but creates a new opportunity for failure. Making concordances available and keeping them up to date takes care and effort. For some identifier spaces such as PubMed id's a concordance could be enormous. It would be much better to place the burden of seeking and maintaining common names on publishers, not on clients or concordance maintainers.

Why not leave this job up to the data providers (NLM, etc.)?

NLM is not (yet) committed to the semantic web. It is unlikely that they, or others, will take action soon, or will provide adequate definitions and descriptive metadata for the URIs they create. It is not necessarily in a publisher's interest to get this right. We - the semantic web users community - need well-defined, well-supported URIs that we can use today.

What about bio2rdf.org?

See http://bio2rdf.org/ . I need to study this more closely, but it is not clear that this system is designed with resource sharing or a global namespace in mind.

Straw man

The semantic web for life sciences is fairly new, and no one has really stepped forward to provide good names. This is merely a sketch of how one might go about setting up a naming system that is adequate for use in the semantic web, and politically and administratively neutral.

I have tried to make the scheme both lightweight and immediately implementable. It has been partically implemented, and is in use in the Science Commons knowledge base (which among other sources draws on work of the W3C Semantic Web Health Care and Life Sciences Interest Group).

We create a PURL "domain," say http://purl.org/commons/, and define mappings of URIs in this domain to particular resources.

We borrow database names from a list of standard resource abbreviations that is under development by another group (related to LSRN?).

Other resources might include PFAM records, rat genome database records, etc.

One thing that may distinguish this proposal from others is an emphasis on contracts: What can an agent (application, person, ...) expect to be true of these things? If a URI identifies an RDF document of some kind, what sort of RDF (what schema, etc) can be expected by an agent that accesses the document, assuming it's accessible? When things other than information resources are named, what meaning is assigned to the URI - how can we decide what's true about the thing and what's not?

I don't pretend to have a complete specification here and I'm not attached to these particular URIs. I'm just trying to give a flavor of what a spec might look like, so I can get a sense from you, the reader, whether this is a profitable direction to go. The exact form and meaning of the URIs - syntax and contracts - should be the subject of another conversation.

Alan R has registered disagreement with the idea of giving PURLs to poorly defined biological entities such as whatever it is that Entrez Gene records might be talking about. RDF sources that talk about genes in the sense of Entrez Gene should do so by referring to the record. If necessary, the application could define or refer to the "primary topic" of the record (although what that is is not necessarily clear), or relate an entity to the record using some other relationship. The goal is to encourage people to be clear about what they mean, and the databases often aren't.

Alan R thinks Qnames are an unnecessary luxury, and prefers html/ncbi_gene/1234, which can be directly forwarded by purl.org, unlike html/ncbi_gene/EG1234, which would require an additional forwarding/redirect hop hosted by the AO.


Who maintains these URIs?

There will need to be an administering organization (AO), and Science Commons volunteers to be that AO. The AO makes sure that the PURLs are documented properly and reserves new definitions as requested by the community.

Science Commons is in a good position to do this, compared to other organizations involved in HCLS, since its charter is to build links between scientific resources and demonstrate the value of shared data and knowledge. Science Commons will take this on as an institutional responsibility, so that maintenance will outlast the tenure of any individual employee of Science Commons. If the community demands, or if Science Commons decides it can't or won't continue to be responsible, a different responsible entity can take on the role of AO.

Note that even if a URI isn't maintained (in the sense of resolving) it still has value as an identifier in semweb applications. But it's best if it does resolve.

Who delivers the documents that back these URIs?

First of all, these URIs are primarily intended to be used for joins in queries -- that is, to link data sources. Use of the URIs with HTTP GET is just a "nice to have" that might assist some users in understanding what the URIs mean and how they are supposed to be used. Any serious use of information about the thing named by the URI should be accomplished by accessing the primary resource (such as NCBI) or a local cache of the same.

It may seem that the correct response to an HTTP GET of these PURLs would be to either deliver a document (a "representation" of the thing, if the thing is an information resource, or a description of the thing, otherwise) or to redirect to another server that can deliver the relevant document. However, it would be much better if purl.org and/or the AO were not bottlenecks in such resolution operations. To discourage high-volume GETs of these documents, the AO could arrange to deliver just a small amount of RDF (perhaps as a result of a 303 redirect) carrying information sufficient to specify how the client can in the future resolve this PURL to a document, without having to follow redirects. In fact, a rewrite rule could be provided in this stub response that would allow resolution of other similar PURLs as well, staving off a chain of other requests.

While such resolution information can clearly be represented in RDF, the details remain to be worked out.

The apparatus set up by the AO may suggest new documents that someone ought to serve, such as RDF conversions of XML documents. In this case the semantic web application would be directed to a server administered by an organization (perhaps but not necessarily the AO) willing to serve the new documents.

If the URI names a thing (such as a person or a gene) that is not an information resource, then a 303 See Other redirect will be served (compare [httpRange-14 http://www.w3.org/2001/tag/issues.html#httpRange-14 httpRange-14 resolution]), with a predictable reference to a descriptive document.

Won't the server be swamped with requests?

Requests for redirection should be fairly easy to handle even with a high load. If the OA's server is handling XSLT or similar requests then that's different. Projects that are doing a serious amount of search should do their own conversions (using scripts that the OA provides, perhaps) and cache the information they need - there is no reason to do a redirect through the OA for every gene record when all the gene records can be downloaded at once.

If the load does become too high, that means we will have succeeded beyond all expectations. It will then be time to migrate scripts to an organization better able to handle the load, such as NCBI.

Why should the community rely on these URIs?

Everyone needs to do something in order to follow the principle that one thing should have only one URI (globally) and to make sure that their application plays nicely with others. This is the best solution proposed so far.

Why purl.org?

purl.org is fairly well known and has a decent management interface. It does not levy a maintenance fee - no domain name registration to worry about. xmlns.com is another possibility, but it appears to be less well supported.

Why 'commons'?

Many or all of the things in this namespace may be considered part of the global information 'commons' because they were created for the greater good and are either public domain or available under liberal terms. The namespace might be used for endeavors other than science, thus the omission of the word 'science' from the name. What would you call it?

How likely is it that PURLs (and purl.org) will continue to resolve in the long run?

purl.org is maintained by the library community, which has good values and a good track record. PURLs are also used by the library community, which is probably a stronger reason to think that PURLs will continue to work for a while.

How will community process be realized?

Someone would post a proposal to bind a URI to a specified thing (gene, gene record, etc.). If consensus is reached, or if the AO likes the proposal and no one seems to care, then the proposal will be implemented by the AO. The community will attempt to identify applications and data sources that ought to be using the 'commons' URIs.

I envision the community to coincide with the public-semweb-lifesci mailing list, at least initially. I assume that this list will probably outlive the W3C HCLS interest group, which expires in April 2008. If some other definition is needed - either more or less inclusive - then we can discuss it later.

The possibility of doing this through OBO should be considered, as they have good process and it could be argued that the namespace of Entrez Gene records (for example) forms a (rather large and unstructured) vocabulary and therefore falls under the scope of OBO.

What happens if the administering organization drops the ball?

Authority to modify the purl.org URI mappings will be given to one or more responsible individuals who are known by the community but not directly connected to the administering organization. These individuals will therefore be empowered to maintain the namespace should the AO prove unresponsive.

What about things that already have URIs?

If something already has a URI, the one-thing-one-URI rule would obligate us to use it, even if the URI is not resolvable. For example, many journal articles have doi: URIs, and pubmed records ostensibly have URIs of the form info:pubmed/15548600. If we determine that no one uses the info: URIs, or that its defined meaning is useless, then we might want to unilaterally deprecate them in favor of http://purl.org/commons/ URIs. Alternatively, if we agree on a relation connecting the pubmed record (for which we have a resolvable and well-defined URI), then the article can be identified for semweb purposes, using a blank node, as the article that is so related to the pubmed record.

Ontologies

While it is useful to give types to things and place them into ontologies so that they can more easily participate in relationships, we ought to be able to begin our work by agreeing on URIs, and develop metadata such as typings separately, even experimentally. I believe that getting agreement on URIs might be easier than getting agreement on ontologies and that the former shouldn't be held hostage to the latter.

OBO is making great progress, however, so I may be wrong.

Versioning

A thing that is an information resource that might change is different from a thing that is a single snapshot (representation, version) of the changing information resource. So when versions are available and an application needs to be aware of them, the versions need to be named separately from the changing thing. Ideally we have agreed-upon relationships that connect these various things.

However, I am not aware of accessible version sequences for any of the data sources currently at issue. So we will need to articulate carefully what it means to make any statement about the content or properties of a database record that will change over time. Is the statement meant to be a hypothesis that will hold of all versions, or of current and future versions (monotonically); or is it true of some particular specified version being discussed, but not others; or is it meant to be a scientific statement of the subject of the versioned thing, e.g. a gene? In the latter case versioning may be moot if all versions have the same subject.

What requirementss did the [:LifeScienceIdentifiers:LSID] proposal identify for versions? How did it meet them?

Documentation

Most PURLs and PURL domains are poorly documented. We would make an effort to make documentation of the process and content of the /commons/ domain available, e.g. as an index.html file accessible at the URL http://purl.org/commons/ and as a technical memo published in other places.

Should we bother?

Olivier Bodenreider: "I am doubtful of the shelf life of any of the applications we develop at this point."

This proposal arose out of an immediate need in the Neurocommons project to choose URIs for various entities. If these names stick and get used, then we may have succeeded. If they don't, then at least we have not expended much effort, and we have something to use until better URIs come along.

See also


Jonathan Rees
initially drafted March 2007
last revised May 2008