HttpRange-14 FAQ

This FAQ is likely to contain personal bias.

Contents

What is httpRange-14?

"httpRange-14" was the name given to an issue that was considered by the W3C Technical Architecture Group (TAG) related to the use of certain URIs on the Semantic Web. The issue was raised by Tim Berners-Lee on 19 March 2002, and after three years of deliberation the TAG resolved to give some particular advice to the community on the matter. The issue was subsequently closed.

The advice might be summarized as follows: When using a plain http: URI to refer (as one does in RDF), if HTTP requests specifying that URI lead to 200 responses, then this suggests that the referent is a web resource, and the TAG requests that the URI be used (in RDF) to refer to that web resource, not something else.

The intent of this advice is to obtain synergy between the "document web" and the "web of data" (semantic web) - encouraging use of the web to support consistent referring, and use of declarative languages (those that refer) to talk about the web.

(I am using the term "web resource" as a placeholder for want of the right term; see the terminology section at the end.)

What problem does this solve?

Plain http: URIs (those not having fragment identifiers i.e. a "hash" mark #) are used in HTML and HTTP to navigate from one "web resource" to another (via href=), and in RDF to refer to an entity in the domain of discourse. For example, the URI http://www.yale.edu/publicart/lipstick.html (I'll call it 'L' for brevity) is used in HTML to direct a browser to a web resource that describes a sculpture, while it might be used in RDF to refer either to that web resource, to the sculpture itself, or to something else entirely. The presence of a web resource may lead someone writing RDF to use the URI to refer to one of these two entities, and if two different people refer to different entities with the same URI, one may get an inconsistency. For example, in one case, one might say L dc:creator :Oldenberg. (the creator of the web resource's subject), while in the other one might say L dc:creator :Yale_University (the creator of the web resource itself).

The httpRange-14 resolution is just one way to prevent such clashes. It does so by attempting to direct the community to consistently take L to refer to the web resource.

Why does it matter?

The issue makes no difference to most Web servers and clients (browsers) as they are only concerned with what a URI leads to. It is only when reference is an issue (e.g. in RDF) that this matters. Putting philosophical concerns aside, the argument is that different agents reading or writing RDF can disagree over what a URI refers to - the sculpture vs. a web resource about the sculpture - as a result of its HTTP experience. Sometimes you even need to talk about the two entities separately. Therefore steps must be taken to prevent this confusion.

What alternatives were considered?

The solution space can be summarized as follows:

  1. The URI can be used for anything at all - the web resource, what the web resource is about, or something else entirely. It is up to the author of the RDF. (Jonathan Borden)
  2. It's better for the web if the URI is only used to refer to what its web resource is about.
  3. It's better for the web if the URI is only used to refer to the web resource.
  4. It's better to use the URI to refer only to what the "URI owner" says it should refer; if they say nothing then don't use it to refer.
  5. It's better if plain http: URIs are not used to refer at all.
  6. It's better if http: URIs are not used to refer at all.

Option 1 says there is no problem and we should forget the whole thing. This is addressed above - there actually is a problem, at least in principle.

Options 2 and 3 are "drive on the left (or right)" rules; they prevent clashes by asking people to choose consistently one way or the other.

Option 2 is attractive because it automatically creates an RDF-friendly URI to refer to anything that is described by a web resource - namely the URI for that web resource.

Option 3 is captured in the httpRange-14 rule. It's attractive because it automatically creates an RDF-friendly URI to refer to any web resource - namely the web resource's URI.

Option 4 can be summarized "read the signs". It's attractive because it forces a conscious decision about reference in each case.

Options 5 and 6 are similar to "use different roads". If plain http: URIs are never used to refer, there is no risk of a conflict.

Why use http: URIs to refer at all - why not URNs or something?

Why not #6, a non-http URI scheme such as urn:?

Short answer: Don't divide the web by creating new naming systems.

Any naming system will require some kind of lookup mechanism. The lookup mechanism for http: is already universally deployed; for a new system to participate fully in the Web, lookup competence would have to become as widely deployed as HTTP, with basically the same characteristics as http: URI lookup. Even after 15 years of deployment, urn: URIs are currently of no use to most people who encounter them. Therefore, in the interest of easy lookup and a united web, it is desirable to use http: URIs to refer.

Why use plain http: URIs to refer - why not hash URIs?

Why not #5, fragment-possessing (# or "hash") http: URIs? Using hash URIs in RDF is a quite common practice, and it seems to work.

The use of hash http: URIs to refer to arbitrary things is well established. If they work, why the pressure to use plain http: URIs?

The arguments against hash URIs may be summarized

  1. Principle. What makes hash URIs privileged in this regard - why treat them any differently from http: other URIs? (Roy Fielding)
  2. Dubious in regard to the normative specifications. The practice works when the only representation (via conneg) is RDF/XML thanks to RFC 3870 the application/rdf+xml media type registration, but if any other representations (such as HTML) are possible things are less clear.
  3. Aesthetics. The fragment seems ugly and redudant...

Two patterns of use apply to hash URIs, which I'll call "grouped" and "single". A "group" of hash URIs is a set sharing the same base URI preceding the hash. Problems with grouped URIs:

  1. Lack of server control. Because the same document is retrieved for every URI in the group, the server cannot differentially respond to different URIs in the same group.
  2. Scaling. If a hash group gets very large, then the documents fetched get very large.

Single hash URIs (a#, b#, ... or a#_, b#_, ... or a#it, b#it, ...) do not interact nicely with notations such as Turtle and SPARQL which are more concise when several URIs can share a common prefix. Compare the following Turtle examples (SPARQL is similar): Grouped:

@prefix m: <http://example.com/myontology#>.
m:dog m:barks_at m:cat.

Single:

@prefix dog: <http://example.com/myontology/dog#>
@prefix barks_at: <http://example.com/myontology/barks_at#>
@prefix cat: <http://example.com/myontology/cat#>
dog: barks_at: cat:

Non-hash with common prefix:

@prefix m: <http://example.com/myontology/>.

The problem is that in Turtle the name following the common prefix (the :) can't have a # character in it.

m:dog m:barks_at m:cat.

(Subject to further thinking) I would advise that hash URIs be used to refer only when (a) used singly, as with the dog# convention used above, and (b) when the only representation provided for the resource named by the hashless URI (dog) is RDF/XML, to ensure that there is no confusion induced by inconsistent media type specifications.

Why not divorce reference from identification?

(Argument against #4 - what URI owner says goes?)

Forces creation of lots of new URIs (one for each web resource needing third-party metadata), threatens standardization.

(Actually a pretty good alternative...)

(Variant on this: Look for a pronouncement, and only if there isn't one, use the URI per option 2 or 3.)

Why not use the URI to refer to the web resource's subject?

(Why not #2?)

  • It is inconvenient to have to create a new hash URI when a perfectly good URI already exists (e.g. Aaron Swartz, Xiaoshu Wang).
  • It's not clear from looking at the web resource what the subject is; e.g. there many be many subjects.
  • This deprives us of a URI to use for the web resource/description itself (e.g. to say who its author is)

I'm not sure anyone promotes #2; more sensible is a combination of #4 with #2 or #3, i.e. look for guidance from the URI owner, and if it's not found, apply rule #2 or rule #3.

What was the TAG's advice?

In 2005 the various camps struck the following compromise: one might be excused for taking a 2xx HTTP response to imply that the URI refers to a web resource that has the retrieved "representation". A 3xx response, on the other hand, would be a way for the URI to lead to useful information without making the implication that the URI should refer to a web resource. In particular, a 303 might redirect to a web resource that is a description of the referent.

The exact resolution was the following:

That we provide advice to the community that they may mint
"http" URIs for any resource provided that they follow this
simple rule for the sake of removing ambiguity:
  a) If an "http" resource responds to a GET request with a
     2xx response, then the resource identified by that URI
     is an information resource;
  b) If an "http" resource responds to a GET request with a
     303 (See Other) response, then the resource identified
     by that URI could be any resource;
  c) If an "http" resource responds to a GET request with a
     4xx (error) response, then the nature of the resource
     is unknown.

The phrase 'an "http" resource responds' is a bit of a muddle in several ways: (1) what's meant is a resource identified by an http: URI, (2) it is the URI's server, not necessarily the resource itself, the does the responding, (3) the consequents are about referring, not identification, in the parlance of this FAQ. But no matter... I would interpret this to mean:

1. URI owners' servers should not issue 200 responses when they mean for a plain http: URI to refer to something that's not the web resource that their server identifies by that URI, 2. readers and writers of RDF should not use a plain http: URI to refer to anything other than the web resource that would be identified by the web server by that URI.

Since this resolution, most purveyors of URIs who want them to refer to things that aren't web resources have been using 303 responses to redirect clients to web resources that explain to what the URI is supposed to refer.

Why would a 200 response imply web resource, and a 303 not?

(the following is too technical. is appeal to spec the only way to answer this? how about bookmarking?)

The HTTP specification (RFC 2616) says that a 200 response to a GET request returns information "identified by the request URI". Recognizing the distinction between the information you get and the web resource itself, most people reading the spec take it to mean that the response carries information "corresponding to" or "representing" the web resource. This is too vague to mean much, but the TAG felt that the protocol, which provides headers such as Last-modified: that seem to describe the resource (as opposed to the response), implies that the URI would have to "identify" what the TAG called an "information resource" in order to for a 200 response to a GET to make sense.

For 303, on the other hand, the spec just says that "the response to the request can be found under a different URI" suggesting that the response would "represent" the redirect target and not the subject of the original request. This frees up the original URI to refer to something not possessing representations (i.e. not a web resource). (This is a bit meddled because if the response that was found is 200 response, and is a response to the first request, then the first request led to a 200 response. But the spec was not read as saying that.)

The effect of a 200 in a browser is pretty much the same as that of a 303, with the exception that a 303 will result in the URI in the browser's URI bar being the forwarding target, not the original URI. So the status code can only affect software that chooses to interpret it this way. Tim B-L's Tabulator data browser takes a 200 to imply that the URI refers to a web resource, so the use of 303 keeps Tabulator from making incorrect inferences such as that a person is a web resource.

What about other redirects?

Another 200-avoidance technique is to use a 301/302/307 redirect whose target is given by a URI that is not a plain http: URI - for example, an http: URI possessing a fragment id. See dc:creator for an example. In principle other kinds of URI could be used as targets, but it is unlikely that an HTTP client would understand them.

Is the TAG's advice normative?

As the appropriate IETF RFCs (RFC 4395, RFC 2616, and RFC 3986) have jurisdiction, the TAG's advice can't be considered in reverse as "normatively" saying that a 200 response can be taken as an assertion that the resource is a web resource, because doing so would impute new meaning to messages that already have meaning, thus attributing to the sender something that wasn't meant. In addition, the advice has not passed any systematic review process. So the advice must be taken as an exhortation and/or a "good practice note", not as a protocol specification.

What should I do in order to follow this advice?

The easiest way to follow the advice is to altogether avoid the use of plain http: URIs to refer, as Tim B-L wanted people to do in the first place. One might use hash URIs following the foo#-refers-to-entity / foo-identifies-RDF pattern as described above. The next line of defense would be to use 303 responses whenever a URI is to be used to refer to something, as this is always correct with respect to the TAG's advice. Setting up 303 responses is a fairly easy configuration option in Apache (Redirect 303 or RedirectMatch 303). The target of the redirect should be a web resource that explains to what the URI should refer, preferably in the form of RDF.

In case a URI is to refer to something that is "obviously" a web resource, the 303 could be skipped in favor of the usual 200 response, but then one would want to put metadata in some other place (see RDFa, <link>, and LRDD, below) to explain what the URI is supposed to refer to, since the resource's variability (versioning, language, media type) cannot be reliably inferred from any particular representation.

(Quoth the Pat: "But in any case, this is ridiculous. RDF is just XML text, for goodness sake. I need to insert lines of code into a server file, [...] in order to publish some RDF or HTML? That is insane.")

What does one say, in RDF, about web resources?

Web Resources are generally confused with the information ("representations") that their URIs lead to, so that one might say that the web resource's author is Charles Darwin, meaning that its representations have Darwin as their author. (Whether this confusion is a good thing is the subject of debate.) One applies statements about information to web resources, meaning that the statement applies either to some representation or all representations. Care must be taken in the use of such URIs to guard against confusions resulting from representation variability - for example, authorship or license terms may change over time, and metadata that's true of a representation in one language or format may not hold of a different representation.

Some relationships (properties) in which an web resource might participate:

  • A web resource might have a "primary topic" (foaf:primaryTopic) that is an arbitrary resource.
  • A person may have a Resource home web resource or publications list that is a web resource.
  • a web resource may be declared, using rdf:type, to be consistent in its representations in various ways (see the memo Generic Resources).
  • a web resource may have a creator or other bibliographic information.
  • a web resource may be said to be published under some particular copyright license (e.g. see ccREL).

How does one find RDF (metadata) about a web resource given its URI?

[I think I will delete this section as it goes far afield an can't be covered adequately in a short piece of writing.]

Find it however you can - maybe via a search engine or SPARQL endpoint. If it's named by a plain http: URI and you get a 200 response to a GET, you can try looking for a <link> element to follow in an HTML response, or in RDFa. Occasionally you will find information about the web resource in an RDF or RDFa representation. For some media types one might find XMP metadata. (need link)

As of this writing a protocol called LRDD is being designed that can lead you to other sources.

What is a web resource, anyways?

The terminology is a continuing problem, and many stories have been spun to explain the response restriction implied by this rule (no 200s for GETs of URIs referring to web resources). Here are some of them:

  1. A web resource is a web page.
  2. A web resource is something that is on the Web.
  3. A web resource is a node in the global hypertext network that we know as the Web.
  4. A web resource is part of the Web information space. (Tim B-L)
  5. A web resource is what a URI that is the target of an ordinary href= attribute would refer to, if it did, and no one said anything explicitly about it (in RDF).
  6. A web resource is something you can access on the web.
  7. A web resource is a "generic resource" - a document-like entity that may vary in time and may be underspecified as to language, format, or other details.
  8. A web resource is an "information resource" - something all of whose essential characteristics can be conveyed in a message; see AWWW.

The "essential characteristics" definition is the one referenced in the TAG's advice, but it is infuriatingly vague; one wonders how to distinguish an essential characteristic from an inessential one, or what this has to do with the web. It is likely that this definition is a committee compromise that no one liked but everyone could live with.

While there are innumerable boundary cases, and the ontological status of web resources is unclear, it seems clear that web resources are not physical things (such as people or chairs) and are information-like in some way (as opposed to people or chairs).

My report regarding a conversation with some IETF folks

From http://lists.w3.org/Archives/Public/www-tag/2009Jan/0114.html (answers in parentheses):

  1. Anyone doing HTTP GETs is going to have to deal with 2xx for things that are not web resources, since this happens in so many cases, e.g. XML namespaces. To attempt to put this kind of semantics into 2xx is a lost cause. (A: This is about damage control and prevention, not reading anything into 2xx.)
  2. Why can't a single string refer to a relation for some purposes and a document for others? (A: This makes automated inference really difficult.)
  3. Who says a relation (link type) can't be a web resource? What you get back from the 200 is a representation of the relation, right? (A: Again, the idea is to force different names for different things, and having one thing that's two things just doesn't work in the presence of inference.)
  4. What practical effect could this possibly have? What benefit would there be to anybody besides HTTP/URI architects? (A: It helps avoid inconsistencies.)
  5. It's not the business of HTTP to convey semantics such as information about what a URI refers to. HTTP is only about access. (A: Again, we're talking about good practice, not reading something in that wasn't there before.)
  6. Isn't 303 used for a lot of things other than avoiding 200s? How does 303 help if it's so unspecific? (A: Not as far as I know. It was originally designed only for use with POST; using it with GET is new.)
  7. Not interested in the debate. (A: Good for you!)

Where are the archives of this issue?

What are some other primary writings on this topic?

Dedication

With due respect to Larry Masinter and Xiaoshu Wang.