Architecture

What are we trying to do?

The general problem we're trying to address is reuse of scientific information artifacts - getting the most societal value out of the publications, data sets, databases, and knowledge bases generated by the scientific community. We see the processes of accessing, parsing, and analyzing these artifacts, and combining them for meta-analysis and joined queries, as being much harder than they need to be.

While solving these problems in the wider community will require community consensus around technical standards and curation processes, we feel there's a lot to be learned by exploring these issues in the context of a prototype. We have therefore created a framework for creation and distribution of information artifacts as way to see what the future might be like. For development and validation we are also employing these in pilot projects designed to illuminate the problems and opportunities presented by such frameworks.

Open integration framework

"Framework" is vague so let me explain. We take the software community's practices for software interoperability and packaging as an inspiration. For example, a GNU/Linux distribution, such as Debian, collect together software from thousands of distinct sources, "port" these software artifacts to their platform by normalizing the way they're packaged, and bring artifacts together into distributions (which in turn are replicated to multiple repositories, i.e. ftp sites). Then a user configures a custom system by requesting exactly those components that are needed. Brought together in a common platform, the components interoperate, usually in the sense that an application can be launched in a compatible user interface, but often in other ways, such as server deployment. Because thousands of components are involved, the interfaces between them and they way they're packaged and installed must be absolutely standard, since user intervention in the installation process does not scale.

To apply this idea to scientific information artifacts, one creates a set of conventions for syntactic and semantic compatibility among components and a standard packaging mechanism to make selecting and installing components easy. One starts with the primary sources (databases, knowledge bases, etc.), applies a script to do the normalization, with a packaged component as the result. The resulting "binary" may or may not be collected with others to make a distribution. Someone creating a local installation optimized for local query obtains needed components from one or more distributions and installs those into their own environment.

Our particular technology choices include

  • RDF and OWL as the "binary" representation
  • Semantic normalization based on a class-intensive modeling methodology (cf. OBO Foundry)
  • a set of preferred URIs so that components will link to one another (from standard ontologies such as Dublin Core, from OBO, and from a predecessor of the forthcoming Shared Names project)
  • a set of Unix 'make' conventions for coordinating the transformation from primary sources to RDF, such that every component is 'made' in the same way
  • some simple scripts that support these 'make' scripts in retrieving primary sources and packaging the RDF
  • a binary package installer called RDFherd that, like the GNU/Linux package managers, takes care of configuring a triple store and installing RDF into a it
  • we currently the Virtuoso triple store, but the system is adaptable to others
  • scripts to enable simple inference (e.g. of subClassOf transitivity) via forward chaining
  • SPARQL queries on a triple store as the way to interact with the selected components.

(We are often asked why we don't use federated query, and the reason is that there is no reason in principle not to, just that the technology is not yet mature enough for doing the kind of queries we want to do.)

Distribution

We have created a set of about 25 components following this method, and collected them into a distribution. The components are independent and the architecture is open, so that anyone may pick and choose which ones they like, without the necessity of taking all 25 of them. One may create new components and either add them to our distribution (subject to quality control), create a new distribution, or just use them privately.

Currently our distribution is only a "binary" distribution, realized either as a set of RDF files or as a database dump. The "source distribution" would consist of the 'make'-based marshaling scripts that we have, but while we can run these on our own system and are available from our svn repository, they are not developed to the point where we would be confident that they would work outside our own development environment.

What about the Semantic Web?

The term "Semantic Web" is an ambitious dream of deploying interlinked information in RDF throughout the Web. It encompasses a wide variety of philosophies, goals, and technologies. We have found many of the technologies to be useful in our work, especially RDF, SPARQL, and OWL. The "Semantic Web" per se does not provide any particular set of standard entity names (URIs) or any particular approach to semantics, leaving these to particular application layers, so any practical system for data integration must add these. Standard packaging conventions and the notion of a distribution are also out of scope in the Semantic Web story, but necessary if we are to achieve the ideal of practical, scalable, reusable data integrations.