Package Makefile conventions

Nearby: Architecture

This is general documentation for the Neurocommons project's packaging system. The general pattern is that a script called a 'packager' (sometimes called a 'package') loads an external data source and converts it to an RDF 'bundle'. In general packagers are roughly analogous to Bio2RDF 'rdfizers'), but some are not really conversions, either because they do not consume external sources or because what the produce is not an RDF bundle.

This documentation is for reference and might be used by either users of packagers or authors of packagers.

Familiarity with information sources commonly used in bioinformatics may be needed to understand some of the examples.

Much of the complexity here has to do with the size of the bundles and their sources. E.g. we cache Medline because it takes so long to download it from NLM. We want to regenerate frequently, but we don't want to redo expensive operations if we can help it.

Sample packagers using this regime can be found here: http://svn.neurocommons.org/svn/trunk/packages/

Each packager lives in its own directory. That directory contains a Makefile possessing a standard set of targets.

GNU 'make' is assumed throughout.

Contents

Configuration

The configuration process is documented in file [http://svn.neurocommons.org/svn/trunk/packages/default.mk.doc default.mk.doc in the packages/ directory]. That file is mostly redundant with what follows and may be more up to date.

Before starting, a site- or packager-specific configuration file needs to be set up. It should define:

  1. CACHE - where the primary sources are cached.
  2. WORK - the workspace for bundle creation - where intermediate files, if any, are stored.
  3. BUNDLE - the directory that will become the bundle, once the build process populates it with RDF files and a Config.pl.
  4. COMMON - directory where shared tools are kept.
  5. AUTHORITY_URI - version authority URI; see below.
  6. EXPORT_ROOT - where 'make export' will put bundles.

In defining these variables, config.mk may assume that $(PACKAGE) is the package name, e.g. 'addgene'.

Pathnames should be absolute since Makefiles can occur anywhere in the directory tree.

A site-specific configuration file may be placed either in 'default.mk' in the directory that contains the packager directories, or in each package. For a default.mk, one might have something like the following:

COMMON?=/home/darwin/checkout/trunk/packages/common
AUTHORITY_URI?=tag:darwinzzz.org,2009:foo
EXPORT_ROOT=/raid/export/development
BUILD_ROOT?=/raid/darwin
CACHE?=$(BUILD_ROOT)/cache/$(PACKAGE)
WORK?=$(BUILD_ROOT)/work/$(PACKAGE)
BUNDLE?=$(BUILD_ROOT)/bundles/$(PACKAGE)

(BUILD_ROOT is not used in any of the Makefiles; it is internal to this file.)

Certain Neurocommons packages require packager-specific parameters. These can be put either in default.mk or in the config.mk for the relevant packager:

MEDLINE_FTP_1=ftp://ftp.nlm.nih.gov/secretplace/gz
MEDLINE_FTP_2=ftp://ftp.nlm.nih.gov/othersecretplace/gz
MESH_FTP=ftp://nlmpubs.nlm.nih.gov/online/mesh/secretplace
MEDLINE_CACHE=$(BUILD)/cache/medline
MESH_DIGEST=$(BUILD)/work/mesh/mesh-digest.lisp
MESH_CACHE=$(BUILD)/cache/mesh
MESH_YEAR=$(shell date | grep -o -E "[0-9]{4}")

(To obtain file locations for Medline, please consult documentation you received with your Medline license.)

If you want to use an existing cache for some particular packager such as pdb, you should override CACHE in its config.mk.

(We may have to generalize the bundle namespace at some point, perhaps so that packages/bundles/graphs are named by arbitrary URIs, but that will require coordination with rdfherd. For now I assume the bundle namespace is managed centrally, similarly to the Debian GNU/Linux package namespace.)

Makefile boilerplate

Each Makefile should contain at the top:

PACKAGE={packagename}
-include config.mk
-include ../default.mk

where {packagename} is the chosen name for your packager, and at the bottom:

include $(COMMON)/common.mk

If your packager's directory is not a subdirectory of the directory that contains default.mk, use ../../default.mk or whatever works for you.

The common.mk file defines some standard targets, such as "bundle", and some implicit rules that may come in handy.

"prepare" target (defined in each Makefile)

Implicit 'make' rules require that the directories in question exist. The "prepare" target ensures that the CACHE, WORK, and BUNDLE directories exist. Unfortunately this is necessary for some packagers, so it has to be done before making their targets.

"cache" target (defined in each Makefile)

"make cache" will make sure that all files needed to create the bundle exist locally. It does not ensure that the cache is up to date. For that, do "make validate" (see below).

Each bundle Makefile should define this target all the files to be cached as prerequisites. Then, each prerequisite should have a rule for obtaining that prerequisite and storing it somewhere under cache/.

A rule may go to some effort to reuse an existing .old file, checking to make sure that in fact the .old file is current first. The check may involve checksums, file write dates, etc.

"make cache" is also responsible for keeping the cache tidy by deleting old versions of files that are no longer needed.

"validate" target (defined in common.mk)

Ensures that the contents of the cache (of the primary sources) are fresh. That is, it checks the origin site (if any), and if the files there seem newer than what's in the cache, or if the cache is empty or incomplete, the new files are fetched and replace the existing ones.

"bundle" target (defined in common.mk)

Common rule. Creates the bundle based on prebundle and Config.pl prerequisites.

"prebundle" target (defined in each Makefile)

Assuming a "cache" prerequisite, this needs to create all of the RDF/OWL/whatever files needed for the bundle.

We rely on "make" to determine whether regeneration is necessary. For this reason, file write dates should be made as old as possible - that is, if we have a file X, we may save is as X.old, then generate a new X, then is X.old and X end up being identical, we replace X with X.old so that it is seen by "make" to have an older file write date (since "make" is driven by file write dates).

Config.pl file (created by 'make bundle')

The bundle's Config.pl file is created by filling in file Config-template.pl (which must be provided) with correct version number and authority URI. The version number is incremented if and only if any md5 of any bundle file (other than Config.pl), or of Config-template.pl, changes.

When going to a new version, the "authority" in Config.pl is set to the URI found in the AUTHORITY file, which must be created locally and uniquely at each installation. This is just a way to distinguish a version 17 written by one development group from a forked version 17 written by another group.

The "authority URI"

To avoid conflicts over the designations of version numbers, each site where bundles are being developed must have its own "authority URI". Do not reuse anyone else's authority URI. To ensure global uniqueness, use a URI that you "own" (in the web architecture sense). "tag:" URIs work just fine; see RFC 4151. Example (but don't use this one! make up your own):

tag:darwinzzz.org,2009-05-02:jar

Unfortunately, due to a bug in the way the URI is passed from a shell script to a perl script, the URI must not contain any characters special to perl, sed, or the shell, and should also not contain a semicolon character. In particular, avoid using @, in spite of its being natural in tag: URIs.

This URI is by design not stored in the svn repository. Instead it is found in the site-specific default.mk or config.mk.

"snapshot" target (defined in common.mk)

Before a bundle is released in any way - loaded into a triple store or copied to another location - it's important to ensure that its version number is not recycled for use in some different, future version of the bundle. That is, different bundle versions need to have different version numbers.

To ensure this, do "make snapshot". This takes a snapshot of the current list of md5s and Config.pl (which includes the version number) and saves them. Any future builds will make sure that either that the bundle has been identically rebuilt, or that the version number will be incremented (and authority URI set).

A saved snapshot should be distributed with each packager, so that if it turns out you build the same bundle version that the bundle author did, that fact will be detected and the same version information reused.

How do you know when you need to do this manual step? Basically any time the bundle could potentially be copied to a new location (in the file system or elsewhere) where independent development on the port might begin - that is, just before any kind of bundle release, whether internal or external.

(I may get rid of this target, and just advance the version number each time a 'make bundle' creates something different from the current reference version / md5s. This will cause a large number of new versions during development, but that's OK. The main risk here is that a change might happen accidentally, and you might want to revert in order to re-generate what you had before, exactly this time. There's no way to deal with this in general, other than accumulating a database of md5s to version+authority mappings.)

"test" target

Self-test; to be done. I anticipate a stand-alone test of the bundle using maybe Jena. More serious integration tests using Virtuoso would happen separately. For 'check' vs. 'test' see http://bugs.python.org/issue3758 .

"clean" target

Flush the WORK area, preserving the cached sources and any bundle that may have been created.

How to create a package

To author a new package that follows the above conventions, one might apply the following general plan:

  1. Set up your local site by creating a default.mk (see above)
  2. Invent a name for your package. In the following suppose it's 'p'. (rdfherd currently assumes the package namespace is global, so coordinate with other package authors to avoid conflicts. Maybe this will get fixed.)
  3. Create a directory named p to hold scripts and related artifacts
  4. Create p/Makefile. Boilerplate consists of the template described above.
  5. Create p/Config-template.pl (copy one from somewhere else, then change occurrences of package name)
  6. Define a 'prepare' target that creates necessary directories, e.g. mkdir -p $(CACHE) $(BUNDLE) -- also include $(WORK) if necessary, and subdirectories as needed
  7. Define a 'cache' target that automates fetching of primary sources to $(CACHE). If there's lots of stuff to transfer, it may pay to optimize cache validation (long story) by reusing the '.old' file holding the previous version. E.g. see common/bin/ftp-validate
  8. Debug 'make cache' (to make sure files are fetched) and 'make validate' (to make sure that new versions are fetched when needed, and only when needed)
  9. Define a 'prebundle' target that uses scripts to create RDF files under $(BUNDLE)
  10. Debug 'make bundle'
  11. Debug the bundle, perhaps in conjunction with other bundles
  12. 'make snapshot'
  13. Put Makefile, reference/*, and scripts under source control (but not default.mk or any config.mk)
  14. Ready for distribution. Package up p (the "source" distribution) and/or the rdf bundle you've created (the "object" distribution).
  15. Incorporate into an rdfherd environment