ImmPort/PDB

The protein data bank (PDB) is one of the data sources we're packaging for use in ImmPort, following the Package Makefile conventions. In summary: we download and cache raw data from PDB, then build RDF out of it for use in SPARQL query services etc.

Bundle documentation (how to use it): Bundles/pdb

Project report: HLA Structure Variation

Contents

Selecting structures

The protein data bank (PDB) is pretty big and it makes sense to pay attention only to those structures that have something to do with epitope binding. Otherwise there's too many bytes to download and process (70G?).

The structures we want are those derived from proteins in the HLA system. There are six (?) relevant genes, HLA-{A, B, C, DP, DQ, DR}. (Some of these either form or consist of complexes, so a bit of research is warranted.) (Question: should we also pull out MHC complexes from model organisms such as mouse? Or would they just get in the way?)

There are two ways to track down candidate structures, and it is possible that neither is adequate: (1) keyword search (unprincipled), and search by function, e.g. GO category (principled).

Keyword Search

keyword search is implemented in keyfetch.py; it finds finds around 270 structures; some, such as 1FFK, are false positives; they are not relevant to our investigation; there's a list of false positives to filter out maintained in the code. The following query (in SPARQL extended with a count() feature) shows 292 structures as of 8 Sep:

How many structures?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sc: <http://purl.org/science/owl/sciencecommons/>

select count(?name) where
{
 ?record rdfs:label ?name;
   <http://purl.obofoundry.org/obo/IAO_0000136> ?struct.
 ?struct sc:has_grain ?g
}


We'll assume for now that structures will be selected on the basis of keyword search results (e.g. "HLA"). Ignore the following more principled approach for now:

Search by function (a road not taken)

If things worked out nicely there would be a GO category, already in the Neurocommons store, that had the various HLA proteins as members. This would get us (via SPARQL) to Entrez Gene ids, which could then be converted to Uniprot ids (see above). However the number of proteins is small enough (less than 20?) that this is probably not necessary.

A keyword search for HLA-B turns up 76 structures, including 1K5N. On the other hand, "advanced search" using the Uniprot id for HLA-B = P03989 has only 14 hits. It might be worthwhile to spend a little time figuring out why this is -- do some of the keyword hits lack Uniprot altogether, or are they false positives, or do they contain some other Uniprot id? If the latter, is it a Uniprot id that maps to the Entrez Gene id for the right gene?

Processing the PDB XML files

Having selected some number (probably a few hundred) of structures, one gets their PDB ids (four alphanumeric characters), and downloads the compressed XML; e.g. 1k5n.xml.gz. keyfetch.py implements downloading .xml.gz files for each search hit.

The PDB mmCIF Exchange Dictionary supporting PDB Format V3.1 is the basis of the XML schema for these files, pdbx.xsd. Note that at least once, the namespace/schema URI in all the .xml.gz files changed; perhaps the namespace name should be determined at runtime rather than hardwired into the code.

Parts of the XML data (uniprot, pubmed, alleles, etc.) are extracted and made available via the ImmPort/PDB template interface.

Natural protein complexes are roughly equated with PDB structures; for example, 1CF5 or 3FQX. Each structure has parts, of which most but not all are polypeptide chains (some parts may be small non-chain molecules).

Molecular description for 1K5N

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sc: <http://purl.org/science/owl/sciencecommons/>
prefix IAO:  <FIXME://example/IAO_terms#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
select distinct ?Molecule ?Type ?Length where
{
 ?record rdfs:label "1K5N"; <http://purl.obofoundry.org/obo/IAO_0000136>  ?struct.
 ?struct sc:has_grain [ ro:has_part ?chain ].
 ?chain rdfs:label ?Molecule;
  sc:sequence_length ?Length;
  rdf:type [ rdfs:label ?Type]
}
order by ?chain


The goal here is a reasonable (Foundry-like) model for the information it carries. To get there, we use the template-based approach that Dan and Alan have worked out, so that Alan can write templates against information Dan extracts form the PDB files.

A polypeptide chain consists of a sequence of amino acid residues. ToDo: a query about residues, their types, etc.

The following query verifies that a uniprot ID is found for every chain:

Any chains missing a uniprot connection?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sc: <http://purl.org/science/owl/sciencecommons/>
prefix IAO:  <FIXME://example/IAO_terms#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
select distinct ?name ?chain where
{
 ?record rdfs:label ?name; <http://purl.obofoundry.org/obo/IAO_0000136>  ?struct.
 ?struct sc:has_grain [ ro:has_part ?chain ].
 ?chain ro:has_part ?contact.
 optional { ?chain sc:uniprot ?u }
 filter (!bound(?u))
}


Some exceptions:

An initial approach to finding pdb alleles was based on text extraction with a regular expression, but it seems insufficient. See ImmPort/PDB alleles.

Names, entities: actually needed?

We also want:

<PDBx:pdbx_entity_nameCategory>

The ones with name_type="SWS-KEYWORD" are not useful and can be dropped. Others such as name_type="RCSB_NAME" are more interesting.

There are none of these in the data, as far as I (Dan_Connolly) can tell.

looks like the data format changed since JAR wrote up his notes.

It will be useful to keep separate information for each "entity" (entity = some part of the structure, probably not covalently bonded). Maybe just the description

<PDBx:pdbx_description>HLA-A2*0201</PDBx:pdbx_description>

but leave the option open to keep other information as well. Do we actually need the list of entities? Isn't the list of polymers enough?

URIs

The ImmPort/PDB template interface allows URIs to be chosen in a template rather than buried in the XML processing code.

For now we're using the URIs created for the Common Naming prototype (which see). If new URIs are needed, e.g. for structures (or structure records) in PDB, just make them up. When the Shared Names project is out and about we'll switch to those.

I'm using: PREFIX u: <FIXME://example/misc_terms#>

Geometry and Contact Points

Contact points (from amino acid on protein to amino acid on epitope) are important to capture.

Jmol has a scripting language that supports computing which residues are close to others points; FirstGlance in Jmol is a particularly convenient interface for exploring contact points. We use the definition of contact points from FirstGlance and adapted its scripts for batch use. see pdbcontacts.py

The results of the contact point calculation is a list of the residues of each chain where the chain contacts the peptide. See example query in ImmPort/Demo_sketch.

Hmm... struggling to reconcile the results of this calculation w.r.t. img3d. Some test data:


1JWS

C A B

PRO(P) 306

  1. PHE(F)
  2. ALA(A)
  3. SER(S)
  1. VAL(V)

LYS(K) 307

  1. SER(S)
  2. PHE(F)
  3. GLU(E)
  1. HIS(H)
  2. ASN(N)
  3. VAL(V)

TYR(Y) 308

  1. PHE(F)
  2. ILE(I)
  3. PHE(F)
  4. TRP(W)
  5. ALA(A)
  6. SER(S)
  7. PHE(F)
  8. GLU(E)
  1. ASN(N)
  2. VAL(V)
  3. GLY(G)
  4. PHE(F)
  5. THR(T)

VAL(V) 309

  1. GLN(Q)
  2. PHE(F)
  3. PHE(F)
  1. THR(T)
  2. TYR(Y)
  3. HIS(H)
  4. ASN(N)

LYS(K) 310

  1. GLN(Q)
  2. PHE(F)
  3. PHE(F)
  4. PHE(F)
  5. GLY(G)
  6. ALA(A)
  7. ASN(N)
  1. TYR(Y)

GLN(Q) 311

  1. GLN(Q)
  2. GLU(E)
  3. ASN(N)
  1. PHE(F)
  2. LEU(L)
  3. GLU(E)
  4. GLN(Q)
  5. ARG(R)
  6. ALA(A)
  7. TYR(Y)

ASN(N) 312

  1. ASN(N)
  1. PHE(F)
  2. ARG(R)

THR(T) 313

  1. GLU(E)
  2. ALA(A)
  3. ASN(N)
  4. ILE(I)
  5. VAL(V)
  6. ASP(D)
  7. GLU(E)
  1. PHE(F)
  2. ARG(R)

LEU(L) 314

  1. VAL(V)
  2. ASN(N)
  1. GLU(E)
  2. TYR(Y)
  3. TRP(W)
  4. LEU(L)
  5. ARG(R)

LYS(K) 315

  1. VAL(V)
  2. ASN(N)
  1. TYR(Y)
  2. TRP(W)

LEU(L) 316

  1. ASN(N)
  2. ILE(I)
  3. MET(M)
  4. ARG(R)
  1. TRP(W)
  2. ASP(D)
  3. TYR(Y)
  4. TRP(W)

ALA(A) 317

  1. ILE(I)
  2. ARG(R)
  1. PRO(P)
  2. ASP(D)
  3. TYR(Y)

THR(T) 318

  1. ILE(I)

Beyond PDB: SCOP and CATH

The SCOP and CATH information in the user interface is useful, but I don't think it occurs in the XML files. These come from third-party web sites, I assume. We should look into obtaining the information from its origin.

Gene / Uniprot

PDB and some of our other data sources are indexed by Uniprot id, while the Neurocommons store only knows about Entrez Gene ids, so the first order of business is to create a bundle consisting of a Uniprot/Gene correspondence.

See Bundles/ipi.

Making Use of the PDB data

Once it's in a triple store (see Bundle development), we can point some RDF browser (e.g. Virtuoso's, or sesame) at it, and presto, we have a portal. Well... maybe. We'll probably want to write our own query front end for easier search and access. See also ImmPort/Demo_sketch

Appendix: Implementation Notes and dependencies

The code is in svn: (check: packages/pdb). It follows Package Makefile conventions.

The code is written in python using Python norms; initially, it used the traditional C implementation of python. But in order to integrate with Jmol, the per-structure processing code was ported we jython. I looked for a way to compute contact points using a promising-looking python library, mmLib, but didn't see very direct support.

The code depends on Jython 2.5 features (yield). There's a stanza in the makefile to download jython_installer-2.5.0.jar. The JDK on norbert (1.5) crashed when running this code. I reported the problem (jython issue 1416). Peers suggested I grab a 1.6 JDK, so I grabbed Sun's JDK 1.6. (it's installed in ~danc/jdk1.6.0_14 on norbert; I don't recall the exact path I used to get it from Sun, but the file was called jdk-6u14-linux-x64.bin and it has md5sum 39f51249580fc042d23a4c76cac2b209).

The straightforward use of XPath and DOM libraries that come with the JDK resulted in OutOfMemory errors, so we migrated to the JDOM XMLBuilder() class, which supports integration of SAX filters. This involved using jaxen for XPath support. There are Makefile stanzas deps/jdom.jar and deps/jaxen-1.1.1.jar that download and unzip the necessary files.

ToDo:

  • move deps from $(HOME) and deps to OPT etc. as discussed with jar 4 Sep
  • consider subsuming 'make prereqs' under 'make prepare'

The dependency on mechanize is gone as of r1258.

Appendix: Task List

Note: see ImmPort/Demo_sketch for tasks more generally related to the demo than this one package.

  • table of (pdbid, chain) -> allele relationships
    • this is implemented to some extent, but I haven't finished checking that it's complete.
    • original implementation was a shell/perl one-liner on the raw ntriples data; doing it as a SPARQL query is starting to work. See ImmPort/PDB alleles.
    • Alan suggests using sequence comparison to find a closest allele when text extraction doesn't work. Many roads lead to the Needleman-Wunsch algorithm for computing a score.
    • curation of the HLA-Chain relationships, e.g. 3HCV HLA-B*2709 - Chain A
      • trivial to do a 40% first pass at this; finishing it depends on an understanding of the PDB pages that I might or might not have
  • refine template to get rid of FIXME: namespaces
  • handle multiple chains in chain ID field (e.g. 1IM9/complex/Chain_B,F)
    • my 2nd approach at this seems to work
  • prune false positives from output of keyfetch.py (e.g. 1FFK)
    • trivial to do an 80% first pass at this; straightforward to iterate as new data become available
  • tune performance
    • as of 2009-08-12, it takes 108m16.006s real, 145m54.455s user to build the bundle
    • I split contact computation from the rest. It might be slower this 1st time (due to 2x jvm startup time) but it should be faster next time the template is revised.
      • 21 Aug contacts build with make -j 4: real 38m1.949s user 146m38.374s sys 1m51.327s
      • for non-contact data: real 26m12.549s user 101m1.275s sys 1m46.411s
    • single jython run to save jython start-up time? this makes parallelism harder, I think (lower pri)
    • build on AWS
    • separate contact computation from the .xml.gz handling