ImmPort/PDB alleles

Many of the HLA structures from PDB have an allele associated with one of its chains. Should they all have one? Are all the ones without alleles false positive matches for HLA?


Contents

Unmatched chains

Which structures have no chains with an associated allele in the bundle?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix util: <FIXME://example/misc_terms#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
prefix bfo:  <http://www.ifomis.org/bfo/1.1#>

select distinct ?crystal
where {
 ?crystal rdf:type bfo:MaterialEntity;
   util:has_grain ?complex.
 optional { ?complex ro:has_part ?chain. ?chain util:allele ?allele.
 }
 filter (!bound(?chain))
}
order by ?crystal


As of 2009-08-27, it returns 200 results (see File:,chains.txt); the first few are:

http://purl.obolibrary.org/pdb/1A6A/crystal
http://purl.obolibrary.org/pdb/1AO7/crystal
http://purl.obolibrary.org/pdb/1BD2/crystal
http://purl.obolibrary.org/pdb/1BXT/crystal
http://purl.obolibrary.org/pdb/1CG9/crystal
http://purl.obolibrary.org/pdb/1D5M/crystal
http://purl.obolibrary.org/pdb/1D5X/crystal
http://purl.obolibrary.org/pdb/1D5Z/crystal
http://purl.obolibrary.org/pdb/1D6E/crystal
...

String Searching

One approach is to look for substring(pdbseq, alleleseq) where pdbseq is the sequence of a PDB chain and alleleseq is the sequence of an HLA allele.

I cleaned up the code to do this a little bit and started a new bundle (packages/pdbsc/matchph.py):

 pdbsc$ python matchph.py
 HLA alleles: 3412
 PDB chains: 271
 unmatched chains 45
 1A1M A no match
 1A1N A no match
 1A1O A no match
 ...
 2 digit matches: 73
 1A6A A matches allele group: DRA*01
 1A9B A matches allele group: B*35
 1A9B D matches allele group: B*35
 ...


 4 digit matches: 153
 1AGB A matches allele: B*0801
 1AGC A matches allele: B*0801
 1AGD A matches allele: B*0801
 ...

The full output is File:Allele-search.txt.


There seem to be 91 structures where finding its HLA chain via uniprot and Entrez not work

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix util: <FIXME://example/misc_terms#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
prefix bfo:  <http://www.ifomis.org/bfo/1.1#>
prefix IAO:  <FIXME://example/IAO_terms#>
prefix bio2rdf: <http://bio2rdf.org/ns/bio2rdf#>
prefix sc: <http://purl.org/science/owl/sciencecommons/>
select ?pdbid
where {
 ?record rdfs:label ?pdbid; IAO:is_about ?struct.
 ?struct util:has_grain ?complex.
optional {
 ?complex ro:has_part ?chain.
 ?chain util:uniprot ?prot.
 ?gene bio2rdf:xPath ?prot.
 ?gene sc:ggp_has_primary_symbol ?genename.
 filter ( regex(?genename, "^HLA-") )
 }
filter (!bound(?chain))
}


Scoring

A search for "python multiple sequence alignment" turned up a student project to implement the Needleman-Wunsch Algorithm; further search found macpy with cleaner code, though it uses a trivial substitution matrix; probably should be enhanced to use BLOSUM or the like. It's pretty slow (takes several minutes to compare one chain against the relevant alleles) and I'm not at all confident about interpreting the scores.

Blast

Summary:

  • structures with allele/group matches: 221
    • chains with allele matches: 209
    • chains with allele group matches: 73
    • chains with ungrouped allele matches: 1
      • 2CII A alleles ['HLA-Cw*0825', 'HLA-Cw*0741', 'HLA-Cw*0703']
    • These total more than 221 because there are some structures with more than one HLA chain; e.g. 1HDM A and B.
  • structures with no chains matched to alleles: 52
    • These are candidates to add to the false positive list in the HLA keyword search. The first few are 1GZP, 1GZQ; see ImmPort/Blast Report for the full list.

Detail:

I made a blast database of all 3622 HLA alleles in hla.dat (ImmPort/Blast Report shows how the packages/pdbsc/Makefile does this.).

Then I made 2418 fasta files, one for each chain from the pdb bundle as found by this query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix util: <FIXME://example/misc_terms#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
prefix IAO:  <FIXME://example/IAO_terms#>

 select distinct ?pdbid ?chain ?seq
 where {
  ?record rdfs:label ?pdbid; IAO:is_about ?struct.
  ?struct util:has_grain ?complex.
  ?complex ro:has_part ?chain.
  ?chain util:seq ?seq
 }
order by ?chain

Then I ran blast on each of the 2418 chains against the alleles.

86 of them failed because they seem to have RNA sequence data (U...) rather than aa sequence data. Looking at 1JJ2.xml.gz showed that chain 9 has type polyribonucleotide rather than polypeptide. Oddly, chain 0 is also a polyribonucleotide but blast gave no hits rather than failing with a warning in that case. Hence it's not surprising that this query shows 97 results, i.e. more than 86:

How many polyribonucleotide chains?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix util: <FIXME://example/misc_terms#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
prefix IAO:  <FIXME://example/IAO_terms#>
select count(?chain)
where {
 ?record rdfs:label ?pdbid; IAO:is_about ?struct.
 ?struct util:has_grain ?complex.
 ?complex ro:has_part ?chain.
 ?chain util:seq ?seq.
 ?chain rdfs:comment "polyribonucleotide".
}


Then I eliminated various cases:

  • blast results had no hits
  • less than 50% identities (i.e. matching residues)
  • blast bit scores below 100 (eliminates the case of 9/9 identities, i.e. short peptides)

Hmm... I didn't count those cases to see that they add up to 2418.

In the case of an actual HLA chain, this would still leave dozens of matches; the first few would have 100% or 99% identities; I eliminated any below the top number of identities. This led to either a single allele (modulo nucleotide-only variation) or (except for one case) an allele group (XX*NN) as summarized above.

For detailed enumeration, see ImmPort/Blast Report.