ImmPort/Demo sketch

Work in progress - not ready for prime time

Originally based on email from Alan to JAR, 2009-05-12; subsequently edited by JAR; and then updated by Dan as development progresses.

See also: ImmPort/Demo queries

The demo has two parts: associating perturbed sequence features with alleles, and associating diseases with alleles.

This explanation assumes familiarity with ImmPort/PDB.

@@extra credit For a deliverable we try to get [Richard and/or Mark] to have a triple store installation of our content with as much of this as we can accomplish (it's possible we can do all of this) and have them running whatever portions of queries leading to this as we have available.

Contents

Contact Points from PDB

done except for alignment bug re PTM

The most important kind of feature is the contact point, where an HLA chain contacts the target peptide. For example, the First Glance in Jmol view of 1AGB: shows contact residues are TYR:7 A, ASP:9 A, ASN 63:A, etc.:

Image:1AGB-contacts.png

We model each contac from residue ?hla_residue on an HLA chain to ?peptide_residue on a peptide a fact { ?hla_residue sc:chain_contact ?peptide_residue }, as the following query demonstrates:

What are the contact points of 1AGB?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sc: <http://purl.org/science/owl/sciencecommons/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>

select distinct ?peptide_res ?hla_res where {
 ?record rdfs:label "1AGB";
   <http://purl.obofoundry.org/obo/IAO_0000136> [
     sc:has_grain [ ro:has_part ?chain ]].

 ?chain ro:has_part ?hla_res.
 ?hla_res sc:chain_contact ?peptide_res.
}
order by ?peptide_res ?hla_res


First few results:

peptide_res hla_res
http://purl.org/science/pdb/residue/1AGB/C/0001 http://purl.org/science/pdb/residue/1AGB/A/0007
http://purl.org/science/pdb/residue/1AGB/C/0001 http://purl.org/science/pdb/residue/1AGB/A/0159
http://purl.org/science/pdb/residue/1AGB/C/0001 http://purl.org/science/pdb/residue/1AGB/A/0171
http://purl.org/science/pdb/residue/1AGB/C/0002 http://purl.org/science/pdb/residue/1AGB/A/0007
http://purl.org/science/pdb/residue/1AGB/C/0002 http://purl.org/science/pdb/residue/1AGB/A/0063
http://purl.org/science/pdb/residue/1AGB/C/0003 http://purl.org/science/pdb/residue/1AGB/A/0070
http://purl.org/science/pdb/residue/1AGB/C/0003 http://purl.org/science/pdb/residue/1AGB/A/0156

ToDo: make this an automated test case for Bundles/pdb.

separate bundle for contacts? @@@

Sequence features

code prototyped but no data bundled built yet.

For present purposes a sequence feature is a structural or functional element connected with a position or span within an amino acid sequence. That is, any given feature is fixed at some particular place along a sequence. The kinds of features we'll look at are:

  • Contact point - derived from PDB - where an HLA molecule contacts a short polypeptide chain
  • Protein family - derived from PFAM or SCOP (JAR's guess is that these will not be very informative)
  • Structural - derived from CATH

@@regions from http://www.uniprot.org/uniprot/P04229 ?

Features are, e.g. "alpha helix", or "domains".

PFAM has domain info, and PDB links to PFAM; e.g. PFAM info for 1CF5 is PF00161, whence Interpro IPR001574 tells us e.g. of IPR016138 Ribosome-inactivating protein, subdomain 1. That page gives us examples of the coordinates of the domain in particular proteins. (I'm having a hard time figuring out how the domain is defined, but that's a different story.) Presumably among all of the listed accessions (sequences) is the appropriate chain from PDB 1CF5.

Sanger's PFAM site has very PFAM/Uniprot/PDB correlations in easy-to-scrape form; e.g. 1K5N. pfam_scrape.py is a prototype implementation that could be integrated with the PDB template mechanism.

EBI seems to have even better structural information; e.g. 1K5N.

A table of sequence feature variance types (SFVTs) is supplied by Richard's group (and we'll find more) (JAR: how?) - these are features specific to one allele (really? which one?). This is work AlanR does.

Alan and Dan need to work out what their interface is around SFVTs as this is likely to be a blocking point in the process.

(Need to figure out the coordinate systems used and relate to PFAM and SVFT.) (I wonder if this information is available in a machine-friendly form? Probably not.)

See also ImmPort/Sequence features, some notes started by jar.

Perturbation and Sequence alignment

done except for bug with primary_allele on DRB and PMT bug

Alleles differ in their sequences; often two related alleles will be almost identical, differing only at a single amino acid. The mutation points are here called perturbations and are of great interest because they may reflect functional changes related to inherited susceptibility to disease.

Perturbations also occur at particular positions, and are thus are like sequence feature. The goal is to correlate features with perturbations: In a particular allele (or set of alleles related to some disease), which features are perturbed (i.e. which features contain positions that are perturbed in that allele and generally not in others)?

For example, consider HLA-B*080101. In order to find corresponding positions on other alleles, we use as a reference the reference sequence of the HLA-B locus assigned by The IMGT in B_prot.txt, B*070201:

HLADB-2.26.0-Jul 2009
HLA-B Protein Sequence Alignments
Sequences Aligned: 17 July 2009
Steven G. E. Marsh, Anthony Nolan Research Institute.

Prot. Pos.        -30        -20        -10                10         20
B*070201                MLVM APRTVLLLLS AALALTETWA GSHSMRYFYT SVSRPGRGEP
...
B*080101                ---- ---------- ---------- --------D- AM--------
                        ^                                ^
                        1st/-24th                        31st/7th

Some alleles have insertions, deletions, or an initial unknown segment, but not in this case, so the fully aligned offset is 31. Why does B_prot.txt from IMGT start from -24? where does that number come from? Seems to be this bit of IMGT/HLA help: "For amino acid-based systems, the start codon of the mature protein is labeled codon 1." ToDo add PMT offset, rebuild pdb, hla, and pdbsc bundles.

The alleles with perturbations at this offset are B*1404 and B*15010102N:

HLA-B Protein Sequence Alignments

Prot. Pos.        -30        -20        -10                10         20
B*070201                MLVM APRTVLLLLS AALALTETWA GSHSMRYFYT SVSRPGRGEP
...
B*080101                ---- ---------- ---------- --------D- AM--------
                        ^                                ^
                        1st                              31st
...
B*1404                  **** ********** ********** *-----H--- A---------
...
B*15010102N             -R-T ---------- G--------- -ECGVGREMA --G-SEGTAG

See Bundles/hla for a SPARQL query for "perturbations in B*1404".

Richard's group has asked to compute whether sequence features are in areas of variation in specific alleles even if we don't have direct information. I (Alan) imagine that JAR would do this portion of the work in consultation with [them?], which would include doing the parsing [of what? IMGT?], translation, and query development demonstrating the capability. Example of such a query: By comparing structures of allele X binding to peptides of length 9, which contact points in allele Y (assuming the IMGT alignment used to map to X) have different amino acids than X?


Finding perturbed contact points

Example query: given a PDB structure... relative to contact positions, what alleles have those contact positions deleted... which have a different amino acid?

Each PDB chain was compared against all HLA alleles using blast (using sequence data from PDB and IMGT respectively).

Continuing with the 1AGB example: the blast results show that positions 1-276 of chain A of 1AGB (i.e. all of it) match positions 25 to 300 @@FIX of the HLA-B*080101 allele. So the 7th position on 1AGB A corresponds to the 31st position of B*080101.

As we discovered above, the alleles with perturbations at this offset are B*1404 and B*15010102N.

The corresponding SPARQL query is:


Show alleles perturbed at contacts of 1AGB

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sc: <http://purl.org/science/owl/sciencecommons/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix MaHCO: <http://purl.org/stemnet/HLA#>
prefix HLA: <http://purl.org/stemnet/HLA#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>

select distinct ?pos ?refname ?fromname ?allelen ?toname where
{
 ?record rdfs:label "1AGB"; <http://purl.obofoundry.org/obo/IAO_0000136>  ?struct.
 ?struct sc:has_grain [ ro:has_part ?chain ].

 ?chain ro:has_part ?contact.
 ?contact sc:chain_contact ?peptide_res.
 ?chain sc:blast_match ?allele_for_chain.
 ?allele rdfs:subClassOf ?locus; sc:perturbation [ sc:reference_sequence_item ?item; rdf:type ?totype].
 optional { ?totype rdfs:label ?toname }.
 ?item rdf:value ?pos.
 optional { ?item rdf:type [ rdfs:label ?fromname] }

 ?locus sc:primary_allele ?base.
 ?base rdfs:label ?refname.
 ?allele_for_chain rdfs:subClassOf ?locus.
 ?allele rdfs:label ?allelen.
 ?contact sc:reference_sequence_item ?item.
}
order by ?pos ?allele


The first few results are as expected:

pos refname fromname allelen toname
7 B*070201 tyrosine residue B*1404 histidine residue
7 B*070201 tyrosine residue B*15010102N arginine residue
9 B*070201 tyrosine residue B*0729 aspartic acid residue
9 B*070201 tyrosine residue B*0770 cysteine residue
9 B*070201 tyrosine residue B*080101 aspartic acid residue
9 B*070201 tyrosine residue B*080102 aspartic acid residue
9 B*070201 tyrosine residue B*080103 aspartic acid residue
9 B*070201 tyrosine residue B*080104 aspartic acid residue
9 B*070201 tyrosine residue B*080105 aspartic acid residue
9 B*070201 tyrosine residue B*080106 aspartic acid residue
9 B*070201 tyrosine residue B*0802 aspartic acid residue
9 B*070201 tyrosine residue B*0803 aspartic acid residue
9 B*070201 tyrosine residue B*0804 aspartic acid residue
9 B*070201 tyrosine residue B*0805 aspartic acid residue
9 B*070201 tyrosine residue B*0806 aspartic acid residue
9 B*070201 tyrosine residue B*0807 aspartic acid residue
9 B*070201 tyrosine residue B*0808N aspartic acid residue
9 B*070201 tyrosine residue B*0809 aspartic acid residue
9 B*070201 tyrosine residue B*0810 aspartic acid residue
9 B*070201 tyrosine residue B*0811 aspartic acid residue
9 B*070201 tyrosine residue B*081201 aspartic acid residue
9 B*070201 tyrosine residue B*0813 aspartic acid residue
9 B*070201 tyrosine residue B*0814 aspartic acid residue
9 B*070201 tyrosine residue B*0815 aspartic acid residue
9 B*070201 tyrosine residue B*0816 aspartic acid residue
9 B*070201 tyrosine residue B*0817 aspartic acid residue
9 B*070201 tyrosine residue B*0818 aspartic acid residue
9 B*070201 tyrosine residue B*0819N aspartic acid residue
9 B*070201 tyrosine residue B*0820 aspartic acid residue
9 B*070201 tyrosine residue B*0821 aspartic acid residue
9 B*070201 tyrosine residue B*0822 aspartic acid residue
9 B*070201 tyrosine residue B*0823 aspartic acid residue
9 B*070201 tyrosine residue B*0824 aspartic acid residue
9 B*070201 tyrosine residue B*0825 aspartic acid residue
9 B*070201 tyrosine residue B*0827 aspartic acid residue
9 B*070201 tyrosine residue B*0828 aspartic acid residue
9 B*070201 tyrosine residue B*0829 aspartic acid residue
9 B*070201 tyrosine residue B*0830N aspartic acid residue
9 B*070201 tyrosine residue B*0831 aspartic acid residue

Notes:

  • sc:blast_match blast_match data should comes from the pdbsc bundle (@@docs)
    • ToDo: make separate relations for exact match and off-by-one match (done? ToDo: test it)
    • ToDo: add these to sciencecommons.owl (via the .lisp source), including comments and, if feasible, domains, ranges, etc.
  • sc:primary_allele is new; it relates a locus/gene (e.g. HLA-A) to the primary/base allele chosen by IMGT; this data has been added to the hla bundle. Its use in this query is just to restrict the offsets to the same coordinate system, i.e. the coordinate system of the base allele for the locus.
  • sc:perturbation relates an allele to a perturbation (either a residue or a deletion); it's somewhat like sc:chain_contact in that it (typically) relates a polypeptide to one of its amino acids
  • sc:chain_contact has replaced util:contact from the pdb contact point calculation
  • sc:aligned_offset relates a residue (or a deletion) to an offset aligned to the base allele for a locus. (@@give an example)


@@jar thinks this would be interesting: 2 alleles of the same locus are contacting the same peptide at different contact points

I wrote some code to find interesting cases, and it found:

  • 1S9V chain A from 1 to 193 matches DQA1*050101 from 24 to 216 (1-based)
    • DQA1*050101 diff has a deletion (.) at 78 (0-based)

yup... that found a bug... residue 1 should be offset by 24 or so. ok... fixed that one... and good... the deletion worked:

 <http://purl.obolibrary.org/pdb/1S9V/complex/Chain_A> <http://purl.org/science/owl/sciencecommons/blast_match>
  <http://purl.org/stemnet/HLA#DQA1_050101> .
 <http://purl.obolibrary.org/pdb/1S9V/A_GLU_1> <http://purl.org/science/owl/sciencecommons/aligned_offset> 24 .
 ...
 <http://purl.obolibrary.org/pdb/1S9V/A_ARG_78> <http://purl.org/science/owl/sciencecommons/aligned_offset> 101 .
 <http://purl.obolibrary.org/pdb/1S9V/A_SER_79> <http://purl.org/science/owl/sciencecommons/aligned_offset> 103 .


This one has multiple matching alleles (DQA1_0503, DQA1_0505, 7, 8, 9). hmm... when there are multiple matching alleles, can we be sure the offsets all match? Maybe it's easy to check?

Darn:

 ValueError: alignment mismatch: ['1D5M', 'A', 'DRA*0101', 'DRA*010201', [26, 27, 28, ...], [29, 30, 31, ...]]

That seems to be 1 of 46 cases:

 danc@norbert:~/nc/packages/pdbsc> grep ValueError make.out |wc
    46   17217   81344

DanC went over this with Jar and found that the alignment mismatch check was buggy... gotta take another look.

ppoints.py prototype

This was originall prototyped in pdbsc/ppoints.py; I think it's working, including blast offsets:

pdbsc$ PYTHONPATH=../pdb:../hla python ppoints.py 1AGB A 7 9 63 70
contact: 1AGB A 7
@@pos 7 @@blast 1AGB A matches B*080101 which is offset by 0 from its base, B*070201
@@blast coords (1, 276, 25, 300) give offset 24 @@for a total of:  31  
where the main allele,  B*070201  has res: Y

perturbation: B*1404 31 H
perturbation: B*15010102N 31 R

Reference sequences, aka primary alleles

Move this to Bundles/hla?

Show all primary alleles; check that each has an rdfs:label from MaHCO:

Show alleles perturbed at contacts of 1AGB

PREFIX sc: <http://purl.org/science/owl/sciencecommons/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?l ?ltxt ?a ?atxt where {
  ?l sc:primary_allele ?a.
  optional { ?l rdfs:label ?ltxt }
  optional { ?a rdfs:label ?atxt }
}
order by ?l


l ltxt a atxt
http://purl.org/stemnet/HLA#HLA-A HLA-A http://purl.org/stemnet/HLA#A_01010101 A*01010101
http://purl.org/stemnet/HLA#HLA-B HLA-B http://purl.org/stemnet/HLA#B_070201 B*070201
http://purl.org/stemnet/HLA#HLA-C HLA-C http://purl.org/stemnet/HLA#Cw_010201 Cw*010201
http://purl.org/stemnet/HLA#HLA-DMA HLA-DMA http://purl.org/stemnet/HLA#DMA_0101 DMA*0101
http://purl.org/stemnet/HLA#HLA-DMB HLA-DMB http://purl.org/stemnet/HLA#DMB_0101 DMB*0101
http://purl.org/stemnet/HLA#HLA-DOA HLA-DOA http://purl.org/stemnet/HLA#DOA_010101 DOA*010101
http://purl.org/stemnet/HLA#HLA-DOB HLA-DOB http://purl.org/stemnet/HLA#DOB_01010101 DOB*01010101
http://purl.org/stemnet/HLA#HLA-DPA1 HLA-DPA1 http://purl.org/stemnet/HLA#DPA1_010301 DPA1*010301
http://purl.org/stemnet/HLA#HLA-DPB1 HLA-DPB1 http://purl.org/stemnet/HLA#DPB1_010101 DPB1*010101
http://purl.org/stemnet/HLA#HLA-DQA1 HLA-DQA1 http://purl.org/stemnet/HLA#DQA1_010101 DQA1*010101
http://purl.org/stemnet/HLA#HLA-DQB1 HLA-DQB1 http://purl.org/stemnet/HLA#DQB1_050101 DQB1*050101
http://purl.org/stemnet/HLA#HLA-DRA HLA-DRA http://purl.org/stemnet/HLA#DRA_0101 DRA*0101
http://purl.org/stemnet/HLA#HLA-DRB1 HLA-DRB1 http://purl.org/stemnet/HLA#DRB1_010101 DRB1*010101
http://purl.org/stemnet/HLA#HLA-DRB2 HLA-DRB2 http://purl.org/stemnet/HLA#DRB2_0101 DRB2*0101
http://purl.org/stemnet/HLA#HLA-DRB3 HLA-DRB3 http://purl.org/stemnet/HLA#DRB3_010101
http://purl.org/stemnet/HLA#HLA-DRB4 HLA-DRB4 http://purl.org/stemnet/HLA#DRB4_010101 DRB4*010101
http://purl.org/stemnet/HLA#HLA-DRB5 HLA-DRB5 http://purl.org/stemnet/HLA#DRB5_010101 DRB5*010101
http://purl.org/stemnet/HLA#HLA-DRB6 HLA-DRB6 http://purl.org/stemnet/HLA#DRB6_0101 DRB6*0101
http://purl.org/stemnet/HLA#HLA-DRB7 HLA-DRB7 http://purl.org/stemnet/HLA#DRB7_010101 DRB7*010101
http://purl.org/stemnet/HLA#HLA-DRB8 HLA-DRB8 http://purl.org/stemnet/HLA#DRB8_0101 DRB8*0101
http://purl.org/stemnet/HLA#HLA-DRB9 HLA-DRB9 http://purl.org/stemnet/HLA#DRB9_0101 DRB9*0101
http://purl.org/stemnet/HLA#HLA-E HLA-E http://purl.org/stemnet/HLA#E_01010101 E*01010101
http://purl.org/stemnet/HLA#HLA-F HLA-F http://purl.org/stemnet/HLA#F_01010101 F*01010101
http://purl.org/stemnet/HLA#HLA-G HLA-G http://purl.org/stemnet/HLA#G_01010101 G*01010101
http://purl.org/stemnet/HLA#HLA-H HLA-H http://purl.org/stemnet/HLA#H_01010101 H*01010101
http://purl.org/stemnet/HLA#HLA-J HLA-J http://purl.org/stemnet/HLA#J_01010101 J*01010101
http://purl.org/stemnet/HLA#HLA-K HLA-K http://purl.org/stemnet/HLA#K_01010101 K*01010101
http://purl.org/stemnet/HLA#HLA-L HLA-L http://purl.org/stemnet/HLA#L_01010101 L*01010101
http://purl.org/stemnet/HLA#HLA-P HLA-P http://purl.org/stemnet/HLA#P_01010101 P*01010101
http://purl.org/stemnet/HLA#HLA-V HLA-V http://purl.org/stemnet/HLA#V_01010101 V*01010101
http://purl.org/stemnet/HLA#MICA MICA http://purl.org/stemnet/HLA#MICA_001 MICA*001
http://purl.org/stemnet/HLA#MICB MICB http://purl.org/stemnet/HLA#MICB_001 MICB*001
http://purl.org/stemnet/HLA#TAP1 TAP1 http://purl.org/stemnet/HLA#TAP1_0101 TAP1*0101
http://purl.org/stemnet/HLA#TAP2 TAP2 http://purl.org/stemnet/HLA#TAP2_0101 TAP2*0101

A study of IMGT alignment data

Note Entrez distinguishes HLA-DQA1 and HLA-DQA2, while the IMGT .zip file lumps them together into DQ_prot.txt. This resulted in assigning a sc:primary_allele to DRB, but not to DRB1, but MaHCO doesn't have a term for DRB, so of course no alleles are linked to it. Issue. See also: Bundles/hla.

Perhaps we should be using DRB1_prot.msf rather than DRB_prot.txt. The .txt format seems to be undocumented, and though it seems to have alignment info, the IMGT download docs say, "The MSF file format is the only format provided that includes the alignment information."

The MSF format seems to be a proprietary invention of Accelrys, but the IMGT download docs go on to say, "All PIR and MSF files have been generated using "ReadSeq", a freely available sequence format conversion program written by D. Gilbert." After a few broken links (or were they server glitches?) I found the readseq source. But I still don't see how deletions are represented. Perhaps better to just hard-code the list of reference sequences (aka sc:primary_allele) from IMGT/HLA help.

verify: Each allele has a delta

exercise: parse alignment data; check that there's a delta between each allele.

Done in alignment_parser.py r1172; see ImmPort/Check Alignment Deltas

Mapping of disease to allele

We need some mapping of disease to allele. There are implicit mappings already via MeSH -> paper (medline/subject-headings bundle), IMGT(allele) -> paper (hla bundle), our text processing (allele) -> paper (medline/alleles bundle). We can also use the alleles JAR has found in OMIM (omim bundle), the OMIM phenotype annotations from The Human Phenotype Ontology - Downloads (relates OMIM to HP = human phenotype ontology), or an existing similar source if there is one (where to look?). Finally we can also look at the GO annotations (obo bundle) to see whether there are mappings to the Uniprot entries for different alleles to see whether there is biological material there (essentially: what are the Uniprot ids for the HLA alleles. How many annotations are there. Rerun our GO annotations script to get the Uniprot associations too - bundle ncbi/goa is now keyed to Entrez Gene, but could easily be additionally keyed to Uniprot).

(See Bundles/ipi for a Uniprot to Gene mapping.)

Query

With the above we can do a demo query: Start with disease (or anything else we can), follow to allele. For that allele find perturbed contact positions or SFVTs [?]. If more than one allele is associated with the disease, ask which peturbations are common to both.

For example, one might start broadly with Immune System Disease (MeSH D020274), or narrowly with Systemic Scleroderma.


What papers relate Systemic Scleroderma to which HLA alleles?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix sc: <http://purl.org/science/owl/sciencecommons/>
prefix mesh: <http://purl.org/commons/record/mesh/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ssc: <http://purl.org/commons/record/mesh/D012595>
prefix mentions: <http://purl.obofoundry.org/obo/IAO_0000142>
select ?article ?m ?title ?year ?allele where 
{
        { ?rec sc:has-as-major-mesh ssc:. }
  union { ?rec sc:has-as-minor-mesh ssc:. }
  union { ?rec sc:has-as-major-mesh ?m. ?m skos:broader ssc:. }
  union { ?rec sc:has-as-minor-mesh ?m. ?m skos:broader ssc:. }

  ?rec foaf:primaryTopic ?article.
  ?article dc:title ?title; dc:date ?year.
  ?article mentions: ?allele.
}


To follow the trail from there to contact points:

What are the perturbed contact points of alleles related to Systemic Scleroderma?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix sc: <http://purl.org/science/owl/sciencecommons/>
prefix mesh: <http://purl.org/commons/record/mesh/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ro:   <http://www.ifomis.org/bfo/1.1/ro#>
select distinct ?locus ?allele ?pos where 
{
        { ?rec sc:has-as-major-mesh mesh:D012595. }
  union { ?rec sc:has-as-minor-mesh mesh:D012595. }
  union { ?rec sc:has-as-major-mesh ?m. ?m skos:broader mesh:D012595. }
  union { ?rec sc:has-as-minor-mesh ?m. ?m skos:broader mesh:D012595. }

  ?rec foaf:primaryTopic ?article.
  ?article dc:title ?title; dc:date ?year.
  ?article <http://purl.obofoundry.org/obo/IAO_0000142> ?allele_.

  ?allele_ rdfs:label ?allele.
  ?allele_ sc:perturbation ?p.
  ?p sc:reference_sequence_item ?item.
  ?chain ro:has_part ?contact.
  ?contact sc:chain_contact ?peptide_res.
  ?contact sc:reference_sequence_item ?item.
  ?item rdf:value ?pos.
  ?allele_ rdfs:subClassOf ?locus_.
  ?locus_ sc:primary_allele [].
  ?locus_ rdfs:label ?locus.
}
order by ?locus ?pos


locus allele pos
HLA-DQB1 DQB1*0402 9
HLA-DQB1 DQB1*0402 30
HLA-DQB1 DQB1*0402 57
HLA-DQB1 DQB1*0402 71
HLA-DQB1 DQB1*0402 77
HLA-DQB1 DQB1*0402 85
HLA-DRB1 DRB1*0404 9
HLA-DRB1 DRB1*0404 11
HLA-DRB1 DRB1*0404 13
HLA-DRB1 DRB1*0404 26
HLA-DRB1 DRB1*0404 28
HLA-DRB1 DRB1*0404 30
HLA-DRB1 DRB1*0404 33
HLA-DRB1 DRB1*0404 86
HLA-DRB5 DRB5*0102 11
HLA-DRB5 DRB5*0102 13
HLA-DRB5 DRB5*0102 30

This query for Autoimmune Diseases returns some interesting results. The hla bundle seems to have some non-HLA alleles captured; note we filter those out with a regex.

PREFIX dc: <http://purl.org/dc/elements/1.1/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix sc: <http://purl.org/science/owl/sciencecommons/>
prefix mesh: <http://purl.org/commons/record/mesh/>

select ?allele ?article ?title where 
 {
   { ?rec sc:has-as-major-mesh mesh:D007154. }
   union { ?rec sc:has-as-minor-mesh mesh:D007154. }
   union { ?rec sc:has-as-major-mesh [ skos:broader mesh:D007154]. }
   union { ?rec sc:has-as-minor-mesh [ skos:broader mesh:D007154]. }

   ?rec foaf:primaryTopic ?article.
   [] sc:article ?article; sc:name ?allele.
   ?article dc:title ?title; dc:date ?year
   filter(regex(?allele, "^HLA"))
 }

This gets a few dozen alleles from papers about type 1 Diabetes:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix sc: <http://purl.org/science/owl/sciencecommons/>
prefix mesh: <http://purl.org/commons/record/mesh/>

select distinct  ?allele where 
 {
   { ?rec sc:has-as-major-mesh mesh:D003922. }
   union { ?rec sc:has-as-minor-mesh mesh:D003922. }
   union { ?rec sc:has-as-major-mesh [ skos:broader mesh:D003922]. }
   union { ?rec sc:has-as-minor-mesh [ skos:broader mesh:D003922]. }

   ?rec foaf:primaryTopic ?article.
   [] sc:article ?article; sc:name ?allele.
 }
order by ?allele

URIs for Alleles

This is largely done; the query to Bjorn is in progress.

I think it would be a good idea to create URIs for alleles; currently they only exist as strings. Just modify the scripts for the medline/alleles, hla, and omim bundles appropriately (make prepare, make cache, make, make export, rdfherd bundle_update - see Bundle development - do not be shy about updating the development instance. A candidate set of URIs is the MHC ontology

 http://www.bioinformatics.org/mahco/wiki/

but be sure to ask Bjorn Peters before using it.

This work is nearing completion. See Bundles/mahco.

example query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix MaHCO: <http://purl.org/stemnet/HLA#>
prefix HLA: <http://purl.org/stemnet/HLA#>
select ?group ?allele where 
{ 
  ?group rdfs:subClassOf HLA:HLA-A.
  ?allele rdfs:subClassOf ?group
}

oops... it seems there are 2 hierarchies; we're using the one for dna when we should be using the one for proteins

also curious: MaHCO is saying that A*68 is a subclass of A*28 This is right; see http://www3.interscience.wiley.com/journal/120148521/abstract

Finding the Entrez gene of a PDB chain via Uniprot

Each PDB chain is correlated to a Uniprot accession, and in the case of HLA chains, these Uniprot accessions are related to an Entrez HLA-XX gene, in Bundles/ipi.

The following query shows that the HLA gene names in Entrez are HLA-A, HLA-B, HLA-C, HLA-DMA, HLA-DMB, HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DRA, HLA-E, HLA-G, HLA-H.

Show Entrez HLA gene names

prefix util: <FIXME://example/misc_terms#>
prefix sc: <http://purl.org/science/owl/sciencecommons/>
prefix bio2rdf: <http://bio2rdf.org/ns/bio2rdf#>
select distinct ?sym  where 
{ 
  ?gene bio2rdf:xPath ?prot.
  ?gene sc:ggp_has_primary_symbol ?sym.
  filter ( regex(?sym, "^HLA-") )
}
order by ?sym


Extra credit

browsing with Jmol

see ImmPort/JmolViz

short peptide sequence binding

There is other work in progress to continue, e.g. the databases of peptides that bind to particular HLA variants (IEDB?).

tools for showing off triple store content at a low level

JAR thinks that a rudimentary RDF browser front end would be a good contribution to the demo, just as a way to get them oriented to the nature and diversity of the content.

Appendix: Task List

  • alignment stuff (in progress)
  • get SCOP/CATH/PFAM from SIFTS (in progress)
    • earlier approach: I prototyped a scraping implementation: pfam_scrape.py. Integration with the template stuff should be straightforward. The scraping and fetching should be separated so that the fetched data can be cached for offline scraping.
  • mediawiki query facility
  • show query results with Jmol
    • I studied the Jmol/FGiJ javascript enough to think this is pretty straightforward (assuming query results are residues... hmm... maybe I need more information about what queries we're talking about), but javascript is still something of a black art to me, so there may be dragons.