Survey of text processing resources

Back to Text processing

This page lists several tools and projects that we have looked at in relation to the Neurocommons's text processing goals. Comments below reflect suitability to our particular projects.

The list was started in January 2008 by User:Tony Loeser. User:Jonathan Rees started a similar list in early 2006 (but can't find it). There are several lists like it spread around the Internet.

Some late arrivals (Alan 3/2010):


Self-contained projects

These projects appear to use their tools internally. They typically analyze large sets of data, and then provide a public interface, such as search, that leverages the results of their text mining. These would be candidates for a partnership, where Neurocommons would ask for access to the text mining results themselves, translate those results into an RDF format, and add them to the integrated KB.

  • Geneways
    • "GeneWays is a system for automatically extracting, analzying, visualizing and integrating molecular pathway data from the research literature."
    • The GeneWays 6.0 dataset can be accessed publicly through the JournalMine page, although the link there appears broken for now.
    • The data set includes "1.5 million unique statements about protein interactions from 150,000 full text articles in 78 journals" (from here).
  • GoPubMed
    • This is a MEDLINE search engine that uses GO to organize results. The text mining engine marks up abstracts, which are then used to build the search indices.
    • Includes a REST-style interface that returns search results in RDF format. (example)
    • System is built by Transinsight, see separate entry on this page.
  • iHOP: Information Hyperlinked over Proteins
    • Basically, they have converted the (>12M) !PubMed abstracts into a network by linking the genes and proteins. "By using genes and proteins as hyperlinks between sentences and abstracts, we convert the information in PubMed into one navigable resource and bring all the advantages of the internet to scientific literature investigation."
    • "The network presented in iHOP currently contains 5 million sentences and 40 000 genes from Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, Arabidopsis thaliana, Saccharomyces cerevisiae and Escherichia coli."
    • The web site mentions that bulk download of the network is possible, so this is something to explore.
  • LitLinker
    • Doesn't appear to be ready with version 1, yet. Stay tuned.
  • PubViz
    • "An Interactive Medline Search Engine Utilizing External Knowledge"
    • It is not clear from the website exactly what this is; we would have to contact the authors to explore further.
    • Search tool that looks for connections between concepts, based on co-occurence in Medline articles. In other words, concepts are nodes and a co-occurence in Medline results in an edge between the nodes.
    • MeSH (Medical Subject Heading) and HUGO are used for concepts and background knowledge.
  • Tsujii lab software
  • The Bio-NLP tools listing

Academic web services

    • XML-RPC service identifies names of genes and proteins in text strings
    • Useful for short text: Discussion and examples only use short strings. Results format identify word but don't locate word in the input string.
  • Whatizit
    • Text processor tags text by running various "pipelines". Most pipelines scan for terms from a defined vocabulary. There are existing pipelines for Swissprot, GO, NCBI taxonomy, etc. Docs say that it can handle vocabs up to 500k terms.
    • Interfaces include a web service, using WSDL and SOAP, and also a streamed servlet for large jobs. Of course, one is using the pipelines already defined on the server.


These are software programs that one could potentially download run independently. Given our limited resources, it would be attractive to find a package that is already trained or tuned for typical biomedical text mining.

Commercial products

  • I2E
    • Linguamatics markets "I2E" text mining product to life sciences market.
    • Architecture is Java client and C server. Inputs HTML, XML, text; outputs tabular format. Uses modular ontologies.
  • LingPipe
    • Company: Alias-I; "LingPipe is a suite of Java libraries for the linguistic analysis of human language."
    • Licensing provides for downloading a free version, provided all results and related software continue to be freely available. (Yep, that's us.)
    • Wide range of capabilities, including tutorials on topic classification, named entity recognition, working with MEDLINE, SVD, word sense disambiguation, etc.
    • Has various biomed pre-trained models including GeneTag and GENIA named entity recognition.
  • Transinsight
    • This is the engine behind GoPubMed
    • Company markets GoPubMed PRO: "With GoPubMed PRO we have achieved one stop information acquisition encompassing PubMed, the Internet, your local intranet and local desktops. Under one roof you have knowledge-based access to the knowledge you need to make decisions."
  • Temis IDE
    • The engine used in the 2006-07 [[Text processing pilot|Neurocommons text processing pilot]
    • Its gene/protein recognition module is the highly regarded ProMiner from SCAI (Germany)
    • Does some amount of parsing to obtain relationships
    • Supports biomedical entity recognition: genes/proteins, MeSH
    • See Temis sample output

Downloadable academic/government projects

  • ABNER (A Biomedical Named Entity Recognizer)
    • Downloadable Java program. Input: text files. Output: various, incl SGML.
    • Machine learning system using conditional random fields, implemented by MALLET
    • Trained on NLPBA and BioCreative corpora
    • Claims state of art performance: for example, in one set of results the overall recall was 72%, precision 69%
  • KEX
    • "A simple Knowledge EXtraction tool, KEX is; a protein name annotation tool based on PROPER (PROtein Proper-noun Extraction Rules)."
    • Downloadable C/Perl program, tested on Solaris.
    • Input is plain text, output appears to have protein names decorated with

    • "MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text."
    • Many other, apparently customizable, related features such as text file tokenizing, general text processing pipelines, string<->integer mapping, etc.
    • Not specifically set up for biomed, this is a general text processing project
  • Meta Map Transfer
    • Downloadable version of the NIH MetaMap program.
    • "MetaMap maps arbitrary text to concepts in the UMLS Metathesaurus; or, equivalently, it discovers Metathesaurus concepts in text." Metathesaurus is a union of ~150 biomed source vocabularies. Includes GO.
    • Input is delimited plain text. Output is a (documented) HTML format.
  • Textpresso
    • Downloadable program that indexes journal articles, and then provides an http-based search interface. System is run publicly with results from the C. elegans literature. Software etc. doesn't appear to be species specific, though.
    • Implementation in Perl/cgi. Input is plain text; they suggest pdftotext for conversion from PDF. Output is normally the index database that is in turn used by the search software.
    • System does entity extraction from articles based on lexica that appear modular and extensible.
  • Rainbow
    • Rainbow is a program that performs statistical text classification. It is based on the bow toolkit, which provides facilities for (in addition to basic manipulations):
    • Tokenizing a text file, according to several different methods.
    • Including N-grams among the tokens.
    • Mapping strings to integers and back again, very efficiently.
    • Building a sparse matrix of document/token counts.
    • Pruning vocabulary by word counts or by information gain.
    • Building and manipulating word vectors.
    • Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
    • Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.
    • Scoring queries for retrieval or classification.
    • Performing test/train splits, and automatic classification tests.