Text processing
It is very useful to be able to query the scientific literature jointly with other information sources. The most familiar way to query the literature is with full text search such as that provided by Google, but more precise query is possible if structured annotations are available.
Annotations can be generated manually, as are the MeSH headings and are many of the GO annotations (need links), or automatically using software that processes the text of the article.
Kinds of automatic text processing:
- Entity recognition - find entities mentioned in an article
- Information extraction - find particular relationships between entities
- Parsing - loose or strict
- Text mining - data mining techniques applied to text, for discovering novel patterns in large corpora
(In the past we have often described all kinds of text processing activities as "text mining" but are making an attempt to reserve the term "text mining" for use in the more limited sense given above.)
We are in the early stages of developing a common schema (for RDF) for expressing the information that comes out of text processing engines.
A pilot project to apply Temis IDE, a commercial named entity recognition and information extraction system, to neuroscience-related PubMed abstracts and convert the output to RDF was completed in early 2007.
A hurdle in text processing is extracting the text from PDF files. We use simple tools such as XPDF but would love to have something that could figure out correct document flow (left column before right column), elide page headers and footers, and handle Greek characters effectively.
