Text processing

By text processing is meant any kind of analysis of natural language text (such as HTML or PDF files) that enhances use of the text, e.g. by improving the way that it's indexed or the way it's linked to other resources.

Text processing can be manual, as when a librarian assigns a subject heading or keyword to an article or book, or automated.

Kinds of automated text processing:

  • Tokenization (word finding) - the obvious thing, just find words in the text (think Google)
  • Named entity recognition (ER) - find known entities (such as drug or protein names) mentioned in text
  • Information extraction - find particular relationships between entities (e.g. where a person lives, or a protein/drug interaction)
  • Text mining (or text analytics) - data mining techniques applied to text, for discovering patterns in large corpora

Of course there may be others.

The term "text mining" has a technical meaning as given above, but more colloquially it seems to often be used to include just about any kind of text processing.

Entity recognition finds entities occurring in lists (thesauri, ontologies) determined ahead of time, and usually involves synonym matching and stemming (trimming off plurals, "ing", and so on).


Timing of text processing

Text processing might be done either in advance, or on demand. When Google indexes web pages it is processing them in advance so that they can be indexed; when someone uses the index (does a search) the page had better have been processed already, or it won't be found. This pattern of advance processing is common because search is such an important application and because the results of processing (index entries in this case) can be used over and over again by multiple searches.

Text processing might happen on demand via a service that operates on an article that has already been found. An example is the whatizit service from EBI. The outcome of on-demand processing might be links to other text (articles, web pages), or to databases such as Swissprot, that were not explicit in the original text.

Performance of text processing

Simple tasks such as word finding would be expected to take a small fraction of second, while sophisticated tasks requiring parsing, such as information extraction, could be expected to take longer.

In our 2006 Temis experiment, abstracts were processed at the rate of one every ten seconds, which would mean several minutes for a complete article. But this is an extreme case, applying one of the most sophisticated and complex algorithms available.

A typical paper found through Google search gives a processing speed of 2466 documents per second for an entity recognition task.

Using text processing results

It is useful to be able to query the scientific literature jointly with other information sources. The most familiar way to query the literature is with full text search such as that provided by Google Scholar, but more precise query is possible if structured annotations are available to the query engine.

Annotations (vaguely, the fruit of certain text processing tasks) can be generated manually, as are the MeSH headings and the Gene Ontology annotations (need links), or automatically using text processing software.

Science Commons investigated developing a common schema for expressing, in RDF, the information that comes out of text processing engines. This would allow experimentation with a variety of engines within a single application such as the Science Collaboration Framework. We have not been able to pursue this project very far and we hope someone else takes it up.

A pilot project to apply Temis IDE, a commercial named entity recognition and information extraction system, to neuroscience-related PubMed abstracts and convert the output to RDF, was completed in early 2007.


One hurdle in text processing is extracting the text from PDF files. The PDF format loses important details such as text flow (which word comes after which other word) and identification of Greek characters, and introduces noise such as page headers and footers that must be ignored. We have used simple tools such as XPDF to get text from PDF, but would love to have a tool available that has higher fidelity to the intended text.

Survey of text processing resources

Conversion from PDF

2011-08-08 Check this out: "NLP Interchange Format (NIF) is an RDF/OWL-based format that allows to combine and chain several NLP tools in a flexible, light-weight way."