Conversion from PDF

Back to Text processing

(Tony Loeser, February 2008)

Many of the files that we need to process are currently in PDF format. (For example, the HD corpus.) So we need to find a way to convert out of PDF format and into something that a typical text mining tool likes to read.

For the purpose of experimenting, here are some sample articles to try:

  • attachment:Sample_Article_1.pdf : Burns et al., an article used for testing the USC tool described below
  • attachment:Sample_Article_2.pdf : Abou-Sleymane et al., an article with relatively straightforward text
  • attachment:Sample_Article_3.pdf : Carpenedo et al., a good example of an article with non-ascii characters; i.e. greek letters
  • attachment:Sample_Article_4.pdf : Emerich et al., a good example of an article with ligatures; i.e. the "fl" in "flasks".

TBD: The attachments aren't there - provide URLs instead of possible

Custom tools

Gully Burns at USC is developing a tool together with Tommy Ingulfsen. Basic idea is to convert PDF files into blocks of text, and then use a rule mechanism to classify those blocks. It is in the early stages, but meanwhile a promising sign is that it is the only tool thus far that puts the text blocks in the correct order. Non-ascii characters seem to be handled in a manner similar to pdftotext.

As an exercise, I configured a ruleset for Sample 1 and ran the program. ([attachment:Sample_Article_1-USC-rules.csv Sample 1 rules], [attachment:Sample_Article_1-USC.txt Sample 1 results]). The results were excellent when compared to the other tools on this page, with only a few glitches. Not only does the text come out right, but it is further organized via XML into sections and sentences. The first obvious question will be, how reusable are the rulesets, and how long does it take to tweak them for the various journal formats. Perhaps the authors have some experience with this?

Free tools

  • XPDF is a commonly referenced free tool that includes the utility 'pdftotext'
    • It is a command line unix tool; no evidence of format-specific settings that might tune it to different journal formats
    • Conversion goes better than for the commercial tool that I've tried. Paragraph blocks stick together well: dehyphenation works, headers, footers, and captions for the most part don't run in to body text. Two column text still results in paragraph jumbling in both example documents. (In sample 1 sections 1 and 1.1 are switched. In sample 2, search for "differentiation of rod photoreceptors".)
    • Conversion results: [attachment:Sample_Article_1-pdftotext.txt sample 1] and [attachment:Sample_Article_2-pdftotext.txt sample 2].
    • Non-ascii characters are translated into ascii with more or less success, e.g. "mu"-->"I", "alpha"-->"a"
  • Multivalent is a sourceforge project with an Extract Text utility
    • It is part of a suite of Java-based tools.
    • So far we have used it to test extraction of text with ligatures. Here are the results for [attachment:Sample_Article_4-multivalent.txt sample 4]. No luck with the ligatures. As an exercise, we did some manual cleaning in order to get a feel for how much work it is, see [attachment:Sample_Article_4-multivalent-manually-cleaned.txt sample 4 cleaned]. It doesn't look promising, since we have no clue as to what the ligature was other than the rest of the word. There is a huge space of possibilities of word completions, which still may be ambiguous.
  • Ghostscript is a well-known postscript viewer and suite of tools
    • Mentioned here only because it is an easy way to convert from PDF to PS ([ sample 4 PS]), which opens up the possibility of using any PS-->TXT tools that we might find. Also, PS is somewhat more readable in its raw form.

Commercial tools

There are a few different commercial tools out there that claim to convert from PDF to text, HTML, and other formats. There appears to be a range of quality of output. Among the two that I've tried, the ABBYY one works very well -- when it is adjusted on a per-document basis.

  • Desk UNPDF: Converts PDF to text, HTML, .doc, .xml, and so on.
    • Text conversion goes mildly well for the first example article; see [attachment:Sample_Article_1-UNPDF-p1.txt sample 1 page 1] and [attachment:Sample_Article_1-UNPDF-p2.txt sample 1 page 2], although the two-column text confuses the tool and the order of paragraphs ends up jumbled.
    • The second example article is quite problematic; see [attachment:Sample_Article_2-UNPDF-p1.txt sample 2 page 1] and especially [attachment:Sample_Article_2-UNPDF-p2.txt sample 2 page 2].
    • The [attachment:Sample_Article_2-UNPDF-HTML-p1.html HTML conversion] looks pretty from a formating point of view (if not character set), but a glance at the page source shows that we haven't actually made any progress towards analyzing the text.
    • Unfortunately there are no sophisticated settings that would allow one to tell the program about footnotes or two-column nature, or other stuff that it really has to know in order to do a better job.
    • Non-ascii characters appear to come out just as in pdftotext.
    • Further progress with a tool like this would involve figuring out some cleanup strategy for the text output.
  • ABBYY PDF Transformer: Converts to the usual formats
    • Automatic conversion is pretty good. It appears to get [attachment:Sample_Article_1-ABBYY-auto.txt sample 1] pretty much right. On [attachment:Sample_Article_2-ABBYY-auto.txt sample 2], it misses badly on the initial 2-column split, but gets the rest right (including the "differentiation of rod photoreceptors").
    • Has manual mode for specifying text, table, and image regions of the document, as well as the order in which to process them. With this feature, it can get the text blocks and order right. (Better than any other tool currently on this page.) Here are the results for [attachment:Sample_Article_2-ABBYY-manual.txt sample 2, manually adjusted], with figure captions and references removed.
    • Manual mode works great, but requires a per-document effort of 5 minutes or so for best results.
  • Nuance PDF Converter 4: Recommended as similar to ABBYY, but doesn't have a trial version for us to test.

This is the tool I've spent some time with (as opposed to just reading on the web), and it seems like a start anyway. Perhaps there is another that doesn't have as much trouble with spacing, columns, etc. We'll have to keep our eyes open for a commercial tool that appears to have more semantic sophistication.