Home » Questions » Computers [ Ask a new question ]

How can I extract text from a table in a PDF file?

How can I extract text from a table in a PDF file?

I am trying to implement an algorithm described in an academic paper, which I have in PDF format. The algorithm includes a table of 256 entries that I want to copy to my implementation. However, I can't seem to copy the table as text that I can manipulate. I can only copy it as an image.

Asked by: Guest | Views: 370
Total answers/comments: 3
Guest [Entry]

"PDF2Table

This gives it out to XML I think.

If we surf the web we can find PDF
files in heaps. Once technical
details of an amazing five mega pixel
digital camera, once a statistic about
the last two years incomes of an
enterprise, and once a brilliant crime
novel of Sir Arthur Conan Doyle is
saved in a PDF file. The widespread
use of this file format takes the
focus on the question of how to reuse
the data in such a file. Many things
are already done in this area. For
example, there are several tools that
convert PDF-files to other formats.

My work focuses only on the extraction
of table information from PDF-files. I
searched for tools that extract basic
information from PDF-files. I found a
tool named pdf2html which also returns
data in XML format. To access this XML
output I used the JDOM archive.

I developed several heuristics for
table detection and decomposition.
These heuristics work pretty good on
lucid tables (without spanning columns
or rows) and fairly good on complex
tables (with spanning rows or
columns).

Sourceforge link"
Guest [Entry]

Your problem might be that it was pasted into the pdf as an image by the origional author. If this is the case (you could find out by seeing if other text in the document will copy as text) your only options are probably to copy it by hand (hope you can touch type) or use OCR software that comes with scanners.
Guest [Entry]

"One option seems to be to save the document (or maybe just the page with the table you want) as an xml file. I just did this in Adobe Acrobrat Pro by saving as ""XML Spreadsheet 2003."" This retained the tabular format in the resulting xml file (viewable in Excel). The only ""imperfection"" is that it considers each literal row in the table as a row in the Excel file. So if any text breaks across rows (e.g., long names), then it will show up as two rows in excel. For a small table, that's pretty minor cleanup.

Other than that, it seems like this process could be automated."