Home » Questions » Computers [ Ask a new question ]

Practical OCR solution for converting a large book to a digital format?

Practical OCR solution for converting a large book to a digital format?

I was over by my grandparent's place this past weekend. My grandmother pulled out this giant (~1400 page) book of her family history going back to 1630 or so. Giant nerd that I am, I thought it would be slick to have all the information stored in a database and available from the web. I can handle all the web programming and regular expressions and what not, but what I don't know is the best way to get the text from book to computer.

Asked by: Guest | Views: 287
Total answers/comments: 5
Guest [Entry]

"I came across this on Lifehacker quite some time back, and it has been one of my top DIY projects ever since.

Replace the iPhone with any camera or imaging, and you get a stack of nice high-res jpegs ready for you to OCR with any software, even (urks!) MS Office... ;)

Cheap. Effective. DIY. You can't beat an idea like this.

EDIT: Comments raised up some points about shadows, page curlings, etc. Quite easily resolved for anyone who have literally photo-copied library texts.

Add a multiple light sources to illuminate the book, and eliminate the shadows.

slant the book at 90 degrees to the pages don't curl towards the bindings in the middle. It also preserves the binding.

I'll see if I can give an example and set one up myself.

EDIT 2 : uploaded sample of how you should hold the book, and also notice the light source from the left."
Guest [Entry]

"You will need to capture the image somehow. Various services exist to do this for you. You will also need someone who is familiar with the content of the text to proofread as OCR is not perfect yet. Especially with anything handwritten.

Others are discussing your question here:
http://ask.metafilter.com/92506/scan-my-books

Some companies will do this for you:
http://www.scandexsystems.com/BookScanning2.html
http://www.kirtas.com/index.php?option=com_content&view=article&id=13&Itemid=48
http://www.ristech.ca/product.html

Some Free Software:
http://download.cnet.com/Image-To-PDF-OCR-Converter-PDF-E-Book-Maker/3000-6675_4-10392924.html"
Guest [Entry]

"I would recommend a flatbed scanner rigged for book scanning or a whole book scanner as mentioned by Chris.

If you can, get your images compiled into a TIFF format as that is industry standard when it comes to document management systems.

For doing OCR, I would recommend tesseract OCR as it is the framework Google expounded upon for their books project."
Guest [Entry]

while it sounds tempting to automate the process, you may want to invest rather more time and work since this particular book is a personal matter. OCR will do the bulk but you'll have to proofread page by page and compare with the original. keep in mind, the author's mistakes are part of the deal, do not correct them (create footnotes if you feel so inclined). take your time, don't put yourself under pressure, book scanning is donkey work but thoroughness pays and you'll end up with a fine digital copy of your family's chronic. good luck with your endeavour :)
Guest [Entry]

"At work we use a Plustek Optibook 3600 book scanner which is about $250.
It's basically a standard flat bed scanner but with the glass plate going right to the edge of the scanner so that the book page can be placed flat on the plate. This eliminates the spine shadow and avoids damaging books."