DIY Open Book Scanning

PictaPoesis_2RFor the last three weeks, I’ve been leading the beginnings of a book-scanning group here in Cambridge. It all started with a cycle ride through the snow to a glazier’s on the outskirts of the city, where I picked up a few sheets of glass before heading back to our meeting at the English Faculty Library, carefully avoiding every piece of icy ground as I went, as you really do not want to fall off a bike when you have glass strapped to your back.

Our method, at least so far, has been very simple. It was inspired by the great DIY Book Scanner website, and needs only a desk lamp, a digital camera, a tripod and some book supports. As we were meeting in a library, the staff kindly lent us some foam book supports and enough extension leads to plug in the lamp. We then propped up a rare copy of a burlesque Hamlet from 1801 in our makeshift cradle, and began, laying the glass sheet on and then photographing each of the right-hand pages, before doing the same with the left.

We then put our collection of 80 or so jpegs, suitably renamed and ordered into ScanTailor, which polished our efforts into something fairly respectable. All that was left was to OCR the images, stitch them all into a single PDF and upload to the Internet Archive.

Here, though, we began to hit problems, and any suggestions for a solution would be very welcome indeed:

  • Our images were far from perfect, often distorted due to the slight curvature of the page or the misalignment of the camera on its tripod.
    • Current solution: short of building a rig, we are trying taking photos from above the book, which at least makes it easier to be parallel.
  • Our file sizes were enormous, and this made conversion really time-consuming
    • Current solution: use the university’s copy of Adobe Acrobat to compress the images into B&W PDFs, although it pains me that there seems to be no open-source alternative. Does anyone know of one?.
  • Big file sizes and slightly skewed images do not a good OCR make: we couldn’t get tesseract to run on windows, so resorted to using a web-based version (), with all its limitations.
    • Current solution: again, Adobe to the rescue; but are there any open-source projects out there for this?

And with that list of problems and solutions, you now have a fairly good idea of where we are. If you’re in the area of Cambridge, do get in touch, as we’re always eager for new volunteers. If you want to start your own open DIY book scanning project get in touch on the OpenGLAM mailing list.

Future plans include: surveying the English Faculty Library for other books that are out of copyright and not yet digitised (not as numerous as you might think), proposing a collaboration with the Engineering Department for help constructing a standalone book scanner, and investigating what there is to be scanned in the College libraries of the city.