From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by speech.braille.uwo.ca (Postfix) with ESMTP id 5127EC1A0FC for ; Thu, 5 Jan 2012 04:39:03 -0500 (EST) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Rijmo-0001nR-TG for speakup@braille.uwo.ca; Thu, 05 Jan 2012 10:39:02 +0100 Received: from ppp198-218.static.internode.on.net ([59.167.198.218]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 05 Jan 2012 10:39:02 +0100 Received: from jason by ppp198-218.static.internode.on.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 05 Jan 2012 10:39:02 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: speakup@braille.uwo.ca From: Jason White Subject: Re: Anyone able to OCR a PDF file? Date: Thu, 5 Jan 2012 09:38:45 +0000 (UTC) Message-ID: References: <483a79$e3rqg0@ipmail06.adl6.internode.on.net> X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: ppp198-218.static.internode.on.net X-Newsreader: trn 4.0-test77 (Sep 1, 2010) Originator: jason@jdc.jasonjgw.net (Jason White) X-BeenThere: speakup@braille.uwo.ca X-Mailman-Version: 2.1.14 Precedence: list Reply-To: "Speakup is a screen review system for Linux." List-Id: "Speakup is a screen review system for Linux." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jan 2012 09:39:03 -0000 wrote: >Willem van der Walt wrote: >> I find that the best of the open-source engines is cuneiform. > >Aha, interesting. I've always used tesseract. cuneiform is >in debian wheezy (testing) but not yet in debian stable... It is now officially unmaintained upstream. If you like it and you know someone familiar with OCR algorithms who has time to spare, or someone who might know such a person, it's time to establish the right connections. I occasionally monitor the lists for Cuneiform and OCR Opus. > >Depending on how the PDF was produced, it's possible that > ps2txt filename.pdf >(a.k.a. ps2ascii) might help; I think it comes with ghostscript. Pdftotext and Pdftohtml (as well as similar tools) will work, but only if there is text in the PDF files. If there are only images of text rather than characters, you have to apply OCR. The size of the PDF file usually gives a strong indication of whether it contains rasterized images or not, and of course you can use the tools in poppler-utils to find out.