From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <glks-speakup@m.gmane.org>
Received: from lo.gmane.org (lo.gmane.org [80.91.229.12])
	by speech.braille.uwo.ca (Postfix) with ESMTP id 5127EC1A0FC
	for <speakup@braille.uwo.ca>; Thu,  5 Jan 2012 04:39:03 -0500 (EST)
Received: from list by lo.gmane.org with local (Exim 4.69)
	(envelope-from <glks-speakup@m.gmane.org>) id 1Rijmo-0001nR-TG
	for speakup@braille.uwo.ca; Thu, 05 Jan 2012 10:39:02 +0100
Received: from ppp198-218.static.internode.on.net ([59.167.198.218])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <speakup@braille.uwo.ca>; Thu, 05 Jan 2012 10:39:02 +0100
Received: from jason by ppp198-218.static.internode.on.net with local (Gmexim
	0.1 (Debian)) id 1AlnuQ-0007hv-00
	for <speakup@braille.uwo.ca>; Thu, 05 Jan 2012 10:39:02 +0100
X-Injected-Via-Gmane: http://gmane.org/
To: speakup@braille.uwo.ca
From: Jason White <jason@jasonjgw.net>
Subject: Re: Anyone able to OCR a PDF file?
Date: Thu, 5 Jan 2012 09:38:45 +0000 (UTC)
Message-ID: <je3r35$i0d$1@dough.gmane.org>
References: <483a79$e3rqg0@ipmail06.adl6.internode.on.net>
X-Complaints-To: usenet@dough.gmane.org
X-Gmane-NNTP-Posting-Host: ppp198-218.static.internode.on.net
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: jason@jdc.jasonjgw.net (Jason White)
X-BeenThere: speakup@braille.uwo.ca
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: "Speakup is a screen review system for Linux."
	<speakup@braille.uwo.ca>
List-Id: "Speakup is a screen review system for Linux."
	<speakup.braille.uwo.ca>
List-Unsubscribe: <http://speech.braille.uwo.ca/mailman/options/speakup>,
	<mailto:speakup-request@braille.uwo.ca?subject=unsubscribe>
List-Archive: <http://speech.braille.uwo.ca/pipermail/speakup>
List-Post: <mailto:speakup@braille.uwo.ca>
List-Help: <mailto:speakup-request@braille.uwo.ca?subject=help>
List-Subscribe: <http://speech.braille.uwo.ca/mailman/listinfo/speakup>,
	<mailto:speakup-request@braille.uwo.ca?subject=subscribe>
X-List-Received-Date: Thu, 05 Jan 2012 09:39:03 -0000

 <speakup@braille.uwo.ca> wrote:
>Willem van der Walt wrote:

>> I find that the best of the open-source engines is cuneiform.
>
>Aha, interesting.  I've always used tesseract.  cuneiform is
>in debian wheezy (testing) but not yet in debian stable... 

It is now officially unmaintained upstream. If you like it and you know
someone familiar with OCR algorithms who has time to spare, or someone who
might know such a person, it's time to establish the right connections.

I occasionally monitor the lists for Cuneiform and OCR Opus.
>
>Depending on how the PDF was produced, it's possible that
>  ps2txt filename.pdf
>(a.k.a. ps2ascii) might help; I think it comes with ghostscript.

Pdftotext and Pdftohtml (as well as similar tools) will work, but only if
there is text in the PDF files.  If there are only images of text rather than
characters, you have to apply OCR. The size of the PDF file usually gives a
strong indication of whether it contains rasterized images or not, and of
course you can use the tools in poppler-utils to find out.