From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by speech.braille.uwo.ca (Postfix) with ESMTP id 5204DC1A0BE for ; Wed, 4 Jan 2012 04:24:28 -0500 (EST) Message-Id: <483a79$e3rqg0@ipmail06.adl6.internode.on.net> X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnsKAJ8aBE920PHC/2dsb2JhbABDggWDC6ZDgROBBoFyAQEEJDMoEwMYAgImAjsOKYd6liOOApJ5h0aCBAyBCgSnSg Received: from ppp118-208-241-194.lns20.hba2.internode.on.net (HELO localhost) ([118.208.241.194]) by ipmail06.adl6.internode.on.net with SMTP; 04 Jan 2012 19:54:26 +1030 Date: Wed, 4 Jan 2012 20:24:26 +1000 From: pj@pjb.com.au To: speakup@braille.uwo.ca X-Mailer: mail.pl Subject: Re: Anyone able to OCR a PDF file? Content-Type: text/plain; charset="utf-8" X-BeenThere: speakup@braille.uwo.ca X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pj@pjb.com.au, "Speakup is a screen review system for Linux." List-Id: "Speakup is a screen review system for Linux." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Jan 2012 09:24:28 -0000 Willem van der Walt wrote: > The different ocr engines require different image formats. > Some of them are really dum. They probably derive from old code written without a format-independent graphics library. > I find that the best of the open-source engines is cuneiform. Aha, interesting. I've always used tesseract. cuneiform is in debian wheezy (testing) but not yet in debian stable... Depending on how the PDF was produced, it's possible that ps2txt filename.pdf (a.k.a. ps2ascii) might help; I think it comes with ghostscript. Regards, Peter Billam http://www.pjb.com.au pj@pjb.com.au (03) 6278 9410 "Was der Meister nicht kann, vermöcht es der Knabe, hätt er ihm immer gehorcht?" Siegfried to Mime, from Act 1 Scene 2