public inbox for speakup@linux-speakup.org
 help / color / mirror / Atom feed
* Anyone able to OCR a PDF file?
@  Janina Sajka
   ` Samuel Thibault
   ` Michael Whapples
  0 siblings, 2 replies; 7+ messages in thread
From: Janina Sajka @  UTC (permalink / raw)
  To: speakup

Has anyone figured out how to get one of the Linux OCR engines (like
tesseract) to accept a graphical file (other than .tiff) as input? In
particular I'm going to be swamped with graphical PDF files this year.
Printing these just to scan them seems both wasteful and inefficient.

I know people do this on other OS's. Has anyone suggestions on how to do
this in Linux?

All suggestions greatly appreciated.

Janina

-- 

Janina Sajka,	Phone:	+1.443.300.2200
		sip:janina@asterisk.rednote.net

Chair, Open Accessibility	janina@a11y.org	
Linux Foundation		http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative	http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)


^ permalink raw reply	[flat|nested] 7+ messages in thread
* Re: Anyone able to OCR a PDF file?
@  pj
   ` Jason White
  0 siblings, 1 reply; 7+ messages in thread
From: pj @  UTC (permalink / raw)
  To: speakup

Willem van der Walt wrote:
> The different ocr engines require different image formats.
> Some of them are really dum.

They probably derive from old code written without a
format-independent graphics library.

> I find that the best of the open-source engines is cuneiform.

Aha, interesting.  I've always used tesseract.  cuneiform is
in debian wheezy (testing) but not yet in debian stable... 

Depending on how the PDF was produced, it's possible that
  ps2txt filename.pdf
(a.k.a. ps2ascii) might help; I think it comes with ghostscript.

Regards,  Peter Billam

http://www.pjb.com.au       pj@pjb.com.au      (03) 6278 9410
"Was der Meister nicht kann,   vermöcht es der Knabe, hätt er
 ihm immer gehorcht?"   Siegfried to Mime, from Act 1 Scene 2


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~ UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
 Anyone able to OCR a PDF file? Janina Sajka
 ` Samuel Thibault
   ` Janina Sajka
     ` Willem van der Walt
 ` Michael Whapples
 pj
 ` Jason White

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).