* Anyone able to OCR a PDF file?
@ Janina Sajka
` Samuel Thibault
` Michael Whapples
0 siblings, 2 replies; 7+ messages in thread
From: Janina Sajka @ UTC (permalink / raw)
To: speakup
Has anyone figured out how to get one of the Linux OCR engines (like
tesseract) to accept a graphical file (other than .tiff) as input? In
particular I'm going to be swamped with graphical PDF files this year.
Printing these just to scan them seems both wasteful and inefficient.
I know people do this on other OS's. Has anyone suggestions on how to do
this in Linux?
All suggestions greatly appreciated.
Janina
--
Janina Sajka, Phone: +1.443.300.2200
sip:janina@asterisk.rednote.net
Chair, Open Accessibility janina@a11y.org
Linux Foundation http://a11y.org
Chair, Protocols & Formats
Web Accessibility Initiative http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Anyone able to OCR a PDF file?
Anyone able to OCR a PDF file? Janina Sajka
@ ` Samuel Thibault
` Janina Sajka
` Michael Whapples
1 sibling, 1 reply; 7+ messages in thread
From: Samuel Thibault @ UTC (permalink / raw)
To: Speakup is a screen review system for Linux.
Janina Sajka, le Tue 03 Jan 2012 11:40:45 -0500, a écrit :
> Has anyone figured out how to get one of the Linux OCR engines (like
> tesseract) to accept a graphical file (other than .tiff) as input?
You can use imagemagick's convert tool to convert from .pdf to .tiff:
convert test.pdf test.tiff
Samuel
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Anyone able to OCR a PDF file?
` Samuel Thibault
@ ` Janina Sajka
` Willem van der Walt
0 siblings, 1 reply; 7+ messages in thread
From: Janina Sajka @ UTC (permalink / raw)
To: Speakup is a screen review system for Linux.
Thanks, Samuel! I think that will work nicely in a little script I can
quickly concoct.
Janina
Samuel Thibault writes:
> Janina Sajka, le Tue 03 Jan 2012 11:40:45 -0500, a écrit :
> > Has anyone figured out how to get one of the Linux OCR engines (like
> > tesseract) to accept a graphical file (other than .tiff) as input?
>
> You can use imagemagick's convert tool to convert from .pdf to .tiff:
>
> convert test.pdf test.tiff
>
> Samuel
> _______________________________________________
> Speakup mailing list
> Speakup@braille.uwo.ca
> http://speech.braille.uwo.ca/mailman/listinfo/speakup
--
Janina Sajka, Phone: +1.443.300.2200
sip:janina@asterisk.rednote.net
Chair, Open Accessibility janina@a11y.org
Linux Foundation http://a11y.org
Chair, Protocols & Formats
Web Accessibility Initiative http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Anyone able to OCR a PDF file?
` Janina Sajka
@ ` Willem van der Walt
0 siblings, 0 replies; 7+ messages in thread
From: Willem van der Walt @ UTC (permalink / raw)
To: Speakup is a screen review system for Linux.
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2050 bytes --]
Hi,
Janina, I have done that script, sort of.
In the kies package I have released a while ago, there is a set of scripts
to scan and do ocr on image files.
It is no rocket science, but it works well enough.
The different ocr engines require different image formats. Some of them
are really dum.
My set of scripts try to handle all that in the background, screening it
from the user.
One can use different OCR engines like cuneiform, tesseract and now even
the commercial ABBYY Finereader engine which is available for 149 euro.
I reacently have to do a lot of OCR, and now have a license at work for
that engine.
I find that the best of the open-source engines is cuneiform.
The main script for scan/OCR stuff is called kies_p2t, for paper to text.
The kies tarball can be found at:
ftp://ftp.csir.co.za/NI/National_Accessibility_Portal/wvdwalt/kies-latest.tar.bz2
Regards, Willem
On Tue, 3 Jan 2012, Janina Sajka wrote:
> Thanks, Samuel! I think that will work nicely in a little script I can
> quickly concoct.
>
> Janina
>
> Samuel Thibault writes:
>> Janina Sajka, le Tue 03 Jan 2012 11:40:45 -0500, a écrit :
>>> Has anyone figured out how to get one of the Linux OCR engines (like
>>> tesseract) to accept a graphical file (other than .tiff) as input?
>>
>> You can use imagemagick's convert tool to convert from .pdf to .tiff:
>>
>> convert test.pdf test.tiff
>>
>> Samuel
>> _______________________________________________
>> Speakup mailing list
>> Speakup@braille.uwo.ca
>> http://speech.braille.uwo.ca/mailman/listinfo/speakup
>
> --
>
> Janina Sajka, Phone: +1.443.300.2200
> sip:janina@asterisk.rednote.net
>
> Chair, Open Accessibility janina@a11y.org
> Linux Foundation http://a11y.org
>
> Chair, Protocols & Formats
> Web Accessibility Initiative http://www.w3.org/wai/pf
> World Wide Web Consortium (W3C)
>
> _______________________________________________
> Speakup mailing list
> Speakup@braille.uwo.ca
> http://speech.braille.uwo.ca/mailman/listinfo/speakup
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Anyone able to OCR a PDF file?
Anyone able to OCR a PDF file? Janina Sajka
` Samuel Thibault
@ ` Michael Whapples
1 sibling, 0 replies; 7+ messages in thread
From: Michael Whapples @ UTC (permalink / raw)
To: Speakup is a screen review system for Linux.
I have personally used cuneiform for linux mostly. I cannot remmeber if it
can natively manage PDF files (possibly, certainly it can do more than
TIFF), however you could use a conversion tool (memory seems to say
pdf2tiff).
Michael Whapples
-----Original Message-----
From: Janina Sajka
Sent: Tuesday, January 03, 2012 4:40 PM
To: speakup@braille.uwo.ca
Subject: Anyone able to OCR a PDF file?
Has anyone figured out how to get one of the Linux OCR engines (like
tesseract) to accept a graphical file (other than .tiff) as input? In
particular I'm going to be swamped with graphical PDF files this year.
Printing these just to scan them seems both wasteful and inefficient.
I know people do this on other OS's. Has anyone suggestions on how to do
this in Linux?
All suggestions greatly appreciated.
Janina
--
Janina Sajka, Phone: +1.443.300.2200
sip:janina@asterisk.rednote.net
Chair, Open Accessibility janina@a11y.org
Linux Foundation http://a11y.org
Chair, Protocols & Formats
Web Accessibility Initiative http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Anyone able to OCR a PDF file?
@ pj
` Jason White
0 siblings, 1 reply; 7+ messages in thread
From: pj @ UTC (permalink / raw)
To: speakup
Willem van der Walt wrote:
> The different ocr engines require different image formats.
> Some of them are really dum.
They probably derive from old code written without a
format-independent graphics library.
> I find that the best of the open-source engines is cuneiform.
Aha, interesting. I've always used tesseract. cuneiform is
in debian wheezy (testing) but not yet in debian stable...
Depending on how the PDF was produced, it's possible that
ps2txt filename.pdf
(a.k.a. ps2ascii) might help; I think it comes with ghostscript.
Regards, Peter Billam
http://www.pjb.com.au pj@pjb.com.au (03) 6278 9410
"Was der Meister nicht kann, vermöcht es der Knabe, hätt er
ihm immer gehorcht?" Siegfried to Mime, from Act 1 Scene 2
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Anyone able to OCR a PDF file?
pj
@ ` Jason White
0 siblings, 0 replies; 7+ messages in thread
From: Jason White @ UTC (permalink / raw)
To: speakup
<speakup@braille.uwo.ca> wrote:
>Willem van der Walt wrote:
>> I find that the best of the open-source engines is cuneiform.
>
>Aha, interesting. I've always used tesseract. cuneiform is
>in debian wheezy (testing) but not yet in debian stable...
It is now officially unmaintained upstream. If you like it and you know
someone familiar with OCR algorithms who has time to spare, or someone who
might know such a person, it's time to establish the right connections.
I occasionally monitor the lists for Cuneiform and OCR Opus.
>
>Depending on how the PDF was produced, it's possible that
> ps2txt filename.pdf
>(a.k.a. ps2ascii) might help; I think it comes with ghostscript.
Pdftotext and Pdftohtml (as well as similar tools) will work, but only if
there is text in the PDF files. If there are only images of text rather than
characters, you have to apply OCR. The size of the PDF file usually gives a
strong indication of whether it contains rasterized images or not, and of
course you can use the tools in poppler-utils to find out.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~ UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
Anyone able to OCR a PDF file? Janina Sajka
` Samuel Thibault
` Janina Sajka
` Willem van der Walt
` Michael Whapples
pj
` Jason White
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).