public inbox for speakup@linux-speakup.org
 help / color / mirror / Atom feed
* Anyone able to OCR a PDF file?
@  Janina Sajka
   ` Samuel Thibault
   ` Michael Whapples
  0 siblings, 2 replies; 7+ messages in thread
From: Janina Sajka @  UTC (permalink / raw)
  To: speakup

Has anyone figured out how to get one of the Linux OCR engines (like
tesseract) to accept a graphical file (other than .tiff) as input? In
particular I'm going to be swamped with graphical PDF files this year.
Printing these just to scan them seems both wasteful and inefficient.

I know people do this on other OS's. Has anyone suggestions on how to do
this in Linux?

All suggestions greatly appreciated.

Janina

-- 

Janina Sajka,	Phone:	+1.443.300.2200
		sip:janina@asterisk.rednote.net

Chair, Open Accessibility	janina@a11y.org	
Linux Foundation		http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative	http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Anyone able to OCR a PDF file?
   Anyone able to OCR a PDF file? Janina Sajka
@  ` Samuel Thibault
     ` Janina Sajka
   ` Michael Whapples
  1 sibling, 1 reply; 7+ messages in thread
From: Samuel Thibault @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Janina Sajka, le Tue 03 Jan 2012 11:40:45 -0500, a écrit :
> Has anyone figured out how to get one of the Linux OCR engines (like
> tesseract) to accept a graphical file (other than .tiff) as input?

You can use imagemagick's convert tool to convert from .pdf to .tiff:

convert test.pdf test.tiff

Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Anyone able to OCR a PDF file?
   ` Samuel Thibault
@    ` Janina Sajka
       ` Willem van der Walt
  0 siblings, 1 reply; 7+ messages in thread
From: Janina Sajka @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Thanks, Samuel! I think that will work nicely in a little script I can
quickly concoct.

Janina

Samuel Thibault writes:
> Janina Sajka, le Tue 03 Jan 2012 11:40:45 -0500, a écrit :
> > Has anyone figured out how to get one of the Linux OCR engines (like
> > tesseract) to accept a graphical file (other than .tiff) as input?
> 
> You can use imagemagick's convert tool to convert from .pdf to .tiff:
> 
> convert test.pdf test.tiff
> 
> Samuel
> _______________________________________________
> Speakup mailing list
> Speakup@braille.uwo.ca
> http://speech.braille.uwo.ca/mailman/listinfo/speakup

-- 

Janina Sajka,	Phone:	+1.443.300.2200
		sip:janina@asterisk.rednote.net

Chair, Open Accessibility	janina@a11y.org	
Linux Foundation		http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative	http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Anyone able to OCR a PDF file?
   Anyone able to OCR a PDF file? Janina Sajka
   ` Samuel Thibault
@  ` Michael Whapples
  1 sibling, 0 replies; 7+ messages in thread
From: Michael Whapples @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

I have personally used cuneiform for linux mostly. I cannot remmeber if it 
can natively manage PDF files (possibly, certainly it can do more than 
TIFF), however you could use a conversion tool (memory seems to say 
pdf2tiff).

Michael Whapples

-----Original Message----- 
From: Janina Sajka
Sent: Tuesday, January 03, 2012 4:40 PM
To: speakup@braille.uwo.ca
Subject: Anyone able to OCR a PDF file?

Has anyone figured out how to get one of the Linux OCR engines (like
tesseract) to accept a graphical file (other than .tiff) as input? In
particular I'm going to be swamped with graphical PDF files this year.
Printing these just to scan them seems both wasteful and inefficient.

I know people do this on other OS's. Has anyone suggestions on how to do
this in Linux?

All suggestions greatly appreciated.

Janina

-- 

Janina Sajka, Phone: +1.443.300.2200
sip:janina@asterisk.rednote.net

Chair, Open Accessibility janina@a11y.org
Linux Foundation http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Anyone able to OCR a PDF file?
     ` Janina Sajka
@      ` Willem van der Walt
  0 siblings, 0 replies; 7+ messages in thread
From: Willem van der Walt @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2050 bytes --]

Hi,
Janina, I have done that script, sort of.
In the kies package I have released a while ago, there is a set of scripts 
to scan and do ocr on image files.
It is no rocket science, but it works well enough.
The different ocr engines require different image formats.  Some of them 
are really dum.
My set of scripts try to handle all that in the background, screening it 
from the user.
One can use different OCR engines like cuneiform, tesseract and now even 
the commercial ABBYY Finereader engine which is available for 149 euro.
I reacently have to do a lot of OCR, and now have a license at work for 
that engine.
I find that the best of the open-source engines is cuneiform.
The main script for scan/OCR stuff is called kies_p2t, for paper to text.
The kies tarball can be found at:
ftp://ftp.csir.co.za/NI/National_Accessibility_Portal/wvdwalt/kies-latest.tar.bz2
Regards, Willem


On Tue, 3 Jan 2012, Janina Sajka wrote:

> Thanks, Samuel! I think that will work nicely in a little script I can
> quickly concoct.
>
> Janina
>
> Samuel Thibault writes:
>> Janina Sajka, le Tue 03 Jan 2012 11:40:45 -0500, a écrit :
>>> Has anyone figured out how to get one of the Linux OCR engines (like
>>> tesseract) to accept a graphical file (other than .tiff) as input?
>>
>> You can use imagemagick's convert tool to convert from .pdf to .tiff:
>>
>> convert test.pdf test.tiff
>>
>> Samuel
>> _______________________________________________
>> Speakup mailing list
>> Speakup@braille.uwo.ca
>> http://speech.braille.uwo.ca/mailman/listinfo/speakup
>
> -- 
>
> Janina Sajka,	Phone:	+1.443.300.2200
> 		sip:janina@asterisk.rednote.net
>
> Chair, Open Accessibility	janina@a11y.org
> Linux Foundation		http://a11y.org
>
> Chair, Protocols & Formats
> Web Accessibility Initiative	http://www.w3.org/wai/pf
> World Wide Web Consortium (W3C)
>
> _______________________________________________
> Speakup mailing list
> Speakup@braille.uwo.ca
> http://speech.braille.uwo.ca/mailman/listinfo/speakup
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Anyone able to OCR a PDF file?
   pj
@  ` Jason White
  0 siblings, 0 replies; 7+ messages in thread
From: Jason White @  UTC (permalink / raw)
  To: speakup

 <speakup@braille.uwo.ca> wrote:
>Willem van der Walt wrote:

>> I find that the best of the open-source engines is cuneiform.
>
>Aha, interesting.  I've always used tesseract.  cuneiform is
>in debian wheezy (testing) but not yet in debian stable... 

It is now officially unmaintained upstream. If you like it and you know
someone familiar with OCR algorithms who has time to spare, or someone who
might know such a person, it's time to establish the right connections.

I occasionally monitor the lists for Cuneiform and OCR Opus.
>
>Depending on how the PDF was produced, it's possible that
>  ps2txt filename.pdf
>(a.k.a. ps2ascii) might help; I think it comes with ghostscript.

Pdftotext and Pdftohtml (as well as similar tools) will work, but only if
there is text in the PDF files.  If there are only images of text rather than
characters, you have to apply OCR. The size of the PDF file usually gives a
strong indication of whether it contains rasterized images or not, and of
course you can use the tools in poppler-utils to find out.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Anyone able to OCR a PDF file?
@  pj
   ` Jason White
  0 siblings, 1 reply; 7+ messages in thread
From: pj @  UTC (permalink / raw)
  To: speakup

Willem van der Walt wrote:
> The different ocr engines require different image formats.
> Some of them are really dum.

They probably derive from old code written without a
format-independent graphics library.

> I find that the best of the open-source engines is cuneiform.

Aha, interesting.  I've always used tesseract.  cuneiform is
in debian wheezy (testing) but not yet in debian stable... 

Depending on how the PDF was produced, it's possible that
  ps2txt filename.pdf
(a.k.a. ps2ascii) might help; I think it comes with ghostscript.

Regards,  Peter Billam

http://www.pjb.com.au       pj@pjb.com.au      (03) 6278 9410
"Was der Meister nicht kann,   vermöcht es der Knabe, hätt er
 ihm immer gehorcht?"   Siegfried to Mime, from Act 1 Scene 2


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~ UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
 Anyone able to OCR a PDF file? Janina Sajka
 ` Samuel Thibault
   ` Janina Sajka
     ` Willem van der Walt
 ` Michael Whapples
 pj
 ` Jason White

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).