public inbox for blinux-list@redhat.com
 help / color / mirror / Atom feed
* extracting text from png files
@  Linux for blind general discussion
   ` Linux for blind general discussion
  0 siblings, 1 reply; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: blinux-list

Has Linux for us command line users got a tool that can extract text from
png files?  I got a couple sent me and if possible I'd like to read them.



--

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
   extracting text from png files Linux for blind general discussion
@  ` Linux for blind general discussion
     ` Linux for blind general discussion
     ` Linux for blind general discussion
  0 siblings, 2 replies; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: blinux-list

What you're looking for is Ocular Character Recognition or OCR for short.

I've never managed to figure out its command line syntax, but I
believe tesseract is considered the current standard option for Linux.

There's also Cuneiform, which I have actually used with some success
in the past, but I believe its either contrib or non-free under
Debian, so you might need to enable extra repositories depending on
how strict your distro is about sticking to FOSS principles.

I will warn you, in my experience, OCR is as likely to produce
gibberish as legible text. A scan of a page of prose type set in a
standard font will probably OCR well, but the more mixed text is with
graphics, the fancier the font, and the more complicated the page
layout, the more likely errors are. I've tried OCR'ing scanlated
manga(Japanese comics) in the past and have gotten results that
included unpredictible patterns of letters and numbers misidentified
as others(S and 5, P and D, I and 1, LI and U, B and g where just some
of the common substitutions I encountered trying to fix the OCR'd
text), characters my screenreader could'nt identify or identified as
characters I'm unfamiliar, and even when the text was clear,
paragraphs out of order wasn't uncommon.

-- 
Sincerely,

Jeffery Wright
Bachelor of Computer Science
President Emeritus, Nu Nu Chapter, Phi Theta Kappa.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
   ` Linux for blind general discussion
@    ` Linux for blind general discussion
       ` Linux for blind general discussion
     ` Linux for blind general discussion
  1 sibling, 1 reply; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: Linux for blind general discussion

Thanks much, I've got both packages on my system so now it's time to
read some manuals and search for youtube tutorials to fill in the
missing pieces to learn these two packages.  I didn't know these could
handle png files.

On Mon, 17 Dec 2018, Linux for blind general discussion wrote:

> Date: Mon, 17 Dec 2018 10:03:23
> From: Linux for blind general discussion <blinux-list@redhat.com>
> To: blinux-list@redhat.com
> Subject: Re: extracting text from png files
>
> What you're looking for is Ocular Character Recognition or OCR for short.
>
> I've never managed to figure out its command line syntax, but I
> believe tesseract is considered the current standard option for Linux.
>
> There's also Cuneiform, which I have actually used with some success
> in the past, but I believe its either contrib or non-free under
> Debian, so you might need to enable extra repositories depending on
> how strict your distro is about sticking to FOSS principles.
>
> I will warn you, in my experience, OCR is as likely to produce
> gibberish as legible text. A scan of a page of prose type set in a
> standard font will probably OCR well, but the more mixed text is with
> graphics, the fancier the font, and the more complicated the page
> layout, the more likely errors are. I've tried OCR'ing scanlated
> manga(Japanese comics) in the past and have gotten results that
> included unpredictible patterns of letters and numbers misidentified
> as others(S and 5, P and D, I and 1, LI and U, B and g where just some
> of the common substitutions I encountered trying to fix the OCR'd
> text), characters my screenreader could'nt identify or identified as
> characters I'm unfamiliar, and even when the text was clear,
> paragraphs out of order wasn't uncommon.
>
>

-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
     ` Linux for blind general discussion
@      ` Linux for blind general discussion
         ` Linux for blind general discussion
  0 siblings, 1 reply; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: blinux-list

Disclaimer: I don't know which image formats either program supports
directly, nor do I know of a good way to convert between image
formats, though I'm pretty sure cuneiform supports at least .jpg and
.png files directly.

I also remember at least one OCR tutorial recommending some
preprocessing to make images easier for the OCR program to work with,
and I believe they used the convert command provided by imagemagick to
do so, but I forget the details.

Also, it's been a while since I've attempted any OCR'ing myself(how
often I had to manually clean up the output kind of put me off), so
there might be others on this list who can provide better, and more
specific advice on this subject.

Still, I hope I've at least got you started on the right track.



-- 
Sincerely,

Jeffery Wright
Bachelor of Computer Science
President Emeritus, Nu Nu Chapter, Phi Theta Kappa.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
       ` Linux for blind general discussion
@        ` Linux for blind general discussion
  0 siblings, 0 replies; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: blinux-list

Howdy,

i use tesseract for doing this.
I recognized with version 4.0 what just is released the results improved 
a lot here (for german and english usecases).
some offical numbers could be found here:
https://github.com/tesseract-ocr/docs/raw/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf
the languages improves between 10 and 80 percent - depending on language 
and it previouse support level..
It seems it got a new OCR engine spend based on neuronal network.

cheers chrys

Am 17.12.18 um 16:57 schrieb Linux for blind general discussion:
> Disclaimer: I don't know which image formats either program supports
> directly, nor do I know of a good way to convert between image
> formats, though I'm pretty sure cuneiform supports at least .jpg and
> .png files directly.
>
> I also remember at least one OCR tutorial recommending some
> preprocessing to make images easier for the OCR program to work with,
> and I believe they used the convert command provided by imagemagick to
> do so, but I forget the details.
>
> Also, it's been a while since I've attempted any OCR'ing myself(how
> often I had to manually clean up the output kind of put me off), so
> there might be others on this list who can provide better, and more
> specific advice on this subject.
>
> Still, I hope I've at least got you started on the right track.
>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
   ` Linux for blind general discussion
     ` Linux for blind general discussion
@    ` Linux for blind general discussion
       ` Linux for blind general discussion
  1 sibling, 1 reply; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: Linux for blind general discussion

OK, this is a nit, but the O in OCR stands for "Optical," not "Ocular."

It's about the process based on vision, not on the organ that is
sensitive to light. Machines don't have eyes, biological beings have
eyes.


Linux for blind general discussion writes:
> What you're looking for is Ocular Character Recognition or OCR for short.
> 
> I've never managed to figure out its command line syntax, but I
> believe tesseract is considered the current standard option for Linux.
> 
> There's also Cuneiform, which I have actually used with some success
> in the past, but I believe its either contrib or non-free under
> Debian, so you might need to enable extra repositories depending on
> how strict your distro is about sticking to FOSS principles.
> 
> I will warn you, in my experience, OCR is as likely to produce
> gibberish as legible text. A scan of a page of prose type set in a
> standard font will probably OCR well, but the more mixed text is with
> graphics, the fancier the font, and the more complicated the page
> layout, the more likely errors are. I've tried OCR'ing scanlated
> manga(Japanese comics) in the past and have gotten results that
> included unpredictible patterns of letters and numbers misidentified
> as others(S and 5, P and D, I and 1, LI and U, B and g where just some
> of the common substitutions I encountered trying to fix the OCR'd
> text), characters my screenreader could'nt identify or identified as
> characters I'm unfamiliar, and even when the text was clear,
> paragraphs out of order wasn't uncommon.
> 
> -- 
> Sincerely,
> 
> Jeffery Wright
> Bachelor of Computer Science
> President Emeritus, Nu Nu Chapter, Phi Theta Kappa.
> 
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list

-- 

Janina Sajka

Linux Foundation Fellow
Executive Chair, Accessibility Workgroup:	http://a11y.org

The World Wide Web Consortium (W3C), Web Accessibility Initiative (WAI)
Chair, Accessible Platform Architectures	http://www.w3.org/wai/apa

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
     ` Linux for blind general discussion
@      ` Linux for blind general discussion
         ` Linux for blind general discussion
  0 siblings, 1 reply; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: blinux-list

And this is one of the reasons I hate acronyms, so easy to mix up
words with related meanings and the same first letter. Probably
doesn't help that, I think it's KDE's gui application for OCR that's
called Okular or something similar, and switch optical to ocular in a
Google search reduces the number of hits from 26 million to only 9
million, so both wordings have fairly widespread usage.

-- 
Sincerely,

Jeffery Wright
Bachelor of Computer Science
President Emeritus, Nu Nu Chapter, Phi Theta Kappa.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: extracting text from png files
       ` Linux for blind general discussion
@        ` Linux for blind general discussion
  0 siblings, 0 replies; 8+ messages in thread
From: Linux for blind general discussion @  UTC (permalink / raw)
  To: blinux-list

Okular is the PDF (ebook, documents and similar)  viewer of KDE

> Am 18.12.2018 um 20:10 schrieb Linux for blind general discussion <blinux-list@redhat.com>:
> 
> And this is one of the reasons I hate acronyms, so easy to mix up
> words with related meanings and the same first letter. Probably
> doesn't help that, I think it's KDE's gui application for OCR that's
> called Okular or something similar, and switch optical to ocular in a
> Google search reduces the number of hits from 26 million to only 9
> million, so both wordings have fairly widespread usage.
> 
> -- 
> Sincerely,
> 
> Jeffery Wright
> Bachelor of Computer Science
> President Emeritus, Nu Nu Chapter, Phi Theta Kappa.
> 
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~ UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
 extracting text from png files Linux for blind general discussion
 ` Linux for blind general discussion
   ` Linux for blind general discussion
     ` Linux for blind general discussion
       ` Linux for blind general discussion
   ` Linux for blind general discussion
     ` Linux for blind general discussion
       ` Linux for blind general discussion

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).