From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id wBHFYxrj016579 for ; Mon, 17 Dec 2018 10:34:59 -0500 Received: by smtp.corp.redhat.com (Postfix) id 6DC3C5C545; Mon, 17 Dec 2018 15:34:59 +0000 (UTC) Received: from mx1.redhat.com (ext-mx05.extmail.prod.ext.phx2.redhat.com [10.5.110.29]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 64F215C269 for ; Mon, 17 Dec 2018 15:34:56 +0000 (UTC) Received: from mailbackend.panix.com (mailbackend.panix.com [166.84.1.89]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8971015552 for ; Mon, 17 Dec 2018 15:34:55 +0000 (UTC) Received: from panix1.panix.com (panix1.panix.com [166.84.1.1]) by mailbackend.panix.com (Postfix) with ESMTP id DD4A433DAE for ; Mon, 17 Dec 2018 10:34:54 -0500 (EST) Received: by panix1.panix.com (Postfix, from userid 20712) id BAAC114B64; Mon, 17 Dec 2018 10:34:54 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by panix1.panix.com (Postfix) with ESMTP id B2AED14B63 for ; Mon, 17 Dec 2018 10:34:54 -0500 (EST) Date: Mon, 17 Dec 2018 10:34:54 -0500 To: Linux for blind general discussion Subject: Re: extracting text from png files In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Greylist: Sender passed SPF test, Sender IP whitelisted by DNSRBL, ACL 216 matched, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Mon, 17 Dec 2018 15:34:55 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Mon, 17 Dec 2018 15:34:55 +0000 (UTC) for IP:'166.84.1.89' DOMAIN:'mailbackend.panix.com' HELO:'mailbackend.panix.com' FROM:'jdashiel@panix.com' RCPT:'' X-RedHat-Spam-Score: -2.301 (RCVD_IN_DNSWL_MED, SPF_PASS) 166.84.1.89 mailbackend.panix.com 166.84.1.89 mailbackend.panix.com X-Scanned-By: MIMEDefang 2.78 on 10.5.110.29 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-loop: blinux-list@redhat.com From: Linux for blind general discussion X-BeenThere: blinux-list@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: blinux-list@redhat.com List-Id: Linux for blind general discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2018 15:34:59 -0000 Thanks much, I've got both packages on my system so now it's time to read some manuals and search for youtube tutorials to fill in the missing pieces to learn these two packages. I didn't know these could handle png files. On Mon, 17 Dec 2018, Linux for blind general discussion wrote: > Date: Mon, 17 Dec 2018 10:03:23 > From: Linux for blind general discussion > To: blinux-list@redhat.com > Subject: Re: extracting text from png files > > What you're looking for is Ocular Character Recognition or OCR for short. > > I've never managed to figure out its command line syntax, but I > believe tesseract is considered the current standard option for Linux. > > There's also Cuneiform, which I have actually used with some success > in the past, but I believe its either contrib or non-free under > Debian, so you might need to enable extra repositories depending on > how strict your distro is about sticking to FOSS principles. > > I will warn you, in my experience, OCR is as likely to produce > gibberish as legible text. A scan of a page of prose type set in a > standard font will probably OCR well, but the more mixed text is with > graphics, the fancier the font, and the more complicated the page > layout, the more likely errors are. I've tried OCR'ing scanlated > manga(Japanese comics) in the past and have gotten results that > included unpredictible patterns of letters and numbers misidentified > as others(S and 5, P and D, I and 1, LI and U, B and g where just some > of the common substitutions I encountered trying to fix the OCR'd > text), characters my screenreader could'nt identify or identified as > characters I'm unfamiliar, and even when the text was clear, > paragraphs out of order wasn't uncommon. > > --