From: "Lloyd G. Rasmussen" <lras@loc.gov>
To: blinux-list@redhat.com
Subject: Re: OCR software (was Re: Concerning BLinux project (fwd))
Date: Mon, 7 Dec 98 11:41:14 EST [thread overview]
Message-ID: <54971.lras@loc.gov> (raw)
What you ask for is not likely to be available until artificial
intelligence goes forward much further. You are asking a computer
program which knows the *presentation* of a document to correctly
infer the *structure* of that document, or at least attempt to do so.
I recently bought Omnipage 9 for Win95 from Caere Corporation. Among
all its export formats, it includes an HTML export format. From what
I've seen so far, the objective is to make a GUI web browser display
the page, with fonts, italics, centering, intact. The HTML is a
series of <p> and <br> with Font, I, Align attributes. No structure.
It even claims to conform to the HTML 3.0 DTD, and tells you that the
generator is Adobe Word for Word. I know that HTML is not SGML. But
I'm not too hopeful that when OCR programs begin exporting XML, that
they will do much better than this.
I know that Duxbury attempts to create styles in a file which it has
imported from ASCII, but this is usually just a beginning toward
correctly marking up a document. I agree that you're aiming for the
right objective, but I don't know how we're going to get there.
On Mon, 7 Dec 1998 09:28:03 +1100 (AEDT),
Jason White <jasonw@ariel.ucs.unimelb.EDU.AU> wrote:
>On the subject of freely available OCR software, currently under
>development, see http://www.socr.org/
>
>What is most needed as output is not straightforward ASCII text, but
>rather a document which has been marked up in SGML, XML or a related
>language, that preserves its structure and maintains the distinctions
>necessary for the generation of high quality braille and audio output.
>
>
>---
>Send your message for blinux-list to blinux-list@redhat.com
>Blinux software archive at ftp://leb.net/pub/blinux
>Blinux web page at http://leb.net/blinux
>To unsubscribe send mail to blinux-list-request@redhat.com
>with subject line: unsubscribe
>
-- Lloyd Rasmussen
Senior Staff Engineer, Engineering Section
National Library Service for the Blind and Physically Handicapped
Library of Congress 202-707-0535
(work) lras@loc.gov http://www.loc.gov/nls/
(home) lras@sprynet.com http://home.sprynet.com/sprynet/lras/
next reply other threads:[~ UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
Lloyd G. Rasmussen [this message]
` Jude Dashiell
-- strict thread matches above, loose matches on Subject: below --
Concerning BLinux project (fwd) Jude Dashiell
` OCR software (was Re: Concerning BLinux project (fwd)) Jason White
` Jude Dashiell
` Jack Berdeaux
` Ron Marriage
` Jack Berdeaux
` Jude Dashiell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54971.lras@loc.gov \
--to=lras@loc.gov \
--cc=blinux-list@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).