From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (qmail 25372 invoked from network); 7 Dec 1998 16:47:15 -0000 Received: from mail.redhat.com (199.183.24.239) by lists.redhat.com with SMTP; 7 Dec 1998 16:47:15 -0000 Received: from rs8.loc.gov (rs8.loc.gov [140.147.248.8]) by mail.redhat.com (8.8.7/8.8.7) with ESMTP id LAA28016 for ; Mon, 7 Dec 1998 11:41:53 -0500 Received: from [140.147.40.26] (lras.loc.gov [140.147.40.26]) by rs8.loc.gov (8.8.4/8.8.4) with SMTP id LAA17424 for ; Mon, 7 Dec 1998 11:41:47 -0500 Date: Mon, 7 Dec 98 11:41:14 EST From: "Lloyd G. Rasmussen" Message-Id: <54971.lras@loc.gov> X-Minuet-Version: Minuet1.0_Beta_18A Reply-To: X-POPMail-Charset: English To: blinux-list@redhat.com Subject: Re: OCR software (was Re: Concerning BLinux project (fwd)) List-Id: What you ask for is not likely to be available until artificial intelligence goes forward much further. You are asking a computer program which knows the *presentation* of a document to correctly infer the *structure* of that document, or at least attempt to do so. I recently bought Omnipage 9 for Win95 from Caere Corporation. Among all its export formats, it includes an HTML export format. From what I've seen so far, the objective is to make a GUI web browser display the page, with fonts, italics, centering, intact. The HTML is a series of

and
with Font, I, Align attributes. No structure. It even claims to conform to the HTML 3.0 DTD, and tells you that the generator is Adobe Word for Word. I know that HTML is not SGML. But I'm not too hopeful that when OCR programs begin exporting XML, that they will do much better than this. I know that Duxbury attempts to create styles in a file which it has imported from ASCII, but this is usually just a beginning toward correctly marking up a document. I agree that you're aiming for the right objective, but I don't know how we're going to get there. On Mon, 7 Dec 1998 09:28:03 +1100 (AEDT), Jason White wrote: >On the subject of freely available OCR software, currently under >development, see http://www.socr.org/ > >What is most needed as output is not straightforward ASCII text, but >rather a document which has been marked up in SGML, XML or a related >language, that preserves its structure and maintains the distinctions >necessary for the generation of high quality braille and audio output. > > >--- >Send your message for blinux-list to blinux-list@redhat.com >Blinux software archive at ftp://leb.net/pub/blinux >Blinux web page at http://leb.net/blinux >To unsubscribe send mail to blinux-list-request@redhat.com >with subject line: unsubscribe > -- Lloyd Rasmussen Senior Staff Engineer, Engineering Section National Library Service for the Blind and Physically Handicapped Library of Congress 202-707-0535 (work) lras@loc.gov http://www.loc.gov/nls/ (home) lras@sprynet.com http://home.sprynet.com/sprynet/lras/