From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from int-mx04.intmail.prod.int.phx2.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.17]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o7CCRE02031676 for ; Thu, 12 Aug 2010 08:27:14 -0400 Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com [10.5.110.7]) by int-mx04.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o7CCR8AV023136 for ; Thu, 12 Aug 2010 08:27:09 -0400 Received: from dc.cis.okstate.edu (dc.cis.okstate.edu [139.78.103.93]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o7CCQuQ4012170 for ; Thu, 12 Aug 2010 08:26:57 -0400 Received: from dc.cis.okstate.edu (localhost.cis.okstate.edu [127.0.0.1]) by dc.cis.okstate.edu (8.14.2/8.13.8) with ESMTP id o7CCQuBf017566 for ; Thu, 12 Aug 2010 07:26:56 -0500 (CDT) (envelope-from martin@dc.cis.okstate.edu) Message-Id: <201008121226.o7CCQuBf017566@dc.cis.okstate.edu> To: Linux for blind general discussion Subject: Extracting ASCII text from a PDF Document MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <17562.1281616016.1@dc.cis.okstate.edu> Date: Thu, 12 Aug 2010 07:26:56 -0500 From: Martin McCormick X-RedHat-Spam-Score: -0.01 (T_RP_MATCHES_RCVD) X-Scanned-By: MIMEDefang 2.67 on 10.5.11.17 X-Scanned-By: MIMEDefang 2.67 on 10.5.110.7 X-loop: blinux-list@redhat.com X-BeenThere: blinux-list@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: Linux for blind general discussion List-Id: Linux for blind general discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Aug 2010 12:27:14 -0000 I have a PDF document that does have embedded ASCII text in it. It plays fine on a Macintosh that has no OCR software on it but uses Voiceover. Voiceover just runs on ASCII so the ASCII is there. I need to use the file on a Debian system so I hope I am just using a2ps and pstotext wrong. if one uses pstotext on this document, it immediately errors out. If I use a2ps and give it -o outfilename.ps, a2ps runs but I may be producing an image file as there is no text from the document, talk about sound and fury signifying nothing. If one runs pstotext on that output file, one gets a single form feed for each page and nothing else. The PDF document is not protected. Any suggestions as to how to extract the text are welcome. Thanks. Martin McCormick