From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o7CDiv0n008643 for ; Thu, 12 Aug 2010 09:44:57 -0400 Received: from mx1.redhat.com (ext-mx02.extmail.prod.ext.phx2.redhat.com [10.5.110.6]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o7CDiqhY031043 for ; Thu, 12 Aug 2010 09:44:52 -0400 Received: from mail-pz0-f46.google.com (mail-pz0-f46.google.com [209.85.210.46]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o7CDifm9010771 for ; Thu, 12 Aug 2010 09:44:42 -0400 Received: by pzk26 with SMTP id 26so521055pzk.33 for ; Thu, 12 Aug 2010 06:44:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to:subject :in-reply-to:references:comments:date; bh=Nz1I2wZKtckQY22OC71XxvLUusTJtw2GXiQOTaBMJQM=; b=xiILnZqaenya1RrkWw/07dKu+zZ86+4wr/7xj7qdettEnXqCVjR4imYej1Vh9BYFKk nXNx012v78fSDg69XanMFBatvM0MCEjT3s/fxK/cK2vUjDJ0DuQa+FYH4a0Kct5UFx0l WFrjT5HyvbZDzzQXDTB0ljMXElVaDsI9C2xCg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:subject:in-reply-to:references:comments:date; b=W03FaTWoDb1AQtxbStN7R5PLLH+N8BmcaemdCw3To6VDXvNT+qhNCxgAIovzDq636K h5UlCl9UcMyHfsaRs3iSkFuVExkHa0FioLQAze0Bk1lEN8ie08TyibwLCzcO4Rvqmd6n 9icSuBP0LdWMWqAbq1B5rtpPrZwobmYpRdPPs= Received: by 10.142.238.18 with SMTP id l18mr39768wfh.16.1281616723706; Thu, 12 Aug 2010 05:38:43 -0700 (PDT) Received: from localhost (ip24-253-221-18.ok.ok.cox.net [24.253.221.18]) by mx.google.com with ESMTPS id y16sm1529716wff.2.2010.08.12.05.38.42 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 12 Aug 2010 05:38:43 -0700 (PDT) Message-ID: <4c63eb53.10c98e0a.5c23.5926@mx.google.com> From: Chris Brannon To: Linux for blind general discussion Subject: Re: Extracting ASCII text from a PDF Document In-reply-to: <201008121226.o7CCQuBf017566@dc.cis.okstate.edu> References: <201008121226.o7CCQuBf017566@dc.cis.okstate.edu> Comments: In-reply-to Martin McCormick message dated "Thu, 12 Aug 2010 07:26:56 -0500." Date: Thu, 12 Aug 2010 07:40:05 -0500 X-RedHat-Spam-Score: 2.113 ** (DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_PASS) X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 X-Scanned-By: MIMEDefang 2.67 on 10.5.110.6 X-loop: blinux-list@redhat.com X-BeenThere: blinux-list@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: Linux for blind general discussion List-Id: Linux for blind general discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Aug 2010 13:44:57 -0000 Martin McCormick wrote: > I have a PDF document that does have embedded ASCII text in it. > I need to use the file on a Debian system so I hope I am > just using a2ps and pstotext wrong. Don't do that! Use pdftotext instead. On my distribution, ArchLinux, pdftotext is provided by the "poppler" package. I don't know which package you need for Debian. Perhaps it's in xpdf. One thing you'll notice when converting PDF to plain text is that certain two-letter combinations are replaced with UTF-8-encoded Unicode characters. Only the gods know why. Common examples are fi, fl, and ff. Of course, most screenreaders won't render those correctly. -- Chris