* Extracting ASCII text from a PDF Document
@ Martin McCormick
` Kirk Reiser
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Martin McCormick @ UTC (permalink / raw)
To: Linux for blind general discussion
I have a PDF document that does have embedded ASCII text in it.
It plays fine on a Macintosh that has no OCR software on it but
uses Voiceover. Voiceover just runs on ASCII so the ASCII is
there.
I need to use the file on a Debian system so I hope I am
just using a2ps and pstotext wrong.
if one uses pstotext on this document, it immediately
errors out. If I use a2ps and give it -o outfilename.ps, a2ps
runs but I may be producing an image file as there is no text
from the document, talk about sound and fury signifying nothing.
If one runs pstotext on that output file, one gets a
single form feed for each page and nothing else.
The PDF document is not protected.
Any suggestions as to how to extract the text are
welcome. Thanks.
Martin McCormick
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Extracting ASCII text from a PDF Document
Extracting ASCII text from a PDF Document Martin McCormick
@ ` Kirk Reiser
` Chris Brannon
` Geoff Shang
2 siblings, 0 replies; 9+ messages in thread
From: Kirk Reiser @ UTC (permalink / raw)
To: Linux for blind general discussion
What happens when you run pdftotext on the file?
On Thu, 12 Aug 2010, Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> It plays fine on a Macintosh that has no OCR software on it but
> uses Voiceover. Voiceover just runs on ASCII so the ASCII is
> there.
>
> I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.
>
> if one uses pstotext on this document, it immediately
> errors out. If I use a2ps and give it -o outfilename.ps, a2ps
> runs but I may be producing an image file as there is no text
> from the document, talk about sound and fury signifying nothing.
>
> If one runs pstotext on that output file, one gets a
> single form feed for each page and nothing else.
>
> The PDF document is not protected.
>
> Any suggestions as to how to extract the text are
> welcome. Thanks.
>
> Martin McCormick
>
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list
>
--
Kirk Reiser The Computer Braille Facility
e-mail: kirk@braille.uwo.ca University of Western Ontario
phone: (519) 661-3061
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Extracting ASCII text from a PDF Document Martin McCormick
` Kirk Reiser
@ ` Chris Brannon
` Geoff Shang
2 siblings, 0 replies; 9+ messages in thread
From: Chris Brannon @ UTC (permalink / raw)
To: Linux for blind general discussion
Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.
Don't do that! Use pdftotext instead.
On my distribution, ArchLinux, pdftotext is provided by the "poppler"
package. I don't know which package you need for Debian.
Perhaps it's in xpdf.
One thing you'll notice when converting PDF to plain text is that certain
two-letter combinations are replaced with UTF-8-encoded Unicode characters.
Only the gods know why.
Common examples are fi, fl, and ff.
Of course, most screenreaders won't render those correctly.
-- Chris
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Extracting ASCII text from a PDF Document Martin McCormick
` Kirk Reiser
` Chris Brannon
@ ` Geoff Shang
2 siblings, 0 replies; 9+ messages in thread
From: Geoff Shang @ UTC (permalink / raw)
To: Linux for blind general discussion
On Thu, 12 Aug 2010, Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
Iuse pdftotext from the poppler-utils package.
Note that this expects an output filename rather than sending to stdout.
Geoff.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
@ Martin McCormick
` Kirk Reiser
0 siblings, 1 reply; 9+ messages in thread
From: Martin McCormick @ UTC (permalink / raw)
To: Linux for blind general discussion
Kirk Reiser writes:
> What happens when you run pdftotext on the file?
$ pstotext BCD996XT_v1.04.00_Protocol.pdf
< BCD996XT Operation Specification >
200
GPL Ghostscript GPL Ghostscript 8.628.62: : Unrecoverable error, exit code 1
Unrecoverable error, exit code 1
7.13. REMOTE
I think that 7.13.remote is some stray text that got outputted
from the file before pstotext exploded.
The output would have gone to standard output had it
worked.
Martin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Martin McCormick
@ ` Kirk Reiser
` Hart Larry
0 siblings, 1 reply; 9+ messages in thread
From: Kirk Reiser @ UTC (permalink / raw)
To: Linux for blind general discussion
pdftotext is a different program, mine with the -v argument returns:
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
It also outputs to a file with the basename but containing a .txt
extension. I believe it is part of the xpdf utilities.
On Thu, 12 Aug 2010, Martin McCormick wrote:
> Kirk Reiser writes:
>> What happens when you run pdftotext on the file?
>
> $ pstotext BCD996XT_v1.04.00_Protocol.pdf
>
> < BCD996XT Operation Specification >
> 200
> GPL Ghostscript GPL Ghostscript 8.628.62: : Unrecoverable error, exit code 1
> Unrecoverable error, exit code 1
> 7.13. REMOTE
>
> I think that 7.13.remote is some stray text that got outputted
> from the file before pstotext exploded.
>
> The output would have gone to standard output had it
> worked.
>
> Martin
>
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list
>
--
Kirk Reiser The Computer Braille Facility
e-mail: kirk@braille.uwo.ca University of Western Ontario
phone: (519) 661-3061
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
@ Martin McCormick
0 siblings, 0 replies; 9+ messages in thread
From: Martin McCormick @ UTC (permalink / raw)
To: Linux for blind general discussion
Kirk Reiser writes:
> pdftotext is a different program, mine with the -v argument returns:
>
> pdftotext version 3.02
> Copyright 1996-2007 Glyph & Cog, LLC
>
>
> It also outputs to a file with the basename but containing a .txt
> extension. I believe it is part of the xpdf utilities.
Thank you very much. I do have pdftotext and I probably need to
upgrade it as mine is 3.00 but it read the document just fine.
I got confused and thought pstotext was what I needed as
the man page says it will convert a postscript or pdf document
to ASCII text.
Anyway, it looks like the problem is solved by calling
the right application.
Martin McCormick
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~ UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
Extracting ASCII text from a PDF Document Martin McCormick
` Kirk Reiser
` Chris Brannon
` Geoff Shang
Martin McCormick
` Kirk Reiser
` Hart Larry
` Lee Maschmeyer
Martin McCormick
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).