* Extracting ASCII text from a PDF Document
@ Martin McCormick
` Kirk Reiser
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Martin McCormick @ UTC (permalink / raw)
To: Linux for blind general discussion
I have a PDF document that does have embedded ASCII text in it.
It plays fine on a Macintosh that has no OCR software on it but
uses Voiceover. Voiceover just runs on ASCII so the ASCII is
there.
I need to use the file on a Debian system so I hope I am
just using a2ps and pstotext wrong.
if one uses pstotext on this document, it immediately
errors out. If I use a2ps and give it -o outfilename.ps, a2ps
runs but I may be producing an image file as there is no text
from the document, talk about sound and fury signifying nothing.
If one runs pstotext on that output file, one gets a
single form feed for each page and nothing else.
The PDF document is not protected.
Any suggestions as to how to extract the text are
welcome. Thanks.
Martin McCormick
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Extracting ASCII text from a PDF Document Martin McCormick
@ ` Kirk Reiser
` Chris Brannon
` Geoff Shang
2 siblings, 0 replies; 9+ messages in thread
From: Kirk Reiser @ UTC (permalink / raw)
To: Linux for blind general discussion
What happens when you run pdftotext on the file?
On Thu, 12 Aug 2010, Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> It plays fine on a Macintosh that has no OCR software on it but
> uses Voiceover. Voiceover just runs on ASCII so the ASCII is
> there.
>
> I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.
>
> if one uses pstotext on this document, it immediately
> errors out. If I use a2ps and give it -o outfilename.ps, a2ps
> runs but I may be producing an image file as there is no text
> from the document, talk about sound and fury signifying nothing.
>
> If one runs pstotext on that output file, one gets a
> single form feed for each page and nothing else.
>
> The PDF document is not protected.
>
> Any suggestions as to how to extract the text are
> welcome. Thanks.
>
> Martin McCormick
>
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list
>
--
Kirk Reiser The Computer Braille Facility
e-mail: kirk@braille.uwo.ca University of Western Ontario
phone: (519) 661-3061
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Extracting ASCII text from a PDF Document Martin McCormick
` Kirk Reiser
@ ` Chris Brannon
` Geoff Shang
2 siblings, 0 replies; 9+ messages in thread
From: Chris Brannon @ UTC (permalink / raw)
To: Linux for blind general discussion
Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.
Don't do that! Use pdftotext instead.
On my distribution, ArchLinux, pdftotext is provided by the "poppler"
package. I don't know which package you need for Debian.
Perhaps it's in xpdf.
One thing you'll notice when converting PDF to plain text is that certain
two-letter combinations are replaced with UTF-8-encoded Unicode characters.
Only the gods know why.
Common examples are fi, fl, and ff.
Of course, most screenreaders won't render those correctly.
-- Chris
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Extracting ASCII text from a PDF Document Martin McCormick
` Kirk Reiser
` Chris Brannon
@ ` Geoff Shang
2 siblings, 0 replies; 9+ messages in thread
From: Geoff Shang @ UTC (permalink / raw)
To: Linux for blind general discussion
On Thu, 12 Aug 2010, Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
Iuse pdftotext from the poppler-utils package.
Note that this expects an output filename rather than sending to stdout.
Geoff.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
` Hart Larry
@ ` Lee Maschmeyer
0 siblings, 0 replies; 9+ messages in thread
From: Lee Maschmeyer @ UTC (permalink / raw)
To: Linux for blind general discussion
Speaking of smoothness, the -layout switch is often helpful. The table of
contents actually comes out looking like a table of contents!
I also ran pdfinfo for the first time just now. It's pretty interesting.
Next time I have a file I can't read I wonder if it'll help.
--
Lee Maschmeyer
Wayne State University Computing Center
5925 Woodward, #281
Detroit MI 48202
USA
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
@ Martin McCormick
0 siblings, 0 replies; 9+ messages in thread
From: Martin McCormick @ UTC (permalink / raw)
To: Linux for blind general discussion
Kirk Reiser writes:
> pdftotext is a different program, mine with the -v argument returns:
>
> pdftotext version 3.02
> Copyright 1996-2007 Glyph & Cog, LLC
>
>
> It also outputs to a file with the basename but containing a .txt
> extension. I believe it is part of the xpdf utilities.
Thank you very much. I do have pdftotext and I probably need to
upgrade it as mine is 3.00 but it read the document just fine.
I got confused and thought pstotext was what I needed as
the man page says it will convert a postscript or pdf document
to ASCII text.
Anyway, it looks like the problem is solved by calling
the right application.
Martin McCormick
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
` Kirk Reiser
@ ` Hart Larry
` Lee Maschmeyer
0 siblings, 1 reply; 9+ messages in thread
From: Hart Larry @ UTC (permalink / raw)
To: Linux for blind general discussion
Well, certainly a majority of pdf files never read well, seemingly better
results with pdftohtml, but if I have that wrong try pdf2html, since I am not
at home, I cannot check. Anyway the results are somewhat smoother.
Hart
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
Martin McCormick
@ ` Kirk Reiser
` Hart Larry
0 siblings, 1 reply; 9+ messages in thread
From: Kirk Reiser @ UTC (permalink / raw)
To: Linux for blind general discussion
pdftotext is a different program, mine with the -v argument returns:
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
It also outputs to a file with the basename but containing a .txt
extension. I believe it is part of the xpdf utilities.
On Thu, 12 Aug 2010, Martin McCormick wrote:
> Kirk Reiser writes:
>> What happens when you run pdftotext on the file?
>
> $ pstotext BCD996XT_v1.04.00_Protocol.pdf
>
> < BCD996XT Operation Specification >
> 200
> GPL Ghostscript GPL Ghostscript 8.628.62: : Unrecoverable error, exit code 1
> Unrecoverable error, exit code 1
> 7.13. REMOTE
>
> I think that 7.13.remote is some stray text that got outputted
> from the file before pstotext exploded.
>
> The output would have gone to standard output had it
> worked.
>
> Martin
>
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@redhat.com
> https://www.redhat.com/mailman/listinfo/blinux-list
>
--
Kirk Reiser The Computer Braille Facility
e-mail: kirk@braille.uwo.ca University of Western Ontario
phone: (519) 661-3061
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
@ Martin McCormick
` Kirk Reiser
0 siblings, 1 reply; 9+ messages in thread
From: Martin McCormick @ UTC (permalink / raw)
To: Linux for blind general discussion
Kirk Reiser writes:
> What happens when you run pdftotext on the file?
$ pstotext BCD996XT_v1.04.00_Protocol.pdf
< BCD996XT Operation Specification >
200
GPL Ghostscript GPL Ghostscript 8.628.62: : Unrecoverable error, exit code 1
Unrecoverable error, exit code 1
7.13. REMOTE
I think that 7.13.remote is some stray text that got outputted
from the file before pstotext exploded.
The output would have gone to standard output had it
worked.
Martin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~ UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
Extracting ASCII text from a PDF Document Martin McCormick
` Kirk Reiser
` Chris Brannon
` Geoff Shang
Martin McCormick
` Kirk Reiser
` Hart Larry
` Lee Maschmeyer
Martin McCormick
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).