public inbox for blinux-list@redhat.com
 help / color / mirror / Atom feed
* Extracting ASCII text from a PDF Document
@  Martin McCormick
   ` Kirk Reiser
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Linux for blind general discussion

I have a PDF document that does have embedded ASCII text in it.
It plays fine on a Macintosh that has no OCR software on it but
uses Voiceover. Voiceover just runs on ASCII so the ASCII is
there.

	I need to use the file on a Debian system so I hope I am
just using a2ps and pstotext wrong.

	if one uses pstotext on this document, it immediately
errors out. If I use a2ps and give it -o outfilename.ps, a2ps
runs but I may be producing an image file as there is no text
from the document, talk about sound and fury signifying nothing.

	If one runs pstotext on that output file, one gets a
single form feed for each page and nothing else.

	The PDF document is not protected.

	Any suggestions as to how to extract the text are
welcome. Thanks.

Martin McCormick

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
@  Martin McCormick
   ` Kirk Reiser
  0 siblings, 1 reply; 9+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Linux for blind general discussion

Kirk Reiser writes:
> What happens when you run pdftotext on the file?

$ pstotext  BCD996XT_v1.04.00_Protocol.pdf

< BCD996XT Operation Specification >
200
GPL Ghostscript GPL Ghostscript 8.628.62: : Unrecoverable error, exit code 1
Unrecoverable error, exit code 1
7.13. REMOTE

I think that 7.13.remote is some stray text that got outputted
from the file before pstotext exploded.

	The output would have gone to standard output had it
worked.

Martin

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: Extracting ASCII text from a PDF Document
@  Martin McCormick
  0 siblings, 0 replies; 9+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Linux for blind general discussion

Kirk Reiser writes:
> pdftotext is a different program, mine with the -v argument returns:
> 
> pdftotext version 3.02
> Copyright 1996-2007 Glyph & Cog, LLC
> 
> 
> It also outputs to a file with the basename but containing a .txt
> extension.  I believe it is part of the xpdf utilities.

Thank you very much. I do have pdftotext and I probably need to
upgrade it as mine is 3.00 but it read the document just fine.

	I got confused and thought pstotext was what I needed as
the man page says it will convert a postscript or pdf document
to ASCII text.

	Anyway, it looks like the problem is solved by calling
the right application.

Martin McCormick

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~ UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
 Extracting ASCII text from a PDF Document Martin McCormick
 ` Kirk Reiser
 ` Chris Brannon
 ` Geoff Shang
 Martin McCormick
 ` Kirk Reiser
   ` Hart Larry
     ` Lee Maschmeyer
 Martin McCormick

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).