public inbox for speakup@linux-speakup.org
 help / color / mirror / Atom feed
* 8-bit characters in output
@  Martin McCormick
   ` Samuel Thibault
  0 siblings, 1 reply; 7+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

On many occasions, I hear output while reading text that I think
is probably 8-bit data because certain characters are spoken that
don't even exist in the text I am reading.

	I may be reading quoted text in an email or maybe
highlighted text in instructions and I hear the one-half symbol
which is pronounced by speakup as a half plus the umlaut from
German text.

	Occasionally when printing output that can best be described
as garbage such as accidentally catting a binary file, speakup
starts chanting a half umlaut or even 1fourth followed by umlauts
or other words that turn out to be not words but characters that
trigger speakup to recite symbols for 1/4th, etc.

	I once examined an email message that was heavily in to
a half-umlaut on about every line and found that the other persons
email client placed a circumflex in quoted lines.

	At other times, words like the contraction of "I am" as
in I apostrophe M are read as IBM like the computer
manufacturer.

	Basically, I certainly understand why this is happening
but want to know if there is anything I can do at the speakup
level to properly process text so that it doesn't sound like
corrupted data.

	One thing I did for several years was to filter the
output of text such as email or just text files through a filter
that removed bit 7 if it was set.  This got rid of the
a half-umlaut chant but replaced it with occasional corruption
when an 8-bit character with bit 7 cleared equals a printable
ASCII character.

	This is more of an annoyance than a show stopper so is
there a translation table or a filter that can be made to fix
this issue?

	Speakup is a fabulous system so I'm not griping at all.

	Thanks for either instructions as to what to do or a link
to such instructions.

Martin McCormick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 8-bit characters in output
   8-bit characters in output Martin McCormick
@  ` Samuel Thibault
     ` Martin McCormick
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Thibault @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Hello,

Martin McCormick, le lun. 14 déc. 2020 12:25:08 -0600, a ecrit:
> On many occasions, I hear output while reading text that I think
> is probably 8-bit data because certain characters are spoken that
> don't even exist in the text I am reading.

Which speech synthesis are you using?

Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 8-bit characters in output
   ` Samuel Thibault
@    ` Martin McCormick
       ` Samuel Thibault
  0 siblings, 1 reply; 7+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Samuel Thibault <samuel.thibault@ens-lyon.org> writes:
> Hello,
> Which speech synthesis are you using?

	A very good question but I am not sure of a good answer.

	I am using the speakup that is installed when buster is
installed.  It is the software speech one hears if installing
debian from a live CD and I must add that it is fabulous so I am
not complaining, just figuring out how best to make sure there
are no artifacts produced by characters that may be there because
of formatting information.  It has been difficult to figure out
exactly what always triggers this effect but I may write a perl
script to generate 8-bit output to see if I can figure out what
is causing it as I can listen to lots of documentation and never
hear it but many email quotes are loaded with it on each line of
the quote if it is more than the usual > symbol.

	I hope this helps in describing the issue.

Martin McCormick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 8-bit characters in output
     ` Martin McCormick
@      ` Samuel Thibault
         ` Martin McCormick
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Thibault @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Martin McCormick, le mer. 16 déc. 2020 17:02:12 -0600, a ecrit:
> Samuel Thibault <samuel.thibault@ens-lyon.org> writes:
> > Which speech synthesis are you using?
> 
> It is the software speech one hears if installing debian from a live
> CD

Ok :)

> It has been difficult to figure out exactly what always triggers this
> effect but I may write a perl script to generate 8-bit output to see
> if I can figure out what is causing it

That could be useful indeed. Once we have an easy reproducer, it's
usualy very easy to fix the bug :)

Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 8-bit characters in output
       ` Samuel Thibault
@        ` Martin McCormick
           ` Samuel Thibault
  0 siblings, 1 reply; 7+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Samuel Thibault <samuel.thibault@ens-lyon.org> writes:
> That could be useful indeed. Once we have an easy reproducer, it's
> usualy very easy to fix the bug :)
> 
> Samuel

Ask and yee shall receive.  It turned out to be far easier to
duplicate the issue than I ever dreamed.  Here is the perl
program I just finished which shows that all characters with bit
7 set trigger the same sounds.  You may have to run the lines
through perltidy if the mailing process mangles them.  Code
starts here and is 17 lines long.  The 1-second sleep slows
things down a bit so you can follow the output more easily.


#!/usr/bin/perl -w
use strict;

sub charmaker {    #
    my $char = 120;

    for ( $char = $char ; $char < 256 ; $char++ ) {    #
        printf( "%d %c\n", $char, $char );
        sleep 1;
    }    #
    return;
}    #

print "First the decimal value then the character itself\n";
charmaker;

exit(0);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 8-bit characters in output
         ` Martin McCormick
@          ` Samuel Thibault
             ` Martin McCormick
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Thibault @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Martin McCormick, le mer. 16 déc. 2020 20:54:56 -0600, a ecrit:
> all characters with bit 7 set trigger the same sounds.

Ok, so what happens is that this is invalid utf-8, which the kernel
turns into U+FFFD characters, which speakup properly passes on to
espeakup, which gives it to espeak-ng, where it gets completely
misinterpreted, I have submitted

https://github.com/espeak-ng/espeak-ng/issues/859

Thanks for the report,
Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 8-bit characters in output
           ` Samuel Thibault
@            ` Martin McCormick
  0 siblings, 0 replies; 7+ messages in thread
From: Martin McCormick @  UTC (permalink / raw)
  To: Speakup is a screen review system for Linux.

Samuel Thibault <samuel.thibault@ens-lyon.org> writes:
> Ok, so what happens is that this is invalid utf-8, which the kernel
> turns into U+FFFD characters, which speakup properly passes on to
> espeakup, which gives it to espeak-ng, where it gets completely
> misinterpreted, I have submitted
> 
> https://github.com/espeak-ng/espeak-ng/issues/859
> 
> Thanks for the report,
> Samuel

	You are quite welcome.  When I was taking electronics
courses in college, we had to submit lab reports on the
experiments we were assigned and one of the things we were
required to do was to write down the serial numbers and other
identifying information about the test equipment we used that day
to make our measurements.

	At the time, this seemed like extra work until the lab
instructor explained that some times equipment could be
malfunctioning in subtle ways that would influence our results
such as a signal generator which was supposed to give the same
voltage output over it's frequency range but didn't, etc.

	That made perfect sense.  Emagine being handed a meter
stick that was warped badly so was no longer 1 meter in length.
The list of issues could go on forever so I made sure that the
required equipment information was always there.

	In that spirit, I did the "env" command in my Linux bash
shell which runs in a text-based terminal such as /dev/tty0 or
tty1.  I can get you the entire output but the following
variables represent factors that might influence the output.
Here they are:

TERMCAP=SC|linux|VT 100/ANSI X3.64 virtual terminal
LANG=en_US.UTF-8
TERM=linux

	
	The LC_TIME variable probably doesn't effect anything but
the formatting of time stamps.

	If I look back to old email configuration files from
several years ago, I see I was trying to filter 8-bit characters
so this is nothing new.  I am presently using Buster however I
have been using debian Linux for about 20 years and what is now
speakup since about 2004 and it is truly a great screen reader.

Martin McCormick

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~ UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
 8-bit characters in output Martin McCormick
 ` Samuel Thibault
   ` Martin McCormick
     ` Samuel Thibault
       ` Martin McCormick
         ` Samuel Thibault
           ` Martin McCormick

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).