public inbox for blinux-list@redhat.com
 help / color / mirror / Atom feed
* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
   ` Jan Hubicka
  0 siblings, 1 reply; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: blinux-list


-----Original Message-----
From: Jan Hubicka <hubicka@atrey.karlin.mff.cuni.cz>
To: blinux-list@redhat.com <blinux-list@redhat.com>
Date: Monday, January 05, 1998 3:30 AM
Subject: Re: Protocol for speaking daemon (comments expected)


>> Say commands should also have some kind of priority indicator.  For
>> example, status messages should be given a higher priority than just
>> regular text.  E.G. While reading a document, status messages should be
>> spoken as soon as they are available (perhaps using a different
>> 'personality' -- see below).
>Yes, priority is quite significant, I think. IMO there should be defined
>something as following priorities:
>urgent
>normal
>low
>optional (say just in case nothing else is spoken....)

Really, there is not likely to be that much application demand on TTS.  I
would argue three priorities would be enough for virtually all situations;
high, base (default), and low.  Hence, you can put yourself above or below
the 'default' priority most applications would use.

>> Similar to the query command (or perhaps part of it), you should be able
to
>> not only learn the status of a message, but determine the currently
>> speaking message, and, if possible, obtain more definitive information
with
>> regards to your exact position in the message.  This would make things
>> similar to indexing on the Dec-Talk possible.
>Well, determining possition in messge should be quite complex at
non-dectalk
>stuff (like software syntetizers...) so it should be made as optional or
>something...

In the implimentation I am doing, the TTS engine passes a message back to
the server as it parses each 'chunk' of the posted 'say'.  Hence, if the TTS
is build to speak 'words' descreately, it would send a marker back at the
end of each spoken word.  Similarly, a phrase or sentance based system would
report current position for each phrase or sentance that has been spoken.
Indexing, then, is really an incremental measure of the TTS's parsing of the
current text being spoken, and is not used to absolutely index each and
every character byte descreately proccessed within the text by the TTS
engine.

An interrupt request is submitted to the TTS engine by the server when one
wants to 'interrupt' the currently spoken text, and the TTS will then return
the 'final' index of the last spoken 'component' of the current message.
This marker is chosen by the TTS where to break up the text, and so one can
assume incomplete phrases can be kept from being split, or, at least spoken
words.  The current message is then returned to the queue with the 'index'
pointing where to continue, as chosen by the TTS engine.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   Protocol for speaking daemon (comments expected) David Sugar
@  ` Jan Hubicka
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Hubicka @  UTC (permalink / raw)
  To: blinux-list

> >Yes, priority is quite significant, I think. IMO there should be defined
> >something as following priorities:
> >urgent
> >normal
> >low
> >optional (say just in case nothing else is spoken....)
> 
> Really, there is not likely to be that much application demand on TTS.  I
> would argue three priorities would be enough for virtually all situations;
> high, base (default), and low.  Hence, you can put yourself above or below
> the 'default' priority most applications would use.
I think that fourth one (opetional) should be also usefull. It should be
used for example for keyboard echo, informations about work in prograss etc.
So they will be sent to output just in case no other output is active...
> 
> >> Similar to the query command (or perhaps part of it), you should be able
> to
> >> not only learn the status of a message, but determine the currently
> >> speaking message, and, if possible, obtain more definitive information
> with
> >> regards to your exact position in the message.  This would make things
> >> similar to indexing on the Dec-Talk possible.
> >Well, determining possition in messge should be quite complex at
> non-dectalk
> >stuff (like software syntetizers...) so it should be made as optional or
> >something...
> 
> In the implimentation I am doing, the TTS engine passes a message back to
> the server as it parses each 'chunk' of the posted 'say'.  Hence, if the TTS
> is build to speak 'words' descreately, it would send a marker back at the
> end of each spoken word.  Similarly, a phrase or sentance based system would
I think this should cause problems with software syntetisers, wich wants
to know whole sentense to make better output (bprozology etc...)
Sending output in separate words should make problems here...
> report current position for each phrase or sentance that has been spoken.
> Indexing, then, is really an incremental measure of the TTS's parsing of the
> current text being spoken, and is not used to absolutely index each and
> every character byte descreately proccessed within the text by the TTS
> engine.
> 
> An interrupt request is submitted to the TTS engine by the server when one
> wants to 'interrupt' the currently spoken text, and the TTS will then return
> the 'final' index of the last spoken 'component' of the current message.
> This marker is chosen by the TTS where to break up the text, and so one can
> assume incomplete phrases can be kept from being split, or, at least spoken
> words.  The current message is then returned to the queue with the 'index'
> pointing where to continue, as chosen by the TTS engine.
> 
> 
> 
> ---
> Send your message for blinux-list to blinux-list@redhat.com
> Blinux software archive at ftp://leb.net/pub/blinux
> Blinux web page at http://leb.net/blinux
> To unsubscribe send mail to blinux-list-request@redhat.com
> with subject line: unsubscribe

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
  0 siblings, 0 replies; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: blinux-list


>> In the implimentation I am doing, the TTS engine passes a message back to
>> the server as it parses each 'chunk' of the posted 'say'.  Hence, if the
TTS
>> is build to speak 'words' descreately, it would send a marker back at the
>> end of each spoken word.  Similarly, a phrase or sentance based system
would
>I think this should cause problems with software syntetisers, wich wants
>to know whole sentense to make better output (bprozology etc...)
>Sending output in separate words should make problems here...

The SPO is word oriented, but a 'chunk' can also certainly be a whole
sentance.  It depends on what makes sence for the underlying TTS engine,
which is why I do not define 'index' as a representation of or use for
setting every absolute binary offset; just those offsets that represent each
descreat chunk of text the TTS processes as a single 'phrase'.  If the TTS
operates on whole sentances, than the index will increment to the start of
each sentance.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   ` Bryan Smart
@    ` Jan Hubicka
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Hubicka @  UTC (permalink / raw)
  To: blinux-list

> Say commands should also have some kind of priority indicator.  For
> example, status messages should be given a higher priority than just
> regular text.  E.G. While reading a document, status messages should be
> spoken as soon as they are available (perhaps using a different
> 'personality' -- see below).
Yes, priority is quite significant, I think. IMO there should be defined
something as following priorities:
urgent
normal
low
optional (say just in case nothing else is spoken....)
> Similar to the query command (or perhaps part of it), you should be able to
> not only learn the status of a message, but determine the currently
> speaking message, and, if possible, obtain more definitive information with
> regards to your exact position in the message.  This would make things
> similar to indexing on the Dec-Talk possible.
Well, determining possition in messge should be quite complex at non-dectalk
stuff (like software syntetizers...) so it should be made as optional or
something...
> 
> Finally, it may be useful to put all of the synthesizer definitions in a
> 'synthcap' or similar file.  This would provide something analogus to the
> SSIL standard system under Windows.
Very interesting idea :)
Please can you send me some docs about SSIL standard?

Honza
> 
> Bryan
> 
> 
> 
> --
> Bryan R. Smart
> Email: bsmart@pobox.com
> 
> 
> ---
> Send your message for blinux-list to blinux-list@redhat.com
> Blinux software archive at ftp://leb.net/pub/blinux
> Blinux web page at http://leb.net/blinux
> To unsubscribe send mail to blinux-list-request@redhat.com
> with subject line: unsubscribe

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   David Sugar
@  ` Luke Davis
  0 siblings, 0 replies; 17+ messages in thread
From: Luke Davis @  UTC (permalink / raw)
  To: David Sugar; +Cc: blinux-list

Yes: the interrupt stuff was overkill/a bit more complex than was
necessary; I -- for somereason -- was think of a different cueing moddle
than you were; I see what you want to do now and much of that is
unnecessary & won't work.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
  0 siblings, 0 replies; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: blinux-list


-----Original Message-----
From: Bryan Smart <bsmart@pobox.com>

>>To synchronize speech, a 'wait' command can be used which returns a
message
>>when the specified message 'id' has been 'spoken'.  Hence, a synchronized
>
>Also, the wait command should have some kind of time out parameter which
>can notify the client if the server is unable to catch up to its message in
>the queue before the specified time passes.  Of course the query command
>that you mentioned could check this, but it might be nice to let the say
>command return a message specifying that the command has timed out.


True, the wait should timeout and have a failure response option if that
happens.

>Say commands should also have some kind of priority indicator.  For
>example, status messages should be given a higher priority than just
>regular text.  E.G. While reading a document, status messages should be
>spoken as soon as they are available (perhaps using a different
>'personality' -- see below).


The best suggestion I have heard for this is to build multiple 'priority'
queues to feed into the TTS engine.  The say command can then have something
like "say pri=x" to select a specific queue.  Personality can be achieved by
specifying other overrides as well in the say command, such as for voice
(male/female).

>>set commands can be introduced to specify global parameters for a session,
>>such as: set language = English.
>
>I think that the pitch, volume, and other similar settings should be
>controlled at this point.  Also, as all synts have differing ranges of
>values for each parameterh, percentages might be useful.


Yes.  I agree.

>>Another obvious command would be query <msgid> to query the current status
>>of a message.
>
>Similar to the query command (or perhaps part of it), you should be able to
>not only learn the status of a message, but determine the currently
>speaking message, and, if possible, obtain more definitive information with
>regards to your exact position in the message.  This would make things
>similar to indexing on the Dec-Talk possible.

Assuming the TTS 'maps' index status into the control record, yes, this
could be
achieved.  One should also be able to specify a 'pause' to pause the
currently spoken text, perhaps either at a specific index point, or at the
next 'natural' break as deterimed by the TTS itself.  The latter would be
far more useful.

>>The "say" command can have options, such as to specify mode (literal,
spell,
>>or speak), pitch and speed (assuming the tts supports it), etc.
>
>These should be realative to the current settings (allowing a slightly
>lower or higher pitch than the default, using a slightly louder or softer
>volume, and so forth).


Or perhaps, either offset from the current defaults (+/- %) or absolute.
Hence, "say volume=-10" vs. "say volume=20".  But yes, relative (scaled) %
values probably makes the most sense.

>I'm not sure of the best way to do this, but there should be some way that
>the 'say' command can specify different personalities when speaking text.
>These, of course, would probably be different voices on the Dec-Talk, but
>you should be able to provide pitch, sppeed, volume, voice, and voice
>control (several synths support different voices) which have values
>realative to those defined by the set commands.
>
>Finally, it may be useful to put all of the synthesizer definitions in a
>'synthcap' or similar file.  This would provide something analogus to the
>SSIL standard system under Windows.
>
>Bryan


Any given server serves one 'engine', hence, a command to return the
capabilities of the given engine would probably accomplish this.  Another
thought might be to add a 'UDP' option to the server, so one can, for
example, broadcast a UDP 'request' to all TTS servers on a given subnet
specifically to look for a 'german' speaker, and then use that to connect to
the first responding TCP-SPEAK server capable of serving spoken 'german',
for example.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
   ` Luke Davis
  0 siblings, 1 reply; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: Luke Davis; +Cc: blinux-list


-----Original Message-----
From: Luke Davis <ldavis@voicenet.com>


>Suggestion:
>
>On say: set up priarities such as:
>
>p=0
>p=1
>p=2
>p=3
>p=4
>p=5
>etc. (probably not that many)


Agreed.  I had not gone into details on the protocol spec, because I wanted
to float the overall concept first.  When I am back from Texas next week, I
plan to write a more formal and complete 'draft' specification for
TCP/SPEAK, and will add these suggesions (priority queue selection,
interrupt options, etc) into it.

>0 = Say it now; interrupt (clearing) next text.
>1 = Same, but speak rest of text when message done
>2 = Say after current.
>3 = Put on cuee.
>
>Default would be 3.


Option '1' can be achived by simply by selecting a higher priority queue,
and option '2' really is more a 'put me in front of the priority queue i
selected' kind of operation.  Hence, I would say queue=first or queue=last
effectivily is option 2 and 3.  Interrupting in the middle may require some
thought on how the 'server' interracts with the backend TTS, which I haven't
thought about yet :).  But barge-in can be interesting...

>Suggestion:
>
>Create an interrupt yes|no flag.:
>
>i=0
>i=1
>i=2
>...
>
>0 = This message can not be interrupted by anything.
> (Could be used for shutdown warnings, but could be abused)
>1 = Can only be interrupted by messages with p=0 or p=1.
>2 = Can only be interrupted by messages with higher priarity
> than is had by the current message.
>3 = Don't interupt anything.


This may be more complex than needed.  Interrupt requests should only
interrupt a message of the same priority or lower.  Hence, if one wants to
make a message only interruptable by pri 1 or 0 messages, one would select
the priority 1 queue for it :).  The option then can simply be
'interruptable=yes/no', which simply is used to indicate if the current
message is interruptable or not.

>Default would be 3.
>
>If this is done, then also do a cap I interrupt flag.:
>
>I=0
>I=1


>0 = This message can/should not interrupt anything.
>1 = This message should interrupt if it can.
> (in which case the lowercase i flag takes effect)
>
>The default will be 0.


Hmmm...

>Suggestion:
>
>A "textback" command.:
>
>Client Server
>
>say "this is a test"
>114568272
>
>textback 114568272
>This is a test
>
>Hense an application can determin something from history with out having
>it spoken.

But it probably should also include information on how it was 'spoken'
(pitch, voice set, etc), which could be returned seperately by the 'query'
command.  I already presume each TTS 'message' spooled out includes a 'text'
(body) and a 'control' record, which would hold the current 'default' values
of the server at the time SAY was issued (or the value of any single
utterence overrides specified in the say command).

>
>Reminder:
>
>I'd advise againced using a dot on a line by it self to end, due to
>conflicts when reading text.
>For example: a centense runs over, but only the period ends up on the next
>ine.
>There is more to say, but since the next phrase does not have a say
>command, the server will generate errors.
>
>So how about something like either EOF (ctrl-d), or ";\." (or something
>obscure like that).


This is a common problem in nntp also :).  The solution is either to make
sure the application never sends an empty line (just a '.'), which nntp post
programs (and older mailers) have traditionally done, or, yes, to choose a
different end of text indicator.  My thought of the '.' was simply to be
consistent with other existing protocols.

>Or:
>
>say [options]
>Help me linux guru, you're my only hope . . .
>\\say
>
>Or standard say:
>
>say [options] "text"


That would be an alternate option, though it leads to potentially 'long'
lines :).

>||
>
>say [options] <<end_of_text
>'Ice berg, dead ahead!'
>and:
>'Sorry ei didn't build ya a better ship, Mis Rose.'
>Are t phrases you'll here in "titanic": the most expencive (but very, very
>much worth it) movie of 1997/98!
>end_of_text
>
>Just some thoughts...




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   David Sugar
@  ` Bryan Smart
     ` Jan Hubicka
  0 siblings, 1 reply; 17+ messages in thread
From: Bryan Smart @  UTC (permalink / raw)
  To: blinux-list

>To synchronize speech, a 'wait' command can be used which returns a message
>when the specified message 'id' has been 'spoken'.  Hence, a synchronized

Also, the wait command should have some kind of time out parameter which
can notify the client if the server is unable to catch up to its message in
the queue before the specified time passes.  Of course the query command
that you mentioned could check this, but it might be nice to let the say
command return a message specifying that the command has timed out.

Say commands should also have some kind of priority indicator.  For
example, status messages should be given a higher priority than just
regular text.  E.G. While reading a document, status messages should be
spoken as soon as they are available (perhaps using a different
'personality' -- see below).

>set commands can be introduced to specify global parameters for a session,
>such as: set language = English.

I think that the pitch, volume, and other similar settings should be
controlled at this point.  Also, as all synts have differing ranges of
values for each parameterh, percentages might be useful.

>Another obvious command would be query <msgid> to query the current status
>of a message.

Similar to the query command (or perhaps part of it), you should be able to
not only learn the status of a message, but determine the currently
speaking message, and, if possible, obtain more definitive information with
regards to your exact position in the message.  This would make things
similar to indexing on the Dec-Talk possible.

>The "say" command can have options, such as to specify mode (literal, spell,
>or speak), pitch and speed (assuming the tts supports it), etc.

These should be realative to the current settings (allowing a slightly
lower or higher pitch than the default, using a slightly louder or softer
volume, and so forth).

I'm not sure of the best way to do this, but there should be some way that
the 'say' command can specify different personalities when speaking text.
These, of course, would probably be different voices on the Dec-Talk, but
you should be able to provide pitch, sppeed, volume, voice, and voice
control (several synths support different voices) which have values
realative to those defined by the set commands.

Finally, it may be useful to put all of the synthesizer definitions in a
'synthcap' or similar file.  This would provide something analogus to the
SSIL standard system under Windows.

Bryan



--
Bryan R. Smart
Email: bsmart@pobox.com



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
   ` Bryan Smart
  0 siblings, 1 reply; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: blinux-list; +Cc: hubicka

Having given consideration to the different ideas for 'speak', and the
desire for a queue based server with history capability, I have come up with
a protocol specification to use in the next release of the speak server
(speak/98? :), borrowing from nntp of all places.

As I envision the speak protocol, the client creates a tcp/ip connection, as
before, and then issues commands.  The 'say' command, like a nntp post,
would be followed by any number of lines of free-form 'text' to be spoken,
and a 'end of text' marker (a empty line with a '.', as in nntp 'post').
The server will then reply with a server generated 'message id' number for
the speech event.

To synchronize speech, a 'wait' command can be used which returns a message
when the specified message 'id' has been 'spoken'.  Hence, a synchronized
session might look something like this:

client                                                            server

say [options...]
the system will go down in 10 seconds
.

END-00127    (message queue id)
wait 00127

END-00127    (when spoken)
(sleep 5 seconds after wait sync's us with tts output)
say [options]
the system will go down in 5 seconds
.

END-00128

Presumably, one can have a 'repeat <msgid>' to recall a spoken phrase from
the history log, a 'cancel <msgid>' to stop playing or remove an unplayed
message from the queue, etc.

set commands can be introduced to specify global parameters for a session,
such as: set language = English.

Another obvious command would be query <msgid> to query the current status
of a message.

This approach would seem to meet most of the desired functionality within
the server I have heard requested, and would be more consistent with the
implementation of other internet protocols (such as nntp, pop, ftp, http,
etc :).

The "say" command can have options, such as to specify mode (literal, spell,
or speak), pitch and speed (assuming the tts supports it), etc.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
     ` Robbie Murray
@      ` Jan Hubicka
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Hubicka @  UTC (permalink / raw)
  To: blinux-list

> 
> Hi
> 
> Just a few suggestions for the speaking daemon. 
> 
> I think the user should be in control of what is being spoken.Ie they
> could have more than one application open at a time and choose which one
> is active. The way I was thinking of implimenting this is something like
> X. The daemon or server process would handle the speaking user interface
> including the keyboard.
We are discussing control of keyboard with Mr. Hanika. (he thinks that
server should be done using pseudoterminals). I think, that server should
not control keyboard. Since it should work at remote machine, different OS
etc, so it should be hard to implement. But I think each terminal should
be controled by something as screen reader, wich will be server's client.
(same as other speaking apps)

I am not sure how such application switching should be implemented. I think
each application should have way to "make it silent" that should do this.
> 
> Then client programs could connect to the server and create a window.  A
> client could be any program which had direct support for the server or a
> terminal (like XTerm) to use other programs. 
Possibly an screen reader should be something like XTerm...

Main difference between X and this situation is, that X don't control wich
application is visible, but wich application gets input. This should be
handled using normal way, as normally in UNIX is done (using multiple
consoles etc...). 
But you are right, that speaking should be disabled to some app. 
It should be quite nice feature. 
Do you have any idea how to implement this?
There is one way - everything will be controlled by screen reader part -
and all applications running at terminal will communicate using it. Other should
be that screen reader should have ability to disable applications from his
terminal using some kind of class. (each application from this terminal will
log in one class and then screen reader should say "dissable all applications
in my class")

> 
> This would eliminate the need for message priorities since the user would
> select the active 'window' and could also select to be informed of
> activity in other 'windows'.
You think about priorities as think neccesary just for handling multiple
applications. I don't think so. (BTW my idea about protocol don't have
priorities) I think this message scheduler should be usefull for one application
too. It should use it to handle output in more clever way than normal terminal
applications does. It should remove messages - this should be usefull when you
want to remove some question user answered, screen reader should remove messages
for text that was at screen etc.

But I think that for multiple connections you really need way to let more
applications speak at once.
> 
> I am willing to help with this project, but I might not have much time due
> to studying.
Great! Hope that we will start coding this or next week :)
I will really apprechiate your help.

Honza
> 
> 
> Regards
> 
> 
> Robbie Murray
> 
> ---
> Send your message for blinux-list to blinux-list@redhat.com
> Blinux software archive at ftp://leb.net/pub/blinux
> Blinux web page at http://leb.net/blinux
> To unsubscribe send mail to blinux-list-request@redhat.com
> with subject line: unsubscribe

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   ` Jan Hubicka
@    ` T. V. Raman
  0 siblings, 0 replies; 17+ messages in thread
From: T. V. Raman @  UTC (permalink / raw)
  To: blinux-list

I get the sense that you have neither read the papers
describing Emacspeak or used the software.

Many of the comments you make regarding emacspeak and this
hypothetical "server" simply dont make sense.

Please do not refer to Emacspeak and the associated work
except when you have actually taken the time to understand it.

-- 
Best Regards,
--raman

      Adobe Systems                 Tel: 1 (408) 536 3945   (W14-129)
      Advanced Technology Group     Fax: 1 (408) 537 4042 
      (W14 129) 345 Park Avenue     Email: raman@adobe.com 
      San Jose , CA 95110 -2704     Email:  raman@cs.cornell.edu
      http://labrador.corp.adobe.com/~raman/        (Adobe Intranet)
      http://cs.cornell.edu/home/raman/raman.html    (Cornell)
----------------------------------------------------------------------
    Disclaimer: The opinions expressed are my own and in no way should be taken
as representative of my employer, Adobe Systems Inc.
____________________________________________________________


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   ` Jan Hubicka
@    ` Robbie Murray
       ` Jan Hubicka
  0 siblings, 1 reply; 17+ messages in thread
From: Robbie Murray @  UTC (permalink / raw)
  To: blinux-list


Hi

Just a few suggestions for the speaking daemon. 

I think the user should be in control of what is being spoken.Ie they
could have more than one application open at a time and choose which one
is active. The way I was thinking of implimenting this is something like
X. The daemon or server process would handle the speaking user interface
including the keyboard.

Then client programs could connect to the server and create a window.  A
client could be any program which had direct support for the server or a
terminal (like XTerm) to use other programs. 

This would eliminate the need for message priorities since the user would
select the active 'window' and could also select to be informed of
activity in other 'windows'.

I am willing to help with this project, but I might not have much time due
to studying.


Regards


Robbie Murray


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   David Sugar
@  ` Jan Hubicka
     ` Robbie Murray
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Hubicka @  UTC (permalink / raw)
  To: blinux-list

> 
> >Well, again in my message queueing server an delay should be easy to
> implement.
> >But pause isn't. You can't say another message in the middle of first.
> >But possibly we should make some pause between messages. But can you
> explain
> 
> In the 'delayed' server, certainly tags like <pause> would only be usable
> outside of <messg> entities, and would really become mode settings.  But
> pause would also probably become unessisary.
OK, pause between messages should be quite easy to implement...
We should start think about it later - when server will be useable :)
> 
> >>
> >> Again the question becomes one of 'live' speech (where the session in
> >> control actively and directly 'speaks' to the device until it releases
> >> control, such as through a <break>) vs. delayed (message queued for
> >> playback) speech.  It seems a lot of extra overhead is needed for
> 'delayed'
> >> speech (such as tags for examining and controlling the queue).
> >Well, I think that an 'live' speech has some other limitations - it should
> >be quite hard to implement history (and I think it should be very good),
> >should cause problems at slow lines, handling multiple connections (it is
> >not very clear where to interrupt an message) and I think tags <break me
> here>
> >should not work wery well, since it is something as non-preemtive
> multitasking.
> 
> Yes, it is non-preemtive scheduling.
> 
> >Also live speech should make more complex cases, where whole sentence needs
> >to be processed at once before sent to output (wich I think is required for
> >most modern syntetizers, to give them way make intonance etc) at least
> >it should be real limitation in future.
> 
> Even in 'live' speech you would have the <mesg> </mesg> entity tags to
> specify where a complete 'phrase' or 'sentance' is, so that the entire
> contents is buffered and sent as a single entity (or utterence) for
> processing.  One advantage of live speech is that you do not have to have
> complex "indexing" markers and forced flushing of the message queue when you
Well, numbers for messages should be missing. But they are not required for
message handling itself, but for history - and not just for history itself,
but for application that wants to handle messages it already sent to server.
I guess there should be situation. Like:
"Do you want to continue y/n"
user press 'y' before message is completted (because he knows, what happends)
application should remove this message from queue. How this can be done using
live speaking?
Application should probably have tag "be silent", but it should also interrupt
some other significant text...

Also even with live speech client application needs to call some kind of "flush"
to flush output buffers. This "flush" should also do necessary message handling.
> want an application to control interruption of talking.  But in many ways a
> delayed server is easer to impliment, as it may not even require a formal
> 'server' to collect messages; one could then create an API that simply
> builds messages in a 'spool' directory for which a server comes along and
> processes (much like "at" and "atrun").
Interesting idea :) But I think handling all the mesagges trought disc is
unnecesary and slow :)
> 
> I guess the real question becomes how intimately involved and sychronized is
> the application with the current speech being spoken for it.  If the
> application needs to 'pace' itself to the rate it's output is being spoken,
> doing so in 'live' speech is easier to achieve, but it can certainly be
Well, I am not sure whether it is easier. I can't see difference. In my protocol
there was some thinks to let application handle a bit what it said (like
removing from queue, putting message into front of queue etc).
I think this should be even harder with 'live' speech server where application
sends data to server and has no way to control this data then. If we assume
that server is connected trought socket (and it probably is) all speaking
is done assynchronously.

This reminds me, maybe we should need some command "wait until message is
not spoken"
> done, even in 'delayed' speaking.  If pacing is required, then a server with
I think that even 'live' server will do 'delayed' speaking :) but it
is just playing with words :)
> a live connection to the application to the app is a better choice than "at"
> and "atrun" style stuff, since status can be returned to the app.  For
> example, the server could return the message id of a requested message when
> it's being spoken, and a end token when it's done.
This should be good idea...
> 
> >There is also another problem. I had long emails with Mr. Hanika. He thinks
> >that sollution using daemon is bad. That much better sollution is where
> >speaking server dirrectly controls and ptty, handles keys for history.
> >It should have one advantage - you have just one kind of applications -
> >not applications handled using stdin/stdout readed using special tool
> >and applications that connects dirrectly to servers and speaks with it.
> 
> I also saw the server's role as providing speech over a LAN as well as to
> service local applications.
OK, I hope mr Hanika will explain his ideas in greater detail here.
I don't want to translate his opinions, since it should be inexact.

Honza
> 
> 
> 
> ---
> Send your message for blinux-list to blinux-list@redhat.com
> Blinux software archive at ftp://leb.net/pub/blinux
> Blinux web page at http://leb.net/blinux
> To unsubscribe send mail to blinux-list-request@redhat.com
> with subject line: unsubscribe

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
   ` Jan Hubicka
  0 siblings, 1 reply; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: blinux-list


>Well, again in my message queueing server an delay should be easy to
implement.
>But pause isn't. You can't say another message in the middle of first.
>But possibly we should make some pause between messages. But can you
explain

In the 'delayed' server, certainly tags like <pause> would only be usable
outside of <messg> entities, and would really become mode settings.  But
pause would also probably become unessisary.

>>
>> Again the question becomes one of 'live' speech (where the session in
>> control actively and directly 'speaks' to the device until it releases
>> control, such as through a <break>) vs. delayed (message queued for
>> playback) speech.  It seems a lot of extra overhead is needed for
'delayed'
>> speech (such as tags for examining and controlling the queue).
>Well, I think that an 'live' speech has some other limitations - it should
>be quite hard to implement history (and I think it should be very good),
>should cause problems at slow lines, handling multiple connections (it is
>not very clear where to interrupt an message) and I think tags <break me
here>
>should not work wery well, since it is something as non-preemtive
multitasking.

Yes, it is non-preemtive scheduling.

>Also live speech should make more complex cases, where whole sentence needs
>to be processed at once before sent to output (wich I think is required for
>most modern syntetizers, to give them way make intonance etc) at least
>it should be real limitation in future.

Even in 'live' speech you would have the <mesg> </mesg> entity tags to
specify where a complete 'phrase' or 'sentance' is, so that the entire
contents is buffered and sent as a single entity (or utterence) for
processing.  One advantage of live speech is that you do not have to have
complex "indexing" markers and forced flushing of the message queue when you
want an application to control interruption of talking.  But in many ways a
delayed server is easer to impliment, as it may not even require a formal
'server' to collect messages; one could then create an API that simply
builds messages in a 'spool' directory for which a server comes along and
processes (much like "at" and "atrun").

I guess the real question becomes how intimately involved and sychronized is
the application with the current speech being spoken for it.  If the
application needs to 'pace' itself to the rate it's output is being spoken,
doing so in 'live' speech is easier to achieve, but it can certainly be
done, even in 'delayed' speaking.  If pacing is required, then a server with
a live connection to the application to the app is a better choice than "at"
and "atrun" style stuff, since status can be returned to the app.  For
example, the server could return the message id of a requested message when
it's being spoken, and a end token when it's done.

>There is also another problem. I had long emails with Mr. Hanika. He thinks
>that sollution using daemon is bad. That much better sollution is where
>speaking server dirrectly controls and ptty, handles keys for history.
>It should have one advantage - you have just one kind of applications -
>not applications handled using stdin/stdout readed using special tool
>and applications that connects dirrectly to servers and speaks with it.

I also saw the server's role as providing speech over a LAN as well as to
service local applications.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
   David Sugar
@  ` Jan Hubicka
     ` T. V. Raman
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Hubicka @  UTC (permalink / raw)
  To: blinux-list

> I admit I like using <>'s primarly for ease of 'markup' of text for
> text-to-speech specific applications.  But this need not be a major issue
> yet.
OK, I agree. We should solve this much later :)
> 
> >It is easy to write and used by terminals for same purpose. Commands should
> be
> >terminated by next ^[ or some differenct character for easier parsing
> 
> My thought is that the 'pty' interface should use invisible (control
> character) markup for this reason, since it's intended primarily to support
> NON-speech aware applications, while the server's native TCP session is used
> primarily by applications that are speech aware.
OK, It was just idea. I am thinking about program packaging output to server.
When you use ^[ as markup, you should simply send a string, since it is useless
to use ^[ in text. But in case of < you have to browse whole string and
change all '<' to '<<>' or something.
> 
> This is different and rather interesting.  I see your intent here is to use
> the server to collect and 'queue' messages, rather than speak the data
> stream as it's being received.  Hence, I send a message to the server
> between a start and end, and it can queue it up under a known id, for which
> a seperate process can come along and actually speak.  My approach had
> always been to 'speak' the message live as data is presented to the server.
> With 'delayed' speech, I see some pluses and minuses.  Certainly a big plus
> is ease of scheduling connections from multiple sessions.  A minus is extra
> complexity when controlling (pacing) speech or timing it to some other
> activity.
OK, this is two really different sollutions. We should think about
implementation issues. I think in most cases, speaking application will
call function like 'say' to say something. Here situation is clear and
function say will send it correctly tagged (with <messg> </messg>)

Second case is application getting output from terminal or so, where it is
not clear, where end of message appears. In this case I suggest making
new message for each line and also after timeout. I can image just two
such applications (terminal emulator in emacsspeak and screen reader)

Using your method we should move this problem to server. This will simplify
this two applications. But I think it has also disadvantages:
1) using my protocol, until message is not completted, it should quite
   heavilly change (using mode, or realtext tags). In case server will have
   some ability to cut an messages in the middle of text, we can't use this
   protocol. So server can't cut messages, just queue them, possible start
   processing of them before they are completted.
2) this sollution should bring problem at slow line, where server can
   incorrectly mark message as completted
3) it should also make more complex an message processing. Since before message
   is spoken, it needs to be converted to literal text, to phonemes and then
   to waweform to be sent to output. In the case next part of message will
   arrive in the middle of this process, it needs to be threaded separately.
   An convertors will thread it as new sentense, wich should bring problems
   in intonation, should cause saying words in two halfs etc.
4) I think it will make an queue system harder to understand, code and
   maitain :)
Note that this is just what I think about.
> 
> >  Scheduler of messages should interrupt output from current application by
> >  other application output between this messages - I think more exact
> definition
> >  should not be necessary, but if so, we should add something as <nobreak>
> and
> >  <break> later.
> 
> I would also add <pause xx> and <delay xx>, the first being a pause with
> restart that allows another 'speaker' (session) to interrupt, while delay
> would be true silence for a specified interval.  But this kind of
> interaction is more difficult in my mind in a 'delayed' speech server.  Just
> something to think about.
Well, again in my message queueing server an delay should be easy to implement.
But pause isn't. You can't say another message in the middle of first.
But possibly we should make some pause between messages. But can you explain
me what this is good for?
> 
> 
> I tend to agree, which is really why I backlogged connections for other
> applications while an 'active' speaker (application) was connected.
OK :)
> 
> I always called this <lit> for 'literal' text :).  Which brings me back to
> tags, since it then becomes obvious to use </lit> to end a literal section
> :).
Sounds better :) But note that an realtext will not be sent to output at
all, this is just for history reasons.
> 
> Again the question becomes one of 'live' speech (where the session in
> control actively and directly 'speaks' to the device until it releases
> control, such as through a <break>) vs. delayed (message queued for
> playback) speech.  It seems a lot of extra overhead is needed for 'delayed'
> speech (such as tags for examining and controlling the queue).
Well, I think that an 'live' speech has some other limitations - it should
be quite hard to implement history (and I think it should be very good),
should cause problems at slow lines, handling multiple connections (it is
not very clear where to interrupt an message) and I think tags <break me here>
should not work wery well, since it is something as non-preemtive multitasking.
Also live speech should make more complex cases, where whole sentence needs
to be processed at once before sent to output (wich I think is required for
most modern syntetizers, to give them way make intonance etc) at least
it should be real limitation in future.

There is also another problem. I had long emails with Mr. Hanika. He thinks
that sollution using daemon is bad. That much better sollution is where
speaking server dirrectly controls and ptty, handles keys for history.
It should have one advantage - you have just one kind of applications -
not applications handled using stdin/stdout readed using special tool
and applications that connects dirrectly to servers and speaks with it.

With our idea of server there really should be some problems. I think that
tool, that controls terminal and reads it is neccesary, but not for all
applications and thinks like emacsspeak should run without it, since they
will support our server dirrectly. But how this should be done? Should
emacsspeak just disable an screen reading program by some tag? Then screen
reader will run all the time and really should (alktrought I don't like
thist sollution) be part of server. Other sollution is that you will
run bash w/o any such tool. It should possibly support this dirrectly
(I think that line editing mechanizm will need to be changed, so this
should not be so big problem) and for programs you want to hear their output,
you will run some program say. But this don't seems to be very good sollution
too.

I hope Mr. Hanika will send email about his opinions to this list, to let
other discuss that. I think it is really significant decisiion and even
I think separated server at socket is better, I might be wrong.

Honza
> 
> 
> ---
> Send your message for blinux-list to blinux-list@redhat.com
> Blinux software archive at ftp://leb.net/pub/blinux
> Blinux web page at http://leb.net/blinux
> To unsubscribe send mail to blinux-list-request@redhat.com
> with subject line: unsubscribe

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Protocol for speaking daemon (comments expected)
@  David Sugar
   ` Jan Hubicka
  0 siblings, 1 reply; 17+ messages in thread
From: David Sugar @  UTC (permalink / raw)
  To: blinux-list




>Hi
>This is my first idea for protocol. Looking forward for your comments:
>
>I think that escape character should be ^[ instad of <, since it is
unvisible
>character and don't need any preprocesoring of string and putting something
as
>;lt instead of <


I admit I like using <>'s primarly for ease of 'markup' of text for
text-to-speech specific applications.  But this need not be a major issue
yet.

>It is easy to write and used by terminals for same purpose. Commands should
be
>terminated by next ^[ or some differenct character for easier parsing

My thought is that the 'pty' interface should use invisible (control
character) markup for this reason, since it's intended primarily to support
NON-speech aware applications, while the server's native TCP session is used
primarily by applications that are speech aware.

>Commands:
><clonecontext num1 num2>
>  clone context number num1 to num2. Context number 0 should be default
>  context. Why use contexts? imagine that you have an bigger program, that
>  uses some libraries, wich speaks using server. Each use some parameters,
but
>  with just one context, it needs always turn them on and then off.
><messg num>
>  Begin of message - will set number to message, to allow client manipulate
>  with this message later
><endmessg>
>  End of message - message will be added to queue to say it.
>  I think it is neccesary, since clever speaking devices needs to know
whole
>  sentense, to make correct intonation and so. It should be determined by
some
>  timeout, but I think this is much cleaner way, since it will work
reliably
>  at slow lines too.

This is different and rather interesting.  I see your intent here is to use
the server to collect and 'queue' messages, rather than speak the data
stream as it's being received.  Hence, I send a message to the server
between a start and end, and it can queue it up under a known id, for which
a seperate process can come along and actually speak.  My approach had
always been to 'speak' the message live as data is presented to the server.
With 'delayed' speech, I see some pluses and minuses.  Certainly a big plus
is ease of scheduling connections from multiple sessions.  A minus is extra
complexity when controlling (pacing) speech or timing it to some other
activity.

>  Scheduler of messages should interrupt output from current application by
>  other application output between this messages - I think more exact
definition
>  should not be necessary, but if so, we should add something as <nobreak>
and
>  <break> later.

I would also add <pause xx> and <delay xx>, the first being a pause with
restart that allows another 'speaker' (session) to interrupt, while delay
would be true silence for a specified interval.  But this kind of
interaction is more difficult in my mind in a 'delayed' speech server.  Just
something to think about.

>  Mixing of multiple applications will be very rare case, like messages
from
>  talk daemon or so...

I tend to agree, which is really why I backlogged connections for other
applications while an 'active' speaker (application) was connected.

><del num>
>  Delete message number num. This should be used in case message became
>  obsolette. This will also interrput saying in case message is currently
>  processing
><language name>
>  Switch to language name. There should be other such commands used at any
>  time in message, like <speed aaa> <playau filename> etc. There should be
>  also informational version of this tags. Like <language> will send all
>  supported languages to client.
><sync>
>  Synchronize with server - server will reply with something.
><realtext>
>  An real text of message will follow - this is helpfull for history, since
>  you might specify commands like say: aa dot bb dot cz, but real text
should
>  be aa.bb.cz, so it will be useable as keyboard input. You also might
specify
>  empty realtext to disable message from command history, or empty text,
but
>  nonempty realtext - this should be usefull for situations like keyboard
>  input. You might say an message for each pressed letter with empty
realtext,
>  when user will press enter, resulting line should be sent to server, to
add
>  it into history.
I always called this <lit> for 'literal' text :).  Which brings me back to
tags, since it then becomes obvious to use </lit> to end a literal section
:).

>  Also messages that informs about something, wich will not be useable in
>  command history (like I-Search: or so), they should be sent with empty
>  realtext
><nohistory>
>  Don't save current message to history - should be usefull for cases, you
>  don't want to let user re-say messages again (like messages informing
about
>  history)
><mode=mode>
>  mode should be first - put message as first in queue
>                 last  -                last in queue
>                 whenready - say just when server is ready to say
something -
>          no other messages are in process - this should be
> usefull for thinks like keyboard echo, messages
> informing
> interrupt - interrupt current message and say this one first.
>           (this should be used for very urgent messages - like
>   system is going down)
><current>
>  Returns number of current message in queue
><prev num>
>  Returns number of previous message in queue
><next num>
>  Returns number of next message in queue
><saymessage num>
>  Says message number num in current message
><getmessage num>
>  Gets message number num from history
><dontsay num>
>  Dont say message number num, but keep it in history.
><context num>
>  Switch context to num, should be used at any place inside message
>
>I am not sure, how to handle numbers of messages. I think it should be
better
>to let client generate this numbers. This should make easier later deleting
of
>messages etc... But number of message needs to be unique for all
connections
>to make history useable. I think we should make an counter of messages at
>aplication side in range 0-65535, where application will start with 0, and
>increase it after every message. (similiar system should be used for
contexts)
>
>Each connection should also have an unique number in range 1-65536. Message
>number should be then calculated as app<<16+mesg. If application will ask
for
>message number <65536, number of current application will be added.

Again the question becomes one of 'live' speech (where the session in
control actively and directly 'speaks' to the device until it releases
control, such as through a <break>) vs. delayed (message queued for
playback) speech.  It seems a lot of extra overhead is needed for 'delayed'
speech (such as tags for examining and controlling the queue).



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Protocol for speaking daemon (comments expected)
@  Jan Hubicka
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Hubicka @  UTC (permalink / raw)
  To: blinux-list

Hi
This is my first idea for protocol. Looking forward for your comments:

I think that escape character should be ^[ instad of <, since it is unvisible
character and don't need any preprocesoring of string and putting something as
;lt instead of <

It is easy to write and used by terminals for same purpose. Commands should be
terminated by next ^[ or some differenct character for easier parsing

Commands:
<clonecontext num1 num2>
  clone context number num1 to num2. Context number 0 should be default
  context. Why use contexts? imagine that you have an bigger program, that
  uses some libraries, wich speaks using server. Each use some parameters, but
  with just one context, it needs always turn them on and then off.
<messg num>
  Begin of message - will set number to message, to allow client manipulate
  with this message later
<endmessg>
  End of message - message will be added to queue to say it.
  I think it is neccesary, since clever speaking devices needs to know whole
  sentense, to make correct intonation and so. It should be determined by some
  timeout, but I think this is much cleaner way, since it will work reliably
  at slow lines too.
  Scheduler of messages should interrupt output from current application by
  other application output between this messages - I think more exact definition
  should not be necessary, but if so, we should add something as <nobreak> and
  <break> later.
  Mixing of multiple applications will be very rare case, like messages from
  talk daemon or so...
<del num>
  Delete message number num. This should be used in case message became
  obsolette. This will also interrput saying in case message is currently
  processing
<language name>
  Switch to language name. There should be other such commands used at any
  time in message, like <speed aaa> <playau filename> etc. There should be
  also informational version of this tags. Like <language> will send all
  supported languages to client.
<sync>
  Synchronize with server - server will reply with something.
<realtext>
  An real text of message will follow - this is helpfull for history, since
  you might specify commands like say: aa dot bb dot cz, but real text should
  be aa.bb.cz, so it will be useable as keyboard input. You also might specify
  empty realtext to disable message from command history, or empty text, but
  nonempty realtext - this should be usefull for situations like keyboard
  input. You might say an message for each pressed letter with empty realtext,
  when user will press enter, resulting line should be sent to server, to add
  it into history.
  Also messages that informs about something, wich will not be useable in
  command history (like I-Search: or so), they should be sent with empty
  realtext
<nohistory>
  Don't save current message to history - should be usefull for cases, you
  don't want to let user re-say messages again (like messages informing about
  history)
<mode=mode>
  mode should be first - put message as first in queue
                 last  -                last in queue
                 whenready - say just when server is ready to say something -
		         no other messages are in process - this should be
			 usefull for thinks like keyboard echo, messages
			 informing
		 interrupt - interrupt current message and say this one first.
		          (this should be used for very urgent messages - like
			  system is going down)
<current>
  Returns number of current message in queue
<prev num>
  Returns number of previous message in queue
<next num>
  Returns number of next message in queue
<saymessage num>
  Says message number num in current message
<getmessage num>
  Gets message number num from history
<dontsay num>
  Dont say message number num, but keep it in history.
<context num>
  Switch context to num, should be used at any place inside message

I am not sure, how to handle numbers of messages. I think it should be better
to let client generate this numbers. This should make easier later deleting of
messages etc... But number of message needs to be unique for all connections
to make history useable. I think we should make an counter of messages at
aplication side in range 0-65535, where application will start with 0, and
increase it after every message. (similiar system should be used for contexts)

Each connection should also have an unique number in range 1-65536. Message
number should be then calculated as app<<16+mesg. If application will ask for
message number <65536, number of current application will be added.
-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~ UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
 Protocol for speaking daemon (comments expected) David Sugar
 ` Jan Hubicka
  -- strict thread matches above, loose matches on Subject: below --
 David Sugar
 David Sugar
 David Sugar
 ` Luke Davis
 David Sugar
 ` Bryan Smart
   ` Jan Hubicka
 David Sugar
 ` Jan Hubicka
   ` Robbie Murray
     ` Jan Hubicka
 David Sugar
 ` Jan Hubicka
   ` T. V. Raman
 Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).