Speech Synthesis Workshop (SSW): Difference between revisions

From SynSIG
(SSW)
 
(No difference)

Revision as of 10:29, 2 October 2007

At an international conference on speech processing, a speech scientist once held up a tube of toothpaste (whose brand was "Signal") and, squeezing it in front of the audience, coined the phrase "This is speech synthesis; speech recognition is the art of pushing the toothpaste back into the tube."

One could turn this very simplistic view the other way round: users are generally much more tolerant of speech recognition errors than they are willing to listen to unnatural speech. There is magic in a speech recognizer that transcribes continuous radio speech into text with a word accuracy as low as 50%; in contrast, even a perfectly intelligible speech synthesizer is only moderately tolerated by users if it delivers nothing more than "robot voices". Delivering both intelligibility and naturalness has been the holy grail of speech synthesis research for the past 30 years. More recently, expressivity has been added as a major objective of speech synthesis.

Add to this the engineering costs (computational cost, memory cost, design cost for making another synthetic voice or another language) which have to be taken into account, and you'll start to have an idea of the challenges underlying text-to-speech synthesis.

Major challenges call for major meetings: the Speech Synthesis Workshops (SSWs), which are held every three years under the auspices of ISCA's SynSIG. SSWs provide a unique occasion for people in the speech synthesis area to meet each other. They contribute to establishing a feeling that we are all participating in a joint effort towards intelligible, natural, and expressive synthetic speech.