Blizzard Challenge 2016 Rules: Difference between revisions

From SynSIG
(Created page with "==DATABASE ACCESS== * After registration and completion of the required licenses, download passwords will issued, as described on [[Blizzard_Challenge_2016| the main Blizzard ...")
 
No edit summary
Line 18: Line 18:
==MATERIALS PROVIDED==
==MATERIALS PROVIDED==


All participants will have access to the following materials (subject to signing the license):
All participants will have access to the following materials (after to signing the license):


* Indian languages: About 4 hours of speech data in each of three Indian languages (Hindi, Tamil and Telugu), and about 2 hours of speech data in each of other three Indian languages (Marathi, Bengali and Malayalam) recorded by native professional speakers in high quality studio environments. Text is provided in UTF-8 format. No other information, such as segment labels, is provided.
* Around XX hours of speech from one native British English female professional speaker.
* Transcripts of all speech material
* Alignments for part of the material


These speech databases are provided by the group of institutions: IIT-Madras, IIIT-Hyderabad, SSNCE, CDAC Trivandrum, CDAC Mumbai, CDAC Kolkata.  
The material has been very kindly provided by [http://www.usborne.com Usborne Publishing]


==THE CHALLENGES ==
==THE CHALLENGE ==


This year there are two parts to the Blizzard Challenge: the Hub tasks, and Spoke tasks on Indian language data.  
Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.


* It is not permissible for a single participant to submit multiple entries to any task, because the listening test will become unmanageable. This rule may be relaxed in the event of a small number of participants.
=== Task ===
Build a voice from the provided data, suitable for reading children's audiobooks. There is just a single task, designated as 2016-EH1.


* Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.
=== Sharing the work of data pre-processing ===
Some, but not all, of the material has been aligned with the book text. Because the books provided by the publisher are in PDF format, and the audio for each book is typically a sequence of audio tracks, it is non-trivial to extract a clean version of the text and then to align it with the speech (e.g., at sentence level).


* It is strongly encouraged to participate in all tasks and not to "cherry pick".
Therefore, we ask all participants to collaborate and share the effort of creating clean, aligned transcripts of the speech. This will be co-ordinated by [http://www.coli.uni-saarland.de/~slemaguer/ Sebastien Le Maguer]. Please contact Sebastien before embarking on any manual cleanup of the text, or alignment with the audio, so that he can eliminate duplicate work.
 
* For all tasks, synthetic speech may be submitted at any sampling rate (but always at 16 bits per sample). Waveforms will '''not''' be downsampled for the listening test
 
===Hub task===
* Build one voice in each language from the provided speech data (wav/ directory), sampled at 16 kHz, and the corresponding text in UTF-8 format (train.done.data). Any other information that may be included in the distributions, such as segment labels, phone set and Roman transliteration available from the sample Festvox voice builds (e.g., the .lab, .sl and .slehmm files) is not officially provided or endorsed as a part of this challenge, although may be used by participants if they wish. In all cases test material will be provided as UTF-8 text only (in a similar format to train.done.data). The subtasks are numbered as follows:
** 2015-IH1.1 Bengali
** 2015-IH1.2 Hindi
** 2015-IH1.3 Malayalam
** 2015-IH1.4 Marathi
** 2015-IH1.5 Tamil
** 2015-IH1.6 Telugu
 
===Spoke task===
 
The purpose of this task is build a multilingual synthesis i.e., Indian language + English.
 
Training: Indian language (ex: Telugu) uttered by speaker A. Note that the training data provided for the Indian language may not contain any English words at all.
 
Test: Telugu with speaker A's voice and English with speaker A's voice
 
Example test sentences:
 
"యూఈఏ నిర్దేశించిన 286 పరుగుల లక్ష్యాన్ని మరో 12 బంతులు మిగులుండగానే ఛేదించింది. 48 ఓవర్లలో 6 వికెట్లు కోల్పోయి 286 పరుగులు చేసింది. http://telugu.oneindia.com"
 
We are not providing language tags in the test sentence. As the text is in UTF-8, it is easy to identify the language from the Unicode point (the idea is to simulate the way the text is available on the webpages without much information).
 
** 2015-IH2.1 Bengali
** 2015-IH2.2 Hindi
** 2015-IH2.3 Malayalam
** 2015-IH2.4 Marathi
** 2015-IH2.5 Tamil
** 2015-IH2.6 Telugu


==USE OF EXTERNAL DATA==
==USE OF EXTERNAL DATA==
Line 71: Line 43:
* Use of external data is entirely optional and is not compulsory
* Use of external data is entirely optional and is not compulsory
* You must use the provided audio files
* You must use the provided audio files
* You must not use any additional speech data from the same speakers
* You must not use any additional speech data from the same speaker
* You may exclude any parts of the provided databases if you wish.
* You may exclude any parts of the provided databases if you wish.
* Use of any provided segmentations, transcriptions or labels is optional.
* Use of any provided segmentations, transcriptions or labels is optional.
Line 77: Line 49:


==SYNTHESISING THE TEST EXAMPLES==
==SYNTHESISING THE TEST EXAMPLES==
* The exact nature of the test set will not be revealed in advance, but is likely to include both sentence, paragraph and possibly longer texts from a similar domain to the provided corpus, as well as texts from other domains. Formal listening tests will be conducted to evaluate the synthetic speech submitted.
* The exact nature of the test set will not be revealed in advance, but is likely to include both sentence, paragraph and short book-length texts from a similar domain to the provided corpus, as well as texts from other domains.
 
* Synthetic speech may be submitted at any standard sampling rate (but always at 16 bits per sample). Waveforms will '''not''' be downsampled for the listening test.


==RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES==
==RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES==
Line 84: Line 58:


==LISTENING TEST==
==LISTENING TEST==
* The Blizzard organisers will conduct a listening test design which will probably include the standard elements used in previous years (naturalness, speaker similarity, intelligibility) and will be extended to include additional tests specific to the audiobook reading task, including the synthesis of multi-sentence paragraphs.
Formal listening tests will be conducted to evaluate the synthetic speech submitted. Whilst the task is to synthesise speech suitable for reading an audiobook to children, the listening test will likely also evaluate the performance of the voice in terms of naturalness and intelligibility on other types of material (i.e., as in most previous Blizzard Challenges).


==PAPER==
==PAPER==
* Each participant will be expected to submit a six-page paper describing their entry for review.
* Each participant will be expected to submit a six-page paper describing their entry for review.
* One of the authors of each accepted paper should present it at the Blizzard 2015 Workshop
* One of the authors of each accepted paper should present it at the Blizzard 2016 Workshop
* In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)
* In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)



Revision as of 14:42, 22 February 2016

DATABASE ACCESS


REGISTRATION FEE

  • A registration fee of 500 GBP is payable by all participants who wish to submit synthetic speech for evaluation, to offset the costs of running the challenge, including paying local assistants and listeners. The fee must be paid by the Friday 29th April 2016. You can pay this fee using Edinburgh University's online payments system: (URL coming soon...) where you should register for the event called 'Blizzard Challenge 2016'. After doing this, please email blizzard@festvox.org to notify us that you have paid. If you are absolutely unable to use the online payments system, please contact blizzard@festvox.org for assistance with a bank transfer. However, we strongly prefer the epay system because it reduces the costs and admin work for us. If you must pay by bank transfer, please contact us in plenty of time (at least 3 weeks before the payment deadline); an additional fee of 100 GBP will be added for any payments not made using the spay system.

EXPERT LISTENERS

  • Each participant should try to recruit at least ten volunteer listeners for the evaluation tests.

NAIVE LISTENERS

  • The organisers would also appreciate assistance in advertising the Challenge as widely as possible (e.g., to your students or colleagues).
  • We are also seeking assistance with conducting a listening test using children. These can be native speakers, or learners of English as an additional language. Children with reading ages of either 4, 5, or 6 years (approximately) would be most appropriate, but slightly older children would also be useful. Please contact blizzard@festvox.org to discuss this. We may be able to offer some financial support with the costs, or reduce the entry fee if you are also submitting synthetic speech.


MATERIALS PROVIDED

All participants will have access to the following materials (after to signing the license):

  • Around XX hours of speech from one native British English female professional speaker.
  • Transcripts of all speech material
  • Alignments for part of the material

The material has been very kindly provided by Usborne Publishing

THE CHALLENGE

Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.

Task

Build a voice from the provided data, suitable for reading children's audiobooks. There is just a single task, designated as 2016-EH1.

Sharing the work of data pre-processing

Some, but not all, of the material has been aligned with the book text. Because the books provided by the publisher are in PDF format, and the audio for each book is typically a sequence of audio tracks, it is non-trivial to extract a clean version of the text and then to align it with the speech (e.g., at sentence level).

Therefore, we ask all participants to collaborate and share the effort of creating clean, aligned transcripts of the speech. This will be co-ordinated by Sebastien Le Maguer. Please contact Sebastien before embarking on any manual cleanup of the text, or alignment with the audio, so that he can eliminate duplicate work.

USE OF EXTERNAL DATA

  • "External data" is defined as data, of any type, that is not part of the provided database.
  • You are allowed to use external data in any way you wish, subject to any exclusions given in these rules
  • Use of external data is entirely optional and is not compulsory
  • You must use the provided audio files
  • You must not use any additional speech data from the same speaker
  • You may exclude any parts of the provided databases if you wish.
  • Use of any provided segmentations, transcriptions or labels is optional.
  • If you are in any doubt about how to apply these rules, please contact the organizers immediately.

SYNTHESISING THE TEST EXAMPLES

  • The exact nature of the test set will not be revealed in advance, but is likely to include both sentence, paragraph and short book-length texts from a similar domain to the provided corpus, as well as texts from other domains.
  • Synthetic speech may be submitted at any standard sampling rate (but always at 16 bits per sample). Waveforms will not be downsampled for the listening test.

RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES

  • Any examples that you submit for evaluation will be retained by the Blizzard organisers for future use.
  • You must include in your submission of the test sentences a statement of whether you give the organisers permission to publically distribute your waveforms and the corresponding listening test results in anonymised form. In the past, all participants have agreed to this and we strongly encourage you to give this consent.

LISTENING TEST

Formal listening tests will be conducted to evaluate the synthetic speech submitted. Whilst the task is to synthesise speech suitable for reading an audiobook to children, the listening test will likely also evaluate the performance of the voice in terms of naturalness and intelligibility on other types of material (i.e., as in most previous Blizzard Challenges).

PAPER

  • Each participant will be expected to submit a six-page paper describing their entry for review.
  • One of the authors of each accepted paper should present it at the Blizzard 2016 Workshop
  • In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)

HOW ARE THESE RULES ENFORCED?

  • This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.