Blizzard Challenge 2009 Rules: Difference between revisions

From SynSIG
 
(33 intermediate revisions by the same user not shown)
Line 18: Line 18:
* It is not permissible for a single participant to submit multiple entries, because the listening test will become unmanageable.
* It is not permissible for a single participant to submit multiple entries, because the listening test will become unmanageable.


* Participants involved in joint projects or consortia who wish to submit both an individual entry and a joint system should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.
* Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.


===Hub task for English===
===Hub task for English===
Line 26: Line 26:


===Spoke tasks for English===
===Spoke tasks for English===
* Task ES1: build voices from the specified 'SMALL_10', 'SMALL_50' and 'SMALL_100' subsets of the UK English database, which consist of the first 10, 50 and 100 sentences respectively of the 'ARCTIC' subset.
* Task ES1: build voices from the specified 'E_SMALL10', 'E_SMALL50' and 'E_SMALL100' datasets, which consist of the first 10, 50 and 100 sentences respectively of the 'ARCTIC' subset. You may use voice conversion, speaker adaptation techniques or any other technique you like.
* Task ES2: build a voice suitable for synthesising speech to be transmitted via a telephone channel. A telephone channel simulation tool is available to assist in system development.
* Task ES2: build a voice from the full UK English database suitable for synthesising speech to be transmitted via a telephone channel. A telephone channel simulation tool is available to assist in system development. You may enter the same voice as task EH1 or EH2 if you wish, although specially-designed voices are strongly encouraged.
* Task ES3: build a voice suitable for synthesising the computer role in a human-computer dialogue. A set of development dialogues are provided. The test dialogues will be from the same domain.
* Task ES3: build a voice from the full UK English database suitable for synthesising the computer role in a human-computer dialogue. A set of development dialogues are provided. The test dialogues will be from the same domain. You may enter the same voice as task EH1 or EH2 if you wish, although specially-designed voices are strongly encouraged. You may not change any of the words in the sentences to be synthesised. You may add simple markup to the text, either automatically or manually, if you wish. Please only use markup that could be provided by a text-generation system (e.g. emphasis tags would be acceptable, but a handcrafted F0 contour would not). Please use the mailing list if you wish to comment on these rules.


===Hub task for Mandarin===
===Hub task for Mandarin===
* Task MH: build a voice from the full Mandarin database (about 6.5 hours)
* Task MH: build a voice from the full Mandarin database (about 6000 utterances / 130000 Chinese characters)


===Spoke tasks for Mandarin===
===Spoke tasks for Mandarin===
* Task MS1: build voices from the specified 'SMALL_10', 'SMALL_50' and 'SMALL_100' subsets of the UK English database, which consist of the first 10, 50 and 100 sentences respectively of the full Mandarin database.
* Task MS1: build voices from the specified 'M_SMALL10', M_'SMALL50' and 'M_SMALL100' datasets, which consist of the first 10, 50 and 100 sentences respectively of the full Mandarin database.
* Task MS2: build a voice suitable for synthesising speech to be transmitted via a telephone channel. A telephone channel simulation tool is available to assist in system development.
* Task MS2: build a voice from the full Mandarin database suitable for synthesising speech to be transmitted via a telephone channel. A telephone channel simulation tool is available to assist in system development.  You may enter the same voice as task MH if you wish, although specially-designed voices are strongly encouraged.


==USE OF EXTERNAL DATA==
==USE OF EXTERNAL DATA==
* "External data" is defined as data, of any type, that is not part of the provided database.
* "External data" is defined as data, of any type, that is not part of the provided database.
* You are allowed to use external data. For each hub or spoke task independently, you must choose and follow one of these two sets of rules:
* You are allowed to use external data in any way you wish
** RULES VERSION 1 (standard rules): You may use external data to construct these parts of your system:
* For the UK English tasks, you should consider the Mandarin database to be external data.
*** text normalisation
* For the Mandarin tasks, you should consider the UK English database to be external data.
*** lexicon & letter-to-sound
* For tasks EH2, ES1, and MS1 in which you build a voice from a subset of a database, you must not use the remainder of that database at all, for any purpose - you must pretend it does not exist.
*** duration model
* You may exclude any parts of the provided databases if you wish.
*** F0 model
*** aligner (i.e., any component used only to label the database, such as a set of HMMs used for forced alignment)
** RULES VERSION 2 (voice conversion rules): You may use external data in any way you wish
* When building a voice from a subset of the data, you must not use the remainder of the data at all, with the sole exception of training the aligner
* If you are in any doubt about how to apply these rules, please contact the organizers immediately.
* If you are in any doubt about how to apply these rules, please contact the organizers immediately.
* You must include in your submission of the test sentences a declaration of which rule set you followed for each task


==SYNTHESISING THE TEST EXAMPLES==
==SYNTHESISING THE TEST EXAMPLES==
Line 55: Line 50:
** "Prompt sculpting"
** "Prompt sculpting"
** Altering existing entries in your lexicon (however, you are allowed to add new words)
** Altering existing entries in your lexicon (however, you are allowed to add new words)
** Using different subsets of the database for different test sentences or sentence types, unless this is a fully automatic part of your system
** Using different subsets of the database to generate different test sentences or sentence types within a single task, unless this is a fully automatic part of your system.
 
==RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES==
* Any examples that you submit for evaluation will be retained by the Blizzard organisers for future use.
* You must include in your submission of the test sentences a statement of whether you give the organisers permission to publically distribute your waveforms and the corresponding listening test results in anonymised form. We strongly encourage you to give this consent.


==LISTENING TEST==
==LISTENING TEST==
* We are not releasing details of the listening test design at this time, because you should not be tailoring your voice building to it. It will contain similar sections to previous challenges along with new ones, and you will need to synthesise several hundred sentences from text.
* The listening test design is likely to be similar to that used in the 2008 Challenge. Depending on the number of entries for each task, the organisers may only be able to evaluate certain subsets of the synthesised sentences or certain system configurations. For example, if a large number of entries are received for spoke task ES1, then we may choose only to evaluate the 100 sentence condition.
* For voice conversion-type systems, there will be an additional component of the test, to judge how close the system sounds to the database speaker. If the listening test design allows, we will perform this test for all standard systems too.
* Any examples that you submit for evaluation may be retained by the Blizzard organisers for future use. We hope to be able to distribute them in anonymised form to all participants, or publically, subject to participants' consent.


==PAPER==
==PAPER==
* Each participant will be expected to submit a four-page paper describing their entry for review.
* Each participant will be expected to submit a six-page paper describing their entry for review.
* One of the authors of each accepted paper should present it at the Blizzard 2008 Workshop, which we hope will be a satellite of Interspeech 2008 in Australia
* One of the authors of each accepted paper should present it at the Blizzard 2008 Workshop, which will be a satellite of Interspeech 2009 in the UK. The workshop will be in Edinburgh.
* In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)
* In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)



Latest revision as of 14:31, 17 June 2009

DATABASE ACCESS

  • You will receive a separate message about how to download this

REGISTRATION FEE

  • A registration fee of 750USD (500GBP) is due to offset the costs of running the challenge, including paying local assistants and undergraduate listeners. The fee is fixed, regardless of how many hub or spoke tasks you participate in. The fee must be paid by Friday 10th April 2009. You can pay this fee using Edinburgh University's online payments system (https://www.epay.ed.ac.uk/events/eventdetails.asp?eventid=88) where you should register for the event called 'Blizzard Challenge 2009'. After doing this, please also email blizzard@festvox.org to notify us that you have paid. If you are unable to use the online payments system, please contact the organisers for assistance with other methods of payment.

EXPERT LISTENERS

  • Each participant is expected to provide ten speech experts as listeners of the evaluation tests. English or Mandarin native speakers (as appropriate) are preferable, where possible. The organisers would also appreciate assistance in advertising the Challenge as widely as possible (e.g., to your students or colleagues).

BUILDING VOICES

  • Participants may participate in the Challenge for one or both languages. You are encouraged you to attempt both languages.
  • For each language in which you are participating, you must complete the hub task.
  • If you complete the hub task for a language, you may then submit entries for any number of the spoke tasks for that language.
  • It is not permissible for a single participant to submit multiple entries, because the listening test will become unmanageable.
  • Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.

Hub task for English

You must build two voices:

  • Task EH1: build a voice from the full UK English database (about 15 hours)
  • Task EH2: build a voice from the specified 'ARCTIC' subset of the UK English database (about 1 hour)

Spoke tasks for English

  • Task ES1: build voices from the specified 'E_SMALL10', 'E_SMALL50' and 'E_SMALL100' datasets, which consist of the first 10, 50 and 100 sentences respectively of the 'ARCTIC' subset. You may use voice conversion, speaker adaptation techniques or any other technique you like.
  • Task ES2: build a voice from the full UK English database suitable for synthesising speech to be transmitted via a telephone channel. A telephone channel simulation tool is available to assist in system development. You may enter the same voice as task EH1 or EH2 if you wish, although specially-designed voices are strongly encouraged.
  • Task ES3: build a voice from the full UK English database suitable for synthesising the computer role in a human-computer dialogue. A set of development dialogues are provided. The test dialogues will be from the same domain. You may enter the same voice as task EH1 or EH2 if you wish, although specially-designed voices are strongly encouraged. You may not change any of the words in the sentences to be synthesised. You may add simple markup to the text, either automatically or manually, if you wish. Please only use markup that could be provided by a text-generation system (e.g. emphasis tags would be acceptable, but a handcrafted F0 contour would not). Please use the mailing list if you wish to comment on these rules.

Hub task for Mandarin

  • Task MH: build a voice from the full Mandarin database (about 6000 utterances / 130000 Chinese characters)

Spoke tasks for Mandarin

  • Task MS1: build voices from the specified 'M_SMALL10', M_'SMALL50' and 'M_SMALL100' datasets, which consist of the first 10, 50 and 100 sentences respectively of the full Mandarin database.
  • Task MS2: build a voice from the full Mandarin database suitable for synthesising speech to be transmitted via a telephone channel. A telephone channel simulation tool is available to assist in system development. You may enter the same voice as task MH if you wish, although specially-designed voices are strongly encouraged.

USE OF EXTERNAL DATA

  • "External data" is defined as data, of any type, that is not part of the provided database.
  • You are allowed to use external data in any way you wish
  • For the UK English tasks, you should consider the Mandarin database to be external data.
  • For the Mandarin tasks, you should consider the UK English database to be external data.
  • For tasks EH2, ES1, and MS1 in which you build a voice from a subset of a database, you must not use the remainder of that database at all, for any purpose - you must pretend it does not exist.
  • You may exclude any parts of the provided databases if you wish.
  • If you are in any doubt about how to apply these rules, please contact the organizers immediately.

SYNTHESISING THE TEST EXAMPLES

  • No manual intervention is allowed during synthesis. This includes, but is not limited to:
    • "Prompt sculpting"
    • Altering existing entries in your lexicon (however, you are allowed to add new words)
    • Using different subsets of the database to generate different test sentences or sentence types within a single task, unless this is a fully automatic part of your system.

RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES

  • Any examples that you submit for evaluation will be retained by the Blizzard organisers for future use.
  • You must include in your submission of the test sentences a statement of whether you give the organisers permission to publically distribute your waveforms and the corresponding listening test results in anonymised form. We strongly encourage you to give this consent.

LISTENING TEST

  • The listening test design is likely to be similar to that used in the 2008 Challenge. Depending on the number of entries for each task, the organisers may only be able to evaluate certain subsets of the synthesised sentences or certain system configurations. For example, if a large number of entries are received for spoke task ES1, then we may choose only to evaluate the 100 sentence condition.

PAPER

  • Each participant will be expected to submit a six-page paper describing their entry for review.
  • One of the authors of each accepted paper should present it at the Blizzard 2008 Workshop, which will be a satellite of Interspeech 2009 in the UK. The workshop will be in Edinburgh.
  • In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)

HOW ARE THESE RULES ENFORCED?

  • This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.