Blizzard Challenge 2007 Rules: Difference between revisions

From SynSIG
No edit summary
 
(Added rules)
Line 1: Line 1:
insert rules here
DATABASE ACCESS
 
  * You will receive a separate message about how to download this
 
REGISTRATION FEE
 
  * A registration fee of 500USD is due to offset the costs of running
    the challenge, including paying undergraduate listeners. This must
    be paid by the time you submit your test examples. You will
    receive separate instructions on how to pay this.
 
EXPERT LISTENERS
 
  * Each participant is expected to provide ten speech experts as
    listeners of the evaluation tests. English native speakers are
    preferable, where possible.
 
BUILDING VOICES
 
  * Each participant should build three synthetic voices from the
    database. It is permissible to submit fewer than three voices, but
    we strongly encourage you to complete the full challenge because
    this will be more informative.
  * It is not permissible for a single participant to submit multiple
    entries for any of the voices (because the listening test will
    become unmanageable).
  * All three voices should be built using the same method, software,
    external data, etc. For example, you are not allowed to use unit
    selection for voice A but a voice conversion method for voices B
    and C.
  * Voices to be built:
      Voice A: from the full dataset (about 8 hours)
      Voice B: from the ARCTIC subset (about 1 hour)
      Voice C: from a subset of the data chosen by you, under the
      following conditions:
      - you may only base your selection on the text (and not the
        speech, or any information such as labelling which has been
        derived with reference to the speech signal)
      - if your selection method requires phonetic, prosodic, or any
        other type of labelling, this must have been derived from the
        text only
      - you must select entire utterances
      - the total duration of the utterances you select must be no
        more than 2914 seconds (which is equal to the duration of the
        ARCTIC subset); you should use the officially provided
        durations file to make this calculation, which will be emailed
        to you.
  * If you use the provided database to train any parts of your system
    (e.g., a prosodic model or HMM parameters), then for voices B and
    C, you must not use the whole database to train those parts, but
    only the appropriate subset. See below for rules on using external
    data.
 
 
USE OF EXTERNAL DATA
 
  * "External data" is defined as data, of any type, that is not part
    of the provided database.
  * You are allowed to use external data. You must follow one of these
    two sets of rules (and the same one for all three voices):
      * Standard rules: You may use external data to construct these
          parts of your system:
            - text normalisation
            - lexicon & letter-to-sound
            - duration model
            - F0 model
            - aligner (i.e., any component used only to label the
              database, such as a set of HMMs used for forced alignment)
      * Voice conversion rules: You may use external data in any way
        you wish
* In essence, if there is any possibility that your system could sound
  like a different speaker than the database speaker, then your system
  should be classified as a voice conversion type of system.
* If you are in any doubt about how to apply these rules, please contact
  the organizers immediately.
 
 
SYNTHESISING THE TEST EXAMPLES
 
  * No manual intervention is allowed during synthesis. This includes,
    but is not limited to:
      * "Prompt sculpting"
      * Altering existing entries in your lexicon (however, you are
        allowed to add new words)
      * Using different subsets of the database for different test
        sentences or sentence types, unless this is a fully automatic
        part of your system
 
 
LISTENING TEST
 
  * We are not releasing details of the listening test design at this
    time, because you should not be tailoring your voice building to
    it. It will be largely similar to previous challenges, and you will
    need to synthesise several hundred sentences from text.
  * For voice conversion-type systems, there will be an additional
    component of the test, to judge how close the system sounds to the
    database speaker. If the listening test design allows, we will
    perform this test for all standard systems too.
  * Any examples that you submit for evaluation may be retained for
    future use. We hope to be able to distribute them in anonymised
    form to all participants, or publically.
 
PAPER
 
  * Each participant will be expected to submit a four-page paper
    describing their entry for review.
  * One of the authors of each accepted paper should present it at a
    satellite workshop of SSW6, on August 25, 2007 in Bonn, Germany
  * In addition, each participant will be expected to complete a form
    giving the general technical specification of their system, to
    facilitate easy cross-system comparisons (e.g. is it unit
    selection? does it predict prosody? etc. etc)
 
 
HOW ARE THESE RULES ENFORCED?
 
  * This is a challenge, which is designed to answer scientific
    questions, and not a competition. Therefore, we rely on your
    honesty in preparing your entry.

Revision as of 16:42, 22 February 2007

DATABASE ACCESS

 * You will receive a separate message about how to download this

REGISTRATION FEE

 * A registration fee of 500USD is due to offset the costs of running
   the challenge, including paying undergraduate listeners. This must
   be paid by the time you submit your test examples. You will
   receive separate instructions on how to pay this.

EXPERT LISTENERS

 * Each participant is expected to provide ten speech experts as
   listeners of the evaluation tests. English native speakers are
   preferable, where possible.

BUILDING VOICES

 * Each participant should build three synthetic voices from the
   database. It is permissible to submit fewer than three voices, but
   we strongly encourage you to complete the full challenge because
   this will be more informative.
 * It is not permissible for a single participant to submit multiple
   entries for any of the voices (because the listening test will
   become unmanageable).
 * All three voices should be built using the same method, software,
   external data, etc. For example, you are not allowed to use unit
   selection for voice A but a voice conversion method for voices B
   and C.
 * Voices to be built:
     Voice A: from the full dataset (about 8 hours)
     Voice B: from the ARCTIC subset (about 1 hour)
     Voice C: from a subset of the data chosen by you, under the
     following conditions:
     - you may only base your selection on the text (and not the
       speech, or any information such as labelling which has been
       derived with reference to the speech signal)
     - if your selection method requires phonetic, prosodic, or any
       other type of labelling, this must have been derived from the
       text only
     - you must select entire utterances
     - the total duration of the utterances you select must be no
       more than 2914 seconds (which is equal to the duration of the
       ARCTIC subset); you should use the officially provided
       durations file to make this calculation, which will be emailed
       to you.
 * If you use the provided database to train any parts of your system
   (e.g., a prosodic model or HMM parameters), then for voices B and
   C, you must not use the whole database to train those parts, but
   only the appropriate subset. See below for rules on using external
   data.


USE OF EXTERNAL DATA

 * "External data" is defined as data, of any type, that is not part
   of the provided database.
 * You are allowed to use external data. You must follow one of these
   two sets of rules (and the same one for all three voices):
     * Standard rules: You may use external data to construct these
         parts of your system:
           - text normalisation
           - lexicon & letter-to-sound
           - duration model
           - F0 model
           - aligner (i.e., any component used only to label the
             database, such as a set of HMMs used for forced alignment)
     * Voice conversion rules: You may use external data in any way
       you wish
* In essence, if there is any possibility that your system could sound
  like a different speaker than the database speaker, then your system
  should be classified as a voice conversion type of system.
* If you are in any doubt about how to apply these rules, please contact
  the organizers immediately.


SYNTHESISING THE TEST EXAMPLES

 * No manual intervention is allowed during synthesis. This includes,
   but is not limited to:
      * "Prompt sculpting"
      * Altering existing entries in your lexicon (however, you are
        allowed to add new words)
      * Using different subsets of the database for different test
        sentences or sentence types, unless this is a fully automatic
        part of your system


LISTENING TEST

 * We are not releasing details of the listening test design at this
   time, because you should not be tailoring your voice building to
   it. It will be largely similar to previous challenges, and you will
   need to synthesise several hundred sentences from text.
 * For voice conversion-type systems, there will be an additional
   component of the test, to judge how close the system sounds to the
   database speaker. If the listening test design allows, we will
   perform this test for all standard systems too.
 * Any examples that you submit for evaluation may be retained for
   future use. We hope to be able to distribute them in anonymised
   form to all participants, or publically.

PAPER

 * Each participant will be expected to submit a four-page paper
   describing their entry for review.
 * One of the authors of each accepted paper should present it at a
   satellite workshop of SSW6, on August 25, 2007 in Bonn, Germany
 * In addition, each participant will be expected to complete a form
   giving the general technical specification of their system, to
   facilitate easy cross-system comparisons (e.g. is it unit
   selection? does it predict prosody? etc. etc)


HOW ARE THESE RULES ENFORCED?

 * This is a challenge, which is designed to answer scientific
   questions, and not a competition. Therefore, we rely on your
   honesty in preparing your entry.