Blizzard Machine Learning Challenge 2017 Rules

From SynSIG

DATABASE ACCESS

REGISTRATION FEE

  • A registration fee of 600 GBP is payable by all participants who wish to submit their output for evaluation, to offset the costs of running the challenge, including paying local assistants and listeners. The fee must be paid by Friday 21st April 2017. You can pay this fee using Edinburgh University's online payments system at http://www.epay.ed.ac.uk/browse/category.asp?compid=1&modid=2&catid=20 where you should register for the event called 'Blizzard Challenge 2017'. After doing this, please email blizzard@festvox.org to notify us that you have paid. If you are absolutely unable to use the online payments system, please contact blizzard@festvox.org for assistance with a bank transfer. However, we strongly prefer the epay system because it reduces the costs and admin work for us. If you must pay by bank transfer, please contact us in plenty of time (at least 4 weeks before the payment deadline); an additional administration fee of 150 GBP will be added for any payments not made using the epay system.

Note: if you are also taking part in the main challenge, you only have to pay a single registration fee, which then covers both of your entries.

LISTENERS

  • Each participant should try to recruit as many naive listeners (with no professional knowledge of synthetic speech) as possible, and at least 10. They do not have to be native speakers. All members of your team are also expected to perform the complete listening test: there will be a separate listening test URL for people who consider themselves experts in speech synthesis.
  • The organisers would also appreciate assistance in advertising both the Challenge and the listening test as widely as possible (e.g., to your students or colleagues).

MATERIALS PROVIDED

All participants will have access to the following material after signing the license:

  • An estimated 4 hours of speech from one native British English female professional speaker (this is a cleaned-up version of the 5 hours released in 2016 for the main Blizzard Challenge) along with linguistic features and speech features.

The speech material has been very kindly provided by Usborne Publishing and the preparation of the linguistic features benefitted greatly from the effort contributed by Innoetics.

THE CHALLENGE

You should make a single submission for either (but not both) of the following two tasks

  • 2017-ES1
    • Predict speech features from linguistic features
    • Materials provided: Frame level pairs of 687-dimensional linguistic features and 77-dimensional speech features
    • You must not use the speech waveforms provided for the other spoke task
  • 2017-ES2
    • Directly predict speech waveforms from linguistic features
    • Materials provided: Pairs of 687-dimensional linguistic features and speech waveforms
    • You must not use the speech features provided for the other spoke task

USE OF EXTERNAL DATA and/or PROCESSING THE PROVIDED DATA

This first version of the Blizzard Machine Learning Challenge is posed as a "pure and simple" machine learning problem. The rules are intended to create a "level playing field" between entries from both speech synthesis experts and machine learning experts alike. Therefore, expert interventions into the provided data are not allowed.

  • "External data" is defined as data, of any type, that is not part of the provided database. You are NOT allowed to use external data in any way.
  • You may automatically apply transforms such as normalisation or quantisation, to any of the provided data.
  • The linguistic features
    • The provided features are typical of statistical parametric speech synthesis.
    • The order has been randomised and you must not attempt to reverse engineer them.
    • You must not augment the linguistic features (e.g., with additional manual annotation)
    • You must not manually remove features.
  • For the waveforms in task 2017-ES2
    • You must not extract additional features (e.g., F0, cepstrum, etc) from the speech waveforms.
    • It is permitted to quantise the waveform samples.
    • It is permitted to downsample the waveforms
  • If you are in any doubt about how to apply these rules, please contact the organizers immediately.

GENERATING THE UNSEEN TEST EXAMPLES

  • The exact nature of the test set will not be revealed in advance, but is likely to include both sentence, paragraph and short book-length texts from a similar domain to the provided corpus, as well as texts from other domains. Linguistic features will be provided, in the same format as for the training data.
  • Synthetic speech waveforms for task 2017-ES2 may be submitted at any standard sampling rate (16kHz, 22.05kHz, 44.1kHz or 48kHz) and always at 16 bits per sample. Waveforms will not be downsampled for the listening test.

RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES

  • Any examples that you submit for evaluation will be retained by the Blizzard organisers for future use.
  • You must include in your submission of the test sentences a statement of whether you give the organisers permission to publically distribute your waveforms and the corresponding listening test results in anonymised form. In the past, all participants have agreed to this and we strongly encourage you to give this consent.

LISTENING TEST

Formal listening tests will be conducted to evaluate the synthetic speech submitted. Whilst the task is to synthesise speech suitable for reading an audiobook to children, the listening test will likely also evaluate the performance of the voice in terms of naturalness and intelligibility on other types of material (i.e., as in most previous Blizzard Challenges).

PAPER

  • Each team will be expected to submit a paper to ASRU and at least one member of the team should attend the workshop to present the paper.

HOW ARE THESE RULES ENFORCED?

  • This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.