Blizzard Challenge 2023 Rules: Difference between revisions
(→PAPER) |
|||
(48 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==DATABASE ACCESS== | ==DATABASE ACCESS== | ||
* | * Please see on the [[Blizzard_Challenge_2023| main Blizzard 2023 page]]. | ||
==REGISTRATION FEE== | ==REGISTRATION FEE== | ||
* A registration fee of | * A registration fee of 950 EUR is payable by all participants who wish to submit synthetic speech for evaluation, to offset the costs of running the challenge, including paying local assistants and listeners. The fee must be paid by 7th April 2023. Each fee covers the entry from one participating team; that entry can comprise either or both of the tasks below. | ||
* Under no circumstances should you pay the fee until your request to participate has been accepted by the organisers. But we strongly recommend paying this fee only ''after'' you have decided to submit synthetic speech (e.g., after completing your own internal evaluation prior to submission). We cannot issue refunds. | * Under no circumstances should you pay the fee until your request to participate has been accepted by the organisers. But we strongly recommend paying this fee only ''after'' you have decided to submit synthetic speech (e.g., after completing your own internal evaluation prior to submission). We cannot issue refunds. | ||
Line 11: | Line 9: | ||
* Payment of the fee does not guarantee that your system will be included in the evaluation. We may exclude very low quality entries, at our discretion, to prevent them skewing the listening test (and wasting listener effort). Dealing with such entries still consumes resources, and therefore the entry fee will not be refunded. | * Payment of the fee does not guarantee that your system will be included in the evaluation. We may exclude very low quality entries, at our discretion, to prevent them skewing the listening test (and wasting listener effort). Dealing with such entries still consumes resources, and therefore the entry fee will not be refunded. | ||
* You should pay this fee using the CNRS Azur Colloque online payment system at | * You should pay this fee using the CNRS Azur Colloque online payment system at https://www.azur-colloque.fr/DR11/inscription/preinscription/258. This is a two-step registration process, please carefully follow instructions that were sent by email to participants. After doing this, you will receive a confirmation email from the payment system. This email serves as your '''invoice''' and your '''receipt''' to show that you have paid that invoice. Please forward this email to blizzard-challenge-organisers@googlegroups.com to notify us that you have paid. | ||
==LISTENERS== | ==LISTENERS== | ||
* | * When the evaluation starts, each participant must try to recruit ''at least'' ten volunteer listeners. If possible, these should be people who have some professional knowledge of synthetic speech. | ||
==NAIVE LISTENERS== | ==NAIVE LISTENERS== | ||
* | * When the evaluation starts, each participant should try to recruit as many naive listeners (with no professional knowledge of synthetic speech) as possible. They do ''not'' have to be native speakers. | ||
* The organisers would also appreciate assistance in advertising both the Challenge and the listening test as widely as possible (e.g., to your students or colleagues). | * The organisers would also appreciate assistance in advertising both the Challenge and the listening test as widely as possible (e.g., to your students or colleagues). | ||
Line 26: | Line 24: | ||
All participants will have access to the following material: | All participants will have access to the following material: | ||
* Training data: | * Training data (See [[Blizzard_Challenge_2023#Data download|the main Blizzard page]] for download): | ||
* | ** Main corpus (for Hub task): an estimated 50 hours of speech data from a female native speaker of French (NEB) | ||
** Small corpus (for Spoke task): an estimated 2 hours of speech data from a second female native speaker of French (AD) | |||
* Aligned text transcriptions for all materials. Hand-checked aligned phonetic transcriptions are also available for a subset of NEB (30 hours) and for all AD utterances. | |||
==THE CHALLENGE == | ==THE CHALLENGE == | ||
Line 35: | Line 35: | ||
=== Tasks === | === Tasks === | ||
Participants may submit an entry for either or both tasks ( | Participants may submit an entry for either or both tasks | ||
* '''Hub task 2023-FH1 - ''' '''''French TTS:''''' The Hub task is to build a voice from the provided French data (NEB), using only publicly available data. | |||
* '''Spoke task 2023-FS1 - ''' '''''Speaker adaptation:''''' The Spoke task is to build a voice from the provided French data that is the closest to AD as possible. | |||
==USE OF EXTERNAL DATA AND MODELS== | |||
===Definitions=== | |||
* "External data" is defined as data, of any type, that is not part of the provided database. | * "External data" is defined as data, of any type, that is not part of the provided database. | ||
* | * "External model" is defined as a model, of any type, that has not been trained by the team (e.g., pre-trained wav2vec, HuBERT, etc.). | ||
* Use of external data is entirely optional and is not compulsory | |||
* We define fully reproducible research if: | |||
# Used external models are publicly-available on-the-shelf pre-trained models, and references are given | |||
# Any audio data used for training models (including for fine-tuning pre-trained models) is publicly available and reported | |||
# Source code is provided | |||
===Requirements=== | |||
We ask that: | |||
* ''Hub task:'' points 1 and 2 of reproducibility are fulfilled (publicly available external models and external audio data) | |||
* ''Spoke task:'' no requirements | |||
Nevertheless, ''full reproducibility is highly encouraged for all tasks'' (including providing source code), and we will report which systems are fully reproducible in our summary paper. | |||
Additionally: | |||
* Use of external data and external models is entirely optional and is not compulsory | |||
* You must use the provided audio files | * You must use the provided audio files | ||
* You must not use any additional speech data from the same speakers | |||
* You may exclude any parts of the provided databases if you wish | |||
* You may exclude any parts of the provided databases if you wish | |||
* There is no limitation on the amount of external non-audio data you may use (e.g., text, dictionaries) | * There is no limitation on the amount of external non-audio data you may use (e.g., text, dictionaries) | ||
* Use of any provided transcriptions is optional. | * Use of any provided transcriptions is optional. | ||
If you are in any doubt about how to apply these rules, please contact the organizers for clarification. | |||
==SYNTHESISING THE TEST EXAMPLES== | ==SYNTHESISING THE TEST EXAMPLES== | ||
'''Test corpus will be released on March 27th, and synthesised utterances should be submitted before April 3rd.''' | |||
* For both 2023-FH1 and 2023-FS1, participants will be asked to synthesise a large test corpus, from which test utterances will be sampled for evaluation. | |||
* The exact nature of the test set will not be revealed in advance. They will include paragraphs and isolated sentences, some of which including heterophonic homographs. A heterophonic homograph is ''"one of two or more words spelled alike but different in meaning or pronunciation"'', such as "mon fils" (\mɔ̃ fis\, my son) vs. "des fils" (\de fil\) which is the plural of "fil" (\fil\, a thread or wire). Please find a more detailed description and examples on [[media: Homographs.pdf | this document]]. | |||
* Registered participants can download the test set from: | * Only grapheme transcriptions will be provided as the test corpus. | ||
* Synthetic speech must be at 16 bits per sample, and may be submitted at 16kHz, 22.05kHz, 44.1kHz or 48kHz, to guarantee that the audio plays reliably on any device. Waveforms will '''not''' be downsampled for the listening test. | |||
* Registered participants can download the test set from: https://www.synsig.org/index.php/Blizzard_Challenge_2023#Test_set | |||
==RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES== | ==RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES== | ||
Line 69: | Line 90: | ||
==LISTENING TEST== | ==LISTENING TEST== | ||
Formal listening tests will be conducted to evaluate the synthetic speech submitted. The listening test will likely evaluate the performance of the voice in terms of naturalness and | |||
Formal listening tests will be conducted to evaluate the synthetic speech submitted. The listening test will likely evaluate the performance of the voice in terms of naturalness, intelligibility, comprehensibility, and speaker similarity on various types of material (i.e., as in most previous Blizzard Challenges). | |||
==USE OF RESULTS== | ==USE OF RESULTS== | ||
Line 75: | Line 97: | ||
==PAPER== | ==PAPER== | ||
* Each participant '''MUST''' submit a six-page paper (conforming to the Interspeech 2023 template, except with up to 5 pages of content and the 6th page used exclusively for references) describing their entry for review. Please email your paper to blizzard-challenge-organisers@googlegroups.com | * Each participant '''MUST''' submit a six-page paper (conforming to the [https://interspeech2023.org/author-resources/ Interspeech 2023 template]), except with up to 5 pages of content and the 6th page used exclusively for references) describing their entry for review. Please email your paper to blizzard-challenge-organisers@googlegroups.com | ||
* Papers should describe the system, as well as the use of: | * Papers should describe the system, as well as the use of: | ||
** external data, if any (e.g., other speech or text corpora) | ** external data, if any (e.g., other speech or text corpora) | ||
** existing tools, software and models (e.g., text analysers, Festival, HTS, Merlin, WaveNet, Tacotron, word2vec, ...) | ** existing tools, software and models (e.g., text analysers, Festival, HTS, Merlin, WaveNet, Tacotron, word2vec, ...) | ||
* To do so, please take time to ''address the full checklist of reproducible research'' that is given in section 1.1.2 of the Interspeech 2023 template. | |||
* '''One of the AUTHORS of each accepted paper MUST present it at the Blizzard 2023 Workshop''' | * '''One of the AUTHORS of each accepted paper MUST present it at the Blizzard 2023 Workshop''' | ||
** If you are unable to comply with this requirement, do not enter the challenge. | ** If you are unable to comply with this requirement, do not enter the challenge. |
Latest revision as of 16:32, 16 June 2023
DATABASE ACCESS
- Please see on the main Blizzard 2023 page.
REGISTRATION FEE
- A registration fee of 950 EUR is payable by all participants who wish to submit synthetic speech for evaluation, to offset the costs of running the challenge, including paying local assistants and listeners. The fee must be paid by 7th April 2023. Each fee covers the entry from one participating team; that entry can comprise either or both of the tasks below.
- Under no circumstances should you pay the fee until your request to participate has been accepted by the organisers. But we strongly recommend paying this fee only after you have decided to submit synthetic speech (e.g., after completing your own internal evaluation prior to submission). We cannot issue refunds.
- Payment of the fee does not guarantee that your system will be included in the evaluation. We may exclude very low quality entries, at our discretion, to prevent them skewing the listening test (and wasting listener effort). Dealing with such entries still consumes resources, and therefore the entry fee will not be refunded.
- You should pay this fee using the CNRS Azur Colloque online payment system at https://www.azur-colloque.fr/DR11/inscription/preinscription/258. This is a two-step registration process, please carefully follow instructions that were sent by email to participants. After doing this, you will receive a confirmation email from the payment system. This email serves as your invoice and your receipt to show that you have paid that invoice. Please forward this email to blizzard-challenge-organisers@googlegroups.com to notify us that you have paid.
LISTENERS
- When the evaluation starts, each participant must try to recruit at least ten volunteer listeners. If possible, these should be people who have some professional knowledge of synthetic speech.
NAIVE LISTENERS
- When the evaluation starts, each participant should try to recruit as many naive listeners (with no professional knowledge of synthetic speech) as possible. They do not have to be native speakers.
- The organisers would also appreciate assistance in advertising both the Challenge and the listening test as widely as possible (e.g., to your students or colleagues).
MATERIALS PROVIDED
All participants will have access to the following material:
- Training data (See the main Blizzard page for download):
- Main corpus (for Hub task): an estimated 50 hours of speech data from a female native speaker of French (NEB)
- Small corpus (for Spoke task): an estimated 2 hours of speech data from a second female native speaker of French (AD)
- Aligned text transcriptions for all materials. Hand-checked aligned phonetic transcriptions are also available for a subset of NEB (30 hours) and for all AD utterances.
THE CHALLENGE
Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree with this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.
Tasks
Participants may submit an entry for either or both tasks
- Hub task 2023-FH1 - French TTS: The Hub task is to build a voice from the provided French data (NEB), using only publicly available data.
- Spoke task 2023-FS1 - Speaker adaptation: The Spoke task is to build a voice from the provided French data that is the closest to AD as possible.
USE OF EXTERNAL DATA AND MODELS
Definitions
- "External data" is defined as data, of any type, that is not part of the provided database.
- "External model" is defined as a model, of any type, that has not been trained by the team (e.g., pre-trained wav2vec, HuBERT, etc.).
- We define fully reproducible research if:
- Used external models are publicly-available on-the-shelf pre-trained models, and references are given
- Any audio data used for training models (including for fine-tuning pre-trained models) is publicly available and reported
- Source code is provided
Requirements
We ask that:
- Hub task: points 1 and 2 of reproducibility are fulfilled (publicly available external models and external audio data)
- Spoke task: no requirements
Nevertheless, full reproducibility is highly encouraged for all tasks (including providing source code), and we will report which systems are fully reproducible in our summary paper.
Additionally:
- Use of external data and external models is entirely optional and is not compulsory
- You must use the provided audio files
- You must not use any additional speech data from the same speakers
- You may exclude any parts of the provided databases if you wish
- There is no limitation on the amount of external non-audio data you may use (e.g., text, dictionaries)
- Use of any provided transcriptions is optional.
If you are in any doubt about how to apply these rules, please contact the organizers for clarification.
SYNTHESISING THE TEST EXAMPLES
Test corpus will be released on March 27th, and synthesised utterances should be submitted before April 3rd.
- For both 2023-FH1 and 2023-FS1, participants will be asked to synthesise a large test corpus, from which test utterances will be sampled for evaluation.
- The exact nature of the test set will not be revealed in advance. They will include paragraphs and isolated sentences, some of which including heterophonic homographs. A heterophonic homograph is "one of two or more words spelled alike but different in meaning or pronunciation", such as "mon fils" (\mɔ̃ fis\, my son) vs. "des fils" (\de fil\) which is the plural of "fil" (\fil\, a thread or wire). Please find a more detailed description and examples on this document.
- Only grapheme transcriptions will be provided as the test corpus.
- Synthetic speech must be at 16 bits per sample, and may be submitted at 16kHz, 22.05kHz, 44.1kHz or 48kHz, to guarantee that the audio plays reliably on any device. Waveforms will not be downsampled for the listening test.
- Registered participants can download the test set from: https://www.synsig.org/index.php/Blizzard_Challenge_2023#Test_set
RETENTION OF SUBMITTED SYNTHETIC SPEECH SAMPLES
- Any examples that you submit for evaluation will be retained by the Blizzard organisers for future use.
- You must include in your submission of the test sentences a statement of whether you give the organisers permission to publically distribute your waveforms and the corresponding listening test results in anonymised form. In the past, all participants have agreed to this and we strongly encourage you to give this consent.
LISTENING TEST
Formal listening tests will be conducted to evaluate the synthetic speech submitted. The listening test will likely evaluate the performance of the voice in terms of naturalness, intelligibility, comprehensibility, and speaker similarity on various types of material (i.e., as in most previous Blizzard Challenges).
USE OF RESULTS
The Blizzard Challenge is a scientific exercise. You may use the results only for scientific research purposes. Specifically, you may NOT use the results (e.g., your team's ranking) for any commercial purposes, including but not limited to advertising products or services.
PAPER
- Each participant MUST submit a six-page paper (conforming to the Interspeech 2023 template), except with up to 5 pages of content and the 6th page used exclusively for references) describing their entry for review. Please email your paper to blizzard-challenge-organisers@googlegroups.com
- Papers should describe the system, as well as the use of:
- external data, if any (e.g., other speech or text corpora)
- existing tools, software and models (e.g., text analysers, Festival, HTS, Merlin, WaveNet, Tacotron, word2vec, ...)
- To do so, please take time to address the full checklist of reproducible research that is given in section 1.1.2 of the Interspeech 2023 template.
- One of the AUTHORS of each accepted paper MUST present it at the Blizzard 2023 Workshop
- If you are unable to comply with this requirement, do not enter the challenge.
- In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)
HOW ARE THESE RULES ENFORCED?
- This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.