Software: Difference between revisions

From SynSIG
No edit summary
No edit summary
 
(16 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== CSLU Toolkit ==
== Full system ==
The [http://cslu.cse.ogi.edu/toolkit/ CSLU Toolkit] was created to provide the basic framework and tools for people to build, investigate and use interactive language systems. These systems incorporate leading-edge speech recognition, natural language understanding, speech synthesis and facial animation.
=== Multilingual ===
==== Festival ====
Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Tools and documentation for build new voices are available through Carnegie Mellon's FestVox project


==Festival==
* Last update: 2015/01/06
[http://www.cstr.ed.ac.uk/projects/festival/ Festival] offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Tools and documentation for build new voices are available through Carnegie Mellon's [http://festvox.org FestVox] project


==Festvox==
* Link: http://www.cstr.ed.ac.uk/downloads/festival/2.4/
The [http://festvox.org/ Festvox] project aims to make the building of new synthetic voices more systemic and better documented, making it possible for anyone to build a new voice. Specifically it offers documentation, including scripts explaining the background and specifics for building new voices for speech synthesis in new and supported languages, aids to building synthetic voices for limited domains, example speech databases to help building new voices, etc.


== FreeTTS ==
* Reference:
[http://freetts.sourceforge.net/docs/ FreeTTS] is a speech synthesis system written entirely in the JavaTM programming language. It is based upon Flite: a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.


== HMM-Based Speech Synthesis System (HTS) ==
    @article{black2001festival,
The basic core system of [http://hts.ics.nitech.ac.jp/ HTS], availble from NITECH, was implemented as a modified version of HTK together with SPTK (see below), and is released as HMM-Based Speech Synthesis System (HTS) in a form of patch code to HTK. HTS version 1.1.1 comes with a small run-time synthesis engine (less than 1 MB including acoustic models), which can run without the HTK library. The current version does not include any text analyzer but the Festival Speech Synthesis System can be used as a text analyzer.
      title        = {The festival speech synthesis system, version 1.4.2},
      author      = {Black, Alan and Taylor, Paul and Caley, Richard and
      Clark, Rob and Richmond, Korin and King, Simon and
      Strom, Volker and Zen, Heiga},
      journal      = {Unpublished document available via http://www.cstr.ed.ac.uk/projects/festival.html},
      year        = {2001}
    }


== KPE ==
==== FreeTTS ====
[http://www.enhance.phon.ucl.ac.uk/public/examples/copysyn/kpe/kpe.htm KPE] provides a graphical interface for the implementation of the Klatt 1980 [[formant synthesiser]]. The interface allows users to display and edit Klatt parameters using a graphical display which includes the time-amplitude waveform of both the original speech and its synthetic copy, and some signal analysis facilities.
FreeTTS is a speech synthesis system written entirely in the JavaTM programming language. It is
See also the other [http://www.enhance.phon.ucl.ac.uk/ University College London software].
based upon Flite: a small run-time speech synthesis engine developed at Carnegie Mellon
University. Flite is derived from the Festival Speech Synthesis System from the University of
Edinburgh and the FestVox project from Carnegie Mellon University.


== MBROLA ==
* Last update: 2009-03-09
The aim of the [http://tcts.fpms.ac.be/synthesis/mbrola.html MBROLA] project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons (Belgium), is to obtain a set of diphone-based speech synthesizers for as many languages as possible, and provide them free for non-commercial applications.


== MARY ==
* Link: http://freetts.sourceforge.net/docs/index.php
[http://mary.dfki.de MARY] is a multi-lingual (German, English, Tibetan) and multi-platform (Windows, Linux, MacOs X and Solaris) speech synthesis system. It comes with an easy-to-use installer - no technical expertise should be required for installation.
It enables expressive speech synthesis, using both diphone and unit-selection synthesis


== Praat ==
* Reference:
[http://fonsg3.let.uva.nl/praat/manual/Praat_program.html Praat] is a system for doing phonetics by computer. The computer program Praat is a research, publication, and productivity tool for phoneticians. With it, you can analyse, synthesize, and manipulate speech, and create high-quality pictures for your articles and thesis.


== Speech Filing System (SFS) ==
    @misc{walker2010freetts,
[http://www.phon.ucl.ac.uk/research/speechsynth.html SFS] SFS is a free computing environment for PCs for conducting research into the nature of speech. It comprises software tools, file and data formats, subroutine libraries, graphics, special programming languages and tutorial documentation. It performs standard operations such as acquisition, replay, display and labelling, spectrographic and formant analysis and fundamental frequency estimation.
      title        = {Freetts 1.2: A speech synthesizer written entirely in the Java programming language},
      author      = {Walker, Willie and Lamere, Paul and Kwok, Philip},
      year        = {2010}
    }


== Speech Signal Processing Toolkit (SPTK) ==
==== MBROLA ====
The main feature of the [http://kt-lab.ics.nitech.ac.jp/~tokuda/SPTK/ Speech Signal Processing Toolkit], available from NITECH, is that not only standard speech analysis and synthesis techniques (e.g., LPC analysis, PARCOR analysis, LSP analysis, PARCOR synthesis filter, LSP synthesis filter, and vector quantization techniques) but also speech analysis and synthesis techniques developed at the research group can easily be used.
The aim of the MBROLA project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons
(Belgium), is to obtain a set of diphone-based speech synthesizers for as many languages as
possible, and provide them free for non-commercial applications.


== TrackDraw ==
* Last update:
[http://www.utdallas.edu/~assmann/TRACKDRAW/trackdraw.html TrackDraw] is a graphical interface for controlling the parameters of a speech synthesizer.


== Wavesurfer ==
* Link: http://tcts.fpms.ac.be/synthesis/mbrola.html
[http://www.speech.kth.se/wavesurfer/ WaveSurfer] is a tool for doing speech analysis. The analysis features include formants and pitch extraction and real time spectrograms. The Wavesurfer tool built on top of the [http://www.speech.kth.se/snack/ Snack] speech visualization module, is highly modular and extensible at several levels.
 
* Reference:
 
    @inproceedings{dutoit1996mbrola,
      title        = {The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes},
      author      = {Dutoit, Thierry and Pagel, Vincent and Pierret,
      Nicolas and Bataille, Fran{\c{c}}ois and Van der
      Vrecken, Olivier},
      booktitle    = {Proceedings of the Internal Conference of Spoken Language Processing}
      volume      = {3},
      pages        = {1393--1396},
      year        = {1996},
      organization = {IEEE}
    }
 
==== MARY ====
MARY is a multi-lingual (German, English, Tibetan) and multi-platform (Windows, Linux, MacOs X and
Solaris) speech synthesis system. It comes with an easy-to-use installer - no technical expertise
should be required for installation. It enables expressive speech synthesis, using both diphone and
unit-selection synthesis.
 
* Last update: 2017/09/26
 
* Link: http://mary.dfki.de/
 
* Reference:
 
    @article{schroder2003german,
      title        = {The German text-to-speech synthesis system MARY: A tool for research, development and teaching},
      author      = {Schr{"o}der, Marc and Trouvain, J{"u}rgen},
      journal      = {International Journal of Speech Technology},
      volume      = {6},
      number      = {4},
      pages        = {365--377},
      year        = {2003},
      publisher    = {Springer}
    }
 
==== AhoTTS ====
Text-to-Speech conversor for Basque, Spanish, Catalan, Galician and English.
It includes linguistic processing and built voices for all the languages aforementioned. Its acoustic engine is based on hts<sub>engine</sub> and it uses a high quality vocoder called AhoCoder.
 
* Last update: 2015/07/15
 
* Link: https://sourceforge.net/projects/ahottsmultiling/
 
=== Language specific ===
==== AHOTTS (Basque & spanish) ====
Text-to-Speech conversor for Basque and Spanish. It includes
linguistic processing and built voices for the languages
aforementioned. Its acoustic engine is based on hts<sub>engine</sub> and it uses
a high quality vocoder called AhoCoder.
 
* Last update: 2016/04/07
 
* Link: https://sourceforge.net/projects/ahotts
 
* Link2: https://sourceforge.net/projects/ahottsiparrahotsa/ (for Lapurdian dialect of Basque.)
 
* Reference:
 
    @inproceedings{hernaez2001description,
      title        = {Description of the ahotts system for the basque language},
      author      = {Hernaez, Inma and Navas, Eva and Murugarren, Juan Luis and Etxebarria, Borja},
      booktitle    = {Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis},
      year        = {2001}
    }
 
==== RHVoice (Russian) ====
RHVoice is a free and open source speech synthesizer.
 
* Last update: 2017/09/24
 
* Link: https://github.com/Olga-Yakovleva/RHVoice
 
== Front end (NLP part) ==
=== Front end inc G2P ===
==== SiRE ====
(Si)mply a (Re)search front-end for Text-To-Speech Synthesis.
This is a research front-end for TTS. It is incomplete, inconsistent, badly coded and slow.
But it is useful for me and should slowly develop into something useful to others.
 
* Last update: 2016/10/11
 
* Link: https://github.com/RasmusD/SiRe
 
==== Phonetisaurus ====
This repository contains scripts suitable for training, evaluating and using grapheme-to-phoneme models for speech recognition using the OpenFst framework. The current build requires OpenFst version 1.6.0 or later, and the examples below use version 1.6.2.
 
The repository includes C++ binaries suitable for training, compiling, and evaluating G2P models. It also some simple python bindings which may be used to extract individual multigram scores, alignments, and to dump the raw lattices in .fst format for each word.
 
* Last update: 2017/09/17
 
* Link: https://github.com/AdolfVonKleist/Phonetisaurus
 
==== Ossian ====
Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Work on it started with funding from the EU FP7 Project Simple4All, and this repository contains a version which is considerable more up-to-date than that previously available. In particular, the original version of the toolkit relied on HTS to perform acoustic modelling. Although it is still possible to use HTS, it now supports the use of neural nets trained with the Merlin toolkit as duration and acoustic models. All comments and feedback about ways to improve it are very welcome.
 
* Last update: 2017/09/15
 
* Link: https://github.com/CSTR-Edinburgh/Ossian
 
==== SALB ====
The SALB system is a software framework for speech synthesis using HMM based voice models built by HTS (http://hts.sp.nitech.ac.jp/). See a more generic description on http://m-toman.github.io/SALB/.
 
The package currently includes:
 
A C++ framework that abstracts the backend functionality and provides a SAPI5 interface, a command line interface and a C++ API.
 
Backend functionality is provided by
 
* an internal text analysis module for (Austrian) German,
 
* flite as text analysis module for English and
 
* hts<sub>engine</sub> for parameter generation/synthesis. (see COPYING for information on 3rd party libraries)
 
Also included is an Austrian German male voice model.
 
* Last update: 2016/11/14
 
* Link: https://github.com/m-toman/SALB
 
==== Sequence-to-Sequence G2P toolkit ====
The tool does Grapheme-to-Phoneme (G2P) conversion using recurrent neural network (RNN) with long short-term memory units (LSTM). LSTM sequence-to-sequence models were successfully applied in various tasks, including machine translation [1] and grapheme-to-phoneme [2].
 
This implementation is based on python TensorFlow, which allows an efficient training on both CPU and GPU.
 
* Last update: 2017/03/28
 
* Link: https://github.com/cmusphinx/g2p-seq2seq
 
=== Text normalization ===
==== Sparrowhawk ====
Sparrowhawk is an open-source implementation of Google's Kestrel text-to-speech
text normalization system.  It follows the discussion of the Kestrel system as
described in:
 
Ebden, Peter and Sproat, Richard. 2015. The Kestrel TTS text normalization
system. Natural Language Engineering, Issue 03, pp 333-353.
 
After sentence segmentation (sentence<sub>boundary.h</sub>), the individual sentences are
first tokenized with each token being classified, and then passed to the
normalizer. The system can output as an unannotated string of words, and richer
annotation with links between input tokens, their input string positions, and
the output words is also available.
 
 
* Last update: 2017/07/25
 
* Link: https://github.com/google/sparrowhawk
 
==== ASRT ====
This is the README for the Automatic Speech Recognition Tools.
 
This project contains various scripts in order to facilitate the preparation of ASR related tasks.
 
Current tasks ares:
 
# Sentences extraction from pdf files
# Sentences classification by langues
# Sentences filtering and cleaning
 
Document sentences can be extracted into single document or batch mode.
 
For an example on how to extract sentences in batch mode, please have a look at the run<sub>data</sub><sub>preparation</sub><sub>task.sh</sub> script located in examples/bash directory.
 
For an example on how to extract sentences in single document mode, please have a look at the run<sub>data</sub><sub>preparation.sh</sub> script located in examples/bash directory.
 
The is also an API to be used in python code. It is located into the common package and is called DataPreparationAPI.py
 
* Last update: 2017/09/20
* Link: https://github.com/idiap/asrt
 
 
==== IRISA text normalizer ====
Text normalisation tools from IRISA lab.
 
The tools provided here are split into 3 steps:
 
# Tokenisation (adding blanks around punctation marks, dealing with special cases like URLs, etc.)
# Generic normalization (leading to homogeneous texts where (almost) information have been lost and where tags have been added for some entities)
# Specific normalisation (projection of the generic texts into specific forms)
 
* Last update: 2018/01/09
* Link: https://github.com/glecorve/irisa-text-normalizer
 
=== Dictionary related tools ===
==== CMU Pronunciation Dictionary Tools ====
Tools for working with the CMU Pronunciation Dictionary
 
* Last update: 2015/02/23
 
* Link: https://github.com/cmusphinx/cmudict-tools
 
==== ISS scripts for dictionary maintenance ====
These scripts are sufficient to convert the distributed forms of dictionaries into forms useful for our tools (notably HTK and ISS). Once a dictionary is in a standard form, the generic tools in ISS can be used to manipulate it further.
 
* Last update: 2017/07/04
 
* Link: https://github.com/idiap/iss-dicts
 
== Backend (Acoustic part) ==
=== Unit selection ===
 
=== HMM based ===
==== MAGE ====
MAGE is a C/C++ software toolkit for reactive implementation of HMM-based speech and singing synthesis.
 
* Last update: 2014/07/18
 
* Link: https://github.com/numediart/mage
 
==== HMM-Based Speech Synthesis System (HTS) ====
The basic core system of HTS, available from NITECH, was implemented as a modified version of HTK
together with SPTK (see below), and is released as HMM-Based Speech Synthesis System (HTS) in a form
of patch code to HTK.
 
* Last update:  2016/12/25
 
* Link: http://hts.sp.nitech.ac.jp/
 
==== HTS Engine ====
hts<sub>engine</sub> is a small run-time synthesis engine (less than 1 MB including acoustic models), which
can run without the HTK library. The current version does not include any text analyzer but the
Festival Speech Synthesis System can be used as a text analyzer.
 
* Last update: 2015/12/25
 
* Link: http://hts-engine.sourceforge.net/
 
=== DNN based ===
==== MERLIN ====
Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).
 
The system is written in Python and relies on the Theano numerical computation library.
 
Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.
 
* Last update: 2017/09/29
 
* Link: http://www.cstr.ed.ac.uk/projects/merlin
 
* Reference:
 
    @inproceedings{wu2016merlin,
      title          = {Merlin: An open source neural network speech synthesis system},
      author        = {Wu, Zhizheng and Watts, Oliver and King, Simon},
      booktitle      = {Proceedings of the Speech Synthesis Workshop (SSW)},
      year          = {2016}
    }
 
==== IDLAK ====
Idlak is a project to build an end-to-end parametric TTS
system within Kaldi, to be distributed with the same licence.
 
It contains a robust front-end, voice building tools, speech analysis
utilities, and DNN tools suitable for parametric synthesis. It also contains
an example of using Idlak as an end-to-end TTS system, in egs/tts<sub>dnn</sub><sub>arctic</sub>/s1
 
Note that the kaldi structure has been maintained and the tool building
procedure is identical.
 
* Last update: 2017/07/03
 
* Link: https://github.com/bpotard/idlak
 
* Reference:
 
    @inproceedings{potard2016idlak,
      title        = {Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN.},
      author      = {Potard, Blaise and Aylett, Matthew P and Baude, David A and Motlicek, Petr},
      booktitle    = {Proceedings of Interspeech},
      pages        = {2293--2297},
      year        = {2016}
    }
 
==== CURRENNT scripts ====
The scripts and examples on the modified CURRENNT toolkit
 
* Last update: 2017/08/27
 
* Link: https://github.com/TonyWangX/CURRENNT_SCRIPTS
 
=== Wavenet based ===
==== tensorflow-wavenet ====
A TensorFlow implementation of DeepMind's WaveNet paper
 
* Last update: 2017/05/23
 
* Link: https://github.com/ibab/tensorflow-wavenet
 
=== Other ===
 
== End-to-end (text to audio) ==
=== barronalex/Tacotron ===
Implementation of Google's Tacotron in TensorFlow
 
* Last update: 2017/08/08
 
* Link: https://github.com/barronalex/Tacotron
 
=== keithito/tacotron ===
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model
 
* Last update: 2017/11/06
 
* Link: https://github.com/keithito/tacotron
 
=== Char2Wav: End-to-End Speech Synthesis ===
This repo has the code for our ICLR submission:
 
Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio. Char2Wav: End-to-End Speech Synthesis.
 
The website is [http://www.josesotelo.com/speechsynthesis/ here].
 
* Last update: 2017/02/28
 
* Link: https://github.com/sotelo/parrot
 
* Reference:
 
    @inproceedings{sotelo2017char2wav,
      title        = {Char2Wav: End-to-end speech synthesis},
      author      = {Sotelo, Jose and Mehri, Soroush and Kumar, Kundan and Santos, Joao Felipe and Kastner, Kyle and Courville, Aaron and Bengio, Yoshua},
      year        = {2017},
      booktitle    = {Proceedings of International Conference on Learning Representations (ICLR)}
    }
 
== Signal processing ==
=== Vocoder, Glottal modelling ===
==== STRAIGHT ====
STRAIGHT is a tool for manipulating voice quality, timbre, pitch, speed and other attributes
flexibly. It is an always evolving system for attaining better sound quality, that is close to the
original natural speech, by introducing advanced signal processing algorithms and findings in
computational aspects of auditory processing.
 
STRAIGHT decomposes sounds into source information and resonator (filter) information. This
conceptually simple decomposition makes it easy to conduct experiments on speech perception using
STRAIGHT, the initial design objective of this tool, and to interpret experimental results in terms
of huge body of classical studies.
 
* Last update:
 
* Link: http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/index_e.html
 
* Reference:
 
    @article{Kawahara1999,
      author      = {Kawahara, Hideki and Masuda-katsuse, Ikuyo and {De Cheveigné}, Alain},
      year        = {1999},
      journal      = {Speech Communication},
      pages        = {187--207},
      title        = {Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency},
      volume      = {27},
    }
 
==== World ====
WORLD is free software for high-quality speech analysis, manipulation and synthesis. It can estimate Fundamental frequency (F0), aperiodicity and spectral envelope and also generate the speech like input speech with only estimated parameters.
 
This source code is released under the modified-BSD license. There is no patent in all algorithms in WORLD.
 
* Last update: 2017/08/23
 
* Link: https://github.com/mmorise/World
 
* Reference:
 
    @article{morise2016world,
      title        = {WORLD: A vocoder-based high-quality speech synthesis system for real-time applications},
      author      = {Morise, Masanori and Yokomori, Fumiya and Ozawa, Kenji},
      journal      = {IEICE TRANSACTIONS on Information and Systems},
      volume      = {99},
      number      = {7},
      pages        = {1877--1884},
      year        = {2016},
      publisher    = {The Institute of Electronics, Information and Communication Engineers}
    }
 
==== Covarep - A Cooperative Voice Analysis Repository for Speech Technologies ====
Covarep is an open-source repository of advanced speech processing algorithms
and is stored as a GitHub project (https://github.com/covarep/covarep) where
researchers in speech processing can store original implementations of published
algorithms.
 
Over the past few decades a vast array of advanced speech processing algorithms
have been developed, often offering significant improvements over the existing
state-of-the-art. Such algorithms can have a reasonably high degree of
complexity and, hence, can be difficult to accurately re-implement based on
article descriptions. Another issue is the so-called 'bug magnet effect' with
re-implementations frequently having significant differences from the original
ones. The consequence of all this has been that many promising developments
have been under-exploited or discarded, with researchers tending to stick to
conventional analysis methods.
 
By developing Covarep we are hoping to address this by encouraging authors to
include original implementations of their algorithms, thus resulting in a
single de facto version for the speech community to refer to.
 
* Last update: 2016/10/16
 
* Link: https://github.com/covarep/covarep
 
* Reference:
 
    @misc{degottex2014covarep,
      title        = {COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies},
      author      = {Degottex, Gilles},
      year        = {2014}
    }
 
==== MagPhase Vocoder ====
Speech analysis/synthesis system for TTS and related applications.
 
This software is based on the method described in the paper:
 
# Espic, C. Valentini-Botinhao, and S. King, “Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis,” in Proc. Interspeech, Stockholm, Sweden, August, 2017.
 
# Last update: 2017/08/30
 
# Link: https://github.com/CSTR-Edinburgh/magphase
 
# Reference:
 
    @inproceedings{espic2017direct,
      title        = {Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis},
      author      = {Espic, Felipe and Valentini-Botinhao, Cassia and King, Simon},
      booktitle    = {Proceedings of Interspeech},
      year        = {2017}
    }
 
==== WavGenSR ====
Waveform generator based on signal reshaping for statistical parametric speech synthesis.
 
* Last update: 2017/08/30
 
* Link: https://github.com/CSTR-Edinburgh/WavGenSR
 
* Reference:
 
    @inproceedings{espic2016waveform,
      title        = {Waveform Generation Based on Signal Reshaping for Statistical Parametric Speech Synthesis.},
      author      = {Espic, Felipe and Valentini-Botinhao, Cassia and Wu, Zhizheng and King, Simon},
      booktitle    = {Proceedings of Interspeech},
      pages        = {2263--2267},
      year        = {2016}
    }
 
==== Pulse model analysis and synthesis ====
It is basically the vocoder described in:
 
# Degottex, P. Lanchantin, and M. Gales, "A Pulse Model in Log-domain for a Uniform Synthesizer," in Proc. 9th Speech Synthesis Workshop (SSW9), 2016.
 
# Last update: 2017/09/7
 
# Link: https://github.com/gillesdegottex/pulsemodel
 
# Reference:
 
    @inproceedings{degottex2016pulse,
      title        = {A pulse model in log-domain for a uniform synthesizer},
      author      = {Degottex, Gilles and Lanchantin, Pierre and Gales, Mark},
      year        = {2016},
      booktitle    = {Proceedings of the Speech Synthesis Workshop (SSW)}
    }
 
==== YANG VOCODER: Yet-ANother-Generalized VOCODER ====
Yet another vocoder that is not STRAIGHT.
 
This project is a state-of-the-art vocoder that parameterizes the speech signal
into a parameterization that is amenable to statistical manipulation.
 
The VOCODER was developed by Hideki Kawahara during his internship at Google.
 
* Last update: 2017/01/02
 
* Link: https://github.com/google/yang_vocoder
 
==== Ahocoder ====
Ahocoder parameterizes speech waveforms into three different streams: log-f0, cepstral representation of the spectral envelope, and maximum voiced frequency. It provides high accuracy during analysis and high quality during reconstruction. It is adequate for statistical parametric speech synthesis and voice conversion. Furthermore, it can be used just for basic speech manipulation and transformation (pitch level and variance, speaking rate, vocal tract length&#x2026;).
 
Ahocoder is reported to be a very good complement for HTS. The output files generated by Ahocoder contain float numbers without header, so they are fully compatible with the HTS demo scripts in the HTS website. You can use the same configuration as in the STRAIGHT-based demo, using the "bap" stream to handle maximum voiced frequency (set its dimension to 1 both in data/Makefile and in scripts/Config.pm).
 
* Last update: 2014
 
* Link: http://aholab.ehu.es/ahocoder/
 
    @article{erro2014harmonics,
      title        = {Harmonics plus noise model based vocoder for statistical parametric speech synthesis},
      author      = {Erro, Daniel and Sainz, Inaki and Navas, Eva and Hernaez, Inma},
      journal      = {IEEE Journal of Selected Topics in Signal Processing},
      volume      = {8},
      number      = {2},
      pages        = {184--194},
      year        = {2014},
      publisher    = {IEEE}
    }
 
==== PhonVoc: Phonetic and Phonological vocoding ====
This is a computational platform for Phonetic and Phonological
vocoding, released under the BSD licence. See file COPYING for
details. The software is based on Kaldi (v. 489a1f5) and Idiap SSP.
For training of the analysis and synthesis models, follow please
train/README.txt.
 
* Last update: 2016/11/23
 
* Link: https://github.com/idiap/phonvoc
 
==== GlottGAN ====
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
 
* Last update: 2017/05/30
 
* Link: https://github.com/bajibabu/GlottGAN
 
* Reference:
 
    @inproceedings{bollepalli2017generative,
      title        = {Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis},
      author      = {Bollepalli, Bajibabu and Juvela, Lauri and Alku, Paavo},
      booktitle    = {Proceedings of Interspeech},
      pages        = {3394--3398},
      year        = {2017}
    }
 
==== Postfilt gan ====
This is an implementation of "Generative adversarial network-based postfilter for statistical parametric speech synthesis"
 
Please check the run.sh file to train the system. Currently, testing part is not yet implemented.
 
* Last update: 2017/07/06
 
* Link: https://github.com/bajibabu/postfilt_gna
 
* Reference:
 
    @INPROCEEDINGS{Kaneko2017,
      author      = {T. Kaneko and H. Kameoka and N. Hojo and Y. Ijima and K. Hiramatsu and K. Kashino},
      booktitle    = {Proceedings of the IEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
      title        = {Generative adversarial network-based postfilter for statistical parametric speech synthesis},
      year        = {2017},
      volume      = {},
      number      = {},
      pages        = {4910-4914},
      doi          = {10.1109/ICASSP.2017.7953090},
      ISSN        = {},
      month        = {March},
    }
 
=== Pitch extractor ===
==== REAPER: Robust Epoch And Pitch EstimatoR ====
This is a speech processing system. The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.
 
This code was developed by David Talkin at Google. This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.
 
* Last update: 2015/03/04
 
* Link: https://github.com/google/REAPER
 
==== SSP - Speech Signal Processing module ====
SSP is a package for doing signal processing in python; the functionality is biassed towards speech signals. Top level programs include a feature extracter for speech recognition, and a vocoder for both coding and speech synthesis. The vocoder is based on linear prediction, but with several experimental excitation models. A continuous pitch extraction algorithm is also provided, built around standard components and a Kalman filter.
 
There is a "sister" package, libssp, that includes translations of some algorithms in C++. Libssp is built around libube that makes this translation easier.
 
SSP is released under a BSD licence. See the file COPYING for details.
 
* Last update: 2017/04/16
 
* Link: https://github.com/idiap/ssp
 
=== Sample modelling ===
==== SampleRNN ====
SampleRNN: An Unconditional End-to-End Neural Audio Generation Mode
 
* Last update:
 
* Link: https://github.com/soroushmehr/sampleRNN_ICLR2017
 
    @article{mehri2016samplernn,
      title        = {SampleRNN: An unconditional end-to-end neural audio generation model},
      author      = {Mehri, Soroush and Kumar, Kundan and Gulrajani, Ishaan and Kumar, Rithesh and Jain, Shubham and Sotelo, Jose and Courville, Aaron and Bengio, Yoshua},
      journal      = {arXiv preprint arXiv:1612.07837},
      year        = {2016}
    }
 
=== Toolkits ===
==== SPTK - Speech Signal Processing Toolkit ====
The main feature of the Speech Signal Processing Toolkit, available from NITECH, is that not only
standard speech analysis and synthesis techniques (e.g., LPC analysis, PARCOR analysis, LSP
analysis, PARCOR synthesis filter, LSP synthesis filter, and vector quantization techniques) but
also speech analysis and synthesis techniques developed at the research group can easily be used.
 
* Last update: 2016/12/25
 
* Link: http://sp-tk.sourceforge.net/
 
== Singing synthesizer ==
=== Sinsy ===
Sinsy is a HMM-based singing voice synthesis system.
 
* Last update: 2015/12/25
 
* Link: http://sinsy.sourceforge.net/
 
== Ebook reader ==
=== Bard Storyteller ebook reader ===
Bard Storyteller is a text reader.  Bard not only allows a user to read books, but can also read books to the user using text-to-speech. It supports txt, epub and (x)html files.
 
* Last update: 2014/07
 
* Link: http://festvox.org/bard/
 
== Various tools ==
=== SparkNG ===
Matlab realtime speech tools and voice production tools
 
* Last update: 2017/06/29
 
* Link: http://www.wakayama-u.ac.jp/~kawahara/MatlabRealtimeSpeechTools/
 
== Articulatory synthesizer ==
=== KLAIR - A virtual infant for spoken language acquisition research ===
The KLAIR project aims to build and develop a computational platform to assist research into the acquisition of spoken language. The main part of KLAIR is a sensori-motor server that displays a virtual infant on screen that can see, hear and speak. Behind the scenes, the server can talk to one or more client applications. Each client can monitor the audio visual input to the server and can send articulatory gestures to the head for it to speak through an articulatory synthesizer. Clients can also control the position of the head and the eyes as well as setting facial expressions. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.
 
* Last update:
 
* Link: http://www.phon.ucl.ac.uk/project/klair/
 
* Reference:
 
    @inproceedings{huckvale2009klair,
      title        = {KLAIR: a virtual infant for spoken language acquisition research.},
      author      = {Huckvale, Mark and Howard, Ian S and Fagel, Sascha},
      booktitle    = {Proceedings of Interspeech},
      pages        = {696--699},
      year        = {2009}
    }
 
=== Vocaltractlab ===
VocalTractLab stands for "Vocal Tract Laboratory" and is an interactive multimedial software tool to demonstrate the mechanism of speech production. It is meant to facilitate an intuitive understanding of speech production for students of phonetics and related disciplines.
 
The current versions of VocalTractLab are free of charge. Only a registration code, which you can request by email, will be necessary to activate the software. VocalTractLab is written for Windows operating systems (XP or higher), but a porting to Linux/Unix is conceivable for the future.
 
* Last update: 2016
 
* Link: http://www.vocaltractlab.de/
 
== API/Library ==
=== Speech Tools ===
The Edinburgh Speech Tools Library is a collection of C++ class,
functions and related programs for manipulating the sorts of objects
used in speech processing. It includes support for reading and writing
waveforms, parameter files (LPC, Ceptra, F0) in various formats and
converting between them. It also includes support for linguistic type
objects and support for various label files and ngrams (with
smoothing).
 
In addition to the library a number of programs are included. An
intonation library which includes a pitch tracker, smoother and
labelling system (using the Tilt Labelling system), a classification
and regression tree (CART) building program called wagon. Also there
is growing support for various speech recognition classes such as
decoders and HMMs.
 
The Edinburgh Speech Tools Library is not an end in itself but
designed to make the construction of other speech systems easy. It is
for example to provided the underlying classes in the Festival Speech
Synthesis System
 
The speech tools are currently distributed in full source form free
for unrestricted use.
 
* Last update: 2015/01/06
 
* Link: http://www.cstr.ed.ac.uk/projects/speech_tools/
 
=== ROOTS ===
Roots is an open source toolkit dedicated to annotated sequential data generation, management and
processing. It is made of a core library and of a collection of utility scripts. A rich API is
available in C++ and in Perl.
 
* Last update: 2015/07/01
 
* Link: http://roots-toolkit.gforge.inria.fr/
 
* Reference:
 
    @inproceedings{chevelu:hal-00974628,
      AUTHOR      = {Chevelu, Jonathan and Lecorv{'e}, Gw{'e}nol{'e} and Lolive, Damien},
      TITLE        = {ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections},
      BOOKTITLE    = {Proceedings of Language Resources and Evaluation Conference (LREC)},
      YEAR        = {2014},
      ADDRESS      = {Reykjavik, Iceland},
      URL          = {http://hal.inria.fr/hal-00974628}
    }
 
== Visualization & annotation tools ==
=== Praat ===
Praat is a system for doing phonetics by computer. The computer program Praat is a research,
publication, and productivity tool for phoneticians. With it, you can analyse, synthesize, and
manipulate speech, and create high-quality pictures for your articles and thesis.
 
* Last update:
 
* Link: http://www.fon.hum.uva.nl/praat/
 
* Reference:
 
    @article{boersma2006praat,
      title        = {Praat: doing phonetics by computer},
      author      = {Boersma, Paul},
      journal      = {http://www.praat.org/},
      year        = {2006}
    }
 
=== KPE ===
KPE provides a graphical interface for the implementation of the Klatt 1980 formant synthesiser. The
interface allows users to display and edit Klatt parameters using a graphical display which includes
the time-amplitude waveform of both the original speech and its synthetic copy, and some signal
analysis facilities.
 
* Last update:
 
* Link: http://www.speech.cs.cmu.edu/comp.speech/Section5/Synth/klatt.kpe80.html
 
=== Wavesurfer ===
WaveSurfer is a tool for doing speech analysis. The analysis features include formants and pitch
extraction and real time spectrograms. The Wavesurfer tool built on top of the Snack speech
visualization module, is highly modular and extensible at several levels.
 
* Last update:
 
* Link: https://sourceforge.net/projects/wavesurfer/
 
== Resources ==
=== Dictionary ===
==== Unisyn lexicon ====
The Unisyn lexicon is a master lexicon transcribed in keysymbols, a kind of metaphoneme which allows the encoding of multiple accents of English.
 
The lexicon is accompanied by a number of perl scripts which transform the base lexicon via phonological and allophonic rules, and other symbol changes, to produce output transcriptions in different accents. The rules can be applied to the whole lexicon, to produce an accent-specific lexicon, or to running text. Output can be displayed in keysymbols, SAMPA, or IPA.
 
The system uses a geographically-based accent hierarchy, with a tree structure describing countries, regions, towns and speakers; this hierarchy is used to specify the application of rules and other pronunciation features.
 
The lexicon system is customisable, and the documentation explains how to modify output by swtiching rules on and off, adding new rules or editing existing ones. The user can also add new nodes in the accent hierarchy (new accents or new speakers within an accent), or add new symbols.
 
A number of UK, US, Australian and New Zealand accents are included in the release.
 
The scripts run under unix, or Windows 98 (DOS), and use perl 5.6.0.
 
* Last update:
 
* Link: http://www.cstr.ed.ac.uk/projects/unisyn/
 
==== Combilex ====
Combilex GA is a keyword-based lexicon for the General American pronunciation.
 
The combilex contains c.145,000 entries, including the 20,000 most frequent words and contains a variety of linguistic information alongside detailed pronunciations, including many useful proper names.
 
Combilex GA is an ASCII text file, one entry-per-line, which is easily adaptable for use in text-to-speech synthesis (voice-building or run-time synthesis) and in speech recognition systems.
 
Full manually notated orthographic-phonemic correspondences are included, allowing derivation of accurate grapheme-to-phoneme rules.
 
* Last update:
 
* Link: https://licensing.edinburgh-innovations.ed.ac.uk/item.php?item=combilex-ga
 
* Reference:
 
    @inproceedings{richmond2009robust,
      title        = {Robust LTS rules with the Combilex speech technology lexicon},
      author      = {Richmond, Korin and Clark, Robert AJ and Fitt, Susan},
      year        = {2009},
      booktitle    = {Proceedings of Interspeech}
    }

Latest revision as of 14:52, 30 June 2020

Full system

Multilingual

Festival

Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Tools and documentation for build new voices are available through Carnegie Mellon's FestVox project

  • Last update: 2015/01/06
  • Reference:
   @article{black2001festival,
     title        = {The festival speech synthesis system, version 1.4.2},
     author       = {Black, Alan and Taylor, Paul and Caley, Richard and
   		  Clark, Rob and Richmond, Korin and King, Simon and
   		  Strom, Volker and Zen, Heiga},
     journal      = {Unpublished document available via http://www.cstr.ed.ac.uk/projects/festival.html},
     year         = {2001}
   }

FreeTTS

FreeTTS is a speech synthesis system written entirely in the JavaTM programming language. It is based upon Flite: a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.

  • Last update: 2009-03-09
  • Reference:
   @misc{walker2010freetts,
     title        = {Freetts 1.2: A speech synthesizer written entirely in the Java programming language},
     author       = {Walker, Willie and Lamere, Paul and Kwok, Philip},
     year         = {2010}
   }

MBROLA

The aim of the MBROLA project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons (Belgium), is to obtain a set of diphone-based speech synthesizers for as many languages as possible, and provide them free for non-commercial applications.

  • Last update:
  • Reference:
   @inproceedings{dutoit1996mbrola,
     title        = {The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes},
     author       = {Dutoit, Thierry and Pagel, Vincent and Pierret,
   		  Nicolas and Bataille, Fran{\c{c}}ois and Van der
   		  Vrecken, Olivier},
     booktitle    = {Proceedings of the Internal Conference of Spoken Language Processing}
     volume       = {3},
     pages        = {1393--1396},
     year         = {1996},
     organization = {IEEE}
   }

MARY

MARY is a multi-lingual (German, English, Tibetan) and multi-platform (Windows, Linux, MacOs X and Solaris) speech synthesis system. It comes with an easy-to-use installer - no technical expertise should be required for installation. It enables expressive speech synthesis, using both diphone and unit-selection synthesis.

  • Last update: 2017/09/26
  • Reference:
   @article{schroder2003german,
     title        = {The German text-to-speech synthesis system MARY: A tool for research, development and teaching},
     author       = {Schr{"o}der, Marc and Trouvain, J{"u}rgen},
     journal      = {International Journal of Speech Technology},
     volume       = {6},
     number       = {4},
     pages        = {365--377},
     year         = {2003},
     publisher    = {Springer}
   }

AhoTTS

Text-to-Speech conversor for Basque, Spanish, Catalan, Galician and English. It includes linguistic processing and built voices for all the languages aforementioned. Its acoustic engine is based on htsengine and it uses a high quality vocoder called AhoCoder.

  • Last update: 2015/07/15

Language specific

AHOTTS (Basque & spanish)

Text-to-Speech conversor for Basque and Spanish. It includes linguistic processing and built voices for the languages aforementioned. Its acoustic engine is based on htsengine and it uses a high quality vocoder called AhoCoder.

  • Last update: 2016/04/07
  • Reference:
   @inproceedings{hernaez2001description,
     title        = {Description of the ahotts system for the basque language},
     author       = {Hernaez, Inma and Navas, Eva and Murugarren, Juan Luis and Etxebarria, Borja},
     booktitle    = {Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis},
     year         = {2001}
   }

RHVoice (Russian)

RHVoice is a free and open source speech synthesizer.

  • Last update: 2017/09/24

Front end (NLP part)

Front end inc G2P

SiRE

(Si)mply a (Re)search front-end for Text-To-Speech Synthesis. This is a research front-end for TTS. It is incomplete, inconsistent, badly coded and slow. But it is useful for me and should slowly develop into something useful to others.

  • Last update: 2016/10/11

Phonetisaurus

This repository contains scripts suitable for training, evaluating and using grapheme-to-phoneme models for speech recognition using the OpenFst framework. The current build requires OpenFst version 1.6.0 or later, and the examples below use version 1.6.2.

The repository includes C++ binaries suitable for training, compiling, and evaluating G2P models. It also some simple python bindings which may be used to extract individual multigram scores, alignments, and to dump the raw lattices in .fst format for each word.

  • Last update: 2017/09/17

Ossian

Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Work on it started with funding from the EU FP7 Project Simple4All, and this repository contains a version which is considerable more up-to-date than that previously available. In particular, the original version of the toolkit relied on HTS to perform acoustic modelling. Although it is still possible to use HTS, it now supports the use of neural nets trained with the Merlin toolkit as duration and acoustic models. All comments and feedback about ways to improve it are very welcome.

  • Last update: 2017/09/15

SALB

The SALB system is a software framework for speech synthesis using HMM based voice models built by HTS (http://hts.sp.nitech.ac.jp/). See a more generic description on http://m-toman.github.io/SALB/.

The package currently includes:

A C++ framework that abstracts the backend functionality and provides a SAPI5 interface, a command line interface and a C++ API.

Backend functionality is provided by

  • an internal text analysis module for (Austrian) German,
  • flite as text analysis module for English and
  • htsengine for parameter generation/synthesis. (see COPYING for information on 3rd party libraries)

Also included is an Austrian German male voice model.

  • Last update: 2016/11/14

Sequence-to-Sequence G2P toolkit

The tool does Grapheme-to-Phoneme (G2P) conversion using recurrent neural network (RNN) with long short-term memory units (LSTM). LSTM sequence-to-sequence models were successfully applied in various tasks, including machine translation [1] and grapheme-to-phoneme [2].

This implementation is based on python TensorFlow, which allows an efficient training on both CPU and GPU.

  • Last update: 2017/03/28

Text normalization

Sparrowhawk

Sparrowhawk is an open-source implementation of Google's Kestrel text-to-speech text normalization system. It follows the discussion of the Kestrel system as described in:

Ebden, Peter and Sproat, Richard. 2015. The Kestrel TTS text normalization system. Natural Language Engineering, Issue 03, pp 333-353.

After sentence segmentation (sentenceboundary.h), the individual sentences are first tokenized with each token being classified, and then passed to the normalizer. The system can output as an unannotated string of words, and richer annotation with links between input tokens, their input string positions, and the output words is also available.


  • Last update: 2017/07/25

ASRT

This is the README for the Automatic Speech Recognition Tools.

This project contains various scripts in order to facilitate the preparation of ASR related tasks.

Current tasks ares:

  1. Sentences extraction from pdf files
  2. Sentences classification by langues
  3. Sentences filtering and cleaning

Document sentences can be extracted into single document or batch mode.

For an example on how to extract sentences in batch mode, please have a look at the rundatapreparationtask.sh script located in examples/bash directory.

For an example on how to extract sentences in single document mode, please have a look at the rundatapreparation.sh script located in examples/bash directory.

The is also an API to be used in python code. It is located into the common package and is called DataPreparationAPI.py


IRISA text normalizer

Text normalisation tools from IRISA lab.

The tools provided here are split into 3 steps:

  1. Tokenisation (adding blanks around punctation marks, dealing with special cases like URLs, etc.)
  2. Generic normalization (leading to homogeneous texts where (almost) information have been lost and where tags have been added for some entities)
  3. Specific normalisation (projection of the generic texts into specific forms)

Dictionary related tools

CMU Pronunciation Dictionary Tools

Tools for working with the CMU Pronunciation Dictionary

  • Last update: 2015/02/23

ISS scripts for dictionary maintenance

These scripts are sufficient to convert the distributed forms of dictionaries into forms useful for our tools (notably HTK and ISS). Once a dictionary is in a standard form, the generic tools in ISS can be used to manipulate it further.

  • Last update: 2017/07/04

Backend (Acoustic part)

Unit selection

HMM based

MAGE

MAGE is a C/C++ software toolkit for reactive implementation of HMM-based speech and singing synthesis.

  • Last update: 2014/07/18

HMM-Based Speech Synthesis System (HTS)

The basic core system of HTS, available from NITECH, was implemented as a modified version of HTK together with SPTK (see below), and is released as HMM-Based Speech Synthesis System (HTS) in a form of patch code to HTK.

  • Last update: 2016/12/25

HTS Engine

htsengine is a small run-time synthesis engine (less than 1 MB including acoustic models), which can run without the HTK library. The current version does not include any text analyzer but the Festival Speech Synthesis System can be used as a text analyzer.

  • Last update: 2015/12/25

DNN based

MERLIN

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).

The system is written in Python and relies on the Theano numerical computation library.

Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.

  • Last update: 2017/09/29
  • Reference:
   @inproceedings{wu2016merlin,
     title          = {Merlin: An open source neural network speech synthesis system},
     author         = {Wu, Zhizheng and Watts, Oliver and King, Simon},
     booktitle      = {Proceedings of the Speech Synthesis Workshop (SSW)},
     year           = {2016}
   }

IDLAK

Idlak is a project to build an end-to-end parametric TTS system within Kaldi, to be distributed with the same licence.

It contains a robust front-end, voice building tools, speech analysis utilities, and DNN tools suitable for parametric synthesis. It also contains an example of using Idlak as an end-to-end TTS system, in egs/ttsdnnarctic/s1

Note that the kaldi structure has been maintained and the tool building procedure is identical.

  • Last update: 2017/07/03
  • Reference:
   @inproceedings{potard2016idlak,
     title        = {Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN.},
     author       = {Potard, Blaise and Aylett, Matthew P and Baude, David A and Motlicek, Petr},
     booktitle    = {Proceedings of Interspeech},
     pages        = {2293--2297},
     year         = {2016}
   }

CURRENNT scripts

The scripts and examples on the modified CURRENNT toolkit

  • Last update: 2017/08/27

Wavenet based

tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper

  • Last update: 2017/05/23

Other

End-to-end (text to audio)

barronalex/Tacotron

Implementation of Google's Tacotron in TensorFlow

  • Last update: 2017/08/08

keithito/tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model

  • Last update: 2017/11/06

Char2Wav: End-to-End Speech Synthesis

This repo has the code for our ICLR submission:

Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio. Char2Wav: End-to-End Speech Synthesis.

The website is here.

  • Last update: 2017/02/28
  • Reference:
   @inproceedings{sotelo2017char2wav,
     title        = {Char2Wav: End-to-end speech synthesis},
     author       = {Sotelo, Jose and Mehri, Soroush and Kumar, Kundan and Santos, Joao Felipe and Kastner, Kyle and Courville, Aaron and Bengio, Yoshua},
     year         = {2017},
     booktitle    = {Proceedings of International Conference on Learning Representations (ICLR)}
   }

Signal processing

Vocoder, Glottal modelling

STRAIGHT

STRAIGHT is a tool for manipulating voice quality, timbre, pitch, speed and other attributes flexibly. It is an always evolving system for attaining better sound quality, that is close to the original natural speech, by introducing advanced signal processing algorithms and findings in computational aspects of auditory processing.

STRAIGHT decomposes sounds into source information and resonator (filter) information. This conceptually simple decomposition makes it easy to conduct experiments on speech perception using STRAIGHT, the initial design objective of this tool, and to interpret experimental results in terms of huge body of classical studies.

  • Last update:
  • Reference:
   @article{Kawahara1999,
     author       = {Kawahara, Hideki and Masuda-katsuse, Ikuyo and {De Cheveigné}, Alain},
     year         = {1999},
     journal      = {Speech Communication},
     pages        = {187--207},
     title        = {Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency},
     volume       = {27},
   }

World

WORLD is free software for high-quality speech analysis, manipulation and synthesis. It can estimate Fundamental frequency (F0), aperiodicity and spectral envelope and also generate the speech like input speech with only estimated parameters.

This source code is released under the modified-BSD license. There is no patent in all algorithms in WORLD.

  • Last update: 2017/08/23
  • Reference:
   @article{morise2016world,
     title        = {WORLD: A vocoder-based high-quality speech synthesis system for real-time applications},
     author       = {Morise, Masanori and Yokomori, Fumiya and Ozawa, Kenji},
     journal      = {IEICE TRANSACTIONS on Information and Systems},
     volume       = {99},
     number       = {7},
     pages        = {1877--1884},
     year         = {2016},
     publisher    = {The Institute of Electronics, Information and Communication Engineers}
   }

Covarep - A Cooperative Voice Analysis Repository for Speech Technologies

Covarep is an open-source repository of advanced speech processing algorithms and is stored as a GitHub project (https://github.com/covarep/covarep) where researchers in speech processing can store original implementations of published algorithms.

Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original ones. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.

By developing Covarep we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.

  • Last update: 2016/10/16
  • Reference:
   @misc{degottex2014covarep,
     title        = {COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies},
     author       = {Degottex, Gilles},
     year         = {2014}
   }

MagPhase Vocoder

Speech analysis/synthesis system for TTS and related applications.

This software is based on the method described in the paper:

  1. Espic, C. Valentini-Botinhao, and S. King, “Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis,” in Proc. Interspeech, Stockholm, Sweden, August, 2017.
  1. Last update: 2017/08/30
  1. Link: https://github.com/CSTR-Edinburgh/magphase
  1. Reference:
   @inproceedings{espic2017direct,
     title        = {Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis},
     author       = {Espic, Felipe and Valentini-Botinhao, Cassia and King, Simon},
     booktitle    = {Proceedings of Interspeech},
     year         = {2017}
   }

WavGenSR

Waveform generator based on signal reshaping for statistical parametric speech synthesis.

  • Last update: 2017/08/30
  • Reference:
   @inproceedings{espic2016waveform,
     title        = {Waveform Generation Based on Signal Reshaping for Statistical Parametric Speech Synthesis.},
     author       = {Espic, Felipe and Valentini-Botinhao, Cassia and Wu, Zhizheng and King, Simon},
     booktitle    = {Proceedings of Interspeech},
     pages        = {2263--2267},
     year         = {2016}
   }

Pulse model analysis and synthesis

It is basically the vocoder described in:

  1. Degottex, P. Lanchantin, and M. Gales, "A Pulse Model in Log-domain for a Uniform Synthesizer," in Proc. 9th Speech Synthesis Workshop (SSW9), 2016.
  1. Last update: 2017/09/7
  1. Link: https://github.com/gillesdegottex/pulsemodel
  1. Reference:
   @inproceedings{degottex2016pulse,
     title        = {A pulse model in log-domain for a uniform synthesizer},
     author       = {Degottex, Gilles and Lanchantin, Pierre and Gales, Mark},
     year         = {2016},
     booktitle    = {Proceedings of the Speech Synthesis Workshop (SSW)}
   }

YANG VOCODER: Yet-ANother-Generalized VOCODER

Yet another vocoder that is not STRAIGHT.

This project is a state-of-the-art vocoder that parameterizes the speech signal into a parameterization that is amenable to statistical manipulation.

The VOCODER was developed by Hideki Kawahara during his internship at Google.

  • Last update: 2017/01/02

Ahocoder

Ahocoder parameterizes speech waveforms into three different streams: log-f0, cepstral representation of the spectral envelope, and maximum voiced frequency. It provides high accuracy during analysis and high quality during reconstruction. It is adequate for statistical parametric speech synthesis and voice conversion. Furthermore, it can be used just for basic speech manipulation and transformation (pitch level and variance, speaking rate, vocal tract length…).

Ahocoder is reported to be a very good complement for HTS. The output files generated by Ahocoder contain float numbers without header, so they are fully compatible with the HTS demo scripts in the HTS website. You can use the same configuration as in the STRAIGHT-based demo, using the "bap" stream to handle maximum voiced frequency (set its dimension to 1 both in data/Makefile and in scripts/Config.pm).

  • Last update: 2014
   @article{erro2014harmonics,
     title        = {Harmonics plus noise model based vocoder for statistical parametric speech synthesis},
     author       = {Erro, Daniel and Sainz, Inaki and Navas, Eva and Hernaez, Inma},
     journal      = {IEEE Journal of Selected Topics in Signal Processing},
     volume       = {8},
     number       = {2},
     pages        = {184--194},
     year         = {2014},
     publisher    = {IEEE}
   }

PhonVoc: Phonetic and Phonological vocoding

This is a computational platform for Phonetic and Phonological vocoding, released under the BSD licence. See file COPYING for details. The software is based on Kaldi (v. 489a1f5) and Idiap SSP. For training of the analysis and synthesis models, follow please train/README.txt.

  • Last update: 2016/11/23

GlottGAN

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

  • Last update: 2017/05/30
  • Reference:
   @inproceedings{bollepalli2017generative,
     title        = {Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis},
     author       = {Bollepalli, Bajibabu and Juvela, Lauri and Alku, Paavo},
     booktitle    = {Proceedings of Interspeech},
     pages        = {3394--3398},
     year         = {2017}
   }

Postfilt gan

This is an implementation of "Generative adversarial network-based postfilter for statistical parametric speech synthesis"

Please check the run.sh file to train the system. Currently, testing part is not yet implemented.

  • Last update: 2017/07/06
  • Reference:
   @INPROCEEDINGS{Kaneko2017,
     author       = {T. Kaneko and H. Kameoka and N. Hojo and Y. Ijima and K. Hiramatsu and K. Kashino},
     booktitle    = {Proceedings of the IEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
     title        = {Generative adversarial network-based postfilter for statistical parametric speech synthesis},
     year         = {2017},
     volume       = {},
     number       = {},
     pages        = {4910-4914},
     doi          = {10.1109/ICASSP.2017.7953090},
     ISSN         = {},
     month        = {March},
   }

Pitch extractor

REAPER: Robust Epoch And Pitch EstimatoR

This is a speech processing system. The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

This code was developed by David Talkin at Google. This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

  • Last update: 2015/03/04

SSP - Speech Signal Processing module

SSP is a package for doing signal processing in python; the functionality is biassed towards speech signals. Top level programs include a feature extracter for speech recognition, and a vocoder for both coding and speech synthesis. The vocoder is based on linear prediction, but with several experimental excitation models. A continuous pitch extraction algorithm is also provided, built around standard components and a Kalman filter.

There is a "sister" package, libssp, that includes translations of some algorithms in C++. Libssp is built around libube that makes this translation easier.

SSP is released under a BSD licence. See the file COPYING for details.

  • Last update: 2017/04/16

Sample modelling

SampleRNN

SampleRNN: An Unconditional End-to-End Neural Audio Generation Mode

  • Last update:
   @article{mehri2016samplernn,
     title        = {SampleRNN: An unconditional end-to-end neural audio generation model},
     author       = {Mehri, Soroush and Kumar, Kundan and Gulrajani, Ishaan and Kumar, Rithesh and Jain, Shubham and Sotelo, Jose and Courville, Aaron and Bengio, Yoshua},
     journal      = {arXiv preprint arXiv:1612.07837},
     year         = {2016}
   }

Toolkits

SPTK - Speech Signal Processing Toolkit

The main feature of the Speech Signal Processing Toolkit, available from NITECH, is that not only standard speech analysis and synthesis techniques (e.g., LPC analysis, PARCOR analysis, LSP analysis, PARCOR synthesis filter, LSP synthesis filter, and vector quantization techniques) but also speech analysis and synthesis techniques developed at the research group can easily be used.

  • Last update: 2016/12/25

Singing synthesizer

Sinsy

Sinsy is a HMM-based singing voice synthesis system.

  • Last update: 2015/12/25

Ebook reader

Bard Storyteller ebook reader

Bard Storyteller is a text reader. Bard not only allows a user to read books, but can also read books to the user using text-to-speech. It supports txt, epub and (x)html files.

  • Last update: 2014/07

Various tools

SparkNG

Matlab realtime speech tools and voice production tools

  • Last update: 2017/06/29

Articulatory synthesizer

KLAIR - A virtual infant for spoken language acquisition research

The KLAIR project aims to build and develop a computational platform to assist research into the acquisition of spoken language. The main part of KLAIR is a sensori-motor server that displays a virtual infant on screen that can see, hear and speak. Behind the scenes, the server can talk to one or more client applications. Each client can monitor the audio visual input to the server and can send articulatory gestures to the head for it to speak through an articulatory synthesizer. Clients can also control the position of the head and the eyes as well as setting facial expressions. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.

  • Last update:
  • Reference:
   @inproceedings{huckvale2009klair,
     title        = {KLAIR: a virtual infant for spoken language acquisition research.},
     author       = {Huckvale, Mark and Howard, Ian S and Fagel, Sascha},
     booktitle    = {Proceedings of Interspeech},
     pages        = {696--699},
     year         = {2009}
   }

Vocaltractlab

VocalTractLab stands for "Vocal Tract Laboratory" and is an interactive multimedial software tool to demonstrate the mechanism of speech production. It is meant to facilitate an intuitive understanding of speech production for students of phonetics and related disciplines.

The current versions of VocalTractLab are free of charge. Only a registration code, which you can request by email, will be necessary to activate the software. VocalTractLab is written for Windows operating systems (XP or higher), but a porting to Linux/Unix is conceivable for the future.

  • Last update: 2016

API/Library

Speech Tools

The Edinburgh Speech Tools Library is a collection of C++ class, functions and related programs for manipulating the sorts of objects used in speech processing. It includes support for reading and writing waveforms, parameter files (LPC, Ceptra, F0) in various formats and converting between them. It also includes support for linguistic type objects and support for various label files and ngrams (with smoothing).

In addition to the library a number of programs are included. An intonation library which includes a pitch tracker, smoother and labelling system (using the Tilt Labelling system), a classification and regression tree (CART) building program called wagon. Also there is growing support for various speech recognition classes such as decoders and HMMs.

The Edinburgh Speech Tools Library is not an end in itself but designed to make the construction of other speech systems easy. It is for example to provided the underlying classes in the Festival Speech Synthesis System

The speech tools are currently distributed in full source form free for unrestricted use.

  • Last update: 2015/01/06

ROOTS

Roots is an open source toolkit dedicated to annotated sequential data generation, management and processing. It is made of a core library and of a collection of utility scripts. A rich API is available in C++ and in Perl.

  • Last update: 2015/07/01
  • Reference:
   @inproceedings{chevelu:hal-00974628,
     AUTHOR       = {Chevelu, Jonathan and Lecorv{'e}, Gw{'e}nol{'e} and Lolive, Damien},
     TITLE        = {ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections},
     BOOKTITLE    = {Proceedings of Language Resources and Evaluation Conference (LREC)},
     YEAR         = {2014},
     ADDRESS      = {Reykjavik, Iceland},
     URL          = {http://hal.inria.fr/hal-00974628}
   }

Visualization & annotation tools

Praat

Praat is a system for doing phonetics by computer. The computer program Praat is a research, publication, and productivity tool for phoneticians. With it, you can analyse, synthesize, and manipulate speech, and create high-quality pictures for your articles and thesis.

  • Last update:
  • Reference:
   @article{boersma2006praat,
     title        = {Praat: doing phonetics by computer},
     author       = {Boersma, Paul},
     journal      = {http://www.praat.org/},
     year         = {2006}
   }

KPE

KPE provides a graphical interface for the implementation of the Klatt 1980 formant synthesiser. The interface allows users to display and edit Klatt parameters using a graphical display which includes the time-amplitude waveform of both the original speech and its synthetic copy, and some signal analysis facilities.

  • Last update:

Wavesurfer

WaveSurfer is a tool for doing speech analysis. The analysis features include formants and pitch extraction and real time spectrograms. The Wavesurfer tool built on top of the Snack speech visualization module, is highly modular and extensible at several levels.

  • Last update:

Resources

Dictionary

Unisyn lexicon

The Unisyn lexicon is a master lexicon transcribed in keysymbols, a kind of metaphoneme which allows the encoding of multiple accents of English.

The lexicon is accompanied by a number of perl scripts which transform the base lexicon via phonological and allophonic rules, and other symbol changes, to produce output transcriptions in different accents. The rules can be applied to the whole lexicon, to produce an accent-specific lexicon, or to running text. Output can be displayed in keysymbols, SAMPA, or IPA.

The system uses a geographically-based accent hierarchy, with a tree structure describing countries, regions, towns and speakers; this hierarchy is used to specify the application of rules and other pronunciation features.

The lexicon system is customisable, and the documentation explains how to modify output by swtiching rules on and off, adding new rules or editing existing ones. The user can also add new nodes in the accent hierarchy (new accents or new speakers within an accent), or add new symbols.

A number of UK, US, Australian and New Zealand accents are included in the release.

The scripts run under unix, or Windows 98 (DOS), and use perl 5.6.0.

  • Last update:

Combilex

Combilex GA is a keyword-based lexicon for the General American pronunciation.

The combilex contains c.145,000 entries, including the 20,000 most frequent words and contains a variety of linguistic information alongside detailed pronunciations, including many useful proper names.

Combilex GA is an ASCII text file, one entry-per-line, which is easily adaptable for use in text-to-speech synthesis (voice-building or run-time synthesis) and in speech recognition systems.

Full manually notated orthographic-phonemic correspondences are included, allowing derivation of accurate grapheme-to-phoneme rules.

  • Last update:
  • Reference:
   @inproceedings{richmond2009robust,
     title        = {Robust LTS rules with the Combilex speech technology lexicon},
     author       = {Richmond, Korin and Clark, Robert AJ and Fitt, Susan},
     year         = {2009},
     booktitle    = {Proceedings of Interspeech}
   }