Software: Difference between revisions

Revision as of 16:56, 24 February 2020

Corpora

Full system

Multilingual

Festival

Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Tools and documentation for build new voices are available through Carnegie Mellon's FestVox project

Last update: 2015/01/06

Link: http://www.cstr.ed.ac.uk/downloads/festival/2.4/

Reference:

   @article{black2001festival,
     title        = {The festival speech synthesis system, version 1.4.2},
     author       = {Black, Alan and Taylor, Paul and Caley, Richard and
   		  Clark, Rob and Richmond, Korin and King, Simon and
   		  Strom, Volker and Zen, Heiga},
     journal      = {Unpublished document available via http://www.cstr.ed.ac.uk/projects/festival.html},
     year         = {2001}
   }

FreeTTS

FreeTTS is a speech synthesis system written entirely in the JavaTM programming language. It is based upon Flite: a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.

Last update: 2009-03-09

Link: http://freetts.sourceforge.net/docs/index.php

Reference:

   @misc{walker2010freetts,
     title        = {Freetts 1.2: A speech synthesizer written entirely in the Java programming language},
     author       = {Walker, Willie and Lamere, Paul and Kwok, Philip},
     year         = {2010}
   }

MBROLA

The aim of the MBROLA project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons (Belgium), is to obtain a set of diphone-based speech synthesizers for as many languages as possible, and provide them free for non-commercial applications.

Last update:

Link: http://tcts.fpms.ac.be/synthesis/mbrola.html

Reference:

   @inproceedings{dutoit1996mbrola,
     title        = {The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes},
     author       = {Dutoit, Thierry and Pagel, Vincent and Pierret,
   		  Nicolas and Bataille, Fran{\c{c}}ois and Van der
   		  Vrecken, Olivier},
     booktitle    = {Proceedings of the Internal Conference of Spoken Language Processing}
     volume       = {3},
     pages        = {1393--1396},
     year         = {1996},
     organization = {IEEE}
   }

MARY

MARY is a multi-lingual (German, English, Tibetan) and multi-platform (Windows, Linux, MacOs X and Solaris) speech synthesis system. It comes with an easy-to-use installer - no technical expertise should be required for installation. It enables expressive speech synthesis, using both diphone and unit-selection synthesis.

Last update: 2017/09/26

Link: http://mary.dfki.de/

Reference:

   @article{schroder2003german,
     title        = {The German text-to-speech synthesis system MARY: A tool for research, development and teaching},
     author       = {Schr{"o}der, Marc and Trouvain, J{"u}rgen},
     journal      = {International Journal of Speech Technology},
     volume       = {6},
     number       = {4},
     pages        = {365--377},
     year         = {2003},
     publisher    = {Springer}
   }

AhoTTS

Text-to-Speech conversor for Basque, Spanish, Catalan, Galician and English. It includes linguistic processing and built voices for all the languages aforementioned. Its acoustic engine is based on hts_engine and it uses a high quality vocoder called AhoCoder.

Last update: 2015/07/15

Link: https://sourceforge.net/projects/ahottsmultiling/

Language specific

AHOTTS (Basque & spanish)

Text-to-Speech conversor for Basque and Spanish. It includes linguistic processing and built voices for the languages aforementioned. Its acoustic engine is based on hts_engine and it uses a high quality vocoder called AhoCoder.

Last update: 2016/04/07

Link: https://sourceforge.net/projects/ahotts

Link2: https://sourceforge.net/projects/ahottsiparrahotsa/ (for Lapurdian dialect of Basque.)

Reference:

   @inproceedings{hernaez2001description,
     title        = {Description of the ahotts system for the basque language},
     author       = {Hernaez, Inma and Navas, Eva and Murugarren, Juan Luis and Etxebarria, Borja},
     booktitle    = {Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis},
     year         = {2001}
   }

RHVoice (Russian)

RHVoice is a free and open source speech synthesizer.

Last update: 2017/09/24

Link: https://github.com/Olga-Yakovleva/RHVoice

Front end (NLP part)

Front end inc G2P

SiRE

(Si)mply a (Re)search front-end for Text-To-Speech Synthesis. This is a research front-end for TTS. It is incomplete, inconsistent, badly coded and slow. But it is useful for me and should slowly develop into something useful to others.

Last update: 2016/10/11

Link: https://github.com/RasmusD/SiRe

Phonetisaurus

This repository contains scripts suitable for training, evaluating and using grapheme-to-phoneme models for speech recognition using the OpenFst framework. The current build requires OpenFst version 1.6.0 or later, and the examples below use version 1.6.2.

The repository includes C++ binaries suitable for training, compiling, and evaluating G2P models. It also some simple python bindings which may be used to extract individual multigram scores, alignments, and to dump the raw lattices in .fst format for each word.

Last update: 2017/09/17

Link: https://github.com/AdolfVonKleist/Phonetisaurus

Ossian

Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Work on it started with funding from the EU FP7 Project Simple4All, and this repository contains a version which is considerable more up-to-date than that previously available. In particular, the original version of the toolkit relied on HTS to perform acoustic modelling. Although it is still possible to use HTS, it now supports the use of neural nets trained with the Merlin toolkit as duration and acoustic models. All comments and feedback about ways to improve it are very welcome.

Last update: 2017/09/15

Link: https://github.com/CSTR-Edinburgh/Ossian

SALB

The SALB system is a software framework for speech synthesis using HMM based voice models built by HTS (http://hts.sp.nitech.ac.jp/). See a more generic description on http://m-toman.github.io/SALB/.

The package currently includes:

A C++ framework that abstracts the backend functionality and provides a SAPI5 interface, a command line interface and a C++ API.

Backend functionality is provided by

an internal text analysis module for (Austrian) German,

flite as text analysis module for English and

hts_engine for parameter generation/synthesis. (see COPYING for information on 3rd party libraries)

Also included is an Austrian German male voice model.

Last update: 2016/11/14

Link: https://github.com/m-toman/SALB

Sequence-to-Sequence G2P toolkit

The tool does Grapheme-to-Phoneme (G2P) conversion using recurrent neural network (RNN) with long short-term memory units (LSTM). LSTM sequence-to-sequence models were successfully applied in various tasks, including machine translation [1] and grapheme-to-phoneme [2].

This implementation is based on python TensorFlow, which allows an efficient training on both CPU and GPU.

Last update: 2017/03/28

Link: https://github.com/cmusphinx/g2p-seq2seq

Text normalization

Sparrowhawk

Sparrowhawk is an open-source implementation of Google's Kestrel text-to-speech text normalization system. It follows the discussion of the Kestrel system as described in:

Ebden, Peter and Sproat, Richard. 2015. The Kestrel TTS text normalization system. Natural Language Engineering, Issue 03, pp 333-353.

After sentence segmentation (sentence_boundary.h), the individual sentences are first tokenized with each token being classified, and then passed to the normalizer. The system can output as an unannotated string of words, and richer annotation with links between input tokens, their input string positions, and the output words is also available.

Last update: 2017/07/25

Link: https://github.com/google/sparrowhawk

ASRT

This is the README for the Automatic Speech Recognition Tools.

This project contains various scripts in order to facilitate the preparation of ASR related tasks.

Current tasks ares:

Sentences extraction from pdf files
Sentences classification by langues
Sentences filtering and cleaning

Document sentences can be extracted into single document or batch mode.

For an example on how to extract sentences in batch mode, please have a look at the run_data_preparation_task.sh script located in examples/bash directory.

For an example on how to extract sentences in single document mode, please have a look at the run_data_{preparation.sh} script located in examples/bash directory.

The is also an API to be used in python code. It is located into the common package and is called DataPreparationAPI.py

Last update: 2017/09/20
Link: https://github.com/idiap/asrt

IRISA text normalizer

Text normalisation tools from IRISA lab.

The tools provided here are split into 3 steps:

Tokenisation (adding blanks around punctation marks, dealing with special cases like URLs, etc.)
Generic normalization (leading to homogeneous texts where (almost) information have been lost and where tags have been added for some entities)
Specific normalisation (projection of the generic texts into specific forms)

Last update: 2018/01/09
Link: https://github.com/glecorve/irisa-text-normalizer

Dictionary related tools

CMU Pronunciation Dictionary Tools

Tools for working with the CMU Pronunciation Dictionary

Last update: 2015/02/23

Link: https://github.com/cmusphinx/cmudict-tools

ISS scripts for dictionary maintenance

These scripts are sufficient to convert the distributed forms of dictionaries into forms useful for our tools (notably HTK and ISS). Once a dictionary is in a standard form, the generic tools in ISS can be used to manipulate it further.

Last update: 2017/07/04

Link: https://github.com/idiap/iss-dicts

Backend (Acoustic part)

Unit selection

HMM based

MAGE

MAGE is a C/C++ software toolkit for reactive implementation of HMM-based speech and singing synthesis.

Last update: 2014/07/18

Link: https://github.com/numediart/mage

HMM-Based Speech Synthesis System (HTS)

The basic core system of HTS, available from NITECH, was implemented as a modified version of HTK together with SPTK (see below), and is released as HMM-Based Speech Synthesis System (HTS) in a form of patch code to HTK.

Last update: 2016/12/25

Link: http://hts.sp.nitech.ac.jp/

HTS Engine

hts_engine is a small run-time synthesis engine (less than 1 MB including acoustic models), which can run without the HTK library. The current version does not include any text analyzer but the Festival Speech Synthesis System can be used as a text analyzer.

Last update: 2015/12/25

Link: http://hts-engine.sourceforge.net/

DNN based

MERLIN

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).

The system is written in Python and relies on the Theano numerical computation library.

Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.

Last update: 2017/09/29

Link: http://www.cstr.ed.ac.uk/projects/merlin

Reference:

   @inproceedings{wu2016merlin,
     title          = {Merlin: An open source neural network speech synthesis system},
     author         = {Wu, Zhizheng and Watts, Oliver and King, Simon},
     booktitle      = {Proceedings of the Speech Synthesis Workshop (SSW)},
     year           = {2016}
   }

IDLAK

Idlak is a project to build an end-to-end parametric TTS system within Kaldi, to be distributed with the same licence.

It contains a robust front-end, voice building tools, speech analysis utilities, and DNN tools suitable for parametric synthesis. It also contains an example of using Idlak as an end-to-end TTS system, in egs/tts_dnn_arctic/s1

Note that the kaldi structure has been maintained and the tool building procedure is identical.

Last update: 2017/07/03

Link: https://github.com/bpotard/idlak

Reference:

   @inproceedings{potard2016idlak,
     title        = {Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN.},
     author       = {Potard, Blaise and Aylett, Matthew P and Baude, David A and Motlicek, Petr},
     booktitle    = {Proceedings of Interspeech},
     pages        = {2293--2297},
     year         = {2016}
   }

CURRENNT scripts

The scripts and examples on the modified CURRENNT toolkit

Last update: 2017/08/27

Link: https://github.com/TonyWangX/CURRENNT_SCRIPTS

Wavenet based

tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper

Last update: 2017/05/23

Link: https://github.com/ibab/tensorflow-wavenet

Other

End-to-end (text to audio)

barronalex/Tacotron

Implementation of Google's Tacotron in TensorFlow

Last update: 2017/08/08

Link: https://github.com/barronalex/Tacotron

keithito/tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model

Last update: 2017/11/06

Link: https://github.com/keithito/tacotron

Char2Wav: End-to-End Speech Synthesis

This repo has the code for our ICLR submission:

Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio. Char2Wav: End-to-End Speech Synthesis.

The website is here.

Last update: 2017/02/28

Link: https://github.com/sotelo/parrot

Reference:

   @inproceedings{sotelo2017char2wav,
     title        = {Char2Wav: End-to-end speech synthesis},
     author       = {Sotelo, Jose and Mehri, Soroush and Kumar, Kundan and Santos, Joao Felipe and Kastner, Kyle and Courville, Aaron and Bengio, Yoshua},
     year         = {2017},
     booktitle    = {Proceedings of International Conference on Learning Representations (ICLR)}
   }

Signal processing

Vocoder, Glottal modelling

STRAIGHT

STRAIGHT is a tool for manipulating voice quality, timbre, pitch, speed and other attributes flexibly. It is an always evolving system for attaining better sound quality, that is close to the original natural speech, by introducing advanced signal processing algorithms and findings in computational aspects of auditory processing.

STRAIGHT decomposes sounds into source information and resonator (filter) information. This conceptually simple decomposition makes it easy to conduct experiments on speech perception using STRAIGHT, the initial design objective of this tool, and to interpret experimental results in terms of huge body of classical studies.

Last update:

Link: http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/index_e.html

Reference:

   @article{Kawahara1999,
     author       = {Kawahara, Hideki and Masuda-katsuse, Ikuyo and {De Cheveigné}, Alain},
     year         = {1999},
     journal      = {Speech Communication},
     pages        = {187--207},
     title        = {Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency},
     volume       = {27},
   }

World

WORLD is free software for high-quality speech analysis, manipulation and synthesis. It can estimate Fundamental frequency (F0), aperiodicity and spectral envelope and also generate the speech like input speech with only estimated parameters.

This source code is released under the modified-BSD license. There is no patent in all algorithms in WORLD.

Last update: 2017/08/23

Link: https://github.com/mmorise/World

Reference:

   @article{morise2016world,
     title        = {WORLD: A vocoder-based high-quality speech synthesis system for real-time applications},
     author       = {Morise, Masanori and Yokomori, Fumiya and Ozawa, Kenji},
     journal      = {IEICE TRANSACTIONS on Information and Systems},
     volume       = {99},
     number       = {7},
     pages        = {1877--1884},
     year         = {2016},
     publisher    = {The Institute of Electronics, Information and Communication Engineers}
   }

Covarep - A Cooperative Voice Analysis Repository for Speech Technologies

Covarep is an open-source repository of advanced speech processing algorithms and is stored as a GitHub project (https://github.com/covarep/covarep) where researchers in speech processing can store original implementations of published algorithms.

Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original ones. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.

By developing Covarep we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.

Last update: 2016/10/16

Link: https://github.com/covarep/covarep

Reference:

   @misc{degottex2014covarep,
     title        = {COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies},
     author       = {Degottex, Gilles},
     year         = {2014}
   }

MagPhase Vocoder

Speech analysis/synthesis system for TTS and related applications.

This software is based on the method described in the paper:

Espic, C. Valentini-Botinhao, and S. King, “Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis,” in Proc. Interspeech, Stockholm, Sweden, August, 2017.

Last update: 2017/08/30

Link: https://github.com/CSTR-Edinburgh/magphase

Reference:

   @inproceedings{espic2017direct,
     title        = {Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis},
     author       = {Espic, Felipe and Valentini-Botinhao, Cassia and King, Simon},
     booktitle    = {Proceedings of Interspeech},
     year         = {2017}
   }

WavGenSR

Waveform generator based on signal reshaping for statistical parametric speech synthesis.

Last update: 2017/08/30

Link: https://github.com/CSTR-Edinburgh/WavGenSR

Reference:

   @inproceedings{espic2016waveform,
     title        = {Waveform Generation Based on Signal Reshaping for Statistical Parametric Speech Synthesis.},
     author       = {Espic, Felipe and Valentini-Botinhao, Cassia and Wu, Zhizheng and King, Simon},
     booktitle    = {Proceedings of Interspeech},
     pages        = {2263--2267},
     year         = {2016}
   }

Pulse model analysis and synthesis

It is basically the vocoder described in:

Degottex, P. Lanchantin, and M. Gales, "A Pulse Model in Log-domain for a Uniform Synthesizer," in Proc. 9th Speech Synthesis Workshop (SSW9), 2016.

Last update: 2017/09/7

Link: https://github.com/gillesdegottex/pulsemodel

Reference:

   @inproceedings{degottex2016pulse,
     title        = {A pulse model in log-domain for a uniform synthesizer},
     author       = {Degottex, Gilles and Lanchantin, Pierre and Gales, Mark},
     year         = {2016},
     booktitle    = {Proceedings of the Speech Synthesis Workshop (SSW)}
   }

YANG VOCODER: Yet-ANother-Generalized VOCODER

Yet another vocoder that is not STRAIGHT.

This project is a state-of-the-art vocoder that parameterizes the speech signal into a parameterization that is amenable to statistical manipulation.

The VOCODER was developed by Hideki Kawahara during his internship at Google.

Last update: 2017/01/02

Link: https://github.com/google/yang_vocoder

Ahocoder

Ahocoder parameterizes speech waveforms into three different streams: log-f0, cepstral representation of the spectral envelope, and maximum voiced frequency. It provides high accuracy during analysis and high quality during reconstruction. It is adequate for statistical parametric speech synthesis and voice conversion. Furthermore, it can be used just for basic speech manipulation and transformation (pitch level and variance, speaking rate, vocal tract length…).

Ahocoder is reported to be a very good complement for HTS. The output files generated by Ahocoder contain float numbers without header, so they are fully compatible with the HTS demo scripts in the HTS website. You can use the same configuration as in the STRAIGHT-based demo, using the "bap" stream to handle maximum voiced frequency (set its dimension to 1 both in data/Makefile and in scripts/Config.pm).

Last update: 2014

Link: http://aholab.ehu.es/ahocoder/

   @article{erro2014harmonics,
     title        = {Harmonics plus noise model based vocoder for statistical parametric speech synthesis},
     author       = {Erro, Daniel and Sainz, Inaki and Navas, Eva and Hernaez, Inma},
     journal      = {IEEE Journal of Selected Topics in Signal Processing},
     volume       = {8},
     number       = {2},
     pages        = {184--194},
     year         = {2014},
     publisher    = {IEEE}
   }

PhonVoc: Phonetic and Phonological vocoding

This is a computational platform for Phonetic and Phonological vocoding, released under the BSD licence. See file COPYING for details. The software is based on Kaldi (v. 489a1f5) and Idiap SSP. For training of the analysis and synthesis models, follow please train/README.txt.

Last update: 2016/11/23

Link: https://github.com/idiap/phonvoc

GlottGAN

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

Last update: 2017/05/30

Link: https://github.com/bajibabu/GlottGAN

Reference:

   @inproceedings{bollepalli2017generative,
     title        = {Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis},
     author       = {Bollepalli, Bajibabu and Juvela, Lauri and Alku, Paavo},
     booktitle    = {Proceedings of Interspeech},
     pages        = {3394--3398},
     year         = {2017}
   }

Postfilt gan

This is an implementation of "Generative adversarial network-based postfilter for statistical parametric speech synthesis"

Please check the run.sh file to train the system. Currently, testing part is not yet implemented.

Last update: 2017/07/06

Link: https://github.com/bajibabu/postfilt_gna

Reference:

   @INPROCEEDINGS{Kaneko2017,
     author       = {T. Kaneko and H. Kameoka and N. Hojo and Y. Ijima and K. Hiramatsu and K. Kashino},
     booktitle    = {Proceedings of the IEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
     title        = {Generative adversarial network-based postfilter for statistical parametric speech synthesis},
     year         = {2017},
     volume       = {},
     number       = {},
     pages        = {4910-4914},
     doi          = {10.1109/ICASSP.2017.7953090},
     ISSN         = {},
     month        = {March},
   }

Pitch extractor

REAPER: Robust Epoch And Pitch EstimatoR

This is a speech processing system. The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

This code was developed by David Talkin at Google. This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Last update: 2015/03/04

Link: https://github.com/google/REAPER

SSP - Speech Signal Processing module

SSP is a package for doing signal processing in python; the functionality is biassed towards speech signals. Top level programs include a feature extracter for speech recognition, and a vocoder for both coding and speech synthesis. The vocoder is based on linear prediction, but with several experimental excitation models. A continuous pitch extraction algorithm is also provided, built around standard components and a Kalman filter.

There is a "sister" package, libssp, that includes translations of some algorithms in C++. Libssp is built around libube that makes this translation easier.

SSP is released under a BSD licence. See the file COPYING for details.

Last update: 2017/04/16

Link: https://github.com/idiap/ssp

Sample modelling

SampleRNN

SampleRNN: An Unconditional End-to-End Neural Audio Generation Mode

Last update:

Link: https://github.com/soroushmehr/sampleRNN_ICLR2017

   @article{mehri2016samplernn,
     title        = {SampleRNN: An unconditional end-to-end neural audio generation model},
     author       = {Mehri, Soroush and Kumar, Kundan and Gulrajani, Ishaan and Kumar, Rithesh and Jain, Shubham and Sotelo, Jose and Courville, Aaron and Bengio, Yoshua},
     journal      = {arXiv preprint arXiv:1612.07837},
     year         = {2016}
   }

Toolkits

SPTK - Speech Signal Processing Toolkit

The main feature of the Speech Signal Processing Toolkit, available from NITECH, is that not only standard speech analysis and synthesis techniques (e.g., LPC analysis, PARCOR analysis, LSP analysis, PARCOR synthesis filter, LSP synthesis filter, and vector quantization techniques) but also speech analysis and synthesis techniques developed at the research group can easily be used.

Last update: 2016/12/25

Link: http://sp-tk.sourceforge.net/

Singing synthesizer

Sinsy

Sinsy is a HMM-based singing voice synthesis system.

Last update: 2015/12/25

Link: http://sinsy.sourceforge.net/

Ebook reader

Bard Storyteller ebook reader

Bard Storyteller is a text reader. Bard not only allows a user to read books, but can also read books to the user using text-to-speech. It supports txt, epub and (x)html files.

Last update: 2014/07

Link: http://festvox.org/bard/

Various tools

SparkNG

Matlab realtime speech tools and voice production tools

Last update: 2017/06/29

Link: http://www.wakayama-u.ac.jp/~kawahara/MatlabRealtimeSpeechTools/

Articulatory synthesizer

KLAIR - A virtual infant for spoken language acquisition research

The KLAIR project aims to build and develop a computational platform to assist research into the acquisition of spoken language. The main part of KLAIR is a sensori-motor server that displays a virtual infant on screen that can see, hear and speak. Behind the scenes, the server can talk to one or more client applications. Each client can monitor the audio visual input to the server and can send articulatory gestures to the head for it to speak through an articulatory synthesizer. Clients can also control the position of the head and the eyes as well as setting facial expressions. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.

Last update:

Link: http://www.phon.ucl.ac.uk/project/klair/

Reference:

   @inproceedings{huckvale2009klair,
     title        = {KLAIR: a virtual infant for spoken language acquisition research.},
     author       = {Huckvale, Mark and Howard, Ian S and Fagel, Sascha},
     booktitle    = {Proceedings of Interspeech},
     pages        = {696--699},
     year         = {2009}
   }

Vocaltractlab

VocalTractLab stands for "Vocal Tract Laboratory" and is an interactive multimedial software tool to demonstrate the mechanism of speech production. It is meant to facilitate an intuitive understanding of speech production for students of phonetics and related disciplines.

The current versions of VocalTractLab are free of charge. Only a registration code, which you can request by email, will be necessary to activate the software. VocalTractLab is written for Windows operating systems (XP or higher), but a porting to Linux/Unix is conceivable for the future.

Last update: 2016

Link: http://www.vocaltractlab.de/

API/Library

Speech Tools

The Edinburgh Speech Tools Library is a collection of C++ class, functions and related programs for manipulating the sorts of objects used in speech processing. It includes support for reading and writing waveforms, parameter files (LPC, Ceptra, F0) in various formats and converting between them. It also includes support for linguistic type objects and support for various label files and ngrams (with smoothing).

In addition to the library a number of programs are included. An intonation library which includes a pitch tracker, smoother and labelling system (using the Tilt Labelling system), a classification and regression tree (CART) building program called wagon. Also there is growing support for various speech recognition classes such as decoders and HMMs.

The Edinburgh Speech Tools Library is not an end in itself but designed to make the construction of other speech systems easy. It is for example to provided the underlying classes in the Festival Speech Synthesis System

The speech tools are currently distributed in full source form free for unrestricted use.

Last update: 2015/01/06

Link: http://www.cstr.ed.ac.uk/projects/speech_tools/

ROOTS

Roots is an open source toolkit dedicated to annotated sequential data generation, management and processing. It is made of a core library and of a collection of utility scripts. A rich API is available in C++ and in Perl.

Last update: 2015/07/01

Link: http://roots-toolkit.gforge.inria.fr/

Reference:

   @inproceedings{chevelu:hal-00974628,
     AUTHOR       = {Chevelu, Jonathan and Lecorv{'e}, Gw{'e}nol{'e} and Lolive, Damien},
     TITLE        = {ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections},
     BOOKTITLE    = {Proceedings of Language Resources and Evaluation Conference (LREC)},
     YEAR         = {2014},
     ADDRESS      = {Reykjavik, Iceland},
     URL          = {http://hal.inria.fr/hal-00974628}
   }

Visualization & annotation tools

Praat

Praat is a system for doing phonetics by computer. The computer program Praat is a research, publication, and productivity tool for phoneticians. With it, you can analyse, synthesize, and manipulate speech, and create high-quality pictures for your articles and thesis.

Last update:

Link: http://www.fon.hum.uva.nl/praat/

Reference:

   @article{boersma2006praat,
     title        = {Praat: doing phonetics by computer},
     author       = {Boersma, Paul},
     journal      = {http://www.praat.org/},
     year         = {2006}
   }

KPE

KPE provides a graphical interface for the implementation of the Klatt 1980 formant synthesiser. The interface allows users to display and edit Klatt parameters using a graphical display which includes the time-amplitude waveform of both the original speech and its synthetic copy, and some signal analysis facilities.

Last update:

Link: http://www.speech.cs.cmu.edu/comp.speech/Section5/Synth/klatt.kpe80.html

Wavesurfer

WaveSurfer is a tool for doing speech analysis. The analysis features include formants and pitch extraction and real time spectrograms. The Wavesurfer tool built on top of the Snack speech visualization module, is highly modular and extensible at several levels.

Last update:

Link: https://sourceforge.net/projects/wavesurfer/

Resources

Dictionary

Unisyn lexicon

The Unisyn lexicon is a master lexicon transcribed in keysymbols, a kind of metaphoneme which allows the encoding of multiple accents of English.

The lexicon is accompanied by a number of perl scripts which transform the base lexicon via phonological and allophonic rules, and other symbol changes, to produce output transcriptions in different accents. The rules can be applied to the whole lexicon, to produce an accent-specific lexicon, or to running text. Output can be displayed in keysymbols, SAMPA, or IPA.

The system uses a geographically-based accent hierarchy, with a tree structure describing countries, regions, towns and speakers; this hierarchy is used to specify the application of rules and other pronunciation features.

The lexicon system is customisable, and the documentation explains how to modify output by swtiching rules on and off, adding new rules or editing existing ones. The user can also add new nodes in the accent hierarchy (new accents or new speakers within an accent), or add new symbols.

A number of UK, US, Australian and New Zealand accents are included in the release.

The scripts run under unix, or Windows 98 (DOS), and use perl 5.6.0.

Last update:

Link: http://www.cstr.ed.ac.uk/projects/unisyn/

Combilex

Combilex GA is a keyword-based lexicon for the General American pronunciation.

The combilex contains c.145,000 entries, including the 20,000 most frequent words and contains a variety of linguistic information alongside detailed pronunciations, including many useful proper names.

Combilex GA is an ASCII text file, one entry-per-line, which is easily adaptable for use in text-to-speech synthesis (voice-building or run-time synthesis) and in speech recognition systems.

Full manually notated orthographic-phonemic correspondences are included, allowing derivation of accurate grapheme-to-phoneme rules.

Last update:

Link: https://licensing.edinburgh-innovations.ed.ac.uk/item.php?item=combilex-ga

Reference:

   @inproceedings{richmond2009robust,
     title        = {Robust LTS rules with the Combilex speech technology lexicon},
     author       = {Richmond, Korin and Clark, Robert AJ and Fitt, Susan},
     year         = {2009},
     booktitle    = {Proceedings of Interspeech}
   }

@@ Line 1: / Line 1: @@
+== Corpora ==
+See [[Corpora]]
 == Full system ==
 === Multilingual ===

Anonymous

Search