Task description

Train the best possible speech recognition system for Vietnamese using only free resources.

Task schedule

  • May 2016 - Training, Development and Evaluation data released on SFTP. Please register to the task (Zero Cost Speech) and sign the data usage agreement form in order to participate. You will then be provided with SFTP account login details to download the data.
  • till 31 July 2016 - Participants can provide and share their own development data on the participants SFTP. The data must be publicly and freely available for research purposes by other participants. See data section for details... Participants can use ONLY the data available on the SFTP.
  • till 12 Sept. 2016 - Work on your systems. You can score your output locally for tuning. But you need to upload your output to the leader board to get full development results.
  • 12 Sept. 2016 - Run submission deadline. We lock data upload and publish test set results.
  • till 28 Sept. 2016 - Work on your late systems. Try to find out what is worknig and what is not.
  • 30 Sept. 2016 - Working notes paper deadline. Write 2 pages short paper where you describe your system and experiments.
  • 20-21 Oct. 2016 - MediaEval 2016 Workshop, Right after ACM MM 2016 in Amsterdam. Present your work and discuss your conclusions with other participants.

Task overview paper

Here is a task description paper: zerocost2016_overview.pdf


A Zero-cost task aims at bridging the gap between “top” labs and companies having enough budget to afford buying any data for speech recognition training with “other” small players, endowed to freely available data, tools or even systems.

The goal of this task is to challenge teams to come up and experiment with bootstrapping techniques, which allow them to train initial ASR system or speech tokenizers for “free”. This means, come up with techniques which allow you to train a speech recognition system on public resource data without the need of buying (expensive, ideally any) data sets.

Target group

Any speech lab working on low / zero resource techniques for development of automatic speech recognizers or speech tokenizers. The outcome of this task for participants is not having a Vietnamese speech recognizer, but the acquired knowledge and tools for building recognizers for any language they want.


An initial set of free resources (a mix of audio data and imperfect transcripts like audios/videos with subtitles) are provided to participants. We expect to provide several hours of data from multiple sources. In addition, if any participant has or knows about other free resources, he is encouraged to share it with other participants if he wants to use it in the evaluation. The training set (data provided by organizers and also by other participants) will be fixed by late spring. After then, participants will not be allowed to use any other data. This should prevent participants from pursuing a data gathering race and to let them focus on speech research. Participants can also share other resources: texts, dictionaries, feature extractors, audios, etc. The only limitation is, that this resource must be freely available to everyone for research purposes. We do not limit the other resources to Vietnamese language. So anyone is allowed to bootstrap from English for example - this data must also be publicly available to all.

Train, Development and Evaluation sets are available already during system training. Participants can use them and adapt their system on it (unsupervised adaptation on the Test set). However they are not provided with reference transcripts and they are not allowed to transcribe or manually analyze the development / evaluation data.

Following data is being provided by the organizers:
  • Forvo.com (Download of Vietnamese data from Forvo.com service. Participants are forbidden to download, use and share any of Forvo.com data on they own. Reason - not to accidentally mix train/dev/eval data.) This data is preprocessed, split into train/dev/test, and converted to 16kHz wav + STM references. Raw download is also available.
    Note: We found, that some files are "empty" (contains only noise or zeros). If you find some, please report it. We removed the ones we found from the official training set.
  • Rhinospike.com (Download of Vietnamese data from Rhinospike.com service. Participants are forbidden to download, use and share any of Rhinospike.com data on they own. Reason - not to accidentally mix train/dev/eval data.) This data is preprocessed, split into train/dev/test, and converted to 16kHz wav + STM references. Raw download is also available.
  • ELSA Proprietary prompted data recorded with a mobile application by Vietnamese students. It contains several read sentences obtained from a book of vietnamese quotes. Use of this data is limited for research purposes
  • Other test data
Following data was provided by other participants:
  • BUT-opus-en-vi.txt.zip (65MB) - Download of EN-VI parallel texts from http://opus.lingfil.uu.se/
  • viki.com.tgz (930MB) - Download of few videos + subtitles from http://viki.com
  • i2r-data-pack-v1.0.tar.gz (560MB) - Data provided by I2R - List of Vietnamese web pages, Vietnamese Wikipage, Vietnamese wordlist
  • i2r-data-pack-v1.0-wiki_viet_extract.tar.gz (176MB) - Data provided by I2R - texts extracted from Vietnamese Wikipage


You can find data and Kaldi baseline on our SFTP (port 22).

Directories structure

  • data-organizers - Data-packages provided by organizers.
    • ZeroCost2016-OfficialData-v1.tgz - This package you probably need. Contains wavs, texts and stms.
    • ZeroCost2016-OfficialData-v1-raw.tgz - This package contains raw data downloaded / provided. htmls, mp3, etc. You can get some more info from the raw data (like place of speaker etc.).
  • data-organizers - Data packages provided by other participants.
  • kaldi-baseline-system - TGZ of a Kaldi baseline system. However it is better to get the latest version from here: https://github.com/xanguera/mediaeval_zerocost
    • mediaeval_zerocost_kaldi_baseline_v1.tgz - Baseline system release together with the task kick-off.
  • scoring - Scoring scripts.
    • ZeroCost2016-Score-ASR-v1.tgz - Simple scoring scripts for ASR task generating report similar to what is provided by leader board. Only devel-local references are provided.
    • ZeroCost2016-Score-SubWrd-v1.tgz - Simple scoring scripts for SubWord task generating report similar to what is provided by leader board. Only devel-local references are provided.

Ground truth and evaluation

There are three evaluation sets defined. Devel-local, Devel, and Test set.
  • Devel-local is 1/5 subset of Devel. Participants are provided with references and scoring script so they can score themselves on Devel-local. The reason is to quickly iterate during training period and overcome the need to upload output to the leader board often (and mess your results).
  • Devel is full Devel. Once participant ends up with some good enough / improved enough system, he is encourage to upload his results to leader board and be scored on much more data. Also his Devel result are published to other participants.
  • Test is "unseen" data. It partly contains data similar to training/devel one but also unseen one. Feel free to adapt your system on this data (in unsupervised way of course).
Participant is provided with overall score per each evaluation set plus score per data source (Forvo, Rhinospike, ELSA).

ASR sub task

Systems will be evaluated on Word-Error-Rate (using cstk). The WER is based on comparison of word transcript match (reference and generated). Both transcripts should be in uppercase and without punctuation, hesitation markers etc. There is no other text normalization done. You can check devel-local.stm to get idea how the references look like.

Subword sub task

Systems will be evaluated on Normalized Mutal Information (NMI). We attached a simple Python scoring script (see SFTP). In principle, the scoring script alignes two sequences (your unit sequence and reference phoneme alignment). Please note, that the algorithm takes into account timing of your units. It matches your units to reference ones (according to time) and then calculates the NMI.

Participants are provided with training, development and evaluation data at the same time. However they do not have references for development and evaluation data. They can use the leader board to score their systems and get development results. When the evaluations are over, we will publish also the scores on evaluation set.

leader board help

  • Each participant (team) should have one account. Please make a short team abbreviation synchronized to MediaEval registration.
  • Public leader board tables show only top 3 systems per each participant. Only team, system title and overall devel scores are shown. The title should be short but informative enough.
  • Private leader board tables show also description (participant can write a small paragraph to describe his system to not mess up his results). Participants are also provided with detailed results in a report (output of cstk).
  • Participant must select the sub task (ASR or subword) and submit one file. Either plain text, gziped, or ziped.
  • We do some basic sanity check:
    • All files are present in the output
    • Output has correct formatting
  • If something goes wrong, an error occurs and participant can see a report.
  • Participant cannot delete correctly submitted systems. Only error ones (to overcome messing up results).
Feel free to ask szoke at fit dot vutbr dot cz or zerocost@googlegroups.com for help.

Frequently asked question

Feel free to ask!

Recommended reading


Task organizers


We acknowledge data providers - services Forvo.com, Rhinospike.com, and ElsaNow.io.

Special thanks to MultiMediaEval organizers to taking this task under their umbrella.

Josef Žižka for this web page, Lucas Ondel for subword task scoring metric