Armenian Common Voice: Frequently Asked Questions

Jun 21, 2021 00:00 ยท 1415 words ยท 7 minute read

Armenian Common Voice was launched in May 2021. I have accumulated popular questions from a diverse community of developers, linguists, educators, hobbyists, etc. for better context on what it is and how you can help.

What is Armenian Common Voice?
What help is needed?
Is there a timeline?
What does the voice corpus look like?
How are dialects addressed? Is there a voice corpus for each dialect similar to how the language is split in Wikipedia?
Are there any links to documentation to get more information about Armenian Common Voice?
What can I do with a mixed dataset if my demographic is Eastern Armenian speakers?
About the voice recordings – is it bad if I stutter or re-read a few words? What about background noise?
If I am a Western Armenian speaker who is given an Eastern Armenian sentence in the application, should I skip it, and vice versa for Eastern Armenian speakers?

What is Armenian Common Voice? ๐Ÿ”—

Armenian Common Voice is a subcommunity of Common Voice, an initiative by Mozilla to help teach machines how people speak. The wheel that keeps it spinning is the people who donate their time to accumulate large public datasets - in case with Common Voice, a collection of multi-lingual voice corpora, which is already being used in academia and industry. Naturally, Armenian Common Voice is responsible for managing the Armenian voice corpus.

What help is needed? ๐Ÿ”—

  • More Armenian text from public domains. They can be added manually in Sentence Collector.

  • More voice donations. Open an account through the app and start speaking sentences outloud into your microphone with your natural accent even if the spelling and grammar does not match your own (more to elaborate on dialects below).

  • More validations. For quality assurance, each audio recording must be validated by other peers from our subcommunity.

  • Better data representation. Related to the first bullet point, we need to diversify the content. Wikipedia text is not enough to train a machine on speech patterns because 1- it does not contain any conversational/dialogic text and 2- the miniscule Western Armenian content misrepresents the vast amount of Western Armenian text out there in the real world.

Is there a timeline? ๐Ÿ”—

Kind of. Common Voice has a dataset release every six months. Our timeline can span one release cycle or multiple. It is really up to us - what is our definition of success?

In Mozilla’s words, success is:

  • At least 1,000 unique speakers per language.
  • 2,000 hours of voice validated to train a near-human general STT model.
  • 10,000 hours of voice validated for a very high quality, general, large vocabulary, continuous speech recognition model

Success is taking baby steps towards building a baseline model for automatic speech recognition. Wav2Vec2.0 is very promising. It has been fine-tuned over various languages including low-resourced ones. For example, huggingface deployed a Wav2Vec2.0 model fine-tuned over 365 MB of Greek speech data from Common Voice (~10 hrs). They achieved a word error rate of 10% on their test set. You can build with this.

How soon can we get to 10 hours of validated recordings? We currently have 18 speakers and 30 minutes of validated sentences. At this rate, we can generate about 4 hours of recordings per year. So we will have 10 hours in 2.5 years1.

I think we can and should do better. Given that automatic speech recognition is currently exploding especially with low-resourced languages, we need to catch up.

1 Tech years can be treated like dog years. Basically that is a really long time.

What does the voice corpus look like? ๐Ÿ”—

Armenian Common Voice was launched after Mozilla released their latest dataset. Therefore, Armenian will be available after their next dataset release - mid of July 2021. Watch out for announcements, because soon you will be able to download Armenian from their dropdown of available languages. The downloadable content will be in the form of audio, text, and metadata files. The text is in sentence form, and its source (for now) is from Armenian Wikipedia.

How are dialects addressed? Is there a voice corpus for each dialect similar to how the language is split in Wikipedia? ๐Ÿ”—

Wikipedia hosts two sites for Armenian content, hy.wikipedia.org and hyw.wikipedia.org due to differences in orthography. Prior to the launch, the main question was should we follow the same pattern with Common Voice, i.e. split the language to two datasets based on dialect? We1 discussed this exhaustively through a collection of Facebook posts, virtual meetups and Google doc collaboration with experienced developers and linguists, including one coder who worked many years in Google Translate’s team and a developer working directly with the Armenian Wikipedia team.

We decided that the language split in general is a bad idea and hence maintain one voice corpus containing both hy.wikipedia and hyw.wikipedia sentences. Why? -

  • Splitting a language by some attribute (like dialect, accent, or geography) is an outdated approach when it comes to learning general speech. The architecture of modern speech-to-text models makes them amenable to speech variance and are already being used in production (making $$) by other polycentric languages including English, Arabic, Portuguese, French, and Spanish. They each maintain one voice corpus. The Armenian language should not be held as a special case entity as no other language is treated that way (Chinese is an exception due to their massive scale).

  • The split already hurts Western Armenian - by isolating it, we are abandoning it. The evidence is clear with Wikipedia; take a look at the volume of hy.wikipedia and hyw.wikipedia content. The difference is embarassingly wide - hyw contains less than 10% of the number of articles of hy. Hyw’s low content also prevents it to be used on modern language systems ( literally happened with BERT ), leaving it effectively and dangerously siloed.

  • It confuses developers and scientists outside the Armenian community when they see multiple Armenian language data sources. The confusion typically leads to either abondoning the less popular one (likely Western Armenian) or abandoning the language altogether in development or research.

  • Even Mozilla implicitly says so. They encourage multiple accents and dialects in their speaker pools. The more variance in speaker data, the more robust a predictive model will be in addressing idiomatic speech and corner cases.

  • Our system takes input from all forms of Armenian speech and should be able to process each correctly. In general, the burden should be on the developer to write better code that handles various incoming audio. The burden should not be on the user to manipulate their speech just so the model can work.

1 Natural Language Processing in Armenia community via Telegram and Facebook

Are there any links to documentation to get more information about Armenian Common Voice? ๐Ÿ”—

What can I do with a mixed dataset if my demographic is Eastern Armenian speakers? ๐Ÿ”—

There is a generalized technique being employed now in academia and industry that we should follow: bootstrap pretrained model then fine-tune with data specific to your use case.

That pretrained model is trained over large amounts of multi-lingual speech data to learn latent speech embeddings. Currently no open-source model has learned Armenian speech. Armenian Common Voice intends to fix this by giving developers the resources needed to build this pretrained model. The burden is then on the developer to find data specific to their use case for fine-tuning.

So to answer the question, you’re not blocked at all by the fact that the dataset is mixed. You are better equipped than ever because now you will be able to plug in your data to modern architecture and build something deployable. Take a look at the work being done at huggingface with automatic speech recognition.

About the voice recordings – is it bad if I stutter or re-read a few words? What about background noise? ๐Ÿ”—

Yes it is bad. If you do one minor stutter - forgivable. More than one, please re-do the recording. Validators should also reject recordings that contain more than one speech flaw.

Background noise is okay as long as your speech is intelligible by a normal listener.

If I am a Western Armenian speaker who is given an Eastern Armenian sentence in the application, should I skip it, and vice versa for Eastern Armenian speakers? ๐Ÿ”—

Please do not skip if you know how to read in both forms! Just read it naturally in your accent.

Create an account now at Armenian Common Voice ๐Ÿ”—

common-voice-bot