Join Transform 2021 this July 12-16. Register for the AI occasion of the 12 months.

For lots of the 700 million illiterate folks all over the world, speech recognition expertise might present a bridge to precious data. Yet in lots of nations, these folks have a tendency to talk solely languages for which the datasets needed to coach a speech recognition mannequin are scarce. This knowledge deficit persists for a number of causes, chief amongst them the truth that creating merchandise for languages spoken by smaller populations will be much less worthwhile.

Nonprofit efforts are underway to shut the hole, together with 1000 Words in 1000 Languages, Mozilla’s Common Voice, and the Masakhane venture, which seeks to translate African languages utilizing neural machine translation. But this week, researchers at Guinea-based tech accelerator GNCode and Stanford detailed a brand new initiative that uniquely advocates utilizing radio archives in creating speech programs for “low-resource” languages, notably Maninka, Pular, and Susu within the Niger Congo household.

“People who speak Niger Congo languages have among the lowest literacy rates in the world, and illiteracy rates are especially pronounced for women,” the coauthors word. “Maninka, Pular, and Susu are spoken by a combined 10 million people, primarily in seven African countries, including six where the majority of the adult population is illiterate.”

The concept behind the brand new initiative is to utilize unsupervised speech illustration studying, demonstrating that representations discovered from radio packages will be leveraged for speech recognition. Where labeled datasets don’t exist, unsupervised studying may also help to fill in area data by figuring out the correlations between knowledge factors after which coaching based mostly on the newly utilized knowledge labels.

New datasets

The researchers created two datasets, West African Speech Recognition Corpus and the West African Radio Corpus, meant for purposes concentrating on West African languages. The West African Speech Recognition Corpus comprises over 10,000 hours of recorded speech in French, Maninka, Susu, and Pular from roughly 49 audio system, together with Guinean first names and voice instructions like “update that,” “delete that,” “yes,” and “no.” As for the West African Radio Corpus, it consists of 17,000 audio clips sampled from archives collected from six Guinean radio stations. The broadcasts within the West African Radio Corpus span information and exhibits in languages together with French, Guerze, Koniaka, Kissi, Kono, Maninka, Mano, Pular, Susu, and Toma.

To create a speech recognition system, the researchers tapped Facebook’s wav2vec, an open supply framework for unsupervised speech processing. Wav2vec makes use of an encoder module that takes uncooked audio and outputs speech representations, that are fed right into a Transformer that ensures the representations seize whole-audio-sequence data. Created by Google researchers in 2017, the Transformer community structure was initially meant as a means to enhance machine translation. To this finish, it makes use of consideration capabilities as an alternative of a recurrent neural community to foretell what comes subsequent in a sequence.

speech recognition

Above: The accuracies of WAwav2vec.

Despite the truth that the radio dataset contains cellphone calls in addition to background and foreground music, static, and interference, the researchers managed to coach a wav2vec mannequin with the West African Radio Corpus, which they name WAwav2vec. In one experiment with speech throughout French, Maninka, Pular, and Susu, the coauthors say that they achieved multilingual speech recognition accuracy (88.01%) on par with Facebook’s baseline wav2vec mannequin (88.79%) — although the baseline mannequin was skilled on 960 hours of speech versus WAwav2vec’s 142 hours.

Virtual assistant

As a proof of idea, the researchers used WAwav2vec to create a prototype of a speech assistant. The assistant — which is out there in open supply together with the datasets — can acknowledge fundamental contact administration instructions (e.g., “search,” “add,” “update,” and “delete”) along with names and digits. As the coauthors word, smartphone entry has exploded within the Global South, with an estimated 24.5 million smartphone house owners in South Africa alone, in keeping with Statista, making this type of assistant prone to be helpful.

“To the best of our knowledge, the multilingual speech recognition models we trained are the first-ever to recognize speech in Maninka, Pular, and Susu. We also showed how this model can power a voice interface for contact management,” the coauthors wrote. “Future work could expand its vocabulary to application domains such as microfinance, agriculture, or education. We also hope to expand its capabilities to more languages from the Niger-Congo family and beyond, so that literacy or ability to speak a foreign language are not prerequisites for accessing the benefits of technology. The abundance of radio data should make it straightforward to extend the encoder to other languages.”


VentureBeat’s mission is to be a digital city sq. for technical decision-makers to achieve data about transformative expertise and transact. Our website delivers important data on knowledge applied sciences and techniques to information you as you lead your organizations. We invite you to develop into a member of our group, to entry:

  • up-to-date data on the themes of curiosity to you
  • our newsletters
  • gated thought-leader content material and discounted entry to our prized occasions, comparable to Transform 2021: Learn More
  • networking options, and extra

Become a member