Meta’s open-source speech AI acknowledges over 4,000 spoken languages

Meta has created an AI language mannequin that (in a refreshing change of tempo) isn’t a ChatGPT clone. The corporate’s Massively Multilingual Speech (MMS) undertaking can acknowledge over 4,000 spoken languages and produce speech (text-to-speech) in over 1,100. Like most of its different publicly introduced AI initiatives, Meta is open-sourcing MMS at this time to assist protect language variety and encourage researchers to construct on its basis. “At the moment, we’re publicly sharing our fashions and code in order that others within the analysis group can construct upon our work,” the corporate wrote. “By this work, we hope to make a small contribution to protect the unimaginable language variety of the world.”
Speech recognition and text-to-speech fashions usually require coaching on hundreds of hours of audio with accompanying transcription labels. (Labels are essential to machine studying, permitting the algorithms to accurately categorize and “perceive” the information.) However for languages that aren’t broadly utilized in industrialized nations — a lot of that are at risk of disappearing within the coming a long time — “this knowledge merely doesn’t exist,” as Meta places it.
Meta used an unconventional strategy to accumulating audio knowledge: tapping into audio recordings of translated spiritual texts. “We turned to spiritual texts, such because the Bible, which have been translated in many alternative languages and whose translations have been broadly studied for text-based language translation analysis,” the corporate mentioned. “These translations have publicly accessible audio recordings of individuals studying these texts in several languages.” Incorporating the unlabeled recordings of the Bible and related texts, Meta’s researchers elevated the mannequin’s accessible languages to over 4,000.
In the event you’re like me, that strategy might elevate your eyebrows at first look, because it seems like a recipe for an AI mannequin closely biased towards Christian worldviews. However Meta says that isn’t the case. “Whereas the content material of the audio recordings is spiritual, our evaluation exhibits that this doesn’t bias the mannequin to provide extra spiritual language,” Meta wrote. “We imagine it’s because we use a connectionist temporal classification (CTC) strategy, which is way extra constrained in contrast with massive language fashions (LLMs) or sequence-to-sequence fashions for speech recognition.” Moreover, regardless of many of the spiritual recordings being learn by male audio system, that didn’t introduce a male bias both — performing equally nicely in feminine and male voices.
After coaching an alignment mannequin to make the information extra usable, Meta used wav2vec 2.0, the corporate’s “self-supervised speech illustration studying” mannequin, which may practice on unlabeled knowledge. Combining unconventional knowledge sources and a self-supervised speech mannequin led to spectacular outcomes. “Our outcomes present that the Massively Multilingual Speech fashions carry out nicely in contrast with current fashions and canopy 10 instances as many languages.” Particularly, Meta in contrast MMS to OpenAI’s Whisper, and it exceeded expectations. “We discovered that fashions educated on the Massively Multilingual Speech knowledge obtain half the phrase error fee, however Massively Multilingual Speech covers 11 instances extra languages.”
Meta cautions that its new fashions aren’t good. “For instance, there may be some danger that the speech-to-text mannequin might mistranscribe choose phrases or phrases,” the corporate wrote. “Relying on the output, this might end in offensive and/or inaccurate language. We proceed to imagine that collaboration throughout the AI group is vital to the accountable improvement of AI applied sciences.”
Now that Meta has launched MMS for open-source analysis, it hopes it could reverse the development of expertise dwindling the world’s languages to the 100 or fewer most frequently supported by Large Tech. It sees a world the place assistive expertise, TTS and even VR / AR tech permit everybody to talk and study of their native tongues. It mentioned, “We envision a world the place expertise has the other impact, encouraging folks to maintain their languages alive since they’ll entry data and use expertise by talking of their most popular language.”