Sunday, December 22, 2024
HomeAmazon PrimePronunciation detection for Alexa’s new English-learning expertise

Pronunciation detection for Alexa’s new English-learning expertise

[ad_1]

In January 2023, Alexa launched a language-learning expertise in Spain that helps Spanish audio system be taught beginner-level English. The expertise was developed in collaboration with Vaughan, the main English-language-learning supplier in Spain, and it aimed to supply an immersive English-learning program, with explicit concentrate on pronunciation analysis.

We at the moment are increasing this providing to Mexico and the Spanish-speaking inhabitants within the US and shall be including extra languages sooner or later. The language-learning expertise contains structured classes on vocabulary, grammar, expression, and pronunciation, with follow workouts and quizzes. To strive it, set your machine language to Spanish and inform Alexa “Quiero aprender Inglés.”

Mini-lesson content material web page: classes overlaying vocabulary, grammar, expression, and pronunciation.

The spotlight of this Alexa talent is its pronunciation function, which supplies correct suggestions each time a buyer mispronounces a phrase or sentence. At this yr’s Worldwide Convention on Acoustics, Speech, and Sign Processing (ICASSP), we offered a paper describing our state-of-the-art strategy to mispronunciation detection.

Pronunciation correction: Blue highlighting signifies right pronunciation. Crimson highlighting signifies incorrect pronunciation. For incorrectly pronounced phrases/phrases, Alexa will present directions on methods to pronounce them.

Our methodology makes use of a novel phonetic recurrent-neural-network-transducer (RNN-T) mannequin that predicts phonemes, the smallest models of speech, from the learner’s pronunciation. The mannequin can subsequently present fine-grained pronunciation analysis, on the phrase, syllable, or phoneme degree. For instance, if a learner mispronounces the phrase “rabbit” as “rabid”, the mannequin will output the five-phoneme sequence R AE B IH D. It may then detect the mispronounced phonemes (IH D) and syllable (-bid) by utilizing Levenshtein alignment to check the phoneme sequence with the reference sequence “R AE B AH T”.

Associated content material

In a top-3% paper at ICASSP, Amazon researchers adapt graph-based label propagation to enhance speech recognition on underrepresented pronunciations.

The paper highlights two information gaps that haven’t been addressed in earlier pronunciation-modeling work. The primary is the flexibility to disambiguate similar-sounding phonemes from totally different languages (e.g., the rolled “r” sounds in Spanish vs. the “r” sound in English). We tackled this problem by designing a multilingual pronunciation lexicon and constructing an enormous code-mixed phonetic dataset for coaching.

The opposite information hole is the flexibility to be taught distinctive mispronunciation patterns from language learners. We obtain this by leveraging the autoregressiveness of the RNN-T mannequin, which means the dependence of its outputs on the inputs and outputs that preceded them. This context consciousness signifies that the mannequin can seize frequent mispronunciation patterns from coaching knowledge. Our pronunciation mannequin has achieved state-of-the-art efficiency in each phoneme prediction accuracy and mispronunciation detection accuracy.

L2 knowledge augmentation

One of many key technical challenges in constructing a phonetic-recognition mannequin for non-native (L2) audio system is that there are very restricted datasets for mispronunciation analysis. In our Interspeech 2022 paper “L2-GEN: A neural phoneme paraphrasing strategy to L2 speech synthesis for mispronunciation analysis”, we proposed bridging this hole by utilizing knowledge augmentation. Particularly, we constructed a phoneme paraphraser that may generate sensible L2 phonemes for audio system from a particular locale — e.g., phonemes representing a local Spanish speaker speaking in English.

Associated content material

Parallel speech recognizers, language ID, and translation fashions geared to conversational speech are among the many modifications that make Dwell Translation doable.

As is widespread with grammatical-error correction duties, we use a sequence-to-sequence mannequin however flip the duty path, coaching the mannequin to mispronounce phrases quite than right mispronunciations. Moreover, to additional enrich and diversify the generated L2 phoneme sequences, we suggest a diversified and preference-aware decoding part that mixes a diversified beam search with a desire loss that’s biased towards human-like mispronunciations.

For every enter cellphone, or speech fragment, the mannequin produces a number of candidate phonemes as outputs, and sequences of phonemes are modeled as a tree, with potentialities proliferating with every new cellphone. Usually, the top-ranked phoneme sequences are extracted from the tree by way of beam search, which pursues solely these branches of the tree with the best chances. In our paper, nevertheless, we suggest a beam search methodology that prioritizes uncommon phonemes, or phoneme candidates that differ from many of the others on the identical depth within the tree.

From established sources within the language-learning literature, we additionally assemble lists of widespread mispronunciations on the phoneme degree, represented as pairs of phonemes, one the usual phoneme within the language and one its nonstandard variant. We assemble a loss perform that, throughout mannequin coaching, prioritizes outputs that use the nonstandard variants on our record.

In experiments, we noticed accuracy enhancements of as much as 5% in mispronunciation detection over a baseline mannequin educated with out augmented knowledge.

Balancing false rejection and false acceptance

A key consideration in designing a pronunciation mannequin for a language-learning expertise is to steadiness the false-rejection and false-acceptance ratio. A false rejection happens when the pronunciation mannequin detects a mispronunciation, however the buyer was truly right or used a constant however evenly accented pronunciation. A false acceptance happens when a buyer mispronounces a phrase, and the mannequin fails to detect it.

Associated content material

Strategies for studying from noisy knowledge, utilizing phonetic embeddings to enhance entity decision, and quantization-aware coaching are a number of of the highlights.

Our system has two design options meant to steadiness these two metrics. To scale back false acceptances, we first mix our commonplace pronunciation lexicons for English and Spanish right into a single lexicon, with a number of phonemes corresponding to every phrase. Then, we use that lexicon to robotically unannotated speech samples that fall into three classes: native Spanish, native English, and code-switched Spanish and English. Coaching the mannequin on this dataset permits it to tell apart very refined variations between phonemes.

To scale back false rejections, we use a multireference pronunciation lexicon the place every phrase is related to a number of reference pronunciations. For instance, the phrase “knowledge” may be pronounced as both “day-tah” or “dah-tah”, and the system will settle for each variations as right.

In ongoing work, we’re exploring a number of approaches to additional bettering our pronunciation analysis function. One in all these is constructing a multilingual mannequin that can be utilized for pronunciation analysis for a lot of languages. We’re additionally increasing the mannequin to diagnose extra traits of mispronunciation, resembling tone and lexical stress.



[ad_2]

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments