Speech Recognition for Language Learners: A Technical Guide
Building pronunciation feedback into a language learning app requires more than dropping in a speech-to-text API. The general-purpose APIs (Whisper, Google STT, Azure Speech) are optimised for transcription accuracy on fluent speech — they're not designed for the specific problems of language learner speech, which has different phonological patterns, different error distributions, and different evaluation needs.
Here's a technical guide to implementing speech recognition that actually serves language learners.
The Core Problem: Learner Speech vs. Native Speech
Language learner speech has distinct characteristics that trip up models trained on native speaker corpora:
- Non-native phonemes: A Spanish speaker learning English will produce /v/ as /b/ or /β/. A Chinese speaker may not distinguish /r/ and /l/. These aren't random errors — they're systematic substitutions from the L1 phonological system.
- Prosodic errors: Wrong stress placement ("reCORD" vs "REcord"), flattened intonation, incorrect syllable timing.
- Hesitation phenomena: More frequent false starts, filled pauses (uh, um), and reformulations.
- Accent interference: Learner speech at B1–B2 level is recognisable but may use intonation patterns from the native language.
A standard transcription API that returns 95% word accuracy on native speech may drop to 75–80% on learner speech — which is bad for a transcription use case and catastrophic for pronunciation feedback, where you need to detect the subtle differences.
Architecture Options
Option 1: Standard STT → Text Comparison
The simplest approach: transcribe the learner's speech, compare the transcript to the target phrase.
import openai
def check_pronunciation(audio_bytes: bytes, target_text: str) -> dict:
transcript = openai.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", audio_bytes, "audio/wav"),
language="en", # force language to prevent misdetection
).text
# Simple word-level comparison
target_words = target_text.lower().split()
spoken_words = transcript.lower().split()
matches = sum(1 for t, s in zip(target_words, spoken_words) if t == s)
accuracy = matches / len(target_words) if target_words else 0
return {
"transcript": transcript,
"accuracy": accuracy,
"matched": matches,
"total": len(target_words),
}
This approach works for basic go/no-go feedback ("you said it correctly / incorrectly") but can't identify which phoneme was mispronounced, which is what a learner actually needs.
Option 2: Forced Alignment → Phoneme-Level Analysis
For granular pronunciation feedback, you need forced alignment: aligning the audio signal to the expected phoneme sequence to identify where divergences occur.
Montreal Forced Aligner (MFA) is the standard open-source tool for this. It takes audio + transcript and returns time-aligned phoneme segments:
# Install MFA
pip install montreal-forced-aligner
# Align a single utterance
mfa align audio_dir/ english_mfa english_mfa aligned_dir/
The output is a TextGrid file with word and phoneme tiers, each with start/end timestamps. You can then compare the actual phoneme segments to the expected phonemes from a G2P (grapheme-to-phoneme) model.
For a Python integration:
from montreal_forced_aligner.alignment import PretrainedAligner
aligner = PretrainedAligner(
acoustic_model_path="english_mfa",
dictionary_path="english_mfa",
)
def get_phoneme_alignment(audio_path: str, transcript: str) -> list[dict]:
# Returns list of {phoneme, start, end, confidence} dicts.
aligner.align_utterance(audio_path, transcript)
# Parse TextGrid output
return parse_textgrid(audio_path.replace(".wav", ".TextGrid"))
This lets you identify exactly which phoneme was mispronounced, at what timestamp, and by how much — enabling precise feedback ("you pronounced the /θ/ in 'think' as /s/").
Option 3: Whisper + Custom Pronunciation Scoring
A practical middle ground: use Whisper for transcription and a custom model for pronunciation scoring.
Whisper's word_timestamps feature (available in the Python library) returns per-word timestamps that you can use as an approximation of forced alignment:
import whisper
model = whisper.load_model("base.en")
def transcribe_with_timestamps(audio_path: str) -> dict:
result = model.transcribe(
audio_path,
word_timestamps=True,
language="en",
)
words = []
for segment in result["segments"]:
for word in segment.get("words", []):
words.append({
"word": word["word"].strip(),
"start": word["start"],
"end": word["end"],
"probability": word.get("probability", 1.0),
})
return {
"text": result["text"],
"words": words,
"language": result["language"],
}
Low probability scores at the word level indicate the model was uncertain, which correlates with mispronunciation. This isn't as precise as forced alignment but is much simpler to implement and runs on consumer hardware.
Whisper for Multilingual Learner Speech
Whisper is trained on 680,000 hours of multilingual speech and handles learner speech better than most commercial APIs because its training data includes accented and non-native speech. Key parameters for learner use cases:
result = model.transcribe(
audio_path,
language="es", # force target language — prevents misidentifying
# accented speech as a different language
task="transcribe", # "transcribe" (same language) vs "translate" (→English)
temperature=0.0, # deterministic output for consistent scoring
no_speech_threshold=0.6, # threshold for detecting silence/no speech
logprob_threshold=-1.0, # lower = accept lower-confidence transcriptions
compression_ratio_threshold=2.4,
)
Setting language explicitly is critical — without it, a Spanish learner speaking heavily accented English may have their audio classified as Spanish, producing a transcription in the wrong language.
Pronunciation Scoring Approaches
Goodness of Pronunciation (GOP)
GOP is the standard quantitative measure of pronunciation quality at the phoneme level. It measures how well the learner's acoustic signal matches the expected phoneme model:
# Simplified GOP calculation using acoustic model likelihoods
def calculate_gop(phoneme_posterior: list[float], expected_phoneme_idx: int) -> float:
# phoneme_posterior: posterior probability distribution over all phonemes
# expected_phoneme_idx: index of the expected phoneme
# Returns: GOP score in range [-1, 0], higher is better
expected_prob = phoneme_posterior[expected_phoneme_idx]
max_prob = max(phoneme_posterior)
gop = math.log(expected_prob / max_prob + 1e-10)
return gop
A GOP score near 0 means the expected phoneme had the highest posterior probability (correct pronunciation). A very negative score means a different phoneme dominated (mispronunciation).
Extended GOP (eGOP) for Language Learners
Standard GOP has high false positive rates for learner speech because it's calibrated on native speech distributions. Extended GOP variants (using acoustic models trained on learner speech corpora) significantly improve precision.
The ISLE corpus (Italian and German learners of English) and the L2-ARCTIC corpus (Indian, Korean, Mandarin, Spanish speakers of English) are the main training resources for learner-specific acoustic models.
React Native Integration
For a mobile language app, on-device inference is preferable to reduce latency:
// Using @xenova/transformers (Transformers.js) for on-device Whisper
import { pipeline } from '@xenova/transformers';
const asr = await pipeline(
'automatic-speech-recognition',
'Xenova/whisper-base.en', // ~150MB, runs on-device
{ device: 'auto' }
);
async function transcribeAudio(audioUri: string): Promise<string> {
const result = await asr(audioUri, {
language: 'english',
task: 'transcribe',
return_timestamps: 'word',
});
return result.text;
}
For devices where on-device inference is too slow (older iPhones, low-end Android), a hybrid approach — stream audio to a server endpoint — adds ~200–400ms of network latency but maintains quality.
Feedback Design: What to Tell the Learner
Technical accuracy in pronunciation detection is only useful if the feedback is pedagogically sound. The design principles from SLA research:
- Prioritise actionable errors: Tell the learner which phoneme to work on, not just that pronunciation was incorrect.
- Recast before explicit correction: Play back the correct pronunciation immediately after the learner's attempt. Hearing the correct form in context is more effective than a score.
- Avoid overcorrection: Flag at most 1–2 errors per utterance. More than that is demoralising and cognitively overloading.
- Use visual feedback for prosody: A pitch track (waveform with pitch overlay) helps learners see sentence-level intonation patterns they can't easily hear in their own speech.
- Track improvement over time: Learners need to see progress. A per-phoneme accuracy trend chart (this week vs. last week) is more motivating than an absolute score.
The speech recognition layer provides the data; the UX layer determines whether it produces learning. Getting both right is the product challenge.
I'm building Pocket Linguist, an AI-powered language tutor for iOS. It uses spaced repetition, camera translation, and conversational AI to help you reach conversational fluency faster. Try it free.
Top comments (0)