Study finds generative LLMs outperform WER and semantic baselines for ASR evaluation on HATS benchmark

The study finds decoder-style generative LLMs align more closely with human judgments on HATS pairwise ASR hypothesis selection than WER and prior embedding-based semantic metrics, while also supporting interpretable, meaning-focused.

A preprint on arXiv, “Evaluation of Automatic Speech Recognition Using Generative Large Language Models,” proposes decoder-based generative LLMs as automatic evaluators for ASR systems and tests them on the HATS benchmark. Background context from the source: The authors frame the work against the field’s reliance on Word Error Rate (WER), which they describe as a traditional, form-based metric that is insensitive to meaning: it measures discrepancies between reference and hypothesis at the word level without accounting for whether two transcriptions preserve the same semantics. They also note that embedding-based semantic metrics have been shown to correlate better with human perception of transcription quality than WER, but prior efforts have focused mainly on encoder-style models, leaving decoder-based generative LLMs underexplored for this role.

To address that gap, the paper systematically evaluates decoder LLMs as ASR evaluators using three tasks: pairwise hypothesis selection, semantic distance estimation from generative embeddings, and qualitative error classification. In the pairwise selection setup, the model is given two ASR hypotheses for the same utterance and must choose the better one, enabling direct comparison with human annotator preferences on HATS. For semantic distance, the authors derive embeddings from the generative LLMs and compute distances between ASR outputs and reference transcriptions, treating those distances as semantic quality scores.

In the qualitative error classification task, the LLM assigns ASR errors to categories, with the stated goal of making evaluation more interpretable and explicitly tied to meaning rather than only token-level mismatches. On the HATS pairwise hypothesis selection benchmark, the strongest decoder-based LLMs reach 92–94% agreement with human annotators when deciding which of two hypotheses is better. Under the same protocol, WER achieves 63% agreement with human judgments, a gap that highlights how often a purely form-based metric diverges from human preference about which transcription better captures the spoken content.

The study further reports that decoder LLMs in this configuration also outperform existing semantic metrics on the HATS selection task, although the abstract does not provide specific percentages for those semantic baselines. Beyond discrete choices between hypotheses, the authors test whether decoder LLMs can support continuous semantic scoring via embeddings. They report that embeddings derived from these generative models yield semantic evaluation performance comparable to encoder-based embedding models when used to compute semantic distance between ASR outputs and references.

That result suggests that decoder-style models, despite being optimized for text generation rather than representation learning, can match encoder architectures on this core semantic evaluation function within the conditions tested on HATS. The qualitative error classification component uses decoder LLMs to label ASR errors into categories, which the authors describe as a way to make evaluation more interpretable and explicitly meaning-centric. Instead of only reporting aggregate token-level mismatches, this approach surfaces structured information about what went wrong in a transcription in terms of error types tied to semantics.

While the abstract does not enumerate the specific categories or provide quantitative breakdowns for this task, it positions qualitative classification as a complementary capability that leverages the generative models’ ability to reason about and describe errors. Taken together, the three tasks on HATS—pairwise hypothesis selection, semantic distance from generative embeddings, and qualitative error categorization—are presented as a systematic testbed for decoder-style LLMs as ASR evaluators. The authors conclude that decoder-based generative LLMs offer a promising direction for future ASR evaluation that is more interpretable, semantically grounded, and human-aligned than relying solely on WER.

They also position the work as extending semantic ASR evaluation beyond encoder-based embeddings by showing that decoder LLMs can both approximate human preference judgments substantially better than WER and current semantic metrics on HATS and provide richer, category-level insight into ASR errors. The abstract does not detail which specific decoder architectures or model sizes were evaluated, nor does it describe the construction or domain coverage of HATS, so the generalizability of these gains beyond this benchmark remains an open question from the summary alone.

Original source: http://arxiv.org/abs/2604.21928v1