Speech Intelligibility for Nuendo 11

Speech Intelligibility

More metering options are integrated with the Netflix Loudness Meter and Intelligibility Meter. To ensure consistency of content production, the Netflix Loudness Meter in Nuendo 11 is calibrated to the official Sound Mix Specifications and Best Practices, measuring the dialog-gated loudness as required by Netflix. Based on algorithms developed by the Oldenburg branch of the Fraunhofer IDMT in Germany, the new AI-powered Intelligibility Meter indicates in real-time the effort of the listener to understand speech in the mix.

In a strict sense, speech intelligibility is measured as the proportion of speech items (e.g. words) that can be recognized correctly in a given situation. More broadly, the term “intelligibility” is often used to describe the perceived effort one has to spend to understand speech. This is also relevant for broadcast applications, because even if I am technically able to understand every word of a dialog, I may still have to invest a lot of cognitive resources, e.g., when the background sounds are too loud. This broader sense of speech intelligibility is what we measure with Nuendo’s new tool.

Speech consists of small building blocks, so-called phonemes. Several phonemes combine to syllables or words. Phonemes are what automatic speech recognition engines detect and convert to meaningful speech. In very clear speech, there is only a single phoneme at a given instant of time. In technical terms, a machine trained to recognize speech detects a high probability for the presence of a specific phoneme and a low probability for all other phonemes. The more disturbed the speech, the less distinct this probability is: The machine is less certain which phoneme is present. This is what we use to quantify intelligibility.

The algorithm has to perform different tasks. First, it must detect if speech is present or not. This sounds trivial but is a challenging issue when considering how diverse and “speech-like” broadcast background sounds can be. Then we use automatic speech recognition technology and compute how certain the recognizer is to detect individual phonemes. Finally, we map this certainty to a scale that corresponds to human perception as measured in hundreds of hours of listening experiments. For all this to work robustly, we exploited deep learning with many thousand hours of training material with real speech and highly challenging backgrounds.

For more information on speech intelligibility visit the Fraunhofer-website