No. 14 - Simon Graf

Simon Graf: Design of Scenario-specific Features for Voice Activity Detection and Evaluation for Different Speech Enhancement Applications

To appear soon, 2023

 

Commission

  • Prof. Dr.-Ing. Gerhard Schmidt
    (first reviewer)
  • Prof. Dr.-Ing. Tim Fingscheidt
    (second reviewer)
  • Prof. Dr.-Ing. Michael Höft
    (examiner)
  • Prof. Dr.-Ing. Hermann Kohlstedt
    (head of the examination board)

 

Abstract

Many technical applications nowadays make use of human speech. In situations where controlling a device by hand is not possible or inconvenient, voice can be employed instead. Important use cases can be found in automotive environments where distractions of the driver have to be reduced. Speech enabled applications allow for dictating messages, controlling devices by voice, or making phone calls out of the driving car via hands-free telephony. Even the communication between passengers inside the car can be facilitated using modern speech applications. In-car-communication (ICC) systems amplify the passenger’s speech and allow for convenient conversations at high travel speeds. Also outside the car, mobile speech applications, such as smartphones, become more and more ubiquitous.

The desired speech signal that is recorded by microphones is inevitably superposed by background noise. In automotive scenarios, primarily stationary noise components are observable. In contrast, smartphones can be employed at almost every location resulting in a much higher variability of noise scenarios. Distinguishing the desired speech from background noise is an essential prerequisite for many algorithms that are incorporated in speech applications. When speech is present in the signal, capturing and preserving these desired components is targeted. Contrariwise, the suppression of noise usually requires information on the background noise that can be gathered primarily during speech pauses.

Voice activity detection (VAD) aims on identifying presence of speech in a noisy signal. For this purpose, features are extracted: the signal is processed in such a way that certain distinctive properties of human speech are emphasized. Various features focusing on different speech properties have been introduced with the goal of telling apart speech and noise. A detector finally decides whether speech is present in the signal. Beyond this, automatic speech recognition systems may identify what is said usually incorporating a VAD.

In this thesis, many features for VAD are summarized and classified with respect to properties of human speech that are exploited. New features are introduced considering speech properties that are typically not taken into account. Since different features represent different aspects of human speech, a combination of multiple features is desirable. By considering advantages and drawbacks of each feature, the final detection result can be improved. Adequate feature combinations may increase the robustness against interferences.

In literature, the results of VAD algorithms are typically evaluated without considering a specific application. Different aspects of the detection are evaluated, however, they are not related to the final application’s performance. The evaluations in this thesis are therefore dedicated to the requirements of the target application. Some important applications are analyzed with respect to their dependency on VAD results. The importance of accurate VAD results is exemplified for algorithms in an ICC system and for the suppression of babble noise. These applications cover important use cases of VAD with particularly challenging yet contrary conditions. Tried and tested for these rather extreme cases, the approaches discussed in this thesis are well suited also for other applications with less strict constraints.