Simon Graf: Design of Scenario-specific Features for Voice Activity Detection and Evaluation for Different Speech Enhancement Applications
To appear soon, 2023
Many technical applications nowadays make use of human speech. In situations where controlling a device by hand is not possible or inconvenient, voice can be employed instead. Important use cases can be found in automotive environments where distractions of the driver have to be reduced. Speech enabled applications allow for dictating messages, controlling devices by voice, or making phone calls out of the driving car via hands-free telephony. Even the communication between passengers inside the car can be facilitated using modern speech applications. In-car-communication (ICC) systems amplify the passenger’s speech and allow for convenient conversations at high travel speeds. Also outside the car, mobile speech applications, such as smartphones, become more and more ubiquitous.
The desired speech signal that is recorded by microphones is inevitably superposed by background noise. In automotive scenarios, primarily stationary noise components are observable. In contrast, smartphones can be employed at almost every location resulting in a much higher variability of noise scenarios. Distinguishing the desired speech from background noise is an essential prerequisite for many algorithms that are incorporated in speech applications. When speech is present in the signal, capturing and preserving these desired components is targeted. Contrariwise, the suppression of noise usually requires information on the background noise that can be gathered primarily during speech pauses.
Voice activity detection (VAD) aims on identifying presence of speech in a noisy signal. For this purpose, features are extracted: the signal is processed in such a way that certain distinctive properties of human speech are emphasized. Various features focusing on different speech properties have been introduced with the goal of telling apart speech and noise. A detector finally decides whether speech is present in the signal. Beyond this, automatic speech recognition systems may identify what is said usually incorporating a VAD.
In this thesis, many features for VAD are summarized and classified with respect to properties of human speech that are exploited. New features are introduced considering speech properties that are typically not taken into account. Since different features represent different aspects of human speech, a combination of multiple features is desirable. By considering advantages and drawbacks of each feature, the final detection result can be improved. Adequate feature combinations may increase the robustness against interferences.
In literature, the results of VAD algorithms are typically evaluated without considering a specific application. Different aspects of the detection are evaluated, however, they are not related to the final application’s performance. The evaluations in this thesis are therefore dedicated to the requirements of the target application. Some important applications are analyzed with respect to their dependency on VAD results. The importance of accurate VAD results is exemplified for algorithms in an ICC system and for the suppression of babble noise. These applications cover important use cases of VAD with particularly challenging yet contrary conditions. Tried and tested for these rather extreme cases, the approaches discussed in this thesis are well suited also for other applications with less strict constraints.
Eric Elzenheimer: Analyse stimulationsevozierter Muskel- und Nervensignale mithilfe elektrischer und magnetischer Sensorik
Shaker-Verlag, 2022
The prevalence of polyneuropathies (PNPs), neurological diseases, in people over 50 years old is 5.5 %. Such systemic diseases of the peripheral nerves can be categorized into inherited metabolic, acquired metabolic, immune-mediated, and toxic forms. Medical doctors must be able to differentiate among these forms to determine which type of therapy is needed. The burden on patients and the costs to health care providers may vary considerably, depending on the therapy administered. Motor nerve conduction studies assess compound muscle action potentials (CMAPs) by using neurography, from which neurophysiological variables are derived. These are used in addition to clinical evaluations to distinguish between the different etiologies. Despite the existing applications of neurography, current analytical strategies for PNP differentiation are inadequate for differential diagnostics, and improvements are needed. To overcome these problems, digital signal processing methods and approaches that can support medical doctors making clinical decisions are presented in this thesis. The focus of these efforts was to quantitatively describe pathological CMAP signal differences without additional effort so that diagnoses can be made in a timely manner. In this context, a system-theoretical signal model was also developed to describe various physiological and pathological processes in human nerves. This model enables realistic insights into the pathophysiology of polyneuropathies.
In principle, electrode-based neurography can be complemented by magnetic detection. The use of novel magnetic field sensors would require a more precise inspection in the field of neurophysiology. These sensors facilitate contactless data acquisition, advantageous when compared with conventional methods, which require electrodes. However, the pilot measurements of nerves and muscles presented in this study revealed some limitations, specifically for non-cryogenic magnetic field sensors. The observed disadvantages mainly resulted from the measurement bandwidth they were able to support and the available detection limit. Consequently, the use of these magnetic field sensors my be more suitable for other medical applications, for example cardiology is particularly noteworthy here since the signal with the highest field amplitude originates from the human heart. Finally, in a dedicated field study, the magnetic equivalent of a human R-wave was successfully detected within one minute for the first time by using a magnetoelectric ME sensor. This affirms the hypothesis that ME sensors are valuable in magnetic diagnostics, promoting further development of this particular sensor type. Finally, sensor-specific advancements combined with digital readout techniques could advance magnetic detection in neurophysiology.
In this collaborative engineering and neuroscience work, the research methods utilized provide a in-depth assessment of nerves and may therefore be valuable for performing diagnostic tests in the long term. The experiments and results presented in this research represent the foundation of technical concepts and analytical procedures necessary for a semiautomated disease classification system in clinical practice. An interdisciplinary team of researchers and an international manufacturer of neurography equipment have already joined forces to make such a system a reality in the form of a diagnostic tool.
Jonas Jungclaussen: Artificial Bandwidth Extension of Speech Signals using Neural Networks
Pdf-based submission (available freely via the MACAU system), 2021
Although mobile broadband telephony has been standardized for over 15 years, many countries still do not have a nationwide network with good coverage. As a result, many cellphone calls are still downgraded to narrowband telephony. The resulting loss of quality can be reduced by artificial bandwidth extension. There has been great progress in bandwidth extension in recent years due to the use of neural networks. The topic of this thesis is the enhancement of artificial bandwidth extension using neural networks. A special focus is given to hands-free calls in a car, where the risk is high that the wideband connection is lost due to the fast movement.
The bandwidth of narrowband transmission is not only reduced towards higher frequencies above 3.5 kHz but also towards lower frequencies below 300 Hz. There are already methods that estimate the low-frequency components quite well, which will therefore not be covered in this thesis.
In most bandwidth extension algorithms, the narrowband signal is initially separated into a spectral envelope and an excitation signal. Both parts are then extended separately in order to finally combine both parts again. While the extension of the excitation can be implemented using simple methods without reducing the speech quality compared to wideband speech, the estimation of the spectral envelope for frequencies above 3.5 kHz is not yet solved satisfyingly. Current bandwidth extension algorithms are just able to reduce the quality loss due to narrowband transmission by a maximum of 50 % in most evaluations.
In this work, a modification for an existing method for excitation extension is proposed which achieves slight improvements while not generating additional computational complexity. In order to enhance the wideband envelope estimation with neural networks, two modifications of the training process are proposed. On the one hand, the loss function is extended with a discriminative part to address the different characteristics of phoneme classes. On the other hand, by using a GAN (generative adversarial network) for the training phase, a second network is added temporarily to evaluate the quality of the estimation.
The neural networks that were trained are compared in subjective and objective evaluations. A final listening test addressed the scenario of a hands-free call in a car, which was simulated acoustically. The quality loss caused by the missing high frequency components could be reduced by 60 % with the proposed approach.
Minh H. Pham: Axial Movements in Older Adults and Patients with Parkinson’s Disease – Algorithm Development and Validation with Inertial Measurement Units Data
To appear soon, 2019
Movements that deviate from physiological performance are associated with many disabilities and reduce the ability to perform daily activities. These impaired movements are associated with e.g. aging and neurodegenerative diseases. An objective and quantitative evaluation of these impaired movements is of high clinical relevance, for both patients and the professional medical team that treats the patient. Moreover, assessment in the usual environment of the affected persons may be superior to assessments performed in the clinic and doctor’s practice, because the latter environments may lead to artificial results and can only be performed at certain time points.
The dynamic development of mobile technological devices has led to a new era of assessment in the medical field. Assessment of movements, especially axial (i.e. close to body center / trunk) movements are especially interesting for this development as sensors that detect movements accurately – e.g. accelerometers, gyroscopes and magnetometers – are especially far developed, reasonably priced and easily to integrate in mobile technology. However, there is a substantial lack of useful and, particularly, of validated algorithms for sensors and inertial measurement units that detect quantity and quality of specific movements in vulnerable cohorts. This work contributes to this area to such an extent, as it presents and discusses three algorithms that detect and evaluate specific movements detected with an inertial measurement unit (IMU) worn on the lower back by older adults and patients with Parkinson’s disease (PD). This work includes the evaluation of data from the supervised and unsupervised environment, and the validation of each algorithm.
Christin Baasch: Instrumentelle Analyse von Parkinson-Sprache
Shaker-Verlag, 2019
Parkinson’s Disease is one of the most frequent neurodegenerative diseases worldwide. Besides motor disorders, patients affected by this disease mostly suffer from a speech disorder named dysarthria.
It will be treated by a speech therapist with a speech therapy, its success as well as the progress of the dysarthria shall be documented. Therefore, a multitude of different methods are available to do so, but all of them have one thing in common: they are not completely objective, because not fully automatic. There ist always a subjective component, where a rater or another person influences the process.
This work presents a system, named SINAS, for fully automatic rating of the dysarthria. The system contains two main components: a recording tool and an analysis tool. The first one gives the possibility to the speech therapist to guide the patient easy and with visual aid by HTML pages through different speech tasks. Thereby the recordings will be robust in level and independent of the position of the microphone.
In the analysis tool acoustic measures are calculated from the recordings, which are intended to evaluate the three clusters of symptoms of dysarthria. These measures form the entry of a neural network, which gives an NTID rating as a result. The NTID scale rates the inteligibility of the recording and therefore the dysarthria of the patient in six steps. The validation of the tool is done by comparison of the results with a survey, where people rated the recordings of Parkinson patients according to the NTID scale, the mean value for each recording is then taken as a reference. As cost functions for evaluating the developed system the correlation, the mean absolute error, as well as the variance of the error are taken, on the basis of these functions the system will be optimized.
For further evaluation and to take into account the uncertainty of the raters, the epsilon insensitive RMSE is used to evaluate the performance of the system. This clearly shows the possibility of a fully automatic NTID rating of the patients with the presented SINAS system.
The developed tool can now form the basis for many applications to support the speech therapy of Parkinson patients.
Page 2 of 4