ITG Conference on Speech Communication | 29.09.2021 - 01.10.2021 | Kiel




Below you can find an overview as well as details on the conference program. To see the individual sections, please click on the corresponding slider (it will open and show the details).



    Wednesday   Thursday   Friday  
  08:00 h              
  09:00 h       Keynote II   Keynote III  
  10:00 h       Active break   Active break  
        Session II (Talks)   Session V (Talks)  
  11:00 h          
        Session II (Poster)   Session V (Poster)  
    ITG Tecchnical Committee Meeting      
  12:00 h        
        Lunch Break   Closing (Awards)  
  13:00 h            
        Session III (Talks)      
  14:00 h            
    Opening   Session III (Poster)      
    Keynote I        
  15:00 h          
    Active break        
  16:00 h   Session I (Talks)        
      Active break      
      Session IV (Talks)      
  17:00 h   Session I (Poster)        
      Session IV (Poster)      
  18:00 h          
  19:00 h              
        Kiel Evening      
  20:00 h            



Wednesday, 29.09.2021  
  14:00 h  

Opening and Welcome

Gerhard Schmidt and Peter Jax open the conference as chairs of the current and upcoming workshop, respectively.

Location: Auditorium

  14:15 h  

Keynote I - Pathological Speech Analyses: From Classical Machine Learning to Deep Learning

Juan Rafael Orozco-Arroyave
University of Antioquia, Colombia

There are many different diseases that affect different aspects or dimensions of speech. Automatic evaluation of speech has evolved in the last decades to a point such that it could be considered suitable to support the diagnosis, and follow-up of patients (including their response to a given therapy). The evolution of this topic mainly started with classical signal processing and machine learning methods, and since a few years ago modern deep learning methods have been incorporated. However, there is still a lot to do in order to incorporate these models into the normal clinical practice. The most important challenges include the amount of available data to address specific topics and the interpretability of the resulting models. The aim of this talk is to show some of the applications of classical and modern methods to model different speech signals of patients with different speech disorders including hypokinetic dysarthria (that results from Parkinson’s disease and other neurological conditions), hoarseness (that results from laryngeal cancer) and hypernasality (that appear mainly in children with cleft lip and palate).

Location: Auditorium

  15:15 h  

Movement and Coffee Break

It's more than a normal break ... it's an "active break". If you want to join and need more information - see here.

Location: Gym

  15:30 h  

Session I (Talks): Speech Recognition and Synthesis

Session chairs: Reinhold Haeb-Umbach and Dorothea Kolossa

15:30 h - Kevin Wilkinghoff, Alessia Cornaggia-Urrigshardt and Fahrettin Gögköz: Two-Dimensional Embeddings for Low-Resource Keyword Spotting Based on Dynamic Time Warping

State-of-the-art keyword spotting systems consist of neural networks trained as classifiers or trained to extract discriminative representations, so-called embeddings. However, a sufficient amount of labeled data is needed to train such a system. Dynamic time warping is another keyword spotting approach that uses only a single sample of each keyword as patterns to be searched and thus does not require any training. In this work, we propose to combine the strengths of both keyword spotting approaches in two ways: First, an angular margin loss for training a neural network to extract two-dimensional embeddings is presented. It is shown that these embeddings can be used as features for dynamic time warping, outperforming cepstral features even when very few training samples are available. Second, dynamic time warping is applied to cepstral features to turn weak into strong labels and thus provide more labeled training data for the two-dimensional embeddings.

Location: Auditorium

15:50 h - Timo Lohrenz, Patrick Schwarz, Zhengyang Li and Tim Fingscheidt: Multi-Head Fusion Attention for Transformer-Based End-to-End Speech Recognition

Stream fusion is a widely used technique in automatic speech recognition (ASR) to explore additional information for a better recognition performance. While stream fusion is a well-researched topic in hybrid ASR, it remains to be further explored for end-to-end model-based ASR. In this work, striving to achieve optimal fusion in end-to-end ASR, we propose a middle fusion method performing the fusion within the multi-head attention function for the all-attention-based encoder-decoder architecture known as the transformer. Using an exemplary single-microphone setting with fusion of standard magnitude and phase features, we achieve a word error rate reduction of 12.1% relative compared to other authors' benchmarks on the well-known Wall Street Journal (WSJ) task and 9.7% relative compared to the best recently proposed fusion approach.

Location: Auditorium

16:10 h - Wentao Yu, Jan Freiwald, Sören Tewes, Fabien Huennemeyer and Dorothea Kolossa: Federated Learning in ASR: Not as Easy as You Think

With the growing availability of smart devices and cloud services, personal speech assistance systems are increasingly used on a daily basis. Most devices redirect the voice recordings to a central server, which uses them for upgrading the recognizer model. This leads to major privacy concerns, since private data could be misused by the server or third parties. Federated learning is a decentralized optimization strategy that has been proposed to address such concerns. Utilizing this approach, private data is used for on-device training. Afterwards, updated model parameters are sent to the server to improve the global model, which is redistributed to the clients. In this work, we implement federated learning for speech recognition in a hybrid and an end-to-end model. We discuss the outcomes of these systems, which both show great similarities and only small improvements, pointing to a need for a deeper understanding of federated learning for speech recognition.

Location: Auditorium

  16:30 h  

Session I (Poster): Speech Recognition and Synthesis

Session chairs: Reinhold Haeb-Umbach and Dorothea Kolossa

16:30 h - Short poster overview in the Auditorium (presented by the session chairs, Reinhold Haeb-Umbach and Dorothea Kolossa, nothing needed from the poster authors)

Julia Pritzen, Michael Gref, Christoph Schmidt and Dietlind Zühlke: A Comparative Pronunciation Mapping Approach Using G2P Conversion for Anglicisms in German Speech Recognition

Anglicisms pose a challenge in German speech recognition due to their irregular pronunciation compared to native German words. To solve this issue, we propose a comparative approach that uses both a German and an English grapheme-to-phoneme model to create Anglicism pronunciations. Comparing their confidence measures, we chose the best resulting pronunciations and added them to an Anglicism pronunciation dictionary. We allowed using English pronunciations within a German ASR model by using phoneme mapping to transform English phonemes to their most likely German equivalents. With our approach, we utilize the original pronunciations of the Anglicisms source language while keeping the German Anglicism pronunciations with high accuracy. Tested on a dedicated Anglicism evaluation set, we improved the recognition of Anglicisms compared to a baseline model, reducing the word error rate by 1.33 % relative and the Anglicism error rate by 4.08 % relative.

Location: P1 in Poster Area I

Yao Wang, Michael Gref, Oliver Walter and Christoph Andreas Schmidt: Bilingual I-Vector Extractor for DNN Hybrid Acoustic Model Training in German Speech Recognition Systems

In recent research, i-vectors have been shown to be significantly beneficial for speaker recognition and have been successfully applied in deep neural network (DNN) acoustic model (AM) training to improve the performance of automatic speech recognition (ASR). This paper describes our work in developing a bilingual i-vector extractor for training a German speech recognition system. A bilingual data set, which consisting of German and English speech data is used to train an i-vector extractor for a DNN hybrid acoustic model. The system is evaluated on different data sets. The results show that i-vector extractors trained with bilingual data can be used to improve the training of ASR models in the case of insufficient monolingual data. Additionally, using telephone speech as a case study, we show that i-vector extractor training with data from this domain leads to improvements in recognition.

Location: P2 in Poster Area I

Paul Baggenstoss: New Restricted Boltzmann Machines and Deep Belief Networks for Audio Classification

In this paper, the deep belief network (DBN), made popular by Hinton in 2006, is re-vitalized using maximum entropy sampling distributions, their corresponding activation functions, and a new direct training approach based on classifier performance. It is shown in keyword classification experiments that the DBN can compete with state of the art classifiers, and using additive classifier combination, improves upon a state of the art deep neural network.

Location: P3 in Poster Area I

Prachi Govalkar, Ahmed Mustafa, Nicola Pia, Judith Bauer, Metehan Yurt, Yigitcan Özer and Christian Dittmar: A Lightweight Neural TTS System for High-quality German Speech Synthesis

This paper describes a lightweight neural text-to-speech system for the German language. The system is composed of a non-autoregressive spectrogram predictor, followed by a recently proposed neural vocoder called StyleMelGAN. Our complete system has a very tiny footprint of 61 MB and is able to synthesize high-quality speech output faster than real-time both on CPU (2.55x) and GPU (50.29x). We additionally propose a modified version of the vocoder called Multi-band StyleMelGAN, which offers a significant improvement in inference speed with a small trade-off in speech quality. In a perceptual listening test with the complete TTS pipeline, the best configuration achieves a mean opinion score of 3.84 using StyleMelGAN, compared to 4.23 for professional speech recordings.

Location: P4 in Poster Area I

Ayimnisagul Ablimit and Tanja Schultz: Automatic Speech Recognition for Dementia Screening Using ILSE-Interviews

Spoken language skills are strong biomarkers for detecting cognitive decline. Studies like the Interdisciplinary Longitudinal Study of Adult Development and Aging (ILSE) are of particular interest to quantify the predictive power of biomarkers in terms of acoustic/linguistic features. ILSE consists of ca. 4200 hours of interviews but 10% were manually transcribed. To extract linguistic features, we need to build reliable ASR to provide transcriptions. The ILSE-corpus is challenging for ASR due to a combination of factors. In this study, we present our effort to overcome some of these challenges. We automatically segmented 45-minutes of interviews into shorter segments and time aligned. Using these segments we developed HMM-DNN based ASR and achieved 33.55% of WER. Based on this system we recreated the time-alignments for manual-transcriptions and derived acoustic and linguistic features for classifier-training. we applied the resulting system for dementia screening and achieved UAR of 0.867 for a three-class problem.

Location: P5 in Poster Area I



Thursday, 30.09.2021  
  08:30 h  

Keynote II - Statistical Signal Processing and Machine Learning for Speech Enhancement

Timo Gerkmann
Universität Hamburg, Department of Informatics, Signal Processing Research Group

Speech Signal Processing is an exciting research field with many applications such as Hearing Devices, Telephony and Smart Speakers. While in noisy environments the performance of these devices may be limited, leveraging modern Machine Learning techniques has recently shown impressive improvements in performance for the estimation of clean speech signals from noisy microphone signals. Yet, in order to to build real-time, robust and interpretable algorithms, those machine learning techniques need to be combined with domain knowledge in signal processing, statistics and acoustics. In this talk, we will present recent research results from our group that follow this perspective by exploiting end-to-end learning, multichannel configuration and deep generative models.

Location: Auditorium

  09:30 h  

Movement and Coffee Break

It's more than a normal break ... it's an "active break". If you want to join and need more information - see here.

Location: Gym

  09:45 h  

Session II (Talks): Localisation, Tracking and Spatial Reproduction

Session chairs: Peter Jax and Rainer Martin

09:45 h - Tobias Gburrek, Joerg Schmalenstroeer and Reinhold Haeb-Umbach: On Source-Microphone Distance Estimation Using Convolutional Recurrent Neural Networks

Several features computed from an audio signal have been shown to depend on the distance between the acoustic source and the receiver, but at the same time are heavily influenced by room characteristics and the microphone setup. While neural networks, if trained on signals representing a large variety of setups, have shown to deliver robust distance estimates from a coherent-to-diffuse power ratio (CDR) feature map at the input, we here push their modeling capabilities by additionally using the network as feature extractor. It is shown that distance estimation based on short-time Fourier transform (STFT) features can achieve a smaller estimation error and can operate on shorter signal segments compared to the previous CDR-based estimator.

Location: Auditorium

10:05 h - Sebastian Nagel and Peter Jax: On the Use of Additional Microphones in Binaural Cue Adaptation

Binaural recording using artificial heads or microphones on or near the ears of a person, e.g., on modern headphones and hearables, is a well-established technology for spatial audio capture. Our previously presented Binaural Cue Adaptation method improves the reproduction of binaural recordings by adapting the signals recorded for a fixed head orientation to a listener’s dynamic head movements. In the present work, we exploit the fact that many binaural recording devices, such as hearables, are equipped with more than two microphones and generalize the method to use more than two microphone signals. Information from the additional microphone signals can be used to remove limitations on possible positions of sound sources in the recorded scene, and to improve the performance of the method in the presence of diffuse ambient signals and for larger head movements.

Location: Auditorium

10:25 h - Michael Günther, Andreas Brendel and Walter Kellermann: Microphone Utility-based Weighting for Robust Acoustic Source Localization in Wireless Acoustic Sensor Networks

Spatially distributed network nodes in Wireless Acoustic Sensor Networks (WASNs) offer different perspectives of an acoustic scene. While the spatial information contained in all recorded signals can in principle be exploited to determine the position of an acoustic Source of Interest (SoI), in challenging acoustic conditions, the observations provided by different network nodes are not equally reliable. Thus, accurately estimating the reliability of the nodes’ observations and incorporating them during the localization procedure is crucial, potentially even excluding observations. In this contribution, we demonstrate how a recently proposed microphone utility measure based on the correlation of single-channel feature streams can improve the localization accuracy when triangulating an acoustic source from node-wise Direction of Arrival (DOA) estimates.

Location: Auditorium

  10:45 h  

Session II (Poster): Localisation, Tracking and Spatial Reproduction

Session chairs: Peter Jax and Rainer Martin

10:45 h - Short poster overview in the Auditorium (presented by the session chairs, Peter Jax and Rainer Martin, nothing needed from the poster authors)

Magnus Schäfer and Leonie Geyer: Sound Source Localisation Using Neural Networks With Circular Binary Classification

There are several approaches for acoustically localizing sound sources. This contribution utilizes an eight-channel microphone array that is mounted on an artificial head and spatially samples the direct vicinity of both ears. This allows for a localization approach that resembles the experience of a human listener while avoiding the limitations of an artificial head (e.g., with respect to front-back confusions).

The signals are used by a convolutional neural network - interpreting the localization task as a classification. The classification approach employs a novel use of binary classifiers that allows for implicitly taking the localization error into account. A performance assessment of the localization system is presented that is based on real audio recordings and a comparison with a related machine learning approach that uses a regular classification and two conventional beamforming algorithms is shown to highlight the positive impact of the new classifier design.

Location: P1 in Poster Area II

Stijn Kindt, Alexander Bohlender and Nilesh Madhu: 2D Acoustic Source Localisation Using Decentralised Deep Neural Networks on Distributed Microphone Arrays

This paper takes a previously proposed convolutional recurrent deep neural network (DNN) approach to direction of arrival (DoA) estimation and extends this to perform 2D localisation using distributed microphone arrays. Triangulation on the individual DoAs from each array is the most straightforward extension of the original DNN. This paper proposes to allow more co-operation between the individual microphone arrays by sharing part of their neural network, in order to achieve a higher localisation accuracy. Two strategies will be discussed: one where the shared network has narrowband information, and one where only broadband information is shared. Robustness against slight clock offsets between different arrays is ensured by only sharing information at deeper layers in the DNN. The position and configuration of the microphone arrays are assumed known, in order to train the network. Simulations will show that combining information between neural network layers has a significant improvement over the triangulation approach.

Location: P2 in Poster Area II

Klaus Brümann, Daniel Fejgin and Simon Doclo: Data-Dependent Initialization for ECM-Based Blind Geometry Estimation of a Microphone Array Using Reverberant Speech

Recently a method has been proposed to blindly estimate the geometry of an array of distributed microphones using reverberant speech, which relies on estimating the coherence matrix of the reverberation using an iterative expectation conditional-maximization (ECM) approach.Instead of using a data-independent initial estimate of the coherence matrix and a matched beamformer to estimate the initial speech and reverberation power spectral densities, in this paper we propose to use a data-dependent initial estimate of the coherence matrix and a time-varying minimum-power-distortionless-response beamformer. Simulation results show that the proposed ECM initialization significantly improves the estimation accuracy of the microphone array geometry and increases the generalizability for microphone arrays of different sizes.

Location: P3 in Poster Area II

Semih Agcaer and Rainer Martin: Binaural Speaker Localization Based on Front/Back-Beamforming and Modulation-Domain Features

In this paper, we propose and evaluate a method for binaural speaker localization using modulation-domain features extracted from the binaural microphone signals of hearing devices. In contrast to most other localization methods the proposed method does not require perfectly synchronized audio signals from the left and the right ear but uses front/back cardioid signals and a classification approach with a small number (< 50) of modulation-domain features for each signal frame. The method employs an efficient implementation of such features using a bank of recursive IIR filters which makes it suitable for low-power portable devices and also allows the application of data-driven optimization procedures. We analyze the capability of these features to reflect not only interaural level differences but also temporal modulation patterns. We evaluate our method on simulated and real-world binaural signals and compare the proposed approach to a beamforming-based method which requires fully-synchronized microphone signals.

Location: P4 in Poster Area II

  12:00 h  

Lunch Break

  13:00 h  

Session III (Talks): Speech Enhancement and Separation

Session chairs: Simon Doclo and Timo Gerkmann

13:00 h - Tal Peer, Klaus-Johan Ziegert and Timo Gerkmann: Plosive Enhancement Using Phase Linearization and Smoothing

Despite their small share in overall signal energy, plosives have been previously shown to be important for speech perception. We propose a simple, yet effective, model-based phase-aware speech enhancement approach specifically targeted at plosives. Starting from a model of the plosive burst as a unit impulse, we introduce three phase enhancement schemes: simple replacement of the noisy phase with a linear function, linear regression, as well as smoothing by local polynomial regression. To improve the outcome and compensate for model mismatch we also propose an SNR-based weighting. All schemes are evaluated under both oracle and realistic conditions, showing a consistent improvement in instrumentally predicted speech quality and, to a lesser degree, speech intelligibility. When only frames containing plosives are considered, a segmental SNR improvement of 2 dB to 6 dB can be observed, depending on the input SNR.

Location: Auditorium

13:20 h - Ragini Sinha, Marvin Tammen, Christian Rollwage and Simon Doclo: Speaker-conditioned Target Speaker Extraction Based on Customized LSTM Cells

Speaker-conditioned target speaker extraction systems rely on auxiliary information about the target speaker to extract the target speaker signal from a mixture of multiple speakers. Typically, a deep neural network is applied to isolate the relevant target speaker characteristics. In this paper, we focus on a single-channel target speaker extraction system based on a CNN-LSTM separator network and a speaker embedder network requiring reference speech of the target speaker. In the LSTM layer of the separator network, we propose to customize the LSTM cells in order to only remember the specific voice patterns corresponding to the target speaker by modifying the information processing in the forget gate. Experimental results for two-speaker mixtures using the Librispeech dataset show that this customization significantly improves the target speaker extraction performance compared to using standard LSTM cells.

Location: Auditorium

13:40 h - Tim Owe Wisch and Gerhard Schmidt: Mixed Analog-digital Speech Communication for Underwater Applications

In some cases, a speech communication link is necessary in underwater environments. While in terms of data communication advanced techniques are investigated and employed, mostly traditional (analogue) forms of speech transmission are used for speech communication (at least in commercial products). In this contribution we present a mixed analog-digital approach, that tackles this problem by combining digital and analogue approaches. Both transmission paths rely on each other in order to lower the necessary bitrate for the digital part and enhance the signal properties of the analogue part. The proposed transmission scheme uses linear predictive coding (LPC) filtering combined with a codebook to whiten the input speech. Afterwards, the resulting signal is power normalized and transmitted in the analog transmission part and the quantized normalization factor is transmitted together with the codebook index in the digital transmission part. This leads to an analogue transmission signal which is strongly reduced in dynamics.

Location: Auditorium

  14:00 h  

Session III (Poster): Speech Enhancement and Separation

Session chairs: Simon Doclo and Timo Gerkmann

14:00 h - Short poster overview in the Auditorium (presented by the session chairs, Simon Doclo and Timo Gerkmann, nothing needed from the poster authors)

Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix and Reinhold Haeb-Umbach: Speeding Up Permutation Invariant Training for Source Separation

Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed Graph-PIT. When implemented naively, both suffer from an exponential complexity in the number of utterances to separate, rendering them unusable for large numbers of speakers or long realistic recordings. We present a decomposition of the PIT criterion into the computation of a matrix and a strictly monotonously increasing function so that the permutation or assignment problem can be solved efficiently with several search algorithms. The Hungarian algorithm can be used for uPIT and we introduce various algorithms for the Graph-PIT assignment problem to reduce the complexity to be polynomial in the number of utterances.

Location: P1 in Poster Area III

Wiebke Middelberg and Simon Doclo: Comparison of Generalized Sidelobe Canceller Structures Incorporating External Microphones for Joint Noise and Interferer Reduction

In this paper, we compare two extended generalized sidelobe canceller (GSC) structures, which exploit external microphones in conjunction with a local microphone array to improve the noise and interferer reduction. As a baseline algorithm we consider a local GSC using only the local microphones, for which the relative transfer function (RTF) vector of the target speaker is known. To incorporate the external microphones in a minimum power distortionless response beamformer, the RTF vector of the target speaker needs to be estimated. Since the estimation accuracy of this RTF vector depends on the signal-to-interferer ratio, the GSC with external speech references (GSC-ESR) pre-processes the external microphone signals to reduce the interferer. In a simplified extended structure, namely the GSC with external references (GSC-ER) no such pre-filtering operation is performed. Simulation results show that the GSC-ESR structure yields the best results in terms of noise and interferer reduction, especially in adverse conditions.

Location: P2 in Poster Area III

Stefan Kühl, Carlotta Anemüller, Christiane Antweiler, Florian Heese, Patrick Vicinus and Peter Jax: Feedback Cancellation for IP-based Teleconferencing Systems

In IP-based teleconferences disturbing howling artifacts might occur whenever multiple participants are in the same room using different communication devices. In this paper it is described how the feedback problem can be reformulated as a system identification task and what are the specifics compared to the classical field of feedback cancellation. The deployment of a Kalman filter based adaptation in the frequency domain is proposed due to its robustness against long phases of double talk and near-end single talk. In order to account for sudden changes in the network delay or in the feedback path a shadow filter system is additionally introduced. Simulation results confirm that the proposed system performs an effective feedback cancellation improving the communication quality significantly.

Location: P3 in Poster Area III

Huajian Fang, Guillaume Carbajal, Stefan Wermter and Timo Gerkmann: Joint Reduction of Ego-noise and Environmental Noise With a Partially-adaptive Dictionary

We consider the problem of simultaneous reduction of ego-noise, i.e., the noise produced by a robot, and environmental noise. Both noise types may occur simultaneously for humanoid interactive robots. Dictionary- and template-based approaches have been proposed for ego-noise reduction. However, most of them lack adaptability to unseen noise types and thus exhibit limited performance in real-world scenarios with environmental noise. Recently, a variational autoencoder (VAE)-based speech model combined with a fully-adaptive dictionary-based noise model, i.e., non-negative matrix factorization (NMF), has been proposed for environmental noise reduction, showing decent adaptability to unseen noise data. In this paper, we propose to extend this framework with a partially-adaptive dictionary-based noise model, which partly adapts to unseen environmental noise while keeping the part pre-trained on ego-noise unchanged. With appropriate sizes, we demonstrate that the partially-adaptive approach outperforms the approaches based on the fully-adaptive and completely-fixed dictionaries, respectively.

Location: P4 in Poster Area III

Mhd Modar Halimeh, Christian Hofmann and Walter Kellermann: Beam-specific System Identification

Due to the increasing availability of devices with multiple microphones, modern acoustic echo cancellation systems need to estimate and optimize a large number of adaptive filters, resulting in an ever-increasing computational cost of such systems. In this paper, a highly efficient multichannel system identification method is introduced for setups where multiple microphone signals are fused using beamformers. By exploiting the knowledge on these beamformers, the proposed approach achieves high levels of computational efficiency, similar to those of beamformer-first acoustic echo cancellation methods, while avoiding severe performance drops resulting from the use of time-varying beamformers. This is validated in an experimental part using both, echo cancellation and system identification performance metrics.

Location: P5 in Poster Area III

Ashay Sathe, Adrian Herzog and Emanuel Habets: Low-Complexity Multichannel Wiener Filtering Using Ambisonic Warping

Applying multichannel noise reduction in the spherical harmonic domain (SHD) can be computationally expensive. To reduce the complexity, an approach is proposed that combines signal-dependent beamforming with spatial warping for dimensionality reduction. In particular, the multichannel Wiener filter combined with spatial warping is investigated. Two different methods to combine beamforming with warping are investigated, both aiming to reduce the computational complexity of the multichannel Wiener filter, which can be mainly attributed to the power spectral density matrix inversions required to compute the beamforming weights. A performance evaluation using simulated reverberant spherical microphone array signals is conducted wherein it is shown that the proposed methods significantly reduce the computational complexity whilst providing high degrees of signal enhancement when compared to order-truncation of the SHD signal without warping.

Location: P6 in Poster Area III

Christoph Boeddeker, Frederik Rautenberg and Reinhold Haeb-Umbach: A Comparison and Combination of Unsupervised Blind Source Separation Techniques

Unsupervised blind source separation methods do not require a training phase and thus cannot suffer from a train-test mismatch, which is a common concern in neural network based source separation. The unsupervised techniques can be categorized in two classes, those building upon the sparsity of speech in the Short-Time Fourier transform domain and those exploiting non-Gaussianity or non-stationarity of the source signals. In this contribution, spatial mixture models which fall in the first category and independent vector analysis (IVA) as a representative of the second category are compared w.r.t. their separation performance and the performance of a downstream speech recognizer on a reverberant dataset of reasonable size. Furthermore, we introduce a serial concatenation of the two, where the result of the mixture model serves as initialization of IVA, which achieves significantly better WER performance than each algorithm individually and even approaches the performance of a complex neural network based technique.

Location: P7 in Poster Area III

Klaus Linhard, Philipp Bulling, Marco Gimm and Gerhard Schmidt: Robust and High Gain Acoustic Feedback Compensation in the Frequency Domain With a Simple Energy-decay Operator

Acoustic feedback compensation in speech applications often suffers from an in-sufficient adaptation. Aa a result feedback whistling may be audible and the compensation gain and generally the speech quality is limited. In this work,we introduce the concept of an energy-decay operator and show how to use it for tap-selective step size control of a multi-delay frequency domain filter (MDF). Therefore, we derive a so-called "complex sign-sign algorithm" with an inherently included decay operator. Finally, we show robust simulation results with high amplification although the used algorithm shows very low complexity. Possible applications are in-car communication (ICC) systems or other sound reinforcement applications.

Location: P8 in Poster Area III

Jean-Marie Lemercier, Leroy Bartel, David Ditter and Timo Gerkmann: An Integrated Deep-Clustering Based System for Speaker Count Agnostic Speech Separation

This paper proposes to unify two deep-learning methods, CountNet and Deep Clustering, designed for speaker count and separation respectively, in order to perform speaker count agnostic speech separation. Two approaches are compared, where the speaker count estimation and separation subnetworks are either trained separately or jointly. Training and evaluation are conducted on a tailored dataset WSJ0-mixN, which is an extension of the WSJ0-mix2 and WSJ0-mix3 datasets for an arbitrary number of speakers.Results show that both systems are capable of separating up to four sources without prior information on the number of speakers. Furthermore, the joint approach is able to perform similarly to its separate counterpart while using 46% fewer parameters.

Location: P9 in Poster Area III

Henri Gode, Marvin Tammen and Simon Doclo: Joint Multi-Channel Dereverberation and Noise Reduction Using a Unified Convolutional Beamformer With Sparse Priors

Recently, the convolutional weighted power minimization distortionless response (WPD) beamformer was proposed, which unifies multi-channel weighted prediction error dereverberation and minimum power distortionless response beamforming. To optimize the convolutional filter, the desired speech component is modeled with a time-varying Gaussian model, which promotes the sparsity of the desired speech component in the short-time Fourier transform domain compared to the noisy microphone signals. In this paper we generalize the convolutional WPD beamformer by using an lp-norm cost function, introducing an adjustable shape parameter which enables to control the sparsity of the desired speech component. Experiments based on the REVERB challenge dataset show that the proposed method outperforms the conventional convolutional WPD beamformer in terms of objective speech quality metrics.

Location: P10 in Poster Area III

Haitham Afifi, Michael Guenther, Andreas Brendel, Holger Karl and Walter Kellermann: Reinforcement Learning-based Microphone Selection in Wireless Acoustic Sensor Networks Considering Network and Acoustic Utilities

Wireless Acoustic Sensor Networks (WASNs) have a wide range of audio signal processing applications. Due to the spatial diversity of the microphone and their relative position to the acoustic source, not all microphones are equally useful for subsequent audio signal processing tasks, nor do they all have the same wireless data transmission rates. Hence, a central task in WASNs is to balance a microphone’s estimated acoustic utility against its transmission delay, selecting a best-possible subset of microphones to record audio signals.In this work, we use reinforcement learning to decide if a microphone should be used or switched off to maximize the acoustic quality at low transmission delays, while minimizing switching frequency. In experiments with moving sources in a simulated acoustic environment, our method outperforms naive baseline comparisons.

Location: P11 in Poster Area III

  15:45 h  

Movement and Coffee Break

It's more than a normal break ... it's an "active break". If you want to join and need more information - see here.

Location: Gym

  16:00 h  

Session IV (Talks): Medical Applications and Analytical Studies

Session chairs: Emanuël Habets and Ina Kodrasi

16:00 h - Bence Halpern, Julian Fritsch, Enno Hermann, Rob van Son, Odette Scharenborg and Mathew Magimai.-Doss: An Objective Evaluation Framework for Pathological Speech Synthesis

Development of pathological speech systems is currently hindered by the lack of a standardised objective evaluation framework. In this work, (1) we utilise existing detection andanalysis techniques to propose a general framework for the consistent evaluation of synthetic pathological speech, then (2)using our proposed framework, we develop a dysarthric voice conversion system (VC) using CycleGAN-VC, and show that the developed system is able to synthesise dysarthric speech with different levels of speech intelligibility.

Location: Auditorium

16:20 h - Parvaneh Janbakhshi and Ina Kodrasi: Supervised Speech Representation Learning for Parkinson's Disease Classification

Recently proposed automatic pathological speech classification techniques use unsupervised auto-encoders to obtain a high-level abstract representation of speech. Since these representations are learned based on reconstructing the input, there is no guarantee that they are robust to pathology-unrelated cues such as speaker identity information. Further, these representations are not necessarily discriminative for pathology detection. In this paper, we exploit supervised auto-encoders to extract robust and discriminative speech representations for Parkinson's disease classification. To reduce the influence of speaker variabilities unrelated to pathology, we propose to obtain speaker identity-invariant representations by adversarial training of an auto-encoder and a speaker identification task. To obtain a discriminative representation, we propose to jointly train an auto-encoder and a pathological speech classifier. Experimental results on a Spanish database show that the proposed supervised representation learning methods yield more robust and discriminative representations for automatically classifying Parkinson's disease speech, outperforming the baseline unsupervised representation learning system.

Location: Auditorium

16:40 h - Mohamed Elminshawi, Wolfgang Mack and Emanuël Habets: Informed Source Extraction With Application to Acoustic Echo Reduction

Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference snippet uttered by the target speaker into a single embedding vector that encapsulates the characteristics of the target speaker. However, such modeling deliberately neglects the time-varying properties of the reference signal. In this work, we assume that a reference signal is available that is temporally correlated with the target signal. To take this correlation into account, we propose a time-varying source discriminative model that captures the temporal dynamics of the reference signal. We also show that existing methods and the proposed method can be generalized to non-speech sources as well. Experimental results demonstrate that the proposed method significantly improves the extraction performance when applied in an acoustic echo reduction scenario.

Location: Auditorium

  17:00 h  

Session IV (Poster): Medical Applications and Analytical Studies

Session chairs: Emanuël Habets and Ina Kodrasi

17:00 h - Short poster overview in the Auditorium (presented by the session chairs, Emanuël Habets and Ina Kodrasi, nothing needed from the poster authors)

Harald Höge: A Cortical Model for a θ-Oscillator Segmenting Syllables

In the human cortex, the auditory signal is segmented ro-bustly into syllables using θ-oscillations, where the phase and instantaneous frequency of each θ-cycle corresponds to the position and duration of a syllable. Recently, in the superior temporal cortex, ensamples of neurons sensitive to edge features have been detected, which spike at the maximal rise of the envelope of the auditory signal at the onset of the nucleus of a syllable. The paper presents a cortical model of an θ-oscillator driven by features of the envelope of the auditory signal and driven by edge features resetting the θ-phase. First experiments on the performance of the θ-oscillator are presented.

Location: P1 in Poster Area IV

Florian Hilgemann, Johannes Fabry and Peter Jax: Active Acoustic Equalization: Performance Bounds for Time-Invariant Systems

In times of rising noise pollution, the demand for active noise control (ANC) in hearable devices continues to increase rapidly. To date, time-invariant feedforward systems constitute a favorable option for their implementation. Recently, active acoustic equalization (AAE) was proposed as a generalization to the feedforward design problem. It facilitates the design of digital filters for various use-cases including ANC- and hear-through-applications. In this contribution, we analyze performance bounds of time-invariant AAE systems and discuss their implications. We build upon techniques which are established in the field of ANC to provide estimates for arbitrary use-cases. The approach facilitates an isolated analysis of individual aspects such as transfer path variance and processing delay.

Location: P2 in Poster Area IV

Jacek Kudera, Lauri Tavi, Bernd Möbius, Tania Avgustinova and Dietrich Klakow: The Effect of Surprisal on Articulatory Gestures in Polish Consonant-to-Vowel Transitions: A Pilot EMA Study

This study is concerned with the relation between the information-theoretic notion of surprisal and articulatory gesture in Polish consonant-to-vowel transitions. It addresses the question of the influence of diphone predictability on spectral trajectories and articulatory gestures by relating the effect of surprisal with motor fluency. The study triangulates the computation of locus equations (LE) with kinematic data obtained from electromagnetic articulograph (EMA). The kinematic and acoustic data showed that a small coarticulation effect was present in the high- and low-surprisal clusters. Regardless of some small discrepancies across the measures, a high degree of overlap of adjacent segments is reported for the mid-surprisal group in both domains. Two explanations of the observed effect are proposed. The first refers to low-surprisal coarticulation resistance and suggests the need to disambiguate predictable sequences. The second, observed in high surprisal clusters, refers to the prominence given to emphasize the unexpected concatenation.

Location: P3 in Poster Area IV

Nikita Jarocky, Sebastian Urrigshardt and Frank Kurth: A Data Generation Framework for Acoustic Drone Detection Algorithms

State-of-the-art drone detection systems are generally combining different sensors, such as radar, acoustic, radio frequency (RF-) or optic sensors, each of which contributes individual capabilities. Acoustic sensors have a relatively low spatial range. On the other hand, they can be used in conditions of bad visibility and for scenarios without line-of-sight between sensor and drone. The development of acoustic drone detection (ADD) algorithms typically requires substantial amounts of realistic recordings of flying drones. Moreover, when evaluating methods for robustly extracting characteristic properties of drones, such as the fundamental frequency (F0) of the rotor blades from sensor data, knowledge of ground truth (GT) data is important. In this paper we present a framework for automatically generating both acoustic drone recordings and GT transcripts of the corresponding time-synchronous motor rotations. We present an application using the thus generated data for evaluating an ADD algorithm based on F0 tracking.

Location: P4 in Poster Area IV

  19:00 h  

Kiel Evening

For more information - see here.

Location: Captain's Bay



Friday, 01.10.2021  
  08:30 h  

Keynote III - Conversational AI in Production Cars

Christophe Couvreur
Cerence, Merelbeke, Belgium

Artificial Intelligence is everywhere. Carmakers rely on conversational AI to offer a personalized and innovative experience to their users during the customer journey. We discuss the market trends and the technical solutions from Cerence and other providers that make it easy for consumers to intuitively and safely interact with the vehicle. We review the state of the art for the various components of conversational AI systems in cars today (audio signal processing, speech recognition, natural language understanding, dialogue management, contextual AI and content integration, natural language generation, speech synthesis, and multimodal integration) and we further zoom in how recent advances in speech synthesis technology bring more natural and personalized voice assistants to the car.

Location: Auditorium

  09:30 h  

Movement and Coffee Break

It's more than a normal break ... it's an "active break". If you want to join and need more information - see here.

Location: Gym

  09:45 h  

Session V (Talks): Quality of Speech and Speech Communication Systems

Session chairs: Hans-Wilhelm Gierlich and Sebastian Möller

09:45 h - Thilo Michael and Sebastian Möller: Predicting Conversational Quality From Simulated Conversations With Transmission Delay

Conversations over a telephone network require a timely transmission of speech to enable smooth interaction. When a transmission delay is introduced, turn-taking signals arrive too late, and the conversational quality degrades. This disruption of the conversation flow also depends on the interactivity of the conversation. Current instrumental quality models do not take into account the interactivity of a conversation. However, the simulation of conversations has proven to replicate the turn-taking behavior of conversations with different interactivity levels. It can also model the changes in interactivity when transmission delay is introduced. In this paper, we simulate two types of conversations at various levels of transmission delay. We perform a parametric conversation analysis to extract interactivity parameters and use them to predict the conversational quality. We compare the results to the predictions of the E-model, an instrumental transmission planning model, and quality ratings obtained in a conversation experiment.

Location: Auditorium

10:05 h - Tobias Hübschen, Rasool Al-Mafrachi and Gerhard Schmidt: Impact of a Speaker Head Rotation on the Far-end Listening Situation

Speech communication and dialog systems inside vehicles are usually optimized for a speaker in the driver’s seat who is permanently facing forward. However while driving, a speaker does repeatedly rotate his head left and right, which changes the properties of the signal picked up by the microphone. This work analyzes to what extend such a speaker head rotation impacts the signal in terms of frequency content, listening quality, and listening effort. By presenting a simple method for compensating the rotation effects and by combining this method with noise reduction and band limitation of the signals, an analysis of the far-end listening situation with and without speaker head rotation compensation is achieved.

Location: Auditorium

10:25 h - David Hülsmeier, Christopher F. Hauth, Saskia Röttges, Paul Kranzusch, Jana Roßbach, Marc René Schädler, Bernd T. Meyer, Anna Warzybok and Thomas Brand: Towards Non-Intrusive Prediction of Speech Recognition Thresholds in Binaural Conditions

Four non-intrusive models are compared that predict human speech recognition thresholds (SRTs, i.e., signal to noise ratios with 50% word recognition rate) in different acoustic environments. Three of them use the blind binaural processing stage (bBSIM) as front-end, while one model uses the spectral representation of the left and right ear signal channels together with their difference. Predictions are evaluated for three acoustic environments (anechoic, office, and cafeteria) with speech from the front and noise from different directions. Despite many technical differences across the models, all of them perform quite accurately (root mean squared prediction errors below 2.2 dB for all models). This implies that any of the non-intrusive models facilitates to predict SRTs for listeners with normal hearing measured in stationary noise, different acoustic environments, and spatial configurations.

Location: Auditorium

  10:45 h  

Session V (Poster): Quality of Speech and Speech Communication Systems

Session chairs: Hans-Wilhelm Gierlich and Sebastian Möller

10:45 h - Short poster overview in the Auditorium (presented by the session chairs, Hans-Wilhelm Gierlich and Sebastian Möller, nothing needed from the poster authors)

Jens Heitkaemper, Joerg Schmalenstroeer, Joerg Ullmann and Reinhold Haeb-Umbach: A Database for Research on Detection and Enhancement of Speech Transmitted Over HF links

In this paper we present an open database for the development of detection and enhancement algorithms of speech transmitted over HF radio channels. It consists of audio samples recorded by various receivers at different locations across Europe, all monitoring the same single-sideband modulated transmission from a base station in Paderborn, Germany. Transmitted and received speech signals are precisely time aligned to offer parallel data for supervised training of deep learning based detection and enhancement algorithms. For the task of speech activity detection two exemplary baseline systems are presented, one based on statistical methods employing a multi-stage Wiener filter with minimum statistics noise floor estimation, and the other relying on a deep learning approach.

Location: P1 in Poster Area V

Tal Peer and Timo Gerkmann: Intelligibility Prediction of Speech Reconstructed From Its Magnitude or Phase

While Fourier phase has long been considered unimportant for speech enhancement in the short-time Fourier domain, phase-aware speech processing is receiving increasing attention in recent years. Among other advances, it has been shown that when using very short frames (<2 ms), the phase spectrum carries enough information to allow an intelligible reconstruction from phase alone, whereas the magnitude spectrum remains virtually bare of any information relevant to intelligibility. Formal listening experiments are expensive and laborious. Hence, in order to facilitate further research into very short frames, researchers require reliable instrumental intelligibility measures. We present a study of two common measures: Short-Time Objective Intelligibility (STOI) and Extended STOI (ESTOI) and compare their performance for speech reconstruction from phase. Our results indicate that only ESTOI is able to predict the general trend of intelligibility across a wide range of frame lengths (including very short frames) for both phase-based and magnitude-based reconstructions.

Location: P2 in Poster Area V

Anton Namenas and Gerhard Schmidt: Acoustic Ambiance Simulation Using Orthogonal Loudspeaker Signals

A reproducible acoustic environment forms the basis for scientific evaluations of speech enhancement systems, as well as their influence on speech quality and intelligibility. Studies with people usually use binaural recordings played back through headphones. However, the effort required to evaluate such a speech enhancement system in this way can quickly become impractical, especially if the system is a black box and does not allow access to internal signals. In order to be able to investigate such systems and the associated influences on the listeners, a reproducible acoustic environment simulation has been developed that is able to generate different noise scenarios in a soundproof room. The system uses loudspeakers to place both the listeners and the communication system in a realistic noise scenario. It recreates the statistical properties of the desired sound scenario at corresponding microphones using real recordings as reference.

Location: P3 in Poster Area V

Jan Reimes: Assessment of Listening Effort for Various Telecommunication Scenarios

Speech communication under adverse conditions may be extremely stressful for the person located at the receiving side. There, background noise may originate from the environment and cannot be reduced for the listener. Additionally, far-end speech might be degraded by the network, due to e.g., transcoding or packet loss. The assessment of several issues mentioned above is partly already addressed by existing state-of-the-art methods. They are often related to speech quality, though. Examples include ETSI103281 for noise reduction in send direction or ITU T P863 for network related processing. However, when quality is already considerably degraded, an intelligibility related measure is of higher interest. As an alternative to the assessment of intelligibility, perceived listening effort was found to be a more suitable measure for such scenarios in several recent studies. This contribution presents the latest developments of an instrumental prediction model that is based on a comprehensive amount of subjective databases.

Location: P4 in Poster Area V

  12:00 h  

Best Paper Awards and Closing of the Workshop

Announcement of the winners of the best paper award.

Gerhard Schmidt and Peter Jax will close the conference and announce the next one (of course ... "Nach dem Spiel, ist vor dem Spiel!").

Location: Auditorium