DOC Viewer

저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게 l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다: l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다. l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다. 저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다. Disclaimer 저작자표시. 귀하는 원저작자를 표시하여야 합니다. 비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다. 변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.

i Abstract Deep Learning Based Spatial Information Modeling for Multi-Channel Speech Enhancement toward Noise Robust Automatic Speech Recognition Minkyu Shin Advised by Prof. Hanseok Ko School of Electrical Engineering Graduate School Korea University Speech signal processing is playing an increasing role in human interaction with smart devices. Therefore, the problem of improving the robustness of automatic speech recognition in noisy environments has attracted considerable research effort. Deep learning approaches to speech enhancement, particularly those that incorporate a denoising auto-encoder, have achieved great success when

ii applied to single-channel audio signals. In the case of a single channel, the signal intensity in the timefrequency domain is the main information resource to represent an input signal, and a simple neural network (NN) topology has been sufficiently effective. Using deep learning is effective in improving the performance of an automatic speech recognition (ASR) system, but a performance limit is exhibited in the field of clean signal estimation. It is expected that multichannel signal-based approaches can overcome this limitation by using spatial information in addition to signal intensity. Attempts have been made to expand the application of deep learning into multichannel speech enhancement fields. However, the full potential of deep learning-based approaches for microphone array processing has not yet been attained. The use of NN for multiple channels faces many obstacles. The main reason for this is that phase information in the timefrequency domain plays a vital role in delivering the spatial information of multichannel signals, whereas NNs traditionally process real-valued physical data and rely on real-valued weights. The application of deep learning to spatial information for multichannel acoustic signal is still under study and there is no representative solution. This analysis is supported by the fact that a variety of approaches are still actively being proposed, such as developing features to be fed into an NN with a multichannel signal, or replacing part of an existing spatial filter-based algorithm with a NN. It is noteworthy that adopting a fully deep learning-based approach in multichannel speech enhancement is not as remarkable as that in speech recognition or image

iii classification. In this dissertation, problems encountered in applying deep learning to multichannel speech enhancement are addressed and mitigating approaches to improve existing methods are proposed. For an in-depth analysis and problem posing, traditional signal processing-based approaches are also discussed. Remarkable approaches incorporating independent component analysis and non- negative matrix factorization are revisited while a relevant previous approach is introduced. This provides an analysis of how and why deep learning is required. Application of spatial diversity with deep learning framework is evaluated and analyzed from various perspectives. This includes a previous work on combining a traditional spatial filter and a deep learning based postfilter. By designing effective feature for following postfilter, the beamformer structure resulted improved performance in terms of speech enhancement. Existing deep learning-based algorithms are also analyzed and improvement methods are proposed. Modeling phase information of the input signal is emphasized to overcome limitations of existing algorithms. By introducing a front-end structure accepting real-valued representation of timefrequency domain signal, the proposed structure avoids forcing the NN to approximate a complex algebra to decode the phase difference between the input signals. As a result, deep learning with real-valued NN is effectively applied for exploiting the spatial information embedded in the phase difference between the input signals. To ensure the applicability as the front-end of an arbitrary ASR system, the proposed methods

v Contents Abstract ....................................................................................................... i Contents ...................................................................................................... v List of Figures .......................................................................................... vii List of Tables ........................................................................................... viii List of Abbreviations ................................................................................ ix Chapter 1. Introduction ............................................................................ 1 1.1. Background ......................................................................................... 1 1.2. Organization of Dissertation ............................................................... 2 Chapter 2. Classical Multichannel Speech Enhancement Algorithms . 5 2.1. Signal Model and Definitions ............................................................. 5 2.2. Overview of Classical Beamformers ................................................... 7 Chapter 3. Blind Source Separation-based Schemes in Relative Transfer Function Estimation ................................................................ 13 3.1. Issues and related works .................................................................... 13 3.2. RTF Estimation Using Peaks in Time-Domain RTF ......................... 19 3.3. Experiments ....................................................................................... 25 3.4. Conclusions ....................................................................................... 27

vi Chapter 4. Neural Network-based Approaches .................................... 29 4.1. Issues and related works .................................................................... 29 4.2. Motivation ......................................................................................... 39 4.3. Proposed system with phase-encoded input for NN-based mask estimation ..................................................................................................... 42 4.4. The ASR stage in the proposed system ............................................. 47 4.5. Experiments ....................................................................................... 52 4.6. Conclusions ....................................................................................... 59 Chapter 5. Neural Network-based postfilter ........................................ 60 5.1. Issues and related works .................................................................... 60 5.2. New Generalized Sidelobe Canceller with Denoising Auto-Encoder for Improved Speech Enhancement ................................................................... 60 5.3. Experiments ....................................................................................... 65 5.4. Conclusions ....................................................................................... 68 Chapter 6. Conclusions and Future Works ........................................... 70 6.1. Conclusions ....................................................................................... 70 6.2. Future Works ..................................................................................... 72 Bibliography ............................................................................................ 73

vii List of Figures Figure 3.1 Basic two-channel linear BSS signal model ...................................... 16 Figure 3.2 PTDRs in time-domain RTF .............................................................. 22 Figure 3.3 Spatial configurations ........................................................................ 25 Figure 4.1 Structure of NN-based mask estimation in GEV beamformer ........... 31 Figure 4.2 Time convolution and CLDNN layers ............................................... 34 Figure 4.3 Factored multichannel raw waveform CLDNN architecture ............. 36 Figure 4.4 NN topology in the proposed NN-based mask estimation system ..... 44 Figure 4.5 LSTM memory block with one cell ................................................... 46 Figure 4.6 Computation in TDNN with subsampling ......................................... 49 Figure 4.7 The microphone array geometry ........................................................ 53 Figure 4.8 Evaluation results ............................................................................... 58 Figure 5.1 The structure of the proposed GSC and SDA .................................... 64

ix List of Abbreviations ANC Adaptive Noise Canceller ASR Automatic Speech Recognition ATF Acoustic Transfer Function BAN Blind Analytical Normalization BM Blocking Matrix BSS Blind Source Separation CEC Constant Error Carousel CLDNN Convolutional, Long Short-Term Memory, Deep Neural Network DOA Direction of Arrival FBF Fixed Beamformer FIR Finite-Impulse Response FMLLR Feature-Space Maximum Likelihood Linear Regression GEV Generalized Eigenvector GEVBM Generalized Eigenvector Blocking Matrix GSC Generalized Sidelobe Canceller IPD Interaural Phase Difference LCMV Linearly Constrained Minimum Variance LDA Linear Discriminative Analysis MFCC Mel-Frequency Cepstral Coefficients MLLT Maximum Likelihood Linear Transformation MVDR Minimum Variance Distortionless Response NMF Non-Negative Matrix Factorization NN Neural Network

x NN-GEV Neural Network supported Generalized Eigenvalue NSE Normalized Squared Error PHAT Phase Transform PSD Power Spectrum Density PTDR Peaks in Time-Domain RTF RIR Room Impulse Response RTF Relative Transfer Function SAT Speaker Adaptive Training SDA Stacked Denoising Autoencoder SDR Signal-to-Distortion Ratio SMBR State-Level Minimum Bayes Risk SNR Signal-to-Noise Ratio SRP Steered Response Power STOI Short-Time Objective Intelligibility WER Word Error Rate WSJ Wall Street Journal

1 Chapter 1. Introduction 1.1. Background Multichannel speech enhancement algorithms can be classified into data- dependent or data-independent approaches. The former cases exploit both the spectro- temporal and the spatial information of input signals, while the latter cases focus on enhancing signals from a predefined direction without accounting for the statistics of signals. Among the data-dependent approaches, the approaches that require reference information such as the spatial distribution of source and sensors are classified as supervised approaches, and are distinguished from unsupervised approaches that do not require prior knowledge. In the case of unsupervised methods, the lack of prior knowledge can be compensated by using a trained model [1][2], [3] or assumptions of target speech characteristics, such as sparseness in the timefrequency domain and the statistical independence between different sources [4], [5]. Two representative examples of traditional supervised multichannel filtering are beamforming and multichannel Wiener filtering [6], [7]. The following sections focus on one of the beamformer family of algorithms, namely minimum variance distortionless response (MVDR) beamformer [8], which has proven its effectiveness in realistic environments by achieving impressive results in noise-robust ASR challenges [9]. The term beamforming implies designing the spatio-temporal filter with specific criteria. The MVDR beamformer performs noise reduction under the constraint that the desired speech

2 signals are processed without distortion. 1.2. Organization and contributions of Dissertation The contributions of this dissertation is summarized as follows. Investigated classical multichannel speech enhancement approaches, especially the beamformer family of algorithms for clear understanding of the subsequent chapters. Developed directional BSS-based approaches to estimate the RTF in a practical scenario with noise that naturally occupies the same timefrequency range with desired speech signals. Introduced and described the issues and related works on the method of improving performance of RTF estimation. By using the peaks in the time-domain RTF proposed in the previous work, relevant issues are introduced and described with experimental results. Developed the recently introduced NN-GEV beamforming approach and its improvement by using deep learning-based spatial information modeling. Proposed a front-end structure of the NN to effectively transfer the phase information into deeper layers. By investigating the issues associated with deep learning-based approaches in spatial information modeling, and the shortcomings of the existing algorithms, a NN structure that effectively transfers the phase information is

3 developed. Relevant evaluation results in terms of ASR performance and the discussions for the results are presented. Investigated and proposed the use of deep learning in compensating distortion of the enhanced signal. Relevant issues and related works concerning the postfiltering beamformer output are described. A structural change of the beamformer, which was proposed in a previous work, is introduced to allow the deep learning-based postfilter take advantage of the multichannel information. This dissertation is organized as follows: In Chapter 2, classical multichannel speech enhancement approaches, especially the beamformer family of algorithms are investigated. In Chapter 3, directional BSS-based approaches to estimate the RTF is developed. In Section 3.1, issues and related works are described. In Sections 3.2 and 3.3, the method of improving performance by using the peaks in the time-domain RTF is introduced with experimental results. In Chapter 4, the recently proposed NN-GEV beamforming approach and its improvement is exploited. In Sections 4.1 and 4.2, issues associated with deep learning- based approaches in spatial information modeling, and the shortcomings of the existing algorithms are investigated. In Section 4.3, a front-end structure of the NN for deep learning-based spatial information modeling is introduced. Evaluation results in terms of ASR performance and the discussions for the results are presented in Section 4.5 and Section 4.6.

4 In Chapter 5, the use of deep learning in compensating distortion of the enhanced signal is investigated. In Section 5.1, issues and related works concerning the postfiltering beamformer output are described. In Section 5.2, a structural change of the beamformer is introduced. The experimental results are presented in Section 5.3. Finally, the concluding remarks of this dissertation and future research directions are presented in Chapter 6.

5 Chapter 2. Classical Multichannel Speech Enhancement Algorithms 2.1. Signal Model and Definitions Let (t) be a speech signal impinging on a microphone array of arbitrary geometry. The input signal in the -th microphone is given by ym() = () + () = () + (), = 2, , (2.1) where * is the convolution operator. The channel impulse response of the -th microphone modeled by the finite impulse response . () is the noise-free reverberant speech component, while () is the noise at the -th microphone. It is assumed that the noise signal is uncorrelated with () . By assuming that the room impulse responses (RIRs) change slowly over time, the time index is omitted in . The above signal model can be written in the frequency domain as (, ) = ()(, ) + (, ) = (, ) + (, ) for = 1,2, ,. (2.2)

6 Here, (, ), (), (, ) , and (, ) are the short-time Fourier transforms (STFTs) of ym(), , (), and (), respectively. is the frame index, and is the frequency bin index. This timefrequency domain signal model can be represented in vector form as (, ) = (, )(, ) + (, ) = (, ) + (, ), where (, ) = [1(, ) 2(, ) (, )] T, () = [1() 2() ()] T, (, ) = [1(, ) 2(, ) (, )] T, (, ) = [1(, ) 2(, ) (, )] T . (2.3) In speech enhancement, the goal is to reduce the noise and recover one of the signal components, (, ) . Classical beamformers apply linear filter (, ) to the input noisy signal to achieve this goal as (, ) = (, )(, ) = (, )(, )(, ) + (, )(, ), (2.4) where (, ) C is a beamformer weight vector decided by the criteria of the beamformer. () denotes the conjugate transpose. To obtain a distortionless speech component in the enhanced result, the goal can be set as making (, )(, ) 1 and (, )(, ) small. In this case, the output of beamformer recovers (, ), i.e., (, ) (, ).

7 2.2. Overview of Classical Beamformers In the context of applying a spatial filter, i.e., beamforming, the MVDR design [8] is particularly popular. The MVDR beamformer is traditionally devised by minimizing the power of the filtered signal subject to the no-speech-distortion constraint. The filter is designed with an assumption that the RIR of target signal (, ) can be estimated exactly. The beamformer filter weight is calculated as the optimal solution of min (, )(, )(, ), subject to (, )() = (l, k) (2.5) where is the so-called power spectrum density (PSD) matrix for input noisy signal (, ). In the case of setting the no-speech-distortion constraint, (l, k) is set to be 1. The PSD matrix for a given vector (l, k) is defined as (, ) = E(, ) (, ), (2.6) where E denotes the expected value. To solve (2.5), a complex Lagrange functional is defined [10] as follows. () = (, )(, )(, ) + [(, )() (, )]

8 + [(, )() (, )], (2.7) where is a Lagrange multiplier. By setting the derivative with to 0 [11] as () = (, )(, ) + () = 0, (2.8) the optimal filter is calculated as (, ) = (,)()(,) () (,)() . (2.9) In the early versions of the MVDR beamformer and its adaptive implementations, i.e., generalized sidelobe canceller (GSC) [12][13], it was assumed that the propagation paths between the source and the sensors are characterized by pure delays. This makes it possible to set the distortionless constraint only with the knowledge of the direction-of- arrival (DOA) of target speech signals. In this case, the acoustic transfer function (ATF) () is simplified to a steering vector phase shifted in the STFT domain. However, reflections present in a real, reverberant environment lead to the degradation when this simplified beamformer is applied in real-life. More specifically, a wrong assumption on the () causes cancellation of parts of the desired signal by making distortionless condition stop working. This is known as the signal cancellation phenomenon, and it is especially well-known in adaptive implementations of MVDR beamformers. As a solution to the problem of the pure-delay assumption, the steering vector of simple time

9 delays is replaced by an arbitrary finite-impulse response (FIR). In [14], the authors showed that knowledge of the RTF from the source to the individual sensors is sufficient to construct the MVDR beamformer in an adaptive way. They introduced a system to estimate the RTF using a least-squares approach exploiting the nonstationarity of speech signals as opposed to the stationary character of noise. In [15], the procedure for obtaining the RTFs from a generalized eigenvector decomposition is derived. By using the eigenvector-based RTF estimation, the performance of the GSC was noticeably improved in simulation environments. The eigenvector-based approach is originated from the max signal-to-noise ratio (SNR) criterion in [16]. In [16], broadband beamforming in the frequency domain by maximizing the signal-to-noise power ratio was proposed. The proposed beamformer filter coefficients are determined by maximizing the SNR of the output signal (, ), while SNR is defined as SNR(, ) = (,)XX(,)(,) (,)(,)(,) = (,)(,)(,) (,)(,)(,) 1. (2.10) The maximization of (2.10) leads to a GEV problem, and as a result, (, ) is found to be the eigenvector corresponding to the largest eigenvalue of (, )(, ). This eigenvector is denoted as (, ). The PSD matrix of the noisy input signal is given by (, ) = (, ) + (, ), (2.11)

10 The PSD matrix of the target speech component is given by (, ) = ss(, )(, ) (, ). (2.12) By applying the GEVD to (, ) and (, ) , and choosing the eigenvector corresponding to the largest eigenvalue, this results in (, )(, ) = (, )(, )(l, k). (2.13) Substituting (, ) in the left-hand side of (2.13) as (2.11) and (2.12) yields ss(, )(, ) (, ) + (, )(, ) = (, )(, )(l, k), (2.14) therefore (, )(, ) (, )(, ) = ((, ) 1)(, )(l, k). (2.15) Finally, (, ) = (,) (,)(,) ((,)1) (, ) (, ). (2.16)

11 As (, )(, ) is a scalar, this can be rewritten as (, ) = (, ) (, ), (2.17) where is an arbitrary complex scalar. Since the above filter weight is derived based on a narrowband SNR criterion, an uncontrolled amount of speech distortion is introduced by the arbitrary gain, . To achieve a distortionless response, an additional single-channel postfilter is required to make the unity gain to speech signal as (, ) (, )(, ) = 1. (2.18) An example of postfilter is blind analytical normalization (BAN), which is (, ) = ()(,)(,)(,) (,)(,)(,) . (2.19) For a more intuitive representation of the max SNR beamformer coefficient and RTF, (2.17) can be rewritten as (, ) = 1(, )(, ). (2.20)

13 Chapter 3. Blind Source Separation-based Schemes in Relative Transfer Function Estimation 3.1. Issues and related works Beamformers critically depend on the estimates of the statistics of the desired speech component. The most common form of information required by beamformers is RTFs. To estimate RTFs, it is typically assumed that a time period exists when the only nonstationary acoustic source is desired speech, or desired speech can be observed alone. If this so-called RTF estimation period is provided, the RTF of desired speech can be estimated by applying the GEVD procedure to the cross-PSD of the period [15] or exploiting the nonstationarity of speech [17] . Obviously, ensuring the existence of the RTF estimation period is very unnatural under realistic acoustic conditions, unless a speaker reads a series of sentences for a substantial amount of time without any movement. To avoid these stringent requirements for the RTF estimation period, spatial information of target speech may be exploited by using a trained model [18][20]. These approaches treat RTFs as a random variable and attempt to estimate it with probabilistic models, which are trained in a specific room structure. These approaches are not applicable when the room structure is not known in prior, and considerable effort is needed in the modeling process. The simplest way to avoid these problems is probably to simplify the RTF as a steering vector by assuming that the relation between desired

14 speech in each sensor is a pure time delay. The RTF estimation can be replaced with noise-robust sound source localization for desired speech. This is much easier, but results in insufficient performance by causing severe desired speech distortion. There have been several researches to estimate the RTF of desired speech in the practical scenario with noise, which naturally occupies the same timefrequency range with desired speech. Early work on estimating timefrequency bins in which desired speech is dominant and use them to estimate the RTF is presented in [21]. In this research, the speech presence probability in the timefrequency domain is incorporated in the way of estimating the RTF based on subintervals that contain speech. The speech presence probability is obtained by applying a minima-controlled recursive averaging-based algorithm in the timefrequency domain, and is highly dependent on the assumption that noise is more stationary than desired speech. More recently, blind source separation (BSS) algorithms in RTF estimation have been proposed. A geometric constraint to BSS was introduced to specify the direction of desired speech [22], and the time-domain BSS weight matrix was updated to block the desired speech in the BSS output. The BSS weight is naturally updated to approximate the RTF in the way of trying to block signal components related to the specified direction. However, the estimation performance of BSS weights is usually not enough to reach sufficient accuracy unless the length of the input signal is considerably long.

15 3.1.1. Generic ICA-based BSS algorithm In this section, triple-N independent component analysis for convolutive mixtures (TRINICON) [23] for BSS is briefly reviewed. A signal model for point sources s() for = 1, , P is described by () = () () 2 =1 , (3.1) where represents convolution and is the discrete time index. () represents a transfer function from the position of the th sound source to the th sensor. The demixing filter from the th microphone to the th output channel is denoted as () and the output signals of the demixing system are described by () = ( )( ) 1 =0 2 =1 = () () 2 =1 (3.2) where (), = 0, , L 1 denote the current weights of the filter tabs from the th microphone to the th output channel. An example of a signal model for the two- channel case is shown in Figure 3.1.

16 Figure 3.1 Basic two-channel linear BSS signal model In TRINICON, demixing filter () is identified by minimizing mutual information between the output channels based on the assumption that acoustic sources are statistically independent. The filter weights can be updated for each block which consists of frames. The cost function for each block is usually calculated by replacing ensemble averaging with temporal averaging over frames in the block with assumption of ergodicity within the individual blocks. To describe a block processing, block output signal matrix is introduced as () = [ () ( + 1) ( + 1) ( + 1) ( + 2) ( + )] (3.3) and reformulate the convolution in = () () 2 =1 (3.2) as () = () =1 () (3.4)

17 with denoting the block time index and denoting the block length. The matrix () incorporates time lags into the correlation matrices in the cost function. To ensure linear convolutions for elements in () , the size of input matrix () is double the size of (). () = [ () ( + 1) ( + 1) ( 2 + 1) ( 2 + 2) ( 2 + ) ] (3.5) Now, () is given as () = [ (0) (1) ( 1) 0 0 0 0 (0) (1) ( 1) 0 0 0 0 0 (0) (1) ( 1) 0 ] (3.6) For a more convenient notation, each component is rewritten by combining all channels as () = ()() (3.7)

18 with the matrices () = [1(), , ()], () = [1(), , ()], () = [ 11(), ,1() 1(), ,() ]. (3.8) The cost function that includes all time lags of all auto-correlations and cross- correlations of the output signals is introduced in [24], [25] as (,w) =(, ) ( ( (()))) ( (())) =0 , where () = ()(). (3.9) Here, is a weighting function with finite support that is normalized according to (, ) = 1=0 . The bdiag operation sets all submatrices on the off-diagonals to zero. In this case, the matrix ()() of size is composed of channel-wise submatrices. The cost function becomes zero if and only if output signals are uncorrelated so that all block-off-diagonal elements become zero. Update equations of the filter coefficients are expressed by applying the natural gradient [26] [25] as

19 (,) = 2 (,) [(, ) () (()) 1 (())] = . (3.10) When the update is applied to , the Sylvester structure of is maintained by using only the nonredundant values in (,) with a Sylvester constraint. The coefficient update rule in [27] is applied to the proposed system to obtain a recursive block-by-block solution based on offline minimization. The update of the time-domain BSS weight matrix for the -th block after the -th iteration, (), is as follows. () = 1() (), = 1, , , (3.11) where is the step size and the update () corresponds to the natural gradient (,()). 3.2. RTF Estimation Using Peaks in Time-Domain RTF 3.2.1. Motivation

20 When an optimum broadband solution of BSS is presented, the demixing matrix can separate two input sources into each output. Assuming that 1() is the desired speech and an ideal BSS system separates 2() from the mixed signal to obtain 1(), the target source 1() should be perfectly suppressed in 1() as follows. 1() 11() 11() + 1() 12() 21() = 0 11() 11() = 12() 21(). (3.12) (3.3) can be expressed in the STFT domain as 11() = 11() = 12() 11() 21(), (3.13) with as the frequency bin and as the length of the filter weight. If 21 is fixed to a pure time delay with a negative sign, then 11() = 12() 11() 2/, (3.14) where is a time delay. Now, 12()/11(), which is the RTF of the target source, can be calculated from BSS weight 11() . This holds only when the BSS system suppresses only 1(t) in 1(), which cannot be guaranteed if the number of the sound sources exceeds the number of the sensors. In [22], [28], a geometric constraint has been

21 proposed to satisfy the condition. This constraint was used to force BSS weights that correspond to output 1() to have a spatial null towards the direction of the target source. The update of the time-domain BSS weight matrix for the -th block after the - th iteration, (), is as follows. +1() = () = + , (3.15) where and are update values from the BSS algorithm and the geometric constraint respectively. Here is the step size and is a parameter that controls the importance of the constraint relative to the BSS. To force () = [11() 21()] T to get a spatial null towards a target direction, , a constraint was set with a steering vector () to the target direction as follows. ()() = 0, where () = [ , 1] , = 2 . (3.16) Here c is the sound velocity, is the distance between sensors, and is the sampling rate. The cost function of this condition is (()) = [ ()()][ ()()] , (3.17)

22 where 21() is fixed to a pure delay and the constraint is applied to 11 by setting as the first derivative of (3.8) with respect to 11 . The steering vector in the constraint in the conventional algorithm represents only the time delay of arrival from the target direction. By applying the cost function (3.8), the estimated RTF 11 is biased to the delayed delta function. True RTF in the time domain is not a pure time delay, and has many peaks as shown in Figure 3.2 (a). Figure 3.2 PTDRs in time-domain RTF: (a) input target signals ratio in time domain (True RTF of target signal), (b) estimated RTF in time domain (conventional algorithm), and (c) estimated RTF in time domain (proposed algorithm) Despite the important role in characterizing the RTF, these peaks are smoothed in

23 conventional algorithms, as shown in Figure 3.2 (b). 3.2.2. Utilization of the peaks in time-domain RTF (PTDR) A previous work [29] presents an effective algorithm for a more accurate estimation of RTF to improve the directional-BSS-based RTF estimation. The peaks in time-domain RTF (PTDR) is used as a feature of the RTF. A non-negative matrix factorization (NMF) based algorithm is employed to estimate PTDRs that corresponds to the target source. A semi-NMF algorithm [30] that is a variation of NMF to extend its applicable range to negative data is used. As a result, the peak-smoothing effect is overcome by replacing the smoothed peaks by PTDR estimates. The input signal ratio is written as 2() 1() = 1()12()+2()22()+ 1()11()+2()21()+ . (3.18) As the dominant input signal changes due to the nonstationarity of signals, the input signal ratio in some of the frames can take a closer value to the RTF of the dominant signal source. 2() 1() 12() 11() 1() 2(), 3() (3.19) When there is no dominant source, the input signal ratio has unpredictable values due to

24 the independence of each source. The peaks of the input signal ratio are assumed to be a weighted sum of PTDRs from each sound source and unpredictable peaks. The peaks of the input signal ratios in each frame are stacked in a matrix and fed into the semi- NMF algorithm as follows. = [], = [(1) ()] T, () = () 0 () (3.20) where (t) is the -th sample in the inverse discrete Fourier transform of the input signal ratio for the -th frame. is a threshold used to find peaks and T is a value used to decide the range of PTDR search. The columns with synchronized fluctuation in matrix can be grouped into different basis. , (3.21) where and are the basis vectors and the corresponding activation weights respectively with as the number of basis. If only the direct path to the target direction is considered, the RTF in time domain is a delayed delta function with a peak on the position that corresponds to the time difference of arrival. By choosing a single basis that has the largest value on this position among the resulting bases of the semi- NMF, the PTDRs having synchronized fluctuations with the peak from the direct path can be obtained as the estimated target signal PTDRs.

25 3.3. Experiments To examine the effect of using PTDRs on RTF estimation, the semi-NMF-based PTDR separation is conducted while RTF estimation with conventional directional BSS is performed. As each RTF estimation is completed, the separated PTDRs are inserted into the resulting RTF estimate to compensate the smoothing effect of the conventional algorithm. The experiments were conducted using speech (3 male and 3 female subjects) as the target and interfering sources, and the performance was evaluated by averaging the results over three separate experiments. The room impulse response was simulated, as shown in Figure 3.3. Figure 3.3 Spatial configurations. Room dimension: 10 m 10 m 3 m. Sensor position: (5 m, 4.97 m, 2 m), (5 m, 5.03 m, 2 m). Source-sensor distance: 1 m. Target speech, noise source, interference source are represented as , , and respectively.

26 For each experiment, a 10-s length of the mixed signal was used. BSS weights were updated at every 8 frames of 1024 samples obtained at a 16-kHz sampling rate. Parameters were set as = 400, = 0.5, = 0.05, and = 5. For experiment 2, the sound of operating machinery in a factory was used as a noise source with equal power of interference. For experiments 3 and 4, white noise was added to each microphone as a diffuse background noise. The SNR was set to 0 dB excluding the diffuse noise in each experiment. When the diffuse noise is considered, it was normalized to have 10-dB SNR and added to the 0-dB SNR mixed signal. Performance was evaluated using the target speech suppression gain and the normalized squared error (NSE) between the estimated RTFs and the ideal RTF [22]. The speech suppression gain is defined as Gain = 1 2 10 log 102=1 , 2 2 , (3.22) where , 2 and 2 denote the signal power of the target signal component in the m-th sensor and the leakage signal respectively. The leakage signal is calculated as in (3.14), using the resulting BSS weight, 11(), 21(), and the target signal. A larger value of the suppression gain indicates a more accurate RTF estimate. = 1() h11() 11() + 1() 12() 21() (3.23) The NSE is calculated as

27 NSEBM = 10 log10 (()()) 21 =0 ()2 1 =0 , (3.24) where is the estimated RTF and is the true RTF of the target signal in the time domain. Table 3.1 RTF estimation performance in terms of NSE and target speech suppression gain. Idx Noise component (s) Target speech suppression gain (dB) NSE Conventional Proposed Conventional Proposed 1 Interference 0.25 12.8 13.7 -7.0 -9.5 2 Interference +Noise 0.25 11.7 12.4 -6.9 -9.6 3 Interference +Noise+Diffuse 0.25 10.7 11.8 -6.9 -9.3 4 Interference +Noise+Diffuse 0.40 7.6 7.9 -5.2 -5.9 3.4. Conclusions Directional BSS is highly efficient in estimating the RTF because it requires only the target direction, unlike other algorithms that require additional detectors. However, the geometric constraint in this algorithm has a side effect where the estimated RTF has

29 Chapter 4. Neural Network-based Approaches 4.1. Issues and related works After the breakthrough for training deep architectures presented by Hinton et al., speech enhancement techniques including noise reduction, separation, feature compensation, and dereverberation have been developed in the framework of deep learning. A simple, yet effective method for the monaural case is to estimate the speech presence timefrequency bin in the noisy spectrogram. This is formulated as a binary classification problem to estimate the ideal binary mask (IBM) in [29], and shown to improve speech intelligibility. However, due to the binary nature, the improvements to speech quality offer limited performance enhancement. This method is adopted for the multichannel case in the timefrequency mask-based approach that has been successful in multichannels recently. Another successful method for the monaural case is the stacked denoising auto-encoders (SDAs) based approaches. Inspired by the successful results on monaural speech [31][33], many approaches tried to use the multichannel signals in an SDA framework. The SDA can be trained to reconstruct the desired speech from the noisy input signal. The main idea behind this structure is using DNNs to model the relationship between clean and noisy features. Multichannel feature-based methods [34][37] explored the suitable features that hold spatial information in addition to spectral information. The newly proposed features are usually fed to the NN along with conventional acoustic features. There are also

30 researches that attempt to train a network directly on multichannel waveforms, such as [38][39]. These approaches force the NN to perform the whole enhancement process including the modeling relationship between channels, thus requiring large amounts of training data. Furthermore, high dependence on training data may cause performance degradation in unseen spatial configurations. As such, deep learning can cooperate with beamformers in several ways, and various methods are still being studied. Here are two representative methods that are considered to be the most successful. 4.1.1. Neural Network-based mask estimation in GEV beamformer Among the various approaches proposed to extend the applicable scope of deep learning to multichannel signals, the recently proposed NN-GEV beamforming approach [40], which combines deep learning-based timefrequency mask estimation and a GEV beamformer, has achieved success in noise-robust ASR challenges. The success of NN- GEV is due not only to its improved performance, but also to the simple and effective application of deep learning. Since modeling is performed on each single-channel spectrum, this system enables the training steps of the NN to be conducted without considering the geometric configuration of the microphones. In a noise-aware scenario, data-driven approaches are applicable [40][41] for estimating the spectral masks for speech and noise, and the estimated masks can be used to estimate the signal statistics required by a beamformer. In a NN-GEV system, two masks are estimated, one of them to indicate which time frequency bins are presumably dominated by speech, and the other

31 to indicate which bins are dominated by noise. The estimated masks are used to estimate the cross-PSD of each frequency as follows: () = (,)(,)(,) =1 (,) =1 , where , (4.1) and are estimated masks for speech and noise respectively, and (, ) is a vector of input noisy signals in which each element indicates one channel signal in a STFT domain for frame index and frequency index . Figure 4.1 Structure of NN-based mask estimation in GEV beamformer In [40], it is assumed that the target speech is prevalent in (), whereas noise is prevalent in () for each utterance, so that PSDs are calculated for each utterance. The IBMs for speech are set as (, ) = 1, (,) (,) > 10 () 0, , (4.2)

32 and the NN is trained to estimate the mask from a noisy spectrum. The mask for noise can be calculated from the speech mask (, ) = 1 (, ) (4.3) or estimated using the NN in the same way with the speech mask. In the latter case, the number of nodes in the output layer of the NN is doubled to generate both speech and noise masks at the same time. The GEV beamformer [16] is obtained by maximizing the SNR for each frequency bin as GEV(k) = argmax H()()() H()()() . (4.4) 4.1.2. A Fully deep learning-based Approach Typical beamformers often use separate modules for each step. First, spatial information acquisition is performed, i.e., source localization in [12], [13] and RTF estimation in [14], [15]. After that, a linear filter that is generally indicated by the word beamformer is applied. In addition to this linear filter, an additional filter with an arbitrary structure may be applied to improve the enhancement performance or compensate the artefacts caused by the linear filter. This structure is considered to have room for

33 optimization in the case when the goal is to improve ASR performance because the filter used in enhancement is calculated independently from the acoustic model of ASR. In an early effort to optimize the enhancement system and the acoustic model jointly, structures like likelihood-maximizing beamforming were proposed [42]. As the NN-based acoustic model in ASR has become a mainstream recently, these approaches continued in optimizing the NN-based acoustic model jointly with preceding enhancement module. In [43], the NN for timefrequency mask estimation in NN-GEV [40], is jointly optimized with the NN for acoustic modeling for ASR. Even though this system keeps using a linear filter that is calculated based on the traditional optimization criterion (i.e., MVDR), the training algorithm is named as end-to-end training. This naming seems reasonable since the NN-based mask estimation comes at the front-end of the enhancement process and jointly trained with the acoustic model in ASR following the enhancement process. The more unified, and more suited to the name end-to-end system, in which whole steps between the input multichannel signal and the acoustic model for ASR are connected by one NN as presented in [44]. This system takes the multichannel raw time- domain waveform directly and performs speech enhancement operations, including a spatial filtering operation and frequency decomposition in its NN layers. This approach has been shown to be effective in utilizing the multichannel input and learning the desired spatial selectivity and it seems quite promising given that deep learning has overwhelmed existing methods in a variety of fields recently. In [44], much of the information from multiple microphones is exploited by the lower layers, while deeper layers perform a similar operation when applied to a conventional single channel. The structure of the

34 deeper layers are known as convolutional, long short-term memory, deep NN (CLDNN) [44][45][45] and shown in Figure 4.2. Figure 4.2 Time convolution and CLDNN layers When the CLDNN is applied on single channel [45], the first layer in Figure 4.2 is a time- convolutional layer over the raw time-domain waveform, which can be thought of as a finite impulse-response filterbank followed by a nonlinearity. By placing multiple filters in time convolution layer, this layer can approximate acoustic filterbanks which perform spectral filtering. The output of this layer is subsequently treated as timefrequency representation of input signal. The frame-level outputs from the first layer is fed to following convolutional layer to model the underlying relationship between adjacent

35 frequency bands. The output of frequency convolution is passed to a stack of three LSTM layers, which model the signal across long time scales. Finally, a single fully connected DNN layer is applied to estimate the context-dependent state of the acoustic model in the ASR system. When the CLDNN is applied to the multichannel signal, i.e., in the fully deep learning-based beamforming system, the low layers are modified to use spatial information encoded in input signals. Structural changes are mainly in the lower layers, especially in the first layer [44][46]. The modified structure for the multichannel signal is shown in Figure 4.3. The first layer is noted as tConv1 and mimics filter-and-sum beamforming, which filters the signals from each microphone using a finite impulse response filter and sums them. This process can be written as follows. 1[t, ] = 1,[, ][ ] 11 =0 1 =0 , for = 1, , , = 10 (4.5) 1 is the number of tabs in filter 1,[, ] and is set 1 = 80 at a sampling rate of 16 kHz (5 ms). This filtering is performed on the input raw waveform [], where c is the input channel index. Equation (4.5) is interpreted as a delay-and- sum beamformer applied on C channel signals when the filter 1,[, ] is set to be a delayed impulse. If it is assumed that C = 2, the look direction of this delay-and- sum beamformer is specified by index and the corresponding time delay of arrival calculated by the relative delay in first dimension between 1,1 and 1,2.

37 This structure is called factored because this layer is expected to perform only spatial filtering while the ability of spectral filtering being factored out to the next layer. The unfactored structure in which spatial and spectral filtering are performed in one layer differ from factored by the length of the first dimension of 1,[, ] while setting a longer filter (e.g., 1=400). The factored effect achieved by changing the filter length, and the improvement in performance realized by factoring is shown in [46]. 1,[, ] can be either fixed or trained and the experimental result in [46] shows that training it improves performance compared to keeping it fixed. Note that by training the tConv1 layer, 1,[, ] that is initialized to be a delayed impulse can have an arbitrary shape, so that tConv1 will no more be a delay-and-sum beamformer. The output of tConv1 has the shape as 1 1 where the first and last dimensions correspond to the sample index in a small window (M = 560 ), and look direction index, respectively. The second layer is also a time-convolution layer and is named as tConv2. The time-convolution process can be written as follows. tConv2[, , ] = 2[, ]1[ , ] 21 =0 , for = 1, , , = 1, , , = 10, = 128 (4.6) This can be interpreted as a decomposing time signal into different time signals, each of which corresponds to a different frequency band. This decomposition is repeated times for each look direction using the same convolution filter. This layers filters are

38 denoted as 2 21, where the dimensions correspond to tabs (sample index in the time convolution filter), a look direction index of 1 to indicate sharing across the 1 directions, and frequency (number of frequency bands). In the convolution process, zero-padding is not performed in the edge of the signal (i.e., valid padding), and the output has the shape of tConv2 (2+1). The filter size 2 is set to be larger (e.g., 2 = 400) than 1 to encourage this layer to perform spectral filtering by using better frequency resolution than the first layer. Next, The max value of every 160 samples are pooled, along the first dimension in tConv2. As a result, 1 is shaped as 1 1 . By selecting one sample in short intervals, the short-time information, i.e., changes of characteristics of the input signal within the range of 160 samples are discarded. After the pooling, a rectified nonlinearity, followed by a stabilized logarithm compression, which means a logarithm with a small additive offset of log ( +0.01) is applied to generate the output as nonlinear = ReLU(log (1 + 0.01)). (4.7) The output of the nonlinear layer is then passed to the CLDNN. First, the fConv layer, which is a convolutional NN, applies frequency convolution. The frequency-convolution process can be written as fConv[, , c] = fConv[, c]nonlinear[, ] 1 =0 ,

39 = 1, , , c = 1, , , = 1, , , = 10 , = 256, = 128, and = 8. (4.8) Nonoverlapping max pooling along frequency, with a pooling size of 3, is applied to fConv[, , c], and as a result, the range of the frequency index in the second dimension reduces to = 1, , +1 3 . For now, the output generates +1 3 feature values for each frame. To reduce this huge feature into an affordable size of vector, a 256-dimensional linear low-rank layer is applied. This is performed by rearranging the three-dimensional feature into the vector with +1 3 values, and applying affine projection to produce a vector with 256 values. The output of this step is passed to three LSTM layers with 832 cells and a 512-unit projection layer. In addition, one DNN layer with 1,024 hidden units are applied. 4.2. Motivation The fully-DNN-based approach relies on the NN's modeling capabilities to solve these problems. The modeling of the relationship between input signals spatially and spectrally is all done by the network. Among the recent fully-DNN-based schemes, there has been an attempt to specify where in the network spatial and spectral filtering should

40 be performed [46]. However, in this study, control of the NN was merely observing where the selectivity exists between spatial or spectral axis after adjusting the width of the convolutional layer. Most of fully-DNN-based approach, due to the nature of its structure, has an unclear boundary of utilization of spatial information and spectral information. Therefore, to generalize various spatial configurations, using a training database including all the corresponding diversity is necessary. Training data to generalize the type, direction, and spatial characteristics of noise should be composed of multichannel data, and collecting such data is much harder than collecting single-channel data. Feeding the multichannel signals to a big network [36], [47] usually make network to be too flexible to train parameters. In most existing studies, constraints that are applied to the evaluation data were usually decided according to the characteristics of the proposed algorithms, making it difficult to compare their performances. Representative examples of such constraints include the presence of an RTF estimation interval and the assumption of a stationary degree of noise. The NN-GEV method, on the other hand, has proved its effectiveness by performing well in a noise-robust ASR competition [48][51], where performance comparison is possible under the same conditions. Because this competition assumes a realistic noise environment, it is meaningful to achieve high grades. The NN-GEV method has the advantage of being easy to implement irrespective of the configuration of the sensors. In addition, there is no need to implement an additional sound source localization system and there is an advantage that voice activity detection can be solved by a NN already included. Despite these advantages, there is still room for improvement.

41 The method of applying the NN to each channel independently has advantages in that it does not depend on the position of the microphone. However, on the other hand, this property leads to disadvantages that the NN is not used at all in spatial information modeling. In realistic situations where noise and desired speech naturally occupy the same timefrequency range, estimation of timefrequency masks based only on signal intensity may result in insufficient performance. Moreover, signal intensity highly depends on signal characteristics, and as a result, the timefrequency mask estimation performance is likely to be unstable when unseen noise is encountered. To deal with the problem, it is necessary to find a method to overcome the disadvantages of using single- channel signal intensity only for deep learning. The new method should be able to use multichannel phase information and its temporal characteristic. The NN topology to deal with such complex relationships between phase information of signal components are proposed in this section. Unlike the fully-DNN-based methods, the proposed method calculates the beamformer weight-based on MVDR criteria and limits the role of the NN to the estimation of the target signal and the noise signal statistics. This will reduce the degree of flexibility of NN weights and reduce the size of the training database required.

42 4.3. Proposed system with phase-encoded input for NN-based mask estimation 4.3.1. Representing Phase Information in Real Numbers Single-channel speech enhancement systems usually enhance only the magnitude spectrum and use the noisy phase during signal reconstruction. This is based on the belief that phase information does not lead to significant improvements in equivalent SNR [52]. However, the spatial diversity of a multichannel signal is typically given as the TDOA of the time signal that corresponds to the phase shift in the timefrequency domain. Inadequacy of the deep learning-based multichannel system compared to the single- channel system is partially attributed to the requirement to modeling phase information while the NN traditionally processed real-valued physical data and relied on real-valued weights. The efforts to model the spatial information from a multichannel signal are directly or indirectly connected to modeling the phase information from a complex number. There have been researches to use a complex-valued NN; however, the appropriateness is still a controversial issue [53]. The most typical method in the real- valued NN is to represent a complex-valued timefrequency domain signal by two channels containing real and imaginary components respectively, while forcing the NN to approximate a complex algebra to decode the phase difference between the two signals [54]. A front-end structure of the NN is designed in this section to effectively transfer the phase information of the input signal to the real-valued NN. The relation of the two

43 complex numbers corresponds to the two-channel signal in the STFT domain, 1(, ), 2(, ), for the th frame and the th frequency index is expanded to a real- valued vector, (, , ), in producing an NN-friendly representation of magnitude and phase difference as follows. (, , ) = ,2(, ) ( ( 2 + ,1(, ) ,2(, ))) where n = 1, , N and N = 20 (4.9) This process results in the three-dimensional extension of existing two-dimensional timefrequency information. This helps avoid performing phase-related modeling using information represented by real and imaginary values, which requires computation such as the inner product and angular difference calculation. In addition, the proposed feature is characterized by the fact that the feature value reflects the change of phase and intensity on real values at once. As a result, the distance due to the phase difference also decreases when the intensity is low. This is effective in resolving the unstableness problem of phase- only features such as the interaural phase difference (IPD), which is frequently observed in a low-signal-intensity region. The extracted feature for each timefrequency bin, which is an -dimensional vector, is fed to feed-forward layers that transform the features into a space that makes that output easier to model. The feed-forward weights

44 are shared for several groups of adjacent frequency bands to reduce the parameters to be trained. The NN topology used to evaluate the effectiveness of the proposed front-end structure is shown in Figure 4.4. The frequency bands were uniformly divided into 10 groups. The outputs from the feed-forward layers are fed to a deeper layer that has same topology with [40]. Figure 4.4 NN topology in the proposed NN-based mask estimation system

45 4.3.2. Bidirectional Long Short-Term (BLSTM) layer The output of the front-end stage is fed into the BLSTM layer as in the mask estimation process in [40]. The bidirectional structure [55] is accomplished by training two LSTMs simultaneously in positive and negative time directions. In the output, two results are concatenated to be fed to deeper layers. The LSTM structure [56] is able to learn to store information over time intervals. It is proposed in [57] to overcome the problems of the RNN caused by decaying error back flow and as a result, make it hard to be applied in learning to store information over extended time intervals. The LSTM can learn to bridge time intervals much longer than that in RNNs without loss of short time- lag capabilities. The basic unit of an LSTM network is the memory block, and the memory block can contain one or more memory cells. Three adaptive, multiplicative gating units consisting of input, forget, and output gates, are shared by all cells in the block. Each memory cell stores its current state at its core and has a recurrently self- connected linear unit called the constant error carousel (CEC). The structure is shown in Figure 4.5. The input, forget, and output gates can be trained to decide what information to store in the memory and when to read it out. The activations of gates are calculated using current input from the lower layer and the previous output of the memory cell. In this system, the output of the front-end layers, i.e., activation of the lower layer, is fed to the LSTM layer. The output of the front-end layer can be given as a Nfront3 dimensional vector, zfront3() for the kth frame.

46 Figure 4.5 LSTM memory block with one cell (rectangle) Similarly, the cell state of the previous frame can be denoted as zcell( 1) . The activation of input (i(k)), forget (f(k)), and output (o(k)) gates are calculated by using (4.10) for each frame index k. i(k) = sigmoid (Wiz3zfront3() +zcstate( 1) + ) f(k) = sigmoid (Wfz3zfront3() +zcstate( 1)+) o(k) = sigmoid (Woz3zfront3() +zcstate( 1)+0) (4.10) in the argument to sigmoid function, the second component indicates the connection between the gates and the cell state, and this connection is named as the peephole connection [57]. The peephole connections allow gates to use the current cell state even when the output gate is closed. , , and b0 are the bias values for sigmoid functions.

47 The cell state for the current state is then calculated by adding the gated cell input and output with the previous state, which is also gated by the forget gate. z() = () tanh (33() +( 1) + ) +() zcstate( 1). (4.11) Here, denotes gating conducted by element-wise multiplication of vectors. The cell output is finally calculated by multiplying the cell state by the output gate activation as zcout() = ()tanh (()). (4.12) The output of the cell is fed to fully connected DNN layers, to estimate output labels. 4.4. The ASR stage in the proposed system 4.4.1. Acoustic model structure For ASR performance assessment, one of the recently proposed ASR systems in the Kaldi speech recognition toolkit [58] is used. In this system, the time-delay NN

48 architecture that models long-term temporal dependencies is adopted. By selecting the time-delay NN architecture with feed-forward NNs only in its layers, this system can be easily parallelized unlike RNNs, parallelization of which cannot easily be exploited due to the dependencies between the time frames being processed in training. Even though it consists of feed-forward layers only, the time-delay NN can model long-term temporal dependencies in short-term speech features by processing a wider temporal context. To achieve this property, a sufficient number of frames must be fed to the NN at a time to model the temporal relationship present in the speech and this increases the number of parameters to be trained in the NN. In the time-delay NN structure, the initial layer learns an affine transform that covers a small window of many frames. The affine transform then repeats the same operation on the incoming window. After performing this operation for a sufficient time, the activation results obtained from each window are fed to the next layer in a similar form of time frame in the initial layer. The structure is shown in Figure 4.6. In this structure, the deeper layers process the hidden activations from a wider temporal context while each layer keeps a relatively small number of parameters by sharing the affine transform weight across the windows. In a typical time-delay NN, the window of each layer is constructed at all time steps and this result in large overlaps between the windows computed at the neighboring time steps.

49 Figure 4.6 Computation in TDNN with subsampling (red) and without subsampling (blue + red) in [59] In [59], subsampling by assuming that neighboring activations are correlated is proposed. Subsampling is performed by omitting the connections that cover the central frames in window. As a result, gaps exist between the frames that feed to the next layer, unlike the typical time-delay NN, which splices together contiguous temporal windows of frames. Table 4.1 shows the configurations of subsampling, and compares them with a typical

50 time-delay NN. For example, -1,2 denotes splicing the frames in the relative time steps of -1 and 2 while [-1,2] means splicing frames corresponding to -1, 0, 1, and 2. Table 4.1 Context specification of time delay NN in [59] layer Input context Input context with sub- sampling 1 [-2, 2] [-2, 2] 2 [-1, 2] -1, 2 3 [-3, 3] -3, 3 4 [-7, 2] -7, 2 5 0 0 By using the subsampling scheme, the computational load is reduced during the forward path and back propagation. Moreover, the parameters in the hidden layers are significantly reduced. The p -norm nonlinearity [60], which is a dimension-reducing nonlinearity, is used for activation of each layer output. The p -norm nonlinearity is proposed in [60] as a generalization of the maxout unit and is defined as = = ( ) 1/ , (4.13) where the value of is configurable.

51 4.4.2. Input feature Mel-frequency cepstral coefficients (MFCCs) are extracted from each frame, with a length of 35 ms and slide size of 10 ms at 16 kHz. MFCC features are fed to the NN without adaptation or mean-normalization, which are usually required in ASR systems. Instead of applying feature compensations at the front-end of the ASR system, the i-vector that is estimated based on the left-to-right way, meaning that the use of the input frames up to each time step, is fed to the NN with the MFCC feature. This is based on the idea that the i-vector gives the NN as much as it needs to know about the speaker properties. 4.4.3. Training Database for ASR system To ensure that the proposed system can be used for a general-purpose ASR system, the ASR system is trained on clean speech data base. As training database, Librispeech [61] that is known to achieve higher performance than models built on Wall Street Journal (WSJ) itself when used in training for the standard WSJ test set. The corpus in this database is derived from audiobooks that are part of the LibriVox project, and contains 1,000 hours of speech sampled at 16 kHz. For the sake of convenience, the training data is divided into several subsets when it is provided, but the entire training data is used in this experiment. The speakers in the corpus were ranked according to the WER, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and

52 the higher WER speakers designated as "other." The development set is decided by drawing 20 male and 20 female speakers from the "clean " pool at random. As a result, the dev set contains about 5 h and 20 min, while each speaker uses approximately 8 min. 4.5. Experiments 4.5.1. Database The performance of the proposed system was assessed with the database released for the CHiME challenge 2016 [9] held for the purpose of promoting research at the interface of signal processing and automatic speech recognition. This challenge targets the performance of automatic speech recognition in a real-world, commercially motivated scenario: a person talking to a tablet device that has been fitted with a six- channel microphone array. The configuration of the recording device is shown in Figure 4.7. Recordings have been made using an array of six omnidirectional microphones mounted in holes drilled through a custom-built frame surrounding a tablet computer. The frame is designed to be held in a landscape orientation and has three microphones spaced both along the top and bottom edges. All microphones face forward (i.e., towards the speaker holding the tablet) except the top-center microphone (mic 2), which faces backwards. Two types, which are real and simulated, of noisy speech data have been provided for each noise environment. The real data consists of sentences from the WSJ0

53 corpus spoken live in the environments. Figure 4.7 The microphone array geometry in the CHiME challenge 2016 [9]. All microphones face forward except microphone 2. The simulated data was generated by simulating six-channel clean utterance from one channel source and mixing them into environment background recordings. For ASR evaluation, the data is divided into training, development, and test sets. The scenario of this challenge is the ASR for a multimicrophone tablet device being used in everyday environments. Four varied environmental noises were recorded in cafe (CAF), street junction (STR), public transport (BUS), and pedestrian area (PED). The speech recordings consist of utterances made by 12 US English talkers (6 male and 6 female) ranging in age from approximately 20 to 50 years old. Recordings were made first in an acoustically isolated (but not anechoic) booth (BTH) and then in each of the four noisy target environments. About 100 sentences were recorded in each location. The speakers

54 were asked to use the tablet in whatever way felt natural and comfortable, and the talker tablet distance varied but was typically around 40 cm. Utterances were recorded at 48 kHz and downsampled to 16 kHz and 16 bits. Multiple sentences were recorded continuously (embedded), and for each continuous recording session, an annotation file was prepared to record the start and end time of each utterance with a precision of approximately 100 ms. Isolated utterances were extracted from the continuous audio according to the annotated start and end times and 300 ms of padding prior to the utterance is included during the extraction. In the case of simulated development or test data, the noise source is acquired by removing the speech component from recorded noisy speech, while in the case of simulated training data, a separately recorded noise background is used. The speech components in each microphone are estimated by applying the estimated impulse response for the tablet mics with close-talking record. Close-talking record is simultaneously acquired using a headset microphone when noisy speech is recorded. In the impulse response estimation process, microphone signals are represented in the complex-valued STFT domain using half-overlapping 256-sample sine windows. After partitioning the time frames into variable-length, half-overlapping, and sine-windowed blocks such that the amount of speech is similar in each block, the per-block STFT- domain IRs between the close-talking microphone and the other microphones are estimated in the least-squares sense in each frequency bin [62]. The SNR to be used in mixing speech and noise source is also estimated by using the estimated speech component in the real recorded noisy speech. The estimated SNRs had an average of

55 approximately 5 dB. To generate the simulated speech source in the simulation process, a time-varying steering vector that models direct sound between the speaker and microphones is convolved with a clean speech signal. In the case of training data, the clean speech signal is taken from the original WSJ0 recordings while in the case of development and test data, the clean speech signal is taken from the booth recordings. In either case, the simulated speech signal is rescaled such that the SNR matches that of the real recording. The steering vector for each utterance in the real recordings is tracked using the steered response power phase transform (SRP-PHAT) algorithm. To estimate the steering vector, the signals are represented in the complex-valued STFT domain using half-overlapping sine windows of 1024 samples. The spatial position of the speaker is encoded by a nonlinear SRP-PHAT pseudo-spectrum for each frame. To stabilize the estimated position, the peaks of the SRP-PHAT pseudo-spectrum are then tracked over time using the Viterbi algorithm. The participants of the challenge were provided with baseline systems for front-end signal enhancement and state-of-the art GMM/DNN-based ASR. The time-varying MVDR beamforming with diagonal loading [63] is selected as a baseline front-end signal enhancement. In the GMM version of the baseline ASR system, the MFCC feature is used. First, MFCC of order 13 is extracted for each frame and tree frames of left and right context are concatenated to form a 91-dimensional feature vector. The concatenated features are compressed to 40 dimensions using linear discriminative analysis (LDA). After that, maximum likelihood linear transformation (MLLT), and feature-space maximum likelihood linear regression (fMLLR) with speaker-adaptive training (SAT) are applied. In acoustic modeling, 2500 tied triphone HMM states are

56 modeled by a total of 15,000 Gaussians. In the DNN-HMM hybrid version of the baseline ASR system, the feed-forward NN that has 7 layers with 2048 units per hidden layer is used as the acoustic model. Five frames of the MFCC feature is fed to the NN at a time. Pretraining using restricted Boltzmann machines, cross-entropy training, and sequence- discriminative training using the state-level minimum Bayes risk (sMBR) criterion are used in the training procedure. 4.5.2. Experimental settings The performance of the proposed system was evaluated using WER with two types of baseline systems. Speech enhancement is performed before ASR with frame size and shift of 640 and 160 samples respectively. It is not easy to compare the proposed method with the existing method on the same basis since the proposed system requires more NN layers and consequently more parameters to be trained. To make a fair comparison, evaluation using another approach that represents the STFT values by two channels containing real and imaginary components respectively (Baseline2) is performed along with the conventional NN-GEV system in [40] (Baseline1). In Baseline2, each frame of STFT values is fed to one fully connected layer with 624 nodes and the output of this layer is fed to the BLSTM and deep layers that have same structures of Baseline1. Since input features consisting of real and imaginary values of STFT values are fed to the NN after concatenating input channels, the dimension of the input feature

57 is 1284. As a result, even when only one hidden layer is used, parameters to be trained in Baseline2 far outnumbers the parameters in the proposed system. The proposed and Baseline2 systems are implemented for two-channel input signals and the Baseline1 system is also evaluated using only two-channel input signals. The reasons for this are as follows. In the NN-GEV system proposed in [40] (Baseline1), training steps of the NN is allowed to be conducted without considering the geometric configuration of the microphones, by applying deep learning-based modeling to each channel separately. This ensures the versatility of the system when conducted on an unseen microphone structure. To maintain these advantages, the evaluation of the proposed system was also performed on an unseen microphone structure by excepting microphone pairs to be evaluated from training data. More specifically, the input signal pairs corresponding to the microphone index pairs 1-3, 1-4, 1-5, 1-6, 3-4, 3-5, 3-6, and 4-6 are used in the training step, and the input signal pairs corresponding microphone index pairs 4-5 are used in evaluation. The IBM label of the second channel index is used in training in this two-channel setting. Training and test sets are also subdivided into two types. In the noise-aware set, four kinds of noise environment data are used in training and same noise kinds are used in evaluation. In the unseen noise set, the NN is trained using data except one of the four kinds of noise environments and evaluation is performed using the excepted noise kind. By performing evaluation in the unseen noise environment, the ability of generalization over various noise kinds can be examined and this is a representative advantage of using spatial information.

58 4.5.3. Results Note that the evaluation is performed in mismatched conditions in terms of noise and RIRs, on the assumption that speech enhancement is performed without prior knowledge of the environment. Figure 4.8 WERs for both the real and simulated, development and test sets. Models are trained on clean data and tested either before or after enhancement. Enhancement combines two channels (mic idx 4, 5).

59 In the noise-aware experiment case, performance changes are hardly noticeable, confirming the fact that the existing system (Baseline1) yields an almost saturated performance on the timefrequency mask-based structure. The performance changes by using spatial information can be found in the evaluation results of the unseen noise environment. In the unseen noise case, the proposed system has improved performance, while the Baseline2 system has no significant performance change. The robust performance in the unseen noise environment can be interpreted as a result of using additional information, i.e., spatial information. In addition, from the evaluation results of Baseline2 using the same multichannel information, it can be interpreted that the proposed structure succeeds in modeling spatial information effectively compared to a simple real and imaginary concatenation scheme. 4.6. Conclusions In this chapter, the recently proposed NN-GEV-based system is investigated. To improve the estimation performance of the second-order statistics of desired speech and noise signals, using spatial information from the multichannel signal is proposed. In the proposed system, a feature that is derived to effectively transfer the phase information to the NN is introduced. By applying the proposed input layer structure, the effectiveness of applying deep learning for exploiting the spatial information embedded in the phase difference of each signal is demonstrated.

60 Chapter 5. Neural Network-based postfilter 5.1. Issues and related works The SDA is one type of deep NN and can be trained to reconstruct desired speech from noisy input signal. The main idea behind this NN SDAs is using DNNs to model the relationship between clean and noisy features. With the outstanding performance of SDAs for modeling acoustic characteristics, extending the applicable scope of deep learning to multichannel signals is being extensively studied in noise-robust ASR research. Successful results in DNN-based monaural speech enhancement has also inspired the approaches that use DNN as a single-channel noise reduction algorithm on the enhanced output of a beamformer, i.e., postfilter [38]. In this structure, the beamformer can perform spatial filtering before spectro-temporal filtering is performed by the DNN-based postfilter. 5.2. New Generalized Sidelobe Canceller with Denoising Auto- Encoder for Improved Speech Enhancement 5.2.1. Motivation The direct application of a DNN to noisy signals is problematic because the DNN

61 is required to model various information such as spatial information and channel difference as well as spectro-temporal characteristics at a time. In particular, modeling spatial information is fundamentally variant from modeling the distribution of signal intensity on the frequency axis. Hence, modeling with one NN is likely to be inadequate in terms of efficiency of model training. As an alternative to a direct approach, a conventional beamformer can be introduced prior to the implementation of the SDA. The spatial information used by the beamformer in the form of the ratio of acoustic transfer functions, i.e. RTFs, is characterized by the path between the speaker and each microphone. However, the modeling ability of the SDA is limited when applied to single- channel audio signals. 5.2.2. New Generalized Sidelobe Canceller to generate multichannel feature An approach proposed in the previous research is thus to have the beamformer exploit the spatial information in its processing all the way prior to the final stage, in which the signals are combined. This final stage is replaced by the SDA, which removes any distortion or noise created by the beamforming process. It is assumed that the beamformer is working on each frame instead of a whole utterance so that the proposed algorithm can be applied to a real-time speech enhancement system. As estimating noise statistics prior to observing the entire utterance is difficult, this approach does it by each frame instead, thus eliminating the need to model noise in advance. Since no prior information of noise statistics is provided, limiting noise types to predefined ones can

62 also be avoided. Thus, a GSC that can estimate noise statistics adaptively and can be implemented using only the direction of target speech, is used. A standard GSC consists of a fixed beamformer (FBF), a blocking matrix (BM), and an adaptive noise canceller (ANC). The input to the th microphone can be expressed as (, ) = (, )(, ) + (, ), (5.1) where (, ) is the target speech, and (, ) and (, ) are the received signal and noise at the mth sensor in the short-time Fourier transform domain, respectively. is the frame index and is the frequency bin index. The target speech undergoes filtering by an RIR prior to being captured by the sensors. Filtering corresponding to the th sensor is modeled by the finite response (, ). In [2], the FBF, FBF(, ) is designed to project input signals onto the subspace spanned by the RTF, (, ) FBF(, ) = (, ) ( H(, )(, )) 1 , where (, ) = (, )/1(, ), and (, ) = [1(, )(, )] T, (5.2) by setting the speech signal picked up by the first sensor 1(, ) as the desired signal. The BM is designed to project the input signals into the orthogonal complement of the

63 target speech RTF. The generalized eigenvector blocking matrix (GEVBM) [15], which can be calculated using the max SNR beamformer, SNR(, ), is H(, ) = (,)SNR(,)SNR H (,) SNR H (,)(,)SNR(,) . (5.3) SNR(, ) can be calculated from the eigenvector belonging to the largest eigenvalue of 1 (, )(, ) during the RTF estimation period, i.e., when noise is stationary. (, ) and (, ) are the PSDs of the noise source and input signals, respectively. SNR(, ) results in the multiplication of 1 (, )(, ) and an arbitrary complex constant so that the RTF can be calculated. However, in realistic conditions, the satisfactory estimation of SNR(, ) is usually not feasible because noise is nonstationary. In this case, the RTF is usually simplified as a pure time delay and (, ) is replaced with a delayed impulse response. The ANC filter weight (, ) can be updated with an unconstrained adaptation to obtain the minimized power of the output signal (, ), (, ) = FBF H (, )(, ) H(, )(, ) where (, ) = H(, )(, ) and (, ) = [1(, )(, )] T. (5.4) The noise reference signal (, ) is the result of the BM; thus, it is orthogonal to the

64 target speech RTF. The target speech components in the output of the FBF are not supposed to be affected by the subtraction of the estimated noise, i.e., the output of the ANC filter, due to this orthogonality. However, the assumption of the perfect orthogonality of (, ) and the target speech components in the output of the FBF is seldom met in real-life situations for many reasons, e.g., leakage signal in the blocking matrix. The distortion of the target signal components in the enhanced signal is a major issue in beamformer-based algorithms and many approaches, including using an SDA as a postfilter, have been attempted. In this work, a GSC structure that enables the effective adoption of an SDA is proposed to deal with the distortion in the output of the beamformer. The proposed system is illustrated in Figure 5.1. Figure 5.1 The structure of the proposed GSC and SDA In the proposed system, each individual ANC filter is adapted separately to minimize each input signal by removing the noise component estimate.

65 (, ) = (, ) H (, )(, ) 2 , (5.5) where is the ANC coefficient corresponding to the th channel. The FBF takes enhanced channel signals from separated ANC filters and compensates the RTF to generate the multichannel output features: (, ) = , (, ) ((, ) H (, )(, )), (5.6) where , (, ) is the th channel component of the fixed beamformer. Note that the proposed structure has the same filter coefficient as a conventional GSC if the output signals are summed into one channel. The distortion caused by imperfect noise cancellation is placed on the multichannel spectrum domain of a two-dimensional space with a channel axis and frequency axis for each frame. The SDA is expected to model the underlying relationship of the distortion with adjacent frequency bins in other frequencies and other channels. 5.3. Experiments To evaluate the performance of the proposed system, speech enhancement is conducted in a manner similar to [64], as depicted in Figure 5.1. Note that the use of mask estimation before the beamformer and more sophisticated SDA structures are not

66 considered because these improvements can be used in both the conventional and the proposed system. This experiment aims to assess the effectiveness of the proposed algorithm in its most typical configuration. The proposed system uses the results of the proposed GSC as the multichannel information and their summation as the beamformed signal. In the baseline system, the output of the conventional GSC is used as the beamformed signal, and the noisy input signal itself (GSC-NOISY) [64] and the GSC- IPD [65] are used as the multichannel information. The IPD is calculated for each channel except the first channel using (, ) = cos(1(, ) (, )) , = 2, , (5.7) to generate information for 1 channels. These two forms of multichannel information are selected as conventional schemes for comparison because as in the proposed method, they can be extracted without prior knowledge of the noise characteristics and used in real-time speech enhancement systems in the form of frame- by-frame feature extraction. To assess the advantages of using multichannel information, these systems are also compared with a single-channel SDA (GSC-ONLY) which uses only the conventional GSC output without multichannel information. Input signals are fed into the DNN in the form of 511-dimensional log power spectra for each frame. The window size and the frame shift are set at 1024 and 512 samples, respectively, at a 16- kHz sampling rate. To evaluate the performance, six-channel data from CHiME [4] is used. This database provides noise recorded in a cafe, in the street, on the bus, and in a

67 pedestrian area. Recordings have been acquired by using an array of six microphones mounted in holes drilled through a frame surrounding a tablet computer. Three microphones were spaced both along the top and bottom edges. The estimated tablet mic SNRs had an average of approximately 5 dB. To control target speech detection errors and spatial information estimation errors, which are beyond the scope of this study, clean target speech is required. As a result, a simulated data set in which noise and speech are separately available is used. Speech components in the simulated data are generated by applying impulse response to booth recordings, as in [4]. The RTF estimated from real recorded data is used in the simulation. In the beamforming step, the impulse response of pure time delay, calculated using the estimated time delay of the target signals, is used instead of the RTF. This simulates realistic situations with nonstationary noise, in which the satisfactory estimation of the noise RTF is usually not feasible. Time delay estimation is conducted using the steered response power phase transform (SRP-PHAT) algorithm used as the baseline in [4]. Development (dt) data are used to decide the iteration number in SDA training and evaluation (et) data are used in the evaluation. Evaluation results are shown in Table 1. The signal-to-distortion ratio (SDR), which is defined as an energy ratio criterion [66], and the short-time objective intelligibility (STOI) described in [67] are used to measure speech enhancement performance. The word error rate in ASR is scored with an acoustic model trained on a clean database. The LibriSpeech database [61] is used in a time-delay NN-based ASR system [59].

68 Table 5.1 Evaluation results for speech enhancement and WERs Mult. Info. SDR STOI WER (%) Noisy input -0.694 0.674 84.43 GSC- ONLY 7.915 0.835 30.57 GSC-NOISY 7.320 0.837 27.13 GSC-IPD 7.445 0.835 26.74 Proposed 8.687 0.856 20.83 Note that ASR evaluation is performed in mismatched conditions in terms of noise and RIRs on the assumption that speech enhancement is performed without prior knowledge of the environment. Evaluation results show that the proposed method consistently outperforms the conventional methods. Note that the STOI score is expected to have a monotonic relation with subjective speech intelligibility, where a higher value denotes more intelligible speech. 5.4. Conclusions In the proposed system, the GSC exploits spatial information and generates multichannel enhanced signals on which the following SDA can act. As a result, the SDA can take advantage of the multichannels by modeling the underlying relationship of the distortion with adjacent frequency bins in other frequencies and other channels. The main improvement point in proposed system is related to the change in the expected role of the

69 SDA. In conventional systems that extract multichannel features from noisy signals, the SDA need to handle a fair amount of information including the noise characteristics, spatial information, and relation between desired speech and noise signals. In the proposed system, the beamformer exploits spatial information and compensates for differences in the transfer functions of each channel in the way of removing noise components. As a result, the modeling capability of the SDA can concentrate on removing the artefacts caused by the beamformer. The evaluation results demonstrate that using the results of the proposed GSC structure as an input to the SDA is effective in improving noise reduction and speech recognition performance.

70 Chapter 6. Conclusions and Future Works 6.1. Conclusions Although multichannel noise cancellation has been studied for decades to improve speech recognition performance, implementing a system that allows smooth conversation in everyday life is still difficult. In this dissertation, approaches to improve existing methods were proposed after analyzing the shortcomings in existing methods by each stage. These stages include estimation of RTF, estimation of signal statistics using deep learning, and distortion compensation of the enhanced signal. The proposed methods mainly deal with the recent application of deep learning for beamforming. In the RTF estimation stage, a practical scenario with noise that naturally occupies the same timefrequency range with desired speech was considered. The shortcomings of the existing directional BSS-based algorithm was exploited, and the use of the semi- NMF algorithm to estimate the PTDR proposed in a previous work was introduced. The proposed method effectively overcame the side-effect of smoothing peaks in time- domain RTF in the existing method by recovering the peaks. Accuracy improvement in RTF estimation was assessed in terms of speech suppression gain and normalized squared error. In the estimation of second-order statistics of desired speech signal, the recently proposed NN-GEV beamforming approach was exploited. Phase information of the input signal was emphasized and a deep learning-based system was proposed to model it

71 effectively. In addition, a method was proposed to effectively transfer the phase information of the input signal to the NN, which traditionally processed real-valued physical data and relied on real-valued weights. By effectively encoding the phase information in real-valued input features, the weakness of the existing method in which deep learning is applied to each single-channel signal was overcome. In the distortion compensation of the enhanced signal, a structural change of the beamformer proposed in a previous work was introduced. This structural change allowed the following NN to take advantage of the multichannels by modeling the underlying relationship of the distortion with adjacent frequency bins in other frequencies and other channels. This resulted in improved performance by overcoming the defects of existing deep learning-based postfilter methods in which the modeling ability of the NN is limited to enhanced single-channel speech.

72 6.2. Future Works To improve multichannel speech enhancement, the following topics of interest can be considered in the future: 1) Investigation on DNN generalization over spatial configuration of multichannel signal 2) Development of DNN-based time-varying RTF modeling in a practical noise environment 3) Development of a Generative Adversarial Nets-based far-field database 4) Development of fully-DNN-based modeling of second-order statistics of speech signal

73 Bibliography [1] N. Q. K. Duong, E. Vincent, and R. R. Gribonval, Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model, IEEE Trans. Audio. Speech. Lang. Processing, vol. 18, no. 7, pp. 18301840, Sep. 2010. [2] A. Ozerov and C. Fevotte, Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation, IEEE Trans. Audio. Speech. Lang. Processing, vol. 18, no. 3, pp. 550563, Mar. 2010. [3] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, New formulations and efficient algorithms for multichannel NMF, IEEE Work. Appl. Signal Process. to Audio Acoust., pp. 153156, Oct. 2011. [4] O. Yilmaz and S. Rickard, Blind Separation of Speech Mixtures via Time- Frequency Masking, IEEE Trans. Signal Process., vol. 52, no. 7, pp. 18301847, Jul. 2004. [5] S. Araki et al., A novel blind source separation method with observation vector clustering, Proc. Int. Work. Acoust. Echo Noise Control, pp. 117 120, 2005. [6] T. den Bogaert, S. Doclo, J. Wouters, and M. Moonen, Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids, J. Acoust. Soc. Am., vol. 125, no. 1, pp. 360371, 2009. [7] B. Cornelis, M. Moonen, and J. Wouters, A VAD-robust multichannel Wiener filter algorithm for noise reduction in hearing aids, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 4, pp. 281284, May 2011. [8] J. Capon, High-resolution frequency-wavenumber spectrum analysis, Proc. IEEE, vol. 57, no. 8, pp. 14081418, 1969.

74 [9] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, The third CHiME speech separation and recognition challenge: Dataset, task and baselines, 2015 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2015 - Proc., pp. 504511, 2016. [10] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech, IEEE Trans. Signal Process., vol. 49, no. 8, pp. 16141626, 2001. [11] D. H. Brandwood, A complex gradient operator and its application in adaptive array theory, in IEE Proceedings F-Communications, Radar and Signal Processing, 1983, vol. 130, no. 1, pp. 1116. [12] L. Griffiths and C. C. Jim, An alternative approach to linearly constrained adaptive beamforming, IEEE Trans. Antennas Propag., vol. 30, no. 1, pp. 2734, Jan. 1982. [13] O. L. . I. L. Frost, An algorithm for linearly constrained adaptive array processing, Proc. IEEE, vol. 60, no. 8, pp. 926935, 1972. [14] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech, IEEE Trans. Signal Process., vol. 49, no. 8, pp. 16141626, 2001. [15] A. Krueger, E. Warsitz, and R. Haeb-Umbach, Speech Enhancement With a GSC-Like Structure Employing Eigenvector-Based Transfer Function Ratios Estimation, IEEE Trans. Audio. Speech. Lang. Processing, vol. 19, no. 1, pp. 206219, Jan. 2011. [16] E. Warsitz and R. Haeb-Umbach, Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition, IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 5, pp. 15291539, Jul. 2007. [17] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech, IEEE Trans. Signal Process., vol. 49, no. 8, pp. 16141626, 2001. [18] E. Georganti, T. May, S. van de Par, and J. Mourjopoulos, Sound Source Distance Estimation in Rooms based on Statistical Properties of Binaural Signals, IEEE Trans. Audio. Speech. Lang. Processing, vol. 21, no. 8, pp.

75 17271741, Aug. 2013. [19] S. Vesa, Binaural Sound Source Distance Learning in Rooms, IEEE Trans. Audio. Speech. Lang. Processing, vol. 17, no. 8, pp. 14981507, Nov. 2009. [20] P. Smaragdis and P. Boufounos, Position and Trajectory Learning for Microphone Arrays, IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 1, pp. 358368, Jan. 2007. [21] I. Cohen, Relative Transfer Function Identification Using Speech Signals, IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 451459, Sep. 2004. [22] Y. Zheng, K. Reindl, and W. Kellermann, Analysis of dual-channel ICA- based blocking matrix for improved noise estimation, EURASIP J. Adv. Signal Process., vol. 2014, no. Lcmv, pp. 124, 2014. [23] H. Buchner, R. Aichner, and W. Kellermann, The TRINICON framework for adaptive MIMO signal processing with focus on the generic Sylvester constraint, Proc. ITG Conf. Speech Commun. Aachen, Ger., no. c, pp. 8 11, 2008. [24] K. Matsuoka, M. Ohoya, and M. Kawamoto, A neural net for blind separation of nonstationary signals, Neural Networks, vol. 8, no. 3, pp. 411419, Jan. 1995. [25] H. Buchner, R. Aichner, and W. Kellermann, A generalization of a class of blind source separation algorithms for convolutive mixtures, in Proc. ICA, 2003, pp. 945950. [26] S. Amari, A. Cichocki, and H. H. Yang, A new learning algorithm for blind signal separation, in Advances in neural information processing systems, 1996, pp. 757763. [27] R. Aichner, H. Buchner, F. Yan, and W. Kellermann, A real-time blind source separation scheme and its application to reverberant and noisy acoustic environments, Signal Processing, vol. 86, no. 6, pp. 12601277, Jun. 2006. [28] Yuanhang Zheng, K. Reindl, and W. Kellermann, BSS for improved

76 interference estimation for Blind speech signal Extraction with two microphones, 2009 3rd IEEE Int. Work. Comput. Adv. Multi-Sensor Adapt. Process., pp. 253256, 2009. [29] Yuxuan Wang and DeLiang Wang, Towards Scaling Up Classification- Based Speech Separation, IEEE Trans. Audio. Speech. Lang. Processing, vol. 21, no. 7, pp. 13811390, Jul. 2013. [30] C. Ding, T. Li, and M. I. Jordan, Convex and semi-nonnegative matrix factorizations., IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 4555, 2010. [31] F. Weninger, S. Watanabe, Y. Tachioka, and B. Schuller, Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 46234627. [32] A. L. Maas et al., Recurrent Neural Networks for Noise Reduction in Robust ASR. [33] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, Deep learning for monaural speech separation, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 15621566. [34] S. Araki, T. Hayashi, M. Delcroix, M. Fujimoto, K. Takeda, and T. Nakatani, Exploring multi-channel features for denoising-autoencoder- based speech enhancement, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, vol. 2015 Augus, pp. 116120. [35] S. Renals and P. Swietojanski, Neural networks for distant speech recognition, in 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014, pp. 172176. [36] Y. Liu, P. Zhang, and T. Hain, Using neural network front-ends on far field multiple microphones based speech recognition, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 55425546.

77 [37] A. Narayanan and D. Wang, Joint noise adaptive training for robust automatic speech recognition, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 25042508. [38] S. Sivasankaran et al., Robust ASR using neural network based speech enhancement and feature simulation, in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 482 489. [39] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2015Augus, pp. 45804584, Apr. 2015. [40] J. Heymann, L. Drude, and R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 2016, pp. 196200. [41] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 52105214. [42] IEEE Signal Processing Society., IEEE transactions on speech and audio processing: a publication of the IEEE Signal Processing Society. Institute of Electrical and Electronics Engineers, 1993. [43] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-Umbach, Beamnet End-to-end training of a beamformer-supported multi-channel ASR system, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2017, pp. 53255329. [44] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, and Andrew, Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms, in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings, 2016, pp. 3036.

78 [45] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [46] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, and M. Bacchiani, Factored spatial and spectral multichannel raw waveform CLDNNs, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 50755079. [47] Convolutional Neural Networks for Distant Speech Recognition, IEEE Signal Process. Lett., vol. 21, no. 9, pp. 11201124, Sep. 2014. [48] T. Menne et al., The RWTH/UPB/FORTH system combination for the 4th CHiME challenge evaluation, in The 4th International Workshop on Speech Processing in Everyday Environments, San Francisco, CA, USA, 2016, pp. 3944. [49] H. Erdogan et al., Multi-channel speech recognition: LSTMs all the way through, in CHiME-4 workshop, 2016. [50] L. D. Jahn Heymann and R. Haeb-Umbach, Wide residual blstm network with discriminative speaker adaptation for robust speech recognition, in CHiME 2016 workshop, 2016. [51] J. Du et al., The USTC-iFlytek system for CHiME-4 challenge, Proc. CHiME, pp. 3638, 2016. [52] D. Wang and Jae Lim, The unimportance of phase in speech enhancement, IEEE Trans. Acoust., vol. 30, no. 4, pp. 679681, Aug. 1982. [53] L. Drude, B. Raj, and R. Haeb-Umbach, On the Appropriateness of Complex-Valued Neural Networks for Speech Enhancement., in INTERSPEECH, 2016, pp. 17451749. [54] T. N. Sainath et al., Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 5, pp. 965979, May 2017. [55] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks,

79 IEEE Trans. Signal Process., vol. 45, no. 11, pp. 26732681, 1997. [56] Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks, vol. 18, no. 56, pp. 602610, Jul. 2005. [57] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Comput., vol. 9, no. 8, pp. 17351780, Nov. 1997. [58] D. Povey et al., The Kaldi Speech Recognition Toolkit, IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011. [59] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts, Proc. Interspeech, vol. 2015Janua, pp. 32143218, 2015. [60] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 215219. [61] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2015, vol. 2015Augus, pp. 52065210. [62] E. Vincent, R. Gribonval, and M. D. Plumbley, Oracle estimators for the benchmarking of source separation algorithms, Signal Processing, vol. 87, no. 8, pp. 19331950, 2007. [63] X. Mestre and M. A. Lagunas, On diagonal loading for minimum variance beamformers, Proc. 3rd IEEE Int. Symp. Signal Process. Inf. Technol. ISSPIT 2003, pp. 459462, 2003. [64] P. Swietojanski, A. Ghoshal, and S. Renals, Hybrid acoustic models for distant and multichannel large vocabulary speech recognition, in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 285290. [65] M. I. Mandel, R. J. Weiss, and D. Ellis, Model-Based Expectation-

80 Maximization Source Separation and Localization, IEEE Trans. Audio. Speech. Lang. Processing, vol. 18, no. 2, pp. 382394, Feb. 2010. [66] E. Vincent, R. Gribonval, and C. Fvotte, Performance measurement in blind audio source separation, IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 14621469, 2006. [67] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, An Algorithm for Intelligibility Prediction of Time Frequency Weighted Noisy Speech, IEEE Trans., vol. 19, no. 7, pp. 21252136, 2011.

81 Curriculum Vitae Personal Information Name: Minkyu Shin Birth Date: February 11, 1987 E-mail: mkshin@ispl.korea.ac.kr Education Mar. 2012present: School of Electrical Engineering, Korea University (combined M.S. and Ph. D degree program) Mar. 2006Feb. 2012: School of Electrical Engineering, Korea University (received B.S in 2012) Research Interests Machine Learning and Pattern Recognition Automatic Speech Recognition Multichannel Speech Enhancement Voice Activity Detection Sound Source Localization

82 Publications International Journal [1] Minkyu Shin and Hanseok Ko, "New Generalized Sidelobe Canceller with Denoising Auto-Encoder for Improved Speech Enhancement", IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, vol.100-A, no.12, Dec, 2017 [2] Minkyu Shin, Wooil Kim, David Han, Hanseok Ko, "Relative Transfer Function (RTF) Estimation Utilizing Peaks in Time-Domain RTF," Electronics Letters, 2016 [3] Seongkyu Mun, Minkyu Shin, Suwon Shon, Wooil Kim, David Han and Hanseok Ko, "DNN Transfer Learning based Non-linear Feature Extraction for Acoustic Event Classification", IEICE trans. on Information and Systems, Sep. 2017 International Conference [1] Sangwook Park, Jinsang Rho, Minkyu Shin, David K. Han, and Hanseok Ko, Acoustic Feature Extraction for Robust Event Recognition on Cleaning Robot Platform, 2014 IEEE International Conference on Consumer Electronics, pp. 149-150, Las Vegas, NV, USA, January 10-13, 2014 [2] Seongjae Lee, Daehun Kim, Suwon Shon, Seongkyu Mun, Minkyu Shin, Youngseng Chen, Sejong Hyun, M. Harris and Hanseok Ko, KU-ISPL TRECVID 2016 Multimedia Event Detection System, TRECVID Workshop, 2016 Domestic Journal [1] 신민규, 고한석, "CASA 기반의 마이크간 전달함수 비 추정 알고리즘", 한 국음향학회지, 제33권, 1호, pp.54-59, January, 2014

83 Domestic Conference [1] 손수원, 문성규, 신민규, 고한석, NIST 2015 i-Vector Machine Learning Challenge 를 위한 KU-ISPL 언어 인식기, 한국음향학회 추계학술대회 발 표논문집, 제35권 제2(s)호, pp.151, Nov, 2016 [2] 신민규, 이영로, 고한석, "딥 뉴럴 네트워크 기반 가정 내 발생 음향 검출 기의 거절음향 훈련 방법" 한국음향학회 추계학술대회 발표논문집, 제34권 제2(s)호, pp.6, Nov, 2015 [3] 문성규, 신민규, 고한석, 음향 상황 인지를 위한 특징 선택 연구, 대한전 자공학회 추계학술대회, pp. 627-629, Nov, 2013 [4] 신민규, 고한석, 마이크 이득 오차 와 입력 SNR 에 따른 GSC 성능 분석, 한국음향학회 춘계학술발표대회 논문집, Vol. 32, No. 1, pp.173-175, May, 2013 [5] 박진수, 신민규, 고한석, "음성인식을 위한 잡음환경에 강인한 음성 구간 검출 기법", 제29회 음성통신 및 신호처리 학술대회, pp. 30-32, Aug, 2012 [6] 김광윤, 신민규, 고한석, 음향 정보 기반의 감시 시스템을 위한 음원 분리 와 음향 이벤트 인식 결합 알고리즘, 한국음향학회 춘계학술발표대회 논 문집, Vol. 31, No. 1, pp.47-50, May, 2012

84 감사의 글 지난 6년간 많은 어려움 속에서도 변함없이 지원하며 이끌어 주신 고한석 교수님께 감사드립니다. 또한, 박사과정을 무사히 마칠 수 있도록 지도해 주신 황인준 교수님, 한성원 교수님, 김우일 교수님, 황광일 박사님께 감사의 말씀 드립니다. 언제나 제 연구에 값진 조언을 해주신 United States Department of Defense 의 데이비드 한 박사님께 감사의 말씀 드립니다. 긴 시간 동안 함께 먹고 자며 고생한 지능 신호 처리 연구실의 자랑스러운 선배님, 후배님, 동기들 모두 감사합니다. 비단 대학원 기간만이 아니라, 저를 키우시는 내내 걱정이 많으셨을 아버지, 어머니, 베풀어 주신 사랑에 보답하는 아들이 되겠습니다. 항상 옆에서 응원해준 누나와 자형에게 감사드립니다. 이미 오래전부터 박사 대접을 해주신 작은할아버지와 삼촌을 비롯한 가족 분들의 격려에 깊은 감사 드리며, 기대에 부응하도록 노력하겠습니다. 항상 곁에 있어준 주연에게, 항상 고맙고, 또 사랑한다는 마음 전합니다.