저작자표시-비영리-변경금지 2.0 대한민국
이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게
l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다.
다음과 같은 조건을 따라야 합니다:
l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건
을 명확하게 나타내어야 합니다.
l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다.
저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다.
이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다.
Disclaimer
저작자표시. 귀하는 원저작자를 표시하여야 합니다.
비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다.
변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.
Thesis for the Degree of Doctor of Philosophy
Deep Learning Based Spatial
Information Modeling for Multi-Channel
Speech Enhancement toward Noise
Robust Automatic Speech Recognition
Minkyu Shin
School of Electrical Engineering
Graduate School
Korea University
December 2017
博士 學位 論文
Deep Learning Based Spatial
Information Modeling for Multi-Channel
Speech Enhancement toward Noise
Robust Automatic Speech Recognition
高麗大學校 大學院
電氣電子工學科
辛 珉 圭
2017 年 12 月
Thesis for the Degree of Doctor of Philosophy
Deep Learning Based Spatial
Information Modeling for Multi-Channel
Speech Enhancement toward Noise
Robust Automatic Speech Recognition
A dissertation submitted
for the degree of Doctor of Philosophy.
December 2017
Minkyu Shin
School of Electrical Engineering
Graduate School
Korea University
高 漢 錫 敎授 指導
博士 學位 論文
Deep Learning Based Spatial
Information Modeling for Multi-Channel
Speech Enhancement toward Noise
Robust Automatic Speech Recognition
이 論文을 工學 博士學位 論文으로 提出함.
2017 年 12 月
高麗大學校 大學院
電氣電子工學科
辛 珉 圭 (印)
i
Abstract
Deep Learning Based Spatial Information Modeling
for Multi-Channel Speech Enhancement toward
Noise Robust Automatic Speech Recognition
Minkyu Shin
Advised by Prof. Hanseok Ko
School of Electrical Engineering
Graduate School
Korea University
Speech signal processing is playing an increasing role in human
interaction with smart devices. Therefore, the problem of improving the robustness
of automatic speech recognition in noisy environments has attracted considerable
research effort. Deep learning approaches to speech enhancement, particularly
those that incorporate a denoising auto-encoder, have achieved great success when
ii
applied to single-channel audio signals. In the case of a single channel, the signal
intensity in the timefrequency domain is the main information resource to
represent an input signal, and a simple neural network (NN) topology has been
sufficiently effective. Using deep learning is effective in improving the
performance of an automatic speech recognition (ASR) system, but a performance
limit is exhibited in the field of clean signal estimation. It is expected that
multichannel signal-based approaches can overcome this limitation by using
spatial information in addition to signal intensity.
Attempts have been made to expand the application of deep learning into
multichannel speech enhancement fields. However, the full potential of deep
learning-based approaches for microphone array processing has not yet been
attained. The use of NN for multiple channels faces many obstacles. The main
reason for this is that phase information in the timefrequency domain plays a vital
role in delivering the spatial information of multichannel signals, whereas NNs
traditionally process real-valued physical data and rely on real-valued weights. The
application of deep learning to spatial information for multichannel acoustic signal
is still under study and there is no representative solution. This analysis is
supported by the fact that a variety of approaches are still actively being proposed,
such as developing features to be fed into an NN with a multichannel signal, or
replacing part of an existing spatial filter-based algorithm with a NN. It is
noteworthy that adopting a fully deep learning-based approach in multichannel
speech enhancement is not as remarkable as that in speech recognition or image
iii
classification.
In this dissertation, problems encountered in applying deep learning to
multichannel speech enhancement are addressed and mitigating approaches to
improve existing methods are proposed. For an in-depth analysis and problem
posing, traditional signal processing-based approaches are also discussed.
Remarkable approaches incorporating independent component analysis and non-
negative matrix factorization are revisited while a relevant previous approach is
introduced. This provides an analysis of how and why deep learning is required.
Application of spatial diversity with deep learning framework is
evaluated and analyzed from various perspectives. This includes a previous work
on combining a traditional spatial filter and a deep learning based postfilter. By
designing effective feature for following postfilter, the beamformer structure
resulted improved performance in terms of speech enhancement. Existing deep
learning-based algorithms are also analyzed and improvement methods are
proposed. Modeling phase information of the input signal is emphasized to
overcome limitations of existing algorithms. By introducing a front-end structure
accepting real-valued representation of timefrequency domain signal, the
proposed structure avoids forcing the NN to approximate a complex algebra to
decode the phase difference between the input signals. As a result, deep learning
with real-valued NN is effectively applied for exploiting the spatial information
embedded in the phase difference between the input signals. To ensure the
applicability as the front-end of an arbitrary ASR system, the proposed methods
iv
focus on clean speech estimation without requiring an acoustic model to be trained
with noise information. This allows ASR performance improvement without
predefining noise conditions when applied to two channel real recorded noisy
signal. It proves that the proposed method can be applied to un-seen noise
situations encountered in everyday life.
v
Contents
Abstract ....................................................................................................... i
Contents ...................................................................................................... v
List of Figures .......................................................................................... vii
List of Tables ........................................................................................... viii
List of Abbreviations ................................................................................ ix
Chapter 1. Introduction ............................................................................ 1
1.1. Background ......................................................................................... 1
1.2. Organization of Dissertation ............................................................... 2
Chapter 2. Classical Multichannel Speech Enhancement Algorithms . 5
2.1. Signal Model and Definitions ............................................................. 5
2.2. Overview of Classical Beamformers ................................................... 7
Chapter 3. Blind Source Separation-based Schemes in Relative
Transfer Function Estimation ................................................................ 13
3.1. Issues and related works .................................................................... 13
3.2. RTF Estimation Using Peaks in Time-Domain RTF ......................... 19
3.3. Experiments ....................................................................................... 25
3.4. Conclusions ....................................................................................... 27
vi
Chapter 4. Neural Network-based Approaches .................................... 29
4.1. Issues and related works .................................................................... 29
4.2. Motivation ......................................................................................... 39
4.3. Proposed system with phase-encoded input for NN-based mask
estimation ..................................................................................................... 42
4.4. The ASR stage in the proposed system ............................................. 47
4.5. Experiments ....................................................................................... 52
4.6. Conclusions ....................................................................................... 59
Chapter 5. Neural Network-based postfilter ........................................ 60
5.1. Issues and related works .................................................................... 60
5.2. New Generalized Sidelobe Canceller with Denoising Auto-Encoder for
Improved Speech Enhancement ................................................................... 60
5.3. Experiments ....................................................................................... 65
5.4. Conclusions ....................................................................................... 68
Chapter 6. Conclusions and Future Works ........................................... 70
6.1. Conclusions ....................................................................................... 70
6.2. Future Works ..................................................................................... 72
Bibliography ............................................................................................ 73
vii
List of Figures
Figure 3.1 Basic two-channel linear BSS signal model ...................................... 16
Figure 3.2 PTDRs in time-domain RTF .............................................................. 22
Figure 3.3 Spatial configurations ........................................................................ 25
Figure 4.1 Structure of NN-based mask estimation in GEV beamformer ........... 31
Figure 4.2 Time convolution and CLDNN layers ............................................... 34
Figure 4.3 Factored multichannel raw waveform CLDNN architecture ............. 36
Figure 4.4 NN topology in the proposed NN-based mask estimation system ..... 44
Figure 4.5 LSTM memory block with one cell ................................................... 46
Figure 4.6 Computation in TDNN with subsampling ......................................... 49
Figure 4.7 The microphone array geometry ........................................................ 53
Figure 4.8 Evaluation results ............................................................................... 58
Figure 5.1 The structure of the proposed GSC and SDA .................................... 64
viii
List of Tables
Table 3.1 RTF estimation performance in terms of NSE and target speech
suppression gain. ............................................................................... 27
Table 4.1 Context specification of time delay NN .............................................. 50
Table 5.1 Evaluation results for speech enhancement and WERs ....................... 68
ix
List of Abbreviations
ANC Adaptive Noise Canceller
ASR Automatic Speech Recognition
ATF Acoustic Transfer Function
BAN Blind Analytical Normalization
BM Blocking Matrix
BSS Blind Source Separation
CEC Constant Error Carousel
CLDNN Convolutional, Long Short-Term Memory, Deep Neural
Network
DOA Direction of Arrival
FBF Fixed Beamformer
FIR Finite-Impulse Response
FMLLR Feature-Space Maximum Likelihood Linear Regression
GEV Generalized Eigenvector
GEVBM Generalized Eigenvector Blocking Matrix
GSC Generalized Sidelobe Canceller
IPD Interaural Phase Difference
LCMV Linearly Constrained Minimum Variance
LDA Linear Discriminative Analysis
MFCC Mel-Frequency Cepstral Coefficients
MLLT Maximum Likelihood Linear Transformation
MVDR Minimum Variance Distortionless Response
NMF Non-Negative Matrix Factorization
NN Neural Network
x
NN-GEV Neural Network supported Generalized Eigenvalue
NSE Normalized Squared Error
PHAT Phase Transform
PSD Power Spectrum Density
PTDR Peaks in Time-Domain RTF
RIR Room Impulse Response
RTF Relative Transfer Function
SAT Speaker Adaptive Training
SDA Stacked Denoising Autoencoder
SDR Signal-to-Distortion Ratio
SMBR State-Level Minimum Bayes Risk
SNR Signal-to-Noise Ratio
SRP Steered Response Power
STOI Short-Time Objective Intelligibility
WER Word Error Rate
WSJ Wall Street Journal
1
Chapter 1. Introduction
1.1. Background
Multichannel speech enhancement algorithms can be classified into data-
dependent or data-independent approaches. The former cases exploit both the spectro-
temporal and the spatial information of input signals, while the latter cases focus on
enhancing signals from a predefined direction without accounting for the statistics of
signals. Among the data-dependent approaches, the approaches that require reference
information such as the spatial distribution of source and sensors are classified as
supervised approaches, and are distinguished from unsupervised approaches that do not
require prior knowledge. In the case of unsupervised methods, the lack of prior
knowledge can be compensated by using a trained model [1][2], [3] or assumptions of
target speech characteristics, such as sparseness in the timefrequency domain and the
statistical independence between different sources [4], [5].
Two representative examples of traditional supervised multichannel filtering are
beamforming and multichannel Wiener filtering [6], [7]. The following sections focus on
one of the beamformer family of algorithms, namely minimum variance distortionless
response (MVDR) beamformer [8], which has proven its effectiveness in realistic
environments by achieving impressive results in noise-robust ASR challenges [9]. The
term beamforming implies designing the spatio-temporal filter with specific criteria. The
MVDR beamformer performs noise reduction under the constraint that the desired speech
2
signals are processed without distortion.
1.2. Organization and contributions of Dissertation
The contributions of this dissertation is summarized as follows.
Investigated classical multichannel speech enhancement approaches,
especially the beamformer family of algorithms for clear understanding of the
subsequent chapters.
Developed directional BSS-based approaches to estimate the RTF in a practical
scenario with noise that naturally occupies the same timefrequency range with desired
speech signals.
Introduced and described the issues and related works on the method of
improving performance of RTF estimation. By using the peaks in the time-domain
RTF proposed in the previous work, relevant issues are introduced and described with
experimental results.
Developed the recently introduced NN-GEV beamforming approach and its
improvement by using deep learning-based spatial information modeling.
Proposed a front-end structure of the NN to effectively transfer the phase
information into deeper layers. By investigating the issues associated with deep
learning-based approaches in spatial information modeling, and the shortcomings of the
existing algorithms, a NN structure that effectively transfers the phase information is
3
developed. Relevant evaluation results in terms of ASR performance and the discussions
for the results are presented.
Investigated and proposed the use of deep learning in compensating distortion
of the enhanced signal. Relevant issues and related works concerning the postfiltering
beamformer output are described. A structural change of the beamformer, which was
proposed in a previous work, is introduced to allow the deep learning-based postfilter
take advantage of the multichannel information.
This dissertation is organized as follows:
In Chapter 2, classical multichannel speech enhancement approaches, especially
the beamformer family of algorithms are investigated.
In Chapter 3, directional BSS-based approaches to estimate the RTF is developed.
In Section 3.1, issues and related works are described. In Sections 3.2 and 3.3, the method
of improving performance by using the peaks in the time-domain RTF is introduced with
experimental results.
In Chapter 4, the recently proposed NN-GEV beamforming approach and its
improvement is exploited. In Sections 4.1 and 4.2, issues associated with deep learning-
based approaches in spatial information modeling, and the shortcomings of the existing
algorithms are investigated. In Section 4.3, a front-end structure of the NN for deep
learning-based spatial information modeling is introduced. Evaluation results in terms of
ASR performance and the discussions for the results are presented in Section 4.5 and
Section 4.6.
4
In Chapter 5, the use of deep learning in compensating distortion of the enhanced
signal is investigated. In Section 5.1, issues and related works concerning the postfiltering
beamformer output are described. In Section 5.2, a structural change of the beamformer
is introduced. The experimental results are presented in Section 5.3.
Finally, the concluding remarks of this dissertation and future research directions
are presented in Chapter 6.
5
Chapter 2. Classical Multichannel Speech
Enhancement Algorithms
2.1. Signal Model and Definitions
Let (t) be a speech signal impinging on a microphone array of arbitrary
geometry. The input signal in the -th microphone is given by
ym() = () + ()
= () + (), = 2, , (2.1)
where * is the convolution operator. The channel impulse response of the -th
microphone modeled by the finite impulse response . () is the noise-free
reverberant speech component, while () is the noise at the -th microphone. It is
assumed that the noise signal is uncorrelated with () . By assuming that the room
impulse responses (RIRs) change slowly over time, the time index is omitted in . The
above signal model can be written in the frequency domain as
(, ) = ()(, ) + (, )
= (, ) + (, ) for = 1,2, ,. (2.2)
6
Here, (, ), (), (, ) , and (, ) are the short-time Fourier transforms
(STFTs) of ym(), , (), and (), respectively. is the frame index, and is
the frequency bin index. This timefrequency domain signal model can be represented in
vector form as
(, ) = (, )(, ) + (, ) = (, ) + (, ),
where (, ) = [1(, ) 2(, ) (, )]
T,
() = [1() 2() ()]
T,
(, ) = [1(, ) 2(, ) (, )]
T,
(, ) = [1(, ) 2(, ) (, )]
T . (2.3)
In speech enhancement, the goal is to reduce the noise and recover one of the signal
components, (, ) . Classical beamformers apply linear filter (, ) to the input
noisy signal to achieve this goal as
(, ) = (, )(, ) = (, )(, )(, ) + (, )(, ), (2.4)
where (, ) C is a beamformer weight vector decided by the criteria of the
beamformer. () denotes the conjugate transpose. To obtain a distortionless speech
component in the enhanced result, the goal can be set as making (, )(, ) 1
and (, )(, ) small. In this case, the output of beamformer recovers (, ), i.e.,
(, ) (, ).
7
2.2. Overview of Classical Beamformers
In the context of applying a spatial filter, i.e., beamforming, the MVDR design [8]
is particularly popular. The MVDR beamformer is traditionally devised by minimizing
the power of the filtered signal subject to the no-speech-distortion constraint. The filter
is designed with an assumption that the RIR of target signal (, ) can be estimated
exactly. The beamformer filter weight is calculated as the optimal solution of
min
(, )(, )(, ),
subject to (, )() = (l, k) (2.5)
where is the so-called power spectrum density (PSD) matrix for input noisy signal
(, ). In the case of setting the no-speech-distortion constraint, (l, k) is set to be 1.
The PSD matrix for a given vector (l, k) is defined as
(, ) = E(, )
(, ), (2.6)
where E denotes the expected value. To solve (2.5), a complex Lagrange functional is
defined [10] as follows.
() = (, )(, )(, )
+ [(, )() (, )]
8
+ [(, )() (, )], (2.7)
where is a Lagrange multiplier. By setting the derivative with to 0 [11] as
() = (, )(, ) + () = 0, (2.8)
the optimal filter is calculated as
(, ) =
(,)()(,)
()
(,)()
. (2.9)
In the early versions of the MVDR beamformer and its adaptive implementations, i.e.,
generalized sidelobe canceller (GSC) [12][13], it was assumed that the propagation paths
between the source and the sensors are characterized by pure delays. This makes it
possible to set the distortionless constraint only with the knowledge of the direction-of-
arrival (DOA) of target speech signals. In this case, the acoustic transfer function (ATF)
() is simplified to a steering vector phase shifted in the STFT domain. However,
reflections present in a real, reverberant environment lead to the degradation when this
simplified beamformer is applied in real-life. More specifically, a wrong assumption on
the () causes cancellation of parts of the desired signal by making distortionless
condition stop working. This is known as the signal cancellation phenomenon, and it is
especially well-known in adaptive implementations of MVDR beamformers. As a
solution to the problem of the pure-delay assumption, the steering vector of simple time
9
delays is replaced by an arbitrary finite-impulse response (FIR). In [14], the authors
showed that knowledge of the RTF from the source to the individual sensors is sufficient
to construct the MVDR beamformer in an adaptive way. They introduced a system to
estimate the RTF using a least-squares approach exploiting the nonstationarity of speech
signals as opposed to the stationary character of noise. In [15], the procedure for obtaining
the RTFs from a generalized eigenvector decomposition is derived. By using the
eigenvector-based RTF estimation, the performance of the GSC was noticeably improved
in simulation environments. The eigenvector-based approach is originated from the max
signal-to-noise ratio (SNR) criterion in [16]. In [16], broadband beamforming in the
frequency domain by maximizing the signal-to-noise power ratio was proposed. The
proposed beamformer filter coefficients are determined by maximizing the SNR of the
output signal (, ), while SNR is defined as
SNR(, ) =
(,)XX(,)(,)
(,)(,)(,)
=
(,)(,)(,)
(,)(,)(,)
1. (2.10)
The maximization of (2.10) leads to a GEV problem, and as a result, (, ) is found to
be the eigenvector corresponding to the largest eigenvalue of
(, )(, ). This
eigenvector is denoted as (, ). The PSD matrix of the noisy input signal is given
by
(, ) = (, ) + (, ), (2.11)
10
The PSD matrix of the target speech component is given by
(, ) = ss(, )(, )
(, ). (2.12)
By applying the GEVD to (, ) and (, ) , and choosing the eigenvector
corresponding to the largest eigenvalue, this results in
(, )(, ) = (, )(, )(l, k). (2.13)
Substituting (, ) in the left-hand side of (2.13) as (2.11) and (2.12) yields
ss(, )(, )
(, ) + (, )(, ) = (, )(, )(l, k),
(2.14)
therefore
(, )(, )
(, )(, ) = ((, ) 1)(, )(l, k).
(2.15)
Finally,
(, ) =
(,)
(,)(,)
((,)1)
(, )
(, ). (2.16)
11
As (, )(, ) is a scalar, this can be rewritten as
(, ) = (, )
(, ), (2.17)
where is an arbitrary complex scalar. Since the above filter weight is derived based on
a narrowband SNR criterion, an uncontrolled amount of speech distortion is introduced
by the arbitrary gain, . To achieve a distortionless response, an additional single-channel
postfilter is required to make the unity gain to speech signal as
(, )
(, )(, ) = 1. (2.18)
An example of postfilter is blind analytical normalization (BAN), which is
(, ) =
()(,)(,)(,)
(,)(,)(,)
. (2.19)
For a more intuitive representation of the max SNR beamformer coefficient and RTF,
(2.17) can be rewritten as
(, ) = 1(, )(, ). (2.20)
12
As shown in (2.20), (, ) is the scaled and rotated version of (, ). To get the
ratios of transfer functions from the source to the individual sensors, the ambiguity caused
by the scalar can be resolved by normalizing the right-hand side of (2.20) with its first
component,
(, ) =
(,)(,)
((,)(,))1
, (2.21)
where ()1 is the first component of the vector.
13
Chapter 3. Blind Source Separation-based
Schemes in Relative Transfer Function Estimation
3.1. Issues and related works
Beamformers critically depend on the estimates of the statistics of the desired
speech component. The most common form of information required by beamformers is
RTFs. To estimate RTFs, it is typically assumed that a time period exists when the only
nonstationary acoustic source is desired speech, or desired speech can be observed alone.
If this so-called RTF estimation period is provided, the RTF of desired speech can be
estimated by applying the GEVD procedure to the cross-PSD of the period [15] or
exploiting the nonstationarity of speech [17] . Obviously, ensuring the existence of the
RTF estimation period is very unnatural under realistic acoustic conditions, unless a
speaker reads a series of sentences for a substantial amount of time without any
movement. To avoid these stringent requirements for the RTF estimation period, spatial
information of target speech may be exploited by using a trained model [18][20]. These
approaches treat RTFs as a random variable and attempt to estimate it with probabilistic
models, which are trained in a specific room structure. These approaches are not
applicable when the room structure is not known in prior, and considerable effort is
needed in the modeling process. The simplest way to avoid these problems is probably
to simplify the RTF as a steering vector by assuming that the relation between desired
14
speech in each sensor is a pure time delay. The RTF estimation can be replaced with
noise-robust sound source localization for desired speech. This is much easier, but results
in insufficient performance by causing severe desired speech distortion.
There have been several researches to estimate the RTF of desired speech in the
practical scenario with noise, which naturally occupies the same timefrequency range
with desired speech. Early work on estimating timefrequency bins in which desired
speech is dominant and use them to estimate the RTF is presented in [21]. In this research,
the speech presence probability in the timefrequency domain is incorporated in the way
of estimating the RTF based on subintervals that contain speech. The speech presence
probability is obtained by applying a minima-controlled recursive averaging-based
algorithm in the timefrequency domain, and is highly dependent on the assumption that
noise is more stationary than desired speech. More recently, blind source separation (BSS)
algorithms in RTF estimation have been proposed. A geometric constraint to BSS was
introduced to specify the direction of desired speech [22], and the time-domain BSS
weight matrix was updated to block the desired speech in the BSS output. The BSS
weight is naturally updated to approximate the RTF in the way of trying to block signal
components related to the specified direction. However, the estimation performance of
BSS weights is usually not enough to reach sufficient accuracy unless the length of the
input signal is considerably long.
15
3.1.1. Generic ICA-based BSS algorithm
In this section, triple-N independent component analysis for convolutive
mixtures (TRINICON) [23] for BSS is briefly reviewed. A signal model for point
sources s() for = 1, , P is described by
() = () ()
2
=1 , (3.1)
where represents convolution and is the discrete time index. () represents a
transfer function from the position of the th sound source to the th sensor. The
demixing filter from the th microphone to the th output channel is denoted as
() and the output signals of the demixing system are described by
() = (
)(
)
1
=0
2
=1
= () ()
2
=1 (3.2)
where (), = 0, , L 1 denote the current weights of the filter tabs from the
th microphone to the th output channel. An example of a signal model for the two-
channel case is shown in Figure 3.1.
16
Figure 3.1 Basic two-channel linear BSS signal model
In TRINICON, demixing filter () is identified by minimizing mutual information
between the output channels based on the assumption that acoustic sources are
statistically independent. The filter weights can be updated for each block which consists
of frames. The cost function for each block is usually calculated by replacing ensemble
averaging with temporal averaging over frames in the block with assumption of
ergodicity within the individual blocks. To describe a block processing, block output
signal matrix is introduced as
() =
[
()
( + 1)
( + 1)
( + 1)
( + 2)
( + )]
(3.3)
and reformulate the convolution in = () ()
2
=1
(3.2) as
() = ()
=1 () (3.4)
17
with denoting the block time index and denoting the block length. The matrix
() incorporates time lags into the correlation matrices in the cost function. To
ensure linear convolutions for elements in () , the size of input matrix () is
double the size of ().
() = [
()
( + 1)
( + 1)
( 2 + 1)
( 2 + 2)
( 2 + )
] (3.5)
Now, () is given as
() =
[
(0)
(1)
( 1)
0
0
0
0
(0)
(1)
( 1)
0
0
0
0
0
(0)
(1)
( 1)
0 ]
(3.6)
For a more convenient notation, each component is rewritten by combining all channels
as
() = ()() (3.7)
18
with the matrices
() = [1(), , ()],
() = [1(), , ()],
() = [
11(), ,1()
1(), ,()
]. (3.8)
The cost function that includes all time lags of all auto-correlations and cross-
correlations of the output signals is introduced in [24], [25] as
(,w) =(, ) ( ( (()))) ( (()))
=0
,
where () =
()(). (3.9)
Here, is a weighting function with finite support that is normalized according to
(, ) = 1=0 . The bdiag operation sets all submatrices on the off-diagonals to zero.
In this case, the matrix ()() of size is composed of channel-wise
submatrices. The cost function becomes zero if and only if output signals are
uncorrelated so that all block-off-diagonal elements become zero. Update equations of
the filter coefficients are expressed by applying the natural gradient [26] [25] as
19
(,) = 2
(,)
[(, ) () (())
1 (())]
=
.
(3.10)
When the update is applied to , the Sylvester structure of is maintained by using
only the nonredundant values in
(,) with a Sylvester constraint. The
coefficient update rule in [27] is applied to the proposed system to obtain a recursive
block-by-block solution based on offline minimization. The update of the time-domain
BSS weight matrix for the -th block after the -th iteration, (), is as follows.
() = 1() (), = 1, , , (3.11)
where is the step size and the update () corresponds to the natural gradient
(,()).
3.2. RTF Estimation Using Peaks in Time-Domain RTF
3.2.1. Motivation
20
When an optimum broadband solution of BSS is presented, the demixing matrix
can separate two input sources into each output. Assuming that 1() is the desired
speech and an ideal BSS system separates 2() from the mixed signal to obtain 1(),
the target source 1() should be perfectly suppressed in 1() as follows.
1() 11() 11() + 1() 12() 21() = 0
11() 11() = 12() 21(). (3.12)
(3.3) can be expressed in the STFT domain as
11() = 11() =
12()
11()
21(), (3.13)
with as the frequency bin and as the length of the filter weight. If 21 is fixed to
a pure time delay with a negative sign, then
11() =
12()
11()
2/, (3.14)
where is a time delay. Now, 12()/11(), which is the RTF of the target source,
can be calculated from BSS weight 11() . This holds only when the BSS system
suppresses only 1(t) in 1(), which cannot be guaranteed if the number of the sound
sources exceeds the number of the sensors. In [22], [28], a geometric constraint has been
21
proposed to satisfy the condition. This constraint was used to force BSS weights that
correspond to output 1() to have a spatial null towards the direction of the target
source. The update of the time-domain BSS weight matrix for the -th block after the -
th iteration, (), is as follows.
+1() = ()
= + , (3.15)
where and are update values from the BSS algorithm and the geometric
constraint respectively. Here is the step size and is a parameter that controls the
importance of the constraint relative to the BSS. To force () = [11() 21()]
T
to get a spatial null towards a target direction, , a constraint was set with a steering
vector () to the target direction as follows.
()() = 0,
where () = [ , 1]
, =
2
. (3.16)
Here c is the sound velocity, is the distance between sensors, and is the
sampling rate. The cost function of this condition is
(()) = [
()()][
()()]
, (3.17)
22
where 21() is fixed to a pure delay and the constraint is applied to 11 by setting
as the first derivative of (3.8) with respect to 11 . The steering vector in the
constraint in the conventional algorithm represents only the time delay of arrival from
the target direction. By applying the cost function (3.8), the estimated RTF 11 is biased
to the delayed delta function. True RTF in the time domain is not a pure time delay, and
has many peaks as shown in Figure 3.2 (a).
Figure 3.2 PTDRs in time-domain RTF:
(a) input target signals ratio in time domain (True RTF of target signal), (b)
estimated RTF in time domain (conventional algorithm), and (c) estimated RTF in
time domain (proposed algorithm)
Despite the important role in characterizing the RTF, these peaks are smoothed in
23
conventional algorithms, as shown in Figure 3.2 (b).
3.2.2. Utilization of the peaks in time-domain RTF (PTDR)
A previous work [29] presents an effective algorithm for a more accurate estimation of
RTF to improve the directional-BSS-based RTF estimation. The peaks in time-domain
RTF (PTDR) is used as a feature of the RTF. A non-negative matrix factorization (NMF)
based algorithm is employed to estimate PTDRs that corresponds to the target source. A
semi-NMF algorithm [30] that is a variation of NMF to extend its applicable range to
negative data is used. As a result, the peak-smoothing effect is overcome by replacing the
smoothed peaks by PTDR estimates. The input signal ratio is written as
2()
1()
=
1()12()+2()22()+
1()11()+2()21()+
. (3.18)
As the dominant input signal changes due to the nonstationarity of signals, the input
signal ratio in some of the frames can take a closer value to the RTF of the dominant
signal source.
2()
1()
12()
11()
1() 2(), 3() (3.19)
When there is no dominant source, the input signal ratio has unpredictable values due to
24
the independence of each source. The peaks of the input signal ratio are assumed to be a
weighted sum of PTDRs from each sound source and unpredictable peaks. The peaks of
the input signal ratios in each frame are stacked in a matrix and fed into the semi-
NMF algorithm as follows.
= [], = [(1) ()]
T,
() =
()
0
()
(3.20)
where (t) is the -th sample in the inverse discrete Fourier transform of the input
signal ratio for the -th frame. is a threshold used to find peaks and T is a value used
to decide the range of PTDR search. The columns with synchronized fluctuation in matrix
can be grouped into different basis.
, (3.21)
where
and are the basis vectors and the corresponding activation
weights respectively with as the number of basis. If only the direct path to the target
direction is considered, the RTF in time domain is a delayed delta function with a peak
on the position that corresponds to the time difference of arrival. By choosing a single
basis that has the largest value on this position among the resulting bases of the semi-
NMF, the PTDRs having synchronized fluctuations with the peak from the direct path
can be obtained as the estimated target signal PTDRs.
25
3.3. Experiments
To examine the effect of using PTDRs on RTF estimation, the semi-NMF-based
PTDR separation is conducted while RTF estimation with conventional directional BSS
is performed. As each RTF estimation is completed, the separated PTDRs are inserted
into the resulting RTF estimate to compensate the smoothing effect of the conventional
algorithm. The experiments were conducted using speech (3 male and 3 female subjects)
as the target and interfering sources, and the performance was evaluated by averaging the
results over three separate experiments. The room impulse response was simulated, as
shown in Figure 3.3.
Figure 3.3 Spatial configurations. Room dimension: 10 m 10 m 3 m. Sensor
position: (5 m, 4.97 m, 2 m), (5 m, 5.03 m, 2 m). Source-sensor distance: 1 m.
Target speech, noise source, interference source are represented as , , and
respectively.
26
For each experiment, a 10-s length of the mixed signal was used. BSS weights were
updated at every 8 frames of 1024 samples obtained at a 16-kHz sampling rate.
Parameters were set as = 400, = 0.5, = 0.05, and = 5. For experiment 2, the
sound of operating machinery in a factory was used as a noise source with equal power
of interference. For experiments 3 and 4, white noise was added to each microphone as a
diffuse background noise. The SNR was set to 0 dB excluding the diffuse noise in each
experiment. When the diffuse noise is considered, it was normalized to have 10-dB SNR
and added to the 0-dB SNR mixed signal. Performance was evaluated using the target
speech suppression gain and the normalized squared error (NSE) between the estimated
RTFs and the ideal RTF [22]. The speech suppression gain is defined as
Gain =
1
2
10 log 102=1
,
2
2 , (3.22)
where ,
2 and
2 denote the signal power of the target signal component in the m-th
sensor and the leakage signal respectively. The leakage signal is calculated as in (3.14),
using the resulting BSS weight, 11(), 21(), and the target signal. A larger value of
the suppression gain indicates a more accurate RTF estimate.
= 1() h11() 11() + 1() 12() 21() (3.23)
The NSE is calculated as
27
NSEBM = 10 log10
(()())
21
=0
()2
1
=0
, (3.24)
where is the estimated RTF and is the true RTF of the target signal in
the time domain.
Table 3.1 RTF estimation performance in terms of NSE and target speech
suppression gain.
Idx
Noise
component
(s)
Target speech suppression gain
(dB)
NSE
Conventional Proposed Conventional Proposed
1 Interference 0.25 12.8 13.7 -7.0 -9.5
2
Interference
+Noise
0.25 11.7 12.4 -6.9 -9.6
3
Interference
+Noise+Diffuse
0.25 10.7 11.8 -6.9 -9.3
4
Interference
+Noise+Diffuse
0.40 7.6 7.9 -5.2 -5.9
3.4. Conclusions
Directional BSS is highly efficient in estimating the RTF because it requires only
the target direction, unlike other algorithms that require additional detectors. However,
the geometric constraint in this algorithm has a side effect where the estimated RTF has
28
smoothed peaks. In this work, the PTDR was effectively used to overcome such a side-
effect. The PTDRs corresponding to the target source were separated from other peaks
employing the semi-NMF algorithm. By using the PTDR estimates, the performance of
directional BSS-based RTF estimation was shown improved in terms of speech
suppression gain and normalized squared error.
29
Chapter 4. Neural Network-based Approaches
4.1. Issues and related works
After the breakthrough for training deep architectures presented by Hinton et al.,
speech enhancement techniques including noise reduction, separation, feature
compensation, and dereverberation have been developed in the framework of deep
learning. A simple, yet effective method for the monaural case is to estimate the speech
presence timefrequency bin in the noisy spectrogram. This is formulated as a binary
classification problem to estimate the ideal binary mask (IBM) in [29], and shown to
improve speech intelligibility. However, due to the binary nature, the improvements to
speech quality offer limited performance enhancement. This method is adopted for the
multichannel case in the timefrequency mask-based approach that has been successful
in multichannels recently. Another successful method for the monaural case is the stacked
denoising auto-encoders (SDAs) based approaches. Inspired by the successful results on
monaural speech [31][33], many approaches tried to use the multichannel signals in an
SDA framework. The SDA can be trained to reconstruct the desired speech from the noisy
input signal. The main idea behind this structure is using DNNs to model the relationship
between clean and noisy features.
Multichannel feature-based methods [34][37] explored the suitable features that
hold spatial information in addition to spectral information. The newly proposed features
are usually fed to the NN along with conventional acoustic features. There are also
30
researches that attempt to train a network directly on multichannel waveforms, such as
[38][39]. These approaches force the NN to perform the whole enhancement process
including the modeling relationship between channels, thus requiring large amounts of
training data. Furthermore, high dependence on training data may cause performance
degradation in unseen spatial configurations. As such, deep learning can cooperate with
beamformers in several ways, and various methods are still being studied. Here are two
representative methods that are considered to be the most successful.
4.1.1. Neural Network-based mask estimation in GEV beamformer
Among the various approaches proposed to extend the applicable scope of deep
learning to multichannel signals, the recently proposed NN-GEV beamforming approach
[40], which combines deep learning-based timefrequency mask estimation and a GEV
beamformer, has achieved success in noise-robust ASR challenges. The success of NN-
GEV is due not only to its improved performance, but also to the simple and effective
application of deep learning. Since modeling is performed on each single-channel
spectrum, this system enables the training steps of the NN to be conducted without
considering the geometric configuration of the microphones. In a noise-aware scenario,
data-driven approaches are applicable [40][41] for estimating the spectral masks for
speech and noise, and the estimated masks can be used to estimate the signal statistics
required by a beamformer. In a NN-GEV system, two masks are estimated, one of them
to indicate which time frequency bins are presumably dominated by speech, and the other
31
to indicate which bins are dominated by noise. The estimated masks are used to estimate
the cross-PSD of each frequency as follows:
() =
(,)(,)(,)
=1
(,)
=1
,
where , (4.1)
and are estimated masks for speech and noise respectively, and (, ) is a
vector of input noisy signals in which each element indicates one channel signal in a
STFT domain for frame index and frequency index .
Figure 4.1 Structure of NN-based mask estimation in GEV beamformer
In [40], it is assumed that the target speech is prevalent in (), whereas noise is
prevalent in () for each utterance, so that PSDs are calculated for each utterance.
The IBMs for speech are set as
(, ) =
1,
(,)
(,)
> 10
()
0,
, (4.2)
32
and the NN is trained to estimate the mask from a noisy spectrum. The mask for noise
can be calculated from the speech mask
(, ) = 1 (, ) (4.3)
or estimated using the NN in the same way with the speech mask. In the latter case, the
number of nodes in the output layer of the NN is doubled to generate both speech and
noise masks at the same time. The GEV beamformer [16] is obtained by maximizing the
SNR for each frequency bin as
GEV(k) = argmax
H()()()
H()()()
. (4.4)
4.1.2. A Fully deep learning-based Approach
Typical beamformers often use separate modules for each step. First, spatial
information acquisition is performed, i.e., source localization in [12], [13] and RTF
estimation in [14], [15]. After that, a linear filter that is generally indicated by the word
beamformer is applied. In addition to this linear filter, an additional filter with an arbitrary
structure may be applied to improve the enhancement performance or compensate the
artefacts caused by the linear filter. This structure is considered to have room for
33
optimization in the case when the goal is to improve ASR performance because the filter
used in enhancement is calculated independently from the acoustic model of ASR. In an
early effort to optimize the enhancement system and the acoustic model jointly, structures
like likelihood-maximizing beamforming were proposed [42]. As the NN-based acoustic
model in ASR has become a mainstream recently, these approaches continued in
optimizing the NN-based acoustic model jointly with preceding enhancement module. In
[43], the NN for timefrequency mask estimation in NN-GEV [40], is jointly optimized
with the NN for acoustic modeling for ASR. Even though this system keeps using a linear
filter that is calculated based on the traditional optimization criterion (i.e., MVDR), the
training algorithm is named as end-to-end training. This naming seems reasonable
since the NN-based mask estimation comes at the front-end of the enhancement process
and jointly trained with the acoustic model in ASR following the enhancement process.
The more unified, and more suited to the name end-to-end system, in which whole
steps between the input multichannel signal and the acoustic model for ASR are
connected by one NN as presented in [44]. This system takes the multichannel raw time-
domain waveform directly and performs speech enhancement operations, including a
spatial filtering operation and frequency decomposition in its NN layers. This approach
has been shown to be effective in utilizing the multichannel input and learning the desired
spatial selectivity and it seems quite promising given that deep learning has overwhelmed
existing methods in a variety of fields recently. In [44], much of the information from
multiple microphones is exploited by the lower layers, while deeper layers perform a
similar operation when applied to a conventional single channel. The structure of the
34
deeper layers are known as convolutional, long short-term memory, deep NN (CLDNN)
[44][45][45] and shown in Figure 4.2.
Figure 4.2 Time convolution and CLDNN layers
When the CLDNN is applied on single channel [45], the first layer in Figure 4.2 is a time-
convolutional layer over the raw time-domain waveform, which can be thought of as a
finite impulse-response filterbank followed by a nonlinearity. By placing multiple filters
in time convolution layer, this layer can approximate acoustic filterbanks which perform
spectral filtering. The output of this layer is subsequently treated as timefrequency
representation of input signal. The frame-level outputs from the first layer is fed to
following convolutional layer to model the underlying relationship between adjacent
35
frequency bands. The output of frequency convolution is passed to a stack of three LSTM
layers, which model the signal across long time scales. Finally, a single fully connected
DNN layer is applied to estimate the context-dependent state of the acoustic model in the
ASR system. When the CLDNN is applied to the multichannel signal, i.e., in the fully
deep learning-based beamforming system, the low layers are modified to use spatial
information encoded in input signals. Structural changes are mainly in the lower layers,
especially in the first layer [44][46]. The modified structure for the multichannel signal
is shown in Figure 4.3. The first layer is noted as tConv1 and mimics filter-and-sum
beamforming, which filters the signals from each microphone using a finite impulse
response filter and sums them. This process can be written as follows.
1[t, ] = 1,[, ][ ]
11
=0
1
=0
,
for = 1, , , = 10 (4.5)
1 is the number of tabs in filter 1,[, ] and is set 1 = 80 at a
sampling rate of 16 kHz (5 ms). This filtering is performed on the input raw waveform
[], where c is the input channel index. Equation (4.5) is interpreted as a delay-and-
sum beamformer applied on C channel signals when the filter 1,[, ] is set to
be a delayed impulse. If it is assumed that C = 2, the look direction of this delay-and-
sum beamformer is specified by index and the corresponding time delay of arrival
calculated by the relative delay in first dimension between 1,1 and 1,2.
36
Figure 4.3 Factored multichannel raw waveform CLDNN architecture for look
directions [46].
37
This structure is called factored because this layer is expected to perform only spatial
filtering while the ability of spectral filtering being factored out to the next layer. The
unfactored structure in which spatial and spectral filtering are performed in one layer
differ from factored by the length of the first dimension of 1,[, ] while
setting a longer filter (e.g., 1=400). The factored effect achieved by changing
the filter length, and the improvement in performance realized by factoring is shown
in [46]. 1,[, ] can be either fixed or trained and the experimental result in [46]
shows that training it improves performance compared to keeping it fixed. Note that by
training the tConv1 layer, 1,[, ] that is initialized to be a delayed impulse can
have an arbitrary shape, so that tConv1 will no more be a delay-and-sum beamformer.
The output of tConv1 has the shape as 1
1 where the first and last
dimensions correspond to the sample index in a small window (M = 560 ), and look
direction index, respectively. The second layer is also a time-convolution layer and is
named as tConv2. The time-convolution process can be written as follows.
tConv2[, , ] = 2[, ]1[ , ]
21
=0
,
for = 1, , , = 1, , , = 10, = 128 (4.6)
This can be interpreted as a decomposing time signal into different time signals, each
of which corresponds to a different frequency band. This decomposition is repeated
times for each look direction using the same convolution filter. This layers filters are
38
denoted as 2
21, where the dimensions correspond to tabs (sample
index in the time convolution filter), a look direction index of 1 to indicate sharing across
the 1 directions, and frequency (number of frequency bands). In the convolution
process, zero-padding is not performed in the edge of the signal (i.e., valid padding),
and the output has the shape of tConv2
(2+1). The filter size 2
is set to be larger (e.g., 2 = 400) than 1 to encourage this layer to perform
spectral filtering by using better frequency resolution than the first layer. Next, The max
value of every 160 samples are pooled, along the first dimension in tConv2. As a result,
1 is shaped as 1
1 . By selecting one sample in short intervals, the
short-time information, i.e., changes of characteristics of the input signal within the range
of 160 samples are discarded. After the pooling, a rectified nonlinearity, followed by a
stabilized logarithm compression, which means a logarithm with a small additive offset
of log ( +0.01) is applied to generate the output as
nonlinear = ReLU(log (1 + 0.01)). (4.7)
The output of the nonlinear layer is then passed to the CLDNN. First, the fConv layer,
which is a convolutional NN, applies frequency convolution. The frequency-convolution
process can be written as
fConv[, , c] = fConv[, c]nonlinear[, ]
1
=0
,
39
= 1, , , c = 1, , , = 1, , , = 10 , = 256,
= 128, and = 8.
(4.8)
Nonoverlapping max pooling along frequency, with a pooling size of 3, is applied to
fConv[, , c], and as a result, the range of the frequency index in the second
dimension reduces to = 1, ,
+1
3
. For now, the output generates
+1
3
feature values for each frame. To reduce this huge feature into an
affordable size of vector, a 256-dimensional linear low-rank layer is applied. This is
performed by rearranging the three-dimensional feature into the vector with
+1
3
values, and applying affine projection to produce a vector with 256 values.
The output of this step is passed to three LSTM layers with 832 cells and a 512-unit
projection layer. In addition, one DNN layer with 1,024 hidden units are applied.
4.2. Motivation
The fully-DNN-based approach relies on the NN's modeling capabilities to solve
these problems. The modeling of the relationship between input signals spatially and
spectrally is all done by the network. Among the recent fully-DNN-based schemes, there
has been an attempt to specify where in the network spatial and spectral filtering should
40
be performed [46]. However, in this study, control of the NN was merely observing where
the selectivity exists between spatial or spectral axis after adjusting the width of the
convolutional layer. Most of fully-DNN-based approach, due to the nature of its structure,
has an unclear boundary of utilization of spatial information and spectral information.
Therefore, to generalize various spatial configurations, using a training database
including all the corresponding diversity is necessary. Training data to generalize the type,
direction, and spatial characteristics of noise should be composed of multichannel data,
and collecting such data is much harder than collecting single-channel data. Feeding the
multichannel signals to a big network [36], [47] usually make network to be too flexible
to train parameters.
In most existing studies, constraints that are applied to the evaluation data were
usually decided according to the characteristics of the proposed algorithms, making it
difficult to compare their performances. Representative examples of such constraints
include the presence of an RTF estimation interval and the assumption of a stationary
degree of noise. The NN-GEV method, on the other hand, has proved its effectiveness by
performing well in a noise-robust ASR competition [48][51], where performance
comparison is possible under the same conditions. Because this competition assumes a
realistic noise environment, it is meaningful to achieve high grades. The NN-GEV
method has the advantage of being easy to implement irrespective of the configuration of
the sensors. In addition, there is no need to implement an additional sound source
localization system and there is an advantage that voice activity detection can be solved
by a NN already included. Despite these advantages, there is still room for improvement.
41
The method of applying the NN to each channel independently has advantages in that it
does not depend on the position of the microphone. However, on the other hand, this
property leads to disadvantages that the NN is not used at all in spatial information
modeling. In realistic situations where noise and desired speech naturally occupy the
same timefrequency range, estimation of timefrequency masks based only on signal
intensity may result in insufficient performance. Moreover, signal intensity highly
depends on signal characteristics, and as a result, the timefrequency mask estimation
performance is likely to be unstable when unseen noise is encountered. To deal with the
problem, it is necessary to find a method to overcome the disadvantages of using single-
channel signal intensity only for deep learning. The new method should be able to use
multichannel phase information and its temporal characteristic. The NN topology to deal
with such complex relationships between phase information of signal components are
proposed in this section. Unlike the fully-DNN-based methods, the proposed method
calculates the beamformer weight-based on MVDR criteria and limits the role of the NN
to the estimation of the target signal and the noise signal statistics. This will reduce the
degree of flexibility of NN weights and reduce the size of the training database required.
42
4.3. Proposed system with phase-encoded input for NN-based mask
estimation
4.3.1. Representing Phase Information in Real Numbers
Single-channel speech enhancement systems usually enhance only the magnitude
spectrum and use the noisy phase during signal reconstruction. This is based on the belief
that phase information does not lead to significant improvements in equivalent SNR [52].
However, the spatial diversity of a multichannel signal is typically given as the TDOA of
the time signal that corresponds to the phase shift in the timefrequency domain.
Inadequacy of the deep learning-based multichannel system compared to the single-
channel system is partially attributed to the requirement to modeling phase information
while the NN traditionally processed real-valued physical data and relied on real-valued
weights. The efforts to model the spatial information from a multichannel signal are
directly or indirectly connected to modeling the phase information from a complex
number. There have been researches to use a complex-valued NN; however, the
appropriateness is still a controversial issue [53]. The most typical method in the real-
valued NN is to represent a complex-valued timefrequency domain signal by two
channels containing real and imaginary components respectively, while forcing the NN
to approximate a complex algebra to decode the phase difference between the two signals
[54]. A front-end structure of the NN is designed in this section to effectively transfer the
phase information of the input signal to the real-valued NN. The relation of the two
43
complex numbers corresponds to the two-channel signal in the STFT domain,
1(, ), 2(, ), for the th frame and the th frequency index is expanded to a real-
valued vector, (, , ), in producing an NN-friendly representation of magnitude and
phase difference as follows.
(, , )
= ,2(, )
( (
2
+ ,1(, ) ,2(, )))
where n = 1, , N and N = 20 (4.9)
This process results in the three-dimensional extension of existing two-dimensional
timefrequency information. This helps avoid performing phase-related modeling using
information represented by real and imaginary values, which requires computation such
as the inner product and angular difference calculation. In addition, the proposed feature
is characterized by the fact that the feature value reflects the change of phase and intensity
on real values at once. As a result, the distance due to the phase difference also decreases
when the intensity is low. This is effective in resolving the unstableness problem of phase-
only features such as the interaural phase difference (IPD), which is frequently observed
in a low-signal-intensity region. The extracted feature for each timefrequency bin,
which is an -dimensional vector, is fed to feed-forward layers that transform the
features into a space that makes that output easier to model. The feed-forward weights
44
are shared for several groups of adjacent frequency bands to reduce the parameters to be
trained. The NN topology used to evaluate the effectiveness of the proposed front-end
structure is shown in Figure 4.4. The frequency bands were uniformly divided into 10
groups. The outputs from the feed-forward layers are fed to a deeper layer that has same
topology with [40].
Figure 4.4 NN topology in the proposed NN-based mask estimation system
45
4.3.2. Bidirectional Long Short-Term (BLSTM) layer
The output of the front-end stage is fed into the BLSTM layer as in the mask
estimation process in [40]. The bidirectional structure [55] is accomplished by training
two LSTMs simultaneously in positive and negative time directions. In the output, two
results are concatenated to be fed to deeper layers. The LSTM structure [56] is able to
learn to store information over time intervals. It is proposed in [57] to overcome the
problems of the RNN caused by decaying error back flow and as a result, make it hard to
be applied in learning to store information over extended time intervals. The LSTM can
learn to bridge time intervals much longer than that in RNNs without loss of short time-
lag capabilities. The basic unit of an LSTM network is the memory block, and the
memory block can contain one or more memory cells. Three adaptive, multiplicative
gating units consisting of input, forget, and output gates, are shared by all cells in the
block. Each memory cell stores its current state at its core and has a recurrently self-
connected linear unit called the constant error carousel (CEC). The structure is shown in
Figure 4.5. The input, forget, and output gates can be trained to decide what information
to store in the memory and when to read it out. The activations of gates are calculated
using current input from the lower layer and the previous output of the memory cell. In
this system, the output of the front-end layers, i.e., activation of the lower layer, is fed to
the LSTM layer. The output of the front-end layer can be given as a Nfront3 dimensional
vector, zfront3() for the kth frame.
46
Figure 4.5 LSTM memory block with one cell (rectangle)
Similarly, the cell state of the previous frame can be denoted as zcell( 1) . The
activation of input (i(k)), forget (f(k)), and output (o(k)) gates are calculated by using
(4.10) for each frame index k.
i(k) = sigmoid (Wiz3zfront3() +zcstate( 1) + )
f(k) = sigmoid (Wfz3zfront3() +zcstate( 1)+)
o(k) = sigmoid (Woz3zfront3() +zcstate( 1)+0)
(4.10)
in the argument to sigmoid function, the second component indicates the connection
between the gates and the cell state, and this connection is named as the peephole
connection [57]. The peephole connections allow gates to use the current cell state even
when the output gate is closed. , , and b0 are the bias values for sigmoid functions.
47
The cell state for the current state is then calculated by adding the gated cell input and
output with the previous state, which is also gated by the forget gate.
z()
= () tanh (33() +( 1) + )
+() zcstate( 1). (4.11)
Here, denotes gating conducted by element-wise multiplication of vectors. The cell
output is finally calculated by multiplying the cell state by the output gate activation as
zcout() = ()tanh (()). (4.12)
The output of the cell is fed to fully connected DNN layers, to estimate output labels.
4.4. The ASR stage in the proposed system
4.4.1. Acoustic model structure
For ASR performance assessment, one of the recently proposed ASR systems in
the Kaldi speech recognition toolkit [58] is used. In this system, the time-delay NN
48
architecture that models long-term temporal dependencies is adopted. By selecting the
time-delay NN architecture with feed-forward NNs only in its layers, this system can be
easily parallelized unlike RNNs, parallelization of which cannot easily be exploited due
to the dependencies between the time frames being processed in training. Even though it
consists of feed-forward layers only, the time-delay NN can model long-term temporal
dependencies in short-term speech features by processing a wider temporal context. To
achieve this property, a sufficient number of frames must be fed to the NN at a time to
model the temporal relationship present in the speech and this increases the number of
parameters to be trained in the NN. In the time-delay NN structure, the initial layer learns
an affine transform that covers a small window of many frames. The affine transform
then repeats the same operation on the incoming window. After performing this operation
for a sufficient time, the activation results obtained from each window are fed to the next
layer in a similar form of time frame in the initial layer. The structure is shown in Figure
4.6. In this structure, the deeper layers process the hidden activations from a wider
temporal context while each layer keeps a relatively small number of parameters by
sharing the affine transform weight across the windows. In a typical time-delay NN, the
window of each layer is constructed at all time steps and this result in large overlaps
between the windows computed at the neighboring time steps.
49
Figure 4.6 Computation in TDNN with subsampling (red) and without subsampling
(blue + red) in [59]
In [59], subsampling by assuming that neighboring activations are correlated is proposed.
Subsampling is performed by omitting the connections that cover the central frames in
window. As a result, gaps exist between the frames that feed to the next layer, unlike the
typical time-delay NN, which splices together contiguous temporal windows of frames.
Table 4.1 shows the configurations of subsampling, and compares them with a typical
50
time-delay NN. For example, -1,2 denotes splicing the frames in the relative time steps
of -1 and 2 while [-1,2] means splicing frames corresponding to -1, 0, 1, and 2.
Table 4.1 Context specification of time delay NN in [59]
layer Input context
Input context with sub-
sampling
1 [-2, 2] [-2, 2]
2 [-1, 2] -1, 2
3 [-3, 3] -3, 3
4 [-7, 2] -7, 2
5 0 0
By using the subsampling scheme, the computational load is reduced during the forward
path and back propagation. Moreover, the parameters in the hidden layers are
significantly reduced. The p -norm nonlinearity [60], which is a dimension-reducing
nonlinearity, is used for activation of each layer output. The p -norm nonlinearity is
proposed in [60] as a generalization of the maxout unit and is defined as
=
= (
)
1/
, (4.13)
where the value of is configurable.
51
4.4.2. Input feature
Mel-frequency cepstral coefficients (MFCCs) are extracted from each frame,
with a length of 35 ms and slide size of 10 ms at 16 kHz. MFCC features are fed to the
NN without adaptation or mean-normalization, which are usually required in ASR
systems. Instead of applying feature compensations at the front-end of the ASR system,
the i-vector that is estimated based on the left-to-right way, meaning that the use of the
input frames up to each time step, is fed to the NN with the MFCC feature. This is based
on the idea that the i-vector gives the NN as much as it needs to know about the speaker
properties.
4.4.3. Training Database for ASR system
To ensure that the proposed system can be used for a general-purpose ASR system,
the ASR system is trained on clean speech data base. As training database, Librispeech
[61] that is known to achieve higher performance than models built on Wall Street Journal
(WSJ) itself when used in training for the standard WSJ test set. The corpus in this
database is derived from audiobooks that are part of the LibriVox project, and contains
1,000 hours of speech sampled at 16 kHz. For the sake of convenience, the training data
is divided into several subsets when it is provided, but the entire training data is used in
this experiment. The speakers in the corpus were ranked according to the WER, and were
divided roughly in the middle, with the lower-WER speakers designated as "clean" and
52
the higher WER speakers designated as "other." The development set is decided by
drawing 20 male and 20 female speakers from the "clean " pool at random. As a result,
the dev set contains about 5 h and 20 min, while each speaker uses approximately 8 min.
4.5. Experiments
4.5.1. Database
The performance of the proposed system was assessed with the database released
for the CHiME challenge 2016 [9] held for the purpose of promoting research at the
interface of signal processing and automatic speech recognition. This challenge targets
the performance of automatic speech recognition in a real-world, commercially
motivated scenario: a person talking to a tablet device that has been fitted with a six-
channel microphone array. The configuration of the recording device is shown in Figure
4.7. Recordings have been made using an array of six omnidirectional microphones
mounted in holes drilled through a custom-built frame surrounding a tablet computer.
The frame is designed to be held in a landscape orientation and has three microphones
spaced both along the top and bottom edges. All microphones face forward (i.e., towards
the speaker holding the tablet) except the top-center microphone (mic 2), which faces
backwards. Two types, which are real and simulated, of noisy speech data have been
provided for each noise environment. The real data consists of sentences from the WSJ0
53
corpus spoken live in the environments.
Figure 4.7 The microphone array geometry in the CHiME challenge 2016 [9]. All
microphones face forward except microphone 2.
The simulated data was generated by simulating six-channel clean utterance from one
channel source and mixing them into environment background recordings. For ASR
evaluation, the data is divided into training, development, and test sets. The scenario of
this challenge is the ASR for a multimicrophone tablet device being used in everyday
environments. Four varied environmental noises were recorded in cafe (CAF), street
junction (STR), public transport (BUS), and pedestrian area (PED). The speech
recordings consist of utterances made by 12 US English talkers (6 male and 6 female)
ranging in age from approximately 20 to 50 years old. Recordings were made first in an
acoustically isolated (but not anechoic) booth (BTH) and then in each of the four noisy
target environments. About 100 sentences were recorded in each location. The speakers
54
were asked to use the tablet in whatever way felt natural and comfortable, and the talker
tablet distance varied but was typically around 40 cm. Utterances were recorded at 48
kHz and downsampled to 16 kHz and 16 bits. Multiple sentences were recorded
continuously (embedded), and for each continuous recording session, an annotation file
was prepared to record the start and end time of each utterance with a precision of
approximately 100 ms. Isolated utterances were extracted from the continuous audio
according to the annotated start and end times and 300 ms of padding prior to the
utterance is included during the extraction.
In the case of simulated development or test data, the noise source is acquired
by removing the speech component from recorded noisy speech, while in the case of
simulated training data, a separately recorded noise background is used. The speech
components in each microphone are estimated by applying the estimated impulse
response for the tablet mics with close-talking record. Close-talking record is
simultaneously acquired using a headset microphone when noisy speech is recorded. In
the impulse response estimation process, microphone signals are represented in the
complex-valued STFT domain using half-overlapping 256-sample sine windows. After
partitioning the time frames into variable-length, half-overlapping, and sine-windowed
blocks such that the amount of speech is similar in each block, the per-block STFT-
domain IRs between the close-talking microphone and the other microphones are
estimated in the least-squares sense in each frequency bin [62]. The SNR to be used in
mixing speech and noise source is also estimated by using the estimated speech
component in the real recorded noisy speech. The estimated SNRs had an average of
55
approximately 5 dB. To generate the simulated speech source in the simulation process,
a time-varying steering vector that models direct sound between the speaker and
microphones is convolved with a clean speech signal. In the case of training data, the
clean speech signal is taken from the original WSJ0 recordings while in the case of
development and test data, the clean speech signal is taken from the booth recordings. In
either case, the simulated speech signal is rescaled such that the SNR matches that of the
real recording. The steering vector for each utterance in the real recordings is tracked
using the steered response power phase transform (SRP-PHAT) algorithm. To estimate
the steering vector, the signals are represented in the complex-valued STFT domain using
half-overlapping sine windows of 1024 samples. The spatial position of the speaker is
encoded by a nonlinear SRP-PHAT pseudo-spectrum for each frame. To stabilize the
estimated position, the peaks of the SRP-PHAT pseudo-spectrum are then tracked over
time using the Viterbi algorithm. The participants of the challenge were provided with
baseline systems for front-end signal enhancement and state-of-the art GMM/DNN-based
ASR. The time-varying MVDR beamforming with diagonal loading [63] is selected as a
baseline front-end signal enhancement. In the GMM version of the baseline ASR system,
the MFCC feature is used. First, MFCC of order 13 is extracted for each frame and tree
frames of left and right context are concatenated to form a 91-dimensional feature vector.
The concatenated features are compressed to 40 dimensions using linear discriminative
analysis (LDA). After that, maximum likelihood linear transformation (MLLT), and
feature-space maximum likelihood linear regression (fMLLR) with speaker-adaptive
training (SAT) are applied. In acoustic modeling, 2500 tied triphone HMM states are
56
modeled by a total of 15,000 Gaussians. In the DNN-HMM hybrid version of the baseline
ASR system, the feed-forward NN that has 7 layers with 2048 units per hidden layer is
used as the acoustic model. Five frames of the MFCC feature is fed to the NN at a time.
Pretraining using restricted Boltzmann machines, cross-entropy training, and sequence-
discriminative training using the state-level minimum Bayes risk (sMBR) criterion are
used in the training procedure.
4.5.2. Experimental settings
The performance of the proposed system was evaluated using WER with two
types of baseline systems. Speech enhancement is performed before ASR with frame size
and shift of 640 and 160 samples respectively. It is not easy to compare the proposed
method with the existing method on the same basis since the proposed system requires
more NN layers and consequently more parameters to be trained. To make a fair
comparison, evaluation using another approach that represents the STFT values by two
channels containing real and imaginary components respectively (Baseline2) is
performed along with the conventional NN-GEV system in [40] (Baseline1). In Baseline2,
each frame of STFT values is fed to one fully connected layer with 624 nodes and the
output of this layer is fed to the BLSTM and deep layers that have same structures of
Baseline1. Since input features consisting of real and imaginary values of STFT values
are fed to the NN after concatenating input channels, the dimension of the input feature
57
is 1284. As a result, even when only one hidden layer is used, parameters to be trained in
Baseline2 far outnumbers the parameters in the proposed system. The proposed and
Baseline2 systems are implemented for two-channel input signals and the Baseline1
system is also evaluated using only two-channel input signals. The reasons for this are as
follows. In the NN-GEV system proposed in [40] (Baseline1), training steps of the NN
is allowed to be conducted without considering the geometric configuration of the
microphones, by applying deep learning-based modeling to each channel separately. This
ensures the versatility of the system when conducted on an unseen microphone structure.
To maintain these advantages, the evaluation of the proposed system was also performed
on an unseen microphone structure by excepting microphone pairs to be evaluated from
training data. More specifically, the input signal pairs corresponding to the microphone
index pairs 1-3, 1-4, 1-5, 1-6, 3-4, 3-5, 3-6, and 4-6 are used in the training step, and the
input signal pairs corresponding microphone index pairs 4-5 are used in evaluation. The
IBM label of the second channel index is used in training in this two-channel setting.
Training and test sets are also subdivided into two types. In the noise-aware set, four
kinds of noise environment data are used in training and same noise kinds are used in
evaluation. In the unseen noise set, the NN is trained using data except one of the four
kinds of noise environments and evaluation is performed using the excepted noise kind.
By performing evaluation in the unseen noise environment, the ability of generalization
over various noise kinds can be examined and this is a representative advantage of using
spatial information.
58
4.5.3. Results
Note that the evaluation is performed in mismatched conditions in terms of noise
and RIRs, on the assumption that speech enhancement is performed without prior
knowledge of the environment.
Figure 4.8 WERs for both the real and simulated, development and test sets. Models
are trained on clean data and tested either before or after enhancement.
Enhancement combines two channels (mic idx 4, 5).
59
In the noise-aware experiment case, performance changes are hardly noticeable,
confirming the fact that the existing system (Baseline1) yields an almost saturated
performance on the timefrequency mask-based structure. The performance changes by
using spatial information can be found in the evaluation results of the unseen noise
environment. In the unseen noise case, the proposed system has improved performance,
while the Baseline2 system has no significant performance change. The robust
performance in the unseen noise environment can be interpreted as a result of using
additional information, i.e., spatial information. In addition, from the evaluation results
of Baseline2 using the same multichannel information, it can be interpreted that the
proposed structure succeeds in modeling spatial information effectively compared to a
simple real and imaginary concatenation scheme.
4.6. Conclusions
In this chapter, the recently proposed NN-GEV-based system is investigated. To
improve the estimation performance of the second-order statistics of desired speech and
noise signals, using spatial information from the multichannel signal is proposed. In the
proposed system, a feature that is derived to effectively transfer the phase information to
the NN is introduced. By applying the proposed input layer structure, the effectiveness
of applying deep learning for exploiting the spatial information embedded in the phase
difference of each signal is demonstrated.
60
Chapter 5. Neural Network-based postfilter
5.1. Issues and related works
The SDA is one type of deep NN and can be trained to reconstruct desired speech
from noisy input signal. The main idea behind this NN SDAs is using DNNs to model
the relationship between clean and noisy features. With the outstanding performance of
SDAs for modeling acoustic characteristics, extending the applicable scope of deep
learning to multichannel signals is being extensively studied in noise-robust ASR
research. Successful results in DNN-based monaural speech enhancement has also
inspired the approaches that use DNN as a single-channel noise reduction algorithm on
the enhanced output of a beamformer, i.e., postfilter [38]. In this structure, the
beamformer can perform spatial filtering before spectro-temporal filtering is performed
by the DNN-based postfilter.
5.2. New Generalized Sidelobe Canceller with Denoising Auto-
Encoder for Improved Speech Enhancement
5.2.1. Motivation
The direct application of a DNN to noisy signals is problematic because the DNN
61
is required to model various information such as spatial information and channel
difference as well as spectro-temporal characteristics at a time. In particular, modeling
spatial information is fundamentally variant from modeling the distribution of signal
intensity on the frequency axis. Hence, modeling with one NN is likely to be inadequate
in terms of efficiency of model training. As an alternative to a direct approach, a
conventional beamformer can be introduced prior to the implementation of the SDA. The
spatial information used by the beamformer in the form of the ratio of acoustic transfer
functions, i.e. RTFs, is characterized by the path between the speaker and each
microphone. However, the modeling ability of the SDA is limited when applied to single-
channel audio signals.
5.2.2. New Generalized Sidelobe Canceller to generate multichannel feature
An approach proposed in the previous research is thus to have the beamformer
exploit the spatial information in its processing all the way prior to the final stage, in
which the signals are combined. This final stage is replaced by the SDA, which removes
any distortion or noise created by the beamforming process. It is assumed that the
beamformer is working on each frame instead of a whole utterance so that the proposed
algorithm can be applied to a real-time speech enhancement system. As estimating noise
statistics prior to observing the entire utterance is difficult, this approach does it by each
frame instead, thus eliminating the need to model noise in advance. Since no prior
information of noise statistics is provided, limiting noise types to predefined ones can
62
also be avoided. Thus, a GSC that can estimate noise statistics adaptively and can be
implemented using only the direction of target speech, is used.
A standard GSC consists of a fixed beamformer (FBF), a blocking matrix (BM),
and an adaptive noise canceller (ANC). The input to the th microphone can be
expressed as
(, ) = (, )(, ) + (, ), (5.1)
where (, ) is the target speech, and (, ) and (, ) are the received signal
and noise at the mth sensor in the short-time Fourier transform domain, respectively.
is the frame index and is the frequency bin index. The target speech undergoes
filtering by an RIR prior to being captured by the sensors. Filtering corresponding to the
th sensor is modeled by the finite response (, ). In [2], the FBF, FBF(, ) is
designed to project input signals onto the subspace spanned by the RTF, (, )
FBF(, ) = (, ) (
H(, )(, ))
1
,
where (, ) = (, )/1(, ),
and (, ) = [1(, )(, )]
T, (5.2)
by setting the speech signal picked up by the first sensor 1(, ) as the desired signal.
The BM is designed to project the input signals into the orthogonal complement of the
63
target speech RTF. The generalized eigenvector blocking matrix (GEVBM) [15], which
can be calculated using the max SNR beamformer, SNR(, ), is
H(, ) =
(,)SNR(,)SNR
H (,)
SNR
H (,)(,)SNR(,)
. (5.3)
SNR(, ) can be calculated from the eigenvector belonging to the largest eigenvalue of
1 (, )(, ) during the RTF estimation period, i.e., when noise is stationary.
(, ) and (, ) are the PSDs of the noise source and input signals,
respectively. SNR(, ) results in the multiplication of
1 (, )(, ) and an
arbitrary complex constant so that the RTF can be calculated. However, in realistic
conditions, the satisfactory estimation of SNR(, ) is usually not feasible because
noise is nonstationary. In this case, the RTF is usually simplified as a pure time delay and
(, ) is replaced with a delayed impulse response. The ANC filter weight (, ) can
be updated with an unconstrained adaptation to obtain the minimized power of the output
signal (, ),
(, ) = FBF
H (, )(, ) H(, )(, )
where (, ) = H(, )(, )
and (, ) = [1(, )(, )]
T. (5.4)
The noise reference signal (, ) is the result of the BM; thus, it is orthogonal to the
64
target speech RTF. The target speech components in the output of the FBF are not
supposed to be affected by the subtraction of the estimated noise, i.e., the output of the
ANC filter, due to this orthogonality. However, the assumption of the perfect
orthogonality of (, ) and the target speech components in the output of the FBF is
seldom met in real-life situations for many reasons, e.g., leakage signal in the blocking
matrix. The distortion of the target signal components in the enhanced signal is a major
issue in beamformer-based algorithms and many approaches, including using an SDA as
a postfilter, have been attempted. In this work, a GSC structure that enables the effective
adoption of an SDA is proposed to deal with the distortion in the output of the beamformer.
The proposed system is illustrated in Figure 5.1.
Figure 5.1 The structure of the proposed GSC and SDA
In the proposed system, each individual ANC filter is adapted separately to minimize
each input signal by removing the noise component estimate.
65
(, ) = (, )
H (, )(, )
2
, (5.5)
where is the ANC coefficient corresponding to the th channel. The FBF takes
enhanced channel signals from separated ANC filters and compensates the RTF
to generate the multichannel output features:
(, ) = ,
(, ) ((, )
H (, )(, )), (5.6)
where ,
(, ) is the th channel component of the fixed beamformer. Note that
the proposed structure has the same filter coefficient as a conventional GSC if the output
signals are summed into one channel. The distortion caused by imperfect noise
cancellation is placed on the multichannel spectrum domain of a two-dimensional space
with a channel axis and frequency axis for each frame. The SDA is expected to model the
underlying relationship of the distortion with adjacent frequency bins in other frequencies
and other channels.
5.3. Experiments
To evaluate the performance of the proposed system, speech enhancement is
conducted in a manner similar to [64], as depicted in Figure 5.1. Note that the use of mask
estimation before the beamformer and more sophisticated SDA structures are not
66
considered because these improvements can be used in both the conventional and the
proposed system. This experiment aims to assess the effectiveness of the proposed
algorithm in its most typical configuration. The proposed system uses the results of the
proposed GSC as the multichannel information and their summation as the beamformed
signal. In the baseline system, the output of the conventional GSC is used as the
beamformed signal, and the noisy input signal itself (GSC-NOISY) [64] and the GSC-
IPD [65] are used as the multichannel information. The IPD is calculated for each channel
except the first channel using
(, ) = cos(1(, ) (, )) , = 2, , (5.7)
to generate information for 1 channels. These two forms of multichannel
information are selected as conventional schemes for comparison because as in the
proposed method, they can be extracted without prior knowledge of the noise
characteristics and used in real-time speech enhancement systems in the form of frame-
by-frame feature extraction. To assess the advantages of using multichannel information,
these systems are also compared with a single-channel SDA (GSC-ONLY) which uses
only the conventional GSC output without multichannel information. Input signals are
fed into the DNN in the form of 511-dimensional log power spectra for each frame. The
window size and the frame shift are set at 1024 and 512 samples, respectively, at a 16-
kHz sampling rate. To evaluate the performance, six-channel data from CHiME [4] is
used. This database provides noise recorded in a cafe, in the street, on the bus, and in a
67
pedestrian area. Recordings have been acquired by using an array of six microphones
mounted in holes drilled through a frame surrounding a tablet computer. Three
microphones were spaced both along the top and bottom edges. The estimated tablet mic
SNRs had an average of approximately 5 dB. To control target speech detection errors
and spatial information estimation errors, which are beyond the scope of this study, clean
target speech is required. As a result, a simulated data set in which noise and speech are
separately available is used. Speech components in the simulated data are generated by
applying impulse response to booth recordings, as in [4]. The RTF estimated from real
recorded data is used in the simulation. In the beamforming step, the impulse response of
pure time delay, calculated using the estimated time delay of the target signals, is used
instead of the RTF. This simulates realistic situations with nonstationary noise, in which
the satisfactory estimation of the noise RTF is usually not feasible. Time delay estimation
is conducted using the steered response power phase transform (SRP-PHAT) algorithm
used as the baseline in [4]. Development (dt) data are used to decide the iteration number
in SDA training and evaluation (et) data are used in the evaluation. Evaluation results are
shown in Table 1. The signal-to-distortion ratio (SDR), which is defined as an energy
ratio criterion [66], and the short-time objective intelligibility (STOI) described in [67]
are used to measure speech enhancement performance. The word error rate in ASR is
scored with an acoustic model trained on a clean database. The LibriSpeech database [61]
is used in a time-delay NN-based ASR system [59].
68
Table 5.1 Evaluation results for speech enhancement and WERs
Mult. Info. SDR STOI WER (%)
Noisy input -0.694 0.674 84.43
GSC- ONLY 7.915 0.835 30.57
GSC-NOISY 7.320 0.837 27.13
GSC-IPD 7.445 0.835 26.74
Proposed 8.687 0.856 20.83
Note that ASR evaluation is performed in mismatched conditions in terms of noise and
RIRs on the assumption that speech enhancement is performed without prior knowledge
of the environment. Evaluation results show that the proposed method consistently
outperforms the conventional methods. Note that the STOI score is expected to have a
monotonic relation with subjective speech intelligibility, where a higher value denotes
more intelligible speech.
5.4. Conclusions
In the proposed system, the GSC exploits spatial information and generates
multichannel enhanced signals on which the following SDA can act. As a result, the SDA
can take advantage of the multichannels by modeling the underlying relationship of the
distortion with adjacent frequency bins in other frequencies and other channels. The main
improvement point in proposed system is related to the change in the expected role of the
69
SDA. In conventional systems that extract multichannel features from noisy signals, the
SDA need to handle a fair amount of information including the noise characteristics,
spatial information, and relation between desired speech and noise signals. In the
proposed system, the beamformer exploits spatial information and compensates for
differences in the transfer functions of each channel in the way of removing noise
components. As a result, the modeling capability of the SDA can concentrate on removing
the artefacts caused by the beamformer. The evaluation results demonstrate that using the
results of the proposed GSC structure as an input to the SDA is effective in improving
noise reduction and speech recognition performance.
70
Chapter 6. Conclusions and Future Works
6.1. Conclusions
Although multichannel noise cancellation has been studied for decades to improve
speech recognition performance, implementing a system that allows smooth conversation
in everyday life is still difficult. In this dissertation, approaches to improve existing
methods were proposed after analyzing the shortcomings in existing methods by each
stage. These stages include estimation of RTF, estimation of signal statistics using deep
learning, and distortion compensation of the enhanced signal. The proposed methods
mainly deal with the recent application of deep learning for beamforming.
In the RTF estimation stage, a practical scenario with noise that naturally occupies
the same timefrequency range with desired speech was considered. The shortcomings
of the existing directional BSS-based algorithm was exploited, and the use of the semi-
NMF algorithm to estimate the PTDR proposed in a previous work was introduced. The
proposed method effectively overcame the side-effect of smoothing peaks in time-
domain RTF in the existing method by recovering the peaks. Accuracy improvement in
RTF estimation was assessed in terms of speech suppression gain and normalized squared
error.
In the estimation of second-order statistics of desired speech signal, the recently
proposed NN-GEV beamforming approach was exploited. Phase information of the input
signal was emphasized and a deep learning-based system was proposed to model it
71
effectively. In addition, a method was proposed to effectively transfer the phase
information of the input signal to the NN, which traditionally processed real-valued
physical data and relied on real-valued weights. By effectively encoding the phase
information in real-valued input features, the weakness of the existing method in which
deep learning is applied to each single-channel signal was overcome.
In the distortion compensation of the enhanced signal, a structural change of the
beamformer proposed in a previous work was introduced. This structural change allowed
the following NN to take advantage of the multichannels by modeling the underlying
relationship of the distortion with adjacent frequency bins in other frequencies and other
channels. This resulted in improved performance by overcoming the defects of existing
deep learning-based postfilter methods in which the modeling ability of the NN is limited
to enhanced single-channel speech.
72
6.2. Future Works
To improve multichannel speech enhancement, the following topics of interest can
be considered in the future:
1) Investigation on DNN generalization over spatial configuration of multichannel
signal
2) Development of DNN-based time-varying RTF modeling in a practical noise
environment
3) Development of a Generative Adversarial Nets-based far-field database
4) Development of fully-DNN-based modeling of second-order statistics of speech
signal
73
Bibliography
[1] N. Q. K. Duong, E. Vincent, and R. R. Gribonval, Under-Determined
Reverberant Audio Source Separation Using a Full-Rank Spatial
Covariance Model, IEEE Trans. Audio. Speech. Lang. Processing, vol.
18, no. 7, pp. 18301840, Sep. 2010.
[2] A. Ozerov and C. Fevotte, Multichannel Nonnegative Matrix
Factorization in Convolutive Mixtures for Audio Source Separation, IEEE
Trans. Audio. Speech. Lang. Processing, vol. 18, no. 3, pp. 550563, Mar.
2010.
[3] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, New formulations and
efficient algorithms for multichannel NMF, IEEE Work. Appl. Signal
Process. to Audio Acoust., pp. 153156, Oct. 2011.
[4] O. Yilmaz and S. Rickard, Blind Separation of Speech Mixtures via Time-
Frequency Masking, IEEE Trans. Signal Process., vol. 52, no. 7, pp.
18301847, Jul. 2004.
[5] S. Araki et al., A novel blind source separation method with observation
vector clustering, Proc. Int. Work. Acoust. Echo Noise Control, pp. 117
120, 2005.
[6] T. den Bogaert, S. Doclo, J. Wouters, and M. Moonen, Speech
enhancement with multichannel Wiener filter techniques in
multimicrophone binaural hearing aids, J. Acoust. Soc. Am., vol. 125, no.
1, pp. 360371, 2009.
[7] B. Cornelis, M. Moonen, and J. Wouters, A VAD-robust multichannel
Wiener filter algorithm for noise reduction in hearing aids, ICASSP, IEEE
Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 4, pp. 281284,
May 2011.
[8] J. Capon, High-resolution frequency-wavenumber spectrum analysis,
Proc. IEEE, vol. 57, no. 8, pp. 14081418, 1969.
74
[9] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, The third CHiME
speech separation and recognition challenge: Dataset, task and
baselines, 2015 IEEE Work. Autom. Speech Recognit. Understanding,
ASRU 2015 - Proc., pp. 504511, 2016.
[10] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using
beamforming and nonstationarity with applications to speech, IEEE
Trans. Signal Process., vol. 49, no. 8, pp. 16141626, 2001.
[11] D. H. Brandwood, A complex gradient operator and its application in
adaptive array theory, in IEE Proceedings F-Communications, Radar and
Signal Processing, 1983, vol. 130, no. 1, pp. 1116.
[12] L. Griffiths and C. C. Jim, An alternative approach to linearly constrained
adaptive beamforming, IEEE Trans. Antennas Propag., vol. 30, no. 1, pp.
2734, Jan. 1982.
[13] O. L. . I. L. Frost, An algorithm for linearly constrained adaptive array
processing, Proc. IEEE, vol. 60, no. 8, pp. 926935, 1972.
[14] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using
beamforming and nonstationarity with applications to speech, IEEE
Trans. Signal Process., vol. 49, no. 8, pp. 16141626, 2001.
[15] A. Krueger, E. Warsitz, and R. Haeb-Umbach, Speech Enhancement With
a GSC-Like Structure Employing Eigenvector-Based Transfer Function
Ratios Estimation, IEEE Trans. Audio. Speech. Lang. Processing, vol. 19,
no. 1, pp. 206219, Jan. 2011.
[16] E. Warsitz and R. Haeb-Umbach, Blind Acoustic Beamforming Based on
Generalized Eigenvalue Decomposition, IEEE Trans. Audio, Speech
Lang. Process., vol. 15, no. 5, pp. 15291539, Jul. 2007.
[17] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using
beamforming and nonstationarity with applications to speech, IEEE
Trans. Signal Process., vol. 49, no. 8, pp. 16141626, 2001.
[18] E. Georganti, T. May, S. van de Par, and J. Mourjopoulos, Sound Source
Distance Estimation in Rooms based on Statistical Properties of Binaural
Signals, IEEE Trans. Audio. Speech. Lang. Processing, vol. 21, no. 8, pp.
75
17271741, Aug. 2013.
[19] S. Vesa, Binaural Sound Source Distance Learning in Rooms, IEEE
Trans. Audio. Speech. Lang. Processing, vol. 17, no. 8, pp. 14981507,
Nov. 2009.
[20] P. Smaragdis and P. Boufounos, Position and Trajectory Learning for
Microphone Arrays, IEEE Trans. Audio, Speech Lang. Process., vol. 15,
no. 1, pp. 358368, Jan. 2007.
[21] I. Cohen, Relative Transfer Function Identification Using Speech
Signals, IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 451459,
Sep. 2004.
[22] Y. Zheng, K. Reindl, and W. Kellermann, Analysis of dual-channel ICA-
based blocking matrix for improved noise estimation, EURASIP J. Adv.
Signal Process., vol. 2014, no. Lcmv, pp. 124, 2014.
[23] H. Buchner, R. Aichner, and W. Kellermann, The TRINICON framework
for adaptive MIMO signal processing with focus on the generic Sylvester
constraint, Proc. ITG Conf. Speech Commun. Aachen, Ger., no. c, pp. 8
11, 2008.
[24] K. Matsuoka, M. Ohoya, and M. Kawamoto, A neural net for blind
separation of nonstationary signals, Neural Networks, vol. 8, no. 3, pp.
411419, Jan. 1995.
[25] H. Buchner, R. Aichner, and W. Kellermann, A generalization of a class
of blind source separation algorithms for convolutive mixtures, in Proc.
ICA, 2003, pp. 945950.
[26] S. Amari, A. Cichocki, and H. H. Yang, A new learning algorithm for blind
signal separation, in Advances in neural information processing systems,
1996, pp. 757763.
[27] R. Aichner, H. Buchner, F. Yan, and W. Kellermann, A real-time blind
source separation scheme and its application to reverberant and noisy
acoustic environments, Signal Processing, vol. 86, no. 6, pp. 12601277,
Jun. 2006.
[28] Yuanhang Zheng, K. Reindl, and W. Kellermann, BSS for improved
76
interference estimation for Blind speech signal Extraction with two
microphones, 2009 3rd IEEE Int. Work. Comput. Adv. Multi-Sensor
Adapt. Process., pp. 253256, 2009.
[29] Yuxuan Wang and DeLiang Wang, Towards Scaling Up Classification-
Based Speech Separation, IEEE Trans. Audio. Speech. Lang. Processing,
vol. 21, no. 7, pp. 13811390, Jul. 2013.
[30] C. Ding, T. Li, and M. I. Jordan, Convex and semi-nonnegative matrix
factorizations., IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp.
4555, 2010.
[31] F. Weninger, S. Watanabe, Y. Tachioka, and B. Schuller, Deep recurrent
de-noising auto-encoder and blind de-reverberation for reverberated
speech recognition, in 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2014, pp. 46234627.
[32] A. L. Maas et al., Recurrent Neural Networks for Noise Reduction in
Robust ASR.
[33] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, Deep
learning for monaural speech separation, in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014,
pp. 15621566.
[34] S. Araki, T. Hayashi, M. Delcroix, M. Fujimoto, K. Takeda, and T.
Nakatani, Exploring multi-channel features for denoising-autoencoder-
based speech enhancement, in 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2015, vol. 2015
Augus, pp. 116120.
[35] S. Renals and P. Swietojanski, Neural networks for distant speech
recognition, in 2014 4th Joint Workshop on Hands-free Speech
Communication and Microphone Arrays (HSCMA), 2014, pp. 172176.
[36] Y. Liu, P. Zhang, and T. Hain, Using neural network front-ends on far
field multiple microphones based speech recognition, in 2014 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2014, pp. 55425546.
77
[37] A. Narayanan and D. Wang, Joint noise adaptive training for robust
automatic speech recognition, in 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 25042508.
[38] S. Sivasankaran et al., Robust ASR using neural network based speech
enhancement and feature simulation, in 2015 IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 482
489.
[39] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolutional, Long
Short-Term Memory, fully connected Deep Neural Networks, ICASSP,
IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2015Augus,
pp. 45804584, Apr. 2015.
[40] J. Heymann, L. Drude, and R. Haeb-Umbach, Neural network based
spectral mask estimation for acoustic beamforming, in Acoustics, Speech
and Signal Processing (ICASSP), 2016 IEEE International Conference on,
2016, pp. 196200.
[41] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, Robust MVDR
beamforming using time-frequency masks for online/offline ASR in
noise, in 2016 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2016, pp. 52105214.
[42] IEEE Signal Processing Society., IEEE transactions on speech and audio
processing: a publication of the IEEE Signal Processing Society. Institute
of Electrical and Electronics Engineers, 1993.
[43] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-Umbach,
Beamnet End-to-end training of a beamformer-supported multi-channel
ASR system, in Acoustics, Speech and Signal Processing (ICASSP), 2017
IEEE International Conference on, 2017, pp. 53255329.
[44] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, and
Andrew, Speaker location and microphone spacing invariant acoustic
modeling from raw multichannel waveforms, in 2015 IEEE Workshop on
Automatic Speech Recognition and Understanding, ASRU 2015 -
Proceedings, 2016, pp. 3036.
78
[45] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals,
Learning the speech front-end with raw waveform CLDNNs, in
Sixteenth Annual Conference of the International Speech Communication
Association, 2015.
[46] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, and M. Bacchiani,
Factored spatial and spectral multichannel raw waveform CLDNNs, in
2016 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2016, pp. 50755079.
[47] Convolutional Neural Networks for Distant Speech Recognition, IEEE
Signal Process. Lett., vol. 21, no. 9, pp. 11201124, Sep. 2014.
[48] T. Menne et al., The RWTH/UPB/FORTH system combination for the 4th
CHiME challenge evaluation, in The 4th International Workshop on
Speech Processing in Everyday Environments, San Francisco, CA, USA,
2016, pp. 3944.
[49] H. Erdogan et al., Multi-channel speech recognition: LSTMs all the way
through, in CHiME-4 workshop, 2016.
[50] L. D. Jahn Heymann and R. Haeb-Umbach, Wide residual blstm network
with discriminative speaker adaptation for robust speech recognition, in
CHiME 2016 workshop, 2016.
[51] J. Du et al., The USTC-iFlytek system for CHiME-4 challenge, Proc.
CHiME, pp. 3638, 2016.
[52] D. Wang and Jae Lim, The unimportance of phase in speech
enhancement, IEEE Trans. Acoust., vol. 30, no. 4, pp. 679681, Aug.
1982.
[53] L. Drude, B. Raj, and R. Haeb-Umbach, On the Appropriateness of
Complex-Valued Neural Networks for Speech Enhancement., in
INTERSPEECH, 2016, pp. 17451749.
[54] T. N. Sainath et al., Multichannel Signal Processing With Deep Neural
Networks for Automatic Speech Recognition, IEEE/ACM Trans. Audio,
Speech, Lang. Process., vol. 25, no. 5, pp. 965979, May 2017.
[55] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks,
79
IEEE Trans. Signal Process., vol. 45, no. 11, pp. 26732681, 1997.
[56] Framewise phoneme classification with bidirectional LSTM and other
neural network architectures, Neural Networks, vol. 18, no. 56, pp.
602610, Jul. 2005.
[57] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural
Comput., vol. 9, no. 8, pp. 17351780, Nov. 1997.
[58] D. Povey et al., The Kaldi Speech Recognition Toolkit, IEEE 2011
Workshop on Automatic Speech Recognition and Understanding. IEEE
Signal Processing Society, 2011.
[59] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network
architecture for efficient modeling of long temporal contexts, Proc.
Interspeech, vol. 2015Janua, pp. 32143218, 2015.
[60] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, Improving deep neural
network acoustic models using generalized maxout networks, in 2014
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2014, pp. 215219.
[61] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An
ASR corpus based on public domain audio books, in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing -
Proceedings, 2015, vol. 2015Augus, pp. 52065210.
[62] E. Vincent, R. Gribonval, and M. D. Plumbley, Oracle estimators for the
benchmarking of source separation algorithms, Signal Processing, vol.
87, no. 8, pp. 19331950, 2007.
[63] X. Mestre and M. A. Lagunas, On diagonal loading for minimum variance
beamformers, Proc. 3rd IEEE Int. Symp. Signal Process. Inf. Technol.
ISSPIT 2003, pp. 459462, 2003.
[64] P. Swietojanski, A. Ghoshal, and S. Renals, Hybrid acoustic models for
distant and multichannel large vocabulary speech recognition, in 2013
IEEE Workshop on Automatic Speech Recognition and Understanding,
2013, pp. 285290.
[65] M. I. Mandel, R. J. Weiss, and D. Ellis, Model-Based Expectation-
80
Maximization Source Separation and Localization, IEEE Trans. Audio.
Speech. Lang. Processing, vol. 18, no. 2, pp. 382394, Feb. 2010.
[66] E. Vincent, R. Gribonval, and C. Fvotte, Performance measurement in
blind audio source separation, IEEE Trans. Audio, Speech, Lang.
Process., vol. 14, no. 4, pp. 14621469, 2006.
[67] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, An Algorithm for
Intelligibility Prediction of Time Frequency Weighted Noisy Speech,
IEEE Trans., vol. 19, no. 7, pp. 21252136, 2011.
81
Curriculum Vitae
Personal Information
Name: Minkyu Shin
Birth Date: February 11, 1987
E-mail: mkshin@ispl.korea.ac.kr
Education
Mar. 2012present:
School of Electrical Engineering, Korea University
(combined M.S. and Ph. D degree program)
Mar. 2006Feb. 2012:
School of Electrical Engineering, Korea University
(received B.S in 2012)
Research Interests
Machine Learning and Pattern Recognition
Automatic Speech Recognition
Multichannel Speech Enhancement
Voice Activity Detection
Sound Source Localization
82
Publications
International Journal
[1] Minkyu Shin and Hanseok Ko, "New Generalized Sidelobe Canceller with Denoising
Auto-Encoder for Improved Speech Enhancement", IEICE Transactions on
Fundamentals of Electronics Communications and Computer Sciences, vol.100-A,
no.12, Dec, 2017
[2] Minkyu Shin, Wooil Kim, David Han, Hanseok Ko, "Relative Transfer Function (RTF)
Estimation Utilizing Peaks in Time-Domain RTF," Electronics Letters, 2016
[3] Seongkyu Mun, Minkyu Shin, Suwon Shon, Wooil Kim, David Han and Hanseok Ko,
"DNN Transfer Learning based Non-linear Feature Extraction for Acoustic Event
Classification", IEICE trans. on Information and Systems, Sep. 2017
International Conference
[1] Sangwook Park, Jinsang Rho, Minkyu Shin, David K. Han, and Hanseok Ko,
Acoustic Feature Extraction for Robust Event Recognition on Cleaning Robot
Platform, 2014 IEEE International Conference on Consumer Electronics, pp. 149-150,
Las Vegas, NV, USA, January 10-13, 2014
[2] Seongjae Lee, Daehun Kim, Suwon Shon, Seongkyu Mun, Minkyu Shin, Youngseng
Chen, Sejong Hyun, M. Harris and Hanseok Ko, KU-ISPL TRECVID 2016
Multimedia Event Detection System, TRECVID Workshop, 2016
Domestic Journal
[1] 신민규, 고한석, "CASA 기반의 마이크간 전달함수 비 추정 알고리즘", 한
국음향학회지, 제33권, 1호, pp.54-59, January, 2014
83
Domestic Conference
[1] 손수원, 문성규, 신민규, 고한석, NIST 2015 i-Vector Machine Learning
Challenge 를 위한 KU-ISPL 언어 인식기, 한국음향학회 추계학술대회 발
표논문집, 제35권 제2(s)호, pp.151, Nov, 2016
[2] 신민규, 이영로, 고한석, "딥 뉴럴 네트워크 기반 가정 내 발생 음향 검출
기의 거절음향 훈련 방법" 한국음향학회 추계학술대회 발표논문집, 제34권
제2(s)호, pp.6, Nov, 2015
[3] 문성규, 신민규, 고한석, 음향 상황 인지를 위한 특징 선택 연구, 대한전
자공학회 추계학술대회, pp. 627-629, Nov, 2013
[4] 신민규, 고한석, 마이크 이득 오차 와 입력 SNR 에 따른 GSC 성능 분석,
한국음향학회 춘계학술발표대회 논문집, Vol. 32, No. 1, pp.173-175, May, 2013
[5] 박진수, 신민규, 고한석, "음성인식을 위한 잡음환경에 강인한 음성 구간
검출 기법", 제29회 음성통신 및 신호처리 학술대회, pp. 30-32, Aug, 2012
[6] 김광윤, 신민규, 고한석, 음향 정보 기반의 감시 시스템을 위한 음원 분리
와 음향 이벤트 인식 결합 알고리즘, 한국음향학회 춘계학술발표대회 논
문집, Vol. 31, No. 1, pp.47-50, May, 2012
84
감사의 글
지난 6년간 많은 어려움 속에서도 변함없이 지원하며 이끌어 주신 고한석
교수님께 감사드립니다. 또한, 박사과정을 무사히 마칠 수 있도록 지도해 주신
황인준 교수님, 한성원 교수님, 김우일 교수님, 황광일 박사님께 감사의 말씀
드립니다. 언제나 제 연구에 값진 조언을 해주신 United States Department of Defense 의
데이비드 한 박사님께 감사의 말씀 드립니다. 긴 시간 동안 함께 먹고 자며 고생한
지능 신호 처리 연구실의 자랑스러운 선배님, 후배님, 동기들 모두 감사합니다.
비단 대학원 기간만이 아니라, 저를 키우시는 내내 걱정이 많으셨을 아버지,
어머니, 베풀어 주신 사랑에 보답하는 아들이 되겠습니다. 항상 옆에서 응원해준
누나와 자형에게 감사드립니다. 이미 오래전부터 박사 대접을 해주신
작은할아버지와 삼촌을 비롯한 가족 분들의 격려에 깊은 감사 드리며, 기대에
부응하도록 노력하겠습니다.
항상 곁에 있어준 주연에게, 항상 고맙고, 또 사랑한다는 마음 전합니다.