Saturday, May 24, 2014

[paper] Two Microphone Binary Mask speech enhancement: application to diffuse and directional noise fields

Link to the paper: http://etrij.etri.re.kr/etrij/common/GetFile.do?method=filedownload&fileid=ERY-1398669567602

This paper utilize the Binary Masking technique to address two kinds of noises: namely the diffuse noise and the directional noise. The diffuse noise is certainly a noise signal, but a directional noise could correspond to a true noise or a disturbing speech source.

Binary masking methods emulate the human's ear capability to mask a weaker signal by a stronger one. [Moore, Brian CJ, and Brian C. Moore. An introduction to the psychology of hearing. Vol. 5. San Diego: Academic press, 2003.]

Spatial cues such as interaural-time-different (ITD) and interaural-level-difference (ILD) are highly useful in source separation.

Many two-micriphone systems rely on localization cues for speech segregation. But, there cues are only useful when each sound source is located at a single point and so, each signal arrives from a specific direction. Although this condition holds for speech and directional noise sources (such as car and street noise, and a competing speaker), in various environments the noise is diffuse and does not arrive from a specific direction (e.g., consider restaurants, and large malls).

The main contributions of this paper is the two features proposed to estimate the masks. Two features are the Coherence Feature and the Phase Error Feature.

1) The Coherence Feature of two spectra $X_1(\lambda, k)$ and $X_2(\lambda, k)$ is defined as

$COH(\lambda, k)=\frac{|P_{X_1,X_2}(\lambda,k)|}{\sqrt{|P_{X_1}(\lambda, k)| |P_{X_2}(\lambda, k)|}}$

where $P_{X_i}(\lambda, k)$ is the smoothed spectrum of signal $X_i, i\in{1,2}$, and is calculated as:

$P_{X_i}(\lambda, k) = \alpha P_{X_i}(\lambda - 1, k) + (1 - \alpha) |X_i (\lambda, k)|^2$.

$P_{X_1, X_2}(\lambda, k)$ is the smoothed cross power spectral density of the two spectra, which is computed as:

$P_{X_1, X_2}(\lambda, k) = \alpha P_{X_1, X_2}(\lambda - 1, k) + (1 - \alpha) X_1(\lambda, k) X_2(\lambda, k)$.

The Coherence of two signals shows the level of correlation or similarity of two signals. In case of a directional source, the signals received at the two microphones are highly similar to each other (they only differ in their time of arrival and amplitude attenuation). So their Coherence is near 1.

But in case of a diffuse noise source, the received signals have lower similarity, and so, their Coherence is noticeably smaller than 1.

2) The Phase Error Feature is defined as

$PE(\lambda, k) = \Delta \phi(\lambda, k) - 2 \pi * ITD$

where $\Delta \phi(\lambda, k) = \angle X_1(\lambda, k) - \angle X_2(\lambda, k)$.

In this paper, the authors utilize these two sets of features for the estimation of binary masks. They have experimented with neural networks (2 hidden layers), decision tree, Gaussian mixture model and support vector machines, which didn't give large differences.

The evaluation criterion adopted in this paper is the mask estimation hit and false alarm rate. However, personally, I don't think it is a good one. As the binary masks depend on the thresholds selected. In different applications, we may want to use different thresholds, which will lead to different masks. The hid and false alarm rates will also vary. Moreover, for recognition purpose, either human or machines, it is hard to relate the hid and false alarm rates to the recognition performance. Especially, whether the improvements in the hit and false alarm rates cause different recognition performance.