Preprocessing System for Speech Enhancement

Objective

Design a preprocessing system to enhance speech signal before it proceeds to the ASR (automatic speech recognition) models. There are some approaches that I worked on, for instance, noise-removal, dereverberation, and speaker separation for overlapping speaker cases.

  1. Noise removal. Speech audio signals are often disrupted by environmental noise, typically categorized as additive noise, which poses a straightforward challenge. Initially, I employed a predefined signal filter in the noise removal process. However, due to the diverse nature of additive noise types, an alternative approach utilizing AI models was pursued. To facilitate this, an exploration and collection phase of potential additive noises occurring in real-world scenarios was conducted. These noises were then used to augment clean speech audio, enabling the training of AI models with the anticipation that they could effectively differentiate between learned and similar additive noises.
  2. Dereverberation. Reverberation represents another type of noise, characterized as a convolutional noise. Unlike additive noise, defining reverberation is more complex due to the necessity of estimating the impulse response responsible for its occurrence. This task is challenging as it requires the estimation of one or multiple impulse responses. Even with the aid of AI models and extensive training using reverberated data, including their impulse responses, accurately defining reverberation remains difficult.
  3. Speaker separation. Another type of additive noise is overlapped speaker noise, which is generated by human speech unintended for recording. This noise typically exhibits very low energy or amplitude, resulting in a babbling sound that even human ears struggle to percieve. To address this, we must isolate and remove this type of speech separately, requiring the use of AI modeling. Numerous research avenues exist for this approach, although results have been imperfect. Occasionally, the AI model may erroneously remove non-babble speech. Experimentation with this method continued beyond my resignation from the company.