An overview of methods for generating, augmenting and evaluating room impulse response using artificial neural networks

Mantas Tamulionis

doi:10.3846/mla.2021.15152

DOI: https://doi.org/10.3846/mla.2021.15152

Abstract

Methods based on artificial neural networks (ANN) are widely used in various audio signal processing tasks. This provides opportunities to optimize processes and save resources required for calculations. One of the main objects we need to get to numerically capture the acoustics of a room is the room impulse response (RIR). Increasingly, research authors choose not to record these impulses in a real room but to generate them using ANN, as this gives them the freedom to prepare unlimited-sized training datasets. Neural networks are also used to augment the generated impulses to make them similar to the ones actually recorded. The widest use of ANN so far is observed in the evaluation of the generated results, for example, in automatic speech recognition (ASR) tasks. This review also describes datasets of recorded RIR impulses commonly found in various studies that are used as training data for neural networks.

Article in English.

Kambario impulsinės reakcijos generavimo ir vertinimo naudojant dirbtinius neuroninius tinklus metodų apžvalga

Santrauka

Dirbtiniais neuroniniais tinklais (DNN) pagrįsti metodai plačiai taikomi įvairiuose garso signalų apdorojimo uždaviniuose. Tai suteikia galimybių optimizuoti procesus ir sutaupyti skaičiavimams reikalingų išteklių. Vienas iš pagrindinių iššūkių yra kambario akustiką apibūdinančių parametrų išskaičiavimas ir akustikos poveikį imituojančio kambario impulsinės reakcijos paieška. Vis dažniau šios srities tyrėjai pasirenka ne įrašyti kambario impulsinės reakcijos pavyzdžius eksperimento metu, bet generuoti juos naudojant DNN, nes toks impulsinės reakcijos generavimas suteikia galimybę tyrėjui parengti neriboto dydžio mokymo duomenų rinkinius. Neuroniniai tinklai taip pat naudojami generuojamoms impulsinėms reakcijoms apdoroti taip, kad jos būtų panašios į įrašytas eksperimentiškai. Analizuojant literatūrą matyti, kad DNN dažniausiai naudojami netiesiogiai vertinant impulsinės reakcijos generavimo rezultatus, pavyzdžiui, tiriant automatinio kalbos atpažinimo uždavinių sprendimo efektyvumo pokyčius. Šioje apžvalgoje nagrinėjami ir įrašytų kambario impulsinių reakcijų rinkiniai, įprastai randami įvairiuose tyrimuose, kur impulsinės reakcijos naudojamos kaip duomenys neuroniniams tinklams mokyti.

Reikšminiai žodžiai: kambario impulsinė reakcija, reverberacija, akustikos imitavimas, duomenų papildymas, neuroniniai tinklai, šnekos atpažinimas.

Keyword : room impulse response, reverberation, acoustic simulation, data augmentation, artificial neural networks, speech recognition

How to Cite

Tamulionis, M. (2021). An overview of methods for generating, augmenting and evaluating room impulse response using artificial neural networks. Mokslas – Lietuvos Ateitis / Science – Future of Lithuania, 13. https://doi.org/10.3846/mla.2021.15152

Published in Issue

Aug 19, 2021

Abstract Views

621

PDF Downloads

544

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Brandschain, L., Graff, D., Cieri, C., Walker, K., Caruso, C., & Neely, A. (2010). The Mixer 6 corpus: Resources for crosschannel and text independent speaker recognition [Conference presentation]. Proceedings of 7th International Conference on Language Resources and Evaluation (LREC), Valletta, Malta.

Cieri, C., Miller, D., & Walker, K. (2004). The Fisher Corpus: a resource for the next generations of speech-to-text [Conference presentation]. Proceedings of 4th International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal.

Donahue, C., McAuley, J., & Puckette, M. (2019). Adversarial audio synthesis [Conference presentation]. Proceedings of International Conference on Learning Representations.

Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) [Conference presentation]. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, CA, USA.
https://doi.org/10.1109/ASRU.1997.659110

Habets, E. (2010). Room impulse response generator. https://www.audiolabs-erlangen.de/fau/professor/habets/software/rir-generator

Harper, M. (2015). The Automatic Speech recogition in Reverberant Environments (ASpIRE) challenge. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 547–554). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ASRU.2015.7404843

Hsiao, R., Ma, J., Hartmann, W., Karafiat, M., Grezl, F., Burget, L., Szoke, I., Cernocky, J., Watanabe, S., Chen, Z., Mallidi, S. H., Hermansky, H., Tsakalidis, S., & Schwartz, R. (2015). Robust speech recognition in unknown reverberant and noisy conditions. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 533–538). Institute of Electrical and Electronics Engineers.
https://doi.org/10.1109/ASRU.2015.7404841

Jager, I., Heusdens, R., & Gaubitch, N. D. (2016). Room geometry estimation from acoustic echoes using graph-based echo labeling. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (pp. 1–5). Institute of Electrical and Electronics Engineers.
https://doi.org/10.1109/ICASSP.2016.7471625

Jeub, M., Schäfer, M., & Vary, P. (2009). A binaural room impulse response database for the evaluation of dereverberation algorithms. In Proceedings of 16th International Conference on Digital Signal Processing (pp. 1–5). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICDSP.2009.5201259

Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Mass, R., Gannot, S., & Raj, B. (2013). The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1–4). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/WASPAA.2013.6701894

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5220–5224). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICASSP.2017.7953152

Nakamura, S., Hiyane, K., Asano, F., Nishiura, T., & Yamada, T. (2000). Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition [Conference presentation]. Proceedings of 2nd International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece.

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206–5210). Institute of Electrical and Electronics Engineers.
https://doi.org/10.1109/ICASSP.2015.7178964

Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 2613–2617). ISCA. https://doi.org/10.21437/Interspeech.2019-2680

Raghuvanshi, N., Narain, R., & Lin, M. C. (2009). Efficient and accurate sound propagation using adaptive rectangular decomposition. IEEE Transactions on Visualization and Computer Graphics, 15(5), 789–801. https://doi.org/10.1109/TVCG.2009.28

Ratnarajah, A., Tang, Z., & Manocha, D. (2020). IR-GAN: room impulse response generator for speech augmentation.
http://arxiv.org/abs/2010.13219

Snyder, D., Chen, G., & Povey, D. (2015). MUSAN: a music, speech, and noise corpus. http://arxiv.org/abs/1510.08484

Szoke, I., Skacel, M., Mosner, L., Paliesek, J., & Cernocky, J. H. (2019). Building and evaluation of a real room impulse response dataset. IEEE Journal on Selected Topics in Signal Processing, 13(4), 863–876. https://doi.org/10.1109/JSTSP.2019.2917582

Tang, Z., Chen, L., Wu, B., Yu, D., & Manocha, D. (2020a). Improving Reverberant Speech Training Using Diffuse Acoustic Simulation. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (pp. 6969–6973). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICASSP40776.2020.9052932

Tang, Z., Meng, H. Y., & Manocha, D. (2020b). Low-frequency compensated synthetic impulse responses for improved farfield speech recognition. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6974–6978). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICASSP40776.2020.9054454

Yu, W., & Kleijn, W. B. (2021). Room acoustical parameter estimation from room impulse responses using deep neural networks. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 436–447. https://doi.org/10.1109/TASLP.2020.3043115