PHONETIC POSTERIORGRAMS FOR SPEECH INTELLIGIBILITY ASSESSMENT IN AUTOMATIC SPEAKER VERIFICATION УДК 004.056.57 : 004.934

Main Article Content

Andrey A. Lependin Email: lependin@phys.asu.ru
Pavel A. Zubkov Email: pav.zubkoff@mail.ru
Valentin V. Karev Email: krv.valentin@gmail.com

Abstract

In this paper a new approach toassessing distortions of speech signals wasproposed. It was based on the use of a pretrained neural network model for calculatingphonetic posteriorgrams and assessing theirdeviation from reference values using theJensen-Shannon divergence. A High-FidelityNeural Phonetic Posteriorgrams model trainedon the Common Voice 21 dataset was usedto calculate the posteriorgrams. Three setsof speech recordings with background noise ofcontrolled power, nonlinear distortions, andreverberation were generated using a test subset of the VoxCeleb1 dataset. The divergence ofthe phonetic posteriorgrams was calculated, anda parallel assessment of the speaker verificationquality was conducted using a TDNN model.The Jensen-Shannon divergence was shownto be highly sensitive to the considered speechsignal distortions and correlates well with theequivalent error rate of speech verification.It can be effectively applied both to assess thequality of speech recordings during biometricverification of users and as a loss functionin training new neural network methods forspeech processing.

Downloads

Download data is not yet available.

Article Details

How to Cite
1. Lependin A. A., Zubkov P. A., Karev V. V. PHONETIC POSTERIORGRAMS FOR SPEECH INTELLIGIBILITY ASSESSMENT IN AUTOMATIC SPEAKER VERIFICATION // ПРОБЛЕМЫ ПРАВОВОЙ И ТЕХНИЧЕСКОЙ ЗАЩИТЫ ИНФОРМАЦИИ, 2026. № 13. P. 30-41. URL: https://journal.asu.ru/ptzi/article/view/18848.
Section
Проблемы технического обеспечения информационной безопасности

References

1. Hazen T.J. Shen W., White C. Queryby-example spoken term detection using phonetic posteriorgram templates // 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. Moreno. Italy. 2009. P. 421–426.

2. Cameron C., Churchwell C., Morrison M., Pardo B. High-Fidelity Neural Phonetic Posteriorgrams // 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). Seoul. Korea. 2024. P. 823–827.

3. Handbook of the International Phonetic Association: a guide to the use of the International Phonetic Alphabet. Cambridge: Cambridge University Press, 1999. ix + 204 p.

4. Cover T., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley. New Jersey, 2006. 748 p.

5. Binary Cross-Entropy Loss // PyTorch 2.9 documentation: сайт. URL: https://docs.pytorch.org/docs/stable/generated/torch.nn.BCELoss.html (дата обращения: 10.10.2025).

6. Ardila R., Branson M., Davis K., Kohler M., Meyer J., Henretty M., Morais R., Saunders L., Tyers F., Weber G. Common Voice: A MassivelyMultilingual Speech Corpus // Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France. European Language Resources Association, 2020. P. 4218–4222.

7. McAuliffe M., Socolof M., Mihuc S., Wagner M., Sonderegger M. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi // Proc. Interspeech. 2017. P. 498–502.

8. Neuroth H., Lohmeier F., Smith K.M. TextGrid — Virtual Research Environment for the Humanities // The International Journal of Digital Curation. Issue 2, Volume 6.| 2011. P. 222–231.

9. Nagrani A., Chung J.S., Zisserman A. VoxCeleb: A Large-Scale Speaker Identification Dataset // Proc. Interspeech. 2017. P. 2616–2620.

10. Thiemann J., Ito N., Vincent E. The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings // The Journal of the Acoustical Society of America, 2013.

11. Schuck Jr. A., Bodmann B. Audio nonlinear modeling through hyperbolic tangent functionals // Proceedings of the 19th International Conference on Digital Audio Effects (DAFx-16). 2016. P. 103–108.

12. Scheibler R., Bezzam E., Dokmanic I. Pyroomacoustics: A Python package for audio room simulations // IEEE Signal Processing Letters, 2020 – vol. 27, P. 133–137.

13. Болл Р.М., Коннел Дж.Х., Панканти Ш., Ратха Н.К., Сеньор Э.У. Руководство по биометрии. М. : Техносфера, 2007. 368 с.

14. Snyder D., Garcia-Romero D., Sell G., Povey D., Khudanpur S. X-Vectors: Robust DNN Embeddings for Speaker Recognition // IEEE International Conference on Acoustics, Speech and Signal Processing, 2018. P. 5329–5333.