Text-to-Speech Technology Development Using FastSpeech2 Algorithm for the Story of the Prophet

Text-to-Speech Technology Development Using FastSpeech2 Algorithm for the Story of the Prophet

Authors

  • Muhammad Raihan Firdaus Department of Informatics, UIN Sunan Gunung Djati Bandung
  • Muhammad Rihap Firdaus Department of Informatics, UIN Sunan Gunung Djati Bandung
  • Pancadrya Yashoda Pasha Department of Informatics, UIN Sunan Gunung Djati Bandung

DOI:

https://doi.org/10.15575/kjrt.v2i2.1099

Abstract

With the SDGs target point 4.6 for 2030, literacy is a very important thing to improve. With today's technological advancements, improving the accessibility of reading in the digital age is becoming increasingly important, especially for individuals with time constraints. Text-to-Speech (TTS) technology allows users to enjoy text content, such as books or journals, in audio format, which can be listened to while doing other activities. This research develops a TTS model based on the FastSpeech2 algorithm, a non-autoregressive deep learning architecture that utilizes Transformers to generate high-quality audio quickly and efficiently. The LJSpeech dataset, which consists of 13,100 audio chunks with a total duration of 24 hours, is used as the training base. The preprocessing process involves text normalization, audio feature extraction, and data synchronization, while evaluation is performed using objective metrics such as Mel Cepstral Distortion (MCD) and Pitch Error to ensure the quality of the results. The results show that FastSpeech2 can provide fast and accurate performance in generating synthesized voices, making it potential to be used in various audio literacy applications. A key application of this TTS technology is in narrating the stories of the Prophets, which are essential in Islamic teachings for imparting moral values, fostering spiritual connection, and offering timeless lessons. The results show that FastSpeech2 is able to produce high-quality audio quickly, making it an effective alternative for improving audio literacy and providing a solution for individuals with limited reading time.

References

[1] J. Baba and F. Rostam Affendi, “Reading Habit and Students’ Attitudes Towards Reading: A Study of Students in the Faculty of Education UiTM Puncak Alam,” Asian Journal of University Education, vol. 16, no. 1, p. 109, Apr. 2020, doi: 10.24191/ajue.v16i1.8988.

[2] Organisation for Economic Co-operation and Development, “PISA 2022 Results: Factsheets Indonesia,” oecd.org. Accessed: Mar. 04, 2024. [Online]. Available: https://www.oecd.org/publication/pisa-2022-results/country-notes/indonesia-c2e1ae0e/

[3] OECD (Organisation for Economic Co-operation and Development), “Program from International Student Assessment (PISA) Result from PISA 2018,” 2018.

[4] S. Serpian, S. F. Alzah, J. Jusnawati, and K. Handayani, “Music at Workplace: Is it trully Improving Employees’ Performance?,” Jurnal Office, vol. 8, no. 2, p. 369, Mar. 2023, doi: 10.26858/jo.v8i2.44749.

[5] E. Tattersall Wallin, “Reading by listening: conceptualising audiobook practices in the age of streaming subscription services,” Journal of Documentation, vol. 77, no. 2, pp. 432–448, Dec. 2020, doi: 10.1108/JD-06-2020-0098.

[6] R. Jain, M. Y. Yiwere, D. Bigioi, P. Corcoran, and H. Cucu, “A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis,” IEEE Access, vol. 10, pp. 47628–47642, 2022, doi: 10.1109/ACCESS.2022.3170836.

[7] M. Chen et al., “MultiSpeech: Multi-Speaker Text to Speech with Transformer,” Jun. 2020.

[8] D. Diatlova and V. Shutov, “EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech,” Jun. 2023.

[9] D. Lim, S. Jung, and E. Kim, “JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech,” Mar. 2022.

[10] A. Lancucki, “Fastpitch: Parallel Text-to-Speech with Pitch Prediction,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Jun. 2021, pp. 6588–6592. doi: 10.1109/ICASSP39728.2021.9413889.

[11] Q. Zhou, X. Xu, and Y. Zhao, “Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2,” Applied Sciences, vol. 14, no. 15, p. 6834, Aug. 2024, doi: 10.3390/app14156834.

[12] B. Zu, R. Cai, Z. Cai, and Z. Pengmao, “Research on Tibetan Speech Synthesis Based on Fastspeech2,” in 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), IEEE, Jul. 2022, pp. 241–244. doi: 10.1109/PRML56267.2022.9882187.

[13] Y. Hu, P. Yin, R. Liu, F. Bao, and G. Gao, “MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline,” in 2022 International Conference on Asian Language Processing (IALP), IEEE, Oct. 2022, pp. 184–189. doi: 10.1109/IALP57159.2022.9961271.

[14] N. Le Minh, A. Q. Do, V. Q. Vu, and H. T. K. Vo, “TTS - VLSP 2021: The NAVI’s Text-To-Speech System for Vietnamese,” VNU Journal of Science: Computer Science and Communication Engineering, vol. 38, no. 1, Jun. 2022, doi: 10.25073/2588-1086/vnucsce.347.

[15] Z. Qiao, J. Yang, and Z. Wang, “Multi-Feature Cross-Lingual Transfer Learning Approach for Low-Resource Vietnamese Speech Synthesis,” in Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms, New York, NY, USA: ACM, Jul. 2023, pp. 175–180. doi: 10.1145/3611450.3611476.

[16] K. Liang, B. Liu, Y. Hu, R. Liu, F. Bao, and G. Gao, “Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus,” Applied Sciences, vol. 13, no. 7, p. 4237, Mar. 2023, doi: 10.3390/app13074237.

[17] I. Gupta and H. A. Murthy, “E-TTS: Expressive Text-to-Speech Synthesis for Hindi Using Data Augmentation,” 2023, pp. 243–257. doi: 10.1007/978-3-031-48312-7_20.

[18] T. M. Koçak and M. Büyükzincir, “Building a Turkish Text-to-Speech Engine: Addressing Linguistic and Technical Challenges,” in 2023 24th International Conference on Digital Signal Processing (DSP), IEEE, Jun. 2023, pp. 1–4. doi: 10.1109/DSP58604.2023.10167970.

[19] M. K. Ben Mna and A. Ben Letaifa, “Exploring the Impact of Speech AI: A Comparative Analysis of ML Models on Arabic Dataset,” in 2023 IEEE Tenth International Conference on Communications and Networking (ComNet), IEEE, Nov. 2023, pp. 1–8. doi: 10.1109/ComNet60156.2023.10366659.

[20] M. Ikeda and K. Markov, “FastSpeech2 Based Japanese Emotional Speech Synthesis,” in 2024 IEEE 12th International Conference on Intelligent Systems (IS), IEEE, Aug. 2024, pp. 1–5. doi: 10.1109/IS61756.2024.10705252.

[21] Y. Choi, J. H. Jang, and M. W. Koo, “A Korean menu-ordering sentence text-to-speech system using conformer-based FastSpeech2,” Journal of the Acoustical Society of Korea, vol. 41, no. 3, pp. 359–366, 2022, doi: 10.7776/ASK.2022.41.3.359.

[22] Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” Jun. 2020.

Downloads

Published

2025-01-06

Issue

Section

Articles
Loading...