Thematic Grouping of Quranic Verse Translations Based on Word2Vec and K-Means Clustering
DOI:
https://doi.org/10.15575/kjrt.v3i2.1748Keywords:
Al-Qur'an, Digital Interpretation, K-Means Clustering, Natural Language Processing, Thematic Clustering, Word2VecAbstract
This study aims to group thematically translated texts of Indonesian Quranic verses using a Word2Vec-based machine learning approach and the KMeans Clustering algorithm. The process begins with text preprocessing, creating vector representations using Word2Vec, and then clustering using KMeans with quality evaluation using the Silhouette Score metric. The experimental results show that the model is able to form six main thematic clusters that semantically describe themes such as prayer and hope, moral evil, social law, the teachings of revelation, divinity, and the stories of figures and ethics. Two-dimensional visualization with PCA strengthens the interpretation of the formed clustering patterns. This study proves that the unsupervised learning approach can be relied upon to support the automation of digital thematic interpretation objectively and systematically. In addition, the results of this clustering have the potential to become the basis for the development of topic-based verse search systems, contextual Quranic learning applications, and technology-based exploration of Islamic studies. This study also supports the achievement of Sustainable Development Goals (SDGs) point 4 regarding increasing access to inclusive and quality education through information technology.
References
[1] S. M. Al-Qaththan, Pengantar Studi Ilmu Al-Qur’an. Pustaka Al-Kautsar, 2018.
[2] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999, doi: 10.1145/331499.331504.
[3] A. Jaeger and D. Banks, “Cluster analysis: A modern statistical review,” WIREs Computational Statistics, vol. 15, no. 3, May 2023, doi: 10.1002/wics.1597.
[4] G. J. Oyewole and G. A. Thopil, “Data clustering: application and trends,” Artif. Intell. Rev., vol. 56, no. 7, pp. 6439–6475, Jul. 2023, doi: 10.1007/s10462-022-10325-y.
[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Sep. 2013.
[6] K. Backhaus, B. Erichson, S. Gensler, R. Weiber, and T. Weiber, Multivariate Analysis. Wiesbaden: Springer Fachmedien Wiesbaden, 2021. doi: 10.1007/978-3-658-32589-3.
[7] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–65, Nov. 1987, doi: 10.1016/0377-0427(87)90125-7.
[8] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis Journal of Computational and Applied Mathematics 1987; 20: 53-65,” visited on, pp. 4–13, 2023.
[9] United Nations, “Sustainable Development Goals (SDGs) 2030,” sdgs.un.org. Accessed: May 12, 2025. [Online]. Available: https://sdgs.un.org
[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Sep. 2013.
[11] I. Akbar, M. Faisal, and T. Chamidy, “Penerapan long short-term memory untuk klasifikasi multi-label terjemahan Al-Qur’an dalam Bahasa Indonesia,” JOINTECS (Journal of Information Technology and Computer Science), vol. 7, no. 1, pp. 41–54, 2024.
[12] R. M. Adani, P. P. Adikara, and N. Santoso, “Analisis Klaster Terjemahan Ayat Al-Qur’an Berbahasa Indonesia Menggunakan K-Means dan Word Embedding,” Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, vol. 9, no. 9, 2025.
[13] M. N. AD, I. Godbole, P. M. Kapparad, and S. Bhattacharjee, “Comparative analysis of religious texts: NLP approaches to the Bible, Quran, and Bhagavad Gita,” in Proceedings of the new horizons in computational linguistics for religious texts, 2025, pp. 1–10.
[14] E. H. Mohamed and W. H. El-Behaidy, “An Ensemble Multi-label Themes-Based Classification for Holy Qur’an Verses Using Word2Vec Embedding,” Arab. J. Sci. Eng., vol. 46, no. 4, pp. 3519–3529, Apr. 2021, doi: 10.1007/s13369-020-05184-0.
[15] T. Mikolov, W. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013, pp. 746–751.
[16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Adv. Neural Inf. Process. Syst., vol. 26, 2013.
[17] R. Brochier, A. Guille, and J. Velcin, “Global Vectors for Node Representations,” in The World Wide Web Conference, New York, NY, USA: ACM, May 2019, pp. 2587–2593. doi: 10.1145/3308558.3313595.
[18] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018.
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.
[21] S. Banerjee, “Word2Vec — a baby step in Deep Learning but a giant leap towards Natural Language Processing.” [Online]. Available: https://medium.com/explore-artificial-intelligence/word2vec-a-baby-step-in-deep-learning-but-a-giant-leap-towards-natural-language-processing-40fe4e8602ba
[22] Md Al-Amin, M. S. Islam, and S. Das Uzzal, “Sentiment analysis of Bengali comments with Word2Vec and sentiment information of words,” in ECCE 2017 - International Conference on Electrical, Computer and Communication Engineering, Institute of Electrical and Electronics Engineers Inc., Apr. 2017, pp. 186–190. doi: 10.1109/ECACE.2017.7912903.
[23] N. Rezki, S. A. Thamrin, and S. Siswanto, “Sentiment Analysis of Merdeka Belajar Kampus Merdeka Policy Using Support Vector Machine with Word2Vec,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 17, no. 1, pp. 0481–0486, Apr. 2023, doi: 10.30598/barekengvol17iss1pp0481-0486.
[24] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval, vol. 39. Cambridge University Press Cambridge, 2008.
[25] J. S. Coleman, Introducing speech and language processing. Cambridge university press, 2005.
[26] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
[27] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive data sets. Cambridge university press, 2020.
[28] R. Anand and U. Jeffrey David, Mining of massive datasets. Cambridge university press, 2011.
[29] J. D. Ullman, J. Leskovec, and A. Rajaraman, Mining of massive datasets, vol. 112. Cambridge University Press Cambridge, 2011.
[30] A. Ram, S. Jalal, A. S. Jalal, and M. Kumar, “A density based algorithm for discovering density varied clusters in large spatial databases,” Int. J. Comput. Appl., vol. 3, no. 6, pp. 1–4, 2010.
[31] M. Parimala, D. Lopez, and N. C. Senthilkumar, “A survey on density based clustering algorithms for mining large spatial databases,” International Journal of Advanced Science and Technology, vol. 31, no. 1, pp. 59–66, 2011.
[32] H. Jiawei, M. Kamber, J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. 2012. doi: 10.1016/B978-0-12-381479-1.00001-0.
[33] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.,” Journal of machine learning research, vol. 9, no. 11, 2008.
[34] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[35] A. Platzer, “Visualization of SNPs with t-SNE,” PLoS One, vol. 8, no. 2, p. e56883, Feb. 2013, doi: 10.1371/journal.pone.0056883.
[36] T. Brown et al., “Language models are few-shot learners,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877–1901, 2020.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Ahmad Badru Al Husaeni, Alif Firmansyah Putra, Adi Purnama, Adly Juliarta Lerian, Diman Fathurohman

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.