Transforming Story Ideas from Images to Text Using Convolutional Neural Networks (CNN) and Generative Pre-trained Transformer (GPT-2)

Transforming Story Ideas from Images to Text Using Convolutional Neural Networks (CNN) and Generative Pre-trained Transformer (GPT-2)

Authors

  • Moh Hasbi Rizqullah Department of Informatics, UIN Sunan Gunung Djati Bandung
  • Eva Nurlatifah Department of Informatics, UIN Sunan Gunung Djati Bandung
  • Wildan Budiawan Zulfikar Department of Informatics, UIN Sunan Gunung Djati Bandung

DOI:

https://doi.org/10.15575/istek.v14i2.2599

Keywords:

CNN , GPT-2 , Image-to-Text, Story Generation

Abstract

The gap between rich visual inspiration and the challenge of creative articulation (writer’s block) remains a major obstacle in the writing process. This study aims to bridge this gap by designing a two-stage artificial intelligence system based on deep learning to provide automated narrative stimuli. The proposed method implements a custom Convolutional Neural Network (CNN) architecture to detect seven classes of natural objects from 4,362 images. The detected objects are then used as prompts for a fine-tuned Generative Pre-trained Transformer (GPT-2) model to generate poetic narratives. Experimental results indicate that the CNN module achieved a peak classification accuracy of 61.96%. Confusion matrix analysis reveals that this limitation is not caused by overfitting, but rather by high inter-class visual ambiguity. Although the GPT-2 module is capable of generating narratives with a BERT Score F1 of up to 0.6455, the primary finding of this study is that the overall narrative quality is highly dependent on the accuracy of the CNN output, which acts as a critical bottleneck in the system.

References

[1] G. Sakkir, “the Effectiveness of Pictures in Enhance Writing Skill of Senior High School Students,” Interf. J. Lang. Lit. Linguist., vol. 1, no. 1, 2020, doi: 10.26858/interference.v1i1.12803.

[2] Listyani, “The use of a visual image to promote narrative writing ability and creativity,” Eurasian J. Educ. Res., vol. 2019, no. 80, pp. 193–224, 2019, doi: 10.14689/ejer.2019.80.10.

[3] H. M. Romadlona and Z. A. Khofshoh, “The effectiveness of using picture series media on student’s writing narrative text,” Karangan, J. Kependidikan, Pembelajaran, dan Pengemb., vol. 5, no. 1, pp. 30–35, 2023.

[4] Joseph Patrick Pascale, “George R. R. Martin: An Epic Case of Writer’s Block,” https://medium.com/. Accessed: Oct. 30, 2024. [Online]. Available: https://medium.com/@josephpatrickpascale/george-r-r-martin-an-epic-case-of-writers-block-5e6e0535ccff

[5] N. S. Pasaribu, N. Annisa, and S. H. Harahap, “Pengaruh Teknologi Terhadap Gaya Menulis dan Komunikasi,” IJEDR Indones. J. Educ. Dev. Res., vol. 2, no. 1, pp. 315–319, 2024, doi: 10.57235/ijedr.v2i1.1764.

[6] N. Amado, “Psychoanalytic views of ‘writer’s block’: Artistic creation and its discontents,” Int. Forum Psychoanal., vol. 31, no. 2, pp. 100–107, 2022, doi: 10.1080/0803706X.2021.1887518.

[7] J. Rowling, “The Times publishes a new interview with J.K. Rowling about her writing process,” https://www.therowlinglibrary.com/. Accessed: Oct. 30, 2024. [Online]. Available: https://www.therowlinglibrary.com/2024/05/05/the-times-publishes-a-new-interview-with-j-k-rowling-about-her-writing-process/#:~:text=I’ve only ever once,t see my way forward.

[8] S. J. Ahmed, “An analysis of writer’s block: causes, characteristics, and solutions,” University of North Florida, 2019. [Online]. Available: https://digitalcommons.unf.edu/etd

[9] J. E. R. Marantika, “The Contribution Of Visual Literacy And Creative Thinking On Writing Skills,” J. Int. Semin. Lang. …, vol. 1, no. 1, pp. 2017–2020, 2019.

[10] Y. Wang, W. Hu, and R. Hong, “Iterative Adversarial Attack on Image-Guided Story Ending Generation,” IEEE Trans. Multimed., vol. 26, pp. 6117–6130, 2024, doi: 10.1109/TMM.2023.3345167.

[11] H. Lovenia, B. Wilie, R. Barraud, S. Cahyawijaya, W. Chung, and P. Fung, “Every picture tells a story: Image-grounded controllable stylistic story generation,” Proc. - Int. Conf. Comput. Linguist. COLING, vol. 29, no. 3, pp. 40–52, 2022.

[12] Y. Zhu and W. Q. Yan, “Image-Based Storytelling Using Deep Learning,” ACM Int. Conf. Proceeding Ser., pp. 179–186, 2022, doi: 10.1145/3561613.3561641.

[13] V. C. Sai Santhosh, T. Nikhil Eshwar, R. Ponraj, and K. Kiran, “Comprehensive Strategy for Analyzing Dementia Brain Images and Generating Textual Reports through ViT, Faster R-CNN and GPT-2 Integration,” 2023 1st Int. Conf. Adv. Electr. Electron. Comput. Intell. ICAEECI 2023, pp. 1–10, 2023, doi: 10.1109/ICAEECI58247.2023.10370864.

[14] G. Chen, Y. Liu, H. Luan, M. Zhang, Q. Liu, and M. Sun, “Learning to Generate Explainable Plots for Neural Story Generation,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 585–593, 2021, doi: 10.1109/TASLP.2020.3039606.

[15] J. A. Cahyono and J. N. Jusuf, “Automated Image Captioning with CNNs and Transformers,” pp. 1–13, 2024, [Online]. Available: http://arxiv.org/abs/2412.10511

[16] J. Li, D. M. Vo, A. Sugimoto, and H. Nakayama, “EVC AP : Retrieval-Augmented Image Captioning with External Visual – Name Memory for Open-World Comprehension,” pp. 13733–13742.

[17] R. Patankar, H. Sethi, A. Sadhukha, N. Banjade, and A. Mathur, “Image Captioning with Audio Reinforcement using RNN and CNN,” Int. Conf. Sustain. Comput. Smart Syst. ICSCSS 2023 - Proc., vol. 2, no. Icscss, pp. 591–596, 2023, doi: 10.1109/ICSCSS57650.2023.10169692.

[18] M. Bautista, S. Alfaro, and L. Wong, “Framework for the Adaptive Learning of Higher Education Students in Virtual Classes in Peru Using CRISP-DM and Machine Learning,” J. Comput. Sci., vol. 20, no. 5, pp. 522–534, 2024, doi: 10.3844/jcsp.2024.522.534.

[19] M. Bhalekar and M. Bedekar, “D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals,” Eng. Technol. Appl. Sci. Res., vol. 12, no. 2, pp. 8366–8373, 2022, doi: 10.48084/etasr.4772.

[20] A. Rahali and M. A. Akhloufi, “End-to-End Transformer-Based Models in Textual-Based NLP,” AI, vol. 4, no. 1, pp. 54–110, 2023, doi: 10.3390/ai4010004.

[21] IBM, “IBM SPSS Modeler CRISP-DM Guide,” https://www.ibm.com/docs/it/SS3RA7_18.3.0/pdf/ModelerCRISPDM.pdf, 2021.

[22] T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, and Y. Shan, “YOLO-World : Real-Time Open-Vocabulary Object Detection,” pp. 16901–16911.

Downloads

Published

2025-12-29

How to Cite

Rizqullah, M. H., Nurlatifah, E., & Budiawan Zulfikar, W. (2025). Transforming Story Ideas from Images to Text Using Convolutional Neural Networks (CNN) and Generative Pre-trained Transformer (GPT-2). ISTEK, 14(2), 88–94. https://doi.org/10.15575/istek.v14i2.2599

Issue

Section

Articles
Loading...