NLP-BASED FOLK STORY DOCUMENTATION SYSTEMS
DOI:
https://doi.org/10.29121/shodhkosh.v6.i5s.2025.6879Keywords:
Cultural Heritage Preservation, Semantic Annotation, Transformer Models, Multilingual Corpus, Knowledge Graph, BER Topic, Ontology Alignment, Digital HumanitiesAbstract [English]
The folk tales are invaluable sources of cultural wisdom, language diversity and collective memory. Much of this intangible heritage is however in danger after the diminishing oral tradition and poor documentation mechanisms. This paper will introduce a NLP-based folk story documentation and preservation framework, which includes the state of the art natural language processing, semantic models as well as ontology-based cultural representations. The suggested pipeline works with multilingual and dialect-containing texts, handling them in a set of computational steps, which are text preprocessing, linguistic analysis, motif discovery, semantic annotation, and the development of the cultural knowledge graphs. Transformer-based models mbERT, IndicBERT and RoBERTa have been fine-tuned on language-specific tasks and BERTopic and TransE embeddings were used to perform thematic clustering and ontology alignment on the CIDOC-CRM schema. The outcomes of the evaluation showed that there were great improvements in linguistic accuracy (F1 = 0.91), motif classification (F1 = 0.83), and topic coherence (CV = 0.74) in comparison to the traditional baselines. The validation of the experts in the field of folklorists and linguists provided a Cultural Authenticity Index (CAI) of 0.87, which validated the interpretive reliability of the system. The acquired body of knowledge can be used to aid semantic querying, comparative motif analysis and cultural pattern recognition and hence convert the archives, which are normally static folklore collections, into smart, interactive ones. It finds applications in the digital heritage management, education, creative storytelling and cross-cultural analytics. Altogether, the framework contributes to a scaled and ethically justified method of the AI-assisted cultural preservation as the linguistic richness and narrative depth of the folk traditions can survive in the digital age.
References
Asyrofi, R., Siahaan, D. O., and Priyadi, Y. (2020). Extraction Dependency Based on Evolutionary Requirement Using Natural Language Processing. In Proceedings of the 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) (pp. 332–337), Yogyakarta, Indonesia. DOI: https://doi.org/10.1109/ISRITI51436.2020.9315489
Bozyiğit, F., Aktaş, Ö., and Kılınç, D. (2021). Linking Software Requirements and Conceptual Models : A Systematic Literature Review. Engineering Science and Technology, an International Journal, 24, 71–82. https://doi.org/10.1016/j.jestch.2020.09.002 DOI: https://doi.org/10.1016/j.jestch.2020.11.006
Bretas, V. P. G., and Alon, I. (2021). Franchising Research on Emerging Markets : Bibliométrie and Content Analyses. Journal of Business Research, 133, 51–65. https://doi.org/10.1016/j.jbusres.2021.04.006 DOI: https://doi.org/10.1016/j.jbusres.2021.04.067
Chakrabarty, B. K. (2022). Integrated Computer-Aided Design by Optimization : An Overview. In Integrated CAD by Optimisation : Architecture, Engineering, Construction, Urban Development and Management (pp. 1–49). Cham, Switzerland : Springer. https://doi.org/10.1007/978-3-030-96895-1_1 DOI: https://doi.org/10.1007/978-3-030-99306-1_1
He, J., Liu, Z., Xia, Y., Wang, J., Zhang, X., and Liu, Y. (2019). Analyzing the Potential of ChatGPT-Like Models in Healthcare: Opportunities and Challenges. Journal of Medical Internet Research, 21, e16279. https://doi.org/10.2196/16279 DOI: https://doi.org/10.2196/16279
Javed, S., Usman, M., Sandin, F., Liwicki, M., and Mokayed, H. (2023). Deep Ontology Alignment Using a Natural Language Processing Approach for Automatic M2M Translation in IIoT. Sensors, 23, 8427. https://doi.org/10.3390/s23218427 DOI: https://doi.org/10.3390/s23208427
Okeke, F. O., Ezema, E. C., Ibem, E. O., Sam-Amobi, C., and Ahmed, A. (2025). Comparative Analysis of the Features of Major Green Building Rating Tools (GBRTs): A Systematic Review. Lecture Notes in Civil Engineering, 539, 355–370. https://doi.org/10.1007/978-3-031-30000-0_25 DOI: https://doi.org/10.1007/978-981-97-5910-1_35
Orji, E. Z., Haydar, A., Erşan, İ., and Mwambe, O. O. (2023). Advancing OCR Accuracy in Image-to-LaTeX Conversion: A Critical and Creative Exploration. Applied Sciences, 13, 12503. https://doi.org/10.3390/app132212503 DOI: https://doi.org/10.3390/app132212503
Osama, M., Zaki-Ismail, A., Abdelrazek, M., Grundy, J., and Ibrahim, A. (2020). Score-Based Automatic Detection and Resolution of Syntactic Ambiguity in Natural Language Requirements. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME), 651–661, Adelaide, Australia. https://doi.org/10.1109/ICSME.2020.00077 DOI: https://doi.org/10.1109/ICSME46990.2020.00067
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., et al. (2021). The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ, 372, n71. https://doi.org/10.1136/bmj.n71 DOI: https://doi.org/10.1136/bmj.n71
Qin, C., Zhang, Z., Chen, J., Yasunaga, M., and Yang, D. (2023). Is ChatGPT a General-Purpose Natural Language Processing Task Solver?. arXiv Preprint, arXiv:2302.06476. https://arxiv.org/abs/2302.06476 DOI: https://doi.org/10.18653/v1/2023.emnlp-main.85
Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence, 1, 206–215. https://doi.org/10.1038/s42256-019-0048-x DOI: https://doi.org/10.1038/s42256-019-0048-x
Riaz, M. Q., Butt, W. H., and Rehman, S. (2019). Automatic Detection of Ambiguous Software Requirements: An Insight. In Proceedings of the 5th International Conference on Information Management (ICIM) (1–6), Cambridge, UK. https://doi.org/10.1109/ICIM.2019.00001 DOI: https://doi.org/10.1109/INFOMAN.2019.8714682
Shen, Y., Zhang, R., Jiang, X., Wang, J., and Liu, Y. (2021). Advances in Natural Language Processing for Clinical Text: Applications and Challenges. Journal of Biomedical Informatics, 118, 103799. https://doi.org/10.1016/j.jbi.2021.103799 DOI: https://doi.org/10.1016/j.jbi.2021.103799
Wang, Z., Zhang, Y., Liu, H., Li, Y., and Wu, X. (2020). Log Event2Vec: Log Event-to-Vector Based Anomaly Detection for Large-Scale Logs in Internet of Things. Sensors, 20, 2451. https://doi.org/10.3390/s20082451 DOI: https://doi.org/10.3390/s20092451
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Richa Srivastava, Ramu K, Ayush Gandhi, Shubhangi S. Shambharkar, Chandrashekhar Ramesh Ramtirthkar, Dr. L. Lakshmanan

This work is licensed under a Creative Commons Attribution 4.0 International License.
With the licence CC-BY, authors retain the copyright, allowing anyone to download, reuse, re-print, modify, distribute, and/or copy their contribution. The work must be properly attributed to its author.
It is not necessary to ask for further permission from the author or journal board.
This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge.























