NEURAL NETWORKS IN SOUND CLASSIFICATION FOR ART STUDENTS

Authors

  • Ayush Gandhi Centre of Research Impact and Outcome, Chitkara University, Rajpura- 140417, Punjab, India
  • Dr. A.C Santha Sheela Associate Professor, Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu, India
  • Lakshya Swarup Chitkara Centre for Research and Development, Chitkara University, Himachal Pradesh, Solan, 174103, India
  • Ms. Ipsita Dash Assistant Professor, Centre for Internet of Things, Institute of Technical Education and Research, Siksha 'O' Anusandhan Deemed to be University, Bhubaneswar, Odisha, India
  • Swati Srivastava Associate Professor, School of Business Management, Noida international University 203201
  • Dr. Varsha Kiran Bhosale Associate Professor Dynashree Institute Of Engineering and Technology

DOI:

https://doi.org/10.29121/shodhkosh.v6.i3.2025.6668

Keywords:

Neural Networks, Sound Emotion Recognition, CNN–LSTM Architecture, Transformer Attention, Valence–Arousal Mapping

Abstract [English]

Sound classification has become an important element in contemporary creative practice, in digital art, interactive installation, performance design and multimedia storytelling. For art students, it not only equips a technological basis but also a creative toolkit to design novel expressive modality through neural network understanding of sound: his study of sound emotion mapping hybrid neural network is equipped with CNN-based spectral extractor, LSTM temporal mode and Transformer attention learning. Using datasets obtained from RAVDESS, EMO-DB, and IEMOCAP, the model can promote high accuracy in the categorical emotion recognition and high alignment in continuous valence arousal prediction. The attention mechanism allows to improve the interpretability by focusing on emotionally salient regions of time-frequency representations. Results indicate that combining spatial, temporal, and contextual representations facilitates robust and generalizable emotion mapping to provide a reliable framework for affect-aware audio applications. The proposed approach furthers the understanding of the interpretation of expressive sound by neural networks and informs future works in the creative computing and human-centered AI fields.

References

Adusumalli, B., Kumar, L. N., Kavya, N., Deepak, K. P., and Indu, P. (2025). TweetScan: An Intelligent Framework for Deepfake Tweet Detection using CNN and FastText. IJRAET, 14(1), 62–70.

Ahmed, M. R., Robin, T. I., and Shafin, A. A. (2020). Automatic Environmental Sound Recognition (AESR) Using Convolutional Neural Network. International Journal of Modern Education and Computer Science, 12(5). https://doi.org/10.515/ijmecs.2020.05.04 DOI: https://doi.org/10.5815/ijmecs.2020.05.04

Dang, T., et al. (2022). Exploring Longitudinal Cough, Breath, and Voice Data for COVID-19 Disease Progression Prediction Via Sequential Deep Learning: Model Development and Validation. Journal of Medical Internet Research. https://doi.org/10.2196/preprints.37004 DOI: https://doi.org/10.2196/preprints.37004

Demir, F., Abdullah, D. A., and Sengur, A. (2020). A New Deep CNN Model for Environmental Sound Classification. IEEE Access, , 66529–66537. https://doi.org/10.1109/ACCESS.2020.294903 DOI: https://doi.org/10.1109/ACCESS.2020.2984903

Eskimez, S. E., et al. (2022). Personalized Speech Enhancement: New Models and Comprehensive Evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022) (356–360). https://doi.org/10.1109/ICASSP43922.2022.9746962 DOI: https://doi.org/10.1109/ICASSP43922.2022.9746962

Guzhov, A., Raue, F., Hees, J., and Dengel, A. (2021). ESResNet: Environmental Sound Classification Based on Visual Domain Models. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR) (pp. 4933–4940). https://doi.org/10.1109/ICPR406.2021.9413035 DOI: https://doi.org/10.1109/ICPR48806.2021.9413035

Han, W., et al. (2020). ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. In Proceedings of Interspeech 2020 (3610–3614). https://doi.org/10.21437/Interspeech.2020-2059 DOI: https://doi.org/10.21437/Interspeech.2020-2059

İnik, Ö. (2023). CNN Hyper-Parameter Optimization for Environmental Sound Classification. Applied Acoustics, 202, Article 10916. https://doi.org/10.1016/j.apacoust.2022.10916 DOI: https://doi.org/10.1016/j.apacoust.2022.109168

Madhu, A., and Suresh, K. (2023). RQNet: Residual Quaternion CNN for Performance Enhancement in Low Complexity and Device-Robust Acoustic Scene Classification. IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2023.3241553 DOI: https://doi.org/10.1109/TMM.2023.3241553

Mnasri, Z., Rovetta, S., and Masulli, F. (2022). Anomalous Sound Event Detection: A Survey of Machine Learning Based Methods and Applications. Multimedia Tools and Applications, 1(4), 5537–556. https://doi.org/10.1007/s11042-021-1117-9 DOI: https://doi.org/10.1007/s11042-021-11817-9

Mulla, R. A., Pawar, M. E., Bhange, A., Goyal, K. K., Prusty, S., Ajani, S. N., and Bashir, A. K. (2024). Optimizing Content Delivery in ICN-based VANET using Machine Learning Techniques. In WSN and IoT: An integrated approach for smart applications (pp. 165–16). https://doi.org/10.1201/971003437079-7 DOI: https://doi.org/10.1201/9781003437079-7

Orken, M., Dina, O., Keylan, A., Tolganay, T., and Mohamed, O. (2022). A Study of Transformer-Based End-To-End Speech Recognition System for Kazakh Language. Scientific Reports, 12, Article 337. https://doi.org/10.103/s4159-022-12260-y DOI: https://doi.org/10.1038/s41598-022-12260-y

Sivaraman, A., Kim, S., and Kim, M. (2021). Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification. In Proceedings of Interspeech 2021. https://doi.org/10.21437/Interspeech.2021-16 DOI: https://doi.org/10.21437/Interspeech.2021-1868

Sultana, S. K. R., Sravani, K., Ranga Lokesh, N. S., Venkateswararao, K., and Lakshmaiah, K. (2025). Automated ID Card Detection and Penalty System using YOLOv5 and Face Recognition. IJRAET, 14(1), 54–61.

Triantafyllopoulos, A., Liu, S., and Schuller, B. W. (2021). Deep Speaker Conditioning for Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2021) (pp. 1–6). https://doi.org/10.1109/ICME51207.2021.942217 DOI: https://doi.org/10.1109/ICME51207.2021.9428217

Verbitskiy, S., Berikov, V., and Vyshegorodtsev, V. (2022). ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition. Pattern Recognition Letters, 161, 3–44. https://doi.org/10.1016/j.patrec.2022.07.012 DOI: https://doi.org/10.1016/j.patrec.2022.07.012

Downloads

Published

2025-11-30

How to Cite

Gandhi, A., Sheela, A. S., Swarup, L., Dash, I., Srivastava, S., & Kiran Bhosale, V. (2025). NEURAL NETWORKS IN SOUND CLASSIFICATION FOR ART STUDENTS. ShodhKosh: Journal of Visual and Performing Arts, 6(3), 31–39. https://doi.org/10.29121/shodhkosh.v6.i3.2025.6668