Return to Article Details MULTIMODAL EMOTION RECOGNITION USING AUDIO-TEXT FUSION AND TRANSFORMER-BASED CONTEXTUAL REPRESENTATION LEARNING