🎭 Multimodal Emotion Recognition

This system predicts emotions from video by automatically extracting and analyzing:

🎤 Audio (extracted from video)
📝 Text (transcribed from audio using Whisper)
🎥 Video (visual frames)

How to use:

Upload a video file (MP4, AVI, MOV, etc.)
Click "Predict Emotion"
The system will automatically extract audio, transcribe speech, and analyze all modalities

The model will provide emotion predictions based on all three inputs.

🎥 Video Input

📊 Prediction Results

📝 Transcribed Text

📌 Notes:

Supported emotions: Angry, Happy, Neutral, Sad
Model uses Wav2Vec2 (audio), BERT (text), and ResNet18 (video)
Best results with clear audio, accurate transcripts, and visible faces