EmotionAI-voice
An AI-powered application for detecting human emotions
EmotionAI Voice is an open-source deep learning project that classifies vocal emotions using raw .wav
audio.
Itβs designed for applications in mental health monitoring, UX analysis, and intelligent speech interfaces.
π¬ The model is trained from scratch, using spectrogram-based audio features, and aims to recognize 8 core emotions.
π― Features
- π§ Emotion recognition:
neutral
,calm
,happy
,sad
,angry
,fearful
,disgust
,surprised
- π§ Accepts
.wav
audio inputs (from RAVDESS dataset) - π CNN and CNN+GRU models implemented in PyTorch
- π Real-time evaluation with confusion matrix and accuracy tracking
- π οΈ Fully open-source and customizable (no pre-trained models)
- π§ͺ Includes SpecAugment for data augmentation (frequency/time masking)
π Dataset β RAVDESS
We use the RAVDESS dataset, which includes:
- π 24 professional actors (balanced male/female)
- ποΈ 1440
.wav
files (16-bit, 48kHz) - 8 labeled emotions:
neutral
,calm
,happy
,sad
,angry
,fearful
,disgust
,surprised
Each .wav
file is preprocessed into a Mel spectrogram and stored as .npy
format.
π§ Model Architectures
2 different models
β CNN (Best Performance)
- 3x Conv1D + ReLU + MaxPool
- Fully connected layers
- Dropout regularization (adjustable)
π CNN + GRU
- CNN front-end for spatial encoding
- GRU (recurrent layers) to capture temporal dynamics
- Lower accuracy than CNN-only model
π§ͺ SpecAugment: Data Augmentation
To improve generalization, we implemented SpecAugmentTransform
which applies:
- π Time masking: hides random time intervals
- π‘ Frequency masking: hides random mel frequency bands
π Training Results
- Best Validation Accuracy: ~49.6%
- Training set: Actors 1β20
- Validation set: Actors 21β24
Confusion Matrix Example:
π Key Observations:
- Surprised, calm, and disgust are the most accurately predicted emotions.
- Neutral, happy, and sad tend to be confused with each other, which is common due to subtle acoustic variations.
- The model struggles with fearful and angry in some cases β suggesting those may share overlapping vocal characteristics in this dataset.
- Emotion classes like happy and fearful are often misclassified due to variability in expression intensity among different actors.
π Interpretation
While the model captures general emotion cues, it suffers from class overlap and limited generalization. The accuracy remains significantly above random (12.5% for 8 classes), but there is still room for improvement.
π Getting Started
1. Install dependencies
pip install -r requirements.txt
2. Download dataset from Kaggle
Follow the instructions in the README.md located in the data folder
3 . Train the model
python src/train.py
4. Evaluation the performances with a confusion matrix
```bash python src/confusion_matrix.py