EmotionAI-voice

An AI-powered application for detecting human emotions

EmotionAI Voice is an open-source deep learning project that classifies vocal emotions using raw .wav audio.
It’s designed for applications in mental health monitoring, UX analysis, and intelligent speech interfaces.

πŸ”¬ The model is trained from scratch, using spectrogram-based audio features, and aims to recognize 8 core emotions.


🎯 Features

  • 🧠 Emotion recognition: neutral, calm, happy, sad, angry, fearful, disgust, surprised
  • 🎧 Accepts .wav audio inputs (from RAVDESS dataset)
  • πŸ“Š CNN and CNN+GRU models implemented in PyTorch
  • πŸ” Real-time evaluation with confusion matrix and accuracy tracking
  • πŸ› οΈ Fully open-source and customizable (no pre-trained models)
  • πŸ§ͺ Includes SpecAugment for data augmentation (frequency/time masking)

πŸ“š Dataset β€” RAVDESS

We use the RAVDESS dataset, which includes:

  • 🎭 24 professional actors (balanced male/female)
  • πŸŽ™οΈ 1440 .wav files (16-bit, 48kHz)
  • 8 labeled emotions:
    neutral, calm, happy, sad, angry, fearful, disgust, surprised

Each .wav file is preprocessed into a Mel spectrogram and stored as .npy format.


🧠 Model Architectures

2 different models

βœ… CNN (Best Performance)

  • 3x Conv1D + ReLU + MaxPool
  • Fully connected layers
  • Dropout regularization (adjustable)

πŸ” CNN + GRU

  • CNN front-end for spatial encoding
  • GRU (recurrent layers) to capture temporal dynamics
  • Lower accuracy than CNN-only model

πŸ§ͺ SpecAugment: Data Augmentation

To improve generalization, we implemented SpecAugmentTransform which applies:

  • πŸ•’ Time masking: hides random time intervals
  • πŸ“‘ Frequency masking: hides random mel frequency bands

πŸ“ˆ Training Results

  • Best Validation Accuracy: ~49.6%
  • Training set: Actors 1–20
  • Validation set: Actors 21–24

Confusion Matrix Example:

ConfusionMatrix

πŸ” Key Observations:

  • Surprised, calm, and disgust are the most accurately predicted emotions.
  • Neutral, happy, and sad tend to be confused with each other, which is common due to subtle acoustic variations.
  • The model struggles with fearful and angry in some cases β€” suggesting those may share overlapping vocal characteristics in this dataset.
  • Emotion classes like happy and fearful are often misclassified due to variability in expression intensity among different actors.
πŸ“ˆ Interpretation

While the model captures general emotion cues, it suffers from class overlap and limited generalization. The accuracy remains significantly above random (12.5% for 8 classes), but there is still room for improvement.


πŸš€ Getting Started

1. Install dependencies

pip install -r requirements.txt

2. Download dataset from Kaggle

Follow the instructions in the README.md located in the data folder

3 . Train the model

python src/train.py

4. Evaluation the performances with a confusion matrix

```bash python src/confusion_matrix.py