The SpeechRecognition project, developed as part of the Digital Signal Processing Techniques course, aimed to create a simple EdgeAI system capable of classifying basic voice commands such as left, right, stop, and go using a neural network deployed on an STM32 microcontroller.
At the beginning, the architecture of the entire system was defined, which includes three main components:
- Signal acquisition (analog microphone and sampling using ADC)
- Audio preprocessing (filtering, decimation to 8kHz, conversion to spectrogram, and spectral noise reduction)
- Spectrogram classification
The neural network was trained offline using the TensorFlow library on the Speech Commands dataset. Below is the confusion matrix for the best-performing model on the test data.
The best model was then fine-tuned using samples recorded from the target microphone connected to the microcontroller. The number of output classes was reduced to 4. Before deployment to the STM32 microcontroller using the X-Cube-AI tool, the network was further optimized and quantized.
Below is the confusion matrix showing the results of system-level tests.
The system’s performance could be improved by using a MEMS microphone with a digital I2S interface, increasing the audio sampling rate to fs = 16kHz
and deploying on a microcontroller with more available SRAM memory.