Improving sound event detection through enhanced feature extraction and attention mechanisms
Higher Education Press
image: Overview of the proposed model based on the Mean-Teacher architecture
Credit: HIGHER EDUCATON PRESS
Sound Event Detection (SED) is a crucial area of research in audio processing, focused on identifying specific sound events and their corresponding timestamps within continuous audio streams. However, current SED methods face several challenges, including overlapping sound events, interference from background noise, and limitations in feature extraction capabilities. These issues often result in false detections or missed events, ultimately reducing the accuracy and robustness of detection systems.
To solve the problems, a research team led by Dongping ZHANG published their new research on 15 October 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The research team introduced an enhanced SED method based on semi-supervised learning (SSL), incorporating Enhanced Feature Extraction and Attention Mechanisms (EFAM) to improve the model’s ability to identify and localize audio signals. Traditional SED methods often face challenges stemming from the limited availability of labeled training data and the inherent complexity of audio signals, leading to higher rates of false positives and missed detections. By leveraging SSL and the Mean-Teacher framework, the EFAM model effectively integrates both labeled and unlabeled data, significantly improving performance and generalization capabilities.
The EFAM model improves sound event detection performance through the following key components. Bi-Path Fusion Convolution Module (BPF-Conv): Utilizes a bi-path architecture to enhance feature extraction, thereby improving feature representation and transfer. Dual-Head Self-Attention Pooling (DSAP) Function: Aggregates frame-level predictions from weakly labeled samples, boosting prediction accuracy. Channel Attention Mechanism (CAM): Selectively emphasizes critical features within the audio feature map while filtering out irrelevant information, effectively reducing false positives and missed detections.
In evaluations conducted within a semi-supervised SED system based on the Mean-Teacher architecture, the EFAM model achieved significantly higher accuracy than existing methods while requiring fewer labeled samples.
Future research will aim to integrate multimodal data, such as audio and video, to exploit complementary information across different modalities, further improving the model’s accuracy and robustness in complex environments.
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.