

Wearable devices are used widely in a variety of health and lifestyle related applications, from tracking personal fitness to monitoring patients suffering from physical and mental ailments. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application.

We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. One such task is foreground speech detection from wearable audio devices. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks.
