Voice Activity Detection

Voice activity detection (VAD) is an AI algorithm used to detect human speech. AI easily processes a large volume of input data and significantly reduces the use of human resources for speech detection. VAD is mainly used for speech processing and speech recognition, for example, speech-to-speech translation. Usually, this system can be created by combining a few sub-systems like speaker diarization, speech-to-text (STT) conversion, text-to-text machine translation (MT), and text-to-speech synthesizing (TTS). This way VAD can be used as part of an automatic speech recognition pipeline like speech boundaries detector.

Our client is a Swiss startup founded in the early fall of 2021. They offer their SaaS applications to media and education professionals. Their solution is based on integrating third-party Google, Amazon, and Microsoft systems and presenting them in a combination and form convenient for users. As they needed to improve translation and time boundaries detection, they decided to test and compare ready-made systems from Google, Amazon, and Microsoft to voice activity detection (VAD) from Unidatalab. As a result of this experiment, they decided to replace some parts of the system with solutions offered by Unidatalab.

Solution

Improve time boundaries detection in the existing system, using voice activity detection (VAD) combined with Google STT.

How it works

Unidatalab`s VAD showed impressive results with 0.5% higher accuracy in English and 2% in German for time boundaries detection compared to the alternative systems. Thus, our system has been integrated instead of the previous system. For speech-to-text transcription, we kept Google STT, which now receives VAD-processed data.

The user uploads video/audio data via the user interface.

The speech-to-speech transcription system sends this data to the Voice activity detection module.

VAD module processes data and returns video/audio only with the segments where speech was detected (other part replaced by silence).

Google Speech API gets processed data by VAD and transcripts it using speech-to-text AI.

Then the data moves along the client’s speech-to-speech transcription pipeline.

Our challenges:

The client had difficulties with identifying key problems

Our customer didn’t know which translation issues were the most painful, so it was challenging to choose key scenarios for testing and evaluation. It was impossible to predict which video categories will be a priority for the client in the future.

Video data in different languages

The video dataset consists of different languages, as well as dialects and accents. Besides English, there were German, Italian, Spanish, and French. It was difficult to markup the data without understanding the context, as the ML engineers do not know all the languages on the requested list. That is why part of the video data was labeled with the help of the client’s editorial team, and part was excluded from the dataset.

There is no AI expertise in the client's team

The customer had only business and software integration experience and no expertise in AI. It was hard to communicate in discussing interim results and next steps. Our team managed the entire research process and prepared the next steps for a client based on the research results.

Project stages

Data preparing

Testing

Development

Delivery

Description:

At this stage, we were collaborating closely with the client editorial team. We selected the video data that was difficult to process and divided it into key categories: “clear speech”, “background noise”, “fast and slurred speech”, and “background music”. Our team analyzed this data and selected the most representative metrics to correctly evaluate the test results. We have also marked the transcript with timestamps for further testing.

Description:

Our engineers tested existing clients’ services for speech-to-speech tasks and compared them with the Unidatalab’s solution. To achieve this, several scripts were written to automate the testing process. The obtained results were analyzed, and recommendations for the next steps were developed.

Description:

Our software engineers created an API for VAD that receives video data and returns the same video only with the part where speech was detected.

Description:

The API integration module was sent to the client as a separate component with corresponding documentation and a report with testing and evaluation results of speech-to-speech services alongside with the pipeline optimization recommendations.

Summary

Voice activity detection is now a part of the pipeline for a speech-to-speech translation system. Our client decided to integrate Unidatalab’s VAD for detecting time boundaries of the input video/audio content, as our solution showed better results than the alternative system. After we get customer’s feedback, the solution can be improved to meet all of their needs.

Head of Marketing