Speaker Identification

Speaker Identification

Speech-to-text transcription is a process where spoken words are converted into written form by a computer. Speaker diarization and identification is what the process looks like “behind the scenes” when transcribing the audio file into text. This technology labels audio recordings with corresponding timestamps that define boundaries between different speakers. Each segment is associated with a particular speaker. Gender or age detection can also take place if needed. Speech recognition technology has made great progress over the years, and now it can transcribe audio recordings from court meetings with an accuracy that rivals human transcriptionists. This can be done to speed up the business processes, get insights or simplify user interactions with the product.

?

Our client is an international software firm with headquarters in Sweden. One of their offerings was the automatic audio translation into various languages. While this is useful for lawyers, what the legal companies really lacked was the high accuracy transcribing system. However, this is a time- consuming and labor-intensive process, which requires a lot of focus and attention from the transcriber. Due to the new EU policy on digital accessibility for public affairs, recordings of official meetings and court hearings should be either transcribed or captioned. Legal companies find transcribing quite useful for private court hearings as well, as they can easily search for words, review, and analyze documents as opposed to just relistening the audio again.

Solution

Leverage the benefits of multilanguage speech recognition and speaker diarization to create high-accuracy structured legal documents from audio.

How it works

01

User uploads video/audio data via the user interface.

02

The audio data is preprocessed and sent to the speaker diarization and identification module.

03

The speaker diarization and identification module processes audio data and returns labeled segments with timestamps and speaker identifiers.

04

Audio data and labeled segments are then sent to the speech recognition module, where each segment is transcribed.

05

Audio data and labeled segments are then sent to the speech recognition module, where each segment is transcribed.

Our challenges:

Industry-specific jargon

Due to the nature of the legal industry, our team had to bear in mind that the AI should be tested on the audio data related to that industry, i.e. recordings of previous court meetings, etc. Because industry jargon is not commonly used in regular speech, we had to
adapt our model to make it work better in this industry lexicon.

A small amount of quality audio from previous hearings

There were not enough audio files of the necessary quality for testing the model. Because of this, we developed a method to artificially increase the amount of data by using our text-to-speech technology (audio data augmentation).

Handling speech overlap

Some audio had segments of speech that had some overlap. To deal with this, we used our speech overlap detection module to handle these overlapped speech segments so that our speech recognition module works correctly.

Project stages

Description:

At this stage, the client team provided the audio data where a hearing was taking place. The audio data was used for testing and evaluation. To develop our solution, textual data from previous hearings was also provided, and it was used to generate audio that would be used as a training dataset for our model. Our team analyzed this data and selected the most representative metrics to correctly evaluate the results.

Description:

Our AI team tested the existing services which provide similar solutions and compared them with our proprietary solution. For this, several scripts were written to automate the testing process. Experiments results were obtained and analyzed, and recommendations for the client’s next steps were developed.

Description:

Our software engineers created an API for Speaker recognition with diarization and identification that accepts audio (or video) data and returns a timeline with speech segments, speaker identifiers, and
transcription.

Description:

The API integration module was sent to the client as a separate component with documentation and a research summary of the speaker recognition and diarization modules. Additionally, pipeline optimization recommendations are provided.

Summary

Therefore, this program facilitates the work because less time has to be spent on the transcription of court sessions. Currently, the customer actively uses the program and collects reviews about it from lawyers.