Voice effects overlaying
The purpose of the voice effects overlaying is to change the voice of the speaker to another voice. Such a system significantly expands the voice base for speech synthesizing and, at the same time, reduces the time for dubbing voice tracks without hiring dubbing actors. Using this system, you can change your voice to sound like a child, change the gender of the voice, make your voice older, sound like a famous person, or copy a character's voice from cartoons and games. Usually, this effect can be done with the help of special software that allows to process sound and change its characteristics, such as frequency, timbre, tempo, pitch, etc. Or by using special plugins such as LADSPA or VST, in which these configurations are properly selected to achieve the effect you need.
Our client is a Swiss start-up that specializes in video content creation and automatic video soundtrack translation. They offer software where users can upload any video data and get it translated into any language using both original and custom voices for it. This software uses Google, Amazon, and Microsoft speech synthesizing solutions. As they needed to go beyond the scope of voices offered by existing services, expand their voices base, and be able to create new ones at the users’ requests, they decided to do research on open-source and commercial third-party solutions. Based on the research results our customer wanted to expand the pipeline by adding a module to the system that allows changing the input voice record or applying various effects to it.
Solution
Expanding the voice database for text voicing service by integrating special third-party tools that allow applying various effects to standard voices in the existing pipeline.
How it works
The voice effects overlaying module accepts audio/video input and returns the same format. Since the module takes only audio/ video data (not text) and serves as an additional tool for voice post-processing, it must be combined with a text-to-speech synthesis system for translation tasks. So, this module is added after the speech synthesis service and is applied at the user’s request.
The user uploads video/audio data via the user interface.
Then the data moves along the client’s video translation pipeline to the speech synthesis service.
Speech synthesis service returns voice records synthesized from translated text using voice,
existing in this service.
The voice effects overlaying module gets voice records and applies special effects sets selected
by the user.
Then a processed voice is inserted in the translated video (instead of the original voice) and
returns to the user.
Our challenges:
Making a decision (customized open- source vs commercial SaaS)
Based on the research results, the question arose about which solutions are better. Open source, which would require more time for implementation and preparation but would provide complete autonomy from third-party vendors; or commercial, which would allow faster integration of such a system and get user feedback to it, but would involve costs for different types of subscriptions and a specific dependence on the existing list of effects and growth of third-party vendors. So, our team helped to evaluate the dos and don’ts and made this decision.
Voice effects list preparing
There wasn’t a prepared list of effects that needed to be implemented. Since most vendors offer some functionality for processing the voice (only changing the gender, aging of the voice, or voice cloning), it was necessary to identify the list of effects required for the particular business case of our client. Together with the client, we considered the effects list proposed by similar businesses, updated and prioritized it, according to which we did research and prepared the top 3 voice processing vendors.
Commercial vendors communication
Since it was decided to use ready-made commercial voice processing solutions to speed up integration, it was necessary to agree on business issues with their owners before implementing such a system. The client’s team didn’t have a specialist who could fully manage this case, so our team helped at the primary stage of communication.
Project stages
At this stage, our team and the client agreed on a list of effects that are a priority for the client, and that must be taken into account when choosing solutions.
Our team conducted research on existing open-source and commercial solutions to solve the task and selected several options according to the client’s requirements.
We have prepared a summary with a list of solutions that customers can use to get each effect, the pros and cons of each solution, and voice samples for each effect to evaluate processing quality. Also, we prepared recommendations and the following steps to integrate the corresponding web service in the client’s pipeline.
Our software engineers upgraded the system by creating an API that accepts audio/video input with parameters and returns processed voice records. The logic behind the API is based on an integration with a few third-party solutions for voice processing.
The system was sent to the customer as a separate web-service with corresponding documentation and research results with recommendations.