Support for the additional voices

by realsanjeev - opened Oct 1, 2025

Oct 1, 2025

Hi, this was great work.

I was wondering how to support additional voices besides the predefined ones. Do we need to fine-tune the model for each new voice (for the English language), or can we perform inference directly using a custom voice?

I tried to create a custom voice for inference by simply providing the model with an audio sample array of shape (1, 256), which the model accepts. I loaded the audio, resampled it to 24kHz, clipped the first 256 elements, and passed it to the model. However, the output was gibberish.

Could you explain how to correctly obtain a style voice sample and pass it to the model? Do we need to use a specific model to encode the audio into the required dimensions, or are there alternative methods for voice styling in this architecture?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment