Instructions to use KittenML/kitten-tts-nano-0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- KittenTTS
How to use KittenML/kitten-tts-nano-0.1 with KittenTTS:
from kittentts import KittenTTS m = KittenTTS("KittenML/kitten-tts-nano-0.1") audio = m.generate("This high quality TTS model works without a GPU") # Save the audio import soundfile as sf sf.write('output.wav', audio, 24000) - Notebooks
- Google Colab
- Kaggle
Support for the additional voices
Hi, this was great work.
I was wondering how to support additional voices besides the predefined ones. Do we need to fine-tune the model for each new voice (for the English language), or can we perform inference directly using a custom voice?
I tried to create a custom voice for inference by simply providing the model with an audio sample array of shape (1, 256), which the model accepts. I loaded the audio, resampled it to 24kHz, clipped the first 256 elements, and passed it to the model. However, the output was gibberish.
Could you explain how to correctly obtain a style voice sample and pass it to the model? Do we need to use a specific model to encode the audio into the required dimensions, or are there alternative methods for voice styling in this architecture?