Text to speech models

Source

Intro to text-to-speech models

Dependencies

    librosa 
    soundfile 
    speechbrain
    torchaudio

Setting up

from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

1. Load the Processor and Feature Extraction model

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") # used like a tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") # used for speech feature extraction

2. Load the Speech Embedding model (Optional)

This model encodes the sound wav files to xvectors which is a popular feature vector used for sound models. This step is optional and is only loaded if the dataset is not in xvector form and you need to convert the file to xvector form.

classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb")

/var/folders/rc/5ny4rz796d7gqs_j5kcvk6nh0000gn/T/ipykernel_45536/1453390625.py:1: UserWarning: Module 'speechbrain.pretrained' was deprecated, redirecting to 'speechbrain.inference'. Please update your script. This is a change from SpeechBrain 1.0. See: https://github.com/speechbrain/speechbrain/releases/tag/v1.0.0
  from speechbrain.pretrained import EncoderClassifier
/opt/homebrew/Caskroom/miniforge/base/envs/myenv/lib/python3.10/site-packages/speechbrain/utils/autocast.py:188: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)

3. Load a Spectogram encoder

This model is used to convert spectograms into waveforms. Specifically, the loaded vocoder operates on 80-bin mel-spectrograms to reconstruct the audio signal.

# loading the vocoder model 
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Example

# Load the pkgs

import torchaudio
import torchaudio.transforms as T
from IPython.display import Audio

# In this example, a wave file is loaded and this will be used as the referencing or conditioning feature vector for the sound model

sound_path = "01-00.04.75_00.07.46.wav"

# Load your .wav file
signal, fs = torchaudio.load(sound_path)

print(f"Shape of signal: {signal.shape} with resample {fs}Hz. ")

if signal.size(0) == 2:
    print("Converting the signal to process as mono channel waveform")
    signal = signal.mean(dim=0, keepdim=True)

if fs != 16000:
    print(f"Resampling from {fs}Hz to 16000Hz")
    resampler = T.Resample(orig_freq=fs, new_freq=16000)
    signal = resampler(signal)
    fs = 16000  # Update fs to the new sample rate

if signal.size(0) == 2:
    signal = signal.mean(dim=0, keepdim=True)

# Extract x-vector using the classifier
embedding = classifier.encode_batch(signal) # NOTE: this audio file is stero so it comes with 2 channels, slicing the first will give you the monowave

# To get numpy vector
# xvector = embedding.squeeze().detach().cpu().numpy()
print(f"Embedding shape: {embedding.shape}")

Output

Shape of signal: torch.Size([2, 119511]) with resample 44100Hz. 
Converting the signal to process as mono channel waveform
Resampling from 44100Hz to 16000Hz
Embedding shape: torch.Size([1, 1, 512])

To ensure compatibility with the SpeechT5 model, the audio is first converted to a mono channel. Additionally, the sampling rate is resampled from 44,100 Hz to 16,000 Hz, which is the expected input rate for the model. After processing, the audio is embedded into a feature representation with shape torch.Size([1, 1, 512]), suitable for downstream tasks like speech synthesis or recognition.

# Insert text message to sound out
inputs = processor(text="The aroma of fresh coffee filled the room, making it the perfect start to the day. He paused at the edge of the lake, staring at the still water, reflecting the clear blue sky above.", return_tensors="pt")
# Run the model
speech = model.generate_speech(inputs["input_ids"], embedding.squeeze(0), vocoder=vocoder)

# to play in notebook 
Audio(speech, rate=fs) 
# to save sound file with torchaudio
torchaudio.save("output.wav", speech.unsqueeze(0), fs)

PreviousSpeech Models NextMultimodal Architectures

Last updated 3 months ago