Text to speech models
Source
Dependencies
librosa
soundfile
speechbrain
torchaudio
Setting up
from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
1. Load the Processor and Feature Extraction model
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") # used like a tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") # used for speech feature extraction
2. Load the Speech Embedding model (Optional)
This model encodes the sound wav files to xvectors which is a popular feature vector used for sound models. This step is optional and is only loaded if the dataset is not in xvector form and you need to convert the file to xvector form.
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb")
/var/folders/rc/5ny4rz796d7gqs_j5kcvk6nh0000gn/T/ipykernel_45536/1453390625.py:1: UserWarning: Module 'speechbrain.pretrained' was deprecated, redirecting to 'speechbrain.inference'. Please update your script. This is a change from SpeechBrain 1.0. See: https://github.com/speechbrain/speechbrain/releases/tag/v1.0.0
from speechbrain.pretrained import EncoderClassifier
/opt/homebrew/Caskroom/miniforge/base/envs/myenv/lib/python3.10/site-packages/speechbrain/utils/autocast.py:188: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
3. Load a Spectogram encoder
This model is used to convert spectograms into waveforms. Specifically, the loaded vocoder operates on 80-bin mel-spectrograms to reconstruct the audio signal.
# loading the vocoder model
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
Example
# Load the pkgs
import torchaudio
import torchaudio.transforms as T
from IPython.display import Audio
# In this example, a wave file is loaded and this will be used as the referencing or conditioning feature vector for the sound model
sound_path = "01-00.04.75_00.07.46.wav"
# Load your .wav file
signal, fs = torchaudio.load(sound_path)
print(f"Shape of signal: {signal.shape} with resample {fs}Hz. ")
if signal.size(0) == 2:
print("Converting the signal to process as mono channel waveform")
signal = signal.mean(dim=0, keepdim=True)
if fs != 16000:
print(f"Resampling from {fs}Hz to 16000Hz")
resampler = T.Resample(orig_freq=fs, new_freq=16000)
signal = resampler(signal)
fs = 16000 # Update fs to the new sample rate
if signal.size(0) == 2:
signal = signal.mean(dim=0, keepdim=True)
# Extract x-vector using the classifier
embedding = classifier.encode_batch(signal) # NOTE: this audio file is stero so it comes with 2 channels, slicing the first will give you the monowave
# To get numpy vector
# xvector = embedding.squeeze().detach().cpu().numpy()
print(f"Embedding shape: {embedding.shape}")
Shape of signal: torch.Size([2, 119511]) with resample 44100Hz.
Converting the signal to process as mono channel waveform
Resampling from 44100Hz to 16000Hz
Embedding shape: torch.Size([1, 1, 512])
To ensure compatibility with the SpeechT5 model, the audio is first converted to a mono channel. Additionally, the sampling rate is resampled from 44,100 Hz to 16,000 Hz, which is the expected input rate for the model. After processing, the audio is embedded into a feature representation with shape torch.Size([1, 1, 512]), suitable for downstream tasks like speech synthesis or recognition.
# Insert text message to sound out
inputs = processor(text="The aroma of fresh coffee filled the room, making it the perfect start to the day. He paused at the edge of the lake, staring at the still water, reflecting the clear blue sky above.", return_tensors="pt")
# Run the model
speech = model.generate_speech(inputs["input_ids"], embedding.squeeze(0), vocoder=vocoder)
# to play in notebook
Audio(speech, rate=fs)
# to save sound file with torchaudio
torchaudio.save("output.wav", speech.unsqueeze(0), fs)
Last updated