Generation of Musical Timbres using a Text-Guided Diffusion Model

Technical University of Munich1, Munich Center for Machine Learning2

The WebApp will go into inactive sleep after a period of time. Please click 'Restart this Space' to wake it up, which may take up to 5 minutes to restart.

Abstract

In recent years, text-to-audio systems have achieved remarkable success, enabling the generation of complete audio segments, directly from text descriptions. While these systems also facilitate music creation for general users, they often limit human creativity and deliberate expression for artists and musicians. In contrast, this work allows composers, arrangers, and performers to create the basic building blocks for music creation: audio of individual musical notes for use in electronic instruments and DAWs. Through text prompts, the user can specify the timbre characteristics of the audio. We introduce a system that combines a latent diffusion model and multi-modal contrastive learning to generate musical timbres conditioned on text descriptions. By jointly generating the magnitude and phase of the spectrogram, our method eliminates the need for subsequently running a phase retrieval algorithm, as related methods do.

The workflow of the proposed method:


MY ALT TEXT

A Gradio webapp implementing the workflow above is hosted here. (It may take some time to load.)



Text to musical notes comparison

Both frameworks faithfully interpret simple text descriptions, such as specifying a particular instrument like ``reed.'' However, when additional constraints are added to the text, our framework produces more precise outputs. For instance, it effectively reduces high-frequency overtones in response to the extended prompt "with a dark tone" and enhances the tail end of the sound according to the extended prompt "and a long release."


Sampling with Different Guidance Scale (\( w \))

Image CFG

Results of conditioned sampling with varying guidance scales \( w \). As guidance scale increases, more high-frequency components are introduced into the spectrogram in line with the text description.


Audio Style Transfer

Image CFG

Smooth transitions in timbre are achieved by altering the guidance scale \( w \) (upper row), or by changing the noising strength through the initial time step \( T_0 \) (lower row). The text description is ``guitar''.


Example music tracks synthesized by our workflow

Insights of the model

Architecture Overview

Architecture overview of our framework for fixed-length music note generation. It combines multi-modal contrastive learning and latent diffusion models. STFT+ and ISTFT+ represent the non-trainable time-frequency domain transformations of audio signals \(S\). (see Spectral Representation Section) A pretrained LLM is used to augment labels such as "bright, guitar" from the NSynth dataset to diverse text descriptions. The training is divided into three phases:

  1. A VQ-GAN (in yellow) is trained as an autoencoder for the spectral representation of real samples. Its discriminator \(D\) is trained to distinguish spectral representations of real samples (i.e.~\(x\) for all training samples) from those of generated samples (i.e.~\(\hat{x}\) for all training samples). The encoder, decoder, and quantizer are trained to fool the discriminator, i.e.~to produce realistic~\(\hat{x}\).
  2. A text encoder (pretrained using CLAP) and a timbre encoder (both shown in green) are trained to map text descriptions and the timbre representation \(\hat{z}\) into a unified embedding space via contrastive learning.
  3. A diffusion model (in blue) is trained to produce latent representations conditioned by the text embeddings. During the inference stage, the output of the diffusion model is passed to the VQ-GAN decoder.

Text-conditioned Note Generation

Latent Representation on Notes:

We train the diffusion model on the lower-dimensional latent representation of the audio to expedite training. The audio is first converted to a spectral image representation. The Short-Time Fourier Transform (STFT) enables the conversion of an audio signal \(s\) into its magnitude and phase, allowing for the near-lossless reconstruction of the original audio signal through the Inverse STFT (ISTFT). Based on the STFT, we encode the audio signal \(s\) into a spectral representation \(x \in \mathbb{R}^{3 \times H \times W}\), where the three channels correspond to log-magnitude, sine phase, and cosine phase, respectively.

We do not use mel-scaled spectrograms due to their tendency to compress high-frequency information, which is detrimental to high-quality music synthesis.

This spectral representation is compressed to and reconstructed from a latent representation \(\hat{z} \in \mathbb{R}^{C \times \frac{H}{r} \times \frac{W}{r}}\), where \(C\) represents the number of channels, and \(r\) denotes the spatial compression scale, via a VQ-GAN. As depicted in the yellow section of the Figure above, the VQ-GAN features an encoder-decoder architecture, employing convolutional and transposed convolutional layers with a stride of 2 for spatial downsampling and upsampling, respectively. The quantizer assigns each element of the spatial dimensions to its nearest discrete value in a codebook.

Additionally, the VQ-GAN has a discriminator that distinguishes between the spectral representation of real samples from the training set and those reconstructed by the decoder. This is trained adversarially against the encoder, decoder, and quantizer.

Contrastive Representation Learning:

The multi-modal nature of our approach necessitates a shared representation between text and timbre. To ensure this, we train a timbre-encoder and a text-encoder, which respectively map the latent audio representation and text descriptions to their corresponding embeddings within a unified latent space.

This is achieved using contrastive loss by enhancing the cosine similarity between the embeddings of the text and timbre within the same sample pair while promoting differentiation among different samples within the batch. The approach takes inspiration from the multi-class N-pair loss utilized in CLIP, used for matching texts with images rather than audio.

The text-encoder is initialized with pretrained CLAP parameters, where CLAP is a multimodal pre-trained model that bridges audio and text. In contrast, the timbre-encoder undergoes preliminary classification pre-training using the labels from the NSynth dataset, before the joint training. We chose to fine-tune the model rather than using CLAP directly to better align text and musical note features, rather than general sounds.

Uniquely for this task, since text descriptions are derived from labels, the dataset includes notes with identical labels, resulting in semantically similar text descriptions for different samples. When optimizing the batch-constructed symmetric cross-entropy loss, samples with similar text descriptions may be erroneously considered negative pairs, which impedes model convergence. Therefore, we avoid the co-existence of audio samples with exactly identical labels within a single batch during training through data filtering.

More details on spectral representation, prompt engineering, learning objectives, and ablation studies using CLAP as the text encoder can be found in the supplementary material.

Denoising Diffusion on Latent Representations

Given pairs of latent representations and text embeddings \( (\hat{z}, e^{t}) \), we train a Denoising Diffusion Probabilistic Model (DDPM) \( p_{\theta}(\hat{z}|e^{t}) \) that closely replicates the conditioned distribution \( q(\hat{z}|e^{t}) \).

[DDPM] engages in an iterative forward process defined on time steps \(t \in [1, \ldots, T]\), wherein noise \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\) is incrementally added to true data points \(z_0:=\hat{z}\). This procedure is grounded on a predetermined noise schedule specified by \(0 < \beta_1 < \cdots < \beta_t < \cdots < \beta_T < 1\), and is delineated as a Markov Chain in the following manner.

Forward Process Equations:

\( q(z_t | z_{t-1}) := \mathcal{N}(z_t; \sqrt{\alpha_t} z_{t-1}, \beta_t \mathbf{I}), \)
\( q(z_t | z_0) := \mathcal{N}(z_t; \sqrt{\overline{\alpha}_t} z_0, (1 - \overline{\alpha}_t) \boldsymbol{\epsilon}), \)

where \( \alpha_{t} = 1-\beta_{t} \) and \( \bar{\alpha}_t := \prod_{s=1}^{t} \alpha_s \). The reverse process represents the inverse mechanism, delineating a generative model, which reconstructs the original data points from the pure Gaussian noise. Specifically, in the task at hand, we are interested in the conditional joint distribution, defined as:

\( p_{\theta}(z_{0:T}|e^t) := p(z_T) \prod_{t=1}^{T} p_{\theta}(z_{t-1}|z_t,e^t), \)

where \( p(z_T) := \mathcal{N}(0, I) \) and \( e^t \) is the text embedding. The intermediate transitions are parameterized by estimations of a neural network \( \boldsymbol{\epsilon}_{\theta} \):

\( p_{\theta}(z_{t-1}|z_t,e^t) := \mathcal{N}(z_{t-1}; \mu_{\theta}(z_t, t, e^t), \frac{1 - \hat{\alpha}_{t-1} \beta_{t}}{1 - \hat{\alpha}_{t}} I), \)
\( \mu_{\theta}(z_t, t, e^t) := \frac{1}{\sqrt{\alpha_t}} \left( z_t - \frac{\beta_t}{\sqrt{1 - \hat{\alpha}_t}} \boldsymbol{\epsilon}_{\theta}(z_t, t, e^t) \right). \)

The neural network is trained to estimate the weighted noise by minimizing an adapted version of the variational lower bound of the negative log-likelihood :

\( L_{\textnormal{t}}^{\textnormal{simple}} = \mathbb{E}_{t, x_0, \boldsymbol{\epsilon}} \left[ \left\| \epsilon - \boldsymbol{\epsilon}_{\theta}(z_t, t, e^t) \right\|^2 \right]. \)

Furthermore, classifier-free guidance is applied during training, which involves randomly replacing the text embedding \( e^t \) with the embedding of an empty string with probability \( p \).

Model Hyperparameters and Training Configurations

VQ-GAN

Training Objective:

The objective in training the VQ-GAN involves minimizing a combination of a weighted reconstruction loss \(L_{\textnormal{rec}}\), VQ loss \(L_{\textnormal{VQ}}\), and adversarial loss \(L_{\textnormal{adv}}\). Specifically, the weighted reconstruction loss for the spectral representation \(x\) is designed channel-wise to achieve combined compression and reconstruction of magnitude and phase information. For the phase information in the latter two channels, the Mean Absolute Error (MAE) is utilized, while for the log-magnitude in the first channel, a weighted Mean Absolute Error (wMAE) is employed to assign greater weight to lower values, thereby reducing noise in low amplitudes.

For a spectral representation sample \(x\) and its reconstruction \(\hat{x}\), the loss for the generator, that is, the encoder-decoder structure, \(L_{\textnormal{G}}\), and the discriminator loss \(L_{\textnormal{D}}\) are defined as follows:

\(L_{\textnormal{G}}(x, \hat{x}) := L_{\textnormal{rec}}(x, \hat{x}) + w_1 \cdot L_{\textnormal{VQ}} + w_2 \cdot L_{\textnormal{adv}}(\hat{x})\),
\(L_{\textnormal{rec}}(x, \hat{x}) := \text{wMAE}(x_1, \hat{x}_1) + w_3 \cdot (\text{MAE}(x_2, \hat{x}_2) + \text{MAE}(x_3, \hat{x}_3))\),
\(\text{wMAE}(x_1, \hat{x}_1) := \text{avg} \left(\frac{\left|\hat{x}_1 - x_1\right|}{\max(x_1, \epsilon)}\right)\),
\(L_{\textnormal{adv}}(\hat{x}) := -\log D(\hat{x})\),
\(L_{\textnormal{D}}(x, \hat{x}) := -\log D(x) - \log(1 - D(\hat{x}))\),

where \(D\) is the discriminator, \(w_1\), \(w_2\), \(w_3\) are weight parameters, \(x_i\) denotes the \(i\)-th channel of \(x\), and \(\epsilon\) is a hyper-parameter that adjusts weights for lower log-magnitude values. In this work, we set \(w_1 = 10\), \(w_2 = 0.1\), \(w_3 = 1\), \(\epsilon = 0.001\).

Architecture:

The VQ-GAN model comprises an encoder, a decoder, a quantization layer, and a discriminator.

The encoder and decoder consist of alternating stacks of ResNet blocks, efficient attention modules, and down/up-sampling modules. Each ResNet block combines group normalization, swish activation, and a 2D convolutional layer with kernel size 3, along with a skip connection that includes a convolution with kernel size 1.

The efficient attention module implements the attention mechanism with linear complexity. The down/up-sampling modules are convolution/transposed convolution layers with a stride of 2 and a kernel size of 4. The encoder features 6 ResNet blocks, 2 efficient attention modules, and two down-sampling modules. The input and output channels of the encoder are 3 and 4, while the hidden channels are 80 and 160. The decoder mirrors this structure but replaces down-sampling with up-sampling modules. Notably, the final activation layer of the decoder is specifically designed for generating spectral representations, using softplus activation for the first channel and tanh activation for the remaining two channels.

For the quantization layer, we employ the implementation based on exponential moving averages. The codebook stores 8192 discrete entries, each with a channel size of 4.

The discriminator employs an 18-layer-ResNet architecture, with the first layer replaced by a 2D convolutional layer that accommodates spectral representation inputs and the last two layers replaced by a binary classifier layer.

The VQ-GAN model has a total of 1.5M trainable parameters. It was trained using the Adam optimizer for 80,000 steps, with a batch size of 4 and a learning rate of \(1 \times 10^{-4}\).

Contrastive Pretrain

The timbre-encoder consists of an LSTM, four single-layer classifier heads, and a single-layer projection head. The LSTM has a feature dimension of 512 and a hidden dimension of 1024, with three stacked layers. It processes the latent representation \(z\) as a sequence of features along the temporal dimension, corresponding to the time frames of the spectral representation \(x\). During the pretraining of the timbre-encoder, the final feature output by the LSTM is passed to four classifier heads, which predict labels provided by the NSynth dataset. When jointly trained with the text-encoder, the LSTM's final feature is fed into the projection head, mapping it to a multi-modal feature space with a dimension of 512.

The text-encoder utilizes the architecture and pretrained parameters of CLAP, along with a projection head. The projection head is also single-layered, mapping the extracted text feature of dimension 512 to the multi-modal feature space.

The timbre-encoder has 25M trainable parameters and was pretrained using the Adam optimizer for 20,000 steps, with a batch size of 64 and a learning rate of \(10^{-3}\). In the joint training phase, it is trained for 600,000 steps with a batch size of 16, together with the text-encoder, which has 155M trainable parameters. The optimizer used is AdamW, with a learning rate of \(10^{-5}\) and weight decay of \(10^{-3}\) for the text-encoder, and a learning rate of \(10^{-4}\) and weight decay of \(10^{-6}\) for the timbre-encoder. However, for the projection heads of both, the learning rate is set to \(10^{-4}\) with a weight decay of \(10^{-6}\).

Diffusion Model

The diffusion model shares similarities with the VQ-GAN in its encoder-decoder structure and building components, namely the ResNet blocks and efficient attention modules. However, it distinguishes itself by:

  • Incorporating skip connections between the encoder and decoder
  • Adding time embeddings to the feature maps
  • Transforming text embeddings and inserting them into the attention modules via cross attention

The total number of time steps \(T\) is set to 1000. A linear Beta schedule is used, with \(\beta_{1}=1 \times 10^{-4}\) and \(\beta_{T}=2 \times 10^{-2}\). For classifier-free guidance, the probability of replacing text descriptions with the empty string is \(p=0.1\).

The models were trained using the Adam optimizer for 0.3M steps, with a batch size of 8 and a learning rate of \(10^{-4}\). The training is conducted on a single NVIDIA T4 GPU, for approximately 70 hours.

Baseline Models

GAN: Inspired by GanSynth, we trained a U-Net model using adversarial learning. This aimed to validate the superiority of the diffusion model over GANs in generating musical notes. The primary distinction between our framework and the GAN-based approach lies in the loss function, while the model architecture and training configurations remain unchanged.

Our Framework with Pretrained CLAP as Text-Encoder: To assess the benefits of contrastive pretraining, we employed the pre-trained CLAP as the text encoder, without fine-tuning it on our dataset. We retrained the diffusion model within our framework, referring to this model as Ours_C. The training settings were consistent with those used for Ours.

Our Framework with a Smaller Model Size: Additionally, we trained a diffusion model with a reduced channel size within our framework, referred to as Ours_S. The training settings for this model were consistent with those of Ours.

AudioLDM and Adapted AudioLDM: We utilized AudioLDM, which is pre-trained on a broad range of sound data including musical notes, as a baseline. Furthermore, we adapted the AudioLDM diffusion model to our dataset, resulting in AudioLDM_A as another baseline. This adaptation was performed using 2 NVIDIA T4 GPUs, involving 0.3M training steps over approximately 90 hours. This process demanded more computational resources compared to our framework. The training configuration followed the author's recommendation, with a batch size of 2 and a learning rate of \(10^{-5}\).

Spectral Representation

Spectral Representation Illustration

Transformation between time signal \(s\) and spectral representation \(x\).

Pseudo Code for STFT-based and ISTFT-based Transformation

Algorithm STFT+(s)
Input: time signal s
Output: spectral representation x, which is a matrix with three channels representing log magnitude, cosine phase, and sine phase

1. Compute complex spectrum matrix D
   D <- STFT(s)

2. Compute magnitude of D
   magnitude <- absolute value of D

3. Compute phase of D
   phase <- angle of D

4. Compute log magnitude
   log_magnitude <- log(1 + magnitude)

5. Compute cosine of phase
   cos_phase <- cosine(phase)

6. Compute sine of phase
   sin_phase <- sine(phase)

7. Encode as three channels, with channel dimension first
   x <- stack(log_magnitude, cos_phase, sin_phase) along axis 0

8. Return x
          

STFT-based spectral representation encoding pseudo code.

Algorithm ISTFT+(x)
Input: x, a spectral representation matrix with three channels representing log magnitude, cosine phase, and sine phase
Output: time signal s

1. Extract channels from x
   log_magnitude <- first channel of x
   cos_phase <- second channel of x
   sin_phase <- third channel of x

2. Invert log magnitude transformation
   magnitude <- exp(log_magnitude) - 1

3. Calculate phase
   phase <- arctan2(sin_phase, cos_phase)

4. Reconstruct the complex spectrum matrix from magnitude and phase
   D <- magnitude * (cos(phase) + i * sin(phase))

4. Reconstruct time signal
   s <- ISTFT(D)

5. Return s
          

ISTFT-based spectral representation decoding pseudo code.

As illustrated in the Figure above, and detailed in the pseudocode provided above, we encode the time signal \(s\) into a spectral representation \(x\) using STFT and reconstruct \(s\) using ISTFT, along with processing on the spectral data. Since \(x\) retains both magnitude and phase information, the representation and reconstruction of the time signal are almost lossless.