Masked Diffusion Transformer for Music Generation

Authors

Vaibhav Rajendrakumar Mishra MS in Artificial Intelligence

Yeshiva University, New York, USA

vmishra1@mail.yu.edu
Niranjan Kumar Kishore Kumar MS in Artificial Intelligence

Yeshiva University, New York, USA

nkishore@mail.yu.edu
Sahil Kumar PhD in Mathematics

Yeshiva University, New York, USA

skumar4@mailyu.edu
Youshan Zhang Department of Computer Science and Engineering

Yeshiva University, New York, USA

youshan.zhang@yu.edu

Abstract

Text-to-music (TTM) generation using diffusion models faces challenges due to the scarcity of high-quality paired datasets, resulting in limited output diversity and fidelity. To address these challenges, we propose a novel framework that incorporates a Masked Diffusion Transformer (MDT) with quality-aware training. The MDT architecture dynamically adapts to varying input quality and effectively models complex temporal and harmonic dependencies in music.

Additionally, a multi-stage caption refinement process ensures enhanced alignment between textual inputs and musical outputs. Our approach improves musical coherence and provides a scalable solution for generating high-fidelity music from text descriptions. Extensive experiments conducted on diverse datasets, including MusicCaps, Song-Describer, and a custom Hugging Face compilation, demonstrate state-of-the-art performance in metrics such as Fréchet Audio Distance (FAD), KL Divergence, and Inception Score (IS). This work significantly advances TTM research and contributes to practical generative music applications.

Architecture

Masked Diffusion Transformer Architecture

The Masked Diffusion Transformer (MDT) architecture consists of 20 encoder layers and 8 decoder layers, optimizing performance and computational efficiency. The deeper encoder captures complex temporal and harmonic dependencies, while the decoder reconstructs audio features efficiently. The MDT processes audio using a patch-based approach (1 × 4 patch size) to enhance memory usage and preserve temporal granularity.

Results

Our proposed MDT outperforms baseline architectures like the CNN-based U-Net and the simplified Diffusion Transformer (DiT) in both quality and alignment metrics. Experiments conducted on datasets such as MusicCaps, Song-Describer, and a custom Hugging Face compilation demonstrate superior performance in Fréchet Audio Distance (FAD), KL Divergence, and Inception Score (IS). The MDT achieves state-of-the-art results while maintaining computational efficiency, trained on NVIDIA A100 GPUs with a batch size of 64 and a learning rate of 8 × 10⁻⁵ over 42,800 steps.

The Masked Diffusion Transformer (MDT) achieved the lowest Fréchet Audio Distance (FAD) of 1.70, indicating superior perceptual quality, and the highest Inception Score (IS) of 2.73, reflecting exceptional diversity. Additionally, MDT recorded a CLAP score of 0.37, surpassing competitors like AudioLDM2 and MusicGen, highlighting its enhanced text-to-audio alignment. These results validate MDT's robustness in generating high-fidelity, coherent, and contextually accurate music.

🎷 Jazz Music

Prompt: "Generate a Jazz Music"

Our Model

AudioLDM2

MusicGen

🎤 Hiphop Music

Prompt: "Generate a Hiphop Music"

Our Model

AudioLDM2

MusicGen

🎶 Random Music

Prompt: "Generate a Random Music"

Our Model

AudioLDM2

MusicGen