Diffusion-based audio and music generation models commonly generate music by
constructing an image representation of audio (e.g., a mel-spectrogram) and then
converting it to audio using a phase reconstruction model or vocoder.
Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz),
which limits their effectiveness. We propose MusicHiFi --- an efficient high-fidelity
stereophonic vocoder. Our method employs a cascade of three generative adversarial
networks (GANs) that convert low-resolution mel-spectrograms to audio,
upsamples to high-resolution audio via bandwidth expansion, and upmixes to
stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based
generator and discriminator architecture and training procedure for each stage of
our cascade, 2) a new fast, near cycle-consistent bandwidth extension module,
and 3) a new fast cycle-consistent mono-to-stereo module that ensures the preservation of monophonic content in the output. We evaluate our proposed approach using both objective and subjective listening tests and find our approach comparable or better audio quality better spatialization control and significantly faster inference speed compared to past work.
Bibtex
@article{zhu2024musichifi,
title={MusicHiFi: Fast High-Fidelity Stereo Vocoding},
author={Zhu, Ge and Caceres, Juan-Pablo and Duan, Zhiyao and Bryan, Nicholas J.},
year={2024},
archivePrefix={arXiv},
primaryClass={cs.SD},
}
Examples
We showcase sample outputs that highlight the capabilities of our high-fidelity, cascaded stereo vocoding system for music generation.
Starting from Mel-spectrograms, we first generate a waveform with GAN-based vocoder and then enhance the generated music through GAN-based bandwidth extension
and mono-to-stereo upmixing.
Our demonstration includes both intermediate outputs from different vocoding stages and the system's final output.
The input Mel-spectrograms are generated from a diffusion based music generation system.
For the mono-to-stereo conversion, the spectrograms depicted represent the side channel of the stereo audio.
All audio samples are provided in MP3 format.
Vocoded from Generated Mel-spectrograms
Vocoding
Bandwidth Extension
Mono-to-stereo
Below are samples from out-of-distribution data, comparing between our generated audio and the original ground truth.
The Mel-spectrograms used for synthesis are extracted from Creative Commons from the FMA dataset .
Detailed licensing information for each music piece can be found at this link.
For the mono-to-stereo conversion, the spectrograms depicted represent the side channel of the stereo audio.
All audio samples are provided in MP3 format.