MusicHiFi: Fast High-Fidelity Stereo Vocoding

Ge Zhu^1,2* Juan-Pablo Caceres² Zhiyao Duan¹ Nicholas J. Bryan²

¹University of Rochester, Rochester, NY
²Adobe Research
^*Work done during an internship at Adobe Research

Paper Video

Abstract

Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi --- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near cycle-consistent bandwidth extension module, and 3) a new fast cycle-consistent mono-to-stereo module that ensures the preservation of monophonic content in the output. We evaluate our proposed approach using both objective and subjective listening tests and find our approach comparable or better audio quality better spatialization control and significantly faster inference speed compared to past work.

Bibtex

          
          @article{zhu2024musichifi,
              title={MusicHiFi: Fast High-Fidelity Stereo Vocoding}, 
              author={Zhu, Ge and Caceres, Juan-Pablo and Duan, Zhiyao and Bryan, Nicholas J.},
              year={2024},
              archivePrefix={arXiv},
              primaryClass={cs.SD},
          }

Examples

We showcase sample outputs that highlight the capabilities of our high-fidelity, cascaded stereo vocoding system for music generation. Starting from Mel-spectrograms, we first generate a waveform with GAN-based vocoder and then enhance the generated music through GAN-based bandwidth extension and mono-to-stereo upmixing. Our demonstration includes both intermediate outputs from different vocoding stages and the system's final output. The input Mel-spectrograms are generated from a diffusion based music generation system. For the mono-to-stereo conversion, the spectrograms depicted represent the side channel of the stereo audio. All audio samples are provided in MP3 format.

Vocoded from Generated Mel-spectrograms

Vocoding

Bandwidth Extension

Mono-to-stereo

Below are samples from out-of-distribution data, comparing between our generated audio and the original ground truth. The Mel-spectrograms used for synthesis are extracted from Creative Commons from the FMA dataset . Detailed licensing information for each music piece can be found at this link. For the mono-to-stereo conversion, the spectrograms depicted represent the side channel of the stereo audio. All audio samples are provided in MP3 format.