Stereo Single-stage LMs

GENERATING STEREOPHONIC MUSIC WITH
SINGLE-STAGE LANGUAGE MODELS

Xingda Li, Fan Zhuo, Dan Luo, Jun Chen, Shiyin Kang, Zhiyong Wu, Tao Jiang, Yang Li, Han Fang, Yahui Zhou

[Paper] Submitted to ICASSP 2023.

Abstract

The recent success of audio language models (LMs) has revolutionized the field of neural music generation. Among all audio LM approaches, MusicGen has demonstrated the success of a single-stage LMs based music generation framework, without needing to train multiple LMs. Despite its promising performance in generating monophonic (mono) music, directly generating stereophonic (stereo) music following the previous framework has resulted in perceptible quality degradation. In this paper, we first discuss the difficulty of directly encoding stereo music with neural codec, and then provide a stable and practical solution based on a dual encoding approach. To utilize the dually encoded tokens in single-stage LMs, we also propose two forms of token sequence patterns. An extensive evaluation has been conducted using various aspects of stereo music audios to examine the performance of stereo neural codec approaches and the generation quality of single-stage LMs. Finally, our experimental results suggest that (i) our proposed dual encoding approach for neural codec is significantly better than the typical joint encoding approach in terms of reconstruction quality, and (ii) the stereo single-stage LMs trained with our proposed token sequence patterns substantially improved the perceptual quality over the state-of-the-art music generation model (i.e. MusicGen) in subjective tests.

Proposed Methods for Stereo Music Generation

(1) Dual Encoding Approach for Stereo Codec

(2) Token Sequence Patterns for Stereo Single-stage LMs

Comparing Neural Codecs for Stereo Music Reconstruction

We refer to the experimental settings in the paper. We compare five codec methods, labelled as SS-Joint, SS-Dual-LR, SS-Dual-MS, EC-Dual-LR and EC-Dual-MS, for each of the music audios obtained from MUSDB18). Each sample audio is 10 seconds long.

Ground-truth	SS-Joint	SS-Dual-LR	SS-Dual-MS	EC-Dual-LR	EC-Dual-MS

Comparing Single-stage LMs for Stereo Music Continuation

We refer to the experimental settings in the paper. We compare nine codec methods, labelled as MusicGen, EC-LR-Seq, EC-LR-Para, EC-MS-Seq, EC-MS-Para, SS-LR-Seq, SS-LR-Para, SS-MS-Seq and SS-MS-Para, for each of the music audios obtained from MUSDB18). Each sample audio is 20 seconds long (first 10s as prompt).

Ground-truth	MusicGen	EC-LR-Seq	EC-LR-Para	EC-MS-Seq	EC-MS-Para	SS-LR-Seq	SS-LR-Para	SS-MS-Seq	SS-MS-Para

GENERATING STEREOPHONIC MUSIC WITHSINGLE-STAGE LANGUAGE MODELS