LongMergent: Pioneering audio mixing strategies for exquisite music generation

Hong Lin; Xuan Liu; Chaomurilige Chaomurilige; Kunyu Yang; Zheng He; Chang Xu; Zheng Liu

doi:10.24294/csma11516

LongMergent: Pioneering audio mixing strategies for exquisite music generation

Hong Lin, Xuan Liu, Chaomurilige Chaomurilige, Kunyu Yang, Zheng He, Chang Xu, Zheng Liu

Article ID: 11516
Vol 8, Issue 1, 2025

VIEWS - 9565 (Abstract)

Abstract

Artificial intelligence-empowered music processing is a domain that involves the use of artificial intelligence technologies to enhance music analysis, understanding, and generation. This field encompasses a variety of tasks from music generation to music comprehension. In practical applications, the complexity of interwoven tasks, differences in data representation, scattered distribution of tool resources, and the threshold of professional music knowledge often become barriers that hinder developers from smoothly carrying out generative tasks. Therefore, it is essential to establish a system that can automatically analyze their needs and invoke appropriate tools to simplify the music processing workflow. Inspired by the recent success of Large Language Models (LLMs) in task automation, we have developed a system named LongMergent, which integrates numerous music-related tools and autonomous workflows to address user requirements. By granting users the freedom to effortlessly combine tools, this system provides a seamless and rich musical experience.

Keywords

music generation; Large Language Models; audio mixing strategies

Full Text:

PDF

References

Gómez E, Gouyon F, Herrera P, Amatriain X. Using and enhancing the current MPEG-7 standard for a music content processing tool. Advances in Engineering Software. 2003.
Meng F, Zhang C, Liu N. Music style classification using deep convolutional neural networks. In: Proceedings of the 2020 3rd International Conference on Computer Graphics, Vision and Information Security (CGVIS). IEEE; 2020. pp. 87–91.
Hadjeres G, Pachet F, Nielsen F. DeepBach: A Steerable Model for Bach Chorales Generation. arXiv. 2016. doi: 10.48550/ARXIV.1612.01010
Chen J, Tan X, Luan J, et al. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. arXiv. 2020. doi: 10.48550/ARXIV.2009.01776
Yu B, Lu P, Wang R, et al. Museformer: Transformer with fine- and coarse-grained attention for music generation. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS); 28 November–8 December 2022.
Shen Y, Song K, Tan X, et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv. 2023.
Yu D, Song K, Lu P, et al. MusicAgent: An AI agent for music understanding and generation with Large Language Models. arXiv. 2023.
Chen Y, Huang L, Gou T. Applications and Advances of Artificial Intelligence in Music Generation: A Review. arXiv. 2024. doi: 10.48550/ARXIV.2409.03715
Agostinelli A, Denk TI, Borsos Z, et al. MusicLM: Generating music from text. arXiv. 2023.
Sun T, Zhang X, He Z, et al. MOSS: An Open Conversational Large Language Model. Machine Intelligence Research. 2024;21(5):888–905. DOI: 10.1007/s11633-024-1502-8.
Wang L, Kawakami K, van den Oord A. Contrastive Predictive Coding of Audio with an Adversary. Interspeech 2020. 2020; 826–830. doi: 10.21437/interspeech.2020-1891
Wu S, Yu D, Tan X, Sun M. CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval. arXiv. 2023. doi: 10.48550/ARXIV.2304.11029
Stöter F, Virtanen T. A Multichannel Nonnegative Matrix Factorization Approach to Sound Scene Analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2016; 24(9): 1652–1663.
Engel J, Agrawal S, Chen D, et al. GANSynth: Adversarial Neural Audio Synthesis. In: Proceedings of the International Conference on Machine Learning (ICML); 10–15 June 2019; Long Beach, CA, USA.
Luo Y, Chen Z, Hershey JR, et al. Deep clustering and conventional networks for music separation: Stronger together. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017; 61–65. doi: 10.1109/icassp.2017.7952118
Ji S, Yang X, Luo J. A Survey on Deep Learning for Symbolic Music Generation: Representations, Algorithms, Evaluations, and Challenges. ACM Computing Surveys. 2023; 56(1): 1–39. doi: 10.1145/3597493
Min S, Lyu X, Holtzman A, et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? arXiv. 2022. doi: 10.48550/ARXIV.2202.12837
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022; 35: 27730–27744.
Wu C, Yin S, Qi W, et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv. 2023. doi: 10.48550/ARXIV.2303.04671
Liu S, Hussain AS, Wu Q, et al. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv. 2023. doi: 10.48550/ARXIV.2311.11255
Chen C, Hu Y, Wang S, et al. Audio Large Language Models Can Be Descriptive Speech Quality Evaluators. arXiv. 2025. doi: 10.48550/ARXIV.2501.17202
Zeng G, Ding W, Xu B, et al. Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline. In: Proceedings of the 2025 International Conference on Learning Representations (ICLR); 24–28 April 2025.
Huang C-ZA, Vaswani A, Uszkoreit J, et al. Music transformer: Generating music with long-term structure. In: Proceedings of International Conference on Learning Representations (ICLR); 30 April–3 May 2018.
Dai Z, Yang Z, Yang Y, et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL); 28 July–2 August 2019; Florence, Italy. pp. 2978–2988.
Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer. arXiv. 2020. doi: 10.48550/ARXIV.2004.05150.

DOI: https://doi.org/10.24294/csma11516

Refbacks

There are currently no refbacks.

License URL: https://creativecommons.org/licenses/by/4.0/

This site is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Username

Password

Remember me