LongMergent: Pioneering audio mixing strategies for exquisite music generation
Vol 8, Issue 1, 2025
Abstract
Artificial intelligence-empowered music processing is a domain that involves the use of artificial intelligence technologies to enhance music analysis, understanding, and generation. This field encompasses a variety of tasks from music generation to music comprehension. In practical applications, the complexity of interwoven tasks, differences in data representation, scattered distribution of tool resources, and the threshold of professional music knowledge often become barriers that hinder developers from smoothly carrying out generative tasks. Therefore, it is essential to establish a system that can automatically analyze their needs and invoke appropriate tools to simplify the music processing workflow. Inspired by the recent success of Large Language Models (LLMs) in task automation, we have developed a system named LongMergent, which integrates numerous music-related tools and autonomous workflows to address user requirements. By granting users the freedom to effortlessly combine tools, this system provides a seamless and rich musical experience.
Keywords
Full Text:
PDFReferences
1. Gómez E, Gouyon F, Herrera P, Amatriain X. Using and enhancing the current MPEG-7 standard for a music content processing tool. Advances in Engineering Software. 2003.
2. Meng F, Zhang C, Liu N. Music style classification using deep convolutional neural networks. In: Proceedings of the 2020 3rd International Conference on Computer Graphics, Vision and Information Security (CGVIS). IEEE; 2020. pp. 87–91.
3. Hadjeres G, Pachet F, Nielsen F. DeepBach: A Steerable Model for Bach Chorales Generation. arXiv. 2016. doi: 10.48550/ARXIV.1612.01010
4. Chen J, Tan X, Luan J, et al. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. arXiv. 2020. doi: 10.48550/ARXIV.2009.01776
5. Yu B, Lu P, Wang R, et al. Museformer: Transformer with fine- and coarse-grained attention for music generation. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS); 28 November–8 December 2022.
6. Shen Y, Song K, Tan X, et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv. 2023.
7. Yu D, Song K, Lu P, et al. MusicAgent: An AI agent for music understanding and generation with Large Language Models. arXiv. 2023.
8. Chen Y, Huang L, Gou T. Applications and Advances of Artificial Intelligence in Music Generation: A Review. arXiv. 2024. doi: 10.48550/ARXIV.2409.03715
9. Agostinelli A, Denk TI, Borsos Z, et al. MusicLM: Generating music from text. arXiv. 2023.
10. Sun T, Zhang X, He Z, et al. MOSS: An Open Conversational Large Language Model. Machine Intelligence Research. 2024;21(5):888–905. DOI: 10.1007/s11633-024-1502-8.
11. Wang L, Kawakami K, van den Oord A. Contrastive Predictive Coding of Audio with an Adversary. Interspeech 2020. 2020; 826–830. doi: 10.21437/interspeech.2020-1891
12. Wu S, Yu D, Tan X, Sun M. CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval. arXiv. 2023. doi: 10.48550/ARXIV.2304.11029
13. Stöter F, Virtanen T. A Multichannel Nonnegative Matrix Factorization Approach to Sound Scene Analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2016; 24(9): 1652–1663.
14. Engel J, Agrawal S, Chen D, et al. GANSynth: Adversarial Neural Audio Synthesis. In: Proceedings of the International Conference on Machine Learning (ICML); 10–15 June 2019; Long Beach, CA, USA.
15. Luo Y, Chen Z, Hershey JR, et al. Deep clustering and conventional networks for music separation: Stronger together. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017; 61–65. doi: 10.1109/icassp.2017.7952118
16. Ji S, Yang X, Luo J. A Survey on Deep Learning for Symbolic Music Generation: Representations, Algorithms, Evaluations, and Challenges. ACM Computing Surveys. 2023; 56(1): 1–39. doi: 10.1145/3597493
17. Min S, Lyu X, Holtzman A, et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? arXiv. 2022. doi: 10.48550/ARXIV.2202.12837
18. Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022; 35: 27730–27744.
19. Wu C, Yin S, Qi W, et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv. 2023. doi: 10.48550/ARXIV.2303.04671
20. Liu S, Hussain AS, Wu Q, et al. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv. 2023. doi: 10.48550/ARXIV.2311.11255
21. Chen C, Hu Y, Wang S, et al. Audio Large Language Models Can Be Descriptive Speech Quality Evaluators. arXiv. 2025. doi: 10.48550/ARXIV.2501.17202
22. Zeng G, Ding W, Xu B, et al. Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline. In: Proceedings of the 2025 International Conference on Learning Representations (ICLR); 24–28 April 2025.
23. Huang C-ZA, Vaswani A, Uszkoreit J, et al. Music transformer: Generating music with long-term structure. In: Proceedings of International Conference on Learning Representations (ICLR); 30 April–3 May 2018.
24. Dai Z, Yang Z, Yang Y, et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL); 28 July–2 August 2019; Florence, Italy. pp. 2978–2988.
25. Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer. arXiv. 2020. doi: 10.48550/ARXIV.2004.05150.
DOI: https://doi.org/10.24294/csma11516
Refbacks
- There are currently no refbacks.
License URL: https://creativecommons.org/licenses/by/4.0/
This site is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.