NB
NingBGM
Project Page / GitHub Pages
Open-Source Research Demo GitHub Pages Showcase

NingBGM: Video Background Music Generation via Holistic Scene Understanding with Multi-Agent Collaboration

Project page, demo results, and open-source implementation for multimodal video background music generation with collaborative scene understanding.

Existing video-to-music generation methods primarily rely on single visual modality inputs, often failing to capture the rich multimodal information present in real-world scenarios. Consequently, the generated music frequently lacks semantic and emotional alignment with the scene. To address this, we propose NingBGM, a video background music generation framework based on multi-agent collaboration. This framework introduces a holistic scene representation that fuses multi-source information to model temporal dynamics, objects, semantics, emotions, and narrative logic, serving as a guiding input for music generation. NingBGM organizes agents into three specialized teams for requirements analysis, scene understanding, and music generation. Each team follows a collaborative mechanism of supervisor assignment, expert execution, and verifier validation to simulate professional creation workflows. Furthermore, a structured agent role definition and task design module allow general agents to rapidly transform into domain experts, translating high-level user intentions into executable micro-instructions. We also construct and open-source CSD-200, the first multimodal holistic scene test set for video-to-music tasks, which fills a gap in fine-grained evaluation benchmarks. Experiments demonstrate that NingBGM achieves superior performance compared to baselines across multiple key metrics, validating the effectiveness of the multi-agent collaborative paradigm. Ablation studies further confirm that constructing holistic scenes through multimodal fusion is crucial for enhancing audio-visual synchronization.

Jump to comparisons

免责声明

以下说明仅用于学术研究与技术展示。如对内容来源、版权或合规性存在疑问,请与我们联系。

学术用途声明
本页面中的模型结果、示例样本与对比内容仅用于学术研究和技术展示,用于说明方法设计与实验效果,不构成任何商业承诺、授权或服务保证。
版权说明
页面中涉及的原始视频、图片、音频及其他素材,其版权归原权利人所有。除论文方法、页面组织与自制结果展示外,其余内容仅在合理展示范围内使用。
致谢
Video comparison 区域的部分素材参考自 https://vem-paper.github.io/VeM-page/
引用
如果本项目对你有帮助,欢迎引用我们的论文。论文正式引用信息可在后续版本中补充。
更多结果使用说明
  • 部分展示素材来源于公开网络或公开数据。
  • 相关内容仅用于学术研究与技术展示。
  • 禁止任何个人或组织将页面内容直接用于商业用途。
  • 如存在版权、隐私或合规问题,请联系我们,我们会及时处理或下架相关内容。