NingBGM: Video Background Music Generation via Holistic Scene Understanding with Multi-Agent Collaboration
Project page, demo results, and open-source implementation for multimodal video background music generation with collaborative scene understanding.
Existing video-to-music generation methods primarily rely on single visual modality inputs, often failing to capture the rich multimodal information present in real-world scenarios. Consequently, the generated music frequently lacks semantic and emotional alignment with the scene. To address this, we propose NingBGM, a video background music generation framework based on multi-agent collaboration. This framework introduces a holistic scene representation that fuses multi-source information to model temporal dynamics, objects, semantics, emotions, and narrative logic, serving as a guiding input for music generation. NingBGM organizes agents into three specialized teams for requirements analysis, scene understanding, and music generation. Each team follows a collaborative mechanism of supervisor assignment, expert execution, and verifier validation to simulate professional creation workflows. Furthermore, a structured agent role definition and task design module allow general agents to rapidly transform into domain experts, translating high-level user intentions into executable micro-instructions. We also construct and open-source CSD-200, the first multimodal holistic scene test set for video-to-music tasks, which fills a gap in fine-grained evaluation benchmarks. Experiments demonstrate that NingBGM achieves superior performance compared to baselines across multiple key metrics, validating the effectiveness of the multi-agent collaborative paradigm. Ablation studies further confirm that constructing holistic scenes through multimodal fusion is crucial for enhancing audio-visual synchronization.
免责声明
以下说明仅用于学术研究与技术展示。如对内容来源、版权或合规性存在疑问,请与我们联系。
- 部分展示素材来源于公开网络或公开数据。
- 相关内容仅用于学术研究与技术展示。
- 禁止任何个人或组织将页面内容直接用于商业用途。
- 如存在版权、隐私或合规问题,请联系我们,我们会及时处理或下架相关内容。