Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging.

Hosung Park, Yoonseo Chung,Ji-Hwan Kim

J. Web Eng.(2023)

引用 0|浏览9
暂无评分
摘要
Videos contain visual and auditory information. Visual information in a video can include images of people, objects, and the landscape, whereas auditory information includes voices, sound effects, background music, and the soundscape. The audio content can provide detailed information on the story by conducting a voice and atmosphere analysis of the sound effects and soundscape. Metadata tags represent the results of a media analysis as text. The tags can classify video content on social networking services, like YouTube. This paper presents the methodologies of speech, audio, and music processing. Also, we propose integrating these audio tagging meth-ods and applying them in an audio metadata generation system for video storytelling. The proposed system automatically creates metadata tags based on speech, sound effects, and background music information from the audio input. The proposed system comprises five subsystems: (1) automatic speech recognition, which generates text from the linguistic sounds in the audio, (2) audio event classification for the type of sound effect, (3) audio scene classification for the type of place from the soundscape, (4) music detection for the background music, and (5) keyword extraction from the automatic speech recognition results. First, the audio signal is converted into a suitable form, which is subsequently combined from each subsystem to create meta -data for the audio content. We evaluated the proposed system using video logs (vlogs) on YouTube. The proposed system exhibits a similar accuracy to handcrafted metadata for the audio content, and for a total of 104 YouTube vlogs, achieves an accuracy of 65.83%.
更多
查看译文
关键词
metadata,audio,music,classification methodologies,networks-based
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要