Pitch Estimation Via Self-Supervision

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

引用 7|浏览27
暂无评分
摘要
We present a method to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. In contrast to existing methods, our neural network can be fully trained only on unlabeled data, using self-supervision. A tiny amount of labeled data is needed solely for mapping the network outputs to absolute pitch values. The key to this is the observation that if one creates two examples from one original audio clip by pitch shifting both, the difference between the correct outputs is known, without even knowing the actual pitch value in the original clip. Somewhat surprisingly, this idea combined with an auxiliary reconstruction loss allows training a pitch estimation model. Our results show that our pitch estimation method obtains an accuracy comparable to fully supervised models on monophonic audio, without the need for large labeled datasets. In addition, we are able to train a voicing detection output in the same model, again without using any labels.
更多
查看译文
关键词
audio pitch estimation, unsupervised learning, convolutional neural networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要