A 1μW voice activity detector using analog feature extraction and digital deep neural network.

Minhao Yang,Chung-Heng Yeh,Yiyin Zhou,Joao Pedro Cerqueira,Aurel A. Lazar,Mingoo Seok

ISSCC（2018）

引用 74|浏览32

暂无评分

摘要

Voice user interfaces (UIs) are highly compelling for wearable and mobile devices. They have the advantage of using compact and ultra-low-power (ULP) input devices (e.g. passive microphones). Together with ULP signal acquisition and processing, voice UIs can give energy-harvesting acoustic sensor nodes and battery-operating devices the sought-after capability of natural interaction with humans. Voice activity detection (VAD), separating speech from background noise, is a key building block in such voice UIs, e.g. it can enable power gating of higher-level speech tasks such as speaker identification and speech recognition [1]. As an always-on block, the power consumption of VAD must be minimized and meanwhile maintain high classification accuracy. Motivated by high power efficiency of analog signal processing, a VAD system using analog feature extraction (AFE) and mixed-signal decision tree (DT) classifier was demonstrated in [2]. While it achieved a record of 6μW, the system requires machine-learning based calibration of the DT thresholds on a chip-to-chip basis due to ill-controlled AFE variation. Moreover, the 7-node DT may deliver inferior classification accuracy especially under low input SNR and difficult noise scenario, compared to more advanced classifiers like deep neural networks (DNNs) [1,3]. Although heavy computational load in conventional floating-point DNNs prevents their adoption in embedded systems, the binarized neural networks (BNNs) with binary weights and activations proposed in [4] may pave the way to ULP implementations. In this paper, we present a 1μW VAD system utilizing AFE and a digital BNN classifier with an event-encoding A/D interface. The whole AFE is 9.4x more power-efficient than the prior art [5] and 7.9x than the state-of-the-art digital filter bank [6], and the BNN consumes only 0.63μW. To avoid costly chip-wise training, a variation-aware python model of the AFE was created and the generated features were used for offline BNN training. Measurements show 84.4%/85.4% mean speech/non-speech hit rate with 1.88%/4.65% 1-σ standard deviation among 10 dies using the same weights for 10dB SNR speech with restaurant noise.

查看译文

关键词

voice UIs,energy-harvesting acoustic sensor nodes,battery-operating devices,voice activity detection,background noise,power gating,higher-level speech tasks,speaker identification,deep neural networks,classification accuracy,AFE variation,chip-wise training,mixed-signal decision tree classifier,ultra-low-power input devices,mobile devices,wearable devices,voice user interfaces,digital deep neural network,analog feature extraction,1μW voice activity detector,10dB SNR speech,1-σ standard deviation,digital filter bank,event-encoding A/D interface,digital BNN classifier,1μW VAD system utilizing AFE,ULP implementations,binarized neural networks,embedded systems,conventional floating-point DNNs,low input SNR,inferior classification accuracy,7-node DT,chip-to-chip basis,DT thresholds,machine-learning based calibration,analog signal processing,power 1.0 muW,power 6.0 muW,power 0.63 muW,noise figure 10.0 dB

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要