A 1μW voice activity detector using analog feature extraction and digital deep neural network.
ISSCC(2018)
摘要
Voice user interfaces (UIs) are highly compelling for wearable and mobile devices. They have the advantage of using compact and ultra-low-power (ULP) input devices (e.g. passive microphones). Together with ULP signal acquisition and processing, voice UIs can give energy-harvesting acoustic sensor nodes and battery-operating devices the sought-after capability of natural interaction with humans. Voice activity detection (VAD), separating speech from background noise, is a key building block in such voice UIs, e.g. it can enable power gating of higher-level speech tasks such as speaker identification and speech recognition [1]. As an always-on block, the power consumption of VAD must be minimized and meanwhile maintain high classification accuracy. Motivated by high power efficiency of analog signal processing, a VAD system using analog feature extraction (AFE) and mixed-signal decision tree (DT) classifier was demonstrated in [2]. While it achieved a record of 6μW, the system requires machine-learning based calibration of the DT thresholds on a chip-to-chip basis due to ill-controlled AFE variation. Moreover, the 7-node DT may deliver inferior classification accuracy especially under low input SNR and difficult noise scenario, compared to more advanced classifiers like deep neural networks (DNNs) [1,3]. Although heavy computational load in conventional floating-point DNNs prevents their adoption in embedded systems, the binarized neural networks (BNNs) with binary weights and activations proposed in [4] may pave the way to ULP implementations. In this paper, we present a 1μW VAD system utilizing AFE and a digital BNN classifier with an event-encoding A/D interface. The whole AFE is 9.4x more power-efficient than the prior art [5] and 7.9x than the state-of-the-art digital filter bank [6], and the BNN consumes only 0.63μW. To avoid costly chip-wise training, a variation-aware python model of the AFE was created and the generated features were used for offline BNN training. Measurements show 84.4%/85.4% mean speech/non-speech hit rate with 1.88%/4.65% 1-σ standard deviation among 10 dies using the same weights for 10dB SNR speech with restaurant noise.
更多查看译文
关键词
voice UIs,energy-harvesting acoustic sensor nodes,battery-operating devices,voice activity detection,background noise,power gating,higher-level speech tasks,speaker identification,deep neural networks,classification accuracy,AFE variation,chip-wise training,mixed-signal decision tree classifier,ultra-low-power input devices,mobile devices,wearable devices,voice user interfaces,digital deep neural network,analog feature extraction,1μW voice activity detector,10dB SNR speech,1-σ standard deviation,digital filter bank,event-encoding A/D interface,digital BNN classifier,1μW VAD system utilizing AFE,ULP implementations,binarized neural networks,embedded systems,conventional floating-point DNNs,low input SNR,inferior classification accuracy,7-node DT,chip-to-chip basis,DT thresholds,machine-learning based calibration,analog signal processing,power 1.0 muW,power 6.0 muW,power 0.63 muW,noise figure 10.0 dB
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要