Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors
CoRR(2024)
摘要
We propose and analyze an adaptive adversary that can retrain a Trojaned DNN
and is also aware of SOTA output-based Trojaned model detectors. We show that
such an adversary can ensure (1) high accuracy on both trigger-embedded and
clean samples and (2) bypass detection. Our approach is based on an observation
that the high dimensionality of the DNN parameters provides sufficient degrees
of freedom to simultaneously achieve these objectives. We also enable SOTA
detectors to be adaptive by allowing retraining to recalibrate their
parameters, thus modeling a co-evolution of parameters of a Trojaned model and
detectors. We then show that this co-evolution can be modeled as an iterative
game, and prove that the resulting (optimal) solution of this interactive game
leads to the adversary successfully achieving the above objectives. In
addition, we provide a greedy algorithm for the adversary to select a minimum
number of input samples for embedding triggers. We show that for cross-entropy
or log-likelihood loss functions used by the DNNs, the greedy algorithm
provides provable guarantees on the needed number of trigger-embedded input
samples. Extensive experiments on four diverse datasets – MNIST, CIFAR-10,
CIFAR-100, and SpeechCommand – reveal that the adversary effectively evades
four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP,
and TABOR.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要