Evaluation Of Machine-Learning Protocols For Technology-Assisted Review In Electronic Discovery
IR(2014)
摘要
Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings. Our comparison addresses a central question in the deployment of technology-assisted review: Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning? On eight review tasks - four derived from the TREC 2009 Legal Track and four derived from actual legal matters - recall was measured as a function of human review effort. The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P < 0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents. Among passive-learning methods, significantly less human review effort (P < 0.01) is required when keywords are used instead of random sampling to select the initial training documents. Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of "stabilization" - determining when training is adequate, and therefore may stop.
更多查看译文
关键词
Technology-assisted review,predictive coding,electronic discovery,e-discovery
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络