Fast and Scalable Method To Search for Concept Occurrences in Elsevier Content

Pierre-Yves Vandenbussche, Darin McBeath

user-5e9d449e4c775e765d44d7c9（2020）

引用 0|浏览20

暂无评分

摘要

With more than 18M articles and 610 Gb worth of text in Science Direct, it may be daunting to search for occurrences of concept labels such as "underground economy", “case-law” or "P53 inhibitor". Fear no more! We put together notebooks that can perform, at scale, exact matching of any string using Aho-Corasick algorithm. In a matter of minutes, the script allows to search for millions of concepts in a large corpus of text such as Science Direct. We leveraged this concept search method to get annotations of cell line concepts in the biomedical domain and of Omniscience concepts from our corpus. The algorithm is taking advantage of Spark distributed computation for fast processing and scalability. Combined with AnnotationQuery, an open-source library developed in-house, we can then formulate complex search queries related to the newly found concept occurrences. For example, we can get relevant concepts within the same sentence or retrieve SciBERT sentence embedding for the sentence in which the concept occurs. In this presentation we will explain you how to take advantage of the notebooks for your projects and show examples of complex search queries run directly on a large body of documents.

查看译文

关键词

Concept search,Sentence,Spark (mathematics),Scalability,Theoretical computer science,Computer science,Embedding,Computation,Omniscience,Exact matching

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要