Measuring Societal Biases in Text Corpora via First-Order Co-occurrence

arxiv(2020)

引用 0|浏览47
暂无评分
摘要
Text corpora are used to study societal biases, typically through statistical models such as word embeddings. The bias of a word towards a concept is typically estimated using vectors similarity, measuring whether the word and concept words share other words in their contexts. We argue that this second-order relationship introduces unrelated concepts into the measure, which causes an imprecise measurement of the bias. We propose instead to measure bias using the direct normalized co-occurrence associations between the word and the representative concept words, a first-order measure, by reconstructing the co-occurrence estimates inherent in the word embedding models. To study our novel corpus bias measurement method, we calculate the correlation of the gender bias values estimated from the text to the actual gender bias statistics of the U.S. job market, provided by two recent collections. The results show a consistently higher correlation when using the proposed first-order measure with a variety of word embedding models, as well as a more severe degree of bias, especially to female in a few specific occupations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要