A multilingual translator to SQL with database schema pruning to improve self-attention

Marcelo Archanjo Jose,Fabio Gagliardi Cozman

International Journal of Information Technology（2023）

引用 0|浏览4

暂无评分

摘要

Databases have a large amount of information that can be accessed by the structured query language (SQL), but this language requires technical knowledge. An alternative to facilitating access to this information is to use natural language to make queries, and an artificial intelligence model to translate to SQL. Transformer-based language models have been incredibly successful in this regard. However, transformers are limited by the size of the input text; therefore, long sentences can interfere with the quality of the results. We present two techniques to improve results. The first is an innovative technique that allows long-text sequences to be handled by transformers with up to 512 input tokens. We run database schema pruning (removal of table names and column names that are useless for the query of interest) during a fine-tuning process. The second technique is a multilingual approach. The model is fine-tuned using a data-augmented Spider dataset [a specialized dataset for Natural Language to SQL (NL2SQL)] in four languages simultaneously: English, Portuguese, Spanish, and French. The combination of these techniques allowed an increase in the exact set match accuracy results from 0.718 to 0.736 in our validation dataset. The process of improving results is challenging because NL2SQL techniques are already significantly optimized, and the two techniques presented here are important because they are applied in the training dataset, allowing them to be used with any current technique. Source code, evaluations, and checkpoints are available at https://github.com/C4AI/gap-text2sql .

查看译文

关键词

multilingual translator,sql,database schema,self-attention

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要