LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
CVPR 2024(2024)
摘要
Recently, leveraging large language models (LLMs) or multimodal large
language models (MLLMs) for document understanding has been proven very
promising. However, previous works that employ LLMs/MLLMs for document
understanding have not fully explored and utilized the document layout
information, which is vital for precise document understanding. In this paper,
we propose LayoutLLM, an LLM/MLLM based method for document understanding. The
core of LayoutLLM is a layout instruction tuning strategy, which is specially
designed to enhance the comprehension and utilization of document layouts. The
proposed layout instruction tuning strategy consists of two components:
Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture
the characteristics of document layout in Layout-aware Pre-training, three
groups of pre-training tasks, corresponding to document-level, region-level and
segment-level information, are introduced. Furthermore, a novel module called
layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on
regions relevant to the question and generate accurate answers. LayoutCoT is
effective for boosting the performance of document understanding. Meanwhile, it
brings a certain degree of interpretability, which could facilitate manual
inspection and correction. Experiments on standard benchmarks show that the
proposed LayoutLLM significantly outperforms existing methods that adopt
open-source 7B LLMs/MLLMs for document understanding. The training data of the
LayoutLLM is publicly available at
https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要