TAT-DQA: Towards Complex Document Understanding By Discrete Reasoning

Introduction

TAT-DQA is a large-scale Document VQA dataset, which is constructed by extending the TAT-QA. It aims to stimulate progress of QA research over more complex and realistic visually-rich documents with rich tabular and textual content, especially those requiring numerical reasoning.

The unique features of TAT-DQA include:

The documents in TAT-DQA dataset are sampled from real-world high-quality financial reports and each document contains both tabular and textual data;
The average number of words of each document in TAT-DQA is around 550, which is significantly larger than all existing Document VQA datasets.
Around 85% of the documents in the dataset have only one page while 15% has multiple pages.
Similar to TAT-QA, the answer forms are diverse, including single span, multiple spans and free-form and various numerical reasoning capabilities are usually required, including addition (+), subtraction (-), multiplication (x), division (/), counting, comparison, sorting, and their compositions;

In total, TAT-DQA contains 16,558 questions associated with 2,758 documents ( 3,067 document pages ) sampled from real-world financial reports.

The following is an example of TAT-DQA. The left is a given visually-rich document. The right are a question associated with the document, the derivation to compute the answer value, the scale of the numerical answer and the final answer.

For more information, please read our ACM MM 2022 paper [PDF].

Getting Started

Download the TAT-DQA Dataset from Google Drive. The data have three parts: the original documents in PDF format, the converted content from the PDF in JSON format and the QA pairs in JSON format. Following are the detailed explainations about the data.

PDF documents: which is obtained by filtering the document pages corresponding to the QA pairs from the raw financial reports, e.g., 11ba155b7577c83fe5f3c4f766039e93.pdf. A PDF document in TAT-DQA may contain at most three pages.

Converted content from PDF in JSON : obtained by converting the text with a bounding box using a PDF Reader for the readable PDF files or a commercial OCR engine for the images e.g., 11ba155b7577c83fe5f3c4f766039e93.json.

{
  "pages": [                                                                  # The document content are stored by page, each page will be one element in this array 
    {
      "bbox": [                                                               # The dimension of this document page, [x1, y1, x2, y2]. (x1, y1) is the top left of the page while (x2, y2) is the bottom right  
        0,
        0,
        1239,
        1754
      ],
      "blocks": [                                                             # The document content are stored by blocks for each page 
        {
          "bbox": [
            34,
            50,
            174,
            71
          ],
          "uuid": "8f1e47dc-af67-485d-b269-057f06e8714c",                     # The unique ID of the block  
          "text": "Table of Contents",                                        # The text content of the block  
          "words": {                                                          # The word list of the content for this block  
            "word_list": [
              "Table",
              "of",
              "Contents"
            ],
            "bbox_list": [                                                    # The corresponding bounding box of each word in the word list  
              [
                34,
                50,
                75,
                71
              ],
              [
                84,
                50,
                100,
                71
              ],
              [
                108,
                50,
                174,
                71
              ]
            ]
          },
          "order": 1                                                        # The order of the block in this page, starting from 1 
        },
        ...
    ]
   }
  ]
 }

QA Pairs: The QA pairs associated to the given document.

{ 
  "doc": {
    "uid": "11ba155b7577c83fe5f3c4f766039e93",                                      # The unique id of the document, {uid}.pdf and {uid}.json are the given PDF document and the OCR result of the PDF 
    "page" : 1,                                                                     # The starting page no. of the document that is relevant to the QA pairs
    "source": "navios-maritime-holdings-inc_2019.pdf"                               # The financial statement that this doc comes from 
  },                   
  "questions": [                                                                    # The questions associated to the given document 
  {                   
    "uid": "9dfd4a8d0b9e91c2ab5ab4a8745226c6",                                      # The unique id of a question
    "order": 6,                                                                     # The order of the question in all questions, starting from 1
    "question": "What was the change in Impairment losses between 2017 and 2018?",  # The question itself
    "answer": 150092,                                                               # The ground-truth answer
    "derivation": "200,657-50,565",                                                 # The derivation that can be executed to arrive at the ground-truth answer
    "answer_type": "arithmetic",                                                    # The answer type including `span`, `spans`, `arithmetic` and `counting`.
    "scale": "thousand",                                                            # The scale of the answer including `None`, `thousand`, `million`, `billion` and `percent`
    "req_comparison": false,                                                        # A flag indicating if `comparison/sorting` is needed to answer the question whose answer is a single span or multiple spans
    "facts": [                                                                      #[Optional] The facts or evidencies used to infer the final answers, which are generated heuristically  
      "200,657",                     
      "50,565"                     
    ],                     
    "block_mapping": [                                                              # The facts' positions in the OCR result of the given document. 
      {
        "69352448-14ad-4854-87a9-3ac44358a660": [
          26,
          32
        ]
      },
      {
        "69352448-14ad-4854-87a9-3ac44358a660": [
          18,
          25
        ]
      }
    ]
    ...                                                 
  ]
}

Leaderboard

Rank	Model Name	Team Name	Exact Match	F1	Created	Paper	Codes
-	Human Performance	-	84.1	90.8	-	-	-
1	TAT-LLM (70B)	NExT	76.5	83.9	20 Jan 2024	Paper	N.A.
2	TAT-LLM (13B)	NExT	72.2	80.5	20 Jan 2024	Paper	N.A.
3	TAT-LLM (7B)	NExT	71.3	80.2	20 Jan 2024	Paper	N.A.
4	Doc2SoarGraph	NExT	59.2	67.6	10 Jan 2023	Paper	Codes
5	Baseline - MHST	NExT	41.5	50.7	10 Aug 2022	Paper	Codes

Submission

To evaluate your models, we have also made available the evaluation script we will use for official evaluation, To run the evaluation, use

python tatqa_eval.py --gold_path=:path_to_dev --pred_path=:path_to_predictions

Predictions Format

The predictions file in JSON format contains a dictionary with question ids as keys and the predictions as values (each prediction shall include both `answer` and `scale` in an array). For example,

{
 "9337c3e6-c53f-45a9-836a-02c474ceac16": [
    "4.6",
    "percent"
  ],
  "c4170232-e89c-487a-97c5-afad45e9d702": [
    "16",
    "thousand"
  ],
  "d81d1ae7-363c-4b47-8eea-1906fef33856": [
    ["2018", "2019"],
    ""
  ]
  ...
}

We also provide a sample prediction file (on Dev) for your reference.

python tatqa_eval.py --gold_path=dataset_raw/tatqa_dataset_dev.json --pred_path=sample_prediction.json

Submission

Please email the prediction file of the test set with the following information to us:

Model name with a brief description of your model
Paper title & link (if any)
Team name
Contact person
Email of the contact person

Please give us up to two weeks to evaluate your submission and we will add your model to the leaderboard.

Contact

The TAT-DQA dataset is under the license of Creative Commons (CC BY) Attribution 4.0 International.

For more information, please contact:

Fengbin ZHU fengbinzhu@u.nus.edu

Please kindly cite our work if the dataset helps your research.


@inproceedings{zhu-etal-2021-tat,
    title = "{TAT}-{QA}: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance",
    author = "Zhu, Fengbin  and
      Lei, Wenqiang  and
      Huang, Youcheng  and
      Wang, Chao  and
      Zhang, Shuo  and
      Lv, Jiancheng  and
      Feng, Fuli  and
      Chua, Tat-Seng",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.254",
    doi = "10.18653/v1/2021.acl-long.254",
    pages = "3277--3287"
}

@inproceedings{zhu2022towards,
  title={Towards complex document understanding by discrete reasoning},
  author={Zhu, Fengbin and Lei, Wenqiang and Feng, Fuli and Wang, Chao and Zhang, Haozhou and Chua, Tat-Seng},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={4857--4866},
  year={2022}
}
    
@inproceedings{zhu2024doc2soargraph,
    title = "{D}oc2{S}oar{G}raph: Discrete Reasoning over Visually-Rich Table-Text Documents via Semantic-Oriented Hierarchical Graphs",
    author = "Zhu, Fengbin  and
      Wang, Chao  and
      Feng, Fuli  and
      Ren, Zifeng  and
      Li, Moxin  and
      Chua, Tat-Seng",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.456",
    pages = "5119--5131"
}