TAT-DQA is a large-scale Document VQA dataset, which is constructed by extending the TAT-QA. It aims to stimulate progress of QA research over more complex and realistic visually-rich documents with rich tabular and textual content, especially those requiring numerical reasoning.
The unique features of TAT-DQA include:
In total, TAT-DQA contains 16,558 questions associated with 2,758 documents ( 3,067 document pages ) sampled from real-world financial reports.
The following is an example of TAT-DQA. The left is a given visually-rich document. The right are a question associated with the document, the derivation to compute the answer value, the scale of the numerical answer and the final answer.
{
"pages": [ # The document content are stored by page, each page will be one element in this array
{
"bbox": [ # The dimension of this document page, [x1, y1, x2, y2]. (x1, y1) is the top left of the page while (x2, y2) is the bottom right
0,
0,
1239,
1754
],
"blocks": [ # The document content are stored by blocks for each page
{
"bbox": [
34,
50,
174,
71
],
"uuid": "8f1e47dc-af67-485d-b269-057f06e8714c", # The unique ID of the block
"text": "Table of Contents", # The text content of the block
"words": { # The word list of the content for this block
"word_list": [
"Table",
"of",
"Contents"
],
"bbox_list": [ # The corresponding bounding box of each word in the word list
[
34,
50,
75,
71
],
[
84,
50,
100,
71
],
[
108,
50,
174,
71
]
]
},
"order": 1 # The order of the block in this page, starting from 1
},
...
]
}
]
}
{
"doc": {
"uid": "11ba155b7577c83fe5f3c4f766039e93", # The unique id of the document, {uid}.pdf and {uid}.json are the given PDF document and the OCR result of the PDF
"page" : 1, # The starting page no. of the document that is relevant to the QA pairs
"source": "navios-maritime-holdings-inc_2019.pdf" # The financial statement that this doc comes from
},
"questions": [ # The questions associated to the given document
{
"uid": "9dfd4a8d0b9e91c2ab5ab4a8745226c6", # The unique id of a question
"order": 6, # The order of the question in all questions, starting from 1
"question": "What was the change in Impairment losses between 2017 and 2018?", # The question itself
"answer": 150092, # The ground-truth answer
"derivation": "200,657-50,565", # The derivation that can be executed to arrive at the ground-truth answer
"answer_type": "arithmetic", # The answer type including `span`, `spans`, `arithmetic` and `counting`.
"scale": "thousand", # The scale of the answer including `None`, `thousand`, `million`, `billion` and `percent`
"req_comparison": false, # A flag indicating if `comparison/sorting` is needed to answer the question whose answer is a single span or multiple spans
"facts": [ #[Optional] The facts or evidencies used to infer the final answers, which are generated heuristically
"200,657",
"50,565"
],
"block_mapping": [ # The facts' positions in the OCR result of the given document.
{
"69352448-14ad-4854-87a9-3ac44358a660": [
26,
32
]
},
{
"69352448-14ad-4854-87a9-3ac44358a660": [
18,
25
]
}
]
...
]
}
| Rank | Model Name | Team Name | Exact Match | F1 | Created | Paper | Codes |
|---|---|---|---|---|---|---|---|
| - | Human Performance | - | 84.1 | 90.8 | - | - | - |
| 1 | TAT-LLM (70B) | NExT | 76.5 | 83.9 | 20 Jan 2024 | Paper | N.A. |
| 2 | TAT-LLM (13B) | NExT | 72.2 | 80.5 | 20 Jan 2024 | Paper | N.A. |
| 3 | TAT-LLM (7B) | NExT | 71.3 | 80.2 | 20 Jan 2024 | Paper | N.A. |
| 4 | Doc2SoarGraph | NExT | 59.2 | 67.6 | 10 Jan 2023 | Paper | Codes |
| 5 | Baseline - MHST | NExT | 41.5 | 50.7 | 10 Aug 2022 | Paper | Codes |
To evaluate your models, we have also made available the evaluation script we will use for official evaluation, To run the evaluation, use
python tatqa_eval.py --gold_path=:path_to_dev --pred_path=:path_to_predictions
The predictions file in JSON format contains a dictionary with question ids as keys and the predictions as values (each prediction shall include both `answer` and `scale` in an array). For example,
{
"9337c3e6-c53f-45a9-836a-02c474ceac16": [
"4.6",
"percent"
],
"c4170232-e89c-487a-97c5-afad45e9d702": [
"16",
"thousand"
],
"d81d1ae7-363c-4b47-8eea-1906fef33856": [
["2018", "2019"],
""
]
...
}
We also provide a sample prediction file (on Dev) for your reference.
python tatqa_eval.py --gold_path=dataset_raw/tatqa_dataset_dev.json --pred_path=sample_prediction.json
Please email the prediction file of the test set with the following information to us:
Please give us up to two weeks to evaluate your submission and we will add your model to the leaderboard.
The TAT-DQA dataset is under the license of Creative Commons (CC BY) Attribution 4.0 International.
For more information, please contact:
Please kindly cite our work if the dataset helps your research.
@inproceedings{zhu-etal-2021-tat,
title = "{TAT}-{QA}: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance",
author = "Zhu, Fengbin and
Lei, Wenqiang and
Huang, Youcheng and
Wang, Chao and
Zhang, Shuo and
Lv, Jiancheng and
Feng, Fuli and
Chua, Tat-Seng",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.254",
doi = "10.18653/v1/2021.acl-long.254",
pages = "3277--3287"
}
@inproceedings{zhu2022towards,
title={Towards complex document understanding by discrete reasoning},
author={Zhu, Fengbin and Lei, Wenqiang and Feng, Fuli and Wang, Chao and Zhang, Haozhou and Chua, Tat-Seng},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
pages={4857--4866},
year={2022}
}
@inproceedings{zhu2024doc2soargraph,
title = "{D}oc2{S}oar{G}raph: Discrete Reasoning over Visually-Rich Table-Text Documents via Semantic-Oriented Hierarchical Graphs",
author = "Zhu, Fengbin and
Wang, Chao and
Feng, Fuli and
Ren, Zifeng and
Li, Moxin and
Chua, Tat-Seng",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.456",
pages = "5119--5131"
}