ACM MM 2019 Video Relation Understanding Challenge

Introduction

For decades, multimedia researchers mainly evaluate visual systems according to a set of application-driven tasks, such as the cross-modal retrieval and concept annotation etc. Although the recent advance in computer vision has effectively boosted the performance of visual systems on these tasks, a core question still cannot be explicitly answered: Does the machine understand what is happening in a video, and can the results of the analysis be interpretable by human users? Another way to look at the limitation is to evaluate how many facts that the machine can recognize from a video.

This new ACM MM 2019 Video Relation Understanding (VRU) Challenge will encourage researchers to explore a key aspect of recognizing facts from a video, that is relation understanding. In many AI and knowledge-based systems, a fact is represented by a relation between a subject entity and an object entity (i.e. <subject,predicate,object>), which forms the fundamental building block for high-level inference and decision making tasks. The challenge is based on VidOR Dataset, a large-scale user-generated video dataset with objects and relations densly annotated. We announce 3 pivotal tasks in video object detection, action detection and visual relation detection to push the limits of relation understanding.

Tasks

Task 1: Video Object Detection [details]

As the first step in relation understanding, the task is to detect objects of certain categories and spatio-temporally localize each detected object using a bounding-box trajectory in videos. For each object category, we compute Average Precision (AP) to evaluate the detection performance and rank according to the mean AP over all categories.

Task 2: Action Detection [details]

Action is another important semantic in videos. This task is to detect actions of certain categories and spatio-temporally localize the subject of each detected action using a bounding-box trajectory. For each action category, we compute AP to evaluate the detection performance and rank according to the mean AP over all categories.

Task 3: Visual Relation Detection [details]

Beyond recognizing object and action individually, this task is to detect relation triplets (i.e. <subject,predicate,object>) of interest and spatio-temporally localize the subject and object of each detected relation triplet using bounding-box trajectories. The categories of predicate will also include spatial type in addition to the action type. For each testing video, we compute AP to evaluate the detection performance and rank according to the mean AP over all testing videos.

Participation

The challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams in a task. For registration, please email the following information to the challenge organizers:

Participating Task (one of the above three tasks to participate)
Team Name
Team Members (first name/last name)
Team Leader (we will send important notifications and testing videos to the team leader)
Team Leader Email

Please note that each task needs a separate registration.

At the end of the challenge, all teams will be ranked based on the objective evaluation above. The top three performing teams of each task will receive award certificates. At the same time, by submitting a 4-page overview paper (plus 1-page reference) to ACM MM'19, all accepted submissions are eligible for the conference’s grand challenge award competition.

Leaderboard

Task 1: Video Object Detection

Rank	Team Name	Performance: mean AP	Team Members
1	DeepBlueAI	0.0944	Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Feng Ni

Task 3: Visual Relation Detection

Rank	Team Name	Performance: mean AP	* Tagging Precision@5	Team Members
1	MAGUS.Gamma	0.0631	0.421	Xu Sun, Yuan Zi, Tongwei Ren, Gangshan Wu [paper]
2	RELAbuilder	0.00546	0.236	Sipeng Zheng, Xiangyu Chen, Qin Jin [paper]

* Tagging Precision@5 is an auxiliary metric, which specifically indicates the ability of tagging accurate visual relations in top 5 but won't evaluate the accuracy of the bounding-box trajectories of the subject and object.

** The challenge has 23 registered teams from around the world. However, due to the big challenges in both the tasks and dataset, only a few outstanding teams successfully submitted results by the final deadline.

Timeline

~~February 22, 2019: Web site and call for participation ready~~
~~February 24, 2019: Training and validation videos available for download~~
~~March 2, 2019: Training and validation annotations available for download~~
~~May 22, 2019: Testing videos open to registered participants for download~~
~~June 22, 2019 June 26, 2019, 23:59 AoE (extended): Results submission~~
~~July 1, 2019: Evaluation results announce~~
~~July 8, 2019: Paper submission deadline~~

Contact

For registration and general information about the challenge, please contact:

Xindi Shang, Donglin Di and Junbin Xiao
shangxin@comp.nus.edu.sg, donglin.di@u.nus.edu, xiaojunbin@u.nus.edu

For information about Task 1, please contact:

Junbin Xiao
xiaojunbin@u.nus.edu

For information about Task 2, please contact:

Donglin Di
donglin.di@u.nus.edu

For information about Task 3, please contact:

Xindi Shang
shangxin@comp.nus.edu.sg