Task 1: Video Object Detection

Object detection is the first step towards relation understanding in videos. The main goal of this task is to develop robust object detectors that not only localize objects from the 80 categories with bounding boxes in every video frame but also link the bounding boxes that indicate same object entity into a trajectory. The task requires participants to develop real video object detectors that can understand the identities and dynamics of object entities at video level, which can benefit many applications that require fine-grained video understanding. The challenge will accelerate the research on robust object detection in videos in the wild, by providing a larger number of user generated videos with annotations. Additionally, the challenge will encourage the research on video object detection with relationships by providing relation annotation in the dataset. An expectable challenge in this task is that, the detectors should be able to re-identify objects which appear again after long-term disappearance. Technically, the detectors have to overcome difficulties from two aspects:

In case any problems, feel free to contact: xiaojunbin@u.nus.edu


The dataset for this task consists of 10K user-generated videos from Flickr, along with annotations on 80 categories of object (e.g., "adult", "child", "dog", "table"). Bounding boxes are annotated for objects in each frame, and the objects' identities among frames are also provided. The training/validation/testing splits are 7,000, 835 and 2,165 videos. Specially, if the object just shows part of its body in the image (e.g., a hand of a person), it is also annotated in this dataset. For more detail information, please go to the dataset page.

The videos and annotations can be downloaded directly from here. Please note that the downloaded annotations contain additional annotation of relations. This task allows participants to use them to train the object detection models.

Evaluation Metric

We adopt average precision (AP) as metric to evaluate the detection performance for each category. The trajectory-level mean AP (mAP) is defined as follows:
Given a predicted trajectory (a.k.a. tubelet) $\mathcal{T}_p$ and a ground truth trajectory $\mathcal{T}_g$ of a certain category, the temporal Intersection over Union (tIoU) between these two trajectories is defined by: $$\text{tIoU}(\mathcal{T}_p, \mathcal{T}_g)=\frac{D_p \cap D_g}{D_p\cup D_g}, $$ where $D_p$, $D_g$ denote the time duration of the predicted trajectory and the ground truth trajectory respectively. In out settings, the threshold for \( \text{tIoU} \) is 0.5, which means any result with \( \text{tIoU} \geq 0.5 \) will be regarded as true positive prediction. Besides, the \( \text{IoU} \) threshold for frame-level bounding box is set to (0.5, 0.7, 0.9). For each trajectory pair, their \( \text{tIoU} \) is averaged on the three frame-level IoU values. The final mAP is obtained by averaging the APs across all categories. Note that we will not independently evaluate the image-level detection performance in this challenge.

The evaluation code used by the evaluation server can be found here. The number of predictions per video is limited up to 200.

Submission Format

Please use the following JSON format when submitting your results for the challenge:


The example above is illustrative. Comments must be removed in your submission. A sample submission file is available here.