Format

Results on the test set must be submitted as a zip archive, named after your algorithm's name. This archive should contain one result file per video in the test set. Each result file should be a comma-separated value file named ‘test<video id>.csv’ (example: ‘test01.csv’ for video ‘test01.mp4’). Result files should follow these guidelines:

  • Each line in these csv files should start with the frame ID and then provide the algorithm’s confidence level for each tool.
  • A confidence level can be any real number. In particular, it does not have to be a probability. The only restriction is that a larger number indicates a larger confidence that the tool is being used.
  • The order in which confidence levels for tools are written in the result files should be the order used in ground truth files (i.e. the order used in Fig. 1).
  • Note that ground truth files differ in that they include a header line mentioning the tool names: please do not include this header in the result files.

An example of valid result file (for a very short video with only five frames) is given below:

1, -0.37, 0.73, -0.38, -0.93, -0.10, 0.16, -1.97, -0.44, -1.17, -0.72, -1.39, -0.61, 0.20, 0.15, 0.57, 1.10, 0.05, 1.09, -0.27, 0.85, -0.86

2, 0.63, 0.63, -0.74, 1.03, -0.06, 1.46, 0.39, -0.54, -0.84, 0.05, 0.26, 0.73, 0.81, -0.87, -0.57, 1.28, 1.42, 1.57, 0.75, 0.88, -1.36

3, 0.97, -1.12, 0.41, 1.28, 1.10, -0.52, -1.29, -0.88, 1.37, -1.49, 0.94, 0.34, 0.27, -0.67, 0.43, -0.14, 0.31, -0.72, 0.95, -1.08, 0.62

4, 0.95, -0.17, -0.11, -1.57, -0.55, 0.56, -0.62, 0.82, 1.18, 0.43, -0.49, -0.35, 0.72, -1.45, -3.36, 0.96, -0.12, -1.06, -0.71, 0.04, -1.74

5, -0.76, -0.16, -0.63, -0.13, -1.37, -1.39, -0.40, -1.47, -0.03, -1.13, -0.06, 0.32, 0.95, 0.76, -0.64, 0.81, 1.04, -0.48, -1.03, 0.32, 2.65

Performance Evaluation

A figure of merit is first computed for each tool label T: the annotation performance for tool T is defined as the area Az(T) under the receiver-operating characteristic (ROC) curve. This curve is obtained by varying a cutoff on the confidence levels for this tool label. Frames associated with a disagreement between experts (reference standard = 0.5 for tool T) are ignored when computing the ROC curve. Then, a global figure of merit is defined: it is simply defined as the mean Az(T) value over all tool labels T.

We decided to use the area under the ROC curve, rather than figures of merit based on precision and recall, which evaluate cutoffs on the rank of tool labels, sorted by decreasing confidence level. The reason for this choice is that a varying number of tools may be used in each frame (zero, one, two or three). The rank is of limited practical value in this scenario: algorithms should not always produce the same number of tool predictions, regardless of the number of tools actually being used. Cutoffs on the confidence level, as used in ROC analysis, are more convenient: a binary prediction can be made independently for each tool label, leading to an adaptive number of tool predictions per frame.

The evaluation script (evaluate.py) can be downloaded.