Evaluation settings

Scoring metrics




Four different evaluation settings are suggested for both tasks. They differ in the source of the preprocessing information (morphological, syntactic and semantic information):

  1. Closed challenge. Systems have to be built strictly with information contained in the given training and test corpora. The aim of this challenge is to compare the performance of the participating systems under exactly the same conditions.   

    1a. Gold-standard setting. Gold-standard manual annotations of the preprocessing information is provided at test time, plus the last column with true mention boundaries (the provided numbers are fake, just maintained to keep coherence with the described coreference annotation and compatibility with the scorer). Only available levels will be provided (see the info.txt file of each language) for Catalan, English, German, and Spanish. [No gold-standard evaluation setting for Dutch or Italian.]

    1b. Regular setting. Only automatically predicted annotations of the preprocessing information is provided at test time (i.e., participants are not allowed to use the gold-standard columns or the last column with true mention boundaries). 
  2. Open challenge. Systems can be developed using any kind of external tools and resources to predict the preprocessing information. The only condition is that such tools or resources must have not been developed with the annotations of the test set, neither the input nor output annotations of the data. In this challenge, we are interested in learning methods which make use of any tools or resources that might improve performance. For example, we encourage the use of rich semantic information from WordNet, Wikipedia, etc. The comparison of different systems in this setting might obscure whether the score differences are due to the coreference algorithm or to the preprocessing tools.

    2a. Gold-standard setting. See description in 1a above

    2b. Regular setting.
    See description in 1b above

We invite - and strongly encourage - participants to send the results of their systems run in ALL FOUR EVALUATION SCENARIOS (closed vs. open, gold-standard vs. regular) and for ALL SIX LANGUAGES. This will be the only way to get an insight into the effect of additional layers of annotation on the same (and across) coreference resolution system, as well as the portability of systems across languages. Nonetheless, we will also allow participants to restrict themselves to any of the evaluation scenarios or/and to any of the languages. 

 [Back to the top]



Four different evaluation metrics will be used to rank the participating systems in the full task:

  1. MUC (Vilain et al., 1995)
  2. B-CUBED (Bagga and Baldwin, 1998)
  3. CEAF (Luo, 2005)
  4. BLANC (Recasens and Hovy, to appear in Natural Language Engineering). A proposal of a new metric.

The subtask will be evaluated with simple accuracy.

The regular setting evaluation, where participants are not given the gold-standard NP boundaries, will be split into two measures, as the mention identification task is distinct from coreference resolution and thus should be evaluated separately:

Recognition of mentions. Standard precision and recall will be computed comparing the gold set of mentions and the system set of mentions. 

Correcteness of coreference. The MUC, B-CUBED, CEAF and BLANC measures will be applied to just the correctly recognized mentions, i.e., the ones that correspond to the gold set. 

The official scorers for the task can be downloaded from the Download section. Different versions will be maintained throughout the task.

[Back to the top]


For any issue regarding the evaluation metrics, feel free to post in the forum.