Datasets and Formats

 

Corpora

Data formatting

Details on the coreference information

 


 

CORPORA

All corpora used in the task belong to the newspapers/newswire genre.

 

Catalan, Spanish

The data sets come from the AnCora corpora.

AnCora-ES (the Spanish part) contains 75k words from the Lexesp balanced 6-million-word corpus, 225k words from the EFE news agency, and 200k from the Spanish version of the 'El Periódico' newspaper.

AnCora-CA (the Catalan part) consists of 75k words from the EFE news agency, 225k words from the ACN news agency, and 200k words from the Catalan version of the 'El Periodico' newspaper. The subset of 200k words coming from 'El Periódico' corresponds to the same news in Catalan and Spanish, spanning from January to December 2000.

Hand-annotated with constituents, functions, thematic roles, semantic verb classes, named entities, WordNet nominal senses, and coreference. 

Training: 300k. Test: 50k.

Freely available for research purposes.

  

English

The data set consists of a series of documents from the Reuters RCV1 newswire corpus. Reuters Corpus RCV1 is distributed by NIST.

Since it does not come with any syntactic or semantic annotation, we only count with automatic linguistic annotation produced by statistical taggers and parsers. 

Training: 100k. Test: 30k.

 

 [Back to the top]


 

DATA FORMATTING
 
Formatting is shared by all languages in the task. Data formats are inspired by the 2008/2009 CoNLL shared tasks on syntactic and semantic dependencies.
Trial data are provided as a single file per each language. Each file contains several documents introduced and finished by comment lines:

    #begin document CESS-CAT-AAP/95694_20030723.tbf.xml
    ...
    sentences in the document
    ...
    #end document CESS-CAT-AAP/95694_20030723.tbf.xml

 

Inside a document, the information of each sentence is organized vertically with one word per line. The information associated to each word is described with several fields (columns) representing different layers of linguistic annotation. Columns are separated by TAB characters. Sentences are separated by a blank line.

The following columns are provided:

ID TOKEN LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL PRED PPRED APREDs PAPREDs COREF

 

Column 1
 1 ID: word identifiers in the sentence

Columns 2--8: words and morphosyntactic information
 2 TOKEN: word forms
 3 LEMMA: word lemmas (gold standard manual annotation)
 4 PLEMMA: word lemmas predicted by an automatic analyzer
 5 POS: coarse part of speech
 6 PPOS same as 5 but predicted by an automatic analyzer           
 7 FEAT: morphological features (part of speech type, number, gender, case, tense, aspect, degree of comparison, etc., separated by the character "|")
 8 PFEAT: same as 7 but predicted by an automatic analyzer

Columns 9--12: syntactic dependency tree
 9 HEAD: for each word, the ID of the syntactic head ('0' if the word is the root of the tree)
10 PHEAD: same as 9 but predicted by an automatic analyzer
11 DEPREL: dependency relation labels corresponding to the dependencies  described in 9
12 PDEPREL: same as 11 but predicted by an automatic analyzer

Columns 13--14+N+M: semantic role labeling
13 PRED: predicates are marked and annotated with a semantic class label
14 PPRED: Same as 13 but predicted by an automatic analyzer  
 * APREDs: N columns, one for each predicate in 13, containing the semantic roles/dependencies of each particular predicate
 * PAPREDs: M columns, one for each predicate in 14, with the same information as APREDs but predicted with an automatic analyzer.  

Last column: output to be predicted
 * COREF: coreference annotation in open-close notation, using "|" to separate multiple annotations (see more details below)

All but the last column are to be considered as input information. The predicted columns will be always provided. The gold standard manual annotations will be provided at test only in the "gold standard" setting of evaluation. For the regular setting, participants are not allowed to use the gold standard columns at test time. The last column (COREF) is the output information, that is, the annotation that has to be predicted by the systems.

[Note] For the English corpus, the gold standard annotation is not available. The column format is exactly the same but includes "_" characters in all the columns with unavailable information.

The following tools have been used to generate the Predicted (P-) columns:

* Catalan and Spanish PLEMMA, PPOS, and PFEAT are generated with the FreeLing Open source suite of Language Analyzers. The accuracy in PLEMMA and PPOS columns is around 95%. Thanks to Lluís Padró (UPC) for helping with the annotation of the morphosyntactic information.

* English PLEMMA, PPOS, and PFEAT columns have been generated using SVMTagger trained on PennTreebank (WSJ) and WordNet lemmatizer. The accuracy in PLEMMA and PPOS columns is expected to be above 96%.

* PHEAD, PDEPREL, PPRED and PAPREDs columns for all languages are generated by JointParser, which is a system trained in CoNLL-2008 and 2009 shared tasks.

For more details on the corpora and the annotation, see the README file included in the distribution (Downloads).  

 [Back to the top] 


 

DETAILS ON THE COREFERENCE ANNOTATION

 

The annotation of coreference is shown in the last column in a numerical-bracketed format. Every entity has an ID number. Every mention is marked with the ID of the entity it refers to. An open parenthesis (before the entity ID) shows the beginning of the mention (first token), and a closed parenthesis (after the entity ID) shows the end of the mention (last token). The following examples are extracted from the Catalan sentence (AnCora-CA):

[La remodelada plaça del [Mercat]_2]_1 es va inaugurar ahir amb actes d'homenatge a [Josep_Roura_i_Estrada]_3 (1787-1860), conegut per la introducció de l'enllumenat públic de gas a Espanya. A la casa natal de [Roura]_3, a [la plaça]_1, s'[hi]_1 va instal·lar un fanal antic de gas.

Using the open-close notation from the task datasets:

la              [...]     (1

plaça           [...]     1)

Mentions with one single token show the entity ID within parentheses:

Roura           [...]      (3)

Tokens belonging to more than one mention separate the respective
entity IDs with a pipe symbol "|". For instance:

La              [...]     (1

remodelada      [...]

plaça           [...]

del             [...]

Mercat          [...]      (2)|1)

 

Since the two mentions "la plaça" and "hi" corefer with "La remodelada plaça del Mercat", the last column shows the same entity ID for both of them.

la              [...]     (1

plaça           [...]     1)

[...]     

hi              [...]     (1) 

 [Back to the top] 


 

For any queries, comments or feedback regarding the data sets, please post in the forum.