Datasets and Formats
Corpora Data formatting
CORPORA All corpora used in the task belong to the newspapers/newswire genre.
Catalan, Spanish The data sets come from the AnCora corpora. AnCora-ES (the Spanish part) contains 75k words from the Lexesp balanced 6-million-word corpus, 225k words from the EFE news agency, and 200k from the Spanish version of the 'El Periódico' newspaper. AnCora-CA (the Catalan part) consists of 75k words from the EFE news agency, 225k words from the ACN news agency, and 200k words from the Catalan version of the 'El Periodico' newspaper. The subset of 200k words coming from 'El Periódico' corresponds to the same news in Catalan and Spanish, spanning from January to December 2000. Hand-annotated with constituents, functions, thematic roles, semantic verb classes, named entities, WordNet nominal senses, and coreference. Training: 300k. Test: 50k. Freely available for research purposes.
English The data set consists of a series of documents from the Reuters RCV1 newswire corpus. Reuters Corpus RCV1 is distributed by NIST. Since it does not come with any syntactic or semantic annotation, we only count with automatic linguistic annotation produced by statistical taggers and parsers. Training: 100k. Test: 30k.
DATA FORMATTING #begin document CESS-CAT-AAP/95694_20030723.tbf.xml
Inside a document, the information of each sentence is organized vertically with one word per line. The information associated to each word is described with several fields (columns) representing different layers of linguistic annotation. Columns are separated by TAB characters. Sentences are separated by a blank line. ID TOKEN LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL PRED PPRED APREDs PAPREDs COREF Column 1 The following tools have been used to generate the Predicted (P-) columns: For more details on the corpora and the annotation, see the README file included in the distribution (Downloads). DETAILS ON THE COREFERENCE ANNOTATION
The annotation of coreference is shown in the last column in a numerical-bracketed format. Every entity has an ID number. Every mention is marked with the ID of the entity it refers to. An open parenthesis (before the entity ID) shows the beginning of the mention (first token), and a closed parenthesis (after the entity ID) shows the end of the mention (last token). The following examples are extracted from the Catalan sentence (AnCora-CA): [La remodelada plaça del [Mercat]_2]_1 es va inaugurar ahir amb actes d'homenatge a [Josep_Roura_i_Estrada]_3 (1787-1860), conegut per la introducció de l'enllumenat públic de gas a Espanya. A la casa natal de [Roura]_3, a [la plaça]_1, s'[hi]_1 va instal·lar un fanal antic de gas. Using the open-close notation from the task datasets: la [...] (1 plaça [...] 1) Mentions with one single token show the entity ID within parentheses: Roura [...] (3) Tokens belonging to more than one mention separate the respective La [...] (1 remodelada [...] plaça [...] del [...] Mercat [...] (2)|1)
Since the two mentions "la plaça" and "hi" corefer with "La remodelada plaça del Mercat", the last column shows the same entity ID for both of them. la [...] (1 plaça [...] 1) [...] hi [...] (1)
For any queries, comments or feedback regarding the data sets, please post in the forum. |