Datasets and Formats

Corpora

Data formatting

Details on the coreference information 

 


 

CORPORA

Five different corpora will be used in the task.

Each distribution includes a general README file as well as separate info.txt and (optionally) tagsets.pdf files with specific information of each data set (source, license, automatic tools, tagsets, etc.).  

 

Catalan, Spanish

The data sets come from the AnCora-CO corpora (Recasens and Martí, 2009).

AnCora-ES (the Spanish part) contains 75k words from the Lexesp balanced 6-million-word corpus, 225k words from the EFE news agency, and 200k from the Spanish version of the 'El Periódico' newspaper.

AnCora-CA (the Catalan part) consists of 75k words from the EFE news agency, 225k words from the ACN news agency, and 200k words from the Catalan version of the 'El Periodico' newspaper. The subset of 200k words coming from 'El Periódico' corresponds to the same news in Catalan and Spanish, spanning from January to December 2000.

Hand-annotated with constituents, functions, thematic roles, semantic verb classes, named entities, WordNet nominal senses, and coreference. 

Training: 300-330k words. Test: 50k words.

Freely available for research purposes.

 

Dutch

The data set comes from the KNACK-2002 corpus, texts from the Flemish weekly magazine Knack.

Hand-annotated with coreference.

Training: 168 documents.

 

English

The data set is an excerpt of news from the OntoNotes Corpus Release 2.0. The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to annotate a one-million-word English corpus by hand.

Hand-annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology, NEs and coreference).  

Training: 100k words. Test: 24k words.

The OntoNotes corpus is distributed by LDC. LDC will distribute the task training and test data sets to SemEval2010 participants after they sign and submit a license agreement for the data. The license agreement requires the data to be returned/destroyed at the end of the task. The website will be announced in due time.

 

German

The data set comes from the TüBa-D/Z Treebank (Hinrichs et al. 2005), a German newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz).

Hand-annotated with inflectional morphology, constituent structure, grammatical functions, and anaphoric and coreference relations.

Training: 415k words.

'die tageszeitung', the owner of the copyright to the original texts, grants all participants of the SemEval task a temporary license for the duration of the task.

 

Italian

The data set comes from the LiveMemories corpus, an Italian corpus under construction as part of the LiveMemories project. The corpus includes texts from Wikipedia, blogs, and news articles. The excerpt for SemEval-2010 consists of texts from the Italian Wikipedia.

Training: 100k. Test: 50k.

The data are distributed under the Wikipedia distribution rules.

 

 [Back to the top]

 


 

DATA FORMATTING
 
Formatting is shared by all languages in the task. Data formats are inspired by the 2008/2009 CoNLL shared tasks on syntactic and semantic dependencies.
Trial data are provided as a single file per each language. Each file contains several documents introduced and finished by comment lines:

    #begin document CESS-CAT-AAP/95694_20030723.tbf.xml
    ...
    sentences in the document
    ...
    #end document CESS-CAT-AAP/95694_20030723.tbf.xml

 

Inside a document, the information of each sentence is organized vertically with one word per line. The information associated to each word is described with several fields (columns) representing different layers of linguistic annotation. Columns are separated by TAB characters. Sentences are separated by a blank line.

The following columns are provided:

ID TOKEN LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL NE PNE PRED PPRED APREDs PAPREDs COREF

Column 1
 1 ID: word identifiers in the sentence

Columns 2--8: words and morphosyntactic information
 2 TOKEN: word forms
 3 LEMMA: word lemmas (gold standard manual annotation)
 4 PLEMMA: word lemmas predicted by an automatic analyzer
 5 POS: coarse part of speech
 6 PPOS same as 5 but predicted by an automatic analyzer
 7 FEAT: morphological features (part of speech type, number, gender, case, tense, aspect, degree of comparison, etc., separated by the character "|")
 8 PFEAT: same as 7 but predicted by an automatic analyzer

Columns 9--12: syntactic dependency tree
 9 HEAD: for each word, the ID of the syntactic head ('0' if the word is the root of the tree)
10 PHEAD: same as 9 but predicted by an automatic analyzer
11 DEPREL: dependency relation labels corresponding to the dependencies  described in 9
12 PDEPREL: same as 11 but predicted by an automatic analyzer  

Columns 13--14 
13 NE: named entities
14 PNE: same as 13 but predicted by a named entity recognizer

Columns 15--16+N+M: semantic role labeling
15 PRED: predicates are marked and annotated with a semantic class label
16 PPRED: Same as 13 but predicted by an automatic analyzer  
 * APREDs: N columns, one for each predicate in 15, containing the semantic roles/dependencies of each particular predicate
 * PAPREDs: M columns, one for each predicate in 16, with the same information as APREDs but predicted with an automatic analyzer.  

Last column: output to be predicted
 * COREF: coreference annotation in open-close notation, using "|" to separate multiple annotations (see more details below)

All but the last column are to be considered as input information. The predicted columns will be always provided. The gold standard manual annotations will be provided at test only in the "gold standard" setting of evaluation. For the regular setting, participants are not allowed to use the gold standard columns at test time. The last column (COREF) is the output information, that is, the annotation that has to be predicted by the systems.

The following tools have been used to generate the Predicted (P-) columns:

* Catalan and Spanish PLEMMA, PPOS, and PFEAT are generated with the FreeLing Open source suite of Language Analyzers. The accuracy in PLEMMA and PPOS columns is around 95%. Thanks to Lluís Padró (UPC) for helping with the annotation of the morphosyntactic information.

* English PLEMMA, PPOS, and PFEAT columns have been generated using SVMTagger trained on PennTreebank (WSJ) and WordNet lemmatizer. The accuracy in PLEMMA and PPOS columns is expected to be above 96%.

* PHEAD, PDEPREL, PPRED and PAPREDs columns for all languages are generated by JointParser, which is a system trained in CoNLL-2008 and 2009 shared tasks.

For more details on the corpora and the annotation, see the README file included in the distribution (Downloads).  

 [Back to the top] 


 

 DETAILS ON THE COREFERENCE ANNOTATION

 

The annotation of coreference is shown in the last column in a numerical-bracketed format. Every entity has an ID number. Every mention is marked with the ID of the entity it refers to. An open parenthesis (before the entity ID) shows the beginning of the mention (first token), and a closed parenthesis (after the entity ID) shows the end of the mention (last token). The following examples are extracted from the Catalan sentence (AnCora-CA):

[La remodelada plaça del [Mercat]_2]_1 es va inaugurar ahir amb actes d'homenatge a [Josep_Roura_i_Estrada]_3 (1787-1860), conegut per la introducció de l'enllumenat públic de gas a Espanya. A la casa natal de [Roura]_3, a [la plaça]_1, s'[hi]_1 va instal·lar un fanal antic de gas.

Using the open-close notation from the task datasets:

la              [...]     (1

plaça           [...]     1)

Mentions with one single token show the entity ID within parentheses:

Roura           [...]      (3)

Tokens belonging to more than one mention separate the respective
entity IDs with a pipe symbol "|". For instance:

La              [...]     (1

remodelada      [...]

plaça           [...]

del             [...]

Mercat          [...]      (2)|1)

 

Since the two mentions "la plaça" and "hi" corefer with "La remodelada plaça del Mercat", the last column shows the same entity ID for both of them.

la              [...]     (1

plaça           [...]     1)

[...]     

hi              [...]     (1) 

 [Back to the top] 


 

For any queries, comments or feedback regarding the data sets, please post in the forum.