CNEC 2.0 CoNLL format extension

Download

Download the corpus including our extensions.

Or visit pages of the original corpus.

License

The corpus is availabe under CC BY-NC-SA 3.0 license.

Citation

If you are using our extension, please cite:

Konkol, M., Konopík M.: CRF-Based Czech Named Entity Recognizer and Consolidation of Czech NER Research. TSD 2013. LNCS, vol. 8082, pp. 153-160. Springer, Heidelberg, 2013. PDF | bibtex | SpringerLink

You can also cite the original corpus:

Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named entities in czech: Annotating data and developing NE tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007)

Changes

We have changed multiple things.

  1. We have found some inconsistent tokenization which made some entities impossible to find, e.g. <ne>5</ne>times being one token and the entity is only a part of the token. This is unreachable in most circumstances as we are usually classifying tokens. This was solved by splitting these tokens. These changes were done and checked manually.

    There are more of these inconsistencies in the 2.0 version. Especially in handling ',' and '-'. E.g. <ne>Praha</ne>-<ne>Smíchov</ne> (one token) and many occurrences of "</ne>," , where the ',' becomes part of the token.

  2. We have created an extended xml format with <sentence id=""> tags. Sentences can be handled easier in XML parsers. Data in this format can be found in

    data/extended-xml/

  3. We have found that there are duplicit sentences in the corpus. Some sentences are in multiple datasets (training, validation, test). The mapping of sentences to datasets and duplicit senteces from the original corpus are in the following files:

    data/extended-xml/sentence.duplicities
    data/extended-xml/sentence-dataset.mapping

    We have let each sentence to appear only once. If the sentence appeared in multiple datasets, our priorities were train > val > test. So if a sentece was in train and test set in the original corpus, we have removed it from test set.

  4. We have created a conll format of the corpus. This version can be found in

    data/conll/

    We have also included a lemma and POS tag so it is easier to test your NER system against Czech language dependent systems. Lemmas and POS tags are aquired automatically using a standard HMM tagger trained on the PDT 2.0 dataset. More information about PDT (including tag format) can be found at

    https://ufal.mff.cuni.cz/pdt2.0/

    The original corpus used 10 supertypes and 62 types. The corpus was created in multiple rounds and the annotation scheme was slightly different in these rounds. That leads to inconsistent annotation for some classes. In the conll format we use only 7 supertypes, that were annotated consistently throughout the whole corpus. More in the publication in Acknowledgements section. This remapping was altered for the 2.0 version.