C-WEP
C-WEP is the Collection of Writing Errors by Professional Writers of German. It currently consists of 245 sentences with grammatical errors. All sentences are taken from published texts. All authors are professional writers with high skill levels with respect to German, the genres, and the topics.
The purpose of this collection is to provide seeds for more sophisticated writing support tools as only a very small proportion of those errors can be detected by state-of-the-art checkers.
Annotation
C-WEP is annotated on various levels and freely available. C-WEP comes with one or more target hypothesis for each sentence and can thus also be used for the development and testing of grammar checkers.
Current annotation includes
- part of speech tagging (with Mbt trained on TüBaD/Z)
- detailed structured morphological analyses (with Stripey Zebra, a German LAG Malaga morphology)
- dependency trees (with MATE)
- LFG analyses (with the IMS German ParGram Grammar).
Data Structure and License
The ZIP file contains the annotated collection as one XML file.
Collection of Writing Errors by Professional Writers of German (C-WEP) by Cerstin Mahlow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Download C-WEP Version 0.9
Cite C-WEP
A description of C-WEP has been published at LREC 2016 as C-WEP―Rich Annotated Collection of Writing Errors by Professionals (download BibTex).
Acknowledgements
Wolfgang Seeker and Özlem Çetinoğlu from IMS Stuttgart and Michael Piotrowski from IEG Mainz helped with automatic annotation.
Known Issues for C-WEP 0.9
- Tokenization is incorrect. For example, quotation marks and parentheses are not separated from word forms and all periods are treated as sentence ends. However, for consistency, the same tokenization is used throughout.
- Some hypotheses contain errors on their own.
- The category tag contains several non-structured items.
- Currently removed comments as they are inconsistent and incomplete.
- Currently all data comes within one XML file, which is probably too big for further processing and display.