Linguistic and NLP Resources

We are currently working on support for writing in German. Therefore we use linguistic resources and computational linguistics methods for German. These include:

  • Morphological components
  • Taggers
  • Syntactical components
  • Linguistic resources

NLP Resources

Criteria for NLP Resources

To be usable in an interactive setting and suitable for integration into an editor (i.e., a real-world application), resources have to meet several requirements:

  • Fast execution
  • Good coverage
  • Reliable results, i.e., the resource should deliver the results it claims to deliver and the quality of these results should be as high as possible
  • Results should be delivered in a format suitable for further processing
  • The resource itself and the tools it needs should be freely available

Morphological Components

  • Stripey Zebra, the German Malaga Morphology developed by Markus Schulze and Oliver Lorenz, based on the grammar development environment Malaga (developed by Björn Beutel) and the Left-Associative Grammar formalism (see Roland Hausser: Foundations of Computational Linguistics: Human-Computer Communication in Natural Language, Springer, 2001)
    As an example, see the analysis of “entwickelt” (ambiguous word form of 'to develop'; it can be participle, or 3rd person singular indicative present, or 2nd person plural indicative present) delivered by Stripey Zebra:

    analysis of entwickelt
  • GERTWOL by Lingsoft Oy, based on the Two-Level Morphology approach developed by Kimmo Koskenniemi (see Kimmo Koskenniemi: Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11. Helsinki: University of Helsinki Department of General Linguistics, 1983 and Kimmo Koskenniemi and Mariikka Haapalainen: GERTWOL - Lingsoft Oy, in Linguistische Verifikation: Dokumentation zur Ersten Morpholympics 1994., pp. 121-140, Max Niemeyer, Tübingen, 1996).

Unfortunately, both systems have to be licensed. Thus we will not be able to freely distribute the functions which make use of morphological components. However, all morphological components which are freely available do not meet the other requirements to the extent necessary. Therefore they are not suitable for use in interactive real-world applications.

Tagging

  • MBT: Memory-based tagger generation and tagging, developed at the Induction of Linguistic Knowledge Research Group at Tilburg University, based on TiMBL (Tilburg Memory-Based Learner) (see http://ilk.uvt.nl/mbt/)

Linguistic Resources

C-WEP

During our work, we collect examples for the Collection of Writing Errors by Professional Writers of German (C-WEP). It contains sentences with grammatical errors. All sentences are taken from published texts. All authors are professional writers with high skill levels with respect to German, the genres, and the topics. For each example, we provide various annotation layers and at least one valid target hypothesis (again with annotations).

C-WEP is freely available here.

Word Lists

  • Conjunctions
  • Prepositions
  • Determiners
  • Relative pronouns