People Research Academics Events Publications Resources

Corpora and Corpus Annotation Tools on the WWW

This list of resources was collected by Markus Dickinson and Detmar Meurers (OSU), February 2002. Funding for this project provided by OSU College of Humanities Seed Grant.

Internal Documentation and Installed Corpora

You can find reference documentation for tools installed at OSU here.

You can find a list of our installed corpora here.


TOKENIZATION / SEGMENTATION TOOLS

  • LT TTT (Text Tokenisation Tool), a text tokenization system from the Language Technology Group

  • Segmenter segments texts into topical chunks

  • SATZ, an adaptive sentence boundary detector

  • MXTERMINATOR by Adwait Ratnaparkhi

  • CPAN has Text::Sentence (Ave), a module for splitting text into sentences.

  • Scott Piao's multilingual concordancer has a sentence splitter (I think).

  • The Illinois Cognitive Computation Group has a sentence splitter

  • Zhiping Zheng's QA system contains an online sentence segmenter

  • Lingua-EN-Sentence-0.25 (Shlomo) splits sentences based on regular expressions and lists of abbreviations.

  • Guenther(?), a sentence segmenter which is to appear, I think (site in German)

  • Jorg Schuster has a Test Sentencizer site which allows comparison of mxterminator, ave, and shlomo.

  • Oliver Mason has a tokenizer called QTOKEN


TAGGERS

  • A demo from Xerox Research Centre Europe (XRCE)

  • WinBrill from Analyse et Traitement Informatique de la Langue Francaise (ATILF)

  • ACOPOST, a collection of POS taggers, including a maximum entropy tagger, a trigram tagger, an error-driven TBL tagger, and an example-based tagger.

  • Decision Tree Tagger, developed by Helmut Schmid

  • Online interface for TreeTagger found here.

  • CLAWS POS Tagger (costs). A trial version is available here.

  • AUTomatic Analysis SYStem (AUTASYS), using the LOB & ICE tagsets

  • XEROX tagger, available via FTP

  • TNT Tagger by Thorsten Brants. TnT = "Trigrams 'n Tags"

  • LT POS, a part-of-speech tagger from the Language Technology Group

  • Brill Tagger, a transformation-based POS tagger. Site also includes supervised & unsupervise POS taggers & a PP-attachment program. The FTP location is found here

    Various demos, including one for the Brill Tagger, can be found at the Centre for Language Engineering Demonstrations

  • An online tagger for German can be found at the University of Zurich

  • Maximum Entropy POS Tagger (MXPOST) developed by Adwait Ratnaparkhi. Site also has MXTERMINATOR, a sentence boundary detector

  • QTAG, a probabilistic tagger roughly based on HMMs.

  • MuTBL, a transformation-based learning system which can train Brill taggers

  • fnTBL is machine learning toolkit for NLP tasks.

  • MTP (Münster Tagging Project), featuring Xlex, a suite of tools including a tokenizer, segmenter, tagger, index tool, & collocation tool. An online demo of Xlex can be found here.

  • AMALGAM , Automatic Mapping Among Lexico-Grammatical Annotation, maps tagsets and phrase structure grammar schemes. (includes a bibliography on lexico-grammatical annotation models)

  • In addition to a shallow parser and a sentence splitter, the Cognitive Computation Group at Illinois has a SNoW-based Tagger. SNoW papers available here

  • VISL has a free upload interface for automatic tagging/parsing of several languages at its website.


MORPHOLOGICAL ANALYZERS

  • Hermit Crab, self-described as a "morphological parser and generator for classical generative phonology and morphology"

  • POSTTAG for use with Korean texts; a tagger & morphological analyzer. POSTPAR is the syntactic analyzer

  • Morphy, a morphological tool for German with some statistical POS tagging (site is in German)

  • Morphix, Günter Neumann's morphological component for inflectional languages

  • GERTWOL, a system for automatic recognition of German word forms, using two-level morphology

  • Word Manager is "a system for the acquisition and management of reusable morphological and phrasal dictionaries"

  • DeKo (Derivations und Kompositionsmorphologie) analyzes complex words of the German language

  • John Carroll has some tools for morphological analysis (morpha), generation (morphg), and a/an insertion (ana).

  • PC-KIMMO is a two-level processor for morphological analysis, available from sil.org. Also available from sil is AMPLE, which breaks words into morphemes.

  • ALE-RA, an ALE extension with Realizational morphology and Automata Phonology

  • Project Deutscher Wortschatz at the University of Leipzig (site in German)

  • Deutsche Malaga-Morphologie (DMM) is a system for the automatic wordform recognition of German.

  • CISLEX from the University of Munich (site in German)

  • For Russian: RUSLO a system for Russian derivational analysis and synthesis (not downloadable)

  • For Turkish: Turkish Morphological Analyzer is an online analyzer which treats both word formation and inflection; developed by Kemal Oflazer

  • Krzysztof Szafran's freeware Windows and Linux versions of a morphological analyser for Polish

  • ChaSen is a morphological analyzer for Japanese


PARSERS/CHUNKERS


TEXT ANALYSIS


VARIOUS TOOLS (ANNOTATE, SEARCH, TRANSCRIBE)

  • Corpuseye offers different searching techniques on different types of corpora and different languages.

  • NEGRA an annotate tool

  • Test Suites for Natural Language Processing (TSNLP), an annotation scheme for use on test suites in German, French, & English

  • VERBMOBIL, some general annotation tools

  • TIGER Search, a specialized search engine for syntactically annotated corpora

    the trees for TIGERSearch use SVG (Scalable Vector Graphics), which are run on Batik

  • Transcriber, a tool for segmenting, labeling and transcribing speech from the Linugistic Data Consortium (LDC)

  • INTEX has multiple uses, including parsing & tagging

  • Xlex has a variety of tools

  • Alembic Workbench includes customizable tagsets & evaluation tools to analyze annotated data

  • The Callisto annotation tool supports "linguistic annotation of textual sources for any Unicode-supported language."

  • WordFreak is an annotation tool for manual and automatic annotation, as well as human correction.

  • ACE (Automatic Content Extraction) annotation tools support multiple annotation layers.

  • MMAX Annotation Tool (Multi-Modal Annotation in XML) supports stand-off annotation, among other things.

  • NXT (NITE XML) supports linguistic annotation for highly structured or cross-annotated data.

  • PALinkA (Perspicuous and Adjustable Links Annotator) has been used to annotate texts for anaphora resolution, centering, summarization, and so on.

  • Corpus Workbench (CWB) is used for extraction and searching for data-driven approaches. Uses the Corpus Query Processor (CQP).

  • SMES, Günter Neumann's information extraction system (with chunker & morphological analyzer)

  • Connexor has various annotation tools and some online demos of annotating sentences in various languages

  • As part of the BulTreeBank, the CLaRK system is an XML-based software system for corpora development.

  • AGTK Annotation Graph ToolKit

  • TGrep, for searching through the Penn Treebank, is downloadable here. Information on using tgrep is available here.

  • GSearch, a search tool which uses syntactic criteria, even if the corpus is not syntactically marked up.

  • LingPipe does named entity recognition, as well as other processing

  • GATE (General Architecture for Text Engineering) offers a lot of text processing tools

  • The TALP research center has various analyzers for Spanish and has recently released FreeLing, an open-source C++ library providing language analysis services


XML TOOLS


CORPORA

SYNTACTICALLY-ANNOTATED CORPORA

ONLINE CORPORA


META-SEARCHES AND OTHER ONLINE RESOURCES

  • Michael Barlow has a very nice page here, devoted to many facets of corpus linguistics

  • David Lee has a very extensive site devoted to corpora and corpus resources.

  • SFB441 has a listing of software for corpus linguistic research

  • Annotation: a site by Steven Bird which lists all sorts of tools for linguistic annotation. Many of them are speech-based.

  • Penn Tools is a listing of corpora and tools available at UPenn

  • Resources for Text, Speech and Language Processing

  • TIGER lists several useful links for Treebank projects

  • Frequency lists of word found in the BNC can be found here

  • ICAME has a bibliography online, as well as in searchable form.

  • EAGLES (Expert Advisory Group on Language Engineering Standards) provides recommendations on corpus typology.

  • W3C Corpus Linguistics Page at the University of Essex


Note

Our system (found under: /home/corpora) corresponds to the 2-letter language codes (ISO 639) found at The XML Cover Pages.


Questions or comments? Contact Markus Dickinson.


Last modified: June 16, 2005