Stanford Encyclopedia of Philosophy Html Cleaners Emsa HTML Tag Remover – Very easy to use tag cleaner program that can be run from a GUI or command line. Activation code: 1760559.
Tokenizers Boost Tokenizer Package – Part of the Boost C++ Library. It contains functions that aid in breaking up strings.
Part of Speech Taggers TreeTagger – “The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart.” It can also be used for noun, verb, adverb, adjective and prepositional phrase chunking. Linux or Win32 binaries are available. Usable through command line. Stanford Tagger – A Log-Linear Part of Speech Tagger developed and maintained by Stanford. Usable through the command line and requires Java to run. Parsers Minipar – An efficient 300 words/sec English parser. Stanford Parser – A statistical parser developed and maintained by Stanford. Uses Java and runs through command line. Other Tools WordNet – English Lexicon developed and maintained by Princeton. Contains the meanings and relations of most nouns, verbs, adverbs and adjectives. Smart Stop Words – A list of words often discarded for efficiency in search engines. Virtual Box – Open source program for creating virtual machines of operating systems. Ubuntu – Free Linux based operating system. Kevin’s Word List – Various word lists and links to other collections of word lists Computational Linguistics at Ohio State
|
Artificial Intelligence Lab
S-3-015, Science Building
University of Massachusetts Boston | Computer Science Department Home | Publications | Research | Resources | People | News |