This is an example of one of my projects completed last year. It is called stylometrics, and involved writing a program that would take in a text document of a story written by either Shakespeare or Melville and determine which of the two wrote the story based off the word usage in the text.
The code shown below is a portion of the code that determines the word frequency used in the given input text by scanning the text and finding the most commonly used words in the text. This assumes that a given author has the tendency to use certain words more than other authors, making it able to determine who the author of the text was based off of their word usage.
def byFreq(pair): '''Used to return the word count for sorting purposes.''' return pair[1] def freqWords(): '''Analyzes an input text file and returns the desired top number of most-used words, their counts, and their ratios.''' counts = {} wordCount = 0 wordList = [] fname = raw_input("File to analyze: ") with open(fname, 'r') as holder: text = holder.read() text = text.lower() for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~': text = text.replace(ch, ' ') text = text.replace("'", "") words = text.split() for w in words: counts[w] = counts.get(w,0) + 1 wordCount += 1 n = input("How many top words? ") items = list(counts.items()) items.sort() items.sort(key=byFreq, reverse=True) uniqueWords = "The number of unique words in " + str(fname) + " is " + str(len(counts)) + "." totalWords = "The total number of words in this text is " +str(wordCount) + "." print totalWords print uniqueWords for i in range(n): word, count = items[i] print("{0:<15}{1:>5}".format(word, count)) for i in range(50): word, count = items[i] count = count/float(wordCount) wordList += [[word, count]] print("{0:<15}{1:>5}".format(word, count)) print wordListHere is the download link for the full code.
Here is a sample text file you can use to test the code, Romeo and Juliet by William Shakespeare
Here is a sample text file you can use to test the code, Bartleby, The Scrivener by Herman Melville