CSIT Programming Project 4    Sorting to find anagrams

 

See Project 4 on pg. 821 for the basic ideas of this project. We will find the longest anagrams in the words.txt provided in the Chapter 13 files on the author’s website.

 

1.      First we need to compute the canonical form of each word, so compose a static method

 

public static String canonicalForm(String word)

 

so that canonicalForm(“computer”) returns cemoprtu”, and canonicalForm(“poor”) returns oopr”.  For this first step, put this method in a little program CanonWord.java that inputs one word and outputs its canonical form:

 

java CanonWord

Enter word: computer

cemoprtu

 

To implement canonicalForm, program a loop that unloads the individual characters in the argued string and puts them in an array of char or ArrayList<Character>. Thus canonicalForm(“poor”) gets an array of 4 chars ‘p’, ‘o’, ‘o’, ‘r’ or an ArrayList<Character> with the corresponding Characters. Then sort that array or ArrayList<Character>. Finally, use the sorted characters to build a new String by a loop of s = s + c, where s is a String, and c is a char or Character. 

 

The program CanonWord needs a main method as well as the canonicalForm method.  All it does is use a Scanner to get the word from the user and call canonicalWord to convert it, and print out the resulting canonical form.

 

2.       Now write a program FindAnagrams that (in its main method) reads words.txt, and for each word there, adds it into an ArrayList<String> words and its canonical form into an ArrayList<String> codes. It obtains the canonical form by calling the same canonicalForm method you developed for CanonWord. Sort the codes ArrayList.  Then hunt for duplicates in the sorted ArrayList codes. Any duplicates there will indicate anagrams in the original words. Keep track of the longest duplicate code you have found so far as you scan codes. Finally using this longest code, scan the words list for words that have this canonical form and output them.  Don’t worry about ties at the longest length, just report on first anagram group with the longest length.

 

Test case:

words.txt       codes       sorted codes

rat             art         abt

hears           aehrs       abtt

share           aehrs       aehrs

tar             art         aehrs

bat             abt         aehrs

batt            abtt        art

shear           aehrs       art

 

The triplet of codes “aehrs” means that the corresponding words (hears, share, and shear) are anagrams.  Similarly the duplicate of codes “art” indicates that rat and tar are anagrams.  We are interested in the longest anagram, so the program should output as follows:

 

java FindAnagrams

hears

share

shear

 

We see there are two anagram groups here, one with code aehrs and the other with code art, so the longest is aehrs, corresponding to words hears, share, and shear.

 

For an example of comparing words next to each other in a sorted list, see line 34 of Vocabulary1.java, pg. 664.

Note that we are assuming that all the words in words.txt are in the same case, uppercase or lowercase. The textbook’s words.txt is all in lowercase.

 

 

3.      Write a narrative that describes how you went about designing and implementing these modifications, how you tested your program. Did you use the test case during development or did you use the final words.txt from the start, or what? Did you find it helpful that the canonicalForm computation was separately developed? Report on what you found to be the longest anagrams in the author’s words.txt from the Chapter 13 area. Put your discussion in a text file called memo.txt.

 

Delivery: in your it115/p4 directory:

 

CanonWord.java (with main and canonicalWord methods)

FindAnagrams.java (with main and same the canonicalWord method as in CanonWord)

memo.txt