This is the third and final installment of a script to read a text file and count the number of occurrences of each word. We left off last time with capitalized words being counted separately from their lower case versions. Some words also had punctuation included and so counted differently from unpunctuated versions. These two problems will be solved in this part. Finally, the dictionary of words will be converted into a sorted list and that list will be returned instead of the dictionary. I will use the translate function in the string module to both convert to lower case and delete punctuation. First a translation table named toLower is constructed using the maketrans function. maketrans takes two arguments; the characters to translate and second the translated version of the characters. Note the string module contains already defined constants of what we need here. The constant string.uppercase is a string of the upper case letters (ABC ...XYZ) and string.lowercase is the lower case version of same so the toLower table will translate from upper to lower case. Note how the translate function is called for each line using the dot operator. The first argument to translate is the toLower translation table and the second argument is an optional string of characters to be deleted. Another string constant from the string module holds all punctuation characters for that second argument. After the call to translate the line is lower case and punctuation is gone. Now a sorted list of the results is constructed at the end of the function. First, a temp list is defined. The items method (or function) of the dictionary returns a list of (key, value) pairs in no particular order. The for loop reverses the key and value data as it transfers the data to temp. The data is reversed so it will sort by count, which is now first. The temp list is sorted in descending order (reverse=True) and then copied to the final result. Note there are better ways to do this conversion of a dictionary to a sorted list, but they involve topics not yet covered. This concludes the problem of how to open a file and count the number of occurrences of each word.
# wordCount2.py count the words in a file """ This script shows how to: - open a text file - read file by line - translate line to lower case and delete punctuation - split line into lower case words - count words - convert dictionary of word counts to ordered list - return ordered list of words and counts """ import string # make an uppercase to lowercase translation table toLower = string.maketrans(string.uppercase, string.lowercase) # create an empty dictionary for words and their counts wordMap = {} def counter(file): # Open file and count words/frequency in file # open file for reading inFile = open(file, 'r') # read file by lines for line in inFile: #print 'line', line.strip() # remove unwanted characters line = line.translate(toLower, string.punctuation) #print 'trans line', line # split lines into words words = line.split() # print 'words', words # put word into dictionary incrementing count for word in words: #print 'word', word # update word's count or initialize if first time wordMap[word] = wordMap.get(word, 0) + 1 # end for word loop # end for line loop # close open file inFile.close() # create temporary list of (count, word) pairs temp = [] for k, v in wordMap.items(): # the key (k) and value (v) are switched to sort by value temp.append((v, k)) #print temp # sort the temp list temp.sort(reverse=True) #print temp # create final word list with key and value swaped wordCount = [] for v, k in temp: wordCount.append((k, v)) #print wordCount return wordCount # end counter function if __name__ == '__main__': file = 'words.txt' wordCount = counter(file) print 'list of words and counts:' for word, count in wordCount: print '%s %d' % (word, count)