[pythonvis] Word counting part 3 clean up words and sort the results

  • From: "Richard Dinger" <rrdinger@xxxxxxxxxx>
  • To: <pythonvis@xxxxxxxxxxxxx>
  • Date: Sat, 31 May 2014 08:21:04 -0700

This is the third and final installment of a script to read a text file and 
count the number of occurrences of each word.  We left off last time with 
capitalized words being counted separately from their lower case versions.  
Some words also had punctuation included and so counted differently from 
unpunctuated versions.  These two problems will be solved in this part.  
Finally, the dictionary of words will be converted into a sorted list and that 
list will be returned instead of the dictionary.

I will use the translate function in the string module to both convert to lower 
case and delete punctuation.  First a translation table named toLower is 
constructed using the maketrans function.  maketrans takes two arguments; the 
characters to translate and second the translated version of the characters.  
Note the string module contains already defined constants of what we need here. 
 The constant string.uppercase is a string of the upper case letters (ABC 
...XYZ) and string.lowercase is the lower case version of same so the toLower 
table will translate from upper to lower case.

Note how the translate function is called for each line using the dot operator. 
 The first argument to translate is the toLower translation table and the 
second argument is an optional string of characters to be deleted.  Another 
string constant from the string module holds all punctuation characters for 
that second argument.  After the call to translate the line is lower case and 
punctuation is gone.

Now a sorted list of the results is constructed at the end of the function.  
First, a temp list is defined.  The items method (or function) of the 
dictionary returns a list of (key, value) pairs in no particular order.  The 
for loop reverses the key and value data as it transfers the data to temp.  The 
data is reversed so it will sort by count, which is now first.

The temp list is sorted in descending order (reverse=True) and then copied to 
the final result.  Note there are better ways to do this conversion of a 
dictionary to a sorted list, but they involve topics not yet covered.

This concludes the problem of how to open a file and count the number of 
occurrences of each word.
# wordCount2.py count the words in a file
""" This script shows how to:
- open a text file
- read file by line
- translate line to lower case and delete punctuation
- split line into lower case words
- count words
- convert dictionary of word counts to ordered list
- return ordered list of words and counts
"""

import string

# make an uppercase to lowercase translation table
toLower = string.maketrans(string.uppercase, string.lowercase)

# create an empty dictionary for words and their counts
wordMap = {}


def counter(file):
  # Open file and count words/frequency in file

  # open file for reading
  inFile = open(file, 'r')

  # read file by lines
  for line in inFile:
    #print 'line', line.strip()

    # remove unwanted characters 
    line = line.translate(toLower, string.punctuation)
    #print 'trans line', line

    # split lines into words
    words = line.split()
    # print 'words', words
    # put word into dictionary incrementing count
    for word in words:
      #print 'word', word

      # update word's count or initialize if first time
      wordMap[word] = wordMap.get(word, 0) + 1
      # end for word loop
    # end for line loop

  # close open file
  inFile.close()

  # create temporary list of (count, word) pairs
  temp = []
  for k, v in wordMap.items():
    # the key (k) and value (v) are switched to sort by value
    temp.append((v, k))
  #print temp
  # sort the temp list
  temp.sort(reverse=True)
  #print temp

  # create final word list with key and value swaped
  wordCount = []
  for v, k in temp:
    wordCount.append((k, v))
  #print wordCount

  return wordCount
  # end counter function


if __name__ == '__main__':
  file = 'words.txt'

  wordCount = counter(file)
  
  print 'list of words and counts:'
  for word, count in wordCount:
    print '%s %d' % (word, count)
    

Other related posts:

  • » [pythonvis] Word counting part 3 clean up words and sort the results - Richard Dinger