[pythonvis] Part 2: How to split line into words and count them

  • From: "Richard Dinger" <rrdinger@xxxxxxxxxx>
  • To: <pythonvis@xxxxxxxxxxxxx>
  • Date: Thu, 22 May 2014 09:03:16 -0700

In part 1 the file was opened and read line by line.  Note there are print 
statements that can be uncommented to trace what is happening.

Once the file is opened each line is processed.  The string object method split 
is used to split the line into a list of words at each whitespace location.

wordMap is a data structure called a dictionary.  A dictionary is sort of a 
list that is accessed by a key (such as a word in this example) rather than by 
an index.  So if our text file has the word ‘the’ in it 5 times:

wordMap[‘the’]

would give 5.  So I use a dictionary with the words of the text file as the 
keys to count how many of each word there are.

Another for loop processes each word in the list.  The Words not already in the 
wordMap are added with a count of 1 and existing members are incremented.  The 
get method tries to get the count of its first argument and if it is not in the 
wordMap the second argument is returned.  So the statement:

wordMap[word] = wordMap(word, 0) + 1

Has the same result as:

if word not in wordMap:
  wordMap[word] = 0
wordMap[word] = wordMap[word] + 1

At the end of the file the result is printed out in no particular order.  

Note the if __name__ stuff at the end of the file is True when the script is 
run directly and False when imported into another script.  So including a 
section like this is a good place to put some testing code.  So make up a file 
with some text in it and either name it words.txt or change the code to match 
the name.  Then run this thing.

Now this still needs some work since capitalized words are different from not 
and punctuation appended to some words changes counts.  But we will look at 
that next version.
# wordCount1.py count the words in a file (no word cleanup)
""" This script shows how to:
- open a text file
- read file by line
- split line into words
- count words
- return dictionary of words->count
"""

# create an empty dictionary for words and their counts
wordMap = {}


def counter(file):
  # Open file and count words/frequency in file

  # open file for reading
  inFile = open(file, 'r')

  # read file by lines
  for line in inFile:
    #print 'line', line.strip()

    # split lines into words
    words = line.split()
    # print 'words', words
    # put word into dictionary incrementing count
    for word in words:
      #print 'word', word

      # update word's count or initialize if first time
      wordMap[word] = wordMap.get(word, 0) + 1
      # end for word loop
    # end for line loop

  # close open file
  inFile.close()

  return wordMap
  # end wordCount function


if __name__ == '__main__':
  file = 'words.txt'

  map = counter(file)
  #exit()
  print 'list of words and counts:'
  for k, v in map.iteritems():
    print '%s %d' % (k, v)

Other related posts: