Sunday, January 02, 2011

Jumble® and Python

So I was on the train the other day, looking at the daily jumble® in the paper (if you don't know how it works, have a look at their website). I had hit a mental block, trying to decode MUGPIN and only getting as far as "umping" (is it a verb, to "ump"?). I could have typed
$ grep '^......$' /usr/share/dict/words | grep m | grep u | grep g | 
    grep p | grep i | grep n
(which would give me a list of six-letter words in the system-supplied dictionary containing all the letters m,u,g,p,i,n) but I thought, "Hey, I have a little time; I should write a script." That way, rather than typing all those "grep"s, I could type the name of the script ("anagram", say) and the letters of interest (in this case mugpin).
For the impatient: →put your mouse here← to see the answer.
What did all those greps mean?

The first one, "grep '^......$' /usr/share/dict/words", selects lines containing six characters: ^ means beginning-of-line, . matches anything other than end-of-line, and $ matches end-of-line.

The | ("pipe" or vertical bar) means: take the output of the prior command, and feed it to the following command as its input.

The next six greps select lines containing one or more 'm's, one or more 'u's, etc. And it gives the right result, but it's icky having to type "grep" so many times.

How best to write a script for this? There is probably some nice way to get bash(1) to give me a list of letters in a word, but I didn't know what it was. Besides that, I'm trying to write more Python and less bash. It turns out that Python treats a string like a list -- even more so if you say "list('whatever')"
$ python
Python 2.6.5 (r265:79063, Jul  5 2010, 11:47:21) 
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> list('whatever')
['w', 'h', 'a', 't', 'e', 'v', 'e', 'r']
>>> 
My thought was, grab the list of words from /usr/share/dict/words, select the ones of the right length; from those, select the ones that contain all the letters in the supplied word. An initial whack might look something like this:
#!/usr/bin/python -utt
import sys              # for argv

def main(aword):
    dictwords = [X.strip() for X in open('/usr/share/dict/words', 'r')]

    aword = aword.strip().lower()
    awordlen = len(aword)
    somewords = [X for X in dictwords if len(X) == awordlen]
    for aletter in list(aword):
        somewords = [X for X in somewords if aletter in X]
    print aword, somewords

if __name__ == '__main__':
    main(sys.argv[1])

OK, let me explain a little. First, the #! line -- that tells the computer (if you're on a Linux® or Unix™-like system) to run the Python interpreter; the "-utt" says to unbuffer output (I'm so impatient) and to treat tabs as illegal for indentation -- safety first.

Then comes "main", which does the work of this script, takes a single input parameter, the scrambled word in question ('mugpin' in this case). The first thing the script does is get the word-list from the system-supplied dictionary, /usr/share/dict/words (if that file doesn't exist on your system, then I don't know where to get a nice word list from). I've become a fan of list comprehensions (the line beginning dictwords = for example) because I can say a lot in just one line without having it look like too much magic: what this line does is open the dictionary and put the blank-stripped version of each line into the array dictwords.

The script unblanks and downshifts the supplied word (that's what aword.strip().lower() does); it then calculates the resulting length. Then comes another list comprehension: somewords = [X for X in dictwords if len(X) == awordlen] -- this basically takes all words in dictwords of a certain length (the length of aword) and stores this subset into somewords

The loop ("for aletter in list(aword)") further narrows somewords down to only those words in the dictionary that have the letters in the supplied parameter. Finally, main() prints the supplied parameter and the list of matching words.

About the last two lines -- I never (well, hardly ever) write a Python script that just starts running, because when you try to run them under pdb they just run; generally the main part of the script goes into a routine main(), which is called only if this script is the main one being run. The last line passes sys.argv[1] as main()'s one parameter. Now to run in a debugger, I can type this:

% python
Python 2.3.5 (#1, Jan 12 2009, 14:43:55) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1819)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdb
>>> import ana2
>>> ana2.main('umping')
umping ['impugn']
>>> 
The script was saved in a file "ana2.py", so "import ana2" brought it "into" the Python interpreter. Here I just ran it, and it worked (casting caution to the wind and all that). If I wanted to do something a little more careful, I could type:
>>> pdb.run('ana2.main("umping")')
> (1)?()
(Pdb) s
--Call--
> /Users/collin/ana2.py(4)main()
-> def main(aword):
(Pdb) n
> /Users/collin/ana2.py(5)main()
-> dictwords = [X.strip() for X in open('/usr/share/dict/words', 'r')]
(Pdb) n
> /Users/collin/ana2.py(7)main()
-> aword = aword.strip().lower()
(Pdb) c
umping ['impugn']
>>> 
Here I told the debugger to step into the routine I gave it, which it did -- gelling me I was at line 4 of ana2.py; I then said to go to the next statement, line 5 -- which puts the whitespace-stripped list of lines from the dictionary into dictwords, and then to just continue -- then it printed the word I couldn't figure out.

But this naive script doesn't work so well when multiple letters are involved. Consider this embarrassment:

% ./ana2.py account
account ['account', 'auction', 'caution', 'conatus', 'courant', 'outcant']
Oops! A word like "auction" includes some letters besides the original, but "outcant" contains all and only the letters from the original word. We have to do a little more.

The basic idea I came up with was to find the duplicate letters (there may be more than one, as in "balloon" or "lightning"), and look for multiple occurrences of those multiply-occurring letters. As far as finding multiple letters, there are lots of ways to do it; I just sorted the letters and then looked for multiples in adjacent letters. Since there are only 26 letters it might be more efficient to zero an array of letters and look for collisions, but this way seems to work.

While I was at it, I decided to make the input case-insensitive by forcing the input to lowercase. Also I allowed multiple words to be given on the command line. And I added a few comments. Here's the next version:

#!/usr/bin/python -utt
# vim:et:sw=4
'''Usage:
\t%PROG% {word...}

For each _word_ given, print a list of its anagrams found
in the words file, /usr/share/dict/words .
'''

import re
import sys

DEBUG = False
LETTERSONLY = re.compile('[a-z]*$')

__doc__ = __doc__.replace('%PROG%', sys.argv[0])

def usage():
    print "Usage:", sys.argv[0], "{word...}"
    sys.exit(1)

def main(wordlist):
    if len(wordlist) == 0:              # Nothing to do, so don't do it.
        usage()

    # Fetch the words from the system-supplied dictionary
    dictwords = [X.strip() for X in open('/usr/share/dict/words','r')]

    # Screen each _word_, then look for anagrams
    for aword in wordlist:
        aword = aword.strip().lower()
        if LETTERSONLY.match(aword) is None:
            print "%%% Words must consist only of letters:", aword
            continue

        # The overall plan is: find words from dictwords[] that have 
        # the same length as _word_, and have all its letters.
letters = list(aword) mylen = len(letters) letters.sort() # letters[] is the letters in _word_, in ascending alphabetical order. # Now look for duplicates. Without handling duplicates, one might # end up with "abalone" or "albinos" or "ballons" when the user # typed "balloon" prevlet = '' dups = dict() for alet in list(letters): if alet == prevlet: # It's a dup. letters.remove(alet) # remove the dup if alet in dups: dups[alet] += 1 # increase the count else: dups[alet] = 2 prevlet = alet if DEBUG: print "DEBUG: letters", letters print "DEBUG: dups", dups
# Find the words from the dictionary that have the same length somewords = [X for X in dictwords if len(X) == mylen] # ... and have all the letters in the supplied _word_ for alet in letters: somewords = [ X for X in somewords if alet in X ] if DEBUG: print "DEBUG:", len(somewords), "candidates:", somewords # ... and have enough copies of each duplicate letter for duplet,dupcount in dups.items(): patstr = '' for junk in range(dupcount): patstr += ('.*' + duplet) pat = re.compile(patstr) somewords = [X for X in somewords if pat.match(X)] if DEBUG: print "DEBUG: after matching", patstr, ':', somewords print aword + ':', somewords sys.exit(0) if __name__ == '__main__': main(sys.argv[1:])
What's with the text that looks like this? Well, as I was writing about "sort" above, I began to feel a little embarrassed about how un-pythonic the above version was, so I hacked it. This background color indicates text that I changed in the version you see below.

Final(?) version

This one is about 14 lines shorter and doesn't bother with sorting. Changed text is indicated with this background color.
#!/usr/bin/python -utt
# vim:et:sw=4
'''Usage:
\t%PROG% {word...}

For each _word_ given, print a list of its anagrams found
in the words file, /usr/share/dict/words .
'''

import re
import sys

DEBUG = False
LETTERSONLY = re.compile('[a-z]*$')

__doc__ = __doc__.replace('%PROG%', sys.argv[0])

def usage():
    print "Usage:", sys.argv[0], "{word...}"
    sys.exit(1)

def main(wordlist):
    if len(wordlist) == 0:              # Nothing to do, so don't do it.
        usage()

    # Fetch the words from the system-supplied dictionary
    dictwords = [X.strip() for X in open('/usr/share/dict/words','r')]

    # Screen each _word_, then look for anagrams
    for aword in wordlist:
        aword = aword.strip().lower()
        if LETTERSONLY.match(aword) is None:
            print "%%% Words must consist only of letters:", aword
            continue

        # The overall plan is: find words from dictwords[] that have 
        # the same length as _word_, and have all its letters.
mylen = len(aword) # Handle duplicates: _letters_ will map each letter to its count # e.g., for 'off', letters will have 'f':2, 'o':1 letters = dict() for alet in list(aword): letters[alet] = letters.get(alet,0) + 1
# Find the words from the dictionary that have the same length somewords = [X for X in dictwords if len(X) == mylen] # ... and have all the letters in the supplied _word_ for alet in letters: somewords = [ X for X in somewords if alet in X ] if DEBUG: print "DEBUG:", len(somewords), "candidates:", somewords # ... and have enough copies of each duplicate letter for duplet,dupcount in [X for X in letters.items() if X[1] > 1]: patstr = '' for junk in range(dupcount): patstr += ('.*' + duplet) pat = re.compile(patstr) somewords = [X for X in somewords if pat.match(X)] if DEBUG: print "DEBUG: after matching", patstr, ':', somewords print aword + ':', somewords sys.exit(0) if __name__ == '__main__': main(sys.argv[1:])
Call me (off)-white and nerdy, but that was fun!

No comments: