So I was on the train the other day, looking at the daily jumble®
in the paper (if you don't know how it works, have a look at their website). I had hit a mental block, trying to decode MUGPIN
and only getting as far as "umping" (is it a verb, to "ump"?). I could have typed
$ grep '^......$' /usr/share/dict/words | grep m | grep u | grep g |
grep p | grep i | grep n
(which would give me a list of six-letter words in the system-supplied dictionary containing
all the letters m,u,g,p,i,n)
but I thought, "Hey, I have a little time; I should write a script." That way,
rather than typing all those "grep"s, I could type the name of the script
("anagram", say) and the letters of interest (in this case mugpin).
For the impatient:
→put your mouse here←
to see the answer.
What did all those greps mean?
The first one, "grep '^......$' /usr/share/dict/words",
selects lines containing six characters: ^ means beginning-of-line,
. matches anything other than end-of-line, and $ matches
end-of-line.
The | ("pipe" or vertical bar) means: take the output of the
prior command, and feed it to the following command as its input.
The next six greps select lines containing one or more 'm's,
one or more 'u's, etc. And it gives the right result, but it's icky having
to type "grep" so many times.
How best to write a script for this?
There is probably some nice way to get bash(1) to give me a list of letters
in a word, but I didn't know what it was. Besides that,
I'm trying to write more Python
and less bash. It turns out that Python treats a string like a list --
even more so if you say "list('whatever')"
$ python
Python 2.6.5 (r265:79063, Jul 5 2010, 11:47:21)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> list('whatever')
['w', 'h', 'a', 't', 'e', 'v', 'e', 'r']
>>>
My thought was, grab the list of words from /usr/share/dict/words, select the
ones of the right length; from those, select the ones that contain all the
letters in the supplied word. An initial whack might look something like this:
#!/usr/bin/python -utt
import sys # for argv
def main(aword):
dictwords = [X.strip() for X in open('/usr/share/dict/words', 'r')]
aword = aword.strip().lower()
awordlen = len(aword)
somewords = [X for X in dictwords if len(X) == awordlen]
for aletter in list(aword):
somewords = [X for X in somewords if aletter in X]
print aword, somewords
if __name__ == '__main__':
main(sys.argv[1])
OK, let me explain a little. First, the #! line -- that tells the
computer (if you're on a Linux® or Unix™-like system) to run the Python
interpreter; the "-utt" says to unbuffer output (I'm so impatient) and to treat
tabs as illegal for indentation -- safety first.
Then comes "main", which does the work of this script, takes a single
input parameter, the scrambled word in question ('mugpin' in this case).
The first thing the script does is get the word-list from the system-supplied
dictionary, /usr/share/dict/words (if that file doesn't exist on
your system, then I don't know where to get a nice word list from).
I've become a fan of list comprehensions (the line beginning
dictwords = for example) because I can say a
lot in just one line without having it look like too much magic: what
this line does is open the dictionary and put the blank-stripped version
of each line into the array dictwords.
The script unblanks and downshifts the supplied word (that's what
aword.strip().lower() does); it then calculates the resulting
length. Then comes another list comprehension: somewords = [X for X in dictwords if len(X) == awordlen] -- this basically takes all words in dictwords
of a certain length (the length of aword) and stores this subset
into somewords
The loop ("for aletter in list(aword)") further narrows somewords
down to only those words in the dictionary that have the letters in
the supplied parameter. Finally, main() prints the supplied parameter and
the list of matching words.
About the last two lines --
I never (well, hardly ever) write a Python script that just starts running,
because when you try to run them under pdb they just run; generally
the main part of the script goes into a routine main(), which is called only if
this script is the main one being run. The last line passes sys.argv[1]
as main()'s one parameter. Now to run in a debugger, I can type this:
% python
Python 2.3.5 (#1, Jan 12 2009, 14:43:55)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1819)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdb
>>> import ana2
>>> ana2.main('umping')
umping ['impugn']
>>>
The script was saved in a file "ana2.py", so "import ana2" brought
it "into" the Python interpreter. Here I just ran it, and it worked
(casting caution to the wind and all that). If I wanted to do something a
little more careful, I could type:
>>> pdb.run('ana2.main("umping")')
> (1)?()
(Pdb) s
--Call--
> /Users/collin/ana2.py(4)main()
-> def main(aword):
(Pdb) n
> /Users/collin/ana2.py(5)main()
-> dictwords = [X.strip() for X in open('/usr/share/dict/words', 'r')]
(Pdb) n
> /Users/collin/ana2.py(7)main()
-> aword = aword.strip().lower()
(Pdb) c
umping ['impugn']
>>>
Here I told the debugger to step into the routine I gave it,
which it did -- gelling me I was at line 4 of ana2.py; I then said to go to
the next statement, line 5 -- which puts the whitespace-stripped
list of lines from the dictionary into dictwords, and then to just
continue -- then it printed the word I couldn't figure out.
But this naive script doesn't work so well when multiple letters are involved.
Consider this embarrassment:
% ./ana2.py account
account ['account', 'auction', 'caution', 'conatus', 'courant', 'outcant']
Oops! A word like "auction" includes some letters besides the original, but
"outcant" contains all and only the letters from the original word. We have to
do a little more.
The basic idea I came up with was to find the duplicate letters (there may
be more than one, as in "balloon" or "lightning"), and look for multiple occurrences
of those multiply-occurring letters. As far as finding multiple letters, there are
lots of ways to do it; I just sorted the letters and then looked for multiples
in adjacent letters. Since there are only 26 letters it might be more
efficient to zero an array of letters and look for collisions, but this way
seems to work.
While I was at it, I decided to make the input case-insensitive by
forcing the input to lowercase. Also I allowed multiple words to be given
on the command line. And I added a few comments. Here's the next version:
#!/usr/bin/python -utt
# vim:et:sw=4
'''Usage:
\t%PROG% {word...}
For each _word_ given, print a list of its anagrams found
in the words file, /usr/share/dict/words .
'''
import re
import sys
DEBUG = False
LETTERSONLY = re.compile('[a-z]*$')
__doc__ = __doc__.replace('%PROG%', sys.argv[0])
def usage():
print "Usage:", sys.argv[0], "{word...}"
sys.exit(1)
def main(wordlist):
if len(wordlist) == 0: # Nothing to do, so don't do it.
usage()
# Fetch the words from the system-supplied dictionary
dictwords = [X.strip() for X in open('/usr/share/dict/words','r')]
# Screen each _word_, then look for anagrams
for aword in wordlist:
aword = aword.strip().lower()
if LETTERSONLY.match(aword) is None:
print "%%% Words must consist only of letters:", aword
continue
# The overall plan is: find words from dictwords[] that have
# the same length as _word_, and have all its letters.
letters = list(aword)
mylen = len(letters)
letters.sort()
# letters[] is the letters in _word_, in ascending alphabetical order.
# Now look for duplicates. Without handling duplicates, one might
# end up with "abalone" or "albinos" or "ballons" when the user
# typed "balloon"
prevlet = ''
dups = dict()
for alet in list(letters):
if alet == prevlet:
# It's a dup.
letters.remove(alet) # remove the dup
if alet in dups:
dups[alet] += 1 # increase the count
else:
dups[alet] = 2
prevlet = alet
if DEBUG:
print "DEBUG: letters", letters
print "DEBUG: dups", dups
# Find the words from the dictionary that have the same length
somewords = [X for X in dictwords if len(X) == mylen]
# ... and have all the letters in the supplied _word_
for alet in letters:
somewords = [ X for X in somewords if alet in X ]
if DEBUG:
print "DEBUG:", len(somewords), "candidates:", somewords
# ... and have enough copies of each duplicate letter
for duplet,dupcount in dups.items():
patstr = ''
for junk in range(dupcount):
patstr += ('.*' + duplet)
pat = re.compile(patstr)
somewords = [X for X in somewords if pat.match(X)]
if DEBUG:
print "DEBUG: after matching", patstr, ':', somewords
print aword + ':', somewords
sys.exit(0)
if __name__ == '__main__':
main(sys.argv[1:])
What's with the text that looks like this?
Well, as I was writing about "sort" above, I began to feel a little embarrassed
about how un-pythonic the above version was, so I hacked it.
This background color indicates text that
I changed in the version you see below.
Final(?) version
This one is about 14 lines shorter and doesn't bother with sorting.
Changed text is indicated with this background color.
#!/usr/bin/python -utt
# vim:et:sw=4
'''Usage:
\t%PROG% {word...}
For each _word_ given, print a list of its anagrams found
in the words file, /usr/share/dict/words .
'''
import re
import sys
DEBUG = False
LETTERSONLY = re.compile('[a-z]*$')
__doc__ = __doc__.replace('%PROG%', sys.argv[0])
def usage():
print "Usage:", sys.argv[0], "{word...}"
sys.exit(1)
def main(wordlist):
if len(wordlist) == 0: # Nothing to do, so don't do it.
usage()
# Fetch the words from the system-supplied dictionary
dictwords = [X.strip() for X in open('/usr/share/dict/words','r')]
# Screen each _word_, then look for anagrams
for aword in wordlist:
aword = aword.strip().lower()
if LETTERSONLY.match(aword) is None:
print "%%% Words must consist only of letters:", aword
continue
# The overall plan is: find words from dictwords[] that have
# the same length as _word_, and have all its letters.
mylen = len(aword)
# Handle duplicates: _letters_ will map each letter to its count
# e.g., for 'off', letters will have 'f':2, 'o':1
letters = dict()
for alet in list(aword):
letters[alet] = letters.get(alet,0) + 1
# Find the words from the dictionary that have the same length
somewords = [X for X in dictwords if len(X) == mylen]
# ... and have all the letters in the supplied _word_
for alet in letters:
somewords = [ X for X in somewords if alet in X ]
if DEBUG:
print "DEBUG:", len(somewords), "candidates:", somewords
# ... and have enough copies of each duplicate letter
for duplet,dupcount in [X for X in letters.items() if X[1] > 1]:
patstr = ''
for junk in range(dupcount):
patstr += ('.*' + duplet)
pat = re.compile(patstr)
somewords = [X for X in somewords if pat.match(X)]
if DEBUG:
print "DEBUG: after matching", patstr, ':', somewords
print aword + ':', somewords
sys.exit(0)
if __name__ == '__main__':
main(sys.argv[1:])
Call me (off)-white and nerdy, but that was fun!