April 25, 2009

Detour into Python and optimizations (Part 1)

Posted in Python tagged , , , , at 4:48 pm by davy803

Sorry for the lack of updates.  I’ve been distracted for a while because a friend described a word game that is played where you take 2 words and try to convert one into the other by switching one letter at a time, where each iteration is a real word.

e.g.

MOON –> WOLF
GOON
GOOF
GOLF
WOLF

Like any good programmer, the first thing that popped in my head was “Hey I could write a program to solve that! ”  I’ve also been meaning to learn another language for some time, and very reliable sources suggested Python.  From what I read it seemed pretty accepted that Dive Into Python was the best intro to Python for programmers, and so I dove in.

And I’m glad I did.  Python does some pretty amazing things that I could do no justice by explaining here.  For a quick taste, take a look at List Comprehesion and Multi-variable Assignment.

What the game essentially boils down to is a search for an optimal path through a graph.  I originally thought of it as a tree, but actually with the same start and end points, you could end up generating different trees depending on what path you take, so it’s more accurately described by a simple undirected graph.   The best way to solve this seems to be an A* algorithm.  Dijkstra’s algorithm was also considered, but it assumes that all nodes and edges of the graph are known, and I choose to generate them on the fly.

The general premise of A* is for each node (a word in our case) you know the distance between the start and the current (called g(x)) and you have an estimate to how far you are from the goal (the heuristic, called h(x)).  In our case the heuristic would be the number of letters that differ.  And since it’s impossible to get to the end in fewer steps, that means that it can never overestimate the number of steps remaining and thus is an admissable heuristic.  It is proven that the A* algorithm will always generate an optimal path.  The way it works is the “score” (f(x)) of each word is the sum of g(x) and h(x)  and you always try the paths with the lowest f(x) first.

Let’s take a look at the algorithm in code:

```def AStar(start, goal, wordlist):
import Queue
closedset = []
openset = [start]
pqueue = Queue.PriorityQueue(0)
g_score = {start:0}
h_score = {start:distance(start, goal)}
f_score = {start:h_score[start]}
pqueue.put((f_score[start], start))
parent_dict = {}
while len(openset) > 0:
x = pqueue.get(False)[1]
if x == goal:
return reconstruct_path(parent_dict,goal)
openset.remove(x)
closedset.append(x)
for y in getchildren(x, wordlist):
if y in closedset:
continue
temp_g_score = g_score[x] + 1
temp_is_better = False
appended = False
if (not y in openset):
openset.append(y)
appended = True
h_score[y] = distance(y, goal)
temp_is_better = True
elif temp_g_score < g_score[y] :
temp_is_better = True
else :
pass
if temp_is_better:
parent_dict[y] = x
g_score[y] = temp_g_score
f_score[y] = g_score[y] + h_score[y]
if appended :
pqueue.put((f_score[y], y))
return None```

Now let’s break it down.  We have and openset and a closedset.  The openset is the list of nodes that we know about.  We have a score for each of these nodes, but we don’t know about that node’s children or their scores yet.  Once we get and score all of a node’s children it goes into the closedset which means we’re done looking at them.  The variable pqueue is a priority queue that contains the same nodes as the openset except it also stores and sorts by the f_score of each node.  Why not just have a priority queue instead of the list?  That was my original thought, but I needed to be able to determine whether a given node was already in openset which we can’t do with a queue.  I could also just sort the list at each iteration but it seemed less efficient.

After that we have the 3 different scores that I’ve already defined above.  They’re stored as dictionaries with the node being the key and the score being the value.  parent_dict stores the parents of each node along the optimal path.   This is updated if a better path is found.

Now we go through the openset and keep adding it’s decendants until we either find the goal or we run out of nodes.  We pull the node with the lowest f_score since that’s what pqueue is sorted by, call it x and put it into the closedset.  (Technically it should go to closed after we finish going through it’s children, but it doesn’t really matter in this case).

Next we get all the children (neighbors) of x.  If it’s already in the closedset, we already know all we need to know about it, move on.  Otherwise, the distance from the start to this node, y, is the distance to it’s parent, x, plus one.

“temp_is_better” is used to flag whether we should add y to the openset, or update it with a better path.  “appended” is really just a workaround because we can’t add y to pqueue until we know the f_score.  Basically if y isn’t in the openset add it.  If it is, but the current path to y is shorter than the previously discovered path to y, then update all the info about it.

Well that’s it.

Ok I lied.  There were 3 functions that I didn’t define:  getchildren, distance, and reconstruct_path.  Let’s look at getchildren first:

``````def getchildren(word, wordlist):
return [ w for w in wordlist if len(word) == len(w)
and distance(word, w) == 1 ]``````

This is an example of list comprehension in Python.  How you read this is for each word, “w”, in “wordlist” if the length of “w” is the same as the length of the paramter “word”, and the distance between word and w (which I’ll define next) is 1, then add w to the list to be returned.  The backslash indicates the the statement continues on the next line, since Python normally used linebreaks to end statements rather than a semicolon.

That was simple enough, now let’s look at the distance function:

``````def distance(word1, word2):
if len(word1) != len(word2):
raise ValueError("Word length do not match")
matches = 0
for i in range(len(word1)):
if word1[i] == word2[i]:
matches += 1
return len(word1) - matches``````

Here we take 2 words and compare to see how many letters are different between them.  First we make sure they contain the same number of letters and raise an exception if they don’t.   Python doesn’t have the standard for loop and instead everything is a foreach loop.  You can mimic a standard for loop by doing “for i in range(stopvalue)” because range is a built-in function that generates a list of numbers 0 to stopvalue -1.  Now it’s just a simple matter of counting the number of letters that match and subtracting that from the number of letters in the word.  Okay, I know what you’re thinking, and you’re right.  It’d be much simpler to count the number of letters that are different instead of the ones that are the same.  It didn’t strike me at the time of writing the first version of the program, but I correct that later on.

Now for the final missing function:

``````def reconstruct_path(parent_dict,node):
if node in parent_dict.keys():
p = reconstruct_path(parent_dict,parent_dict[node])
p.append(node)
return p
else:
return []   ``````

This is called when we find the goal, and all it does is backtrack recursively through all the parents nodes starting from the goal and finally returning a list of the path.

Now that we have the algorithm down, let’s actually use it:

``````wordfile = open("realwords.txt")
wordfile.close()
words = []
while True:
userentry = raw_input("Hello, enter the 2 words to play with separated by a space:n ")
words = [w.lower() for w in userentry.split()]
if(len(words) == 2 and len(words[0]) == len(words[1])):
break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
print "Minimum number of steps is %s" % (len(answer))
else:
print "Good luck!"
else:
print "Sorry, there's no answer to yours"``````

Python makes working with text files super easy.  As you can see, I opened the file in the first line.  The 2nd line might need some explaining.  First if you don’t specify a mode for opening a file, the default is readonly mode as text.  (I’m not sure how it handles encoding whether it’s ASCII, ANSI or Unicode, but for our purposes it’s just a plain text file and it just works.)  Specifying read() without parameters reads to the end of file, and since we opened in text mode it returns it as a string.  The split() method takes a string and splits it up into a list of strings for each word.  You can specify a word delimiter, but if you don’t it is assumed to be any kind of white space (space, linebreak, tab).  I wrote a quick method called cleanentries() to clean up some formatting specific to this file, it’s not real robust so not really worth going into.

After you’re done with a file you should always close it, so I do.  Now I need to ask the user for 2 words.  I use the split method again and convert all the words to lowercase.  Python doesn’t have a do-while loop, so to work around that people suggesting doing while True and then break on the opposite condition.  Here I make sure that exactly 2 words are entered, and that each word has the same number of letters.

I let the user know what words they entered and then remove all the words from the wordlist that don’t have the same number of letters as their input.  Now I use my magic algorithm, and if there’s a solution I tell them how many steps it takes and give them the option of whether or not they want to see the actual solution.

That’s it, all fine and dandy.  Now let’s run it.  One set that my friend did by hand was to convert from “poets” to “verse”, so let’s give that a try.

That took forever.  How can I speed this up?

There’s a saying about performance, always measure and profile.  Luckily Python comes with a pretty easy to use profiler out of the box.  After skimming through this guide, I found out all I needed to do was the open up a command prompt to the folder where my file is saved and type “python -m cProfile [scriptname].py”  So let’s see what it tells us:

What the screenshot doesn’t show is that 156 seconds is in the “raw_input” method which was basically waiting for user input so that doesn’t count, which leaves 189 seconds if you subtract the input time from the cumulative time for WordGame.py:1((module)).  About 1 second went to cleaning the entries from the list, and a little misc time here and there and you end up with the bulk of the time being in the AStar method.  Cumulative time counts the entire time spent in the method, including other method calls, where  total time excludes the time in other methods.  You’ll see that the majority of the time is spent in the distance method and that it’s called a whopping  5.3 million times.  Obviously some optimization needs to go in there.  Also not on the screenshot, but the “len” function takes up 57.6 seconds and is called 32million times.  len is a built-in function, so we can’t really speed up that implementation, but we could call it less often.

In my next post I’ll go into how I optimized the program from taking 189 seconds as you see here, down to 2.25 seconds through a series of optimizations with some help from the people over at Stackoverflow.com.