Saturday, September 17, 2011

Playing with Pac-Man

What the Puck?

For those unfamiliar, Pac-Man is an old arcade game, where you play as a cheese pizza trying to collect pellets while being chased round a maze by four whimsically named ghosts. Although really, if you are unfamiliar with Pac-Man, there's something wrong with you. Anyway, you can have a play online here.

The four ghost are Blinky, Pinky, Inky, and Clyde. For a good explanation of the ghosts' behaviours, here's a post from the tragically short-lived 'Game Internals' blog.

tl;dr Blinky chases, Pinky ambushes, Inky teams up with Blinky, and Clyde does pretty much whatever the fuck he wants. But really, you should read the article.

And as for the origin of the name, I'll let Scott Pilgrim explain. Or if you prefer to read, wikipedia.

Wakking Round in Circles

Graph Theory is a fantastic area of maths. Graph Theory can find you the shortest route, help you get out of a maze, tells you how to colour a map so no adjoining regions are of the same colour, to reconstruct DNA fragments, and help you find the best route for sight-seeing. Among many other things.

But sadly graph theory tends to be covered more in Computer Science courses than Maths courses. So for myself, i have to be content with books and the internet. Not that I let such thing hold me back.

So the question is this - Find the optimal route around the Pac-Man maze

What we want to do is find a route around the maze while collects every pellet. Or in graph theory terms, a route which travels along every edge at least once.

First of all, we need to turn the maze into a network graph. For this, we place a node at each junction, and join pairs of nodes according to whether or not there is a direct route between the pair.

So here's what the maze looks like, beside what the graph of that maze looks like.
The blue edges are the ones with pellets. The red edges are the ones without pellets, which you don't necessarily have to visit.

And we can assign to each edge a 'weight' - i.e. the distance (in grid squares) between the pairs of nodes
Now, first of all, we note that there are a lot of node where an odd number of edges meet (with odd degree). What this means is that the graph is not Eulerian - what that means is, there doesn't exist a route around the graph which visits each edge exactly once.

Which means that we will definitely have to travel along some paths more than once.

So the optimal path is going to be the one which minimises the total distance traveled. This is an example of the Chinese Postman Problem, and fortunately, there is an algorithm for finding the shortest route.

Without going into too much detail, what you do is pair up odd nodes, and find the shortest paths between them (using Dijkstra's Algorithm). This makes the graph Eulerian, i.e. so that a route that visits each edge at least once exists.

So if we take the bottom corner as an example, there are four node of odd degree (green).
By finding the shortest paths between pairs of odd nodes (red), we can make this mini-graph Eulerian - i.e. so all nodes have even degree. This means we can find the optimal tour around the graph (right). In this case we have to travel an additional 12 units.

Of course, it gets more complicated when you're working with the full Pac-Man graph.

When there are more than two nodes of odd degree, you also have to find the combination of pairings which minimises the total tour size. For an idea of scale, there are 20 odd nodes in the Pac-Man graph. That means there are 654,729,075 pair combinations to check and compare. And you have to run Dijkstra's algorithm 10 times for each of those permutations. It adds up.

So instead, I took the essence of the algorithm and constructed a route by hand. This is not necessarily the optimal route. For this route, you have to travel an extra 60 units - including 9 units along a path segment that doesn't contain any pellets.
There are other paths around the maze which cover (and recover) the same same ground as the above solution, and others that double back over different ground, but that add up to the same distance.

One constraint, in the case of Pac-Man, is that the path has to start at the point near the bottom of the maze, where Pac-Man starts in the game.

But also, it doesn't matter where the tour finishes (where as, the Chinese Postman starts and finishes at the same point). This means modifications to the 'optimal' solution may be possible/required.

In the bottom-corner example, we can eliminate some of the extra paths (up to 9 units worth) if we don't care where we end up.

If I ever get around to programming and running the proper algorithm, I'll let you know what the 'proper' solution is. And if you can find a better solution, let me know.

Of course what I haven't taken into account is the fact that, as you're making your way around the maze, you're also trying to avoid ghosts. I'll be honest, this path finding was more of an intellectual exercise than anything practical. You can find some strategic tips here.

This Maze Ain't Big Enough For the Both of Us.

If you follow me on Twitter, you may have already read about this - in a dream I formulated a multiplayer version of Pac-Man. But weird, subconscious machinations aside..

I looked it up on wikipedia, and there was already a multiplayer Pac-Man - Pac-Man Vs. But in that game you have one player playing as Pac-Man, and any additional players as ghosts. The aim is for the Pac-Man player to collect pellets and for ghost players to stop them. My idea is better.

First of all, you'd probably need to make the maze bigger. Each player plays as a Pac-Man, and like the normal game, their aim is to collect the most points, while trying to avoid ghosts. The game introduces a new special pellet which lets you cannibalise an opponent Pac-Man (much like the pellet that lets you eat ghosts).

Each player starts with two lives. If they lose both their lives they become a ghost. As a ghost, other ghosts can't hurt you, but you can't collect any pellets. If you catch an opponent Pac-Man, you steal one of their lives (becoming a 'living' Pac-Man again, with one life). If an opponent Pac-Man gets one of the special ghost pellets, they can eat a ghost opponent, killing them for good.

The game ends when all the pellets are collected or when all players are ghosts/dead. The winner is the player with the most points.

Or something like that.

If only I have the know how to make a real, playable version. But, hey, if there happens to be anyone out there reading this who could make it, please do..


Sunday, September 11, 2011

Picking and Scoring Words for Hangman

Hangman, for those unfamiliar, is a popular way for primary school teachers to teach spelling in the guise of a fun guessing game, and a way for secondary school teachers to pass the time at the end of term when they can't be bothered with teaching.

It Works Like This..

In a dystopian future where the powers that be like to play deadly games with their political prisoners, Hangman sees one such prisoner, 27 year old Adrian Silk, standing on a platform in front of a cheering crowd; his crime - logical thought.

Adrian is presented with an unknown word - 9 letters, title of a 21st century movie - and his task is to determine that word by guessing letters.

With each correct guess he gets closer to discovering the mystery word and winning a stay of execution. With each incorrect guess, another piece of the gallows is built. If he can't reveal the word before the gallows are fully constructed, he'll be dancing a dead man's jig for the crowd.

He starts by guessing the vowels - A, E, I, O, U - two incorrect guesses, two pieces of the gallows built: I__E__IO_

He guesses some common consonants - N, R, S, T - two more incorrect guesses, but still safe: IN_E_TION

Adrian stops to think - "are there any films called INFECTION? There are two, but they're pretty obscure.. Wait..!". And he hazards a guess - INCEPTION ..?

He's right! Adrian jumps for joy, and breathes a sigh of relief. And as the disappointed crowd jeers, Adrian is shot between the eyes by the Games Master - found guilty of the logical thought with which he'd been charged.

Still, if you think that's bad, you should see how they play KerPlunk!

Lets Play a Guessing Game

Okay, imagine you were guessing letters for an unknown word. But in this variation on the game, the hangman will only tell you when you've won, or else when you've run out of letters. That's all the information you get, nothing else.

The simplest approach to finding a word would be to just work your way through the alphabet - 26 letters, a maximum of 26 guesses to uncover any word.

But there will be a limit on how many incorrect guesses you can make - usually around 10. This means that, unless the word you're guessing contains 17+ unique letters this probably isn't the optimal approach.

So with no feedback and a guess limit, all you can do is guess letters 'at random'. But some letters appear more often in the English language than others. So you can make educated guesses, and go for the more common letters first: E, T, A, I, O, N, S, ...

If we assume that the probability, p(ci), of a person guessing a given letter is roughly equal to that letter's frequency in the English language, then the probability, big P, of that person guessing a correct letter in a word equals the sum of the probabilities of each unique letter in that word.

For example, in MISSISSIPPI, the unique letters would be {I,M,P,S}, so

P(MISSISSIPPI) = p(I)+p(M)+p(P)+p(S) = 4.025% + 2.406% + 1.929% + 6.327% = 14.687%

What this immediately suggests for choosing words is:

1) Words with 'uncommon' letters are better

Uncommon letters have lower probability p(ci), so will make totals lower. E.g. C is less common than H, so (in the game described above) CAT is harder to guess than HAT

2) Words with fewer unique letters are better

If there are fewer (and smaller) targets, it's harder to hit one of them by just firing at random.

Parting Words

So what about in a real game, where you get lots of feedback - which letters are right or wrong, where each letter appears and how many times.. This makes deduction a lot easier.

Have you ever tried to cheat at a crossword? There are loads of 'helpers' online - put in the pattern and see what possible words match it. Well for hangman we take that a step further.
Say our pattern is E_UA_I__; there are 5 words it could be - EQUALITY, EQUALING, EQUALISE, EQUATION, EQUATING

First of all, EQUALISE is not a valid solution because it introduces an extra E. And if we'd guessed all the vowels to get that pattern, we can also rule out EQUATION since we've already determined that O doesn't appear in the solution word.

So to put it more concisely - we want to find words that match a given pattern, eliminating those which contain extra letters already guessed.

In the case above we're left with 3 words - EQUALITY, EQUALING, EQUATING - and which of the three the correct word is can be determined by guessing the letter T:

1) E_UA_IT_ which can only be EQUALITY
2) E_UATI__ which can only be EQUATING
3) T doesn't appear in the word, in which case it can only be EQUALING

The fact that T is the most common consonant probably means each of those 3 words are poor choices for hangman. This is an example of a low entropy pattern.

Imagine, instead, that you're presented with a 3 letter word, and you've gone through the vowel and found you're left with the pattern _A_

What could that be? Maybe you guess T next and get _AT - well that could be BAT, CAT, HAT, RAT, .. Or maybe T is wrong, then the word could be MAN, CAN, BAY, RAY, WAX, TAX, .. how many guesses have you got left?

In fact, there are 179 possible words which match this pattern. This would be a high entropy pattern.

So, more tips for picking good words:

1) Pick obscure words

Even if there is only one word a pattern could fit, if the opponent doesn't know that word, then they are forced to keep guessing letter-wise. Which makes things a little more tricky.

2) Pick shorter words

There are more 3 and 4 letter words in the English language than 8 and 9 letter words.

3) Avoid too many repeated letters

For example, MISSISSIPPI only has 4 unique letters, but once you've guessed I and S, you've uncovered 73% of the word, and there's really nothing else _ISSISSI__I could be.

4) Pick words with high entropy patterns

A Question of Uncertainty

In Information Theory, entropy relates to the uncertainty in a piece of information. It's measured in bits, and, for a set of equally likely outcomes, is calculated as the base two logarithm of uncertainty.

So a coin toss has an uncertainty of 2, because there are two possible, equally likely outcome. That gives an entropy of 1 bit. For a given character in a word, if we assume all letters are equally likely, then there is an uncertainty of 26 - so that gives an entropy of log2(26) = 4.7 bits.

But as pointed out above, all letters are not equally likely in the English language. So the entropy is more complicated to work out: H = -SUM[p(ci)*log2(p(ci))]. Anyway, it turns out the entropy of a given unknown letter is 4.18 bits.

In the case of word patterns, we work out entropy as log2 of number of valid words matching the pattern, since all matching words have the same probability of being correct.

For the low entropy example above, there are 3 possible words that match the pattern so that would give an entropy of log2(3) = 1.58 bits. The 3 letter, high entropy example, on the other hand - with it's 179 possible solutions - would have an entropy of 7.48 bits.

So Where Does This Leave Us?

The idea is this - to come up with a scheme for scoring the 'quality' of a word in terms of how hard it is to guess in a game of hangman.

First, we score our word by its letter, as described above, to get the probability of guessing a correct letter. We then take one minus this to find the probability that a letter guess is wrong - the lower the probability of guessing a correct letter, the higher that word's score.

Entropy has a well define method of measurement in Information Theory, as discussed above.

But if we measure the entropy of a word before any letters are guessed, we find that all words of the same length have the same entropy. So instead, it would be more useful to measure a word's entropy after, say, some common letters have been filled in. Since most people start by guessing the vowels, that is what I'm going to go with.

In some cases, there are so many possible pattern matches, that you'd do better to simply make educated guesses at letters until the solution is found. For this, we work out the letter-wise entropy. Again, before any letters are guessed, all words with the same number of unique letters have the same entropy.

So. Each unknown, unique letter in a word can be one of those not already guessed. So we can work out the entropy, H(alpha), of the alphabet, sans the letters already guessed. Then, multiply that by the number of unique letters left to find, n.

So for example, if we guess the vowels and get the pattern _A_, then the entropy is  2*H(consonants) = 5.62 bits

From a Practical Stand Point

To work out the word entropy we can download a dictionary, then write some code which will find and count the words which match our required pattern, excluding those containing already guessed letters.

We then find the 'smallest maximum' entropy - which of the character entropy and the word entropy is smallest. In most cases it'll be the word entropy.

And finally we multiply that by the probability, one minus big P, to assign to each word a score:

(1-P)*min{H(W), H(C)}

Or you can use this code.

In and of themselves, the scores this gives don't have any specific meaning - the exact score for each word will depend on your choice of dictionary and initial guess letters. But so long as you use them consistently, the scores should stand as an indicator of each word's relative 'quality'.

For example, by this system, CAT gets a score of 5.3, and EQUALITY gets a score of 1.4. So CAT is a much better hangman word than EQUALITY.

So What's the Best Word

JAZZ, apparently. In fact, while researching this blog, I came across a few other people's attempts to find the best words for hangman. Pleasingly, they went with different approaches to me.

In fact, JAZZ never occurred to me when I was thinking up random good words. The best I thought of was WAX. Also reassuring is the fact that JAZZ does score highly under my system - 6.07 - and scores only slightly better than WAX, with its score of 5.92

QUIZ is an interesting word. It contains the two least common consonants and the two least common vowels (P=9.89%). But there are only 12 words that match _UI_, so it has low entropy. Its final score is 4.13 - lower than that of CAT, with its more common letter set (P=20.0%).

Similarly GYM; it has P=6.34% and has no vowels, but there are only 29 words it could be, so has a score of 5.48. Or MY, with P=4.38% but only 5 possibilities, score 3.18. So it's a matter of balance between probability and entropy.

Ideally, I'd have run a whole dictionary through my system to look for a list of best words. But I didn't. You can if you really want. Otherwise, I'd recommend one of the blogs linked above for lists.

In Summary

So there you go. You can score some of your own words if you so wish. But for on the fly word assessment, just remember, pick:

1) Words with 'uncommon' letters
2) Words with fewer unique letters
3) Obscure words
4) Shorter words
5) Words with fewer repeated letters
6) Words with high entropy patterns

And if you're trying to guess a 4 letter word with an A in the second position, odds are someone is trying to outfox you with JAZZ.


[Full disclosure: If I got the information theory stuff wrong, it's because I only know as much as the one chapter of this book, and odd bits on wikipedia.]