String Comparisons vs. Fuzzy Matching – dig·i·tal/ur·ban·ite

A friend of mine finally started playing my game this weekend. I’d sent out a general email to a bunch of my friends last September and most of them played around a bit. This particular friend decided to look at it this past weekend and boy did she have some comments!

I have like seven emails from her with various complaints, ranging from what the hell to this is the correct answer to implement a spellcheck.

While I argue that someone who is as much of a Star Wars fan as she is (PS: May the Fourth be with you!) should be able to spell Alderaan, I do admit that the strict comparison I currently perform in the game is a little, how shall I put this, unyielding.

Look, I’ve always been a good speller. I know which to/too/two to use. I know how to spell antidisestablishmentarianism. I know how to spell supercalifragilisticexialidocious. I often take it for granted that other people can, or perhaps should, spell things properly.

So when I was writing the early code for my game, I was like, yes, sure, let’s do a strict string comparison on the submitted answer and compare it with the correct answer and the alternate answer.

At the time, I knew that it would be unforgiving and I thought to myself meh, I’ll worry about that later.

So with my friend having written me several emails to complain, I figure that now is later.

As such, rather than do any of the things I desperately have to do in order to finish up the first challenge, I spent any time coding this weekend on researching fuzzy matching in PHP.

There are apparently four ways to do this. The one that looks the absolute easiest is the similar_text() function. You can use it to compare one thing to another and it’ll give you a percentage of how much it matches. So, for example, if I have “Beverly” as the answer to a Star Trek: The Next Generation question, but someone answers “Beverley”, I can do this:

similar_text(“beverley”, “beverly”, $percent);

echo $percent;

And that gives you 93.333333 (repeating, of course).

So that means that the user-submitted string of “beverley” is a 93.33% match to the actual correct answer of “beverly”.

Using this, I can set an appropriate percentage limit to grant a correct answer. So like, if the limit is 93% or under, then “beverley” will match “beverly” and will get the question correct.

HOWEVER.

A lot of this works on length. So while adding an extra letter to Beverly isn’t a big deal, what about Data’s cat, Spot? If we’re looking for “spot” as an answer, what if someone adds an S and says “spots”? It’s a shorter word, so that only gives an 88.88888% (repeating) match. Should be simple, right? Just set the limit to 90ish percent and we’re good, right?

Well, what if it’s a Red Dwarf question and I’m looking for the answer is “dave” but someone submits “david”? Again, since it’s such a short word and because two letters are different (the e replaced for the i and the d added), this is only a 66.66666% (repeating) match. Shouldn’t someone who types in “david” get points for “dave”?

What about the poor person who typos dave as save? There’s one letter different and so this gives a flat 75% percentage match. While I would argue that save does not equal dave in the least, the S and the D are right next to one another.

What about things that are very similar and yet different? Say I’m looking for ursa major as an answer and someone types in ursa minor. That’s an 80% match. So maybe the match criteria needs to be higher than 80%. But where does that leave us with Dave vs. David, or Dave vs. Save?

Even aside from “what should the limit be”, I’m now also thinking about whether or not there’s a notice to the user that “mmm, that’s not quite right, but I’ll accept it” and then show them the correct answers. Does it matter if someone spells it Beverley if they get it right? Does it matter if they eventually learn that it’s Beverly? How important is that knowledge if it doesn’t do anything different in the game?

And then, of course, one has to think about whether or not properly spelling something should be an advantage in the game. Should I go farther if I can spell Beverly right as compared to Beverley? Maybe I could show a notice saying “mmmm, I’ll accept it, but you’ll only go half as far for this one” and then divide by two and round down to the nearest whole number (unless the number is zero, in which case it would become one).

But does that unfairly reward the good speller? What if I do show the proper spelling, which then gives someone the chance to improve next time?

One of the most fascinating parts of game design, to me, is the question of “sure, I can do it, but should I?”. That question has been a constant presence in the back of my mind since I started out on this project. I can do just about anything. But I’m coming at this from my own biases and perspectives, obviously. As someone who is a good speller, it’s only natural to me that I would want to be rewarded for that. But if I were a terrible speller, I think it would suck to only get a portion of what a good speller gets.

Food for thought!

Leave a Reply Cancel reply