This content was originally published at Kyan’s blog at http://kyan.com/blog/2008/7/23/the-future-of-captcha, but is no longer available at that location. Reproduced here with permission.

The future of CAPTCHA

23 July 2008

CAPTCHA (standing for Completely Automated Public Turing test to tell Computers and Humans Apart) must have seemed like a good idea when it was first invented in 2000. Spam was beginning to become a major problem on the web and a method was needed to fight back. CAPTCHA at first glance seems ideal: a distorted image that would be instantly recognisable by humans yet incomprehensible to machines. Place some letters in the distorted image and get the user to type them back and bingo: you’ve stopped your spam problem.

Real life though is rarely so easy. The problem is that spam is profitable, and because of that it’s worthwhile to write programs that try to crack CAPTCHAs. The original CAPTCHA examples are now trivial for current algorithms to recognise, and the only option that developers had was to increase the complexity of the distortion. Successive CAPTCHA systems have added more distortion, extraneous lines and shapes, fuzz on the letters, multiple colours and different sizes all in an attempt to stay ahead of the spammers. This had lead to the current situation where the CAPTCHAs are so complex that it’s difficult if not impossible for a large proportion of humans to recognise any particular one, yet a sizeable proportion of CAPTCHA breaking bots can solve that same one.

There is a further problem with CAPTCHA – they are a complete block to many web users who have visual difficulties. Web standards demand alternative text for any image that contains information, but that completely breaks the system here. By using these tests we (as an industry) ghettoise a complete section of web users. Various workarounds have been proposed and implemented – for example reCAPTCHA’s audio equivalent – but these tend to be extremely difficult to use as well.

So is it possible to make CAPTCHA better? In its current form I’d have to say no: we’ve now reached the state where computers are so good at letter recognition that any system that lets the majority of humans through is going to be susceptible to bots. More recently, researchers have concentrated their efforts on upping the difficultly of recognition by changing to photos instead of words. Microsoft’s Asirra project was one of the first to attempt this: they use a database of cat and dog photos and ask the user to select the cats from a randomly selected set of twelve.

At first glance this seems like an good solution, but it too has major problems. The first is sample set size: although Asirra has a set of around three million photos this isn’t big enough to provide a completely new image for every time one is presented. Given a bit of time a spammer (possibly co-operating with other spammers) could easily build a database of photos to animal (analogous to rainbow tables in password cracking). This can be worked around by programmatically generating the images – see this recent attempt – but both of these fall prey to what is the ultimate problem for any CAPTCHA systems: using humans instead to solve them. The basic idea with this approach is to either use a pay-for-services system like Amazon’s Mechanical Turk or to simply offer something of small value like a mobile phone ringtone in return for a solution. The spammer simply has to pass CAPTCHAs they want cracked through to the human workforce and receive the answer in return.

If we can’t rely on CAPTCHAs then how can we stop spammers from abusing the services we want to provide? There are various alternatives, none of which are 100% effective at blocking spam but which if used in combination can remove the majority while still keeping an accessible service. First off it’s a good idea to blacklist previous offenders and to provide traps in the form of hidden form controls that automatically invalidate any submission. Form-filling robots might not notice that they’re not meant to check them. We can also use common-sense questions to filter humans from bots (the classic example is “What colour is an orange?”), but more complex examples can be difficult for users with cognitive difficulties and you have to make sure your collection of questions if you’re worried that spammers might focus on your site rather than just trying it as part of a random sweep.

Assuming you’re trying to protect a content submission form and not a user account generator then the best solution though is to base your reckoning of whether a user is a spammer or not on the content they submit. Anti-spam email services work in much the same way: by using Bayesian filtering we assign a points value to each of a set of rules that define whether something is spam or not. Every time a piece of content is submitted we check it against each rule and if an arbitrary points total is reached we ignore it. Examples of these rules could be “Does is contain the word Viagra?”, “Is there a web link?” and “Is this the first time the user has commented?”. Individually these are unlikely to be a problem; together they could be a sign that this user is a spammer.

Of course, if you know that your target audience is going to be limited to a particular set of users then you can do something like RBI’s signup form!