Wordpress plugin reCAPTCHA - Digitize books while stopping spam

A Lifehacker article today led me to the reCAPTCHA project. This fascinating project creates CAPTCHAs from OCR errors produced while digitizing text, then serves those CAPTCHAs to your site resulting in a seemingly symbiotic process - you prevent comment spam on your site with their CAPTCHA, and they receive assistance from thousands of humans correcting OCR errors. According to reCAPTCHA’s project description,

reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

But if a computer can’t read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

I decided to check out this slick-sounding project. It’s pretty easy to integrate their CAPTCHA plugin into Wordpress:

  1. Sign up for an account at reCAPTCHA - as far as I can tell, the service is free
  2. Register for an API key, one per domain
  3. Download the Wordpress plugin (they have plugins for other applications as well)
  4. Upload, activate and set options for the plugin
  5. Follow the instructions here to insert the CAPTCHA into your comment loop

You can see the results below. The interface is a little confusing - it would be nice if it used a smaller field and smaller widget, perhaps something more like bot-check. Perhaps reCAPTCHA 2.0 will integrate better into existing forms. As it is, it’s still worth the small added confusion to be helping digitize projects like the Internet Archive.

Unless, of course, people stop leaving comments.

After playing with the plugin a little, I noticed the letters are sometimes hard to discern. For instance, I can’t even make a guess as to what the word on the left must be in this one:

recaptcha
Dave says it’s cit.), - good call.

My guess is this widget’s refresh button will get a lot of use.

What do you think - is the interface too confusing? Test it out and tell me in the comments below.

UPDATE-

I decided to disable the plugin - it’s too cumbersome as it exists now, and I don’t want to make visitors work hard to leave comments. I like the idea of the project though, and hope a version 2.0 is in the works. I’ll leave a screenshot up of the plugin interface.

Thank you for visiting NoShrinkwrap. If you enjoyed this article, check out the related posts below and subscribe to our feed.

Rate this:
2.5

No related posts.

12 Comments

  1. no imageSomebaudy (Check me out!):

    what if I only get one of the words right ? I wanted to install the plugin but now I’m having second thoughts…

    Rate this:
    2.9
  2. no imageCris (Check me out!):

    It seems to be inconsistent - I’ve typed only one word of the two and both failed and succeeded.

    I have to agree - I’m reluctant to leave this plugin installed permanently. The interface is a little confusing and requires more than just typing something in a box - first you type two words, then you copy a code into another box. I can imagine that will be too much of a process for many people wanting to leave comments.

    Of course, that in and of itself might be seen by some as a filter of sorts as well. :)

    Rate this:
    2.5
  3. no imageSusanO (Check me out!):

    I’d like to see one word per box, replace the speaker icon (what, am I going to be serenaded?) with the more commonly used wheelchair icon, and make the whole thing just a leeeeetle bit bigger.

    Rate this:
    2.5
  4. no imageJosh (Check me out!):

    Just testing it out.

    Seems like a cool idea though I think there would be a huge potential for malicious activity, i.e. feeding incorrect words into the captcha for one word while doing the first correctly. I dunno.

    Rate this:
    2.5
  5. no imageFrank (Check me out!):

    Just testing it out.

    Seems like a cool idea though I think there would be a huge potential for malicious activity, i.e. feeding incorrect words into the captcha for one word while doing the first correctly. I dunno.

    Rate this:
    2.5
  6. no imageDave (Check me out!):

    That’s true Josh. Malicious corruption needs to be considered. But I’m not sure which a person would always be able to tell which word is which. Like right now I have the choices lantern and foribio. Both are fairly easy to read and unless they aren’t doing any sort of dictionary check I know what my guess would be. But there isn’t a hard indication.

    Another way would be to rely on stats, and offer up each word multiple times.

    Rate this:
    2.9
  7. no imageCris (Check me out!):

    I like the idea of relying on statistical returns, especially since the captcha widget servies up words with punctuation. Assuming someone types in the second word correctly and relying on that one response seems a little overly-trusting.

    Rate this:
    2.5
  8. no imagea (Check me out!):

    @SusanO:

    The reason there are two words in the box is that the reCaptcha software doesn’t actually know the answer to the first one: the first word is the one that was incorrectly read by OCR software is being “read” by the captcha user. To verify that the poster is not a script or whatnot, there has to be a second word to which the widget knows the answer. I can’t think of a way around this for now, while providing the same functionality.

    Rate this:
    2.5
  9. no imageDavid Millar (Check me out!):

    Just an FYI to the comments above - I believe I read that they offer the misread words multiple times and build up a confidence level for the correct word before choosing a permanent answer.

    Rate this:
    2.5
  10. no imageBen Maurer (Check me out!):

    Just fyi, the whole “copy the code into a text box” thing is because you have JS turned off. Most visitors don’t have to go through that process.

    Also, if you can’t read one of the words, the refresh button is always there to help. We use this as a signal that the CAPTCHA is too hard, and eventually stop serving it to users. We also take a hint from inconsistent answers on a given word.

    Rate this:
    2.5
  11. no imagejameswillisisthebest (Check me out!):

    This is my first post
    just saying HI

    Rate this:
    2.9
  12. Trackbacks:

Leave a comment