INFOWORLD GRIPE LINE BY ED FOSTER Bookmark this page

 
Display: Sort:
Borderline searches and seizures | 19 comments (19 topical) | Post A Comment
Word problems[ Parent | Reply to This ] (none / 0) (#14)
by Anonymous User on Fri Jul 04, 2008 at 12:23:11 PM PDT

Math word problems is a good idea, but they shouldn't be too naive or they still fall to brute-force attacks. Parsing out numbers, and guessing operations (perhaps based on words or phrases, e.g. "together" suggests addition, "how many more" suggests subtraction, etc.).

Here's a fairly good example. "John has five apples and eight oranges. Amy has three apples and two oranges. How many cores will be left over once they eat them all?" The answer is obviously eight, since that's the total number of apples and only the apples will result in cores. No script is likely to be able to solve problems resembling this one -- there's nothing easily parsed out of that to suggest which numbers to use of those supplied. The downside is that a fairly clever script might figure out that there are only a few likely right answers and try one at random, and a percentage will get through.

The ending, and right answer, can be changed in a few ways that machines would have difficulty identifying:

  • How many fruit do they have? 18
  • How many fruit that won't leave cores? Ten
  • How many fruit does the one with the fewest have? 5
  • If they divide what they have equally between them, how many cores will each leave behind? Four
  • Ditto, but how many peels? Five

This amount of possible variety makes the number of machine-guessable possible right answers fairly large, at least as large as the number of multiple choice answers now. It can be made larger by throwing in irrelevant details, e.g. "Amy is seventeen years old". Indeed, best is to have a small number of "stories", which each contain as many as ten different numbers, and for each one several possible questions, which use different subsets (two or three, typically) of those numbers, and whose answers range fairly widely, preferably into the low triple-digits. Throw in random numbers, plus images that may include needed numbers (e.g. a girl in a soccer jersey with "56" on the front, and elsewhere an image that looks like text with the number 23, used as a number in the text), and make these large and clear, and you further confuse bots. They will need to use OCR, and a single image might show many potentially-significant numbers. Picture a baseball scoreboard and a question that might ask how many runs the visiting team scored, period or in a particular inning or even in a different game entirely, if the text says "Pictured is the final score from their first game against one another. In the rematch, Dumont's Dudley scored their only run, in the third inning. What was Dumont's score?" -- the answer, one, may not even appear anywhere literally, the scoreboard image is completely irrelevant, and we've also got the number three occurring in the text where it's completely irrelevant. There might be a total of two dozen numbers, 20 of them (inning scores and final scores) in the image, plus four in the text, none of which are (or are used to compute) the genuine right answer. But of course the question could be "What inning contained Dudley's run" (3) or "What did Dumont score in the second inning" (zero) or something about the scoreboard image instead. An image that, I might add, OCR software might choke on, perhaps reading "0001032017" instead of ten separate numbers in a fairly plausible failure mode.

Guessing, even "educated" guessing after parsing text and images, could be made to produce a very low hit-rate, under 1%, at least in principle.

In fact, simply asking text-answer questions about images might work wonders. Have a couple dozen stock images and a few hundred questions about them with easy, unambiguous answers that OCR mostly won't work to produce, and watch the bots bash their heads against a brick wall, even if you allow for a certain amount of sloppiness in the answers.

Especially combined with the three-strikes rule suggested by the previous poster.

Just watch for someone to develop a script that exploits the finite repertoire of questions if you do this instead of math problems with random components. I'd just wait until an apparently-automated spamming spree succeeds, then completely replace all of the images and questions, wait until it happens again, and repeat as needed; it should be infrequent enough that, over all, very few spams make it through per day on average and very little work per day is actually required on average.

Only presenting and requiring an answer to the captcha if there are "href=" in the comment will further reduce any impact the captcha has on normal users while making it a bit more awkward for a would-be spammer to catalogue all of the captcha questions that can occur (if even finite).


[ Parent | Reply to This ]



Borderline searches and seizures | 19 comments (19 topical) | Post A Comment
Display: Sort:

Menu
· create account
· faq
· search

Login
Make a new account
Username:
Password:

 HOME  NEWS  COLUMNS  BLOGS  PODCASTS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS  IT EXEC-CONNECT   About Awards Contact Us 

Copyright © 2006, Reprints, Permissions, Licensing, IDG Network, Privacy Policy.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

ComputerWorld :: LinuxWorld :: Network World :: CIO :: PC World :: Darwin :: CMO :: CSO
IT Careers :: JavaWorld :: Macworld :: Mac Central :: Playlist :: GamePro :: GameStar :: Gamerhelp
ITWorld Canada :: Computerwoche :: Techworld UK :: tecChannel :: IDG.se :: IDG.no :: IDG.pl

create account | faq | search