[long] Some tests of how much AI "understands" what it says (spoiler: very little)

diz@awful.systems · 1 year ago

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

Mii@awful.systems · 1 year ago

My favorite bullshit test is to have the AI count letters. Something as simple as “How many A’s are in this sentence?” will have the thing give you a seemingly random number (which might sometimes even be correct), but as soon as you ask it to list them, it will have a stroke and just make up stuff that makes no sense.

I can’t be arsed to make an account with OpenAI or whatever, so here’s an example with Bing Chat, but this has worked on every single chatbot I’ve tested it on.

Another fun test (as suggested by @flizzo last time I brought this up) is to have the chatbot attempt to crack a simple code. Just Caesar cipher something and ask it to crack it. Hilarity will ensue. You can even ask it to encrypt it first and then to crack it for extra fun.

Me: Encrypt the sentence “I am a stupid roided Clippy” with the Caesar cipher.

Clippy: [After explaining what the Ceasar cipher is, which I didn’t ask for, thanks] Convert the sentence to uppercase: “I AM A STUPID ROIED CLIPPY”. Let’s apply the Caesar cipher to your sentence using a shift of 7.

‘I’ becomes ‘P’

‘A’ becomes ‘H’

‘M’ becomes ‘T’

…

‘Y’ becomes ‘F’

So the encrypted sentence is: “PH HT H ZBHTLW YVPLUU”. 😊

Okay, first of all, it dropped a letter. And the shift isn’t even correct, lol. It said Y becomes F and then does whatever the fuck.

Okay, so let’s give it an easy example, and even tell it the shift. Let’s see how that works.

This shit doesn’t even produce one correct message. Internal state or not, it should at least be able to read the prompt correctly and then produce an answer based on that. I mean, the DuckDuckGo search field can fucking do it!

diz@awful.systems · 1 year ago

I feel like letter counting and other letter manipulation problems kind of under-sell the underlying failure to count - LLMs work on tokens, not letters, so they are expected to have a difficulty with letters.

The inability to count is of course wholly general - in a river crossing puzzle an LLM can not keep track of what’s on either side of the river, for example, and sometimes misreports how many steps it output.

scruiser@awful.systems · 1 year ago

Well, if they were really “generalizing” just from training on crap tons of written text, they could implicitly develop a model of letters in each token based on examples of spelling and word plays and turning words into acronyms and acrostic poetry on the internet. The AI hype men would like you to think they are generalizing just off the size of their datasets and length of training and size of the models. But they aren’t really “generalizing” that much (and even examples of them apparently doing any generalizing are kind of arguable) so they can’t work around this weakness.

The counting failure in general is even clearer and lacks the excuse of unfavorable tokenization. The AI hype would have you believe just an incremental improvement in multi-modality or scaffolding will overcome this, but I think they need to make more fundamental improvements to the entire architecture they are using.

diz@awful.systems · 1 year ago

The counting failure in general is even clearer and lacks the excuse of unfavorable tokenization. The AI hype would have you believe just an incremental improvement in multi-modality or scaffolding will overcome this, but I think they need to make more fundamental improvements to the entire architecture they are using.

Yeah.

I think the failure could be extremely fundamental - maybe local optimization of a highly parametrized model is fundamentally unable to properly learn counting (other than via memorization).

After all there’s a very large number of ways how a highly parametrized model can do a good job of predicting the next token, which would not involve actual counting. What makes counting special vs memorization is that it is relatively compact representation, but there’s no reason for a neural network to favor compact representations.

The “correct” counting may just be a very tiny local minimum, with tall hill all around it and no valley leading there. If that’s the case then local optimization will never find it.

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

[long] Some tests of how much AI "understands" what it says (spoiler: very little)

A couple simple probes:

GPT4 is uncannily good at recognizing the river crossing puzzle

An Idiot With a Petascale Cheat Sheet

Is this a “hallucination”?

But after an update, GPT-whatever is so much better at such prompts.

The need for an Absolute Imbecile Level Reasoning Benchmark

Randomness in bullshitting