![](/static/253f0d9b/assets/icons/icon-96x96.png)
![](https://lemmy.world/pictrs/image/44bf11eb-4336-40eb-9778-e96fc5223124.png)
Standardized tests were always a poor measure of comprehensive intelligence.
But this idea that “LLMs aren’t intelligent” popular on Lemmy is based on what seems to be a misinformed understanding of LLMs.
At this point there’s been multiple replications of the findings that transformers build world models abstracted from the training data and aren’t just relying on surface statistics.
The free version of ChatGPT (what I’m guessing most people have direct experience with) is several years old tech that is (and always has been) pretty dumb. But something like Claude 3 Opus is very advanced at critical thinking compared to GPT-3.5.
A lot of word problem examples that models ‘fail’ are evaluating the wrong thing. When you give a LLM a variation of a classic word problem, the frequency of the normal form biases the answer back towards it unless you take measures to break the token similarities. If you do that though, most modern models actually do get the variation completely correct.
So for example, if you ask it to get a vegetarian wolf, a carnivorous goat, and a cabbage across a river, even asking with standard prompt techniques it will mess up. But if you ask it to get a vegetarian 🐺, a carnivorous 🐐 and a 🥬 across, it will get it correct.
GPT-3.5 will always fail it, but GPT-4 and more advanced will get it correct. And recently I’ve started seeing models get it correct even without the variation and trip up less with variations.
The field is moving rapidly and much of what was true about LLMs a few years ago with GPT-3 is no longer true with modern models.
This isn’t correct and has been shown not to be correct in research over and over and over in the past year.
https://arxiv.org/abs/2310.07582
Just a few of the relevant papers you might want to check out before stating things as facts.