Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

misk@sopuli.xyz · 2 months ago

Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

Halcyon@discuss.tchncs.de · 2 months ago

They are large LANGUAGE models. It’s no surprise that they can’t solve those mathematical problems in the study. They are trained for text production. We already knew that they were no good in counting things.

Flocklesscrow@lemm.ee · 2 months ago

“You see this fish? Well, it SUCKS at climbing trees.”

zbyte64@awful.systems · 2 months ago

That’s not how you sell fish though. You gotta emphasize how at one time we were all basically fish and if you buy my fish for long enough, those fish will eventually evolve hands to climb!

Flocklesscrow@lemm.ee · 2 months ago

“Premium fish for sale: GUARANTEED to never climb your trees”

Lvxferre@mander.xyz · 2 months ago

The fun part isn’t even what Apple said - that the emperor is naked - but why it’s doing it. It’s nice bullet against all four of its GAFAM competitors.

conciselyverbose@sh.itjust.works · 2 months ago

They’re a publicly traded company.

Their executives need something to point to to be able to push back against pressure to jump on the trend.

jherazob@fedia.io · 2 months ago

This right here, this isn’t conscientious analysis of tech and intellectual honesty or whatever, it’s a calculated shot at it’s competitors who are desperately trying to prevent the generative AI market house of cards from falling

jabathekek@sopuli.xyz · 2 months ago

WhatAmLemmy@lemmy.world · 2 months ago

The results of this new GSM-Symbolic paper aren’t completely new in the world of AI research. Other recent papers have similarly suggested that LLMs don’t actually perform formal reasoning and instead mimic it with probabilistic pattern-matching of the closest similar data seen in their vast training sets.

WTF kind of reporting is this, though? None of this is recent or new at all, like in the slightest. I am shit at math, but have a high level understanding of statistical modeling concepts mostly as of a decade ago, and even I knew this. I recall a stats PHD describing models as “stochastic parrots”; nothing more than probabilistic mimicry. It was obviously no different the instant LLM’s came on the scene. If only tech journalists bothered to do a superficial amount of research, instead of being spoon fed spin from tech bros with a profit motive…

aesthelete@lemmy.world · 2 months ago

If only tech journalists bothered to do a superficial amount of research, instead of being spoon fed spin from tech bros with a profit motive…

This is outrageous! I mean the pure gall of suggesting journalists should be something other than part of a human centipede!

jabathekek@sopuli.xyz · 2 months ago

describing models as “stochastic parrots”

That is SUCH a good description.

2 months ago

Are the uncensored models more capable tho?

misk@sopuli.xyz · 2 months ago

Given the use cases they were benchmarking I would be very surprised if they were any better.

emerald@lemmy.blahaj.zone · 2 months ago

statistical engine suggesting words that sound like they’d probably be correct is bad at reasoning

How can this be??

graphene@lemm.ee · 2 months ago

Totally unexpectable!!!

aesthelete@lemmy.world · edit-2 2 months ago

antianticipatable!

Halcyon@discuss.tchncs.de · 2 months ago

astonisurprising!

kingthrillgore@lemmy.ml · 2 months ago

I feel like a draft landed on Tim’s desk a few weeks ago, explains why they suddenly pulled back on OpenAI funding.

CombatWombat1212@lemmy.ml · 2 months ago

So do I every time I ask it a slightly complicated programming question

Saik0@lemmy.saik0.com · 2 months ago

And sometimes even really simple ones.

anon_8675309@lemmy.world · 2 months ago

Did anyone believe they had the ability to reason?

Aeri@lemmy.world · 2 months ago

People are stupid OK? I’ve had people who think that it can in fact do math, “better than a calculator”

Kairos@lemmy.today · 2 months ago

N0body@lemmy.dbzer0.com · 2 months ago

The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions

Good thing they’re being trained on random posts and comments on the internet, which are known for being succinct and accurate.

whotookkarl@lemmy.world · edit-2 2 months ago

Here’s the cycle we’ve gone through multiple times and are currently in:

AI winter (low research funding) -> incremental scientific advancement -> breakthrough for new capabilities from multiple incremental advancements to the scientific models over time building on each other (expert systems, LLMs, neutral networks, etc) -> engineering creates new tech products/frameworks/services based on new science -> hype for new tech creates sales and economic activity, research funding, subsidies etc -> (for LLMs we’re here) people become familiar with new tech capabilities and limitations through use -> hype spending bubble bursts when overspend doesn’t keep up with infinite money line goes up or new research breakthroughs -> AI winter -> etc…

WrenFeathers@lemmy.world · 2 months ago

Someone needs to pull the plug on all of that stuff.