Reddit blocking all major search engines, except Google

gedaliyah@lemmy.world · 5 months ago

Reddit blocking all major search engines, except Google

Vanth@reddthat.com · 5 months ago

Cool, thank you. You seem to know quite a bit about this stuff.

If we do end up at a point without search engines, where AI does the search and summarizes an answer, what do you think their level of ability to tie back to source material will be?

I’m thinking in cases of asking about a technical detail for a hobby, “how do I get x to work”. I don’t necessarily want a response like “connect blue wire to red”. What I really want is the forum posts discussing the troubleshooting and solutions from various people. If an AI search can’t get me to those forums, it’s of little value to me and when I do figure out an answer acceptable to my application, I’m not tied into that forum to share my findings (and generate new content for the AI to index).

Related to that, I’m thinking about these stories of lawyers relying on AI to write their briefs, and the AI cites non-existent cases as if they were real. It seems to me, not at all a programmer, that getting an AI to the point where it knows what’s real and what’s a hallucination would be a challenge. And until we get to that point, it’s hard to put full trust into an AI search.

tal@lemmy.today · edit-2 5 months ago

If we do end up at a point without search engines, where AI does the search and summarizes an answer, what do you think their level of ability to tie back to source material will be?

I haven’t used the text-based search queries myself; I’ve used LLM software, but not for this, so I don’t know what the current situation is like. My understanding is that current approach doesn’t really permit for it. And there are two issues with that:

There isn’t a direct link between one source and what’s being generated; the model isn’t really structured so as to retain this.
Many different sources probably contribute to the answer.

All information contributes a little bit to the probability of the next word that the thing is spitting out. It’s not that the software rapidly looks through all pages out there and then finds a given single reputable source that could then cite, the way a human might. That is, you aren’t searching an enormous database when the query comes in, but repeatedly making use of a prediction that the next word in the correct response is a given word, and that probability is derived from many different sources. Maybe tens of thousands of people have made posts on a given subject; the response isn’t just a quote from one, and the generated text may appear in none of them.

To maybe put that in terms of how a human might think, place you in the generative AI’s shoes, suppose I say to you “draw a house”. You draw a house with two windows, a flowerbed out front, whatever. I say “which house is that”? You can’t tell me, because you’re not trying to remember and present one house – you’re presenting me with a synthetic aggregate of many different houses; probably all houses have mentally contributed a bit to it. Maybe you could think of a given house that you’ve seen in the past that looks a fair bit like that house, but that’s not quite what I’m asking you to tell me. The answer is really “it doesn’t reflect a single house in the real world”, which isn’t really what you want to hear.

It might be possible to basically run a traditional search for a generated response to find an example of that text, if it amounts to a quote (which it may not!)

And if Google produces some kind of “reliability score” for a given piece of material and weights the material in the training set by that (which I will guess that if they don’t now, they will), they could maybe use the reliability score to try to rank various sources when doing that backwards search for relevant sources.

But there’s no guarantee that that will succeed, because they’re ultimately synthesizing the response, not just quoting it, and because it can come from many sources. There may potentially be no one source that says what Google is handing back.

It’s possible that there will be other methods than the present ones used for generating responses in the future, and those could have very different characteristics. Like, I would not be surprised, if this takes off, if the resulting system ten years down the road is considerably more complex than what is presently being done, even if to a user, the changes under the hood aren’t really directly visible.

There’s been some discussion about developing systems that do permit for this, and I believe that if you want to read up on it, the term used is “attributability”, but I have not been reading research on it.

Vanth@reddthat.com · 5 months ago

Attribution, great term to search. Thank you.

Websearching “attribution + AI” brings up a lot of hits on copyright concerns. Which opens up even more questions. If we get to the point where AI attributes it’s sources with some sort of scoring, then it’s near certainly going to be using copyrighted materials at times. And depending on the copyright and what profits the AI company is gaining from their use and probably a bunch more detailed copyright stuff beyond my civilian acknowledge, there’s probably financial and legal reasons for AI searches to not publicly attribute sources. Which loops me back to, I want to see conflicting materials and make a judgement call on final summary myself in many cases.

I’m sure there are many people much smarter than me with nothing but pure, ethical intentions figuring all this out. Who knows, maybe this will be the tipping point for better copyright and intellectual property protections in the US and elsewhere.