DuckDuckGo, Bing, Mojeek, and other search engines are not returning full Reddit results any more.
no mention of brave search, didn’t read the article yet though
After seeing this news I just created this lemmy account. I hope people make the right decision and move on to lemmy.
Welcome and don’t feel shy to contribute!
Welcome.
Heya jack, welcome aboard!
welcome, but maybe consider not using the world instance. it is pretty saturated and the point is to spread users out across many instances instead of having one monolithic one
Same here
Then welcome to you too! There’s a nice selection of apps if you haven’t tried them, since Lemmy has no financial incentive to limit access to the content.
Welcome!
Good.
I am hoping (probably naively so) that lemmy’s stock of technical answers will continue to grow and eventually become a half decent archive for people to search for potential solutions.
Really for technical answers things should be on a forum. Troubleshooting a linux distro, post on the distro’s forums. Troubleshooting a piece of software, make an issue on its codeberg/github/gitlab/etc. It makes sense that if you’re having an issue with a specific thing you ask for help on a forum dedicated to that thing. I don’t think it’s a positive that things are becoming centralised onto generalised social media, even for more decentralised federated social media like Lemmy. It just makes support for a given piece of software more spread out and harder to find.
The most annoying thing with that is that you need an account for every little forum that you want to post in. But you are still right.
This might be a hot take, but I hope that we as a platform are toxic enough to advertisers so that big tech’s enshittification and advertising never becomes a problem.
Twitter still has advertisers.
As I understand it, Lemmy, being FOSS, is pretty immune to this since there are no big tech shareholders to appease. Lemmy is susceptible to EEE (embrace, extend, extinguish) via something like Threads, however.
Problem is that we’ll probably need a dedicated search engine for that. As answers are spread in lots of instances, some of them without “lemmy” in the name, I assume.
Seems like a solvable problem though. We have a list of federated servers inately built into activitypub, right? Just need to tag results from those servers as being linked to a “lemmy” keyword search.
I’m sure I’m oversimplifying it, but all the pieces are there, just need search engines to be smart about how they index. Since there are a couple of federation based models that would be good to index, not just lemmy, it would probably behoove them to figure it out.
Highly doubtful.
The few times I have bothered to ask technical questions I mostly get one of the following:
- Ideological ranting. “The problem is you aren’t running arch linux in that corporate environment with proprietary hardware you need to interface with”
- Complete refusal to read the question. “I totally didn’t read that you said Foo was not viable for reasons XYZ but you should use Foo”
- Complete nonsense
Reddit has a lot of that too but ALSO has the institutional knowledge of people who actually care enough to answer. Similar to stack overflow.
I try to help where I can but this is an enthusiast “site”. So you have all the people who suggest all the crap they heard on linus tech tips rather than “Okay, for my day job we use X but no sane person should use that at home. Look into Y”.
That said: I have said it before and I’ll say it again. The age of the online message board for tech support is long gone. Because the super useful results might be talking about a bug from five years ago rather than a bug from today. The answer really is ephemeral discord servers.
Ephemeral discord servers are awful because they don’t scale and they can only ever help the lowest common denominator of questions/issues. We need something else, but it has yet to present itself as a solution.
I’m sure you’ve had bad experiences but they actually scale as well as any forum ever did and are great because the general vibe is not “This was asked ten years ago, go figure out some search terms” and more actually responding to and helping people.
The key is to have a moderated support channel.
It’s a never ending onslaught of beginner questions and experienced folks with domain knowledge burn out. I’m sure it’s good when it’s new and fresh and everyone is exited to participate, but that wears out. It’s why things went away from mailing lists, or why mailing lists started getting archived, so they could be searched.
I guess with most things it comes in cycles, and we’re at the on demand answers cycle right now.
That has not at all been my experience over the years.
Yes, the vast majority of questions are “beginner questions”. Which… is true no matter where you go.
But when someone has the ability to articulate a “real” problem? Everyone comes out of the woodwork because that is actually interesting. And there is a very strong communal feeling of “we all have the same problem and are trying to collect data”
Usually when I try to get help with a “real” problem through discord, I get crickets
Drop a link to a few places people ask tech questions and I will do my part to contribute
tbh I’ve never seen a Lemmy link when searching for stuff. Is it too small to show up? Or do search engines not index Lemmy instances?
Most of the originalish content on lemmy are linux related stuff, memes and porn. The latter 2 are mostly image/video based, so you don’t search for that very frequently and easily. I can see that in the future it will become a very relevant source of info in linux admin and user circles.
I go back to r*ddit sometimes for some local content which is non existent on lemmy. I see that the tech related subs are mostly dead there, or at least only shadows of their former selfs. E.g. go to r/linux, sort by top all time. In the first 100 results you will barely find anything posted after the exodus.
Yeah, the notion that Lemmy is a Reddit replacement is misguided. It definitely doesn’t have the same Q&A balance Reddit does. It feels a lot more like 90s and early 2000s forums than the large-scale self-service link and customer service churn Reddit encourages.
Which I’m all for. I was never a Reddit guy and I do like it here. But in terms of how bad it is now that Reddit is not happy to host most of the actually useful online content for free… well, that’s a different conversation.
Yeah I mostly go back for r/BestofRedditorUpdates to get my trash drama fix and r/nursing to commiserate with my people. I’ve tried bringing in more hcw communities but sometimes its tiring to be the first of a few to move over. It elicits some pretty strong feelings of isolation.
Twice I have come across links to lemmy, definitely not the norm though.
I’m inclined to think due to the nature of the platform, contents are constantly duplicated to the eyes of search engines, which hurts authoritativeness of each instance thereby hurts ranking.
Searx will show Lemmy results, at least on some Searx instances.
I’ve seen it a couple of times when searching on DDG.
A lot of Fediverse admins are just normal people like you and me with a budget, and disallowing bots and spiders helps save bandwidth, and the budget.
Could it be possible to have one major global instance that aggregates everything so it can be indexed by search engines? Would that work? Or do I not fully understand how federation works?
That would defeat the purpose of federation.
It becomes a central choke point of moderation. Who gets to decide what instances are part of global and which ones aren’t. Because a free for all isn’t going to end well. And then you’re back at Reddit.
Right, but having a centralised search index thingy is better than none at all. Maybe there could be something where it’s a joint effort from admins from many of the biggest servers, idk if that would work.
Lemmy search already is quite excellent… at least here on lemm.ee, we don’t have many communities but tons of users subscribed to probably about everything on the lemmyverse so the servers have it all.
It might be interesting to team up with something like YaCy: Instances could operate as YaCy peers for everything they have. That is, integrate a p2p search protocol into ActivityPub itself so that also smaller instances can find everything. Ordinary YaCy instances, doing mostly web crawling, can in turn use posts here as interesting starting points.
I just wish lemmy search itself wasn’t broken…
Gotta keep some things that feel like reddit.
Yep. I block all bots to my instance.
Most are parasitic (GPTBot, ImageSift bot, Yandex, etc) but I’ve even blocked Google’s crawler (and its ActivityPub cralwer bot) since it now feeds their LLM models. Most of my content can be found anyway because instances it federated to don’t block those, but the bandwidth and processing savings are what I’m in it for.
I have two questions. How much do those bots consume your bandwidth? And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?
I ask these questions because I don’t know much about the topic when managing a website or an instance of the fediverse.
How much do those bots consume your bandwidth?
Pretty negligible per bot per request, but I’m not here to feed them. They also travel in packs, so the bandwidth does multiply. It also costs me money when I exceed my monthly bandwidth quota. I’ve blocked them for so long, I no longer have data I can tally to get an aggregate total (I only keep 90 days). SemrushBot alone, before I blocked it, was averaging about 15 GB a month. That one is fairly aggressive, though. Imagesift Bot, which pulls down any images it can find, would also use quite a bit, I imagine, if it were allowed.
With Lemmy, especially earlier versions, the queries were a lot more expensive, and bots hitting endpoints that triggered a heavy query (such as a post with a lot of comments) would put unwanted load on my DB server. That’s when I started blocking bot crawlers much more aggressively.
Static sites are a lot less impactful, and I usually allow those. I’ve got a different rule set for them which blocks the known AI scrapers but allows search indexers (though that distinction is slowly disappearing).
And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?
I block bots by default, and that prevents them from being indexed since they can’t be crawled at all. Searching “dubvee” (my instance name / url) in Google returns no relevant results. I’m okay with that, lol, but some people would be appalled.
However, I can search for things I’ve posted from my instance if they’ve federated to another instance that is crawled; the link will just be to the copy on that instance.
For the few static sites I run (mostly local business sites since they’d be on Facebook otherwise), I don’t enforce the bot blocking, and Google, etc are able to index them normally.
Thanks for the explanation and it was clear to me.
Teach me oh wise one
Kinda long, so I’m putting it in spoilers. This applies to Nginx, but you can probably adapt it to other reverse proxies.
- Create a file to hold the mappings and store it somewhere you can include it from your other configs. I named mine
map-bot-user-agents.conf
Here, I’m doing a regex comparison against the user agent (
$http_user_agent
) and mapping it to either a0
(default/false) or1
(true) and storing that value in the variable$ua_disallowed
. The run-on string at the bottom was inherited from another admin I work with, and I never bothered to split it out.'map-bot-user-agents.conf'
# Map bot user agents map $http_user_agent $ua_disallowed { default 0; "~CCBot" 1; "~ClaudeBot" 1; "~VelenPublicWebCrawler" 1; "~WellKnownBot" 1; "~Synapse (bot; +https://github.com/matrix-org/synapse)" 1; "~python-requests" 1; "~bitdiscovery" 1; "~bingbot" 1; "~SemrushBot" 1; "~Bytespider" 1; "~AhrefsBot" 1; "~AwarioBot" 1; "~GPTBot" 1; "~DotBot" 1; "~ImagesiftBot" 1; "~Amazonbot" 1; "~GuzzleHttp" 1; "~DataForSeoBot" 1; "~StractBot" 1; "~Googlebot" 1; "~Barkrowler" 1; "~SeznamBot" 1; "~FriendlyCrawler" 1; "~facebookexternalhit" 1; "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough| ^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb| ^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^ WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1; }
Once you have a mapping file setup, you’ll need to do something with it. This applies at the virtual host level and should go inside the
server
block of your configs (except the include for the mapping config.).This assumes your configs are in conf.d/ and are included from nginx.conf.
The
map-bot-user-agents.conf
is included above theserver
block (since it’s anhttp
level config item) and insideserver
, we look at the$ua_disallowed
value where 0=false and 1=true (the values are set in the map).You could also do the mapping in the base
nginx.conf
since it doesn’t do anything on its own.If the
$ua_disallowed
value is 1 (true), we immediately return an HTTP 444. The444
status code is an Nginx thing, but it basically closes the connection immediately and wastes no further time/energy processing the request. You could, optionally, redirect somewhere, return a different status code, or return some pre-rendered LLM-generated gibberish if your bot list is configured just for AI crawlers (because I’m a jerk like that lol).Example site1.conf
include conf.d/includes/map-bot-user-agents.conf; server { server_name example.com; ... # Deny disallowed user agents if ($ua_disallowed) { return 444; } location / { ... } }
I’ve always been told to be scared about
if
s in nginx configsYeah,
if
’s are weird in Nginx. The rule of thumb I’ve always gone by is that you shouldn’t try toif
on variables directly unless they’re basically pre-processed to a boolean via amap
(which is what the user agent map does).
So I would need to add this to every subdomain conf file I have? Preciate you!
I just include the
map-bot-user-agents.conf
in my basenginx.conf
so it’s available to all of my virtual hosts.When I want to enforce the bot blocking on one or more virtual host (some I want to leave open to bots, others I don’t), I just include a
deny-disallowed.conf
in theserver
block of those.deny-disallowed.conf
# Deny disallowed user agents if ($ua_disallowed) { return 444; }
site.conf
server { server_name example.com; ... include conf.d/includes/deny-disallowed.conf; location / { ... } }
- Create a file to hold the mappings and store it somewhere you can include it from your other configs. I named mine
I was worrying about precisely this. I’d be ok with blocking search engines if there was a better way of searching but AFAICT there isn’t federated search of any kind?
All a spider needs is an instance to download everything.
One of the major problems with Lemmy is that many posts get deleted and that nukes the comment section (which is where most of the answers will be).
I wish Lemmy deleted posts closer to how Reddit deletes posts - the post content should be deleted, but leave the comments alone.
You could always add “site:lemmy.world” to your search (remove the quotes). I commonly do that, as well as the same for reddit or stack overflow.
The problem with that is, lemmy.world is only one of many different instances. Too bad there isn’t a way to add a modifier that searches the entire fediverse.
yea i’ve been doing “inurl:lemmy” for that reason
Appending
(intext:“modlog” & “instances” & “docs” & “code” & “join lemmy”)
to your search query will search most instances. Works with Google, Startpage, SearXNG afaik.Very nice, thanks!
Was able to find this thread:
(Heh, when testing this sanitized URL from the thousand character monster it was before, Google asked me if I was a bot. I think parentheses and stuff make them suspicious.)
You’d miss instances that don’t use “lemmy” in the URL, but it’s at least a better solution than specifying a single instance.
out of the top of my head, that won’t include lemm.ee, sopuli, beehaw, szmer.info, slrpnk.net, sh.itjust.works, or other threadiverse instances like kbin/mbin.
Yeah but we need non technical stuff too which i what i hate about the ui and stuff mot trying to be made simpler for non tech people to start using lemmy. I want doctors, lawyers, and casual people asking questions about everyday items and stuff so i can search “best sleep mask lemmy” or any product category and find good discussions. Would also help if lemmy.com was an instance instead of just redirecting to lemm.ee
I think we have to contribute our hours of UX assistance to see changes there. The brilliant engineers who donate their time probably both focus on working features first and specialize more in technical problem-solving than visual design.
Honestly the percentage of Reddit posts that Google returns which still actually retain their info is dropping substantially for me. Last few times it pulled up a Reddit post that should have my answer instead had either the question or the answers I needed deleted.
Enshittification continues but now it’s trying to enshittify competitors indirectly instead of altering their own products.
It is their internet peasants… Say thank you daddy let’s you use it at all, shitlord.
Slaves are getting brazen now… Regime might get ideas about how to fix that from our greatest ally. They call it mowing the lawn. They are doing it it now! And regime likes it.
DDG is a metasearch engine uses Google as a primary search sources, so this would indicate Google is not returning search data to its own APIs.
I feared AI would lead to this, so hopefully I’m jumping to conclusions. DDG, searxng, kagi, and most other alternative search engines rely on this API.
Duckduckgo uses Bing for links.
from : https://duckduckgo.com/duckduckgo-help-pages/results/sources/
Good to know. I thought they used aggregated google too.
Not an ad, but Kagi is worth it. I’m ok with paying for search tho.
Stop shilling for this scam of a service
The free alternatives are better
Why do you say it’s a scam? I have not found a better alternative that produces decent search results. It’s not incentivized to send you to any particular websites, and I can personalize the results how I like.
If you wanna pay for inferior services, go crazy.
I’ll pass, but thanks for the reminder.
I’ve been using it for a while and agree. At least I know how they’re making money off of me.
Still seeing new Reddit posts on Kagi. Lets hope it stays this way.
Kagi pays Google for search results to supplement its index so it will keep reddit results as long as Google lets them
I will pay for a server and then host SearX-NG rather than paying for a Good Service. Yep, I am wierd.
I actually have an instance of searxng as well. I use it with my local chatGPT.
Yes an ad, and a free one at that.
I’m still getting full reddit results on DDG.
Old indexed results are fine. But anything from last week is not showing up
To be fair, Reddit is no longer that good of a source for answers in the later years.
Quality drop in comments is insane. Sometimes it looks like Quora.
Am I bbbrrrregnant?
If we consider all possible outcomes on a galaxy scale, then No.
Also my collection of hobbies seems to match up well with the people who nuked their post history after the API-ocalypse. Even when I get good search results I click through and… so many deleted comments…
It irritates me that so many forums and media sites allow you to edit your posts at will. There’s one site I go to that I like very much - it has a 5 minute edit window, and after that, your post can no longer be edited. You can’t change what you said, pretend you never said things, etc, once you say something it remains. It would be nice if more sites were like that. Or at least, if you edit/delete something, for there to be an option to check the history to see what it used to be, so if you try to delete some comment you made people can still check it. Whether it’s informational, or it’s because you’re trying to hide something you said that you realize was actually super shitty and people are getting angry at you for it, I prefer things to stick.
I was looking for Bluetooth speakers recommendations and it’s the first time I really noticed “generic bot replies” like “I’ve got this great product to recommend, not only is it good but it offers great sound quality as well! The product is [link to Amazon page]”
Gotta start searching using “before:” to get quality results…
I’m seeing bots promoted and sold to generate those kinds of replies, RIP internet, looking forward to SSN/DNA+background check review verification (I kid but I half dream of that privacy nightmare partially plugging the review fraud hole).
I’m with you on embracing the privacy nightmare to kill off cheaters in games. Tie an account to a real identity and that problem will quickly reduce.
thats fucked
Oh good, I can’t view most reddit threads without an account anymore, so it’ll be nice to see those results go away.
I wasn’t aware of this, when did it start? So far, it has never happened to me not to be able to view reddit threads
Weird, most of the results I get from Google’s search are from Quora (and they fucking suck). Google as a search engine has been going downhill for a while now. Reddit has becomes an increasingly spammy shithole full of corporate and political astroturfing too.
whoever is left there deserves whatever happens. they really showed what they thought of the users over the years. culminating in app/api control.
…and Pepperidge Farm definitely remembers saydra.