TIL That the entirety of Wikipedia is only ~100Gb and you can download it for offline use

retrospectology@lemmy.world · edit-2 4 months ago

TIL That the entirety of Wikipedia is only ~100Gb and you can download it for offline use

lolola@lemmy.blahaj.zone · 4 months ago

So something akin to this joke image I saw the other day is actually feasible for Wikipedia?

Slovene@feddit.nl · 4 months ago

https://m.youtube.com/watch?v=1lRI35gKSPA

mctoasterson@reddthat.com · 4 months ago

I mean, you can self-host your own local LLMs using something like Ollama. The performance will be bound by the disk space you have (the complexity of the model you’re able to store), and the performance of the CPU or GPU you are using to run it, but it does work just fine. Probably as good results as ChatGPT for most use cases.

Nooodel@lemmy.world · 4 months ago

We do this at work (lots of sensitive data that we don’t want Openai to capitalize on) and it works pretty well. Hosted locally, setup by a data security and privacy sensitive admin, who specifically runs the settings to not save any queries even on the server. Bit slower than chatgpt but not by much

Farmfixit@lemmy.world · 4 months ago

I tried to download it but couldn’t get it to work :(

retrospectology@lemmy.world · edit-2 4 months ago

Download the kiwix app for whatever OS you’re using, then go into Kiwix and click on the folder icon in the app and navigate to where the .zim file you downloaded is located. If you click it it should automatically pop-up and be viewable.

If you did that and it’s still failing, is it giving you a specific error or anything?

Slovene@feddit.nl · 4 months ago

It’s already been done: https://m.youtube.com/watch?v=1lRI35gKSPA

TheReturnOfPEB@reddthat.com · 4 months ago

and you should donate to wikipedia if you are gonna do that

Fenrisulfir@lemmy.ca · 4 months ago

Is there a git repo for it or do I have to redownload the whole thing to do an update?

CannedCairn@lemmy.world · 4 months ago

I did! I do! Also all public domain books as part of the project Gutenberg

bionicjoey@lemmy.ca · 4 months ago

The text version of Wikipedia*

The images and other media are a hell of a lot more.

retrospectology@lemmy.world · edit-2 4 months ago

The 100Gb version mentioned above does only have thumbnails/lowres pictures, yeah. Better than nothing for some types of articles, but not everything. The true text-only version is actually only ~53Gb though.

ByteOnBikes@slrpnk.net · 4 months ago

Some of the high res photos are ridiculous.

Like a 8000x9000 uncompressed image of someone’s hand and weighs about 22mb.

I know that because I use a lot of royalty free images.

owsei@programming.dev · 4 months ago

Is there an index of the images or something like that?

morhp@lemmynsfw.com · 4 months ago

https://commons.wikimedia.org/

The images are categorised and there’s a search function.

owsei@programming.dev · 4 months ago

Thank you very much!

BuddyTheBeefalo@lemmy.ml · 4 months ago

it’s 102GB with images, 53GB without

Silverseren@fedia.io · 4 months ago

I presume this is images directly hosted on English Wikipedia and not the entirety of Commons where the vast majority of images are kept, right?

BuddyTheBeefalo@lemmy.ml · edit-2 4 months ago

Wikimedia is 373TB images. https://commons.m.wikimedia.org/wiki/Special:MediaStatistics

maegul (he/they)@lemmy.ml · 4 months ago

Kinda interesting at a broad level … that there’s still something to the efficiency of language.

Sure storage is cheap now, but so much of the calculation of the utility of data in modern tech is the presumption of an internet connection and retrieval of information over the network.

With the internet going to shit in various ways, local or decentralised computing is making more sense, at least depending on your priorities and perspective. And so all of a sudden, storage tradeoffs become a bit more meaningful. Do I need all of the pictures and media … or would a simple textual description suffice for most instances with high res media available at a more centralised archive if I’m really interested? A picture is worth 1000 words, but takes a hell of a lot more digital storage space!

Dasus@lemmy.world · 4 months ago

Without images Wikipedia is a “mere” 22.14gb.

https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#:~:text=The total number of pages,about 22.14 GB without media.

Em Adespoton@lemmy.ca · 4 months ago

Aside from the text clarification, this is also only the US version of Wikipedia.

What worries me though is that most videos linked on Wikipedia are hosted on YouTube. That’s a pretty dangerous choke point.

superkret@feddit.org · 4 months ago

Videos aren’t an essential part of an encyclopedia.

PhobosAnomaly@feddit.uk · 4 months ago

Ten year old me would beg to differ.

Videos turned Encarta 95 from being an encyclopedia to the encyclopedia!

I jest - a multimedia experience helps but I agree that the text knowledge is the big draw.

Rai@lemmy.dbzer0.com · 4 months ago

I liked to look at the genitals of everyone possible

Thank you encarta

AbsoluteChicagoDog@lemm.ee · 4 months ago

The real ones remember wandering around that damn maze answering questions while managing limited torches to see the map.

Greg Clarke@lemmy.ca · 4 months ago

I remember watching the hand of God goal in the library many times using Encarta 95

AnUnusualRelic@lemmy.world · 4 months ago

I never even noticed any videos on Wikipedia. Maybe for some cinema articles.

aname@lemmy.one · 4 months ago

You mean the English version? There is no US version, thanks god.

ByteOnBikes@slrpnk.net · 4 months ago

My brain immediately thought archive.org but after the last incident, I kinda feel like archive org is going to get lawsuited into oblivion

whats_all_this_then@lemmy.world · 4 months ago

I tried searching but found nothing. What incident?

Silverseren@fedia.io · 4 months ago

The benefit of text not taking up much space.

Don_Dickle@piefed.social · 4 months ago

I am currently reading on terrorists while in the states. But something tells me I will get my IP banning me. But I have read a shitton and I highly doubt its just 100gb. Otherwise you would see it more on piracy sites.

whoreticulture@lemmy.world · 4 months ago

But it’s freely and easily available to download, why would it be on piracy sites?

Serinus@lemmy.world · 4 months ago

China is making a copy. For… reasons.

Dasus@lemmy.world · 4 months ago

Otherwise you would see it more on piracy sites.

What on Earth do you mean? Piracy sites share things which aren’t available easily for free otherwise.

https://en.m.wikipedia.org/wiki/Wikipedia:Database_download

And the text only version of Wiki is just 22.14gb.

https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#:~:text=The total number of pages,about 22.14 GB without media.

Aatube@kbin.melroy.org · 4 months ago

DYK that Kiwix was actually created by Wikipedia? Back in the late 2000s there was this gigantic effort to select and improve a ton of articles to make an offline “Wikipedia 1.0” release. The only remains of that effort are Kiwix, periodic backups, and an incredibly useful article-rating system.

felixwhynot@lemmy.world · 4 months ago

Can you write more about the rating system you mentioned?

Aatube@kbin.melroy.org · 4 months ago

There is a set of criteria to rate an article B, C, Start or Stub. These are called classes. Similarly, articles can be rated to be of 1 of 4 importance values to a particular WikiProject.
There’s a banner on every article’s talk page. Any editor can change an article’s rating between one of the above classes boldly; if a revert happens, they discuss it according to the criteria.
Some WikiProjects have their own criteria for rating articles. Some of them even have process to make an article A-class.
Before this system, Wikipedia already had processes to make an article a Good Article or Featured article.

With GAs, a nominator should put a candidate onto backlog. Later, a reviewer will scrutinize the article according to criteria. Often, the reviewer asks the nominator to fix quite a bit of issues. If these issues are fixed promptly, or the reviewer thinks that there are only nitpicks, the article passes. If they aren’t fixed in a week or the reviewer thinks that there are major problems, the article fails.
- As with other processes, the nominator and reviewer can be anyone, though reviewers are usually experienced.
With FAs, a nominator brings the candidate to a noticeaboard. Editors there then come to a consensus about whether the article should pass.
Both processes display a badge directly on passed articles.
Both processes have an associated re-review process where editors come to a consensus whether the article should fail if it were nominated today
There’s also an informal process called “peer review”, where someone just puts an article at a noticeable and anyone can comment about its quality.

Articles are automatically sorted into categories by their rating and importance. Editors usually look at these to decide which articles to focus on nowadays.

Muffi@programming.dev · 4 months ago

This saved my ass at my engineering chemistry exam (still a requirement, even for software engineers) where only offline tools were allowed. Love Kiwix!

snrkl@lemmy.sdf.org · 4 months ago

LOL… Malicious compliance at its best…