Question:
Why does a dump of the Wikipedia database have more pages than Wikipedia itself has?
Emmy
2012-01-14 21:42:19 UTC
I downloaded a dump of the current articles in the English Wikipedia, the dump file is enwiki-20120104-pages-articles.xml. This is "current revisions only, no talk or user pages", and has no images. I installed a LAMP server with MediaWiki in a virtual machine and began using MWDumper to import the dump into my database. I haven't had any errors, and I can browse Wikipedia on this local server and see the articles that have been imported so far.

My problem is that I thought Wikipedia had about 3.85 million articles, but I've already imported 3.93 million pages into my database with MWDumper. I don't know how many are left, but when I browse my local Wikipedia, there are a lot of red links still. I looked on the talk page for MWDumper and saw that someone else complained that he or she expected 3.8 million pages and MWDumper imported 11 million.

I'm getting frustrated by how long this is taking. It's been importing for more than a week already and I thought it would be done today. I'm wondering why there are so many more pages in the Wikipedia dump than are in the English Wikipedia.
Three answers:
Nihiltres
2012-01-14 23:05:43 UTC
Certain classes of pages, even when in the main "article" namespace, don't count as "articles" for the purpose of statistics. Most notably, redirects! You're probably seeing those, among a few others that aren't formally counted.
anonymous
2016-12-03 12:47:22 UTC
An experience is random in case you will not be able to anticipate it with a hundred% accuracy. although, an experience this is mathematically modeled to be random is generally not "fairly" random. case in point, thermal noise is modeled to be random, besides the reality that the action of electrons ought to theoretically expected via electric powered and magnetic forces. although, fashions that evaluate each and all the forces to blame for this action may well be impractically complicated, and require too many inputs. consequently "previous the human capacity to foretell" with a hundred% accuracy is the definition of maximum modeled random events. i don't understand if there is this variety of factor as an exceptionally random experience that has no deterministic foundation.
Masked Musketeer
2012-01-14 21:46:22 UTC
You'd be better off asking this at http://en.wikipedia.org/wiki/Wikipedia_talk:Database_download


This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.
Loading...