Question:
hi.. i wont to get wikipedia content to a Text file using any programing language with out HTML tags.. ect?
shiran j
2010-02-25 01:45:26 UTC
hi.. i wont to get wikipedia content to a Text file using any programing language with out containing HTML tags ,ect .. is there any wiki API to do that?
Four answers:
Nihiltres
2010-02-25 10:44:51 UTC
If you're OK with wikitext (the syntax MediaWiki uses), then the easiest thing to do is to grab raw pages with the URL parameter "action=raw", e.g. < http://en.wikipedia.org/w/index.php?title=Plutonium&action=raw >, which will return the raw source code of the page. This probably isn't what you want, though.



There's an API over at < http://en.wikipedia.org/w/api.php >, but I think you'd do better to simply scrape the printable version, which is accessible by adding a parameter "printable=yes" to a page's URL, e.g. < http://en.wikipedia.org/w/index.php?title=Plutonium&printable=yes >.



The printable version is still HTML, but it removes or otherwise makes nicer most of the problems you'd have. I think that with a few regular expressions (< http://enwp.org/Regex >), it would be simple to remove tables and remove most images (you'd want to convert math images to the TeX markup used as their alt text, for example, rather than removing them). You'd have to fix certain kinds of representational formatting, particularly superscript and subscript, as it's often quite relevant e.g. in physics/chemistry-related articles. This conversion process does involve losing some of the content (some Wikipedia content can't be adequately represented without a markup language supporting images, tables, etc.), but you'd get most of it without much work.
Kohs Knows
2010-02-25 16:04:51 UTC
Mediawiki (the software that undergirds Wikipedia) does not convert the wiki mark-up from HTML to TXT very well at all. I'm afraid you are stuck in this situation, other than to just manually copy text and paste into a text editor.
Frecklefoot
2010-02-25 15:12:56 UTC
You could just highlight the text you want and copy & paste it into a text editor, like Notepad. You'll still have to remove the occasional [1] and such, but that wasn't one of your requirements.
Paul Zucchini
2010-02-25 09:55:12 UTC
You can use this...



http://www.autohotkey.com/



It's totally free.



You can write a program in it to automate Firefox and have it save the text file. Or there are other ways to do it.



They have a free help forum there if you need more info.


This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.
Loading...