Scraping text from Wikipedia using PHP
Posted by jimblackler on Sep 25, 2007
Wikipedia has grown from one of many interesting websites to being one of the most famous sites on the Internet. Millions of volunteer years have been invested over the years, and the pay off is what we have today – a wealth of factual data in one place.
When Wikis were a new concept, many predicted they would descend into chaos as they grew. In the case of Wikipedia the reverse is true. It seems to become increasingly well organised as the site develops. Rather than becoming more jumbled, the natural development of article conventions and the more planned use of standardised templates has created an increasingly neat and consistent structure.
This careful organisation of the prose leads to the interesting possibility of extracting more structured data from Wikipedia for alternative purposes, while staying true to the letter and spirit of the GFDL under which the material is licensed.
There’s the potential for a kind of semantic reverse engineering of article content. HTML pages could be scraped, and pages scoured for hints as to the meaning of each text fragment.
Applications could include loading articles about a variety of subjects into structured databases. Subjects for this treatment could include countries, people, chemical elements, diseases, you name it. These databases could then be searched by a variety of applications.
I’ve knocked up a simple page that gives a kind of quasi-dictionary definition when a word is entered. It looks at the first sentence of the Wikipedia article, which typically describes the article topic concisely.
I’ll show here how the basic page scrape works, which is actually very easy with PHP, its HTML reading abilities and the power of xpath.
- $html = @file_get_contents(“http://en.wikipedia.org/wiki/France”); will pull down the HTML content of the Wikipedia article on France.
- $dom = @DOMDocument::loadHTML($html); will read the HTML into a DOM for querying.
- $xpath = new domXPath($dom); will make a new xpath query.
- $results = $xpath->query(‘//div[@id=”bodyContent”]/p’); will find the first paragraph that is a direct child of the div with the id “bodyContent”. This is where the article always starts in a Wikipedia article page.
I then perform some more processing on the results including contingencies for if any of the steps fail. For instance to make the definitions snappier reading I strip any text in brackets, either round or square. There’s also some additional logic to pick the first topic in the list if the page lists multiple subjects (a “disambiguation” page). Predicting the Wikipedia URL for a given topic also involves a small amount of processing.Anyway, when you ask the page “what is France”, it will reply..
France, officially the French Republic, is a country whose metropolitan territory is located in Western Europe and that also comprises various overseas islands and territories located in other continents.
Can’t argue with that!
Edit, 1st March: By request, here is the source of the WhatIs application. It will work in any LAMP environment but the .sln file is for VS.PHP under Visual Studio 2005.
Hi Jim, thanks for this! Just what I needed at the moment :)
I have one problem though, after setting DBConnection.php and creating `WhatIs` database, this is what I get: “Table ‘whatis.subjects’ doesn’t exist”. Can you send me the .sql file for the MySQL structure?
Thanks :)
Thanks Rade. Try this:
CREATE TABLE IF NOT EXISTS `subjects` (
`Word` text NOT NULL,
`Description` text NOT NULL,
`DateTime` datetime NOT NULL,
`Link` text NOT NULL
);
That worked like a charm.
You saved me with this one, thanks again!
Fantastic! Thanks for sharing this Jim. It will be fun to play with and learn by. I appreciate it.
This is strange, I download and tried the code, but could not make it work even with database setup. I then just created a small program using 4 steps you listed above and $results shows nothings, any suggestion? Thanks.
Hi Jim. This is outstanding!!!
Wikipedia has started adding location coordinates to some entries (e.g. Paris, NASA). The code pulls this in place of a description. Any suggestions?
What’s missing here (from my perspective) is how to parse the wikipedia text.
I couldn’t make that out.
TIA
[…] the techniques I worked out for my earlier article I wrote a new scraper to crawl pages from Wikipedia. This was a Java client running on my […]
For the life in me I cannot get this, sweet little script to work… any pointers would be so appreciated…
Even just using the 4 lines of code, I am not retreiving any data.
Please help… :o)
[…] problem: The following links might be of some help, if I have understood the problem correctly. http://jimblackler.net/blog/?p=13 and […]
Hey,
This is just great.
Based on this concept, I have created a single “Wikitools” class that fetches the definition from Wikipedia based on the “Feeling Lucky” url from Google.
Like:
$wiki = new Wikitools();
$result = $wiki->get( “madonna” );
Thanks for the inspiration.
It doesn’t always work though. For example:
What is “building”? Fails because the Wikipedia page is formatted differently.
hi,
i am very thankful for you value document and souce code.
it realy helpful.