The Mediawiki API makes it possible for web developers to access,
search and integrate all Wikipedia content into their applications.
Given that Wikipedia is the ultimate online encyclopedia, there are dozens of use cases in which this might be useful.
I used to post a lot of articles about using the webservice APIS of
third party sites on this blog. This is going to be another post like
that.
This post describes how to use the Java Wikipedia API to fetch and format the contents of a Wikipedia article.
The Wikipedia API
The Wikipedia API makes it possible to interact with Wikipedia/Mediawiki
through a webservice instead of the normal browserbased web interface.
The documentation for using this api is at http://www.mediawiki.org/wiki/API.
The specific documentation for the English Wikipedia(the mediawiki api can be called on all Wikimedia sites, so not just Wikipedia itself, but also Wikimedia Commons etc..) is at http://en.wikipedia.org/w/api.php.
The specific documentation for the English Wikipedia(the mediawiki api can be called on all Wikimedia sites, so not just Wikipedia itself, but also Wikimedia Commons etc..) is at http://en.wikipedia.org/w/api.php.
We cover a basic use case: getting the contents of the “Web service” article.
To fetch the contents for this article, the following url suffices:
http://en.wikipedia.org/w/api.php? format=xml&action=query&titles=Web%20service&prop=revisions&rvprop=content
A request to this url will return an xml document which includes the
current wiki markup for the page titled “Web service”. As the request
parameters indicate, these requests are highly configurable. For
example, other formats than xml, such as json, are possible. For a full
list of available parameters, visit http://en.wikipedia.org/w/api.php.
We are not going to construct these urls ourselves. We are going to use bliki, the Java wikipedia API library instead.
Getting the Java Wikipedia API lib
If you are using Maven you need to add the following repository to your pom:
<repository> <id>info-bliki-repository</id> <url>http://gwtwiki.googlecode.com/svn/maven-repository/</url> <releases> <enabled>true</enabled> </releases> </repository>
together with the following dependency:
<!-- bliki --> <dependency> <groupId>info.bliki.wiki</groupId> <artifactId>bliki-core</artifactId> <version>3.0.17</version> </dependency>
and if you want the addons:
<dependency> <groupId>info.bliki.wiki</groupId> <artifactId>bliki-addons</artifactId> <version>3.0.17</version> </dependency>
If you are not using Maven, just grab the jar from the linked project page.
Usage examples of this lib are at http://code.google.com/p/gwtwiki/wiki/HTML2Mediawiki.
Basic example: getting the contents of an article
However, the basic usage example given in the documentation, at this time, does not compile with the current version of the lib.
Therefore, we will start with a basic usage example of which no variant is listed there and extend this example.
We are going to list the code to fetch the content of the “Web
service” page and render it as html. Note that to get a specific page,
you need to know its title.
If the page does not exist, a result with one empty page will be returned.
For ambiguous titles, the disambiguation page will be given too, so even
if you get a non-empty result, you still need to check it thoroughly.
String[] listOfTitleStrings = { "Web service" }; User user = new User("", "", "http://en.wikipedia.org/w/api.php"); user.login(); List<Page> listOfPages = user.queryContent(listOfTitleStrings); for (Page page : listOfPages) { WikiModel wikiModel = new WikiModel("${image}", "${title}"); String html = wikiModel.render(page.toString()); System.out.println(html); }
We are instantiating a user on the English wikipedia endpoint. Since we are only going to read, we can login anonymously.
We query the english Wikipedia for the specified titles and get one page as result in the listOfPages variable.
We then instantiate a WikiModel. This class will render the html and its
constructor parameters – imageBaseUrl and linkBaseUrl – determine where
the rendered images and links will point too. For example, if you want
these to point to local files, you would supply a local path. In the
example, I made it completely relative. In the official documentation,
these are “http://www.mywiki.com/wiki/${image}” and
“http://www.mywiki.com/wiki/${title}”, which you would use if you were
putting a Wikipedia copy at http://www.mywiki.com/wiki/.
We then render the page as html and print it out to the console.
The outputted rendering is very rudimentary and is far from complete though:
- Wikipedia magic variables, recognizable by their {{…}} markup, are not rendered. Instead, they are displayed literally.
- By default, all markup is rendered. However, you might need to leave certain parts out or modify the content a bit before it is displayed for your particular use case.
Handling magic variables
Most magic words are not supported by the Java Wikipedia API.
We need to implement their rendering ourselves.
If you want to do some advanced converting of the Wikipedia content,
such as handling these magic words, you need to extend the WikiModel
class. More info about this is at http://code.google.com/p/gwtwiki/wiki/Mediawiki2HTML.
This is what we are doing here:
package com.integratingstuff.wikimedia; import java.util.Locale; import java.util.Map; import java.util.ResourceBundle; import info.bliki.wiki.model.Configuration; import info.bliki.wiki.model.WikiModel; import info.bliki.wiki.namespaces.INamespace; public class MyWikiModel extends WikiModel{ public MyWikiModel(Configuration configuration, Locale locale, String imageBaseURL, String linkBaseURL) { super(configuration, locale, imageBaseURL, linkBaseURL); } public MyWikiModel(Configuration configuration, ResourceBundle resourceBundle, INamespace namespace, String imageBaseURL, String linkBaseURL) { super(configuration, resourceBundle, namespace, imageBaseURL, linkBaseURL); } public MyWikiModel(Configuration configuration, String imageBaseURL, String linkBaseURL) { super(configuration, imageBaseURL, linkBaseURL); } public MyWikiModel(String imageBaseURL, String linkBaseURL) { super(imageBaseURL, linkBaseURL); } @Override public String getRawWikiContent(String namespace, String articleName, Map<String, String> templateParameters) { String rawContent = super.getRawWikiContent(namespace, articleName, templateParameters); if (rawContent == null){ return ""; } else { return rawContent; } } }
The overriden getRawWikiContent in the above MyWikiModel code returns
null for most magic words in its default implementation. A magic word
such as {{InfoBox}} would pass through this code with the default
namespace=”Template” and articleName=”InfoBox”. If null is returned, the
magic word will be outputted in the rendered html as is(so for
{{InfoBox}}, this would be {{InfoBox}}). So, the resulting html is full
of these unreadable tags, which does not make it look pretty printed.
What we are doing in the above code to solve this is returning “”
instead of null, so the magic word does not get rendered at all.
Nothing is stopping you from returning something else though.
Controlling the rendering of the html by implementing an ITextConverter
For my particular use case, I also did not want to render any html
links, I did not want to render any references and I did not want to
render any images. The WikiModel class does not implement support for
leaving out these things. However, the overloaded render method of
WikiModel can take an ITextConverter object as an argument, the object
that is responsible for converting the parsed nodes to html(or another
format, like pdf or plain text). The default ITextConverter, used when
none is specified as an argument, is HTMLConverter, with its property
noLinks set to false by default.
However, there is a HTMLConverter constructor which sets the noLinks
boolean. By passing true to this constructor, no links will be rendered.
Their content will be rendered as plain text instead.
Since I still had to leave out the reference and image elements, I still ended up subclassing the HTMLConverter.
First, I made a more extensible version of it:
package com.ceardannan.exams.wikimedia; import info.bliki.htmlcleaner.ContentToken; import info.bliki.htmlcleaner.EndTagToken; import info.bliki.htmlcleaner.TagNode; import info.bliki.htmlcleaner.Utils; import info.bliki.wiki.filter.HTMLConverter; import info.bliki.wiki.model.Configuration; import info.bliki.wiki.model.IWikiModel; import info.bliki.wiki.model.ImageFormat; import info.bliki.wiki.tags.HTMLTag; import java.io.IOException; import java.util.Iterator; import java.util.List; import java.util.Map; /** * A converter which renders the internal tree node representation as specific * HTML text, but which is easier to change in behaviour than its superclass and * has a noImages property, which can be set to leave out all images * */ public class ExtendedHtmlConverter extends HTMLConverter { private boolean noImages; public ExtendedHtmlConverter() { super(); } public ExtendedHtmlConverter(boolean noLinks) { super(noLinks); } public ExtendedHtmlConverter(boolean noLinks, boolean noImages) { this(noLinks); this.noImages = noImages; } protected void renderContentToken(Appendable resultBuffer, ContentToken contentToken, IWikiModel model) throws IOException { String content = contentToken.getContent(); content = Utils.escapeXml(content, true, true, true); resultBuffer.append(content); } protected void renderHtmlTag(Appendable resultBuffer, HTMLTag htmlTag, IWikiModel model) throws IOException { htmlTag.renderHTML(this, resultBuffer, model); } protected void renderTagNode(Appendable resultBuffer, TagNode tagNode, IWikiModel model) throws IOException { Map<String, Object> map = tagNode.getObjectAttributes(); if (map != null && map.size() > 0) { Object attValue = map.get("wikiobject"); if (!noImages) { if (attValue instanceof ImageFormat) { imageNodeToText(tagNode, (ImageFormat) attValue, resultBuffer, model); } } } else { nodeToHTML(tagNode, resultBuffer, model); } } public void nodesToText(List<? extends Object> nodes, Appendable resultBuffer, IWikiModel model) throws IOException { if (nodes != null && !nodes.isEmpty()) { try { int level = model.incrementRecursionLevel(); if (level > Configuration.RENDERER_RECURSION_LIMIT) { resultBuffer .append("<span class=\"error\">Error - recursion limit exceeded rendering tags in HTMLConverter#nodesToText().</span>"); return; } Iterator<? extends Object> childrenIt = nodes.iterator(); while (childrenIt.hasNext()) { Object item = childrenIt.next(); if (item != null) { if (item instanceof List) { nodesToText((List) item, resultBuffer, model); } else if (item instanceof ContentToken) { // render plain text content ContentToken contentToken = (ContentToken) item; renderContentToken(resultBuffer, contentToken, model); } else if (item instanceof HTMLTag) { HTMLTag htmlTag = (HTMLTag) item; renderHtmlTag(resultBuffer, htmlTag, model); } else if (item instanceof TagNode) { TagNode tagNode = (TagNode) item; renderTagNode(resultBuffer, tagNode, model); } else if (item instanceof EndTagToken) { EndTagToken node = (EndTagToken) item; resultBuffer.append('<'); resultBuffer.append(node.getName()); resultBuffer.append("/>"); } } } } finally { model.decrementRecursionLevel(); } } } protected void nodeToHTML(TagNode node, Appendable resultBuffer, IWikiModel model) throws IOException { super.nodeToHTML(node, resultBuffer, model); } }
The functionality of the above HTMLConverter is almost the same as
the original one, but the code is divided into more methods, for easier
overriding, and a noImages boolean is added as well, which leaves out
all the images at render time if set to true.
And then I subclassed this class like this:
package com.ceardannan.exams.wikimedia; import info.bliki.htmlcleaner.ContentToken; import info.bliki.htmlcleaner.Utils; import info.bliki.wiki.model.IWikiModel; import info.bliki.wiki.tags.HTMLTag; import java.io.IOException; public class MyHtmlConverter extends ExtendedHtmlConverter { public MyHtmlConverter() { super(); } public MyHtmlConverter(boolean noLinks) { super(noLinks); } public MyHtmlConverter(boolean noLinks, boolean noImages) { super(noLinks, noImages); } protected void renderContentToken(Appendable resultBuffer, ContentToken contentToken, IWikiModel model) throws IOException { String content = contentToken.getContent(); content = content.replaceAll("\\(,", "(").replaceAll("\\(\\)", "()"); content = Utils.escapeXml(content, true, true, true); resultBuffer.append(content); } protected void renderHtmlTag(Appendable resultBuffer, HTMLTag htmlTag, IWikiModel model) throws IOException { String tagName = htmlTag.getName(); if ((!tagName.equals("ref"))) { super.renderHtmlTag(resultBuffer, htmlTag, model); } } }
If the converter encounters a “ref” html tag, it does not render the
html tag. This results in no references getting rendered at all.
I also changed the rendering of the content a bit. This is because
returning of “” for the magic words(see above), might leave (, or () in
the text, and the line that replaces these cleans up the rendered html.
The code that we call to get the html is now:
MyWikiModel wikiModel = new MyWikiModel("${image}", "${title}"); String currentContent = page.getCurrentContent(); String html = wikiModel.render( new MyHtmlConverter(true, true), currentContent);