Saturday, 18 July 2015

How to get WIKIPedia Data using JAVA API

The Mediawiki API makes it possible for web developers to access, search and integrate all Wikipedia content into their applications.

Given that Wikipedia is the ultimate online encyclopedia, there are dozens of use cases in which this might be useful.

I used to post a lot of articles about using the webservice APIS of third party sites on this blog. This is going to be another post like that.

This post describes how to use the Java Wikipedia API to fetch and format the contents of a Wikipedia article.

The Wikipedia API

The Wikipedia API makes it possible to interact with Wikipedia/Mediawiki through a webservice instead of the normal browserbased web interface.

The documentation for using this api is at http://www.mediawiki.org/wiki/API.
The specific documentation for the English Wikipedia(the mediawiki api can be called on all Wikimedia sites, so not just Wikipedia itself, but also Wikimedia Commons etc..) is at http://en.wikipedia.org/w/api.php.

We cover a basic use case: getting the contents of the “Web service” article.

To fetch the contents for this article, the following url suffices:

http://en.wikipedia.org/w/api.php? format=xml&action=query&titles=Web%20service&prop=revisions&rvprop=content

A request to this url will return an xml document which includes the current wiki markup for the page titled “Web service”. As the request parameters indicate, these requests are highly configurable. For example, other formats than xml, such as json, are possible. For a full list of available parameters, visit http://en.wikipedia.org/w/api.php.

We are not going to construct these urls ourselves. We are going to use bliki, the Java wikipedia API library instead.

Getting the Java Wikipedia API lib

If you are using Maven you need to add the following repository to your pom:

<repository>
  <id>info-bliki-repository</id>
  <url>http://gwtwiki.googlecode.com/svn/maven-repository/</url>
  <releases>
    <enabled>true</enabled>
  </releases>
</repository>

together with the following dependency:

<!-- bliki -->
<dependency>
  <groupId>info.bliki.wiki</groupId>
  <artifactId>bliki-core</artifactId>
  <version>3.0.17</version>
</dependency>

and if you want the addons:

<dependency>
  <groupId>info.bliki.wiki</groupId>
  <artifactId>bliki-addons</artifactId>
  <version>3.0.17</version>
</dependency>

If you are not using Maven, just grab the jar from the linked project page.

Usage examples of this lib are at http://code.google.com/p/gwtwiki/wiki/HTML2Mediawiki.

Basic example: getting the contents of an article

However, the basic usage example given in the documentation, at this time, does not compile with the current version of the lib.

Therefore, we will start with a basic usage example of which no variant is listed there and extend this example.

We are going to list the code to fetch the content of the “Web service” page and render it as html. Note that to get a specific page, you need to know its title.

If the page does not exist, a result with one empty page will be returned.

For ambiguous titles, the disambiguation page will be given too, so even if you get a non-empty result, you still need to check it thoroughly.

String[] listOfTitleStrings = { "Web service" };
User user = new User("", "", "http://en.wikipedia.org/w/api.php");
user.login();
List<Page> listOfPages = user.queryContent(listOfTitleStrings);
for (Page page : listOfPages) {
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
  String html = wikiModel.render(page.toString());
  System.out.println(html);
}

We are instantiating a user on the English wikipedia endpoint. Since we are only going to read, we can login anonymously.

We query the english Wikipedia for the specified titles and get one page as result in the listOfPages variable.

We then instantiate a WikiModel. This class will render the html and its constructor parameters – imageBaseUrl and linkBaseUrl – determine where the rendered images and links will point too. For example, if you want these to point to local files, you would supply a local path. In the example, I made it completely relative. In the official documentation, these are “http://www.mywiki.com/wiki/${image}” and “http://www.mywiki.com/wiki/${title}”, which you would use if you were putting a Wikipedia copy at http://www.mywiki.com/wiki/.

We then render the page as html and print it out to the console.

The outputted rendering is very rudimentary and is far from complete though:

Wikipedia magic variables, recognizable by their {{…}} markup, are not rendered. Instead, they are displayed literally.
By default, all markup is rendered. However, you might need to leave certain parts out or modify the content a bit before it is displayed for your particular use case.

Handling magic variables

Most magic words are not supported by the Java Wikipedia API.

We need to implement their rendering ourselves.

If you want to do some advanced converting of the Wikipedia content, such as handling these magic words, you need to extend the WikiModel class. More info about this is at http://code.google.com/p/gwtwiki/wiki/Mediawiki2HTML.

This is what we are doing here:

package com.integratingstuff.wikimedia;

import java.util.Locale;
import java.util.Map;
import java.util.ResourceBundle;

import info.bliki.wiki.model.Configuration;
import info.bliki.wiki.model.WikiModel;
import info.bliki.wiki.namespaces.INamespace;

public class MyWikiModel extends WikiModel{

  public MyWikiModel(Configuration configuration, Locale locale,
    String imageBaseURL, String linkBaseURL) {
      super(configuration, locale, imageBaseURL, linkBaseURL);
  }
  public MyWikiModel(Configuration configuration,
    ResourceBundle resourceBundle, INamespace namespace,
    String imageBaseURL, String linkBaseURL) {
      super(configuration, resourceBundle, namespace, imageBaseURL, linkBaseURL);
  }
  public MyWikiModel(Configuration configuration, String imageBaseURL,
    String linkBaseURL) {
      super(configuration, imageBaseURL, linkBaseURL);
  }
  public MyWikiModel(String imageBaseURL, String linkBaseURL) {
    super(imageBaseURL, linkBaseURL);
  }

  @Override
  public String getRawWikiContent(String namespace, String articleName,
    Map<String, String> templateParameters) {
      String rawContent = super.getRawWikiContent(namespace, articleName, templateParameters);

      if (rawContent == null){
        return "";
      }
      else {
        return rawContent;
      }
    }
}

The overriden getRawWikiContent in the above MyWikiModel code returns null for most magic words in its default implementation. A magic word such as {{InfoBox}} would pass through this code with the default namespace=”Template” and articleName=”InfoBox”. If null is returned, the magic word will be outputted in the rendered html as is(so for {{InfoBox}}, this would be {{InfoBox}}). So, the resulting html is full of these unreadable tags, which does not make it look pretty printed.

What we are doing in the above code to solve this is returning “” instead of null, so the magic word does not get rendered at all.

Nothing is stopping you from returning something else though.

Controlling the rendering of the html by implementing an ITextConverter

For my particular use case, I also did not want to render any html links, I did not want to render any references and I did not want to render any images. The WikiModel class does not implement support for leaving out these things. However, the overloaded render method of WikiModel can take an ITextConverter object as an argument, the object that is responsible for converting the parsed nodes to html(or another format, like pdf or plain text). The default ITextConverter, used when none is specified as an argument, is HTMLConverter, with its property noLinks set to false by default.

However, there is a HTMLConverter constructor which sets the noLinks boolean. By passing true to this constructor, no links will be rendered. Their content will be rendered as plain text instead.

Since I still had to leave out the reference and image elements, I still ended up subclassing the HTMLConverter.

First, I made a more extensible version of it:

package com.ceardannan.exams.wikimedia;

import info.bliki.htmlcleaner.ContentToken;
import info.bliki.htmlcleaner.EndTagToken;
import info.bliki.htmlcleaner.TagNode;
import info.bliki.htmlcleaner.Utils;
import info.bliki.wiki.filter.HTMLConverter;
import info.bliki.wiki.model.Configuration;
import info.bliki.wiki.model.IWikiModel;
import info.bliki.wiki.model.ImageFormat;
import info.bliki.wiki.tags.HTMLTag;

import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

/**
 * A converter which renders the internal tree node representation as specific
 * HTML text, but which is easier to change in behaviour than its superclass and
 * has a noImages property, which can be set to leave out all images
 *
 */
public class ExtendedHtmlConverter extends HTMLConverter {

  private boolean noImages;

  public ExtendedHtmlConverter() {
    super();
  }

  public ExtendedHtmlConverter(boolean noLinks) {
    super(noLinks);
  }

  public ExtendedHtmlConverter(boolean noLinks, boolean noImages) {
    this(noLinks);
    this.noImages = noImages;
  }

  protected void renderContentToken(Appendable resultBuffer,
      ContentToken contentToken, IWikiModel model) throws IOException {
    String content = contentToken.getContent();
    content = Utils.escapeXml(content, true, true, true);
    resultBuffer.append(content);
  }

  protected void renderHtmlTag(Appendable resultBuffer, HTMLTag htmlTag,
      IWikiModel model) throws IOException {
    htmlTag.renderHTML(this, resultBuffer, model);
  }

  protected void renderTagNode(Appendable resultBuffer, TagNode tagNode,
      IWikiModel model) throws IOException {
    Map<String, Object> map = tagNode.getObjectAttributes();
    if (map != null && map.size() > 0) {
      Object attValue = map.get("wikiobject");
      if (!noImages) {
        if (attValue instanceof ImageFormat) {
          imageNodeToText(tagNode, (ImageFormat) attValue, resultBuffer, model);
        }
      }
    } else {
      nodeToHTML(tagNode, resultBuffer, model);
    }
  }

  public void nodesToText(List<? extends Object> nodes,
      Appendable resultBuffer, IWikiModel model) throws IOException {
    if (nodes != null && !nodes.isEmpty()) {
      try {
        int level = model.incrementRecursionLevel();

        if (level > Configuration.RENDERER_RECURSION_LIMIT) {
          resultBuffer
              .append("<span class=\"error\">Error - recursion limit exceeded rendering tags in HTMLConverter#nodesToText().</span>");
          return;
        }
        Iterator<? extends Object> childrenIt = nodes.iterator();
        while (childrenIt.hasNext()) {
          Object item = childrenIt.next();
          if (item != null) {
            if (item instanceof List) {
              nodesToText((List) item, resultBuffer, model);
            } else if (item instanceof ContentToken) {
              // render plain text content
              ContentToken contentToken = (ContentToken) item;
              renderContentToken(resultBuffer, contentToken, model);
            } else if (item instanceof HTMLTag) {
              HTMLTag htmlTag = (HTMLTag) item;
              renderHtmlTag(resultBuffer, htmlTag, model);
            } else if (item instanceof TagNode) {
              TagNode tagNode = (TagNode) item;
              renderTagNode(resultBuffer, tagNode, model);
            } else if (item instanceof EndTagToken) {
              EndTagToken node = (EndTagToken) item;
              resultBuffer.append('<');
              resultBuffer.append(node.getName());
              resultBuffer.append("/>");
            }
          }
        }
      } finally {
        model.decrementRecursionLevel();
      }
    }
  }

  protected void nodeToHTML(TagNode node, Appendable resultBuffer,
      IWikiModel model) throws IOException {
    super.nodeToHTML(node, resultBuffer, model);
  }

}

The functionality of the above HTMLConverter is almost the same as the original one, but the code is divided into more methods, for easier overriding, and a noImages boolean is added as well, which leaves out all the images at render time if set to true.

And then I subclassed this class like this:

package com.ceardannan.exams.wikimedia;

import info.bliki.htmlcleaner.ContentToken;
import info.bliki.htmlcleaner.Utils;
import info.bliki.wiki.model.IWikiModel;
import info.bliki.wiki.tags.HTMLTag;

import java.io.IOException;

public class MyHtmlConverter extends ExtendedHtmlConverter {

  public MyHtmlConverter() {
    super();
  }

  public MyHtmlConverter(boolean noLinks) {
    super(noLinks);
  }

  public MyHtmlConverter(boolean noLinks, boolean noImages) {
    super(noLinks, noImages);
  }

  protected void renderContentToken(Appendable resultBuffer,
      ContentToken contentToken, IWikiModel model) throws IOException {
    String content = contentToken.getContent();
    content = content.replaceAll("\\(,", "(").replaceAll("\\(\\)", "()");
    content = Utils.escapeXml(content, true, true, true);
    resultBuffer.append(content);
  }

  protected void renderHtmlTag(Appendable resultBuffer, HTMLTag htmlTag,
      IWikiModel model) throws IOException {
    String tagName = htmlTag.getName();
    if ((!tagName.equals("ref"))) {
      super.renderHtmlTag(resultBuffer, htmlTag, model);
    }
  }
}

If the converter encounters a “ref” html tag, it does not render the html tag. This results in no references getting rendered at all.

I also changed the rendering of the content a bit. This is because returning of “” for the magic words(see above), might leave (, or () in the text, and the line that replaces these cleans up the rendered html.

The code that we call to get the html is now:

MyWikiModel wikiModel = new MyWikiModel("${image}", "${title}");
String currentContent = page.getCurrentContent();
String html = wikiModel.render(
  new MyHtmlConverter(true, true), currentContent);

Thursday, 16 July 2015

Hadoop Jar with libjars option and importance of Generic Options Parser

When you have written MapReduce program with the support of Third Party jars, its very important to ensure these jars are fed to MapReduce, which will be used by every slave node that is running Map/Reduce tasks. Including these jars has always been some what difficult job for the users. So they will tend to create fat jar, which will include all the dependencies in the exported archive file.

However, there is an other elegant option "libjars" which can be included while running the MapReduce job using hadoop jar command, with comma separated dependent jar files.

eg: hadoop jar <some_path>/your_jar.jar <your.class.name> -libjars <lib_path>/commons-lang-1.2.jar,<some_path>/guava-1.13.jar,other_jars <inputs & outputs and other parameters...>

Making this command work is tougher than the saying.

To ensure libjars works, you need to make sure below things.

Your Driver program has to extend Configured and implement Tool and your main class should have object created for GenericOptionsParser. Example Code has been shown below.

Please Note that: The configuration in run method has to be recieved from main method. Otherwise, the property values will not be received from command line.

It would be explained as below.

Immediately after calling main method, configuration will be created and that takes the command line properties and updates configuration.

If this configuration is not used in run, and new Configuration created and used, the properties that set will not be passed and your job will not recognize, lib jars and will throw class not found error. This is one of common mistakes that every programmer does.

also note that, your execution command should have class name first, and then libjars as second and then regular inputs.

eg:

hadoop jar ./srini.jar com.srini.test.TestMapreduce -libjars /usr/lib/hadoop/lib/commons-logging-1.1.3.jar,/usr/lib/hadoop/lib/commons-lang-1.1.3.jar <argument1> <argument2> <argument3> etc..

Example Code:

    public int run(String[] args) throws Exception {

        Configuration mainConf = super.getConf();
        GenericOptionsParser parser = new GenericOptionsParser(args);

         // inputs below will have all the actual arguments passed
        String[] inputs = parser.getRemainingArgs();

        Job job = Job.getInstance(mainConf);

        job.setMapperClass(TestMapper.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setJarByClass(TestMapReduce.class);
        FileInputFormat.addInputPath(job, new Path(inputs[0]));
        LazyOutputFormat.setOutputFormatClass(job,TextOutputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(inputs[1]));

        job.setNumReduceTasks(0);
        return (job.waitForCompletion(true)?0:1);
    }

    public static void main(String... args) throws Exception
    {
        Configuration conf = new Configuration();
        int res = ToolRunner.run(conf,new TestMapReduce(), args);
        System.exit(res);
    }

How to see Hadoop's Configuration details if property is known

Some times you may want to see any specific property in your hadoop cluster and you might feel it would be good if you have any command line option that gives value stored in configuration. and Yes, there is an option to view most of the hadoop's configuration details.

This is important feature which will be useful when you entered into any new organization and you have been given hadoop cluster, but you are not sure of the hadoop installation path to check the properties by your self in hadoop .xml files, such as core-site.xml, yarn-site.xml, hdfs-site.xml etc.

Let me give you some useful commands which can come in handy.

1) hadoop org.apache.hadoop.mapred.JobConf

-- This gives properties of all mapreduce jobs. It takes properties from mapreduce-site and mapreduce-defaults. It works any where and gives all the properties.

(It works in both MR1 and MR2, dont confuse by the class package.)

This will throw complete xml like properties on the screen, you can use grep to filter on your requirement basis.

2) hadoop org.apache.hadoop.hdfs.tools.GetConf

-- By running this, with option -confKey we can see the value of any specific property..

eg: hadoop org.apache.hadoop.hdfs.tools.GetConf -confKey mapreduce.task.io.sort.mb

3) hadoop org.apache.hadoop.conf.Configuration -- By running this we can see the properties from core-site and core-defaults..

If you want to alter any specific property for your mapreduce job, then you can do with the help of GenericOptionsParser or ToolRunner.

Thank You...

Friday, 10 July 2015

writing output of mapper/reducer using mulitpleoutputs

Mulitple outputs can be used with both mapper and reducers.
But, if the mapper output is used with mulitple outputs, then you will have regular parts file with no output, and new file with your custom path.

package com.srini.test

public class DecryptorMapReduce extends Configured implements Tool {

    static class DecryptionMapper extends Mapper<Text, Text, Text, NullWritable>
    {

        MultipleOutputs<Text, NullWritable> multipleOutputs;

        # Initializing the multiple outputs
        protected void setup(Context context) throws IOException, InterruptedException {
                multipleOutputs = new MultipleOutputs<Text, NullWritable>(context);
            }

        @Override
        protected void map(Text key, Text value, Context context) throws IOException, InterruptedException
        {
               // Write using mulitpleOutputs instead of context.write
                multipleOutputs.write(new Text(Key), NullWritable.get(),"Output");
         }

        @Override
        public void run(Context context) throws IOException, InterruptedException {
            setup(context);
            try {
              while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
              }
            } finally {
              cleanup(context);
            }
        }
    }

    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setMapperClass(DecryptionMapper.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setJarByClass(DecryptorMapReduce.class);
        FileInputFormat.setInputPathFilter(job, TransporterPathFilter.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileInputFormat.setInputDirRecursive(job, true);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setNumReduceTasks(0);
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String... args) throws Exception
    {
        int res = ToolRunner.run(new DecryptorMapReduce(), args);
        System.exit(res);
    }

}

How to avoid _success file in Mapreduce Output Folder.

How to avoid _success and _log files in mapreduce output:

Hadoop produces _logs folder and creates logs in that. But, this has been avoided in next versions of hadoop.
But, just want to know.

we need to set hadoop.job.history.user.location value to none.
In your mapreduce program, you can simply set
conf.set("hadoop.job.history.user.location","none");

Please ensure this property is set before the job creation.

How to avoid _success file.?

We need to set mapreduce.fileoutputcommitter.marksuccessfuljobs property to false.

like this:
conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");

Collection of few very interesting things that i have faced in my big data journey