bishal acharya

Tuesday, September 13, 2011

Generating pdf reports in JAVA using iText library

The given Java program demonstrates a tutorial based upon iText to produce pdf report in the form of a Certificate. The final generated certificate will be a pdf file and look like the figure given. Three java classes are used for the sample as listed below. Cetrificate class represents the basic certificate object to be generated encapsulating all the required contents wanted on the certificate pdf report. CerrtificateReport is the main Java class which when run produces pdf certificate report under ./templates/ directory. The java report tutorial also uses XmlUtils file which is a xml utility class for reading xml files. The class has a method which returns the list of entries in the xml file required by our application.

Classes Used in the pdf Report generating application.

1. Cerrificate.java
2. CertificateReport.java
3. XmlUtils.java

Libraries Used:
iText1-3.jar

Certificate.java

package org.dvst.dvs;
import java.util.Date;
import java.util.List;
/**
* DTO for producing certificate
*  @author bishal acharya
*/
public class Certificate {
private int registrationNo;
private String fullName;
private String fatherName;
private String permanentAddres;
private int citizenshipNo;
private int passportNo;
private String creditHours;
private String system;
private Date fromDate;
private Date toDate;
private String course;
private String imagePath;
private List<String> headers;
private String workSystem;

public int getRegistrationNo() {
return registrationNo;
}
public void setRegistrationNo(int registrationNo) {
this.registrationNo = registrationNo;
}
public String getFullName() {
return fullName;
}
public void setFullName(String fullName) {
this.fullName = fullName;
}
public String getFatherName() {
return fatherName;
}
public void setFatherName(String fatherName) {
this.fatherName = fatherName;
}
public String getPermanentAddres() {
return permanentAddres;
}
public void setPermanentAddres(String permanentAddres) {
this.permanentAddres = permanentAddres;
}
public int getPassportNo() {
return passportNo;
}
public void setPassportNo(int passportNo) {
this.passportNo = passportNo;
}
public int getCitizenshipNo() {
return citizenshipNo;
}
public void setCitizenshipNo(int citizenshipNo) {
this.citizenshipNo = citizenshipNo;
}
public String getCreditHours() {
return creditHours;
}
public void setCreditHours(String creditHours) {
this.creditHours = creditHours;
}
public String getSystem() {
return system;
}
public void setSystem(String system) {
this.system = system;
}
public Date getFromDate() {
return fromDate;
}
public void setFromDate(Date fromDate) {
this.fromDate = fromDate;
}
public Date getToDate() {
return toDate;
}
public void setToDate(Date toDate) {
this.toDate = toDate;
}
public String getCourse() {
return course;
}
public void setCourse(String course) {
this.course = course;
}

public String getImagePath() {
return imagePath;
}
public void setImagePath(String imagePath) {
this.imagePath = imagePath;
}
public List<String> getHeaders() {
return headers;
}
public void setHeaders(List<String> headers) {
this.headers = headers;
}
public String getWorkSystem() {
return workSystem;
}
public void setWorkSystem(String workSystem) {
this.workSystem = workSystem;
}
}

CertificateReport.java

package org.dvst.dvs;
import java.io.FileOutputStream;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import com.lowagie.text.Chunk;
import com.lowagie.text.Document;
import com.lowagie.text.Font;
import com.lowagie.text.PageSize;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;

/**
* @author bishal acharya
*/
public class CertificateReport {
public CertificateReport(Certificate certificate) throws Exception {
SimpleDateFormat formatter = new SimpleDateFormat("EEE, d MMM yyyy");
Document document = new Document(PageSize.A4.rotate());
document.setPageSize(PageSize.A4);
PdfWriter.getInstance(document,
new FileOutputStream("./templates/" + certificate.getFullName()
+ "-" + certificate.getRegistrationNo() + ".pdf"));
document.open();
Paragraph p1 = new Paragraph(30);
p1.add(new Chunk(certificate.getHeaders().get(0), new Font(
Font.TIMES_ROMAN, 8)));
p1.setAlignment(1);
Paragraph p2 = new Paragraph();
p2.add(new Chunk(certificate.getHeaders().get(1), new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p2.setAlignment(1);
Paragraph p3 = new Paragraph();
p3.add(new Chunk(certificate.getHeaders().get(2), new Font(
Font.TIMES_ROMAN, 14, Font.BOLD)));
p3.setAlignment(1);
Paragraph p4 = new Paragraph();
p4.add(new Chunk(certificate.getHeaders().get(3), new Font(
Font.TIMES_ROMAN, 14)));
p4.setAlignment(1);
Paragraph p5 = new Paragraph(60);
p5.add(new Chunk("Reg. No. (DVSDT) :- "
+ certificate.getRegistrationNo(),
new Font(Font.TIMES_ROMAN, 9)));
p5.setAlignment(0);
p5.setIndentationLeft(80);
Paragraph p6 = new Paragraph(45);
p6.add(new Chunk("Certificate", new Font(Font.TIMES_ROMAN, 17,
Font.BOLD)));
p6.setAlignment(1);
Paragraph p7 = new Paragraph(30);
p7.add(new Chunk("This certificate is awarded to " + "  ", new Font(
Font.TIMES_ROMAN, 9)));
p7.add(new Chunk(certificate.getFullName() + "  ", new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p7.add(new Chunk("son of  ", new Font(Font.TIMES_ROMAN, 9)));
p7.add(new Chunk("Mr.  " + certificate.getFatherName(), new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p7.setAlignment(1);
Paragraph p8 = new Paragraph(18);
p8.add(new Chunk("Permanent resident of" + "  ", new Font(
Font.TIMES_ROMAN, 9)));

p8.add(new Chunk(certificate.getPermanentAddres() + "  ", new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p8.add(new Chunk("holding citizenship No : ", new Font(
Font.TIMES_ROMAN, 9)));
p8.add(new Chunk(certificate.getCitizenshipNo() + " ", new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p8.setAlignment(1);
Paragraph p9 = new Paragraph(18);
p9.add(new Chunk("& passport No. ", new Font(Font.TIMES_ROMAN, 9)));
p9.add(new Chunk(certificate.getPassportNo() + " ", new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p9.add(new Chunk("for successful completion of "
+ certificate.getCreditHours()
+ " Credit Hours course on Preliminary", new Font(
Font.TIMES_ROMAN, 9)));
p9.setAlignment(1);
Paragraph p10 = new Paragraph(18);
p10.add(new Chunk(
"education for the workers going to Republic of Korea under",
new Font(Font.TIMES_ROMAN, 9)));
p10.add(new Chunk("  " + certificate.getWorkSystem(), new Font(
Font.TIMES_ROMAN, 9, Font.BOLD)));
p10.setAlignment(1);

Paragraph p11 = new Paragraph(18);
p11.add(new Chunk("from  "
+ formatter.format(certificate.getFromDate()) + "  " + "to  "
+ formatter.format(certificate.getToDate()), new Font(
Font.TIMES_ROMAN, 9)));
p11.setAlignment(1);

Paragraph p12 = new Paragraph(45);
p12.add(new Chunk(
"---------------------"
+ "                                                                "
+ "                                                          "
+ "  ---------------------------", new Font(
Font.TIMES_ROMAN, 8)));

p12.setAlignment(1);
Paragraph p13 = new Paragraph(10);
p13.add(new Chunk(
"   Coordinator"
+ "                                                           "
+ "                                                                        "
+ "Executive Director", new Font(Font.TIMES_ROMAN, 8)));
p13.setAlignment(1);
Paragraph p14 = new Paragraph(20);
p14.setAlignment(1);
p14.add(new Chunk(formatter.format(new Date()), new Font(
Font.TIMES_ROMAN, 7, Font.BOLD)));

document.add(p1);
document.add(p2);
document.add(p3);
document.add(p4);
document.add(p5);
document.add(p6);
document.add(p7);
document.add(p8);
document.add(p9);
document.add(p10);
document.add(p11);
document.add(p12);
document.add(p13);
document.add(p14);
com.lowagie.text.Image image = com.lowagie.text.Image
.getInstance("./templates/me.JPG");
image.setBorder(1);
image.scaleAbsolute(100, 100);
image.setAbsolutePosition(450, 730);
document.add(image);
document.close();

}

public static void main(String[] args) {
List<String> headerList = XmlUtils.getNodeValue("DataElement", "Value",
DvsdtConstants.xmlConfigurationpath);
try {
Certificate c = new Certificate();
c.setFullName("Bishal Acharya");
c.setFatherName("Manoj Acharya");
c.setRegistrationNo(15236);
c.setCitizenshipNo(102545);
c.setPassportNo(3518161);
c.setFromDate(new Date());
c.setToDate(new Date());
c.setCreditHours("Fourty Five (45)");
c.setPermanentAddres("Shahid Marga Biratnagar");
c.setWorkSystem("Employment Permit System");
c.setHeaders(headerList);
CertificateReport cR = new CertificateReport(c);

} catch (Exception e) {
System.out.println(e);
}
}
}

XmlUtils.java

package org.dvst.dvs;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

/**
* Returns node value in given XML
* @author bishal acharya
*/
public class XmlUtils {
/**
* Gets Node Value List from given XML document with file Path
* 
* @param parentTag
* @param tagName
* @param layoutFile
* @return
*/
public static List<String> getNodeValue(String parentTag, String tagName,
String layoutFile) {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = null;
Document doc = null;
List<String> valueList = new ArrayList<String>();
try {
docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.parse(new File(layoutFile));
} catch (SAXException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (IOException e) {
System.out.println("Could not load file");
}
NodeList layoutList = doc.getElementsByTagName(parentTag);

for (int s = 0; s < layoutList.getLength(); s++) {
Node firstPersonNode = layoutList.item(s);
if (firstPersonNode.getNodeType() == Node.ELEMENT_NODE) {
Element firstPersonElement = (Element) firstPersonNode;
NodeList firstNameList = (firstPersonElement)
.getElementsByTagName(tagName);
Element firstNameElement = (Element) firstNameList.item(0);
NodeList textFNList = (firstNameElement).getChildNodes();
valueList.add(textFNList.item(0).getNodeValue().trim());
}
}
return valueList;
}

/**
* Unit test for XmlUtils class
* @param args
*/
public static void main(String args[]) {
System.out.println(XmlUtils.getNodeValue("DataElement", "Value",
"./templates/Contents.xml"));
}
}

The xml file should be present under ./templates/Contents.xml

me.JPG image file should be present under the same ./templates/me.JPG directory.

Commons Validator : Java based Validation Utility

Commons Validator : Java based Validation Utility

A utility to validate common data fields like Numeric, Double, Date, email, Integer, Currency,Percent etc using apache Commons validator library. The class below uses Commons-Validator-1.3.1.jar library. The bulk of logic is implemented in the library itself. This class is a simple java program to unite validators work with a simple utility.

/**
 * Utility for validation of different fields 
 * 
 * @author Bishal Acharya
*/
public class ValidationUtility {
 /**
  * Check if provided String is Number
  * 
  * @param str
  * @return
  */
 public static boolean checkIsNumeric(String str) {
  if (str == null)
   return false;
  return str.matches("-?\\d+(.\\d+)?");
 }

 /**
  * Check if given String provided is double
  * 
  * @param str
  * @return
  */
 public static boolean checkIfDouble(String str) {
  if (str == null)
   return false;
  try {
   Double.parseDouble(str);
  } catch (NumberFormatException nfe) {
   return false;
  }
  return true;
 }

 /**
  * Validates whether provided string is date field or not
  * 
  * @param date
  * @param format
  *            defaultDate format
  * @return
  */
 public static boolean validateDate(String date) {
  String format = "MM/dd/yyyy";
  DateValidator validator = DateValidator.getInstance();

  Date dateVal = validator.validate(date, format);
  if (dateVal == null) {
   return false;
  }
  return true;
 }

 /**
  * Validates whether provided string is date field or not
  * 
  * @param date
  * @param format
  * @return boolean status of whether given data is valid or not
  */
 public static boolean validateDate(String date, String format) {

  DateValidator validator = DateValidator.getInstance();

  Date dateVal = validator.validate(date, format);
  if (dateVal == null) {
   return false;
  }
  return true;
 }

 /**
  * Formats the given date as according to given formatter
  * 
  * @param date
  * @param format
  * @return
  */
 public static String formatDate(String date, String format) {
  DateValidator validator = DateValidator.getInstance();

  String dateVal = null;
  try {
   dateVal = validator.format(date, format);
  } catch (IllegalArgumentException e) {
   System.out.println("Bad date:" + date + ": cannot be formatted");
  }
  if (dateVal == null) {
   return null;
  }
  return dateVal;
 }

 /**
  * Validates whether clients data is Integer or not
  * 
  * @param integer
  * @return
  */
 public static boolean IntegerValidator(String integer) {
  IntegerValidator validator = IntegerValidator.getInstance();

  Integer integerVal = validator.validate(integer, "#,##0.00");
  if (integerVal == null) {
   return false;
  }
  return true;
 }

 /**
  * validates whether data is currency of not
  * 
  * @param currency
  * @param loc
  * @return
  */
 public static boolean currencyValidator(String currency, Locale loc) {
  BigDecimalValidator validator = CurrencyValidator.getInstance();
  if (loc == null) {
   loc = Locale.US;
  }
  BigDecimal amount = validator.validate(currency, loc);
  if (amount == null) {
   return false;
  }
  return true;
 }

 /**
  * Validates whether data provided is in percentage or not
  * 
  * @param percentVal
  * @return
  */
 public static boolean percentValidator(String percentVal) {
  BigDecimalValidator validator = PercentValidator.getInstance();
  boolean valid = false;
  BigDecimal Percent = validator.validate(percentVal, Locale.US);
  if (Percent == null) {
   valid = false;
  }
  // Check the percent is between 0% and 100%
  if (validator.isInRange(Percent, 0, 1)) {
   valid = true;
  } else {
   valid = false;
  }
  return valid;
 }

 /**
  * validates correct email address
  * 
  * @param email
  * @return
  */
 public static boolean emailValidator(String email) {
  EmailValidator validator = EmailValidator.getInstance();
  boolean isAddressValid = validator.isValid(email);
  return isAddressValid;
 }

 public static void main(String args[]) {
  String s = "12/18/1952";
  System.out.println(DateValidator.getInstance().validate(s,
    "MM/dd/yyyy"));
  System.out.println("valid percent :"
    + ValidationUtility.percentValidator("100"));
  System.out.println("Invalid percent :"
    + ValidationUtility.percentValidator("110"));

  System.out.println("Valid Currency :"
+ ValidationUtility.currencyValidator("100", Locale.US));
  System.out.println("InValid Currency :"
+ ValidationUtility.currencyValidator("Dollar", Locale.US));
  System.out.println("Integer Validator :"
    + ValidationUtility.IntegerValidator("1"));
  System.out.println("Integer Validator :"
    + ValidationUtility.IntegerValidator("1.2"));
  System.out.println("Valid Numeric:"
    + ValidationUtility.checkIsNumeric("1"));
  System.out.println("InValid Numeric:"
    + ValidationUtility.checkIsNumeric("ABCD")); }

}

Output :-

Thu Dec 18 00:00:00 NPT 1952
valid percent :true
Invalid percent :false
Valid Currency :true
InValid Currency :false
Integer Validator :true
Integer Validator :false
Valid Numeric:true
InValid Numeric:false

Utility for Evaluating XML string,Files in JAVA

Given below is a Java utility class that can be used to evaluate/parse XML files. There are three different methods which parses the given XML file or XML String.

getNodeValueListFromFile

This method can be used to get NodeValue List from the ML file given as String.

getNodeValue

This method can be used to get value of the Node from give XML file.

printNodeValueFromFile

This method prints the Node value from XML file given as string


/**
 * @author bacharya
 */
public class XmlUtils {
    /**
     * Gets Node Value from given XML document with file Path
     * 
     * @param parentTag
     * @param tagName
     * @param layoutFile
     * @return
     */
    public static String getNodeValue(String parentTag, String tagName,
            String layoutFile) {
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
                .newInstance();
        DocumentBuilder docBuilder = null;
        Document doc = null;

        try {
            docBuilder = docBuilderFactory.newDocumentBuilder();
            doc = docBuilder.parse(new File(layoutFile));
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (IOException e) {
            System.out.println("Could not load file");
        }
        NodeList layoutList = doc.getElementsByTagName(parentTag);

        for (int s = 0; s < layoutList.getLength(); s++) {
            Node firstPersonNode = layoutList.item(s);
            if (firstPersonNode.getNodeType() == Node.ELEMENT_NODE) {
                Element firstPersonElement = (Element) firstPersonNode;
                NodeList firstNameList = (firstPersonElement)
                        .getElementsByTagName(tagName);
                Element firstNameElement = (Element) firstNameList.item(0);
                NodeList textFNList = (firstNameElement).getChildNodes();
                return textFNList.item(0).getNodeValue().trim();
            }
        }
        return null;
    }

    /**
     * Gets Node Value from given XML as String
     * 
     * @param parentTag
     * @param tagName
     * @param xmlRecords
     * @return
     */
    public static void printNodeValueFromFile(String parentTag, String tagName,
            String xmlRecords) {
        try {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder db = dbf.newDocumentBuilder();
            InputSource is = new InputSource();
            is.setCharacterStream(new StringReader(xmlRecords));

            Document doc = db.parse(is);
            NodeList nodes = doc.getElementsByTagName(parentTag);

            for (int i = 0; i < nodes.getLength(); i++) {
                Element element = (Element) nodes.item(i);

                NodeList name = element.getElementsByTagName(tagName);
                Element line = (Element) name.item(0);
                System.out.println(": " + getCharacterDataFromElement(line));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "";

    }

    /**
     * Gets Node Value from given XML as Map of List of Strings
     * 
     * @param parentTag
     * @param tagName
     * @param xmlRecords
     * @return
     */
    public static Map<String, List<String>> getNodeValueListFromFile(
            String parentTag, String tagName, String xmlRecords) {
        Map<String, List<String>> nodeMapList = new HashMap<String, List<String>>();
        List<String> valueList = new ArrayList<String>();

        try {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder db = dbf.newDocumentBuilder();
            InputSource is = new InputSource();
            is.setCharacterStream(new StringReader(xmlRecords));

            Document doc = db.parse(is);
            NodeList nodes = doc.getElementsByTagName(parentTag);

            for (int i = 0; i < nodes.getLength(); i++) {
                Element element = (Element) nodes.item(i);

                NodeList name = element.getElementsByTagName(tagName);
                Element line = (Element) name.item(0);
                valueList.add(getCharacterDataFromElement(line));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        nodeMapList.put(tagName, valueList);
        return nodeMapList;

    }

    public static String getCharacterDataFromElement(Element e) {
        Node child = e.getFirstChild();
        if (child instanceof CharacterData) {
            CharacterData cd = (CharacterData) child;
            return cd.getData();
        }
        return "?";
    }

    /**
     * Unit test for XmlUtils class
     * 
     * @param args
     */
    public static void main(String args[]) {
        String rec = "<Layout>" + " <Data>" + "     <Name>testLayout</Name>"
                + "     <Delimiter>s</Delimiter>" + " </Data>" + "    <Details>"
                + " <DataElement>" + "   <fieldName>GROUP</fieldName>"
                + " <Type>String</Type>" + "     <Location>" + "         <Num>1</Num>"
                + "     </Location>" + " </DataElement>" + " <DataElement>"
                + "   <fieldName>ENTITY_CODE</fieldName>"
                + " <Type>String</Type>" + "     <Location>" + "         <Num>2</Num>"
                + "     </Location>" + " </DataElement>" + "    </Details>"
                + "</Layout>";

        Map<String, List<String>> nodeMapList = XmlUtils
                .getNodeValueListFromFile("DataElement", "fieldName", rec);
        Map<String, List<String>> nodeMapList1 = XmlUtils
                .getNodeValueListFromFile("DataElement", "Type", rec);
        Map<String, List<String>> nodeMapList2 = XmlUtils
                .getNodeValueListFromFile("Location", "Num", rec);

        List<String> val = nodeMapList.get("fieldName");

        System.out.println(val.size());
        for (int i = 0; i < val.size(); i++) {
            System.out.println(nodeMapList1.get("Type").get(i));
            System.out.println(nodeMapList2.get("Num").get(i));
        }

        System.out.println(
                 XmlUtils.getNodeValueFromFile("DataElement", "fieldName",
                        rec));

    }
}

Test Output :

2
String
1
String
2
: GROUP
: ENTITY_CODE

Thursday, September 08, 2011

Using Distributed Cache in Hadoop (Hadoop 0.20.2)

Quite often we encounter situation when we need certain files like configuration files, jar libraries, xml files, properties files etc to be present in Hadoops processing nodes at the time of its execution. Quite understandably Hadoop has a feature called Distributed Cache which helps in sending those readonly files to the task nodes. In Hadoop environment jobs are basically map-Reduce jobs and the necessary readonly files are copied to the tasktracker nodes at the beginning of job execution process. The default size of distributed cache in Hadoop is about 10 GB but We can control the size of the distributed cache by explicitly defining its size in hadoop’s configuration file local.cache.size.

Thus, Distributed cache is a mechanism to caching readonly data over Hadoop cluster. The sending of readOnly files occurs at the time of job creation and the framework makes the cached files available to the cluster nodes at their computational time.

The following distributed cache java program sends the necessary xml files to the task executing nodes prior to job execution.

Java Program/Tutorial of Distributed Cache usage in Hadoop
Hadoop Version : 0.20.2
Java Version: Java-SE-1.6

The program below consists of two classes. The DcacheMapper class and the parent Class. Job is initialized in the base class. Job is initialized pointing to the location in HDFS where the file to be sent to all nodes is present. When the setup method in parent class is executed we can retrieve the distributed configuration file and read it for our usage.

API doc for Distributed Cache can be found at the given URL.

http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html

Class To make a Map Reduce Job for Distributed Cache


package com.bishal.mapreduce;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * @author Bishal Acharya
 * 
 */
public class DcacheMapper extends ParentMapper {
 public DcacheMapper() {
  super();
 }

 public void map(Object key, Text value, Context context) throws IOException {
  /**
   * Write your map Implementation
   */
 };

 
 public static void main(String args[]) throws URISyntaxException,
   IOException, InterruptedException, ClassNotFoundException {

  Configuration conf = new Configuration();

  final String NAME_NODE = "hdfs://localhost:9000";

  Job job = new Job(conf);

  DistributedCache.addCacheFile(new URI(NAME_NODE
    + "/user/root/input/Configuration/layout.xml"),
    job.getConfiguration());


  job.setMapperClass(ImportMapReduce.class);
  job.setJarByClass(ImportMapReduce.class);
  job.setNumReduceTasks(0);

  FileInputFormat.addInputPath(
    job,
    new Path(NAME_NODE + "/user/root/input/" + "/"
      + "/test.txt"));
  FileOutputFormat.setOutputPath(
    job,
    new Path(NAME_NODE + "/user/root/output/" + "/"
      + "/importOutput"));
  System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
}

Parent Class to read data from Distributed Cache

package com.bishal.mapreduce;

import java.io.BufferedReader;
import java.io.FileReader;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * BaseMapper class to perform initialization setup and cleanUp tasks using Distributed Cache for Map
 * Reduce job
 * 
 * @author Bishal Acharya
 * 
 */
public class ParentMapper extends Mapper<Object, Text, Object, Text> {
 protected Configuration conf;

 public ParentMapper() {
  initialize();
 }

 private void initialize() {
  conf = new Configuration();
 }

 protected void setup(
   org.apache.hadoop.mapreduce.Mapper<Object, Text, Object, Text>.Context context)
   throws java.io.IOException, InterruptedException {

  Path[] uris = DistributedCache.getLocalCacheFiles(context
    .getConfiguration());

  BufferedReader fis;

  /**
   * Prepare Objects from Layout XML
   */
  for (int i = 0; i < uris.length; i++) {
   if (uris[i].toString().contains("layout")) {

    String chunk = null;
    fis = new BufferedReader(new FileReader(uris[i].toString()));
    String records = "";
    while ((chunk = fis.readLine()) != null) {
     records += chunk;
    }
    // do whatever you like with xml using parser
    System.out.println("Records :" + records);
   }

  }

 };

 protected void cleanup(
   org.apache.hadoop.mapreduce.Mapper<Object, Text, Object, Text>.Context context)
   throws java.io.IOException, InterruptedException {
  DistributedCache.purgeCache(conf);
 };

}

In the parent class we read the cache file in the setup method of the job. And cachePurge operation is performed at the cleanup phase of the MapReduce job.

Wednesday, September 07, 2011

Evaluating Javascript expressions in Java

As the java virtual machine comes up with Javascript runtime, we can directly evaluate JavaScript expressions in our Java Class. The approach could be helpful in conditions when we want certain kind of Javascript expressions to be evaluated in our Java class in server rather than the client side. JVM has the in built Script Engine called Mozilla Rhino. Rhino is an open-source implementation of JavaScript written in Java, which is used to run expressions given in Javascript from JVM.

Given is the ScriptingEngine info for Java virtual machine. This information can be obtained by running below method.

ScriptEngineManager mgr = new ScriptEngineManager();
List facts =
mgr.getEngineFactories();

System.out.println("engName :" + facts.getEngineName()
+ ":engVersion:" + facts.getEngineVersion()
+ ":langName :" + facts.getLanguageName()
+ ":langVersion :"+ factory.getLanguageVersion());

when the Engine Names is printed we can see the following results.

Script Engine: Mozilla Rhino (1.6 release 2)
Engine Alias: js
Engine Alias: rhino
Engine Alias: JavaScript
Engine Alias: javascript
Engine Alias: ECMAScript
Engine Alias: ecmascript
Language: ECMAScript (1.6)

Given below is the class which evaluates the JavaScript expression given to it. The evaluate method has to be given the expression to be evaluated, the setters names and the setterValueMap which contains the key,value pair of setterField and its corresponding value.

/**
*@author Bishal Acharya
*
*/
public class JavascriptEvaluator {
 /**
  * 
  * @param expression
  *            the expression to evaluate
  * @param setters
  *            list of setters for the expression
  * @param setterValueMap
  *            Map having setterKey and its value
  * @return Object the evaluated expression
  */
 public static Object evaluate(String expression, String setters,
   Map<String, String> setterValueMap) {
  String[] setterArray = setters.split(",");
  List<String> setterList = new ArrayList<String>();

  for (String val : setterArray) {
   setterList.add(val.trim());
  }

  ScriptEngineManager mgr = new ScriptEngineManager();
  ScriptEngine jsEngine = mgr.getEngineByName("JavaScript");
  Object obj = null;
  for (int i = 0; i < setterList.size(); i++) {
   jsEngine.put(setterList.get(i),
     setterValueMap.get(setterList.get(i)));
  }
  try {
   obj = jsEngine
     .eval("func1(); function func1(){" + expression + "}");
  } catch (ScriptException e) {
   e.printStackTrace();
  }
  return obj;
 }

 public static void main(String args[]) {
  String expr = "return GROUP_NAME.substr(0,5).concat(' How are You');";
  String setters = "GROUP_NAME";
  Map<String, String> setterValueMap = new HashMap<String, String>();
  setterValueMap.put("GROUP_NAME", "Hello World");
  System.out.println(JavascriptEvaluator.evaluate(expr, setters,
    setterValueMap));
 }
}

Output of Running above class :

Hello How are You

Thursday, August 18, 2011

A JINI based approach towards dynamic cluster Management and Resource Utilization

Abstract
With offices nowadays have started using computers, we are also faced with a challenge to maximize the utilization of computing resources offered by each computer and minimize cost. With many computers, we are faced with many idle
resources. Jobs can be distributed out to idle servers or even idle desktops. Many of these resources remain idle during off office hours or even during office hours with many users under utilizing the computing as well as memory resources. We can manage policies allowing jobs to go only go to computers that are free of resources allowing others to run normally and hence maximize the throughput as well as minimize cost. Our proposed model not only utilizes resources to optimum but also makes the architecture more modular, adaptive, provides dynamic fail over recovery and linear scalability.

Keywords : JINI, javaspace, cluster, Space

1. Introduction

As the size of any organization increases, with it increases the issue of managing the increased resources. A little foresightedness may result saving huge cost for the organization. There might arise a question of managing computer clusters for carrying out computational task with more efficiency. In this paper we shall focus on the prototype cluster management using JINI, and also show how to setup a cluster management system that performs resource sharing. Traditional architectures normally focus on clientserver or peer to peer interaction model but our focus shall be upon a completely new architecture “Space based Architecture”. The space based idea
has several advantages compared to its counterparts. A space based architecture is said to be more robust because one agent failing will not bring down the whole system as is the case with clientserver model. Replication and mirroring of persistent spaces permits communication regardless network failure. Communication between peers is anonymous and asynchronous which makes computers in the cluster to work together to solve a problem collectively. These attributes of space based architecture enables us to make an adaptive cluster. We will particularly focus on managing clusters in an adaptive manner, where the increase or decrease in the number of peers wont create any problem to the overall space. Our approach will be based on the one of the services of JINI the “javaspace”.

2. JINI and JavaSpaces

JINI technology is a service oriented architecture that defines a programming model which both exploits and extends the ability of java technology to enable the creation of distributed systems consisting of federations of well behaved networked services and clients. JINI technology can be used to build adaptive network systems that are scalable, evolvable and flexible as typically required in dynamic distributed systems [1]. JINI enables computers to find each other and use each others services on the network without prior information about each other or the protocols used. To make Jini selfhealing,leases are utilized. Nearly every registration or resource must be leased, that is, it must periodically be confirmed that the registered resource is alive or that there is still interest in a resource.
If the lease is not renewed before its expiration, the resource or registration becomes unavailable. This provides a form of distributed garbage collection, where only healthy resources continue to be published.

2.1 An overview of JINI Infrastructure

Jini comes with several standard infrastructure components out of the box. To achieve the nonfunctional requirements (NFRs) of performance, resiliency, and scalability, multiple instances of these components run simultaneously on different machines.

The standard Jini services are:
· Lookup Service : The lookup service, named reggie, is the first among equals of Jini services.
All services in a Jini architecture register with the lookup service to make themselves available to other services. All initial access to other services is via the lookup service, after which clients bind directly.
· Class Server : The class server, a simple HTTP daemon, eliminates the coupling between clients and service implementations. Clients deal with interfaces. If a particular implementation class is required, it is downloaded transparently.
· Transaction Manager : Distributed transactions support is provided by the transaction manager service called mahalo.
· JavaSpace Services with a requirement to share information with other services do so through a JavaSpace. The reference implementation of the JavaSpace is named outrigger.

2.2 JavaSpace technology

The JavaSpaces technology is a highlevel tool for building distributed applications, and it can also be used as a coordination tool. A marked departure from classic distributed models that rely on message passing or RMI, the JavaSpaces model views a distributed application as a collection of processes that cooperate through the flow of objects into and out of one or more spaces. This programming model has its roots in Linda, a coordination language developed by Dr. David Gelernter at Yale niversity. However, no knowledge of Linda is required to understand and use JavaSpaces technology [2].The dominant model of computation in distributed computing is the ClientServer model. This model is based on the assumption that local procedure calls are the same as remote procedure calls. Javaspaces overcome the problems of synchronization, latency and partial failure, inherent in distributed systems, by providing loosely coupled interactions between the components of distributed systems. Javaspace processes communicate through a space, not than directly. Communication between processes on different physical machines is asynchronous and free from the main limitation of the traditional client/server model, when client/server communication requires simultaneous presence on network both parts client and server. Sender and receiver in JavaSpace don't need to be synchronized and can interact when network is available. In a distributed application, JavaSpaces technology acts as a virtual space between providers and requesters of network resources or objects. This allows participants in a distributed solution to exchange tasks, requests, and information in the form of Java
technology based objects[3]. The javaspace transactional mangement and notify feature makes it easier to build a dynamic cluster management framework. In particular it addresses the dynamic cluster problem where nodes can depart and join the cluster at any time.

2.3 Leasing

One of the important feature of Jini is the concept of leasing. All resources in a Jini system are leased,including proxy records in the lookup service, transactions, and, of course, memory in a JavaSpace.When a lease expires, the resource is recovered and made available to other components. This prevents resource accretion, a common problem in distributed systems. Leases can be renewed explicitly, or implicitly through a lease renewal manager. In the interests of simplicity, the example below uses leases that last "forever." This is obviously inappropriate for a
production system[4].

2.4 The Entry Interface

All objects that can be stored in a JavaSpace must implement the Entry interface. Entry is a simple tag interface that does not add any methods but does extend Serializable. Like JavaBeans, all Entry implementations must provide a public constructor that takes no arguments. Unlike JavaBeans, all of the data members of an Entry must be public. In production Jini systems, the public data members are a nonissue because of common techniques like the envelope letter idiom. The Entry implementation acts as an envelope or wrapper around the "real" payload, which may be any serializable Java object. The only public data members exposed are those required for the templatebased matching of the JavaSpaces API[5].

3. JavaSpace based cluster mangement

At the focus of our system is a working space and entries that cluster nodes can write to the space.These entries are a Join entry and a Depart entry. The space itself is assumed to be run on a host that forms the nucleus of the cluster and in fact for the work reported here we assume this host and its space stay up. We are working on the problem of making this space system properly persistent and robust
against temporary failure. It is further assumed that the space hosting node is well known to other nodes that participate in the cluster. This seems reasonable assumption for a cluster within an administrative boundary

3.1 Architecture Description

A JavaSpaces service holds entries, each of which is a typed group of objects expressed in a class that implements the interface net.jini.core.entry.Entry. Once an entry is written into a JavaSpaces service, it can be used in future lookup operations. Looking up entries is performed using templates, which are
entry objects that have some or all of their fields set to specified values that must be matched exactly. All remaining fields, which are not used in the lookup, are left as wildcards.
There are two lookup operations: read() and take(). The read() method returns either an entry that matches the template or an indication that no match was found. The take() method operates like read(), but if a match is found, the entry is removed from the space. Distributed events can be used by requesting a JavaSpaces service to notify you when an entry that matches the specified template is written into the space. Note that each entry in the space can be taken at most once, but two or more entries may have the exact same values. Using JavaSpaces technology, distributed applications are modeled as a flow of objects between participants, which is different from classic distributed models such as RMIs. Figure 1 indicates what a JavaSpaces technology based application looks like. A client can interact with as many JavaSpaces services as needed. Clients perform operations that map entries to templates onto JavaSpaces services. Such operations can be singleton or contained in a transaction so that all or none of the operationstake place. Notifications go to event catches, which can be either clients or proxies for clients.

4. Javaspace Realated concepts

4.1 Transactions

The JavaSpaces API uses the package net.jini.core.transaction to provide basic atomic
transactions that group multiple operations across multiple JavaSpaces services into a bundle that acts as a single atomic operation. Either all modifications within the transactions will be applied or none will, regardless of whether the transaction spans one or more operations or one or more JavaSpaces services. Note that transactions can span multiple spaces and participants in general.A read(), write(), or take() operation that has a null transaction acts as if it were in a committed transaction that contained that operation. As an example, a take() with a null
transaction parameter performs as if a transaction was created, the take() was performed under that transaction, and then the transaction was committed.

4.2 The Jini Outrigger JavaSpaces Service
The Jini Technology Starter Kit comes with the package com.sun.jini.outrigger, which
provides an implementation of a JavaSpaces technologyenabled
service. You can run it two ways:
· As a transient space that loses its state between executions: Use
com.sun.jini.outrigger.TransientOutriggerImpl.
· As a persistent space that maintains state between executions: Use
com.sun.jini.outrigger.PersistentOutriggerImpl.
The TransientOutriggerImpl can be run only as a nonactivatable server, but the
PersistentOutriggerImpl can be run as either an activatable or nonactivatable server.

4.3 Distributed Data structure in JavaSpace

With JavaSpace it is also possible to organize objects in form of a tree structure or an array. Since remote processes may access these structures concurrently, they are called distributed data structures. A channel in JavaSpaces terminology is a distributed data structure that organizes messages in a queue. Several processes can write messages to the end of the channel, and several processes can read or take messages from the beginning of it. A channel is made up of two pointer
objects, the head and the tail, which contain the numbers of the first and the last entry in the channel(Figure 2). It is possible to use several such channels, giving all Actors associated with a space the possibility to handle messages in a FIFO fair manner. Channels may also be bounded, meaning that an upper limit can be set for how many messages a channel may contain.

4.4 Master Worker pattern

The MasterWorker Pattern (sometimes called MasterSlave pattern) is used for parallel processing and is the basis pattern to work with javaspace. It follows a simple approach that allows applications to perform simultaneous processing across multiple machines or processes via a Master and multiple Workers. The Master hands out units of work to the "space", and these are read, processed and written back to the space by the workers. In a typical environment there are several "spaces", several masters
and many workers; the workers are usually designed to be generic, i.e. they can take any unit of work from the space and process the task.

5. Discussions and Conclusions

In this paper we have discussed using JINI/javaspace technology to providing clustering supporting.The approach presented is useful in places which requires clusters to be set up to perform resource intensive works, like data processing or computing works. Our model can be realized using JINI/javaspace technology which are open source technologies and hence can be cost effective as compared to other proprietary solutions. As with government offices, this approach can prove
beneficial as it provides effective solutions to clustering issues like scalability, fault tolerance,adaptability and utilization of resources. Creating adaptive systems in dynamic environments where services and clients come and go all the time and system components may dynamically be added and removed is a complex task. JavaSpaces has several features that can ease this task, including its ability to provide asynchronous and uncoupled communication in time, space and destination based on
associative addressing. Since a JavaSpace stores objects, it is a simple means of distributing both messages and agent behavior. Our space based architecture utilizes these possibilities together with the actor role abstraction to simplify the creation of adaptive systems. The architecture consists of three main types of agents that interact asynchronously through the space.

6 .References

[1] http://www.jini.org/wiki/Main_Page
[2] http://java.sun.com/developer/technicalArticles/tools/JavaSpaces/
[3] http://www.javafaq.nu/javaarticle150.
html
[4] http://www.softwarematters.org/jiniintro.
html
[5] http://www.artima.com/intv/swayP.html
[6]http://www.javaworld.com/javaworld/jw102000/
jw1002jiniology.
html?page=1
[7]http://java.sun.com/developer/technicalArticles/tools/JavaSpaces/
[8]http://www.theserverside.com/tt/articles/article.tss?l=UsingJavaSpaces
[9]http://www.artima.com/lejava/articles/dynamic_clustering.html
[10]http://www.artima.com/intv/cluster2.html
[11]Grid Computing: A Practical Guide To Technology And Applications By Ahmar
Abbas,Publisher:Charles River Media
[12]http://jan.newmarch.name/java/jini/tutorial/Jini.html
[13]Grid computing Software Environment and tools By : Omer F. Rana (Editor) and Jose C. Cunha
(Editor)
[14]http://java.sun.com/developer/Books/JavaSpaces/
[15]Dynamic Cluster Configuration and Management using JavaSpaces( K.A.Hawick and H.A.James
Computer Science Division, School of Informatics)

Semantic Modeling of Distributed key value stores featuring Hbase and Wikipedia

Abstract—

The research paper intends to do classification of different articles, documents of Wikipedia into a consistent ontology like persons, organizations, films, album's, video games, species and diseases, leveraging distributed key value store Hbase. Each entity is kept as a column within a column family which will be the individual ontologies as mentioned above. Each get operation to Hbase will be responsible for
getting triples for the same key. The choice of key value store over any relational database promotes horizontal scalability and an increase in performance when the size of dataset is of the magnitude as that of Wikipedia. The proposed approach will
help make semantic queries to the Wikipedia dataset and make intelligent inferences, which is difficult doing currently specially with traditional Relational databases. The approach presents statistical analysis of performance with approaches of RDBMS
and that of Hbase a clone of Google's BigTable.
Index Terms—KeyValue store, Hbase, Hadoop, EVI (Encoded Vector Index), Semantic Modeling, OWL, RDF, MapReduce.

I.INTRODUCTION

ONE of the challenges of semantic search and retrieval in natural language processing is the storage and manipulation of large amount of subjectpredicateobject
expressions which can be RDF; making statements about particular resources. Most of the work of knowledge representation involves persisting the RDF data in relational
databases or native representations called Triple Stores, or Quad Stores if context is also persisted in RDBMS. But circumstances are more complicated by the fact that web today has billions of pages and RDBMS are nowadays more rigorously questioned whether they can sustain the tight scalability constraints imposed by the massive growth in dataset over the past years and the trend still growing. More specifically, we have the following concerns whether RDBMS suitably fits for semantic data persistence Scalability, Querying, Efficiency, Query latency and
organization Overhead. In this paper we present an approach of semantic modeling of RDF triples with Hbase, a sparse, distributed, persistent multidimensional sorted map for storing key values. The paper works for way of mapping RDF data to Hbase. It champions for a new approach of data processing pipeline.In relational databases, data is stored in conventional horizontal approach, which might present several problems like limitations in the number of columns, 1012 for DB2 and Oracle; as we might have several properties of an object which we have to model. Even if we are allowed more columns; RDBMS will have null in the vacant cells; which will kill up a lot of space and present additional problems when data size is substantially larger. Other limitations with RDBMS can be the difficulty in changing the schema like addition of columns later or maintaining timestamps. Hence a large penalty in performance has to be paid if data records are wider or the dataset is huge. Finally, RDBMS faces serious limitations when the questions are about addressing
redundancy, parallelism and horizontal scalability. Flexibility of column oriented key value stores is that they can be either used as wide table consisting of millions of columns, grouped under a single column family or as a very tall table consisting of billions of rows. This facility can be suitably exploited to make a persistent store for ontologies. Our approach involves better natural approach for storing RDF data in Hbase. Here, each RDF subject or resource corresponds to a row key, and each RDF predicate or property corresponds to a column. Keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs. The major advantages with this approach will be that it encompasses all the advantages of Hbase like high availability, consistency, horizontal scalability and data partitioning across the cluster to name some. The system is optimized for resource oriented access patterns to RDF statements about a particular subject taken from Wikipedia.

A semantic web is merely a collection of interrelated triples, the predicates of whose are data in themselves. Every object has an identifier that is unique and has a distinct meaning. A triple is a record containing three values (subject, predicate,
object). A RDF model consists of statements, which asserts facts about a resource. Each statement has three parts and is also known as a triple. A triple might relate one object to another or can link a constant value to a subject. Nepal is a mountainous country and Nepal's zip code is 977, are two of the cases. There are several optimizations which are needed to be performed for modeling triples with Hbase. In our model each record consists of numerous fields. The fields again contains data that belongs to another object forming a tree like structure. An object is not modeled explicitly, but rather by a series of linked tables distributed across the cluster on top of distributed File system (HDFS).

II.PROBLEMS TO BE MET

A.Identification of unique triples There can be cases when we might have to uniquely
identify the triples, when the combination of subject,predicate and the object is itself not unique.

B.Optimal Search performance given a Subject and a Predicate

How could we possibly perform search in key value store given Subject and a Predicate ?

C.Optimal Search performance given a Predicate and an Object

How could we possibly perform search in key value store given Predicate and an Object ?

III.SOLUTIONS TO PROBLEMS

We address the problems mentioned above by maintaining custom indexes with EVI indexing. Maintained separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime). Added multiple indexes for the two ways the tables are used: a subject predicate combined key and a predicateobject
combined key. MapReduce utilization in the creation of indexes offers an effective yet powerful way for distribution of processing across the cluster. The index tables consists serialized composite keys and cell value consists either the value or
subject of the triple. The figure above gives the picture of how we are modeling
Hbase and its indexes in order to model semantic data. The first model presents a big table having predicates as their column names, here we are exploiting the fact that Hbase being a clone of Big Table can support millions of columns. Similar categories are grouped under a column family, we have categorized articles under person, Movies, Films, Organization, Album's, Video Games, Species, diseases etc..

IV.LEXICOGRAPHICAL ORDERING

As with Hbase we have Lexicographical ordering of strings. The keys in the main table as well as indexes are ordered in the following fashioned. If Σ has a total order (cf.alphabetical order) one can define a total order on Σ* the lexicographical order. since Σ is finite, it is always possible to define an ordering on Σ and consequently on Σ*. If Σ = {0, 1} and 0 < 1, then the lexicographical ordering of Σ* is ε < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 … and so on.

V.COMPRESSION

Hbase sits on top of HDFS the Hadoop distributed file system. Using LZO compression will allow for reduction in the size of the data and also reduction in the read time of Hard Drives. The compression additionally allows for its block to be split into chunks for parallel processing with Hadoop.Creating indexes demands for reading data from HDFS and inserting them into Hbase, so compression enhances the performance of MapReduce process in general. Moreover, compressing data on Hbase increases its performance because an expensive I/O is reduced and useless bandwidth usage is negated which can sometimes saturate clusters, thus improving performance.

VI.SPARSE DATA

In Hbase, given row which in our case is the keyword of search can have any number of columns in each group or column family. Here we will have row based gaps. So, the
Hbase table will be sparse helping for quicker random reads.

VII.CONCEPTUAL MODELING OF DATA

A.Modeling of Main Table

The main table for storing RDF triples will be modeled as follows.

The table above provides how subject, object and predicates will be stored in the main table in Hbase. The Subject will be modeled as the row key whereas the object
can be the URI or the value itself. The predicates form the column name and are grouped under a column family,Persons in our case. There can be millions of predicates as the columns within a family, the table above just just gives three
of them. Theoretically a Column Family can accommodate infinite numbers of columns. Timestamps allows us to keep different Uri's or values for a given subject key. The
conceptual modeling although looks sparse, the table space is not eaten up because physically vacant cells has no significance. To retrieve a value, the user has to do a get using three keys, row key+column key+timestamps >value .

B.Modeling of Index Tables

The index tables will be modeled as follows. There will be two index tables one for S+P combination and other for P+O combination so that query for every type could be performed adequately. Tables below shows the cases for two of them.

There is a single column family and a single column in this case. The tables can be queried with the S+P or O+P combinations and get the list of serialized values of results, which can be later used to query whatever we want from the main table.

VIII.DISTRIBUTED

Hbase being built on top of Hadoop distributed Filesystem, the underlying data can be spread across several nodes in the cluster. It can either sit on top of HDFS or Amazon S3. The data will be replicated across several participating nodes in the cluster, which protects data in case one or several node goes down and allows for distributed processing and querying of data. The URL is being inverted simply to keep related data together. Although conceptually our data sits inside Hbase table similar to that of a Relational database table, but Hbase has a special file format for saving its data called Hfile format which are eventually stored in HDFS. So, for proper functioning and data persistence Hbase relies upon HDFS the Hadoop distributed FileSystem. Any file is split into one or more blocks and these blocks are stored and distributed across several datanodes in the cluster. Thus HDFS facilitates with data replication, Robustness, countering node failures, cluster
rebalancing, maintaining data integrity and managing cluster in overall.

IX.MODEL AND SEMANTICS

The semantic model must be able to express all the entities,facts, relations between facts and properties between relations. Currently OWL(Web Ontology Language) serves
for knowledge representation. It can express properties between relations and express relations between facts. In OWL and RDFS all the objects( persons, Places, URLS) are represented as entities.

X.MAPREDUCE FOR INDEXING

Effective indexing serves as the backbone of any search and retrieval systems. At the same time, the system should be effectively and quickly be able to prepare indexes. Preparing indexes for huge datasets like Wikipedia can be a daunting and time consuming tasks with traditional approaches, Consider the rate at which the dataset is growing daily and newer indexes has to be prepared frequently. Imagining
Google doing it with billions of pages and petabytes of data can provide us some insights into the effectiveness and swiftness of their technology. Our approach to indexing will utilize MapReduce; the notion Google popularized and has really evolved as a defacto for analyzing large datasets of the magnitudes of petabytes over the past years. We will utilize MapReduce process to generate indexes from the main Hbase table. The process will be executed in parallel over the cluster
of commodity machines. The Map operations are distributed across numerous computers across the cluster by automatically partitioning the input dataset into M shards.
Reduce invocations are also distributed by partitioning the intermediate key space into R pieces using a hash partitioning function(e.g., hash(key) mod R).Map takes one pair of data with a type in one data domain, and returns a list of pairs in a
different domain:

Map(k1,v1) -> list(k2,v2)
K1→ Multidimensional keys for row data, V1→ cell alue

The function is applied to all items in the input dataset, which produces a list of (k2,v2) pairs for each call. All the intermediate pairs are collected and are grouped together, creating one group for one different generated keys. The
reduce function is then applied to each group, which then produces collection of values in the same domain.

Reduce(k2, list (v2)) -> list(v3)
k2 → S+P or P+O, v2 → List of all values corresponding to common key, V3 → serialized list. The diagrammatic representation of the overall MapReduce process is shown in the diagram below. The image is taken from one of the original slides of Google present at Google Labs about MapReduce.

Thus MapReduce transforms all the key value pairs into a list of values, the list of values in our case are either the serialized list of all object values making up a relationship or the serialized list of all the subjects, as depicted in the column of
figure 3 and 4. The serialized binary data will be inserted into the cell corresponding the particular key. The benchmark for the time required to index will be provided appropriately after experimentation.

XI.WORDNET

WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University. WordNet distinguishes between words as literally appearing in texts and the actual senses of the words. A set of words that share one sense is called a synset. Thus, each synset identifies one sense (i.e., semantic concept). Words with multiple meanings (ambiguous words) belong to multiple synsets. As of the current version 2.1, WordNet contains 81,426 synsets for 117,097 unique nouns. (WordNet also includes other types of words like verbs and adjectives, but we consider only nouns in this paper.) WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a subconcept
and a superconcept) and holonymy/meronymy (i.e., the relation between a part and the
whole); for this paper, we focus on hypernyms/hyponyms. Conceptually, the hypernymy relation in WordNet spans a directed acyclic graph (DAG) with a single source node called Entity. Wordnet will be leveraged for finding relationship between words.

XII.CONCLUSION

In this paper we discussed about Hbase and semantic modeling of Wikipedia data to Hbase in order to do intelligent search over Wikipedia data. we also presented an index modeling technique for storing RDF triples in Hbase. We leveraged distributed MapReduce operations to the large dataset of Wikipedia and modeled its data and indexes appropriately. We presented a distributed and highly redundant alternative to making the solution. The comparison of Hbase modeling is to be done with Mysql and statistics are to be presented. The system was able to produce intelligent
search with Wikipedia data and was able to produce results which with normal query in Wikipedia we cannot have. The approach also promotes the usage of distributed key value store, Hbase; as an appropriate datastore for modeling a semantic web.

XIII.FUTURE WORK

A large scale semantic modeling of Wikipedia data remains to be done. The immediate goal is to workout an effective and highly scalable design to scale up to TeraBytes of data. Some areas to be explored are, Improvements in efficiency including query caching, disk allocation, and indices optimization. Latency, query performance, and relevance of results is at top priority. Although some tests have earlier been done regarding performance of Hbase with Hadoop, we are still to convince that Hbase forms an effective datastore for modeling semantic web data.

References

[1] Indexing Goes a new direction, by Richard Winter;
http://www.wintercorp.com/rwintercolumns/ie_9901.html.
[2] Jeffrey Dean & sanjay Ghemawat, “MapReduceA
simplified data
processing on Large Clusters”, http://labs.google.com/papers/mapreduceosdi04.
pdf
[3] Mark Slee, Aditya Agarwal and Marc Kwiatkowski, Thrift: Scalable
CrossLanguage
Services Implementation
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
style),” unpublished.
[5] Loannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and
Nectarios Koziris,{ikons, eangelou, dtsouma,
nkoziris}@cslab.ece.ntua.gr} Distributed Indexing of Web Scale
Datasets for the Cloud
[6] Tim Lu, Shan Sinha, Ajay Sudan. “Panaché: A Scalable Distributed
Index for Keyword Search”,{timlu, ssinha, ajaytoo} @mit.edu
[7] Iadh Ounis, Craig Macdonald,Richard M. C. McCreadie, Comparing
Distributed Indexing: To MapReduce or Not?
[8] Apache Software Foundation. The apache hadoop project.
http://hadoop.apache.org
[9] S.Ghemawat, H. Gobiff, and S.T.Leung. The google FileSystem.
[10] Sergey Menlik, Sriram Raghavan, Beverly Yang, Hector GarciaMolina,
Building a FullText
distributed system for web.
http://www10.org/cdrom/papers/pdf/p275.pdf
[11] Rasmus Pagh , S. Srinivasa Rao,”Secondary Indexing in One
Dimension”
[12] Tadeusz Morzy, Maciej Zakrzewicz, “ Group Bitmap Index: A Structure
for Association Rules Retrieval”.
[13] Elizabeth O’Neil and Patrick O’Neil,Kesheng Wu,”Bitmap Index Design
Choices and Their Performance Implications”.
[14]Daniel J. Abadi Peter A. Boncz Stavros Harizopoulos,”Columnoriented
Database Systems”.
[15] Johan Jonsson, Coldbase A
ColumnOriented
InMemory
Database.
[16] Biswanath Panda,Joshua S. Herbach,Sugato Basu, Roberto J. Bayardo,
Google, Inc. PLANET: Massively Parallel Learning of Tree Ensembles
with MapReduce.
[17] G. W. Juette and L. E. Zeffanella, “Radio noise currents n short sections
on bundle conductors (Presented Conference Paper style),” presented at
the IEEE Summer power Meeting, Dallas, TX, June 22–27, 1990, Paper
90 SM 6900
PWRS.
[18] Ian Foster Carl Kesselman, “ Globus:A Metacomputing Infrastructure “.
[19] ohn Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski,, Patrick
Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea,Hakim
Weatherspoon, Westley Weimer, Chris Wells, and Ben
Zhao.OceanStore: An Architecture for GlobalScale
Persistent Storage.
[20] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
Christos Kozyrakis, Evaluating MapReduce for Multicore
and
Multiprocessor Systems
[21] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein,Philip Bohannon, HansArno
Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni Yahoo! Research. “PNUTS: Yahoo!’s
Hosted Data Serving Platform”
[22] Apache Software Foundation,The apache Hbase project
http://hbase.apache.org
[23] Morteza Zaker, Somnuk PhonAmnuaisuk,
SuCheng
Haw, An Adequate
Design for Large Data Warehouse Systems: Bitmap index versus Btree
index

Friday, August 13, 2010

Ubuntu Commands for viewing hardware details

View Hard disk space
bacharya@bacharya:~$df -h

Filesystem Size Used Avail Use% Mounted on
/dev/sda3 185G 59G 118G 34% /
varrun 1.7G 124K 1.7G 1% /var/run
varlock 1.7G 0 1.7G 0% /var/lock
udev 1.7G 44K 1.7G 1% /dev
devshm 1.7G 336K 1.7G 1% /dev/shm
lrm 1.7G 40M 1.6G 3% /lib/modules/2.6.24-28-generic/volatile
/dev/sda1 464M 47M 393M 11% /boot
/dev/sda4 43G 190M 40G 1% /others

View memory details (in MB)

bacharya@bacharya:~$free -m
total used free shared buffers cached
Mem: 3286 3109 177 0 340 992
-/+ buffers/cache: 1776 1510
Swap: 3906 0 3906

View Processes

bacharya@bacharya:~$top

View processor details
bacharya@bacharya:~$ cat /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping : 9
cpu MHz : 2800.531
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc up pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm
bogomips : 5601.06

Wednesday, February 17, 2010

Apache Pig Tutorial

Apache Pig Tutorial

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Defined by apache

Requirements for Pig

have your Hadoop system setup. Because Pig is not meant to run alone but in coordination with HDFS/MapReduce.

General WorkFlow of Pig is depicted in the workFlow image.
First step invloves loading data . Data can be logs of sql dumps or text files. The second phase involves manipulation of data like filtering, using foreach, distinct or any user defined functions. The third step involves grouping the data. And the final stage involves writing the data into the dfs or repeating the step if another dataset arrives.

Installation of Hadoop is required for Pig.Make sure we have our namenode and tasktracker set in hadoop configuration files. These configurations are necessary because Pig has to connect to the HDFS namenode and JobTracker node.

Running pig jobs in a cluster
Copy the sample dump to the DFS

bin/hadoop fs -copyFromLocal /home/hadoop/passwd .
In my case I am using a dump of Passwd from /etc/passwd

Edit the pig.properties to mention cluster = nameofJobTracker(Map/Reduce)

Edit /bin/Pig.sh file to include the following line

export JAVA_HOME =path to your java_home
export PIG_CLASSPATH=pathto hadoop/conf

Now run the pig command from the pig installation directory
/bin/pig
this should start the namenode and tasktracker of hadoop
If not then we have problems in the configurations. check configuration settings.

Now run your queries, and finally dump the output .

Success !

Saturday, February 13, 2010

Disperity and dearth : Does it portrays my identity : A retrospective view of my understanding to my country! Search for Hope.

Being born in the one of the most impoverished nations in the world, having seen and experienced what hunger is, I have seen children's playing in war zone and conflict shattering dreams of many, resulting in a bruised socio-economical and political condition of the beautiful Himalayan Shangrilla. I supposed, the main cause of this was nothing but the poverty and the gap created by poverty between different classes of our society. My definition of a better scientific world when I was in Nepal was (not necessarily a world created with great scientific achievements and lavishness),a place where people would get two times a meal daily, a place where children would not be forced to carry weapons but rather get a chance to go to school and get proper education. Place where there would be no discrimination amongst caste and creed, where peace and harmony are the languages known rather than war and anguish and where nobody has to give his life either out of any conflict or poverty.This would really be a great challenge for science filling the gap ,the difference between the west and the third world countries like Nepal. My thought is that the next step of science should be in the scientific and uniform distribution of the existing technological achievements throughout the world. As long as there is unevenness in the distribution of the assets of science there will be war zone, inequality, malnutrition and diseases. Up until now, we have had scientific achievements and its distribution in a localized way, leading to the socio-economical segregation between different parts of the world. So, the next step should be to bridge that gap created between technological society and the existing backward society.

Science is knowledge. As until now we have utilized this knowledge in a positive as well as a negative way. Today we are proud to have achieved what our forefathers used to see in their dreams, However, still our achievements are in a tarnished state and the finishing is still left to be done. Hereafter, science should focus on solidifying its achievements in order to avoid the further imbalance prevalent today amongst different sectors of world and society. Similar to the ecosystem if the balance between production and consumption is uneven it will result in the unnatural disturbance, we have to mobilize the achievements of science in order to avoid such disturbance like hunger, poverty, diseases and illiteracy.

Having got a chance to study in Japan, Luckily I got a chance to see what science and technology has had its effect in this foreign soil. At last for a moment I felt I have arrived to the place where I wanted my childhood to nurture, the difference I saw was imminent. A big building I could see reminded me of a common hut in my village made up of bamboo and soil, huge bridge across the river I compared with the wooden logs we used to cross the river, big roads with huge number of vehicles was of reminder of a pathway in my village where I watched sheep's march and I followed behind those. When I saw those crowds of people running after their destinations I compared myself when I used to run after the lost kites in the sky. My stay in Japan was really amazing, great food to eat, I could use laptops now and learned a lot of things about computers, society and cultures. My experience was similar to the dream I used to see when I was a child, At last I was in my dreamland. I was so much immersed in this world that I left all the agony and frustrations at my country.

At the same time, when I looked back the ruins of my motherland I believed that the role science has played in getting the human generation to 21st century has had its limitations and its role is still unfulfilled up until when country like mine will exist in this world.

My personal view regarding scientific change is that Change depends upon the heart of those who initiates it. There may arrive time when civilization may reach inter-terrestrial level and expand its horizon throughout this universe, As the foundation of which has already been created in the developed countries like the US.Poverty is like a disease, it should be stopped from spreading. Back in my country there are many places like Rolpa where we see people fighting for rotten grains unsuitable to eat, Although science has gone a long way still we see such happenings in many parts of the world, while the rest of the world had been developing My country I looked back was going to dark ages day by day. But, one day I heard that some initiation to bridge the gap between technology had been taken by some group of enthusiasts back in my country and they were working for was for the complete transformation of the villages in remote areas of Nepal.

The place called Myagdi,in the heights of the Himalayas it lied seven hours hard climb to the nearest roadway, These people never saw a telephone and even electricity was out of reach. School was miles away and students didn't have pencils or paper to write with. There arose an inspirational man. A Nepali citizen after he finished his studies in USA came to build a high school in this remote area.While he was teaching in that school, he had to travel for two days to reach the nearby city to check his emails. This condition, he informed to BBC and when BBC publicized his ideas, lots of volunteers form Europe and Americas came to help. When he installed four used donated computers in his school teaching students how to use it, it became a start. With the help of donated computers and wireless devices, he had setup relay stations around his village. Now all of sudden these peoples were connected to rest of the world with the latest technology. The impact was imminent. The people have started using computers that were once a waste for the westerners, Suddenly teleteaching has started, local people benefited form the medical facilities from the doctors of big cities, and villagers now e marketed their local products and household goods. People now use online libraries,online health clinic, and e shopping. Really the wireless boon of the science really had a tremendous impact upon the socio-economical aspects of the rural people. The same moment concentrated not only to Myagdi village, but seeing its success the wireless were similarly set up in the nearby villages and rural parts of Nepal, and the once isolated villages are now a global village connected by one of the most fantastic technologies of Science.

So, An example set up, the impact the wireless project had brought about a revolutionary effect in a then so called Dark Village. There are certainly more such rural parts as Myagdi in Nepal and in many parts of the world. Together and with the aid of some of the fantastic innovations of science we can make the world a better place to live.

The places where it was very difficult to reach and where it was almost impossible to access information regarding outside world. Science was making things possible and was easing peoples life in a remarkable way. How did that happened is really a wonderful story I take a great morale from. A man called Mahabir Pun a Nepali student in USA initiated a per dollar program from the abroad Nepalese and collected that money and went back to the inaccessible villages of Nepal and set up wireless devices which enabled the poor mountain peoples an not only an easy access to information and internet but also they could now easily communicate with other peoples of other mountainous regions easily. It seems a small progress to most of the people, But it turned out to be a miraculous aid for the people of that region. Just after 1 year of the installation of the wireless device suddenly the lifestyle of those people changed dramatically and their purchasing power rose at once. Now here we see how can a small change brought about by science can bring about a tremendous change in peoples life. Not only it was the technology which changed their status from normal people but also it enabled them to learn and acquire knowledge in a much similar way the westerners do.

This step was a bold step which filled the gap between the people of remote areas the rich peoples of the cities. A small step like that can really be a giant step for the whole mankind.

Friday, January 29, 2010

Unicode Encoding Problems with Java- Mysql-Tomcat web page and solution

There are many things to take care if we want to correctly display utf-8 characters from mysql properly. Also if we wish to do internationalization based upon database, the following steps serves as a good place to start with.

Thing to take care :

first : while entering utf-8 data into the table from shell be careful to set names as follows :
By setting name as utf8 we are telling mysql to use the given charset. We need to do it every time we log into the server. If not done mysql regards whatever entered data to the table in the default charset format, which will not make your program display characters properly.

set names utf8 collate utf8_general_ci

Also we can use this command mysql > default-character-set=utf8;

Check your mysql server encodings by this command:

show variables like 'character\_set\_%';
Make sure you have similar response.

Variable_name | Value |

+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8

If not edit /etc/my.cnf to make the encodings to utf8 as shown in the above table. Although its not necessary to change all the fields to utf8. I have the following configuration and its working perfectly fine. I am facing no problem of encodings provided i fulfill all other steps carefully.

Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
+--------------------------+--------+

second : create the database with character set utf8, create table with charset utf8 also make the columns of table utf8 . we need to do this to make sure we get characters correctly.

Third : set the page charset and default encoding to utf8 if we are displaying table data in a web page.

<%@ page pageEncoding="UTF-8" %>

Adding these lines instructs the browser to use the utf8 encoding for the page.
By performing step three we will be able to see hard coded utf8 characters in the page properly but this might not work in case of database backed utf8 encoded data. For it to work we need to follow step first and second properly.

This also might not work in case the reponse is coming from server. So we have to explicitely tell the server to use utf8 as encoding for our pages, So we need to go to step fourth.

Fourth : We may also have to write a custom filter to encode the response sent by the server as utf-8. Define a filter and set the responseEncoding to utf-8

Write a ResponseFilter class implementing Filter. and in the doFilter() method set following to tell the server explicitely that you want utf-8 encoded response.

response.setCharacterEncoding("utf-8");
request.setCharacterEncoding("utf-8");
chain.doFilter(request,response);

This should display any utf8 characters properly in any web page or console.

Step Five : Well this is not necessary if we performed all the above steps above, at least not necessary for latest Mysql connectors. But for older versions of the driver we need to add the following to the connection url

jdbc:mysql://localhost/table-test?useUnicode=yes&characterEncoding=UTF-8

Note : - Have to set charset utf8 before entering any data into table from the mysql prompt.

Must Read article :
unicode-how-to-get-characters-right

If problems persists ! feel free to post your problem in this blog. I shall try to solve the issue. Well it worked for me in case of Japanese, Arabian, DevNagari characters and others.

Wednesday, January 06, 2010

Richfaces -Facelets -Spring JBoss configuration

JBOSS version is : jboss-4.2.0.GA

In order to run richfaces application with JBOSS-4.2.0.GA we need to have the following libraries in the web-inf /lib directory

Commons-beanutils
commons-digester
el-impl
el-ri
jsf-facelets
richfaces-api
richfaces-impl
richfaces-ui

Other libraries are by default included in JBOSS installation so we dont need to add other JSF libraries and commons libraries. If added could cause deployment problems.

Sunday, January 03, 2010

eclipse java.lang.Error: Unresolved compilation problem solved

Recently I was working on a project when suddenly i had this error in my eclipse :

java.lang.Error : Unresolved compilation problem

Error bubble were shown at all the source files. I went through the web to solve this issue but nothing was working to me. Finally what I did was

Right clicked the project
set the compiler compliance level to 1.6
Initially it was set to 1.5
Now the problem was solved. I guess it was the issue with different versions of JRE used to build.

Sunday, December 13, 2009

Getting started with JINI in Ubuntu Linux

Get yourself running JINI in UBUNTU with the following steps :

Download the java version of JINI distribution from the following page, because the linux versions is facing some problems so its better to download the java version even if you are Linux users.

http://www.jini.org/wiki/Category:Getting_Started

Run the Jini_2.1.jar from the command line by using the following command
java -cp Jini_2.1.jar install

follow the procedure for full or custom installation of JINI

After installtion go to jini2_1/installverify/support folder of the installation
Before you can run the launch-all you have to edit the launch-all file

search export LD_ASSUME_KERNEL=2.2.5 line in the file

comment this line in launch-all
export LD_ASSUME_KERNEL=2.2.5
#export LD_ASSUME_KERNEL=2.2.5

now run launch-all from the commandline
./launch-all

All the jini services will be started.
This completes the installtion procedure of JINI under UBUNTU and Linux platforms.

You can run sample application now.