Abstract
With offices nowadays have started using computers, we are also faced with a challenge to maximize the utilization of computing resources offered by each computer and minimize cost. With many computers, we are faced with many idle
resources. Jobs can be distributed out to idle servers or even idle desktops. Many of these resources remain idle during off office hours or even during office hours with many users under utilizing the computing as well as memory resources. We can manage policies allowing jobs to go only go to computers that are free of resources allowing others to run normally and hence maximize the throughput as well as minimize cost. Our proposed model not only utilizes resources to optimum but also makes the architecture more modular, adaptive, provides dynamic fail over recovery and linear scalability.
Keywords : JINI, javaspace, cluster, Space
1. Introduction
As the size of any organization increases, with it increases the issue of managing the increased resources. A little foresightedness may result saving huge cost for the organization. There might arise a question of managing computer clusters for carrying out computational task with more efficiency. In this paper we shall focus on the prototype cluster management using JINI, and also show how to setup a cluster management system that performs resource sharing. Traditional architectures normally focus on clientserver or peer to peer interaction model but our focus shall be upon a completely new architecture “Space based Architecture”. The space based idea
has several advantages compared to its counterparts. A space based architecture is said to be more robust because one agent failing will not bring down the whole system as is the case with clientserver model. Replication and mirroring of persistent spaces permits communication regardless network failure. Communication between peers is anonymous and asynchronous which makes computers in the cluster to work together to solve a problem collectively. These attributes of space based architecture enables us to make an adaptive cluster. We will particularly focus on managing clusters in an adaptive manner, where the increase or decrease in the number of peers wont create any problem to the overall space. Our approach will be based on the one of the services of JINI the “javaspace”.
2. JINI and JavaSpaces
JINI technology is a service oriented architecture that defines a programming model which both exploits and extends the ability of java technology to enable the creation of distributed systems consisting of federations of well behaved networked services and clients. JINI technology can be used to build adaptive network systems that are scalable, evolvable and flexible as typically required in dynamic distributed systems [1]. JINI enables computers to find each other and use each others services on the network without prior information about each other or the protocols used. To make Jini selfhealing,leases are utilized. Nearly every registration or resource must be leased, that is, it must periodically be confirmed that the registered resource is alive or that there is still interest in a resource.
If the lease is not renewed before its expiration, the resource or registration becomes unavailable. This provides a form of distributed garbage collection, where only healthy resources continue to be published.
2.1 An overview of JINI Infrastructure
Jini comes with several standard infrastructure components out of the box. To achieve the nonfunctional requirements (NFRs) of performance, resiliency, and scalability, multiple instances of these components run simultaneously on different machines.
The standard Jini services are:
· Lookup Service : The lookup service, named reggie, is the first among equals of Jini services.
All services in a Jini architecture register with the lookup service to make themselves available to other services. All initial access to other services is via the lookup service, after which clients bind directly.
· Class Server : The class server, a simple HTTP daemon, eliminates the coupling between clients and service implementations. Clients deal with interfaces. If a particular implementation class is required, it is downloaded transparently.
· Transaction Manager : Distributed transactions support is provided by the transaction manager service called mahalo.
· JavaSpace Services with a requirement to share information with other services do so through a JavaSpace. The reference implementation of the JavaSpace is named outrigger.
2.2 JavaSpace technology
The JavaSpaces technology is a highlevel tool for building distributed applications, and it can also be used as a coordination tool. A marked departure from classic distributed models that rely on message passing or RMI, the JavaSpaces model views a distributed application as a collection of processes that cooperate through the flow of objects into and out of one or more spaces. This programming model has its roots in Linda, a coordination language developed by Dr. David Gelernter at Yale niversity. However, no knowledge of Linda is required to understand and use JavaSpaces technology [2].The dominant model of computation in distributed computing is the ClientServer model. This model is based on the assumption that local procedure calls are the same as remote procedure calls. Javaspaces overcome the problems of synchronization, latency and partial failure, inherent in distributed systems, by providing loosely coupled interactions between the components of distributed systems. Javaspace processes communicate through a space, not than directly. Communication between processes on different physical machines is asynchronous and free from the main limitation of the traditional client/server model, when client/server communication requires simultaneous presence on network both parts client and server. Sender and receiver in JavaSpace don't need to be synchronized and can interact when network is available. In a distributed application, JavaSpaces technology acts as a virtual space between providers and requesters of network resources or objects. This allows participants in a distributed solution to exchange tasks, requests, and information in the form of Java
technology based objects[3]. The javaspace transactional mangement and notify feature makes it easier to build a dynamic cluster management framework. In particular it addresses the dynamic cluster problem where nodes can depart and join the cluster at any time.
2.3 Leasing
One of the important feature of Jini is the concept of leasing. All resources in a Jini system are leased,including proxy records in the lookup service, transactions, and, of course, memory in a JavaSpace.When a lease expires, the resource is recovered and made available to other components. This prevents resource accretion, a common problem in distributed systems. Leases can be renewed explicitly, or implicitly through a lease renewal manager. In the interests of simplicity, the example below uses leases that last "forever." This is obviously inappropriate for a
production system[4].
2.4 The Entry Interface
All objects that can be stored in a JavaSpace must implement the Entry interface. Entry is a simple tag interface that does not add any methods but does extend Serializable. Like JavaBeans, all Entry implementations must provide a public constructor that takes no arguments. Unlike JavaBeans, all of the data members of an Entry must be public. In production Jini systems, the public data members are a nonissue because of common techniques like the envelope letter idiom. The Entry implementation acts as an envelope or wrapper around the "real" payload, which may be any serializable Java object. The only public data members exposed are those required for the templatebased matching of the JavaSpaces API[5].
3. JavaSpace based cluster mangement
At the focus of our system is a working space and entries that cluster nodes can write to the space.These entries are a Join entry and a Depart entry. The space itself is assumed to be run on a host that forms the nucleus of the cluster and in fact for the work reported here we assume this host and its space stay up. We are working on the problem of making this space system properly persistent and robust
against temporary failure. It is further assumed that the space hosting node is well known to other nodes that participate in the cluster. This seems reasonable assumption for a cluster within an administrative boundary
3.1 Architecture Description
A JavaSpaces service holds entries, each of which is a typed group of objects expressed in a class that implements the interface net.jini.core.entry.Entry. Once an entry is written into a JavaSpaces service, it can be used in future lookup operations. Looking up entries is performed using templates, which are
entry objects that have some or all of their fields set to specified values that must be matched exactly. All remaining fields, which are not used in the lookup, are left as wildcards.
There are two lookup operations: read() and take(). The read() method returns either an entry that matches the template or an indication that no match was found. The take() method operates like read(), but if a match is found, the entry is removed from the space. Distributed events can be used by requesting a JavaSpaces service to notify you when an entry that matches the specified template is written into the space. Note that each entry in the space can be taken at most once, but two or more entries may have the exact same values. Using JavaSpaces technology, distributed applications are modeled as a flow of objects between participants, which is different from classic distributed models such as RMIs. Figure 1 indicates what a JavaSpaces technology based application looks like. A client can interact with as many JavaSpaces services as needed. Clients perform operations that map entries to templates onto JavaSpaces services. Such operations can be singleton or contained in a transaction so that all or none of the operationstake place. Notifications go to event catches, which can be either clients or proxies for clients.
4. Javaspace Realated concepts
4.1 Transactions
The JavaSpaces API uses the package net.jini.core.transaction to provide basic atomic
transactions that group multiple operations across multiple JavaSpaces services into a bundle that acts as a single atomic operation. Either all modifications within the transactions will be applied or none will, regardless of whether the transaction spans one or more operations or one or more JavaSpaces services. Note that transactions can span multiple spaces and participants in general.A read(), write(), or take() operation that has a null transaction acts as if it were in a committed transaction that contained that operation. As an example, a take() with a null
transaction parameter performs as if a transaction was created, the take() was performed under that transaction, and then the transaction was committed.
4.2 The Jini Outrigger JavaSpaces Service
The Jini Technology Starter Kit comes with the package com.sun.jini.outrigger, which
provides an implementation of a JavaSpaces technologyenabled
service. You can run it two ways:
· As a transient space that loses its state between executions: Use
com.sun.jini.outrigger.TransientOutriggerImpl.
· As a persistent space that maintains state between executions: Use
com.sun.jini.outrigger.PersistentOutriggerImpl.
The TransientOutriggerImpl can be run only as a nonactivatable server, but the
PersistentOutriggerImpl can be run as either an activatable or nonactivatable server.
4.3 Distributed Data structure in JavaSpace
With JavaSpace it is also possible to organize objects in form of a tree structure or an array. Since remote processes may access these structures concurrently, they are called distributed data structures. A channel in JavaSpaces terminology is a distributed data structure that organizes messages in a queue. Several processes can write messages to the end of the channel, and several processes can read or take messages from the beginning of it. A channel is made up of two pointer
objects, the head and the tail, which contain the numbers of the first and the last entry in the channel(Figure 2). It is possible to use several such channels, giving all Actors associated with a space the possibility to handle messages in a FIFO fair manner. Channels may also be bounded, meaning that an upper limit can be set for how many messages a channel may contain.
4.4 Master Worker pattern
The MasterWorker Pattern (sometimes called MasterSlave pattern) is used for parallel processing and is the basis pattern to work with javaspace. It follows a simple approach that allows applications to perform simultaneous processing across multiple machines or processes via a Master and multiple Workers. The Master hands out units of work to the "space", and these are read, processed and written back to the space by the workers. In a typical environment there are several "spaces", several masters
and many workers; the workers are usually designed to be generic, i.e. they can take any unit of work from the space and process the task.
5. Discussions and Conclusions
In this paper we have discussed using JINI/javaspace technology to providing clustering supporting.The approach presented is useful in places which requires clusters to be set up to perform resource intensive works, like data processing or computing works. Our model can be realized using JINI/javaspace technology which are open source technologies and hence can be cost effective as compared to other proprietary solutions. As with government offices, this approach can prove
beneficial as it provides effective solutions to clustering issues like scalability, fault tolerance,adaptability and utilization of resources. Creating adaptive systems in dynamic environments where services and clients come and go all the time and system components may dynamically be added and removed is a complex task. JavaSpaces has several features that can ease this task, including its ability to provide asynchronous and uncoupled communication in time, space and destination based on
associative addressing. Since a JavaSpace stores objects, it is a simple means of distributing both messages and agent behavior. Our space based architecture utilizes these possibilities together with the actor role abstraction to simplify the creation of adaptive systems. The architecture consists of three main types of agents that interact asynchronously through the space.
6 .References
[1] http://www.jini.org/wiki/Main_Page
[2] http://java.sun.com/developer/technicalArticles/tools/JavaSpaces/
[3] http://www.javafaq.nu/javaarticle150.
html
[4] http://www.softwarematters.org/jiniintro.
html
[5] http://www.artima.com/intv/swayP.html
[6]http://www.javaworld.com/javaworld/jw102000/
jw1002jiniology.
html?page=1
[7]http://java.sun.com/developer/technicalArticles/tools/JavaSpaces/
[8]http://www.theserverside.com/tt/articles/article.tss?l=UsingJavaSpaces
[9]http://www.artima.com/lejava/articles/dynamic_clustering.html
[10]http://www.artima.com/intv/cluster2.html
[11]Grid Computing: A Practical Guide To Technology And Applications By Ahmar
Abbas,Publisher:Charles River Media
[12]http://jan.newmarch.name/java/jini/tutorial/Jini.html
[13]Grid computing Software Environment and tools By : Omer F. Rana (Editor) and Jose C. Cunha
(Editor)
[14]http://java.sun.com/developer/Books/JavaSpaces/
[15]Dynamic Cluster Configuration and Management using JavaSpaces( K.A.Hawick and H.A.James
Computer Science Division, School of Informatics)
Thursday, August 18, 2011
Semantic Modeling of Distributed key value stores featuring Hbase and Wikipedia
Abstract—
The research paper intends to do classification of different articles, documents of Wikipedia into a consistent ontology like persons, organizations, films, album's, video games, species and diseases, leveraging distributed key value store Hbase. Each entity is kept as a column within a column family which will be the individual ontologies as mentioned above. Each get operation to Hbase will be responsible for
getting triples for the same key. The choice of key value store over any relational database promotes horizontal scalability and an increase in performance when the size of dataset is of the magnitude as that of Wikipedia. The proposed approach will
help make semantic queries to the Wikipedia dataset and make intelligent inferences, which is difficult doing currently specially with traditional Relational databases. The approach presents statistical analysis of performance with approaches of RDBMS
and that of Hbase a clone of Google's BigTable.
Index Terms—KeyValue store, Hbase, Hadoop, EVI (Encoded Vector Index), Semantic Modeling, OWL, RDF, MapReduce.
I.INTRODUCTION
ONE of the challenges of semantic search and retrieval in natural language processing is the storage and manipulation of large amount of subjectpredicateobject
expressions which can be RDF; making statements about particular resources. Most of the work of knowledge representation involves persisting the RDF data in relational
databases or native representations called Triple Stores, or Quad Stores if context is also persisted in RDBMS. But circumstances are more complicated by the fact that web today has billions of pages and RDBMS are nowadays more rigorously questioned whether they can sustain the tight scalability constraints imposed by the massive growth in dataset over the past years and the trend still growing. More specifically, we have the following concerns whether RDBMS suitably fits for semantic data persistence Scalability, Querying, Efficiency, Query latency and
organization Overhead. In this paper we present an approach of semantic modeling of RDF triples with Hbase, a sparse, distributed, persistent multidimensional sorted map for storing key values. The paper works for way of mapping RDF data to Hbase. It champions for a new approach of data processing pipeline.In relational databases, data is stored in conventional horizontal approach, which might present several problems like limitations in the number of columns, 1012 for DB2 and Oracle; as we might have several properties of an object which we have to model. Even if we are allowed more columns; RDBMS will have null in the vacant cells; which will kill up a lot of space and present additional problems when data size is substantially larger. Other limitations with RDBMS can be the difficulty in changing the schema like addition of columns later or maintaining timestamps. Hence a large penalty in performance has to be paid if data records are wider or the dataset is huge. Finally, RDBMS faces serious limitations when the questions are about addressing
redundancy, parallelism and horizontal scalability. Flexibility of column oriented key value stores is that they can be either used as wide table consisting of millions of columns, grouped under a single column family or as a very tall table consisting of billions of rows. This facility can be suitably exploited to make a persistent store for ontologies. Our approach involves better natural approach for storing RDF data in Hbase. Here, each RDF subject or resource corresponds to a row key, and each RDF predicate or property corresponds to a column. Keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs. The major advantages with this approach will be that it encompasses all the advantages of Hbase like high availability, consistency, horizontal scalability and data partitioning across the cluster to name some. The system is optimized for resource oriented access patterns to RDF statements about a particular subject taken from Wikipedia.
A semantic web is merely a collection of interrelated triples, the predicates of whose are data in themselves. Every object has an identifier that is unique and has a distinct meaning. A triple is a record containing three values (subject, predicate,
object). A RDF model consists of statements, which asserts facts about a resource. Each statement has three parts and is also known as a triple. A triple might relate one object to another or can link a constant value to a subject. Nepal is a mountainous country and Nepal's zip code is 977, are two of the cases. There are several optimizations which are needed to be performed for modeling triples with Hbase. In our model each record consists of numerous fields. The fields again contains data that belongs to another object forming a tree like structure. An object is not modeled explicitly, but rather by a series of linked tables distributed across the cluster on top of distributed File system (HDFS).
II.PROBLEMS TO BE MET
A.Identification of unique triples There can be cases when we might have to uniquely
identify the triples, when the combination of subject,predicate and the object is itself not unique.
B.Optimal Search performance given a Subject and a Predicate
How could we possibly perform search in key value store given Subject and a Predicate ?
C.Optimal Search performance given a Predicate and an Object
How could we possibly perform search in key value store given Predicate and an Object ?
III.SOLUTIONS TO PROBLEMS
We address the problems mentioned above by maintaining custom indexes with EVI indexing. Maintained separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime). Added multiple indexes for the two ways the tables are used: a subject predicate combined key and a predicateobject
combined key. MapReduce utilization in the creation of indexes offers an effective yet powerful way for distribution of processing across the cluster. The index tables consists serialized composite keys and cell value consists either the value or
subject of the triple. The figure above gives the picture of how we are modeling
Hbase and its indexes in order to model semantic data. The first model presents a big table having predicates as their column names, here we are exploiting the fact that Hbase being a clone of Big Table can support millions of columns. Similar categories are grouped under a column family, we have categorized articles under person, Movies, Films, Organization, Album's, Video Games, Species, diseases etc..
IV.LEXICOGRAPHICAL ORDERING
As with Hbase we have Lexicographical ordering of strings. The keys in the main table as well as indexes are ordered in the following fashioned. If Σ has a total order (cf.alphabetical order) one can define a total order on Σ* the lexicographical order. since Σ is finite, it is always possible to define an ordering on Σ and consequently on Σ*. If Σ = {0, 1} and 0 < 1, then the lexicographical ordering of Σ* is ε < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 … and so on.
V.COMPRESSION
Hbase sits on top of HDFS the Hadoop distributed file system. Using LZO compression will allow for reduction in the size of the data and also reduction in the read time of Hard Drives. The compression additionally allows for its block to be split into chunks for parallel processing with Hadoop.Creating indexes demands for reading data from HDFS and inserting them into Hbase, so compression enhances the performance of MapReduce process in general. Moreover, compressing data on Hbase increases its performance because an expensive I/O is reduced and useless bandwidth usage is negated which can sometimes saturate clusters, thus improving performance.
VI.SPARSE DATA
In Hbase, given row which in our case is the keyword of search can have any number of columns in each group or column family. Here we will have row based gaps. So, the
Hbase table will be sparse helping for quicker random reads.
VII.CONCEPTUAL MODELING OF DATA
A.Modeling of Main Table
The main table for storing RDF triples will be modeled as follows.
The table above provides how subject, object and predicates will be stored in the main table in Hbase. The Subject will be modeled as the row key whereas the object
can be the URI or the value itself. The predicates form the column name and are grouped under a column family,Persons in our case. There can be millions of predicates as the columns within a family, the table above just just gives three
of them. Theoretically a Column Family can accommodate infinite numbers of columns. Timestamps allows us to keep different Uri's or values for a given subject key. The
conceptual modeling although looks sparse, the table space is not eaten up because physically vacant cells has no significance. To retrieve a value, the user has to do a get using three keys, row key+column key+timestamps >value .
B.Modeling of Index Tables
The index tables will be modeled as follows. There will be two index tables one for S+P combination and other for P+O combination so that query for every type could be performed adequately. Tables below shows the cases for two of them.
There is a single column family and a single column in this case. The tables can be queried with the S+P or O+P combinations and get the list of serialized values of results, which can be later used to query whatever we want from the main table.
VIII.DISTRIBUTED
Hbase being built on top of Hadoop distributed Filesystem, the underlying data can be spread across several nodes in the cluster. It can either sit on top of HDFS or Amazon S3. The data will be replicated across several participating nodes in the cluster, which protects data in case one or several node goes down and allows for distributed processing and querying of data. The URL is being inverted simply to keep related data together. Although conceptually our data sits inside Hbase table similar to that of a Relational database table, but Hbase has a special file format for saving its data called Hfile format which are eventually stored in HDFS. So, for proper functioning and data persistence Hbase relies upon HDFS the Hadoop distributed FileSystem. Any file is split into one or more blocks and these blocks are stored and distributed across several datanodes in the cluster. Thus HDFS facilitates with data replication, Robustness, countering node failures, cluster
rebalancing, maintaining data integrity and managing cluster in overall.
IX.MODEL AND SEMANTICS
The semantic model must be able to express all the entities,facts, relations between facts and properties between relations. Currently OWL(Web Ontology Language) serves
for knowledge representation. It can express properties between relations and express relations between facts. In OWL and RDFS all the objects( persons, Places, URLS) are represented as entities.
X.MAPREDUCE FOR INDEXING
Effective indexing serves as the backbone of any search and retrieval systems. At the same time, the system should be effectively and quickly be able to prepare indexes. Preparing indexes for huge datasets like Wikipedia can be a daunting and time consuming tasks with traditional approaches, Consider the rate at which the dataset is growing daily and newer indexes has to be prepared frequently. Imagining
Google doing it with billions of pages and petabytes of data can provide us some insights into the effectiveness and swiftness of their technology. Our approach to indexing will utilize MapReduce; the notion Google popularized and has really evolved as a defacto for analyzing large datasets of the magnitudes of petabytes over the past years. We will utilize MapReduce process to generate indexes from the main Hbase table. The process will be executed in parallel over the cluster
of commodity machines. The Map operations are distributed across numerous computers across the cluster by automatically partitioning the input dataset into M shards.
Reduce invocations are also distributed by partitioning the intermediate key space into R pieces using a hash partitioning function(e.g., hash(key) mod R).Map takes one pair of data with a type in one data domain, and returns a list of pairs in a
different domain:
Map(k1,v1) -> list(k2,v2)
K1→ Multidimensional keys for row data, V1→ cell alue
The function is applied to all items in the input dataset, which produces a list of (k2,v2) pairs for each call. All the intermediate pairs are collected and are grouped together, creating one group for one different generated keys. The
reduce function is then applied to each group, which then produces collection of values in the same domain.
Reduce(k2, list (v2)) -> list(v3)
k2 → S+P or P+O, v2 → List of all values corresponding to common key, V3 → serialized list. The diagrammatic representation of the overall MapReduce process is shown in the diagram below. The image is taken from one of the original slides of Google present at Google Labs about MapReduce.
Thus MapReduce transforms all the key value pairs into a list of values, the list of values in our case are either the serialized list of all object values making up a relationship or the serialized list of all the subjects, as depicted in the column of
figure 3 and 4. The serialized binary data will be inserted into the cell corresponding the particular key. The benchmark for the time required to index will be provided appropriately after experimentation.
XI.WORDNET
WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University. WordNet distinguishes between words as literally appearing in texts and the actual senses of the words. A set of words that share one sense is called a synset. Thus, each synset identifies one sense (i.e., semantic concept). Words with multiple meanings (ambiguous words) belong to multiple synsets. As of the current version 2.1, WordNet contains 81,426 synsets for 117,097 unique nouns. (WordNet also includes other types of words like verbs and adjectives, but we consider only nouns in this paper.) WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a subconcept
and a superconcept) and holonymy/meronymy (i.e., the relation between a part and the
whole); for this paper, we focus on hypernyms/hyponyms. Conceptually, the hypernymy relation in WordNet spans a directed acyclic graph (DAG) with a single source node called Entity. Wordnet will be leveraged for finding relationship between words.
XII.CONCLUSION
In this paper we discussed about Hbase and semantic modeling of Wikipedia data to Hbase in order to do intelligent search over Wikipedia data. we also presented an index modeling technique for storing RDF triples in Hbase. We leveraged distributed MapReduce operations to the large dataset of Wikipedia and modeled its data and indexes appropriately. We presented a distributed and highly redundant alternative to making the solution. The comparison of Hbase modeling is to be done with Mysql and statistics are to be presented. The system was able to produce intelligent
search with Wikipedia data and was able to produce results which with normal query in Wikipedia we cannot have. The approach also promotes the usage of distributed key value store, Hbase; as an appropriate datastore for modeling a semantic web.
XIII.FUTURE WORK
A large scale semantic modeling of Wikipedia data remains to be done. The immediate goal is to workout an effective and highly scalable design to scale up to TeraBytes of data. Some areas to be explored are, Improvements in efficiency including query caching, disk allocation, and indices optimization. Latency, query performance, and relevance of results is at top priority. Although some tests have earlier been done regarding performance of Hbase with Hadoop, we are still to convince that Hbase forms an effective datastore for modeling semantic web data.
References
[1] Indexing Goes a new direction, by Richard Winter;
http://www.wintercorp.com/rwintercolumns/ie_9901.html.
[2] Jeffrey Dean & sanjay Ghemawat, “MapReduceA
simplified data
processing on Large Clusters”, http://labs.google.com/papers/mapreduceosdi04.
pdf
[3] Mark Slee, Aditya Agarwal and Marc Kwiatkowski, Thrift: Scalable
CrossLanguage
Services Implementation
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
style),” unpublished.
[5] Loannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and
Nectarios Koziris,{ikons, eangelou, dtsouma,
nkoziris}@cslab.ece.ntua.gr} Distributed Indexing of Web Scale
Datasets for the Cloud
[6] Tim Lu, Shan Sinha, Ajay Sudan. “Panaché: A Scalable Distributed
Index for Keyword Search”,{timlu, ssinha, ajaytoo} @mit.edu
[7] Iadh Ounis, Craig Macdonald,Richard M. C. McCreadie, Comparing
Distributed Indexing: To MapReduce or Not?
[8] Apache Software Foundation. The apache hadoop project.
http://hadoop.apache.org
[9] S.Ghemawat, H. Gobiff, and S.T.Leung. The google FileSystem.
[10] Sergey Menlik, Sriram Raghavan, Beverly Yang, Hector GarciaMolina,
Building a FullText
distributed system for web.
http://www10.org/cdrom/papers/pdf/p275.pdf
[11] Rasmus Pagh , S. Srinivasa Rao,”Secondary Indexing in One
Dimension”
[12] Tadeusz Morzy, Maciej Zakrzewicz, “ Group Bitmap Index: A Structure
for Association Rules Retrieval”.
[13] Elizabeth O’Neil and Patrick O’Neil,Kesheng Wu,”Bitmap Index Design
Choices and Their Performance Implications”.
[14]Daniel J. Abadi Peter A. Boncz Stavros Harizopoulos,”Columnoriented
Database Systems”.
[15] Johan Jonsson, Coldbase A
ColumnOriented
InMemory
Database.
[16] Biswanath Panda,Joshua S. Herbach,Sugato Basu, Roberto J. Bayardo,
Google, Inc. PLANET: Massively Parallel Learning of Tree Ensembles
with MapReduce.
[17] G. W. Juette and L. E. Zeffanella, “Radio noise currents n short sections
on bundle conductors (Presented Conference Paper style),” presented at
the IEEE Summer power Meeting, Dallas, TX, June 22–27, 1990, Paper
90 SM 6900
PWRS.
[18] Ian Foster Carl Kesselman, “ Globus:A Metacomputing Infrastructure “.
[19] ohn Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski,, Patrick
Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea,Hakim
Weatherspoon, Westley Weimer, Chris Wells, and Ben
Zhao.OceanStore: An Architecture for GlobalScale
Persistent Storage.
[20] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
Christos Kozyrakis, Evaluating MapReduce for Multicore
and
Multiprocessor Systems
[21] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein,Philip Bohannon, HansArno
Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni Yahoo! Research. “PNUTS: Yahoo!’s
Hosted Data Serving Platform”
[22] Apache Software Foundation,The apache Hbase project
http://hbase.apache.org
[23] Morteza Zaker, Somnuk PhonAmnuaisuk,
SuCheng
Haw, An Adequate
Design for Large Data Warehouse Systems: Bitmap index versus Btree
index
The research paper intends to do classification of different articles, documents of Wikipedia into a consistent ontology like persons, organizations, films, album's, video games, species and diseases, leveraging distributed key value store Hbase. Each entity is kept as a column within a column family which will be the individual ontologies as mentioned above. Each get operation to Hbase will be responsible for
getting triples for the same key. The choice of key value store over any relational database promotes horizontal scalability and an increase in performance when the size of dataset is of the magnitude as that of Wikipedia. The proposed approach will
help make semantic queries to the Wikipedia dataset and make intelligent inferences, which is difficult doing currently specially with traditional Relational databases. The approach presents statistical analysis of performance with approaches of RDBMS
and that of Hbase a clone of Google's BigTable.
Index Terms—KeyValue store, Hbase, Hadoop, EVI (Encoded Vector Index), Semantic Modeling, OWL, RDF, MapReduce.
I.INTRODUCTION
ONE of the challenges of semantic search and retrieval in natural language processing is the storage and manipulation of large amount of subjectpredicateobject
expressions which can be RDF; making statements about particular resources. Most of the work of knowledge representation involves persisting the RDF data in relational
databases or native representations called Triple Stores, or Quad Stores if context is also persisted in RDBMS. But circumstances are more complicated by the fact that web today has billions of pages and RDBMS are nowadays more rigorously questioned whether they can sustain the tight scalability constraints imposed by the massive growth in dataset over the past years and the trend still growing. More specifically, we have the following concerns whether RDBMS suitably fits for semantic data persistence Scalability, Querying, Efficiency, Query latency and
organization Overhead. In this paper we present an approach of semantic modeling of RDF triples with Hbase, a sparse, distributed, persistent multidimensional sorted map for storing key values. The paper works for way of mapping RDF data to Hbase. It champions for a new approach of data processing pipeline.In relational databases, data is stored in conventional horizontal approach, which might present several problems like limitations in the number of columns, 1012 for DB2 and Oracle; as we might have several properties of an object which we have to model. Even if we are allowed more columns; RDBMS will have null in the vacant cells; which will kill up a lot of space and present additional problems when data size is substantially larger. Other limitations with RDBMS can be the difficulty in changing the schema like addition of columns later or maintaining timestamps. Hence a large penalty in performance has to be paid if data records are wider or the dataset is huge. Finally, RDBMS faces serious limitations when the questions are about addressing
redundancy, parallelism and horizontal scalability. Flexibility of column oriented key value stores is that they can be either used as wide table consisting of millions of columns, grouped under a single column family or as a very tall table consisting of billions of rows. This facility can be suitably exploited to make a persistent store for ontologies. Our approach involves better natural approach for storing RDF data in Hbase. Here, each RDF subject or resource corresponds to a row key, and each RDF predicate or property corresponds to a column. Keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs. The major advantages with this approach will be that it encompasses all the advantages of Hbase like high availability, consistency, horizontal scalability and data partitioning across the cluster to name some. The system is optimized for resource oriented access patterns to RDF statements about a particular subject taken from Wikipedia.
A semantic web is merely a collection of interrelated triples, the predicates of whose are data in themselves. Every object has an identifier that is unique and has a distinct meaning. A triple is a record containing three values (subject, predicate,
object). A RDF model consists of statements, which asserts facts about a resource. Each statement has three parts and is also known as a triple. A triple might relate one object to another or can link a constant value to a subject. Nepal is a mountainous country and Nepal's zip code is 977, are two of the cases. There are several optimizations which are needed to be performed for modeling triples with Hbase. In our model each record consists of numerous fields. The fields again contains data that belongs to another object forming a tree like structure. An object is not modeled explicitly, but rather by a series of linked tables distributed across the cluster on top of distributed File system (HDFS).
II.PROBLEMS TO BE MET
A.Identification of unique triples There can be cases when we might have to uniquely
identify the triples, when the combination of subject,predicate and the object is itself not unique.
B.Optimal Search performance given a Subject and a Predicate
How could we possibly perform search in key value store given Subject and a Predicate ?
C.Optimal Search performance given a Predicate and an Object
How could we possibly perform search in key value store given Predicate and an Object ?
III.SOLUTIONS TO PROBLEMS
We address the problems mentioned above by maintaining custom indexes with EVI indexing. Maintained separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime). Added multiple indexes for the two ways the tables are used: a subject predicate combined key and a predicateobject
combined key. MapReduce utilization in the creation of indexes offers an effective yet powerful way for distribution of processing across the cluster. The index tables consists serialized composite keys and cell value consists either the value or
subject of the triple. The figure above gives the picture of how we are modeling
Hbase and its indexes in order to model semantic data. The first model presents a big table having predicates as their column names, here we are exploiting the fact that Hbase being a clone of Big Table can support millions of columns. Similar categories are grouped under a column family, we have categorized articles under person, Movies, Films, Organization, Album's, Video Games, Species, diseases etc..
IV.LEXICOGRAPHICAL ORDERING
As with Hbase we have Lexicographical ordering of strings. The keys in the main table as well as indexes are ordered in the following fashioned. If Σ has a total order (cf.alphabetical order) one can define a total order on Σ* the lexicographical order. since Σ is finite, it is always possible to define an ordering on Σ and consequently on Σ*. If Σ = {0, 1} and 0 < 1, then the lexicographical ordering of Σ* is ε < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 … and so on.
V.COMPRESSION
Hbase sits on top of HDFS the Hadoop distributed file system. Using LZO compression will allow for reduction in the size of the data and also reduction in the read time of Hard Drives. The compression additionally allows for its block to be split into chunks for parallel processing with Hadoop.Creating indexes demands for reading data from HDFS and inserting them into Hbase, so compression enhances the performance of MapReduce process in general. Moreover, compressing data on Hbase increases its performance because an expensive I/O is reduced and useless bandwidth usage is negated which can sometimes saturate clusters, thus improving performance.
VI.SPARSE DATA
In Hbase, given row which in our case is the keyword of search can have any number of columns in each group or column family. Here we will have row based gaps. So, the
Hbase table will be sparse helping for quicker random reads.
VII.CONCEPTUAL MODELING OF DATA
A.Modeling of Main Table
The main table for storing RDF triples will be modeled as follows.
The table above provides how subject, object and predicates will be stored in the main table in Hbase. The Subject will be modeled as the row key whereas the object
can be the URI or the value itself. The predicates form the column name and are grouped under a column family,Persons in our case. There can be millions of predicates as the columns within a family, the table above just just gives three
of them. Theoretically a Column Family can accommodate infinite numbers of columns. Timestamps allows us to keep different Uri's or values for a given subject key. The
conceptual modeling although looks sparse, the table space is not eaten up because physically vacant cells has no significance. To retrieve a value, the user has to do a get using three keys, row key+column key+timestamps >value .
B.Modeling of Index Tables
The index tables will be modeled as follows. There will be two index tables one for S+P combination and other for P+O combination so that query for every type could be performed adequately. Tables below shows the cases for two of them.
There is a single column family and a single column in this case. The tables can be queried with the S+P or O+P combinations and get the list of serialized values of results, which can be later used to query whatever we want from the main table.
VIII.DISTRIBUTED
Hbase being built on top of Hadoop distributed Filesystem, the underlying data can be spread across several nodes in the cluster. It can either sit on top of HDFS or Amazon S3. The data will be replicated across several participating nodes in the cluster, which protects data in case one or several node goes down and allows for distributed processing and querying of data. The URL is being inverted simply to keep related data together. Although conceptually our data sits inside Hbase table similar to that of a Relational database table, but Hbase has a special file format for saving its data called Hfile format which are eventually stored in HDFS. So, for proper functioning and data persistence Hbase relies upon HDFS the Hadoop distributed FileSystem. Any file is split into one or more blocks and these blocks are stored and distributed across several datanodes in the cluster. Thus HDFS facilitates with data replication, Robustness, countering node failures, cluster
rebalancing, maintaining data integrity and managing cluster in overall.
IX.MODEL AND SEMANTICS
The semantic model must be able to express all the entities,facts, relations between facts and properties between relations. Currently OWL(Web Ontology Language) serves
for knowledge representation. It can express properties between relations and express relations between facts. In OWL and RDFS all the objects( persons, Places, URLS) are represented as entities.
X.MAPREDUCE FOR INDEXING
Effective indexing serves as the backbone of any search and retrieval systems. At the same time, the system should be effectively and quickly be able to prepare indexes. Preparing indexes for huge datasets like Wikipedia can be a daunting and time consuming tasks with traditional approaches, Consider the rate at which the dataset is growing daily and newer indexes has to be prepared frequently. Imagining
Google doing it with billions of pages and petabytes of data can provide us some insights into the effectiveness and swiftness of their technology. Our approach to indexing will utilize MapReduce; the notion Google popularized and has really evolved as a defacto for analyzing large datasets of the magnitudes of petabytes over the past years. We will utilize MapReduce process to generate indexes from the main Hbase table. The process will be executed in parallel over the cluster
of commodity machines. The Map operations are distributed across numerous computers across the cluster by automatically partitioning the input dataset into M shards.
Reduce invocations are also distributed by partitioning the intermediate key space into R pieces using a hash partitioning function(e.g., hash(key) mod R).Map takes one pair of data with a type in one data domain, and returns a list of pairs in a
different domain:
Map(k1,v1) -> list(k2,v2)
K1→ Multidimensional keys for row data, V1→ cell alue
The function is applied to all items in the input dataset, which produces a list of (k2,v2) pairs for each call. All the intermediate pairs are collected and are grouped together, creating one group for one different generated keys. The
reduce function is then applied to each group, which then produces collection of values in the same domain.
Reduce(k2, list (v2)) -> list(v3)
k2 → S+P or P+O, v2 → List of all values corresponding to common key, V3 → serialized list. The diagrammatic representation of the overall MapReduce process is shown in the diagram below. The image is taken from one of the original slides of Google present at Google Labs about MapReduce.
Thus MapReduce transforms all the key value pairs into a list of values, the list of values in our case are either the serialized list of all object values making up a relationship or the serialized list of all the subjects, as depicted in the column of
figure 3 and 4. The serialized binary data will be inserted into the cell corresponding the particular key. The benchmark for the time required to index will be provided appropriately after experimentation.
XI.WORDNET
WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University. WordNet distinguishes between words as literally appearing in texts and the actual senses of the words. A set of words that share one sense is called a synset. Thus, each synset identifies one sense (i.e., semantic concept). Words with multiple meanings (ambiguous words) belong to multiple synsets. As of the current version 2.1, WordNet contains 81,426 synsets for 117,097 unique nouns. (WordNet also includes other types of words like verbs and adjectives, but we consider only nouns in this paper.) WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a subconcept
and a superconcept) and holonymy/meronymy (i.e., the relation between a part and the
whole); for this paper, we focus on hypernyms/hyponyms. Conceptually, the hypernymy relation in WordNet spans a directed acyclic graph (DAG) with a single source node called Entity. Wordnet will be leveraged for finding relationship between words.
XII.CONCLUSION
In this paper we discussed about Hbase and semantic modeling of Wikipedia data to Hbase in order to do intelligent search over Wikipedia data. we also presented an index modeling technique for storing RDF triples in Hbase. We leveraged distributed MapReduce operations to the large dataset of Wikipedia and modeled its data and indexes appropriately. We presented a distributed and highly redundant alternative to making the solution. The comparison of Hbase modeling is to be done with Mysql and statistics are to be presented. The system was able to produce intelligent
search with Wikipedia data and was able to produce results which with normal query in Wikipedia we cannot have. The approach also promotes the usage of distributed key value store, Hbase; as an appropriate datastore for modeling a semantic web.
XIII.FUTURE WORK
A large scale semantic modeling of Wikipedia data remains to be done. The immediate goal is to workout an effective and highly scalable design to scale up to TeraBytes of data. Some areas to be explored are, Improvements in efficiency including query caching, disk allocation, and indices optimization. Latency, query performance, and relevance of results is at top priority. Although some tests have earlier been done regarding performance of Hbase with Hadoop, we are still to convince that Hbase forms an effective datastore for modeling semantic web data.
References
[1] Indexing Goes a new direction, by Richard Winter;
http://www.wintercorp.com/rwintercolumns/ie_9901.html.
[2] Jeffrey Dean & sanjay Ghemawat, “MapReduceA
simplified data
processing on Large Clusters”, http://labs.google.com/papers/mapreduceosdi04.
[3] Mark Slee, Aditya Agarwal and Marc Kwiatkowski, Thrift: Scalable
CrossLanguage
Services Implementation
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
style),” unpublished.
[5] Loannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and
Nectarios Koziris,{ikons, eangelou, dtsouma,
nkoziris}@cslab.ece.ntua.gr} Distributed Indexing of Web Scale
Datasets for the Cloud
[6] Tim Lu, Shan Sinha, Ajay Sudan. “Panaché: A Scalable Distributed
Index for Keyword Search”,{timlu, ssinha, ajaytoo} @mit.edu
[7] Iadh Ounis, Craig Macdonald,Richard M. C. McCreadie, Comparing
Distributed Indexing: To MapReduce or Not?
[8] Apache Software Foundation. The apache hadoop project.
http://hadoop.apache.org
[9] S.Ghemawat, H. Gobiff, and S.T.Leung. The google FileSystem.
[10] Sergey Menlik, Sriram Raghavan, Beverly Yang, Hector GarciaMolina,
Building a FullText
distributed system for web.
http://www10.org/cdrom/papers/pdf/p275.pdf
[11] Rasmus Pagh , S. Srinivasa Rao,”Secondary Indexing in One
Dimension”
[12] Tadeusz Morzy, Maciej Zakrzewicz, “ Group Bitmap Index: A Structure
for Association Rules Retrieval”.
[13] Elizabeth O’Neil and Patrick O’Neil,Kesheng Wu,”Bitmap Index Design
Choices and Their Performance Implications”.
[14]Daniel J. Abadi Peter A. Boncz Stavros Harizopoulos,”Columnoriented
Database Systems”.
[15] Johan Jonsson, Coldbase A
ColumnOriented
InMemory
Database.
[16] Biswanath Panda,Joshua S. Herbach,Sugato Basu, Roberto J. Bayardo,
Google, Inc. PLANET: Massively Parallel Learning of Tree Ensembles
with MapReduce.
[17] G. W. Juette and L. E. Zeffanella, “Radio noise currents n short sections
on bundle conductors (Presented Conference Paper style),” presented at
the IEEE Summer power Meeting, Dallas, TX, June 22–27, 1990, Paper
90 SM 6900
PWRS.
[18] Ian Foster Carl Kesselman, “ Globus:A Metacomputing Infrastructure “.
[19] ohn Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski,, Patrick
Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea,Hakim
Weatherspoon, Westley Weimer, Chris Wells, and Ben
Zhao.OceanStore: An Architecture for GlobalScale
Persistent Storage.
[20] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
Christos Kozyrakis, Evaluating MapReduce for Multicore
and
Multiprocessor Systems
[21] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein,Philip Bohannon, HansArno
Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni Yahoo! Research. “PNUTS: Yahoo!’s
Hosted Data Serving Platform”
[22] Apache Software Foundation,The apache Hbase project
http://hbase.apache.org
[23] Morteza Zaker, Somnuk PhonAmnuaisuk,
SuCheng
Haw, An Adequate
Design for Large Data Warehouse Systems: Bitmap index versus Btree
index
Subscribe to:
Posts (Atom)