Abstract—
The research paper intends to do classification of different articles, documents of Wikipedia into a consistent ontology like persons, organizations, films, album's, video games, species and diseases, leveraging distributed key value store Hbase. Each entity is kept as a column within a column family which will be the individual ontologies as mentioned above. Each get operation to Hbase will be responsible for
getting triples for the same key. The choice of key value store over any relational database promotes horizontal scalability and an increase in performance when the size of dataset is of the magnitude as that of Wikipedia. The proposed approach will
help make semantic queries to the Wikipedia dataset and make intelligent inferences, which is difficult doing currently specially with traditional Relational databases. The approach presents statistical analysis of performance with approaches of RDBMS
and that of Hbase a clone of Google's BigTable.
Index Terms—KeyValue store, Hbase, Hadoop, EVI (Encoded Vector Index), Semantic Modeling, OWL, RDF, MapReduce.
I.INTRODUCTION
ONE of the challenges of semantic search and retrieval in natural language processing is the storage and manipulation of large amount of subjectpredicateobject
expressions which can be RDF; making statements about particular resources. Most of the work of knowledge representation involves persisting the RDF data in relational
databases or native representations called Triple Stores, or Quad Stores if context is also persisted in RDBMS. But circumstances are more complicated by the fact that web today has billions of pages and RDBMS are nowadays more rigorously questioned whether they can sustain the tight scalability constraints imposed by the massive growth in dataset over the past years and the trend still growing. More specifically, we have the following concerns whether RDBMS suitably fits for semantic data persistence Scalability, Querying, Efficiency, Query latency and
organization Overhead. In this paper we present an approach of semantic modeling of RDF triples with Hbase, a sparse, distributed, persistent multidimensional sorted map for storing key values. The paper works for way of mapping RDF data to Hbase. It champions for a new approach of data processing pipeline.In relational databases, data is stored in conventional horizontal approach, which might present several problems like limitations in the number of columns, 1012 for DB2 and Oracle; as we might have several properties of an object which we have to model. Even if we are allowed more columns; RDBMS will have null in the vacant cells; which will kill up a lot of space and present additional problems when data size is substantially larger. Other limitations with RDBMS can be the difficulty in changing the schema like addition of columns later or maintaining timestamps. Hence a large penalty in performance has to be paid if data records are wider or the dataset is huge. Finally, RDBMS faces serious limitations when the questions are about addressing
redundancy, parallelism and horizontal scalability. Flexibility of column oriented key value stores is that they can be either used as wide table consisting of millions of columns, grouped under a single column family or as a very tall table consisting of billions of rows. This facility can be suitably exploited to make a persistent store for ontologies. Our approach involves better natural approach for storing RDF data in Hbase. Here, each RDF subject or resource corresponds to a row key, and each RDF predicate or property corresponds to a column. Keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs. The major advantages with this approach will be that it encompasses all the advantages of Hbase like high availability, consistency, horizontal scalability and data partitioning across the cluster to name some. The system is optimized for resource oriented access patterns to RDF statements about a particular subject taken from Wikipedia.
A semantic web is merely a collection of interrelated triples, the predicates of whose are data in themselves. Every object has an identifier that is unique and has a distinct meaning. A triple is a record containing three values (subject, predicate,
object). A RDF model consists of statements, which asserts facts about a resource. Each statement has three parts and is also known as a triple. A triple might relate one object to another or can link a constant value to a subject. Nepal is a mountainous country and Nepal's zip code is 977, are two of the cases. There are several optimizations which are needed to be performed for modeling triples with Hbase. In our model each record consists of numerous fields. The fields again contains data that belongs to another object forming a tree like structure. An object is not modeled explicitly, but rather by a series of linked tables distributed across the cluster on top of distributed File system (HDFS).
II.PROBLEMS TO BE MET
A.Identification of unique triples There can be cases when we might have to uniquely
identify the triples, when the combination of subject,predicate and the object is itself not unique.
B.Optimal Search performance given a Subject and a Predicate
How could we possibly perform search in key value store given Subject and a Predicate ?
C.Optimal Search performance given a Predicate and an Object
How could we possibly perform search in key value store given Predicate and an Object ?
III.SOLUTIONS TO PROBLEMS
We address the problems mentioned above by maintaining custom indexes with EVI indexing. Maintained separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime). Added multiple indexes for the two ways the tables are used: a subject predicate combined key and a predicateobject
combined key. MapReduce utilization in the creation of indexes offers an effective yet powerful way for distribution of processing across the cluster. The index tables consists serialized composite keys and cell value consists either the value or
subject of the triple. The figure above gives the picture of how we are modeling
Hbase and its indexes in order to model semantic data. The first model presents a big table having predicates as their column names, here we are exploiting the fact that Hbase being a clone of Big Table can support millions of columns. Similar categories are grouped under a column family, we have categorized articles under person, Movies, Films, Organization, Album's, Video Games, Species, diseases etc..
IV.LEXICOGRAPHICAL ORDERING
As with Hbase we have Lexicographical ordering of strings. The keys in the main table as well as indexes are ordered in the following fashioned. If Σ has a total order (cf.alphabetical order) one can define a total order on Σ* the lexicographical order. since Σ is finite, it is always possible to define an ordering on Σ and consequently on Σ*. If Σ = {0, 1} and 0 < 1, then the lexicographical ordering of Σ* is ε < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 … and so on.
V.COMPRESSION
Hbase sits on top of HDFS the Hadoop distributed file system. Using LZO compression will allow for reduction in the size of the data and also reduction in the read time of Hard Drives. The compression additionally allows for its block to be split into chunks for parallel processing with Hadoop.Creating indexes demands for reading data from HDFS and inserting them into Hbase, so compression enhances the performance of MapReduce process in general. Moreover, compressing data on Hbase increases its performance because an expensive I/O is reduced and useless bandwidth usage is negated which can sometimes saturate clusters, thus improving performance.
VI.SPARSE DATA
In Hbase, given row which in our case is the keyword of search can have any number of columns in each group or column family. Here we will have row based gaps. So, the
Hbase table will be sparse helping for quicker random reads.
VII.CONCEPTUAL MODELING OF DATA
A.Modeling of Main Table
The main table for storing RDF triples will be modeled as follows.
The table above provides how subject, object and predicates will be stored in the main table in Hbase. The Subject will be modeled as the row key whereas the object
can be the URI or the value itself. The predicates form the column name and are grouped under a column family,Persons in our case. There can be millions of predicates as the columns within a family, the table above just just gives three
of them. Theoretically a Column Family can accommodate infinite numbers of columns. Timestamps allows us to keep different Uri's or values for a given subject key. The
conceptual modeling although looks sparse, the table space is not eaten up because physically vacant cells has no significance. To retrieve a value, the user has to do a get using three keys, row key+column key+timestamps >value .
B.Modeling of Index Tables
The index tables will be modeled as follows. There will be two index tables one for S+P combination and other for P+O combination so that query for every type could be performed adequately. Tables below shows the cases for two of them.
There is a single column family and a single column in this case. The tables can be queried with the S+P or O+P combinations and get the list of serialized values of results, which can be later used to query whatever we want from the main table.
VIII.DISTRIBUTED
Hbase being built on top of Hadoop distributed Filesystem, the underlying data can be spread across several nodes in the cluster. It can either sit on top of HDFS or Amazon S3. The data will be replicated across several participating nodes in the cluster, which protects data in case one or several node goes down and allows for distributed processing and querying of data. The URL is being inverted simply to keep related data together. Although conceptually our data sits inside Hbase table similar to that of a Relational database table, but Hbase has a special file format for saving its data called Hfile format which are eventually stored in HDFS. So, for proper functioning and data persistence Hbase relies upon HDFS the Hadoop distributed FileSystem. Any file is split into one or more blocks and these blocks are stored and distributed across several datanodes in the cluster. Thus HDFS facilitates with data replication, Robustness, countering node failures, cluster
rebalancing, maintaining data integrity and managing cluster in overall.
IX.MODEL AND SEMANTICS
The semantic model must be able to express all the entities,facts, relations between facts and properties between relations. Currently OWL(Web Ontology Language) serves
for knowledge representation. It can express properties between relations and express relations between facts. In OWL and RDFS all the objects( persons, Places, URLS) are represented as entities.
X.MAPREDUCE FOR INDEXING
Effective indexing serves as the backbone of any search and retrieval systems. At the same time, the system should be effectively and quickly be able to prepare indexes. Preparing indexes for huge datasets like Wikipedia can be a daunting and time consuming tasks with traditional approaches, Consider the rate at which the dataset is growing daily and newer indexes has to be prepared frequently. Imagining
Google doing it with billions of pages and petabytes of data can provide us some insights into the effectiveness and swiftness of their technology. Our approach to indexing will utilize MapReduce; the notion Google popularized and has really evolved as a defacto for analyzing large datasets of the magnitudes of petabytes over the past years. We will utilize MapReduce process to generate indexes from the main Hbase table. The process will be executed in parallel over the cluster
of commodity machines. The Map operations are distributed across numerous computers across the cluster by automatically partitioning the input dataset into M shards.
Reduce invocations are also distributed by partitioning the intermediate key space into R pieces using a hash partitioning function(e.g., hash(key) mod R).Map takes one pair of data with a type in one data domain, and returns a list of pairs in a
different domain:
Map(k1,v1) -> list(k2,v2)
K1→ Multidimensional keys for row data, V1→ cell alue
The function is applied to all items in the input dataset, which produces a list of (k2,v2) pairs for each call. All the intermediate pairs are collected and are grouped together, creating one group for one different generated keys. The
reduce function is then applied to each group, which then produces collection of values in the same domain.
Reduce(k2, list (v2)) -> list(v3)
k2 → S+P or P+O, v2 → List of all values corresponding to common key, V3 → serialized list. The diagrammatic representation of the overall MapReduce process is shown in the diagram below. The image is taken from one of the original slides of Google present at Google Labs about MapReduce.
Thus MapReduce transforms all the key value pairs into a list of values, the list of values in our case are either the serialized list of all object values making up a relationship or the serialized list of all the subjects, as depicted in the column of
figure 3 and 4. The serialized binary data will be inserted into the cell corresponding the particular key. The benchmark for the time required to index will be provided appropriately after experimentation.
XI.WORDNET
WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University. WordNet distinguishes between words as literally appearing in texts and the actual senses of the words. A set of words that share one sense is called a synset. Thus, each synset identifies one sense (i.e., semantic concept). Words with multiple meanings (ambiguous words) belong to multiple synsets. As of the current version 2.1, WordNet contains 81,426 synsets for 117,097 unique nouns. (WordNet also includes other types of words like verbs and adjectives, but we consider only nouns in this paper.) WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a subconcept
and a superconcept) and holonymy/meronymy (i.e., the relation between a part and the
whole); for this paper, we focus on hypernyms/hyponyms. Conceptually, the hypernymy relation in WordNet spans a directed acyclic graph (DAG) with a single source node called Entity. Wordnet will be leveraged for finding relationship between words.
XII.CONCLUSION
In this paper we discussed about Hbase and semantic modeling of Wikipedia data to Hbase in order to do intelligent search over Wikipedia data. we also presented an index modeling technique for storing RDF triples in Hbase. We leveraged distributed MapReduce operations to the large dataset of Wikipedia and modeled its data and indexes appropriately. We presented a distributed and highly redundant alternative to making the solution. The comparison of Hbase modeling is to be done with Mysql and statistics are to be presented. The system was able to produce intelligent
search with Wikipedia data and was able to produce results which with normal query in Wikipedia we cannot have. The approach also promotes the usage of distributed key value store, Hbase; as an appropriate datastore for modeling a semantic web.
XIII.FUTURE WORK
A large scale semantic modeling of Wikipedia data remains to be done. The immediate goal is to workout an effective and highly scalable design to scale up to TeraBytes of data. Some areas to be explored are, Improvements in efficiency including query caching, disk allocation, and indices optimization. Latency, query performance, and relevance of results is at top priority. Although some tests have earlier been done regarding performance of Hbase with Hadoop, we are still to convince that Hbase forms an effective datastore for modeling semantic web data.
References
[1] Indexing Goes a new direction, by Richard Winter;
http://www.wintercorp.com/rwintercolumns/ie_9901.html.
[2] Jeffrey Dean & sanjay Ghemawat, “MapReduceA
simplified data
processing on Large Clusters”, http://labs.google.com/papers/mapreduceosdi04.
pdf
[3] Mark Slee, Aditya Agarwal and Marc Kwiatkowski, Thrift: Scalable
CrossLanguage
Services Implementation
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
style),” unpublished.
[5] Loannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and
Nectarios Koziris,{ikons, eangelou, dtsouma,
nkoziris}@cslab.ece.ntua.gr} Distributed Indexing of Web Scale
Datasets for the Cloud
[6] Tim Lu, Shan Sinha, Ajay Sudan. “Panaché: A Scalable Distributed
Index for Keyword Search”,{timlu, ssinha, ajaytoo} @mit.edu
[7] Iadh Ounis, Craig Macdonald,Richard M. C. McCreadie, Comparing
Distributed Indexing: To MapReduce or Not?
[8] Apache Software Foundation. The apache hadoop project.
http://hadoop.apache.org
[9] S.Ghemawat, H. Gobiff, and S.T.Leung. The google FileSystem.
[10] Sergey Menlik, Sriram Raghavan, Beverly Yang, Hector GarciaMolina,
Building a FullText
distributed system for web.
http://www10.org/cdrom/papers/pdf/p275.pdf
[11] Rasmus Pagh , S. Srinivasa Rao,”Secondary Indexing in One
Dimension”
[12] Tadeusz Morzy, Maciej Zakrzewicz, “ Group Bitmap Index: A Structure
for Association Rules Retrieval”.
[13] Elizabeth O’Neil and Patrick O’Neil,Kesheng Wu,”Bitmap Index Design
Choices and Their Performance Implications”.
[14]Daniel J. Abadi Peter A. Boncz Stavros Harizopoulos,”Columnoriented
Database Systems”.
[15] Johan Jonsson, Coldbase A
ColumnOriented
InMemory
Database.
[16] Biswanath Panda,Joshua S. Herbach,Sugato Basu, Roberto J. Bayardo,
Google, Inc. PLANET: Massively Parallel Learning of Tree Ensembles
with MapReduce.
[17] G. W. Juette and L. E. Zeffanella, “Radio noise currents n short sections
on bundle conductors (Presented Conference Paper style),” presented at
the IEEE Summer power Meeting, Dallas, TX, June 22–27, 1990, Paper
90 SM 6900
PWRS.
[18] Ian Foster Carl Kesselman, “ Globus:A Metacomputing Infrastructure “.
[19] ohn Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski,, Patrick
Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea,Hakim
Weatherspoon, Westley Weimer, Chris Wells, and Ben
Zhao.OceanStore: An Architecture for GlobalScale
Persistent Storage.
[20] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
Christos Kozyrakis, Evaluating MapReduce for Multicore
and
Multiprocessor Systems
[21] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein,Philip Bohannon, HansArno
Jacobsen, Nick Puz, Daniel
Weaver and Ramana Yerneni Yahoo! Research. “PNUTS: Yahoo!’s
Hosted Data Serving Platform”
[22] Apache Software Foundation,The apache Hbase project
http://hbase.apache.org
[23] Morteza Zaker, Somnuk PhonAmnuaisuk,
SuCheng
Haw, An Adequate
Design for Large Data Warehouse Systems: Bitmap index versus Btree
index
No comments:
Post a Comment