Abstract— This work presents a new algorithm to detect RDF
graph isomorphism. Furthermore, we show that isomorphic RDF
graphs issue can be solved in an efficient manner using graphs
information. Algorithm efficiency resides in the detection of RDF
subgraphs isomorphism. The theoretical complexity of algorithm
is O(nlogn+ e
G
), where n is the number of vertices and e
G
is the
cost to check all edges. The proposed algorithm was compared
with Iterative Vertex Classification algorithm in order to show
the strengths of the new algorithm, called LPS (Longest Path
Subgraph). Experimental tests show a much better performance
of LPS algorithm.
Index Terms— graph isomorphism, RDF graph, algorithms
I. INTRODUCTION
There is a growing need for data integration. For example,
life sciences research demands the integration of diverse and
heterogeneous data sets that originate from distinct
communities of scientists in separate subfields [1]. The
Semantic Web [2] is an effort to give semantics to web content
in order to allow data integration. The Resource Description
Framework (RDF) [3] is a metadata model and language that
serves as a basis to the Semantic Web infrastructure. RDF
allows data to be linked to and/or merged with other RDF data
by semantic web applications, using the eXtendable Markup
Languaje (XML) as an interchange syntax. XML is a simple,
very flexible text format proposed as a standard for structured
information interchange. Good references for the RDF model
are [4] and [5].
1.1 Problem Statement
Although XML and RDF can be very related, Berners explains
an important difference, if we have XML and RDF documents
which represent the same object, there are in general a large
number of ways in which the XML document can be mapped
onto the logical RDF graph [6]. In this case, we can say that
RDF model is a data model. So, there are a large number of
ways in which the XML document can be mapped to a RDF
data model. One might wonder if two documents written
syntactically different, but with the same meaning should have
the same RDF data model. The answer to this question is
affirmative. It is, a data model can be represented syntactically
in many ways. On the other hand, index structures and
algorithms for querying distributed RDF repositories have
been proposed. Moreover, these proposals promote the
optimization of queries on RDF repositories [7]. Nonetheless,
to optimize is necessary to have good algorithms in order to
determine if two distributed sources represent the same data
model. This problem is equivalent to determine if two RDF
graphs are isomorphic [8]. Nevertheless, to determine if RDF
graphs are isomorphic is not easy work. Several algorithms to
solve the issue have been proposed (see [6][9]). However,
these algorithms are neither efficient nor robust.
1.2 Contributions
In this paper, we propose a new algorithm to detect RDF
graph isomorphism. Furthermore, we apply the extra
information that comes with the graphs to solve the
isomorphism problem for RDF graphs efficiently. Algorithm
efficiency resides on RDF subgraph isomorphism detection.
We implemented our algorithm called Longest Path Subgraph
(LPS) and the Iterative Vertex Classification (IVC) algorithm
[8], and drew experimental results.
The LPS algorithm makes a preprocessing of the RDF graphs
in first place, in order to generate two new graphs with
information on its vertices and edges. Later, two RDF
subgraphs are obtained from the new graphs with information.
Then, algorithm verifies if RDF subgraphs are isomorphic. If
subgraphs are not isomorphic RDF graphs, then it is not
necessary to check all vertices and edges of RDF graphs.
In addition, our algorithm solves two latent problems in the
RDF graph matching problem: the first, is about the
existence of cycles in a graph, and the second problem
consists about unnamed vertices. Moreover, we believe that
this algorithm can be extended to other applications.
Our algorithm in the worst case is O(nlogn+ e
G
), where n is
the number of vertices and e
G
is the cost to check all edges. In
order to verify its performance, we compared it with IVC
algorithm. Both algorithms were implemented in C, and run
on Linux Red Hat 7.2 with Celeron 1.4 GHZ processor, 512
MB RAM and gcc v2.95 compiler version.
Longest Path Subgraph: A Novel and Efficient
Algorithm to Match RDF Graphs
Claudio Gutiérrez-Soto
Universidad del Bío-Bío
cogutier@ubiobio.cl
Pedro G. Campos
Universidad del Bío-Bío
pgcampos@ubiobio.cl
Julio Águila
Universidad de Magallanes
julio.aguila@umag.cl
2008 Mexican International Conference on Computer Science
1550-4069/08 $25.00 © 2008 IEEE
DOI 10.1109/ENC.2008.41
232