Content-Based Search Using Self-Organizing Peer-to-Peer Network IGOR MEKTEROVIĆ, KREŠIMIR KRIŽANOVIĆ, MIRTA BARANOVIĆ Department of Applied Computing Faculty of Electrical Engineering and Computing, University Of Zagreb Unska 3, 10000 Zagreb CROATIA Abstract: - In order for peer-to-peer (P2P) content sharing network to be scalable, it is imperative to efficiently route queries through the network. Semantic based search should, with as little messages (traffic) as possible, return just those relevant documents stored throughout the network thus achieving precision and recall values comparable to those of correspondent centralized system. In this article we propose protocols for self-organizing P2P network that arranges links between peers according to peer's content. In proposed network peers organize themselves into "semantic communities" but without losing links to other semantic communities. Proposed network has no prior knowledge of the semantics of documents that are to be stored in the system. Key-Words: - Peer-to-peer, content-based search, information retrieval, algorithm 1 Introduction With the advent of Napster we have witnessed extraordinary expansion of interest in peer-to-peer content sharing systems both in general population and in scientific community. With time, better and more efficient protocols (networks) for content sharing have been developed (e.g. Gnutella, eDonkey, BitTorrent). However, widely adopted peer-to-peer protocols for content sharing only allow for metadata searches (file name, size, type, etc.). Peer-to-peer networks that enable content-based searches are still a subject of active research. The ones that have been developed so far are usually divided into structured and unstructured. In unstructured networks peers are unaware of content in neighboring peers which coerces them into less-effective query routing (e.g. Gnutella flooding) resulting in poor network scalability. Structured P2P networks overcome scalability issues but incur complex protocols that are not suitable for highly transient peers typical for P2P systems [10]. Also, structured P2P networks usually have to maintain high-dimensional DHTs (Distributed Hash Tables) that reflect semantic space of documents stored in the system. Such DHTs may be inappropriate for the newly arrived documents whose semantics significantly differ from those of documents that were taken into account during construction of DHT. Along those lines, if we wanted to construct an initially empty network and offer it to the general public to share arbitrary documents we'd have no documents to sample and construct the semantic space. That was the motivation for construction of P2P network that allows for semantic-based queries without prior knowledge of documents that will be stored throughout the network. Self-organizing P2P network that arranges links between peers according to their content is proposed. In such a network peers organize themselves into "semantic communities". Every peer represents its content with a set of vectors and content likeness is determined as vector likeness. It is assumed that peers (i.e. users) sharing documents of certain topic will most likely search for similar documents (e.g. someone who is sharing papers in the field of computer science is more likely to search for similar papers than e.g. biology papers). Of course, it is entirely possible for user to search for something semantically completely different – therefore it is important not to lose links to other semantic communities. 2 Problem Formulation Textual documents can be represented and stored as data objects in P2P system. More precisely, a document is represented as an n-dimensional vector, namely Semantic Vector or Feature Vector. Each element in the vector represents the importance of a term in the document. The importance of each term is computed using TF*IDF (term frequency * inverse document frequency) scheme [1]. A term in the document is considered more important if it is used often in that document (TF) and used seldom in other documents in the collection (IDF). Such term is important because it differentiates one document from the others. During the search process, documents are retrieved according to the similarity of the query vector (which can also be a full- blown document) and document vector. Prevailing measure of similarity is the cosine of the angle between the vectors. If the vectors are normalized, cosine of the 7th WSEAS Int. Conf. on SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS (SEPADS '08), University of Cambridge, UK, Feb 20-22, 2008 ISSN: 1790-5117 50 ISBN: 978-960-6766-42-8