-1- Web Document Modelling and Clustering * William Song SISU, ELECTRUM 212, 164 40 KISTA, SWEDEN (Position Paper) Abstract Great number of the web documents, created by people of a great variety of walks and used by all the people being able to access to the Internet, gives rise to a problem of how to search the Internet to easily obtain what users want and to filter out what they don't. The problem is strongly related to how to describe or characterize the web documents. On the other hand, labels are being introduced to cluster the web documents. These labels are generated by different people using various categorization models so that conflicts may arise. An introduction of a web document model is undoubtedly necessary for analysis and formulation of web documents, hence for formal clustering of the web documents. In the SISU-PICS project, we propose a web document data model (WDM) to represent various characteristics and components for identifying web documents. We also establish a set of relatedness and similarity relations between WDM objects for computing web document clustering. 1 Introduction Ever growing information on the World Wide Web (web) has apparently posed a problem of how to effectively and efficiently search for what a web user requires from the explosively large number of the web documents. The centric problem is that a web document does not provide sufficient information in a straight forward manner for quickly and easily identifying what a user may need. That is, a web document cannot directly be used to uniquely locate the web document to be searched for. Most of the search engines available use keywords (textual strings) matching mechanism to search through all the web pages which contain one or several keywords matching the web users' queries. Because of subjectivity of keywords assigned and multivocality of keywords selected, searching results by keywords are not satisfactory, i.e., low in search precision and comprehension [6]. The introduction of labels (or rating services) to describe a web document brings a dawn to the identification of web documents since the labels assigned to a web document at metadata level are intended to let the web users have the possibility to know to what class a document may belong. Motivation of deploying label techniques is to group web documents in classes - significant relationships between the classes are also defined hopefully - so that, by using some web browsers having capability of rating web documents, can web users rapidly access to or ignore those web documents in the certain class or classes identified by labels [5]. A rating service will as well provide classifications of labels. Such classifications or labels can be either defined by web information providers or some authorities (e.g. a rating bureau) or both. * The authors' work in this paper is supported by SISU's project Sisu-PICS.