1 Finding Similar Academic Web Sites with Links, Bibliometric Couplings and Colinks Mike Thelwall 1 School of Computing and Information Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK. E-mail: m.thelwall@wlv.ac.uk Tel: +44 1902 321470 Fax: +44 1902 321478 David Wilkinson School of Computing and Information Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK. E-mail: d.wilkinson@wlv.ac.uk Tel: +44 1902 321452 Fax: +44 1902 321478 A common task in both Webmetrics and Web information retrieval is to identify a set of Web pages or sites that are similar in content. In this paper we assess the extent to which links, colinks and couplings can be used to identify similar Web sites. As an experiment, a random sample of 500 pairs of domains from the UK academic Web were taken and human assessments of site similarity, based upon content type, were compared against ratings for the three concepts. The results show that using a combination of all three gives the highest probability of identifying similar sites, but surprisingly this was only a marginal improvement over using links alone. Another unexpected result was that high values for either colink counts or couplings were associated with only a small increased likelihood of similarity. The principal advantage of using couplings and colinks was found to be greater coverage in terms of a much larger number of pairs of sites being connected by these measures, instead of increased probability of similarity. In information retrieval terminology, this is improved recall rather than improved precision. Keywords: Document clustering, web metrics, web information retrieval. Introduction A growing area of research in information science is the analysis of Web based documents, often using quantitative techniques (Almind & Ingwersen, 1997; Aguillo, 1998; Cronin, 2001; Borgman & Furner, 2002), commonly known as Webmetrics. Much of the work in this area is focussed on Web links, motivated by both bibliometrics (Rousseau, 1997; Ingwersen, 1998) and computer science with graph theory (Broder et al., 2000; Björneborn, 2001b). Results have shown that university Web site links are influenced by a combination of geographic (Thelwall, 2002a) and research (Thelwall, 2001a) factors, showing that patterns can be mined from this kind of link data. Very simple mapping techniques have also been applied to visualise the flow of information between national educational systems (Thelwall & Smith, 2002) and other identifiable areas of the Web (Thelwall, 2001b). Colinks (see below) have also been used to map patterns of interlinking between universities in Europe (Polanco et al., 2001). One trend in academic Webmetrics research is to increasingly focus on smaller units of study based around a particular discipline in a Information Processing & Management, to appear.