Soft Semantic Web Intelligence Eduardo Ramirez and Ramon Brena Tecnologico de Monterrey Abstract— In the context of recent efforts to make “intelligent” many internet-based applications, thus improving their usability, the importance of automated meaning handling has been widely acknowledged; this is indeed the rational behind initiatives like the so-called “Semantic Web”, which proposes the widespread use of “ontologies” to define concepts and to make online docu- mentation self-described. Nevertheless, several practical problems have hindered the Semantic Web from becoming mainstream. In this paper we propose a way to make explicit the underlying se- mantic structure of the ordinary (html-based) web, by measuring joint keyword occurrences in web pages, around our notion of “Semantic Contexts”. The resulting infrastructure semantically characterizes contextual semantic information in the internet. Semantic Contexts can have many practical uses, such as focusing internet searches in such a way that results are much more relevant to users than with current search engines, as we show in this paper. Further, we propose to make Semantic Contexts publicly available online, in order to make semantic-aware many internet applications in a simpler way than using the currently marginal Semantic Web. I. I NTRODUCTION Internet is acknowledged as one of the big technological revolutions of our time; since its inception in the early 90s, the WWW has grown exponentially, reaching some 74.5 millions of websites [1] with at least 11.5 billions indexed at the main web searchers [2]. Nevertheless, web pages normally have the limitation of not taking into account the meaning or the context of the included information content, but just its formatting. This is indeed a serious limitation, because many internet applications could be much more usable if they were “semantic-aware”, that is, able to understand what the available information is about, how it relates to other informations and how it needs to be processed in order to be useful to human users. This kind of issues show that additional “intelligence” is needed in internet applications, so fields like “web intelligence” [3] have been created and have received considerable attention. For instance, one very important issue is to determine what a given web page is about. Many works have been done for categorizing texts [4], but the problem of deter- mining the categories in the first place remains a source of incompatibilities and interoperability troubles, as there is no “universal” classification agreed. Nevertheless, the issue of a web page topic is indeed important, because for example, in web searches each search is done with a specific context [5] in mind, and results outside that context as seen by the user as irrelevant. Indeed, every internet user is confronted with the inconvenience of receiving from the search engines many irrelevant pages, due to the inability of search engines to contextualize keywords in meaningful concepts, areas, themes, etc. The need for automatically taking into account meanings has been the rational behind the “Semantic Web” [6], [7] initiative, which proposes markup languages, mainly based on XML [8], and intends to develop technologies for defining and using concepts and relations among them in the so-called “ontologies” [9]. Despite the great potential of the Semantic Web to give semantic intelligence to internet applictions, there are many issues hindering its widespread use, like, for instance, the arbitrary nature of human-defined ontologies, which could define in different ways the same concept, and further, the alignment of such equivalent definitions [10], etc. Another big question is how to relate the explicit semantic of Semantic Web ontologies with the implicit semantic of existing web pages. The basic attitude of the Semantic Web community of considering the standard web as semantically hopeless does not help to solve this issue. In this paper we consider that the implicit semantic content of current web pages could be made explicit, at least in statistical terms, relying on the big quantities of web pages currently existing in the internet. We consider the joint fre- quencies of keywords as representative of which concepts are implicitly semantically related in the existing internet, not in an ideal or futuristic internet. Further, we define a semantic structure in the form of a collection of keyword clusters that we call “Semantic Contexts”. We also present some practical applications, in particular how to better focus web searches, and we present some tentative ideas on how to leverage semantic web definitions with the support of our infrastructure. If this infrastructure is made explicit and publicly available, it could be used as a universal semantic reference for estab- lishing topics, classifications, and for allowing many internet applications to be “semantic aware” without the need of “classic” Semantic Web technology. Further, allowing human users to attach Semantic Web markup content to our “Semantic Contexts”, in a “wiki” style [11], the Semantic Web could be leveraged to relate to the standard web and serve this one, instead of remaining marginal for many years. After this introduction, in the next section we give a techni- cal presentation of our method, followed by the presentation of an application to internet search. Then we present some experimental results, followed by some ideas for relating our proposal to the Semantic Web, then a discussion and a comparison with related work, to end with a conclusion.