Exploiting Statistical and Relational Information on the Web and in Social Media Lise Getoor & Lilyana Mihalkova Department of Computer Science University of Maryland College Park, MD 20742 Abstract This tutorial will provide an overview of statistical re- lational learning and inference techniques, motivating and illustrating them using web and social media appli- cations. We will start by briefly surveying some of the sources of statistical and relational information on the web and in social media and will then dedicate most of the tutorial time to an introduction to representations and techniques for learning and reasoning with multi- relational information, viewing them through the lens of web and social media domains. We will end with a discussion of current trends and related fields, such as privacy in social networks. 1 Tutorial Overview The growing popularity of Web 2.0, characterized by a proliferation of social media sites, and Web 3.0, with more richly semantically annotated objects and relationships, brings to light a variety of important prediction, ranking and extraction tasks. The input to these tasks is often best seen as some type of (noisy) multi-relational graph, which can be constructed in a variety of ways. For example, users’ behaviors on the Web define a graph, the click graph, which can be viewed as a bipartite graph with queries and URLs, where the edges are weighted by the number of times a user clicked on a URL given a particular query. Social media sites often capture more semantically rich relationships such as friendship and affiliations. In addition to clicking, there are also many actions that users can take such as making a posting, asking a question, rating an item, purchasing an item and so on. The common tasks that we want to perform in these situations typically depend both on the relational infor- mation contained in the graph, and also on statistics that are computed over the graph. In the first part of this tutorial, we will briefly survey several common Web applications, and show how they can be described as reasoning over multi-relational graphs. In the second part of the tutorial, we will describe techniques devel- oped in the field of statistical relational learning (SRL). SRL addresses the challenge of learning from multi- relational data and representing the acquired knowledge in a form that allows for probabilistic inference. Two of the hallmarks of SRL approaches are, first, the ability to learn, represent, and leverage the rich sets of relations present among the entities, and, second, the ability to perform effective learning and reasoning in the presence of a considerable amount of noise and uncertainty. We argue in favor of using SRL techniques for Web prob- lems by pointing out that these two main characteristics of SRL problems also typify problems that arise on the Web and discuss in detail several successful SRL appli- cations on the Web. 2 Content The tutorial web page can be found at http://www.cs.umd.edu/projects/linqs/Tutorials/SRL- Web- SDM11/Home.html. Part 1: Overview of Statistical and Rela- tional Information on the Web. In the first part of the tutorial, we will provide a brief overview of some sources of relational information on the web. Part 2: Toolkit. This is the main part of the tu- torial. It will provide a survey of statistical relational learning and inference techniques, motivating and illus- trating them from the point of view of web applications. This part will be structured as follows. First, we will discuss “flat” relational models in which relational in- formation has been flattened by computing aggregates over relations, thus allowing training instances to be described by fixed-length feature vectors. We will give examples of web applications that follow this approach, e.g., [18]. Second, we will present collective models in which the labels of related instances are predicted collec- tively, thus allowing for assignments to be coordinated and some mistakes fixed. Here will discuss two collective approaches: the ICA algorithm [13, 11] and graphical models. The discussion of graphical models will lead