Spectral Learning with Type-2 Fuzzy Numbers for Question/Answering System Asli Celikyilmaz 1 I. Burhan Turksen 2 1. Computer Sciences Division, University in California, Berkeley, CA, USA 2. TOBB Economy and Technology University, Ankara, Turkey & University of Toronto, Canada Abstract— Graph-based semi-supervised learning has recently emerged as a promising approach to data-sparse learning problems in natural language processing. They rely on graphs that jointly rep- resent each data point. The problem of how to best formulate the graph representation remains an open research topic. In this pa- per, we introduce a type-2 fuzzy arithmetic to characterize the edge weights of a formed graph as type-2 fuzzy numbers. The fuzzy num- bers are identified by the changing parameters of the fuzzy kernel nearest neighbor algorithm, namely the degree of fuzziness and the hyper-parameter of the Gaussian kernel function, both of which have an effect on the uncertainty in forming the affinity matrix of the graph. We introduce a new graph-based semi-supervised learning with the type-2 arithmetic operations. We apply this technique in the frame- work of label propagation and evaluate on a question answering task. We demonstrate that the type-2 SSL can improve the prediction ac- curacy and can be considered to be the an alternative tool for text mining applications of computational linguistics. Keywords— Graph-based semi-supervised learning, kernel fuzzy k-nearest neighbor, type-2 fuzzy numbers. 1 Introduction and Motivation In building reliable models for real systems, identification of exact values of variables of model equations are required. In real life practices, precise values of parameters may not be ob- tained due to imprecise, noisy, vague, or incomplete nature of information. Fuzzy logic provides explanatory tools for such tasks, mainly because of its capability to manage imprecise categories to represent imperfect information, by means of fuzzy sets, graduality, measures of resemblance or aggrega- tion methods. Type-1 fuzzy sets may not be enough to explain the whole spectrum of possible results, mainly because the values used to characterize the membership functions of type- 1 fuzzy numbers are usually overly precise. Usually the level of information is improperly set to define membership func- tions, thus it is rather necessary to use type-2 fuzzy numbers to represent uncertainties in model parameters. In this work, we mainly focus on uncertainties in finding similarities in text mining. We consider one of the most com- monly used learning methods, namely the semi-supervised learning (SSL) method [1]. It is often the case in the areas of machine learning for classification problems such as the prob- lem of text classification on web pages, automatic translation or online question/ answering systems, etc. that one needs to deal with a very small portion of labeled data and vast amounts of unlabeled data. For such cases, graph-based SSL methods (spectral learning methods) have proved to outperform other learning methods. In graph- based methods the data is repre- sented by the nodes of a graph (Fig. 1), the edges of which are labeled with the pairwise distance of the incident nodes. One problem with spectral learning methods is that the procedure is highly sensitive to the choice of the kernel, for example it is very sensitive to the choice of the spread (variance) of a Gaussian kernel, which naturally effects the similarity matrix defined for the given dataset. As in the phrase of ”words can mean different things to different people”, an entailment relation between a candidate sentence and a question posed by the user may be evaluated differently by different people. For instance, a different de- gree of entailment may be assigned by different people for pairs of question ”Who bought Overture?” and candidate sen- tences such as ”Yahoo bought Overture”, ”Yahoo owns Over- ture”, ”Overture acquisition by Yahoo”, using linguistic terms such strict, loose, or direct entailment. Current methods can only use crisp values to define such relations, which cannot be explained to a full extent. Type-2 fuzzy logic is the best fit to define the entailment relations between each sentence. To our knowledge, characterization of edge weights of a graph as type-2 fuzzy numbers, as presented in this paper, is a new approach. The novel type-2 SSL defines such uncertain en- tailment relation between two sentences by characterizing soft linked graphs. In this paper we concentrate on characterization of the un- certainties in similarity measure when discovering knowledge from unstructured text using graph-based SSL algorithm. A common way to construct the affinity matrix of a graph is by application of a nearest neighbor method. We use a fuzzy k- nearest neighbor (FKNN) to allow fuzzy decisions based on fuzzy labels. In addition we use its kernel extension [2] to enable solving possible non-linearly separable problems and get non-linear fuzzy boundaries instead of linear boundaries when necessary. In addition kernel methods have proven to prevent over-fitting in high dimensional feature spaces. For these reasons, we consider applying type-2 fuzzy arithmetic to situations where the similarity between two objects, i.e, two sentences, is imprecise. Thus, the novel type-2 SSL method learns the edge link weights via kernel fuzzy k-nearest neighbor algorithm [2] (KFKNN). We use the arithmetic operations on type-2 fuzzy numbers defined in [3]. For ease of calculations, we gradu- ate the interval valued degree of fuzziness and the kernel pa- rameter and obtain bounded discrete valued weights (interval valued) with associated type-2 membership grades, enabling to represent each weight link with a type-2 fuzzy number. In a way each membership value is further stretched 1 based on fuzziness of the model to capture uncertainty interval of membership values (Fig. 2). Using the interval type-2 fuzzy 1 Zadeh[4] defines the membership values as elastic constraints that has to be stretched to get their full meaning. ISBN: 978-989-95079-6-8 IFSA-EUSFLAT 2009 1388