* Corresponding author. ‡: Primary student contributors. On Building a Quantitative Food-Disease-Gene Network Abstract Nutritional genomics is a new science that studies the relationship between foods (or nutrients), diseases, and genes. Large amounts of scientific findings have been published in this area, primarily in unstructured text. Moreover, given a pair of entities, different studies can report different findings. It is hence important to obtain a holistic view of the reported relationships. In this article, we describe an information extraction system aiming to reach this goal. The system integrates natural language processing techniques, domain ontology, statistical, and machine learning methods. It consists of four main modules: (1) entity extraction, which recognizes and extracts five types of entities: foods, chemicals (or nutrients), diseases, proteins and genes; (2) relationship extraction, which extracts binary relationships between entities; (3) relationship polarity analysis, which categorizes relationships into three groups: positive, negative, and neutral; and (4) strength analysis, which rates a relationship as weak, medium, or strong. To the best of our knowledge, we are the first to propose to analyze the polarity and strength of a binary relationship. We have evaluated our system using the GENIA corpus and datasets drawn from the MEDLINE database. The first two modules outperform the reported best results with an average F- score of 0.89 and 0.82, respectively; while the last two also achieve promising results with an accuracy of 0.75- 0.84 and ~0.90, respectively. Key words: nutritional genomics, text mining, relationship extraction, relationship polarity, relationship strength 1 INTRODUCTION Advances in bio-technology and life sciences are leading to an ever-increasing volume of published research data, predominantly in unstructured text (or natural language). At the time of writing, the MEDLINE database consists of 19 million scientific articles with a growth rate of ~400,000 articles per year [8]. This phenomenon becomes even more apparent in nutritional genomics, an emerging new science that studies the relationship between foods (or nutrients), diseases, and genes [16]. For instance, soy products and green tea have been two of the intensively studied foods in this new discipline due to their controversial relationship with cancer. A search to the MEDLINE database on “soy and cancer” renders a total of 1,287 articles, and a search on “green tea and cancer” renders 1,318 articles. Due to the large number of publications every year, it is unrealistic for even the most motivated to manually go through these articles to obtain a full picture of the findings reported to date. This however has become ever more important and necessary due to the following reasons: (1) given a pair of entities, e.g., green tea and cancer, different studies might report different findings with respect to their relationship. For example, Sonn et al. suggest that “green tea is beneficial to the treatment of cancer” [25], whereas Sauvaget et al. conclude that these two are not related [23]. In other words, a relationship can be positive (good), negative (bad), or neutral. We term this as the relationship polarity; and (2) even if different studies agree with each other on the relationship polarity between two entities, they may report it with a different level of decisiveness. As another example, one study suggests that “soy intake … may protect against breast cancer …” [19], while another study indicates “soy intake is believed to be an essential factor for the incidence of hormone-dependent tumors (e.g., breast cancer) …” [28]. Obviously, the latter is more decisive than the former. We term the decisiveness of a relationship as the relationship strength. In this article, we propose to develop an information extraction system that automatically (1) extracts the binary relationships between foods, diseases and genes; and (2) analyze the polarity and strength of these relationships. The long-term goal is to build food-disease-gene networks that statistically quantify the various relationships reported in nutritional genomics. To reach this goal, one need to first accurately recognize and extract the terms that describe foods, diseases, and genes. In addition, we will also need to recognize chemicals (nutrients) in foods and proteins. This is because: (1) given a whole food (e.g., soy), scientific research often focuses on understanding how different organic compounds (e.g., genistein in soy) contained within the food impact certain diseases (e.g., breast cancer); and (2) genes and their protein products are often used interchangeably in practice. Past efforts in this area, termed as Named Entity Recognition (NER), have been on genes or proteins. (See the reviews by Cohen and Skusa et al. [2] [24] for a list of works in this area.) The best reported F-scores on NER are generally between 0.75~0.85 [24]. We adopt an approach that utilizes domain ontology, statistics, and syntactic information for this task and achieve an average F-score of 0.89. Abhishek Sharma‡ Dept of Computer Science San Francisco State Univ. San Francisco, CA, 94132 asharma1@sfsu.edu Hui Yang* Dept of Computer Science San Francisco State Univ. San Francisco, CA, 94132 huiyang@sfsu.edu Rajesh Swaminathan ‡ Dept of Computer Science San Francisco State Univ. San Francisco, CA, 94132 rajeshs@sfsu.edu Vilas Ketkar Dept of Computer Science San Francisco State Univ. San Francisco, CA, 94132 vilask@sfsu.edu