Identification of Negated Regulation Events in the Literature: Exploring the Feature Space Farzaneh Sarafraz 1 , Goran Nenadic 1, 2 1 School of Computer Science, University of Manchester, Manchester, UK 2 Manchester Interdisciplinary BioCentre, University of Manchester, UK Email addresses: FS: sarafraf@cs.man.ac.uk; GN: g.nenadic@manchester.ac.uk Abstract Background. Regulation events are of critical importance to researchers trying to understand processes in living beings. These events are naturally complex and can involve both individual molecular entities and other biomedical events. Of equal importance is the ability to capture statements that refer to regulation events that do not take place. In this paper we explore the identification of negated regulation events in the literature using a number of features. Results. We construe the problem as a classification task and apply support vector machines that use lexical, syntactic and semantic features associated with sentences that represent events. Lexical features include negation cues, part-of-speech tagging and surface distances, whereas syntactic features are engineered from constituency parse trees, the command relation between constituents and parse-tree distances. Semantic features include event sub- type and participant types. On a test dataset, best precision has been achieved by combing all features, while ignoring surface-level distances resulted in best recall. Overall, the best F- measure was 54%. Conclusions. Syntactic features proved to be useful for improving recall, whereas semantic features proved useful for improving precision, demonstrating the potential and limits of task- specific feature engineering to negation detection. Contrasting statements are used frequently to express negated events and many false negatives were due to not capturing those events. Background Several efforts have been recently initiated in the text mining community that focus on the extraction of structured information about biomedical relations and events, including protein- protein interactions, gene expression, etc. [1, 2]. These efforts aim for both supporting data consolidation (population of curated databases) and knowledge exploration (e.g. hypothesis generation) [3, 4]. A topic that has been of particular interest in biology and medicine is the investigation of gene regulatory networks, which are of critical importance to researchers trying to understand regulatory mechanisms in living beings. There have been a number of databases developed to store knowledge about gene regulation in various model organisms (e.g. RegulonDB with regulation information in E. coli [5]), but populating such databases proved to be challenging given the pace of publications and complexity of the events. Regulatory events are particularly complex as their participants can be either entities (e.g. a protein) or other events. Therefore, regulation events can be recursively nested, and – given that regulations can be positive (facilitating a particular process) or negative (inhibiting a particular event) – they typically require complex linguistic expressions to report and explain regulation findings. In addition to affirmative findings, a number of events are also reported as negated (e.g. However, NFATc.beta neither bound to the kappa3 element (an NFAT-binding site) in the tumor necrosis factor-alpha promoter nor activated the tumor necrosis factor-alpha promoter in cotransfection assays). Detection of