Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Pramodh Pochu, Venkata Sesha Sanagavarapu, Nicholas F. Polys Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA. {msh, amonika, ppramodh, ssanagav}@cs.vt.edu, npolys@vt.edu ABSTRACT In this project, we concentrate on discovering relationships between cellular signaling pathways that are organized as connection maps in the STKE dataset [1]. Signaling pathways are relations between proteins that transform cellular signals to appropriate biological responses. Our observation is that a relation between two components of a signal can appear in more than one pathway that might aid the biologists to identify a new phenomenon. We develop a tool that would help biologists to discover relationships between pathways depending on the structural overlaps among themselves or their neighboring pathways. We address the problem of determining pathway relationships by two data mining approaches: clustering and storytelling. In the first approach, our tool brings similar pathways to the same cluster and in the second, the tool determines some intermediate overlapping pathways that might help biologists to uncover some new relationships between the pathways. We capture the problem of discovering pathway relationships as subgraph discovery problem and propose a new technique called Subgraph-Extension Generation (SEG) that outperforms the traditional FSG [2] approach by magnitudes. The developed tool also provides an interface to compare these two approaches along with variety of similarity measures and clustering techniques in terms of runtime and memory consumption. Keywords Apriori, Cellular Signaling Pathway, Clustering, Storytelling, Subgraph-Extension Generation, FSG. 1. INTRODUCTION A cellular signaling pathway contains a set of molecules interacting with each other through signals and conveying information, generally from the outside of the cell to inside [3]. The Signal Transduction Knowledge Environment (STKE) dataset covers signal transduction in biology, allowing a study of how cells interact with each other through chemical signals [1]. Scientists all over the world documented different cell signaling pathways over time. Still, there can be some uncovered relationships between some of the components of some of the pathways due to the lack of existence of a proper tool to analyze the overlaps between the already discovered pathways. The goal of our project is to build a tool that can discover probable initial relationships between pathways using graph mining approaches. The resultant pathway relationships would help biologists to analyze the pathways in discovering new relationships between them. In this project, we examine different algorithms to mine the frequent subgraphs representing commonality of the signaling pathways. Discovered frequent subgraphs are then used to calculate the similarities between every pair of pathways. We depend on the discovered subgraphs to cluster pathways or to discover a story between a pair of pathways. We have developed an interactive tool by which users can control the parameters at different phases of the pipeline. Additionally, we provide a runtime and memory consumption analysis for each of the algorithms used in this project. In this work, we propose a graph-based storytelling approach that is similar to the text-based storytelling described by Kumar et al. [4]. The graph-based storytelling is more robust than the text- based storytelling approach considering the fact that texts are sometimes misleading and can generate meaningless stories. In our graph-based storytelling approach, we ensure that subsequent pathways of a generated story have overlapped signals between them which text-based storytelling algorithm does not guarantee. As a result, chances that our algorithm generates misleading or meaningless stories are lower than the text-based storytelling. The rest of the report is organized as follows. Section 2 describes some of the related works. We describe the overall design in Section 3. Some illustrative experimental results are described in Section 4. We conclude this report in Section 5. 2. LITERATURE REVIEW There have been some existing tools that help biologists to visualize and analyze signaling pathways. PathCase [5] presents a way to visualize signaling pathways as nested graphs and employs four abstraction levels to counter for the visual complexity of signaling pathways. They also introduce a Gene-Ontology (GO) based functional visualization of pathways. Xu et al. [6] design a model for the WNT signaling pathway [7], a gene regulation route of living cells of various organisms. They use Maude - an interpreter software [8] for implementing the model and verification of some key properties of the model. The system does not provide any automated approach to discover relations between pathways. BioPath system [9] shows a comparison of metabolic pathways between different organisms using color codes. Each organism is associated with a different color while their similarities are indicated by a mixture of their respective color codes. It becomes difficult to identify the similarities represented by the mixed colors. Moreover, the dissimilarities are not highlighted. Schreiber [10] presents a constraint graph drawing algorithm for visually comparing metabolic pathways of different species. In this system, similar parts of the similar pathways of different species are