GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications Seyed M. Mirtaheri, Gregor V. Bochmann, Guy-Vincent Jourdan 1 , and Iosif Viorel Onut 2 1 School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada staheri@uottawa.ca, gvj@eecs.uottawa.ca, bochmann@eecs.uottawa.ca 2 Security AppScan R Enterprise, IBM 770 Palladium Dr, Ottawa, Ontario, Canada vioonut@ca.ibm.com Abstract. Crawling web applications is important for indexing, acces- sibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawl- ing Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Pre- viously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements. Keywords: Web Crawling, Rich Internet Application, Greedy Algo- rithm, Load-Balancing 1 Introduction Crawling is the process of exploring and discovering states of a web application automatically. This problem has a long and interesting history. Throughout the history of web-crawling, the chief focus of web-crawlers has been on crawling traditional web applications. In these applications there is a one to one cor- respondance between the state of the web application and its URL. The new generation of web applications, called Rich Internet Applications (RIAs), take advantage of availability of powerful client-side web-browsers and shift some part