ISSN 1054-6618, Pattern Recognition and Image Analysis, 2007, Vol. 17, No. 3, pp. 390–398. © Pleiades Publishing, Ltd., 2007. Partial Evaluation Technique for Distributed Image Processing A. Tchernykh a , A. Cristóbal-Salas b , V. Kober a , and I. A. Ovseevich c a Computer Science Department, CICESE Research Center Ensenada, BC, Mexico 22830 b School of Chemistry Sciences and Engineering, University of Baja California, Tijuana, B.C. Mexico 22390 c Institute for Information Transmissions Problems, Russian Academy of Sciences, Bol’shoi Karetnyi 19, Moscow, Russia e-mail: chernykh@cicese.mx; cristobal@uabc.mx; vitally@iitp.ru; ovseev@iitp.ru Abstract—In this paper, a partial evaluation technique to reduce communication costs of distributed image pro- cessing is presented. It combines application of incomplete structures and partial evaluation together with clas- sical program optimization such as constant-propagation, loop unrolling and dead-code elimination. Through a detailed performance analysis, we establish conditions under which the technique is beneﬁcial. DOI: 10.1134/S1054661807030054 Received January 9, 2007 1. INTRODUCTION Typical image processing tasks require a large amount of processing power, larger than can be achieved by current state-of-the-art workstations. Par- allel processing using distributed systems appears to be the only solution to obtain sufﬁcient processing power for handling all image processing levels. In spite of the fact that the exchange of information remains a critical bottleneck in such applications with inherent high com- munication costs, the message passing paradigm has signiﬁcantly increased in popularity with the prolifera- tion of clusters and GRID technology. It appears that the optimization of parallel programs is as equally demanding as the design of a parallel algorithm itself. Communication cost depends on many factors such as memory manipulation overheads (message prepara- tion, message interpretation), network communication delays, etc. Reducing this cost is vital to achieve good performance. There are several strategies to minimize it, for instance, by computation and communication overlapping, network topology optimization, band- width increasing, reduction of number of messages (message coalescing, caching messages), messages pipelining, etc. High performance of the fast Fourier transform (FFT) and wavelet transforms is a key issue in many image applications [1]. There have been several attempts to parallelize and optimize FFT. For example, an algorithm suitable for 64-processor nCUBE 3200 hypercube multicomputer was presented in [2] where a speedup of up to 16 with 64 processors was demon- strated. The FFT implementations for hypercube multi- computers and vector multiprocessors were discussed in [3]. A parallel FFT algorithm for 64-processor Intel iPSC was described in [4]. Techniques for paralleliza- tion of the multidimensional hypercomplex FFT are considered in [5]. The FFT implementations based on incomplete data structures were presented in [6]. Speedup factors between 1.83 and 5.05 were archived for shared memory computers if the size of the input N is available. Good speedup is achieved despite the sig- niﬁcant growth (N log2 N) of the code size. Nevertheless, in spite of the reduction in complexity and time, FFT and wavelet transforms remain expen- sive mainly for distributed memory parallel computers where network latency signiﬁcantly affects the perfor- mance of the algorithm. In most cases, the speedup gained by parallelization is limited due to inter-process communication. That is why programming for distrib- uted architectures has been somewhat restricted to reg- ular, coarse-grained, and computation-intensive appli- cations. FFT exploits ﬁne grain parallelism, which means that an improvement at the communication level plays an extremely important role. The aim of this work is to demonstrate that the per- formance beneﬁts of incomplete information process- ing and partial evaluation can be brought to distributed image processing in a high level manner that is trans- parent to users. To this end, partial evaluation is used not only to remove much of the excess overhead of sequential and shared memory parallel computation, but also to distributed application to reduce the number of messages when some information about the image or part of the image is known. The work demonstrates that good parallel speedups are attainable using MPI and can be integrated into existing distributed environment. 1D-Fast Fourier and 2D Haar wavelet transforms’ opti- mization is presented. The rest of the paper is organized as follows. In Sec- tion 2 a general description of the proposed optimiza- tion technique is presented. Partial evaluation, incom- plete data structures and incomplete data structures IMAGE PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING