Average Vibrational Potentials of Oscillators in Condensed-matter Environments using Hadoop Bojana Koteska, Anastas Mishev Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje, Republic of Macedonia Email: {bojana.koteska, anastas.mishev}@finki.ukim.mk Ljupˇ co Pejov Institute of Chemistry, Faculty of Natural Sciences and Mathematics, Ss. Cyril and Methodius University, Arhimedova 5, P.O. Box 162, 1000 Skopje, Republic of Macedonia Email: ljupcop@iunona.pmf.ukim.edu.mk Abstract—In physical sciences, when condensed matter systems (e.g. solids or liquids) are modeled with an explicit inclusion of dynamical effects, often the following computational problem arises. A given property of an embedded atomic/molecular system within condensed phase should be computed either at different possible structural arrangements and further average over configurations, or alternatively, it is possible to generate an averaged configuration of the dynamical surrounding that the system experiences and further compute the property of interest at that configuration. The problem of solving the average vibrational potentials of large number of oscillators in vari- ous condensed-matter environments (sampled from a statistical physics simulation) can be placed in the category of problems with large data sets. In this paper, a distributed and parallel processing of the large data sets needed for the generation of the averaged vibrational potential is efficiently performed by using the MapReduce programming model and Hadoop software library. Some of the reasons for choosing the Hadoop software library are: It is able to work on data pieces in parallel; The computing solutions enabled by Hadoop are scalable and flexible; The distributed file system enables rapid data transfer among nodes; Hadoop is fault-tolerant which means that if a node fails the job is redirected to another node. The main goal of this paper is to perform an efficient processing of the large data sets used in the scientific applications. Index Terms—Hadoop, Average vibrational potentials, Anhar- monic oscillator, Condensed-matter environments, Schr¨ odinger equation I. I NTRODUCTION Theoretical models in physical sciences are often used to understand the experimentally observed behavior of certain physical systems or to predict their behavior under specific circumstances which are relevant to the actual or potential technological applications of the systems in question. Besides getting a more enlightening view of the systems behavior, the- oretical models may be quite useful in discriminating among various factors leading to observation of certain physical phe- nomena or in quantifying the contribution of various factors to a certain physical observable. Most of the experimental data are, however, collected at finite temperatures, usually quite above absolute zero. A reliable theoretical model aiming to provide a realistic description of the system in question therefore has to account for the dynamical effects on a certain time-scale. Most of the models based on quantum mechanical description of many- particle physical systems are based on explorations of the potential energy hypersurfaces (or certain cuts through these surfaces), which means that they do not conform to the previ- ously mentioned criterion. To explicitly include the dynamical behavior of the studied quantum system, one has to treat it within the framework of quantum dynamics. However, a fully exact quantum dynamical treatment of multi-particle systems is prohibitively computationally expensive. At the same time, luckily, such full quantum dynamical treatment is mandatory only in certain specific cases, usually when the focus of the study is put on light particles (such as e.g. hydrogen atoms). An acceptable alternative which has been exploited to some extent in the literature is to first carry out a classical dynamics (or statistical physics, such as e.g. Monte Carlo) simulation of the time-evolution (or evolution in imaginary time) of the system in question, then to pick up a reasonably small number of configurations (snapshots from the classical simulation) and perform rigorous quantum mechanical simulations only on these configurations. Though the previously mentioned dynamical simulations are classical in a rigorous sense, note that the interaction potentials used throughout the simulations may be even derived from high-level quantum mechanical calculations. II. RELATED WORK There are several papers in which MapReduce paradigm has been used for solving problems in the scientific domain. In [1], the authors applied MapReduce model to perform High Energy Physics data analyses and Kmeans clustering. They also made a streaming-based MapReduce implementation and compared its performance with Hadoop. Their conclusion is that most of the scientific analyses that has some form of the SMPD algorithm can benefit from the MapReduce model and can achieve scalability and speedup. In [2], the authors present the MapReduce implementation in Google inc. The implementation is highly scalable and it processes terabytes of data on thousands of machines. Also, upwards of one thousand MapReduce jobs are executed on