Performance of Electronic Structure Calculations on Blue Gene and Cray XT4 Computers. Andrey Asadchev*, Veerendra Allada, Mark S Gordon, Brett Bode Iowa State University, Ames Laboratory andrey@si.msg.chem.iastate.edu, allada@scl.ameslab.gov, mark@si.msg.chem.iastate.edu, brett@scl.ameslab.gov Creating Materials and Energy Solutions Abstract Since our initial efforts of porting and exploring the performance of the Gen- eral Atomic and Molecular Electronic Structure System 1,2 (GAMESS) on the Blue Gene/L and Cray XT4, several key improvements have been made to the Hartree Fock (HF) and diagonalization codes. However, the per- formance of 4-index transformations, central to several post-HF methods, was found to be lacking. A new 4-index transformation code has been implemented to make better use of modern massively parallel platforms (MPP). The key features of the new implementation are the use of BLAS2 and BLAS3 subroutines in computational kernels, remote memory access (RMA) and collective operations to maximize data locality, and scalability with respect to local and distributed memory. 1. Introduction The result of a typical HF calculation is a reasonable approximation of a molecular geometry. However, the energies obtained during the HF step generally need to be refined by a post-HF method, such as second order perturbation theory (MP2). Despite the availability of high accuracy approx- imations of the MP2 method, such as Resolution of Identity (RI), the tradi- tional MP2 is still widely used. While the 2e integrals are computed in terms of atomics orbitals (AO), the MP2 theory is formulated in terms of molecular orbitals (MO), requiring a transformation from AOs to MOs. This transforma- tion is known as 4-index transformation and it scales as N 5 computationally and N 4 memory-wise presenting a computational challenge suitable for a supercomputer. Besides the MP2 method, the 4-index transformation is central to other methods, e.g, Multi-Configurational Self Consistent Field (MCSCF). 2. Theory Let i, j be any of the N a active occupied orbitals and a, b be any of the N v virtual orbitals. Let p, q, r, s be any of the N AO atomic orbitals. Then the second order perturbation energy correction is given by: E MP 2 = ij ab (2ij |ab〉−〈ij |ba)ij |ab ǫ i + ǫ j ǫ a ǫ b (1) ij |ab= p q r s C ip C jq C ar C bs pq |rs(2) where, pq |rsis a 4-center 2e integral and C and ǫ are coefficient matrix and orbital energies obtained during the HF step. For computational efficiency, the integrals are typically evaluated in terms of S, P, D, ... shells, where S shell has 1 AO, P has 3 AOs, etc. 3. Supercomputer Challenge The defining features of BG/L and XT4 are: High processor count, 1024 or greater for BG/L. Modest local memory, 512 MB on BG/L and 8 GB on XT4. Large aggregate memory suitable for distributed arrays. High performance low latency networks, dedicated networks for collective operations Lack of local disks, I/O over network. Thus a successful theory implementation requires a scalable algorithm - especially in terms of local memory. The relatively modest memory of in- dividual nodes and high processor count require algorithms which are not limited by the memory. The memory required by MP2 energy calculation scales as N v N AO (N a N a + N a )/2. During the 4-index transformation there is an additional overhead due to work arrays. Thus, the memory overhead is a significant issue with the classical MP2 method. The existing MP2 implementations 3,4 in GAMESS are not directly suit- able to run on MPP systems efficiently as they suffer from excessive disk I/O, poor data locality, and do not scale well memory-wise. With the existing implementations the memory runs out well before a problem can be scaled across thousands of processors efficiently. For example, a larger calculation with 1500 AOs and 250 MOs would require a single Blue Gene/L rack to satisfy just the distributed memory requirements. 4. MP2 Algorithm Index and Block shells by types S, P, D, ... for i b blocks do for j b i b do for P, Q i b ,j b do for R shells do for S R do Compute pq |rs end First index transform: C pq |rs〉→〈ir |pq C pq |sr 〉→〈ir |pq end end Second index transform:ir |pq 〉→〈ir |aq Third index transform: C ir |aq 〉→〈ij |aq , Acc ij |aq C ir |qa〉→〈ij |qa, Acc ij |qa end end Fourth index transform: Permute C by shell type for j i do Get/Transpose ij |aq C ij |aq 〉→〈ij |ab Transpose/Put ij |ab end 1 Figure 1: MP2 energy algorithm. In the new algorithm we took the key features of the previous algorithms and combined them together with further improvements. The salient features of the new algorithm are: Exploiting pq |rs= qp|rs, pq |rs= pq |sr symmetries Data blocking, BLAS2 and BLAS3 computational kernels Use of the distributed memory Locality based RMA and collective operations An optional tradeoff between memory overhead and re-computation Multi-pass computations for very large matrices 5. Data Blocking and layout The small shells, e.g. S, P are blocked together The large shells, e.g. F, G may be split into smaller blocks to minimize the memory Shells must be ordered by type to avoid unnecessary re-computations. To restore the original ordering of the transformation matrix, C matrix must be permuted in the 4th transformation. Large calculations, e.g. proteins, require transformation matrices which may exceed the physical memory. To go beyond the memory limits the MP2 energy can be evaluated in segments at the expense of recalculating the integrals in multiple passes. Figure 2: Ordering and Blocking shells by type Compared to the previous implementation, the new implementation stores the transposed transformation matrix in the distributed memory. The i, j occupied orbitals are contiguous in the memory of a single pro- cessor for a given a, b indices. The accumulate operation in the third transformation spans only a subset of processors. The fourth transformation must transpose the distributed matrix for effi- cient computation. Figure 3: Distributed Transformation Matrix. 6. Transformations 6.1 First Transformation N 4 AO N a /2 , DGER/DGEMM and DGEMV/DGEMM kernels, Rys quadra- ture 5 . •〈pq |rsare evaluated as a block for all s .The extra memory overhead to block 2e integrals is N AO m 2 b n R , where n R is the R shell size. The insignificant integrals are screened out of computation and transfor- mation. 6.2 Second Transformation N 3 AO N v N a /2, DGEMM kernel. The memory overhead is N a N v to store temporary transformation. 6.3 Third Transformation N 2 AO N 2 a N v /2, DGER kernel. Two accumulate operations per ij b iteration, total size N 2 a N v m b /2 each. In order to minimize the memory overhead the accumulate operations are split into m b N a /2chunks. 6.4 Fourth Transformation N AO N 2 a N 2 v /2 ,DGEMM kernel. Local Put and Get operations. Two matrix transposes via MPI Alltoall. Least expensive transformation. 6.5 Memory C matrix - N MO N AO . Read-only, allocated on each processor. Local transformation array in 1st, 2nd and 3rd transformations - m 2 b N AO N a . Read-write, allocated on each processor. The dominant memory overhead, can be controlled by m b parameter. Work arrays in the 1st, 2nd and 3rd transformation have an upper bound of N a N v . Work array in 4th transformation can occupy the memory allocated during the 1st, 2nd and 3rd transformations. 7. Load Balancing Since one of the accumulate operations can be localized to a single proces- sor for each i b iteration, the locality of that operation determines the load balancing as follows: Process all local blocks Process all blocks to the right and to the top of the processor. After this point there are no more local accumulate operations left. Pick next processor and proceed in the above manner. Since each processor traverses the entire processor matrix, the duplica- tion of tasks is avoided by the means of the global task list, where the completed tasks are flagged. The collision of accumulate operation is minimized. Figure 4: Processor matrix traversal by P2. 8. Scalability and performance Taxol, a larger biological molecule found in the tree bark, is an anti-cancer drug. The MP2 calculations on Taxol with 6-311G basis set present a medium challenge, suitable to assess the scalability: 6-311G basis set: 959 AOs, 164 active and 733 virtual orbitals. 71 GB transformation matrix, requires two passes on 128 BG/L nodes. During the production runs all available global memory would be allo- cated. For the scalability testing the distributed memory size is fixed at 32 GB. Communication overhead due to Accumulate operations is significant, in- troducing a large penalty. Accumulate performs better on Cray XT4, consequently the algorithm scales better on XT4. 0 1 2 3 4 5 6 7 8 200 300 400 500 600 700 800 900 1000 Speedup Number of Processors, BG/L Speedup Ideal Speedup 0 1 2 3 4 5 6 7 8 100 150 200 250 300 350 400 450 500 Speedup Number of Processors, XT4 Speedup Ideal Speedup Figure 5: MP2 Scalability on BG/L and XT4 2e Rys quadrature is a significant bottleneck since its implementation has poor cache performance and cannot be SIMDized. The 1st transformation performs well due to integral screening. How- ever 2nd and 3rd transformation do not screen the integrals and perform worse. BG/L XT4 Processors 128 64 T 2e 40.0% 11.0% T 1 10.5% 6.0% T 2 10.5% 39.0% T 3 34.0% 31.5% T 4 5.0% 12.5% T comp 3721s 3330s Table 1: T comp ratios for 2e, 1st, 2nd, 3rd and 4th transforms, Taxol 6-311G 8.1 Large calculations One of the smallest proteins, TRP Cage protein, Protein DataBank 1L2Y, is found in the saliva of Gila monster lizard. Having only 20 amino acids its energy can be calculated in a reasonable time using 6-311G basis set. Figure 6: TRP Cage and Gila Monster 6-311G basis set: 1686 AOs, 425 active and 1107 virtual orbitals. The entire transformation matrix would require a 1.25 TB of memory. Using segmentation strategy, the calculation can be performed on a sin- gle BG/L rack in several passes. The effective distributed memory available is 216MB times the number of nodes due to the binary memory footprint and the overhead of an under- lying HF calculation. The calculation can be performed on a single BG/L rack in 6 passes. The total wall time is 6 hours. The unscreened 2nd and 3rd transformations fractions become larger. The 3rd transformation, involving unfavourable DGER kernels and accu- mulate operations, becomes the dominant bottleneck - more than half the total runtime. Taxol TRP Cage Processors 128 1024 Passes 2/2 1/6 T 2e 40.0% 17.0% T 1 10.5% 6.3% T 2 10.5% 16.4% T 3 34.0% 58.0% T 4 5.0% 2.3 % T comp 3721s 3169s Table 2: T comp ratios for 2e, 1st, 2nd, 3rd and 4th transforms. Taxol vs TRP Cage on BG/L 9. Conclusions 2e integral code, a significant bottleneck on BG/L, needs to be improved with respect to cache performance and vectorization. The 1st transformation performs well due to blocking, integral screening, and use of optimized math kernels. The 2nd transformation, a significant bottleneck on XT4 and a large con- tributing factor in large calculations. Integral screening is necessary to reduce the execution time. The 3rd transformation becomes a dominant factor as the system size increases. Integrals in the 3rd transformation must be screened in order for the algorithm to perform well on large molecules. Integral screening reduces both, computation and communication. The DGER kernels in the 3rd transformation must be replaced with DGEMM kernels in order to improve computational efficiency. Overhead of the accumulate operations can be reduced by aggregating accumulates with collective reduce operations, albeit at the expense of coarser load balancing. The algorithm does not take into the account the Manhattan distances of the 3D topologies. With slight modification the algorithm can be adopted to traverse the processor matrix closest neighbors first. The 4th transformation and the relevant transpose operations are not a significant runtime contribution. Performing large post-HF calculations, e.g. protein, using classical meth- ods is possible within a reasonable time frame. 10. Acknowledgements Funding is provided by the US Department of Energy. Blue Gene/L calculations were performed on the system at Iowa State University. Cray XT4 calculations were performed on the system at Oak Ridge Na- tional Laboratory. References [1]M. W. Schmidt, K. K. Baldridge, J. A. Boatz, M. S. G. S. T. Elbert, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. J. Su, T. L. Win- dus, M. Dupuis, and J. A. Montgomery, “General Atomic and Molecular Electronic Structure System,” J. Comput. Chem., pp. 1347–1363, 1993. [2] M. S. Gordon and M. W. Schmidt, “Advances in electronic structure the- ory: GAMESS a decade later,” in Theory and Applications of Computa- tional Chemistry, the first forty years, ch. 41, pp. 1167–1189, Elsevier, 2005. [3]G. D. Fletcher, A. P. Rendell, and P. Sherwood J. Mol. Phys., p. 431, 1997. [4]K. Ishimura, P. Pulay, and S. J. Nagase J. Comput. Chem., p. 407, 2006. [5]J. Rys, M. Dupuis, and H. F. King, “Computation of Electron Repulsion Integrals Using the Rys Quadrature Method,” J. Comput. Chem., no. 2, pp. 154–157, 1983. SuperComputing 08, Austin, TX