Compared accuracy of QR / Normal equations with IEEE arithmetic Gravity Field Parameter Estimation Using QR Factorization Marty, J.C .; Gratton, S.; Bruinsma, S.L.; Balmino, G. (CNES/GRGS) Baboulin, M. (Universidade de Coimbra) Some references: [1] M. Arioli, M. Baboulin, S. Gratton. A partial condition number for linear least-squares problems. SIAM J. Matrix Analysis and Applications. 29(2):413--433, 2007. [2] M. Baboulin, L. Giraud, S. Gratton, J. Langou. A distributed packed storage for large dense in-core parallel calculations Concurrency and Computation: Practice and Experience. 19(04),483--502, 2007. [3] M. Baboulin, L. Giraud, S. Gratton. A parallel distributed solver for large dense symmetric systems: applications to geodesy and electromagnetism problems. Int. J. High Speed Computing , 19(04),381--410, 2005. [4] M. Baboulin, L. Giraud, S. Gratton, J. Langou. Parallel tools for solving incremental dense least squares problems. Application to space geodesy. University of Tennessee Technical Report UT-CS-06-582 and CERFACS Technical Report TR/PA/06/63. Also appeared as LAPACK Working Note 179 , September 2006. To appear in Advances in Engineering Software. Results on IBM pSeries 690 – Power4 1.7 GHz Packed solver twice as cheap as ScaLAPACK in memory Iso-memory scalability for a right-looking implementation of the Cholesky/QR factorization. (constant problem size/processor = 10240, 64 procs problem size = 81920) Solved by summation of the normals followed by a Cholesky decomposition and 2 triangular solves Consider the solution of the least-squares problem Let be the solution of the perturbed least squares problem where the perturbations are due to round-off errors. Let be the relative error on the solution, and be machine precision, and be the condition number of A. It can be proved by first-order expansions that For the normal equations, For the QR approach, 2 ) ( ) ( b b x A A Min n x + - + x ~ x 16 10 . 2 ~ - ψ x x x Err - = ~ ψ . ) ( ~ 2 A K Err ψ . ) ( 1 ) ( ~ + x A r A K A K Err + = A A A K ) ( 1. QR approach is twice as expensive as normal equations, but much more accurate if Useful if round-off error dominates model error. COMPUTE K(A) 3. Normal equation will meet an overflow if compared with for the QR approach 4. QR approach is needed for multi-mission solvers . 1 ) ( << x A r A K 1 ). ( > ψ A K 1 . ) ( 2 > ψ A K Square root algorithm for incremental least-squares problems 2 1 1 b x A Min n x - with a priori normal equations Normal equation approach Square root approach Load factor from the QR factorization The covariance T T R R A A - - - = 1 1 ) ( 0 0 0 0 b A x A A T T = ( 29 1 1 0 0 1 1 0 0 b A b A x A A A A T T T T + = + 1 R [ ] 0 0 0 0 , R Q b A = [ ] 1 1 0 b A R Compute the R-factor of , denoted solve 0 R = ρ s r R 1 s x r = Costs 2 mn 2 n 2 ~ 2 mn 2 Note that is the Cholesky factor of 1 R ( 29 1 1 0 0 A A A A T T + 1 A m-by-n dense matrix by m >> n Costs mn 2 n 3 /3 2n 2 ~ mn 2 Cost ~ n 3 Packed Incremental QR factorization 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 Triangular matrix 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 Row oriented Packed representation Elementary block 3 0 1 2 3 0 1 2 3 0 1 2 Obs Main steps of our algorithm (submitted to ScaLAPACK) 2. Copy a row of the packed structure in the buffer 3. Perform the QR factorization (Block Householder) 4. Copy back the row in the packed structure 5. Go to 1. with next row 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 1.Copy 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 2.Fact or 3.Copy 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 1.Copy Obtaining good performance is based on the BLAS3 performance, and requires a tuning of the elementary block that may look like e.g. 3 1 0 3 1 0 3 1 0 2 2 2 3 1 0 3 1 0 3 1 0 2 2 2 ScaLAPACK is one of the best set of routines to store and compute with large, dense matrices on a distributed memory computer. A periodic two dimensional distribution is used where is a matrix (typically 128x128, depending on cache size) hosted by processor i Why such a complex layout ? The 2-D block cyclic distribution provides a good load-balancing i.e. good workload distribution among the processors for important linear algebra kernels such as matrix-matrix multiplication. View for processor 2 256x384 i Matrix 512x768 2 2 2 3 1 0 3 1 0 3 1 0 2 2 2 3 1 0 3 1 0 3 1 0 2 2 2 2 2 2 ScaLAPACK distributed format Design of parallel solvers for least squares based on a cyclic distribution as in ScaLAPACK Optimal memory usage obtained by introducing a packed parallel distributed layout that generalizes to distributed machines the well-known packed format Implementation of a normal equation solver and of a square root QR-based solver, in which the efficiency in Mflop/s is comparable to ScaLAPACK routines Not presented here: condition number estimators have been developed using the power and inverse power iterations implemented in the packed format. Computing a condition number is affordable when a factor (Cholesky or QR) is available, and costs only a few matrix-vector products. An implementation is the CNES/GRGS GINS software has been performed, and experiments in the framework of the GOCE mission will be carried out. Preliminary solution was computed using either Cholesky or QR decomposition. The evaluation of the gravity fields obtained by the two methods shows differences in the accumulated geoid errors below 1 micrometer (in agreement with sensitivity analysis and condition numbers) Further work includes sensitivity analysis in non nominal instrument configuration such as the loss of one or two axis of the gradiometer. This study was partly done under ESA contract 18308/04/NL/MM and supported by CNES. Performance of parallel QR [Mflop/s] 0 0,5 1 1,5 2 2,5 3 3,5 4 1 2 4 8 16 32 64 number of processors Packed solver ScaLAPACK Performance of parallel Cholesky [Mflop/s] 0 0,5 1 1,5 2 2,5 3 3,5 4 1 2 4 8 16 32 64 number of processors Packed solver ScaLAPACK GOCE simulation The objective of the European Space Agency’s Gravity Field and Steady-State Ocean Circulation Explorer (GOCE) mission is to provide a mean static (i.e., free from temporal gravity variations due to, for example, atmosphere and ocean mass re-distribution) gravity field model of the Earth, represented by spherical harmonic coefficients, and the corresponding geoid of a very high accuracy (ESA, 1999). The geoid-error specifications are 1-2 cm cumulated error at degree 200, which corresponds to a half-wavelength of about 100 km. To achieve this unprecedented performance, GOCE is equipped with a gravity gradiometer as well as a GPS receiver, whereas the orbit altitude will be as low as possible (265 km). In this simulation, gravity field parameters were estimated using only Satellite Gravity Gradiometry (SGG) observations, i.e., GPS or position data were not used. The main contractor Thales Alenia, using the gravity field EGM96 to degree and order 360, simulated the SGG data realistically. The data noise is based on the latest spacecraft and instrument performance. Only 20 days, which is less than the repeat period of 30 days, of 1 Hz SGG Txx, Tyy, and Tzz data were processed (the instrument X-axis is nominally aligned with the velocity vector, the Z-axis is approximately radial outward, and the Y- axis is normal to the orbital plane), computing either normal equations or a QR factorization. The same band pass filter was applied to both equation systems. The solution was computed using either Cholesky or QR decomposition. These gravity fields were evaluated by calculating the accumulated geoid errors in reversed order to highlight differences, which are more likely at high degree. These numbers are not estimates of mission performance, for which the period is much too short anyway, but relative precision indicators. First ESA Earth Explorer : GOCE (Gravity field and steady-state Ocean Circulation Explorer) Satellite scheduled for launch in 2008. GOCE Mission GOCE 50 cm X1 Z1 Y1 X2 Z2 Gravity gradiometer (SGG) Three pairs of accelerometers, baseline 0.5 m, perpendicular to each other. The read-out difference per pair of accelerometers provides one component of the gravity gradient, the average gives the common mode Angular accelerations are also obtained Measurement bandwidth (MBW): 0.005-0.1 Hz precision: 3-6 mE Hz -0.5 (1 E = 10 -9 s -2 ) Power: 75 W (125 kg) Frequency (Hz) MBW filter Gradiometer noise PSD Results and conclusion