ACM Communications in Computer Algebra, Vol. 45, No. 2, Issue 176, June 2011 ISSAC 2011 Poster Abstracts Communicated by Manuel Kauers Parallel and Cache-efficient Hensel Lifting Fatima K. Abu Salem Computer Science Department, American University of Beirut P. O. Box 11-0236, Riad El Solh, Beirut 1107 2020, Lebanon fatima.abusalem@aub.edu.lb We present work in progress towards a high performance (HP) design for Hensel lifting bivariate poly- nomial factorisation over a finite field F q . We discuss techniques that improve on data locality, which in turn is becoming increasingly important in today’s algorithm design. We also discuss how to reorganise the iterative computations involved in the process into a sequence of rounds each of which can be executed in parallel. We propose the use of heaps, as inspired by successful results to perform polynomial multi- plication and division using the distributed polynomial representation (see Monagan and Pearce’s work as in [1, 3, 3, 4], to name a few). Additionally, we associate the order of polynomial computations with two-dimensional indices, and suggest a traversal of the two-dimensional grid in a manner that allows an evaluation order of the dependency graph which improves upon locality. Let f ∈ F q [x, y] where q is a prime power, and n = deg(f ). We wish to obtain a polynomial factorisation of f into two factors g and h such that f = gh. Write f = ∑ n k=0 f k y k where f k ∈ F q [x] and deg(f k )= n − k. Suppose we were given a (boundary) factorisation of the form f 0 = g 0 h 0 , where f 0 is squarefree, and g 0 and h 0 belong to F q [x]. We wish to lift this boundary factorisation into a full polynomial factorisation of f in F q [x, y]. Let d = gcd(g 0 ,h 0 ) with u and v chosen such that ug 0 + vh 0 = d. When d = 1 and under certain restrictions governing the degrees of each g k and h k , there will be at most one way of defining g k and h k as follows: g k ≡ v f k − k−1 i=1 g i h k−i mod g 0 , h k ≡ u f k − k−1 i=1 g i h k−i mod h 0 (1) If the degree restrictions are observed one continues lifting; else, one halts lifting from the given pair (g 0 ,h 0 ). It can be shown that there exists a certain boundary factorisation by which one can produce all monic factors of f with total degree between 1 and ⌊n/2⌋. Computation in the order stipulated by Eq. 1 not only restricts parallelism to a very limited scale, it also can be shown to produce bad memory performance. We show that the sequential order of computations which stipulates that in the kth iteration a polynomial g k (or h k ) can only be computed after g k−1 and h k−1 have been obtained, can be overcome by a sequence of parallel rounds such that in round k ′ all of the following polynomial products can be obtained: {g i h j } such that (i, j ) ∈{1,...,k ′ } 2 ∧ (i = k ′ ∨ j = k ′ ). Consequently, one can start producing g k (or h k ) whenever g ⌈k/2⌉ and h ⌈k/2⌉ have been obtained, and it can be shown that the number of parallel rounds grows asymptotically alike to the total number of lifting steps required to terminate. An extra component of concurrency can be further obtained, as partial terms appearing in g k can kick-start computations contributing to {g ′ k } k ′ >k – similarly for h k . Such concurrency reduces the number of synchronisation barriers needed as it allows processors to hop vertically across the 107