Poster: Precise Dynamic Dataflow Tracking with Proximal Gradients Gabriel Ryan, Abhishek Shah, Dongdong She, Suman Jana Columbia University New York, USA {gabe, dongdong, suman}@cs.columbia.edu, abhishek.shah@columbia.edu Koustubha Bhat Vrije Universiteit Amsterdam, Netherlands k.bhat@vu.nl Abstract—Dataflow analysis is a fundamental technique in the development of secure software. It has multiple security applications in detecting attacks, searching for vulnerabilities, and identifying privacy violations. Taint tracking is a type of dataflow analysis is that tracks dataflow between a set of specified sources and sinks. However, taint tracking suffers from high false positives/negatives due to fundamentally imprecise propogation rules, which limits its utility in real world applications. We introduce a novel form of dynamic dataflow analysis, called proximal gradient analysis (PGA), that not only provides much more precise dataflow information than taint tracking, but also more fine grained information dataflow behavior in the form of a gradient. PGA uses proximal gradients to estimate derivatives on program operations that are not numerically differentiable, making it possible to propogate gradient estimates through a program in the same way taint tracking propogates labels. By using gradient to track dataflows, PGA naturally avoids many of the propogation errors that occur in taint tracking. We evaluate PGA on 7 widely used programs and show it achieves up to 39% better precision that than taint while incurring lower average overhead due to the increased precision. Index Terms—poster, taint, dataflow, program analysis, nons- mooth optimization, gradient I. I NTRODUCTION Dataflow analysis is a fundamental technique in the de- velopment of secure software. It has multiple security ap- plications in detecting attacks, searching for vulnerabilities, and identifying privacy violations [1], [5]. One of the most effective techniques of dataflow analysis is taint tracking, which tracks which internal variables are affected by the input [3]. However, taint tracking suffers from high false positives/negatives due to fundamentally imprecise propoga- tion rules, which limits its utility in real world applications. We introduce a novel form of dynamic program analy- sis, called proximal gradient analysis (PGA), that not only provides much more precise dataflow information than taint tracking, but also more overall information about program behavior in the form of a gradient. PGA uses proximal gradients to estimate derivatives on program operations that are not numerically differentiable, making it possible to propogate gradient estimates through a program in the same way taint tracking propogates labels [4]. By using gradient to track dataflows, PGA naturally avoids many of the problems with over approximation that occur in taint tracking. Figure 2 gives an example of an operation on which PGA provides more precise and fine grained dataflow information taint source: x taint sink: y // x is a 4 byte int int x = 0x12345678; for (int i=0; i<6; i++){ y[i] = x; x = x<<8; } Taint to y from x Gradient of y wrt. x y 1 1 1 1 1 1 1 256 2^16 2^24 0 0 Fig. 1. Example of a program in which an iterated shift operation on an integer will cause over-tainting, while gradient will precisely identify how the source variable (x) influences the sink(y). Deeper shades of red indicate greater degrees of influence. compared to taint tracking. The source integer x is left shifted by a byte every iteration of the for loop and then assigned to a position in the sink array y. After the first 4 iterations, all the bytes of x’s initial value have been shifted out and x goes to 0. At this point, there is no dataflow between x and the value of y[i], since it will always be 0. PGA correctly identifies this, and also identifies that changes in x have a much larger effect on higher indexes in the array y. In contrast, taint tracking will mark all of the integers in y with x’s label. II. BACKGROUND Our approach to program analysis draws on work in three fields: Dyanmic Dataflow Analyis, Nonsmooth Optimization, and Automatic Differentiation. Dynamic Dataflow Analysis models the flow of data through a program by tracking variable interactions and has applications in both compiler optimization and detection of security vulnerabilities, but suffers from high false positive rates that limit its utility. Nonsmooth Gradient Approximation involves a collection of methods that have been developed in the field of Nonsmooth Optimization for approximating gradients in cases where the gradient cannot be evaluated analytically. These methods make it possible to approximate gradients on discrete and nonsmooth functions in a principled way based on the local behavior of the function. Finally, we draw on the field of Automatic Differentiation, which involves methods for computing gradients over pro- grams compused of semi smooth numerical operations, but not general programs with discrete and nonsmooth operations. To evaluate gradients over nonsmooth operations, we use a method from the discrete optimization literature called proximal gradients [4]. Proximal gradients use the minimum point within a soft bounded region. This region is defined