Systems Should Automatically Specialize Code and Data Brandon Lucia Todd Mytkowicz Microsoft Research Generality & Programmability are at Odds with Specialization & Efficiency Today’s languages, systems, and architectures include considerable support for generality and programmability. Systems are built to enable simple implementation of a broad class of programs, and to provide an acceptable level of performance and efficiency. The profusion of high-level languages, runtime and system support for tasks like garbage collection and virtualization, and architectural mechanisms like virtual memory speak to the importance placed by system designers on supporting programmability for a broad class of programs. While systems go to these great lengths for generality and programmability, efficiency is also a key design requirement because the amount of time, space, and energy a computation re- quires ultimately determines its cost. Unfortunately, system support for generality is frequently at odds with the need for high perfor- mance and efficiency. Mechanisms and abstractions built to provide generality and programmability come at a cost in efficiency and programmers frequently go to lengths to elide those mechanisms and abstractions to make their computations more efficient. Specialization is one approach taken by programmers to trade generality for efficiency. Broadly, specialization exploits structure present in a problem to increase the efficiency of its implementa- tion. To specialize a program, a programmer makes assumptions about things like the nature of expected program inputs or the ma- chine that will run the program. The programmer then carefully writes an implementation to take advantage of those assumptions to improve efficiency. Specialization can affect a program’s data representation, the algorithm that manipulates that data representa- tion, and the implementation of both. For example, say a programmer has implemented a program that sorts numbers and wants it to be more efficient. If the programmer knows that their general sorting code will only ever sort peoples’ ages, they may choose to specialize their data structures and sort- ing code to store and manipulate only 7 bit values, assuming no one will live beyond 128 years. The age-specialized sorting implemen- tation takes advantage of structure in the program’s input, making more efficient use of resources than a general implementation that stores and manipulates 64 bit numbers. While such ad hoc, manual specialization can be fruitful, it has several drawbacks. First, even for a particular program, different execution environments may deal with different input sets that vary in the structure they have and whether or not they have structure at all. That variation requires the programmer to adapt the program to each different environment, which is onerous. Second, specializa- tion may make programming more difficult, as programmers avoid general purpose abstractions and take advantage of specific data or machine characteristics. Such mechanisms often exist to insulate programmers from the complexity of data and machine character- istics, so manual specialization may lead to more complex (and po- tentially error-prone) programming. Vision It is our vision that specialization does not need to be ad hoc and does not need to be applied manually. We propose that it is a worthy and attainable research agenda to develop systems – programming languages, system software, and computer architectures – that au- tomate the process of program specialization. We believe that to develop those systems we must borrow techniques from machine learning, especially from deep learning. Strategy Our strategy for automatic specialization is to use deep learning to produce specialized data representations and algorithms and to use techniques for disciplined approximation to control or tolerate any imprecision introduced by what is learned. Automatic Specialization with Deep Learning Deep learning is the science of automatically learning abstract data representations. Deep learning algorithms are built up from a series of processing stages and each stage learns a “layer” of abstraction. Each layer learns a more abstract data representations than the abstractions learned at the previous layer. Deep learning algorithms learn by looking at examples of data and then tuning the abstraction at each layer according to some optimization function, like how much error the layers of abstraction introduce. One of the key insights of deep learning is that an ”algorithm” is trivial if the problem’s data representation contains a simple representation of the answer. For example, if the final layer in a deep architecture contains a single bit which encodes whether an image contains a cat, the algorithm for determining whether an image contains a cat is trivial (is the bit set?). Deep learning recognizes that data representation and algorithm are inseparably intertwined and as such has developed learning algorithms that explicitly learn both data representation and algorithm at the same time. This paper suggests that learned data representations and algo- rithms are inherently specialized. Deep learning algorithms learn representations that are derived from example data, but that gener- alize to unseen examples that share characteristics of the example data. Example As an illustrative example, say a programmer wants to manually implement a program that manipulates images of faces to determine the disposition of the person in the image. The naive way to do that is to store the images’ entire pixel arrays and metadata and do the disposition computation over the full images. That naive implementation might then do a search over the image to find certain known mouth or eye shapes characteristic of certain dispositions. Knowing the images are all of faces and the nature of the task, the programmer may try to specialize the code and data represen- tation. One way to do that is to store and manipulate an abstract data type representing eye and mouth characteristics only. While that specialization may improve efficiency by reducing storage and computation overhead, in general, it is hard to predict how efficient 1 2014/4/3