1 Enumerating x86-64 – It’s Not as Easy as Counting William Mahoney J. Todd McDonald University of Nebraska at Omaha PKI 281-E 6001 Dodge Street Omaha, Nebraska 68182 wmahoney@unomaha.edu University of South Alabama 1121 Shelby Hall 150 Jaguar Drive Mobile, AL 36688 jtmcdonald@southalabama.edu ABSTRACT In our work for software watermarking, we have been examining the possibility of executable steganography, hiding intel/AMD x86-64 instructions within the operands of other instructions. Early thoughts about this concept revolved around creating a database of sorts to reflect which x86-64 instructions had large enough operand fields to hold the hidden payload. It was assumed that this database would be easily constructed, but it turns out to be a surprisingly difficult endeavor. Even the question of “how many are there?” is challenging to answer. Different CPUs support different instruction sets, different instruction decoders or reverse assemblers give different results for the same combination of bytes, and even the number of distinct mnemonics for instructions is blurry. In the process of attempting to construct the x86-64 database we encountered several stumbling blocks along the way and we report on the stumbling blocks here. This white paper is not a traditional research paper, with background, other relevant prior work. Rather, it describes our attempts to answer what we thought was a fairly simple question: for various numbers of bytes, just how many legal x86-64 instructions exist? CCS Concepts • Computer systems organization à Architectures à Serial architectures à Complex instruction set computing Keywords Instruction set, instruction decoding, mnemonics 1. INTRODUCTION In the process of working on our executable steganography efforts [1] we desired to construct a database of x86-64 instructions and what we called their “cover numbers”. Our intent is to hide short executable instructions inside the operands of longer x86-64 instructions in such a way that there would be a hidden payload or watermark inside the code, and that this watermark not be visible normally by reverse engineering tools. The “cover number” of an instruction was defined as the number of bytes that the instruction is capable of hiding. For instance, an x86-64 instruction with a 64-bit operand would be capable of secretly encoding eight bytes in the operand, so it’s cover number would be eight. Our early thoughts for the project included some kind of searchable database, where, if I need an instruction with a cover number of at least three, the database will tell me all potential instructions. Later it was determined that this database of cover numbers is not as useful as a database of which instructions are available for various numbers of bytes; rather than looking up an instruction to see what it might be capable of hiding, the better approach is to determine what operations require only one byte, or two, or three, … In this way the author of the code which will be hidden can select operations based on the number of opcode bytes. Although the intel/AMD 64-bit instruction set is large, it would seem that there would be a relatively simple / programmatically easy way to generate all of the instructions and from there, or in the process, to determine the number of bytes for each. Come to find out this task is surprisingly difficult. Even as recently as 2016 one could assume that “a formal semantics of the current 64-bit x86 ISA … would be readily available, but one would be wrong. In fact, the only published description of the x86-64 ISA is the Intel manual with over 3,800 pages written in an ad-hoc combination of English and pseudo-code” [2]. So of course, when the assumption is that something should be simple, often it is not, and this is only discovered after “jumping in head first”. A result of this “jumping in” is reported here. Our paper is less of a research tome and more of a running commentary on how we approached the problem and what results we had (or did not have!) along the way. Section two presents terminology and states the problem in more detail. Section three describes our approach to exhaustively searching a list of x86 instructions. Are the results correct? Surprisingly this is not an easy question to answer. The reasons are given in section four, as well as some thoughts about future changes which could be made to shed some illumination on the answer. 2. THE PROBLEM In a nutshell our question is: how many valid byte combinations correspond to legal x86-64 architecture instructions of a certain length? Can they be enumerated, and if so how? Specifically, due to our steganography work we are interested in instructions whose size is six or fewer bytes. 2.1 Considerations On the surface the issue of constructing our valid instruction list seems an easy problem. But consider: • The number of potential x86-64 instructions is huge, as the hardware limit is the number of bytes that the CPU is willing to fetch for one instruction. On x86-64 this is 15 bytes [3] (pp 208) and as a result there are 2 15*8 potential instructions. • Certain prefix bytes can be added in advance of the instruction, some of them causing extended behavior and some of them having no effect whatsoever. As a result, a simple instruction such as an addition of two registers can This research is supported by the National Science Foundation under the Secure and Trusted Computing (SaTC) grants CNS-1811560 and 1811578. The project is a collaborative effort between the University of Nebraska at Omaha (UNO) and the University of South Alabama (USA).