Shantanu Gupta 1/4 RESEARCH STATEMENT Shantanu Gupta (shangupt@umich.edu) My interests span the field of computer architecture and compiler technology with a focus on system reliability, performance and energy-efficiency. Within this scope, I have worked on numerous projects during the course of my doctoral research, industrial internships and collaboration within the University of Michigan. In the reliability domain, I have investigated hard-fault tolerance (both in processors and caches), soft- error tolerance, and concurrency bugs in parallel programs. In the performance domain, I have developed microarchitectural solutions for enabling dynamic multicores, which can cater to situations requiring single- thread performance, throughput computing and anything in between. And finally, in the energy-efficiency domain, I am exploring configurable compute engines that can save a large fraction of instruction and data supply energy. Dissertation Research With the increasing silicon integration, transistor today are cheaper and faster than ever before. This transistor scaling has long been a source of dramatic performance gains. However, at the same time, it has resulted in increasing levels of operating temperatures and power densities which can have serious repercussions on a chip’s reliability, performance and computational efficiency. For instance, given that most silicon wearout mechanisms are highly dependent on chip temperatures and device sizes, significantly higher failure rates are projected for future technology generations. In modern multicore chips, this can jeopardize the objective of throughput sustainability over the lifetime of a chip. In terms of performance, multicore chips prevalent today (chosen as an alternative to complex monolithic designs) are effective for throughput computing, but they provide small gains for sequential applications. Even if a major transition towards parallel programming occurs in the future, Amdahl’s law dictates that the sequential component of an application will present itself as a performance bottleneck. And lastly, going forward, chip-wide power and energy constraints will limit the number of cores / resources that can be kept active on a chip, motivating the need for highly energy-efficient computing. My thesis is on design of adaptive architectures to deal with all of the issues discussed above. Further, the solutions proposed are complementary to each other, and when applied together, can effectively tackle reliability, performance and energy-efficiency demands expected in future microprocessors. Hard Fault Tolerance (StageNet, 2007-10) Traditionally, hard-faults in high-end servers and mission critical systems have been addressed by using mechanisms such as dual and triple-modular redundancy. However, such solutions incur high hardware overheads and can tolerate only a small number of defects. As a new direction in hard-fault tolerance paradigm, I proposed StageNet, a fine-grained redundancy solution for multicore chips. StageNet is a highly reconfigurable multicore architecture that is designed as a network of pipeline stages, rather than isolated cores. Its interconnection flexibility allows it to salvage health pipeline stages, by adaptively routing around defective ones in the multicore fabric. This fine-grained defect isolation enables StageNet to maintain a higher throughput over a system’s lifetime compared to a conventional multicore chip. The primary challenge in this project was the design of a decoupled pipeline microarchitecture that allows pipeline stages from different cores to assemble together and form a logical processor. The original decoupled pipeline design appeared in CASES’08 [1] and the full system in MICRO’08 [2], TOC’10 [3]. A scalable and process variation tolerant version of StageNet also appeared in DSN’10 [4]. Hard Fault Detection (Adaptive Testing, 2008-09) Given a scenario with increasing failure rates in commodity systems, processors would need to be equipped with fault tolerance mechanisms that can detect in-field silicon defects. In this project, I proposed an adaptive on-line testing framework to significantly reduce the overhead of in-field hard fault detection. The insight here was to leverage health monitoring sensors to guide the amount of testing applied to different http://www.eecs.umich.edu/shangupt