A Study of Thread Level Parallelism on Mobile Devices Cao Gao * , Anthony Gutierrez * , Ronald G. Dreslinski * , Trevor Mudge * , Krisztian Flautner and Geoffery Blake * Advanced Computer Architecture Laboratory, University of Michigan, {caogao, atgutier, rdreslin, tnm}@umich.edu ARM Ltd., {krisztian.flautner, blakeg}@arm.com Abstract—Mobile devices continue to increase the number of cores in an attempt to meet the needs of performance- demanding applications. However, the increasing number of cores does not necessarily translate into performance gain and/or power reduction. In this paper we investigate how multi-core mobile devices are utilized by applications. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, which shows that multi-cores are generally underutilized by today’s mobile applications. Unless application developers can significantly improve core utilization, further increasing core counts will result in little gain. I. I NTRODUCTION Given the growing hardware demand from modern mobile applications, mobile devices vendors have started shipping smartphones and tablets embedded with multi-core CPUs in volume. However, despite the great computation potential that resides in multi-core CPUs, it is not clear how much they can be utilized for mobile devices. In order to take advantage of a multi-core system, software developers have to divide their program into parallelized threads, which is difficult. In a similar desktop situation, Blake et al. [2] performed a study on a suite of representative desktop applications. Their results suggested that the number of cores that can be profitably used are less than 3 for most commonly used applications. In this work, we analyze a broad range of popular mo- bile applications on two up-to-date development boards to determine how the cores are utilized on mobile devices. We calculate the Thread Level Parallelism (TLP) of these applications. Our results show an average TLP of 1.4 for a quad-core system. It suggests that mobile applications are utilizing less than 2 cores on average, even with several applications running concurrently. In fact, some recent mobile CPUs [1] are made with 2 cores and still provide the desired performance. We also measure the same metrics for a broad spectrum of configurations, including various number of cores in the system, core frequencies, and different CPUs. We observe a modest TLP scalability for most applications, and increasing the number of cores has little return on TLP. In addition, CPUs with higher frequencies tend to exhibit less TLP, which suggests that exploiting parallelism will only be more challenging in the future. In all, these studies suggest an underutilization of multi-core CPUs in mobile devices. It seems that software developers are lagging behind in exploiting parallelism in mobile applications, and increasing the number of CPU cores may have diminishing returns until that changes. II. METHODOLOGY A. Metrics We use Thread Level Parallelism (TLP) [2], [3], which is defined in Equation 1 as the machine utilization over the non-idle portions of the benchmarks execution: T LP = n i=1 c i i 1 - c 0 (1) where c i is the fraction of time that i cores are concurrently running different threads, and n is the number of cores. Specifically, c 0 is the idle time fraction. To calculate TLP, we collect all the context switch events using ftrace, a Linux kernel internal tracer. B. System setup We choose two development boards that are representative of the latest mobile device technology. Most of the experiments are done on the Samsung Origenboard. It contains a Exynos 4412 SoC with a 1.4GHz quad-core Cortex- A9 CPU and Mali-400 GPU. For comparison, we also use a Qualcomm Dragonboard with a 2.3GHz quad-core Krait CPU. C. Benchmarks We choose 16 popular applications from the Google Play Appstore and 4 native ones in the Android OS. This means they have a large user base and are thus representative of current mobile software. They represent ap- plications from 10 different commonly used categories (shown in Fig. 1a). The testing actions on these applications usually last for 30 seconds and cover most typical functions of the application under test. We found 30 seconds is long enough to cover all common actions for the benchmark applications. All experiments are repeated at least 5 times, and we observe a low standard deviation of TLP results. Before testing, we kill all the running and background applications to reduce experi- mental errors. Besides single applications, we also choose four applications from the suite, and run them concurrently with a set of other applications in the background in order to simulate multi-tasking scenarios. III. RESULTS In this section, we show that current mobile applications have a rather low TLP on modern mobile device platforms. We also observe a small return on TLP given the increase in the number of cores and less TLP for cores with higher frequencies. We present the overall TLP results in Fig.1a. The results demonstrate that: 1) All the applications have some, but quite limited TLP. We do see a TLP higher than 1.2 for almost all the applications under test. However, the parallelism we observed is quite low: for a 4-core system, on average, we see a TLP of 1.4. The applications with high TLP, namely Games, Browser and Navigation, have TLPs around 1.5 to 1.6. Applications like Music and File Browser have rather low TLPs around 1.2 to 1.3. 2) Increasing number of cores has little return on TLP. On average, TLP increases by 4.5% 126 978-1-4799-3606-9/14/$31.00 ©2014 IEEE