Multi-core Multithreading Architecture

Related Publications

MICPRO
A Resource Utilization Based Instruction Fetch Policy for SMT Processors (ScienceDirect)
Lichen Weng and Chen Liu
Microprocessors and Microsystems (MICPRO), Vol. 39, Issue 1, pp.1-10, February 2015
DOI: 10.1016/j.micpro.2014.10.001
Abstract: Simultaneous Multithreading (SMT) architectures are proposed to better explore on-chip parallelism, which capture the essence of performance improvement in modern processors. SMT overcomes the limits in a single thread by fetching and executing from multiple of them in a shared fashion. The long-latency operations, however, still cause inefficiency in SMT processors. When instructions have to wait for data from lower-level memory hierarchy, the dependent instructions cannot proceed, hence continue occupying the shared resources on the chip for an extended number of clock cycles. This introduces undesired inter-thread interference in SMT processors, which further leads to negative impacts on overall system throughput and average thread performance. In practice, instruction fetch policies take the responsibility of assigning thread priority at the fetch stage, in an effort to better distribute the shared resources among threads in the same core to cope with the long-latency operations and other runtime behavior from the thread for better performance.
In this paper we propose an instruction fetch policy RUCOUNT, which considers resource utilization of individual thread in the prioritization process. The proposed policy observes instructions in the front-end stages of the pipeline as well as low-level data misses to summarize the resource utilization for thread management. Higher priority is granted to the thread(s) with less utilized resources, such that overall resources are distributed more efficiently in SMT processors. As a result, it has two unique features compared to other policies: one is to observe the hardware resource comprehensively and the other is to monitor limited resource entries. Our experimental results demonstrate that RUCOUNT is 20% better than ICOUNT, 10% than Stall, 8% than DG and 3% than DWarn, in terms of averaged performance. Considering its hardware overhead is at the similar level as ICOUNT and DWarn, our proposed instruction fetch policy RUCOUNT is superior among the studied policies.

IJSSOE
Adaptive Virtual Machine Management in the Cloud: A Performance-Counter-Driven Approach (acm)
Gildo Torres and Chen Liu
International Journal of Systems and Service-Oriented Engineering (IJSSOE), Vol. 4, Issue 2, Pages 28-43, April 2014
DOI: 10.4018/ijssoe.2014040103
Abstract: The success of cloud computing technologies heavily depends on both the underlying hardware and system software support for virtualization. In this study, we propose to elevate the capability of the hypervisor to monitor and manage co-running virtual machines (VMs) by capturing their dynamic behavior at runtime and adaptively schedule and migrate VMs across cores to minimize contention on system resources hence maximize the system throughput. Implemented at the hypervisor level, our proposed scheme does not require any changes or adjustments to the VMs themselves or the applications running inside them, and minimal changes to the host OS. It also does not require any changes to existing hardware structures. These facts reduce the complexity of our approach and improve portability at the same time. The main intuition behind our approach is that because the host OS schedules entire virtual machines, it loses sight of the processes and threads that are running within the VMs; it only sees the averaged resource demands from the past time slice. In our design, we sought to recreate some of this low level information by using performance counters and simple virtual machine introspection techniques. We implemented an initial prototype on the Kernel Virtual Machine (KVM) and our experimental results show the presented approach is of great potential to improve the overall system throughput in the Cloud environment.

CF 2013
Scheduling Optimization in Multicore Multithreaded Microprocessors through Dynamic Modeling (acm)
Lichen Weng, Chen Liu, and Jean-Luc Gaudiot ACM International Conference on Computing Frontiers (CF 2013), Ischia, Italy, May 14-16, 2013
DOI: 10.1145/2482767.2482774
Abstract: Complexity in resource allocation grows dramatically as multiple cores and threads are implemented on Multicore Multi-threaded Microprocessors (MMMP). Such complexity is escalated with variations in workload behaviors. In an effort to support a dynamic, adaptive and scalable operating system (OS) scheduling policy for MMMP, architectural strategies are proposed to construct linear models to capture workload behaviors and then schedule threads according to their resource demands. This paper describes the design through three steps: in the first step we convert a static scheduling policy into a dynamic one, which evaluates the thread mapping pattern at runtime. In the second step we employ regression models to ensure that the scheduling policy is capable of responding to the changing behaviors of threads during execution. In the final step we limit the overhead of the proposed policy by adopting a heuristic approach, thus ensure the scalability with the exponential growth of core and thread counts. The experimental results validate our proposed model in terms of throughput, adaptability and scalability. Compared with the baseline static approach, our phase-triggered scheduling policy could achieve up to 29% speedup. We also provide detailed tradeoff study between performance and overhead that system architects can reference to when target systems and specific overheads are presented.

JOIN
Introducing the Extremely Heterogeneous Architecture (Word Scientific)
Shaoshan Liu, Won W. Ro, Chen Liu, Alfredo C. Salas, Christophe Cerin, Jian-Jun Han, and Jean-Luc Gaudiot
Journal of Interconnection Networks (JOIN), Vol. 13, No. 03n04, September-December 2012
DOI: 10.1142/S0219265912500107
Abstract: The computer industry is moving towards two extremes: extremely high-performance high-throughput cloud computing, and low-power mobile computing. Cloud computing, while providing high performance, is very costly. Google and Microsoft Bing spend billions of dollars each year to maintain their server farms, mainly due to the high power bills. On the other hand, mobile computing is under a very tight energy budget, but yet the end users demand ever increasing performance on these devices. This trend indicates that conventional architectures are not able to deliver high-performance and low power consumption at the same time, and we need a new architecture model to address the needs of both extremes. In this paper, we thus introduce our Extremely Heterogeneous Architecture (EHA) project: EHA is a novel architecture that incorporates both general-purpose and specialized cores on the same chip. The general-purpose cores take care of generic control and computation. On the other hand, the specialized cores, including GPU, hard accelerators (ASIC accelerators), and soft accelerators (FPGAs), are designed for accelerating frequently used or heavy weight applications. When acceleration is not needed, the specialized cores are turned off to reduce power consumption. We demonstrate that EHA is able to improve performance through acceleration, and at the same time reduce power consumption. Since EHA is a heterogeneous architecture, it is suitable for accelerating heterogeneous workloads on the same chip. For example, data centers and clouds provide many services, including media streaming, searching, indexing, scientific computations. The ultimate goal of the EHA project is two-fold: first, to design a chip that is able to run different cloud services on it, and through this design, we would be able to greatly reduce the cost, both recurring and non-recurring, of data centers\clouds; second, to design a light-weight EHA that runs on mobile devices, providing end users with improved experience even under tight battery budget constraints.

TEMM 2011
PCOUNT: A Power Aware Fetch Policy in Simultaneous Multithreading Processors (acm, ieee)
Lichen Weng, Gang Quan, and Chen Liu
The 1st International IEEE Workshop on Thermal Modeling and Management: Chips to Data Centers (TEMM 2011), in conjunction with IGCC 2011, Orlando, Florida, July 25, 2011
DOI: 10.1109/IGCC.2011.6008578
Abstract: The Simultaneous Multithreading (SMT) architecture improves the resource efficiency via scheduling and executing concurrent threads in the same core. Moreover, fetch policies are proposed to assign priorities in the fetch stage to manage the shared resources. However, power consumption study is omitted in most fetch policies. On the other hand, the power management schemes nowadays are focused on multicore processors. Given the growing demands to manage the power consumption of processors and the fully shared system resources in SMT environment, it requires detailed research to develop the power management in an SMT processor. This paper proposes a power aware fetch policy PCOUNT, which evaluates the power consumption for two categories in SMT: computation resources and memory accessing resources. PCOUNT fetches from the thread with lowest evaluated power consumption in every CPU cycle, in order to reduce overall power consumption. Furthermore, this paper justifies studied fetch polices using power efficiency, which is calculated as evaluated power consumption per unit system throughput. As a result, PCOUNT improves power efficiency over ICOUNT by 26% and over DWarn by 31% on average. Meanwhile, PCOUNT is able to achieve better overall system throughput and average thread improvement than ICOUNT and DWarn.

IEEE GOLD 2010
Future Multicore Multithreading Microprocessor Design (pdf)
Chen Liu
IEEE Graduates of the Last Decade (GOLD) Rush Newsletter, September Issue, pp.13, 2010.

MASVD 2010
Cooperative Virtual Machine Scheduling on Multi-core Multi-threading Systems - A Feasibility Study (ResearchGate)
Dulcardo Arteaga, Ming Zhao, Chen Liu, Pollawat Thanarungroj, and Lichen Weng
Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds (MASVDC 2010) in conjunction with MICRO-43, Atlanta, GA, December 4, 2010
Abstract: Virtual machines (VMs) and multi-core multi-threading microprocessors (MMMP) are two emerging technologies in software and hardware, respectively, and they are expected to become pervasive on computer systems in the near future. However, the nature of resource sharing on an MMMP introduces contention among VMs which are scheduled onto the cores and the threads that share the processor computation resources and caches. Such contention can lead to performance degradation of individual VMs as well as the overall system throughput, if not carefully managed. This paper proposes to address this problem through cooperative VM scheduling that takes processor input to schedule VMs across processors and cores in a way that minimizes the contention on processor resources and maximizes the total throughput of the VMs. As a first step towards this goal, this paper presents an experiment-based feasibility study for the proposed approach and focuses on the effectiveness of process contention aware VM scheduling. The results confirm that when VMs are scheduled in a way that mitigates their contention on the shared cache, the cache miss rates from the VMs are reduced substantially, and so do the runtimes of the benchmarks.

ICPP 2010
On Better Performance from Scheduling Threads according to Resource Demands in MMMP (acm, ieee)
Lichen Weng and Chen Liu
The 6th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS 2010), in conjunction with ICPP 2010, San Diego, CA, September 13, 2010
DOI: 10.1109/ICPPW.2010.53
Abstract: The Multi-core Multi-threading Microprocessor introduces not only resource sharing to threads in the same core, e.g., computation resources and private caches, but also isolates those resources within different cores. Moreover, when the Simultaneous Multithreading architecture is employed, the execution resources are fully shared among the concurrently executing threads in the same core, while the isolation is worsened as the number of cores increases. Even though fetch policies regarding how to assign priorities in fetch stage are well designed to manage the shared resources in a core, it is actually the scheduling policy that makes the distributed resources available for workloads, through deciding how to send their threads to cores. On the other hand, threads consume various resources in different phases and Cycles Per Instruction Spent on Memory (CPImem) is used to express their resource demands. Consequently, aiming at better performance via scheduling according to their resource demands, we propose the Mix-Scheduling to evenly mix threads across cores, so that it achieves thread diversity, i.e., CPImem diversity in every core. As a result, it is observed in our experiment that 63% improvement in overall system throughput and 27% improvement in average thread performance, when comparing the Mix-Scheduling policy with the reference policy Mono-Scheduling, which keeps CPImem uniformity among threads in every core on chips. Furthermore, the Mix-Scheduling also makes an essential step towards shortening load latency, because it succeeds in reducing the L2 Cache Miss Rate by 6% from Mono-Scheduling.

Handbook of Research on Scalable Computing Technologies Chapter
Simultaneous Multi-Threading (SMT) Microarchitecture (IGI Global)
Chen Liu, Xiaobin Li, Shaoshan Liu, and Jean-Luc Gaudiot Chapter 24 for Collection "Handbook of Research on Scalable Computing Technologies", Edited by Kuan-Ching Li, Ching-Hsien Hsu, Laurence Tianruo Yang, Jack Dongarra, and Hans Zima, IGI Global, July 2009
DOI: 10.4018/978-1-60566-661-7

ICA3PP 2009
The Impact of Resource Sharing Control on the Design of Multicore Processors (acm, springer)
Chen Liu and Jean-Luc Gaudiot
The 9th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2009), Vol. 5574/2009, pp. 315-326, Taipei, Taiwan, June 8-11, 2009
DOI: 10.1007/978-3-642-03095-6_31
Abstract: One major obstacle faced by designers when entering the multicore era is how to harness the massive computing power which these cores provide. Since Instructional-Level Parallelism (ILP) is inherently limited, one single thread is not capable of efficiently utilizing the resource of a single core. Hence, Simultaneous MultiThreading (SMT) microarchitecture can be introduced in an effort to achieve improved system resource utilization and a correspondingly higher instruction throughput through the exploitation of Thread-Level Parallelism (TLP) as well as ILP. However, when multiple threads execute concurrently in a single core, they automatically compete for system resources. Our research shows that, without control over the number of entries each thread can occupy in system resources like instruction fetch queue and/or reorder buffer, a scenario called "mutual-hindrance" execution takes place. Conversely, introducing active resource sharing control mechanisms causes the opposite situation ("mutual-benefit" execution), with a possible significant performance improvement and lower cache miss frequency. This demonstrates that active resource sharing control is essential for future multicore multithreading microprocessor design.

ACSAC 2008
Resource sharing control in Simultaneous MultiThreading microarchitectures (ieee)
Chen Liu and Jean-Luc Gaudiot
The 13th IEEE Asia-Pacific Computer Systems Architecture Conference (ACSAC 2008), Hsinchu, Taiwan, August 04-06, 2008
DOI: 10.1109/APCSAC.2008.4625432
Abstract: Simultaneous multithreading (SMT) achieves improved system resource utilization and accordingly higher instruction throughput because it exploits thread-level parallelism (TLP) in addition to conventional instruction-level parallelism (ILP). The key to high-performance SMT is to optimize the distribution of shared system resources among the threads. However, existing dynamic sharing mechanism has no control over the resource distribution, which could cause one thread to grab too many resources and clog the pipeline. Existing fetch policies address the resource distribution problem only indirectly. In this work, we strive to quantitatively determine the balance between controlling resource allocation and dynamic sharing of different system resources with their impact on the performance of SMT processors. We find that controlling the resource sharing of either the instruction fetch queue (IFQ) or the reorder buffer (ROB) is not sufficient if implemented alone. However, controlling the resource sharing of both IFQ and ROB can yield an average performance gain of 38% when compared with dynamic sharing case. The average L1 D-cache miss rate has been reduced by 33%. The average time that the instruction resides in the pipeline has been reduced by 34%. This demonstrates the power of the resource sharing control mechanism we propose.

IJPP
The Impact of Speculative Execution on SMT Processors (acm, springer)
Dongsoo Kang, Chen Liu, and Jean-Luc Gaudiot
International Journal of Parallel Programming (IJPP),Vol. 36, No. 4, pp. 361-385 August 2008
DOI: 10.1007/s10766-007-0052-3
Abstract: By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper). Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines, the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed workloads we simulate.

APPT 2005
Static Partitioning vs Dynamic Sharing of Resources in Simultaneous MultiThreading Microarchitectures (acm, springer)
Chen Liu and Jean-Luc Gaudiot
The 6th International Workshop on Advanced Parallel Processing Technologies (APPT 2005), Vol. 3756/2005, pp. 81-90, Hong Kong, China, October 27-28, 2005
DOI: 10.1007/11573937_11
Abstract: Simultaneous MultiThreading (SMT) achieves better system resource utilization and higher performance because it exploits Thread-Level Parallelism (TLP) in addition to “conventional” Instruction-Level Parallelism (ILP). Theoretically, system resources in every pipeline stage of an SMT microarchitecture can be dynamically shared. However, in commercial applications, all the major queues are statically partitioned. From an implementation point of view, static partitioning of resources is easier to implement and has a lower hardware overhead and power consumption. In this paper, we strive to quantitatively determine the tradeoff between static partitioning and dynamic sharing. We find that static partitioning of either the instruction fetch queue (IFQ) or the reorder buffer (ROB) is not sufficient if implemented alone (3% and 9% performance decrease respectively in the worst case comparing with dynamic sharing), while statically partitioning both the IFQ and the ROB could achieve an average performance gain of 9% at least, and even reach 148% when running with floating-point benchmarks, when compared with dynamic sharing. We varied the number of functional units in our efforts to isolate the reason for this performance improvement. We found that static partitioning both queues outperformed all the other partitioning mechanisms under the same system configuration. This demonstrates that the performance gain has been achieved by moving from dynamic sharing to static partitioning of the system resources.