2.5 Energy Efficient Systems and Architectures

Time	Label	Presentation Title Authors
11:30	2.5.1	A DISCRETE THERMAL CONTROLLER FOR CHIP-MULTIPROCESSORS Speaker: Yingnan Cui, Nanyang Technological University, SG Authors: Yingnan Cui¹, Wei Zhang² and Bingsheng He¹ ¹Nanyang Technological University, SG; ²Hong Kong University of Science and Technology, HK Abstract The ever increasing power density has posed challenges to the thermal management of modern chip-multiprocessors (CMP). Closed-loop thermal controllers have the benefits of high response speed, high robustness and high accuracy. Most previously proposed closed-loop automatic thermal controllers are designed by continuous control theories. However, the thermal controllers for microprocessors are discrete controllers by nature. The traditional design methodology fails to analyze the discrete features of the thermal controllers such as the influence of sampling frequency and signal distortion. In this paper, we proposed an automatic thermal controller for microprocessors which is designed by discrete control theories. With specific concerns about the discrete feature of thermal control systems, our discrete thermal controller increases the performance of CMPs by reducing the sampling frequency and improves the control quality of the thermal control system. When compared with state-of-the-art thermal controllers, our discrete thermal controller achieves up to 50% reduction in sampling frequency and up to 20% higher performance of the CMPs. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	2.5.2	SWALLOW: BUILDING AN ENERGY-TRANSPARENT MANY-CORE EMBEDDED REAL-TIME SYSTEM Speaker: Steve Kerrison, University of Bristol, GB Authors: Steve Kerrison and Simon Hollis, University of Bristol, GB Abstract Swallow is a many-core platform of interconnected embedded real time processors with time-deterministic execution and a cache-less memory subsystem. Its largest current configuration is 480 × 32-bit processors. It is open-source, designed from the ground up to allow the exploration of flexibility, scalability and energy efficiency in large systems of embedded processors. Further, it enables the behavior of various structures of parallel programs to be explored. It is a proof of concept and design example for other potential systems of this kind. We present the energy transparency features and proportional energy scaling of the system that allows it to be expanded beyond hundreds of cores. We discuss the design choices, construction and novel network implementation of Swallow. Currently, the system provides up to 240 GIPS, with each core consuming 71-193 mW, dependent on workload. Its power per instruction is lower than almost all systems of comparable scale. We discuss the challenges associated with efficiently utilizing this system, particularly communication/computation ratios, and give recommendations for future systems and their software. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	2.5.3	A NOVEL CACHE-UTILIZATION BASED DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) MECHANISM FOR RELIABILITY ENHANCEMENTS Speaker: Yen-Hao Chen, National Tsing Hua University, Taiwan, TW Authors: Yen-Hao Chen¹, Yi-Lun Tang¹, Yi-Yu Liu², Allen C.-H. Wu³ and TingTing Hwang¹ ¹National Tsing Hua University, TW; ²Yuan Ze University, TW; ³Jiangnan University, CN Abstract We propose a cache architecture using a 7T/14T SRAM [1] and a control mechanism for reliability enhancements. Our control mechanism differs from the conventional DVFS methods, which considers not only the CPI behaviors but also the cache utilizations. To measure cache utilization, a novel metric is proposed. The experimental results show that our proposed method achieves thousand times less bit-error occurrences compared to the conventional DVFS methods under the ultra-low voltage operation. Moreover, the results show that our proposed method surprisingly not only incurs no performance and energy overheads but also achieves on an average 5.1% performance improvement and 5% energy reduction compared to the conventional DVFS methods. Download Paper (PDF; Only available from the DATE venue WiFi)
12:45	2.5.4	EFFICIENT KERNEL MANAGEMENT ON GPUS Speaker: Xiuhong Li, Peking University, CN Authors: Xiuhong Li and Yun Liang, Peking University, CN Abstract As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilization by executing multiple kernels simultaneously. However, prior systems only achieve limited performance improvement as they do not optimize the thread-level parallelism (TLP) and model the resource contention for the concurrently executing kernels. In this paper, we design a framework that optimizes the performance and energy-efficiency for multiple kernel execution on GPUs. It employs two key techniques. First, we develop an algorithm to adjust the TLP for the concurrently executing kernels. Second, we employ cache bypassing to mitigate the cache contention. Experiments indicate that our framework can improve performance by 1.42X on average (energy-efficiency by 1.33X on average), compared with default concurrent kernel execution framework. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00	IP1-7, 83	(Best Paper Award Candidate) MACHINE LEARNED MACHINES: ADAPTIVE CO-OPTIMIZATION OF CACHES, CORES, AND ON-CHIP NETWORK Speaker: Rahul Jain, Indian Institute of Technology Delhi, IN Authors: Rahul Jain¹, Preeti Ranjan Panda¹ and Sreenivas Subramoney² ¹Indian Institute of Technology Delhi, IN; ²Intel, IN Abstract Abstract—Modern multicore architectures require runtime optimization techniques to address the problem of mismatches between the dynamic resource requirements of different processes and the runtime allocation. Choosing between multiple optimizations at runtime is complex due to the non-additive effects, making the adaptiveness of the machine learning techniques useful. We present a novel method, Machine Learned Machines (MLM), by using Online Reinforcement Learning (RL) to perform dynamic partitioning of the last level cache (LLC), along with dynamic voltage and frequency scaling (DVFS) of the core and uncore (interconnection network and LLC). We show that the co-optimization results in much lower energy-delay product (EDP) than any of the techniques applied individually. The results show an average of 19.6% EDP and 2.6% execution time improvement over the baseline. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00		End of session Lunch Break in Großer Saal + Saal 1

Time

Label

Presentation Title
Authors

11:30

2.5.1

A DISCRETE THERMAL CONTROLLER FOR CHIP-MULTIPROCESSORS
Speaker:
Yingnan Cui, Nanyang Technological University, SG
Authors:
Yingnan Cui¹, Wei Zhang² and Bingsheng He¹
¹Nanyang Technological University, SG; ²Hong Kong University of Science and Technology, HK
Abstract
The ever increasing power density has posed challenges to the thermal management of modern chip-multiprocessors (CMP). Closed-loop thermal controllers have the benefits of high response speed, high robustness and high accuracy. Most previously proposed closed-loop automatic thermal controllers are designed by continuous control theories. However, the thermal controllers for microprocessors are discrete controllers by nature. The traditional design methodology fails to analyze the discrete features of the thermal controllers such as the influence of sampling frequency and signal distortion. In this paper, we proposed an automatic thermal controller for microprocessors which is designed by discrete control theories. With specific concerns about the discrete feature of thermal control systems, our discrete thermal controller increases the performance of CMPs by reducing the sampling frequency and improves the control quality of the thermal control system. When compared with state-of-the-art thermal controllers, our discrete thermal controller achieves up to 50% reduction in sampling frequency and up to 20% higher performance of the CMPs.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

2.5.2

SWALLOW: BUILDING AN ENERGY-TRANSPARENT MANY-CORE EMBEDDED REAL-TIME SYSTEM
Speaker:
Steve Kerrison, University of Bristol, GB
Authors:
Steve Kerrison and Simon Hollis, University of Bristol, GB
Abstract
Swallow is a many-core platform of interconnected embedded real time processors with time-deterministic execution and a cache-less memory subsystem. Its largest current configuration is 480 × 32-bit processors. It is open-source, designed from the ground up to allow the exploration of flexibility, scalability and energy efficiency in large systems of embedded processors. Further, it enables the behavior of various structures of parallel programs to be explored. It is a proof of concept and design example for other potential systems of this kind. We present the energy transparency features and proportional energy scaling of the system that allows it to be expanded beyond hundreds of cores. We discuss the design choices, construction and novel network implementation of Swallow. Currently, the system provides up to 240 GIPS, with each core consuming 71-193 mW, dependent on workload. Its power per instruction is lower than almost all systems of comparable scale. We discuss the challenges associated with efficiently utilizing this system, particularly communication/computation ratios, and give recommendations for future systems and their software.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

2.5.3

A NOVEL CACHE-UTILIZATION BASED DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) MECHANISM FOR RELIABILITY ENHANCEMENTS
Speaker:
Yen-Hao Chen, National Tsing Hua University, Taiwan, TW
Authors:
Yen-Hao Chen¹, Yi-Lun Tang¹, Yi-Yu Liu², Allen C.-H. Wu³ and TingTing Hwang¹
¹National Tsing Hua University, TW; ²Yuan Ze University, TW; ³Jiangnan University, CN
Abstract
We propose a cache architecture using a 7T/14T SRAM [1] and a control mechanism for reliability enhancements. Our control mechanism differs from the conventional DVFS methods, which considers not only the CPI behaviors but also the cache utilizations. To measure cache utilization, a novel metric is proposed. The experimental results show that our proposed method achieves thousand times less bit-error occurrences compared to the conventional DVFS methods under the ultra-low voltage operation. Moreover, the results show that our proposed method surprisingly not only incurs no performance and energy overheads but also achieves on an average 5.1% performance improvement and 5% energy reduction compared to the conventional DVFS methods.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:45

2.5.4

EFFICIENT KERNEL MANAGEMENT ON GPUS
Speaker:
Xiuhong Li, Peking University, CN
Authors:
Xiuhong Li and Yun Liang, Peking University, CN
Abstract
As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilization by executing multiple kernels simultaneously. However, prior systems only achieve limited performance improvement as they do not optimize the thread-level parallelism (TLP) and model the resource contention for the concurrently executing kernels. In this paper, we design a framework that optimizes the performance and energy-efficiency for multiple kernel execution on GPUs. It employs two key techniques. First, we develop an algorithm to adjust the TLP for the concurrently executing kernels. Second, we employ cache bypassing to mitigate the cache contention. Experiments indicate that our framework can improve performance by 1.42X on average (energy-efficiency by 1.33X on average), compared with default concurrent kernel execution framework.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

IP1-7, 83

(Best Paper Award Candidate)
MACHINE LEARNED MACHINES: ADAPTIVE CO-OPTIMIZATION OF CACHES, CORES, AND ON-CHIP NETWORK
Speaker:
Rahul Jain, Indian Institute of Technology Delhi, IN
Authors:
Rahul Jain¹, Preeti Ranjan Panda¹ and Sreenivas Subramoney²
¹Indian Institute of Technology Delhi, IN; ²Intel, IN
Abstract
Abstract—Modern multicore architectures require runtime optimization techniques to address the problem of mismatches between the dynamic resource requirements of different processes and the runtime allocation. Choosing between multiple optimizations at runtime is complex due to the non-additive effects, making the adaptiveness of the machine learning techniques useful. We present a novel method, Machine Learned Machines (MLM), by using Online Reinforcement Learning (RL) to perform dynamic partitioning of the last level cache (LLC), along with dynamic voltage and frequency scaling (DVFS) of the core and uncore (interconnection network and LLC). We show that the co-optimization results in much lower energy-delay product (EDP) than any of the techniques applied individually. The results show an average of 19.6% EDP and 2.6% execution time improvement over the baseline.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

End of session
Lunch Break in Großer Saal + Saal 1

Visit us at DATE 2016