5.4 Architectural-level Low-power Design

Time	Label	Presentation Title Authors
08:30	5.4.1	MULTI-STORY POWER DISTRIBUTION NETWORKS FOR GPUS Speaker: Mark Gottscho, UCLA, US Authors: Qixiang Zhang¹, Liangzhen Lai², Mark Gottscho³ and Puneet Gupta³ ¹Zhejiang University, CN; ²ARM/UCLA, US; ³UCLA, US Abstract High-performance chips require many power pins to support large currents, which increases fabrication cost, limits scalability, and degrades power efficiency. Multi-story serial power distribution networks (PDNs) are a promising approach to reducing pin counts and power losses. We study the feasibility of 2-story PDNs for graphics processing units (GPUs). These PDNs use either an auxiliary off-chip regulator or integrated on-die supercapacitors to stabilize the virtual rail voltage. Static SIMT thread scheduling (SSTS) and dynamic current compensation (DCC) can reduce transient impedance mismatch when the auxiliary regulator is omitted. Simulation results show that compared to a traditional 1-story design, our 2-story GPU architectures can reduce the required number of core power pins by up to 2X, power losses in the PDN by up to 3.6X, and/or maximum voltage swing by up to 2X without any performance degradation. Our results demonstrate the efficiency and cost advantages of multi-story PDNs for GPUs without any impact on performance. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	5.4.2	ENERGY-EFFICIENT CACHE MEMORIES USING A DUAL-VT 4T SRAM CELL WITH READ-ASSIST TECHNIQUES Speaker: Massoud Pedram, University of Southern California, US Authors: Alireza Shafaei Bejestan and Massoud Pedram, University of Southern California, US Abstract In order to improve the energy-efficiency of cache memories, this paper presents a static random access memory (SRAM) cell composed of four transistors using dual-Vt FinFET devices. The proposed 4T SRAM cell is designed by (i) removing pull-down transistors of the standard 6T SRAM, and (ii) using low-leakage high-Vt devices for pull-up transistors and fast low-Vt devices for access transistors. This dual-Vt design simultaneously improves hold and write characteristics, but results in a destructive read operation. Accordingly, read-assist techniques are employed to ensure a non-destructive and robust read operation. A selective row address decoder is also proposed to prevent the undesired write operation in half-selected cells. The 4T SRAM cell compared with the all-single-fin 6T counterpart has a 25% smaller layout area with an aspect ratio closer to one. Furthermore, using 7nm FinFET devices with a nominal supply voltage of 0.45V, the 4T SRAM cell achieves 3.5X lower cell leakage power. Because of these features, the energy consumption of a 32KB L1 (256KB L2) cache memory using 4T SRAM cell compared with its 6T counterpart is reduced by 18% (2X), with 35% (19%) higher cache access frequency. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	5.4.3	LEARNING-BASED DYNAMIC RELIABILITY MANAGEMENT FOR DARK SILICON PROCESSOR CONSIDERING EM EFFECTS Speaker: Sheldon X.-D. Tan, University of California, Riverside, US Authors: Taeyoung Kim¹, Xin Huang¹, Hai-Bao Chen², Valeriy Sukharev³ and Sheldon X.-D. Tan¹ ¹University of California, Riverside, US; ²Shanghai Jiao Tong University, CN; ³Mentor Graphics Corporation, US Abstract In this article, we propose a new dynamic reliability management (DRM) technique for emerging dark silicon manycore processors. We formulate our DRM problem as minimizing the energy consumption subject to the reliability, performance and thermal constraints. The new approach is based on a newly proposed physics-based electromigration (EM) reliability model to predict the EM reliability of full-chip power grid networks. We consider thermal design power (TDP) as the power constraint for a dark silicon manycore processor. We employ both dynamic voltage and frequency scaling (DVFS) and dark silicon core using ON/OFF pulsing action as the two control knobs. To solve the problem, we apply the adaptive Q-learning based method, which is suitable for runtime operation as it can provide cost-effective yet good solutions. A large class of multithreaded applications is used as the benchmark to validate and compare the proposed dynamic reliability management methods. Experimental results on a 64-core dark silicon chip show that the proposed DRM algorithm can effectively reduce the energy consumption of a dark silicon manycore system when the system is not tightly constrained. The proposed method can outperform a simple global DVFS method significantly in this case. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP2-8, 245	SEQUENTIAL ANALYSIS DRIVEN RESET OPTIMIZATION TO IMPROVE POWER, AREA AND ROUTABILITY Speaker: Srihari Yechangunja, Mentor Graphics Corporation, IN Authors: Srihari Yechangunja¹, Raj Shekhar¹, Mohit Kumar¹, Nikhil Tripathi¹, Abhishek Ranjan¹, Abhishek Mittal¹, Jianfeng Liu², Minyoung Mo², Kyungtae Do², Jung Yun Choi² and SungHo Park² ¹Mentor Graphics Corporation, IN; ²S.LSI, Samsung Electronics Co. Ltd, KR Abstract Resets are required in the design to initialize the hardware for system operation and to force it into a known state for simulation or to recover from an error. Given the increasing design complexity and time-to-market pressures, figuring out the registers which do not require resets is extremely challenging. In this paper, we present a novel algorithm which uses observability based sequential analysis to identify the registers in design which do not require resets. With the proposed algorithm, we have seen that in some cases 70% registers in the design can have redundant resets. Further, with removal of the redundant resets on registers up to 22% sequential power savings and up to 3% area reduction post-layout can be obtained. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session Coffee Break in Exhibition Area

Time

Label

Presentation Title
Authors

08:30

5.4.1

MULTI-STORY POWER DISTRIBUTION NETWORKS FOR GPUS
Speaker:
Mark Gottscho, UCLA, US
Authors:
Qixiang Zhang¹, Liangzhen Lai², Mark Gottscho³ and Puneet Gupta³
¹Zhejiang University, CN; ²ARM/UCLA, US; ³UCLA, US
Abstract
High-performance chips require many power pins to support large currents, which increases fabrication cost, limits scalability, and degrades power efficiency. Multi-story serial power distribution networks (PDNs) are a promising approach to reducing pin counts and power losses. We study the feasibility of 2-story PDNs for graphics processing units (GPUs). These PDNs use either an auxiliary off-chip regulator or integrated on-die supercapacitors to stabilize the virtual rail voltage. Static SIMT thread scheduling (SSTS) and dynamic current compensation (DCC) can reduce transient impedance mismatch when the auxiliary regulator is omitted. Simulation results show that compared to a traditional 1-story design, our 2-story GPU architectures can reduce the required number of core power pins by up to 2X, power losses in the PDN by up to 3.6X, and/or maximum voltage swing by up to 2X without any performance degradation. Our results demonstrate the efficiency and cost advantages of multi-story PDNs for GPUs without any impact on performance.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

5.4.2

ENERGY-EFFICIENT CACHE MEMORIES USING A DUAL-VT 4T SRAM CELL WITH READ-ASSIST TECHNIQUES
Speaker:
Massoud Pedram, University of Southern California, US
Authors:
Alireza Shafaei Bejestan and Massoud Pedram, University of Southern California, US
Abstract
In order to improve the energy-efficiency of cache memories, this paper presents a static random access memory (SRAM) cell composed of four transistors using dual-Vt FinFET devices. The proposed 4T SRAM cell is designed by (i) removing pull-down transistors of the standard 6T SRAM, and (ii) using low-leakage high-Vt devices for pull-up transistors and fast low-Vt devices for access transistors. This dual-Vt design simultaneously improves hold and write characteristics, but results in a destructive read operation. Accordingly, read-assist techniques are employed to ensure a non-destructive and robust read operation. A selective row address decoder is also proposed to prevent the undesired write operation in half-selected cells. The 4T SRAM cell compared with the all-single-fin 6T counterpart has a 25% smaller layout area with an aspect ratio closer to one. Furthermore, using 7nm FinFET devices with a nominal supply voltage of 0.45V, the 4T SRAM cell achieves 3.5X lower cell leakage power. Because of these features, the energy consumption of a 32KB L1 (256KB L2) cache memory using 4T SRAM cell compared with its 6T counterpart is reduced by 18% (2X), with 35% (19%) higher cache access frequency.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

5.4.3

LEARNING-BASED DYNAMIC RELIABILITY MANAGEMENT FOR DARK SILICON PROCESSOR CONSIDERING EM EFFECTS
Speaker:
Sheldon X.-D. Tan, University of California, Riverside, US
Authors:
Taeyoung Kim¹, Xin Huang¹, Hai-Bao Chen², Valeriy Sukharev³ and Sheldon X.-D. Tan¹
¹University of California, Riverside, US; ²Shanghai Jiao Tong University, CN; ³Mentor Graphics Corporation, US
Abstract
In this article, we propose a new dynamic reliability management (DRM) technique for emerging dark silicon manycore processors. We formulate our DRM problem as minimizing the energy consumption subject to the reliability, performance and thermal constraints. The new approach is based on a newly proposed physics-based electromigration (EM) reliability model to predict the EM reliability of full-chip power grid networks. We consider thermal design power (TDP) as the power constraint for a dark silicon manycore processor. We employ both dynamic voltage and frequency scaling (DVFS) and dark silicon core using ON/OFF pulsing action as the two control knobs. To solve the problem, we apply the adaptive Q-learning based method, which is suitable for runtime operation as it can provide cost-effective yet good solutions. A large class of multithreaded applications is used as the benchmark to validate and compare the proposed dynamic reliability management methods. Experimental results on a 64-core dark silicon chip show that the proposed DRM algorithm can effectively reduce the energy consumption of a dark silicon manycore system when the system is not tightly constrained. The proposed method can outperform a simple global DVFS method significantly in this case.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP2-8, 245

SEQUENTIAL ANALYSIS DRIVEN RESET OPTIMIZATION TO IMPROVE POWER, AREA AND ROUTABILITY
Speaker:
Srihari Yechangunja, Mentor Graphics Corporation, IN
Authors:
Srihari Yechangunja¹, Raj Shekhar¹, Mohit Kumar¹, Nikhil Tripathi¹, Abhishek Ranjan¹, Abhishek Mittal¹, Jianfeng Liu², Minyoung Mo², Kyungtae Do², Jung Yun Choi² and SungHo Park²
¹Mentor Graphics Corporation, IN; ²S.LSI, Samsung Electronics Co. Ltd, KR
Abstract
Resets are required in the design to initialize the hardware for system operation and to force it into a known state for simulation or to recover from an error. Given the increasing design complexity and time-to-market pressures, figuring out the registers which do not require resets is extremely challenging. In this paper, we present a novel algorithm which uses observability based sequential analysis to identify the registers in design which do not require resets. With the proposed algorithm, we have seen that in some cases 70% registers in the design can have redundant resets. Further, with removal of the redundant resets on registers up to 22% sequential power savings and up to 3% area reduction post-layout can be obtained.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session
Coffee Break in Exhibition Area

Visit us at DATE 2016