7.4 Runtime memory optimization and GPU/manycore architectures

Printer-friendly version PDF version

Date: Wednesday 26 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 2

Chair:
Alberto Nannarelli, DTU Copenhagen, DK

Co-Chair:
Alberto Macii, PoliTo Torino, IT

The session starts with memory design techniques under PVT variation and ageing for DRAMs and SRAM caches. Afterwards, bus, memory and partitioning techniques for 2D and 3D GPUs and manycores are presented.

TimeLabelPresentation Title
Authors
14:307.4.1EXPLOITING EXPENDABLE PROCESS-MARGINS IN DRAMS FOR RUN-TIME PERFORMANCE OPTIMIZATION
Speakers:
Karthik Chandrasekar1, Sven Goossens2, Christian Weis3, Martijn Koedam2, Benny Akesson4, Norbert Wehn3 and Kees Goossens5
1Delft University of Technology, NL; 2Eindhoven University of Technology, NL; 3University of Kaiserslautern, DE; 4Czech Technical University in Prague, CZ; 5Eindhoven university of technology, NL
Abstract
Manufacturing-time process (P) variations and runtime variations in voltage (V) and temperature (T) can affect a DRAM's performance (internal delays) severely. To counter the effects of these variations, DRAM vendors provide substantial design-time PVT margins to guarantee correct DRAM functionality under worst-case conditions. Unfortunately, with technology scaling these design margins have become large and very pessimistic for a majority of the manufactured DRAMs. While run-time variations are specific to operating conditions and their margins difficult to optimize, process variations are manufacturing-time effects and excessive process-margins can be reduced on a per-device basis, if properly identified. In this paper, we propose a generic post-manufacturing performance characterization methodology for DRAMs that identifies this excess in process-margins for any given DRAM device at runtime, while retaining the requisite margins for voltage (noise) and temperature variations. By doing so, the methodology ascertains the actual impact of process-variations on the particular DRAM device and optimizes its access latencies, thereby improving its overall performance. We evaluate this methodology on 48 DDR3 devices (from 12 DIMMs) and verify the derived timings under worst-case operating conditions, showing up to 33.3% and 25.9% reduction in DRAM read and write latencies, respectively.
15:007.4.2CACHE AGING REDUCTION WITH IMPROVED PERFORMANCE USING DYNAMICALLY RE-SIZABLE CACHE
Speakers:
Haroon Mahmood, Massimo Poncino and Enrico Macii, Politecnico di Torino, Torino Italy, IT
Abstract
Aging of transistors is a limiting factor for long term reliability of devices in sub-100nm technologies. It's a worst-case metric where the lifetime of a device is determined by the earliest failing component. Impact is more serious on memory arrays, where failure of a single SRAM cell would cause the failure of the whole system. Previous works have shown that partitioning based strategies based on power management techniques can effectively control aging effects and can extend lifetime of the cache significantly. However, such a benefit comes as a trade-off with performance which reduces proportionally as the time elapses. To address this problem and provide a single solution to concurrently improve aging, energy and performance of the cache, we propose an architectural solution based on the dynamically re-sizable cache and cache partitioning approaches. By this strategy, cache is dynamically re-sized and reconfigured whenever a cache block becomes unreliable. Coupling such aging mitigation technique along with dynamically re-sizable cache approach provides on average 30% lifetime improvement with less than 0.4x degradation in performance whereas, in previous solutions, performance degradation sometimes goes upto 10x.
15:157.4.3ON GPU BUS POWER REDUCTION WITH 3D IC TECHNOLOGIES
Speakers:
Young-joon Lee1 and Sung Kyu Lim2
1Intel Corporation, US; 2Georgia Institute of Technology, US
Abstract
The complex buses consume significant power in graphics processing units (GPUs). In this paper, we demonstrate how the power consumption of buses in GPUs can be reduced with 3D IC technologies. Based on layout simulations, we found that partitioning and floorplanning of 3D ICs affect the power benefit amount, as well as the technology setup, target clock frequency, and circuit switching activity. With 3D IC technologies, we achieved the total power reduction of up to 21.5% for our GPU.
15:457.4.4PROCESS VARIATION-AWARE WORKLOAD PARTITIONING ALGORITHMS FOR GPUS SUPPORTING SPATIAL MULTITASKING
Speakers:
Paula Aguilera1, Jungseob Lee1, Amin Farmahini Farahani1, Michael Schulte2, Katherine Morrow1 and Nam Sung Kim1
1University of Wisconsin-Madison, US; 2AMD, US
Abstract
High-level programming languages have transformed graphics processing units (GPUs) from domain-restricted devices into powerful compute platforms. Yet many "general-purpose GPU" (GPGPU) applications fail to fully utilize the GPU resources. Executing multiple applications simultaneously on different regions of the GPU (spatial multitasking) thus improves system performance. However, within-die process variations lead to significantly different maximum operating frequencies (Fmax) of the streaming multiprocessors (SMs) within a GPU. As the chip size and number of SMs per chip increase, the frequency variation is also expected to increase, exacerbating the problem. The increased number of SMs also provides a unique opportuni-ty: we can allocate resources to concurrently-executing applica-tions based on how those applications are affected by the differ-ent available Fmax values. In this paper, we study the effects of per-SM clocking on spatial multitasking-capable GPUs. We demonstrate two factors that affect the performance of simulta-neously-running applications: (i) the SM partitioning algorithm that decides how many resources to assign to each application, and (ii) the assignment of SMs to applications based on the oper-ating frequencies of those SMs and the applications characteris-tics. Our experimental results show that spatial multitasking that partitions SMs based on application characteristics, when com-bined with per-SM clocking, can greatly improve application performance by up to 46% on average compared to cooperative multitasking with global clocking.
16:00IP3-11, 240A THERMAL RESILIENT INTEGRATION OF MANY-CORE MICROPROCESSORS AND MAIN MEMORY BY 2.5D TSI I/OS
Speakers:
Sih-Sian Wu1, Kanwen Wang1, Sai Manoj P. D.1, Tsung-Yi Ho2 and Hao Yu1
1Nanyang Technological University, SG; 2National Cheng Kung University, TW
Abstract
One memory-logic-integration design platform is developed in this paper with thermal reliability analysis provided for 2.5D throughsilicon-interposer (TSI) and 3D through-silicon-via (TSV) based integrations. Temperature-dependent delay and power models have been developed at microarchitecture level for 2.5D and 3D integrations of many-core microprocessors and main memory, respectively. Experiments are performed by general-purpose benchmarks from SPEC CPU2006 and also cloud-oriented benchmarks from Phoenix with the following observations. The memory-logic integration by 3D RC-interconnected TSV I/Os can result in thermal runaway failures due to strong electrical-thermal couplings. On the other hand, the one by 2.5D transmission-line-interconnected TSI I/Os has shown almost the same energy efficiency and better thermal resilience.
16:01IP3-12, 24LEVERAGING ON-CHIP NETWORKS FOR EFFICIENT PREDICTION ON MULTICORE COHERENCE
Speaker:
Libo Huang, National University of Defense Technology, CN
Abstract
Coherent data prediction is introduced as a promising architectural technique for reducing cache-to-cache accesses in directory protocol. However, limited on-chip resources cause the accuracy of current prediction to be generally low. Low accuracy would result in a large number of unnecessary or incorrect predictions, which would consequently generate excessive network traffic. This leads to large power and performance overhead for coherent memory access. This paper proposes an early abort mechanism (EBT) that leverages NoC design to reduce the negative effect of wrong prediction operations, thus facilitating overall performance improvement and traffic reduction. Using detailed full-system simulations, we conclude that EBT provides a cost-effective solution for designing efficient multicore processors. To the best of our knowledge, this study is the first to leverage on-chip network for the prediction optimization on multicore coherence.
16:02IP3-13, 184AN ADAPTIVE MEMORY INTERFACE CONTROLLER FOR IMPROVING BANDWIDTH UTILIZATION OF HYBRID AND RECONFIGURABLE SYSTEMS
Speakers:
Vito Giovanni Castellana1, Antonino Tumeo2 and Fabrizio Ferrandi1
1Politecnico di Milano, DEIB, IT; 2Pacific Northwest National Laboratory, US
Abstract
Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but also presents large amounts of inherent dynamic parallelism because they can potentially spawn concurrent activities for each one of the element they are exploring. Hybrid architectures, which integrate general purpose processors with reconfigurable devices, appears promising target platforms for accelerating irregular applications. These systems often connect to distributed and multi-ported memories, potentially enabling parallel memory operations. However, these memory architectures introduce several challenges, such as the necessity to manage concurrency and synchronization to avoid structural conflicts on shared memory locations and to guarantee consistency. In this paper we present an adaptive Memory Interface Controller (MIC) that addresses these issues. The MIC is a general and customizable solution that can target several different memory structures, and is suitable for High Level Synthesis frameworks. It implements a dynamic arbitration scheme, which avoids conflicts on memory resources at runtime, and supports atomic memory operations, commonly exploited for synchronization directives in parallel programming paradigms. The MIC simultaneously maps multiple accesses to different memory ports, allowing fine grained parallelism exploitation and ensuring correctness also in the presence of irregular and statically unpredictable memory access patterns. We evaluated the effectiveness of our approach on a typical irregular kernel, graph Breadth First Search (BFS), exploring different design alternatives.
16:00End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).