2.5 Low-Power and Efficient Architectures

Printer-friendly version PDF version

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 3

Chair:
Cristina Silvano, Politecnico di Milano, IT

Co-Chair:
Todd Austin, University of Michigan, US

This session presents three papers on energy efficiency in memory-intensive systems. The first paper aims at energy-efficient scheduling of cooperative-thread arrays on GPGPUs for memory intensive workloads through throttling of warps on different cores. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video encoder to design a distributed scratchpad memory system with adaptive SPM data allocation and power management. The third paper explores the feasibility of non-volatile memories for instruction caches to improve energy efficiency. To handle the write delay and energy issues of NVMs, an analysis and extensions to the miss status handling registers are proposed.

TimeLabelPresentation Title
Authors
11:302.5.1ENERGY-EFFICIENT SCHEDULING FOR MEMORY-INTENSIVE GPGPU WORKLOADS
Speakers:
Seokwoo Song1, Minseok Lee1, John Kim1, Woong Seo2, Yeongon Cho2 and Soojung Ryu2
1KAIST, KR; 2Samsung, KR
Abstract
High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.
12:002.5.2DSVM: ENERGY-EFFICIENT DISTRIBUTED SCRATCHPAD VIDEO MEMORY ARCHITECTURE FOR THE NEXT-GENERATION HIGH EFFICIENCY VIDEO CODING
Speakers:
Felipe Sampaio1, Muhammad Shafique2, Bruno Zatt3, Sergio Bampi4 and Jörg Henkel2
1Federal University of Rio Grande do Sul, BR; 2Karlsruhe Institute of Technology (KIT), DE; 3Federal University of Pelotas, BR; 4Federal University of Rio Grande do Sul, BR
Abstract
An energy-efficient distributed Scratchpad Video Memory Architecture (dSVM) for the next-generation parallel High Efficiency Video Coding is presented. Our dSVM combines private and overlapping (shared) Scratchpad Memories (SPMs) to support data reuse within and across different cores concurrently executing multiple parallel HEVC threads. We developed a statistical method to size and design the organization of the SPMs along with a supporting memory reading policy for energy efficiency. The key is to leverage the HEVC and video content knowledge. Furthermore, we integrate an adaptive power management policy for SPMs to manage the power states of different memory parts at run time depending upon the varying video content properties. Our experimental results illustrate that our dSVM architecture reduces the overall memory energy consumption by up to 51%-61% compared to parallelized state-of-the-art solutions [11]. The dSVM external memory energy savings increase with an increasing number of parallel HEVC threads and size of search window. Moreover, our SPM power management reacts to the current video properties and achieves up to 54% on-chip leakage energy savings.
12:302.5.3FEASIBILITY EXPLORATION OF NVM BASED I-CACHE THROUGH MSHR ENHANCEMENTS
Speakers:
Manu Komalan1, José Ignacio Gómez Pérez2, Christian Tenllado2, Praveen Raghavan3, Matthias Hartmann3 and Francky Catthoor3
1imec, UCM(Universidad Complutense de Madrid), ES; 2Universidad Complutense de Madrid, ES; 3imec, BE
Abstract
SRAM based memory systems are plagued by a number of problems like sub-threshold leakage and susceptibility to read/write failure with dynamic voltage scaling schemes or low supply voltage. Non-Volatile Memory (NVM) technologies are being explored extensively nowadays to replace the conventional SRAM memories even for level 1 (L1) caches. These NVMs like Spin Torque Transfer RAM (STT-MRAM), Resistive-RAM (ReRAM) and Phase Change RAM (PRAM) are less hindered by leakage problems with technology scaling and consume lesser area. However, simple replacement of SRAM by NVMs is not a viable option due to their write related issues. The main focus of this paper is the exploration of write delay and write energy issues in a NVM based L1 Instruction cache (I-cache) for an ARM like single core system. We propose a NVM I-cache and extend its MSHR (Miss Status Handling Register)functionality to address the NVMs write related issues. According to our simulations, appropriate tuning of selective architecture parameters can reduce the performance penalty introduced by the NVM (∼45%) to extremely tolerable levels (∼1%) and show energy gains up to 35%. Furthermore, on configuring our modified NVM based system to occupy area comparable to the original SRAM-based configuration, it outperforms the SRAM baseline and leads to even more energy savings.
13:00IP1-8, 266EVX: VECTOR EXECUTION ON LOW POWER EDGE CORES
Speakers:
Milovan Duric1, Oscar Palomar1, Aaron Smith2, Osman Unsal1, Adrian Cristal1, Mateo Valero1 and Doug Burger2
1Barcelona Supercomputing Center, ES; 2Microsoft Research, US
Abstract
In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.
13:01IP1-9, 730PROGRAM AFFINITY PERFORMANCE MODELS FOR PERFORMANCE AND UTILIZATION
Speakers:
Ryan Moore and Bruce Childers, University of Pittsburgh, US
Abstract
Multithreaded applications have a wide variety of behavior, causing complex interactions with today's chip multiprocessor machines. Application threads may have large private working sets, and may compete for cache space and memory bandwidth. These threads benefit from large private caches. Other threads may share data or communicate, and thus, execute more quickly if using shared caches. Many applications fall somewhere in between, requiring careful thread-to-core assignments to maximize performance. Yet because of the large number of thread-to-core assignments on today's chip multiprocessors, it is time and energy prohibitive to exhaustively try and determine the best assignment. In this paper, we present and demonstrate application performance models that predict application performance given a proposed thread-to-core assignment. We show how these models can be quickly built and used to select thread-to-core assignments for multiple programs and to improve system utilization.
13:02IP1-10, 791ADVANCED SIMD: EXTENDING THE REACH OF CONTEMPORARY SIMD ARCHITECTURES
Speakers:
Matthias Boettcher1, Giacomo Gabrielli2, Mbou Eyole2, Alastair Reid2 and Bashir M. Al-Hashimi1
1University of Southampton, GB; 2ARM Ltd., GB
Abstract
SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g. Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity. This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed. We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.
13:03IP1-11, 898A TIGHTLY-COUPLED HARDWARE CONTROLLER TO IMPROVE SCALABILITY AND PROGRAMMABILITY OF SHARED-MEMORY HETEROGENEOUS CLUSTERS
Speakers:
Paolo Burgio1, Robin Danilo2, Andrea Marongiu3, Philippe Coussy4 and Luca Benini5
1University of Bologna, Université de Bretagne-Sud, IT; 2Université de Bretagne-Sud, FR; 3University of Bologna, IT; 4Universite de Bretagne-Sud / Lab-STICC, FR; 5Università di Bologna, IT
Abstract
Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload "HW tasks" to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.
13:00End of session
Lunch Break in Exhibition Area
Sandwich lunch