2.5 Low-Power and Efficient Architectures

Date: Tuesday 25 March 2014
Time: 11:30 - 13:00
Location / Room: Konferenz 3

Chair:
Cristina Silvano, Politecnico di Milano, IT

Co-Chair:
Todd Austin, University of Michigan, US

This session presents three papers on energy efficiency in memory-intensive systems. The first paper aims at energy-efficient scheduling of cooperative-thread arrays on GPGPUs for memory intensive workloads through throttling of warps on different cores. The second paper leverages the application-specific knowledge of the next-generation parallelized high-efficiency video encoder to design a distributed scratchpad memory system with adaptive SPM data allocation and power management. The third paper explores the feasibility of non-volatile memories for instruction caches to improve energy efficiency. To handle the write delay and energy issues of NVMs, an analysis and extensions to the miss status handling registers are proposed.

Time	Label	Presentation Title Authors
11:30	2.5.1	ENERGY-EFFICIENT SCHEDULING FOR MEMORY-INTENSIVE GPGPU WORKLOADS Speakers: Seokwoo Song¹, Minseok Lee¹, John Kim¹, Woong Seo², Yeongon Cho² and Soojung Ryu² ¹KAIST, KR; ²Samsung, KR Abstract High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.
12:00	2.5.2	DSVM: ENERGY-EFFICIENT DISTRIBUTED SCRATCHPAD VIDEO MEMORY ARCHITECTURE FOR THE NEXT-GENERATION HIGH EFFICIENCY VIDEO CODING Speakers: Felipe Sampaio¹, Muhammad Shafique², Bruno Zatt³, Sergio Bampi⁴ and Jörg Henkel² ¹Federal University of Rio Grande do Sul, BR; ²Karlsruhe Institute of Technology (KIT), DE; ³Federal University of Pelotas, BR; ⁴Federal University of Rio Grande do Sul, BR Abstract An energy-efficient distributed Scratchpad Video Memory Architecture (dSVM) for the next-generation parallel High Efficiency Video Coding is presented. Our dSVM combines private and overlapping (shared) Scratchpad Memories (SPMs) to support data reuse within and across different cores concurrently executing multiple parallel HEVC threads. We developed a statistical method to size and design the organization of the SPMs along with a supporting memory reading policy for energy efficiency. The key is to leverage the HEVC and video content knowledge. Furthermore, we integrate an adaptive power management policy for SPMs to manage the power states of different memory parts at run time depending upon the varying video content properties. Our experimental results illustrate that our dSVM architecture reduces the overall memory energy consumption by up to 51%-61% compared to parallelized state-of-the-art solutions [11]. The dSVM external memory energy savings increase with an increasing number of parallel HEVC threads and size of search window. Moreover, our SPM power management reacts to the current video properties and achieves up to 54% on-chip leakage energy savings.
12:30	2.5.3	FEASIBILITY EXPLORATION OF NVM BASED I-CACHE THROUGH MSHR ENHANCEMENTS Speakers: Manu Komalan¹, José Ignacio Gómez Pérez², Christian Tenllado², Praveen Raghavan³, Matthias Hartmann³ and Francky Catthoor³ ¹imec, UCM(Universidad Complutense de Madrid), ES; ²Universidad Complutense de Madrid, ES; ³imec, BE Abstract SRAM based memory systems are plagued by a number of problems like sub-threshold leakage and susceptibility to read/write failure with dynamic voltage scaling schemes or low supply voltage. Non-Volatile Memory (NVM) technologies are being explored extensively nowadays to replace the conventional SRAM memories even for level 1 (L1) caches. These NVMs like Spin Torque Transfer RAM (STT-MRAM), Resistive-RAM (ReRAM) and Phase Change RAM (PRAM) are less hindered by leakage problems with technology scaling and consume lesser area. However, simple replacement of SRAM by NVMs is not a viable option due to their write related issues. The main focus of this paper is the exploration of write delay and write energy issues in a NVM based L1 Instruction cache (I-cache) for an ARM like single core system. We propose a NVM I-cache and extend its MSHR (Miss Status Handling Register)functionality to address the NVMs write related issues. According to our simulations, appropriate tuning of selective architecture parameters can reduce the performance penalty introduced by the NVM (∼45%) to extremely tolerable levels (∼1%) and show energy gains up to 35%. Furthermore, on configuring our modified NVM based system to occupy area comparable to the original SRAM-based configuration, it outperforms the SRAM baseline and leads to even more energy savings.
13:00	IP1-8, 266	EVX: VECTOR EXECUTION ON LOW POWER EDGE CORES Speakers: Milovan Duric¹, Oscar Palomar¹, Aaron Smith², Osman Unsal¹, Adrian Cristal¹, Mateo Valero¹ and Doug Burger² ¹Barcelona Supercomputing Center, ES; ²Microsoft Research, US Abstract In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.
13:01	IP1-9, 730	PROGRAM AFFINITY PERFORMANCE MODELS FOR PERFORMANCE AND UTILIZATION Speakers: Ryan Moore and Bruce Childers, University of Pittsburgh, US Abstract Multithreaded applications have a wide variety of behavior, causing complex interactions with today's chip multiprocessor machines. Application threads may have large private working sets, and may compete for cache space and memory bandwidth. These threads benefit from large private caches. Other threads may share data or communicate, and thus, execute more quickly if using shared caches. Many applications fall somewhere in between, requiring careful thread-to-core assignments to maximize performance. Yet because of the large number of thread-to-core assignments on today's chip multiprocessors, it is time and energy prohibitive to exhaustively try and determine the best assignment. In this paper, we present and demonstrate application performance models that predict application performance given a proposed thread-to-core assignment. We show how these models can be quickly built and used to select thread-to-core assignments for multiple programs and to improve system utilization.
13:02	IP1-10, 791	ADVANCED SIMD: EXTENDING THE REACH OF CONTEMPORARY SIMD ARCHITECTURES Speakers: Matthias Boettcher¹, Giacomo Gabrielli², Mbou Eyole², Alastair Reid² and Bashir M. Al-Hashimi¹ ¹University of Southampton, GB; ²ARM Ltd., GB Abstract SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g. Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity. This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed. We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.
13:03	IP1-11, 898	A TIGHTLY-COUPLED HARDWARE CONTROLLER TO IMPROVE SCALABILITY AND PROGRAMMABILITY OF SHARED-MEMORY HETEROGENEOUS CLUSTERS Speakers: Paolo Burgio¹, Robin Danilo², Andrea Marongiu³, Philippe Coussy⁴ and Luca Benini⁵ ¹University of Bologna, Université de Bretagne-Sud, IT; ²Université de Bretagne-Sud, FR; ³University of Bologna, IT; ⁴Universite de Bretagne-Sud / Lab-STICC, FR; ⁵Università di Bologna, IT Abstract Modern designs for embedded many-core systems increasingly include application-specific units to accelerate key computational kernels with orders-of-magnitude higher execution speed and energy efficiency compared to software counterparts. A promising architectural template is based on heterogeneous clusters, where simple RISC cores and specialized HW units (HWPU) communicate in a tightly-coupled manner via L1 shared memory. Efficiently integrating processors and a high number of HW Processing Units (HWPUs) in such an system poses two main challenges, namely, architectural scalability and programmability. In this paper we describe an optimized Data Pump (DP) which connects several accelerators to a restricted set of communication ports, and acts as a virtualization layer for programming, exposing FIFO queues to offload "HW tasks" to them through a set of lightweight APIs. In this work, we aim at optimizing both these mechanisms, for respectively reducing modules area and making programming sequence easier and lighter.
13:00		End of session Lunch Break in Exhibition Area Sandwich lunch

< Return to last page

Submissions

2.5 Low-Power and Efficient Architectures