10.5 Enhancing Memory in Next-Generation Platforms

Printer-friendly version PDF version

Date: Thursday 17 March 2016
Time: 11:00 - 12:30
Location / Room: Konferenz 3

Chair:
Fancisco Cazorla, Barcelona Supercomputing Center, ES

Co-Chair:
Jeronimo Castrillon, Technische Universität Dresden, DE

This session presents three interesting paper describing different approaches for enhancing the memory for obtaining significant performance and energy improvement with respect to standard processor-centric architectures. The first paper introduces a Near-Data Processing solution compatible with existing processor memory interfaces such as DDR3/4 with minimal changes. The second paper introduces the HIVE architecture, which allows performing common vector operations directly inside the HMC, avoiding contention on the interconnections as well as cache pollution. The third paper proposes a minimalistic clustered flash array which exposes a simple, stable, error-free, shared-memory flash interface that enables a flexible cross-layer flash management optimizations and a scalable distributed storage coordination.

TimeLabelPresentation Title
Authors
11:0010.5.1(Best Paper Award Candidate)
BUFFERED COMPARES: EXCAVATING THE HIDDEN PARALLELISM INSIDE DRAM ARCHITECTURES WITH LIGHTWEIGHT LOGIC
Speaker:
Kiyoung Choi, Seoul National University, KR
Authors:
Jinho Lee, Jung Ho Ahn and Kiyoung Choi, Seoul National University, KR
Abstract
We propose an approach called buffered compares, a less-invasive processing-in-memory solution that can be used with existing processor memory interfaces such as DDR3/4 with minimal changes. The approach is based on the observation that multi-bank architecture, a key feature of modern main memory DRAM devices, can be used to provide huge internal bandwidth without any major modification. We place a small buffer and a simple ALU per bank, define a set of new DRAM commands to fill the buffer and feed data to the ALU, and return the result for a set of commands (not for each command) to the host memory controller. By exploiting the under-utilized internal bandwidth using 'compare-n-op' operations, which are frequently used in many applications, we not only reduce the amount of energy-inefficient processor-memory communication, but also accelerate the computation of big data processing applications by utilizing parallelism of the buffered compare units in DRAM banks. Experimental results show that our solution significantly improves the performance and efficiency of the system on the tested workloads.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:3010.5.2LARGE VECTOR OPERATIONS INSIDE HMC
Speaker:
Luigi Carro, Universidade Federal do Rio Grande do Sul (UFRGS), BR
Authors:
Marco Antonio Zanata Alves, Matthias Diener, Paulo Santos and Luigi Carro, Universidade Federal do Rio Grande do Sul (UFRGS), BR
Abstract
One of the main challenges for embedded systems is the transfer of data between memory and processor. In this context, Hybrid Memory Cubes (HMCs) can provide substantial energy and bandwidth improvements compared to traditional memory organizations, while also allowing the execution of simple atomic instructions in the memory. However, the complex memory hierarchy still remains a bottleneck, especially for applications with a low reuse of data, limiting the usable parallelism of the HMC vaults and banks. In this paper, we introduce the HIVE architecture, which allows performing common vector operations directly inside the HMC, avoiding contention on the interconnections as well as cache pollution. Our mechanism achieves substantial speedups of up to 17.3x (9.4x on average) compared to a baseline system that performs vector operations in a 8-core processor. We show that the simple instructions provided by HMC actually hurt performance for streaming applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:0010.5.3MINFLASH: A MINIMALISTIC CLUSTERED FLASH ARRAY
Speaker:
Ming Liu, Massachusetts Institute of Technology (MIT), US
Authors:
Ming Liu1, Sang-Woo Jun1, Sungjin Lee1, Jamey Hicks2 and Arvind1
1Massachusetts Institute of Technology (MIT), US; 2Quanta Research Cambridge, US
Abstract
NAND flash is seeing increasing adoption in the data center because of its orders of magnitude lower latency and higher bandwidth compared to hard disks. However, flash performance is often degraded by (i) inefficient storage I/O stack that hides flash characteristics under Flash Translation Layer (FTL), and (ii) long latency network protocols for distributed storage. In this paper, we propose a minimalistic clustered flash array (minFlash). First, minFlash exposes a simple, stable, error-free, shared-memory flash interface that enables the host to per- form cross-layer flash management optimizations in file systems, databases and other user applications. Second, minFlash uses a controller-to-controller network to connect multiple flash drives with very little overhead. We envision minFlash to be used within a rack cluster of servers to provide fast scalable distributed flash storage. We show through benchmarks that minFlash can access both local and remote flash devices with negligible latency overhead, and it can expose near theoretical max performance of the NAND chips in a distributed setting.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP5-6, 681EXPLORING SPECIALIZED NEAR-MEMORY PROCESSING FOR DATA INTENSIVE OPERATIONS
Speaker:
Salessawi Ferede Yitbarek, University of Michigan, US
Authors:
Salessawi Ferede Yitbarek1, Tao Yang2, Reetuparna Das1 and Todd Austin1
1University of Michigan, US; 2University of California, San Diego, US
Abstract
Emerging 3D stacked memory systems provide significantly more bandwidth than current DDR modules. However, general purpose processors do not take full advantage of these resources offered by the memory modules. Taking advantage of the increased bandwidth requires the use of specialized processing units. In this paper, we evaluate the benefits of placing hardware accelerators at the bottom layer of a 3D stacked memory system compared to accelerators that are placed external to the memory stack. Our evaluation of the design using cycle-accurate simulation and RTL synthesis shows that, for important data intensive kernels, near-memory accelerators inside a single 3D memory package provide 3x-13x speedup over a Quad-core Xeon processor. Most of the benefits are from the application of accelerators, as the near-memory configurations provide marginal benefits compared to the same number of accelerators placed on a die external to the memory package. This comparable performance for external accelerators is due to the high bandwidth afforded by the high-speed off-chip links. On the other hand, near-memory accelerators consume 7%-39% less energy than the external accelerators.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session
Lunch Break in Großer Saal + Saal 1
Keynote Lecture in "Saal 2" 13:30 - 14:00