DATE 2018

Practical Challenges in Delivering the Promises of Real Processing‐in‐Memory Machines

Nishil Talati^1,a, Ameer Haj Ali^1,b, Rotem Ben Hur^1,c, Nimrod Wald^1,d, Ronny Ronen^1,e, Pierre‐Emmanuel Gaillardon² and Shahar Kvatinsky^1,f
¹Technion ‐ Israel Institute of Technology, Haifa, ISRAEL
^anishil.t@campus.technion.ac.il
^bameerh@campus.technion.ac.il
^crotembenhur@campus.technion.ac.il
^dnimrodw@campus.technion.ac.il
^eronny.ronen@gmail.com
^fshahar@ee.technion.ac.il
²University of Utah, Salt Lake City, UT, USA
pierre-emmanuel.gaillardon@utah.edu

ABSTRACT

Processing‐in‐Memory (PiM) machines promise to overcome the von Neumann bottleneck in order to further scale performance and energy efficiency of computing systems by reducing the extent of data transfer and offering ample parallelism. In this paper, we take the memristive Memory Processing Unit (mMPU) as a case study of a PiM machine and scrutinize it in practical scenarios. Specifically, we explore the limitations of parallelism and data transfer elimination. We argue that lack of operand locality and arrangement might make data transfer inevitable in the mMPU. We then devise techniques to move data within the mMPU, without transferring it off‐chip, and quantify their costs. Additionally, we present electrical parameters that might limit the parallelism offered by the mMPU and evaluate their impact. Using benchmarks from the LGsynth91 suite, their vector extensions, and a few synthetic data‐parallel workloads, we show that the internal data transfer results in an increase of up to 1.5× in the execution time, while the parallelism can be limited in some cases to 256 gates, resulting in an increase in execution time by 1.1× to 2×.

Keywords: Von Neumann bottleneck, Memristors, mMPU.

Full Text (PDF)