Practical Challenges in Delivering the Promises of Real Processing‐in‐Memory Machines

Nishil Talati1,a, Ameer Haj Ali1,b, Rotem Ben Hur1,c, Nimrod Wald1,d, Ronny Ronen1,e, Pierre‐Emmanuel Gaillardon2 and Shahar Kvatinsky1,f
1Technion ‐ Israel Institute of Technology, Haifa, ISRAEL
anishil.t@campus.technion.ac.il
bameerh@campus.technion.ac.il
crotembenhur@campus.technion.ac.il
dnimrodw@campus.technion.ac.il
eronny.ronen@gmail.com
fshahar@ee.technion.ac.il
2University of Utah, Salt Lake City, UT, USA
pierre-emmanuel.gaillardon@utah.edu

ABSTRACT


Processing‐in‐Memory (PiM) machines promise to overcome the von Neumann bottleneck in order to further scale performance and energy efficiency of computing systems by reducing the extent of data transfer and offering ample parallelism. In this paper, we take the memristive Memory Processing Unit (mMPU) as a case study of a PiM machine and scrutinize it in practical scenarios. Specifically, we explore the limitations of parallelism and data transfer elimination. We argue that lack of operand locality and arrangement might make data transfer inevitable in the mMPU. We then devise techniques to move data within the mMPU, without transferring it off‐chip, and quantify their costs. Additionally, we present electrical parameters that might limit the parallelism offered by the mMPU and evaluate their impact. Using benchmarks from the LGsynth91 suite, their vector extensions, and a few synthetic data‐parallel workloads, we show that the internal data transfer results in an increase of up to 1.5× in the execution time, while the parallelism can be limited in some cases to 256 gates, resulting in an increase in execution time by 1.1× to 2×.

Keywords: Von Neumann bottleneck, Memristors, mMPU.



Full Text (PDF)