8.5 Don't Forget the Memory

Printer-friendly version PDF version

Date: Wednesday, March 27, 2019
Time: 17:00 - 18:30
Location / Room: Room 5

Chair:
Christian Pilato, Politecnico di Milano, IT, Contact Christian Pilato

Co-Chair:
Olivier Sentieys, INRIA, FR, Contact Olivier Sentieys

Multi-core systems demand new solutions to overcome the increasing memory gap and emerging memory technologies still need to find a suitable place in the traditional memory system. This session showcases different proposals covering memory, storage, and OS. The first presentation improves the parallelism of the Open-Channel SSD Linux implementation. The second presentation proposes a method to orchestrate multicore memory requests to maintain the main memory locality. The third presentation proposes a new method to improve directory entry lookup in deep directory structures. An interactive presentation completes the session with a new cache replacement algorithm for NVM disk read caches.

TimeLabelPresentation Title
Authors
17:008.5.1DS-CACHE: A REFINED DIRECTORY ENTRY LOOKUP CACHE WITH PREFIX-AWARENESS FOR MOBILE DEVICES
Speaker:
Zhaoyan Shen, Shandong University, CN
Authors:
Lei Han1, Bin Xiao1, Xuwei Dong2, Zhaoyan Shen3 and Zili Shao4
1The Hong Kong Polytechnic University, HK; 2Northwestern Polytechnical University, CN; 3Shandong University, CN; 4The Chinese University of Hong Kong, HK
Abstract
Our modern devices are filled with files, directories upon directories. Applications generate huge I/O activities in mobile devices. Directory cache is adopted to accelerate file lookup operations in the virtual file system. However, the original directory cache recursively walks all the components of a path for each lookup, leading to inefficient lookup performance and lower cache hit ratio. In this paper, we for the first time fully investigate the characteristics of the directory entry lookup in mobile devices. Based on our findings, we further propose a new directory cache scheme, called Dynamic Skipping Cache, which adopts an ASCII-based hash table to simplify the path lookup complexity by skipping the common prefixes of paths. We also design a novel lookup scheme to optimize the directory cache hit ratio. We have implemented and deployed DS-Cache on a Google Nexus 6P smartphone. Experimental results show that we can significantly reduce the latency of invoking system calls by up to 57.4%, and further reduce the completion time of real-world mobile applications by up to 64%.
17:308.5.2IMPROVING THE DRAM ACCESS EFFICIENCY FOR MATRIX MULTIPLICATION ON MULTICORE ACCELERATORS
Speaker:
Sheng Ma, National University of Defense Technology, CN
Authors:
Sheng Ma, Yang Guo, Shenggang Chen, Libo Huang and Zhiying Wang, National University of Defense Technology, CN
Abstract
The parallelization of matrix multiplication on multicore accelerators divides a matrix into several partitions. The existing design deploys an independent DMA transfer for each core to access its own partition from DRAM. This design has poor memory access efficiency, since memory access streams of multiple concurrent DMA transfers interfere with each other. We propose Distributed-DMA (D-DMA), which invokes one transfer to serve all cores. D-DMA accesses data in a row-major manner to efficiently exploit inter-partition locality to improve the DRAM access efficiency. Compared with a baseline design, D-DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM benchmark. With much lower hardware cost, D-DMA significantly outperforms an out-of-order memory controller.
18:008.5.3QBLK: TOWARDS FULLY EXPLOITING THE PARALLELISM OF OPEN-CHANNEL SSDS
Speaker:
Hongwei Qin, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System, Engineering Research Center of data storage systems and Technology, Ministry of Education of China, School of Computer Science and Technology, Huazhong University, CN
Authors:
Hongwei Qin1, Dan Feng1, Wei Tong1, Jingning Liu2 and Yutong Zhao2
1Wuhan National lab for Optoelectronics, CN; 2Wuhan National Lab for Optoelectronics, CN
Abstract
By exposing physical channels to host software, Open-Channel SSD shows great potential in future high performance storage systems. However, the existing scheme fails to achieve acceptable performance under heavy workloads. The main reasons reside not only in its single-buffer architecture, more importantly, but also in its line-based physical address management. Besides, the lock of address mapping table is also a performance burden under heavy workloads. We propose QBLK, an open source driver which tries to better exploit the parallelism of Open-Channel SSDs. Particularly, QBLK adopts four key techniques, namely (1) Multi-queue based buffering, (2) Per-channel based address management, (3) Lock-free address mapping, and (4) Fine-grained draining. Experimental results show that QBLK achieves up to 97.4% bandwidth improvement compared with the state-of-the-art PBLK scheme.
18:30IP4-2, 1013A WRITE-EFFICIENT CACHE ALGORITHM BASED ON MACROSCOPIC TREND FOR NVM-BASED READ CACHE
Speaker:
Ning Bao, Renmin University of China, CN
Authors:
Ning Bao1, Yunpeng Chai1 and Xiao Qin2
1Renmin University of China, CN; 2Auburn University, US
Abstract
Compared with traditional storage technologies, non-volatile memory (NVM) techniques have excellent I/O performances, but high costs and limited write endurance (e.g., NAND and PCM) or high energy consumption of writing (e.g., STT-MRAM). As a result, the storage systems prefer to utilize NVM devices as read caches for performance boost. Unlike write caches, read caches have greater potential of write reduction because their writes are only triggered by cache updates. However, traditional cache algorithms like LRU and LFU have to update cached blocks frequently because it is difficult for them to predict data popularity in the long future. Although some new algorithms like SieveStore reduce cache write pressure, they still rely on those traditional cache schemes for data popularity prediction. Due to the bad long-term data popularity prediction effect, these new cache algorithms lead to a significant and unnecessary decrease of cache hit ratios. In this paper, we propose a new Macroscopic Trend (MT) cache replacement algorithm to reduce cache updates effectively and maintain high cache hit ratios. This algorithm discovers long-term hot data effectively by observing the macroscopic trend of data blocks. We have conducted extensive experiments driven by a series of real-world traces, and the results indicate that compared with LRU, the MT cache algorithm can achieve 15.28 times longer lifetime or less energy consumption of NVM caches with a similar hit ratio.
18:31IP4-3, 626SRAM DESIGN EXPLORATION WITH INTEGRATED APPLICATION-AWARE AGING ANALYSIS
Speaker:
Alexandra Listl, Technical University of Munich, DE
Authors:
Alexandra Listl1, Daniel Mueller-Gritschneder2, Sani Nassif3 and Ulf Schlichtmann4
1Chair of Electronic Design Automation, DE; 2Technical University of Munich, DE; 3Radyalis, US; 4TU M√ľnchen, DE
Abstract
On-Chip SRAMs are an integral part of safetycritical System-on-Chips. At the same time however, they are also most susceptible to reliability threats such as Bias Temperature Instability (BTI), originating from the continuous trend of technology shrinking. BTI leads to a significant performance degradation, especially in the Sense Amplifiers (SAs) of SRAMs, where failures are fatal, since the data of a whole column is destroyed. As BTI strongly depends on the workload of an application, the aging rates of SAs in a memory array differ significantly and the incorporation of workload information into aging simulations is vital. Especially in safety-critical systems precise estimation of application specific reliability requirements to predict the memory lifetime is a key concern. In this paper we present a workload-aware aging analysis for On-Chip SRAMs that incorporates the workload of real applications executed on a processor. According to this workload, we predict the performance degradation of the SAs in the memory. We integrate this aging analysis into an aging-aware SRAM design exploration framework that generates and characterizes memories of different array granularity to select the most reliable memory architecture for the intended application. We show that this technique can mitigate SA degradation significantly depending on the environmental conditions and the application workload.
18:30End of session