4.5 Memory System Architectures

Date: Tuesday 25 March 2014
Time: 17:00 - 18:30
Location / Room: Konferenz 3

Chair:
Muhammad Shafique, Karlsruhe Institute of Technology, DE

Co-Chair:
Cristina Silvano, Politecnico di Milano, IT

The memory sub-system plays an increasingly important role in modern multicore systems. Novel solutions are needed in order to deliver the expected performance improvements with minimal energy overheads. In addition, new solutions should be preferably backward compatible with already existing approaches. In this session we have four papers dealing with different aspects of the memory hierarchy in modern computing systems. ALLARM provides a novel, yet power efficient strategy towards cache coherence to simultaneously improve performance and reduce energy. The next paper in this session presents a novel packet-based interface and compression, which reduces communication overhead. The third paper deals with prefetcher aggressiveness and proposes a sound solution to reduce overall execution time. The last paper of this session proposes a novel extension of the shared L2 cache memory system, providing a very high aggregated bandwidth with a very low impact on L2 cache design complexity or operating frequency.

Time	Label	Presentation Title Authors
17:00	4.5.1	ACHIEVING EFFICIENT PACKET-BASED MEMORY SYSTEM BY EXPLOITING CORRELATION OF MEMORY REQUESTS Speakers: Tianyue Lu, Licheng Chen and Mingyu Chen, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Packet-based interface is a probable trend for future memory system to alleviate capacity and bandwidth bottleneck, meanwhile fine grained memory access is proven to efficiently reduce memory power. However leveraging both these two technologies will result in high packet overhead, cause previous implementations all adopt a simple design that a single packet is dedicated to a single request (SPSR). In this paper, we propose three optimizations to overcome the problem by exploiting correlations of memory requests. First, we propose a novel single packet multiple requests (SPMR) interface that encapsulates multiple requests into a packet to share packet head and tail. Second, we propose an adaptive compression mechanism for addresses within a packet by adopting an base-difference algorithm. Third, we propose a mechanism to merge multiple memory requests with continuous access addresses into a single request before packing. In this way, the granularity constraint of 64 bytes is break down that the efficiency of requests scheduling and row buffer will be improved. The experimental results show that, for memory-intensive workloads, the optimizations can effectively reduce packet overhead by about 53.9% and improve system performance by about 63.6%.
17:30	4.5.2	ALLARM: OPTIMIZING SPARSE DIRECTORIES FOR THREAD-LOCAL DATA Speakers: Amitabha Roy¹ and Timothy Jones² ¹EPFL, CH; ²University of Cambridge, GB Abstract Large-scale cache-coherent systems often impose unnecessary overhead on data that is thread-private for the whole of its lifetime. These include resources devoted to tracking the coherence state of the data, as well as unnecessary coherence messages sent out over the interconnect. In this paper we show how the memory allocation strategy for non-uniform memory access (NUMA) systems can be exploited to remove any coherence-related traffic for thread-local data, as well removing the need to track those cache lines in sparse directories. Our strategy is to allocate directory state only on a miss from a node in a different affinity domain from the directory. We call this ALLocAte on Remote Miss, or ALLARM. Our solution is entirely backward compatible with existing operating systems and software, and provides a means to scale cache coherence into the many-core era. On a mix of SPLASH2 and Parsec workloads, ALLARM is able to improve performance by 13% on average while reducing dynamic energy consumption by 9% in the on-chip network and 15% in the directory controller. This is achieved through a 46% reduction in the number of sparse directory entries evicted.
18:00	4.5.3	INTRODUCING THREAD CRITICALITY AWARENESS IN PREFETCHER AGGRESSIVENESS CONTROL Speakers: Biswabandan Panda and Shankar Balachandran, IIT Madras, IN Abstract A single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent cache misses and (3) effect of synchronization primitives. Identifying critical threads and minimizing their cache miss latencies can improve system performance. One way to hide and tolerate the cache misses is through hardware prefetching. Hardware prefetching is one of the most commonly used memory latency hiding techniques. Previous studies have shown the effectiveness of hardware prefetchers for multiprogrammed workloads (multiple sequential applications running independently on different cores). In contrast to multiprogrammed workloads, the performance of a single parallel application depends on the progress of slow progress(critical) threads. This paper introduces Thread Criticality-aware Prefetcher Aggressiveness Control (TCPAC). TCPAC controls the aggressiveness of prefetchers at the L2 prefetching controllers (known as TCPAC-P), DRAM controller (known as TCPAC-D) and at the Last Level Cache (LLC) controller (known as TCPAC-C) based on the prefetch accuracy and the thread progress. Though each TCPAC subtechnique outperform the respective state-of-the-art techniques such as HPAC [2], PADC [4], and PACMan [3]. Combination of all the TCPAC sub-techniques named as TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan. On an average, on a 8 core system, in terms of improvement in execution time, TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan by 7.61%. For 12 and 16 cores, TCPAC-PDC beats the state-of-the-art combinations by 7.21% and 8.32% respectively.
18:15	4.5.4	A MULTI BANKED - MULTI PORTED - NON BLOCKING SHARED L2 CACHE FOR MPSOC PLATFORMS Speakers: Igor Loi¹ and Luca Benini² ¹University of Bologna, IT; ²Università di Bologna, IT Abstract On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent requirements on power and cost of these systems result in one of the key challenges in many-core designs, mandating the deployment of highly efficient L2 caches. In this perspective, sharing the L2 cache layer among all system cores has important advantages, such as increased utilization, fast inter-core communication, and reduced aggregate footprint because no undesired replication of lines occurs. This paper presents and explores a novel architecture for a shared L2 cache system with multi-port and multi-bank features. We target this L2 cache to a many-core platform based on hierarchical cluster structure that does not employ private data caches, and therefore does not require complex coherency mechanisms. In fact, our shared L2 cache can be seen logically as a Last Level Cache (LLC) adopting the terminology of higher-performance many-core products, although in these latter the LLC is more often an L3 layer. Our experimental results show a maximum aggregate bandwidth of 28GB/s (89% of the maximum channel capacity) for 100% hit traffic with random banking conflicts, as a realistic case. Physical implementation results in 28nm Fully-Depleted-Silicon-on-Insulator (FDSoI) show that our L2 cache can operate at up to 1GHz with a memory density loss of only 20% with respect to an L2 scratchpad for a 2 MB configuration.
18:30	IP2-2, 150	DRAM-BASED COHERENT CACHES AND HOW TO TAKE ADVANTAGE OF THE COHERENCE PROTOCOL TO REDUCE THE REFRESH ENERGY Speakers: Zoran Jaksic and Ramon Canal, Universitat Politecnica de Catalunya, ES Abstract Recent technology trends has turned DRAMs into an interesting candidate to substitute traditional SRAM-based on-chip memory structures (i.e. register file, cache memories). Nevertheless, a major problem to introduce these cells is that they lose their state (i.e. value) over time, and they have to be refreshed. This paper proposes the implementation of coherent caches with DRAM cells. Furthermore, we propose to use the coherence state to tune the refresh overhead. According to our analysis, an average of up to 57% of refresh energy can be saved. Also, comparing to the caches implemented in SRAMs total energy savings are on average up to 39% depending of the refresh policy with a performance loss below 8%
18:31	IP2-3, 302	(Best Paper Award Candidate) REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3) Speakers: Alen Bardizbanyan¹, Magnus Själander², David Whalley² and Per Larsson-edefors¹ ¹Chalmers University of Technology, SE; ²Florida State University, US Abstract Fast set-associative level-one data caches (L1~DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD3) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1~DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1~DC energy by 13%.
18:32	IP2-4, 444	DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP Speakers: Preethi Parayil Mana Damodaran¹, Stefan Wallentowitz² and Andreas Herkersdorf³ ¹LIS, Technical University of Munich, DE; ²Technische Universität München, Institute for Integrated Systems, DE; ³TU München, DE Abstract In a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a multi-level cache hierarchy and a suitable cache-coherency mechanism. This paper presents a method to increase the memory access performance in distributed-directory-coherency-protocol based tiled many-core systems. The proposed method introduces an alternate design for the system-wide shared last-level caches (LLC) placed between the memory and the node private caches (NPC). The proposed system-wide shared LLC layer is distributed over the entire network and it interacts with the home directories of specific cache lines. Results from simulating SPEC2000 benchmark applications executed on a SystemC model of the proposed design show a minimum performance improvement of 20-25% when compared to a model without the shared cache layer at the expense of an additional 2% of the total cache memory space (NPC + LLC memory). In addition, the proposed design shows a minimum 7-15% and an average 14-15% improvement in performance in comparison to centralized system-wide shared LLC of equivalent size and dynamic mapped distributed LLC of equivalent size respectively.
18:30		End of session Exhibition Reception in Several serving points inside the Exhibition Area (Terrace Level) The Exhibition Reception will take place in the exhibition area (Terrace Level). All exhibitors are welcome to provide drinks and snacks for delegates and visitors.

< Return to last page

Submissions

4.5 Memory System Architectures