3.5 Robust Architectures

Date: Tuesday 25 March 2014
Time: 14:30 - 16:00
Location / Room: Konferenz 3

Chair:
Todd Austin, University of Michigan, US

Co-Chair:
Muhammad Shafique, Karlsruhe Institute of Technology, DE

This session presents the design of novel architectures to support real-time and secure systems. The first paper couples a time-division multiplexed NoC with a real-time memory controller to design a cost-effective real-time system with improved worst-case latency at reduced area and power consumption. The next paper proposes bus designs for multi-cores that are analyzable for probabilistic timing analysis. The final paper in this session designs a lightweight hardware solution using lockstep shadow thread execution to detect and prevent code injection attacks.

Time	Label	Presentation Title Authors
14:30	3.5.1	(Best Paper Award Candidate) COUPLING TDM NOC AND DRAM CONTROLLER FOR COST AND PERFORMANCE OPTIMIZATION OF REAL-TIME SYSTEMS Speakers: Manil Dev Gomony¹, Benny Akesson² and Kees Goossens³ ¹Eindhoven University of Technology, NL; ²Czech Technical University in Prague, CZ; ³Eindhoven university of technology, NL Abstract Existing memory subsystems and TDM NoCs for real-time systems are optimized independently in terms of cost and performance by configuring their arbiters according to the bandwidth and/or latency requirements of their clients. However,when they are used in conjunction, and run in different clock domains, i.e. they are decoupled, there exists no structured methodology to select the NoC interface width and operating frequency for minimizing area and/or power consumption. Moreover,the multiple arbitration points, one in the NoC and the other in the memory subsystem, introduce additional overhead in the worst-case guaranteed latency. These makes it hard to design cost-efficient real-time systems. The three main contributions in this paper are: (1)We present a novel methodology to couple any existing TDM NoC with a real-time memory controller and compute the different NoC interface width and operating frequency combinations for minimal area and/or power consumption. (2)For two different TDM NoC types,one a packet-switched and the other circuit-switched, we show the trade-off between area and power consumption with the different NoC configurations, for different DRAM generations. (3)We compare the coupled and decoupled architectures with the two NoCs, in terms of guaranteed worst-case latency, area and power consumption by synthesizing the designs in 40 nm technology.Our experiments show that using a coupled architecture in a system consisting of 16 clients results in savings of over 44%in guaranteed latency, 18% and 17% in area, 19% and 11% in power consumption for a packet-switched and a circuit-switched TDM NoC, respectively, with different DRAM types.
15:00	3.5.2	BUS DESIGNS FOR TIME-PROBABILISTIC MULTICORE PROCESSORS Speakers: Javier Jalle¹, Leonidas Kosmidis¹, Jaume Abella², Eduardo Quinones¹ and Francisco Cazorla³ ¹Barcelona Supercomputing Center, ES; ²Barcelona Supercomputing Center (BSC-CNS), ES; ³Barcelona Supercomputing Center and IIIA-CSIC, ES Abstract Probabilistic Timing Analysis (PTA) reduces the amount of information needed to provide tight WCET estimates in real-time systems with respect to classic timing analysis. PTA imposes new requirements on hardware design that have been shown implementable for single-core architectures. However, no support has been proposed for multicores so far. In this paper, we propose several probabilistically-analysable bus designs for multicore processors ranging from 4 cores connected with a single bus, to 16 cores deploying a hierarchical bus design. We derive analytical models of the probabilistic timing behaviour for the different bus designs, show their suitability for PTA and evaluate their hardware cost. Our results show that the proposed bus designs (i) fulfil PTA requirements, (ii) allow deriving WCET estimates with the same cost and complexity as in single-core processors, and (iii) provide higher guaranteed performance than single-core processors, 3.4x and 6.6x on average for an 8-core and a 16-core setup respectively.
15:30	3.5.3	PROGRAMMABLE DECODER AND SHADOW THREADS: TOLERATE REMOTE CODE INJECTION EXPLOITS WITH DIVERSIFIED REDUNDANCY Speakers: Weidong Shi¹, Ziyi Liu¹, Shouhuai Xu² and Zhiqiang Lin³ ¹University of Houston, US; ²University of Texas at San Antonio, US; ³University of Texas at Dallas, US Abstract We present a lightweight hardware framework for providing high assurance detection and prevention of code injection attacks using a lockstep diversified shadow execution. Recent studies show that hardware diversification can detect software attacks by checking the consistency of their behavior simultaneously. Unfortunately, the severe performance degradation and extra system costs caused by these methods are unacceptable in many applications. This paper presents a hardware-level, lockstep shadow thread framework to enrich the diversity of the software execution, with the facilitation from programmable hardware decoder and novel CPU support of tightly coupled non-executing shadow thread technique. Specifically, given a piece of (legacy) binary code, we first generate diversified binary versions using an offline binary rewriter and programmable hardware binary translator at runtime. Two diversified binary code images are launched as dual simultaneous threads in the hardware layer with one as the primary thread and the other one as shadow thread. Instructions from the shadow thread are not executed but just compared, and thus incur no OS side-effects. The extended CPU is able to decode instructions from both threads, and dispatch them to next stage pipeline for a lockstep comparison. Any mismatch of the decoded instructions from the two threads caused by remotely injected binary code will be detected. Our design provides instruction set randomization (ISR) with minimal cost in performance, when compared with straight-forward ISR implementation. The simulation results indicate that our framework incurs very small overheads and provides a protection against code injection attacks.
16:00	IP1-19, 268	EXPLOITING NARROW-WIDTH VALUES FOR IMPROVING NON-VOLATILE CACHE LIFETIME Speakers: Guangshan Duan and Shuai Wang, Nanjing University, CN Abstract Due to the high cell density, low leakage power consumption, and less vulnerability to soft errors, the non-volatile memory technologies are among the most promising alternatives for replacing the traditional DRAM and SRAM technologies used in implementing main memory and caches in the modern microprocessor. However, one of the difficulties is the limited write endurance of most non-volatile memory technologies. In this paper, we propose to exploit the narrow-width values to improve the lifetime of the non-volatile last level caches. Leading zeros masking scheme is first proposed to reduce the write stress to the upper half of the narrow-width data. To balance the write variations between the upper half and the lower half of the narrow-width data, two swap schemes, the swap on write (SW) and swap on replacement (SRepl), are proposed. To further reduce the write stress to the non-volatile cache, we adopt two optimization schemes, the multiple dirty bit (MDB) and read before write (RBW), to improve its lifetime. Our experimental results show that by combining all our proposed schemes, the lifetime of the non-volatile caches can be improved by 245% on average.
16:01	IP1-20, 166	PARTIAL-SET: WRITE SPEEDUP OF PCM MAIN MEMORY Speakers: Li Bing¹, Shan Shuchang², Hu Yu² and Li XiaoWei³ ¹ICT,UCAS, CN; ²ICT,CAS, CN; ³ICT.CAS, CN Abstract Phase change memory (PCM) is a promising nonvolatile memory technology developed as a possible DRAM replacement. Although it offers the read latency close to that of DRAM, PCM generally suffers from the long write latency. Long write request may block the read requests on the critical path of cache/memory access, incurring adverse impact on the system performance. Besides, the write performance of PCM is very asymmetric, i.e, the SET operation (writing '1') is much slower than that of the RESET operation (writing '0'). In this work, we re-examine the resistance transform process during the SET operation of PCM and propose a novel Partial-SET scheme to alleviate the long write latency issue of PCM. During a write access to a memory line, a short Partial-SET pulse is applied first to program the PCM cells to a pre-stable state, achieving the same write latency as RESET. The partially-SET cells are then fully programmed within the retention window to preserve the data integrity. Experimental results show that our Partial-SET scheme can improve the memory access performance of PCM by more than 45% averagely with very marginal storage overhead.
16:00		End of session Coffee Break in Exhibition Area On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).

< Return to last page

Submissions

3.5 Robust Architectures