IP2 Interactive Presentations

Printer-friendly version PDF version

Date: Wednesday 26 March 2014
Time: 10:00 - 10:30
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

LabelPresentation Title
Authors
IP2-1FAST AND ACCURATE COMPUTATION USING STOCHASTIC CIRCUITS
Speakers:
Armin Alaghi and John P. Hayes, University of Michigan - Ann Arbor, US
Abstract
Stochastic computing (SC) is a low-cost design technique that has great promise in applications such as image processing. SC enables arithmetic operations to be performed on stochastic bit-streams using ultra-small and low-power circuitry. However, accurate computations tend to require long run-times due to the random fluctuations inherent in stochastic numbers (SNs). We present novel techniques for SN generation that lead to better accuracy/run-time trade-offs. First, we analyze a property called progressive precision (PP) which allows computational accuracy to grow systematically with run-time. Second, borrowing from Monte Carlo methods, we show that SC performance can be greatly improved by replacing the usual pseudo-random number sources by low-discrepancy (LD) sequences that are predictably progressive. Finally, we evaluate the use of LD stochastic numbers in SC, and show they can produce significantly faster and more accurate results than existing stochastic designs.
IP2-2DRAM-BASED COHERENT CACHES AND HOW TO TAKE ADVANTAGE OF THE COHERENCE PROTOCOL TO REDUCE THE REFRESH ENERGY
Speakers:
Zoran Jaksic and Ramon Canal, Universitat Politecnica de Catalunya, ES
Abstract
Recent technology trends has turned DRAMs into an interesting candidate to substitute traditional SRAM-based on-chip memory structures (i.e. register file, cache memories). Nevertheless, a major problem to introduce these cells is that they lose their state (i.e. value) over time, and they have to be refreshed. This paper proposes the implementation of coherent caches with DRAM cells. Furthermore, we propose to use the coherence state to tune the refresh overhead. According to our analysis, an average of up to 57% of refresh energy can be saved. Also, comparing to the caches implemented in SRAMs total energy savings are on average up to 39% depending of the refresh policy with a performance loss below 8%
IP2-3(Best Paper Award Candidate)
REDUCING SET-ASSOCIATIVE L1 DATA CACHE ENERGY BY EARLY LOAD DATA DEPENDENCE DETECTION (ELD3)
Speakers:
Alen Bardizbanyan1, Magnus Själander2, David Whalley2 and Per Larsson-edefors1
1Chalmers University of Technology, SE; 2Florida State University, US
Abstract
Fast set-associative level-one data caches (L1~DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD3) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1~DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1~DC energy by 13%.
IP2-4DISTRIBUTED COOPERATIVE SHARED LAST-LEVEL CACHING IN TILED MULTIPROCESSOR SYSTEM ON CHIP
Speakers:
Preethi Parayil Mana Damodaran1, Stefan Wallentowitz2 and Andreas Herkersdorf3
1LIS, Technical University of Munich, DE; 2Technische Universität München, Institute for Integrated Systems, DE; 3TU München, DE
Abstract
In a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a multi-level cache hierarchy and a suitable cache-coherency mechanism. This paper presents a method to increase the memory access performance in distributed-directory-coherency-protocol based tiled many-core systems. The proposed method introduces an alternate design for the system-wide shared last-level caches (LLC) placed between the memory and the node private caches (NPC). The proposed system-wide shared LLC layer is distributed over the entire network and it interacts with the home directories of specific cache lines. Results from simulating SPEC2000 benchmark applications executed on a SystemC model of the proposed design show a minimum performance improvement of 20-25% when compared to a model without the shared cache layer at the expense of an additional 2% of the total cache memory space (NPC + LLC memory). In addition, the proposed design shows a minimum 7-15% and an average 14-15% improvement in performance in comparison to centralized system-wide shared LLC of equivalent size and dynamic mapped distributed LLC of equivalent size respectively.
IP2-5DESIGN OF SAFETY CRITICAL SYSTEMS BY REFINEMENT
Speakers:
Alex Iliasov1, Arseniy Alekseyev2, Danil Sokolov3 and Andrey Mokhov3
1Newcastle University, GB; 2Newcastle University, ZW; 3Newcastle University, BB
Abstract
An increasingly large number of safety-critical embedded systems rely on software to prevent and mitigate hazards occurring due to design errors and unexpected interactions of the system with its users and the environment. Implementing a safety instrumented function in the way advocated by the traditional software methods requires an intimate understanding and thorough validation of a complex ecosystem of programming languages, compilers, operating systems and hardware. We propose to consider an alternative where a system designer, for each individual problem, creates in a correct-by-construction manner both the design of a system and its compilation and execution infrastructure. This permits an uninterrupted chain of a formal correctness argument spanning from formalised requirements all the way to the gate-level characterisation of an execution environment. The past decade of advances in verification technology turned the mechanical verification of large-scale models into a reality while the pressure of certification makes the cost of a formally verified development routine increasingly acceptable. The proposed technique fits the Grand Challenge for Computer Research posed by Hoare in 2003, namely, development of a Verifying Compiler which not only mechanically translates a given program from one language to another but also verifies its correctness according to a formal specification. This allows meeting the most stringent software certification requirements such as SIL 4. We illustrate the idea with a small case-study developed using the Event-B modelling notation and tools.
IP2-6ENERGY OPTIMIZATION IN ANDROID APPLICATIONS THROUGH WAKELOCK PLACEMENT
Speakers:
Faisal Alam1, Preeti Ranjan Panda1, Nikhil Tripathi2, Namita Sharma3 and Sanjiv Narayan2
1IIT Delhi, IN; 2Calypto Design Systems, IN; 3Indian Institute of Technology Delhi, IN
Abstract
Energy efficiency is a critical factor in mobile systems, and a significant body of recent research efforts has focused on reducing the energy dissipation in mobile hardware and applications. The Android OS Power Manager provides programming interface routines called wakelocks for controlling the activation state of devices on a mobile system. An appropriate placement of wakelock acquire and release functions in the application can make a significant difference to the energy consumption. In this paper, we propose a data flow analysis based strategy for determining the placement of wakelock statements corresponding to the uses of devices in an application. Our experimental evaluation on a set of Android applications show significant (up to 32%) energy savings with the proposed optimization strategy.
IP2-7A WEAR-LEVELING-AWARE DYNAMIC STACK FOR PCM MEMORY IN EMBEDDED SYSTEMS
Speakers:
Qingan Li1, Yanxiang He2, Yong Chen2, Chun Xue3, Nan Jiang2 and Chao Xu2
1Wuhan University & City University of Hong Kong, CN; 2Wuhan University, CN; 3City University of Hong Kong, CN
Abstract
Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics such as extremely low leakage power, high storage density and good scalability. However, PCM's low endurance constrains its practical applications. In this paper, we propose a Wear Leveling aware dynamic stack to extend PCM's lifetime when it is adopted in embedded systems as main memory. Through a dynamic stack, the memory space is circularly allocated to stack objects, and thus an even usage of PCM memory is achieved. The experimental results show that the proposed method can significantly reduce the write variation on PCM cells and enhance the lifetime of PCM memory.
IP2-8LIFETIME HOLES AWARE REGISTER ALLOCATION FOR CLUSTERED VLIW PROCESSORS
Speakers:
Xuemeng Zhang1, Hui Wu2, Haiyan Sun1 and Jingling Xue3
1National University of Defense Technology, CN; 2The University of New South Wales, AU; 3UNSW, AU
Abstract
This paper presents an on-the-fly register allocator which dynamically detects and utilises lifetime holes for clustered VLIW processors. A lifetime hole is an interval in which a variable does not contain a valid value. A register holding a lifetime hole can be allocated to another variable whose live range fits in the lifetime hole, leading to more efficient utilisation of registers. We propose efficient techniques for dynamically utilising lifetime holes and incorporate these techniques into our on-the-fly register allocator. We have simulated our register allocator and a linear scan register allocator without considering lifetime holes by using the MediaBench II benchmark suite. Our simulation results show that our register allocator reduces the number of spills by 12.5%, 11.7%, 12.7%, for three different processor models, respectively.
IP2-9A LOW-POWER, HIGH-PERFORMANCE APPROXIMATE MULTIPLIER WITH CONFIGURABLE PARTIAL ERROR RECOVERY
Speakers:
Cong Liu1, Jie Han1 and Fabrizio Lombardi2
1University of Alberta, CA; 2Northeastern University, US
Abstract
Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multiplier, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.
IP2-10A LINUX-GOVERNOR BASED DYNAMIC REALIABILITY MANAGER FOR ANDROID MOBILE DEVICES
Speakers:
Pietro Mercati1, Andrea Bartolini2, Francesco Paterna1, Tajana Simunic Rosing1 and Luca Benini2
1UCSD, US; 2University of Bologna, IT
Abstract
Reliability is a major concern in multiprocessors. Dynamic Reliability Management (DRM) aims at trading off processor performance with lifetime. The state-of-the-art publications study only the theory supported by simulation. This paper presents the first complete software implementation, working on a real hardware, of a low-overhead, Android-compatible workload-aware DRM Governor for mobile multiprocessors. We discuss the design challenges and the run-time overhead involved. We show the effectiveness of our governor in guaranteeing the predefined target lifetime and show that it achieves up to 100% of lifetime improvement with respect to traditional governors, while providing comparable performance for critical applications.
IP2-11YIELD AND TIMING CONSTRAINED SPARE TSV ASSIGNMENT FOR THREE-DIMENSIONAL INTEGRATED CIRCUITS
Speakers:
Yu-Guang Chen1, Kuan-Yu Lai1, Ming-Chao Lee2, Yiyu Shi3, Wing-Kai Hon1 and Shih-Chieh Chang1
1National Tsing Hua University, TW; 2MediaTek Inc., TW; 3Missouri University of Science and Technology, US
Abstract
Through Silicon Via (TSV) is a critical enabling technique in three-dimensional integrated circuits (3D ICs). However, it may suffer from many reliability issues. Various fault-tolerance mechanisms have been proposed in literature to improve yield, at the cost of significant area overhead. In this paper, we focus on the structure that uses one spare TSV for a group of original TSVs, and study the optimal assignment of spare TSVs under yield and timing constraints to minimize the total area overhead. We show that such problem can be modeled through constrained graph decomposition. An efficient heuristic is further developed to address this problem. Experimental results show that under the same yield and timing constraints, our heuristic can reduce the area overhead induced by the fault-tolerance mechanisms by up to 38%, compared with a seemingly more intuitive nearest-neighbor based heuristic.
IP2-12COMPILER-DRIVEN DYNAMIC RELIABILITY MANAGEMENT FOR ON-CHIP SYSTEMS UNDER VARIABILITIES
Speakers:
Semeen Rehman, Florian Kriebel, Muhammad Shafique and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE
Abstract
This paper presents a novel Dynamic Reliability Management System (DyReMS) for on-chip systems that performs resilience-driven resource allocation and mapping. It accounts for both the tasks' resilience properties and heterogeneous error recovery features of different cores. DyReMS also chooses a reliable task version (out of multiple reliability-aware transformed options) depending upon the reliability level of the allocated core. In case of error detection, rollbacks are performed. Our system provides 70%-87% improved task reliability compared to a timing reliabil-ity-optimizing core assignment, i.e. minimizing the probability of deadline misses (with EDF scheduling).
IP2-13(Best Paper Award Candidate)
MINIMIZING STATE-OF-HEALTH DEGRADATION IN HYBRID ELECTRICAL ENERGY STORAGE SYSTEMS WITH ARBITRARY SOURCE AND LOAD PROFILES
Speakers:
Yanzhi Wang1, Xue Lin1, Qing Xie1, Naehyuck Chang2 and Massoud Pedram1
1University of Southern California, US; 2Seoul National University, KR
Abstract
Hybrid electrical energy storage (HEES) systems consisting of heterogeneous electrical energy storage (EES) elements are proposed to exploit the strengths of different EES elements and hide their weaknesses. The cycle life of the EES elements is one of the most important metrics. The cycle life is directly related to the state-of-health (SoH), which is defined as the ratio of full charge capacity of an aged EES element to its designed (or nominal) capacity. The SoH degradation models of battery in the previous literature can only be applied to charging/discharging cycles with the same state-of-charge (SoC) swing. To address this shortcoming, this paper derives a novel SoH degradation model of battery for charging/discharging cycles with arbitrary patterns. Based on the proposed model, this paper presents a near-optimal charge management policy focusing on extending the cycle life of battery elements in the HEES systems while simultaneously improving the overall cycle efficiency.
IP2-14DYNAMIC FLIP-FLOP CONVERSION TO TOLERATE PROCESS VARIATION IN LOW POWER CIRCUITS
Speakers:
Mehrzad Nejat, Bijan Alizadeh and Ali Afzali Kusha, School of Electrical and Computer Eng., College of Eng., University of Tehran, IR
Abstract
A novel time borrowing method called dynamic Flip-Flop conversion is presented in this paper. A timing violation predictor detects the violations halfway in the critical path and dynamically converts the critical Flip-Flop to a latch. This way, time borrowing benefits of latches are utilized in a Flip-Flop based design which is more adaptable with Computer-Aided- Design tools. The overhead of this method is smaller than that of similar methods due to the elimination of delay elements. According to the post-synthesis simulations and Monte-Carlo analysis of Spice simulations on some ITC'99 benchmark circuits, the power overhead of the proposed method is about 15% and 19% smaller than that of Soft-Edge-Flip-Flop and Dynamic- Clock-Stretching circuits respectively in a simple case of about 40% yield improvement. This overhead would be relatively even smaller for higher performance and yield improvements.
IP2-15A LOW POWER AND ROBUST CARBON NANOTUBE 6T SRAM DESIGN WITH METALLIC TOLERANCE
Speakers:
Luo Sun1, Jimson Mathew1, Rishad Shafik2, Dhiraj Pradhan1 and Zhen Li1
1University of Bristol, GB; 2University of Southampton, GB
Abstract
Carbon nanotube field-effect transistor (CNTFET) is envisioned as a promising device to overcome the limitations of traditional CMOS based MOSFETs due to its favourable physical properties. This paper presents a novel six-transistor (6T) static random access memory (SRAM) bitcell design using CNTFETs. Extensive validations and comparative analyses are carried out with the proposed SRAM design using SPICE based simulations. We show that the proposed CNTFET based SRAM has a significantly better static noise margin (SNM) and write ability margin (WAM) compared to a CNTFET-based standard 6T bitcell, equivalent to isolated read-port 8T cell based on CNTFET, while consuming less dynamic power. We further demonstrate that it exhibits higher robustness under process, voltage and temperature (PVT) variations when compared with the traditional CMOS SRAM cell designs. Furthermore, metallic CNTs removal technique is used considering metallic tolerance to make the proposed SRAM design more reliable.
IP2-16MAKE IT REAL: EFFECTIVE FLOATING-POINT REASONING VIA EXACT ARITHMETIC
Speakers:
Miriam Leeser1, Saoni Mukherjee1, Jaideep Ramachandran1 and Thomas Wahl2
1Northeastern University, US; 2Northeastern University, Boston, US
Abstract
Floating-point arithmetic is widely used in scientific computing. While many programmers are subliminally aware that floating-point numbers only approximate the reals, few are cognizant of the dangers this entails for programming. Such dangers range from tolerable rounding errors in sequential programs, to unexpected, divergent control flow in parallel code. To address these problems, we present a decision procedure for floating-point arithmetic (FPA) that exploits the proximity to real arithmetic (RA), via a lossless reduction from FPA to RA. Our procedure does not involve any form of bit-blasting or bit-vectorization, and can thus generate much smaller back-end decision problems, albeit in a more complex logic. This tradeoff is beneficial for the exact and reliable analysis of parallel scientific software, which tends to give rise to large but benignly structured formulas. We have implemented a prototype decision engine and present encouraging results analyzing such software for numerical accuracy.
IP2-17WIDTH MINIMIZATION IN THE SINGLE-ELECTRON TRANSISTOR ARRAY SYNTHESIS
Speakers:
Chian-Wei Liu1, Chang-En Chiang1, Ching-Yi Huang1, Chun-Yao Wang1, Yung-Chih Chen2, Suman Datta3 and Vijaykrishnan Narayanan4
1Dept. of Computer Science, National Tsing Hua University, TW; 2Dept. of Computer Science and Engineering, Yuan Ze University, TW; 3Department of Electrical Engineering, The Pennsylvania State University, US; 4Department of Computer Science and Engineering, The Pennsylvania State University, US
Abstract
Power consumption has become one of the primary challenges to meet the Moore's law. For reducing power consumption, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra-low power consumption during operation. Prior work has proposed an automated mapping approach for SET architecture which focuses on minimizing the number of hexagons in an SET array. However, the area of an SET array is more related to the width. Consequently, in this work, we propose an approach for width minimization of the SET arrays. The experimental results show that the proposed approach saves 26% of width compared with the state-of-the-art for a set of MCNC and IWLS 2005 benchmarks while spending similar CPU time.
IP2-18AREA MINIMIZATION SYNTHESIS FOR RECONFIGURABLE SINGLE-ELECTRON TRANSISTOR ARRAYS WITH FABRICATION CONSTRAINTS
Speakers:
Yi-Hang Chen, Jian-Yu Chen and Juinn-Dar Huang, Department of Electronics Engineering, National Chiao Tung University, TW
Abstract
As fabrication processes exploit even deeper submicron technology, power dissipation has become a crucial issue for most electronic circuit and system designs nowadays. In particular, leakage power is becoming a dominant source of power consumption. Recently, the reconfigurable single-electron transistor (SET) array has been proposed as an emerging circuit design style for continuing Moore's Law due to its ultra-low power consumption. Several automated synthesis approaches have been developed for the reconfigurable SET array in the past few years. Nevertheless, all of those existing methods consider fabrication constraints, which are mandatory, merely in late synthesis stages. In this paper, we propose a synthesis algorithm, featuring both variable reordering and product term reordering, for area minimization. In addition, our algorithm takes those mandatory fabrication constraints into account in early stages for better outcomes. Experimental results show that our new method can achieve an area reduction of up to 24% as compared to current state-of-the-art techniques.
IP2-19SOFTWARE-BASED PAULI TRACKING IN FAULT-TOLERANT QUANTUM CIRCUITS
Speakers:
Alexandru Paler1, Simon Devitt2, Kae Nemoto2 and Ilia Polian1
1University of Passau, DE; 2National Institute of Informatics, JP
Abstract
The realisation of large-scale quantum computing is no longer simply a hardware question. The rapid development of quantum technology has resulted in dozens of control and programming problems that should be directed towards the classical computer science and engineering community. One such problem is known as Pauli tracking. Methods for implementing quantum algorithms that are compatible with crucial error correction technology utilise extensive quantum teleportation protocols. These protocols are intrinsically probabilistic and result in correction operators that occur as byproducts of teleportation. These byproduct operators do not need to be corrected in the quantum hardware itself , but are tracked through the circuit and output results emph{reinterpreted}. This tracking is routinely ignored in quantum information as it is assumed that tracking algorithms will eventually be developed. In this work we help fill this gap and present an algorithm for tracking byproduct operators through a quantum computation.
IP2-20AN EFFICIENT TEMPERATURE-GRADIENT BASED BURN-IN TECHNIQUE FOR 3D STACKED ICS
Speakers:
Nima Aghaee, Zebo Peng and Petru Eles, Linköping University, SE
Abstract
Burn-in is usually carried out with high temperature and elevated voltage. Since some of the early-life failures depend not only on high temperature but also on temperature gradients, simply raising up the temperature of an IC is not sufficient to detect them. This is especially true for 3D stacked ICs, since they have usually very large temperature gradients. The efficient detection of these early-life failures requires that specific temperature gradients are enforced as a part of the burn-in process. This paper presents an efficient method to do so by applying high power stimuli to the cores of the IC under burn-in through the test access mechanism. Therefore, no external heating equipment is required. The scheduling of the heating and cooling intervals to achieve the required temperature gradients is based on thermal simulations and is guided by functions derived from a set of thermal equations. Experimental results demonstrate the efficiency of the proposed method.
IP2-21TEST AND NON-TEST CUBES FOR DIAGNOSTIC TEST GENERATION BASED ON MERGING OF TEST CUBES
Speaker:
Irith Pomeranz, Purdue University, US
Abstract
Test generation by merging of test cubes supports test compaction and test data compression. This paper describes a new approach to the use of test cube merging for the generation of compact diagnostic test sets. For this the paper uses the new concept of non-test cubes. While a test cube for a fault fi0 detects the fault, a non-test cube for a fault fi1 prevents the fault from being detected. Merging a test cube for a fault fi0 and a non-test cube for a fault fi1 produces a diagnostic test cube that distinguishes the two faults. The paper describes a procedure for diagnostic test generation based on merging of test and non-test cubes. Experimental results demonstrate that compact diagnostic test sets are obtained.
IP2-22NEW IMPLEMENTIONS OF PREDICTIVE ALTERNATE ANALOG/RF TEST WITH AUGMENTED MODEL REDUNDANCY
Speakers:
Haithem Ayari, Florence Azais, Serge Bernard, Mariane Comte, Vincent Kerzerho and Michel Renovell, LIRMM, CNRS/Univ. Montpellier 2, FR
Abstract
This paper discusses new implementations of the predictive alternate test strategy that exploit model redundancy in order to improve test confidence. The key idea is to build during the training phase, not only one regression model for each specification as in the classical implementation, but several regression models. This redundancy is then used during the testing phase to identify suspect predictions and remove the corresponding devices from the alternate test flow. In this paper, we explore various options for implementing model redundancy, based on the use of different indirect measurement combinations and/or different partitions of the training set. The proposed implementations are evaluated on a real case study for which we have production test data from 10,000 devices.