IP5 Interactive Presentations

Printer-friendly version PDF version

Date: Thursday 27 March 2014
Time: 15:30 - 16:00
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

LabelPresentation Title
Authors
IP5-1HYBRID WIRE-SURFACE WAVE ARCHITECTURE FOR ONE-TO-MANY COMMUNICATION IN NETWORK-ON-CHIP
Speakers:
Ammar Karkar1, Nizar Dahir1, Ra'ed Al-Dujaily2, Kenneth Tong3, Terrence Mak4 and Alex Yakovlev1
1School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne, GB; 2General Systems Company, Baghdad - Iraq, IQ; 3Depart- ment of Electrical and Electronic Engineering, University College London, GB; 4Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong, CN
Abstract
Network-on-chip (NoC) is a communication paradigm that has emerged to tackle different on-chip challenges and has satisfied different demands in terms of high performance and economical interconnect implementation. However, merely metal based NoC pursuit offers limited scalability with the relentless technology scaling, especially in one-to-many (1-to-M) communication. To meet the scalability demand, this paper proposes a new hybrid architecture empowered by both metal interconnects and Zenneck surface wave interconnects (SWI). This architecture, in conjunction with newly proposed routing and global arbitration schemes, avoids overloading the NoC and alleviates traffic hotspots compared to the trend of handling 1-to-M traffic as unicast. This work addresses the system level challenges for intra chip multicasting. Evaluation results, based on a cycle-accurate simulation and hardware description, demonstrate the effectiveness of the proposed architecture in terms of power reduction ratio of 2 to 12X and average delay reduction of 25X or more, compared to a regular NoC. These results are achieved with negligible hardware overheads.
IP5-2FAILURE ANALYSIS OF A NETWORK-ON-CHIP FOR REAL-TIME MIXED-CRITICAL SYSTEMS
Speakers:
Eberle A Rambo1, Alexander Tschiene1, Jonas Diemer1, Leonie Ahrendts1 and Rolf Ernst2
1Technische Universität Braunschweig, DE; 2TU Braunschweig, DE
Abstract
Multi- and many-core architectures using Networks-on-Chip (NoC) are being explored for use in real-time safety-critical applications for their performance and efficiency. Such systems must provide isolation between tasks that may present distinct criticality levels. The NoC is critical to maintain the isolation property as it is a heavily used shared resource. To meet safety-standard requirements, such architectures require a systematic evaluation of the effects of all possible failures such as in the form of a Failure Mode and Effects Analysis (FMEA). We present the results of a detailed system-level analysis of a typical real-time mixed-critical network-on-chip architecture. This comprises an FMEA and error effects classification regarding duration and isolation violation.
IP5-3COOLIP: SIMPLE YET EFFECTIVE JOB ALLOCATION FOR DISTRIBUTED THERMALLY-THROTTLED PROCESSORS
Speakers:
Pratyush Kumar, Hoeseok Yang, Iuliana Bacivarov and Lothar Thiele, ETH Zurich, CH
Abstract
Thermal constraints limit the time for which a processor can run at high frequency. Such thermal-throttling complicates the computation of response times of jobs. For multiple processors, a key decision is where to allocate the next job. For distributed thermally-throttled procesosrs, we present COOLIP with a simple allocation policy: a job is allocated to the earliest available processor, and if there are several available simultaneously, to the coolest one. For Poisson distribution of inter-arrival times and Gaussian distribution of execution demand of jobs, COOLIP matches the 95-percentile response time of Earliest Finish-Time (EFT) policy which minimizes response time with full knowledge of execution demand of unfinished jobs and thermal models of processors. We argue that COOLIP performs well because it directs the processors into states such that a defined sufficient condition of optimality holds.
IP5-4ENERGY OPTIMIZATION IN 3D MPSOCS WITH WIDE-I/O DRAM USING TEMPERATURE VARIATION AWARE BANK-WISE REFRESH
Speakers:
Mohammadsadegh Sadri1, Matthias Jung2, Christian Weis2, Norbert Wehn2 and Luca Benini1
1Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, IT; 2Microelectronic Systems Design Research Group, University of Kaiserslautern, DE
Abstract
Heterogeneous 3D integrated systems with Wide-I/O DRAMs are a promising solution to squeeze more functionality and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to provide proof of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. On this platform we run the Android OS with real-world benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bank-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.
IP5-5EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES
Speakers:
Ji Qi and Mark Zwolinski, University of Southampton, GB
Abstract
With increasing chip complexity, Networks-on-Chips (NoCs) are becoming a central platform for future on-chip communications. Many regular NoC architectures have been proposed to eliminate the communication bottlenecks of traditional bus-based networks. Non-rectangular and irregular architectures have also been proposed to increase performance. However, the complexity of designing custom non-rectangular networks leads to a rapid increase in design and verification times. To alleviate the conflict between performance and efficiency, this paper proposes a novel method that efficiently constructs virtual non-rectangular topologies on a mesh network by using time-regulated models to emulate irregular patterns. Data routings on virtual hexagonal and two irregular geometries validate the proposed method. An MPEG-4 decoder is used to exemplify the proposed method for media applications. Results analysis shows the virtual topologies emulated by the proposed method can provide precise timing and energy performance.
IP5-6MOVING FROM CO-SIMULATION TO SIMULATION FOR EFFECTIVE SMART SYSTEMS DESIGN
Speakers:
Franco Fummi1, Michele Lora2, Francesco Stefanni3, Dimitrios Trachanis4, Jan Vanhese4 and Sara Vinco2
1University of Verona, EDALab s.r.l., IT; 2University of Verona, IT; 3EDALab s.r.l., IT; 4Agilent Technologies, BE
Abstract
Design of smart systems needs to cover a wide variety of domains, ranging from analogue to digital, with power devices, micro-sensors and actuators, up to MEMS. This high level of heterogeneity makes design a very challenging task, as each domain is supported by specific languages, modeling formalisms and simulation frameworks. A major issue is furthermore posed by simulation, that heavily impacts the design and verification loop and that is very hard to be built in such an heterogeneous context. On the other hand, achieving efficient simulation would indeed make smart system design feasible with respect to budget constraints. This work provides a formalization of the typical abstraction levels and design domains of a smart system. This taxonomy allows to identify a precise role in the design flow for co-simulation and simulation scenarios. Moreover, a methodology is proposed to move from the co-simulated heterogeneity to a simulatable homogeneous representation in C++ of the entire smart system. The impact of heterogeneous or homogeneous models of computation is also examined. Experimental results prove the effectiveness of the proposed C++ generation for reaching high-speed simulation.
IP5-7AUTOMATING DATA REUSE IN HIGH-LEVEL SYNTHESIS
Speakers:
Wim Meeus1 and Dirk Stroobandt2
1Imec and Ghent University, BE; 2Ghent University, BE
Abstract
Current High-Level Synthesis (HLS) tools perform excellently for the synthesis of computation kernels, but they often don't optimize memory bandwidth. As memory access is a bottleneck in many algorithms, the performance of the generated circuit will benefit substantially from memory access optimization. In this paper we present an automated method and a toolchain to detect reuse of array data in loop nests and to build hardware that exploits this data reuse. This saves memory bandwidth and improves circuit performance. We make use of the polyhedral representation of the source program, which makes our method computationally easy. Our software complements the existing HLS flows. Starting from a loop nest written in C, our tool generates a reuse buffer and a loop controller, and preprocesses the loop body for synthesis with an existing HLS tool. Our automated tool produces designs from unoptimized source code that are as efficient as those generated by a commercial HLS tool from manually-optimized source code.
IP5-8A UNIVERSAL SYMMETRY DETECTION ALGORITHM
Speaker:
Peter Maurer, Dept. of Computer Sci., Baylor University, US
Abstract
Research on symmetry detection focuses on identifying and detecting new types of symmetry. We present an algorithm that is capable of detecting any type of permutation-based symmetry, including many types for which there are no existing algorithms. General symmetry detection is library-based, but symmetries that can be parameterized, (i.e. total, partial, rotational, and dihedral symmetry), can be detected without using libraries. In many cases it is faster than existing techniques. Furthermore, it is simpler than most existing techniques, and can easily be incorporated into existing software.
IP5-9OPTIMIZATION OF DESIGN COMPLEXITY IN TIME-MULTIPLEXED CONSTANT MULTIPLICATIONS
Speakers:
Levent Aksoy1, Paulo Flores2 and Jose Monteiro3
1INESC-ID, PT; 2INESC-ID/IST ULisbon, PT; 3INESC-ID / IST, ULisbon, PT
Abstract
The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.
IP5-10HARDWARE PRIMITIVES FOR THE SYNTHESIS OF MULTITHREADED ELASTIC SYSTEMS
Speakers:
Giorgos Dimitrakopoulos1, Seitanidis Ioannis2, Anastasios Psarras1, Konstantinos Tsiouris1, Pavlos Matthaiakis3 and Jordi Cortadella4
1Democritus University of Thrace, GR; 2Democritus University of Thrac, GR; 3Mentor Graphics, FR; 4Universitat Politecnica de Catalunya, ES
Abstract
Abstract—Elastic systems operate in a dataflow-like mode using a distributed scalable control and tolerating variable latency computations. At the same time, multithreading increases the utilization of processing units and hides the latency of each operation by time-multiplexing operations of different threads in the datapath. This paper proposes a model to unify multithreading and elasticity. A new multithreaded elastic control protocol is introduced supported by low-cost elastic buffers that minimize the storage requirements without sacrificing performance. To enable the synthesis of multithreaded elastic architectures, new hardware primitives are proposed and utilized in two circuit examples to prove the applicability of the proposed approach.
IP5-11DCM: AN IP FOR THE AUTONOMOUS CONTROL OF OPTICAL AND ELECTRICAL RECONFIGURABLE NOCS.
Speakers:
Wolfgang Büter1, Christof Osewold1, Daniel Gregorek1 and Alberto Garcia-Ortiz2
1University of Bremen, DE; 2ITEM (U.Bremen), DE
Abstract
The increasing requirements for bandwidth and quality-of-service motivate the use of parallel interconnect architectures with several degrees of reconfiguration. This paper presents an IP, called Distributed Channel Management (DCM), to extend existing packet-switched NoCs with a reconfigurable point-to-point network seamlessly, i.e., without the need for any modification on the routers. The configuration of the reconfigurable network takes place dynamically and autonomously, so that the topology can be changed at run time. Furthermore, the architecture is scalable due to the autonomous decentralized administration of the links. The Paper reports a thorough experimental analysis of the overhead of the approach at the gate level that considers different network parameters such as flit size and timing constraints.
IP5-12MINIMALLY BUFFERED SINGLE-CYCLE DEFLECTION ROUTER
Speakers:
Gnaneswara Rao Jonna1, John Jose1, Rachana Radhakrishnan2 and Madhu Mutyam1
1Indian Institute of Technology, Madras., IN; 2Rajagiri School of Engineering & Technology, Kochhi., IN
Abstract
With the drift from computation centric designs to communication centric designs in the Chip Multi Processor (CMP) era, the interconnect fabric is gaining more importance. An efficient NoC in terms of power, area and average flit latency has a huge impact on the overall performance of a CMP. In the current work, we propose MinBSD - a minimally buffered, single cycle, deflection router. It incorporates different operations (Injection, Ejection, Preemption, Re-injection) in a single module to handle the traffic effectively and ensures smooth flow of flits through router pipeline. It performs overlapped execution of independent operations. These factors not only make MinBSD to operate in a single cycle but also to reduce the critical path latency resulting in a faster interconnect network. Experimental results show that MinBSD reduces the average flit latency on real work loads, reduces die area and power consumption when compared to the existing state-of-the-art minimally buffered deflection routers.
IP5-13FUNCTIONAL TEST GENERATION GUIDED BY STEADY-STATE PROBABILITIES OF ABSTRACT DESIGN
Speakers:
Jian Wang1, Huawei Li2, Tao Lv2, Tiancheng Wang2 and Xiaowei Li2
1Institute of Computing Technology, Chinese Academy of Sc iences, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
This paper presents a novel method for functional test generation aiming at exploring control state space of the design. The steady-state probabilities (SP's) of the abstract design's control FSM are used to guide test generation. The SP's of the states can reflect how hard the states can be reached, and the hard-to-reach states are assigned with high priority to be exercised. Experimental results show that our method has better performance in test generation in comparison with constrained random simulation, and demonstrate that SP's provide good guidance on traversing hard-to-reach states of the design under validation.
IP5-14AUTOMATED SYSTEM TESTING USING DYNAMIC AND RESOURCE RESTRICTED CLIENTS
Speakers:
Mirko Caspar, Mirko Lippmann and Wolfram Hardt, Technische Universität Chemnitz, DE
Abstract
Testing on system level using a static and homogeneous architecture of clients is common practice. This paper introduces a new approach to use a heterogeneous and dynamic set of resource restricted test clients for automated testing. Due to changing resources and availability of the clients, the test case distribution needs to be recalculated dynamically during the test execution. All necessary conditions and parameters are represented by a formal model. It is shown that the algorithmic problem of DYNAMIC TESTPARTITIONING can be solved in polynomial time by a heuristic recursive algorithm. A testbench architecture is introduced and by simulation it is shown that the testbench can execute the test requirements within a small variation using a number of several hundred clients. The system can react dynamically on changing resources and availability of the test clients within several seconds. The approach is generic and can be adapted to a huge number of systems.
IP5-15RELIABILITY-AWARE MAPPING OPTIMIZATION OF MULTI-CORE SYSTEMS WITH MIXED-CRITICALITY
Speakers:
Shin-Haeng Kang1, Hoeseok Yang2, Sungchan Kim3, Iuliana Bacivarov2, Soonhoi Ha1 and Lothar Thiele4
1Seoul National University, KR; 2ETH Zurich, CH; 3Chonbuk National University, KR; 4Swiss Federal Institute of Technology Zurich, CH
Abstract
This paper presents a novel mapping optimization technique for mixed critical multi-core systems with different reliability requirements. For this scope, we derived a quantitative reliability metric and presented a scheduling analysis that certifies given mixed-criticality constraints. Our framework is capable of investigating re-execution, passive replication, and modular redundancy with optimized voter placement, while typical hardening approaches consider only one or two of these techniques. The proposed technique complies with existing safety standards and is power-efficient, as demonstrated by our experiments.
IP5-16(Best Paper Award Candidate)
FROM SIMULINK TO NOC-BASED MPSOC ON FPGA
Speakers:
Francesco Robino and Johnny Öberg, KTH Royal Institute of Technology, SE
Abstract
Network-on-chip (NoC) based multi-processor systems are promising candidates for future embedded system platforms. However, because of their complexity, new high level modeling techniques are needed to design, simulate and synthesize embedded systems targeting NoC-based MPSoC. Simulink is a popular modeling environment suitable to model at system level. However, there is no clear standard to synthesize Simulink models into SW and HW towards a NoC-based MPSoC implementation. In addition, many of the proposed solutions require large overhead in terms of SW components and memory requirements, resulting in complex and customized multi-processor platforms. In this paper we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.
IP5-17(Best Paper Award Candidate)
THERMAL ANALYSIS AND MODEL IDENTIFICATION TECHNIQUES FOR A LOGIC + WIDEIO STACKED DRAM TEST CHIP
Speakers:
Francesco Beneventi1, Andrea Bartolini1, Pascal Vivet2, Denis Dutoit2 and Luca Benini1
1DEI - University of Bologna, IT; 2CEA-Leti, Grenoble, FR
Abstract
High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top of a logic die) equipped with temperature sensors and heaters to explore thermal effects. We correlated real temperature measurements with the power dissipated by the heaters using model learning techniques. The resulting compact thermal model is able to predict temperatures at chip locations far from the temperature sensors and to infer the power dissipation at any location of the chip. Results are verified by mean of an off-sample validation technique and show a high accuracy of the compact thermal model when compared with silicon measurements.
IP5-18ADAPTIVE POWER ALLOCATION FOR MANY-CORE SYSTEMS INSPIRED FROM MULTIAGENT AUCTION MODEL
Speakers:
Xiaohang Wang1, Baoxin Zhao1, Terrence Mak2, Mei Yang3, Yingtao Jiang3, Masoud Daneshtalab4 and Maurizio Palesi5
1Guangzhou Institute of Advanced Technology, CN; 2The Chinese University of Hong Kong, CN; 3University of Nevada, Las Vegas, US; 4University of Turku, FI; 5University of Enna, Kore, IT
Abstract
Scaling of future many-core chips is hindered by the challenge imposed by ever-escalating power consumption. At its worst, an increasing fraction of the chips will have to be shut down, as power supply is inadequate to simultaneously switch all the transistors. This so-called dark silicon problem brings up a critical issue regarding how to achieve the maximum performance with a given limited power budget. This issue is further complicated by two facts. First, high variation in power budget calls for wide range power control capability, whereas most current frequency/voltage scaling techniques cannot effectively adjust power over such a wide range. Second, as the applications' behavior becomes more complicated, there is a pressing need for scalability and global coordination, rendering heuristic-based centralized or fully distributed control schemes inefficient. To address the aforementioned problems, in this paper, a power allocation method employing multiagent auction models is proposed, referred as Hierarchal MultiAgent based Power allocation (HiMAP). Tiles act the role of consumers to bid for power budget and the whole process is modeled by a combinatorial auction, whereas HiMAP finds the Walrasian equilibria. Experimental results have confirmed that HiMAP can reduce the execution time by as much as 45% compared to three competing methods. The runtime overhead and cost of HiMAP are also small, which makes it suitable for adaptive power allocation in many-core systems.
IP5-19UNIFIED, ULTRA COMPACT, QUADRATIC POWER PROXIES FOR MULTI-CORE PROCESSORS
Speakers:
Muhammad Yasin1, Ibrahim (Abe) Elfadel2 and Anas Shahrour2
1New York University - Abu Dhabi, AE; 2Masdar Institute of Science and Technology, AE
Abstract
Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.
IP5-203D FPGA USING HIGH-DENSITY INTERCONNECT MONOLITHIC INTEGRATION
Speakers:
Ogun Turkyilmaz1, Gerald Cibrario2, Olivier Rozeau2, Perrine Batude2 and Fabien Clermidy3
1CEA-LETI, Minatec Campus, FR; 2CEA, FR; 3CEA-LETI, FR
Abstract
New 3D technology, called "Monolithic Integration", offers very dense 3D interconnect capabilities. In this paper, we propose a 3D FPGA architecture with logic-on-memory approach based on this technology. The routing and computation blocks are splitted into two layers where the logic is placed on the top and memory on the bottom. Using extracted values from layout in 14nm FDSOI technology, typical benchmark circuits are evaluated in the VPR5 toolflow. The results show an area reduction of 55% compared to the 2D FPGA. More importantly, due to the lowered routing congestion, the EDP of the 3D FPGA is improved by 47%.
IP5-21JOINT COMMUNICATION SCHEDULING AND INTERCONNECT SYNTHESIS FOR FPGA-BASED MANY-CORE SYSTEMS
Speakers:
Alessandro Cilardo, Edoardo Fusella, Luca Gallo and Antonino Mazzeo, University of Naples Federico II, IT
Abstract
This work proposes an automated methodology for optimizing FPGA-based many-core interconnect architectures. Based on the application communication requirements, the methodology concurrently defines the structure of the interconnect and the communication task scheduling, taking into account possible dependencies between tasks under given area constraints. The resulting architecture improves the level of communication parallelism that can be exploited while keeping area costs low. The paper thoroughly describes the proposed approach and discusses a few case-studies showing the impact of the proposed technique.
IP5-22A NOVEL EMBEDDED SYSTEM FOR VISION TRACKING
Speakers:
Antonis Nikitakis1, Theofilos Paganos1 and Ioannis Papaefstathiou2
1Technical University of Crete, Department of Electronic and Computer Engineering Kounoupidiana, Chania, Crete, GR73100, Greece, GR; 2Synelixis Solutions Ltd, Farmakidou 10,Chalkida, GR34100, Greece, GR
Abstract
One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.