IP5 Interactive Presentations

Date: Thursday 27 March 2014
Time: 15:30 - 16:00
Location / Room: Conference Level, foyer

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

Label	Presentation Title Authors
IP5-1	HYBRID WIRE-SURFACE WAVE ARCHITECTURE FOR ONE-TO-MANY COMMUNICATION IN NETWORK-ON-CHIP Speakers: Ammar Karkar¹, Nizar Dahir¹, Ra'ed Al-Dujaily², Kenneth Tong³, Terrence Mak⁴ and Alex Yakovlev¹ ¹School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne, GB; ²General Systems Company, Baghdad - Iraq, IQ; ³Depart- ment of Electrical and Electronic Engineering, University College London, GB; ⁴Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong, CN Abstract Network-on-chip (NoC) is a communication paradigm that has emerged to tackle different on-chip challenges and has satisfied different demands in terms of high performance and economical interconnect implementation. However, merely metal based NoC pursuit offers limited scalability with the relentless technology scaling, especially in one-to-many (1-to-M) communication. To meet the scalability demand, this paper proposes a new hybrid architecture empowered by both metal interconnects and Zenneck surface wave interconnects (SWI). This architecture, in conjunction with newly proposed routing and global arbitration schemes, avoids overloading the NoC and alleviates traffic hotspots compared to the trend of handling 1-to-M traffic as unicast. This work addresses the system level challenges for intra chip multicasting. Evaluation results, based on a cycle-accurate simulation and hardware description, demonstrate the effectiveness of the proposed architecture in terms of power reduction ratio of 2 to 12X and average delay reduction of 25X or more, compared to a regular NoC. These results are achieved with negligible hardware overheads.
IP5-2	FAILURE ANALYSIS OF A NETWORK-ON-CHIP FOR REAL-TIME MIXED-CRITICAL SYSTEMS Speakers: Eberle A Rambo¹, Alexander Tschiene¹, Jonas Diemer¹, Leonie Ahrendts¹ and Rolf Ernst² ¹Technische Universität Braunschweig, DE; ²TU Braunschweig, DE Abstract Multi- and many-core architectures using Networks-on-Chip (NoC) are being explored for use in real-time safety-critical applications for their performance and efficiency. Such systems must provide isolation between tasks that may present distinct criticality levels. The NoC is critical to maintain the isolation property as it is a heavily used shared resource. To meet safety-standard requirements, such architectures require a systematic evaluation of the effects of all possible failures such as in the form of a Failure Mode and Effects Analysis (FMEA). We present the results of a detailed system-level analysis of a typical real-time mixed-critical network-on-chip architecture. This comprises an FMEA and error effects classification regarding duration and isolation violation.
IP5-3	COOLIP: SIMPLE YET EFFECTIVE JOB ALLOCATION FOR DISTRIBUTED THERMALLY-THROTTLED PROCESSORS Speakers: Pratyush Kumar, Hoeseok Yang, Iuliana Bacivarov and Lothar Thiele, ETH Zurich, CH Abstract Thermal constraints limit the time for which a processor can run at high frequency. Such thermal-throttling complicates the computation of response times of jobs. For multiple processors, a key decision is where to allocate the next job. For distributed thermally-throttled procesosrs, we present COOLIP with a simple allocation policy: a job is allocated to the earliest available processor, and if there are several available simultaneously, to the coolest one. For Poisson distribution of inter-arrival times and Gaussian distribution of execution demand of jobs, COOLIP matches the 95-percentile response time of Earliest Finish-Time (EFT) policy which minimizes response time with full knowledge of execution demand of unfinished jobs and thermal models of processors. We argue that COOLIP performs well because it directs the processors into states such that a defined sufficient condition of optimality holds.
IP5-4	ENERGY OPTIMIZATION IN 3D MPSOCS WITH WIDE-I/O DRAM USING TEMPERATURE VARIATION AWARE BANK-WISE REFRESH Speakers: Mohammadsadegh Sadri¹, Matthias Jung², Christian Weis², Norbert Wehn² and Luca Benini¹ ¹Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, IT; ²Microelectronic Systems Design Research Group, University of Kaiserslautern, DE Abstract Heterogeneous 3D integrated systems with Wide-I/O DRAMs are a promising solution to squeeze more functionality and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to provide proof of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. On this platform we run the Android OS with real-world benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bank-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.
IP5-5	EFFICIENT SIMULATION AND MODELLING OF NON-RECTANGULAR NOC TOPOLOGIES Speakers: Ji Qi and Mark Zwolinski, University of Southampton, GB Abstract With increasing chip complexity, Networks-on-Chips (NoCs) are becoming a central platform for future on-chip communications. Many regular NoC architectures have been proposed to eliminate the communication bottlenecks of traditional bus-based networks. Non-rectangular and irregular architectures have also been proposed to increase performance. However, the complexity of designing custom non-rectangular networks leads to a rapid increase in design and verification times. To alleviate the conflict between performance and efficiency, this paper proposes a novel method that efficiently constructs virtual non-rectangular topologies on a mesh network by using time-regulated models to emulate irregular patterns. Data routings on virtual hexagonal and two irregular geometries validate the proposed method. An MPEG-4 decoder is used to exemplify the proposed method for media applications. Results analysis shows the virtual topologies emulated by the proposed method can provide precise timing and energy performance.
IP5-6	MOVING FROM CO-SIMULATION TO SIMULATION FOR EFFECTIVE SMART SYSTEMS DESIGN Speakers: Franco Fummi¹, Michele Lora², Francesco Stefanni³, Dimitrios Trachanis⁴, Jan Vanhese⁴ and Sara Vinco² ¹University of Verona, EDALab s.r.l., IT; ²University of Verona, IT; ³EDALab s.r.l., IT; ⁴Agilent Technologies, BE Abstract Design of smart systems needs to cover a wide variety of domains, ranging from analogue to digital, with power devices, micro-sensors and actuators, up to MEMS. This high level of heterogeneity makes design a very challenging task, as each domain is supported by specific languages, modeling formalisms and simulation frameworks. A major issue is furthermore posed by simulation, that heavily impacts the design and verification loop and that is very hard to be built in such an heterogeneous context. On the other hand, achieving efficient simulation would indeed make smart system design feasible with respect to budget constraints. This work provides a formalization of the typical abstraction levels and design domains of a smart system. This taxonomy allows to identify a precise role in the design flow for co-simulation and simulation scenarios. Moreover, a methodology is proposed to move from the co-simulated heterogeneity to a simulatable homogeneous representation in C++ of the entire smart system. The impact of heterogeneous or homogeneous models of computation is also examined. Experimental results prove the effectiveness of the proposed C++ generation for reaching high-speed simulation.
IP5-7	AUTOMATING DATA REUSE IN HIGH-LEVEL SYNTHESIS Speakers: Wim Meeus¹ and Dirk Stroobandt² ¹Imec and Ghent University, BE; ²Ghent University, BE Abstract Current High-Level Synthesis (HLS) tools perform excellently for the synthesis of computation kernels, but they often don't optimize memory bandwidth. As memory access is a bottleneck in many algorithms, the performance of the generated circuit will benefit substantially from memory access optimization. In this paper we present an automated method and a toolchain to detect reuse of array data in loop nests and to build hardware that exploits this data reuse. This saves memory bandwidth and improves circuit performance. We make use of the polyhedral representation of the source program, which makes our method computationally easy. Our software complements the existing HLS flows. Starting from a loop nest written in C, our tool generates a reuse buffer and a loop controller, and preprocesses the loop body for synthesis with an existing HLS tool. Our automated tool produces designs from unoptimized source code that are as efficient as those generated by a commercial HLS tool from manually-optimized source code.
IP5-8	A UNIVERSAL SYMMETRY DETECTION ALGORITHM Speaker: Peter Maurer, Dept. of Computer Sci., Baylor University, US Abstract Research on symmetry detection focuses on identifying and detecting new types of symmetry. We present an algorithm that is capable of detecting any type of permutation-based symmetry, including many types for which there are no existing algorithms. General symmetry detection is library-based, but symmetries that can be parameterized, (i.e. total, partial, rotational, and dihedral symmetry), can be detected without using libraries. In many cases it is faster than existing techniques. Furthermore, it is simpler than most existing techniques, and can easily be incorporated into existing software.
IP5-9	OPTIMIZATION OF DESIGN COMPLEXITY IN TIME-MULTIPLEXED CONSTANT MULTIPLICATIONS Speakers: Levent Aksoy¹, Paulo Flores² and Jose Monteiro³ ¹INESC-ID, PT; ²INESC-ID/IST ULisbon, PT; ³INESC-ID / IST, ULisbon, PT Abstract The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.
IP5-10	HARDWARE PRIMITIVES FOR THE SYNTHESIS OF MULTITHREADED ELASTIC SYSTEMS Speakers: Giorgos Dimitrakopoulos¹, Seitanidis Ioannis², Anastasios Psarras¹, Konstantinos Tsiouris¹, Pavlos Matthaiakis³ and Jordi Cortadella⁴ ¹Democritus University of Thrace, GR; ²Democritus University of Thrac, GR; ³Mentor Graphics, FR; ⁴Universitat Politecnica de Catalunya, ES Abstract Abstract—Elastic systems operate in a dataflow-like mode using a distributed scalable control and tolerating variable latency computations. At the same time, multithreading increases the utilization of processing units and hides the latency of each operation by time-multiplexing operations of different threads in the datapath. This paper proposes a model to unify multithreading and elasticity. A new multithreaded elastic control protocol is introduced supported by low-cost elastic buffers that minimize the storage requirements without sacrificing performance. To enable the synthesis of multithreaded elastic architectures, new hardware primitives are proposed and utilized in two circuit examples to prove the applicability of the proposed approach.
IP5-11	DCM: AN IP FOR THE AUTONOMOUS CONTROL OF OPTICAL AND ELECTRICAL RECONFIGURABLE NOCS. Speakers: Wolfgang Büter¹, Christof Osewold¹, Daniel Gregorek¹ and Alberto Garcia-Ortiz² ¹University of Bremen, DE; ²ITEM (U.Bremen), DE Abstract The increasing requirements for bandwidth and quality-of-service motivate the use of parallel interconnect architectures with several degrees of reconfiguration. This paper presents an IP, called Distributed Channel Management (DCM), to extend existing packet-switched NoCs with a reconfigurable point-to-point network seamlessly, i.e., without the need for any modification on the routers. The configuration of the reconfigurable network takes place dynamically and autonomously, so that the topology can be changed at run time. Furthermore, the architecture is scalable due to the autonomous decentralized administration of the links. The Paper reports a thorough experimental analysis of the overhead of the approach at the gate level that considers different network parameters such as flit size and timing constraints.
IP5-12	MINIMALLY BUFFERED SINGLE-CYCLE DEFLECTION ROUTER Speakers: Gnaneswara Rao Jonna¹, John Jose¹, Rachana Radhakrishnan² and Madhu Mutyam¹ ¹Indian Institute of Technology, Madras., IN; ²Rajagiri School of Engineering & Technology, Kochhi., IN Abstract With the drift from computation centric designs to communication centric designs in the Chip Multi Processor (CMP) era, the interconnect fabric is gaining more importance. An efficient NoC in terms of power, area and average flit latency has a huge impact on the overall performance of a CMP. In the current work, we propose MinBSD - a minimally buffered, single cycle, deflection router. It incorporates different operations (Injection, Ejection, Preemption, Re-injection) in a single module to handle the traffic effectively and ensures smooth flow of flits through router pipeline. It performs overlapped execution of independent operations. These factors not only make MinBSD to operate in a single cycle but also to reduce the critical path latency resulting in a faster interconnect network. Experimental results show that MinBSD reduces the average flit latency on real work loads, reduces die area and power consumption when compared to the existing state-of-the-art minimally buffered deflection routers.
IP5-13	FUNCTIONAL TEST GENERATION GUIDED BY STEADY-STATE PROBABILITIES OF ABSTRACT DESIGN Speakers: Jian Wang¹, Huawei Li², Tao Lv², Tiancheng Wang² and Xiaowei Li² ¹Institute of Computing Technology, Chinese Academy of Sc iences, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract This paper presents a novel method for functional test generation aiming at exploring control state space of the design. The steady-state probabilities (SP's) of the abstract design's control FSM are used to guide test generation. The SP's of the states can reflect how hard the states can be reached, and the hard-to-reach states are assigned with high priority to be exercised. Experimental results show that our method has better performance in test generation in comparison with constrained random simulation, and demonstrate that SP's provide good guidance on traversing hard-to-reach states of the design under validation.
IP5-14	AUTOMATED SYSTEM TESTING USING DYNAMIC AND RESOURCE RESTRICTED CLIENTS Speakers: Mirko Caspar, Mirko Lippmann and Wolfram Hardt, Technische Universität Chemnitz, DE Abstract Testing on system level using a static and homogeneous architecture of clients is common practice. This paper introduces a new approach to use a heterogeneous and dynamic set of resource restricted test clients for automated testing. Due to changing resources and availability of the clients, the test case distribution needs to be recalculated dynamically during the test execution. All necessary conditions and parameters are represented by a formal model. It is shown that the algorithmic problem of DYNAMIC TESTPARTITIONING can be solved in polynomial time by a heuristic recursive algorithm. A testbench architecture is introduced and by simulation it is shown that the testbench can execute the test requirements within a small variation using a number of several hundred clients. The system can react dynamically on changing resources and availability of the test clients within several seconds. The approach is generic and can be adapted to a huge number of systems.
IP5-15	RELIABILITY-AWARE MAPPING OPTIMIZATION OF MULTI-CORE SYSTEMS WITH MIXED-CRITICALITY Speakers: Shin-Haeng Kang¹, Hoeseok Yang², Sungchan Kim³, Iuliana Bacivarov², Soonhoi Ha¹ and Lothar Thiele⁴ ¹Seoul National University, KR; ²ETH Zurich, CH; ³Chonbuk National University, KR; ⁴Swiss Federal Institute of Technology Zurich, CH Abstract This paper presents a novel mapping optimization technique for mixed critical multi-core systems with different reliability requirements. For this scope, we derived a quantitative reliability metric and presented a scheduling analysis that certifies given mixed-criticality constraints. Our framework is capable of investigating re-execution, passive replication, and modular redundancy with optimized voter placement, while typical hardening approaches consider only one or two of these techniques. The proposed technique complies with existing safety standards and is power-efficient, as demonstrated by our experiments.
IP5-16	(Best Paper Award Candidate) FROM SIMULINK TO NOC-BASED MPSOC ON FPGA Speakers: Francesco Robino and Johnny Öberg, KTH Royal Institute of Technology, SE Abstract Network-on-chip (NoC) based multi-processor systems are promising candidates for future embedded system platforms. However, because of their complexity, new high level modeling techniques are needed to design, simulate and synthesize embedded systems targeting NoC-based MPSoC. Simulink is a popular modeling environment suitable to model at system level. However, there is no clear standard to synthesize Simulink models into SW and HW towards a NoC-based MPSoC implementation. In addition, many of the proposed solutions require large overhead in terms of SW components and memory requirements, resulting in complex and customized multi-processor platforms. In this paper we present a novel design flow to synthesize Simulink models onto a NoC-based MPSoC running on low-cost FPGAs. Our design flow constrains the MPSoC and the Simulink model to share a common semantics domain. This permits to reduce the need of resource consuming SW components, reducing the memory requirements on the platform. At the same time, performances (throughput) of dataflow applications can increase when the number of processors of the target platform is increased. This is shown through a case study on FPGA.
IP5-17	(Best Paper Award Candidate) THERMAL ANALYSIS AND MODEL IDENTIFICATION TECHNIQUES FOR A LOGIC + WIDEIO STACKED DRAM TEST CHIP Speakers: Francesco Beneventi¹, Andrea Bartolini¹, Pascal Vivet², Denis Dutoit² and Luca Benini¹ ¹DEI - University of Bologna, IT; ²CEA-Leti, Grenoble, FR Abstract High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top of a logic die) equipped with temperature sensors and heaters to explore thermal effects. We correlated real temperature measurements with the power dissipated by the heaters using model learning techniques. The resulting compact thermal model is able to predict temperatures at chip locations far from the temperature sensors and to infer the power dissipation at any location of the chip. Results are verified by mean of an off-sample validation technique and show a high accuracy of the compact thermal model when compared with silicon measurements.
IP5-18	ADAPTIVE POWER ALLOCATION FOR MANY-CORE SYSTEMS INSPIRED FROM MULTIAGENT AUCTION MODEL Speakers: Xiaohang Wang¹, Baoxin Zhao¹, Terrence Mak², Mei Yang³, Yingtao Jiang³, Masoud Daneshtalab⁴ and Maurizio Palesi⁵ ¹Guangzhou Institute of Advanced Technology, CN; ²The Chinese University of Hong Kong, CN; ³University of Nevada, Las Vegas, US; ⁴University of Turku, FI; ⁵University of Enna, Kore, IT Abstract Scaling of future many-core chips is hindered by the challenge imposed by ever-escalating power consumption. At its worst, an increasing fraction of the chips will have to be shut down, as power supply is inadequate to simultaneously switch all the transistors. This so-called dark silicon problem brings up a critical issue regarding how to achieve the maximum performance with a given limited power budget. This issue is further complicated by two facts. First, high variation in power budget calls for wide range power control capability, whereas most current frequency/voltage scaling techniques cannot effectively adjust power over such a wide range. Second, as the applications' behavior becomes more complicated, there is a pressing need for scalability and global coordination, rendering heuristic-based centralized or fully distributed control schemes inefficient. To address the aforementioned problems, in this paper, a power allocation method employing multiagent auction models is proposed, referred as Hierarchal MultiAgent based Power allocation (HiMAP). Tiles act the role of consumers to bid for power budget and the whole process is modeled by a combinatorial auction, whereas HiMAP finds the Walrasian equilibria. Experimental results have confirmed that HiMAP can reduce the execution time by as much as 45% compared to three competing methods. The runtime overhead and cost of HiMAP are also small, which makes it suitable for adaptive power allocation in many-core systems.
IP5-19	UNIFIED, ULTRA COMPACT, QUADRATIC POWER PROXIES FOR MULTI-CORE PROCESSORS Speakers: Muhammad Yasin¹, Ibrahim (Abe) Elfadel² and Anas Shahrour² ¹New York University - Abu Dhabi, AE; ²Masdar Institute of Science and Technology, AE Abstract Per-core power proxies for multi-core processors are known to use several dozens of hardware activity monitors to achieve a 2% accuracy on core power estimation. These activity monitors are typically not accessible to the user, and even if they were accessible, there would be a significant overhead in using them at the kernel or OS level for power monitoring or control. Furthermore, when scaled up to hundreds of cores per chip, such power proxies become a computational bottleneck for power management operations such as chip power capping. In this paper, we show that a 4% accuracy or better for per-core power estimation can be achieved using an ultra compact power proxy based on a hybrid set of only four user-accessible parameters, namely core frequency, core temperature, instruction-per-cycle and active-state residency. Our proxy is nonlinear, valid across all P and C states, and is based on a randomized power data collection strategy that aims at exercising all the P and C levels of each core. We illustrate the accuracy of the model using the full suite of the SPEC CPU 2006 benchmarks on a 12-core processor.
IP5-20	3D FPGA USING HIGH-DENSITY INTERCONNECT MONOLITHIC INTEGRATION Speakers: Ogun Turkyilmaz¹, Gerald Cibrario², Olivier Rozeau², Perrine Batude² and Fabien Clermidy³ ¹CEA-LETI, Minatec Campus, FR; ²CEA, FR; ³CEA-LETI, FR Abstract New 3D technology, called "Monolithic Integration", offers very dense 3D interconnect capabilities. In this paper, we propose a 3D FPGA architecture with logic-on-memory approach based on this technology. The routing and computation blocks are splitted into two layers where the logic is placed on the top and memory on the bottom. Using extracted values from layout in 14nm FDSOI technology, typical benchmark circuits are evaluated in the VPR5 toolflow. The results show an area reduction of 55% compared to the 2D FPGA. More importantly, due to the lowered routing congestion, the EDP of the 3D FPGA is improved by 47%.
IP5-21	JOINT COMMUNICATION SCHEDULING AND INTERCONNECT SYNTHESIS FOR FPGA-BASED MANY-CORE SYSTEMS Speakers: Alessandro Cilardo, Edoardo Fusella, Luca Gallo and Antonino Mazzeo, University of Naples Federico II, IT Abstract This work proposes an automated methodology for optimizing FPGA-based many-core interconnect architectures. Based on the application communication requirements, the methodology concurrently defines the structure of the interconnect and the communication task scheduling, taking into account possible dependencies between tasks under given area constraints. The resulting architecture improves the level of communication parallelism that can be exploited while keeping area costs low. The paper thoroughly describes the proposed approach and discusses a few case-studies showing the impact of the proposed technique.
IP5-22	A NOVEL EMBEDDED SYSTEM FOR VISION TRACKING Speakers: Antonis Nikitakis¹, Theofilos Paganos¹ and Ioannis Papaefstathiou² ¹Technical University of Crete, Department of Electronic and Computer Engineering Kounoupidiana, Chania, Crete, GR73100, Greece, GR; ²Synelixis Solutions Ltd, Farmakidou 10,Chalkida, GR34100, Greece, GR Abstract One of the most important challenges in the field of Computer Vision is the implementation of low-power embedded systems that will execute very accurate, yet real-time, algorithms. In the visual tracking sector one of the most promising approaches is the recently introduced OpenTLD algorithm which uses a random forest classification method. While it is very robust, it cannot be efficiently parallelized in its native form as its memory access pattern has certain characteristics that make it hard to take advantage of the conventional memory hierarchies. In this paper, we present a novel embedded system implementing this algorithm. We accelerate the bottleneck of the algorithm by designing and implementing a high bandwidth distributed memory sub-system which is independent of the various software parameters. We demonstrate the applicability and efficiency of this novel approach by implementing our scheme in a modern FPGA.

< Return to last page

Submissions

IP5 Interactive Presentations