IP5 Interactive Presentations

Label	Presentation Title Authors
IP5-1	RELIABILITY AND PERFORMANCE TRADE-OFFS FOR 3D NOC-ENABLED MULTICORE CHIPS Speaker: Partha Pande, Washington State University, US Authors: Sourav Das¹, Janardhan Rao Doppa¹, Partha Pande¹ and Krishnendu Chakrabarty² ¹Washington State University, US; ²Duke University, US Abstract Three-dimensional (3D) integration, a breakthrough technology to achieve "More Moore and More Than Moore," provides the benefits of better performance, lower power consumption, and increased bandwidth through the use of vertical interconnects and 3D stacking. The vertical interconnects enable the design of a high-bandwidth and energy-efficient small-world (SW) network-based 3D network-on-Chip (3D SWNoC) for massive multicore platforms. However, the anticipated performance gain of a 3D SWNoC-enabled multicore chip may be compromised due to the potential failures of through-silicon- vias (TSVs) that are predominantly used as vertical interconnects. In particular, due to the non-homogeneous traffic patterns, heavily used TSVs may wear-out quickly and can also contribute to the wear-out of neighboring TSVs. As a result, the mean-time-to-failure (MTTF) of those TSVs will decrease, which will adversely affect the overall lifetime of the chip. In this paper, we address this traffic-dependent TSV wear-out problem in 3D SWNoC. We demonstrate that by employing an adaptive routing mechanism, we can improve the MTTF of 3D SWNoC significantly while still providing 21% lower energy-delay-product (EDP) compared to a conventional 3D MESH. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-2	MEMORY-ACCESS AWARE DVFS FOR NETWORK-ON-CHIP IN CMPS Speaker: Yuan Yao, KTH Royal Institute of Technology, SE Authors: Yuan Yao and Zhonghai Lu, KTH Royal Institute of Technology, SE Abstract We present a new DVFS technique for network-on-chip (NoC) that adjusts the voltage/frequency scales of routers according to memory-access characteristics of application running on the CMP. The memory characteristics are periodically profiled, reflecting both resource-access density in the network and memory-access criticality for application performance. The network conducts per-router voltage/frequency tuning using the memory-access density information while it performs priority-based switch allocation to speed up critical packets and avoid starvation using the memory-criticality information. Compared to a latest per-router DVFS approach, benchmark experiments demonstrate that our memory-access characteristics aware DVFS technique achieves not only better power saving, energy-delay product, but also enhanced network and application performance. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-3	A DYNAMICALLY RECONFIGURABLE ECC DECODER ARCHITECTURE Speaker: Philippe Coussy, Universite Bretagne Sud / Lab-STICC, FR Authors: Awais Sani¹, Philippe Coussy² and Cyrille Chavet³ ¹Universite de Bretagne-Sud, FR; ²Universite de Bretagne-Sud / Lab-STICC, FR; ³Lab-STICC / Université de Bretagne Sud, FR Abstract Due to their impressive error correction performances, Error Correcting Codes (ECC) are now widely used in communication systems. In order to achieve high throughput requirements ECC decoders are based on parallel architectures, which results in a major issue: memory access conflicts. In this paper, we introduce a new class of ECC decoder architectures that dynamically reconfigures by executing on-chip a memory mapping approach. For that purpose, a dedicated algorithm taking into account network constraint is presented. A smart architecture based on a butterfly network and a reconfiguration unit is also proposed. Experimental results show that real-time reconfiguration at reasonable hardware cost is possible. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-4	RESISTIVE BLOOM FILTERS: FROM APPROXIMATE MEMBERSHIP TO APPROXIMATE COMPUTING WITH BOUNDED ERRORS Speaker: Abbas Rahimi, University of California, Berkeley, US Authors: Vahideh Akhlaghi¹, Abbas Rahimi² and Rajesh K. Gupta¹ ¹University of California, San Diego, US; ²University of California, Berkeley, US Abstract Approximate computing provides an opportunity for exploiting application characteristics to trade the accuracy for gains in energy efficiency. However, such opportunity must be able to bound the error that the system designer provides to the application developer. Space-efficient probabilistic data structure such as Bloom filter can provide one such means. Bloom filter supports approximate set membership queries with a tunable rate of false positives (i.e., errors) and no false negatives. We propose a resistive Bloom filter (ReBF) to approximate a function by tightly integrating it to a functional unit (FU) implementing the function. ReBF approximately mimics partial functionality of the FU by recalling its frequent input patterns for computational reuse. The accuracy of the target FU is guaranteed by bounding the ReBF error behavior at the design time. We further lower energy consumption of a FU by designing its ReBF using low-power memristor arrays. The experimental results show that function approximation using ReBF for five image processing kernels running on the AMD Southern Islands GPU yields on average 24.1% energy saving in 45 nm technology compared to the exact computation. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-5	REAL-TIME SYSTEM-LEVEL IMPLEMENTATION OF A TELEPRESENCE ROBOT USING AN EMBEDDED GPU PLATFORM Speaker: Swathi Gurumani, Advanced Digital Sciences Center, SG Authors: Muhammad Teguh Satria¹, Swathi Gurumani¹, Wang Zheng², Keng Peng Tee², Augustine Koh¹, Pan Yu², Kyle Rupnow¹ and Deming Chen³ ¹Advanced Digital Sciences Center, SG; ²Institute for Infocomm Research, SG; ³UIUC, US Abstract Real-time applications such as telepresence systems present an opportunity to use embedded GPUs for compute acceleration to meet platform goals. In this paper, we develop a prototype of a portable, standalone telepresence robot that performs real-time attention-directed control using an NVIDIA Jetson TK1 embedded platform. We perform platform-specific optimizations to improve thread occupancy, optimize computa- tion workload and improve accuracy of face detection on the embedded GPU and achieve real-time performance of 30 frames per second on the Jetson TK1 and an overall speedup of 10x compared to the ARM CPU version. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-6	EXPLORING SPECIALIZED NEAR-MEMORY PROCESSING FOR DATA INTENSIVE OPERATIONS Speaker: Salessawi Ferede Yitbarek, University of Michigan, US Authors: Salessawi Ferede Yitbarek¹, Tao Yang², Reetuparna Das¹ and Todd Austin¹ ¹University of Michigan, US; ²University of California, San Diego, US Abstract Emerging 3D stacked memory systems provide significantly more bandwidth than current DDR modules. However, general purpose processors do not take full advantage of these resources offered by the memory modules. Taking advantage of the increased bandwidth requires the use of specialized processing units. In this paper, we evaluate the benefits of placing hardware accelerators at the bottom layer of a 3D stacked memory system compared to accelerators that are placed external to the memory stack. Our evaluation of the design using cycle-accurate simulation and RTL synthesis shows that, for important data intensive kernels, near-memory accelerators inside a single 3D memory package provide 3x-13x speedup over a Quad-core Xeon processor. Most of the benefits are from the application of accelerators, as the near-memory configurations provide marginal benefits compared to the same number of accelerators placed on a die external to the memory package. This comparable performance for external accelerators is due to the high bandwidth afforded by the high-speed off-chip links. On the other hand, near-memory accelerators consume 7%-39% less energy than the external accelerators. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-7	MATLAB TO C COMPILATION TARGETING APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS Speaker: Francky Catthoor, Interuniversity Microelectronics Centre (IMEC), BE Authors: Ioannis Latifis¹, Karthick Parashar², Grigoris Dimitroulakos¹, Hans Cappelle², Christakis Lezos¹, Konstantinos Masselos¹ and Francky Catthoor² ¹University of Peloponnese, GR; ²Interuniversity Microelectronics Centre (IMEC), BE Abstract This paper discusses a MATLAB to C compiler exploiting custom instructions such as instructions for SIMD processing and instructions for complex arithmetic present in Application Specific Instruction Set Processors (ASIPs). The compiler generates ANSI C code in which the processor's special instructions are represented via specialized intrinsic functions. By doing this the generated code can be used as input to any C/C++ compiler. Thus the proposed compiler allows the description of the specialized instruction set of the target processor in a parameterized way allowing the support of any processor. The proposed compiler has been used for the generation of application code for an ASIP targeting DSP applications. The code generated by the proposed compiler achieves a speed up between 2x-30x on the targeted ASIP for six DSP benchmarks compared to the code generated by Mathworks MATLAB to C compiler. Thus the proposed compiler can be employed to reduce the development time/effort/cost and time to market by raising the abstraction of application design in an embedded systems / system-on-chip development context while still improving implementation efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-8	SAMPLING-BASED BUFFER INSERTION FOR POST-SILICON YIELD IMPROVEMENT UNDER PROCESS VARIABILITY Speaker: Grace Li Zhang, Technische Universität München (TUM), DE Authors: Grace Li Zhang, Bing Li and Ulf Schlichtmann, Technische Universität München (TUM), DE Abstract At submicron manufacturing technology nodes process variations affect circuit performance significantly. This trend leads to a large timing margin and thus overdesign to maintain yield. To combat this pessimism, post-silicon clock tuning buffers can be inserted into circuits to balance timing budgets of critical paths with their neighbors. After manufacturing, these clock buffers can be configured for each chip individually so that chips with timing failures may be rescued to improve yield. In this paper, we propose a sampling-based method to determine the proper locations of these buffers. The goal of this buffer insertion is to reduce the number of buffers and their ranges, while still maintaining a good yield improvement. Experimental results demonstrate that our algorithm can achieve a significant yield improvement (up to 35%) with only a small number of buffers. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-9	PRADA: COMBATING VOLTAGE NOISE IN THE NOC POWER SUPPLY THROUGH FLOW-CONTROL AND ROUTING ALGORITHMS Speaker: Prabal Basu, Utah State University, US Authors: Prabal Basu, Rajesh JayashankaraShridevi, Koushik Chakraborty and Sanghamitra Roy, Utah State University, US Abstract Network-on-Chip (NoC) has become the de-facto standard for on-chip communication in MPSoCs. The growing NoC power footprint, increase in the transistor current, and high switching speed of the logic devices, exacerbate the peak power supply noise (PSN) in the NoC power delivery network (PDN). Hence, preserving power supply integrity in the NoC PDN is critical. In this work, we propose PRADA (PSN-aware Runtime Adaptation)—a collection of a novel flow-control protocol (PAF) and an adaptive routing algorithm (PAR), to mitigate PSN in NoCs. Our best scheme achieves 14% and 12% improvements in the regional peak PSN and energy ef- ficiency, with an average of 4.6% performance overhead and marginal area and power footprints. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-10	A POWER-EFFICIENT 3-D ON-CHIP INTERCONNECT FOR MULTI-CORE ACCELERATORS WITH STACKED L2 CACHE Speaker: Kyungsu Kang, Samsung, KR Authors: Kyungsu Kang¹, Luca Benini², Giovanni De Micheli³, Sangho Park¹ and Jong-Bae Lee¹ ¹Samsung, KR; ²Università di Bologna, IT; ³École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract The use of multi-core clusters is a promising option for data-intensive embedded applications such as multimodal sensor fusion, image understanding, mobile augmented reality. In this paper, we propose a power-efficient 3-D onchip interconnect for multi-core clusters with stacked L2 cache memory. A new switch design makes a circuit-switched Mesh-of-Tree (MoT) interconnect reconfigurable to support power-gating of processing cores, memory blocks, and unnecessary interconnect resources (routing switch, arbitration switch, inverters placed along the on-chip wires). The proposed 3-D MoT improves the power efficiency up to 77% in terms of energy-delay product (EDP). Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-11	POWER-EFFICIENT LOAD-BALANCING ON HETEROGENEOUS COMPUTING PLATFORMS Speaker: Muhammad Shafique, Karlsruhe Institute of Technology (KIT), DE Authors: Muhammad Usman Karim Khan¹, Muhammad Shafique¹, Apratim Gupta², Thomas Schumann² and Jörg Henkel¹ ¹Karlsruhe Institute of Technology (KIT), DE; ²University of Applied Sciences, Darmstadt, DE Abstract In order to address the throughput constraints of the system at minimal power consumption, the workload of computing nodes should be balanced. This requires accounting for the underlying hardware characteristics (e.g., power vs. frequency profiles) and throughput sustainable by these nodes. This work provides a workload distribution and balancing methodology of a divisible load under a throughput constraint, on heterogeneous nodes. The power efficiency of each node is considered during load distribution. For load balancing, the frequency of the node is determined which just fulfills the job requirements of the nodes. We functionally verify our methodology by implementing it on an FPGA-based system, with heterogeneous multi-cores and hardware accelerators, and report results for different image processing benchmarks. Compared to a state-of-the-art-approach, our approach results in up to 64% performance improvement for the benchmarks evaluated in this paper. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-12	TOPAZ: MINING HIGH-LEVEL SAFETY PROPERTIES FROM LOGIC SIMULATION TRACES Speaker: Fadi Kurdahi, University of California, Irvine, US Authors: Ahmed Nassar¹, Fadi Kurdahi¹ and Salam Zantout² ¹University of California, Irvine, US; ²American University of Beirut, LB Abstract Formal specifications are hard to formulate and maintain for evolving complex digital hardware designs. Specification mining offers a (partially) automated route to discovering specifications from large simulation traces. In this paper, we embark on a novel and rigorous mining methodology (data preparation, mining algorithms, selection criteria, etc.) for finite-state automata checkers using an iterative and interactive mining tool, called Topaz. Topaz is evaluated using an open-source 32-bit RISC CPU design as a case study to demonstrate extraction of complex temporal properties cross-cutting through all CPU pipeline stages, guided by the CPU instruction set specification. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-13	EXPLOITING TRANSACTION LEVEL MODELS FOR OBSERVABILITY-AWARE POST-SILICON TEST GENERATION Speaker: Prabhat Mishra, University of Florida, US Authors: Farimah Farahmandi¹, Prabhat Mishra¹ and Sandip Ray² ¹University of Florida, US; ²Intel Corporation, US Abstract A critical problem in post-silicon debug is to generate efficient tests that both activate requisite coverage goals on the target hardware as well as produce results that are observable through a given on-chip design-for-debug architecture. Unfortunately, such tests cannot be generated directly from RTL models, both due to design complexity and due to bugs in the design itself. In this paper, we propose an approach to address this problem by exploiting transaction-level models (TLM). Our approach involves mapping test and observability requirements between TLM and RTL, enabling TLM analysis to generate post-silicon tests. We provide case studies from a number of different design classes to demonstrate the flexibility and effectiveness of the approach. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-14	SEERAD: A HIGH SPEED YET ENERGY-EFFICIENT ROUNDING-BASED APPROXIMATE DIVIDER Speaker: Ali Afzali-Kusha, University of Tehran, IR Authors: Reza Zendegani¹, Mehdi Kamal¹, Arash Fayyazi¹, Ali Afzali-Kusha¹, Saeed Safari¹ and Massoud Pedram² ¹University of Tehran, IR; ²University of Southern California, US Abstract In this paper, a high speed yet energy-efficient approximate divider for error resilient applications is proposed. For the division operation, the divisor is rounded to a value with a specific form resulting in the transformation of the division operation to the multiplication one. The proposed approximate divider enjoys the flexibility of increasing the accuracy at the price of higher delay and hardware usage. The efficacy of the proposed approximate divider is evaluated in comparison to three different implementations of the SRT divider. The results show that the delay and energy consumption of the proposed approximate divider are, on average, 14 and 300 times smaller than those of the Radix-2 SRT with the carry-save reminder computation. Additionally, the effectiveness of the proposed approximate divider is studied in an image division operation performed in image processing applications. The results suggest the appropriateness of the proposed approximate divider for digital signal processing applications. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-15	IMPROVING PERFORMANCE GUARANTEES IN WORMHOLE MESH NOC DESIGNS Speaker: Milos Panic, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES Authors: Milos Panic¹, Carles Hernandez², Jaume Abella², Antoni Roca Perez³, Eduardo Quinones² and Francisco Cazorla⁴ ¹Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES; ³Universitat Politècnica de Catalunya, ES; ⁴Barcelona Supercomputing Center and IIIA-CSIC, ES Abstract Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-16	A DATA LAYOUT TRANSFORMATION (DLT) ACCELERATOR: ARCHITECTURAL SUPPORT FOR DATA MOVEMENT OPTIMIZATION IN ACCELERATED-CENTRIC HETEROGENEOUS SYSTEMS Speaker: Tung Hoang, University of Chicago, US Authors: Tung Hoang, Amirali Shambayati and Andrew A. Chien, University of Chicago, US Abstract Technology scaling and growing use of accelerators make optimization of data movement of increasing importance in all computing systems. Further, growing diversity in memory structures makes embedding such optimization in software non-portable. We propose a novel architectural solution called Data Layout Transformation (DLT) associated with a simple set of instructions that enable software to describe the required data movement compactly, and free the implementation to optimize the movement based on the knowledge of the memory hierarchy and system structure. The DLT architecture ideas can be applicable to both general-purpose and accelerator-based heterogeneous systems. Experiment results first show that the proposed DLT architecture can make use of the full bandwidth (>97%) of a wide range of memory systems (DDR3 and HMC) while its implementation cost, in 32nm, is low (only 0.246 mm2 and 75mW at 1GHz). Our evaluation of using the DLT accelerator in accelerated-based heterogeneous system across DDR3 and HMC memory shows that the DLT can enhance system performance in range of 4.6x-99x (DDR3), 4.4x-115x (HMC) which turns out 2.8x-48x (DDR3), 1.4x-39x (HMC) improvement for energy efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-17	OUESSANT: FLEXIBLE INTEGRATION OF DEDICATED COPROCESSORS IN SYSTEMS ON CHIP Speaker: Pierre-Henri Horrein, Lab-STICC/Télécom Bretagne, FR Authors: Pierre-Henri Horrein, Philip-Dylan Gleonec, Erwan Libessart, André Lalevée and Matthieu Arzel, Lab-STICC/Télécom Bretagne, FR Abstract Integration of hardware accelerators in System on Chips is often complex. When dealing with reconfigurable hardware, this greatly limits the attainable flexibility. In this paper, we propose an alternative approach to the Molen paradigm [1]. This approach, named Ouessant, is based on a very simple general purpose instruction set designed for close interaction with dedicated hardware accelerators. This instruction set is used to program a dedicated controler, which commands the accelerator's execution and data transfer with minimal CPU intervention. The resulting architecture is flexible, extensible, and can be easily integrated in System on Chips. Adding new accelerators is also made easier. Implementation of the architecture on different FPGA resources show very low footprint and a very small impact on attainable performance. Ouessant is freely available under an open-source license. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-18	A NOVEL BACKGROUND SUBTRACTION SCHEME FOR IN-CAMERA ACCELERATION IN THERMAL IMAGERY Speaker: Konstantinos Makantasis, Institute of Communication and Computer Systems, GR Authors: Antonis Nikitakis¹, Ioannis Papaefstathiou², Konstantinos Makantasis³ and Anastasios Doulamis⁴ ¹Technical University of Crete, GR; ²Synelixis Solutions Ltd, GR; ³Institute of Communication and Computer Systems, GR; ⁴National Technical University of Athens, GR Abstract Real-time segmentation of moving regions in image sequences is a very important task in numerous surveillance and monitoring applications. A common approach for such tasks is the "background subtraction" which tries to extract regions of interest from the image background for further processing or action; as a result its accuracy as well as its real-time performance is of great significance. In this work we utilize a novel scheme, designed and optimized for FPGA-based implementations, which models the intensities of each pixel as a mixture of Gaussian components; following a Bayesian approach, our method automatically estimates the number of Gaussian components as well as their parameters. Our novel system is based on an efficient and highly accurate on-line updating mechanism, which permits our system to be automatically adapted to dynamically changing operation conditions, while it avoids over/under fitting. We also present two reference implementations of our Background Subtraction Parallel System (BSPS) in Reconfigurable Hardware achieving both high performance as well as low power consumption; the presented FPGA-based systems significantly outperform a multi-core ARM and two multi-core low power Intel CPUs in terms of energy consumed per processed pixel as well as frames per second. Moreover, our low-cost, low-power devices allow for the implementation, for the first time, of a highly distributed surveillance system which will alleviate the main problems of the existing centralized approaches. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-19	RADIATION-HARDENED DSP CONFIGURATIONS FOR IMPLEMENTING ARITHMETIC FUNCTIONS ON FPGA Speaker: Felipe Serrano, Universidad Complutense de Madrid, ES Authors: Marcos Sanchez-Elez, Inmaculada Pardines, Felipe Serrano and Hortensia Mecha, Universidad Complutense de Madrid, ES Abstract This paper presents a study of different implementations of arithmetic operations on FPGAs. Radiation vulnerability has been analyzed for each implementation using the fault injection platform NESSY. Results in terms of area, delay and reliability are presented. Taking into account the performed tests we propose to build a library of HDL templates. This library is used during the design process with a synthesis tool that implements digital circuits as reliable as possible. Experimental results show that those implementations using DSP slices are the ones which achieve better results. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-20	CONFIGURATION PREFETCHING AND REUSE FOR PREEMPTIVE HARDWARE MULTITASKING ON PARTIALLY RECONFIGURABLE FPGAS Speaker: Ann Gordon-Ross, University of Florida, US Authors: Aurelio Morales-Villanueva, Rohit Kumar and Ann Gordon-Ross, University of Florida, US Abstract Partially reconfigurable (PR) FPGAs enable preemptive hardware (HW) multitasking using PR regions (PRRs). To enable this multitasking, the HW task's partial bitstream is downloaded to only the task's PRR, and only that PRR is reconfigured. Since only a small portion of the FPGA fabric is reconfigured, reconfiguration time is significantly reduced as compared to reconfiguring the entire fabric, however this time is not negligible. Reconfiguration time can be reduced/hidden using two techniques: configuration prefetching and configuration reuse. Even though these techniques can effectively reduce/hide reconfiguration overhead, prior works in preemptive HW multitasking did not use these techniques. To the best of our knowledge, no prior work evaluated physical implementations of these techniques on PR FPGAs, which precludes consideration of physical-implementation-specific details, such as delays in accessing bitstreams, speed limitations during reconfiguration, etc. In this work, we present a novel implementation of configuration prefetching and reuse for preemptive HW multitasking on a Virtex-5 FPGA, however, our established fundamentals are device-family independent. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-21	ANALOG CIRCUIT TOPOLOGICAL FEATURE EXTRACTION WITH UNSUPERVISED LEARNING OF NEW SUB-STRUCTURES Speaker: Alex Doboli, Stony Brook University, US Authors: Hao Li, Fanshu Jiao and Alex Doboli, Stony Brook University, US Abstract This paper presents novel techniques to automatically extract the topological (structural) features in analog circuits. The extracted features include basic building blocks, structural templates and hierarchical structures. Finding structural features is important for tasks like circuit synthesis and sizing, design verification, design reuse, and design knowledge description, summarization and management. The paper presents algorithms for supervised feature extraction and unsupervised learning of new block connections. Experiments discuss feature extraction for a set of 34 state-of-the-art analog circuits. Download Paper (PDF; Only available from the DATE venue WiFi)
IP5-22	DESIGN AUTOMATION TASKS SCHEDULING FOR ENHANCED PARALLEL EXECUTION OF A STATE-OF-THE-ART LAYOUT-AWARE SIZING APPROACH Speaker: Nuno Horta, Instituto de Telecomunicações/Instituto Superior Técnico, PT Authors: David Neves, Ricardo Martins, Nuno Lourenço and Nuno Horta, Instituto de Telecomunicações/Instituto Superior Técnico, PT Abstract This paper presents an innovative methodology to efficiently schedule design automation tasks during the execution of an analog IC layout-aware sizing process. The referred synthesis process includes several sub-tasks such as DC simulation, floorplanning, placement, global routing, parasitic extraction, and circuit simulations in multiple worst case corners. The schedule of the design tasks is here optimized taking into account standard multi-core architectures, tasks dependencies, accurate time estimations for each task and a limited number of licenses for using commercial tools, e.g., number of simulator licenses. The proposed methodology, first, considers a directed acyclic graph for representing the design flow and task dependencies, then, an evolutionary kernel is used to implement a single-objective multi-constraint optimization. The efficiency and impact of the proposed approach is validated by using a state-of-the-art Analog IC design automation environment. Download Paper (PDF; Only available from the DATE venue WiFi)

Label

Presentation Title
Authors

IP5-1

RELIABILITY AND PERFORMANCE TRADE-OFFS FOR 3D NOC-ENABLED MULTICORE CHIPS
Speaker:
Partha Pande, Washington State University, US
Authors:
Sourav Das¹, Janardhan Rao Doppa¹, Partha Pande¹ and Krishnendu Chakrabarty²
¹Washington State University, US; ²Duke University, US
Abstract
Three-dimensional (3D) integration, a breakthrough technology to achieve "More Moore and More Than Moore," provides the benefits of better performance, lower power consumption, and increased bandwidth through the use of vertical interconnects and 3D stacking. The vertical interconnects enable the design of a high-bandwidth and energy-efficient small-world (SW) network-based 3D network-on-Chip (3D SWNoC) for massive multicore platforms. However, the anticipated performance gain of a 3D SWNoC-enabled multicore chip may be compromised due to the potential failures of through-silicon- vias (TSVs) that are predominantly used as vertical interconnects. In particular, due to the non-homogeneous traffic patterns, heavily used TSVs may wear-out quickly and can also contribute to the wear-out of neighboring TSVs. As a result, the mean-time-to-failure (MTTF) of those TSVs will decrease, which will adversely affect the overall lifetime of the chip. In this paper, we address this traffic-dependent TSV wear-out problem in 3D SWNoC. We demonstrate that by employing an adaptive routing mechanism, we can improve the MTTF of 3D SWNoC significantly while still providing 21% lower energy-delay-product (EDP) compared to a conventional 3D MESH.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-2

MEMORY-ACCESS AWARE DVFS FOR NETWORK-ON-CHIP IN CMPS
Speaker:
Yuan Yao, KTH Royal Institute of Technology, SE
Authors:
Yuan Yao and Zhonghai Lu, KTH Royal Institute of Technology, SE
Abstract
We present a new DVFS technique for network-on-chip (NoC) that adjusts the voltage/frequency scales of routers according to memory-access characteristics of application running on the CMP. The memory characteristics are periodically profiled, reflecting both resource-access density in the network and memory-access criticality for application performance. The network conducts per-router voltage/frequency tuning using the memory-access density information while it performs priority-based switch allocation to speed up critical packets and avoid starvation using the memory-criticality information. Compared to a latest per-router DVFS approach, benchmark experiments demonstrate that our memory-access characteristics aware DVFS technique achieves not only better power saving, energy-delay product, but also enhanced network and application performance.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-3

A DYNAMICALLY RECONFIGURABLE ECC DECODER ARCHITECTURE
Speaker:
Philippe Coussy, Universite Bretagne Sud / Lab-STICC, FR
Authors:
Awais Sani¹, Philippe Coussy² and Cyrille Chavet³
¹Universite de Bretagne-Sud, FR; ²Universite de Bretagne-Sud / Lab-STICC, FR; ³Lab-STICC / Université de Bretagne Sud, FR
Abstract
Due to their impressive error correction performances, Error Correcting Codes (ECC) are now widely used in communication systems. In order to achieve high throughput requirements ECC decoders are based on parallel architectures, which results in a major issue: memory access conflicts. In this paper, we introduce a new class of ECC decoder architectures that dynamically reconfigures by executing on-chip a memory mapping approach. For that purpose, a dedicated algorithm taking into account network constraint is presented. A smart architecture based on a butterfly network and a reconfiguration unit is also proposed. Experimental results show that real-time reconfiguration at reasonable hardware cost is possible.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-4

RESISTIVE BLOOM FILTERS: FROM APPROXIMATE MEMBERSHIP TO APPROXIMATE COMPUTING WITH BOUNDED ERRORS
Speaker:
Abbas Rahimi, University of California, Berkeley, US
Authors:
Vahideh Akhlaghi¹, Abbas Rahimi² and Rajesh K. Gupta¹
¹University of California, San Diego, US; ²University of California, Berkeley, US
Abstract
Approximate computing provides an opportunity for exploiting application characteristics to trade the accuracy for gains in energy efficiency. However, such opportunity must be able to bound the error that the system designer provides to the application developer. Space-efficient probabilistic data structure such as Bloom filter can provide one such means. Bloom filter supports approximate set membership queries with a tunable rate of false positives (i.e., errors) and no false negatives. We propose a resistive Bloom filter (ReBF) to approximate a function by tightly integrating it to a functional unit (FU) implementing the function. ReBF approximately mimics partial functionality of the FU by recalling its frequent input patterns for computational reuse. The accuracy of the target FU is guaranteed by bounding the ReBF error behavior at the design time. We further lower energy consumption of a FU by designing its ReBF using low-power memristor arrays. The experimental results show that function approximation using ReBF for five image processing kernels running on the AMD Southern Islands GPU yields on average 24.1% energy saving in 45 nm technology compared to the exact computation.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-5

REAL-TIME SYSTEM-LEVEL IMPLEMENTATION OF A TELEPRESENCE ROBOT USING AN EMBEDDED GPU PLATFORM
Speaker:
Swathi Gurumani, Advanced Digital Sciences Center, SG
Authors:
Muhammad Teguh Satria¹, Swathi Gurumani¹, Wang Zheng², Keng Peng Tee², Augustine Koh¹, Pan Yu², Kyle Rupnow¹ and Deming Chen³
¹Advanced Digital Sciences Center, SG; ²Institute for Infocomm Research, SG; ³UIUC, US
Abstract
Real-time applications such as telepresence systems present an opportunity to use embedded GPUs for compute acceleration to meet platform goals. In this paper, we develop a prototype of a portable, standalone telepresence robot that performs real-time attention-directed control using an NVIDIA Jetson TK1 embedded platform. We perform platform-specific optimizations to improve thread occupancy, optimize computa- tion workload and improve accuracy of face detection on the embedded GPU and achieve real-time performance of 30 frames per second on the Jetson TK1 and an overall speedup of 10x compared to the ARM CPU version.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-6

EXPLORING SPECIALIZED NEAR-MEMORY PROCESSING FOR DATA INTENSIVE OPERATIONS
Speaker:
Salessawi Ferede Yitbarek, University of Michigan, US
Authors:
Salessawi Ferede Yitbarek¹, Tao Yang², Reetuparna Das¹ and Todd Austin¹
¹University of Michigan, US; ²University of California, San Diego, US
Abstract
Emerging 3D stacked memory systems provide significantly more bandwidth than current DDR modules. However, general purpose processors do not take full advantage of these resources offered by the memory modules. Taking advantage of the increased bandwidth requires the use of specialized processing units. In this paper, we evaluate the benefits of placing hardware accelerators at the bottom layer of a 3D stacked memory system compared to accelerators that are placed external to the memory stack. Our evaluation of the design using cycle-accurate simulation and RTL synthesis shows that, for important data intensive kernels, near-memory accelerators inside a single 3D memory package provide 3x-13x speedup over a Quad-core Xeon processor. Most of the benefits are from the application of accelerators, as the near-memory configurations provide marginal benefits compared to the same number of accelerators placed on a die external to the memory package. This comparable performance for external accelerators is due to the high bandwidth afforded by the high-speed off-chip links. On the other hand, near-memory accelerators consume 7%-39% less energy than the external accelerators.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-7

MATLAB TO C COMPILATION TARGETING APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS
Speaker:
Francky Catthoor, Interuniversity Microelectronics Centre (IMEC), BE
Authors:
Ioannis Latifis¹, Karthick Parashar², Grigoris Dimitroulakos¹, Hans Cappelle², Christakis Lezos¹, Konstantinos Masselos¹ and Francky Catthoor²
¹University of Peloponnese, GR; ²Interuniversity Microelectronics Centre (IMEC), BE
Abstract
This paper discusses a MATLAB to C compiler exploiting custom instructions such as instructions for SIMD processing and instructions for complex arithmetic present in Application Specific Instruction Set Processors (ASIPs). The compiler generates ANSI C code in which the processor's special instructions are represented via specialized intrinsic functions. By doing this the generated code can be used as input to any C/C++ compiler. Thus the proposed compiler allows the description of the specialized instruction set of the target processor in a parameterized way allowing the support of any processor. The proposed compiler has been used for the generation of application code for an ASIP targeting DSP applications. The code generated by the proposed compiler achieves a speed up between 2x-30x on the targeted ASIP for six DSP benchmarks compared to the code generated by Mathworks MATLAB to C compiler. Thus the proposed compiler can be employed to reduce the development time/effort/cost and time to market by raising the abstraction of application design in an embedded systems / system-on-chip development context while still improving implementation efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-8

SAMPLING-BASED BUFFER INSERTION FOR POST-SILICON YIELD IMPROVEMENT UNDER PROCESS VARIABILITY
Speaker:
Grace Li Zhang, Technische Universität München (TUM), DE
Authors:
Grace Li Zhang, Bing Li and Ulf Schlichtmann, Technische Universität München (TUM), DE
Abstract
At submicron manufacturing technology nodes process variations affect circuit performance significantly. This trend leads to a large timing margin and thus overdesign to maintain yield. To combat this pessimism, post-silicon clock tuning buffers can be inserted into circuits to balance timing budgets of critical paths with their neighbors. After manufacturing, these clock buffers can be configured for each chip individually so that chips with timing failures may be rescued to improve yield. In this paper, we propose a sampling-based method to determine the proper locations of these buffers. The goal of this buffer insertion is to reduce the number of buffers and their ranges, while still maintaining a good yield improvement. Experimental results demonstrate that our algorithm can achieve a significant yield improvement (up to 35%) with only a small number of buffers.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-9

PRADA: COMBATING VOLTAGE NOISE IN THE NOC POWER SUPPLY THROUGH FLOW-CONTROL AND ROUTING ALGORITHMS
Speaker:
Prabal Basu, Utah State University, US
Authors:
Prabal Basu, Rajesh JayashankaraShridevi, Koushik Chakraborty and Sanghamitra Roy, Utah State University, US
Abstract
Network-on-Chip (NoC) has become the de-facto standard for on-chip communication in MPSoCs. The growing NoC power footprint, increase in the transistor current, and high switching speed of the logic devices, exacerbate the peak power supply noise (PSN) in the NoC power delivery network (PDN). Hence, preserving power supply integrity in the NoC PDN is critical. In this work, we propose PRADA (PSN-aware Runtime Adaptation)—a collection of a novel flow-control protocol (PAF) and an adaptive routing algorithm (PAR), to mitigate PSN in NoCs. Our best scheme achieves 14% and 12% improvements in the regional peak PSN and energy ef- ficiency, with an average of 4.6% performance overhead and marginal area and power footprints.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-10

A POWER-EFFICIENT 3-D ON-CHIP INTERCONNECT FOR MULTI-CORE ACCELERATORS WITH STACKED L2 CACHE
Speaker:
Kyungsu Kang, Samsung, KR
Authors:
Kyungsu Kang¹, Luca Benini², Giovanni De Micheli³, Sangho Park¹ and Jong-Bae Lee¹
¹Samsung, KR; ²Università di Bologna, IT; ³École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
The use of multi-core clusters is a promising option for data-intensive embedded applications such as multimodal sensor fusion, image understanding, mobile augmented reality. In this paper, we propose a power-efficient 3-D onchip interconnect for multi-core clusters with stacked L2 cache memory. A new switch design makes a circuit-switched Mesh-of-Tree (MoT) interconnect reconfigurable to support power-gating of processing cores, memory blocks, and unnecessary interconnect resources (routing switch, arbitration switch, inverters placed along the on-chip wires). The proposed 3-D MoT improves the power efficiency up to 77% in terms of energy-delay product (EDP).
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-11

POWER-EFFICIENT LOAD-BALANCING ON HETEROGENEOUS COMPUTING PLATFORMS
Speaker:
Muhammad Shafique, Karlsruhe Institute of Technology (KIT), DE
Authors:
Muhammad Usman Karim Khan¹, Muhammad Shafique¹, Apratim Gupta², Thomas Schumann² and Jörg Henkel¹
¹Karlsruhe Institute of Technology (KIT), DE; ²University of Applied Sciences, Darmstadt, DE
Abstract
In order to address the throughput constraints of the system at minimal power consumption, the workload of computing nodes should be balanced. This requires accounting for the underlying hardware characteristics (e.g., power vs. frequency profiles) and throughput sustainable by these nodes. This work provides a workload distribution and balancing methodology of a divisible load under a throughput constraint, on heterogeneous nodes. The power efficiency of each node is considered during load distribution. For load balancing, the frequency of the node is determined which just fulfills the job requirements of the nodes. We functionally verify our methodology by implementing it on an FPGA-based system, with heterogeneous multi-cores and hardware accelerators, and report results for different image processing benchmarks. Compared to a state-of-the-art-approach, our approach results in up to 64% performance improvement for the benchmarks evaluated in this paper.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-12

TOPAZ: MINING HIGH-LEVEL SAFETY PROPERTIES FROM LOGIC SIMULATION TRACES
Speaker:
Fadi Kurdahi, University of California, Irvine, US
Authors:
Ahmed Nassar¹, Fadi Kurdahi¹ and Salam Zantout²
¹University of California, Irvine, US; ²American University of Beirut, LB
Abstract
Formal specifications are hard to formulate and maintain for evolving complex digital hardware designs. Specification mining offers a (partially) automated route to discovering specifications from large simulation traces. In this paper, we embark on a novel and rigorous mining methodology (data preparation, mining algorithms, selection criteria, etc.) for finite-state automata checkers using an iterative and interactive mining tool, called Topaz. Topaz is evaluated using an open-source 32-bit RISC CPU design as a case study to demonstrate extraction of complex temporal properties cross-cutting through all CPU pipeline stages, guided by the CPU instruction set specification.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-13

EXPLOITING TRANSACTION LEVEL MODELS FOR OBSERVABILITY-AWARE POST-SILICON TEST GENERATION
Speaker:
Prabhat Mishra, University of Florida, US
Authors:
Farimah Farahmandi¹, Prabhat Mishra¹ and Sandip Ray²
¹University of Florida, US; ²Intel Corporation, US
Abstract
A critical problem in post-silicon debug is to generate efficient tests that both activate requisite coverage goals on the target hardware as well as produce results that are observable through a given on-chip design-for-debug architecture. Unfortunately, such tests cannot be generated directly from RTL models, both due to design complexity and due to bugs in the design itself. In this paper, we propose an approach to address this problem by exploiting transaction-level models (TLM). Our approach involves mapping test and observability requirements between TLM and RTL, enabling TLM analysis to generate post-silicon tests. We provide case studies from a number of different design classes to demonstrate the flexibility and effectiveness of the approach.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-14

SEERAD: A HIGH SPEED YET ENERGY-EFFICIENT ROUNDING-BASED APPROXIMATE DIVIDER
Speaker:
Ali Afzali-Kusha, University of Tehran, IR
Authors:
Reza Zendegani¹, Mehdi Kamal¹, Arash Fayyazi¹, Ali Afzali-Kusha¹, Saeed Safari¹ and Massoud Pedram²
¹University of Tehran, IR; ²University of Southern California, US
Abstract
In this paper, a high speed yet energy-efficient approximate divider for error resilient applications is proposed. For the division operation, the divisor is rounded to a value with a specific form resulting in the transformation of the division operation to the multiplication one. The proposed approximate divider enjoys the flexibility of increasing the accuracy at the price of higher delay and hardware usage. The efficacy of the proposed approximate divider is evaluated in comparison to three different implementations of the SRT divider. The results show that the delay and energy consumption of the proposed approximate divider are, on average, 14 and 300 times smaller than those of the Radix-2 SRT with the carry-save reminder computation. Additionally, the effectiveness of the proposed approximate divider is studied in an image division operation performed in image processing applications. The results suggest the appropriateness of the proposed approximate divider for digital signal processing applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-15

IMPROVING PERFORMANCE GUARANTEES IN WORMHOLE MESH NOC DESIGNS
Speaker:
Milos Panic, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Authors:
Milos Panic¹, Carles Hernandez², Jaume Abella², Antoni Roca Perez³, Eduardo Quinones² and Francisco Cazorla⁴
¹Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES; ³Universitat Politècnica de Catalunya, ES; ⁴Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-16

A DATA LAYOUT TRANSFORMATION (DLT) ACCELERATOR: ARCHITECTURAL SUPPORT FOR DATA MOVEMENT OPTIMIZATION IN ACCELERATED-CENTRIC HETEROGENEOUS SYSTEMS
Speaker:
Tung Hoang, University of Chicago, US
Authors:
Tung Hoang, Amirali Shambayati and Andrew A. Chien, University of Chicago, US
Abstract
Technology scaling and growing use of accelerators make optimization of data movement of increasing importance in all computing systems. Further, growing diversity in memory structures makes embedding such optimization in software non-portable. We propose a novel architectural solution called Data Layout Transformation (DLT) associated with a simple set of instructions that enable software to describe the required data movement compactly, and free the implementation to optimize the movement based on the knowledge of the memory hierarchy and system structure. The DLT architecture ideas can be applicable to both general-purpose and accelerator-based heterogeneous systems. Experiment results first show that the proposed DLT architecture can make use of the full bandwidth (>97%) of a wide range of memory systems (DDR3 and HMC) while its implementation cost, in 32nm, is low (only 0.246 mm2 and 75mW at 1GHz). Our evaluation of using the DLT accelerator in accelerated-based heterogeneous system across DDR3 and HMC memory shows that the DLT can enhance system performance in range of 4.6x-99x (DDR3), 4.4x-115x (HMC) which turns out 2.8x-48x (DDR3), 1.4x-39x (HMC) improvement for energy efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-17

OUESSANT: FLEXIBLE INTEGRATION OF DEDICATED COPROCESSORS IN SYSTEMS ON CHIP
Speaker:
Pierre-Henri Horrein, Lab-STICC/Télécom Bretagne, FR
Authors:
Pierre-Henri Horrein, Philip-Dylan Gleonec, Erwan Libessart, André Lalevée and Matthieu Arzel, Lab-STICC/Télécom Bretagne, FR
Abstract
Integration of hardware accelerators in System on Chips is often complex. When dealing with reconfigurable hardware, this greatly limits the attainable flexibility. In this paper, we propose an alternative approach to the Molen paradigm [1]. This approach, named Ouessant, is based on a very simple general purpose instruction set designed for close interaction with dedicated hardware accelerators. This instruction set is used to program a dedicated controler, which commands the accelerator's execution and data transfer with minimal CPU intervention. The resulting architecture is flexible, extensible, and can be easily integrated in System on Chips. Adding new accelerators is also made easier. Implementation of the architecture on different FPGA resources show very low footprint and a very small impact on attainable performance. Ouessant is freely available under an open-source license.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-18

A NOVEL BACKGROUND SUBTRACTION SCHEME FOR IN-CAMERA ACCELERATION IN THERMAL IMAGERY
Speaker:
Konstantinos Makantasis, Institute of Communication and Computer Systems, GR
Authors:
Antonis Nikitakis¹, Ioannis Papaefstathiou², Konstantinos Makantasis³ and Anastasios Doulamis⁴
¹Technical University of Crete, GR; ²Synelixis Solutions Ltd, GR; ³Institute of Communication and Computer Systems, GR; ⁴National Technical University of Athens, GR
Abstract
Real-time segmentation of moving regions in image sequences is a very important task in numerous surveillance and monitoring applications. A common approach for such tasks is the "background subtraction" which tries to extract regions of interest from the image background for further processing or action; as a result its accuracy as well as its real-time performance is of great significance. In this work we utilize a novel scheme, designed and optimized for FPGA-based implementations, which models the intensities of each pixel as a mixture of Gaussian components; following a Bayesian approach, our method automatically estimates the number of Gaussian components as well as their parameters. Our novel system is based on an efficient and highly accurate on-line updating mechanism, which permits our system to be automatically adapted to dynamically changing operation conditions, while it avoids over/under fitting. We also present two reference implementations of our Background Subtraction Parallel System (BSPS) in Reconfigurable Hardware achieving both high performance as well as low power consumption; the presented FPGA-based systems significantly outperform a multi-core ARM and two multi-core low power Intel CPUs in terms of energy consumed per processed pixel as well as frames per second. Moreover, our low-cost, low-power devices allow for the implementation, for the first time, of a highly distributed surveillance system which will alleviate the main problems of the existing centralized approaches.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-19

RADIATION-HARDENED DSP CONFIGURATIONS FOR IMPLEMENTING ARITHMETIC FUNCTIONS ON FPGA
Speaker:
Felipe Serrano, Universidad Complutense de Madrid, ES
Authors:
Marcos Sanchez-Elez, Inmaculada Pardines, Felipe Serrano and Hortensia Mecha, Universidad Complutense de Madrid, ES
Abstract
This paper presents a study of different implementations of arithmetic operations on FPGAs. Radiation vulnerability has been analyzed for each implementation using the fault injection platform NESSY. Results in terms of area, delay and reliability are presented. Taking into account the performed tests we propose to build a library of HDL templates. This library is used during the design process with a synthesis tool that implements digital circuits as reliable as possible. Experimental results show that those implementations using DSP slices are the ones which achieve better results.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-20

CONFIGURATION PREFETCHING AND REUSE FOR PREEMPTIVE HARDWARE MULTITASKING ON PARTIALLY RECONFIGURABLE FPGAS
Speaker:
Ann Gordon-Ross, University of Florida, US
Authors:
Aurelio Morales-Villanueva, Rohit Kumar and Ann Gordon-Ross, University of Florida, US
Abstract
Partially reconfigurable (PR) FPGAs enable preemptive hardware (HW) multitasking using PR regions (PRRs). To enable this multitasking, the HW task's partial bitstream is downloaded to only the task's PRR, and only that PRR is reconfigured. Since only a small portion of the FPGA fabric is reconfigured, reconfiguration time is significantly reduced as compared to reconfiguring the entire fabric, however this time is not negligible. Reconfiguration time can be reduced/hidden using two techniques: configuration prefetching and configuration reuse. Even though these techniques can effectively reduce/hide reconfiguration overhead, prior works in preemptive HW multitasking did not use these techniques. To the best of our knowledge, no prior work evaluated physical implementations of these techniques on PR FPGAs, which precludes consideration of physical-implementation-specific details, such as delays in accessing bitstreams, speed limitations during reconfiguration, etc. In this work, we present a novel implementation of configuration prefetching and reuse for preemptive HW multitasking on a Virtex-5 FPGA, however, our established fundamentals are device-family independent.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-21

ANALOG CIRCUIT TOPOLOGICAL FEATURE EXTRACTION WITH UNSUPERVISED LEARNING OF NEW SUB-STRUCTURES
Speaker:
Alex Doboli, Stony Brook University, US
Authors:
Hao Li, Fanshu Jiao and Alex Doboli, Stony Brook University, US
Abstract
This paper presents novel techniques to automatically extract the topological (structural) features in analog circuits. The extracted features include basic building blocks, structural templates and hierarchical structures. Finding structural features is important for tasks like circuit synthesis and sizing, design verification, design reuse, and design knowledge description, summarization and management. The paper presents algorithms for supervised feature extraction and unsupervised learning of new block connections. Experiments discuss feature extraction for a set of 34 state-of-the-art analog circuits.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP5-22

DESIGN AUTOMATION TASKS SCHEDULING FOR ENHANCED PARALLEL EXECUTION OF A STATE-OF-THE-ART LAYOUT-AWARE SIZING APPROACH
Speaker:
Nuno Horta, Instituto de Telecomunicações/Instituto Superior Técnico, PT
Authors:
David Neves, Ricardo Martins, Nuno Lourenço and Nuno Horta, Instituto de Telecomunicações/Instituto Superior Técnico, PT
Abstract
This paper presents an innovative methodology to efficiently schedule design automation tasks during the execution of an analog IC layout-aware sizing process. The referred synthesis process includes several sub-tasks such as DC simulation, floorplanning, placement, global routing, parasitic extraction, and circuit simulations in multiple worst case corners. The schedule of the design tasks is here optimized taking into account standard multi-core architectures, tasks dependencies, accurate time estimations for each task and a limited number of licenses for using commercial tools, e.g., number of simulator licenses. The proposed methodology, first, considers a directed acyclic graph for representing the design flow and task dependencies, then, an evolutionary kernel is used to implement a single-objective multi-constraint optimization. The efficiency and impact of the proposed approach is validated by using a state-of-the-art Analog IC design automation environment.
Download Paper (PDF; Only available from the DATE venue WiFi)

Visit us at DATE 2016