DATE Executive Committee
DATE Sponsors Committee
Technical Program Topic Chairs
Technical Program Committee
Best Paper Awards
Call for Papers: DATE 2011
Design of complex system is essentially about connections: Connection of concepts, connection of objects, connection of teams. And products of the future will be connected seamlessly across physical and virtual domains. Connections can produce systems that offer more than the sum of the components but they can also yield to systems that are less powerful than the sum of the components or that are so compromised by their interactions that they do not work at all.
The rise of the wireless internet is the megatrend in communication industry. Mobile communication devices will be the dominant platform for information access, gaming, music and spending time with distant friends. About 5 bn or three quarters of the world population is using mobile phones. Internet enabled mobile phones at affordable cost will be the most common internet access device in the near future. At the high end powerful and versatile smartphones need faster and more energy efficient semiconductors.
Cyber-Physical Systems require distributed architectures to support safety critical real-time control. Kopetz' Time-Triggered Architectures (TTA) have been proposed as both an architecture and a comprehensive paradigm for systems architecture, for such systems. To relax the strict requirements on synchronization imposed by TTA, Loosely Time-Triggered Architectures (LTTA) have been recently proposed. In LTTA, computation and communication units at all triggered by autonomous, non synchronized, clocks. Communication media act as shared memories between writers and readers and communication is non blocking. In this paper we review the different variants of LTTA and discuss their principles and usage.
Leakage power consumption contributes significantly to the overall power dissipation for systems that are manufactured in advanced deep sub-micron technology. Different from many previous results, this paper explores leakage-aware energy-efficient scheduling if leakage power consumption depends on temperature. We propose a pattern-based approach which divides a given time horizon into several time segments with the same length, where the processor is in the active (dormant, respectively) mode for a fixed amount of time at the beginning (end, respectively) of each time segment. Computation is advanced in the active mode, whereas the dormant mode helps reduce the temperature via cooling as well as the leakage power consumption. Since the pattern-based approach leads to a steady state with an equilibrium temperature, we develop a procedure to find the optimal pattern whose energy consumption in steady state is the minimum. Compared to existing work, our approach is more effective, has less run-time scheduling overhead, and requires only a simple scheduler to control the system mode periodically. The paper contains extensive simulation results which validate the new models and methods.
Index Terms - energy-efficient task scheduling, temperature-dependent leakage power consumption, real-time systems.
We present a high-level method for rapidly and accurately estimating energy and performance overhead of Real- Time Operating Systems. Unlike most other approaches, which rely on Transaction-Level Modeling (TLM), we infer the information we need directly from executing the algorithmic specification, without needing to build any high-level architectural model. We distinguish two main components in our approach: first, an accurate one-time pre-characterization of the main RTOS functionalities in terms of energy and cycles; second, the development of an algorithm to rapidly predict the occurrences of such RTOS functionalities. Finally, we demonstrate the feasibility of our approach by comparing it against gate level for accuracy and against TLM for speed. We obtain a worst-case energy error of 12% against a mean speedup of 36X.
With new technologies, temperature has become a major issue to be considered at system level design. In this paper we propose a temperature aware idle time distribution technique for energy optimization with dynamic voltage scaling (DVS). A temperature analysis approach is also proposed which is accurate and, yet, sufficiently fast to be used inside the optimization loop for idle time distribution and voltage selection.
The use of dynamic voltage and frequency scaling (DVFS) in contemporary multicores provides significant protection from unpredictable thermal events. A side effect of DVFS can be an increased processor exposure to soft errors. To address this issue, a flexible fault prevention mechanism has been developed to selectively enable a small amount of per-core dual modular redundancy (DMR) in response to increased vulnerability, as measured by the processor architectural vulnerability factor (AVF). Our new algorithm for DMR deployment aims to provide a stable effective soft error rate (SER) by using DMR in response to DVFS caused by thermal events. The algorithm is implemented in real-time on the multicore using a dedicated monitor network-on-chip and controller which evaluates thermal information and multicore performance statistics. Experiments with a multicore simulator using standard benchmarks show an average 6% improvement in overall power consumption and a stable SER by using selective DMR versus continuous DMR deployment.
Keywords-architectural vulnerability, DVFS, monitor network
Requiring more bandwidth at reasonable power consumption, new communication infrastructures must provide adequate solutions to guarantee performance during physical integration. In this paper, we propose the design of a low-power asynchronous Network-on-Chip which is implemented in a bottom-up approach using optimized hard-macros. This architecture is fully testable and a new design flow is proposed to overcome CAD tools limitations regarding asynchronous logic. The proposed architecture has been successfully implemented in CMOS 65nm in a complete circuit. It achieves a 550Mflit/s throughput on silicon, and exhibits 86% power reduction compared to an equivalent synchronous NoC version.
Supporting Distributed Shared Memory (DSM) is essential for multi-core Network-on-Chips for the sake of reusing huge amount of legacy code and easy programmability. We propose a microcoded controller as a hardware module in each node to connect the core, the local memory and the network. The controller is programmable where the DSM functions such as virtual-to-physical address translation, memory access and synchronization etc. are realized using microcode. To enable concurrent processing of memory requests from the local and remote cores, our controller features two mini-processors, one dealing with requests from the local core and the other from remote cores. Synthesis results suggest that the controller consumes 51k gates for the logic and the run up to 455 MHz in 130 nm technology. To evaluate its performance, we use synthetic and application workloads. Results show that, then the system size is scaled up, the delay overhead incurred by the controller may become less significant when compared with the network delay. In this way, the delay efficiency of our DSM solution is closed to hardware solutions on average but still have all the flexibility of software solutions.
The shared-memory model has been adopted, both for data exchange as well as synchronization using semaphores in almost every on-chip multiprocessor implementation, ranging from general purpose chip multiprocessors (CMPs) to domain specific multi-core graphics processing units (GPUs). Low-latency synchronization is desirable but is hard to achieve in practice due to the memory hierarchy. On the contrary, an explicit exchange of synchronization tokens among the processing elements through dedicated on-chip links would be beneficial for the overall system performance. In this paper we propose the Medea NoC-based framework, a hybrid shared-memory/message-passing approach. Medea has been modeled with a fast, cycle-accurate SystemC implementation enabling a fast system exploration varying several parameters like number and types of cores, cache size and policy and NoC features. In addition, every SystemC block has its RTL counterpart for physical implementation on FPGAs and ASICs. A parallel version of the Jacobi algorithm has been used as a test application to validate the methodology. Results confirm expectations about performance and effectiveness of system exploration and design.
Aggressive technology scaling has an ever-increasing adverse impact on the lifetime reliability of microprocessors. This paper proposes a novel simulation framework for evaluating the lifetime reliability of processor-based system-on-a-chips (SoCs), namely AgeSim, which facilitates designers to make design decisions that affect SoCs' mean time to failure. Unlike existing work, AgeSim can simulate failure mechanisms with arbitrary lifetime distributions and do not require to trace the system's reliability-related factors over its entire lifetime, and hence is more efficient and accurate. Two case studies are conducted to show the flexibility and effectiveness of the proposed methodology.
This paper presents an automated technique to perform SRAM wide statistical analysis in presence of process variability. The technique is implemented in a prototype tool and is demonstrated on several 45 and 32nm industry-grade SRAM vehicles. Selected case studies show how this approach successfully captures non-trivial statistical interactions between the cells and the periphery, which remain uncovered when only using statistical electrical simulations of the critical path or applying a digital corner approach. The presented tool provides the designer with valuable information on what performance metrics to expect, if manufactured. Since this feedback takes place in the design phase, a significant reduction in development time and cost can be achieved.
Ever-increasing test mode IR-drop results in a significant amount of defect-free chips failing at-speed testing. The lack of a systematic IR-drop failure identification technique engenders a highly increased failure analysis time/cost and significant yield loss. In this paper, we propose a failure-adaptive test scheme that enables a fast differentiation of the IR-drop induced failure from the actual defects of the chip. The proposed technique debugs the failing chips using low IR-drop vectors that are custom-generated from the observed faulty response. Since these special vectors are designed in such a way that all the actual defects captured by the original vectors are still manifestable, their application can clearly pinpoint whether the root cause of failure is IR-drop or not, thus eliminating reliance on an intrusive debugging process that incurs quite a high cost. Such a test scheme further enables effective yield recovery from failing chips by passing the ones validated by the debugging vectors whose IR-drop level matches the functional mode. Experimental results show that the proposed scheme delivers a significant IRdrop reduction in the second test (debugging) phase, thus enabling a highly effective IR-drop failure identification and yield recovery at a slightly increased test cost.
Power gating is an effective technique for reducing leakage power which involves powering off idle circuits through power switches, but those power-gated circuits which need to retain their states store their data in state retention registers. When power-gated circuits are switched from sleep to active mode, sudden rush of current has the potential of corrupting the stored data in the state retention registers which could be a reliability problem. This paper presents a methodology for improving the reliability of power-gated designs by protecting the integrity of state retention registers through state monitoring and correction. This is achieved by scan chain data encoding and decoding. The methodology is compatible with EDA tools design and power gating control flows. A detailed analysis of the proposed methodology's capability in detecting and correcting errors is given including the area overhead and energy consumption of the protection circuitry. The methodology is validate using FPGA and show that it is possible to correct all single errors with Hamming code and detect all multiple errors with CRC-16 code. To the best of our knowledge this is the first study in the area of reliable power gating designs through state monitoring and correction.
Virtual Prototypes (VPs) based on Transaction Level Modeling (TLM) have become a de-facto standard in today's SoC design, enabling early SW development. However, due to the growing complexity of SoC architectures full system simulations (HW+SW) become a bottleneck reducing this benefit. Hence, it is necessary to develop modeling styles which allow for further abstraction beyond the currently applied TLM methodology. This paper introduces such a modeling style, referred to as TLM+. It enables a higher modeling abstraction through merging hardware dependent driver software at the lowest level with the HW interface. Thus, sequences of HW transactions can be merged to single HW/SW transactions while preserving both the HW architecture and the low-level to high-level SW interfaces. In order to maintain the ability to validate timing-critical paths, a new resource model concept is introduced which compensates the loss of timing information, induced by merging HW transactions. Experimental results show a speed-up of up to 1000x at a timing error of approximately 10%.
In the past few years, many research groups have presented methods which are very valuable for analytically evaluating the timing behavior of automotive electric/electronic (E/E) systems. From an industrial perspective (view of an OEM), the novelty of this topic leads to a situation where the necessary input data for timing analysis is partially not specified or not available in an appropriate manner. Therefore, this paper presents a methodology for a systematic extraction of periodic and sporadic events in order to refine the input data for subsequent timing analysis. Our experimental results obtained with a real-world E/E-system points out the impact of the contribution.
Model Driven Software Development has offered a faster way to design and implement embedded real-time software by moving the design to a model level, and by transforming models to code. However, the testing of embedded systems has remained at the code level. This paper presents a Graphical Model Debugger Framework, providing an auxiliary avenue of analysis of system models at runtime by executing generated code and updating models synchronously, which allows embedded developers to focus on the model level. With the model debugger, embedded developers can graphically test their design model and check the running status of the system, which offers a debugging capability on a higher level of abstraction. The framework intends to contribute a tool to the Eclipse society, especially suitable for model-driven development of embedded systems.
Keywords - embedded systems; model-driven development; model debugger; eclipse
Throughput and programmability have always been the central, but generally conflicting concerns for modern IP router designs. Current high performance routers depend on proprietary hardware solutions, which make it difficult to adapt to ever-changing network protocols. On the other hand, software routers offer the best flexibility and programmability, but could only achieve a throughput one order of magnitude lower. Modern GPUs are offering significant computing power, and its dataparallel computing model well matches the typical patterns of packet processing on routers. Accordingly, in this research we investigate the potential of CUDA-enabled GPUs for IP routing applications. As a first step toward exploring the architecture of a GPU based software router, we developed GPU solutions for a series of core IP routing applications such as IP routing table lookup and pattern match. For the deep packet inspection application, we implemented both a Bloom-filter based string matching algorithm and a finite automata based regular expression matching algorithm. A GPU based routing table lookup solution is also proposed in this work. Experimental results proved that GPU could accelerate the routing processing by one order of magnitude. Our work suggests that, with proper architectural modifications, GPU based software routers could deliver significant higher throughput than previous CPU based solutions.
Keywords-GPU; CUDA; router; table lookup; Deep packet inspection; Bloom filter; DFA
Historically, processor performance has increased at a much faster rate than that of main memory and up-coming NoC-based many-core architectures are further tightening the memory bottleneck. 3D integration based on TSV technology may provide a solution, as it enables stacking of multiple memory layers, with orders-of-magnitude increase in memory interface bandwidth, speed and energy efficiency. To fully exploit this potential, the architectural interface to vertically stacked memory must be streamlined. In this paper we present an efficient and flexible distributed memory interface for 3D-stacked DRAM. Our interface ensures ultra-low-latency access to the memory modules on top of each processing element (vertically local memory neighborhoods). Communication to these local modules do not travel through the NoC and takes full advantage of the lower latency of vertical interconnect, thus speeding up significantly the common case. The interface still supports a convenient global address space abstraction with high-latency remote access, due to the slower horizontal interconnect. Experimental results demonstrate significant bandwidth improvement that ranges from 1.44x to 7.40x as compared to the JEDEC standard, with peaks of 4.53GB/s for direct memory access, and 850MB/s for remote access through the NoC.
Emerging TSV-based 3D integration technologies have shown great promise to overcome scalability limitations in 2D designs by stacking multiple memory dies on top of a manycore die. Application software developers need programming models and tools to fully exploit the potential of vertically stacked memory. In this work, we focus on efficient data mapping for SPMD parallel applications on an explicitly managed 3D-stacked memory hierarchy, which requires placement of data across multiple vertical memory stacks to be carefully optimized. We propose a programming framework with compiler support that enables array partitioning. Partitions are mapped to the 3D-stacked memory on top of the processor that mostly accesses it to take advantage of the lower latencies of vertical interconnect and for minimizing high-latency traffic on the horizontal plane.
Liquid cooling has emerged as a promising solution for addressing the elevated temperatures in 3D stacked architectures. In this work, we first propose a framework for detailed thermal modeling of the microchannels embedded between the tiers of the 3D system. In multicore systems, workload varies at runtime, and the system is generally not fully utilized. Thus, it is not energy-efficient to adjust the coolant flow rate based on the worst-case conditions, as this would cause an excess in pump power. For energy-efficient cooling, we propose a novel controller to adjust the liquid flow rate to meet the desired temperature and to minimize pump energy consumption. Our technique also includes a job scheduler, which balances the temperature across the system to maximize cooling efficiency and to improve reliability. Our method guarantees operating below the target temperature while reducing the cooling energy by up to 30%, and the overall energy by up to 12% in comparison to using the highest coolant flow rate.
In this paper, we explore the design and optimization of an on-chip active cooling system based on thin-film thermoelectric coolers (TEC). We start our investigation by establishing the compact thermal model for the chip package with integrated thin-film TEC devices. We observe that deploying an excessive number of TEC devices and/or providing the TEC devices with an improper supply current might adversely result in the overheating of the chip, rendering the cooling system ineffective. A large amount of supply current could even cause the thermal runaway of the system. Motivated by this observation, we formulate the deployment of the integrated TEC devices and their supply current setting as a system-level design problem. We propose a greedy algorithm to determine the deployment of TEC devices and a convex programming based scheme for setting the supply current levels. Leveraging the theory of inverse-positive matrix, we provide an optimality condition for the current setting algorithm. We have tested our algorithms on various benchmarks. We observe that our algorithms are able to determine the proper deployment and supply current level of the TEC devices which reduces the temperatures of the hot spots by as much as 7.5 °C compared to the cases without integrated TEC devices.
With the majority of chip real estate being filled with re-used IP blocks, the process of block assembly has significantly grown in importance. Marketing literature seems to suggest that assembling a chip from IP is as easy as browsing a library of blocks, assembling them in a block diagram and then pushing a button.
The current energy and environmental cost trends of datacenters are unsustainable. It is critically important to develop datacenter-wide power and thermal management (PTM) solutions that improve the energy efficiency of the datacenters. This paper describes one such approach where a PTM engine decides on the number and placement of ON servers while simultaneously adjusting the supplied cold air temperature. The goal is to minimize the total power consumption (for both servers and air conditioning units) while meeting an upper bound on the maximum temperature seen in any server chassis in the data center. To achieve this goal, it is important to be able to predict the incoming workload in terms of requests per second (which is done by using a short-term workload forecasting technique) and to have efficient runtime policies for bringing new servers online when the workload is high or shutting them off when the workload is low. Datacenter-wide power saving is thus achieved by a combination of chassis consolidation and efficient cooling. Experimental results demonstrate the effectiveness of the proposed dynamic resource provisioning method
Keywords-datacenter, cloud computing, resource provisioning, energy efficient, power optimization, temperature aware
In this paper we study the effectiveness of two power gating methods - transistor switches and MEMS switches - in reducing the power consumption of a design with a certain target throughput. Transistor switches are simple, but have fundamental limitations in their effectiveness. MEMS switches, with zero leakage in the off state, have achieved much focus over the past decade in the RF field, but have only very recently been explored in the context of power gating. In this paper we study both methods in conjunction with voltage scaling and show that MEMS switches are the superior choice over a wide range of target throughputs, especially low-throughput applications such as wireless sensor networks and biomedical implants. We also show that the architectural choices and operating conditions in a throughput-aware design can be profoundly different when using MEMS switches as opposed to transistor switches. For instance, while transistor switches favor smaller and slower architectures, the MEMS switches favor larger and faster designs when the target throughput is low. Moreover, while the optimal operating voltage of a transistor-switched design resides in the subthreshold region, that of a MEMS-switched design can be above or near the threshold voltage. To prove this, we provide both a mathematical analysis and experimental results from four different FFT architectures.
Phase change memory (PCM) is one of the most promising technology among emerging non-volatile random access memory technologies. Implementing a cache memory using PCM provides many benefits such as high density, non-volatility, low leakage power, and high immunity to soft error. However, its disadvantages such as high write latency, high write energy, and limited write endurance prevent it from being used as a drop-in replacement of an SRAM cache. In this paper, we study a set of techniques to design an energy- and endurance-aware PCM cache. We also modeled the timing, energy, endurance, and area of PCM caches and integrated them into a PCM cache simulator to evaluate the techniques. Experiments show that our PCM cache design can achieve 8% of energy saving and 3.8 years of lifetime compared with a baseline PCM cache having less than a hour of lifetime.
To respond to variations in solar energy, harvested-energy prediction is essential to harvested-energy management approaches. The effectiveness of such approaches is dependent on both the achievable accuracy and computation overhead of prediction algorithm implementation. This paper presents detailed evaluation of a recently reported solar energy prediction algorithm to determine empirical bounds on achievable accuracy and implementation overhead using an effective error evaluation technique. We evaluate the algorithm performance over varying prediction horizons and propose guidelines for algorithm parameter selection across different real solar energy profiles to simplify implementation. The prediction algorithm computation overhead is measured on actual hardware to demonstrate prediction accuracy-cost trade-off. Finally, we motivate the basis for dynamic prediction algorithm and show that more than 10% increase in prediction accuracy can be achieved compared to static algorithm.
We proposed a novel self-reference sensing scheme for Spin-Transfer Torque Random Access Memory (STT-RAM) to overcome the large bit-to-bit variation of Magnetic Tunneling Junction (MTJ) resistance. Different from all the existing schemes, our solution is nondestructive: The stored value in the STT-RAM cell does NOT need to be overwritten by a reference value. And hence, long write-back operation (of the original stored value) is eliminated. The robustness analyses of the existing scheme and our proposed nondestructive scheme are also presented. The measurement results from a 16kb testing chip successfully confirmed the effectiveness of our technique.
Keywords-spin-transfer torque; STT-RAM; self-reference
Flexible electronics have attracted much attention since they enable promising applications such as lowcost RFID tags and e-paper. Thin-film transistors (TFTs) are considered as an ideal candidate to implement flexible electronics on low-cost substrates. Most TFT technologies, however, have only mono-type - either n- or p-type - devices and thus modern design technologies for silicon-based electronics cannot be directly applied. In this paper, we propose a novel design style Pseudo-CMOS for flexible electronics that uses only mono-type TFTs while achieving comparable performance with the complementary-type designs. The manufacturing cost and complexity can therefore be significantly reduced while the circuit yield and reliability are also enhanced with the built-in capability of post-fabrication tuning. Some standard cells have been designed and fabricated in p-type organic and n-type InGaZnO (IGZO) TFT technologies which successfully verify the superiority of the proposed Pseudo-CMOS design style. To the best of our knowledge, this is the first design solution that has proven superior performance for both types of TFT technologies.
With the prospect of atomic-scale computing, we study cumulative energy profiles of spin-spin interactions in nonferromagnetic lattices (Ising spin-glasses) - an established topic in solid-state physics that is now becoming relevant to atomic-scale EDA. Recent proposals suggest non-traditional computing devices based on nature's ability to find min-energy states. Spinto utilizes EDA-inspired high-performance algorithms to (i) simulate natural energy minimization in spin systems and (ii) study its potential for solving hard computational problems. Unlike previous work, our algorithms are not limited to planar Ising topologies. In one CPU-day, our branch-and-bound algorithm finds min-energy (ground) states on 100 spins, while our local search approximates ground states on 1, 000, 000 spins. We use this computational tool to study the significance of hyper-couplings in the context of recently implemented adiabatic quantum computers.
3D technology provides many benefits including high density, high band-with, low-power, and small form-factor. Through Silicon Via (TSV), which provides communication links for dies in vertical direction, is a critical design issue in 3D integration. Just like other components, the fabrication and bonding of TSVs can fail. A failed TSV may cause a number of known-good-dies that are stacked together to be discarded. This can severely increase the cost and decrease the yield as the number of dies to be stacked increases. A redundant TSV architecture with reasonable cost for ASICs is proposed in this paper. Design issues including recovery rate and timing problem are addressed. Based on probabilistic models, some interesting findings are reported. First, the probability that three or more TSVs are failed in a tier is less than 0.002%. Assumption of that there are at most two failed TSVs in a tier is sufficient to cover 99.998% of all possible faulty free and faulty cases. Next, with one redundant TSV allocated to one TSV block, limiting the number of TSVs in each TSV block to be no greater than 50 and 25 leads to 90% and 95% recovery rates when 2 failed TSVs are assumed. Finally, analysis on overall yield shows that the proposed design can successfully recover most of the failed chips and increase the yield of TSV bonding to 99.99%. This can effectively reduce the cost of manufacturing 3D ICs.
The release of general purpose GPU programming environments has garnered universal access to computing performance that was once only available to super-computers. The availability of such computational power has fostered the creation and re-deployment of algorithms, new and old, creating entirely new classes of applications. In this paper, a GPU implementation of the Center-Surround Distribution Distance (CSDD) algorithm for detecting features within images and video is presented. While an optimized CPU implementation requires anywhere from several seconds to tens of minutes to perform analysis of an image, the GPU based approach has the potential to improve upon this by up to 28X, with no loss in accuracy.
Subdivision Surfaces provide a compact way to describe a smooth surface using a mesh model. They are widely used in 3D animation and nearly all modern modeling programs support them. In this work we describe a complete parallel pipeline for real-time interactive editing, processing and rendering of smooth surface primitives on the Cell BE. Our approach makes it possible to edit and render these high-order graphics primitives passing them directly to a parallel pipeline which tessellates them just before rendering. We describe a combination of algorithmic, architectural and back-end optimizations that enable us to render smooth subdivision surfaces in real time and to dynamically deform 3D models represented by subdivision surfaces.
Applications like 4G baseband modem require single-chip implementation to meet the integration and power consumption requirements. These applications demand a high computing performance with real-time constraints, low-power consumption and low cost. With the rapid evolution of telecom standards and the increasing demand for multi-standard products, the need for flexible baseband solutions is growing. The concept of Multi-Processor System-on-Chip (MPSoC) is well adapted to enable hardware reuse between products and between multiple wireless standards in the same device. Heterogeneous architectures are well known solutions but they have limited flexibility. Based on the experience of two heterogeneous Software Defined Radio (SDR) telecom chipsets, this paper presents the homoGENEous Processor arraY (GENEPY) platform for 4G applications. This platform is built with Smart ModEm Processors (SMEP) interconnected with a Network-on-Chip. The SMEP, implemented in 65nm low-power CMOS, can perform 3.2 GMAC/s with 77 GBits/s internal bandwidth at 400MHz. Two implementations of homogeneous GENEPY are compared to a heterogeneous platform in terms of silicon area, performance and power consumption. Results show that a homogeneous approach can be more efficient and flexible than a heterogeneous approach in the context of 4G Mobile Terminals.
We present Huckleberry, a tool for automatically generating parallel implementations for multi-core platforms from sequential recursive divide-and-conquer programs. The recursive programming model is a good match for parallel systems because it highlights the temporal and spatial locality of data use. Recursive algorithms are used by Huckleberry's code generator not only to automatically divide a problem up into smaller tasks, but also to derive lower-level parts of the implementation, such as data distribution and inter-core synchronization mechanisms. We apply Huckleberry to a multicore platform based on the Cell BE processor and show how it generates parallel code for a variety of sequential benchmarks.
Current multi-core design methodologies are facing increasing unpredictability in terms of quality due to the actual diversity of the workloads that characterize the deployment scenario. To this end, these systems expose a set of dynamic parameters which can be tuned at run-time to achieve a specified Quality of Service (QoS) in terms of performance. A run-time manager operating system module is in charge of matching the specified QoS with the available platform resources by manipulating the overall degree of task-level parallelism of each application as well as the frequency of operation of each of the system cores. In this paper, we introduce a design space exploration framework for enabling and supporting enhanced resource management through software re-configuration on an industrial multicore platform. From one side, the framework operates at design time to identify a set of promising operating points which represent the optimal trade-off in terms of the target power consumption and performance. The operating points are used after the system has been deployed to support an enhanced resource management policy. This is done by a light-weight resource management layer which filters and selects the optimal parallelism of each application and operating frequency of each core to achieve the QoS constraints imposed by the external world and/or the user. We show how the proposed design-time and run-time techniques can be used to optimally manage the resources of a multiple-stream MPEG4 encoding chip dedicated to automotive cognitive safety tasks.
This paper describes a multi-gigahertz test module to enhance the performance capabilities of automated test equipment (ATE), such as high-speed signal generation, loopback testing, jitter injection, etc. The test module includes a core logic block consisting of a high-performance FPGA. It is designed to be compatible with existing ATE infrastructure; connecting to the device under test (DUT) via a device interface board (DIB). The core logic block controls the test module's functionality, thereby allowing it to operate independently of the ATE. Exploiting recent advances in FPGA SerDes, the test module is able to generate very high (multi-GHz) data rates at a relatively low cost. In this paper we demonstrate multiplexing logic to generate higher data rates (up to 10Gbps) and a low-jitter buffered loopback path to carry high speed signals from the DUT back to the DUT. The test module can generate 10Gbps signals with ~32ps (p-p) jitter, while the loopback path adds ~20ps (p-p) jitter to the input signal.
Keywords-Automated Test Equipment(ATE); built-in self test (BIST); Serializer/Deserializer(SerDes); Field Programmable Gate Array(FPGA); high-speed testing; multi-gigahertz testing; loopback testing; test modules; test enhancement.
Recent research has shown that different defects can manifest themselves as failures at different temperature spectra. Therefore, we need multi-temperature testing which applies tests at different temperature levels. In this paper, we discuss the need and problems for testing core-based systems-on-chip at different temperatures. To address the long test time problem for multi-temperature test, we propose a test scheduling technique that generates the shortest test schedules while keeping the cores under test within a temperature interval. Experimental results show the efficiency of the proposed technique.
Keywords: multi-temperature testing; system-on-chip test; test scheduling; thermal-aware test
Many systems are based on embedded microcontrollers. Applications demand for production and Power-On testing, including memory testing. Because low-end microcontrollers may not have memory BIST, the CPU will be the only resource to perform at least the Power-On tests. This paper shows the problems, solutions and limitations of CPU-based at-speed memory testing, illustrated with examples from the ATMEL RISC microcontroller.
Keywords: Memory testing, CPU-based memory testing, assembler language, ATMEL RISC microcontroller
The admission control problem is concerned with determining whether a new task may be accepted by a system consisting of a set of running tasks, such that the already admitted and the new task are all schedulable. Clearly, admission control decisions are to be taken on-line, and hence, this constitutes a general problem that arises in many real-time and embedded systems. As a result, there has always been a strong interest in developing efficient admission control algorithms for various setups. In this paper, we propose a novel constant-time admission control test for the Deadline Monotonic (DM) policy, i.e., the time taken by the test does not depend on the number of admitted tasks currently in the system. While it is possible to adapt known utilization bounds from the literature to derive constant-time admission control tests (e.g., the Liu and Layland bound, or the more recent hyperbolic bound), the test we propose is less pessimistic. We illustrate this analytically where possible and through a set of detailed experiments. Apart from the practical relevance of the proposed test in the specific context of DM tasks, the underlying technique is general enough and can possibly be extended to other scheduling policies as well.
In this paper we present a new technique which exploits timing-correlation between tasks for scheduling analysis in multiprocessor and distributed systems with non-preemptive scheduled resources. Previously developed techniques also allow capturing and exploiting timing-correlation in distributed systems. However, they focus on timing correlations resulting from data dependencies between tasks. The new technique presented in this paper is orthogonal to the existing ones and allows capturing timing-correlations between the output event streams of tasks resulting from the use of a non-preemptive scheduling policy on a resource. We also show how these timing-correlations can be exploited to calculate tighter bounds for the worst-case response time analysis for tasks activated by such correlated event streams.
Due to increase in demand for reconfigurability in embedded systems, real-time task scheduling is challenged by non-negligible reconfiguration overheads. If such overheads are not considered, tasks may not be schedulable under given deadlines, and hence, affecting the quality of service. We introduce the problem of real-time periodic task scheduling under transition overhead on heterogeneous reconfigurable systems. We formulate the problem as a network flow problem and provide a mixed integer linear programming solution. We compare our proposed solution with optimal scheduling under maximum fixed transition overhead. We deployed our method on task scheduling for multiple communication protocols on reconfigurable FPGA-like systems. Results show that our proposed scheduling improves the task schedulability by 24.2% in comparison with non-preemptive EDF and by 17.5% in comparison with maximum-transition-overhead scheduling.
Keywords- Dynamically reconfigurable systems and Real-time task scheduling
With the advancement of CMOS manufacturing process to nano-scale, future shipped microprocessors will be increasingly vulnerable to intermittent faults. Quantitatively characterizing the vulnerability of microprocessor structures to intermittent faults at early design stage is significantly helpful to balance system performance and reliability. Prior researches have proposed several metrics to characterize the vulnerability of microprocessor structures to soft errors and permanent faults, however, the vulnerability of these structures to intermittent faults are still rarely considered. In this work, we propose a metric intermittent vulnerability factor (IVF) to characterize the vulnerability of microprocessor structures to intermittent faults. A structure's IVF is the probability an intermittent fault in that structure causes an external visible error. We instrument a cycle-accurate execution-driven simulator Sim-Alpha to compute IVFs for reorder buffer and register file. Experimental results show that the IVF of reorder buffer is much higher than that of register file. Besides, IVF varies significantly across different structures and workloads, which implies partial protection to the most vulnerable structures to improve system reliability with less overhead.
Time-dependent performance degradation due to transistor aging caused by mechanisms such as Negative Bias Temperature Instability (NBTI) and Hot Carrier Injection (HCI) is one of the most important reliability concerns for deep nano-scale regime VLSI circuits. Hence, aging-resilient design methodologies are necessary to address this issue in order to improve reliability, preferably with minimal impact on the area, power and performance. This work offers two major contributions to the aging-resilient circuit design methodology literature. First, it introduces a novel sensor circuit that can detect the aging of pipeline architectures by monitoring the arrival time of data signals at flip-flops. The area overhead of the proposed circuit is estimated to be less than 45% compared to that of previous approaches, which are over 95%. To ensure the accuracy of its operation, a comprehensive timing analysis is performed on the proposed circuit including the influence of process variations. As a second contribution, this work presents an innovative correction technique to reduce the probability of timing failures caused by aging. This method employs novel reconfigurable flipflops, which operate as normal flip-flops as long as the circuit is fresh, but function as time-borrowing flip-flops once the circuit ages. This unique flip-flop design allows utilization of the advantages of the time-borrowing technique while avoiding potential race conditions that can be created by employing such a technique. It is shown via simulations that by employing the proposed design methodology, the probability of timing failures in the aged circuits can be reduced by as much as 10X for various benchmark circuits.
Keywords- Reliability; Fault-Tolerance; Aging; Timing Analysis; Negative/Positive Bias Temperature Instability (NBTI/PBTI); Process Variation; Diagnostics and Built-in tests.
The design of a digital system for energy efficiency often requires the analysis of circuit tradeoffs in addition to architectural tradeoffs. To assist with this analysis, we present a framework for performing joint exploration of both the architectural and circuit design spaces. In our approach, we use statistical inference techniques to create a model of a large microarchitectural design space from a small number of simulation samples. We then characterize the design tradeoffs of each of the underlying circuits and integrate these with the higher level architectural models to define the joint circuit-architecture design space. We use posynomial forms for all our models, enabling the use of convex optimization tools to efficiently search the joint design space. As an example, we apply this methodology to explore the power-performance tradeoffs in a dual-issue superscalar out-of-order processor, showing how the framework can be used to determine the optimal set of design parameters for energy efficiency. Compared to current architectural tools that use fixed circuit costs, joint optimization can reduce energy by up to 30% by considering circuit tradeoff characteristics.
Since the foundation of AUTomotive Open System ARchitecture (AUTOSAR), the AUTOSAR Core Partners and more than 65 Premium and Development Members have been working on the standardization of vehicles' software architecture. As a result of its joint development activities AUTOSAR has already provided several Releases, which comprise a set of specifications describing software architecture components and defining their interfaces. With Release 2.1 and Release 3.0/3.1 the majority of partners and members started their series roll-out of AUTOSAR. When introducing the AUTOSAR standard in series products dedicated migration scenarios need to be applied. BMW is migrating to AUTOSAR Basic Software in its current and upcoming product lines. This includes also common functionality that is today already realized as AUTOSAR compliant extensions to the basic software. Further on BMW's strategy on providing application software as ready-to-integrate AUTOSAR software components is described.
This paper will present how the new concepts of the AUTOSAR system methodology influence the SW-development tool-chain landscape.
Keywords: AUTOSAR methodology, Tool Chains, ARTOP
The development of complex control units requires mature and reliable basic software as well as integration support particularly in early phases of the project. In this presentation Elektrobit Automotive will focus on new AUTOSAR basic software features such as multi core and functional safety. We will show how integration and validation will be enhanced by diagnostic logging and tracing functionalities.
We formally define a high-level power-aware protocol model based on a Markov chain, and consider 2 aspects of power consumption: the general switching activity, and the cost of data transfers. A state-assignment algorithm is devised that results in a state encoding that is near the theoretical lower bound (in the protocols we have studied). We have analyzed a set of protocol "converters" that have been synthesised for the AMBA protocol, and compared the (high-level) predicted power consumption with the power actually used during (low-level) simulation of these converters. We observe high fidelity.
This paper presents a design space exploration of a selective load value prediction scheme suitable for energyaware Simultaneous Multi-Threaded (SMT) architectures. A load value predictor is an architectural enhancement which speculates over the results of a micro-processor load instruction to speedup the execution of the following instructions. The proposed architectural enhancement differs from a classic predictor due to an improved selection scheme that allows to activate the predictor only when a miss occurs in the first level of cache. We analyze the effectiveness of the selective predictor in terms of overall energy reduction and performance improvement. To this end, we show how the proposed predictor can produce benefits (in terms of overall cost) when the cache size of the SMT architecture reduced and we compare it with a classic non-selective load value prediction scheme. The experimental results have been gathered with a state-of-the-art SMT simulator running the SPEC2000 benchmark suite, both in SMT and non-SMT mode.1
Scaling down in very deep submicron (VDSM) technologies increases the delay, power consumption of on-chip interconnects, while the reliability and yield decrease. In high performance integrated circuits wires become the performance bottleneck and we are shifting towards communication centric design paradigms. Networks-on-chip and stacked 3D integration are two emerging technologies that alleviate the performance difficulties of on-chip interconnects in nano-scale designs. In this paper we present a design-time configurable error correction scheme integrated at link-level in the 3D Spidergon STNoC on-chip communication platform. The proposed scheme detects errors and selectively corrects them on the fly, depending on the critical nature of the transmitted information, making thus the correction software controllable. Moreover, the proposed scheme can correct multiple error patterns by using interleaved single error correction codes, providing an increased level of reliability. The performance of the link and its cost in silicon and vertical wires are evaluated for various configurations.
A framework is proposed to analyze circuit layout geometries to predict chip lifetime due to low-k time-dependent dielectric breakdown (TDDB). The methodology uses as inputs data from test structures, which have been designed and fabricated to detect the impact of area and metal linewidth on low-k TDDB.
Negative Bias Temperature Instability (NBTI) has become an important reliability concern for nano-scaled Complementary Metal Oxide Semiconductor (CMOS) devices. In this paper, we present an analysis of temperature impact on various sub-processes that contribute to NBTI degradation. We demonstrate our analysis on 90nm industrial design operating in temperature range 25-125°C. The key temperature impacts observed in our simulation are: (a) the threshold voltage increase in P-type Metal Oxide Semiconductor (PMOS) due to NBTI is very sensitive to temperature, and increases by 34% due to the temperature increment, (b) the hole mobility in PMOS inversion layer reduces by 11% with the temperature increment, and (c) the temperature has a marginal impact on the transistor delay, that increases by 3% with the temperature increment.
Multicore environments are rapidly emerging and are widely used in SoC, but accompanying parallelism programming and debugging impact the ordinary sequential world. Unfortunately, according to Heisenberg's uncertainty principle, the instrument trying to probe the target will cause probe effects. Therefore, current intrusive debugging methodologies for sequential programs cannot be used directly in parallel programs in a multicore environment. This work developed a non-intrusive run-time assertion (RunAssert) for parallel program development based on a novel non-uniform debugging architecture. Our approaches are as follows: (a) a current language extension for parallel program debugging (b) corresponding non-intrusive hardware configuration logic and checking methodologies and (c) several reality cases using the extensions mentioned above. In general, the target program can be executed at its original speed without altering the parallel sequences, thereby eliminating the possibility of probe effect. The net hardware cost is relatively low, the reconfigurable logic for RunAssert is 0.6%-2.5% in a NUDA cluster with 8 cores, such that RunAssert can readily scale up for increasingly complex multicore systems. Many-core, Debugging, Architecture, Race detection, debugging programming model
In a conventional SoC designs, on-chip memories occupy more than the 50% of the total die area. 3D technology enables the distribution of logic and memories on separate stacked dies (tiers). This allows redesigning the memory tier as a configurable product to be used in multiple system designs. Previously proposed dynamic re-configurable solutions demonstrate strong dependence between read latency and dimensions of the mapped memory, leading to potential performance limitations. In this paper we propose a one-time configurable memory tier designed to minimize the performances overhead due to the commodity. Flexible configuration is enabled by smart memory macros and I/Os organization and a customizable redistribution layer routing . With respect to the dynamic re-configurability, the proposed design offers up to 40% faster access time, while saving more than 10% of energy per access. In addition production cost trade offs are analyzed.
In state of the art systems, workload scheduling and server fan speed operate independently leading to cooling inefficiencies. In this work we propose GentleCool, a proactive multi-tier approach for significantly lowering the fan cooling costs without compromising the performance. Our technique manages the fan speed through intelligently allocating the workload across different machines. The experimental results show our approach delivers average cooling energy savings of 72% and improves the mean time between failures (MTBF) of the fans by 2.3X compared to the state of the art.
Despite an increasing interest in digital subthreshold circuits little research has been dedicated to timing modeling in this voltage domain so far. Especially high timing variabilities makes proper modeling necessary to allow for the prediction of timing behavior and timing yield on the path towards design automation. This paper first deals with gate timing characterization at sub-threshold voltages and a characterization waveform well resembling the actual transistor-level waveforms in this voltage domain is proposed. The error made in this abstraction step is identified and shown to be typically below 3%. Secondly, the modeling of timing variability is considered and the high correlation between gate delays due to slope propagation combined with strong non-linearities in the delay-slope dependencies are pointed out as modeling challenges. A path-based logic-level Monte-Carlo technique, already magnitudes faster than transistor-level simulation, is applied and shown to match transistor-level Monte-Carlo simulation results better than 3% in mean and 7% in standard deviation values.
Index Terms - sub-threshold circuits, timing modeling, timing analysis, SSTA, characterization
Ambipolar devices have been reported in many technologies, including carbon nanotube field effect transistors (CNTFETs). The ambipolarity can be in-field controlled with a second gate, enabling the design of generalized logic gates with a high expressive power, i.e., the ability to implement more functions with fewer physical resources. Reported circuit design techniques using generalized logic gates show an improvement in terms of area and delay with respect to conventional CMOS circuits. In this paper, we characterize and study the power dissipation of generalized logic gates based on ambipolar CNTFETs. Our results show that the logic gates in the generalized CNTFET library dissipate 28% less power on average than a library of conventional CMOS gates. Further, we also perform logic synthesis and technology mapping, demonstrating that synthesized circuits mapped with the library of ambipolar logic gates dissipate 57% less power than CMOS circuits. By combining the benefits coming from the expressive power of generalized logic and from the CNTFET technology, we demonstrate that we can reduce the energy-delay-product by a factor of 20x using the ambipolar CNTFET technology.
We propose a novel synthesis technique for reversible logic based on ant colony optimization (ACO). In our ACO-based approach, reversible logic synthesis is formulated as a best-path search problem, where artificial ants, starting from their nest (reversible function output), attempt to find the best path to the food source (reversible function input). The experimental results have demonstrated superior performance in terms of both synthesis quality and computation time. They also show that the proposed method is scalable in handling large reversible functions.
FinFETs with channel surface along the <110> plane can be easily fabricated by rotating the fins by 45o from the <100> plane. By designing logic gates, which have pFinFETs in the <110> plane and nFinFETs in the <100> plane, the gate delay can be reduced by as much as 14%, compared to the conventional <100> logic gates. The reduction in delay can be traded off for reduced power in FinFET circuits. In this paper, we propose a low-power FinFET-based circuit synthesis methodology based on surface orientation optimization. We study various logic design styles, which depend on different FinFET channel orientations, for synthesizing low-power circuits. We use BSIM, a process/physics based double-gate model in HSPICE, to derive accurate delay and power estimates. We design layouts of standard library cells containing FinFETs in different orientations to obtain an accurate area estimate for the low-power synthesized netlists after place-and-route. We use a linear programming based optimization methodology that gives power-optimized netlists, consisting of oriented gates, at tight delay constraints. Experimental results demonstrate the efficacy of our scheme.
In this paper, a new type of combinational logic circuit realization is presented. Logic values are implemented as sinusoidal signals. Sinusoidal signals of the same frequency are phase shifted by π to destructively interfere with each other, and represent the logic 0 and 1 values of Boolean Logic. These properties of sinusoids can be used to identify a signal without ambiguity. Thus, representing logic values as sinusoidal signals yields a realizable system of logic. The paper presents a logic gate family that can operate using the sinusoidal signals for logic 0 and logic 1 values. Due to orthogonality of sinusoid signals with different frequencies, multiple sinusoids could be transmitted on a single wire. This provides a natural way of implementing multilevel logic. Signals traveling long distances could take advantage of this fact and can share interconnect lines. Recent research in circuit design has made it possible to harvest sinusoidal signals of the same frequency and 180. phase difference from a single resonant clock ring in a distributed manner. Other advantage of such a logic family is its immunity from external additive noise. The experiments in this paper indicate that this paradigm, when used to implement binary valued logic, yields an improvement in switching (dynamic) power.
Face Recognition techniques are solutions used to quickly screen a huge number of persons without being intrusive in open environments or to substitute id cards in companies or research institutes. There are several reasons that require to systems implementing these techniques to be reliable. This paper presents the design of a reliable face recognition system implemented on Field Programmable Gate Array (FPGA). The proposed implementation uses the concepts of multiprocessor architecture, parallel software and dynamic reconfiguration to satisfy the requirement of a reliable system. The target multiprocessor architecture is extended to support the dynamic reconfiguration of the processing unit to provide reliability to processors fault. The experimental results show that, due to the multiprocessor architecture, the parallel face recognition algorithm can achieve a speed up of 63% with respect to the sequential version. Results regarding the overhead in maintaining a reliable architecture are also shown.
Today we can identify a big gap between requirement specification and the generation of test environments. This article extends the Classification Tree Method for Embedded Systems (CTM/ES) to fill this gap by new concepts for the precise specification of stimuli for operational ranges of continuous control systems. It introduces novel means for continuous acceptance criteria definition and for functional coverage definition.
The paper describes a new approach of boundary scan emulation based testing for adaptive failure diagnostics using programmable logic. The motivation to speed up boundary scan based testing as well as the approach taken for this new concept and architecture are presented. With this approach the possibilities of boundary scan testing can be extended by using the available on-board resources for a faster and more real-time oriented test. The new options and benefits, as well as the necessary fundamentals of this approach are indicated. An example and first test results are given as well, to indicate the advantage of the proposed system.
Keywords: adaptive systems; automatic test equipment; boundary scan testing; field programmable gate arrays
As Electronic Control Units (ECUs) and embedded software functions within an automobile keep increasing in number, the scale and complexity of automotive embedded systems is growing at a very rapid pace. Hence, the automotive industry has been developing the Automotive Open System Architecture (AUTOSAR) to harness the reusability of common interfaces to communication buses, real-time operating systems and services. These common interfaces foster ease of adoption, interoperability, maintainability, predictability, and analyzability. However, realizing such standards also requires strong support from end-to- end design tool chains. In this paper, we describe some key analytical components that together characterize the end-to-end timing properties of hierarchical bus structures composed of FlexRay, CANbus and LINbus. Our analysis shows that the practical constraints imposed by standards such as AUTOSAR can lead to higher levels of schedulable resource utilization. This reduces both the overall component count and cost, while facilitating easy enhancements. Our analytical results show (a) how a schedulable utilization of 100% can be obtained for time-triggered FlexRay static segments under AUTOSAR compliance, (b) average-case schedulable utilization of 87% for the event-triggered CAN bus, and (c) similarities between LINbus and FlexRay analyses. We generalize the analytical results from different bus technologies, by exploiting their common underlying structure to enable an integrated end-to-end timing analysis of hierarchical heterogeneous networks. These together yield an end-to-end framework to analyze heterogeneously networked AUTOSAR-compliant automotive systems.
Future microprocessors increasingly rely on an unreliable CMOS fabric due to aggressive scaling of voltage and frequency, and shrinking design margins. Fortunately, many emerging applications can tolerate computational errors caused by hardware unreliabilities, at least during certain execution intervals. In this paper, we propose scalable stochastic processors, a computing platform for error-tolerant applications that able to scale gracefully according to performance demands and power constraints while producing outputs that are, in the worst case, stochastically correct. Scalability is achieved by exposing to the application layer multiple functional units that differ in their architecture but share functionality. A mobile video encoding application here is able to achieve the lowest power consumption at any bitrate demand by dynamically switching between functional-unit architectures.
This work presents Adaptive Vgs Multiplexer (AVGSMux) Technique. Proposed method controls the transistor current by the source voltage. It can provide ±1.6X control on the delay and ±7X exponential control on sub-threshold and gate leakages in the switch-box, LUT, and interconnects. For equal leakage, it improves the speed 9%, reduces dynamic power 13%, and reduces random dopant fluctuations effect. AVGS-Mux is a good replacement of adaptive body biasing and adaptive supply voltage techniques in emerging Multi-Gate devices which have very small body effect and cannot tolerate voltages higher than nominal VDD due to reliability issues.
Keywords-FPGA fabric; inter-die process variation; low-power; leakage; source biasing; body biasing; adaptive supply voltage
Read and write assist techniques are now commonly used to lower the minimum operating voltage (Vmin) of an SRAM. In this paper, we review the efficacy of four leading write-assist (WA) techniques and their behavior at lower supply voltages in commercial SRAMs from 65nm, 45nm and 32nm low power technology nodes. In particular, the word-line boosting and negative bit-line WA techniques seem most promising at lower voltages. These two techniques help reduce the value of WLcrit by a factor of ~2.5X at 0.7V and also decrease the 3σ spread by ~3.3X, thus significantly reducing the impact of process variations. These write-assist techniques also impact the dynamic read noise margin (DRNM) of half-selected cells during the write operation. The negative bit-line WA technique has virtually no impact on the DRNM but all other WA techniques degrade the DRNM by 10-15%. In conjunction with the benefit (decrease in WLcrit,) and the negative impact (decrease in DRNM), overhead of implementation in terms of area and performance must be analyzed to choose the best write-assist technique for lowering the SRAM Vmin.
With the increasing demand for energy-efficient power delivery network (PDN) in today's electronic systems, configuring an optimal PDN that supports power management techniques, e.g., dynamic voltage scaling (DVS), has become a daunting, yet vital task. This paper describes how to model and configure such a PDN so as to minimize the total energy dissipation in DVS-enabled systems, while satisfying total PDN cost and/or power conversion efficiency constraints. The problem of configuring an energy-efficient PDN under various constraints is subsequently formulated by using a controllable Markovian decision process (MDP) model and solved optimally as a policy optimization problem. The key rationale for utilizing MDP for solving the PDN configuration problem is to manage stochastic behavior of the power mode transition times of DVS-enabled systems. Simulation results demonstrate that the proposed technique ensures energy savings, while satisfying design goals in terms of total PDN cost and its power efficiency.
Design-time application mapping is limited to a predefined set of applications and a static platform. Resource management at run-time is required to handle future changes in the application set, and to provide some degree of fault tolerance, due to imperfect production processes and wear of materials. This paper concerns resource allocation at run-time, allowing multiple real-time applications to run simultaneously on a heterogeneous MPSoC. Low-complexity algorithms are required, in order to respond fast enough to unpredictable execution requests. We present a decomposition of this problem into four phases. The allocation of tasks to specific locations in the platform is the main contribution of this work. Experiments on a real platform show the feasibility of this approach, with execution times in tens of milliseconds for a single allocation attempt.
The pipelined Multiprocessor System on Chip (MPSoC) paradigm is well suited to the data flow nature of streaming applications. A pipelined MPSoC is a system where processing elements (PEs) are connected in a pipeline. Each PE is implemented using one of a number of processor configurations (configurations differ by instruction sets and cache sizes) available for that PE. The goal is to select a pipelined MPSoC with a mapping of a processor configuration to every PE. To estimate the runtime of a pipelined MPSoC, designers typically perform cycle-accurate simulation of the whole pipelined system. Since the number of possible pipelined implementations can be in the order of billions, estimation methods are necessary. In this paper, we propose two methods to estimate the runtime of a pipelined MPSoC, minimizing the use of slow cycle-accurate simulations. The first method estimates the runtime of the pipelined MPSoC, by performing cycle accurate simulations of individual processor configurations (rather than the whole pipelined system), and then utilizing an analytical model to estimate the runtime of the pipelined system. In the second method, runtimes of individual processor configurations are estimated using an analytical processor model (which uses cycle-accurate simulations of selected configurations, and an equation based on ISA and cache statistics). These estimated runtimes of individual processor configurations are then used to estimate the total runtime of the pipelined system. By evaluating our approach on three benchmarks, we show that the maximum estimation error is 5.91% and 16.45%, with an average estimation error of 2.28% and 6.30% for the first and second method respectively. The time to simulate all the possible pipelined implementations (design points) using cycle-accurate simulator is in the order of years, as design spaces with at least 1010 design points are considered in this paper. However, the time to simulate all processor configurations individually (first method) takes tens of hours, while the time to simulate a subset of processor configurations and estimate their runtimes (second method) is only a few hours. Once these simulations are done, the runtime of each pipelined implementation can be estimated within milliseconds.
Future embedded system products, e.g. smart handheld mobile terminals, will accommodate a large number of applications that will partly run sequentially and independently, partly concurrently and interacting on massively parallel computing platforms. Already for systems of moderate complexity, the design space will be huge and its exploration requires that the system architect is able to quickly evaluate the performances of candidate architectures and application mappings. The mainstream evaluation technique today is the system-level performance simulation of the applications and platforms using abstracted workload and processing capacity models, respectively. These virtual system models allow fast simulation of large systems at an early phase of development with reasonable modeling effort and time. The accuracy of the performance results is dependent on how closely the models used reflect the actual system. This paper presents a compiler based technique for automatic generation of workload models for performance simulation, while exploiting an overall approach and platform performance capacity models developed previously. The resulting workload models are experimented using x264 video and JPEG encoding application examples.
Static and dynamic variations, which have negative impact on the reliability of microelectronic systems, increase with smaller CMOS technology. Thus, further downscaling is only profitable if the costs in terms of area, energy and delay for reliability keep within limits. Therefore, the traditional worst case design methodology will become infeasible. Future architectures have to be error resilient, i.e., the hardware architecture has to tolerate autonomously transient errors. In this paper, we present an FPGA based rapid prototyping system for multi-processor systems-on-chip composed of autonomous hardware units for error-resilient processing and interconnect. This platform allows the fast architectural exploration of various error protection techniques under different failure rates on the microarchitectural level while keeping track of the system behavior. We demonstrate its applicability on a concrete wireless communication system.
The set of applications communicating via a Network-on-Chip (NoC) and the NoC itself both have varying run-time requirements on reliability and power-efficiency. To meet these requirements, we propose a novel Power-aware and Reliable Encoding Schemes Supported reconfigurable Network-on- Chip (PRESSNoC) architecture which allows processing elements, routers, and data encoding methods to be reconfigured at runtime. Further, an intelligent selection of encoding methods is achieved through a REasoning And Learning (REAL) framework at run-time. An instance of PRESSNoC was implemented on a Xilinx Virtex 4 FPGA device, which required 25.5% lesser number of slices compared to a conventional NoC with a full-fledged encoding method. The average benefit to overhead ratio of the proposed architecture is greater than that of a conventional NoC by 71%, 32%, and 277% when we consider the individual effects of interference rate per instruction, application domains, and system characteristics, respectively. Experiments have thus shown that PRESSNoC induces a higher probability toward the reduction of crosstalk interferences and dynamic power consumption, at the same amount of overheads in performance and hardware usage.
Heterogeneous reconfigurable processing architectures are often limited by the speed at which they can access data in external memory. Such architectures are designed for flexibility to support a broad range of target applications, including advanced algorithms with significant processing and data requirements. Clearly, strong performance of applications in this category is an extremely relevant metric for demonstrating the full performance potential of heterogeneous computing platforms. One such example, a film grain noise reduction application for high-definition video, which is composed of multiple image processing tasks, requires enormous data rates due to its large input image size and real-time processing constraints. This application is especially representative of highly parallel, heterogeneous, data-intensive programs that can properly exploit the advantages offered by computing platforms with multiple heterogeneous reconfigurable processing elements. To accomplish this task and meet the above requirements, a bandwidth-optimized external memory controller has been designed for use with a heterogeneous reconfigurable architecture and its NoC interconnect. With the help of the application described above, this paper evaluates the proposed architecture in two forms: (1) with a basic memory controller IP and (2) with the advanced memory controller design. The results illustrate the full potential of the computing platform as well as the power of heterogeneous reconfigurable computing combined with high-speed access to large external memories.
Motion Estimation (ME) is the most computationally intensive part of video compression and video enhancement systems. One bit transform (1BT) based ME algorithms have low computational complexity. Therefore, in this paper, we propose a high performance reconfigurable hardware architecture of 1BT based multiple reference frame (MRF) ME. The proposed ME hardware architecture performs full search ME for 4 Macroblocks and 4 reference frames in parallel. The proposed hardware is faster than the 1BT based ME hardware reported in the literature even though it is capable of searching in 4 reference frames. MRF ME increases the ME performance at the expense of increased computational complexity. The reconfigurability of the proposed ME hardware is used to statically configure the number and selection of reference frames based on the application requirements in order to trade-off ME performance and computational complexity. The proposed hardware architecture is implemented in Verilog HDL. The MRF ME hardware consumes %65 of the slices in a Xilinx XC2VP30-7 FPGA. It can work at 191 MHz in the same FPGA and is capable of processing 83 1920x1080 full High Definition frames per second.
Keywords-Motion Estimation, One Bit Transform, Multiple Reference Frame, Hardware Implementation, FPGA.
Deep Packet Inspection (DPI) involves searching a packet's header and payload against thousands of rules to detect possible attacks. The increase in Internet usage and growing number of attacks which must be searched for has meant hardware acceleration has become essential in the prevention of DPI becoming a bottleneck to a network if used on an edge or core router. In this paper we present a new multi-pattern matching algorithm which can search for the fixed strings contained within these rules at a guaranteed rate of one character per cycle independent of the number of strings or their length. Our algorithm is based on the Aho-Corasick string matching algorithm with our modifications resulting in a memory reduction of over 98% on the strings tested from the Snort ruleset. This allows the search structures needed for matching thousands of strings to be small enough to fit in the on-chip memory of an FPGA. Combined with a simple architecture for hardware, this leads to high throughput and low power consumption. Our hardware implementation uses multiple string matching engines working in parallel to search through packets. It can achieve a throughput of over 40 Gbps (OC-768) when implemented on a Stratix 3 FPGA and over 10 Gbps (OC-192) when implemented on the lower power Cyclone 3 FPGA.
As new protein sequences are discovered on an everyday basis and protein databases continue to grow exponentially with time, computational tools take more and more time to search protein databases to discover the common ancestors of them. HMMER is among the most used tools in protein search and comparison and multiple efforts have been made to accelerate its execution by using dedicated hardware prototyped on FPGAs. In this paper we introduce a novel algorithm called the Divergence Algorithm, which not only enables the FPGA accelerator to reduce execution time, but also enables further acceleration of the alignment generation algorithm of the HMMER programs by reducing the number of cells of the Dynamic Programming matrices it has to calculate. We also propose a more accurate performance measurement strategy that considers all the execution times while doing protein searches and alignments, while other works only consider hardware execution times and do not include alignment generation times. Using our proposed hardware accelerator and the Divergence Algorithm, we were able to achieve gains up to 182x when compared to the unaccelerated HMMER software running on a general purpose CPU.
Keywords - Bioinformatics; Hidden Markov Models; HMMER; Hardware accelerator; FPGA.
Due to fast technology scaling, negative bias temperature instability (NBTI) has become a major reliability concern in designing modern integrated circuits. In this paper, we present a simple and proactive NBTI recovery scheme targeting at critical and busy functional units with storage cells in modern microprocessors. Existing schemes have limitations when recovering these functional units. By exploiting the idle time of busy functional units at per-buffer-entry level, our scheme achieves on average 5.57x MTTF (Mean Time To Failure) improvement at the cost of <1% IPC degradation and <1% area overhead.
With every process generation, the problem of variability in physical parameters and environmental conditions poses a great challenge to the design of fast and reliable circuits. Propagation delays which decide circuit performance are likely to suffer the most from this phenomena. While Statistical static timing analysis (SSTA) is used extensively for this purpose, it does not account for dynamic conditions during operation. In this paper, we present a multivariate regression based technique that computes the propagation delay of circuits subject to manufacturing process variations in the presence of temporal variations like temperature. It can be used to predict the dynamic behavior of circuits under changing operating conditions. The median error between the proposed model and circuit-level simulations is below 5%. With this model, we ran a study of the effect of temperature on access time delays for 500 cache samples. The study was run in 0.557 seconds, compared to the 20h and 4min of the SPICE simulation achieving a speedup of over 1X105. As a case study, we show that the access times of caches can vary as much as 2.03X at high temperatures in future technologies under process variations.
With aggressive gate oxide scaling, latent defects in the gate oxide manifest as traps that, in time, lead to gate oxide breakdown. Progressive gate oxide breakdown, also referred to as time-dependent dielectric breakdown (TDDB), is emerging as one of the most important sources of performance degradation in nanoscale CMOS devices. This paper describes an accurate analytical model to predict the delay of combinational logic gates subject to TDDB. The analytical model can be seamlessly integrated into a static timing analysis tool to analyze TDDB effects in large combinational logic circuits across a range of supply voltages and severity of oxide breakdown. Simulation results for an early version of an industrial 32nm library show that the model is accurate to within 3% of SPICE with orders of magnitude improvement in runtime.
The main contribution of this work is providing a static and dynamic enhancement of bit-cell stability for low-power SRAM in nanometer technologies. We consider a wide layout topology without bends in diffusion layers for the nanometer SRAM cell design to minimize the impact of process variations. The design restrictions imposed by such a nanometer SRAM cell design prevents from applying traditional read SNM improvement techniques. We use the SNM as a measure of the cell stability during read operations, and Qcrit to quantify the robustness against SEE during hold mode. The techniques proposed have a low impact on read time and leakage current while improving significantly the SNM. Moreover, the Word-line modulation technique has no impact on strategic cell parameters like area and leakage when in hold mode. Results obtained from both a commercial 65nm CMOS technology and a 45nm BPTM technology are provided.
Keywords-Nanometre SRAM; Critical Charge; Static Noise Margin.
Embedded software development has become one of the greatest challenges in the automotive domain, due to the rising complexity of vehicle systems. A method to handle the complexity of automotive software is Model Based Design (MBD). As MBD offers great advantages in early simulation and testing, it has become today's mainstream method for automotive software engineering. However, some aspects can be initially tested after the integration of software on real hardware components (usually by the supplier) and when all parts of a system (e.g. bus systems, sensors, actuators) are present. The consequence is that the requirement specification of the according system possibly contains gaps that can lead to software defects. New technologies like the AUTOSAR standard enable additional potentials for the validation of model based developed software. Due to the AUTOSAR software architecture it is possible for an OEM to realize an early "virtual" software integration with an acceptable effort and perform at next step a front loading of system tests. In this paper we present an approach that improves the quality of the requirement specification artifacts by using test front loading. In detail, we analyze the requirement engineering part of the software development process to identify aspects that can not be tested without having all system components. Afterwards, we classify these aspects and define an abstract test pattern that can be globally used for testing. Additionally, we illustrate our approach in a case study on an interior light system for the next Mercedes- Benz M-Class generation.
Keywords-model based design; validation and test; AUTOSAR; virtual integration; front loading
The SPEEDS project is aimed at making rich components models (RCM) into a mature framework in all phases of the design of complex distributed embedded systems. The RCM model is required to be expressive enough to cover the entire development process from requirements to code through design, and also capture both functional and non-functional aspects. In this paper we propose a language-based framework for real-time component interfaces in SPEEDS that is suitable at the ECU layer when a target processor has been identified, and WCET analysis done. We assume a discrete time model.
We propose an automated, tool-supported approach to scenario-based analysis and synthesis of real-time embedded systems. The inter-object behaviors of a system are modeled as a set of live sequence charts (LSCs), and the scenario-based user requirement is specified as a separate LSC. By translating the set of LSC charts into a behavior-equivalent network of timed automata (TA), we reduce the problems of model consistency checking and property verification to classical CTL real-time model checking problems, and reduce the problem of centralized synthesis for open systems to a timed game solving problem. We implement a prototype LSC-to-TA translator, which can be linked to existing real-time model checker UPPAAL and timed game solver UPPAAL-TIGA. Preliminary experiments on a number of examples show that it is a viable approach.
In this paper we present a stochastic model order reduction technique for interconnect extraction in the presence of process variabilities, i.e. variation-aware extraction. It is becoming increasingly evident that sampling based methods for variation-aware extraction are more efficient than more computationally complex techniques such as stochastic Galerkin method or the Neumann expansion. However, one of the remaining computational challenges of sampling based methods is how to simultaneously and efficiently solve the large number of linear systems corresponding to each different sample point. In this paper, we present a stochastic model reduction technique that exploits the similarity among the different solves to reduce the computational complexity of subsequent solves. We first suggest how to build a projection matrix such that the statistical moments and/or the coefficients of the projection of the stochastic vector on some orthogonal polynomials are preserved.We further introduce a proximity measure, which we use to determine apriori if a given system needs to be solved, or if it is instead properly represented using the currently available basis. Finally, in order to reduce the time required for the system assembly, we use the multivariate Hermite expansion to represent the system matrix. We verify our method by solving a variety of variation-aware capacitance extraction problems ranging from on-chip capacitance extraction in the presence of width and thickness variations, to off-chip capacitance extraction in the presence of surface roughness. We further solve very large scale problems that cannot be handled by any other state of the art technique.
We present an efficient and highly accurate approach to high-frequency impedance extraction for VLSI interconnects and intentional on-chip inductors. The approach is based on a three-dimensional (3D) loop formalism that uses discrete complex images approximations applied to a quasi-magnetostatic treatment of the vector potential, resulting in closed-form expressions for the impedance matrix of current filaments in the presence of a multi-layer substrate. Populating the impedance (Z) matrix for 3D configurations of finite transverse dimensions (including non- Manhattan wires and inductors) is computationally inexpensive, and includes substrate eddy current effects that become quantitatively important in the frequency regime beyond 20 GHz which is imminent at the 45 nm technology node onwards. The accuracy, as exemplified by the magnitude of inductor impedance |Z|, is within 5% of a full-wave electromagnetic field solver for frequencies up to 100 GHz, with an order of magnitude lower computation cost. The proposed method represents a core technology for incorporation into system level extraction of analog systems consisting of multiple inductors and nearby interconnects, for CMOS on-chip circuits in the nanometer era.
This paper describes a Model Order Reduction algorithm for multi-dimensional parameterized systems, based on a sampling procedure which incorporates a low order moment matching paradigm into a multi-point based methodology. The procedure seeks to maximize the subspace generated by a given number of samples, selected among an initial candidate set. The selection is based on a global criteria that chooses the sample whose associated vector adds more information to the existing subspace. However, the initial candidate set can be extremely large for high-dimensional systems, and thus the procedure can be costly. To improve efficiency we propose a scheme to incorporate information from low order moments to the basis with small extra cost, in order to extend the approximation to a wider region around the selected point. This will allow reduction of the initial candidate set without decreasing the level of confidence.We further improve the procedure by generating the global subspace based on the composition of local approximations. To achieve this, the initial candidates will be split into subsets that will be considered as independent regions, and in a first phase the procedure applied locally thus enabling improved efficiency and providing a framework for almost perfect parallelization.
The super node algorithm performs model order reduction based on physical principles. Although the algorithm provides us with compact models, its passivity has not thoroughly been studied yet. The loss of passivity is a serious problem because simulations of the reduced network may encounter artificial behavior which render the simulations useless. In this paper we find the reason of delivering non-passive models and propose a modified version of the algorithm which guarantees passivity. This is done by applying a passivity enforcement procedure after frequency fitting step within the algorithm. This allows to preserve passivity while keeping the main advantages of the algorithm. Finally numerical examples validate the proposed approach.
Index Terms - electromagnetic modeling, interconnect system, model order reduction, passivity.
Innovations in micro and nano technology form the basis of modern ICT. However, the steady growth in the ICT sector has meanwhile a significant ecological footprint: 2% of global CO2 emissions are due to ICT systems already today - one fourth of the emissions caused by cars. The energy costs for running ICT infrastructure have turned into a significant economical factor. The most urgent challenge in the area of micro and nanotechnology is therefore to massively increase energy efficiency, in particular for ICT as a key sector for economic growth. Significant improvements in this area can only be achieved through disruptive innovations and new system approaches, which rely on a combination of excellent research & development and world leading know-how of semiconductor production. But will hardware be the driver to fulfill these requirements and software has to adapt to whatever hardware concepts are developed? Or should the ability to program systems energy efficiently define the design of the hardware architecture? This session will present the different perspectives on the problem and try to bring both sides together.
The paper tackles the problem of property qualification focusing in particular on the identification of vacuous properties. It proposes a methodology based on a combination of dynamic and static techniques that, given a set of properties defined to check the correctness of a design implementation, performs vacuity detection. Existing approaches for vacuity checking are as complex as model checking, and they require to define and model check further properties, thus increasing the verification time. Moreover, for some formulae they fail to detect vacuity, as for example in case of tautology. These problems are overcome by our approach. It is based on mutation analysis, thus, it does not require the definition of new properties granting a speed-up of the vacuity analysis process. Moreover, it provides highly accurate vacuity alerts which capture also propositional and temporal tautologies.
Index Terms - vacuity analysis, mutation analysis, simulation, model checking
In order to combine the power of simulation-based and formal techniques, semi-formal methods have been widely explored. Among these methods, abstraction-guided simulation is a quite promising one. In this paper, we propose an abstraction-guided simulation approach aiming to cover hard-to-reach states in functional verification of microprocessors. A Markov model is constructed utilizing the high level functional specification, i.e. ISA. Such model integrates vector correlations. Furthermore, several strategies utilizing abstraction information are proposed as an effective guidance to the test generation. Experimental results on two complex microprocessors show that our approach is more efficient in covering hard-to-reach states than similar methods. Comparing with some work with other intelligent engines, our approach could guarantee higher hit ratio of target states without efficiency loss.
Model checking techniques are promising for automated generation of directed tests. However, due to the prohibitively large time and resource requirements, conventional model checking techniques do not scale well when checking complex designs. In SAT-based BMC, many variable ordering heuristics have been investigated to improve counterexample (test) generation involving only one property. This paper presents efficient decision ordering techniques that can improve the overall test generation time of a cluster of similar properties. Our method exploits the assignments of previously generated tests and incorporates it in the decision ordering heuristic for current test generation. Our experimental results using both software and hardware benchmarks demonstrate that our approach can drastically reduce the overall test generation time.
Increasing the speed of cache simulation to obtain hit/miss rates enables performance estimation, cache exploration for embedded systems and energy estimation. Previously, such simulations, particularly exact approaches, have been exclusively for caches which utilize the least recently used (LRU) replacement policy. In this paper, we propose a new, fast and exact cache simulation method for the First In First Out(FIFO) replacement policy. This method, called DEW, is able to simulate multiple level 1 cache configurations (different set sizes, associativities, and block sizes) with FIFO replacement policy. DEW utilizes a binomial tree based representation of cache configurations and a novel searching method to speed up simulation over single cache simulators like Dinero IV. Depending on different cache block sizes and benchmark applications, DEW operates around 8 to 40 times faster than Dinero IV. Dinero IV compares 2.17 to 19.42 times more cache ways than DEW to determine accurate miss rates.
Flash memory is widely used in consumer electronics products, such as cell-phones and music players, and is increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even servers. There is a rich microarchitectural design space for flash memory and there are several architectural options for incorporating flash into the memory hierarchy. Exploring this design space requires detailed insights into the power characteristics of flash memory. In this paper, we present FlashPower, a detailed analytical power model for Single-Level Cell (SLC) based NAND flash memory, which is used in high-performance flash products. We have integrated FlashPower with CACTI 5.3, which is widely used in the architecture community for studying memory organizations. FlashPower takes as input device technology and microarchitectural parameters to estimate the power consumed by a flash chip during its various operating modes. We have validated FlashPower against published chip power measurements and show that they are comparable.
A new sub-space max-monomial modeling scheme for CMOS transistors in sub-micron technologies is proposed to improve the modeling accuracy. Major electrical parameters of CMOS transistors in each sub-space from the design space are modeled with max-monomials. This approach is demonstrated to have a better accuracy for sub-micron technologies than singlespace models. Sub-space modeling based geometric programming power optimization has been successfully applied to three different op-amps in 0.18μm technology. HSPICE simulation results show that sub-space modeling based GP optimization can allow efficient and accurate analog design. Computational effort can be managed to an acceptable level when searching sub-spaces for transistors by using practical constraints. An efficient scheme in dealing with non-convex constraint inherent in Kirchhoff's voltage law is suggested in this paper. By using this scheme, the non-convex constraint, such as posynomial equality, can be relaxed to a convex constraint without affecting the result.
Keywords-power optimization; CMOS op-amps; geometric programming; monomial; posynomial
Structured ASIC has been introduced to bridge the power, performance, area and design cost gaps between ASIC and FPGA. As technology scales, leakage power consumption becomes a serious problem. Among the leakage power reduction techniques, power gating is commonly used to disconnect idle logic blocks from power network to curtail sub-threshold leakage. In this paper, we apply power gating to structured ASICs for leakage power reduction. We present a power-gated via-configurable logic block (PGVCLB) and a power gated design flow mostly using existing standard cell design tools. We can configure PGVCLBs in a design to implement fine-grained power gating, coarse-grained/cluster-based power gating or even distributed sleep transistor network (DSTN). With fine-grained power gating, we can achieve 52% leakage reduction on average with only 8% area and 17% delay overheads when compared to the data obtained using a non-power-gated library.
Keywords: power-gating, low power, via-configurable, structured ASIC.
Dual-Vth technique is a mature and effective method for reducing leakage power consumption. Previously proposed algorithms assign logic gates with sufficient timing slack to high threshold voltage to reduce leakage power without impact on timing. Meanwhile, clock skew scheduling algorithms are always utilized to optimize period or timing slack. In order to further reduce subthreshold leakage power consumption, in this paper, we ingeniously combine dual voltage assignment technique with intended clock skew scheduling: First, a leakage weight based clock skew scheduling algorithm is proposed to enlarge the leakage power optimization potential. Then we employ a dual-threshold voltage assignment algorithm to minimize leakage power. The experimental results on ISCAS89 benchmark circuits show that, within only several seconds, the leakage power can be further reduced by as much as 41.30% and by 9.87% on average with this new approach, compared to using the traditional method without considering clock skews. Three timing optimized industrial circuit blocks, among which each has around one hundred thousand gates, have also been optimized. It is shown that an average leakage power reduction of 9.95% can be achieved within minutes compared with traditional techniques.
Keywords- low power, leakage, dual-threshold, clock skew
This paper presents an innovative and effective approach to design and test a regulator for an automotive alternator with programmable functionalities. The prototype system consists of two different parts: an integrated circuit (IC) and a FPGA. The IC, implemented in austriamicrosystems HVCMOS 0.35 μm technology, includes all the high voltage parts and a power switch with very low ON resistance. It is able to manage full reverse polarity on every pin, including the reverse battery condition, and over voltages up to 50 V. The programmability is guaranteed through the FPGA. This prototype system can be used to develop a new type of intelligent smart and flexible regulators which implement many additional programmable functions that give the car maker a better control and allow to reduce vehicle fuel consumption and CO2 emissions. In the demonstrator all the implemented functions, including regulation, can be changed during the development phase and many properties, including loop stability, can be checked before releasing a final version of the regulator. The proposed system is also included in a standard brush-holder that can be mounted on Valeo Engine and Electrical System mechatronic alternator and verified directly in a real application. Regulator for automotive alternators; programmable regulator; reverse polarity; HV CMOS voltage regulator
This paper describes the design of an automotive traffic sign recognition application. All stages of the design process, starting on system-level with an abstract, pure functional model down to final hardware/software implementations on an FPGA, are shown. The proposed design flow tackles existing bottlenecks of today's system-level design processes, following an early model-based performance evaluation and analysis strategy, which takes into account hardware, software and real-time operating system aspects. The experiments with the traffic sign recognition application show, that the developed mechanisms are able to identify appropriate system configurations and to provide a seamless link into the underlying implementation flows.
Design and specification errors are hard to find in the traditional automotive system design flow. Consequently, these errors may be detected very late e.g. in a hardware prototype or even worse in the final product. In order to allow the verification of distributed embedded systems in early design phases, this work proposes a flexible and efficient virtual prototyping approach in order to check the consistency of system specifications. Our virtual prototyping approach has been applied to the Media Oriented Systems Transport (MOST) specification revision 3.0 and verifies the influence of two newly specified algorithms, namely Ring Break Diagnosis and Sudden Signal Off detection, with respect to numerous network configurations. In total we have verified the specification using more than 105 automatically generated network configurations. The overall costs for network modelling and verification compared to cost-expensive error detection and correction at later design phases have been significantly reduced.
Automotive network technologies such as FlexRay present a cost-optimized structure in order to tailor the system to the required functionalities and to the environment. The space exploration for optimization of single components (cable, transceiver, communication controller, middleware, application) as well as the integration of these components (e.g. selection of the topology) are complex activities that can be efficiently supported by means of simulation. The main challenge while simulating communication architectures is to efficiently integrate the heterogeneous models in order to obtain accurate results for a relevant operation time of the system. In this work, a run-time model switching method is introduced for the holistic simulation of FlexRay networks. Based on a complete modeling of the main network components, the simulation performance increase is analyzed and the new test and diagnosis possibilities resulting from this holistic approach are discussed.
In the current environment of rapidly changing in vehicle requirements and ever-increasing functional content for automotive EE systems, there are several sources of uncertainties in the definition of EE architecture design. This is also true for communication schedule synthesis where key decisions are taken early because of interactions with the suppliers. The possibility of change necessitates a design process that can analyze schedules for robustness to uncertainties, e.g., changes in estimated task durations or communication load. A robust design would be able to accommodate these changes incrementally without changes in the system scheduling, thus reducing validation times and increasing reusability. This paper introduces a novel approach based on the info-gap decision theory that provides a systematic scheme for analyzing robustness of schedules by computing the greatest horizon of uncertainty that still satisfies the performance requirements. The paper formulates info-gap models for potential uncertainties in schedule synthesis for a distributed automotive system communicating over a FlexRay network, and shows their application to a case study.
Adaptive testing is a generic term for a number of techniques which aim at improving the test quality and/or reducing the test application costs. In adaptive tests, the test content or pass/fail limits are not fixed as in conventional tests, but dependent on other test results of the currently or previously tested chips. Part-average testing, outlier detection, and neighborhood screening are just a few examples of adaptive testing. With this Embedded Tutorial, we are offering an introduction to this topic, which is hot in the test community, to the wider DATE audience.
Parallel file systems are very sensitive to adverse conditions, and the lack of synergy between such file systems and some of the applications running on them has a negative impact on the overall system performance. Our observations indicate that the increased pressure on metadata management is one of the relevant causes of performance drops. This paper proposes a virtualization layer above the native file system that, transparently to the user, reorganizes the underlying directory tree, mitigating bottlenecks by taking advantage of the native file system optimizations and limiting the effects of potentially harmful application behavior. We developed COFS (COmposite File System) as a proof-of-concept virtual layer to evaluate the feasibility of the proposal.
The increasing complexity of today's system-on-a-chip (SoC) design is challenging the design engineers to evaluate the system performance and explore the design space. Electronic system-level (ESL) design methodology is of great help for attacking the challenges in recent years. In this paper, we present a system-level architecture refinement flow and implement a dual DSP cores virtual system based-on the highly accurate mixed abstraction-level modeling methodology. The constructed virtual platform can run various multimedia applications and achieve high accuracy. Compared with the traditional RTL simulation, the error rate is less than 5% and the simulation speed is around 100 times faster. Using the architecture refinement flow, the system performance profiling and architecture exploration is also realized for the software and hardware engineers to scrutinize the complicated system.
Keywords- electronic system-level (ESL); transaction-level modeling (TLM); system validation; architecture refinement
The success of server virtualization has let to the deployment of a huge number of virtual machines in today's data centers, making a manual virtualization management very laborintensive. The development of appropriate management solutions is hindered by the various management interfaces of different hypervisors. Therefore, a uniform management can be simplified by a layer abstracting from these dedicated hypervisor interfaces. The libvirt management library provides such an interface to different hypervisors. Unfortunately, remote hypervisor management using libvirt has not been possible without altering the managed servers. To overcome this limitation, we have integrated remote hypervisor management facilities into the libvirt driver infrastructure for VMware ESX and Microsoft Hyper-V. This paper presents the resulting architecture as well as experiences gained during the implementation process.
In aggressively scaled technologies, reliability concerns such as oxide breakdown have become a key issue. Dynamic reliability management (DRM) has been proposed as a mechanism to dynamically explore the tradeoff between system performance and reliability margin. However, existing DRM methods are hampered by the fact that they do not accurately model spatial and temporal variations in process and temperature parameters which have a strong impact on chip reliability. In addition, they make the simplifying assumption that the future workloads are identical to the currently observed one. This makes them sensitive to sudden workload variations and outliers. In this paper, we present a novel workload-aware dynamic reliability management framework that accounts for local variations in both the process and temperature. The reliability estimation, along with the predicted remaining workload is fed to a dynamic voltage/ frequency scaling module to manage the system reliability and optimize processor performance. Using a fast on-line analytical/table-look-up method we demonstrate an average error of 1% with up to 5 orders of magnitude speedup compared to Monte Carlo simulation. Experiments on an Alpha-like processor show our DRM framework fully utilizes the available margin and achieves 28.7% performance improvement on average.
We present a framework and control policies for optimizing dynamic control of various self-tuning parameters over lifetime in the presence of circuit aging. Our framework introduces dynamic cooling as one of the self-tuning parameters, in addition to supply voltage and clock frequency. Our optimized self-tuning satisfies performance constraints at all times and maximizes a lifetime computational power efficiency (LCPE) metric, which is defined as the total number of clock cycles achieved over lifetime divided by the total energy consumed over lifetime. Our framework features three control policies: 1. Progressive-worst-case-aging (PWCA), which assumes worst-case aging at all times; 2. Progressive-on-state- aging (POSA), which estimates aging by tracking active/sleep mode, and then assumes worst-case aging in active mode and long recovery effects in sleep mode; 3. Progressive-real-time-aging-assisted (PRTA), which estimates the actual amount of aging and initiates optimized control action. Simulation results on benchmark circuits, using aging models validated by 45nm CMOS stress measurements, demonstrate the practicality and effectiveness of our approach. We also analyze design constraints and derive system design guidelines to maximize self-tuning benefits.
The occupancy of caches has tended to be dominated by the logic bit value '0' approximately 75% of the time. Periodic bit flipping can reduce this to 50%. Combining cache power saving strategies with bit flipping can lower the effective logic bit value '0' occupancy ratios even further. We investigate how Negative Bias Temperature Instability (NBTI) affects different power saving cache strategies employing symmetric and asymmetric 6- transistor (6T) and 8T Static Random Access Memory (SRAM) cells. We notice that greater than 38% to 66% of the recovery in stability parameters (SNM and WNM) under different power saving cache strategies have been achieved for different SRAM cells based caches. We also study the process variations effect along with NBTI for 32nm and 45nm technology node. It is observed that the rate of recovery in asymmetric SRAM cells based caches is slightly higher than the symmetric and 8T SRAM cells based caches.
Energy consumption has always been considered as the key issue of the state-of-the-art SoCs. Implementing an on-chip Cache is one of the most promising solutions. However, traditional Cache may suffer from performance and energy penalties due to the Cache conflict. In order to deal with this problem, this paper firstly introduces a Time-Slotted Cache Conflict Graph to model the behavior of Data Cache conflict. Then, we implement an Integer Nonlinear Programming to select the most profitable data pages and employ Virtual Memory System to remap those data pages, which can cause severe Cache conflict within a time slot, to the on-chip Scratchpad Memory (SPM). In order to minimize the swapping overhead of dynamic SPM allocation, we introduce a novel SPM controller with a tightly coupled DMA to issue the swapping operations without CPU's intervention. The proposed method can optimize all of the data segments, including global data, heap and stack data in general, and reduce 24.83% energy consumption on average without any performance degradation.
Keywords-Time-Slotted Cache Conflict Graph; Scratchpad Memory; Energy Optimization; Virtual Memory System
This paper presents a novel power management techniques based on enhanced Q-learning algorithms. By exploiting the submodularity and monotonic structure in the cost function of a power management system, the enhanced Q-learning algorithm is capable of exploring ideal trade-offs in the power-performance design space and converging to a better power management policy. We further propose a linear adaption algorithm that adapts the Lagrangian multiplier λ to search for the power management policy that minimizes the power consumption while delivering the exact required performance. Experimental results show that, comparing to the existing expert-based power management, the proposed Q-learning based power management achieves up to 30% and 60% reduction in power saving for synthetic workload and real workload, respectively while in average maintain a performance within 7% variation of the given constraint.
The simulation speed is a key issue in virtual prototyping of Multi-Processors System on Chip (MPSoCs). The SystemC TLM2.0 (Transaction Level Modeling) approach accelerates the simulation by using Interface Method Calls (IMC) to implement the communications between hardware components. Another source of speedup can be exploited by parallel simulation. Multi-core workstations are becoming the mainstream, and SMP workstations will soon contain several tens of cores. The standard SystemC simulation engine uses a centralized scheduler, that is clearly the bottleneck for a parallel simulation. This paper has two main contributions. The first is a general modeling strategy for shared memory MPSoCs, called TLM-DT (Transaction Level Modeling with Distributed Time). The second is a truly parallel simulation engine, called SystemC-SMP. First experimental results on a 40 processor MPSoC virtual prototype running on a dual-core workstation demonstrate a 1.8 speedup, versus a sequential simulation.
Keywords MPSoC, Parallel Simulation, SystemC, SMP workstations
High speed serial interfaces represent the new trend for device-to-device communication. These systems require clock recovery modules to avoid clock forwarding. In this paper we present a high-speed clock recovery method usable with low-cost FPGAs. Our proposed solution features increased speed and reduced size compared to existing designs. The method allows a maximum throughput of 400Mbps compared to the vendor supplied solution capable of only 160Mbps. The module was also integrated and tested within a serial transceiver system. Although the implementation is specific to a given vendor, the idea can also be applied to others devices because it uses only generally available components from most vendors.
Keywords-clock recovery, serial communication, FPGA
This paper describes the implementation of an FPGA prototype of an application based on embedded ASIC technology. The overall goal is to implement a system that can monitor an Ethernet data stream and extracts configuration data marked by the EtherType field in the Ethernet header. For evaluation the application is implemented on a prototype consisting of two XILINX FPGA boards. Since the target platform is an ASIC with embedded reconfigurable architectures the prototype is divided in the corresponding parts. One board emulates the embedded reconfigurable architecture that contains the Ethernet MAC. Ethernet packets can reconfigure this MAC. The second board emulates the static part of the application that controls the reconfiguration process.
Electronic systems for safety-critical automotive applications must operate for many years in harsh environments. Reliability issues are worsening with device scaling down, while performance and quality requirements are increasing. One of the key reliability issues is long-term performance degradation due to aging. For safe operation, aging monitoring should be performed on chip, namely using built-in aging sensors (activated from time to time). The purpose of this paper is to present a novel programmable nanometer aging sensor. The proposed aging sensor allows several levels of circuit failure prediction and exhibits low sensitivity to PVT (Process, power supply Voltage and Temperature) variations. Simulation results with a 65 nm sensor design are presented, that ascertain the usefulness of the proposed solution.
Keywords: aging sensors, reliability in nanometer technologies, failure prediction
In this paper we present a passive reduced order modeling algorithm for linear multiport interconnect structures. The proposed technique uses rational fitting via semidefinite programming to identify a passive transfer matrix from given frequency domain data samples. Numerical results are presented for a power distribution grid and an array of inductors, and the proposed approach is compared to two existing rational fitting techniques.
We present GOLDMINE, a methodology for generating assertions automatically. Our method involves a combination of data mining and static analysis of the Register Transfer Level (RTL) design. We present results of using GoldMine for assertion generation of the RTL of a 1000-core processor design that is still in an evolving stage. Our results show that GoldMine can generate complex, high coverage assertions in RTL, thereby minimizing human effort in this process.
Today, mobile and embedded real time systems have to cope with the migration and allocation of multiple software tasks running on top of a real time operating system (RTOS) residing on one or several processors. For scaling of each task set and processor configuration, instruction set simulation and worst case timing analysis are typically applied. This paper presents a complementary approach for the verification of RTOS properties based on an abstract RTOS Model in SystemC. We apply IEEE P1850 PSL for which we present an approach and first experiences for the assertion-based verification of RTOS properties.
Keywords- real-time operating systems; verification; PSL
With technology scaled to deep submicron era, temperature and temperature gradient have emerged as important design criteria. We propose two post-placement techniques to reduce peak temperature by intelligently allocating whitespace in the hotspots. Both methods are fully compliant with commercial technologies, and can be easily integrated with state-of-the-art thermal-aware design flow. Experiments in a set of tests on circuits implemented in STM 65nm technologies show that our methods achieve better peak temperature reduction than directly increasing circuit's area.
Clock Gating has been the most widely used method to reduce dynamic power for digital designs. Increasingly the need to assess the quality of a clock gating implementation has resulted in generation of various benchmarking criteria. These criteria however fail to provide a feel to the designer about quality of a current implementation and the scope available for further clock gating. Prior work has also reported various datapath based clock gating techniques to optimize for dynamic power. In this paper we present a new approach to analyze this problem through the IO Exclusivity (IOEX) graphs and Cluster Efficiency (CE) plots. The IOEX graph captures the datapath activity across the sequential elements normalized to the source clock. This exercise produces sequential elemental clusters or modules, which are amenable to clock gating. A CE plot then provides a visual insight into these fine grained implementations to aid the designer to further gate the given cluster. This additional gating can then be implemented either at synthesis or at the layout stages, depending on the design cycle time. Results from 65nm designs show that up to 20% dynamic power savings can be achieved with our approach over and above the industry standard low power synthesis solutions.
Safety-critical automotive systems must fulfill hard real-time constraints for reliability and safety. This paper presents a case study for the application of an AUTOSAR-based language for timing modeling and analysis. We present and apply the Timing Augmented Description Language (TADL) and demonstrate a methodology for the development of a speed-adaptive steer-by-wire system. We examine the impact of TADL and the methodology on the development process and the suitability and interoperability of the applied tools with respect to the AUTOSAR-based tool chain in the context of our case study.
Virtualization has become a key technology in the design of embedded systems. Within the scope of virtualization, emulation is a central aspect to overcome the limits induced by the heterogeneity of complex distributed embedded systems. Most of the techniques developed for the desktops and servers are not directly applicable to embedded systems due to their strict timing requirements. We will show the problems of existing emulation methods when applying them to embedded real-time systems and will propose a metric to determine the worst-case overhead caused by emulation. Based on this metrics we then propose an emulation method minimizing the worst-case overhead.
Statistical variability (SV) presents increasing challenges to CMOS scaling and integration at nanometer scales. It is essential that SV information is accurately captured by compact models in order to facilitate reliable variability aware design. Using statistical compact model parameter extraction for the new industry standard compact model PSP, we investigate the accuracy of standard statistical parameter generation strategies in statistical circuit simulations. Results indicate that the typical use of uncorrelated normal distribution of the statistical compact model parameters may introduce considerable errors in the statistical circuit simulations.
Keywords- Statistical variability; mismatch; statistical compact modelling; MOSFETs;
Power consumption can be significantly reduced in Systems-on-Chip (SoC) by scaling down the voltage levels of the Processing Elements (PEs). The power efficiency of this Voltage Islanding technique comes at the cost of energy and area overhead due to the level shifters between voltage islands. Moreover, from the physical design perspective it is not desirable to have an excessive number of voltage islands on the chip. Considering voltage islanding at an early phase of design as during floorplanning of the PEs can address various of these issues. In this paper, we propose a new cost function for the floorplanning objective different from the traditional floorplanning objective. The new cost function not only includes the overall area requirement, but also incorporates the overall power consumption and the design constraint imposed on the maximum number of voltage islands. We propose a greedy heuristic based on the proposed cost function for the floorplanning of the PEs with several voltage islands. Experimental results using benchmark data study the effect of several parameters on the outcome of the heuristic. It is evident from the results that power consumption can be significantly reduced using our algorithm without significant area overhead. The area obtained from the heuristic is also compared with the optimal, and found to be within 4% of the optimal on average, when area minimization is given the priority.
Wireless sensor-and-control systems era starting to make inroads in medical health care. Monitoring of vital signs, advanced imaging, and innovative treatments and prosthetics are emerging. The challenge with all these systems is that they have to be "energy-frugal", that is that they have to get by with the energy they scavenge from the environment around them. At the same time, they need to be adaptive to the time-varying needs of the systems they observe or control. In this talk we will discuss how we could simultaneously explore the lower bounds of energy dissipation while at the same time providing "hugely-scalable" performance.
Structural Health Monitoring (SHM) is a new challenge in wireless sensor design. In our project we are proposing an ultra-sonic measurement system to find malfunctions within structures. Examples of such structures are aircraft bodies and the wings of a wind turbine. It is required to allow for long term monitoring from a single battery or even using energy scavenging techniques. First of all the measurement algorithms are analyzed. Here we have to find a trade-off between energy consumption of the sensor, required memory size, computational power; duty cycle and timing jitter between actuator and sensors. Most important method to control the energy consumption of the system is an efficient duty cycle control. As most of the time the sensor is sleeping the overall timing of the system has to be maintained and refreshed in a continuous manner. A high timing accuracy results also in higher power consumption during sleep mode, while a low power RC oscillator needs a higher refresh rate. For the wireless system we need to trade-off between power consumption during active mode and data rates. While we are able to transfer more data in a shorter period of time with a high performance system, the overall energy consumption with low data rates could be higher if the on-time of the RF (radio-frequency) frontends is much longer - even if the RF-frontends consumes less power in a low data rate mode. In the paper and presentation the use cases and design decision are described based on the outcome of the system analysis.
6lowPan devices are intended to be deployed in Ipv6 networks whose subnets that often will be physically disjunct and perhaps separated by large distances. A major advantage of exploiting the nearly inexhaustible address of pool available in IPv6 is the ease with which true host-to-host communication can be realised. This however amplifies the importance of security in the network. It must be warranted with nearly 100% certainty that whenever a sensor node solicits or furnishes data to another node, that the solicited node be in fact that node from which the data is required, and just importantly, that the soliciting node be true node authorised to request the data. This paper will examine the requirements for such peer-to-peer networks and discuss solutions that will fulfill them.
The paper introduces novel field programmable gate array (FPGA) circuits based on hybrid CMOS/resistive switching device (memristor) technology and explores several logic architectures. The novel FPGA structure is based on the combination of CMOL (Cmos + MOLecular scale devices) FPGA circuits and recent improvements and generalization of the CMOL concept to allow multilayer crossbar integration, compatibility with state-of-the-art foundries, and a wide range of available memristive crosspoint devices. Preliminary results indicate that with no optimization and only conventional CMOS technology, the proposed circuits can be at least ten times denser (and potentially faster) than CMOS FPGAs with the same design rules and similar power density. The second part of this paper shows that this performance can be further improved using optimal MUX-based logic architecture.
Keywords: FPGA, hybrid circuits, logic architecture, resistance switching, memristors, three dimensional circuits
Spintronic memristor devices based upon spin torque induced magnetization motion are presented and potential application examples are given. The structure and material of these proposed spin torque memristors are based upon existing (and/or commercialized) magnetic devices and can be easily integrated on top of a CMOS. This provides better controllability and flexibility to realize the promises of nanoscale memristors. Utilizing its unique device behavior, the paper explores spintronic memristor potential applications in multibit data storage and logic, novel sensing scheme, power management and information security.
Keywords-spintronic; memristor; spin torque; storage; sensing; power management; security.
In this paper, we present a compact model of the spintronic memristor based on the magnetic-domain-wall motion mechanism for circuit design. Our model also takes into account the variations of material parameters and fabrication process, which significantly affects the actual electrical characteristics of a memristor in nano-scale technologies. Our proposed model can be easily implemented by Verilog-A languages and compatible to SPICE-based simulation. Based on our model, we also show some potential applications of memristor in computing system, including the detailed analysis and optimizations based on our proposed model.
Keywords-Memristor, spin torque, spintronic, magnetic tunneling junction (MTJ), compact model
There is today little doubt on the fact that a high-performance and cost-effective Network-on-Chip can only be designed in 45nm and beyond under a relaxed synchronization assumption. In this direction, this paper focuses on a GALS system where the NoC and its end-nodes have independent clocks (unrelated in frequency and phase) and are synchronized via dual-clock FIFOs at network interfaces. Within the network, we assume mesochronous synchronization implemented with hierarchical clock tree distribution. This paper contributes two essential components of any practical design automation support for network instantiation in the target system. On one hand, it introduces a switch design which greatly reduces the overhead for mesochronous synchronization and can be adapted to meet different layout constraints. On the other hand, the paper illustrates a design space exploration framework of mesochronous links that can direct the selection of synchronization options on a port-by-port basis for all the switches in the NoC, based on timing and layout constraints. A final case study illustrates how a cost-effective GALS NoC can be assembled, placed and routed by exploiting the flexibility of the architecture and the outcomes of the exploration framework, thus proving the viability and effectiveness of the design platform.
Associated with the ever growing integration scales is the increase in process variability. In the context of network-on-chip, this variability affects the maximum frequency that could be sustained by each link that interconnects two cores in a chip multiprocessor. In this paper we present a methodology to model delay variations in NoC links. We also show its application to several technologies, namely 45nm, 32nm, 22nm, and 16nm. Simulation results show that conclusions about variability greatly depend on the implementation context.
Recent developments have shown the possibility of leveraging silicon nanophotonic technologies for chip-scale interconnection fabrics that deliver high bandwidth and power efficient communications both on- and off-chip. Since optical devices are fundamentally different from conventional electronic interconnect technologies, new design methodologies and tools are required to exploit the potential performance benefits in a manner that accurately incorporates the physically different behavior of photonics. We introduce PhoenixSim, a simulation environment for modeling computer systems that incorporates silicon nanophotonic devices as interconnection building blocks. PhoenixSim has been developed as a cross-discipline platform for studying photonic interconnects at both the physical-layer level and at the architectural and system levels. The broad scope at which modeled systems can be analyzed with PhoenixSim provides users with detailed information into the physical feasibility of the implementation, as well as the network and system performance. Here, we describe details about the implementation and methodology of the simulator, and present two case studies of silicon nanophotonic-based networks-on-chip.
A wide tuning range LO generation architecture for software defined radio is presented. A dual VCO approach followed by a programmable divider chain based on high-speed dynamic CMOS latches provides full rail-to-rail operation with low power consumption. The 1.2V 90nm CMOS implementation achieves a VCO tuning range between 6 to 13.6GHz for a power consumption between 3.5 to 13.4mW and phase noise figure of merit of 182dBc/Hz measured at 3MHz offset from a 12GHz carrier. The VCO-multiplexer and divider chain consumes between 5.9 to 8.1mW for this frequency range.
This paper presents a 90 nm CMOS digital amplitude modulator for polar transmitter. It reaches an output power of -2.5 dBmRMS using a WLAN OFDM 64QAM modulation at 2.45GHz achieving -26.1 dB EVM and 18% efficiency. To reduce the aliases due to the discrete-time to continuous-time conversion a 2-fold interpolation has been implemented. The amplitude modulator has a segmented architecture. This results in a very compact 0.007 mm2 chip area.
This paper presents the design of a 14 bit, 280 kS/s cyclic ADC which consumes 1.6 mW power and achieves 100 dB SFDR. The design is optimized with a half-scale residue transfer characteristic (RTC) which lowers swing and slew requirements on the opamp. Further advantages of this RTC are exploited to reduce the number and magnitude of dominant error sources, and the residual error is randomized with dithering. Capacitor scaling and optimized allocation of conversion time to each step add to power savings. The ADC fabricated in a 0.35 μm CMOS process occupies 1.04 mm2 silicon area. Cyclic analog-to-digital converter (ADC); half-scale residue transfer characteristic (RTC); residue amplifier (RA); dithering; integral nonlinearity (INL); differential nonlinearity (DNL)
This article discusses system-level techniques to optimize the power-performance trade-off in subthreshold circuits and presents a uniform platform for implementing ultra-low power power-scalable analog and digital integrated circuits. The proposed technique is based on using subthreshold source-coupled or current-mode approach for both analog and digital circuits. In addition to possibility of operating with ultra-low power dissipation, because of similar basis for constructing analog and digital parts, a common power management unit could be used for optimizing the power-performance of the entire mixed-signal system. Some circuit examples have been provided to show the performance of the proposed circuits in practice.
Soft errors have been a critical reliability concern in nanoscale integrated circuits, especially in sequential circuits where a latched error can be propagated for multiple clock cycles and affect more than one output, more than once. This paper presents an analytical methodology for enhancing the soft error tolerance of sequential circuits. By using clock skew scheduling, we propose to minimize the probability of unwanted transient pulses being latched and also prevent latched errors from propagating through sequential circuits repeatedly. The overall methodology is formulated as a piecewise linear programming problem whose optimal solution can be found by existing mixed integer linear programming solvers. Experiments reveal that 30-40% reduction in the soft error rate for a wide range of benchmarks can be achieved.
This paper describes a hardware-/software-based technique to make the data path of a statically scheduled super scalar processor fault tolerant. The results of concurrently executed operations can be compared with little hardware overhead in order to detect a transient or permanent fault. Furthermore, the hardware extension allows to recover from a fault within one to two clock cycles and to distinguish between transient and permanent faults. If a permanent fault was detected, this fault is masked for the rest of the program execution such that no further time is needed for recovering from that fault. The proposed extensions were implemented in the data path of a simple VLIW processor in order to prove the feasibility and to determine the hardware overhead. Finally a reliability analysis is presented. It shows that for medium and large scaled data paths our extension provides an up to 98% better reliability than triple modular redundancy.
Inductive and capacitive coupling are responsible for slowing down signals. Existing bus encoding techniques tackle the issue by avoiding certain types of transitions. This work proposes a codeword generation method for such techniques that is scalable to very wide buses. Experimentation on a recent encoding technique confirms that the conventional method is limited to 16-bit bus while the proposed method is easily extended beyond 128-bits.
As the VLSI technology scaling continues and the device dimension keeps shrinking, memories are more and more sensitive to soft errors. Memory cores usually occupy a large portion of an SOC and have significant impact on the chip reliability. Therefore error detection and correction (EDAC) techniques are commonly used for protecting the system against soft errors. This paper presents a novel EDAC scheme, which provides adaptive code rate for random access memories (RAMs). Under a certain reliability restriction, the proposed design allows more error bits than a conventional EDAC design.
Index Terms - Error correction codes, memory, Hsiao code, fault tolerance, reliability
Employing COTS components in real-time embedded systems leads to timing challenges. When multiple CPU cores and DMA peripherals run simultaneously, contention for access to main memory can greatly increase a task's WCET. In this paper, we introduce an analysis methodology that computes upper bounds to task delay due to memory contention. First, an arrival curve is derived for each core representing the maximum memory traffic produced by all tasks executed on it. Arrival curves are then combined with a representation of the cache behavior for the task under analysis to generate a delay bound. Based on the computed delay, we show how tasks can be feasibly scheduled according to assigned time slots on each core.
We use the polyhedral process network (PPN) model of computation to program embedded Multi-Processor Systems on Chip (MPSoCs) platforms. If a designer wants to reduce the number of processes in a network due to resource constraints, for example, then the process merging transformation can be used to achieve this. We present a compile-time approach to evaluate the system throughput of PPNs in order to select a merging candidate which gives a system throughput as close as possible to the original PPN. We show results for two experiments on the ESPAM platform prototyped on a Xilinx Virtex 2 Pro FPGA.
Nowadays, most embedded devices need to support multiple applications running concurrently. In contrast to desktop computing, very often the set of applications is known at design time and the designer needs to assure that critical applications meet their constraints in every possible use-case. In order to do this, all possible use-cases, i.e. subset of applications running simultaneously, have to be verified thoroughly. An approach to reduce the verification effort, is to perform composability analysis which has been studied for sets of applications modeled as Synchronous Dataflow Graphs. In this paper we introduce a framework that supports a more general parallel programming model based on the Kahn Process Networks Model of Computation and integrates a complete MPSoC programming environment that includes: compiler-centric analysis, performance estimation, simulation as well as mapping and scheduling of multiple applications. In our solution, composability analysis is performed on parallel traces obtained by instrumenting the application code. A case study performed on three typical embedded applications, JPEG, GSM and MPEG-2, proved the applicability of our approach.
Predicting timing behavior is key to reliable realtime system design and verification, but becomes increasingly difficult for current multiprocessor systems on chip. The integration of formerly separate functionality into a single multicore system introduces new inter-core timing dependencies, resulting from the common use of the now shared resources. In order to conservatively bound the delay due to the shared resource accesses, upper bounds on the potential amount of conflicting requests from other processors are required. This paper proposes a method that captures the request distances of multiple shared resource accesses by single tasks and also by multiple tasks that are dynamically scheduled on the same processor. Unlike previous work, we acknowledge the fact that on a single processor, tasks will not actually execute in parallel, but in alternation. This consideration leads to a more accurate load model. In a final step, the approach is extended to allow addressing also dynamic cache misses that do not occur at predefined times but surface dynamically during the execution of the tasks.
A new delay-insensitive data encoding scheme for global asynchronous communication is introduced. The goal of this work is to combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the fault-tolerance of error-correcting codes. The proposed error-correcting unordered (ECU) code, called Zero-Sum, can safely accommodate arbitrary skew in arrival times of individual bits in a packet, while simultaneously providing 1-bit correction and 2-bit detection. A systematic code is targeted, where data can be directly extracted from the codewords. A basic method for generating the code is presented, as well as detailed designs for the supporting hardware blocks. An outline of the system micro-architecture and its operating protocol is also given. When compared to the best previous systematic ECU code, the new code provides a 5.74 to 18.18% reduction in transition power for most field sizes, with better or comparable coding efficiency. Pre-layout technology-mapped implementations of the supporting hardware (encoder, completion detector, error-corrector) were synthesized with the UC Berkeley ABC tool using a 90nm industrial standard cell library. Results indicate that they have moderate area and delay overheads, while the best non-systematic ECU codes have 3.82 to 10.44x greater area for larger field sizes.
We propose a methodology for Boolean matching under permutations of inputs and outputs (PP-equivalence checking problem) - a key step in incremental logic design that identifies large sections of a netlist that are not affected by a change in specifications. Finding reusable sections of a netlist reduces the amount of work in each design iteration and accelerates design closure. Our approach integrates graph-based, simulation-driven and SAT-based techniques to make Boolean matching feasible for large circuits. Experimental results confirm scalability of our techniques to circuits with hundreds and even thousands of inputs and outputs.
This paper introduces the concept of kl-feasible cuts, by controlling both the number k of inputs and the number l of outputs in a circuit cut. To provide scalability, the concept of factor cuts is extended to kl-cuts. Algorithms for computing this kind of cuts, including kl-cuts with unbounded k, are presented and results are shown. As a practical application, a covering algorithm using these cuts is presented.
Keywords - AIG; cut enumeration; technology mapping.
Reliability analysis for a logic circuit is one of the primary tasks in fault-tolerant logic synthesis. Given a fault model, it quantifies the impact of faults on the full-chip fault rate. We present RALF, an exact algorithm for calculating the reliability of a logic circuit. RALF is based on the compilation of a circuit to deterministic decomposable negation normal form (d-DNNF), a representation for Boolean formulas that can be more succinct than BDDs. Our algorithm can solve a large set of MCNC benchmark circuits within 5 minutes, enabling an optimality study of Monte Carlo simulation, a popular estimation method for reliability analysis, on real benchmark circuits. Our study shows that Monte Carlo simulation with a small set of random vectors generally has a high fidelity for the computation of full-chip fault rates and the criticality of single gates. While we focus on reliability analysis, RALF can also be used to efficiently locate random pattern resistant faults. This can be used to identify where methods other than random simulation should be used for accurate criticality calculations and where to enhance the testability of a circuit.
We have seen the practical use of multi-processors in complex SoCs and systems grow in the past several years, and the discussion range from architectures through to programming models. One of the issues that poses several challenges to design and verification teams is that of multi-core debug, especially in heterogeneous systems where the processors may be from different vendors, and even when from the same vendor, may be very application-specific. In this panel, designers and researchers who have practical experience with heterogeneous multiprocessor systems, both commercial and research, will draw on those experiences.
Keywords-debugging, SoC, multicore, manycore
The IC industry is facing a huge paradox. On one hand, with the slowing of the performance and power gains provided by scaling, designers need to find new ways of delivering value to their customers. Historically this has meant creating more application specialized chips and systems. On the other hand the rising NRE costs for chip design (now over $10M/chip) has caused the number of chip design starts to fall. Everyone today seems to be talking about building programmable platforms to ensure the total available market is large enough to justify the chip design costs. To get out of this paradox, we need to change the way we think about chip design. Reducing digital NRE costs requires moving the end user designers up a level in abstraction. For many reasons I don't believe that either the current SoC, or high-level language effort will succeed. Instead, we should acknowledge that working out the interactions in a complex design is complex, and will cost a lot of money, even when we do it well. The key is to leverage this work over a broader class of chips. This approach leads to the idea of building chip-generators and not chips. That is instead of building a programmable chip to meet a broad class of application needs, you create a virtual programmable chip, that is MUCH more flexible than any real chip. The application designer (the new chip designer) will then configure this substrate to optimize for their application. The generator will take this information and then create the desired chip. While there are many very hard problems that need to be addressed to make this work, but none of them seem insurmountable. In fact I will provide some examples which indicate the promise of this approach - like having the generator choose the core that is the most energy efficient for your application mix.
The X-GOLD® SDR 2x family of programmable baseband processors is designed for hosting multiple standards of mobile communication, connectivity, and reception of broadcast services. Processors from the X-GOLD® SDR 2x family obtain the necessary flexibility from a set of programmable SIMD (single-instruction, multiple-data) processor cores, which exchange data through shared on-chip memories. The processors are supported by few dedicated configurable hardware accelerators for those DSP tasks which require no or little flexibility, by an ARM® core for the execution of the upper layers of the protocol stack and by standard IO-components.
LTE is the first cellular communications standard that has been designed from the start for delivering internet content to mobile devices. This will enable a whole new set of applications and devices that are currently beyond our imagination. While new devices will easily consume orders of magnitude more data delivered over the cellular communication pipe than their early 2G counter parts, the challenge for low power consumption remains. LTE not only promises higher data rates, it also offers lower power consumption on a per bit basis than any preceding cellular communications standard thus making it ideal for battery powered data hungry devices. This talk focuses on low power data transmission over LTE from a system perspective. Taking an electronic reading device as an example, analysis will be made where power in the LTE modem is consumed. An outline on further system optimization with respect to low power exploiting the possibilities offered by the LTE standard and beyond will be given.
Static noise margin analysis using butterfly curves has traditionally played a leading role in the sizing and optimization of SRAM cell structures. Heightened variability and reduced supply voltages have resulted in increased attention being paid to new methods for characterizing dynamic robustness. In this work, a technique based on vector field analysis is presented for quickly extracting both static and dynamic stability characteristics of arbitrary SRAM topologies. It is shown that the traditional butterfly curve simulation for 6T cells is actually a special case of the proposed method. The proposed technique not only allows for standard SNM "smallest-square" measurements, but also enables tracing of the state-space separatrix, an operation critical for quantifying dynamic stability. It is established via importance sampling that cell characterization using a combination of both separatrix tracing and butterfly SNM measurements is significantly more correlated to cell failure rates then using SNM measurements alone. The presented technique is demonstrated to be thousands of times faster than the brute force transient approach and can be implemented with widely available, standard design tools.
Keywords- memory; stability; robustness; dynamic noise margin; static noise-margin;
The impact of process variation in deep-submicron technologies is especially pronounced for SRAM architectures which must meet demands for higher density and higher performance at increased levels of integration. Due to the complex structure of SRAM, estimating the effect of process variation accurately has become very challenging. In this paper, we address this challenge in the context of estimating SRAM timing variation. Specifically, we introduce a method called loop flattening that demonstrates how the evaluation of the timing statistics in the complex, highly structured circuit can be reduced to that of a single chain of component circuits. To then very quickly evaluate the timing delay of a single chain, we employ a statistical method based on importance sampling augmented with targeted, high-dimensional, spherical sampling. Overall, our methodology provides an accurate estimation with 650X or greater speed-up over the nominal Monte Carlo approach.
The advanced sampling and variance reduction techniques as efficient alternatives to the slow crude-MC method have recently been adopted for the analysis of timing yield in digital circuits. However, these techniques, the Quasi-MC method and the order-statistics base estimator, are prone to bias or negligible improvement upon the crude-MC method when an early-stage timing analysis with few (10s) simulation iterations can be afforded. In this paper, these issues are studied and a control variate-base technique is developed to accurately estimate the moments of circuits - critical delays with very few timing simulation iterations. A skew-normal distribution is then used to form a closed-form cumulative distribution function of timing yield. Analysis of the benchmark circuits shows 3-10X reduction of the confidence interval ranges of the estimated yield compared to the crude-MC translating to 9-100X reduction in the number of samples for the same analysis accuracy.
We present a new technique for statistical static timing analysis (SSTA) based on Markov chain Monte Carlo (MCMC), that allows fast and accurate estimation of the right-hand tail of the delay distribution. A "naive" MCMC approach is inadequate for SSTA. Several modifications and enhancements, presented in this paper, enable application of MCMC to SSTA. Moreover, such an approach overcomes inherent limitations of techniques such as importance sampling and Quasi-Monte Carlo. Our results on open source designs, with an independent delay variation model, demonstrate that our technique can obtain more than an order of magnitude improvement in computation time over simple Monte Carlo, given an estimation accuracy target at a point in the tail. Our approach works by providing a large number of samples in the region of interest. Open problems include extension of algorithm applicability to a broader class of synthesis conditions, and handling of correlated delay variations. In a broader context, this work aims to show that MCMC and associated techniques can be useful in rare event analyses related to circuits, particularly for high-dimensional problems.
Facing the requirements of next generation applications, current approaches of embedded systems design will soon hit the limit where they may no longer perform efficiently. The unpredictable nature and diverse processing behavior of future applications requires to transgress the barrier of tailor-made, application-/domain-specific embedded system designs. As a consequence, next generation architectures for embedded systems have to react much more flexible to unforeseeable run-time scenarios. In this paper we present our innovative processor architecture concept KAHRISMA (KArlsruhe's Hypermorphic Reconfigurable-Instruction-Set Multi-grained-Array). It tightly integrates coarse- and fine-grained run-time reconfigurable fabrics that can incorporate to realize hardware acceleration for computationally complex algorithms. Furthermore, the fabrics can be combined to realize different Instruction Set Architectures that may execute in parallel. With the help of an encrypted H.264 en-/decoding case study we demonstrate that our novel KAHRISMA architecture will deliver the required flexibility to design future-proof embedded systems that are not limited to a certain computational domain.
The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC is composed of heterogeneous sub-caches as opposed to common caches using homogenous subcaches. The sub-caches are turned off depending on the application workload to conserve power and minimize latencies. A novel reconfiguration algorithm based on Basic Block Vectors is proposed to recognize program phases, and a learning mechanism is used to select the appropriate cache configuration for each program phase. We compare our reconfigurable cache with existing proposals of adaptive and non-adaptive caches. Our results show that the combination of AC and the novel reconfiguration algorithm provides the best power consumption and performance. For example, on average, it reduces the cache access latency by 55.8%, the cache dynamic energy by 46.5%, and the cache leakage power by 49.3% with respect to a non-adaptive cache.
Keywords - Cache, Dynamic adaptation, Processor evaluation.
rSesame is a generic modeling and simulation framework which can explore and evaluate reconfigurable systems at the early design stages. The framework can be used to explore different HW/SW partitionings, task mappings and scheduling strategies at both design time and runtime. The framework strives for a high degree of flexibility, ease of use, fast performance and applicability. In this paper, we want to evaluate the framework's characteristics by showing that it can easily and quickly model, simulate and compare a wide range of runtime mapping heuristics from various domains. A case study with a Motion-JPEG (MJPEG) application demonstrates that the presented model can be efficiently used to model and simulate a wide variety of mapping heuristics as well as to perform runtime exploration of various non-functional design parameters such as execution time, number of reconfigurations, area usage, etc.
Due to the runtime flexibility offered by field programmable gate arrays (FPGAs), FPGAs are popular devices for stream processing systems, since many stream processing applications require runtime adaptability (i.e. throughput, data transformations, etc.). FPGAs can offer this adaptability through runtime assembly of stream processing systems that are decomposed into hardware modules. Runtime hardware module assembly consists of dynamic hardware module replacement and hardware module communication reconfiguration. In this paper, we architect a flexible base embedded system amenable to runtime assembly of stream processing systems using custom communication architecture with dynamic streaming channel establishment between hardware modules. We present a hardware module swapping methodology that replaces hardware modules without stream processing interruption. Finally, we formulate two design flows, system and application construction, to provide system and application designer assistance.
Parallel programming techniques have become one of the great challenges in the transition from single-core to multicore architectures. In this paper, we investigate the parallelization of the Montgomery multiplication, a very common and time-consuming primitive in public-key cryptography. A scalable parallel programming scheme, called pSHS, is presented to map the Montgomery multiplication to a general multicore architecture. The pSHS scheme offers a considerable speedup. Based on 2-, 4-, and 8-core systems, the speedup of a parallelized 2048-bit Montgomery multiplication is 1.98, 3.74, and 6.53, respectively. pSHS delivers stable performance, high portability, high throughput and low latency over different multicore systems. These make pSHS a good candidate for public-key software implementations, including RSA, DSA, and ECC, based on general multicore platforms. We present a detailed analysis of pSHS, and verify it on dual-core, quad-core and eight-core prototypes.
In this paper, we present BCDL (Balanced Cell-based Dual-rail Logic), a new counter-measure against Side Channel Attacks (SCA) on cryptoprocessors implementing symmetrical algorithms on FPGA. BCDL is a DPL (Dual-rail Precharge Logic), which aims at overcoming most of the usual vulnerabilities of such counter-measures, by using specific synchronization schemes, while maintaining a reasonable complexity. We compare our architecture in terms of complexity, performances and easiness to design with other DPLs (WDDL, IWDDL, MDPL, iMDPL, STTL, DRSL, SecLib). It is shown that BCDL can be optimized to achieve higher performances than any other DPLs (more than 1/2 times the nominal data rate) with an affordable complexity. Finally, we implement a BCDL AES on an FPGA and compare its robustness against DPA by using the number of Measurements To Disclosure (MTD) required to find the key with regards to unprotected AES. It is observed that the SCA on a BCDL implementation failed for 150,000 power consumption traces which represents a gain greater than 20 w.r.t. the unprotected version. Moreover the fault attack study has pointed out the natural resistance of BCDL against simple faults attacks.
Keywords: Side Channel Attacks, Dual-rail Precharge Logic, Synchronization, Differential Power Analysis, FPGA.
For any computing system to be secure, both hardware and software have to be trusted. If the hardware layer in a secure system is compromised, not only it would be possible to extract secret information about the software, but it would also be extremely hard for the software to detect that an attack is underway. In this work we detail a complete end-to-end fault-attack on a microprocessor system and practically demonstrate how hardware vulnerabilities can be exploited to target secure systems. We developed a theoretical attack to the RSA signature algorithm, and we realized it in practice against an FPGA implementation of the system under attack. To perpetrate the attack, we inject transient faults in the target machine by regulating the voltage supply of the system. Thus, our attack does not require access to the victim system's internal components, but simply proximity to it. The paper makes three important contributions: first, we develop a systematic fault-based attack on the modular exponentiation algorithm for RSA. Second, we expose and exploit a severe flaw on the implementation of the RSA signature algorithm on OpenSSL, a widely used package for SSL encryption and authentication. Third, we report on the first physical demonstration of a fault-based security attack of a complete microprocessor system running unmodified production software: we attack the original OpenSSL authentication library running on a SPARC Linux system implemented on FPGA, and extract the system's 1024-bit RSA private key in approximately 100 hours.
An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that "approves" each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.
Systems based on satellite localization are enabling new scenarios for road charging schemes by offering the possibility to charge drivers as a function of their road usage. An in-vehicle installation of a black box with the capabilities of a Location Based Service terminal suffices to deploy such a scheme. In the most straightforward architecture a back-end server collects vehicle's location data in order to extract the correct fees. However, with industry, governments and users being more and more aware of privacy issues the deployment of such system seems to be contradictory. Our contribution is the demonstration of a practical and functional road charging system based on PriPAYD . Our black box is built guaranteeing most of the processing of location data in real-time, thus minimizing overheads required to ensure security and privacy. The performance of our software-based prototype is tested and proves that the deployment of a privacy-friendly solution can be achieved within a minimum cost increment compared to existing road charging schemes.
Various X-filling methods have been proposed for reducing the shift and/or capture power in scan testing. The main drawback of these methods is that X-filling for low power leads to lower defect coverage than random-fill. We propose a unified low-power and defect-aware X-filling method for scan testing. The proposed method reduces shift power under constraints on the peak power during response capture, and the power reduction is comparable to that for the Fill-Adjacent X-filling method. At the same time, this approach provides high defect coverage, which approaches and in many cases is higher than that for random-fill, without increasing the pattern count. The advantages of the proposed method are demonstrated with simulation results for the largest ISCAS and the IWLS benchmark circuits.
In this paper, a new very fast fault simulation method to handle the X-fault model is proposed. The method is based on a two-phase procedure. In the first phase, a parallel exact critical path fault tracing is used to determine all the detected stuck-at faults in the circuit, and in the second phase a postprocess is launched which will determine the detectability of X-faults.
Keywords-digital circuits; fault simulation; X-fault model; parallel exact critical path fault tracing
We propose a multiple-fault diagnosis method with high diagnosability, resolution, first-hit and short run time. The method has no assumption on fault models, thus can diagnose arbitrary faults. To cope with the multiple-fault mask and reinforcement effect, two key techniques of construction and scoring of fault-tuple equivalence trees are introduced to choose and rank the final candidate locations. Experimental results show that, when the circuits have 2 arbitrary faults, the average diagnosability and resolution are 98% and 0.95, respectively, with the best case 100% and 1.00. Moreover, in average, even when 21 arbitrary faults exist, our method can still identify 93% of them with the resolution 0.78, increased by 41% and 39% in comparison with the latest work where the diagnosability and resolution are 66% and 0.56. Finally, 96% of our top-ranked candidate locations are actual fault locations.
Keywords-multiple arbitrary faults; diagnosis; mask and reinforcement effect; fault-tuple equivalence tree
Modern automotive and aerospace embedded applications require very high-performance simulations that are able to produce new values every microsecond. Simulations must now rely on scalable performance of multi-core systems rather than faster clock frequencies. Novel parallelization techniques are needed to satisfy the industrial simulation demands that are essential for the development of safety-critical systems. Simulink formalism is the industrial de facto standard, but current state-of-the-art simulation and code generation techniques fail to fully exploit the parallelism in modern multi-core systems. However, closed-loop and dynamic system simulations are very difficult to parallelize because of the loop-carried dependencies. In this paper we introduce a novel skewed pipelining technique that overcomes these difficulties and allows loop-carried Simulink applications to be executed concurrently in multi-core systems. By delaying the forwarding of values for a few iterations, we can break some data dependencies and coarsen the granularity of programs. This improves the concurrency and reduces the high cost of inter-processor communication. Implementation studies to demonstrate the viability of our method on a commodity multicore system with 2, 3, and 4 processors show a 1.72, 2.38, and 3.33 fold speedup over uniprocessor execution.
Our work focuses on allocating and scheduling a synchronous data-flow (SDF) graph onto a multi-core platform subject to a minimum throughput requirement. This problem has traditionally be tackled by incomplete approaches based on problem decomposition and local search, which could not guarantee optimality. Exact algorithms used to be considered reasonable only for small problem instances. We propose a complete algorithm based on Constraint Programming which solves the allocation and scheduling problem as a whole. We introduce a number of search acceleration techniques that significantly reduce run-time by aggressively pruning the search space without compromising optimality. The solver has been tested on a number of non-trivial instances and demonstrated promising run-times on SDFGs of practical size and one order of magnitude speed-up w.r.t. the fastest known complete approach.
Integration of system components is a crucial challenge in the design of embedded real-time systems, as complex non-functional interdependencies may exist. We propose a software update service with self-protection capabilities against unverified system updates - thus solving the integration problem in-system. As modern embedded systems may evolve through software updates, component replacement or even self-optimization, possible system configurations are hard to predict. Thus the designer of system updates does not know the exact system configuration. This turns the proof of system feasibility into a critical challenge. This paper presents the architecture of a framework and associated protocols enabling updates in embedded systems while ensuring safe operation w.r.t. non-functional properties. The proposed process employs contract based principles at the interfaces towards applications to perform an in-system verification. Practical feasibility of our approach is demonstrated by an implementation of the update process, which is analyzed w.r.t. the memory consumption overhead and execution time.
This paper suggests a new approach for bitstream processing of embedded systems, using a combination of C++ metaprogramming combined with architecture extensions of an customizable embedded processor. Firstly, by using C++ metaprogramming techniques, we are able to code application software that needs to manipulate bitstreams in a very compact manner. Secondly, by using the architecture extensions of the Tensilica embedded processor indirectly via C++ operator overloading, the application code can seamlessly exploit custom architecture extensions. The intention is to do bitstream related processing with low programming effort, while generating runtime efficient code. Compared to other bitstream processing approaches we require no compiler modifications to exploit custom architecture features. Rather we put the bitstream related manipulation functionality into an active library, generated by a C++ metaprogram.
The introduction of Phase-Change Memory (PCM) as a main memory technology has great potential to achieve a large energy reduction. PCM has desirable energy and scalability properties, but its use for main memory also poses challenges such as limited write endurance with at most 107 writes per bit cell before failure. This paper describes techniques to enhance the lifetime of PCM when used for main memory. Our techniques are (a) writeback minimization with new cache replacement policies, (b) avoidance of unnecessary writes, which write only the bit cells that are actually changed, and (c) endurance management with a novel PCM-aware swap algorithm for wear-leveling. A failure detection algorithm is also incorporated to improve the reliability of PCM. With these approaches, the lifetime of a PCM main memory is increased from just a few days to over 8 years.
We consider the problem of on-chip L2 cache management and replacement policies. We propose a new adaptive cache replacement policy, called Dueling CLOCK (DC), that has several advantages over the Least Recently Used (LRU) cache replacement policy. LRU's strength is that it keeps track of the "recency" information of memory accesses. However, a) LRU has a high overhead cost of moving cache blocks into the most recently used position each time a cache block is accessed; b) LRU does not exploit "frequency" information of memory accesses; and, c) LRU is prone to cache pollution when a sequence of single-use memory accesses that are larger than the cache size is fetched from memory (i.e., it is non scan resistant). The DC policy was developed to have low overhead cost, to capture "recency" information in memory accesses, to exploit the "frequency" pattern of memory accesses and to be scan resistant. In this paper, we propose a hardware implementation of the CLOCK algorithm for use within an on-chip cache controller to ensure low overhead cost. We then present the DC policy, which is an adaptive replacement policy that alternates between the CLOCK algorithm and the scan resistant version of the CLOCK algorithm. We present experimental results showing the MPKI (Misses per thousand instructions) comparison of DC against existing replacement policies, such as LRU. The results for an 8-way 1MB L2 cache show that DC can lower the MPKI of SPEC CPU2000 benchmark by an average of 10.6% when compared to the tree based Pseudo-LRU cache replacement policy.
Ternary content addressable memories (TCAMs) are becoming very popular due to the simple-to-design IP lookup units included in high-speed routers; they are fast and simple to manage, and they provide a one-clock lookup solution. However, a major drawback of TCAM-based IP lookup schemes lies in their high power consumption. Thus, the rapid increase of routing tables inevitably deteriorates TCAM power efficiency. Although on-chip TCAM minimizers aim for the TCAM power efficiency in a fast time and at a small memory amount, the minimizers are not efficient in a large scale prefix table. In this paper, we present a hash-based on-chip TCAM minimization for a power and throughput-efficient IP lookup. In a hash-based TCAM minimization (HTM), we convert prefixes into keys and merge keys with a fast hash lookup in an O(nW) complexity, where n is the number of prefixes and W is the number of IP bits. Additionally, by building a forest of merging trees and choosing a subset among them, we can achieve a higher minimization ratio. The simulation with two routing tables shows that our HPM scheme uses 8.6 and 4.0 times fewer computation time and memory, compared to a contemporary on-chip minimizer.
As pointed out in the ITRS roadmap, the level of embedded software complexity is greater than the pure HW complexity of SoCs when comparing, for example, lines of HDL and C code. Even worse, SW complexity grows faster than HW complexity (Moore's Law) and SW productivity increases more slowly than HW productivity. A new design gap - the gap of embedded software - as appeared. EDA has now identified ESL as a field with sufficient revenue and revenue growth, but do they really approach the SW productivity challenge? This panel gives the answer by presenting recent EDS products and solutions in the area of ESW, contrasting them with needs of industry, and discussing ways out of the ESL productivity crisis.
In this paper, we propose a binary-tree waveguide connected Optical-Network-on-Chip (ONoC) to accelerate the establishment of the lightpath. By broadcasting the control data in the proposed power-efficient binary-tree waveguide, the maximal hops for establishing lightpath is reduced to two. With extensive simulations and analysis, we demonstrate that the proposed ONoC significantly reduces the setup time, and then the packet latency.
This work presents a High-Voltage Low-Power CMOS DC-DC buck regulator for automotive applications. The overall system, including the high and low voltage analog devices, the power MOS and the low voltage digital devices, was realized in the Austriamicrosystems 0.35 HVCMOS technology, resulting in a 6.5 mm2 die. The regulator is able to manage a supply voltage down to 4.5 V and up to 50 V and generates a fixed regulated output voltage of 5 V or a variable one in the whole automotive temperature range. The regulator sinks only a maximum of 1.8 μA of current in standby mode and a maximum of 25 μA when no load is connected. It can be used to supply low voltage devices from the battery when low power dissipation and low current consumption is needed. The system output current can be selected in the range 350-700 mA. When a higher output current is needed, it is possible to connect more regulators in parallel multiplying the output current without any problem.
Keywords: DC-DC regulator; buck converter; pulse frequency modulation; current control; low quiscent current.
Though tag bits in the data caches are vulnerable to transient errors, few effort has been made to reduce their vulnerability. In this paper, we propose to exploit prevalent same tag bits to improve error protection capability of the tag bits in the data caches. When data are fetched from the main memory, it is checked if adjacent cache lines have the same tag bits as those of the data fetched. This similarity information is stored in the data caches as extra bits to be used later. When an error is detected in the tag bits, the similarity information is used to recover from the error in the tag bits. The proposed scheme has small area, energy, and performance overheads with error protection coverage of 97.9% on average. In contrast, the previously proposed In- Cache Replication scheme is shown to incur large performance and energy overheads.
Technology scaling requires lowering Vcc due to power constraints. Unfortunately, permanent faulty bit rates grow due to the higher impact of process variations at low Vcc, especially in the register file whose critical timing limits circuit optimizations. This paper proposes a novel register file design based on splitting registers and discarding faulty blocks to increase the number of registers available. By increasing the number of registers available higher performance can be obtained and yield increases because a larger number of processors reaches the minimum number of registers required to operate.
Synchronous languages offer a deterministic model of concurrency at the level of actions. However, essentially all compilers for synchronous languages compile these actions into a single thread by sophisticated methods to guarantee dynamic schedules for the sequential execution of these actions. In this paper, we present the compilation of synchronous programs to multi-threaded OpenMP-based C programs. We thereby start at the level of synchronous guarded actions which is a comfortable intermediate language for synchronous languages. In addition to the explicit parallelism given in the source program, our method also exploits the implicit parallelism which is due to the underlying synchronous model of computation and the data dependencies of the guarded actions. We show how viable tasks can be constructed from the actions of a program and show the feasibility of our approach by a small example.
This paper addresses the problem of execution time estimation for tasks in a software pipeline independent of the application structure or the underlying architecture. A regression model is developed to obtain the estimates from previously observed data. To improve the quality of the estimates execution times of predecessor task in a software pipeline is exploited. Since the Model order (number of past observations required to obtain optimal estimate) cannot be determined at design time and to circumvent this, we propose means to dynamically update the order and hence obtain a critical-fit model without resorting to analytical benchmarking or calibration runs. The estimation scheme comprises of two estimation methods, namely "Wiener-Hopf" and Order-recursive estimation. The selection of the estimation method is automatic and depends on the required quality of the estimate against a user selectable threshold. In order recursion, new model order is obtained in conjunction to estimates, so order recursion solve the system both for order and estimate simultaneously. We experimented on two multicore platforms using H.264 decoder, a control dominant, computationally demanding application. Results show that estimates obtained by our method are up to 39% better in case of the first task in the software pipeline. The estimate quality improves significantly for the task with predecessor(s) in pipeline and comparison shows up to 54% improvement in estimation results.
Error tolerance formally captures the notion that - for a wide variety of applications including audio, video, graphics, and wireless communications - a defective chip that produces erroneous values at its outputs may be acceptable, provided the errors are of certain types and their severities are within application-specified thresholds. All previous research on error tolerance has focused on identifying such defective but acceptable chips during post-fabrication testing to improve yield. In this paper, we explore a completely new approach to exploit error tolerance based on the following observation: If certain deviations from the nominal output values are acceptable, then we can exploit this flexibility during circuit design to reduce circuit area and delay as well as to increase yield. The specific metric of error tolerance we focus on is error rate, i.e., how often the circuit produces erroneous outputs. We propose a new logic synthesis approach for the new problem of identifying how to exploit a given error rate threshold to maximally reduce the area of the synthesized circuit. Experiment results show that for an error rate threshold within 1%, our approach provides 9.43% literal reductions on average for all the benchmarks that we target.
Keywords - error tolerance, logic synthesis, approximate logic function, functional yield
This paper presents a method for automatic microarchitectural pipelining of systems with loops. The original specification is pipelined by performing provably-correct transformations including conversion to a synchronous elastic form, early evaluation, inserting empty buffers, anti-tokens, and retiming. The design exploration is done by solving an optimization problem followed by simulation of solutions. The method is explained on a DLX microprocessor example. The impact of different microarchitectural parameters on the performance is analyzed.
For CMOS feature size of 65 nm and below, local (or intra-die or within-die) variations in transistor Vt contribute stochastic variation in logic delay that is a large percentage of the nominal delay. Moreover, when circuits are operated at low voltage (Vdd ≤ 0.5V), the standard deviation of gate delay becomes comparable to nominal delay, and the Probability Density Function (PDF) of the gate delay is highly non-Gaussian. This paper presents a computationally efficient algorithm for computing the PDF of logic Timing Path (TP) delay, which results from local variations. This approach is called Non-linear Operating Point Analysis for Local Variations (NLOPALV). The approach is implemented using commercial STA tools and integrated into the standard CAD flow using custom scripts. Timing paths from a 28nm commercial DSP are analyzed using the proposed technique and the performance is observed to be within 5% accuracy compared to SPICE based Monte-Carlo analysis.
Keywords- SSTA, Local Variations, Low-voltage, Statistical Design
This paper presents dynamic reconfiguration of a register file of a Very Long Instruction Word (VLIW) processor implemented on an FPGA. We developed an open-source reconfigurable and parameterizable VLIW processor core based on the VLIW Example (VEX) Instruction Set Architecture (ISA), capable of supporting reconfigurable operations as well. The VEX architecture supports up to 64 multiported shared registers in a register file for a single cluster VLIW processor. This register file accounts for a considerable amount of area in terms of slices when the VLIW processor is implemented on an FPGA. Our processor design supports dynamic partial reconfiguration allowing the creation of dedicated register file sizes for different applications. Therefore, valuable area can be freed and utilized for other implementations running on the same FPGA when not the full register file size is needed. Our design requires 924 slices on a Xilinx Virtex-II Pro device for dynamically placing a chunk of 8 registers, and places registers in multiples of 8 registers to simplify the design. Consequently, when 64 registers is not needed at all times, the area utilization can be reduced during run-time.
In conventional static implementations for correlated streaming applications, computing resources may be inefficiently utilized since multiple stream processors may supply their sub-results at asynchronous rates for result correlation or synchronization. To enhance the resource utilization efficiency, we analyze multi-streaming models and implement an adaptive architecture based on FPGA Partial Reconfiguration (PR) technology. The adaptive system can intelligently schedule and manage various processing modules during run-time. Experimental results demonstrate up to 78.2% improvement in throughput-per-unit-area on unbalanced processing of correlated streams, as well as only 0.3% context switching overhead in the overall processing time in the worst-case.
Electromagnetic analysis is an important class of attacks against cryptographic devices. In this article, we prove that Correlation-based on ElectroMagnetic Analysis (CEMA) on a hardware-based high-performance AES module is possible from a distance as far as 50 cm. First we show that the signal-to-noise ratio (SNR) tends to a non-zero limit when moving the antenna away from the cryptographic device. An analysis of the leakage structure shows that the Hamming distance model, although suitable for small distances gets more and more distorted when the antenna is displaced far from the device. As we cannot devise any physical model that would predict the observations, we instead pre-characterized it using a first order templates construction. With this model, we enhanced the CEMA by a factor up to ten. Therefore, we conclude that EMA at large distance is feasible with our amplification strategy coupled to an innovative training phase aiming at precharacterizing accurate coefficients of a parametric weighted distance leakage model.
Keywords: Side-Channel Attacks (SCA), ElectroMagnetic Analysis (EMA), Correlation EMA (CEMA), Leakage model. Template estimation.
Messerges, Dabbish and Sloan proposed a DPA attack which analyzes the address values of register . This attack is called the Address-bit DPGA (ADPA) attack. As countermeasures against ADPA, Itoh, Izu and Takenaka proposed algorithms that randomizes address bits . in this paper, we point out that one of their countermeasures has vulnerability even if the address bits are uniformly randomized. When a register is overwritten by the same data as one stored in the register during a data move process, the power consumption is lower than the case of being overwritten by the different data. This fact enables us to separate the power traces. As a result, in the case of the algorithm proposed in , we could invalidate the randomness of the random bits and perform ADPA to retrieve a secret key. moreover, for the purpose of overcoming the vulnerability, we propose a new countermeasure algorithm.
Bug-free first silicon is not guaranteed by the existing pre-silicon verification techniques. To have impeccable products, it is now required to identify any bug as soon as the first silicon becomes available. We consider the Assertion Based Verification techniques for the post-silicon debugging based on the insertion of hardware checkers in the debug infrastructure for complex systems on chip. This paper proposes a method to cluster hardware-assertion checkers using the graph partitioning approach. It turns out that having the clusters of hardware-assertions and controlling each cluster selectively during the debug mode and normal operation of the circuit makes integration of assertions inside the circuits easier, and causes lower energy consumption and efficient debug scheduling.
In this paper we provide an overview of CPM, a cross-layer framework for Constrained Power Management, and we present its application on a real use case. This framework involves different layers of a typical embedded system, ranging from device drivers to applications. The main goals of CPM are (i) to aggregate applications' QoS requirements and (ii) to exploit them to support an efficient coordination between different drivers' local optimization policies. This role is supported by a system-wide and multi-objective optimization policy which could be also changed at run-time. In this paper we mostly focus on a real use case to show the very low overhead of CPM both on the management of QoS requirements and on the tracking of hardware cross-dependencies, which cannot be directly considered by local optimization policies.
To overcome issues originating from the CMOS technology, a large-scale reconfigurable data-path (LSRDP) processor based on single-flux quantum circuits is introduced. LSRDP is augmented to a general purpose processor to accelerate the execution of data flow graphs (DFGs) extracted from scientific applications. Procedure of mapping large DFGs onto the LSRDP is discussed and our proposed techniques for reducing area of the accelerator within the design procedure will be introduced as well.
Keywords- Reconfigurable accelerator; single-flux quantum circuit; data flow graph; placement and routing
Due to the ever increasing number of microprocessors which can be integrated in very large systems on chip the need for robust, easily modifiable microprocessors has emerged. Within this paper a light-weight cycle compatible implementation of the MicroBlaze architecture called MB-LITE is presented in an attempt to fill the gap in quality between commercial and open source processors. Experimental results showed that MBLITE obtains very high performance compared with other open source processors while using very few hardware resources. The microprocessor can be easily extended with existing IP thanks to an easily configurable data memory bus and a wishbone bus adapter. All components are modular to optimize design reuse and are developed using a two-process design methodology for improved performance, simulation and synthesis speeds. All components have been thoroughly tested and verified on a FPGA. Currently an architecture with four MB-LITE cores in a NoC architecture is in development which will be implemented in 90nm process technology.
We present a transactional datapath specification (Tspec) and the tool (T-piper) to synthesize automatically an inorder pipelined implementation from it. T-spec abstractly views a datapath as executing one transaction at a time, computing next system states based on current ones. From a T-spec, T-piper can synthesize a pipelined implementation that preserves original transaction semantics, while allowing simultaneous execution of multiple overlapped transactions across pipeline stages. T-piper not only ensures the correctness of pipelined executions, but can also employ forwarding and speculation to minimize performance loss due to data dependencies. Design case studies on RISC and CISC processor pipeline development are reported.
This presentation will discuss the challenges of modern PC systems to reduce their power consumption. While efforts are made on all levels from the silicon to the software PC systems still use 2 to 3 orders of magnitude more power than mobile devices. The presentation will outline some of the reasons for this and how this problem is being addressed on the PC architecture side as well as from the operating system side. It will also discuss the tradeoffs that have to be made. Peak performance and low idle power consumption are two contradicting design goals. Growing number of CPU cores and new mechanisms to turn off unused units create new challenges for the OS. The talk will discuss some strategies developed by AMD to address this.
The challenges of managing energy efficiency and performance in SoC designs often results in sleepless nights searching for a solution. Today processors have to deliver more computational power, while maintaining flexibility and delivering the lowest power envelope simultaneously. These requirements are fundamentally contradictory in nature. This paradox keeps designers up at night trying to develop the perfect tradeoff between energy efficiency, flexibility and performance. This session will discuss some of the latest and most advanced techniques in power and performance management. Topics will include clocking, controlling idle and active power, optimizing data pipelines, hardware accelerators, advancements in microprocessor architecture and utilizing optimized libraries, mfg process, tools and design flow to ensure optimal power and performance ratios. The challenges of low-power microprocessor design are unique in the sense that a significant power savings is desired with little or no performance and area impact. ARM has developed a series of optimized processor architectures and optimized libraries to provide the highest level of flexibility in meeting your area, performance and power requirements. In addition, they have worked with leading EDA companies, Foundries and Silicon Manufacturers to develop a complete design solution. This session will reference the ARM CortexTM processor family and optimized ARM libraries to demonstrate the best-in-class strategies for designing optimal low power, high performance consumer devices.
Variations of process parameters have an important impact on reliability and yield in deep sub micron IC technologies. One methodology to estimate the influence of these effects on power and delay times at chip level is Monte Carlo Simulation, which can be very accurate but time consuming if applied to transistor-level models. We present an alternative approach, namely a statistical gate-level simulation flow, based on parameter sensitivities and a generated VHDL cell model. This solution provides a good speed/accuracy tradeoff by using the event-driven digital simulation domain together with an extended consideration of signal slope times directly in the cell model. The designer gets a fast and accurate overview about the statistical behavior of power consumption and timing of the circuit depending on the manufacturing variations. The paper shortly illustrates the general flow from cell characterization to the model structure and presents first simulation results.
Keywords: Simulation, digital IC design, statistical timing analysis, statistical power analysis
Technology scaling has an increasing impact on the resilience of CMOS circuits. This outcome is the result of (a) increasing sensitivity to various intrinsic and extrinsic noise sources as circuits shrink, and (b) a corresponding increase in parametric variability causing behavior similar to what would be expected with hard (topological) faults. This paper examines the issue of circuit resilience, then proposes and demonstrates a roadmap for evaluating fault rates starting at the 45nm and going down to the 12nm nodes. The complete infrastructure necessary to make these predictions is placed in the open source domain, with the hope that it will invigorate research in this area.
We are rapidly approaching an inflection point where the conventional target of producing perfect, identical transistors that operate without upset can no longer be maintained while continuing to reduce the energy per operation. With power requirements already limiting chip performance, continuing to demand perfect, upset-free transistors would mean the end of scaling benefits. The big challenges in device variability and reliability are driven by uncommon tails in distributions, infrequent upsets, one-size-fits-all technology requirements, and a lack of information about the context of each operation. Solutions co-designed across traditional layer boundaries in our system stack can change the game, allowing architecture and software (a) to compensate for uncommon variation, environments, and events, (b) to pass down invariants and requirements for the computation, and (c) to monitor the health of collections of devices. Cross-layer codesign provides a path to continue extracting benefits from further scaled technologies despite the fact that they may be less predictable and more variable. While some limited multi-layer mitigation strategies do exist, to move forward redefining traditional layer abstractions and developing a framework that facilitates cross-layer collaboration is necessary.
Current electronic systems implement reliability using only a few layers of the system stack, which simplifies the design of other layers but is becoming increasingly expensive over time. In contrast, cross-layer resilient systems, which distribute the responsibility for tolerating errors, device variation, and aging across the system stack, have the potential to provide the resilience required to implement reliable, high-performance, low-power systems in future fabrication processes at significantly lower cost. These systems can implement less-frequent resilience tasks in software to save power and chip area, can tune their reliability guarantees to the needs of applications, and can use the information available at each level in the system stack to optimize performance and power consumption. In this paper, we outline an approach to cross-layer system design that describes resilience as a set of tasks that systems must perform in order to detect and tolerate errors and variation. We then present strawman examples of how this task-based design process could be used to implement general-purpose computing and SoC systems, drawing on previous work and identifying key areas for future research.
With increasing sources of disturbances in the underlying hardware, the most significant challenge in design of robust systems is to meet user expectations at very low cost. Cross-layer resilience techniques, implemented across multiple layers of the system stack and designed to work together, can potentially enable effective robust system design at low cost. This paper brings to the forefront two major cross-layer resilience challenges: 1. Quantification and validation of the effectiveness of a cross-layer resilience approach to robust system design in overcoming hardware reliability challenges. 2. Global optimization of a robust system design using cross-layer resilience techniques.
We present a Pareto efficient design method for multi-dimensional optimization of run-time reconfigurable streaming applications on CPU/FPGA platforms, which automatically allocates applications with optimized buffer requirement and software/hardware implementation cost. At the same time, application performance is guaranteed with sustainable throughput during run-time reconfigurations. As the main contribution, we formulate the constraint based application allocation, scheduling, and reconfiguration analysis, and propose a design Pareto-point calculation flow. A public domain solver - Gecode is used in solutions finding. The capability of our method has been exemplified by two cases studies on applications from media and communication domains.
Media processing systems often have limited resources and strict performance requirements. An implementation must meet those design constraints while minimizing resource usage and energy consumption. Design-space exploration techniques help system designers to pinpoint bottlenecks in a system for a given configuration. The trade-offs between performance and resources in the design space can guide designers to tailor and tune the system. Many applications in those systems are computationally intensive and can be modeled by a synchronous dataflow graph. We present a bottleneck-analysis-driven technique to explore the design space of those systems automatically and incrementally. The feasibility and efficiency of the technique is demonstrated with experiments on a set of realistic application models ranging from multimedia to digital printing.
Index Terms - Synchronous dataflow, Design-space exploration, Bottleneck identification
Wireless Sensor Networks are gaining more and more importance in various application fields. Often, energy autonomy on the node level is an essential nonfunctional constraint to be met. Therefore, when simulating such networks, the energy consumption on the node level has to be included into the simulation. To make this time consuming task feasible, an overall simulation speedup on the network level is desirable. In this paper, we propose to use techniques similar to those used in Transaction Level Models of Bus Systems.
Refinement of untimed TLM models into a timed HW/SW platform is a step by step design process which is a tradeoff between timing accuracy of the used models and correct estimation of the final timing performance. The use of an RTOS on the target platform is mandatory in the case real-time properties must be guaranteed. Thus, the question is when the RTOS must be introduced in this step by step refinement process. This paper proposes a four-level RTOS-aware refinement methodology that, starting from an untimed TLMSystemC description of the whole system, progressively introduce HW/SW partitioning, timing, device driver and RTOS functionalities, till to obtain an accurate model of the final platform, where SW tasks run upon an RTOS hosted by QEMU and HW components are modeled by cycle accurate TLM descriptions. Each refinement level allows the designer to estimate more and more accurate timing properties, thus anticipating design decisions without being constrained to leave timing analysis to the final step of the refinement. The effectiveness of the methodology has been evaluated in the design of two complex platforms.
To obtain a better trade-off between cost and security, practical DPA countermeasures are not likely to deploy full masking that uses one distinct mask bit for each signal. A common approach is to use the same mask on several instances of an algorithm. This paper proposes a novel power analysis method called Power Variance Analysis (PVA) to reveal the danger of such implementations. PVA uses the fact that the side-channel leakage of parallel circuits has a big variance when they are given the same but random inputs. This paper introduces the basic principle of PVA and a series of PVA experiments including a successful PVA attack against a prototype RSL-AES implemented on SASEBO-R.
Keywords: Side Channel Attacks, Variance, RSL, Masking
Physical Unclonable Functions (PUFs) are employed to generate unique signature to be used for integrated circuit (IC) identification and authentication. Existing PUFs exploit only process variations for generating unique signature. Due to the spatial correlation between process parameters, such PUFs will be vulnerable to be modeled or leak information under side-channel attacks. The PUF we present in this paper, called PE-PUF, takes into account both process and environmental variations which magnifies chip-to-chip signature randomness and uniqueness. PEPUF takes into account process variations, temperature, power supply noise and crosstalk; all these effects are major sources of variations and noise in integrated circuits. Designers would be able to select PE-PUF response by applying different input patterns.? Furthermore, PE-PUF imposes no routing constraints to the design. The gates in PE-PUF are distributed across the entire chip and cannot be easily identified/modeled or leak side-channel information. Simulation results demonstrate that each IC can be uniquely characterized by PE-PUF with higher secrecy rate when compared to other PUFs that use only process variations.
Keywords: PUF, IC Authentication, Process Variations, Environmental Variations, Hardware Security
The design and first measuring results of an ultra-low power 12bit Successive-Approximation ADC for autonomous multi-sensor systems are presented. The comparator and the DAC are optmised for the lowest power consumption. The proposed design has a power consumption of 0.52μW at a bitclock of 50-kHz and of 0.85μW at 100-kHz with a 1.2-V supply. As far as we know, the Figure-of-Merit of 66 fJ/convertion-step is the best reported so far. The ADC was realised in the NXP CMOS 0.14μm technology with an area of 0.35 mm2. Only four metal layers were used in order to allow 3D integration of the sensors.
This paper presents a flexible architecture for an integrated Ultra-Wideband (UWB) Transmitter capable of generating pulses suited for breast cancer detection imaging systems. A flexible design allows the generation of a large variety of UWB signals fully compatible with the ones used in real experiments in recent state-of-the-art. Flexibility and high degree of programmability of the mixed-signal system allow also to compensate for non-ideal effects of building blocks through a digital calibration. It is also shown how not only internal non-idealities are accounted for but also how channel and antenna responses can be compensated for through a digital pre-emphasis of UWB pulses. The circuit is designed on a 130nm CMOS technology and simulated at transistor-level. Simulations showed 2% maximum NRMSE pulse error with respect to ideal Gaussian and Modulated and Modified Hermite Polynomial (MMHP) Matlab templates.
Index Terms - Pulse-Based UWB, Breast Cancer Screening, Pulse Generator, Transmitter, AWG.
Pressure sensors are ideal candidates for implementing portable digital music instruments. Existing commercial pressure sensors, however, are not optimized to meet both timing and precision requirements for acoustic uses. In this paper, we demonstrate a portable multi-pitch electronic drum (e-drum) system based on large-area (> 15cm in diameter) ring-shaped pressure sensors made with low-cost screen-printing process. This e-drum system, which can accurately generate six different pitches of sounds in the current prototype, has the following key advantages: 1) a light-weight, flexible, bendable, and robust human-instrument interface, 2) real-time sound responses, 3) comparable acoustic sound quality with the conventional drums, and 4) easily expandable to a much larger number of sound-pitches. The digital music synthesis is implemented using a TI-DSP board and can be easily re-configured to realize other percussion instruments such as pianos and xylophones. To the best of our knowledge, this is the first successful demonstration of a portable e-drum based on large-area ring-shaped flexible sensors, whose success could open up many new applications.
For any analog integrated circuit, a simultaneous analysis of the performance trade-offs and impact of variability can be conducted by computing the Pareto front of the realizable specifications. The resulting Specification Pareto front shows the most ambitious specification combinations for a given minimum parametric yield. Recent Pareto optimization approaches compute a so-called yield-aware specification Pareto front by applying a two-step approach. First, the Pareto front is calculated for nominal conditions. Then, a subsequent analysis of the impact of variability is conducted. In the first part of this work, it is shown that such a two-step approach fails to generate the most ambitious realizable specification bounds for mismatch-sensitive performances. In the second part of this work, a novel single-step approach to compute yield-optimized specification Pareto fronts is presented. Its optimization objectives are the realizable specification bounds themselves. Experimental results show that for mismatch-sensitive performances the resulting yield-optimized specification Pareto front is superior to the yield-aware specification Pareto front.
This paper demonstrates a deterministic, variability-aware reliability modeling and simulation method. The purpose of the method is to efficiently simulate failure-time dispersion in circuits subjected to die-level stress effects. A Design of Experiments (DoE) with a quasi-linear complexity is used to build a Response Surface Model (RSM) of the time-dependent circuit behavior. This reduces simulation time, when compared to random-sampling techniques, and guarantees good coverage of the circuit factor space. The DoE consists of a linear screening design, to filter out important circuit factors, followed by a resolution 5 fractional factorial regression design to model the circuit behavior. The method is validated over a broad range of both analog and digital circuits and compared to traditional random-sampling reliability simulation techniques. It is shown to outperform existing simulators with a simulation speed improvement of up to several orders of magnitude. Also, it is proven to have a good simulation accuracy, with an average model error varying from 1.5 to 5 % over all test circuits.
Probabilistic CMOS is considered a promising technology for future generations of computing devices. By embracing possibly incorrect calculations, the technology makes it possible to trade correctness of circuit operations for potentially significant energy saving. For systematic design of probabilistic circuits, accurate mathematical models are indispensable. To this end, we propose a model of probabilistic ripple-carry adders. Compared to existing models, ours is applicable under a wide range of noise assumptions, including the popular additive-noise assumption. Our model provides recursive equations that can accurately capture propagation of carry errors. The proposed model is validated by HSPICE simulation, and we find that the model is able to predict multi-bit error-rates of a simulated probabilistic ripple-carry adder with reasonable accuracy.
Monte-Carlo (MC) simulation is still the most commonly used technique for yield estimation of analog integrated circuits, because of its generality and accuracy. However, although some speed acceleration methods for MC simulation have been proposed, their efficiency is not high enough for MC-based yield optimization (determines optimal device sizes and optimizes yield at the same time), which requires repeated yield calculations. In this paper, a new sampling-based yield optimization approach is presented, called the Memetic Ordinal Optimization (OO)-based Hybrid Evolutionary Constrained Optimization (MOHECO) algorithm, which significantly enhances the efficiency for yield optimization while maintaining the high accuracy and generality of MC simulation. By proposing a two-stage estimation flow and introducing the OO technology in the first stage, sufficient samples are allocated to promising solutions, and repeated MC simulations of non-critical solutions are avoided. By the proposed memetic search operators, the convergence speed of the algorithm can considerably be enhanced. With the same accuracy, the resulting MOHECO algorithm can achieve yield optimization by approximately 7 times less computational effort compared to a state-of-the-art MC-based algorithm integrating the acceptance sampling (AS) plus the Latin-hypercube sampling (LHS) techniques. Experiments and comparisons in 0.35μm and 90nm CMOS technologies show that MOHECO presents important advantages in terms of accuracy and efficiency.
This paper presents reuse-aware modulo scheduling to maximizing stream reuse and improving concurrency for stream-level loops running on stream processors. The novelty lies in the development of a new representation for an unrolled and software-pipelined stream-level loop using a set of reuse equations, resulting in simultaneous optimization of two performance objectives for the loop, reuse and concurrency, in a unified framework. We have implemented this work in the compiler developed for our 64-bit FT64 stream processor. Our experimental results obtained on FT64 and by simulation using nine representative stream applications demonstrate the effectiveness of the proposed approach.
The stream processing characteristics of many embedded system applications in multimedia and networking domains have led to the advent of stream based programming formats. Several multicore processors aimed at embedded domains incorporate scratchpad memories (SPM) due to their superior power consumption characteristics. The paper addresses the problem of compiling stream programs on to multicore processors that incorporate SPM. Performance optimization on SPM based processors requires effective schemes for software based management of code and/or data overlay. In the context of our problem instance the code overlay scheme impacts both the stream element to core mapping and memory available for inter-processor communication. The paper presents an integer linear programming (ILP) formulation and heuristic approach that effectively exploit the SPM to maximize the throughput of stream programs when mapped to multicore processors. The experimental results demonstrate the effectiveness of the proposed techniques by compiling StreamIt based benchmark applications on the IBM Cell processor and comparing against existing approach.
Scratch-pad memory has been employed as a partial or entire replacement for cache memory due to its better energy efficiency. In this paper, we propose scratch-pad memory management techniques for priority-based preemptive multi-task systems. Our techniques are applicable to a real-time environment. The three methods which we propose, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems. We formulate each method as an integer programming problem that simultaneously determines (1) partitioning of scratch-pad memory space for the tasks, and (2) allocation of program code to scratch-pad memory space for each task. It is remarkable that periods and priorities of tasks are considered in the formulas. Additionally, we implement an RTOS-hardware cooperative support mechanism for a runtime code allocation to the scratch-pad memory space. We have made the experiments with the fully functional real-time operating system. The experimental results with four task sets have demonstrated the effectiveness of our techniques. Up to 73 % energy reduction compared to a standard method was achieved.
Elementary functions are extensively used in computer graphics, signal and image processing, and communication systems. This paper presents a special-purpose compiler that automatically generates customized look-up tables and implementations for elementary functions under user given constraints. The generated implementations include a C/C++ code that can be used directly by applications running on multicores, as well as a MATLAB-like code that can be translated directly to a hardware module on FPGA platforms. The experimental results show that our solutions for function evaluation bring significant performance improvements to applications on multicores as well as significant resource savings to designs on FPGAs.
This paper proposes a new architecture-level thermal modeling method to address the emerging thermal related analysis and optimization problem for high-performance multi-core microprocessor design. The new approach builds the thermal behavioral models from the measured or simulated thermal and power information at the architecture level for multi-core processors. Compared with existing behavioral thermal modeling algorithms, the proposed method can build the behavioral models from given arbitrary transient power and temperature waveforms used as the training data. Such an approach can make the modeling process much easier and less restrictive than before, and more amenable for practical measured data. The new method is based on a subspace identification method to build the thermal models, which first generates a Hankel matrix of Markov parameters, from which state matrices are obtained through minimum square optimization. To overcome the overfitting problems of the subspace method, the new method employs an overfitting mitigation technique to improve model accuracy and predictive ability. Experimental results on a real quad-core microprocessor show that ThermSID is more accurate than the existing ThermPOF method. Furthermore, the proposed overfitting mitigation technique is shown to significantly improve modeling accuracy and predictability.
Index Terms - Thermal analysis, architecture thermal modeling, multicore processor
This paper describes a robust and accurate blackbox macromodeling technique, in which the constitutive equations combine both closed-form delay operators and low-order rational coefficients. These models describe efficiently electrically long interconnect links. The algorithm is based on an iterative weighted least-squares process and can be interpreted as a generalization of the well-known Vector Fitting. The paper is focused, in particular, on the passivity enforcement of these models. We present two perturbation methods and we show how the accuracy of the models is well-preserved during the passivity enforcement process.
An efficient algorithm based on the Extended Hamiltonian Pencil was proposed in  for systems with hybrid representation. Here we further extend the Extended Hamiltonian Pencil method to systems described with scattering representation, i.e. S-parameter systems. The derivation of the Extended Hamiltonian Pencil for Sparameter systems is presented. Some properties that allow passivity enforcement based on eigenvalue displacement are reported. Experimental results demonstrate the effectiveness of the proposed method.
This paper presents a modeling method for power distribution networks (PDNs) consisting of multilayered power/ground planes of the PCB/Package. Using our proposed method, multiple stacked power/ground plane pairs having holes and apertures can be modeled as an equivalent circuit. The structure of this equivalent circuit is suitable for the Latency Insertion Method (LIM), which is one of the fast transient simulation methods based on the "leapfrog" algorithm. Numerical results show that the leapfrog algorithm enables a speed-up of 105 and 486 times compared to the linear circuit simulator based on the sparse LU-decomposition and HSPICE, respectively, with the same level of accuracy.
Keywords-component; power integrity; power distribution network; leapfrog algorithm; spice
Carbon Nanotube Field-Effect Transistors (CNFETs) can potentially provide significant energy-delay-product benefits compared to silicon CMOS. However, CNFET circuits are subject to several sources of imperfections. These imperfections lead to incorrect logic functionality and substantial circuit performance variations. Processing techniques alone are inadequate to overcome the challenges resulting from these imperfections. An imperfection-immune design methodology is required. We present an overview of imperfection-immune design techniques to overcome two major sources of CNFET imperfections: metallic Carbon Nanotubes (CNTs) and CNT density variations.
Temperature has a strong influence on integrated circuit (IC) performance, power consumption, and reliability. However, accurate thermal analysis can impose high computation costs during the IC design process. We analyze the performance and accuracies of a variety of time-domain dynamic thermal analysis techniques and use our findings to propose a new analysis technique that improves performance by 38-138x relative to popular methods such as the fourth-order globally adaptive Runge-Kutta method while maintaining accuracy. More precisely, we prove that the step sizes of step doubling based globally adaptive fourth-order Runge-Kutta method and Runge-Kutta-Fehlberg methods always converge to a constant value regardless of the initial power profile, thermal profile, and error threshold during dynamic thermal analysis. Thus, these widely-used techniques are unable to adapt to the requirements of individual problems, resulting in poor performance. We also determine the effect of using a number of temperature update functions and step size adaptation methods for dynamic thermal analysis, and identify the most promising approach considered. Based on these observations, we propose FATA, a temporally-adaptive technique for fast and accurate dynamic thermal analysis.
In this paper a comprehensive assertion-based verification methodology for the digital, analog and software domain of heterogeneous systems is presented. The proposed methodology combines a novel mixed-signal assertion language and the corresponding automatic verification algorithm. The algorithm translates the heterogeneous temporal properties into observer automata for a semi-formal verification. This enables automatic verification of complex heterogeneous properties that can not be verified by existing approaches. The experimental results show the integration of mixed-signal assertions into a simulation environment and demonstrate the broad applicability and the high value of the evolved solution.
This paper proposes a novel software Transaction-Level Modeling (TLM) approach for efficient HW/SW co-simulation. In HW/SW co-simulation, timing synchronization should be involved between the hardware and software simulations for keeping their concurrency. However, improperly handling timing synchronization either slows down the simulation speed or scarifies the simulation accuracy. Our approach performs timing synchronization only at the points of HW/SW interactions, so the accurate simulation result can be achieved efficiently. Furthermore, we define three abstraction levels of software TLM models based on the type of interactions captured. Given the target software, the software TLM models can be automatically generated in multiple abstraction layers. The experimental results show that our software TLM models attain 3 million instructions per second (MIPS) for low-level abstraction and go as high as 248 MIPS for higher level abstraction. Therefore, designers can have efficient co-simulation by selecting a proper layer according to the abstraction of corresponding hardware components.
We present a set of modeling constructs accompanied by a high performance simulation kernel for accuracy adaptive transaction level models. In contrast to traditional, fixed accuracy TLMs, accuracy of adaptive TLMs can be changed during simulation to the level which is most suitable for a given use case and scenario. Ad-hoc development of adaptive models can result in complex models, and the implementation detail of adaptivity mechanisms can obscure the actual logic of a model. To simplify and enable systematic development of adaptive models, we have identified several mechanisms which are applicable to a wide variety of models. The proposed constructs relieve the modeler from low level implementation details of those mechanisms. We have developed an efficient, light-weight simulation kernel optimized for the proposed constructs, which enables parallel simulation of large models on widely available, low-cost multi-core simulation hosts. The modeling constructs and the kernel have been evaluated using industrial benchmark applications.
Starting Electronic System Level (ESL) design flows with executable High-Level Models (HLMs) has the potential to sustainably improve productivity. However, writing good HLMs for complex systems is still a challenging task. In the context of network controller design, modeling complexity has two major sources: (1) the functionality to handle a single connection, and (2) the number of connections to be handled in parallel. In this paper, we will propose an efficient actor-oriented modeling approach for complex systems by (1) integrating hierarchical FSMs into dynamic dataflow models, and (2) providing new channel types to allow concurrent processing of multiple connections. We will show the applicability of our proposed modeling approach to real-world system designs by presenting results from modeling and simulating a network controller for the Parallel Sysplex architecture used in IBM System z mainframes.
In this paper we propose a design methodology to explore partial and dynamic reconfiguration of modern FPGAs. We improve an UML based co-design methodology to allow dynamic properties in embedded systems. Our approach targets MPSoPC (Multiprocessor System on Programmable Chip) which allows area optimization through partial reconfiguration without performance penalty. In our case area reduction is achieved by reconfiguring co-processors connected to embedded processors. Most of the system is automatically generated by means of MDE techniques. Our modeling approach allows designers to target dynamic reconfiguration without being expert of modern FPGAs as many implementation details are hidden during the modeling step. Such a methodology allows design time speedup and a significant reduction of the gap between hardware and software modeling. In order to validate our approach, an object tracking application has been implemented on a reconfigurable system composed of 4 embedded processors and 3 co-processors. Dynamic reconfiguration has been performed for one co-processor which dynamically implements 3 different computations.
UML is widely applied for the specification and modeling of software and some studies have demonstrated that it is applicable for HW/SW codesign. However, in this area there is still a big gap from UML modeling to SystemC-based verification and synthesis environments. This paper presents an efficient approach to bridge this gap in the context of Systems-on-a-Chip (SoC) design. We propose a framework for the seamless integration of a customized SysML entry with code generation for HW/SW cosimulation and high-level FPGA synthesis. For this, we extended the SysML UML profile by SystemC and synthesis capabilities. Two case studies demonstrate the applicability of our approach.
The IEEE standard PSL is now a commonly accepted specification language for the Assertion-Based Verification (ABV) of complex systems. In addition to its Boolean and Temporal layers, it is syntactically extended with the Modeling layer that borrows the syntax of the HDL is which the PSL assertions are included, to manage auxiliary variables. In this paper we propose a formal, operational, semantics of PSL enriched with the Modeling layer. Moreover we describe the implementation of this notion in our tool for the dynamic ABV of SystemC TLM models. Illustrative examples are presented.
The use of commercial electronic components is increasingly attractive for the space domain. This paper discusses the current degree of use of these components in space avionics, the selection and qualification phases to be successfully completed before they can be used, and an overview of the constraints the designers of hardware and software architectures have to face regarding these components, with the corresponding solutions. Concerning the issue of upsets, this paper describes possible solutions at architecture and system level and illustrates them with real examples that have already flown or are being developed. The constraints inherent in space avionics do not allow the total performance range of commercial electronic components to be fully exploited; nevertheless, these components - and particularly microprocessors, on which this paper focuses - are among the technologies having a potential disruptive capability for future space missions.
Keywords - space avionics; commercial electronic components; COTS; performance limitation; fault-tolerant architectures; disruptive technology
AFDX (Avionics Full Duplex Switched Ethernet) standardized as ARINC 664 is a major upgrade for avionics systems. But network delay analysis is required to evaluate end-to-end delay's upper bounds. The Network Calculus approach, that has been used to evaluate such end-to-end delay upper bounds for certification purposes, is shortly described. The Trajectory approach is an alternative method that can be applied to an AFDX avionics network. We show on an industrial configuration, in which cases the Trajectory approach outperforms the existing end-to-end delays upper bounds and how the combination of the two methods can lead to an improvement of the existing analysis.
Packaging becomes an important issue in aerospace equipments because of high integration and severe environmental constraints. In order to develop products which respond to the specifications at a minimum cost, Thales performs both mechanical and thermal simulations. The simulation level depends on the phases of design (preliminary or detailed). The major challenges are encountered on thermal management problems with power higher than 100 W at the module level and with local hot spot greater to 100 W/cm2. Under these conditions, standard cooling approaches using forced air are no longer applicable. To challenge these points Thales has launched European collaboration research programs: "COSEE" for the development of two phases cooling systems and "NANOPACK" for the development of thermal interface materials.
Keywords-electronics, cooling, heat pipe, avionics, packaging, integration
Modern FPGAs have been designed with advanced integrated circuit techniques that allow high speed and low power performance, joined to reconfiguration capabilities. This makes new FPGA devices very advantageous for space and avionics computing. However, larger levels of integration makes FPGA's configuration memory more prone to suffer Multi-Cell Upset errors (MCUs), caused by a single radiation particle that can flip the content of multiple nearby cells. In particular, MCUs are on the rise for the new generation of SRAM-based FPGAs, since their configuration memory is based on volatile programming cells designed with smaller geometries that result more sensitive to proton- and heavy ion-induced effects. MCUs drastically limits the capabilities of specific hardening techniques adopted in space-based electronic systems, mainly based on Triple Modular Redundancy (TMR). In this paper we describe a new placement algorithm for hardening TMR circuits mapped on SRAM-based FPGAs against the effects of MCUs. The algorithm is based on layout information of the FPGA's configuration memory and on metrics related to the logic and interconnection resources locations. Experimental results obtained from MCU static analysis on a set of benchmark circuits hardened by the proposed algorithm prove the efficiency of our approach.
We describe a storage scheme for functional test sequences where a test sequence T is associated with a primary input vector B called a background vector. T is stored by storing only the differences between its test vectors and B . We describe a procedure for computing a background vector B for a given test sequence T . We also describe a procedure that modifies T so as to reduce its storage requirements with respect to B . We present experimental results demonstrating that the single background vector B , computed based on T , allows T to be modified such that a vast majority of its entries are equal to the corresponding entries of B . Consequently, storage of T reduces to storage of a small number of entries.
Built-In Self-Test (BIST) is less often applied to random logic than to embedded memories due to the following reasons: Firstly, for a satisfiable fault coverage it may be necessary to apply additional deterministic patterns, which cause additional hardware costs. Secondly, the BIST-signature reveals only poor diagnostic information. Recently, the first issue has been addressed successfully. The paper at hand proposes a viable, effective and cost efficient solution for the second problem. The paper presents a new method for Built-In Self-Diagnosis (BISD). The core of the method is an extreme response compaction architecture, which for the first time enables an autonomous on-chip evaluation of test responses with negligible hardware overhead. The key advantage of this architecture is that all data, which is relevant for a subsequent diagnosis, is gathered during just one test session. The BISD method comprises a hardware scheme, a test pattern generation approach and a diagnosis algorithm. Experiments conducted with industrial designs substantiate that the additional hardware overhead introduced by the BISD method is on average about 15% of the BIST area, and the same diagnostic resolution can be obtained as for external testing.
Index Terms - Logic BIST, Diagnosis
Increasing yield is important, especially for nano-scale technologies. Also, pipelines are an important aspect of many SoC architectures. In this paper we present new approaches to improve the yield and yield/area of pipeline architectures by using (1) an appropriate number of redundant copies for each module, and (2) sufficient steering logic1 resources. We present an optimal algorithm of time complexity O(n3) that adds redundant modules to an n-stage pipeline so as to maximize yield. Experimental results indicate that for parameter values of interests, this algorithm also improves the yield/area of the pipeline, especially when the yield for some modules is low.
Keywords-algorithm, yield/area, switch, redundancy, pipeline
Pattern recognition has many applications in design automation. A generalized pattern recognition algorithm is presented in this paper which can efficiently extract similar patterns in programs. Compared to previous pattern-based techniques, our approach overcomes their limitation in handling control-flow-aware patterns, and leads to more opportunities for optimization. Our algorithm uses a feature-based filtering approach for fast pruning, and an elegant graph similarity metric called the generalized edit distance for measuring variations in CDFGs. Furthermore, our pattern recognition algorithm is applied to solve the area optimization problem in behavioral synthesis. Our experimental results show up to a 40% area reduction on a set of real-world benchmarks with a moderate 9% latency overhead, compared to synthesis results without pattern extractions; and up to a 30% area reduction, compared to the results using only data-flow patterns.
Keywords: Behavioral Synthesis, control flow, pattern, feature
Dual-Vth design is an effective leakage power reduction technique at behavioral synthesis level. It allows designers to replace modules on non-critical path with the high- Vth implementation. However, the existing constructive algorithms fail to find the optimal solution due to the complexity of the problem and do not consider the on-chip temperature variation. In this paper, we propose a two-stage thermal-dependent leakage power minimization algorithm by using dual- Vth library during behavioral synthesis. In the first stage, we quantitatively evaluate the timing impact on other modules caused by replacing certain modules with high Vth. Based on this analysis and the characteristics of the dual-Vth module library, we generate a small set of candidate solutions for the module replacement. Then in the second stage, we obtain the on-chip thermal information from thermal-aware floorplanning and thermal analysis to select the final solution from the candidate set. Experimental results show an average of 17.8% saving in leakage power consumption and a slightly shorter runtime compared to the best known work. In most cases, our algorithm can actually find the optimal solutions obtained from a complete solution space exploration. Leakage Power, Behavioral Synthesis, Dual-Vth, Thermal-aware
Reducing resource usage is one of the most important optimization objectives in behavioral synthesis due to its direct impact on power, performance and cost. The datapath in a typical design is composed of different kinds of components, including functional units, registers and multiplexers. To optimize the overall resource usage, a behavioral synthesis tool should consider all kinds of components at the same time. However, most previous work on behavioral synthesis has the limitations of (i) not being able to consider all kinds of resources globally, and/or (ii) separating the synthesis process into a sequence of optimization steps without a consistent optimization objective. In this paper we present a behavioral synthesis flow in which all types of components in the datapath are modeled and optimized consistently. The key idea is to feed to the scheduler the intentions for sharing functional units and registers in favor of the global optimization goal (such as total area), so that the scheduler could generate a schedule that makes the sharing intentions feasible. Experiments show that compared to the solution of minimizing functional unit requirements in scheduling and using the least number of functional units and registers in binding, our solution achieves a 24% reduction in total area; compared to the online tool provided by c-to-verilog.com, our solution achieves a 30% reduction on average.
The move to low-k1 lithography makes it increasingly difficult to print feature sizes which are a small fraction of the wavelength of light. Manufacturing processes currently treat a target layout as a fixed requirement for lithography. However, in reality layout features may vary within certain bounds without violating design constraints. The knowledge of such tolerances, coupled with models for process variability, can help improve the manufacturability of layout features while still meeting design requirements. In this paper, we propose a methodology to convert electrical slack in a design to shape slack or tolerances on individual layout shapes using a two-phase approach. In the first step, we redistribute delay slack to generate delay bounds on individual cells using linear programming. In the second phase, which is solved as a quadratic program, we convert these delay bounds to shape tolerances to maximize the process window of each shape. The shape tolerances produced by our methodology can be used within a process-window optical proximity correction (PWOPC) flow to reduce delay errors arising from variations in the lithographic process. Our experiments on 45nm SOI cells using accurate process models show that the use of our shape slack generation in conjunction with PWOPC reduces delay errors from 3.6% to 1.4%, on average, compared to the simplistic way of tolerance band generation.
Keywords-process-window optical proximity correction, tolerance bands, design-intent, DFM.
Double patterning technology (DPT) is emerging as the dominant technology to achieve the 32-nm node and beyond. Two challenges faced by DPT are layout decomposition and overlay error. To handle the challenges, some effort has been made to consider DPT during detailed routing. In this paper, we propose two enhancing techniques for DPT-friendly detailed routing: lazy color decision and last conflict segment recording. Experiments show that our techniques are able to reduce the number of stitches by 15~20% with 4% increase in running time.
Keywords- Detailed Routing, Double Pattering Technology
In deep sub-micron technology, accurate modeling of output waveforms of library cells under different input slew and load capacitance values is crucial for precise timing and noise analysis of VLSI circuits. Construction of a compact and efficient model of such waveforms becomes even more challenging when manufacturing process and environmental variations are considered. This paper introduces a rigorous and robust foundation to mathematically model output waveforms under sources of variability and to compress the library data. The proposed approach is suitable for today's current source model (CSM) based ASIC libraries. It employs an orthonormal transformation to represent the output waveforms as a linear combination of some appropriately-derived basis waveforms. More significantly Robust Principle Component Analysis (RPCA) is used to stratify the library waveforms into a small number of groups for which different sets of principle components are calculated. This stratification results in a very high compression ratio for the variational CSM library while meeting a maximum error tolerance. Interpolation and further compression is obtained by representing the coefficients as signomial functions of various parameters, e.g., input slew, load capacitance, supply voltage, and temperature. We propose a procedure to calculate the coefficients and power of the signomial functions. Experimental results demonstrate the effectiveness of the proposed variational CSM modeling framework and the stratification-based compression approach.
Keywords- Current Source Model; Robust Principle Component Analysis; Stratification; signomial;
We present a throughput-driven partitioning algorithm and a throughput-preserving merging algorithm for the high-level physical synthesis of latency-insensitive (LI) systems. These two algorithms are integrated along with a published floorplanner  in a new iterative physical synthesis flow to optimize system throughput and reduce area occupation. The partitioning algorithm performs bottom-up clustering of the internal logic of a given IP core to divide it into smaller ones, each of which has no combinational path from input to output and thus is legal for LI-interface encapsulation. Applying this algorithm to cores on critical feedback loops optimizes their length and in turn enables throughput optimization via the subsequent floorplanning. The merging algorithm reduces the number of cores on non-critical loops, lowering the overall area taken by LI interfaces without hurting the system throughput. Experimental results on a large system-on-chip design show a 16.7% speedup in system throughput and a 2.1% reduction in area occupation.
In this paper, we present a method to analyze different implementations of stream-based applications on heterogeneous multiprocessor systems. We take both resource usage and performance constraints into account. For the first aspect we use an empirical cost model. For the second aspect we build a network of cycle-accurate processor simulators. The simulation and resource cost estimation have been integrated in an existing framework, allowing one to generate fast exploration simulations, cycle-accurate simulations and FPGA implementations from a single system level specification. We show that with our methodology cycle-accurate performance numbers of candidate systems can be obtained. In our experiments with the QR and MJPEG applications, we found that the error of our resource cost model is below two percent.
Differential Power Analysis (DPA) is a powerful Side-Channel Attack (SCA) targeting as well symmetric as asymmetric ciphers. Its principle is based on a statistical treatment of power consumption measurements monitored on an Integrated Circuit (IC) computing cryptographic operations. A lot of works have proposed improvements of the attack, but no one focuses on ordering measurements. Our proposal consists in a statistical preprocessing which ranks measurements in a statistically optimized order to accelerate DPA and reduce the number of required measurements to disclose the key.
The Monte Carlo (MC) simulation is a well-known solution to the statistical analysis of analog circuits in the presence of device mismatch. Despite MC's superior accuracy compared with that of the sensitivity-based techniques, an accurate analysis that involves traditional MC-based techniques requires large number of circuit simulations. In this paper, a correlation controlled sampling technique is developed to enhance the quality of the variance estimations. The superiority of the developed technique is verified by variability analysis of the input-referred offset voltage of a comparator, the frequency mismatch of a ring oscillator, and the AC parameters of an operational transconductance amplifier.
We model and verify analog designs in the presence of noise and process variation using an automated theorem prover, MetiTarski. Due to the statistical nature of noise, we propose to use stochastic differential equations (SDE) to model the designs. We find a closed form solution for the SDEs, then integrate the device variation due to the 0.18μm fabrication process and verify properties using MetiTarski. We illustrate the proposed approach on an inverting Op-Amp Integrator and a Band-Gap reference bias circuit.
Model-Based Development (MBD) provides an addi- tional level of abstraction, the model, which lets engineers focus on the business aspect of the developed system. MBD permits automatic treatments of these models with dedicated tools like synthesis of system's application by automatic code generation. Real-Time and Embedded Systems (RTES) are often constrained by their environment and/or the resources they own in terms of memory, energy consumption with respect to performance re- quirements. Hence, an important problem to deal with in RTES development is linked to the optimization of their software part. Although automatic code generation and the use of optimizing compilers bring some answers to application optimization issue, we will show in this paper that optimization results may be en- hanced by adding a new level of optimizations in the modeling process. Our arguments are illustrated with examples of the Uni- fied Modeling Language (UML) state machines diagrams which are widely used for control aspect modeling of RTES. The well- known Gnu Compiler Collection (GCC) is used for this study. The paper concludes on a proposal of two step optimization ap- proach that allows reusing as they are, existing compiler optimi- zations.
Hardware acceleration uses hardware to perform some software functions faster than it is possible on a processor. This paper proposes to optimize hardware acceleration using pathbased scheduling algorithms derived from dataflow static scheduling, and from control-flow state machines. These techniques are applied to the MIPS-to-Verilog (M2V) compiler, which translates blocks of MIPS machine code into a hardware design represented in Verilog for reconfigurable platforms. The simulation results demonstrate a factor of 22 in performance improvement for simple self-looped basic blocks over the base compiler.
Keywords-compiler; MIPS; Verilog, FPGA, schedulling
The finite-impulse response (FIR) filter technique has been widely used for pre-emphasis of channels to mitigate the intersymbol interference (ISI) resulted from both frequency dependent losses and reflections. This paper proposes a systematic methodology, based on arbitrary step response, to determine the tap setting of multi-tap FIR filter for best eye diagram improvement. The required tap number and the optimal tap coefficients are determined according to the compensation efficiency and hence the ultimate performance of FIR filter is evaluated. Eventually, the compensation results for two specific 5Gbps signalling systems, which include significant effects of losses and multiple reflections are demonstrated to validate the optimization method.
Keywords - Finite-impulse response (FIR); step response; eye diagram; lossy line; reflection; pre-emphasis; signal integrity.
This paper discusses signal integrity (SI) issues and signalling techniques for Through Silicon Via (TSV) interconnects in 3-D Integrated Circuits (ICs). Field-solver extracted parasitics of TSVs have been employed in Spice simulations to investigate the effect of each parasitic component on performance metrics such as delay and crosstalk and identify a reduced-order electrical model that captures all relevant effects. We show that in dense TSV structures voltage-mode (VM) signalling does not lend itself to achieving high data-rates, and that current-mode (CM) signalling is more effective for high throughput signalling as well as jitter reduction. Data rates, energy consumption and coupled noise for the different signalling modes are extracted.
Integrated circuit process technology is entering the ultra deep submicron era. At this level, interconnect structure becomes very stiff and the metal resistance shielding effects problem is more serious. Although several delay metrics have been proposed, they are inefficient and difficult to implement. Hence, we propose a new delay and slew metric for interconnect based on Beta distribution and which does not require a look-up table to be built. Our metrics are efficient and easy to implement; the overall standard deviation and error mean are smaller than in previous works.
In this paper, we present an accurate timed RTOS model within transaction level models (TLMs). Our RTOS model, implemented on top of system level design language (SLDL), incorporates two key features: RTOS behavior model and RTOS overhead model. The RTOS behavior model provides dynamic scheduling, inter-process communication (IPC), and external communication for timing annotated user applications. While the RTOS behavior model is running, all RTOS events, such as context switch and interrupt handling, are passed to RTOS overhead model to adopt the overhead during system execution. Our RTOS overhead model has processor- and RTOS-specific precharacterized overhead information to provide cycle approximate estimation. We demonstrate the applicability of our model using a multi-core platform executing a JPEG encoder. Experimental results show that the proposed RTOS model provides the high accuracy, 7% off compared to on-board measurements while simulating at speeds close to the reference C code.
This paper describes a system-level modeling method in UML for performance evaluation of embedded systems. The core technology of this modeling method is reverse modeling based on dynamic analysis. A case study of real MFPs (multifunction peripherals/printers) is presented in this paper to evaluate the modeling method.
Nowadays, modeling languages like UML are essential in the design of complex software systems and also start to enter the domain of hardware and hardware/software codesign. Due to shortening time-to-market demands, "first time right" requirements have thereby to be satisfied. In this paper, we propose an approach that makes use of Boolean satisfiability for verifying UML/OCL models. We describe how the respective components of a verification problem, namely system states of a UML model, OCL constraints, and the actual verification task, can be encoded and afterwards automatically solved using an offthe- shelf SAT solver. Experiments show that our approach can solve verification tasks significantly faster than previous methods while still supporting a large variety of UML/OCL constructs.
This paper presents the definition of an integrated processor core ASIC named SCOC3 which is designed for space computers. It also presents the validation method that has led to a successful ASIC run at the first time, thanks to Astrium's improved control in microelectronic component design. It is base on the LEON3FT Sparc processor associated to a Floating Point Unit both from Aeroflex Gaisler. The core provides many interfaces to connect various types of electronic units. This ASIC is one of the first applications implemented on the new ATC018RHA ATMEL technology.
Keywords-space,computer, processor, component ASIC, development, process, validation
Due to the need for reducing system size and weight while increasing performance, many military and commercial systems today require high-temperature electronics to run actuators, high-speed motors or generators. Of the many passive devices required to satisfy the needs for a complete high temperature system, none has been more problematic than the capacitor, particularly for larger devices requiring values of several micro- or milli-farads. Here we introduce a polymer metal composite we have recently developed that meets typical aerospace design constraints of high reliability, robustness, light-weight, as well as high temperature (up to 300°C) operation. Our recent discovery of the capacitive behaviour in perfluorinated sulfonic acid polymers sandwiched between metal electrodes has lead to the exciting development of high temperature capable high density passive storage components. These composites exhibit capacitance per unit planar area of ~1.0 mF cm-2 or 40 mF/g for a ~100 μm-thick polymer substrate, with only a small predictable decrease in capacitance immediately after heating to 100°C followed by constant capacitance up to 300°C. Here we report the design and testing of single step microfabrication of metal electrodes to these polymer composites sandwiched between two thin metal films along with their performance at high temperatures.
Faster-than-at-speed testing provides an effective way for detecting and debugging small delay defects in modern fabricated chips. However, the use of external automatic test equipment for faster-than-at-speed delay testing could be costly. In this paper, we present an on-chip clock generation scheme which facilitates faster-than-at-speed delay testing for both launch on capture and launch on shift test frameworks. The required test clock frequency with a high resolution can be obtained by specifying the information in the test patterns, which is then shifted into the delay control stages to configure the launch and capture clock generation circuit (LCCG) embedded on-chip. Similarly, the control information for selecting various test frameworks and clock signals can also be embedded in the test patterns. Experimental results are presented to validate the proposed scheme.
Keywords-launch on capture; launch on shift; faster-than-atspeed; on-chip;small delay defect
High-level synthesis has recently started to gain industrial acceptance, due to the improved quality of results and the multi-objective optimizations offered. One optimization area lately addressed is reconfigurable computing, where parts of a DFG are merged and mapped into coarse grained reconfigurable components. This paper presents an alternative approach, the construction of dual mode components which are exchanged with regular components in the resulting RTL architecture. The dual mode components are constructed by exhaustive search for dual mode functional primitives inside the datapath of complicated RTL components. Such components, like multipliers and dividers, that would remain idle in certain control steps, are able to work full-time in two different modes, without any reconfiguration overhead applied to the critical path of the application. The results obtained with different DSP benchmarks show an average performance gain of 15%, without any practical datapath area increase, offering uniform and balanced resource utilization.
Index Terms - high-level synthesis; scheduling; reconfigurable computing; coarse grained reconfigurable components;
During Electronic System-Level (ESL) design, High- Level Synthesis (HLS) tools normally translate the system description to a Control/Data Flow Graph. At this level, several transformations are performed as early as possible to reduce the number and complexity of the data operations. These preliminary transformations (for example, common sub-expression elimination, constant propagation, etc) are typically applied in algebraic expressions with arithmetic operators. This paper presents preliminary transformations that optimize Data-Flow Graphs with relational, maximum/minimum and arithmetic (addition/subtraction) operations. The proposed techniques produce a significant reduction in the number of operations. HLS tools and even software compilers and symbolic algebra packages are not able to generate similar results. The efficiency of the techniques has been evaluated with several modules of real telecommunications standards and their HW implementations show important area reductions and, sometimes, low impact on latency or critical path.
Microprocessor chips employ increasingly larger number of thermal sensing devices. These devices are networked by an underlying infrastructure, which provides bias currents to sensing devices and collects measurements. In this work, we address the optimization of the bias current distribution network utilized by the sensing devices. We show that the choice between two fundamental topologies (the 2-wire and the 4-wire measurement) for this network has a non-negligible impact on the precision of the monitoring system. We also show that the 4-wire measurement principle supports the remote sensing technique better. However, it requires more routing resources. We thus propose a novel routing algorithm to minimize its routing cost. We also present a detailed evaluation of the quality of the resulting system in presence of process and thermal variations. Our Monte Carlo simulations using the IBM 10SF 65nm SPICE models show that the monitoring accuracies can be as high as 0.6°C under considerable amount of process and temperature variation. Moreover, by adopting a customized routing approach for the current mirror network, the total wire length of the bias current network can be reduced by as much as 42.74% and by 27.65% on average.
Manufacturing hotspots are the layout patterns which cause excessive difficulties to manufacturing process. Design rules are effective at handling sizing/spacing induced hotspots, but are inadequate at dealing with topological hotspots. In wire routings, existing approaches often remove the hotspots through iteratively ripping up and rerouting one net at a time guided by litho-simulations. This procedure can be very timeconsuming because litho-simulation is typically very slow and the rerouting may result in new hotspots due to its heuristic nature. In this paper, we propose a new approach for improving the efficiency of hotspot removal. In our approach, multiple nets in each hotspot region are simultaneously ripped up and rerouted based on Boolean satisfiability (SAT). The hotspot patterns, which are described and stored in a pre-built library, are forbidden to appear in the reroute through SAT constraints. Since multiple nets are simultaneously processed and SAT can guarantee to find a feasible solution if it exists, our approach can greatly accelerate the convergence on manufacturability. Experimental results on benchmark circuits show that our approach can remove over 90% of the hotspots in less than one minute on circuits with more than 20K nets and hundreds of hotspots.
As CMOS technology continues to scale, the accurate prediction of silicon timing through the use of pre-silicon modeling and analysis has become especially difficult. These timing mismatches are important because they make it hard to accurately design circuits that meet timing specifications at firstsilicon. Among all the parameters leading to the timing discrepancy between simulation and silicon, this paper studies the effect of dynamic IR-drop on the delay of a path. We propose a noise index model, NIM, which can be used to predict the mismatch between expected and real path delays. The noise index considers both the proximity of switching activity to the path and physical characteristics of the design. To evaluate the method, we performed silicon measurements on randomly selected paths from an industrial 65nm design and compared these with Spice simulations. We show that a very strong correlation exists between the noise index model and the deviations between simulations and silicon measurements.
Keywords- Timing Mismatch, Path Delay Test, Performance Test, Power Supply Noise, IR-Drop, Post-Silicon Measurement.
In a recent keynote speech , UCB Prof. Randy Katz defined power as the 21st century's most limited, as well as most wasted resource: not only we seem to be unable to generate "greener" power but, at all levels, we seem to be really poor even in making an efficient use of our precious power. We waste while doing things, as well as while doing nothing! "Electronics is sized for peak power consumption, and designed for continuous activity." The issue starts at the transistor-level, where static power, i.e. the power which is consumed by ICs even while doing nothing, has passed the 50% threshold at 32 nanometers. All technology improvements, including costly high-k dielectric, a one shot weapon introduced at 45 nanometers, have temporarily mitigated, but not resolved the issue.
Scalable cache coherence is imperative as systems move into the many-core era with cores counts numbering in the hundreds. Directory protocols are often favored as more scalable in terms of bandwidth requirements than broadcast protocols; however, directories incur storage overheads that can become prohibitive with large systems. In this paper, we explore the impact that reducing directory overheads has on the network-on-chip and propose SigNet to mitigate these issues. SigNet utilizes signatures within the network fabric to filter out extraneous requests prior to reaching their destination. Overall, we demonstrate average reductions in interconnect activity of 21% and latency improvements of 20% over a coarse vector directory while utilizing as little as 25% of the area of a fullmap directory.
In this paper, we employ formal feedback control theory to achieve desired communication throughput across a network-on-chip (NoC) based multicore. When the output of the system needs to follow a certain reference input over time, our controller regulates the system to obtain the desired effect on the output. In this work, targeting a multicore that executes multiple applications simultaneously, we demonstrate how to design and employ a PID (Proportional Integral Derivative) controller to obtain the desired throughput for communications by tuning the weights of the virtual channels of the routers in the NoC. We also propose a global controller architecture that implements policies to handle situations in which the network cannot provide the overlapping communications with sufficient resources or the throughputs of the communications can be enhanced (beyond their specified values) due to the availability of excess resources. Finally, we discuss how our novel control architecture works under different scenarios by presenting experimental results obtained using four embedded applications. These results show how the global controller adjusts the virtual channels weights to achieve the desired throughputs of different communications across the NoC, and as a result, the system output successfully tracks the specified input.
On-chip interconnection network is a crucial design component in high-performance System-on-Chips (SoCs). Many of previous works have focused on the automation of its topology design, since the topology largely determines its overall performance. For this purpose, they mostly require a switch library which includes all possible switch configurations (e.g. the number of in/output ports and data width) with their implementation costs such as delay, area, and power. More precisely, they characterize the switches by synthesizing them with a common design objective (e.g. minimizing area) and common design constraints for a given gate-level design library. The implementation costs are used in evaluating the topologies throughout the topology synthesis. The major drawback of single switch library approach is that it forces the topology synthesis methods to search the best topology with the assumption that all the switches comprising a topology will be implemented (synthesized) with a common design objective and common design constraints. Such assumption prevents them from exploring diverse combinations of the switches for a topology from the implementation perspective. To tackle this issue, we propose a topology synthesis method with multiple switch libraries, where the switch libraries are prepared with different design objectives and design constraints. The experimental results show that the power consumption and the area of optimal topologies can be saved by up to 67.1% and 27.2%, respectively, by the proposed method with negligible synthesis time overhead.
In this paper a VLSI architecture of a high throughput and high performance soft-output (SO) MIMO detector (the recently presented Layered ORthogonal Lattice Detector, LORD) is presented. The baseline implementation includes optimal (i.e. maximum-likelihood - ML - in the max-log sense) SO generation. A reduced complexity variant of the SO generation stage is also described. To the best of the authors' knowledge, the proposed architecture is the first VLSI implementation of a max-log ML MIMO detector which includes QR decomposition and SO generation, having the latter a deterministic very high throughput thanks to a fully parallelizable structure, and parameterizability in terms of both the number of transmit and receive antennas, and the supported modulation orders. The two designs achieve a very high throughput making them particularly suitable for MIMO-OFDM systems like e.g. IEEE 802.11n WLANs: the most demanding requirements are satisfied at a reasonable cost of area and power consumption.
We present an algorithm and architecture of a soft-output sphere decoder with an optimized hardware implementation for 2x2 MIMO-OFDM reception. We introduce a novel table look-up approach for symbol enumeration that simplifies the implementation of soft-output decoders. The HW implementation is targeted towards WLAN (IEEE 802.11n) with stringent latency and throughput requirements. The current implementation supports all modulation schemes (BPSK,QPSK,16- QAM,64-QAM) and shows near-optimal real-time performance. To achieve this, the sphere decoder computes in the worst-case Euclidean distances of 4.1 Giga QAM symbols per second. This challenging requirement is met by a scalable, multi-standard HW architecture which can be tuned to other applications such as LTE, WiMax with no re-design effort. The current instance for WLAN occupies an area of only 0.17 mm2 in 45 nm CMOS technology while providing a guaranteed throughput of 374 Msoftbits/s at 312 MHz clock rate (i.e. outputting 2x6 softbits worst-case every 10 clock cycles).
In this paper, we examine the design process of a Network on-Chip (NoC) for a high-end commercial System on-Chip (SoC) application. We present several design choices and focus on the power optimization of the NoC while achieving the required performance. Our design steps include module mapping and allocation of customized capacities to links. Unlike previous studies, in which point-to-point, per-flow timing constraints were used, we demonstrate the importance of using the application end-to-end traversal latency requirements during the optimization process. In order to evaluate the different alternatives, we report the synthesis results of a design that meets the actual throughput and timing requirements of the commercial SoC. According to our findings, the proposed technique offers up to 40% savings in the total router area and a reduction of up to 49% in the inter-router wiring area. System on-chip, Network on-chip, Optimization
In order to solve the challenges in processor design for the next generation wireless communication systems, this paper first proposes a system level design flow for communication domain specific processor, and then proposes a novel processor architecture for the next generation wireless communication named GAEA using this design flow. GAEA is a shared memory multi-core SoC based on Software Controlled Time Division Multiplexing Bus, with which programmers can easily explore memory-level parallelism of applications by proper instructions and scheduling algorithms. MPE, which is the kernel component of GAEA, adopts hybrid parallel processing scheme to explore instruction-level and data-level parallelism. The pipeline and instruction set of GAEA are also optimized for the next generation wireless communication systems. The evaluation and implementation results show that GAEA architecture is suitable for the next generation wireless communication systems.
3GPP long term evolution (LTE) enhances the wireless communication standards UMTS and HSDPA towards higher throughput. A throughput of 150 Mbit/s is specified for LTE using 2x2 MIMO. For this, highly punctured Turbo codes with rates up to 0.95 are used for channel coding, which is a big challenge for decoder design. This paper investigates efficient decoder architectures for highly punctured LTE Turbo codes. We present a 150 Mbit/s 3GPP LTE Turbo code decoder, which is part of an industrial SDR multi-standard baseband processor chip.
Testing for small-delay defects (SDDs) is necessary to ensure the quality and reliability of high-performance integrated circuits fabricated with the latest technologies. These timing defects can be caused by process variations, crosstalk, and power-supply noise, as well as by physical defects such as resistive opens and shorts. Timing-aware ATPG tools have been developed for SDD detection. However, they only use static timing analysis reports for path-length calculation and neglect important parameters such as process variations, crosstalk, and powersupply noise, which can induce small delays into the circuit and impact the timing of targeted paths. In this paper, we present an efficient pattern evaluation and selection procedure for screening SDDs that are caused by physical defects and by delays added to paths by process variations and crosstalk. In this procedure, the best patterns for SDDs are selected from a large repository test set. Experimental results demonstrate that our method sensitizes more long paths and detects more SDDs with a much smaller pattern count compared with a commercial timing-aware ATPG tool.
When testing delay faults on critical paths, conventional structural test patterns may be applied in functionally-unreachable states, leading to over-testing or under-testing of the circuits. In this paper, we propose novel layout-aware pseudofunctional testing techniques to tackle the above problem. Firstly, by taking the circuit layout information into account, functional constraints related to delay faults on critical paths are extracted. Then, we generate functionally-reachable test cubes for every true critical path in the circuit. Finally, we fill the don't-care bits in the test cubes to maximize power supply noises on critical paths under the consideration of functional constraints. The effectiveness of the proposed methodology is verified with large ISCAS'89 benchmark circuits.
Functional broadside tests were defined to avoid overtesting that may occur under structural scan-based tests. Overtesting occurs due to nonfunctional operation conditions created by unreachable scan-in states. Functional broadside tests were computed assuming that functional operation starts after the circuit is synchronized. We discuss the definition of functional broadside tests for the case where hardware reset is used for bringing the circuit into a known state before functional operation starts. We show that the set of reachable states for a circuit with hardware reset contains the set of reachable states based on a synchronizing sequence. Consequently, the set of functional broadside tests and the set of detectable faults for a circuit with hardware reset contain those obtained based on a synchronizing sequence. In addition, there are differences between different reset states in the sets of reachable states and the sets of detectable faults.
This paper studies the dilemma between fault tolerance and energy efficiency in frame-based real-time systems. Given a set of K tasks to be executed on a system that supports L voltage levels, the proposed heuristic-based scheduling technique minimizes the energy consumption of tasks execution when faults are absent, and preserves feasibility under the worst case of fault occurrences. The proposed technique first finds out the optimal solution in a comparable system that supports continuous voltage scaling, then converts the solution to the original system. The runtime complexity is only (LK2). Experimental results show that the proposed approach produces near-optimal results in polynomial time.
Detailed diagnostic data is a prerequisite for debugging problems and understanding runtime performance in distributed wireless embedded systems. Severe bandwidth limitations, tight timing constraints, and limited program text space hinder the application of standard diagnostic tools within this domain. This work introduces the Log Instrumentation Specification (LIS), which provides a high level logging interface to developers and is able to create extremely compact diagnostic logs. LIS uses a token scoping technique to aggressively compact identifiers that are packed into bit aligned log buffers. LIS is evaluated in the context of recording call traces within a network of wireless sensor nodes. Our evaluation shows that logs generated using LIS require less than 50% of the bandwidth utilized by alternate logging mechanisms. Through microbenchmarking of a complete LIS implementation for the TinyOS operating system, we demonstrate that LIS can comfortably fit onto low-end embedded systems. By significantly reducing log bandwidth, LIS enables extraction of a more complete picture of runtime behavior from distributed wireless embedded systems.
Cyber Physical Systems are distributed systems-of-systems that integrate sensing, processing, networking and actuation. Aggregating physical data over space and in time emerges as an intrinsic part of data acquisition, and is critical for dependable decision making under performance and resource constraints. This paper presents a Linear Programming-based method for optimizing the aggregation of data sampled from geographically-distributed areas while satisfy timing, precision, and resource constraints. The paper presents experimental results for data aggregation, including a case study on gas detection using a network of sensors.
In this paper, we examine the impact of application task mapping on the reliability of MPSoC in the presence of single-event upsets (SEUs). We propose a novel soft erroraware design optimization using joint power minimization with voltage scaling and reliability improvement through application task mapping. The aim is to minimize the number of SEUs experienced by the MPSoC for a suitably identified voltage scaling of the system processing cores such that the power is reduced and the specified real-time constraint is met.We evaluate the effectiveness of the proposed optimization technique using an MPEG-2 decoder and random task graphs. We show that for an MPEG-2 decoder with four processing cores, our optimization technique produces a design that experiences 38% less SEUs than soft error-unaware design optimization for a soft error rate of 10-9, while consuming 9% less power and meeting a given real-time constraint. Furthermore, we investigate the impact of architecture allocation (varying the number of MPSoC cores) on the power consumption and SEUs experienced. We show that for an MPSoC with six processing cores and a given realtime constraint, the proposed technique experiences upto 7% less SEUs compared to soft error-unaware optimization, while consuming only 3% more power.
On-chip clock networks are remarkable in their impact on the performance and power of synchronous circuits, in their susceptibility to adverse effects of semiconductor technology scaling, as well as in their strong potential for improvement through better CAD algorithms and tools. Our work offers new algorithms and a methodology for SPICE-accurate optimization of clock networks, coordinated to satisfy slew constraints and achieve best trade-offs between skew, insertion delay, power, as well as tolerance to variations. Our implementation, called Contango, is evaluated on 45nm benchmarks from IBM Research and Texas Instruments with up to 50K sinks.
To conserve energy, a design which utilizes different power modes has been widely adopted. However, when a design has many different power modes, clock tree optimization (CTO) becomes very difficult. In this paper, we propose a two-level power-mode-aware CTO methodology. Among all different power modes, the chip-level CTO globally reduces clock skew among modules, whereas the module-level CTO reduces clock skew within a single module. Our experimental results show that the power-mode-aware CTO can achieve significant improvement in the worst-case condition with only a minor penalty in area.
Keywords-power modes, clock tree, clock skew
Nondeterminism of multi-clock systems often complicates various system validation processes such as post silicon debugging and at-speed testing, which has brought many difficulties to system designers and testers. The major source of nondeterministic behaviors is clock domain crossing, because the clocks that determine the timing of events are sensitive to variations. In this paper, we propose a general method to eliminate the nondeterminism resulted from clock domain crossing. This method does not assume any specific relationship among the clocks. Instead, to adapt to various clock conditions, an automatic configuration procedure and a periodic error canceling mechanism, which only require trivial hardware support, are proposed by analyzing the deterministic boundaries theoretically. To demonstrate the applicability of our method in practice, we implement it on a FPGA platform. Experiment results validate that the performance loss brought by our method over conventional multi-clock FIFO is less than 2%.
Panelists: A. Nohl, B. Douglass, F. Schaefer, H. de Groot and F. Fummi
This panel proposes a discussion on the role and challenges of embedded software testing. Invited speakers represent different industrial and academic points of view on this topic. Specifically, the panelists were invited to discuss what exactly defines the test of an embedded software, what is the role of the platform designer and of the software designer with respect to test, how embedded software testing really differs from traditional software testing, and whether current solutions and tools are sufficient for this task. Embedded software testing; embedded test; software test;
Leading edge CMOS technologies today are unique examples of nanoscale engineering at an industrial scale. As we celebrate this remarkable achievement of our industry that forms the ever-expanding technology basis of modern society we cannot help but ponder the question of how we can continue to push the envelope of nano-electronics. With the end of Si FET scaling appearing increasingly near, searching for more scalable transistor structures in Si and in "beyond-Si" solutions has become imperative; from relatively "easy" transitions to non-planar Si structures, to the incorporation of high mobility semiconductors, like Ge and III-V's, to even higher mobility new materials such as carbon nanotubes, graphene, or other molecular structures. And even further, there are searches for new information representation and processing concepts beyond charge in FETs, as for example, in spin-state devices. Of course, declaring silicon dead is premature at best, and with this in mind I will discuss the challenges and possible scenaria for the introduction of novel nano-electronic devices.
This paper summarizes a special session on multicore/ multi-processor system-on-chip (MPSoC) programming challenges. Wireless multimedia terminals are among the key drivers for MPSoC platform evolution. Heterogeneous multiprocessor architectures achieve high performance and can lead to a significant reduction in energy consumption for this class of applications. However, just designing energy efficient hardware is not enough. Programming models and tools for efficient MPSoC programming are equally important to ensure optimum platform utilization. Unfortunately, this discipline is still in its infancy, which endangers the return on investment for MPSoC architecture designs. On one hand there is a need for maintaining and gradually porting a large amount of legacy code to MPSoCs. On the other hand, special C language extensions for parallel programming as well as adapted process network programming models provide a great opportunity to completely rethink the traditional sequential programming paradigm for sake of higher efficiency and productivity. MPSoC programming is more than just code parallelisation, though. Besides energy efficiency, limited and specialized processing resources, and real-time constraints also growing software complexity and mapping of simultaneous applications need to be taken into account. We analyze the programming methodology requirements for heterogeneous MPSoC platforms and outline new approaches.
Due to increases in design complexity, routing a reset signal to all registers is becoming more difficult. One way to solve this problem is to reset only certain registers and rely on a software initialization sequence to reset other registers. This approach, however, may allow unknown values (also called Xvalues) in uninitialized registers to leak to other registers, leaving the design in a nondeterministic state. Although logic simulation can find some X-problems, it is not accurate and may miss bugs. A recent approach based on symbolic simulation can handle Xs accurately; however, it is not scalable. In this work we analyze the characteristics of X-problems and propose a methodology that leverages the accuracy of formal X-analysis and can scale to large designs. This is achieved by our novel partitioning techniques and the intelligent use of waveforms as stimulus. We applied our methodology to an industrial design and successfully identified several Xs unknown to the designers, including three real bugs, demonstrating the effectiveness of our approach.
Behavioral synthesis is the compilation of an Electronic system-level (ESL) design into an RTL implementation.We present a suite of optimizations for equivalence checking of RTL generated through behavioral synthesis. The optimizations exploit the high-level structure of the ESL description to ameliorate verification complexity. Experiments on representative benchmarks indicate that the optimizations can handle equivalence checking of synthesized designs with tens of thousands of lines of RTL.
Module paths are often used to specify the delays of cells in a Verilog cell library description, which define the propagation delay for an event from an input to an output. Specifying such paths manually is an error prone task; a forgotten path is interpreted as a zero delay, which can cause further flaws in the subsequent design steps. Moreover, one can specify superfluous module paths, i.e., module paths that can never occur in any practical run of the model and hence, make excessive restrictions on the subsequent design decision. This paper presents a method to check whether the given module paths are reflected in the functional implementation. Complementing this check, we also present a method to derive module paths from a functional description of a cell.
Existing reachability analysis techniques are easy to fail when applied to large compositional linear hybrid systems, since their memory usages rise up quickly with the increase of systems' size. To address this problem, we propose a tool BACH 2 that adopts a path-oriented method for bounded reachability analysis of compositional linear hybrid systems. For each component, a path is selected and all selected paths compose a path set for reachability analysis. Each path is independently encoded to a set of constraints while synchronization controls are encoded as a set of constraints too. By merging all the constraints into one set, the path-oriented reachability problem of a path set can be transformed to the feasibility problem of this resulting linear constraint set, which can be solved by linear programming efficiently. Based on this path-oriented method, BACH 2 adopts a shared label sequence guided depth first search (SLS-DFS) method to perform bounded reachability analysis of compositional linear hybrid system, where all potential path sets within the bound limit are identified and verified one by one. By this means, since only the structure of a system and the recently visited one path in each component need to be stored in memory, memory consumption of BACH 2 is very small at runtime. As a result, BACH 2 enables the verification of extremely large systems, as is demonstrated in our experiments.
The task scheduler of an energy harvesting wireless sensor node (WSN) must adapt the task complexity and maximize the accuracy of the tasks within the constraint of limited energy reserves. Structural Health Monitoring (SHM) represents a great example of such an application comprising of both steady state operations and sporadic externally triggered events. To this end, we propose a task scheduler based on a Linear Regression Model embedded with Dynamic Voltage and Frequency Scaling (DVFS) functionality. Our results show an improvement in the average accuracy of a SHM measurement, setting it at 80% of the maximum achievable accuracy. There is also an increase of 50% in the number of SHM measurements.
Keywords - Energy harvester (EH); Task manager; Structural Health Monitoring (SHM); DVFS;
Wearable, mobile computing platforms are envisioned to be used in out-patient monitoring and care. These systems continuously perform signal filtering, transformations, and classification, which are quite compute intensive, and quickly drain the system energy. The design space of these human activity sensors is large and includes the choice of sampling frequency, feature detection algorithm, length of the window of transition detection etc., and all these choices fundamentally trade-off power/performance for accuracy of detection. In this work, we explore this design space, and make several interesting conclusions that can be used as rules of thumb for quick, yet power-efficient designs of such systems. For instance, we find that the x-axis of our signal, which was oriented to be parallel to the forearm, is the most important signal to be monitored, for our set of hand activities. Our experimental results show that by carefully choosing system design parameters, there is considerable (5X) scope of improving the performance/power of the system, for minimal (5%) loss in accuracy.
Blood oxygen saturation is one of the key parameters for health monitoring of premature infants at the neonatal intensive care unit (NICU). In this paper, we propose and demonstrate a design of a wearable wireless blood saturation monitoring system. Reflectance pulse oxymeter based on Near Infrared Spectroscopy (NIRS) techniques are applied for enhancing the flexibility of measurements at different locations on the body of the neonates and the compatibility to be integrated into a non-invasive monitoring platform, such as a neonatal smart jacket. Prototypes with the reflectance sensors embedded in soft fabrics are built. The thickness of device is minimized to optimize comfort. To evaluate the performance of the prototype, experiments on the premature babies were carried out at NICU of M´xima Medical Centre (MMC) in Veldhoven, the Netherlands. The results show that the heart rate and SpO2 measured by the proposed design are corresponding to the readings of the standard monitor.
Keywords- neonatal monitoring; reflectance pulse oximeter; blood oxygen saturation monitoring; design process
The paper presents an active vision system for the automatic detection of falls and the recognition of several postures for elderly homecare applications. A wall-mounted Time-Of-Flight camera provides accurate measurements of the acquired scene in all illumination conditions, allowing the reliable detection of critical events. Preliminarily, an off-line calibration procedure estimates the external camera parameters automatically without landmarks, calibration patterns or user intervention. The calibration procedure searches for different planes in the scene selecting the one that accomplishes the floor plane constraints. Subsequently, the moving regions are detected in real-time by applying a Bayesian segmentation to the whole 3D points cloud. The distance of the 3D human centroid from the floor plane is evaluated by using the previously defined calibration parameters and the corresponding trend is used as feature in a thresholding-based clustering for fall detection. The fall detection shows high performances in terms of efficiency and reliability on a large real dataset in which almost one half of events are falls acquired in different conditions. The posture recognition is carried out by using both the 3D human centroid distance from the floor plane and the orientation of the body spine estimated by applying a topological approach to the range images. Experimental results on synthetic data validate the correctness of the proposed posture recognition approach.
Keywords- Fall detection, posture recognition, range imaging, self-calibration, plane detection.
In this paper we present a cross-domain application for ambient and health monitoring. The system architecture is intended to be openly extensible in order to fulfil unanticipated needs. Our implementation addresses diverse groups, from those requiring heart related monitoring, which could be dependent on the environmental conditions, to those who need to maximize environmental comfort under specified energy consumption constraints. In this application the ambient information is used to enrich the biomedical data and provide a more complete picture to the information consumers, such as doctors and building superintendents. The cross-domain nature of the scenario requires data interoperability, which is ensured by a shared Smart Space. The Smart Space represents the information in Resource Description Framework and its semantics are ontology driven. A simple ontology for the addressed information domain is also presented. The Smart Space platform is provided by the JTI Artemis SOFIA project. Information interoperability; smart space; cross-domain application; ontology; heart rate; discomfort index
Pervasive computing environments consist of many independent collaborating electronic devices, including sensors and actuators. Ad-Hoc extendibility of such systems is desirable but the current network technologies use the concept of a central coordinator device in the network or define application profiles which are not easy to extend and maintain. The distributed architecture proposed in this paper allows these devices to organize themselves automatically to execute some pervasive system application without the intervention of a central controlling device. The knowledge that defines interactions between these devices is derived from an ontological model of a particular domain. This knowledge is distributed over the devices such that every device only has information about its own interactions and operations. A simple demonstration of this architecture is presented.
Increasing dynamic variability with technology scaling has made it essential to incorporate large design-time timing margins to ensure yield and reliable operation. Online techniques for timing error resilience help recover timing margins, improving performance and/or power consumption. This paper presents TIMBER, a technique for online timing error resilience that masks timing errors by borrowing time from successive pipeline stages. TIMBER-based error masking can recover timing margins without instruction replay or roll-back support. Two sequential circuit elements - TIMBER flip-flop and TIMBER latch - that implement error masking based on time-borrowing are described. Both circuit elements are validated using corner-case circuit simulations, and the overhead and trade-offs of TIMBER-based error masking are evaluated on an industrial processor.
There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a low-cost robust system architecture for emerging killer probabilistic applications such as Recognition, Mining and Synthesis (RMS) applications. While resilience of such applications to errors in loworder bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control errors (in addition to low-order bit errors) using a judicious combination of 3 key ideas: (1) asymmetric reliability in many-core architectures, (2) error-resilient algorithms at the core of probabilistic applications, and (3) intelligent software optimizations. Error injection experiments on a multi-core ERSA hardware prototype demonstrate that, even at very high error rates of 20,000 errors/second/core or 2x10-4 error/cycle/core (with errors injected in architecturally-visible registers), ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K-Means clustering, LDPC decoding and Bayesian networks. Moreover, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges such as Vccmin problems and erratic bit errors. Using the concept of configurable reliability, ERSA platforms may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs).
Topology virtualization techniques are proposed for NoCbased many-core processors with core-level redundancy to isolate hardware changes caused by on-chip defective cores. Prior work focuses on homogeneous cores with symmetric performance and optimizes on-chip communication only. However, core-to-core performance asymmetry due to manufacturing process variations poses new challenges for constructing virtual topologies. Lower performance cores may scatter over a virtual topology, while operating systems typically allocate tasks to continuous cores. As a result, parallel applications are probably assigned to a region containing many slower cores that become bottlenecks. To tackle the above problem, in this paper we present a novel performance-asymmetry-aware reconfiguration algorithm Bubble-Up based on a new metric called core fragmentation factor (CFF). Bubble-Up can arrange cores with similar performance closer, yet maintaining reasonable hop distances between virtual neighbors, thus accelerating applications with higher degree of parallelism, without changing existing allocation strategies for OS. Experimental results show its effectiveness.
Continued CMOS scaling is expected to make future microprocessors susceptible to transient faults, hard faults, manufacturing defects and process variations causing fault tolerance to become important even for general purpose processors targeted at the commodity market. To mitigate the effect of decreased reliability, a number of fault-tolerant architectures have been proposed that exploit the natural coarse-grained redundancy available in chip multiprocessors (CMPs). These architectures execute a single application using two threads, typically as one leading thread and one trailing thread. Errors are detected by comparing the outputs produced by these two threads. These architectures schedule a single application on two cores or two thread contexts of a CMP. As a result, besides the additional energy consumption and performance overhead that is required to provide fault tolerance, such schemes also impose a throughput loss. Consequently a CMP which is capable of executing 2n threads in non-redundant mode can only execute half as many (n) threads in fault-tolerant mode. In this paper we propose multiplexed redundant execution (MRE), a low-overhead architectural technique that executes multiple trailing threads on a single processor core. MRE exploits the observation that it is possible to accelerate the execution of the trailing thread by providing execution assistance from the leading thread. Execution assistance combined with coarse-grained multithreading allows MRE to schedule multiple trailing threads concurrently on a single core with only a small performance penalty. Our results show that MRE increases the throughput of fault-tolerant CMP by 16% over an ideal dual modular redundant (DMR) architecture.
This paper presents a methodology to evaluate and optimize the robustness of an embedded system in terms of invariability in case of design revisions. Early decisions in embedded system design may be revised in later stages resulting in additional costs. A method that quantifies the expected additional costs as the robustness value is proposed. Since the determination of the robustness based on arbitrary revisions is computationally expensive, an efficient set-based approach that uses a symbolic encoding as Binary Decision Diagrams is presented. Moreover, a methodology for the integration of the optimization of the robustness into a design space exploration is proposed. Based on an external archive that accepts also near-optimal solutions, this robustness-aware optimization is efficient since it does not require additional function evaluations as previous approaches. Two realistic case studies give evidence of the benefits of the proposed approach.
In this paper, we consider energy minimization for multiprocessor system-on-a-chip (MPSoC) under lifetime reliability constraint of the system, which has become a serious concern for the industry with technology scaling. As today's complex embedded systems typically have multiple execution modes, we first identify a set of "good" task allocation and schedules for each execution mode in terms of lifetime reliability and/or energy consumption, and then we introduce novel techniques to obtain an optimal combination of these singlemode solutions, which is able to minimize the energy consumption of the entire multi-mode system while satisfying given lifetime reliability constraint. Experimental results on several hypothetical MPSoC platforms with various task graphs demonstrate the effectiveness of the proposed approach.
Multi-Processor System-on-Chips (MPSoCs) exploit task-level parallelism to achieve high computation throughput, but concurrent memory accesses from multiple PEs may cause memory bottleneck. Therefore, to maximize system performance, it is important to simultaneously consider the PE and on-chip memory architecture design. However, in a traditional MPSoC design flow, PE allocation and on-chip memory allocation are often considered independently. To tackle this problem, we propose the first PE and Memory Co-synthesis (PM-COSYN) framework for MPSoCs. One critical issue in such a memoryaware MPSoC design is how to utilize the available die area to achieve a balanced design between memory and computation subsystems. Therefore, the goal of PM-COSYN is to allocate PE and on-chip memory for MPSoCs with Network-on-Chip (NoC) architecture such that system performance is maximized and the area constraint is met. The experimental results show that, PMCOSYN can synthesize NoC resource allocation according to the needs of the target task set. When comparing to a Simulated- Annealing method, PM-COSYN generates a comparable solution with much shorter CPU time.
Wear-out related permanent faults are projected to make system lifetime a critical issue for all designs. In embedded systems, lifetime can be increased using slack, underutilization in execution and storage resources, so that when components fail, data and tasks can be re-mapped and re-scheduled. The design space of possible slack allocation is both large and complex. However, based on the observation that useful slack is often quantized, we have developed an approach that effectively and efficiently allocates execution and storage slack to jointly optimize system lifetime and cost. While exploring less than 1.4% of the slack allocation design space, our approach consistently outperforms alternative slack allocation techniques to find sets of designs within 1.4% of the lifetime-cost Pareto-optimal front.
Energy harvesting has emerged as a feasible and attractive option to improve battery lifetime in micro-scale electronic systems such as biomedical implants and wireless sensor nodes. A key challenge in designing micro-scale energy harvesting systems is that miniature energy transducers (e.g., photovoltaic cells, thermo-electric generators, and fuel cells) output very low voltages (0-0.4V). Therefore, a fully on-chip power converter (usually based on a charge pump) is used to boost the output voltage of the energy transducer and transfer charge into an energy buffer for storage. However, the charge transfer capability of widely used linear charge pump based power converters degrades when used with ultra-low voltage energy transducers. This paper presents the design of a new tree topology charge pump that has a reduced charge sharing time, leading to an improved charge transfer capability. The proposed design has been implemented using 65nm technology and circuit simulations demonstrate that the proposed design results in an increase of up to 30% in harvested power compared to existing linear charge pumps.
Nowadays digital systems have very high switching frequencies. Hence analogue effects can have a serious impact on data transmissions of connected modules in System-on-Chip (SoC) designs. The implications include attenuation, delay, and others which have to be considered as important effects. However, analogue technology models comprise too many details to be usable at system level as the simulation time would be far to high compared to traditional Transaction Level Modelling (TLM) models. In this paper we illustrate different aspects of using analogue line models as a transmission method for transactions between TLM models. This includes the introduction of analogue signal paths for TLM models and how to avoid the simulation time penalty of analogue technology models. We show how we can even use this approach to apply analogue effects to electronic system level (ESL) performance evaluations by further reducing the amount of details of the analogue effects.
This paper proposes a circuit optimization approach that can ease the computational burden on the simulation-based circuit optimizers by leveraging simple design equations that reflect the designer's intent. The technique is inspired by continuation methods (a.k.a. homotopy) in numerical analysis where a hard problem is solved by constructing an easier problem first and gradually refining its solution to that of the hard problem. In a circuit optimization context, the designer's simplified equations for the circuit serve as the easier problem. These simplified design equations are easy to write as they need not be completely accurate and have intuitive, well-understood solutions. Nonetheless, in several circuit examples, it was found that the designer's equations serve as better guidance than the conventional, fixed-point equations. As a result, the proposed approach demonstrates the better convergence to the desired solution with less computational efforts.
Keywords-Transistor Sizing, Circuit Optimization, Automated Design
In today's life, data centers are integral part of daily life. From web search to online banking, online shopping to medical records, we rely on data centers for almost everything. Malfunctions in the operation of such data centers have become an inseparable part of our daily lives as well. Major malfunction causes include hardware and software failures, design errors, malicious attacks and incorrect human interactions. The consequences of such malfunctions are enormous: loss of human life, financial loss, fraud, wastage of time and energy, loss of productivity, and frustrations with computing. Therefore, reliability of these systems plays a critical role in all aspects of our day to day life.
We have proposed (σ, ρ)-based flow regulation to reduce delay and backlog bounds in SoC architectures, where σ bounds the traffic burstiness and ρ the traffic rate. The regulation is conducted per-flow for its peak rate and traffic burstiness. In this paper, we optimize these regulation parameters in networks on chips where many flows may have conflicting regulation requirements. We formulate an optimization problem for minimizing total buffers under performance constraints. We solve the problem with the interior point method. Our case study results exhibit 48% reduction of total buffers and 16% reduction of total latency for the proposed problem. The optimization solution has low run-time complexity, enabling quick exploration of large design space.
Networks-on-Chip (NoCs) are a promising interconnect paradigm to address the communication bottleneck of Systems-on-Chip (SoCs). Wormhole flow control is widely used as the transmission protocol in NoCs, as it offers high throughput and low latency. To match the application characteristics, customized irregular topologies and routing functions are used. With wormhole flow control and custom irregular NoC topologies, deadlocks can occur during system operation. Ensuring a deadlock free operation of custom NoCs is a major challenge. In this paper, we address this important issue and present a method to remove deadlocks in application-specific NoCs. Our method can be applied to any NoC topology and routing function, and the potential deadlocks are removed by adding minimal number of virtual or physical channels. Experiments on a variety of realistic benchmarks show that our method results in a large reduction in the number of resources needed (88% on average) and NoC power consumption, area reduction (66% area savings on average) when compared to the state-of-the-art deadlock removal methods.
Keywords - Network-on-Chip (NoC), deadlock, topology, application specific
Today, due to the increasing demand for more and more complex applications in the consumer electronic market segment, Systems-on-Chip consist of many processing elements and become larger and larger. While on-chip system designers must be able to get fast and accurate communication performance analysis for such huge systems, the simulation-based approaches are not adequate anymore. Addressing the increasing need for early performance evaluation in NoC-based system design flow, this paper presents a generic analytical method to estimate communication latencies and link-buffer utilizations for a given NoC architecture with a given application mapped on it. The accuracy of our method is experimentally compared with the results obtained from Cycle-Accurate SystemC simulations.
MIMO wireless technology is required to increase the data rates for a broad range of applications, including low cost mobile devices. In this paper we present a very low area reconfigurable MIMO detector which achieves a high throughput of 103Mbps and uses 27 Kilo Gates when implemented in a commercial 180nm CMOS process. The low area is achieved by the proposed in-place architecture. This architecture implements the K-best algorithm and reduces area 4-fold compared to the widely used multi-stage architecture, while provides reconfigurability in terms of antenna configuration during realtime operation.
The paper describes an embedded circuit for the single shot jitter measurement of the clock signal. Based on a jitter amplified technique with a pulse removing mechanism, the picosecond level resolution is achieved in the wide frequency range. In addition, a gain-locked loop calibration scheme is proposed to keep the amplification ratio constant under PVT variations. Fabricated by 0.13-um CMOS process, the tested circuit can achieve a resolution of 2 ps root mean square (rms) jitter at an input range from few tens of megahertz to 1.6 GHz.
Production verification of analog circuit specifica- tions is a challenging task requiring expensive test equipment and time consuming procedures. This paper presents a method for low cost on-chip parameter verification based on the analysis of a digital signature. A 65 nm CMOS on-chip monitor is proposed and validated in practice. The monitor composes two signals (x(t), y(t)) and divides the X-Y plane with nonlinear boundaries in order to generate a digital code for every analog (x, y) location. A digital signature is obtained using the digital code and its time duration. A metric defining a discrepancy factor is used to verify circuit parameters. The method is applied to detect possible deviations in the natural frequency of a Biquad filter. Simulated and experimental results show the possibilities of the proposal.
Index Terms - Mixed-Signal Test, Specification Verification, Monitoring, Nonlinear Zone Boundary.
This paper addresses the problem of stochastic task execution time estimation agnostic to the process distributions. The proposed method is orthogonal to the application structure and underlying architecture. We build the time varying state space model of the task execution time. In the case of software pipelined tasks, to refine the estimate quality, the state-space is modeled as Multiple Input Single Output (MISO) system by taking into account the current execution time of the predecessor task. To obtain nearly Bayesian estimates, irrespective of the process distribution, the sequential Monte Carlo method is applied which form the recursive solution to reduce the overheads and comprises of time update and correction steps. We experimented on three different platforms, including multicore, using the time parallelized H.264 decoder: a control dominant computationally demanding application and AES encoder: a pure data flow application. Results show that estimates obtained by our method are superior in quality and are up to 68% better in comparison to others.
The growing trend towards using component based design approach in embedded system development requires addressing newer system engineering challenges. These systems are usually time critical and require timing guarantees from components. The articulation of a desirable response bounds for the components is often ad-hoc and happens late in development. In this work, we present a formal methods based methodology for an early stage design space exploration. We focus on realtime response of a component as a basis for exploration and allow the developer model it using constant values or parameters. To quantify the parameters, we propose a novel constraint synthesis technique to correlate response times of interacting components. Finally, for system integration, we introduce a new notion of timing layout to specify time-budgeting for each component. The selection of a suitable layout can be made based on system optimization criteria. We have demonstrated our methodology on an automotive Adaptive Cruise Control feature.
We present a new language called Precision Timed C, for predictable and lightweight multithreading in C. PRET-C supports synchronous concurrency, preemption, and a high-level construct for logical time. In contrast to existing synchronous languages, PRET-C offers C-based shared memory communications between concurrent threads, which is guaranteed to be thread safe via the proposed semantics. Mapping of logical time to physical time is achieved by a Worst Case Reaction Time (WCRT) analyser. To improve throughput while maintaining predictability, a hardware accelerator specifically designed for PRET-C is added to a soft-core processor. We then demonstrate through extensive benchmarking that the proposed approach not only achieves complete predictable execution, but also improves overall throughput when compared to the software execution of PRET-C. The PRET-C software approach is also significantly more efficient in comparison to two other light-weight concurrent C variants called SC and Protothreads, as well as the well-known synchronous language Esterel.
We present an Inversed Temperature Dependence (ITD) aware clock skew scheduling framework. Specifically, we demonstrate how our framework can assist dual-Vth assignment in preventing timing violations arising due to ITD effect. We formulate the ITD aware synthesis problem and prove that it is NP-Hard. Then, we propose an algorithm for synergistic temperature aware clock skew scheduling and dual-Vth assignment. Experiments on ISCAS89 benchmarks reveal that several circuits synthesized by the traditional high-temperature corner based flow with a commercial tool exhibit timing violations in the low temperature range while all circuits generated using our methodology for the same timing constraints have guaranteed timing.
Energy consumption is a critical parameter in wireless healthcare systems which consist of battery operated devices such as sensors and local aggregators. The system battery lifetime depends on the allocation of processing, sensing, and communication tasks to devices of the system. In this paper, we optimize the battery life of a wireless healthcare system by efficiently assigning tasks to the available resources. There are several dynamically changing characteristics in the system, such as task parameters (processing complexity, arrival rate, and output data), each device's available battery capacity, varying wireless channel conditions, and network load. Our dynamic task assignment algorithm, "DynAHeal" adapts to such changing conditions, and improves the battery life. Our experiments show that the task assignment given by DynAHeal improves the overall system lifetime under varying dynamic conditions on an average 60% relative to sending all the data for processing to the base station, and 35% with respect to an optimal static design time assignment.
Fault tolerance (FT) has become a major concern in computing systems. Instruction duplication has been proposed to verify application execution at run time. Two techniques, instruction memoization and precomputation, have been shown to improve the performance and fault coverage of duplication. This work shows that the combination of these two techniques is much more powerful than either one in isolation. In addition to performance, it improves the long-lasting transient and permanent fault coverage upon the memoization scheme. Compared to the precomputation scheme, it reduces the longlasting transient and permanent fault coverage of 10.6% of the instructions, but covers 2.6 times as many instructions against shorter transient faults. On a system with 2 integer ALUs, the combined scheme reduces the performance degradation due to duplication by on average 27.3% and 22.2% compared to the precomputation and memoization-based techniques, respectively, with similar hardware requirements.
Modern embedded multimedia systems process multiple concurrent streams of data processing jobs. Streams often have throughput requirements. These jobs are implemented on a multiprocessor system as a task graph. Tasks communicate data over buffers, where tasks wait on sufficient space in output buffers before producing their data. For cost reasons, jobs share resources. Because jobs can share resources with other jobs that include tasks with date-dependent execution rates, we assume run-time scheduling on shared resources. Budget schedulers are applied, because they guarantee a minimum budget in a maximum replenishment interval. Both the buffer sizes as well as the budgets influence the temporal behaviour of a job. Interestingly, a trade-off exists: a larger buffer size can allow for a smaller budget while still meeting the throughput requirement. This work is the first to address the simultaneous computation of budget and buffer sizes.We solve this non-linear problem by formulating it as a second-order cone program. We present tight approximations to obtain a non-integral second-order cone program that has polynomial complexity. Our experiments confirm the non-linear trade-off between budget and buffer sizes.
Trajectory piecewise-linear macromodeling (TPWL) technique has been widely employed to characterize strong nonlinear circuits, and makes the reduction of the strong nonlinear circuits possible. The trajectory piecewise-linear macromodeling technique linearizes nonlinear circuits around multiple expansion points which are extracted from state trajectories driven by training inputs. However, the accuracy of the trajectory piecewise-linear macromodeling technique heavily relies on the extracted expansion points and the training inputs. It will lead to large error in simulation if state vector reaches regions far away from the extracted expansion points. In this paper, we propose an efficient transistor-level piecewise linearization scheme for macromodeling of nonlinear circuits. Piecewise linear models are first built for each transistor. The macromodel of the whole nonlinear circuit is then constructed by combining all the piecewise-linear models of the transistors together with appropriate weight functions. The proposed approach can cover remarkably larger state space than the TPWL method. By using the complete piecewise-linear models of the transistors, the constructed piecewise-linear models of the nonlinear circuits are capable of covering the whole state space of the nonlinear circuits. More importantly, model order reduction of the proposed transistor-level piecewise linearization macromodel is also possible, which makes the proposed method a potentially good macromodeling approach for model order reduction of nonlinear circuits.
This panel session will address the post-CMOS research great challenges and opportunities corresponding to the More Than Moore and Beyond CMOS domains. The opening talk of the panel will set the ground for discussion with some key examples placed at the intersection of the two domains. Especially the role of functional diversification and of new research and application drivers, different from scaling, will be critically discussed by a team of high-level experts in the field. Moreover, the impact of post-CMOS era on the way the academic research and the education of engineering are conceived today and should be adapted in the future will be the center of the debate.
3D integration is a key solution to the predicted performance increase of future electronic systems. It offers extreme miniaturization and fabrication of More than Moore products. This can be accomplished by the combination of Through-Silicon-Via (TSV) technologies for shortened electrical signal lines and Solid Liquid Interdiffusion (SLID) for highly reliable assembly. Depending on the chosen technology concept, TSVs are filled with either tungsten or copper metal. Thinning of silicon as part of the process flow enables devices as thin as 30 μm, so multilayer stacking will result in ultra-thin systems. All these 3D integration concepts focus on wafer level processing to achieve the highest miniaturization degree and highest processing reliability as well as enabling high volume cost-effective fabrication.
Keywords; Through-Silicon-Via, Solid Liquid Interdiffusion
3D stacking and integration can provide system advantages. Following a brief technology review, this abstract explores application drivers, design and CAD for 3D ICs. The main application area explored in detail is that of logic on memory. This application is explored in a specific DSP example. Finally critical areas that need better solutions are explored. These include design planning, test management, and thermal management.
Keywords-3DIC; 3D IC; three dimensional IC; TSV; stacked memor; memory on logic
To meet customer's product-quality expectations, each individual IC needs to be tested for manufacturing defects incurred during its many high-precision, and hence defect-prone manufacturing steps; these tests should be both effective and cost-efficient. The semiconductor industry is preparing itself now for three-dimensional stacked ICs (3D-SICs) based on Through-Silicon Vias (TSVs), which, due to their many compelling benefits, are quickly gaining ground. Test solutions need to be ready for this new generation of "super chips". 3D-SICs are chips where all basic, as well as most advanced test technologies come together. In addition, they pose some truly new test challenges with respect to complexity and cost, due to their advanced manufacturing processes and physical access limitations. This presentation focuses on the available solutions and still open challenges for testing 3D-SICs. It discusses flows for wafer-level and package-level tests, the challenges with respect to test contents and wafer-level probe access, and the on-chip Design-for-Test (DfT) infrastructure required for 3D-SICs.
Many CAD for VLSI problems can be naturally encoded as Quantified Boolean Formulas (QBFs) and solved with QBF solvers. Furthermore, such problems often contain circuitbased information that is lost during the translation to Conjunctive Normal Form (CNF), the format accepted by most modern solvers. In this work, a novel preprocessing framework for circuit-based QBF problems is presented. It leverages structural circuit dominators to reduce the problem size and expedite the solving process. Our circuit-based QBF preprocessor PReDom recursively reduces dominated subcircuits to return a simpler but equisatisfiable QBF instance. A rigorous proof is given for eliminating subcircuits dominated by single outputs, irrespective of input quantifiers. Experimental results are presented for circuit diameter computation problems. With preprocessing times of at most five seconds using PReDom, three state-of-the-art QBF solvers can solve 27% to 45% of our problem instances, compared to none without preprocessing.
Networks-on-chips (NoC) are emerging as a promising interconnect solution for efficient Multi-Processors Systemson- Chips. We propose a methodology that supports the specification of parametric NoCs. We provide sufficient constraints that ensure deadlock-free routing, functional correctness, and liveness of the design. To illustrate our method, we discharge these constraints for a parametric NoC inspired by the HERMES architecture.
We address the problem of computing the exact abstraction of a program with respect to a given set of predicates, a key computation step in Counter-Example Guided Abstraction Refinement. We build on a recently proposed approach that integrates BDD-based quantification techniques with SMT-based constraint solving to compute the abstraction. We extend the previous work in three main directions. First, we propose a much tighter integration of the BDD-based and SMT-based reasoning where the two solvers strongly collaborate to guide the search. Second, we propose a technique to reduce redundancy in the search by blocking already visited models. Third, we present an algorithm exploiting a conjunctively partitioned representation of the formula to quantify. This algorithm provides a general framework where all the presented optimizations integrate in a natural way. Moreover, it allows to overcome the limitations of the original approach that used a monolithic BDD representation of the formula to quantify. We experimentally evaluate the merits of the proposed optimizations, and show how they allow to significantly improve over previous approaches.
The H.264/AVC video encoder standard significantly improves the compression efficiency by using variable block-sized Inter (P) and Intra (I) Macroblock (MB) coding modes. In this paper, we propose a novel Human Visual System based Adaptive Computational Complexity Reduction Scheme (ACCoReS). It performs Prognostic Early Mode Exclusion and a Hierarchical Fast Mode Prediction to exclude as many I-MB and P-MB coding modes as possible (up to 73%) even before the actual Rate Distortion Optimized Mode Decision (RDO-MD) and Motion Estimation while keeping a good quality. In the best case, ACCoReS processes exactly one MB Type and one corresponding nearoptimal coding mode, such that the complete RDO-MD process is skipped. Experimental results show that compared to state-ofthe- art approaches (, -), ACCoReS achieves a speedup of up to 9.14x (average 3x) with an average PSNR loss of 0.66 dB. Compared to exhaustive RDO-MD, our ACCoReS provides a performance improvement of up to 19x (average 10x) for an average 3% PSNR loss.
Ubiquitous image processing tasks (such as transform decompositions, filtering and motion estimation) do not currently provide graceful degradation when their clock-cycles budgets are reduced, e.g. when delay deadlines are imposed in a multi-tasking environment to meet throughput requirements. This is an important obstacle in the quest for full utilization of modern programmable platforms' capabilities, since: (i) worst-case considerations must be in place for reasonable quality of results; (ii) throughput-distortion tradeoffs are not possible for distortion-tolerant image processing applications without cumbersome (and potentially costly) system customization. In this paper, we extend the functionality of the recently-proposed software framework for operational refinement of image processing (ORIP) and demonstrate its inherent throughputdistortion and energy-distortion scalability. Importantly, our extensions allow for such scalabilities at the software level, without needing hardware-specific customization. Extensive tests on a mainstream notebook computer and on OLPC's subnotebook ("xo-laptop") verify that the proposed designs provide for: (i) seamless quality-complexity scalability per video frame; (ii) up to 60% increase in processing throughput with graceful degradation in output quality; (iii) up to 20% more images captured and filtered for the same power-level reduction on the xo-laptop.
Keywords: software realizations of image processing; programmable platforms; incremental refinement of computation; energy-distortion scalability; throughput-distortion scalability
The limited energy resources in portable multimedia devices require the reduction of encoding complexity. The complex Motion Estimation (ME) scheme of H.264/MPEG-4 AVC accounts for a major part of the encoder energy . In this paper we present a Run-Time Adaptive Predictive Energy Budgeting (enBudget) scheme for energy-aware ME that predicts the energy budget for different video frames and different Macroblocks (MBs) in an adaptive manner considering the run-time changing scenarios of available energy, video frame characteristics, and user-defined coding constraints while keeping a good video quality. It assigns different Energy-Quality Classes to different video frames and fine-tunes at MB level depending upon the predictive energy quota in order to cope with above-mentioned run-time unpredictable scenarios. Compared to UMHexagonS , EPZS , and FastME , our enBudget scheme for energy-aware ME achieves an energy saving of up to 93%, 90%, 88% (average 88%, 77%, 66%), respectively. It suffers from an average Peak Signal to Noise Ratio (PSNR) loss of 0.29 dB compared to Full Search. We also demonstrate that enBudget is equally beneficial to various other state-of-the-art fast adaptive MEs (e.g. ). We have evaluated our scheme for ASIC and various FPGAs.
This paper deals with the evolutionary design of area-efficient filters for impulse bursts noise which is often present in remote sensing images such as satellite images. Evolved filters require much smaller area in the FPGA than conventional filters. Simultaneously, they exhibit at least comparable filtering capabilities with respect to conventional filters. Low-cost embedded systems equipped with low-end FPGAs represent a target application for presented filters.
Hardware sharing can be used to reduce the area and the power dissipation of a design. This is of particular interest in the field of image and video compression, where an encoder must deal with different design tradeoffs depending on the characteristics of the signal to be encoded and the constraints imposed by the users. This paper introduces a novel methodology for exploring the design space based on the amount of hardware sharing between different functional blocks, giving as a result a set of feasible solutions which are broad in terms of hardware cost and throughput capabilities. The proposed approach, inspired by the notion of a partition in set theory, has been applied to optimize and to evaluate the sharing alternatives of a group of image and video compression key computational kernels when mapped onto a Xilinx Virtex-5 FPGA.
Keywords- hardware sharing, image and video encoders, JPEG, FPGA.
Stereoscopic 3D reconstruction is an important algorithm in the field of Computer Vision, with a variety of applications in embedded and real-time systems. Existing software-based implementations cannot satisfy the performance requirements for such constrained systems; hence an embedded hardware mechanism might be more suitable. In this paper, we present an architecture of a 3D reconstruction system for stereoscopic images, which we implement on Virtex2 Pro FPGA. The architecture uses a Sobel edge detector to achieve real-time (75 fps) performance, and is configurable in terms of various application parameters, making it suitable for a number of application environments. The paper also presents a design exploration on algorithmic parameters such as disparity range, correlation window size, and input image size, illustrating the impact on the performance for each parameter.
Keywords- FPGA Signal Processing; Stereo Vision; Disparity Computation;
This paper presents a robust, low-cost ADC code hit counting technique to record the number of times each ADC output code word appears with respect to the ramp input. Using a smart center code tracking engine, the proposed code hit counter performs robustly against the code transition noise, missing code segments, and non-monotonicity; furthermore, the required hardware and test time is at the same level as the known best results. The robustness together with the low overhead makes the proposed code hit counter suitable for (on-line) ADC selftesting and self-calibration applications.
This paper presents a new analog ATPG (AATPG) framework that generates near-optimal test stimulus for the digitally-assisted adaptive equalizers in high-speed serial links. Based on the dynamic-signature-based testing scheme developed recently, our AATPG utilizes a Genetic Algorithm (GA) which attempts to maximize the difference between the fault-free and faulty dynamic signatures of the target fault. Our test generation framework takes into account process variations and signal noise in selecting the test stimulus, which minimizes the number of misclassified devices. The experimental results on a 5-tap feed-forward adaptive equalizer demonstrate that the GA-tests generated by our framework can effectively detect faults that are hard to detect by the hand-crafted tests.
We discuss a fault diagnosis scheme for analog integrated circuits. Our approach is based on an assemblage of learning machines that are trained beforehand to guide us through diagnosis decisions. The central learning machine is a defect filter that distinguishes failing devices due to gross defects (hard faults) from failing devices due to excessive parametric deviations (soft faults). Thus, the defect filter is key in developing a unified hard/soft fault diagnosis approach. Two types of diagnosis can be carried out according to the decision of the defect filter: hard faults are diagnosed using a multi-class classifier, whereas soft faults are diagnosed using inverse regression functions. We show how this approach can be used to single out diagnostic scenarios in an RF low noise amplifier (LNA).
Daily experience with product designers, test and diagnosis engineers it is realized that the depth of interaction among them, ought be high for sucessfull diagnosis of analogue circuits. With this knowledge in mind, a responsibility was undertaken to choose a popular diagnostic method and define a systematic procedure that binds together the knowledge of a product from a design, test and diagnostic engineer. A set of software utilities was developed that assists in automating these procedures and in collecting appropriate data for effective diagnosis of analogue circuits. This paper will discuss the chosen methodology for diagnosis and the associated procedures for block-level diagnosis of analogue electronic circuits in detail. The paper is concluded with an illustration of the methodology and the related procedures of an industrial automotive voltage regulator circuit as a representative example.
Creating latency insensitive or asynchronous designs from clocked designs has potential benefits of increased modularity and robustness to variations. Several transformations have been suggested in the literature and each of these require a handshake control network (examples include synchronous elasticization and desynchronization). Numerous implementations of the control network are possible. This paper reports on an algorithm that has been proven to generate an optimal control network consisting of the minimum number of 2-input join and 2-output fork control components. This can substantially reduce the area and power consumption of a system. The algorithm has been implemented in a CAD tool, called CNG. It has been applied to the MiniMIPS processor showing a 14% reduction in the number of control steering units over a hand optimized design in a contemporary work.
Speculative Functional Units (SFUs) enable a new execution paradigm for High Level Synthesis (HLS). SFUs are arithmetic functional units that operate using a predictor for the carry signal, which reduces the critical path delay. The performance of these units is determined by the success in the prediction of the carry value, i.e. the hit rate of the prediction. Hence SFUs reduce critical path at a low cost, but they cannot be used in HLS with the current techniques. In order to use them, it is necessary to include hardware support to recover from mispredictions of the carry signals. In this paper, we present techniques for designing a datapath controller for seamless deployment of SFUs in HLS. We have developed two techniques for this goal. The first approach stops the execution of the entire datapath for each misprediction and resumes execution once the correct value of the carry is known. The second approach decouples the functional unit suffering from the misprediction from the rest of the datapath. Hence, it allows the rest of the SFUs to carry on execution and be at different scheduling states at different times. Experiments show that it is possible to reduce execution time by as much as 38% and by 33% on average.
Keywords: Dynamic scheduling, HLS, speculation
Multi-rate digital signal processing(DSP) algorithms are usually modeled by synchronous dataflow graphs(SDFGs). Performing with high enough throughput is a key real-time requirement of a DSP algorithm. Therefore how to decrease the iteration period of an SDFG to meet the real-time requirement of the system under consideration is a very important problem. Retiming is a prominent graph transformation technique for performance optimizing. In this paper, by proving some useful properties about the relationship between an SDFG and its equivalent homogeneous SDFG(HSDFG), we present an efficient retiming algorithm, which needn't convert the SDFG to HSDFG, for finding a feasible retiming to reduce the iteration period of an SDFG as required.
Starting from sequential programs, we present an approach combining data reuse, multi-level MapReduce, and pipelining to automatically find the most power-efficient designs that meet speed and area constraints in the design space on Field- Programmable Gate Arrays (FPGAs). This combined approach enables trade-offs in power, speed and area: we show 63% reduction in power can be achieved with 27% increase in execution time. Compared to the sequential designs, our approach yields designs with up to 158 times reduction in execution time. Moreover, for a given execution time, our combined approach generates designs using up to 1.4 times less power than those produced by the same optimizations applied separately and can also find solutions missed by separating the optimizations.
FPGA structures are widely used due to early time-to-market and reduced non-recurring engineering costs in comparison to ASIC designs. Interconnections play a crucial role in modern FPGAs, because they dominate delay, power and area. Multiple-valued logic allows the reduction of the number of signals in the circuit, hence can serve as a mean to effectively curtail the impact of interconnections. In this work we propose a new FPGA structure based on a low-power quaternary voltage-mode device. The most important characteristics of the proposed architecture are the reduced fanout, low number of wires and switches, and the small wire length. We use a set of FIR filters as a demonstrator of the benefits of the quaternary representation in FPGAs. Results show a significant reduction on power consumption with small timing penalties.
As FPGA sizes and densities grow, their manufacturing yields decrease. This work looks toward reclaiming some of this lost yield. Several previous works have suggested fault aware CAD tools for intelligently routing around faults. In this work we evaluate such an approach quantitatively with respect to some standard benchmarks. We also quantify the trade-offs between performance and fault tolerance in such a method. Leveraging existing CAD tools, we show up to 30% of slices being faulty can be tolerated. Such approaches could potentially allow manufacturers to sell larger chips with manufacturing faults as smaller chips using a nomenclature that appropriately captures the reduction in logic resources.
Negative bias temperature instability (NBTI) significantly affects nanoscale integrated circuit performance and reliability. The degradation in threshold voltage (Vth) due to NBTI is further affected by the initial value of Vth from fabricationinduced process variation (PV). Addressing these challenges in embedded FPGA designs is possible, as FPGA reconfigurablility can be exploited to measure the exact timing degradation of an FPGA due to the joint effect of NBTI and PV at run time with low overhead. The gathered information can then be used to improve the run-time performance and reliability of FPGA designs without targeting the pessimistic worst case. In this paper, we present joint NBTI/PV-aware placement techniques for FPGAs, including NBTI/PV-aware timing analysis, region-based delay estimation, and a new move-acceptance procedure. To evaluate the proposed techniques, we combine PV measurements from 15 Xilinx Virtex-II Pro FPGAs with a model of NBTI. The proposed techniques reduce the effect of NBTI/PV by more than 60% for over 60% of the 15 FPGA chips used in the experiments, with a typical run-time overhead of 1.4-1.8X. The standalone move-acceptance procedure also produces good results with negligible run-time overhead, making it suitable for online FPGA compilation and optimization flows.