DATE Executive Committee
DATE Sponsor Committee
Technical Program Chairs
Technical Program Committee
Best Paper Awards
Call for Papers: DATE 2010
The development of a satisfactory Embedded Systems Design Science provides a timely challenge and opportunity for reinvigorating Computer Science. Embedded systems are components integrating software and hardware jointly and specifically designed to provide given functionalities, which are often critical. They are used in many applications areas including transport, consumer electronics and electrical appliances, energy distribution, manufacturing systems etc. Embedded systems design requires techniques taking into account extra-functional requirements regarding optimal use of resources such as time, memory and energy while ensuring autonomy, reactivity and robustness. Jointly taking into account these requirements raises a grand scientific and technical challenge extending Computer Science with paradigms and methods from Control Theory and Electrical Engineering. Computer Science is based on discrete computation models not encompassing physical time and resources which are by their nature very different from analytic models used by other engineering disciplines. We summarise some current trends in embedded systems design and point out some of their characteristics, such as the chasm between analytical and computational models and the gap between safety critical and best-effort engineering practices. We call for a coherent scientific foundation for embedded systems design, and we discuss a few key demands on such a foundation: the need for encompassing several manifestations of heterogeneity, and the need for design paradigms ensuring constructivity and adaptivity. We discuss main aspects of this challenge and associated research directions for different areas such as modelling, programming, compilers, operating systems and networks.
Multiprocessor system-on-chip (MPSoC) is an attractive platform for high-performance applications. Networks-on-Chip (NoCs) can improve the on-chip communication bandwidth of MPSoCs. However, traditional metallic interconnects consume significant amount of power to deliver even higher communication bandwidth required in the near future. Optical NoCs are based on CMOS-compatible optical waveguides and microresonators, and promise significant bandwidth and power advantages. This paper proposes a fat tree-based optical NoC (FONoC) including its topology, floorplan, protocols, and a low-power and low-cost optical router, optical turnaround router (OTAR). Different from other optical NoCs, FONoC does not require building a separate electronic NoC for network control. It carries both payload data and network control data on the same optical network, while using circuit switching for the former and packet switching for the latter. The FONoC protocols are designed to minimize network control data and the related power consumption. An optimized turnaround routing algorithm is designed to utilize the low-power feature of OTAR, which can passively route packets without powering on any microresonator in 40% of all cases. Comparing with other optical routers, OTAR has the lowest optical power loss and uses the lowest number of microresonators. An analytical model is developed to characterize the power consumption of FONoC. We compare the power consumption of FONoC with a matched electronic NoC in 45 nm, and show that FONoC can save 87% power comparing with the electronic NoC on a 64-core MPSoC. We simulate the FONoC for the 64-core MPSoC and show the end-to-end delay and network throughput under different offered loads and packet sizes.
Three-dimensional integrated circuits are a promising approach to
address the integration challenges faced by current Systems on Chips
(SoCs). Designing an efficient Network on Chip (NoC) interconnect
for a 3D SoC that not only meets the application performance
constraints, but also the constraints imposed by the 3D technology,
is a significant challenge. In this work we present a design tool,
SunFloor 3D, to synthesize application-specific 3D NoCs. The proposed
tool determines the best NoC topology for the application,
finds paths for the communication flows, assigns the network components
on to the 3D layers and performs a placement of them in
each layer. We perform experiments on several SoC benchmarks
and present a comparative study between 3D and 2D NoC designs.
Our studies show large improvements in interconnect power consumption
(average of 38%) and delay (average of 13%) for the
3D NoC when compared to the corresponding 2D implementation.
Our studies also show that the synthesized topologies result in large
power (average of 54%) and delay savings (average of 21%) when
compared to standard topologies.
Keywords 3D ICs, Networks on chip (NoC), synthesis, topology, placement
In this paper, we present a design methodology for automatic platform generation of future heterogeneous systems where communication happens via the Network-on-Chip (NoC) approach. As a novel contribution, we consider explicitly the information about the user experience into a design flow which aims at minimizing the workload variance; this allows the system to better adapt to different types of user needs and workload variations. More specifically, we first collect various user traces from various applications and generate specific clusters using machine learning techniques. For each cluster of such user traces, depending on the architectural parameters extracted from high-level specifications, we propose an optimization method to generate the NoC system architecture. Finally, we validate the user-centric design space exploration using realistic traces and compare it to the traditional NoC design methodology.
Current trends in technology scaling foreshadow worsening transistor reliability as well as greater numbers of transistors in each system. The combination of these factors will soon make long-term product reliability extremely difficult in complex modern systems such as systems on a chip (SoC) and chip multiprocessor (CMP) designs, where even a single device failure can cause fatal system errors. Resiliency to device failure will be a necessary condition at future technology nodes. In this work, we present a network-onchip (NoC) routing algorithm to boost the robustness in interconnect networks, by reconfiguring them to avoid faulty components while maintaining connectivity and correct operation. This distributed algorithm can be implemented in hardware with less than 300 gates per network router. Experimental results over a broad range of 2D-mesh and 2D-torus networks demonstrate 99.99% reliability on average when 10% of the interconnect links have failed.
Despite recent advances in FPGA, GPU, and general purpose processor technologies, the challenges posed by realtime digital image processing at high resolutions cannot be fully overcome due to insufficient processing capability, inadequate data transport and control mechanisms, and often prohibitively high costs. To address these issues, we proposed a two-phase solution for a real-time film grain noise reduction application. The first phase is based on a state-of-the-art FPGA platform used as a reference design. The second phase is based on a novel heterogeneous reconfigurable computing platform that offers flexibility not available from other computing paradigms. This paper introduces the heterogeneous platform and briefly reviews our previous work with the application in question, as well as its implementation on the FPGA demonstration board during the first phase. Then we present a decomposition of the application, which allows an efficient mapping to the new heterogeneous computing platform through the use of its diverse reconfigurable computing units and run-time reconfiguration.
Multi-core architectures are increasingly being adopted in the design of emerging complex embedded systems. Key issues of designing such systems are on-chip interconnects, memory architecture, and task mapping and scheduling. This paper presents an integer linear programming formulation for the task mapping and scheduling problem. The technique incorporates profiling-driven loop level task partitioning, task transformations, functional pipelining, and memory architecture aware data mapping to reduce system execution time. Experiments are conducted to evaluate the technique by implementing a series of DSP applications on several multi-core architectures based on dynamically reconfigurable processor cores. The results demonstrate that the proposed technique is able to generate high-quality mappings of realistic applications on the target multi-core architecture, achieving up to 1.3x parallel efficiency by employing only two dynamically reconfigurable processor cores.
The Active Buffer project is part of the CBM (compressed baryonic matter) experiment and takes advantage of the DPR (dynamic partial reconfiguration) technology, in which a dynamic module can be reconfigured while the static part and other dynamic modules keep running untouched. Due to DPR, design flexibility and simplicity are achieved at the same time. The correctness and the performance have been verified by multiple tests.
Many emerging communication technologies significantly increase the complexity of the physical layer and have dramatically increased the number of operating configurations. To ensure maximum performance, designers have to optimize their algorithm implementations, which requires for comprehensive performance testing in all possible operating modes various channel conditions. This paper presents a flexible and affordable framework for baseband algorithm development and performance verification for digital communication systems with an arbitrary number of modules, each operating at a possibly different sampling rate with various latencies. The proposed architecture is scalable to support complex scenarios, such as multiple antenna systems, and is compact enough to be implemented within a single field-programmable gate array.
With the relentless scaling of semiconductor technology, the lifetime reliability of embedded multiprocessor platforms has become one of the major concerns for the industry. If this is not taken into consideration during the task allocation and scheduling process, some processors might age much faster than the others and become the reliability bottleneck for the system, thus significantly reducing the system's service life. To tackle this problem, in this paper, we propose an analytical model to estimate the lifetime reliability of multiprocessor platforms when executing periodical tasks, and we present a novel lifetime reliability-aware task allocation and scheduling algorithm based on simulated annealing technique. In addition, to speed up the annealing process, several techniques are proposed to simplify the design space exploration process with satisfactory solution quality. Experimental results on various multiprocessor platforms and task graphs demonstrate the efficacy of the proposed approach.
Many embedded control systems comprise several control loops that are closed over a network of computation nodes. In such systems, complex timing behavior and communication lead to delay and jitter, which both degrade the performance of each control loop and must be considered during the controller synthesis. Also, the control performance should be taken into account during system scheduling. The contribution of this paper is a control-scheduling co-design method that integrates controller design with both static and priority-based scheduling of the tasks and messages, and in which the overall control performance is optimized.
The computing engines of many current applications are powered by MPSoC platforms, which promise significant speedup but induce increased reliability problems as a result of ever growing integration density and chip size. While static MPSoC execution schedules deliver predictable worst-case performance, the absence of dynamic variability unfortunately constrains their usefulness in such an unreliable execution environment. Adaptive static schedules with predictable responses to runtime resource variations have consequently been proposed, yet the extra constraints imposed by adaptivity on task assignment have resulted in schedule length increases. We propose to eradicate the associated performance degradation of such techniques while retaining all the concomitant benefits, by exploiting an inherent degree of freedom in task assignment regarding the logical to physical core mapping. The proposed technique relies on the use of core reordering and rotation through utilizing a graph representation model, which enables a direction translation of inter-core communication paths into order requirements between cores. The algorithmic implementation results confirm that the proposed technique can drastically reduce the schedule length overhead of both pre- and post- reconfiguration schedules.
In this paper, we propose a multi-task mapping/scheduling technique for heterogeneous and scalable MPSoC. To utilize the large number of cores embedded in MPSoC, the proposed technique considers temporal and data parallelisms as well as task parallelism. We define a multi-task mapping/scheduling problem with all these parallelisms and propose a QEA(quantum-inspired evolutionary algorithm)-based heuristic. Compared with an ILP (Integer Linear Programming) approach, experiments with real-life examples show the feasibility and the efficiency of the proposed technique.
Negative Bias Temperature Instability (NBTI), a PMOS aging phenomenon causing significant loss on circuit performance and lifetime, has become a critical challenge for temporal reliability concerns in nanoscale designs. Aggressive technology scaling trends, such as thinner gate oxide without proportional downscaling of supply voltage, necessitate a design optimization flow considering NBTI effects at the early stages. In this paper, we present a novel framework using joint logic restructuring and pin reordering to mitigate NBTI-induced performance degradation. Based on detecting functional symmetries and transistor stacking effects, the proposed methodology involves only wire perturbation and introduces no gate area overhead at all. Experimental results reveal that, by using this approach, on average 56% of performance loss due to NBTI can be recovered. Moreover, our methodology reduces the number of critical transistors remaining under severe NBTI and thus, transistor resizing can be applied to further mitigate NBTI effects with low area overhead.
As semiconductor manufacturing enters advanced nanometer design paradigm, aging and device wear-out related degradation is becoming a major concern. Negative Bias Temperature Instability (NBTI) is one of the main sources of device lifetime degradation. The severity of such degradation depends on the operation history of a chip in the field, including such characteristics as temperature and workloads. In this paper, we propose a system level reliability management scheme where a chip dynamically adjusts its own operating frequency and supply voltage over time as the device ages. Major benefits of the proposed approach are (i) increased performance due to reduced frequency guard banding in the factory and (ii) continuous field adjustments that take environmental operating conditions such as actual room temperature and the power supply tolerance into account. The greatest challenge in implementing such a scheme is to perform calibration without a tester. Much of this work is performed by a hypervisor like software with very little hardware assistance. This keeps both the hardware overhead and the system complexity low. This paper describes the entire system architecture including hardware and software components. Our simulation data indicates that under aggressive wear-out conditions, scheduling interval of days or weeks is sufficient to reconfigure and keep the system operational, thus the run time overhead for such adjustments is of no consequence at all.
There is a growing concern about timing errors resulting from design marginalities and the effects of circuit aging on speed-paths in logic circuits. This paper presents a low overhead solution for masking timing errors on speed-paths in logic circuits. Error masking at the outputs of a logic circuit is achieved by synthesis of a non-intrusive error-masking circuit that has at least 20% timing slack over the original logic circuit. The error-masking circuit can also be used to collect runtime information when the speed-paths are exercised to (i) predict the onset of wearout and (ii) assist in in-system silicon debug. Simulation results for several benchmark circuits and modules from the OpenSPARC T1 processor are presented to illustrate the effectiveness of the proposed solution. 100% masking of timing errors on all speed-paths within 10% of the critical path delay is achieved for all circuits with an average area (power) overhead of 16% (18%).
The synchronous model of computation together with a suitable execution platform facilitates system-level timing predictability. This paper introduces an algebraic framework for precisely capturing worst case reaction time (WCRT) characteristics for Esterel-style reactive processors with hardware-supported multithreading. This framework provides a formal grounding for the WCRT problem, and allows to improve upon earlier heuristics by accurately and modularly characterizing timing interfaces.
Many application domains require adaptive realtime embedded systems that can change their functionality over time. In such systems it is not only necessary to guarantee timing constraints in every operating mode, but also during the transition between different modes. Known approaches that address the problem of timing analysis over mode changes are restricted to fixed priority scheduling policies. In addition, most of them are also limited to simple periodic event stream models and therefore, they can not faithfully abstract the bursty timing behavior which can be observed in embedded systems. In this paper, we propose a new method for the design and analysis of adaptive multi-mode systems that supports any event stream model and can handle earliest deadline first (EDF) as well as fixed priority (FP) scheduling of tasks. We embed the analysis method into a well-established modular performance analysis framework based on Real-Time Calculus and prove its applicability by analyzing a case study.
Fast real-time feasibility tests and analysis algorithms are necessary for a high acceptance of the formal techniques by industrial software engineers. This paper presents a possibility to reduce the computation time required to calculate the worst-case response time of a task in a fixed-priority task set with jitter by a considerable amount of time. The correctness of the approach is proven analytically and experimental comparisons with the currently fastest known tests show the improvement of the new method.
For a number of years, dataflow concepts have provided designers of digital signal processing systems with environments capable of expressing high-level software architectures as well as low-level, performance-oriented kernels. But analysis of system-level trade-offs has been inhibited by the diversity of models and the dynamic nature of modern dataflow applications. To facilitate design space exploration for software implementations of heterogeneous dataflow applications, developers need tools capable of deeply analyzing and optimizing the application. To this end, we present a new scheduling approach that leverages a recently proposed general model of dynamic dataflow called core functional dataflow (CFDF). CFDF supports high-level application descriptions with multiple models of dataflow by structuring actors with sets of modes that represent fixed behaviors. In this work we show that by decomposing a dynamic dataflow graph as directed by its modes, we can derive a set of static dataflow graphs that interact dynamically. This enables designers to readily experiment with existing dataflow model specific scheduling techniques to all or some parts of the application while applying custom schedulers to others. We demonstrate this generalized dataflow scheduling method on dynamic mixed-model applications and show that run-time and buffer sizes significantly improve compared to a baseline dynamic dataflow scheduler and simulator.
This paper describes an efficient graph-based method to optimize data-flow expressions for best hardware implementation. The method is based on factorization, common subexpression elimination (CSE) and decomposition of algebraic expressions performed on a canonical representation, Taylor Expansion Diagram. The method is generic, applicable to arbitrary algebraic expressions and does not require specific knowledge of the application domain. Experimental results show that the DFGs generated from such optimized expressions are better suited for high level synthesis, and the final, scheduled implementations are characterized, on average, by 15.5% lower latency and 7.6% better area than those obtained using traditional CSE and algebraic decomposition.
The automated design of SoCs from pre-selected IPs that may require different clocks is challenging because of the following issues. Firstly, protocol mismatches between IPs need to be resolved automatically before IPs are integrated. Secondly, the presence of multiple clocks makes the protocol conversion even more difficult. Thirdly, it is desirable that the resulting integration is correct-by-construction, i.e., the resulting SoC satisfies given system-level specifications. All of these issues have been studied extensively, although not in a unifying manner. In this paper we propose a framework based on protocol conversion that addresses all these issues. We have extensively studied many SoC design problems and show that the proposed methodology is capable of handling them better than other known approaches. A significant contribution of the proposed approach is that it nicely generalizes many existing techniques for formal SoC design and integrates them into a single approach.
In the field of chip design, hardware module reuse is a standard solution to the increasing complexity of chip architecture and the pressures to reduce time to market. In the absence of a single module interface standard, integration of pre-designed modules often requires the use of protocol converters. For an arbitrary pair of incompatible protocols it is likely that there exist more than one possible converter. However, existing approaches to automatic synthesis of protocol converters either produce a single suggested converter or provide a general nondeterministic solution, out of which a designer is required to extract a deterministic converter. In this work we present a novel approach for design space exploration of FSM based protocol converters. We present algorithms for extraction of minimal converters for a given pair of incompatible protocols. We demonstrate the process through a simple example, and report on results of experiments with converters for commercial protocols AMBA ASB, APB and the Open Core Protocol (OCP). The experiments show a reduction in the number of states in the converter of as much as 62% (with an average reduction of 42%) and a reduction in the number of transitions of as much as 85% (with an average reduction of 61%), demonstrating the benefits of design space exploration.
High computational effort in modern image processing applications like medical imaging or high-resolution video processing often demands for massively parallel special purpose architectures in form of FPGAs or ASICs. However, their efficient implementation is still a challenge, as the design complexity causes exploding development times and costs. This paper presents a new design flow which permits to specify, analyze, and synthesize complex image processing algorithms. A novel buffer requirement analysis allows exploiting possible tradeoffs between required communication memory and computational logic for multi-rate applications. The derived schedule and buffer results are taken into account for resource optimized synthesis of the required hardware accelerators. Application to a multi-resolution filter shows that buffer analysis is possible in less than one second and that scheduling alternatives influence the required communication memory by up to 24% and the computational resources by up to 16%.
Panelists: A. Domic, M. Montalti, M. Muller, J. Sawicki
Subthreshold logic is showing good promise as a viable ultra-low-power circuit design technique for power-limited applications. For this design technique to gain widespread adoption, one of the most pressing concerns is how to improve the robustness of subthreshold logic to process and temperature variations. We propose a variation resilient adaptive controller for subthreshold circuits with the following novel features: new sensor based on time-to-digital converter for capturing the variations accurately as digital signatures, and an all-digital DCDC converter incorporating the sensor capable of generating an operating operating Vdd from 0V to 1.2V with a resolution of 18.75mV, suitable for subthreshold circuit operation. The benefits of the proposed controller is reflected with energy improvement of upto 55% compared to when no controller is employed. The detailed implementation and validation of the proposed controller is discussed.
Negative Bias Temperature Instability (NBTI) is a significant reliability concern for nanoscale CMOS circuits. Its effects on circuit timing can be especially pronounced for circuits with standby-mode equipped functional units because these units can be subjected to static NBTI stress for extended periods of time. This paper proposes internal node control, in which the inputs to individual gates are directly manipulated to prevent this static NBTI fatigue. We give a mixed integer linear program formulation for an optimal solution to this problem. The optimal placement of internal node control yields an average 26.7% reduction in NBTI-induced delay over a ten year period for the ISCAS85 benchmarks. We find that the problem is NP-complete and present a linear-time heuristic that can be used to quickly find near-optimal solutions. The heuristic solutions are, on average, within 0.17% of optimal and all were within 0.60% of optimal.
Nanometer CMOS scaling has resulted in greatly increased circuit variability, with extremely adverse consequences on design predictability and yield. A number of recent works have focused on adaptive post-fabrication tuning approaches to mitigate this problem. Adaptive Body Bias (ABB) is one of the most successful tuning "knobs" in use today in high-performance custom design. Through forward body bias (FBB), the threshold voltage of the CMOS devices can be reduced after fabrication to bring the slow dies back to within the range of acceptable specs. FBB is usually applied with a very coarse core-level granularity at the price of a significantly increased leakage power. In this paper, we propose a novel, physically clustered FBB scheme on row-based standardcell layout style that enables selective forward body biasing of only of the rows that contain most timing critical gates, thereby reducing leakage power overhead. We propose exact and heuristic algorithms to partition the design and allocate optimal body bias voltages to achieve minimum leakage power overhead. This style is fully compatible with state-of-the-art commercial physical design flows and imposes minimal area blowup. Benchmark results show large leakage power savings with a maximum savings of 30% in case of 5% compensation and 47.6% in case of 10% compensation with respect to block-level FBB and minimal implementation area overhead.
Supply voltage fluctuations that result from inductive noise are increasingly troublesome in modern microprocessors. A voltage "emergency", i.e., a swing beyond tolerable operating margins, jeopardizes the safe and correct operation of the processor. Techniques aimed at reducing power consumption, e.g., by clock gating or by reducing nominal supply voltage, exacerbate this noise problem, requiring ever-wider operating margins. We propose an event-guided, adaptive method for avoiding voltage emergencies, which exploits the fact that most emergencies are correlated with unique microarchitectural events, such as cache misses or the pipeline flushes that follow branch mispredictions. Using checkpoint and rollback to handle unavoidable emergencies, our method adapts dynamically by learning to trigger avoidance mechanisms when emergency-prone events recur. After tightening supply voltage margins to increase clock frequency and accounting for all costs, the net result is a performance improvement of 8% across a suite of fifteen SPEC CPU2000 benchmarks.
BLAST is a very popular Computational Biology algorithm. Since it is computationally expensive it is a natural target for acceleration research, and many reconfigurable architectures have been proposed offering significant improvements. In this paper we approach the same problem with a different approach: we propose a BLAST algorithm preprocessor that efficiently identifies the portions of the database that must be processed by the full algorithm in order to find the complete set of desired results. We show that this preprocessing is feasible and quick, and requires minimal FPGA resources, while achieving a significant reduction in the size of the database that needs to be processed by BLAST. We also determine the parameters under which prefiltering is guaranteed to identify the same set of solutions as the original NCBI software. We model our preprocessor in VHDL and implement it in reconfigurable architecture. To evaluate the performance, we use a large set of datasets and compare against the original (NCBI) software. Prefiltering is able to determine that between 80 and 99.9% of the database will not produce matches and can be safely ignored. Processing only the remaining portions using software such as NCBI-BLAST improves the system performance (reduces execution time) by 3 to 15 times. Since our prefiltering technique is generic, it can be combined with any other software or reconfigurable acceleration technique.
Interconnect structures significantly contribute to the
delay, power consumption, and silicon area of modern
reconfigurable architectures. The demand for higher clock
frequencies and logic densities is also important for the Field-Programmable
Gate Array (FPGA) paradigm. Threedimensional
(3-D) integration can alleviate such performance
limitations by accommodating a number of additional silicon
layers. However, the benefits of 3-D integration have yet to be
sufficiently investigated. In this paper, we propose a software-supported
methodology to explore and evaluate 3-D FPGAs
fabricated with alternative technologies. Based on the evaluation
results, the proposed FPGA device improves speed and energy
dissipation by approximately 38% and 26%, respectively, as
compared to 2-D FPGAs. Furthermore, these gains are achieved
in addition to reducing the interlayer connections, as compared
to existing design approaches, leading to cheaper and more
Keywords-FPGA; 3-D integration; interconnection architectures; CAD tools
We present an application tailored packed-based SoC communication system with one-hop communication between all entities, priority-based arbitration, broadcast and multicast support on a bus-shaped basis. It is located as a hybrid between NoC and bus approaches, closing the gap for mostly streaming-based systems with the need for highly flexible communication patterns and multicast messages that are below a certain size. The system is implemented and evaluated on a FPGA within a car-to-car communication gateway application.
We will explore how processing power of LEON3 processor can be enhanced by connecting small commercially available embedded FPGA (eFPGA) IP with the processor. We will analyze integration of eFPGA with LEON3 in two ways, inside the processor pipeline and as a co-processor. The enhanced processing power helps to reduce dynamic power consumption by Dynamic Frequency Scaling. More computational power at lower frequency helps fabrication of chip in LP (Low Power) process compared to GP (General Purpose) which helps to significantly reduce Static Power which has become a very crucial issue at and beyond 90nm technologies. Use of reconfigurable accelerator raises the question of its programming complexity, HW/SW partitioning and silicon overhead. We will present that silicon overhead of eFPGA is small compared to the benefits which can be obtained with it. We will present a profiling tool which we created for our experiments. To analyze the issue of programming complexity we have explored state of the art CatapultTM ESL tool of Mentor Graphics®.
The topic will cover the use of functional qualification for measuring the quality of functional verification of TLM models. Functional qualification is based on the theory of mutation analysis but considers a mutation to have been killed only if a testcase fails. A mutation model of TLM behaviors is proposed to qualify a verification environment based on both testcases and assertions. The presentation describes at first the theoretic aspects of this topic and then it focuses on its application to real cases by using actual EDA tools, thus showing advantages and limitations of the application of mutation analysis to TLM.
Checking the equivalence of a system-level model against an RTL design is a major challenge. The reason is that usually the system-level model is written by a system architect, whereas the RTL implementation is created by a hardware designer. This approach leads to two models that are significantly different. Checking the equivalence of real-life designs requires strong solver technology. The challenges can only be overcome with a combination of bit-level and word-level reasoning techniques, combined with the right orchestration. In this paper, we discuss solver technology that has shown to be effective on many real-life equivalence checking problems.
A large part of a modern SOC's debug complexity resides in the interaction between the main system components. Transaction-level debug moves the abstraction level of the debug process up from the bit and cycle level to the transactions between IP blocks. In this paper we raise the debug abstraction level further, by utilising structural and temporal abstraction techniques, combined with debug data interpretation and logical communication views. The combination of these techniques and views allow us, among others, to single-step and observe the operation of the network on a per-connection basis. As an example, we show how these higher-level abstractions have been implemented in the debug environment for the Æthereal NOC architecture and present a generic debug API, which can be used to visualise an SOC's state at the logical communication level.
During post-silicon processor debugging, we need to frequently capture and dump out the internal state of the processor. Since internal state constitutes all memory elements, the bulk of which is composed of cache, the problem is essentially that of transferring cache contents off-chip, to a logic analyzer. In order to reduce the transfer time and save expensive logic analyzer memory, we propose to compress the cache contents on their way out. We present a hardware compression engine for cache data using a Cache Aware Compression strategy that exploits knowledge of the cache fields and their behavior to achieve an effective compression. Experimental results indicate that the technique results in 7-31% better compression than one that treats the data as just one long bit stream. We also describe and evaluate a parallel compression architecture that uses multiple compression engines, resulting in a 54% reduction in transfer time.
According to the standard IEC 61508 fault insertion
testing is required for the verification of fail-safe systems. Usually
these systems are realized with microcontrollers. Fail-safe
systems based on a novel CPLD-based architecture require a
different method to perform fault insertion testing than
microcontroller-based systems. This paper describes a method to
accomplish fault insertion testing of a system based on the novel
CPLD-based architecture using the original system hardware.
The goal is to verify the realized safety integrity measures of the
system by inserting faults and observing the behavior of the
system. The described method exploits the fact, that the system
contains two channels, where both channels contain a CPLD.
During a test one CPLD is configured using a modified
programming file. This file is available after the compilation of a
VHDL-description, which was modified using saboteurs or
mutants. This allows injecting a fault into this CPLD. The other
CPLD is configured as fault-free device. The entire system has to
detect the injected fault using its safety integrity measures.
Consequently it has to enter and/or maintain a safe state.
Keywords-IEC 61508; fail-safe system; safety integrity; fault insertion testing; fault injection; CPLD; VHDL
Core-based system-on-chips (SoCs) fabricated on three-dimensional (3D) technology are emerging for better integration capabilities. Effective test architecture design and optimization techniques are essential to minimize the manufacturing cost for such giga-scale integrated circuits. In this paper, we propose novel test solutions for 3D SoCs manufactured with die-to-wafer and die-to-die bonding techniques. Both testing time and routing cost associated with the test access mechanisms in 3D SoCs are considered in our simulated annealing-based technique. Experimental results on ITC'02 SoC benchmark circuits are compared to those obtained with two baseline solutions, which show the effectiveness of the proposed technique.
In this paper we propose a UML/MDA approach, called MoPCoM methodology, to design high quality real-time embedded systems. We have defined a set of rules to build UML models for embedded systems, from which VHDL code is automatically generated by means of MDA techniques. We use the MARTE profile as an UML extension to describe real-time properties and perform platform modeling. The MoPCoM methodology defines three abstraction levels: abstract, execution and detailed modeling levels (AML, EML and DML, respectively). We detail the lowest MoPCoM level, DML, design rules in order to perform automatically VHDL code generation. A viterbi coder has been used as a first case study.
Building highly optimized embedded systems demands hardware/software (HW/SW) co-design. A key challenge in co-design is the design of HW/SW interfaces, which is often a design bottleneck. We propose a novel approach to HW/SW interface design based on the concept of bridge component. Bridge components fill the HW/SW semantic gap by propagating events across the HW/SW boundary and raise the abstraction level for designing HW/SW interfaces by abstracting processors, buses, embedded OS, etc. of embedded system platforms. Bridge components are specified in platform-specific Bridge Specification Languages (BSLs) and compiled by the BSL compilers for simulation and deployment.We have applied our approach to two different embedded system platforms. Case studies have shown that bridge components greatly simplify component-based codesign of embedded systems and system simulation speed can be improved three orders of magnitude by simulating bridge components on the transaction level.
IP-XACT is a well accepted standard for the exchange of
IP components at Electronic System and Register Transfer Level.
Still, the creation and manipulation of these descriptions at the XML
level can be time-consuming and error-prone. In this paper, we show
that the UML can be consistently applied as an efficient and
comprehensible frontend for IP-XACT-based IP description and
integration. For this, we present an IP-XACT UML profile that
enables UML-based descriptions covering the same information as a
corresponding IP-XACT description. This enables the automated
generation of IP-XACT component and design descriptions from
respective UML models. In particular, it also allows the integration of
existing IPs with UML. To illustrate our approach, we present an
application example based on the IBM PowerPC Evaluation Kit.
Keywords-ESL design, RTL design, IP-XACT, IP Management, UML Profile
IP-XACT is a standard for describing intellectual property metadata for System-on-Chip (SoC) integration. Recently researchers have proposed visualizing and abstracting IP-XACT objects using structural UML2 model elements and diagrams. Despite the number of proposals at conceptual level, experiences on utilizing this representation in practical SoC development environments are very limited. This paper presents how UML2 models of IP-XACT features can be utilized to efficiently design and implement a multiprocessor SoC prototype on FPGA. The main contribution of this paper is the experimental development of a multiprocessor platform on FPGA using UML2 design capture, IP-XACT compatible components, and design automation tools. In addition, modeling concepts are improved from earlier work for the utilized integration methodology.
To accommodate the growing number of applications integrated on a single chip, Networks on Chip (NoC) must offer scalability not only on the architectural, but also on the physical and functional level. In addition, real-time applications require Guaranteed Services (GS), with latency and throughput bounds. Traditionally, NoC architectures only deliver scalability on two of the aforementioned three levels, or do not offer GS. In this paper we present the composable and predictable aelite NoC architecture, that offers only GS, based on flit-synchronous Time Division Multiplexing (TDM). In contrast to other TDM-based NoCs, scalability on the physical level is achieved by using mesochronous or asynchronous links. Functional scalability is accomplished by completely isolating applications, and by having a router architecture that does not limit the number of service levels or connections. We demonstrate how aelite delivers the requested service to hundreds of simultaneous connections, and does so with 5 times less area compared to a state-of-the-art NoC.
Reliability concerns associated with upcoming technology nodes coupled with unpredictable system scenarios resulting from increasingly complex systems require considering runtime adaptivity in all possible parts of future on-chip systems. We are presenting a novel configurable link which can change its supported bandwidth on-demand at runtime (2X-Links) for an adaptive on-chip communication architecture. We have evaluated our results using real-time multi-media and the E3S application benchmark suits. Our 2X-Links provide a higher throughput of up to 36%, with an average throughput increase of 21.3%, compared to the Normal-Full-Duplex-Links , , ,  and keep performance-related guarantees with as low as 50% of the Normal-Full-Duplex-Links capacity. Our simulation shows when some links fail, the NoC with 2X-Links can recover from these faults with an average probability of 82.2% whereas these faults would be fatal for the Normal-Full-Duplex-Links.
In on-chip multiprocessor communication, link failures and dynamically changing application scenarios represent demanding constraints for the provision of suitable Quality of Service. Networks-on-Chip (NoCs) featuring dynamic routing are a known way to tackle these issues, but deadlock freedom and message ordering concerns arise. NoCs with configurable routing, whereby the communication routes are explicitly chosen at runtime out of a set of statically predefined alternatives, provide intelligent adaptation without impacting the consistency of traffic flows. However, configurable source routing on a NoC platform requires a design that provides fast path lookup coupled with low area and power consumption. This paper presents an exploration and synthesis approach that, depending on the required amount of routing flexibility, can for example reduce by 3 to 15 times the area cost of the NoC routing tables by adopting partially reprogrammable routing logic instead of fully reprogrammable tables. Further optimizations based on path redundancy allow to reduce up to 17 times the silicon cost.
Parallel architectures have become an increasingly popular method in which to achieve high performance with low power consumption. In order to leverage these benefits, applications are decomposed into multiple computational modules (tasks) that collectively operate and communicate in parallel. In this paper, we present a scalable and highly parametric streams-based communication architecture for inter-module communication for FPGA-based systems. SCORES. This communication architecture improves on previous methods by providing increased application specialization and heterogeneous module clock frequencies, as well as providing a means for low latency communication and data throughput guarantees.
Panelists: J. Cessna, G. Goelz, V. Meyer zu Bexten and E. Petrus
This paper gives an overview of some recent advances in topological approaches to analog layout synthesis and in layout-aware analog sizing. The core issue in these approaches is the modeling of layout constraints for an efficient exploration process. This includes fast checking of constraint compliance, reducing the search space, and quickly relating topological encodings to placements. Sequence-pairs, B*-trees, circuit hierarchy and layout templates are described as advantageous means to tackle these tasks.
This paper presents an accurate interconnect thermal model for analyzing the temperature distribution of an on-chip interconnect wire. The model addresses the ambient temperatures and the heat transfer rates of the packaging materials. Particularly, the model considers the effect of the interconnect temperature gradients. The paper employs an equivalent transmission line circuit to obtain the temperature distribution solution from the model. Then an O(n) algorithm is introduced to compute the interconnect temperatures. Experimental results demonstrate the accuracy of the thermal model, by comparisons with the computational fluid dynamics tool FLUENT.
SystemC is a discrete event simulator that enables the programmer to model complex designs with varying levels of abstraction. In order to improve precision, it can be coupled to more specialized simulators. This article introduces the concept of loose simulator coupling between an analogue simulator and SystemC. It explains the properties and advantages which include a higher simulation performance as well as a higher degree of flexibility. A design example in which SystemC will be connected to SwitcherCad will demonstrate the benefits of loose coupling.
This work proposes reliability aware through silicon via (TSV) planning for the 3D stacked silicon integrated circuits (ICs). The 3D power distribution network is modeled and extracted in frequency domain which includes the impact of skin effect. The worst case power noise of the 3D power delivery networks (PDN) with local TSV failures resulting from fabrication process or circuit operation is identified in both frequency and time domain. From the experimental results, it is observed that a single TSV failure could increase the maximum voltage variation up to 70% which should be considered in nanoscale ICs. The parameters of the 3D PDN are designed such that the power distribution is reliable under local TSV failures. The spatial distribution of the power noise, reliability and block out area is analyzed to enhance the reliability of the 3D PDN under local TSV failure1.
Optical shrink for process migration, manufacturing process variation, temperature and voltage changes lead to clock skew as well as path delay variations in a manufactured chip. Such variations end up degrading the performance of manufactured chips. Since, such variations are hard to predict in pre-silicon phase, tunable clock buffers have been used in several designs. These buffers are tuned to improve maximum operating clock frequency of a design. Previously, we have presented an algorithmic approach that uses delay measurements on a few selected patterns to determine which buffers should be targeted for tuning. In this paper, a study on impact of tunable buffer placement on performance is reported. Greatest benefit from tunable buffer placement is observed, when the clock tree is designed by the proposed tuning system assuming random delay perturbations during design. Accordingly, we present a clock tree synthesis procedure which offer very good protection against process variation as borne out by the results.
NBTI (Negative Bias Temperature Instability) has emerged as the dominant PMOS device failure mechanism for sub-100nm VLSI designs. There is little research to quantify its impact on skew of clock trees. This paper demonstrates a mathematical framework to compute the impact of NBTI on gating-enabled clock tree considering their workload dependent temperature variation. Circuit design techniques are proposed to deal with NBTI induced clock skew by achieving balance in NBTI degradation of clock devices. Our technique achieves up-to 70% reduction in clock skew degradation with miniscule (<0.1%) power and area penalty.
Partial Reconfiguration (PR) of FPGAs presents many
opportunities for application design flexibility, enabling tasks to
dynamically swap in and out of the FPGA without entire system
interruption. However, mapping a task to any available PR
region (PRR) requires a unique partial bitstream for each PRR.
This replication can introduce significant overheads in terms of
bitstream storage and communication requirements. Previous
research in partial bitstream relocation can alleviate these
overheads by transforming a single partial bitstream to map to
any available PRR. However, careful steps are necessary to
ensure proper functionality of relocated partial bitstreams and
may result in clock routing inefficiencies. These routing
inefficiencies can be alleviated by using regional clock resources
introduced in the Virtex-4 FPGAs to implement local clock
domains. PRRs can internally drive local clock domains, enabling
each PRR to vary its clock frequency with respect to a single
global clock signal, as opposed to sending multiple global clock
signals (one for each desired clock frequency) to each PRR. We
introduce this novel local clock domain (LCD) concept, which
provides enhanced PR design flexibility. However, integration of
LCDs and partial bitstream relocation introduces new challenges.
In this paper, we identify motivating application domains for this
integration, analyze integration benefits, and provide a detailed
Keywords-partial reconfiguration; relocatable; local clock
In this paper, we present a fully parallel transistor level full-chip circuit simulation tool with SPICE-accuracy for general circuit designs. The proposed overlapping domain decomposition approach partitions the circuit into a linear subdomain and multiple non-linear subdomains based on circuit non-linearity and connectivity. Parallel iterative matrix solver is used to solve the linear domain while non-linear subdomains are parallelly distributed into different processors topologically and solved by direct solver. To achieve maximum parallelism, device model evaluation is done parallelly. Parallel domain decomposition technique is used to iteratively solve the different partitions of the circuit and ensure convergence. Orders of magnitude speedup over SPICE is observed for sets of large-scale circuit designs on up to 64 processors.
In recent years, pre-fabricated design styles grow up rapidly to amortize the mask cost. However, the interconnection delay of the pre-fabricated design styles slows down the circuit performance due to the high capacitive load. In this paper, we propose a technique to insert dual-rail wires for pre-fabricated design styles. Furthermore, we propose an effective dual-rail insertion algorithm to reduce the routing area overheads caused by the inserted dual-rail wires. Taking the wire criticality, the delay significance, and the wire congestion into consideration, our proposed algorithm is capable of trading additional routing area overheads for the interconnection performance improvement. The experimental results demonstrate that our proposed algorithm reduces the interconnection delay by 11.4% with 5.8% routing area overheads.
Software Defined Radio (SDR) terminals are crucial to enable seamless and transparent inter-working between fourth generation wireless access systems or communication modes. On the longer term, SDRs will be extended to become Cognitive Radios enabling efficient spectrum usage. Future communication modes will have heavy hardware resource requirements and switching between them will introduce dynamism in respect with timing and size of resource requests. In this paper, we propose a modeling framework that enables the simulation of such complex, dynamic hardware/software SDR designs. Thus, we can do an exploration, which can pinpoint the coarse grain platform component requirements for future SDR applications in a very early design phase. Our solution differs from existing ones by combining multiple simulation granularities in a way that is specialized for SDR simulation. Finally, we demonstrate the effectiveness of our approach with a case study for dimensioning the on-chip interconnect of a prospective SDR platform.
The need to have Transaction Level models early in the design cycle is becoming more and more important to shorten the development times of complex Systems-on-Chip (SoC). These models need to be functional and timing accurate in order to address different design use-cases during the SoC development. However the typical issue with Transaction Level Modeling (TLM) techniques is the accuracy vs. simulation speed trade-off. Models that can run at high simulation speeds are often modeled at abstraction levels that make them unsuitable for use-cases where timing accuracy is required. Similarly, most models that are cycle accurate are inherently too slow (due to clock sensitive processes) to be used in use-cases where high simulation speed is key. This paper introduces a new methodology that enables the creation of fast and cycle accurate protocol specific bus-based communication models, based on the new TLM 2.0 standard from the Open SystemC Initiative (OSCI).
In this work, the focus is put on the behavior of a system in case a fault occurs that disables the system from executing its applications. Instead of executing a random subset of the applications depending on the fault, an approach is presented that optimizes the systems structure and behavior with respect to a possible graceful degradation. It includes a degradation-aware reliability analysis that guides the optimization of the resource allocation and function distribution, and provides data-structures for an efficient online degradation algorithm. Thus, the proposed methodology covers both, the design phase with a structural optimization and the online phase with a behavioral optimization of the system. A case study shows the effectiveness of the proposed approach.
Redundancy Addition and Removal (RAR) is a restructuring technique used in the synthesis and optimization of logic designs. It can remove an existing target wire and add an alternative wire in the circuit such that the functionality of the circuit is intact. However, not every irredundant target wire can be successfully removed due to some limitations. Thus, this paper proposes a new restructuring technique, IRredundancy Removal and Addition (IRRA), which successfully removes any desired target wire by constructing a rectification network which exactly corrects the error caused by removing the target wire.
As technology scales, the aging effect caused by Negative Bias Temperature Instability (NBTI) has become a major reliability concern for circuit designers. On the other hand, reducing leakage power remains to be one of the design goals. Because both NBTI-induced circuit degradation and standby leakage power have a strong dependency on the input vectors, Input Vector Control (IVC) technique may be adopted to mitigate leakage and NBTI. However, IVC technique is in-effective for larger circuits. Therefore, in this paper, we propose two fast gate replacement algorithms together with optimal input vector selection to simultaneously mitigate leakage power and NBTI induced circuit degradation: Direct Gate Replacement (DGR) algorithm and Divide and Conquer Based Gate Replacement (DCBGR) algorithm. Our experimental results on 20 benchmark circuits at 65nm technology node reveal that: 1) Both DGR and DCBGR algorithms outperform pure IVC about on average 20% for three different object functions: leakage power reduction only, NBTI mitigation only, and leakage/NBTI co-optimization. 2) The DCBGR algorithm leads to better optimization results and save on average 100X runtime compared with the DGR algorithm.
Clock-gating and power-gating have proven to be very effective solutions for reducing dynamic and static power, respectively. The two techniques may be coupled in such a way that the clock-gating information can be used to drive the control signal of the power-gating circuitry, thus providing additional leakage minimization conditions w.r.t. those manually inserted by the designer. This conceptual integration, however, poses several challenges when moved to industrial design flows. Although both clock and power-gating are supported by most commercial synthesis tools, their combined implementation requires some flexibility in the back-end tools that is not currently available. This paper presents a layout-oriented synthesis flow which integrates the two techniques and that relies on leading-edge, commercial EDA tools. Starting from a gated-clock netlist, we partition the circuit in a number of clusters that are implicitly determined by the groups of cells that are clock-gated by the same register. Using a row-based granularity, we achieve runtime leakage reduction by inserting dedicated sleep transistors for each cluster. The entire flow has been benchmarked on a industrial design mapped onto a commercial, 65nm CMOS technology library.
Memories are increasingly dominating Systems on Chip (SoC) designs and thus contribute a large percentage of the total system's power dissipation, area and reliability. In this paper, we present a tool which captures the effects of supply voltage Vdd and temperature on memory performance and their interrelationships. We propose a Temperature- and Reliability- Aware Memory Design (TRAM) approach which allows designers to examine the effects of frequency, supply voltage, power dissipation, and temperature on reliability in a mutually interrelated manner. Our experimental results indicate that thermal unaware estimation of probability of error can be off by at least two orders of magnitude and up to five orders of magnitude from the realistic, temperature-aware cases. We also observed that thermal aware Vdd selection using TRAM can reduce the total power dissipation by up to 2.5X while attaining an identical predefined limit on errors.
In today's aircraft, system complexity increases are
making it particularly challenging for engineers to validate
systems architectures. To ease this burden, the integration test
rig, often known as the "iron bird" integration simulator, has
been developed, and allows testing of real systems in a simulated
The computing host platform and interface equipment used in
the integration simulator, are evolving rapidly. The capability to
predict the performance of both the simulation application and
the infrastructure on which it runs, is crucial in order to select
the proper architecture for the future test rigs.
This paper presents the results of an AADL development that
simulates the test rig simulator in order to predict its needs. We
illustrate the use of model based engineering techniques on a real
industrial application where we simulate the simulator in order
to architect its computing infrastructure.
Firstly, the simulation application model built with AADL
language is presented. Secondly, the producer-consumer
paradigm is introduced and it is shown how it is used to model
the simulation infrastructure host platform. Thirdly, the time
reference used to abstract time in the simulation is presented.
And finally the capacity of the AADL simulation to match the
simulators currently used in our company is illustrated.
Keywords: modeling, real-time, simulation, AADL.
The availability of multimillion Commercial-Off-The-Shelf (COTS) Field Programmable Gate Arrays (FPGAs) is making now possible the implementation on a single device of complex systems embedding processor cores as well as huge memories and ad-hoc hardware accelerators exploiting the programmable logic (Systems on Programmable Chip, or SoPCs). When deployed in safety- or mission-critical applications, as avionic- and space-oriented ones, Singe Event Effects (SEEs) affecting COTS FPGA, which may have catastrophic effects if neglected, have to be considered and SEE mitigation techniques have to be employed. In this paper we explore the adoption of known techniques (such as lockstep, checkpointing and rollback recovery) for SEE mitigation to processors cores embedded in SoPCs, and propose their customization, specifically addressing the characteristics of programmable devices. Since the resulting design flow can easily be supported by automation tools, its adoption is particularly suitable to reduce the design and validation costs. Experimental results show the effectiveness of the proposed approach when compared to conventional TMR-based solutions.
Body sensor networks are emerging as a promising platform for healthcare monitoring. These systems are composed of battery-operated embedded devices which process physiological data. The reduction in the power consumption is an important factor to increase the lifetime for such systems and to enhance their wearability through reducing the size of the battery. In this paper, we develop an energy-efficient communication scheme that uses buffers to reduce the number of transmissions among the sensor nodes constrained to limited hardware resources. A direct acyclic graph is used to model the information flow. We define a communication optimization problem and solve it using convex optimization techniques. We present results that support the efficiency of the proposed technique.
In this paper we present a reconfigurable Class-E Power Amplifier (PA) whose operation frequency covers all uplink bands of GSM standard. We describe the circuit design strategy to reconfigure PA operation frequency maximizing the efficiency. Two dies, manufactured using CMOS and MEMS technologies, are assembled through bondwires in a SiP fashion. Prototypes deliver 20dBm output power with 38% and 26% drain efficiencies at lower and upper bands, respectively. MEMS technological issues degrading performance are also discussed.
High performance analog-to-digital converters (ADC)
are essential elements for the development of high performance
image sensors. These circuits need a big number of ADCs to
reach the required resolution at a specified speed. Moreover,
nowadays power dissipation has become a key performance to be
considered in analog designs, specially in those developed for
portable devices. Design of such circuits is a challenging task
which requires a combination of the most advanced digital
circuit, the analog expertise knowledge and an iterative design.
Amplifier sharing has been a commonly used technique to reduce
power dissipation in pipelined ADCs. In this paper we present a
partial amplifier sharing topology of a 12 bit pipeline ADC,
developed in 0.35μm CMOS process. Its performance is
compared with a conventional amplifier scaling topology and
with a fully amplifier sharing one.
Keywords- ADC, pipeline, CMOS, low-power
Panelists: J. Cong, G. Clave, T. Makelainen, Z. Zhang, V. Kathail and J. Kunkel
In this paper we propose a novel statistical framework to model the impact of process variations on semiconductor circuits through the use of process sensitive test structures. Based on multivariate statistical assumptions, we propose the use of the expectation-maximization algorithm to estimate any missing test measurements and to calculate accurately the statistical parameters of the underlying multivariate distribution. We also propose novel techniques to validate our statistical assumptions and to identify any outliers in the measurements. Using the proposed model, we analyze the impact of the systematic and random sources of process variations to reveal their spatial structures. We utilize the proposed model to develop a novel application that significantly reduces the volume, time, and costs of the parametric test measurements procedure without compromising its accuracy. We extensively verify our models and results on measurements collected from more than 300 wafers and over 25 thousand die fabricated at a state-of-the-art facility. We prove the accuracy of our proposed statistical model and demonstrate its applicability towards reducing the volume and time of parametric test measurements by about 2.5 - 6.1X at absolutely no impact to test quality.
Lithographic variability and its impact on printability
is a major concern in today's semiconductor manufacturing
process. To address sub-wavelength printability, a number of
resolution enhancement techniques (RET) have been used. While
RET techniques allow printing of sub-wavelength features, the
feature width itself becomes highly sensitive to process
parameters, which in turn detracts from yield due to small
perturbations in manufacturing parameters. Yield loss is a
function of random variables such as depth-of-focus and
exposure dose. In this paper, we present a first order canonical
dose/focus model that takes into account both the correlated and
independent randomness of the effects of lithographic variation.
A novel tile-based yield estimation technique for a given layout,
based on a statistical model for process variability is presented.
Another novel contribution of this paper is the computation of
global and local line-yield probabilities. The key issues addressed
in this paper are (i) layout error modeling, (ii) avoidance of mask
simulation for chip layouts, (iii) avoidance of full Monte-Carlo
simulation for variational lithography modeling, (iv) building a
methodology for yield estimation based on existing commercial
tools. Numerical results based on our approach are shown for
45nm ISCAS85 layouts.
Keywords-Photolithography, depth-of-focus, exposure dose, focus-exposure matrix (FEM), chemical mechanical polishing, stratified sampling, linewidth-based yield.
Low voltage SRAMs are critical for power constrained designs. Currently, the choice of supply voltage in SRAMs is governed by bit cell read static noise margin, writability, data retention etc. However, in the nanometer technology nodes, the choice of supply voltage impacts the reliability of SRAMs as well. Two important reliability challenges for current and future generation SRAMs are gate oxide degradation and soft error susceptibility. The current generation transistors have ultra-thin gate oxides to improve the device performance and they are prone to breakdown due to higher level of electric field stress. In addition, the soft error susceptibility of SRAMs has significantly increased in the nanometer regime. In this work, we have quantified the impact of voltage scaling on the soft error susceptibility of gate oxide degraded SRAMs.We show that when gate oxide degradation is taken into account, there exists an optimal voltage (Vopt) at which the bit cell Qcrit is maximized. Further, we show that both Vopt and Qcritmax are a function of the level of oxide degradation. Finally, we investigate the impact of technology node scaling and analyze the trend of Vopt and Qcritmax. As the technology node shrinks to sub-45nm, both Vopt and Qcritmax decrease sharply, thus significantly decreasing the reliability of SRAMs.
As flash memory became popular over various platforms, there is a strong demand on the performance degradation problem, due to the special characteristics of flash memory. This research proposes the design of a file-system-aware flash translation layer, in which a filter mechanism is designed to separate the access requests of file-system metadata and file contents for better performance. A recovery scheme is then proposed to maintain the integrity of a file system. The proposed flash translation layer is implemented as a Linux device driver and evaluated with respect to ext2 and ext3 file systems. The experimental results show significant performance improvement over ext2 and ext3 file systems with limited system overheads
NAND Flash Memories require Garbage Collection (GC) and Wear Leveling (WL) operations to be carried out by Flash Translation Layers (FTLs) that oversee flash management. Owing to expensive erasures and data copying, these two operations essentially determine application response times. Since file systems do not share any file deletion information with FTL, dead data is treated as valid by FTL, resulting in significant WL and GC overheads. In this work, we propose a novel method to dynamically interpret and treat dead data at the FTL level so as to reduce above overheads and improve application response times, without necessitating any changes to existing file systems. We demonstrate that our resource-efficient approach can improve application response times and memory write access times by 22% and reduce erasures by 21.6% on average.
With wide applicability of flash memory in various application domains, reliability has become a very critical issue. This research is motivated by the needs to resolve the lifetime problem of flash memory and a strong demand in turning thrown-away flash-memory chips into downgraded products. We proposes a set-based mapping strategy with an effective implementation and low resource requirements, e.g., SRAM. A configurable management design and wear-leveling issue are considered. The behavior of the proposed method is also analyzed with respect to popular implementations in the industry.We show that the endurance of flash memory can be significantly improved by a series of experiments over a realistic trace. Our experiments show that the read performance is even largely improved.
In this paper, we propose a novel, energy aware
scheduling algorithm for applications running on DVS-enabled
multiprocessor systems, which exploits variation in execution times
of individual tasks. In particular, our algorithm takes into account
latency and resource constraints, precedence constraints among
tasks and input-dependent variation in execution times of tasks to
produce a scheduling solution and voltage assignment such that the
average energy consumption is minimized. Our algorithm is based
on a mathematical programming formulation of the scheduling and
voltage assignment problem and runs in polynomial time.
Experiments with randomly generated task graphs show that up to
30% savings in energy can be obtained by using our algorithm over
existing techniques. We perform experiments on two real-world
applications - MPEG-4 decoder and MJPEG encoder. Simulations
show that the scheduling solution generated by our algorithm can
provide up to 25% reduction in energy consumption over greedy
dynamic slack reclamation algorithms.
Index Terms - DVS, scheduling, average energy consumption, precedence constraints, convex optimization
Complex software programs are mostly characterized by phase behavior and runtime distributions. Due to the dynamism of the two characteristics, it is not efficient to make workload predictions during design-time. In our work, we present a novel online DVFS method that exploits both phase behavior and runtime distribution during runtime in combined Vdd/Vbb scaling. The presented method performs a bi-modal analysis of runtime distribution, and then a runtime distribution-aware workload prediction based on the analysis. In order to minimize the runtime overhead of the sophisticated workload prediction method, it performs table lookups to the pre-characterized data during runtime without compromising the quality of energy reduction. It also offers a new concept of program phase suitable for DVFS. Experiments show the effectiveness of the presented method in the case of H.264 decoder with two sets of long-term scenarios consisting of total 4655 frames. It offers 6.6% ~ 33.5% reduction in energy consumption compared with existing offline and online solutions.
As industry moves towards many-core chips, networks-on-chip (NoCs) are emerging as the scalable fabric for interconnecting the cores. With power now the first-order design constraint, early-stage estimation of NoC power has become crucially important. ORION  was amongst the first NoC power models released, and has since been fairly widely used for early-stage power estimation of NoCs. However, when validated against recent NoC prototypes - the Intel 80-core Teraflops chip and the Intel Scalable Communications Core (SCC) chip - we saw significant deviation that can lead to erroneous NoC design choices. This prompted our development of ORION 2.0, an extensive enhancement of the original ORION models which includes completely new subcomponent power models, area models, as well as improved and updated technology models. Validation against the two Intel chips confirms a substantial improvement in accuracy over the original ORION. A case study with these power models plugged within the COSI-OCC NoC design space exploration tool  confirms the need for, and value of, accurate early-stage NoC power estimation. To ensure the longevity of ORION 2.0, we will be releasing it wrapped within a semi-automated flow that automatically updates its models as new technology files become available.
Panelists: J. Abraham, R. Goldman and J. McLean
SoC development requires interaction between a wide range of engineering disciplines. Each of which brings in optimisation factors that impacts other disciplines. Therefore, concurrent development and end-to-end planning between these disciplines are necessary. This session will show the overlap between design, packaging, silicon manufacturing, test and yield optimisation.
During 1990s, silicon-based CMOS made steady advancement with miniaturization and with lower power consumption by incorporating the scaling effect and expanded its share by invading the region of bipolar transistors and compound semiconductors market. On the other hand, new semiconductor application technologies grew rapidly one after the other in conjunction with the development of silicone CMOS technologies. Such developments included the microprocessor for PC, server and router chipsets for internet application, RF for cellular phones, analogue circuitry, base band processors, and wireless LAN technologies. Also in memory areas, the flash memory technology was introduced into the market and FeRAM, MRAM, and PRAM technologies with new principles were introduced into the market.
Non-volatile logic-in-memory architecture, where
nonvolatile memory elements are distributed over a logic-circuit
plane, is expected to realize both ultra-low-power and reduced
interconnection delay. This paper presents novel non-volatile
logic circuits based on logic-in-memory architecture using
magnetic tunnel junctions (MTJs) in combination with MOS
transistors. Since the MTJ with a spin-injection write capability
is only one device that has all the following superior features as
large resistance ratio, virtually unlimited endurance, fast
read/write accessibility, scalability, complementary MOS
(CMOS)-process compatibility, and nonvolatility, it is very suited
to implement the MOS/MTJ-hybrid logic circuit with logic-inmemory
architecture. A concrete nonvolatile logic-in-memory
circuit is designed and fabricated using a 0.18 μm CMOS/MTJ
process, and its future prospects and issues are discussed.
Keywords-nonvolatile; logic-in-memory; MTJ; standby-powerfree; quick sleep/wake-up
Carbon Nanotube Field-Effect Transistors (CNFETs) show big promise as extensions to silicon-CMOS because: 1) Ideal CNFETs can provide significant energy and performance benefits over silicon-CMOS, and 2) CNFET processing is compatible with existing silicon-CMOS processing. However, future gigascale systems cannot rely solely on existing chemical synthesis for guaranteed ideal devices. VLSI-scale logic circuits using CNFETs must overcome major challenges posed by: 1) Misaligned and mis-positioned Carbon Nanotubes (CNTs); 2) Metallic CNTs; and, 3) CNT density variations. This paper performs detailed analysis of the impact of these challenges on CNFET circuit performance. A combination of design and processing techniques, presented this paper, can enable VLSI-scale CNFET logic circuits that are immune to high rates of inherent imperfections. These techniques are inexpensive compared to traditional defect- and fault-tolerance, do not impose major changes in VLSI design flows, and are compatible with VLSI processing because they do not require special customization on chip-by-chip basis.
It is generally acknowledged that nanoelectronics will
eventually replace traditional silicon CMOS in high-performance
integrated circuits. To that end, considerable investments are
being made in the research and development of new
nanoelectronic devices and fabrication techniques. When these
technologies mature, they can be used to create the next
generation of electronic systems. Given the intrinsic properties of
nanomaterials, such systems are likely to deviate considerably
from their predecessors. In this paper, we compare two potential
architectures for the design of nanoelectronic FPGAs. By
evaluating the performance of nanoelectronic devices at the
systems level, we aim to provide insights into how they can be
Keywords-FPGAs; nano-architecture; nanoelectronics; carbon nanotube devices
Software defined radio (SDR) is a rapidly evolving technology which implements some functional modules of a radio system in software executing on a programmable processor. SDR provides a flexible mechanism to reconfigure the radio, enabling networked devices to easily adapt to user preferences and the operating environment. However, the very mechanisms that provide the ability to reconfigure the radio through software also give rise to serious security concerns such as unauthorized modification of the software, leading to radio malfunction and interference with other users' communications. Both the SDR device and the network need to be protected from such malicious radio reconfiguration. In this paper, we propose a new architecture to protect SDR devices from malicious reconfiguration. The proposed architecture is based on robust separation of the radio operation environment and μuser application environment through the use of virtualization. A secure radio middleware layer is used to intercept all attempts to reconfigure the radio, and a security policy monitor checks the target configuration against security policies that represent the interests of various parties. Therefore, secure reconfiguration can be ensured in the radio operation environment even if the operating system in the user application environment is compromised. We have prototyped the proposed secure SDR architecture using VMware and the GNU Radio toolkit, and demonstrate that the overheads incurred by the architecture are small and tolerable. Therefore, we believe that the proposed solution could be applied to address SDR security concerns in a wide range of both general-purpose and embedded computing systems.
Hardware/Software codesign of Elliptic Curve Cryptography has been extensively studied in recent years. However, most of these designs have focused on the computational aspect of the ECC hardware, and not on the system integration into a SoC architecture. We study the impact of the communication link between CPU and coprocessor hardware for a typical ECC design, and demonstrate that the SoC may become performance-limited due to coprocessor data- and instruction-transfers. A dual strategy is proposed to remove the bottleneck: introduction of local control as well as local storage in the coprocessor. We quantify the impact of this strategy on a prototype implementation for Field Programmable Gate Arrays (FPGA) and measured an average speed-up in the resulting design of 9.4 times over the baseline ECC system, while the resulting system area increases by a factor of 1.6. The optimal area-time product improvement of our ECC coprocessor is 4.3 times compared to that of the baseline ECC coprocessor. Using design space exploration of a large number of system configurations using the latest FPGA technology and tools, we show that the optimal choice of ECC coprocessor parameters is strongly dependent on the efficiency of system-level communication.
Reliable and verifiable hardware, software and content usage metering (HSCM) are of primary importance for wide segments of e-commerce including intellectual property and digital rights management. We have developed the first HSCM technique that employs intrinsic aging properties of components in modern and pending integrated circuits (ICs) to create the first self-enforceable HSCM approach. There are variety of hardware aging techniques that range from electro-migration in wires to slow-down of crystal-based clocks. We focus on transistor aging due to negative bias temperature instability (NBTI) effects where the delay of gates increases proportionally to usage times. We address the problem of how we can measure the amount of time a particular licensed software (LS) is used by designing an aging circuitry and exposing it to the unique inputs associated with each LS. If a particular LS is used longer than specified, it automatically disables itself. Our novel HSCM technique uses a multi-stage optimization problem of computing the delays of gates, their aging degradation factors, and finally LS usage using convex programming. The experimental results show not just viability of the technique but also surprisingly high accuracy in the presence of measurement noise and imperfect aging models. HSCM can be used for many other business and engineering applications such as power minimization, software evaluation, and processor design.
MPSoC is evolving towards processor-pool (PP)-based architectures, which employ hierarchical on-chip network for inter- and intra-PP communication. Since the design space of PP-based MPSoC is extremely wide, application-specific optimization of on-chip communication is a nontrivial task. This paper presents a systematic methodology for on-chip network design of PP-based MPSoC. The proposed approach allows independent configurations of PPs, which leads to efficient solutions than previous work. Since time-consuming simulation is inevitable to evaluate complicated on-chip network during exploration, we do early pruning of design space by a bandwidth analysis technique that considers task execution dependencies. Our approach yields the Pareto-optimal solutions between clock frequency and area requirements. The experiments show that the proposed technique finds more efficient architectures compared with the previous approaches.
In this paper, a novel design space exploration approach is proposed that enables a concurrent optimization of the topology, the process binding, and the communication routing of a system. Given an application model written in SystemC TLM 2.0, the proposed approach performs a fully automatic optimization by a simultaneous resource allocation, task binding, data mapping, and transaction routing for MPSoC platforms. To cope with the huge complexity of the design space, a transformation of the transaction level model to a graph-based model and symbolic representation that allows multi-objective optimization is presented. Results from optimizing a Motion-JPEG decoder illustrate the effectiveness of the proposed approach.
Rapid design space exploration with accurate models is necessary to improve designer productivity at the electronic system level. We describe how to use a new event-based design framework, Metro II, to carry out simulation and design space exploration of multi-core architectures. We illustrate the design methodology on a UMTS data link layer design case study with both a timed and untimed functional model as well as a complete set of MPSoC architectural services. We compare different architectures (including RTOSes) explored with Metro II and quantify the associated simulation overhead.
Fault-tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault-tolerance. For a given job and a soft (transient) error probability, we define mathematical formulas for AET that includes bus communication overhead for both voting (active replication) and rollback-recovery with checkpointing (RRC). And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC, (2) finding the number of processors and job-to-processor assignment when using voting, and (3) defining fault-tolerance scheme (voting or RRC) per job and defining its usage for each job. Experiments demonstrate significant savings in AET.
An increasing number of hardware failures can be
attributed to device reliability problems that cause partial system
failure or shutdown. In this paper we propose a scheme for
improving reliability of a homogeneous chip multiprocessor
(CMP) that also serves to improve manufacturing yield. Our
solution centers on exploiting the natural redundancy that
already exists in multi-core systems by using services from other
cores for functional units that are defective in a faulty core. A
micro-architectural modification allows a core on a CMP to use
another core as a coprocessor to service any instruction that the
former cannot execute correctly. This service is accessed to
improve yield and reliability, but at the cost of some loss of
performance. In order to quantify this loss we have used a cycle-accurate
simulator to simulate the performance of a dual-core
system with one or two cores sustaining partial failure. Our
results indicate that when a large and sparingly-used unit such as
a floating point arithmetic unit fails in a core, even for a floating
point intensive benchmark, we can continue to run each faulty
core with help from companion cores with as little as 10% impact
to performance and less than 1% area overhead.
Keywords- yield; reliability; micorarchitecture; multiprocessors
In ultra-deep submicro technology, two of the paramount reliability concerns are soft errors and device aging. Although intensive studies have been done to face the two challenges, most take them separately so far, thereby failing to reach better performance-cost tradeoffs. To support a more efficient design tradeoff, we present a new fault model, Stability Violation, derived from analysis of signal behavior. Furthermore, we propose a unified fault detection scheme - Stability Violation based Fault Detection (SVFD), by which the soft errors (both Single Event Upset and Single Event Transient), aging delay, and delay faults can be uniformly handled. SVFD can greatly facilitate soft error-resistant and aging-aware designs. SVFD is validated by conducting a set of intensive Hspice simulations targeting 65nm CMOS technology. Experimental results show that SVFD has more robust capability for fault detection than previous schemes at comparable overhead in terms of area, power, and performance.
Fault injection has become a very classical method to
determine the dependability of an integrated system with respect
to soft errors. Due to the huge number of possible error
configurations in complex circuits, a random selection of a subset
of potential errors is usual in practical experiments. The main
limitation of such a selection is the confidence in the outcomes
that is never quantified in the articles. This paper proposes an
approach to quantify both the error on the presented results and
the confidence on the presented interval. The computation of the
required number of faults to inject in order to achieve a given
confidence and error interval is also discussed. Experimental
results are shown and fully support the presented approach.
Keywords-dependability analysis, statistical fault injection
Flash memory is a good candidate for the storage device in real-time systems due to its non-fluctuating performance, low power consumption and high shock resistance. However, the garbage collection for invalid pages in flash memory can invoke a long blocking time. Moreover, the worst-case blocking time is significantly long compared to the best-case blocking time under the current flash management techniques. In this paper, we propose a novel flash translation layer (FTL), called KAST, where user can configure the maximum log block associativity to control the worst-case blocking time. Performance evaluation using simulations shows that the overall performance of KAST is better than the current FTL schemes as well as KAST guarantees the longest block time is shorter than the specified value.
In this paper, a novel approach for integrating static non-preemptive software scheduling in formal bottom-up performance evaluation of embedded system models is described. The presented analysis methodology uses a functional SystemC implementation of communicating processes as input. Necessary model extensions towards capturing of static non-preemptive scheduling are introduced and the integration of the software scheduling in the formal analysis process is explained. The applicability of the approach in an automated design flow is presented using a SystemC model of a JPEG encoder.
In the design and development of embedded realtime systems the aspect of timing behavior plays a central role. Especially, the evaluation of different scheduling approaches, algorithms and configurations is one of the elementary preconditions for creating not only reliable but also efficient systems - a key for success in industrial mass production. This is becoming even more important as multi-core systems are more and more penetrating the world of embedded systems together with the large (and growing) variety of scheduling policies available for such systems. In this work simple mathematical concepts are used to define performance indicators allowing to quantify the benefit of different solutions of the scheduling challenge for a given application. As a sample application some aspects of analyzing the dynamic behavior of an combustion engine management system for the automotive domain are shown. However, the described approach is flexible in order to support the specific optimization needs arising from the timing requirements defined by the application domain and can be used with simulation data as well as target system measurements.
As multiprocessor systems are increasingly used in real-time environments, scheduling and synchronization analysis of these platforms receive growing attention. However, most known schedulability tests lack a general applicability. Common constraints are a periodic or sporadic task activation pattern, with deadlines no larger than the period, or no support for shared resource arbitration, which is frequently required for embedded systems. In this paper, we address these constraints and present a general analysis which allows the calculation of response times for fixed priority task sets with arbitrary activations and deadlines in a partitioned multiprocessor system with shared resources. Furthermore, we derive an improved bound on the blocking time in this setup for the case where the shared resources are protected according to the Multiprocessor Priority Ceiling Protocol (MPCP).
To deal with the "memory wall" problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer granularity than conventional caches, reducing cache latency. Our evaluations show that in general, an L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.
Modern processors are becoming more complex and as features and application size increase, their evaluation is becoming more time-consuming. To date, design space exploration relies on extensive use of software simulation that when highly accurate is slow. In this paper we propose ReSim, a parameterizable ILP processor simulation acceleration engine based on reconfigurable hardware. We describe ReSim's trace-driven microarchitecture that allows us to simulate the operation of a complex ILP processor in a cycle serial fashion, aiming to simplify implementation complexity and to boost operating frequency. Being trace driven, ReSim can simulate timing in an almost ISA independent fashion, and supports all SimpleScalar ISAs, i.e. PISA, Alpha, etc. We implemented ReSim for the latest Xilinx devices. In our experiments with a 4-way superscalar processor ReSim achieves a simulation throughput of up to 28MIPS, and offers more than a factor of 5x improvement over the best reported ILP processor hardware simulators.
Reconfigurable Architectures are good candidates for application accelerators that cannot be set in stone at production time. FPGAs however, often suffer from the area and performance penalty intrinsic in gate-level reconfigurability. To reduce this overhead, coarse-grained reconfigurable arrays (CGRAs) are reconfigurable at the ALU level, but a successful design needs more than computational power - the main bottleneck usually being memory transfers. Just like the integration of hardwired multiplier and memory blocks enabled FPGAs to efficiently implement digital signal processing applications, in this paper we study a customizable architecture template based on heterogeneous processing elements (multipliers, ALU clusters and memories) that provides enough flexibility to realize fast pipelined implementations of various loop kernels on a CGRA.
In this paper, two general algorithms for the automatic generation of instruction-set extensions are presented. The basic instruction set of a reconfigurable architecture is specialized with new application-specific instructions. The paper proposes two methods for the generation of convex multiple input multiple output instructions, under hardware resource constraints, based on a two-step clustering process. Initially, the application is partitioned in single-output instructions of variable size and then, selected clusters are combined in convex multiple output clusters following different policies. Our results on well-known kernels show that the extended Instructions-Set allows to execute applications more efficiently and needing fewer cycles. Our results show that a significant overall application speed-up is achieved even for large kernels (for ADPCM decoder the speed-up is up to x2.2 and for TWOFISH encoder the speedup is up to x5.5).
In embedded computing we face a continuously growing algorithm complexity combined with a constantly rising number of applications running on a single system. Multi-core systems are becoming popular to cope with these requirements. Growing computational complexity is handled by increasing the number of cores and core types within one system - leading to heterogeneous many-core MPSoCs in the near future. One key challenge in designing such systems is to determine the number of cores required to meet performance, power and area constraints. In this paper we present a methodology that helps dimensioning these systems via a novel parallelism analysis methodology within seconds. The presented methodology has an average performance estimation error of less than 4% compared to transaction level simulation.
Networks-on-Chip (NoCs) have appeared as design strategy to overcome the limitations, in terms of scalability, efficiency, and power consumption of current buses. In this paper, we discuss the idea of using NoCs to monitor system behaviour at run-time by tracing activities at initiators and targets. Main goal of the monitoring system is to retrieve information useful for run-time optimization and resources allocation in adaptive systems. Information detected by probes embedded within NIs is sent to a central unit, in charge of collecting and elaborating the data. We detail the design of the basic blocks and analyse the overhead associated with the ASIC implementation of the monitoring system, as well as discussing implications in terms of the additional traffic generated in the NoC.
Most of past evaluations of fat-trees for on-chip interconnection networks rely on oversimplifying or even irrealistic architecture and traffic pattern assumptions, and very few layout analyses are available to relieve practical feasibility concerns in nanoscale technologies. This work aims at providing an in-depth assessment of physical synthesis efficiency of fat-trees and at extrapolating siliconaware performance figures to back-annotate in the system-level performance analysis. A 2D mesh is used as a reference architecture for comparison, and a 65 nm technology is targeted by our study. Finally, in an attempt to mitigate the implementation cost of k-ary n-tree topologies, we also review an alternative unidirectional multi-stage interconnection network which is able to simplify the fat-tree architecture and to minimally impact performance.
In this paper, we propose a novel on-chip
communication scheme by dividing the resources of a traditional
packet-switched network-on-chip between a packet-switched and
a circuit-switched sub-network. The former directs packets
according to the traditional packet-switching mechanism, while
the latter forwards packets over circuits which are directly
established between two non-adjacent nodes by bypassing the
intermediate routers. A packet may switch between the subnetworks
several times to reach its destination. The circuits are
set up using a low-latency and low-cost setup-network. The
network resources are split between the two sub-networks using
Spatial-Division Multiplexing (SDM). The work aims to improve
the power and performance metrics of Network-on-Chip (NoC)
architectures and benefits from the power and scalability
advantage of packet-switched NoCs and superior communication
performance of circuit-switching. The evaluation results show a
significant reduction in power and latency over a traditional
This paper presents a new two-levels page-based memory bus protection scheme. A trusted Operating System drives a hardware cryptographic unit and manages security contexts for each protected memory page. The hardware unit is located between the internal system bus and the memory controller. It protects the integrity and confidentiality of selected memory pages. For better acceptability the processor (CPU) architecture and the software application level are unmodified. The impact of the security on cost and performance is optimized by several algorithmic and hardware techniques and by a differentiated handling of memory pages, depending on their characteristics.
Networks-on-chip (NoC) for general-purpose multiprocessors require quality of service mechanisms to allow realtime streaming applications to be executed along with latency-sensitive general purpose processing tasks. In this paper, we propose a NoC link arbitration technique that supports bandwidth guarantees along with best effort latency optimizations. In contrast to many existing quality of service mechanisms, our technique prioritizes best effort over guaranteed bandwidth traffic for optimal latency. Distributed traffic shaping is used to offer bandwidth guarantees over previously reserved connections, which are established dynamically using control messages. Initial simulation results show that our arbitration scheme can provide tight bandwidth guarantees for streaming traffic under network overload conditions. At the same time, the average latency of best effort traffic is improved compared to a simple prioritization of streaming traffic.
We propose (σ, ρ)-based flow regulation as a design instrument for System-on-Chip (SoC) architects to control quality-of-service and achieve cost-effective communication, where σ bounds the traffic burstiness and ρ the traffic rate. This regulation changes the burstiness and timing of traffic flows, and can be used to decrease delay and reduce buffer requirements in the SoC infrastructure. In this paper, we define and analyze the regulation spectrum, which bounds the upper and lower limits of regulation. Experiments on a Network-on-Chip (NoC) with guaranteed service demonstrate the benefits of regulation. We conclude that flow regulation may exert significant positive impact on communication performance and buffer requirements.
Traditional digital circuit synthesis flows start from an HDL behavioral definition and assume that circuit functions are almost completely defined, making don't-care conditions rare. However, recent design methodologies do not always satisfy these assumptions. For instance, third-party IP blocks used in a system-on-chip are often over-designed for the requirements at hand. By focusing only on the input combinations occurring in a specific application, one could resynthesize the system to reduce its area and power consumption. Therefore we extend modern digital synthesis with a novel technique, called SWEDE, that uses external don't-cares present implicitly in existing simulation-based verification environments for circuit customization. Experiments indicate that SWEDE scales to large ICs with half-million input vectors and handles practical cases well.
We are interested in the problem of improving ipreuse in SoC design. This paper presents an MDE based approach based on a proposed IP-XACT standard extension. This approach combines the benefits of using MDE techniques in SoC design such as abstraction levels definition and model transformation for code generation, and the benefits of the IPXACT standard such as a unique exchange format of packaged IPs (Intellectual Property) with reuse capabilities.
Today verification, testing and debugging of
SystemC models can be applied at an early stage in the
design process. To support these techniques gaining required
information of the respective model, the SystemC Verification
Library (SCV) implements a concept called data introspection.
Unfortunately data introspection holds problems that arise with
increasing usage of language features. Native C++ data types for
instance will not appear in meta-data extracted by introspection
In this paper we propose a non-intrusive analysis concept to
overcome the drawbacks of traditional data introspection. The
presented approach is a hybrid technique joining a parser to
collect statical information and a code generator to evaluate run
Index Terms - SystemC, data introspection, analysis, intermediate representation
Ever since the invention of various leakage power
reduction techniques, leakage and dynamic power reduction
techniques are categorized into two separate sets. Most of them
cannot be applied together during runtime. The gap between
them is due to the large energy breakeven time (EBT) and wakeup
time (WUT) of conventional leakage reduction techniques.
This paper proposes a new leakage reduction technique (SLITH)
based on Vth hopping. SLITH has very low EBT and WUT, yet
keeps the effectiveness of leakage reduction. Thus, it is able to
reduce the gap, and enables joint dynamic and leakage power
reduction. SLITH can be applied together with clock gating, precomputation
and operand isolation etc., and significantly reduces
both dynamic and active leakage power consumption.
Index Terms - runtime leakage power reduction, energy breakeven time, wake-up time, Vth hopping
D-NUCA L2 caches are able to tolerate the increasing wire delay effects due to technology scaling thanks to their banked organization, broadcast line search and data promotion/demotion mechanism. Data promotion mechanism aims at moving frequently accessed data near the core, but causes additional accesses on cache banks, hence increasing dynamic energy consumption. We shown how, in some cases, this migration mechanism is not successful in reducing data access latency and can be selectively and dynamically inhibited, thus reducing dynamic energy consumption without affecting performances.
Panelists: A. Aznar, J.-A. Carballo, R. Madhavan, M. Merced, A. Shubat and R. Yavatkar
The rapid deployment of 3G wireless networks is accelerating the demand for application processors to deliver multimedia-rich wireless services to end-customers. Texas Instruments has pioneered the way with OMAP(tm) technology. Each generation of OMAP application processors has delivered breakthrough performance with ultralow power consumption. This challenging combination has been achieved by applying state-of-the art power management technologies to application processors manufactured with leading edge silicon technologies. The trend towards more performance will continue to drive innovation.
While prior research has extensively evaluated the performance advantage of moving from a 2D to a 3D design style, the impact of process parameter variations on 3D designs has been largely ignored. In this paper, we attempt to bridge this gap by proposing a variability-aware design framework for fully-synchronous (FS) and multiple clock-domain (MCD) 3D systems. First, we develop analytical system-level models of the impact of process variations on the performance of FS 3D designs. The accuracy of the model is demonstrated by comparing against transistor-level Monte Carlo simulations in SPICE - we observe a maximum error of only 0:7% (average 0:31% error) in the mean of the maximum critical path delay distribution. Second, to mitigate the impact of process variations on 3D designs, we propose a variability-aware 3D integration strategy for MCD 3D systems that maximizes the probability of the design meeting specified system performance constraints. The proposed optimization strategy is shown to significantly outperform FS and MCD 3D implementations that are conventionally assembled - for example, the MCD designs assembled with the proposed integration strategy provide, on average, 44% and 16:33% higher absolute yield than the FS and conventional MCD designs respectively, at the 50% yield point of the conventional MCD designs.
Heat removal and power delivery are two major reliability concerns in the 3D stacked IC technology. Liquid cooling based on micro-fluidic channels is proposed as a viable solution to dramatically reduce the operating temperature of 3D ICs. In addition, designers use a highly complex hierarchical power distribution network in conjunction with decoupling capacitors to deliver currents to all parts of the 3D IC while suppressing the power supply noise to an acceptable level. These so called silicon ancillary technologies, however, pose major challenges to routing completion and congestion. These thermal and power/ground interconnects together with those used for signal delivery compete with one another for routing resources including various types of Through-Silicon-Vias (TSVs). This paper presents the work on routing with these interconnects in 3D: signal, power, and thermal networks. We demonstrate how to consider various physical, electrical, and thermo-mechnical requirements of these interconnects to successfully complete routing while addressing various reliability concerns.
The quest for technologies with superior device
characteristics has showcased Carbon Nanotube Field Effect
Transistors (CNFETs) into limelight. Among the several design
aspects necessary for today's grail in CNFET technology,
achieving functional immunity to Carbon Nanotube (CNT)
manufacturing issues (such as mispositioned CNTs and metallic
CNTs) is of paramount importance. In this work we present a
new design technique to build compact layouts while ensuring
100% functional immunity to mispositioned CNTs. Then, as
second contribution of this work, we have developed a CNFET
Design Kit (DK) to realize a complete design flow from logic-to-GDSII
traversing the conventional CMOS design flow. This flow
enables a framework that allows accurate comparison between
CMOS and CNFET-based circuits. This paper also presents
simulation results to illustrate such analysis, namely, a CNFET-based
inverter can achieve gains, with respect to the Energy-
Delay Product (EDP) metric, of more than 4x in delay, 2x in
energy/cycle and significant area savings (more than 30%) when
compared to a corresponding CMOS inverter benchmarked with
an industrial 65nm technology.
Keywords - Carbon Nanotube Transistors, Logic Synthesis, CNT, Imperfection Immune, Misaligned Immune, CNFET.
This paper exploits the unique in-field controllability of the device polarity of ambipolar carbon nanotube field effect transistors (CNTFETs) to design a technology library with higher expressive power than conventional CMOS libraries. Based on generalized NORN-AND-AOI-OAI primitives, the proposed library of static ambipolar CNTFET gates efficiently implements XOR functions, provides full-swing outputs, and is extensible to alternate forms with area-performance tradeoffs. Since the design of the gates can be regularized, the ability to functionalize them in-field opens opportunities for novel regular fabrics based on ambipolar CNTFETs. Technology mapping of several multi-level logic benchmarks - including multipliers, adders, and linear circuits - indicates that on average, it is possible to reduce both the number of gates and area by ~38% while also improving performance by 6.9X
In the field of the Side Channel Analysis (SCA), the electromagnetic radiation of a cryptographic device is the richest source of information. Indeed, it permits to be more accurate by positioning smartly the EM probe near a given logic, filtering the signal that is not useful regarding a given attack. But this advantage can become easily a drawback if the attacker is unable to position her probe onto the device. Our contribution consists in giving an accurate method detecting an hot spot onto the device, i.e. the position where a correlation electromagnetic attack (CEMA) should be the most successful. This strategy is based on an indicator evaluated during a cartography. Its performance has been tested on an hardware AES implemented on an Altera Stratix II.
Side channel attacks are known to be efficient
techniques to retrieve secret data. In this context, this paper
concerns the evaluation of the robustness of triple rail logic
against power and electromagnetic analyses on FPGA devices.
More precisely, it aims at demonstrating that the basic concepts
behind triple rail logic are valid and may provide interesting
design guidelines to get DPA resistant circuits which are also
more robust against DEMA.
Index Terms - DPA, CPA, DEMA Logic Style, DES, FPGA, Side-Channel Attacks.
In this paper, we propose a preprocessing method
to improve Side Channel Attacks (SCAs) on Dual-rail
with Precharge Logic (DPL) countermeasure family. The
strength of our method is that it uses intrinsic characteristics
of the countermeasure: classical methods fail when the
countermeasure is perfect, whereas our method still works
and enables us to perform advanced attacks.
We have experimentally validated the proposed method
by attacking a DES cryptoprocessor embedded in a Field
Programmable Gates Array (FPGA), and protected by the
Wave Dynamic Differential Logic (WDDL) countermeasure.
This successful attack, unambiguous as the full key is retrieved,
is the first to be reported.
Keywords: Side-Channel Analysis (SCA), Differential Power Analysis (DPA), ElectroMagnetic Analysis (EMA), Dual-rail with Precharge Logic (DPL), Wave Dynamic Differential Logic (WDDL), Field Programmable Gates Array (FPGA).
In the next years, new hash function candidates will replace the old MD5 and SHA-1 standards and the current SHA-2 family. The hash algorithms RadioGatún and ir-RUPT are potential successors based on a stream structure, which allows the achievement of high throughputs (particularly with long input messages) with minimal area occupation. In this paper, several hardware architectures of the two above mentioned hash algorithms have been investigated. The implementation on ASIC of RadioGat&ucute;n with a word length of 64 bits shows a complexity of 46 k gate equivalents (GE) and reaches 5.7 Gbps throughput with a 3.64-bit input message. The same design approaches 120 Gbps on ASIC with long input messages (63.4 Gbps on a Virtex-4 FPGA with 2.9 kSlices). On the other hand, the irRUPT core turns out to be the most compact circuit (only 5.8 kGE on ASIC, and 370 Slices on FPGA) achieving 2.4 Gbps (with long input messages) on ASIC, and 1.1 Gbps on FPGA.
Violations in memory references cause tremendous loss of productivity, catastrophic mission failures, loss of privacy and security, and much more. Software mechanisms to detect memory violations have high false positive and negative rates or huge performance overhead. This paper proposes architectural support to detect memory reference violations in inherently unsafe languages such as C and C++. In this approach, the ISA is extended to include "safety" instructions that provide compile-time information on pointers and objects. The microarchitecture is extended to efficiently execute the safety instructions. We explore optimizations, such as delayed violation detection and stack-based handling of local pointers, to reduce the performance overhead. Our experiments show that the synergy between hardware and software results in this approach having less than 5% average performance overhead, while an exclusively software mechanism incurs 480% impact for the same benchmarks.
Ensuring correctness of execution of complex multi-core processor systems deployed in the field remains to this day an extremely challenging task. The major part of this effort is concentrated on design verification, where different pre- and post-silicon techniques are used to guarantee that devices behave exactly as stated in the specification. Unfortunately the performance of even state-of-the-art validation tools lags behind the growing complexity of multi-core designs. There fore, subtle bugs still slip into released components, causing incorrect computational results, or even compromising the security of the end-user systems. In this work we present Caspar - an approach for in-the field patching of the memory subsystem hardware in multi core chips. Caspar relies on a checkpointing system, which periodically logs the state of the chip, and a novel error detection and recovery scheme, which uses a simplified mode o operation to bypass cache coherence and consistency errors The implementation of Caspar employs hardware detectors on-die programmable circuits to identify system's configurations that may lead to bugs, and to trigger recovery and bypass. Our experimental results show that Caspar can be used effectively to detect and bypass a variety of memory subsystem bugs, with as little as 2% performance impact and 6% area overhead during bug-free operation.
Existing architectures for speculative addition are all based on the assumption that operands have uniformly distributed bits, which is rarely verified in real applications. As a consequence, they may be disadvantageous for real-world workloads, although in principle faster than standard adders. To address this limitation, we introduce a new architecture based on an innovative technique for speculative global carry evaluation. The proposed architecture solves the main drawback of existing schemes and, evaluated on real-world benchmarks, it exhibits an interesting performance improvement with respect to both standard adders and alternative architectures for speculative addition.
Caches often employ write-back instead of writethrough, since write-back avoids unnecessary transfers for multiple writes to the same block. For several reasons, however, it is undesirable that a significant number of cache lines will be marked "dirty". Energy-efficient cache organizations, for example, often apply techniques that resize, reconfigure, or turn off (parts of) the cache. In such cache organizations, dirty lines have to be written back before the cache is reconfigured. The delay imposed by these write-backs or the required additional logic and buffers can significantly reduce the attained energy savings. A cache organization called the clean/dirty cache (CDcache) is proposed that combines the properties of write-back and write-through. It avoids unnecessary transfers for recurring writes, while restricting the number of dirty lines to a hard limit. Detailed experimental results show that the CD-cache reduces the number of dirty lines significantly, while achieving similar or better performance. We also use the CD-cache to implement cache decay. Experimental results show that the CD-cache attains similar or higher performance than a normal decay cache, while using a significantly less complex design.
The traditionally wired interfaces of many electronic systems are in many applications being replaced by wireless interfaces. Testing of electronic systems (both integrated circuits and printed circuit boards) still requires physical electrical contact through probe needles and/or sockets. This paper addresses the state-of-the-art, options, and hurdles-still-to-take of contactless testing, which would resolve many test challenges due to shrinking size and pitch of pads and pins and inaccessibility of advanced assembly techniques as System-in-Package (SiP) and 3D stacked ICs.
In this paper we propose an approach to the design optimization of fault-tolerant hard real-time embedded systems, which combines hardware and software fault tolerance techniques. We trade-off between selective hardening in hardware and process re-execution in software to provide the required levels of fault tolerance against transient faults with the lowest-possible system costs. We propose a system failure probability (SFP) analysis that connects the hardening level with the maximum number of re-executions in software. We present design optimization heuristics, to select the fault-tolerant architecture and decide process mapping such that the system cost is minimized, deadlines are satisfied, and the reliability requirements are fulfilled.
We consider multiprocessor distributed real-time systems where concurrency control is managed using software transactional memory (or STM). For such a system, we propose an algorithm to compute an upper bound on the response time.The proposed algorithm can be used to study the behavior of systems where node crash failures are possible. We compare the result of the proposed algorithm to a simulation of the system being studied in order to determine its efficacy. The results of our study indicate that it is possible to provide timeliness guarantees for multiprocessor distributed systems programmed using STM.
As application complexity increases, modern embedded
systems have adopted heterogeneous processing elements to enhance
the computing capability or to reduce the power consumption. The
heterogeneity has introduced challenges for energy efficiency in hardware
and software implementations. This paper studies how to partition
real-time tasks on a platform with heterogeneous processing elements
(processors) so that the energy consumption can be minimized. The
power consumption models considered in this paper are very general
by assuming that the energy consumption with higher workload is larger
than that with lower workload, which is true for many systems. We
propose an approximation scheme to derive near-optimal solutions for
different hardware configurations in energy/power consumption. When
the number of processors is a constant, the scheme is a fully polynomial-time
approximation scheme (FPTAS) to derive a solution with energy
consumption very close to the optimal energy consumption in polynomial-time/
space complexity. Experimental results reveal that the proposed
scheme is very effective in energy efficiency with comparison to the
Keywords: Multiprocessor scheduling, Heterogeneous multiprocessor, Energy-efficient scheduling.
This paper introduces a graph grammar based approach to automated topology synthesis of analog circuits. A grammar is developed to generate circuits through production rules, that are encoded in the form of a derivation tree. The synthesis has been sped up by using dynamically obtained designsuitable building blocks. Our technique has certain advantages when compared to other tree-based approaches like GP based structure generation. Experiments conducted on an opamp and a vco design show that unlike previous works, we are capable of generating both manual-like designs (bookish circuits) as well as novel designs (unfamiliar circuits) for multi-objective analog circuit design benchmarks.
This paper demonstrates a system that performs multi-objective sizing across 100,000 analog circuit topologies simultaneously, with SPICE accuracy. It builds on a previous system, MOJITO, which searches through 3500 topologies defined by a hierarchically-organized set of 30 analog blocks. This paper improves MOJITO's results quality via three key extensions. First, it enlarges the block library to enable symmetrical transconductance amplifiers and more. Second, it improves initial topology diversity via optimization-based constraint satisfaction. Third, it maintains topology diversity during search via a novel multi-objective selection mechanism, dubbed TAPAS. MO-JITO+TAPAS is demonstrated on a problem with 6 objec- tives, returning a tradeoff holding 17438 nondominated designs. The tradeoff is comprised of 152 unique topologies that include the newly-introduced topologies. 59 designs across 12 topologies designs outperform an expert-designed reference circuit.
A new approach in hierarchical optimisation is presented which is capable of optimising both the performance and yield of an analogue design. Performance and yield trade offs are analysed using a combination of multi-objective evolutionary algorithms and Monte Carlo simulations. A behavioural model that combines the performance and variation for a given circuit topology is developed which can be used to optimise the system level structure. The approach enables top-down system optimisation, not only for performance but also for yield. The model has been developed in Verilog-A and tested extensively with practical designs using the Spectre simulator. A performance and variation model of a 5 stage voltage controlled ring oscillator has been developed and a PLL design is used to demonstrate hierarchical optimisation at the system level. The results have been verified with transistor level simulations and suggest that an accurate performance and yield prediction can be achieved with the proposed algorithm.
Intermodulation distortion is one of the key design requirements of Radio Frequency circuits. The standard approach for analyzing distortion using circuit simulators is to mimic measurement environments and compute the response due to a two-tone input. This considerably increases the CPU cost of the simulation because of the large number of variables resulting from the harmonics of these two tones and their intermodulation products. In this paper, we propose an analytical method for directly obtaining the intermodulation distortion from the Harmonic Balance equations with a only single-tone input, without the need to perform a Harmonic Balance simulation. The proposed method is shown to be significantly faster than traditional simulation based approaches.
For a speed-up of analog design cycles to keep up with the continuously decreasing time to market, iterative design refinement and redesigns are more than ever regarded as showstoppers. To deal with this issue, referred to as design and verification gap, the development of a continuous and consistent verification is mandatory. In digital design, formal verification methods are considered as a key technology for efficient design flows. However, industrial availability of formal methods for analog circuit verification is still negligible despite a growing need. In recent years, research institutions have made considerable advances in the area of formal verification of analog circuits. This paper presents a selection of four recent approaches in analog verification that cover a broad scope of verification philosophies.
Novel nonvolatile memory technologies are gaining
significant attentions from semiconductor industry in the
competition of universal memory development. We used Spin-Transfer
Torque Random Access Memory (STT-RAM) and
Resistive Random Access Memory (R-RAM) as examples to
discuss the implication of emerging nonvolatile memory for tools
and architectures. Three aspects, including device and memory
cell modeling, device/circuit co-design consideration and novel
memory architecture, are discussed in details. The goal of these
discussions is to design a high-density, low-power, high-performance
nonvolatile memory with simple architecture and
minimized circuit design complexity.
Keywords - Universal memory; STT-RAM; R-RAM; MTJ device modleing; memory yield improvement.
Caches made of non-volatile memory technologies, such as Magnetic RAM (MRAM) and Phase-change RAM (PRAM), offer dramatically different power-performance characteristics when compared with SRAM-based caches, particularly in the areas of static/dynamic power consumption, read and write access latency and cell density. In this paper, we propose to take advantage of the best characteristics that each technology has to offer through the use of read-write aware Hybrid Cache Architecture (RWHCA) designs, where a single level of cache can be partitioned into read and write regions, each of a different memory technology with disparate read and write characteristics. We explore the potential of hardware support for intra-cache data movement within RWHCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that a RWHCA design with a conservative setup can provide a geometric mean 55% power reduction and yet 5% IPC improvement over a baseline SRAM cache design across a collection of 30 workloads. Furthermore, a 2-layer 3D cache stack (3DRWHCA) of high density memory technology with the same chip footprint still gives 10% power reduction and boost performance by 16% IPC improvement over the baseline.
Recent breakthroughs in circuit and process technology have enabled new usage models for non-volatile memory technologies such as Flash and phase change RAM (PCRAM) in the general purpose computing environment. These technologies display high density and low power consumption as well as persistency that are appealing properties in a memory device. This paper summarizes our earlier work on improving NAND Flash based disk caches and extends it to consider PCRAM. We first present the primary challenges in reliably managing non-volatile memories such as NAND Flash, reviewing our past work on architectural support for Flash manageability. We then provide a preliminary analysis of how our current Flash manageability architecture may be simplified when we replace Flash with PCRAM. Our evaluations on PCRAM shows a potential for more than a 65% throughput improvement for a disk-intensive database workload. Although more detailed studies are needed, we conclude that PCRAM is a strong contender to replace Flash if it becomes cost-effective.
We present the aEqualized routing algorithm: novel algorithm for the Spidergon Network on Chip. AEqualized combines the well known aFirst and aLast algorithms proposed in literature obtaining an optimized use of the channels of the network. This optimization allows to reduce the number of channels actually implemented on the chip while maintaining similar performances achieved by the two basic algorithms. In the second part of this paper, we propose a variation on the Spidergon's router architecture that enhances the performance of the network especially when the aEqualized routing algorithm is adopted.
Most CMPs use on-chip networks to connect cores and
tend to integrate more simple cores on a single die. Low-radix
networks, such as 2D-MESH, are widely used in tiled CMPs since
they can be mapped to on-chip networks efficiently. However,
low-radix networks introduce high network latency caused by
long diameter. In this paper, we propose the use of group-caching
design in NoC based multicore cache coherent systems. In our
design, on-chip L2 banks are organized to form multiple groups.
Each cache group behaves like a shared L2 cache for the cores
inside cache group while the cache coherence between cache
groups is maintained by coherence messages. Besides, group-caching
also adopts the new cache replacement policy to improve
the inefficient use of the aggregate L2 cache capacity. Compared
to banked and shared L2 design, as most L2 accesses are served
by local cache group, the hop count is significantly reduced.
Experiment results based on full-system simulation show that for
2D-MESH, group-caching can increase the performance by
2%~8% compared to banked and shared L2 design, with
network energy consumption reduced by 11%~13%. Experiment
results also show that the communication overhead inside cache
group plays an important role in the performance of group-caching.
Keywords-CMP; NOC; network latency; L2 banks; cache coherence; group-caching; performance; power
In many current SoCs, the architectural interface to on-chip monitors is ad hoc and inefficient. In this paper, a new architectural approach which advocates the use of a separate low-overhead subsystem for monitors is described. A key aspect of this approach is an on-chip interconnect specifically designed for monitor data with different priority levels. The efficiency of our monitor interconnect is assessed for a multicore system using both an interconnect and a system-level simulator. Collected monitor information is used by a dedicated processor to control the frequency and voltage of individual multicore processors. Experimental results show that the new low-overhead subsystem facilitates employment of thermal and delay-aware dynamic voltage and frequency scaling.
This paper presents a novel technique for the modeling, simulation, and analysis of real-time applications on Multi-Processor Systems-on-Chip (MPSoCs). This technique is based on an application-transparent emulation of OS primitives, including support for RTOS elements. The proposed methodology enables a quick evaluation of the real-time performance of an application in front of different design choices, including the study of system's behavior as tasks. deadlines become stricter or looser. The approach has been verified on a large set of multi-threaded benchmarks. Results show that our methodology (a) enables accurate real-time and responsiveness analysis of parallel applications running on MPSOCs, (b) allows the designer to devise an optimal interrupt distribution mechanism for the given application, and (c) helps dimensioning the system to meet performance and real-time needs.
Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts.
Multiprocessor System on Chip (MPSoC) architecture is
rapidly gaining momentum for modern embedded devices. The vulnerabilities
in software on MPSoCs are often exploited to cause software
attacks, which are the most common type of attacks on embedded systems.
Therefore, we propose an MPSoC architectural framework, CUFFS, for
an Application Specific Instruction set Processor (ASIP) design that has a
dedicated security processor called iGuard for detecting software attacks.
The CUFFS framework instruments the source code in the application
processors at the basic block (BB) level with special instructions that allow
communication with iGuard at runtime. The framework also analyzes
the code in each application processor at compile time to determine
the program control flow graph and the number of instructions in each
basic block, which are then stored in the hardware tables of iGuard. The
iGuard uses its hardware tables to verify the applications' execution at
runtime. For the first time, we propose a framework that probes the
application processors to obtain their Instruction Count and employs an
actively engaging security processor that can detect attacks even when an
application processor does not communicate with iGuard.
CUFFS relies on the exact number of instructions in the basic block to
determine an attack which is superior to other time-frame based measures
proposed in the literature. We present a systematic analysis on how CUFFS
can thwart common software attacks. Our implementation of CUFFS on
the Xtensa LX2 processor from Tensilica Inc. had a worst case run-time
penalty of 44% and an area overhead of about 28%.
Categories and Subject Descriptors B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids
General Terms Design, Performance, Security
Keywords Architecture, Instruction Count, MPSoC, Attacks, Tensilica
Soft errors in combinational and sequential elements of digital circuits are an increasing concern as a result of technology scaling. Several techniques for gate and latch hardening have been proposed to synthesize circuits that are tolerant to soft errors. However, each such technique has associated overheads of power, area, and performance. In this paper, we present a new methodology to compute the failures in time (FIT) rate of a sequential circuit where the failures are at the system-level. System-level failures are detected by monitors derived from functional specifications. Our approach includes efficient methods to compute the FIT rate of combinational circuits (CFIT), incorporating effects of logical, timing, and electrical masking. The contribution of circuit components to the FIT rate of the overall circuit can be computed from the CFIT and probabilities of system-level failure due to soft errors in those elements. Designers can use this information to perform Pareto-optimal hardening of selected sequential and combinational components against soft errors. We present experimental results demonstrating that our analysis is efficient, accurate, and provides data that can be used to synthesize a low-overhead, low-FIT sequential circuit.
Ensuring reliable computation at the nanoscale requires mechanisms to detect and correct errors during normal circuit operation. In this paper we propose a method for designing efficient online error detection schemes for circuits based on the identification of invariant relationships in hardware. More specifically, we present a technique that automatically identifies multi-cycle gate-level invariant relationships - where no knowledge of high-level behavioral constraints is required to identify the relationships - and generates the checker logic that verifies these implications. Our results show that cross-cycle implications are particularly useful in discovering difficult-to-detect errors near latch boundaries, and can have a significant impact on boosting error detection rates.
In this paper, we present an open architecture
Virtual Test Environment (VTE) which can be easily integrated
into various modularized Automatic Test Systems
(ATS) compliant to Open Standard Architecture (OSA). The
focus of this paper is to analyze and address the major issues
that still prevent the application of Virtual Test (VT) from
day-to-day's practice. As a pilot demonstration, a VHDLAMS
based VTE is established and an ADC test is performed.
The environment is intended to seamlessly interoperate
with the test system during test program development
Keywords - Virtual Test, Test generation, Simulation, Hardware description language, VHDL, ATML, IEEE1641
Scheduling task graphs under hard (end-to-end) timing constraints is an extensively studied NP-hard problem of critical importance for predictable software mapping on Multiprocessor System-on-chip (MPSoC) platforms. In this work we focus on an off-line (design-time) version of this problem, where the target task graph is known before execution time. We address the issue of scheduling robustness, i.e. providing hard guarantees that the schedule will meet the end-to-end deadline in presence of bounded variations of task execution times expressed as min-max intervals known at design time. We present a robust scheduling algorithm that proactively inserts sequencing constraints when they are needed to ensure that execution will have no inserted idle times and will meet the deadline for any possible combination of task execution times within the specified intervals. The algorithm is complete, i.e. it will return a feasible graph augmentation if one exists. Moreover, we provide an optimization version of the algorithm that can compute the shortest deadline that can be met in a robust way.
OpenMP is a de facto standard interface of the shared address space parallel programming model. Recently, there have been many attempts to use it as a programming environment for embedded MultiProcessor Systems-On-Chip (MPSoCs). This is due both to the ease of specifying parallel execution within a sequential code with OpenMP directives, and to the lack of a standard parallel programming method on MPSoCs. However, MPSoC platforms for embedded applications often feature non-uniform, explicitly managed memory hierarchies with no hardware cache coherency as well as heterogeneous cores with heterogeneous run-time systems. In this paper we present an optimized implementation of the compiler and runtime support infrastructure for OpenMP programming for a non-cache-coherent distributed memory MPSoC with explicitly managed scratchpad memories (SPM). The proposed framework features specific extensions to the OpenMP programming model that leverage explicit management of the memory hierarchy. Experimental results on different real-life applications confirm the effectiveness of the optimization in terms of performance improvements.
Future computing systems will feature many cores that run fast, but might show more faults compared to existing CMOS technologies. New software methodologies must be adopted to utilize communication bandwidth and the computational power of few slow, reliable cores that could be employed in such systems to verify the results of the fast, faulty cores. Employing the traditional Triple Module Redundancy (TMR) at core instruction level would not be as effective due to its blind replication of computations. We propose two software development methods that utilize what we call Smart TMR (STMR) and fingerprinting to statistically monitor the results of computations and selectively replicate computations that exhibit faults. Experimental results show significant speedup and reliability improvement over traditional TMR approaches.
With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.
Today, many chips are designed with predefined discrete cell libraries. In this paper we present a new fast gate sizing algorithm that works natively with discrete cell choices and realistic timing models. The approach iteratively assigns signal slew targets to all source pins of the chip and chooses discrete layouts of minimum size preserving the slew targets. Using slew targets instead of delay budgets, accurate estimates for the input slews are available during the sizing step. Slew targets are updated by an estimate of the local slew gradient. To demonstrate the effectiveness, we propose a new heuristic to estimate lower bounds for the worst path delay. On average, we violate these bounds by 6%. A subsequent local search decreases this gap quickly to 2%. This two-stage approach is capable of sizing designs with more than 5.8 million cells within 2.5 hours and thus helping to decrease turn-around times of multi-million cell designs.
Multi-domain clock skew scheduling is a cost effective technique for performance improvement. However, the required wire length and area overhead due to phase shifters for realizing such clock scheduler may be considerable if registers are placed without considering assigned skews. Focusing on this issue, in this paper, we propose a skew scheduling-aware register placement algorithm that enables clock tree optimization by considering domains assigned to registers in placement. Our experimental results show that the proposed approach remarkably decreases clock wire length and clock network power consumption at the cost of a slight increase in total wire length.
Decoupling capacitors (decaps) are typically used to reduce the noise in the power supply network. Because the delay of gates and interconnects is affected by the supply voltage level, decaps can be used to improve the circuit performance as well. In this paper, we present the analytical delay model under IR drop, Ldi/dt noise, and decaps to study how decaps affect both the gate and interconnect delay. Given a floorplanning solution, we study how to allocate the whitespace for decap insertion so that the delay is minimized under the given noise and area constraint. We employ the Sequential Linear Programming method to solve the non-linear whitespace allocation problem. Our experimental results show that intelligent decap allocating decap makes further delay reduction possible without adding any additional decap.
Due to increasing complexity of design interactions between the chip, package and PCB, it is essential to consider them at the same time. Specifically the finger/pad locations affect the performance of the chip and the package significantly. In this paper, we have developed techniques in chip-package codesign to decide the locations of fingers/pads for package routability and signal integrity concerns in chip core design. Our finger/pad assignment is a two-step method: first we optimize the wire congestion problem in package routing, and then we try to minimize the IR-drop violation with finger/pad solution refinement. The experimental results are encouraging. Compared with the randomly optimized methods, our approaches reduce in average 42% and 68% of the maximum density in package and 10.61% of IR-drop for test circuits.
Today's innovations in the automotive sector are, to a great extent, based on electronics. The increasing integration complexity and stringent cost reduction goals turn E/E platform design into a challenging task. Timing/performance is becoming a key aspect of architecture design, because the platform must be dimensioned to provide just the right amount of computing power and network bandwidth, including reserves for future extensions, in order to be cost efficient. In other words, it must be as powerful as needed but as cheap as possible. Finding this sweet spot is a key challenge. Therefore, OEMs and Tier-1 are in search of new methods, processes, and timing analysis techniques that assist in early platform design stages. In this paper, we demonstrate how some selected techniques that are established for verification (in late design stages) can also be used to guide the design (in early stages). We present examples in the areas ECU (OSEK), buses (CAN, FlexRay) and gated networks. Flow and applicability aspects are highlighted. As a key result, we show that and how we can learn from late-stage verification for early-stage design. Finally, we also outline future challenges in the area of multi-core ECUs.
This article describes important challenges regarding the design, specification and implementation of FlexRay-based automotive networks. The authors outline a design approach that especially accounts for timing constraints of the network, namely end-to-end and cycle timing constraints. The schedule generation for electronic control units (ECU) as well as bus entities is addressed and constraint compatibility with basic FlexRay configuration properties is investigated. The discussed design approach considers three practical design challenges of the automotive industry: first, the function-based cycle timing constraints and their dependency to basic bus design is presented. Second, the challenge of distributed development of modern on-board networks by many different teams and an approach for collaboration improvement is discussed. Finally, the third part describes the configuration of time-triggered ECU schedules with respect to different constraint types.
The adoption of AUTOSAR in the development of automotive electronics can increase the portability and reuse of functional components. Inside each component, the behavior is represented by a set of runnables, defining reactions executed in response to an event or periodic computations. The implementation of AUTOSAR runnables in a concurrent program executing as a set of tasks reveals several issues and trade-offs because of the need to protect communication and state variables and to ensure time determinism. We discuss some of these tradeoffs and options and outline a problem formulation that can be used to compute the solution with minimum memory requirements executing within the deadlines.
MIMO systems (with multiple transmit and receive antennas) are becoming increasingly popular, and many next-generation systems such as WiMAX, 3-GPP LTE and IEEE802.11n wireless LANs rely on the increased throughput of MIMO systems with up to four antennas at receiver and transmitter. High throughput implementation of the detection unit for MIMO systems is a significant challenge especially for higher order modulation schemes. To achieve superior Bit Error Rate(BER) or Frame Error Rate (FER) performance, the detector has to provide soft values to advanced Forward Error Correction (FEC) schemes like Turbo Codes. This paper presents a systolic soft detector architecture for high dimensional(eg. 4x4, 64-QAM) MIMO systems. A Single detector core achieves, throughput of 215Mbps and power consumption of 23.6mW, whiles using only 33.1K gate equivalent(for l2 norm). Impressive SNR gains of almost 2dB are observed with respect to the hard detection counterpart over a block fading channel(at an FER of 1%). Additionally, the architecture can be stacked to give linear increase in throughput with linear increase in hardware resources.
Longer range, faster speed and stronger link are today's wireless mandatory characteristics. Tremendous efforts are being deployed to create new and improved wireless protocols. However, these new protocols are being tested in harsh and uncontrolled environments. Simulation tools help to capture the expected behavior, but the proposed designs might not work in real life situations due to lack of accurate simulation models. Testbed platforms are able to test designs in real life settings, but the flexibility of the design is reduced and design exploration becomes a complex task. This paper presents a hybrid platform composed of a simulation tool and a testbed environment, which makes it possible easily design and accurately test new wireless protocols.
Non-Linear Feedback Shift Registers (NLFSRs) have been proposed as an alternative to Linear Feedback Shift Registers (LFSRs) for generating pseudo-random sequences for stream ciphers. Conventional NLFSRs use the Fibonacci configuration in which the feedback is applied to the last bit only. In this paper, we show how to transform a Fibonacci NLFSR into an equivalent NLFSR in the Galois configuration, in which the feedback can be applied to every bit. Such a transformation can potentially reduce the depth of the circuits implementing feedback functions, thus decreasing the propagation time and increasing the throughput.
Motion Estimation (ME) is the most computationally intensive part of video compression and video enhancement systems. For the recently available high definition frame sizes and high frame rates, the computational complexity of full search (FS) ME algorithm is prohibitively high, while the PSNR obtained by fast search ME algorithms is low. Therefore, in this paper, we propose a new ME algorithm and a high performance reconfigurable systolic ME hardware architecture for efficiently implementing this algorithm. The proposed ME algorithm performs up to three different granularity search iterations in different size search ranges based on the application requirements. Simulation results showed that the proposed ME algorithm performs very close to FS algorithm, even though it searches much fewer search locations than FS algorithm. It outperforms successful fast search ME algorithms by searching more search locations than these algorithms. The proposed reconfigurable ME hardware is implemented in VHDL and mapped to a low cost Xilinx XC3S1500-5 FPGA. It works at 130MHz and is capable of processing high definition and high frame rate video formats in real time. Therefore, it can be used in flat panel displays for frame rate conversion and de-interlacing, and in video encoders.
This paper proposes a novel approach for design space exploration by characterizing hardware sharing based on the notion of a partition in set theory. Related designs with different degrees of hardware sharing can be captured concisely by a Hasse diagram, high-lighting designs with shared building blocks. Hardware sharing can be implemented in various ways, such as component multiplexing, instruction-set processors, or run-time reconfiguration. We illustrate how the proposed approach can be applied to exploring the design space for FPGA implementations of JPEG image compression.
The demand for embedded computing power is continuously increasing and FPGAs are becoming very interesting computing platforms, as they provide huge amounts of customizable parallelism. However, programming them is challenging, let alone from a high level language. In , the ESPAM methodology was already presented to quickly obtain realizations on FPGAs from sequential C code. The realization consists of a network of processors and IP cores. In this approach, a problem was that the IP cores had to be provided manually. In this paper, we present an extension on the ESPAM methodology by incorporating the industrial high level synthesis tool PICO from Synfora Inc. In this way, we realize the automated generation of efficient hardware implementations on FPGAs from a single sequential C input specification of a streaming application. We demonstrate our approach for the Sobel and QR applications.
This paper presents a low-cost and simple distributed force sensor that is particularly suitable for measuring grip force and hand position on a steering wheel. The sensor can be used in automotive active safety systems that aim at detecting driver's fatigue, which is a major issue to prevent road accidents. The key point of our approach is to design a chain of sensor units, each of them provided with some intelligence and general purpose capabilities, so that it can serve as platform for integrating different kinds of sensors into the steering wheel. A proof-of-concept demonstration of the distributed sensor consisting of 16 units based on capacitive sensing elements has been realised and preliminary results are presented.
DNA self-assembly is emerging as the most promising technique for nanoscale self-assembly as it uses the simple, yet precise rules of DNA binding to create macroscale assemblies from nanoscale components. However, DNA self-assembly is also highly error-prone and requires the use of error-resilience techniques in order to unlock its potential. In this paper we propose a technique for error-resilience that is based on information redundancy but, in contrast to previous information redundancy schemes, can achieve much higher resilience to growth errors. By expanding the neighborhood from which redundant information is taken, we can extend the distance that errors are propagated and therefore increase the likelihood of the error being reversed. Given a growth error rate of ε, we show that with a neighborhood of only 2 we can reduce the error rate to ε3.64 for arbitrary functions (as compared to ε2.33 previously achieved). Compared with spatial redundancy approaches, our technique allows for higher density nanostructures and has a greatly reduced assembly time.
In this paper, a novel diagnosis method is proposed. The proposed technique uses machine learning techniques instead of traditional cause-effect and/or effect-cause analysis. The proposed technique has several advantages over traditional diagnosis methods, especially for volume diagnosis. In the proposed method, since the time consuming diagnosis process is reduced to merely evaluating several decision functions, run time complexity is much lower than traditional diagnosismethods. The proposed technique can provide not only high resolution diagnosis but also statistical data by classifying defective chips according to locations of their defects. Even with highly compressed output responses, the proposed diagnosis technique can correctly locate defect locations for most defective chips. The proposed technique correctly located defects for more than 90 % (86 %) defective chips at 50x (100x) output compaction. Run time for diagnosing a single simulated defect chip was only tens of milli-seconds.
In deep submicron designs of MultiProcessor Systems-on-Chip (MPSoC) architectures,
uncompensated within-die process variations and aging effects will lead to an
increasing uncertainty and unbalancing of expected core lifetimes. In this paper
we present an adaptive workload allocation strategy for run-time compensation
of variations- and againg-induced unbalanced core lifetimes by means of core
activity duty cycling. The proprosed techniques regulates the percentage of
idle time on short-expected-life cores to meet the platform lifttime target with
minimum performance degradation.
Experiments have been conducted on a multiprocessor simulator of a next-generation industrial MPSoC platform for multimedia applications made of a general purpose processor and programmable accelerators.
Panelists: A. Jantsch, P. Urard, F. Schirrmeister, P. Mosterman, L. Le-Toumelin and C. Engblom
This paper proposes a novel Process Variation Aware SRAM architecture designed to inherently support voltage scaling. The peripheral circuitry of the SRAM is modified to selectively allow overdriving a wordline which contains weak cell(s). This architecture allows reducing the power on the entire array; however it selectively trades power for correctness when rows containing weak cells are accessed. The cell sizing is designed to assure successful read operations. This avoids flipping the content of the cells when the wordline is overdriven. Our simulations report 23% to 30% improvement in cell access time and 31% to 51% improvement in cell write time in overdriven wordlines. Total area overhead is negligible (4%). Low voltage operation achieves more than 40% reduction in dynamic power consumption and approximately 50% reduction in leakage power consumption.
This paper presents a six-transistor (6T) single-ended static random access memory (SE-SRAM) bitcell with an isolated read-port, suitable for low-VDD and low-power embedded applications. The proposed bitcell has a better static noise margin (SNM) and write-ability compared to a standard 6T bitcell and equivalent to an 8T bitcell . An 8Kbit SRAM module with the proposed and standard 6T bitcells is simulated, including full blown parasitics using BPTM, 65nm CMOS technology node to evaluate and compare different performance parameters. The active power dissipation in the proposed 6T design is 28% and 25% less, compared to standard 6T and 8T SRAM modules respectively.
Convergence of communication, consumer applications and computing within mobile systems pushes memory requirements both in terms of size, bandwidth and power consumption. The existing solution for the memory bottleneck is to increase the amount of on-chip memory. However, this solution is becoming prohibitively expensive, allowing 3D stacked DRAM to become an interesting alternative for mobile applications. In this paper, we examine the power/performance benefits for three different 3D stacked DRAM scenarios. Our high-level memory and Through Silicon Via (TSV) models have been calibrated on state-of-theart industrial processes. We model the integration of a logic die with TSVs on top of both an existing DRAM and a DRAM with redesigned transceivers for 3D. Finally, we take advantage of the interconnect density enabled by 3D technology to analyze an ultra-wide memory interface. Experimental results confirm that TSV-based 3D integration is a promising technology option for future mobile applications, and that its full potential can be unleashed by jointly optimizing memory architecture and interface logic.
This paper presents a DRAM architecture that improves the DRAM performance/power trade-off to increase their usability on low power chip design using 3D interconnect technology. The use of a finer matrix subdivision and buffering the bitline signal at the localblock level allows to reduce both the energy per access and the access time. The obtained performances match those of a typical low power SRAM, while achieving a significant area and static power reduction compared to these memories. The 128 kb memory architecture proposed here achieves an access time of 1.3 ns for a dynamic energy of less than 0.2 pJ per bit. A localized refresh mechanism allows gaining a factor of 10 in static power consumption associated with the cell, and a factor of 2 in area, when compared with an equivalent SRAM.
In video recording, ever increasing demands on image resolution, frame rate, and quality necessitate a lot of memory bandwidth and energy. This paper presents and evaluates such a potential memory load in future handheld multimedia devices. Based on the achieved simulation results, the multi-channel memories provide the capability for high bandwidth without excessive overhead in terms of energy consumption. A full HDTV (1080p) quality video recording with H.264/AVC encoding at 30 frames per second (fps) is found here to require 4.3 GB/s memory bandwidth. According to the simulations, this memory requirement can be fulfilled with four 32-bit memory channels operating at 400 MHz and consuming 345 mW of power. As another example, 400 MHz 8-channel memory configuration is able to provide the required bandwidth for video recording with up to 3840x2160@30 fps. Die stacking is the technology thought to be able to provide the required bandwidth, sufficiently low power consumption, and the multi-channel memory organization.
H.264/AVC (Advanced Video Codec) is a new video coding standard developed by a joint effort of the ITU-TVCEG and ISO/IEC MPEG. This standard provides higher coding efficiency relative to former standards at the expense of higher computational requirements. Implementing the H.264 video encoder for an embedded System-on-Chip (SoC) is a big challenge. For an efficient implementation, we motivate the use of multiprocessor platforms for the execution of a parallel model of the encoder. In this paper, we propose a high-level independent target-architecture parallelization methodology for the development of an optimized parallel model of a H.264/AVC encoder (i.e. a processes network model balanced in communication and computation workload).
Dynamic Thermal Management techniques have been widely accepted as a thermal solution for their low cost and simplicity. The techniques have been used to manage the heat dissipation and operating temperature to avoid thermal emergencies, but are not aware of application behavior in Chip Multiprocessors (CMPs). In this paper, we propose a temperature-aware scheduler based on applications' thermal behavior groups classified by a K-means clustering method in multicore systems. The application's thermal behavior group has similar thermal pattern as well as thermal parameters.With these thermal behavior groups, we provide thermal balances among cores with negligible performance overhead. We implement and evaluate our schemes in the 4-core (Intel Quad Core Q6600) and 8-core (two Quad Core Intel XEON E5310 processors) systems running several benchmarks. The experimental results show that the temperature-aware scheduler based on thermal behavior grouping reduces the peak temperature by up to 8.C and5.C in our 4-core system and 8-core system with only 12% and 7.52% performance overhead, respectively, compared to Linux standard scheduler.
The sustained push for performance, transistor count, and instruction level parallelism has reached a point where chip level power density issues are at the forefront of design constraints. Many high performance computing platforms are integrating several homogeneous or heterogeneous processing cores on the same die to fit small form factors. Due to the design limitations of using expensive cooling solutions, such complex chip multiprocessors require an architectural solution to mitigate thermal problems. Many of the current systems deploy Dynamic Voltage and Frequency Scaling (DVFS) to address thermal emergencies, either within the Operating System or hardware. These techniques have certain limitations in terms of response lag, scalability, cost and being reactive. In this paper, we present an alternative thermal management system to address these limitations, based on hardware/software co-design architecture. The results show that in the 65nm technology, a predictive, targeted, and localized response to thermal events improves a quad-core performance by an average of 50% over conventional chip-level DVFS.
Processors that deploy fine-grained reconfigurable fabrics to implement application-specific accelerators on-demand obtained significant attention within the last decade. They trade-off the flexibility of general-purpose processors with the performance of application-specific circuits without tailoring the processor towards a specific application domain like Application Specific Instruction Set Processors (ASIPs). Vast amounts of reconfigurable processors have been proposed, differing in multifarious architectural decisions. However, it has always been an open question, which of the proposed concepts is more efficient in certain application and/or parameter scenarios. Various reconfigurable processors were investigated in certain scenarios, but never before a systematic design space exploration across diverse reconfigurable processor concepts has been conducted with the aim to aid a designer of a reconfigurable processor. We have developed a first-of-its-kind comprehensive design space exploration tool that allows to systematically explore diverse reconfigurable processors and architectural parameters. Our tool allows presenting the first cross-architectural design space exploration of multiple fine-grained reconfigurable processors on a fair comparable basis. After categorizing fine-grained reconfigurable processors and their relevant parameters, we present our tool and an in-depth analysis of reconfigurable processors within different relevant scenarios.
The inherent reconfigurability of FPGAs enables us to optimize an FPGA implementation in different time intervals by generating new optimized FPGA configurations and reconfiguring the FPGA at the interval boundaries. With conventional methods, generating a configuration at run-time requires an unacceptable amount of resources. In this paper, we describe a tool flow that can automatically map a large set of applications to a self-reconfiguring platform, without an excessive need for resources at run-time. The self-reconfiguring platform is implemented on a Xilinx Virtex-II Pro FPGA and uses the FPGA's PowerPC as configuration manager. This configuration manager generates optimized configurations on-the-fly and writes them to the configuration memory using the ICAP. We successfully used our approach to implement an adaptive 32-tap FIR filter on a Xilinx XUP board. This resulted in a 40% reduction in FPGA resources compared to a conventional implementation and a manageable reconfiguration overhead.a
Dynamic Partial Reconfiguration (DPR) is a promising technology ready for use, enabling the design of more flexible and efficient systems. However, existing design flows for DPR are either low-level and complex or lack support for automatic synthesis. In this paper, we present a SystemC based modelling and synthesis flow using the OSSS+R framework for reconfigurable systems. Our approach addresses reconfiguration already on application level enabling early exploration and analysis of the effects of DPR. Moreover it also allows quick implementation of such systems using our automatic synthesis flow. We demonstrate our approach using an educational example.
In partially reconfigurable architectures, system components can be dynamically loaded and unloaded allowing resources to be shared over time. This paper focuses on the relation between the design options of partial reconfiguration modules and their placement at run-time. For a set of dynamic system components, we propose a design method that optimizes the feasible positions of the resulting partial reconfiguration modules to minimize position overlaps. We introduce the concept of subregions, which guarantees the parallel execution of a certain number of partial reconfiguration modules for tiled reconfigurable systems. Experimental results, which are based on a Xilinx Virtex-4 implementation, show that at run-time the average number of available positions can be increased up to 6:4 times and the number of placement violations can be reduced up to 60:6%.
Since pre-silicon functional verification is insufficient to detect all design errors, re-spins are often needed due to malfunctions that escape into the silicon. This paper presents an automated software solution to analyze the data collected during silicon debug. The proposed methodology analyzes the test sequences to detect suspects in both the spatial and the temporal domain. A set of software debug techniques are proposed to analyze the acquired data from the hardware testing and provide suggestions for the setup of the test environment in the next debug session. A comprehensive set of experiments demonstrate its effectiveness in terms of run-time and resolution.
Improving diagnosis resolution becomes very important in nanometer technology. Nowadays, defects are affecting gate and transistor level. In this paper, we present a new method to volume diagnosis intra-gate defects affecting standard cell Integrated Circuits (ICs). Our method can identify the cause of failure of different intra-gate defects such as bridge, open and resistive-open defects. Our method gives accurate results since it is based on the use of physical information extracted from library cells layout. Our method can also locate intra-gate defects in presence of multiple faults. Experimental results show the efficiency of our approach to isolate injected defects on industrial designs.
We describe a preprocessing step to fault diagnosis of an observed response obtained from a faulty chip. In this step, a fault model for diagnosing the observed response is selected. This step allows fault diagnosis to be performed based on a single fault model after identifying the most appropriate one. We describe a specific implementation of this preprocessing step based on what is referred to as the unique output response of a fault model. As an example, we apply it to the diagnosis of multiple stuck-at faults, selecting between single and double stuck-at faults as the fault model for diagnosis. Experimental results demonstrate improvements compared to diagnosis based on single stuck-at faults, and compared to diagnosis based on both single and double stuck-at faults.
To reduce test data volumes, encoded tests and compacted test responses are widely used in industry. Use of test response compaction negatively impacts fault diagnosis since the errors in responses due to defects which are captured in scan cells are not directly observed. We propose a simple and effective way to enhance the diagnostic resolution achievable by production tests with minimal increase in pattern counts. In this work we present experimental results for the case of multiple scan chain faults to demonstrate the effectiveness of the proposed method.
We present an embedded software application for the real-time estimation of building occupancy using a network of video cameras. We analyze a series of alternative decompositions of the main application tasks and profile each of them by running the corresponding embedded software on three different processors. Based on the profiling measures, we build various alternative embedded platforms by combining different embedded processors, memory modules and network interfaces. In particular, we consider the choice of two possible network technologies: ARCnet and Ethernet. After deriving an analytical model of the network costs, we use it to complete an exploration of the design space as we scale the number of video cameras in an hypothetical building. We compare our results with those obtained for two real buildings of different characteristics. We conclude discussing the results of our case study in the broader context of other camera-network applications.
More and more processor manufacturers have launched embedded multicore processors for consumer electronics products because such processors provide high performance and low power consumption to meet the requirements of mobile computing and multimedia applications. To effectively utilize computing power of multicore processors, software designers interest in using concurrent processing for such architecture. The master-slave model is one of the popular programming models for concurrent processing. Even if it is a simple model, the potential concurrency faults and unreliable slave systems still lead to anomalies of entire system. In this paper, we present an adaptive testing tool called pTest to stress test a slave system and to detect the synchronization anomalies of concurrent software in the master-slave systems on embedded multicore processors. We use a probabilistic finite-state automaton(PFA) to model the test patterns for stress testing and shows how a PFA can be applied to pTest in practice.
This paper deals with a methodology for software estimation to enable design space exploration of heterogeneous multiprocessor systems. Starting from fork-join representation of application specification along with high level description of multiprocessor target architecture and mapping of application components onto architecture resource elements, it estimates the performance of application on target multiprocessor architecture. The methodology proposed includes the effect of basic compiler optimizations, integrates light weight memory simulation and instruction mapping for complex instruction to improve the accuracy of software estimation. To estimate performance degradation due to contention for shared resources like memory and bus, synthetic access traces coupled with interval analysis technique is employed. The methodology has been validated on a real heterogeneous platform. Results show that using estimation it is possible to predict performance with average errors of around 11%.
The extreme heterogeneity of networked embedded platforms makes both design and reuse of applications really hard. These facts decrease portability. A middleware is the software layer that allows to abstract the actual characteristics of each embedded platform. Using a middleware decreases the difficulty in designing applications, but programming for different middlewares is still a barrier to portability. This paper presents a design methodology based on an abstract middleware environment that allows to abstract even the services provided. This is gained by allowing the designer to smoothly move across different design paradigms. As a proof, the paper shows how to mix and exchange applications between tuple-space and message-oriented based middleware environments.
Exploding health care demands and costs of aging and
stressed populations necessitate the use of more in-home
monitoring and personalized health care. Electronics hold great
promise to improve the quality and reduce the cost of health care.
The speakers in this hot topic session will discuss the field of
health care electronics from all aspects. First, the market of
health care electronics is described, and realities, trends and
hypes will be pointed out. The second presentation describes the
engineering challenges in ultra low-power disposable electronics
for wireless body sensor applications. Both the sensor aspects, the
related signal processing, and business models will be discussed.
The third presentation talks about embedded bio-stimulation
applications in cochlea implants, thereby highlighting the design
challenges in terms of power consumption and extreme reliability
of these devices. The final presentation discusses the application
of brain stimulation and recording with respect to artifact
reduction and field steering, and describes aspects of the
modeling and design strategy. In this way, this hot-topic session
offers the attendees a complete picture of the field of health-care
electronics, ranging from the business to the technological and
Keywords-health-care, medical electronics, implants, embedded SoC, wireless body sensor networks, neural stimulation, eletrical field modeling, FEM.
In this paper, we propose a scalable and transparent parallelization scheme using threads for multi-core processor. The performance achieved by our scheme is scalable to the number of cores, and the application program is not affected by the actual number of cores. For the performance efficiency, we designed the threads so that they do not suspend and that they do not start their execution until the data necessary for them are available. We implemented our design using three modules: the dependency controller, which controls dependencies among threads, the thread pool, which manages the ready threads, and the thread dispatcher, which fetches threads from the pool and executes them on the core. Our design and implementation provide efficient thread scheduling with low overhead. Moreover, by hiding the actual number of cores, it realizes transparency. We confirmed the transparency and scalability of our scheme by applying it to the H.264 decoder program. With this scheme, modification of application program is not necessary even if the number of cores changes due to disparate requirements. This feature makes the developing time shorter and contributes to the reduction of the developing cost.
With the multiplication of mobile and wireless
communication networks and standards, the physical layer of
communication systems (i.e. the modem part of the system) has to
be completely flexible. This assumption leads to the well known
Software Defined Radio concept which enables the
implementation and the deployment of different waveforms on
the same platform. This concept has been widely investigated
since the early 2000's mainly for processors and Sw approach but
less for reconfigurable Hw or DSP implementation. This paper
deals with a specific architecture and an innovative design
methodology which were designed within the framework of a
fully flexible high data rate Software Defined Radio wireless
modem. This approach is focused on the waveform part of the
system and its goal is to reach a fully flexible physical layer. In
case of modem evolutions or upgrades, it enables to avoid
significant rework and extra cost in term of waveform
development. Moreover the association of the right architecture
and the right methodology allows to master and to manage the
complexity of the modem (which presents several hundred
configurations available with different kind of parameters) and
permits to provide the needed flexibility. The development
methodology is based on a C/C++ approach which allows to
manage all the parameters at a system level. The architecture
coupled to this development methodology offers a high level of
modularity which enables to easily modify the waveform only in
replacing blocks by other blocks. The efficiency and the flexibility
of the modem is then obtained by designing not a single
waveform but a waveforms family.
Keywords - Physical layer, high data rate modem, mobile and wireless communication, Software Defined radio (SDR), flexibility, modularity, architecture, high level synthesis, FPGA.
Entry mobile phone market is a mass volume segment
where the modem and application technologies are commoditized
and fully proven. Nevertheless the cost and power reduction
target continues to heavily drive leading edge innovations.
Semiconductor companies strive to integrate more and more
Printed Circuit Board (PCB) components into one single chip,
without discontinuing the technology node shrink roadmap, from
130nm down to 65nm. This duality between intampampegration level and
aggressive silicon feature size reduction generates an innovative
environment where design engineers must create new
methodologies to cope with complex cross coupling mechanisms
and additional power dissipation. This paper describes one aspect
of the design methodology to reduce the die and package crosstalks,
and focus on the package co-design flow. The chip being
considered is a 65nm single chip System on Chip (SoC) including
EDGE RF, Power Management Unit (PMU), Audio Front End
(AFE) and FM Radio (FMR) circuits.
Keywords: eWLB; Chip&Package Co-Design; SoC integration, cross-coupling, aggressors, victims
Multicore SoCs integrate an increasing number of heterogeneous programmable units and sophisticated communication interconnects. Unlike classic computers, the design of SoC includes the building of application specific architecture and specific interconnect and other hardware components required to execute the software for a well defined class of applications. In this case, the programming model hides both hardware and software interfaces that may include sophisticated communication and synchronisation concepts to handle parallel programs running on the processors. This embedded tutorial introduces the key technologies for the design of such complex devices.
Packet-switched interconnect fabric is a promising on-chip communication solution for many-core architectures. It offers high throughput and excellent scalability for on-chip data and protocol transactions. The main problem posed by this communication fabric is the potentially-high and nondeterministic network latency caused by router data buffering and resource arbitration. This paper describes a new method to minimize on-chip network latency, which is motivated by the observation that only a small percentage of on-chip data and protocol traffic is latency-critical. Existing work focusing on minimizing average network latency is thus suboptimal. Such techniques expend most of the design, area, and power overhead accelerating latency-noncritical traffic for which there is no corresponding application-level speedup. We propose run-time techniques that identify latency-critical traffic by leveraging network data-transaction and protocol information. Latency-critical traffic is permitted to bypass router pipeline stages and latency-noncritical traffic. These techniques are evaluated via a router design that has been implemented using TSMC 65nm technology. Detailed network latency simulation and hardware characterization demonstrate that, for latency-critical traffic, the proposed solution closely approximates the ideal interconnect even under heavy load while preserving throughput for both latency-critical and noncritical traffic.
Data-intensive functions on chip, e.g., codec, 3D graphics, pixel processing, etc. need to make best use of the increased bandwidth of multiple memories enabled by 3D die stacking via accessing multiple memories in parallel. Parallel memory accesses with originally in-order requirements necessitate reorder buffers to avoid deadlock. Reorder buffers are expensive in terms of area and power consumption. In addition, conventional reorder buffers suffer from a problem of low resource utilization. In our work, we present a novel idea, called in-network reorder buffer, to increase the utilization of reorder buffer resource. In our method, we move the reorder buffer resource and related functions from network entry/exit points to network routers. Thus, the in-network reorder buffers can be better utilized in two ways. First, they can be utilized by other packets without in-order requirements while there are no in-order packets. Second, even in-order packets can benefit from innetwork reorder buffers by enjoying more shares of reorder buffers than before. Such an increase in reorder buffer utilization enables NoC performance improvement while supporting the original inorder requirements. Experimental results with an industrial strength DTV SoC example show that the presented idea improves the total execution cycle by 16.9%.
Nowadays, in MPSoCs and NoCs, multicast protocol is significantly used for many parallel applications such as cache coherency in distributed shared-memory architectures, clock synchronization, replication, or barrier synchronization. Among several multicast schemes proposed in on chip interconnection networks, path-based multicast scheme has been proven to be more efficient than the tree-based, and unicast-based. In this paper a low distance path-based multicast scheme is proposed. The proposed method takes advantage of the network partitioning, and utilizing of an efficient destination ordering algorithm. The results in performance, and power consumption show that the proposed method outstands the previous on chip path-based multicasting algorithms.
In this paper we introduce Priority Based Forced Requeue to decrease worst-case latencies in NoCs offering best effort services. Forced Requeue is to prematurely lift out low priority packets from the network and requeue them outside using priority queues. The first benefit of this approach, applicable to any NoC offering best effort services, is that packets that have not yet entered the network now compete with packets inside the network and hence tighter bounds on admission times can be given. The second benefit - which is more specific to deflective routing as in the Nostrum NoC - is that packet "reshuffling" dramatically reduces the latency inside the network for bursty traffic due to a lowered risk of collisions at the exit of the network. This paper studies the Forced Requeuing on a mesh with varying burst sizes and traffic scenarios. The experimental results show a 50% reduction in worst-case latency from a system perspective thanks to a reshaped latency distribution whilst keeping the average latency the same.
FlexRay  is an automotive standard for high-speed and reliable communication that is being widely deployed for next generation cars. The protocol has powerful error-detection mechanisms, but its error-management scheme forces a corrupted frame to be dropped without any notification to the transmitter. In this paper, we analyze the feasibility of and propose an optimization approach for an application-level acknowledgement and retransmission scheme for which transmission time is allocated on top of an existing schedule. We formulate the problem as a Mixed Integer Linear Program. The optimization is comprised of two stages. The first stage optimizes a fault tolerance metric; the second improves scheduling by minimizing the latencies of the acknowledgement and retransmission messages. We demonstrate the effectiveness of our approach on a case study based on an experimental vehicle designed at General Motors.
Distributed systems, especially time-triggered ones, are implementing clock synchronization algorithms to provide and maintain a common view of time among the different nodes. Such architectures heavily rely on the nodes' local oscillators to remain within given accuracy bounds. However, measuring the oscillator frequencies (e.g., for maintenance or diagnosis) is usually difficult to perform since it requires physical access to each single node and may interfere with the running application. Moreover, clock synchronization features tend to mask clock deviations. In this work, we propose a non-intrusive method for remote measurement of the individual oscillator drifts within a distributed system. Our approach is based on a tester that sends carefully aligned messages to stimulate the clock synchronization service and records the resulting bus traffic for an analysis of the nodes' synchronization behavior. This tester needs access to the communication bus only. We focus our work to FlexRay and validate our approach by experiments.
As the number of electronic components in automobiles steadily increases, the demand for higher communication bandwidth also rises dramatically. Instead of installing new wiring harnesses and new bus structures, it would be useful, if already available structures could be used, but driven at higher data rates. In this paper, we a) propose an extension of the well-known Controller Area Network (CAN) called CAN+ with which the target rate of 1Mbit/s can be increased up to 16 times. Moreover, b) existing CAN hardware and devices not dedicated to these boosted data rates can still be used without interferences on communication. The major idea is a change of the protocol. In particular, we exploit the fact that data could be sent in time slots, where CAN-conform nodes don't listen. Finally, c) an implementation of this type of overclocking scheme on an FPGA is provided to prove the feasibility and the impressive throughput gains.
This paper presents an innovative and effective method to improve the performance of a micro mechanical gyroscope by introducing the damping of its sensing quality factor. Indeed the sensing quality factor is a key parameter for the micro mechanical gyroscope dynamic; particularly high sensing quality factor means long settling time, high response overshoot and high sensitivity to external disturbances (shocks and vibrations) that are typical of harsh automotive environment. For this reason micro mechanical gyroscopes employed in automotive environment need high shock and vibration immunity. This paper proposes a solution to reach this goal by adding a "virtual damping" to the system with an electrostatic feedback technique. This approach has been applied to a real automotive yaw gyro system, and simulations performed using Simulink™ environment show an appreciable output overshoot reduction, with the benefit of higher vibration immunity, once implemented the feedback technique. Micro mechanical gyroscope; electrostatic feedback technique; shock immunity enhancement
Several European research projects in the vehicular
area address the enhancement of vehicular safety. In the frame
of the Caring Cars project, an on-board car-gateway embedded
architecture for safety and wellness applications has been
designed. This paper puts forward the essentials of this modular,
dynamic and robust architecture and defines in detail the
advanced emergency call (eCall+), one of the most innovative
applications in the project. By mean of the eCall+, the emergency
services will always be able to track the affected vehicle and
monitor the state of the car. The driver may also contact them
through videoconference in a critical situation. Thus, the system
can either prevent an accident or help the vehicle occupants and
the emergency services to safe the occupants' lives after an
Keywords-component; eCall, eCall+, emergency, safety, automotive, localization, services
This paper explores the use of SAT-Modulo Theory in determination of bit-widths for finite precision implementation of numerical calculations, specifically in the context of scientific computing where division frequently occurs. Employing SAT-Modulo Theory leads to more accurate bounds estimation than those provided by other analytical methods, in turn yielding smaller bit-widths.
An inherent performance gap between custom designs and ASICs is one of the reasons why many designers still start their designs from register transfer level (RTL) description rather than from behavioral description, which can be synthesized to RTL via high-level synthesis (HLS). Sequencing overhead is one of the factors for this performance gap; the choice between latch and flip-flop is not typically taken into account during HLS, even though it affects all the steps of HLS. HLSl is a new design framework that employs high-performance latches during scheduling, allocation, and controller synthesis. Its main feature is a new scheduler that is based on a concept of phase step (as opposed to conventional control step), which allows us scheduling in finer granularity, register allocation that resolves the conflict of latch being read and written at the same time, and controller synthesis that exploits dual-edge triggered storage elements to support phase step based scheduling. In experiments on benchmark designs implemented in 1.2 V, 65-nm CMOS technology, HLS-l reduced latency by 16.6% on average, with 9.5% less circuit area, compared to the designs produced by conventional HLS.
This paper presents a technique to perform arbitrary fixed permutations on streaming data. We describe a parameterized architecture that takes as input n data points streamed at a rate of w per cycle, performs a permutation over all n points, and outputs the result in the same streaming format. We describe the system and its requirements mathematically and use this mathematical description to show that the datapaths resulting from our technique can sustain a full throughput of w words per cycle without stalling. Additionally, we provide an algorithm to configure the datapath for a given permutation and streaming width. Using this technique, we have constructed a full synthesis system that takes as input a permutation and a streaming width and outputs a register-transfer level Verilog description of the datapath. We present an evaluation of our generated designs over varying problem sizes and streaming widths, synthesized for a Xilinx Virtex-5 FPGA.
Although Triple Modular Redundancy (TMR) has
been widely used to mitigate single event upsets (SEUs) in
SRAM-based FPGAs, SEU-caused bridging faults between the
TMR modules do not guarantee correctness of TMR design
under SEU. In this paper, we present a novel approximation
algorithm for resource binding on scheduled datapaths at the
presence of TMR, which aims at containment of each SEU within
a single replica of tripled operations. The key challenges are to
avoid resource sharing between modular redundant operations
and also to reduce the possibility of TMR masking breaches in
resource allocation. We introduce the notion of vulnerability gap
during resource sharing to potentially reduce the effort for white
space allocation at the physical design stage in order to avoid
bridging faults between TMR resources. The experimental
results show that our proposed resource binding algorithm,
followed by floorplanner, reduces the potential of TMR breaches
by 20%, on average.
Keywords: Triple modular redundancy; single event upset; high level design; FPGA
Multi-detect (N-detect) testing suffers from the drawback that its test length grows linearly with N. We present a new method to generate compact test sets that provide high defect coverage. The proposed technique makes judicious use of a new pattern-quality metric based on the concept of output deviations. We select the most effective patterns from a large N-detect pattern repository, and guarantee a small test set as well as complete stuck-at coverage. Simulation results for benchmark circuits show that with a compact, 1-detect stuck-at test set, the proposed method provides considerably higher transition-fault coverage and coverage ramp-up compared to another recently-published method. Moreover, in all cases, the proposed method either out-performs or is as effective as the competing approach in terms of bridging-fault coverage and the surrogate BCE+ metric. In many cases, higher transition-fault coverage is obtained than much larger N-detect test sets for several values of N. Finally, our results provide the insight that, instead of using N-detect testing with as large N as possible, it is more efficient to combine the output deviations metric with multi-detect testing to get high-quality, compact test sets.
This paper presents a scalable method to generate close to minimal size test pattern sets for stuck-at faults in scan based circuits. The method creates sets of potentially compatible faults based on necessary assignments. It guides the justification and propagation decisions to create patterns that will accommodate most targeted faults. The technique presented achieves close to minimal test pattern sets for ISCAS circuits. For industrial circuits it achieves much smaller test pattern sets than other methods in designs sensitive to decision order used in ATPG.
In this paper, we present an X-Fill (QC-Fill)
method for not only slashing the test time but also
reducing the test power (including both capture power and
shifting power). QC-Fill, built upon the existing
multicasting scan architecture, can coexist with most
low-capture-power (LCP) X-fill methods through a
multicasting-driven X-fill method incorporating a
clique-stripping scheme. QC-Fill is independent of the
ATPG patterns and does not require any area-overhead
since it can directly operate on an existing scan
architecture incorporating test compression.
Index Terms - Scan Test, Multicasting, Test Compression, Low-Power Scan, Low-capture-power X-fill
This paper presents a tool for exploring different parallelization options for an application. It can be used to quickly find a high-quality match between an application and a multi-processor platform architecture. By specifying the parallelization at a high abstraction level, and leaving the actual source code transformations to the tool, a designer can try out many parallelizations in a short time. A parallelization may use either functional or data-level splits, or a combination of both. An accompanying high-level simulator provides rapid feedback about the expected performance of a parallelization, based on platform parameters and profiling data of the sequential application on the target processor. The use of the tool and simulator are demonstrated on an MPEG-4 video encoder application and two different platform architectures.
This paper reports on experience gained and lessons learned from an intensive investigation of model-driven engineering methodology and technology for application to high-integrity systems. Favourable experimental context was provided for by ASSERT, a 40-month project partly funded by the EC as part of the 6th Framework Program. The goodness of fit of the MDE paradigm for the industrial domain of interest was critically assessed on a small number of candidate solutions. One of the main axes of investigation concerned HRT-UML/RCM, an advanced method and integrated tool for the model-driven development of embedded real-time software systems. HRT-UML/RCM vastly leveraged on version 2 of the OMG UML standard and combined it with the development of a domain-specific metamodel in the quest to attain correctness-by-construction from the ground up. The prototype tool developed in the project supported: (1) the separation of functional (sequential) design from the specification of real-time and concurrency requirements and properties to be preserved at run time; and (2) the exploitation of a fully generative approach to the development, equipped with support for model-based feasibility analysis and round-trip engineering.
Designing reconfigurable yet critical embedded and complex systems (i.e. systems composed of different subsystems) requires making these systems adaptable while guaranteeing that they operate with respect to predefined safety properties. When it comes to complex systems, component-based software engineering methods provide solutions to master this complexity ("divide to conquer"). In addition, architecture description languages provide solutions to design and analyze critical and reconfigurable embedded systems. In this paper we propose a methodology that combines the benefits of these two approaches by leaning on both AADL and Lightweigth CCM standards. This methodology is materialized through a complete design process and an associated framework, MyCCM-HI, dedicated to designing reconfigurable, critical, and complex embedded systems.
AADL is an Architecture Description Language which
describes embedded real-time systems. Behavior annex is an
extension of the dispatch mechanism of AADL execution model.
This paper proposes a formal semantics for the AADL behavior
annex using Timed State Machine (TASM). Firstly, the
semantics of AADL default execution model is given, and then we
formally define some aspects semantics of behavior annex. A
prototype of real-time behavior modeling and verification is
proposed, and finally, a case study will be given to validate the
Keywords- AADL; behavior annex; execution model; TASM
Due to higher integration and increasing frequency based effects, full Electromagnetic Models (EM) are needed for accurate prediction of the real behavior of integrated passives and interconnects. Furthermore, these structures are subject to parametric effects due to small variations of the geometric and physical properties of the inherent materials and manufacturing process. Accuracy requirements lead to huge models, which are expensive to simulate and this cost is increased when parameters and their effects are taken into account. This paper presents a complete procedure for efficient reduction of realistic, hierarchy aware, EM based parametric models. Knowledge of the structure of the problem is explicitly exploited using domain partitioning and novel electromagnetic connector modeling techniques to generate a hierarchical representation. This enables the efficient use of block parametric model order reduction techniques to generate block-wise compressed models that satisfy overall requirements, and provide accurate approximations of the complete EM behaviour, which are cheap to evaluate and simulate.
This paper describes a waveform compression
technique suitable for the efficient utilization, storage and
interchange of the emerging current source model (CSM) based
cell libraries. The technique is based on pre-processing of a
collection of voltage/current waveforms for the cells in the library
and then, constructing an orthogonal time-voltage/time-current
waveform basis using singular-value decomposition.
Compression is achieved by representing all waveforms as linear
combination coefficients of adaptive subset of the basis
waveforms. Experimental results indicate that adaptive
waveform representation results in higher compression ratios
than the waveform representation as a function of fixed set of
basis functions. Interpolation and further compression are
obtained by representing the coefficients as simple functions of
various parameters, e.g., input slew, load capacitance, supply
voltage, and temperature. The methods introduced in this paper
are tested and validated on several industrial strength libraries,
with spectacular compression results.
Keywords- Current Source Model; Adaptive Data Compression; Parameterization; Principal Component; Pre-processing
As clock frequencies exceed giga-Hertz, the extra power loss due to conductor surface roughness in interconnects and packagings is more evident and thus demands a proper accounting for accurate prediction of signal integrity and energy consumption. Existing techniques based on analytical approximation often suffer from a narrow valid range, i.e., small or large limit of roughness. In this paper, we propose a new simulation methodology for surface roughness loss that is applicable to general surface roughness and a wide frequency range. The method is based on 3D statistical modeling of surface roughness and the numerical solution of scalar wave modeling (SWM) with the method of moments (MOM). The spectral stochastic collocation method (SSCM) is applied in association of random surface modeling to avoid the time-consuming Monte-Carlo (MC) simulation. Comparisons with existing methods in their respective valid region then verify the effectiveness of our approach.
This paper proposes an efficient decoupling (decaps) capacitance optimization algorithm to reduce the voltage noise of on-chip power grid networks. The new method is based on the efficient charge formulation of the decap allocation problem. But different from the existing work , the new method applies the more accurate piecewise polynomial micromodels to estimate the voltage noises during the linear programming process. The resulting method overcomes the over-estimation problem, which plagues the existing method. The proposed method has the best of two worlds: it has the efficiency of the charge-based methods and the accuracy of the sensitivity-based methods. Experimental results demonstrate that the proposed method leads to the decap values similar to that of the sensitivity-based methods, which give the best reported results and are much better than the existing charge-based method, and at the same time, it enjoys the similar efficiency of the charge-based method.
This contribution shows and discusses the requirements and constraints that an industrial engineering process defines for the integration of hardware IP into the system development flow. It describes the developed strategy for automating the step of making hardware descriptions available in a MATLAB/Simulink based system modeling and validation environment. It also explains the transformation technique on which that strategy is based. An application of the strategy is shown in terms of an industrial automotive electronic hardware IP block.
Model Based Design tools based around Simulink
from The MathWorks are a popular technology for the creation
of streaming DSP designs for FPGAs, since they offer the promise
of rapid design exploration through immediate quantitative
feedback of algorithm performance. Current tools typically use a
library of components that reflect an explicit representation of
the underlying FPGA device features. This is undesirable since
the designer is forced to mix implementation and architecture,
and leads to long design cycles and non-portable results. This
paper shows that introducing techniques of high level synthesis
allows a more elegant design at a higher level of abstraction. This
results in fewer components needed for a design which translates
into a faster design cycle, more portable designs and fewer
defects. Pushbutton clock frequencies of up to 500 MHz are
achieved without detailed knowledge of FPGA architectures.
Although the capabilities described are embodied in the DSP
Builder tool from Altera, this paper describes the technology
involved rather than the details of the tools. Four major
technologies are described: a latency-insensitive system
representation, the module level internal representation with
associated transformations, hardware retiming, and lastly a FIR
filter design tool layered on top.
Keywords- Model Based Design; High Level Synthesis; FPGAs; Technology Mapping; Retiming; FIR Filter Design.
In modern digital ICs, the increasing demand for performance and throughput requires operating frequencies of hundreds of megahertz, and in several cases exceeding the gigahertz range. Following the technology scaling trends, this request will continue to rise, thus increasing the electromagnetic interference (EMI) generated by electronic systems. The enforcement of strict governmental regulations and international standards, mainly (but not only) in the automotive domain, are driving new efforts towards design solutions for electromagnetic compatibility (EMC). Hence, EMC/EMI is rapidly becoming a major concern for high-speed circuit and package designers. The on-chip power rail noise is one of the most detrimental sources of electromagnetic (EM) conducted emissions, since it propagates to the board through the power and ground I/O pads. In this work we investigate the impact of power rail noise on EMI, and we show that by limiting this noise source it is possible to drastically reduce the conducted emissions. Furthermore, we present a transistor-level lumped-element simulation model of the system power distribution network (PDN) that allows chip, package, and board designers to asses the power integrity and predict the conducted emissions at critical chip I/O pads. The experimental results obtained on an industrial microcontroller for automotive applications demonstrate the effectiveness of our approach.
The verification of embedded software has become an important subject over the last years. This work presents a new semiformal verification approach called SofTPaDS. It combines assertion-based and symbolic simulation approaches for the verification of embedded software with hardware dependencies. SofTPaDS shows to be more efficient than the software model checkers in order to trace deep state spaces and improves the state coverage relative to a simulation-based verification tool. We have successfully applied our approach to an industrial automotive embedded software.
The single stuck-at fault coverage is often seen as a figure-of-merit also for scan testing according to other fault models like transition faults, bridging faults, crosstalk faults, etc. This paper analyzes how far this assumption is justified. Since the scan test infrastructure allows reaching states not reachable in the application mode, and since faults only detectable in such unreachable states are not relevant in the application mode, we distinguish those irrelevant faults from relevant faults, i.e. faults detectable in the application mode. We prove that every combinatorial circuit with exactly 100% stuck-at fault coverage has 100% transition fault test coverage for those faults which are relevant in the application. This does not necessarily imply that combinatorial circuits with almost 100% single-stuckat coverage automatically have high transition fault coverage. This is shown in an extreme example of a circuit with nearly 100% stuck-at coverage, but 0% transition fault coverage.
We present a hardware-based approach to improve the resilience of a computer system against the errors occurred in the main memory with the help of error detecting and correcting (EDAC) codes. Checksums are placed in the same type of memory locations and addressed in the same way as normal data. Consequently, the checksums are accessible from the exterior of the main memory just as normal data and this enables implicit fault-tolerance for interconnection and solid-state secondary storage sub-systems. A small hardware module is used to manage the sequential retrieval of checksums each time the integrity of the data accessed by the processor sub-system needs to be verified. The proposed approach has the following properties: (a) it is cost efficient since it can be used with simple storage and interconnection sub-systems that do not possess any inherent EDAC mechanism, (b) it allows on-line modifications of the memory protection levels, and (c) no modification of the application software is required.
Flash-based FPGAs are increasingly demanded in safety critical fields, in particular space and avionic ones, due to their non-volatile configuration memory. Although they are almost immune to permanent loss of the configuration data, they are composed of floating gate based switches that can suffer transient effects if hit by high energetic particles with critical consequences on the implemented logic. This paper presents a new way for the analysis of the impact of Single Event Effects in Flash-based FPGAs. We proposed a new methodology to identify the most critical switches inside the configuration logic block and the most redundant and robust configuration selection for each logic function. The experimental results achieved by fault injection demonstrated the feasibility of the proposed method and show that by using the most robust functional mapping it is possible to enhance the reliability of the entire design with respect to a not robust ones.
Complex signal processing algorithms are often specified in floating point precision. Thus, a type conversion is needed when the targeted platform requires fixed-point precision. In this work we proposed a new method to evaluate the final impact of finite precision processing in wireless applications. The latter combines analytical analysis with simulations. This extends previous work including the effect of the decision-making errors resulting from quantization. Thereby efficient dimensioning of the minimum bit-widths that satisfy a given accuracy constraint can be deployed. The method is validated with two representative case studies, namely an OFDM inner receiver and a Near-ML MIMO (Multiple Inputs, Multiple Outputs) detector.
IR-drop problem during test mode exacerbates delay defects and results in false failures. In this paper, we take the X-filling approach to reduce IR-drop effect during at-speed test. The main difference between our approach and the previous X-filling methods - lies in two aspects. The first one is that we take the spatial information into consideration in our approach. The second one is how X-filling is performed. We propose a backward-propagation approach instead of a forward-propagation approach taken in previous work. The experimental results show that we have 42.81% reduction for the worst IR-drop and 45.71% reduction in the average IR-drop as compared to random fill method.
Aggressive scaling to nanometer CMOS technologies causes both analog and digital circuit parameters to degrade over time due to die-level stress effects (i.e. NBTI, HCI, TDDB, etc). In addition, failure-time dispersion increases due to increasing process variability. In this paper an innovative methodology to simulate analog circuit reliability is presented. Advantages over current state of the art reliability simulators include, among others, the possibility to estimate the impact of variability and the ability to account for the effects of complex time-varying stress signals. Results show that taking time-varying stress signals into account provides circuit reliability information not visible with classic DC-only reliability simulators. Also, variability-aware reliability simulation results indicate a significant percentage of early circuit failures compared to failure-time results based on nominal design only.
Low Density Parity Check (LDPC) codes have recently been chosen in the CCSDS standard for uses in near-earth applications. The specified code belongs to the class of Quasi-Cyclic LDPC codes which provide very high data rates and high reliability. Even if these codes are suited to high data rate, the complexity of LDPC decoding is a real challenge for hardware engineers. This paper presents a generic architecture for a CCSDS LDPC decoder. This architecture uses the regularity and the parallelism of the code and a genericity based on an optimized storage of the data. Two FPGA implementations are proposed: the first one is low-cost oriented and the second one targets high-speed decoder.
Verification is a major issue in circuit and system design. Formal methods like bounded model checking (BMC) can guarantee a high quality of the verification. There are several techniques that can check if a set of formal properties forms a complete specification of a design. But, in contrast to simulation-based methods, like random testing, formal verification requires a detailed knowledge of the design implementation. Finding the correct set of properties is a tedious and time consuming process. In this paper, two techniques are presented that provide automatic support for writing properties in a quality-driven BMC flow. The first technique can be used to analyze properties in order to remove redundant assumptions and to separate different scenarios. The second technique - inverse property checking - automatically generates valid properties for a given expected behavior. The techniques are integrated with a coverage check for BMC. Using the presented techniques, the number of iterations to obtain full coverage can be reduced, saving time and effort.
The complexity of the test infrastructure and test
strategies in systems-on-chip approaches the complexity of the
functional design space. This paper presents test design space
exploration and validation of test strategies and schedules using
transaction level models (TLMs). Since many aspects of testing
involve the transfer of a significant amount of test stimuli and
responses, the communication-centric view of TLMs suits this
purpose exceptionally well.
Index Terms - Test of systems-on-chip, design-for-test, transaction level modeling
This paper presents a multi-core SoC architecture for
consumer multimedia applications. The comprehensive
functionality of such multimedia systems is described using the
example of a hybrid TV application. The successful usage of a
heterogeneous multi-core SoC platform is presented and it is
shown how specific challenges such as inter-processor
communication and real-time performance guarantees in
physically centralized memory systems are addressed.
Keywords -component; multiprocessor; TV; physically centralized memory system
High-end mobile phones support multiple radio standards and a rich suite of applications, which involves advanced radio, audio, video, and graphics processing. The overall digital workload amounts to nearly 100GOPS, from 4b integer to 24b floating-point operations. With a power budget of only 1W this inevitably leads to heterogeneous multi-core architectures with aggressive power management. We review the state-of-the-art as well as trends.
Modern Systems on Chip strongly rely on highly complex, specialized, mixed hardware software sub systems to handle processing intensive tasks: 3D graphic, imaging, video, software radio, positioning... Cost and difficulty of super integration, lack of flexibility, little resource sharing combined with a new class of issues attached to deep submicron process variability, reliability, open opportunities to revisit more regular, programmable approaches as an alternative. Will our industry see the emergence of a new generation of standard mega cells that can be assembled as homogeneous many cores fabrics as an alternative to today's heterogeneous SoCs? We strongly believe that the answer is yes and in this talk we will go through the many folds of this question.
Wireless sensor networks hold the potential to open new domains to distributed data acquisition. However, low-cost battery-powered nodes are often used to implement such networks, resulting in tight energy and communication bandwidth constraints. Cluster-based data compression and aggregation helps to reduce communication energy consumption. However, neglecting to adapt cluster sizes to local network conditions has limited the efficiency of previous clustering schemes. We have found that sensor node distances and densities are key factors in clustering. To the best of our knowledge, this is the first work taking these factors into consideration when adaptively forming data aggregation clusters. Compared with previous uniform-size clustering techniques, the proposed algorithm achieves up to 24% communication energy savings in uniform density networks and 36% savings in non-uniform density networks.
This paper presents a systematic methodology for designing the adaptation policies of reconfigurable sensor networks. The work is motivated by the need to provide efficient sensing, processing, and networking capabilities under tight hardware, bandwidth, and energy constraints. The design flow includes two main steps: generation of alternative design points representing different performance-cost trade-offs, and finding the switching rates between the points to achieve effective adaptation. Experiments studied the scaling of the methods with the size of the networks, and the effectiveness of the produced policies with respect to data loss, latency, power consumption, and buffer space.
Programmable logic arrays (PLAs) using selfassembly nanowire crossbars have shown promising potential for future nano-scale circuit design. However, due to the density and size factors of nanowires and molecular switches, the fabrication fault densities are much higher than those of the conventional silicon technology, and hence pose greater design challenges. In this paper, we propose a novel defect-aware logic mapping framework via Boolean satisfiability (SAT). Compared with the prior works, our technique considers PLA defects on both input and output planes at the same time. This synergistic approach can help solve logic mapping problems with higher defect rates. The proposed method is universally suitable for various nanoscale PLAs, including AND/OR, NOR/NOR structures, etc. The experimental results have shown that it can efficiently solve large mapping problems at a total defect rate of 20% or even higher. We further investigate the impact of different defects on PLA mapping, which helps set up an initial contribution for yield estimation and utilization of partially-defective PLAs.
Intensive research is performed to find post-CMOS technologies. A very promising direction based on reversible logic are quantum computers. While in the domain of reversible logic synthesis, testing, and verification have been investigated, debugging of reversible circuits has not yet been considered. The goal of debugging is to determine gates of an erroneous circuit that explain the observed incorrect behavior. In this paper we propose the first approach for automatic debugging of reversible Toffoli networks. Our method uses a formulation for the debugging problem based on Boolean satisfiability. We show the differences to classical (irreversible) debugging and present theoretical results. These are used to speed-up the debugging approach as well as to improve the resulting quality. Our method is able to find and to correct single errors automatically.
Recent advances in droplet-based digital microfluidics have enabled biochip devices for DNA sequencing, immunoassays, clinical chemistry, and protein crystallization. Since cross-contamination between droplets of different biomolecules can lead to erroneous outcomes for bioassays, the avoidance of cross-contamination during droplet routing is a key design challenge for biochips. We propose a droplet-routing method that avoids cross-contamination in the optimization of droplet flow paths. The proposed approach targets disjoint droplet routes and minimizes the number of cells used for droplet routing. We also minimize the number of wash operations that must be used between successive routing steps that share unit cells in the microfluidic array. Two real-life biochemical applications are used to evaluate the proposed droplet-routing methods.
Energy efficient communication is a key issue in wireless sensor networks. Common belief is that a multi-hop configuration is the only viable energy efficient technique. In this paper we show that the use of forward error correction techniques in combination with ARQ is a promising alternative. Exploiting the asymmetry between lightweight sensor nodes and a more powerful base station even advanced techniques known from cellular networks can be efficiently applied to sensor networks. Our investigations are based on realistic power models and real measurements and, thus, consider all side-effects. This is to the best of our knowledge the first investigation of advanced forward error correction techniques in sensor networks which is based on real experiments.
Various Orthogonal Frequency Division Multiplexing (OFDM)-based wireless communication standards have raised more stringent requirements on throughput and flexibility of Fast Fourier Transformation (FFT), a kernel data transformation task in communication systems. Application-specific instruction set processor (ASIP) has emerged as a promising solution to meet these requirements. In this paper, we propose a novel ASIP design tailored for FFT computation. We reconstruct the FFT computation flow into a scalable array structure based on an 8-point butterfly unit (BU). Any-point FFT computation can be carried out in the array structure which can easily expand along both the horizontal and vertical dimensions. We incorporate custom register files to reduce memory access. The data address for custom registers in each FFT stage is changed accordingly, and we derive a regular address changing (AC) rule. With the microarchitecture modifications, we extend the instruction set with three custom instructions correspondingly. Our FFT ASIP implementation achieves great performance improvement over the standard FFT software implementation, one TI DSP processor, and one commercial Xtensa ASIP, with the data throughput improvement as 866.5X, 5.9X, 2.3X, respectively. Meanwhile, the area and power consumption overhead of the custom hardware is negligible.
In this paper a programmable Forward Error Correction
(FEC) IP for a DVB-S2 receiver is presented. It is composed
of a Low-Density Parity Check (LDPC), a Bose-Chaudhuri-Hoquenghem
(BCH) decoder, and pre- and postprocessing units.
Special emphasis is put on LDPC decoding, since it accounts for
the most complexity of the IP core by far.
We propose a highly efficient LDPC decoder which applies
Gauss-Seidel decoding. In contrast to previous publications,
we show in detail how to solve the well known problem of
superpositions of permutation matrices. The enhanced convergence
speed of Gauss-Seidel decoding is used to reduce area
and power consumption. Furthermore, we propose a modified
version of the λ-Min algorithm which allows to further decrease
the memory requirements of the decoder by compressing the
Compared to the latest published DVB-S2 LDPC decoders,
we could reduce the clock frequency by 40% and the memory
consumption by 16%, yielding large energy and area savings
while offering the same throughput.
Index Terms - Forward Error Correction, Soft Decision Decoding, LDPC, DVB-S2, Check Node approximation.
The richness of wavelet transformation is known in many fields. There exist different classes of wavelet filters that can be used depending on the application. In this paper, we propose an IEEE 754 floating-point lifting-based wavelet processor that can perform various forward and inverse Discrete Wavelet Transforms (DWTs) and Discrete Wavelet Packets (DWPs). Our architecture is based on processing elements that can perform either prediction or update on a continuous data stream in every two clock cycles. We also consider the normalization step that takes place at the end of the forward DWT/DWP or at the beginning of the inverse DWT/DWP. To cope with different wavelet filters, we feature a multi-context configuration to select among various DWTs/DWPs. Different memory sizes and multi-level transformations are supported. For the 32-bit implementation, the estimated area of the proposed processor with 2x512 words memory and 8 PEs in a 0.18-μm process is 3.7 mm square and the estimated operating speed is 353 MHz.
Statistical static timing analysis deals with the increasing variations in manufacturing processes to reduce the pessimism in the worst case timing analysis. Because of the correlation between delays of circuit components, timing model generation and hierarchical timing analysis face more challenges than in static timing analysis. In this paper, a novel method to generate timing models for combinational circuits considering variations is proposed. The resulting timing models have accurate input-output delays and are about 80% smaller than the original circuits. Additionally, an accurate hierarchical timing analysis method at design level using pre-characterized timing models is proposed. This method incorporates the correlation between modules by replacing independent random variables to improve timing accuracy. Experimental results show that the correlation between modules strongly affects the delay distribution of the hierarchical design and the proposed method has good accuracy compared with Monte Carlo simulation, but is faster by three orders of magnitude.
Equivalence checking and property checking are powerful techniques to detect error traces. Debugging these traces is a time consuming design task where automation provides help. In particular, debugging based on Boolean Satisfiability (SAT) has been shown to be quite efficient. Given some error traces, the algorithm returns fault candidates. But using random error traces cannot ensure that a fault candidate is sufficient to explain all erroneous behaviors. Our approach provides a more accurate diagnosis by iterating the generation of counterexamples and debugging. This increases the accuracy of the debugging result and yields more valuable counterexamples. As a consequence less time consuming manual iterations between verification and debugging are required - thus the debugging productivity increases.
In recent years, the verification of digital designs has become one of the most challenging, time consuming and critical tasks in the entire hardware development process. Within this area, the vast majority of the verification effort in industry relies on logic simulation tools. However, logic simulators deliver limited performance when faced with vastly complex modern systems, especially synthesized netlists. The consequences are poor design coverage, delayed product releases and bugs that escape into silicon. Thus, we developed a novel GPU-accelerated logic simulator, called GCS, optimized for large structural netlists. By leveraging the vast parallelism offered by GP-GPUs and a novel netlist balancing algorithm tuned for the target architecture, we can attain an order-of-magnitude performance improvement on average over commercial logic simulators, and simulate large industrial-size designs, such as the OpenSPARC processor core design.
Today's complex integrated circuit designs increasingly rely on post-silicon validation to eliminate bugs that escape from presilicon verification. One effective silicon debug technique is to monitor and trace the behaviors of the circuit during its normal operation. However, designers can only afford to trace a small number of signals in the design due to the associated overhead. Selecting which signals to trace is therefore a crucial issue for the effectiveness of this technique. This paper proposes an automated trace signal selection strategy that is able to dramatically enhance the visibility in post-silicon validation. Experimental results on benchmark circuits show that the proposed technique is more effective than existing solutions.
Core-cell stability represents the ability of the core-cell to keep the stored data. With the rapid development of semiconductor memories, their test is becoming a major concern in VDSM technologies. It provides information about the SRAM design reliability, and its effectiveness is therefore mandatory for safety applications. Existing core-cell stability Design-for-Test (DfT) techniques consist in controlling the voltage levels of bit lines to apply a weak write stress on the core-cell under test. If the core-cell is weak, the weak write stress induces the faulty swap of the core-cell. However, these solutions are costly in terms of area and test application time, and generally require modifications of critical parts of the SRAM (core-cell array and/or the structure generating the internal auto-timing). In this paper, we present a new DfT technique for stability fault detection. It consists in modulating the word line activation in order to perform an adjustable weak write stress on the targeted core-cell for stability fault detection. Compared to existing DfT solutions, the proposed technique offers many advantages: programmability, low area overhead, low test application time. Moreover, it does not require any modification of critical parts of the SRAM.
Multiple-voltage is an effective dynamic power reduction
design technique. Recent research has shown that testing
for resistive bridging faults in such designs requires more than
one voltage setting for 100% defect coverage; however switching
between several supply voltage settings has a detrimental impact
on the overall cost of test. This paper proposes an effective
Gate Sizing technique for reducing test cost of multi-Vdd designs
with bridge defects. Using synthesized ISCAS benchmarks and a
parametric fault model, experimental results show that for all the
circuits, the proposed technique achieves 100% defect coverage
at a single Vdd setting; in addition it has a lower overhead than
the recently proposed Test Point Insertion technique in terms of
timing, area and power.
Index Terms - Gate Sizing, Test Cost, Resistive Bridging Faults, Multiple-Vdd designs, Design for Testability
During volume testing, test application time, test data
volume and high performance automatic test equipment (ATE)
are the major cost factors. Embedded testing including builtin
self-test (BIST) and multi-site testing are quite effective cost
reduction techniques which may make diagnosis more complex.
This paper presents a test response compaction scheme and a
corresponding diagnosis algorithm which are especially suited
for BIST and multi-site testing. The experimental results on
industrial designs show, that test time and response data volume
reduces significantly and the diagnostic resolution even improves
with this scheme. A comparison with X-Compact indicates, that
simple parity information provides higher diagnostic resolution
per response data bit than more complex signatures.
Keywords - Diagnosis, Embedded diagnosis, Multi-site test, Compaction, Design-for-test
DRAM is usually used as main memory for program execution. The thermal behavior of a memory block in a 3D SIP is affected not only by the power behavior but also the heat dissipating ability of that block. The power behavior of a block is related to the applications run on the system while the heat dissipating ability is determined by the number of tier and the position the block locates. Therefore, a thermal-aware memory allocator should consider the following two points. First, allocator should consider not only the power behavior of a memory block but also the physical location during memory mapping, second, the changing temperature of a physical block during execution of programs. In this paper, we will propose a memory mapping algorithm taking into consideration the above-mentioned two points. Our technique can be classified as static thermal management to be applied to embedded software designs. Experiments show that our method can reduce temperature of memory system by 17.2.C as compared to a straightforward mapping in the best case, and 13.4.C in average.
With continuous technology scaling, soft errors are becoming an increasingly important design concern even for earth-bound applications. While compiler approaches have the potential to mitigate the effect of soft errors with minimal runtime overheads, static vulnerability estimation - an essential part of compiler approaches - is lacking due to its inherent complexity. This paper presents a static analysis approach for Register File (RF) vulnerability estimation. We decompose the vulnerability of a register into intrinsic and conditional basic-block vulnerabilities. This decomposition allows us to develop a fast, yet reasonably accurate, linear equation-based RF vulnerability estimation mechanism. We demonstrate its practical application to compiler optimizations. Our experimental results on benchmarks from MiBench suite indicate that not only our static RF vulnerability estimation is fast and accurate, but also compiler optimizations enabled by our static estimation can achieve very cost-effective protection of register files against soft errors.
This paper explores the use of dynamic compilation for continuing execution even if one or more of the memory banks used by an application become temporarily unavailable (but their contents are preserved), that is, the number of memory banks available to the application varies at runtime. We implemented the proposed dynamic compilation approach using a code instrumentation system and performed experiments with 12 embedded benchmark codes. The results collected so far are very encouraging and indicate that, even when all the overheads incurred by dynamic compilation are included, the proposed approach still brings significant benefits over an alternate approach that suspends application execution when there is a reduction in memory bank availability and resumes later when all the banks are up and running.
This paper presents a design methodology for fully reconfigurable low-voltage Delta-Sigma converters as for instance used in next-generation wireless applications. The design methodology first finds the power-optimized noise transfer functions for the different standards at system level and then translates them into optimal granularities of programmability and circuit parameters such as resistance and capacitance values for the integrators. Reconfiguration is done in the passive component arrays, modulator orders, number of quantizer bits and transconductance for optimal power consumption. This gives the design the best trade-off between power and performance for every configuration mode.
This paper describes a systematic approach that facilitates yield improvement of integrated circuits at the post-manufacture stage. A new Configurable Analogue Transistor (CAT) structure is presented that allows the adjustment of devices after manufacture. The technique enables both performance and yield to be improved as part of the normal test process. The optimal sizing of the inserted CAT devices is crucial to ensure the greatest improvement in yield and this paper considers this challenge in detail. An analysis and description of the underlying theory of the sizing problem is given along with examples of incorrect sizing. Guidelines to achieve optimal CAT sizing are proposed, and results are provided to demonstrate the overall effectiveness of the CAT approach.
This paper proposes, for the first time, an automated energy harvester design flow which is based on a single HDL software platform that can be used to model, simulate, configure and optimise energy harvester systems. A demonstrator prototype incorporating an electromagnetic mechanical-vibration-based micro-generator and a limited number of library models has been developed and a design case study has been carried out. Experimental measurements have validated the simulation results which show that the outcome from the design flow can improve the energy harvesting efficiency by 75%.
In this work, we propose an enhanced design method for filterless class-D audio amplifier based on multi-level architecture. The multilevel technique consists of a multilevel converter and a time division adder followed by modulator. In this method, the modulated signal is arranged into several time divisions and then be integrated into a binary numeric. After that, the binary numeric is encoded to be a set of parallel control signal for multilevel converter. The multilevel converter will deliver multilevel signal to loudspeaker instead of conventional two-level signals. Consequently, improve the total-harmonic-distortion (THD) and signal-noise-ratio (SNR) significantly without sacrificing power efficiency. Moreover, we can apply the proposed method to many class-D amplifier designs simply insert a time division adder behind modulator and replace output stage with multilevel converter.
Panelists: N. Topham, D. Pulley, M. Harrand, J. Goodacre, G. Martin and Y. Tanurhan
Adaptive body bias (ABB) and adaptive supply voltage (ASV) have been showed to be effective methods for post-silicon tuning of circuit properties to reduce variability. While their properties have been compared on generic combinational circuits or microprocessor circuit sub-blocks, the advent of multi-core systems is bringing a new application domain forefront. Global interconnects are evolving to complex communication channels with drivers and receivers, in an attempt to mitigate the effects of reverse scaling and reduce power. The characterization of the performance spread of these links and the exploration of effective and power-aware compensation techniques for them is becoming a key design issue. This work compares the variability compensation efficiency of ABB vs ASV when put at work in two representative link architectures of today's ICs: a traditional full-swing interconnect and a low-swing signaling scheme for low-power communication. We provide guidelines for the post-silicon variability compensation of these communication channels.
Technology scaling has caused the feature sizes to shrink continuously, whereas interconnects, unlike transistors, have not followed the same trend. Designing 3D stack architectures is a recently proposed approach to overcome the power consumption and delay problems associated with the interconnects by reducing the length of the wires going across the chip. However, 3D integration introduces serious thermal challenges due to the high power density resulting from placing computational units on top of each other. In this work, we first investigate how the existing thermal management, power management and job scheduling policies affect the thermal behavior in 3D chips. We then propose a dynamic thermally-aware job scheduling technique for 3D systems to reduce the thermal problems at very low performance cost. Our approach can also be integrated with power management policies to reduce energy consumption while avoiding the thermal hot spots and large temperature variations.
We present an optimal methodology for dynamic voltage scheduling problem in the presence of realistic assumption such as leakage-power and intra-task overheads. Our contribution is an optimal algorithm for energy minimization that concurrently assumes the presence of (1) non-convex energy-speed models as opposed to previously studied convex models, (2) discrete set of operational modes (voltages) and (3) intra-task energy and delay overhead. We tested our algorithm on MediaBench and task sets used in previous papers. Our simulation results show an average of 22% improvement in energy reduction in comparison with optimal algorithms for convex models without switching overhead and on average of 24% with consideration for energy and delay overheads. This analysis lays the groundwork for improving functionality in CAD design through non-convex techniques for discrete models.
Localized heating-up creates thermal hotspots across the chip, with the integer register file ranked as the hottest unit in high-performance microprocessors. In this paper, we perform a detailed study on the thermal behavior of a low-power value-aware register file (VARF) that is subjected to internal fine-grain hotspots. To further optimize its thermal behavior, we propose and evaluate three thermal-aware control schemes, thermal sensor (TS), access counter (AC), and register-id (ID) based, to balance the access activity and thus the temperature across different partitions in the VARF. The simulation results using SPEC CINT2000 benchmarks show that the register-id controlled VARF (ID-VARF) scheme achieves optimized thermal behavior at minimum cost as compared to the other schemes. We further evaluate the performance impact of the thermal-aware VARF design with the dynamic thermal management (DTM). The experimental results show that the ID-VARF can improve the performance by 26.1% and 7.2% over the conventional register file and the original VARF design, respectively.
With the trend toward high-quality large form factor displays on high-end handhelds, LCD backlight accounts for a significant and increasing percentage of the total energy budget. Substantial energy savings can be achieved by dynamically adapting backlight intensity levels while compensating for the ensuing visual quality degradation with image pixel transformations. Several compensation techniques have been recently developed to this purpose, but none of them has been fully characterized in terms of quality losses considering jointly the non-idealities present in a real embedded video chain and the peculiar characteristics of the human visual system (HVS). We have developed a quality analysis framework based on an accurate embedded visualization system model and HVS-aware metrics. We use it to assess the visual quality performance of existing dynamic backlight scaling (DBS) solutions. Experimental results show that none of the DBS techniques available today is fully capable of keeping quality loss under control, and that there is significant room for improvement in this direction.
The H.264/AVC Intra Frame Codec (i.e. all frames are coded as I-frames) targets high-resolution/high-end encoding applications (e.g. digital cinema and high quality archiving etc.), providing much better compression efficiency at lower computational complexity compared to MJPEG2000. Moreover, in case of video coding of very high motion scenes, the number of Intra Macroblocks is dominant. Intra Prediction is a compute intensive and memory-critical part that consumes 80% of the computation time of the entire Intra Compression process when executing the H.264 encoder on MIPS processor . We therefore present a novel hardware for H.264 Intra Prediction that processes all the prediction modes in parallel inside one integrated module (i.e. mode-level parallelism) enabling us to exploit the full space of optimization. It exhibits a group-based write-back scheme to reduce the memory transfers in order to facilitate the fast mode-decision schemes. Our Luma 4x4 hardware is 3.6x, 5.2x, and 5.5x faster than state-of-the-art approaches , QS0 , and , respectively. Our results show that processing Luma 16x16, Chroma 8x8, and Luma 4x4 with the proposed approach is 7.2x, 6.5x, and 1.8x faster (while giving an energy saving of 60%, 80%, and 74%) when compared with Dedicated Module Approach  (each prediction mode is processed with its independent hardware module i.e. a typical ASIC style for Intra Prediction). We get an area saving of 58% for Luma 4x4 hardware.
Diverse approaches to parallel implementation of H.264 have been proposed; however, they all share a common problem. The entropy decoder in H.264 remains mapped on a single processing element (PE). Due to the inherently sequential and context-adaptive nature of the entropy decoder, it cannot be parallelized. This renders a bottleneck to the performance of the entire decoding process. Depending on the type of the processing core and the video bit-rate, the performance of the entire decoding process is subject to the process of entropy decoding. It is, therefore, needful to research and implement new algorithmic solutions to compensate for this bottleneck, and thereby make optimal use of parallel implementation of H.264 decoder on mainstream multi-core systems. This paper presents a new CAVLC decoding method which is de-rived by constructing custom CAVLC decoding tables using "table grouping". Compared to the conventional  "sequential table look-up" method, which requires multiple memory accesses. Our proposed method accesses the custom tables only once for the decoding of any symbol. Moreover, in our proposed method, the symbol decoding time does not depend on the symbol length and it is constant for each symbol, resulting in a nearly linear increase in computational complexity with increase in video fidelity as compared to an non linear increase in earlier proposed methods. Experimental results show that our proposed algorithm features up to 7x higher performance and 83% less memory accesses compared to conventional methods. We compare to three commonly used, state-of-the-art CAVLC algorithms, such as table look-up by sequential search , table look-up by binary search , and "Moon's method".  .
Power management at any abstraction level is a key issue for many mobile multimedia and embedded applications. In this paper a design workflow to generate system-level power models will be presented, tailored to support quantitative runtime power optimization policies to be implemented within an operating system. The approach we followed to derive power models is strongly use-case oriented. Starting from a comprehensive general and accurate model of a representative architecture for embedded applications (including a multi core MPSoC, accelerators, interfaces and peripherals), a methodology to derive compact models is presented, based upon the distinctive characteristics of the selected use cases. The methodology to generate such model, whose exploitation is foreseen within a power manager working at the OS level, is the focus of the paper. The value and accuracy of the approach is quantitatively and statistically justified through extensive experiments carried out on a development board designed for multimedia applications.
Common sub-expression elimination (CSE) serves as a useful optimization technique in the synthesis of arithmetic datapaths described at RTL. However, CSE has a limited potential for optimization when many common sub-expressions are not exposed. Given a suitable transformation of the polynomial system representation, which exposes many common sub-expressions, subsequent CSE can offer a higher degree of optimization. The objective of this paper is to develop algebraic techniques that perform such a transformation, and present a methodology to integrate it with CSE to further enhance the potential for optimization. In our experiments, we show that this integrated approach outperforms conventional methods in deriving area-efficient hardware implementations of polynomial systems.
This paper uses under-approximation of unreachable states of a design to derive incomplete specification of combinational logic. The resulting incompletely-specified functions are decomposed to enhance the quality of technology-dependent synthesis. The decomposition choices are computed implicitly using novel formulation of symbolic bi-decomposition that is applied recursively to decompose logic in terms of simple primitives. The ability of BDDs to represent compactly certain exponentially large combinatorial sets helps us to implicitly enumerate and explore variety of decomposition choices improving quality of synthesized circuits. Benefits of the symbolic technique are demonstrated in sequential synthesis of publicly available benchmarks as well as on the realistic industrial designs.
We investigate restructuring techniques based on decomposition/factorization, with the objective to move critical signals toward the output while minimizing area. A specific application is synthesis for minimum switching activity (or high performance), with minimum area penalty, where decompositions with respect to specific critical variables are needed (the ones of highest switching activity for example). In this paper we describe new types of factorization that extend Shannon cofactoring and are based on projection functions that change the Hamming distance of the original minterms and on appropriate don't care sets, to favor logic minimization of the component blocks. We define two new general forms of decomposition that are special cases of the pattern F = G(H(X),Y). The related implementations, called P-Circuits, show experimentally promising results in area with respect to Shannon cofactoring.
In modern sub-micron design, achieving low-skew clock distributions is facing challenges for high-performance circuits. Symmetric global clock distribution and clock tree synthesis (CTS) for local clock optimization are used so far, but new methodologies are necessary as the technology node advances. In this paper, we study the register placement problem which is a key component of local clock optimization for high-performance circuit design along with local clock distribution. We formulate it as a minimum weighted maximum independent set problem on a weighted conflict graph and propose a novel efficient two-stage heuristic to solve it. To reduce the graph size, techniques based on register flipping and Manhattan circle are also presented. Experiments show that our heuristic can place all registers without overlaps and achieve significant improvement on the total and maximal register movement.
Scan compression has emerged as the most successful solution to solve the problem of rising manufacturing test cost. Compression technology is not hierarchical in nature. Hierarchical implementations need test access mechanisms that keep the isolation between the different tests applied through the different compressors and decompressors. In this paper we discuss a test access mechanism for Adaptive Scan that addresses the problem of reducing test data and test application time in a hierarchical and low pin count environment. An active test access mechanism is used that becomes part of the compression schemes and unifies the test data for multiple CODEC implementations. Thus, allowing for hierarchical DFT implementations with flat ATPG.
The main disadvantage of LFSR-based compression is that it should be usually combined with a constrained ATPG process, and, as a result, it cannot be effectively applied to IP cores of unknown structure. In this paper, a new LFSR-based compression approach that overcomes this problem is proposed. The proposed method allows each LFSR seed to encode as many slices as possible. For achieving this, a special purpose slice, called stop-slice, that indicates the end of a seed's usage is encoded as the last slice of each seed. Thus, the seeds include by construction the information of where they should stop and, for that reason, we call them self-stoppable. A stop-slice generation procedure is proposed that exploits the inherent test set characteristics and generates stop slices which impose minimum compression overhead. Moreover, the architecture for implementing the proposed technique requires negligible additional hardware overhead compared to the standard LFSR-based architecture. The proposed technique is also accompanied by a seed calculation algorithm that tries to minimize the number of calculated seeds.
Test data volume and test application time are major concerns for large industrial circuits. In recent years, many compression techniques have been proposed and evaluated using industrial designs. However, these methods do not target sequence- or timing-dependent failures while compressing the test patterns. Timingrelated failures in high-performance integrated circuits are now increasingly dominated by small-delay defects (SDDs). We present a SDD-aware seed-selection technique for LFSR-reseeding-based test compression. Experimental results show that significant test-pattern-quality increase can be achieved when seeds are selected to target SDDs.
Growing test data volume and overtesting caused by excessive scan capture power are two of the major concerns for the industry when testing large integrated circuits. Various test data compression (TDC) schemes and low-power X-filling techniques were proposed to address the above problems. These methods, however, exploit the very same "don't-care" bits in the test cubes to achieve different objectives and hence may contradict to each other. In this work, we propose a generic framework for reducing scan capture power in test compression environment. Using the entropy of the test set to measure the impact of capture power-aware X-filling on the potential test compression ratio, the proposed holistic solution is able to keep capture power under a safe limit with little compression ratio loss for any fixed-length symbol-based TDC method. Experimental results on benchmark circuits demonstrate the efficacy of the proposed approach.
The generation of device drivers is a very time consuming and error prone activity. All the strategies proposed up to now to simplify this operation require a manual, even formal, specification of the device driver functionalities. In the system-level design, IP functionalities are tested by using testbenches, implemented to contain the communication protocols to correctly interact with the device. The aim of this paper is to present a methodology to automatically generate device drivers from the testbench of any RTL IP. The only manual step required is to tag the states corresponding to the different device functionalities. The Extended Finite State Machines (EFSMs) are then used to create a correct-by-construction two-level device driver: the lower level deals with architectural choices, while the higher one is derived from the EFSMs and it implements the communication protocols. The effectiveness of this methodology has been proved by applying it to a platform provided by STMicroelectronics.
We address the problem of real-time streaming applications scheduling on hybrid CPU/FPGA architectures. The main contribution is a two-step approach to minimize the buffer requirement for streaming applications with throughput guarantees. A novel declarative way of constraint based scheduling for real-time hybrid SW/HW systems is proposed, while the application throughput is guaranteed by periodic phases in execution. We use a voice-band modem application to exemplify the scheduling capabilities of our method. The experimental results show the advantages of our techniques in both less buffer requirement and higher throughput guarantees compared to the traditional PAPS method.
Behavioral models for analog and mixed signal (AMS) designs are developed at various levels of abstraction, using various types of languages, to cater to a wide variety of requirements, ranging from verification, design space exploration, test generation, and application demonstration. In this paper we present a high-level formalism for capturing the AMS design intent from the specification and present techniques for automatic generation of AMS behavioral models. The proposed formalism is a language independent one, yet the design intent is modeled at a level of abstraction which enables easy translation into common modeling standards. We demonstrate the translation into VerilogA and SPICE, which are fundamentally different standards for behavioral modeling. The proposed approach is demonstrated using a family of Low Dropout Regulators (LDO) as the reference.
This paper describes a systematic approach to integrate the Discrete Event Specified System (DEVS) methodology into SystemC. It thus combines Model of Computation (MoC) specific properties and the features of an advanced SystemC environment. The execution of abstract system level DEVS models is comparable to pure SystemC models and is significantly faster compared to other DEVS environments. Thus, system level models based on abstract MoCs may easily be executed in a SystemC environment. The proposed integration is realized as a non-introspective extension to the SystemC 2.2 kernel. The DEVS models are implemented on an additional software layer above the SystemC simulation kernel. Our approach may be used simultaneously with other layered extensions, e.g., SystemC-AMS or TLM.
Clock skew scheduling (CSS) is an effective technique to optimize clock period of sequential designs. However, these techniques are not effective in the presence of certain design structural constraints that limit the CSS. In this paper, we present an analysis of several design structural constraints that affect the CSS and propose techniques to resolve these constraints. Furthermore, we propose a CSS FPGA architecture and a novel clock-period optimization (CPO) flow that tackles some of these constraints by exploiting the reconfigurability of FPGAs. Experimental results demonstrate that the proposed FPGA architecture with the CPO flow achieved an average performance improvement of 24.4% which was an average performance improvement of 10.7% over the CPO flow without considering the constraints.
FPGAs are widely used for evaluating the error-floor performance of LDPC (low-density parity check) codes. We propose a scalable vector decoder for FPGA-based implementation of quasi-cyclic (QC) LDPC codes that takes advantage of the high bandwidth of the embedded memory blocks (called Block RAMs in a Xilinx FPGA) by packing multiple messages into the same word. We describe a vectorized overlapped message passing algorithm that results in 3.5X to 5.5X speedup over state-of-theart FPGA implementations in literature.
This paper explores runtime reconfiguration of custom instructions in the context of multi-tasking real-time embedded systems. We propose a pseudo-polynomial time algorithm that minimizes processor utilization through customization and runtime reconfiguration, while satisfying all the timing constraints. Our experimental infrastructure consists of Stretch customizable processor supporting runtime reconfiguration as the hardware platform and realistic embedded benchmarks as applications. We observe that runtime reconfiguration of custom instructions can help to reduce the processor utilization by up to 64%. The experimental results also demonstrate that our algorithm is highly scalable and achieves optimal or near optimal (3% difference) processor utilization.
Statistical analysis is generally seen as the next EDA
technology for timing and power sign-off. Research into this field
has seen significant activity started about five years ago.
Recently, interest appears to have fallen off somewhat. Also,
while a lot of focus has been put on research fundamentals,
extremely few applications in industry have been reported so far.
Therefore, a group including Infineon Technologies as a leading
semiconductor IDM and various universities and research
institutes, as well as an EDA provider has tackled key challenges
to enable statistical design in industry in a publicly funded
project called "Sigma65". Sigma65 strives to provide key
foundations to allow a change from traditional deterministic
design methods to future design methods driven by statistical
considerations. The project starts with statistical modeling and
optimization of library components and ranges to statistical
techniques for designing ICs on gate level and higher levels. In
this paper, we present some results of this project, demonstrating
how the interaction between industrial perspective, research
institutions and EDA provider enables solutions which are
applicable already in the near future. After an overview of the
industrial perspective of the current situation in dealing with
variations recent results on both statistical timing and power
analysis will be given. In addition, recent research advances on
fast yield estimation concerning parametric timing yield will be
Keywords: Simulation, digital IC design, statistical timing analysis, statistical power analysis
Advances in chip-multiprocessor processing capabilities has led to an increased power consumption and temperature hotspots. Maintaining the on-chip temperature is important from the power reduction and reliability considerations. Achieving highest performance while maintaining the temperature constraint is a challenge. We develop analytical solutions for the optimal control of frequencies for each core in a chip-multiprocessor. The objective is to reduce the makespan or the latest task completion time of all tasks. We show that the optimal frequency policy is bang-bang when the temperature constraint is not active and is exponential when the temperature constraint is active. We show that there is a significant improvement in overall throughput with our proposed solution and yet all cores operate under the thermal maximum.
As the number of cores continues to grow in both digital signal and general purpose processors, tools which perform automatic scheduling from model-based designs are of increasing interest. This scheduling consists of statically distributing the tasks that constitute an application between available cores in a multi-core architecture in order to minimize the final latency. This problem has been proven to be NP-complete. A static scheduling algorithm is usually described as a monolithic process, and carries out two distinct functionalities: choosing the core to execute a specific function and evaluating the cost of the generated solutions. This paper describes a scheduling module which splits these functionalities into two sub-modules. This division produces an advanced scalability in terms of schedule quality and computation time, and also separates the heuristic complexity from the architecture model precision.
Recently proposed techniques for peak power management  involve centralized decision-making and assume quick evaluation of the various power management states. These techniques do not prevent instantaneous power from exceeding the peak power budget, but instead trigger corrective action when the budget has been exceeded. Similarly, they are not suitable for many-core architectures (processors with tens or possibly hundreds of cores on the same die) due to an exponential explosion in the number of global power management states. In this paper, we look at a hierarchical and a gradient ascent-based technique for decentralized peak power management for many-core architectures. The proposed techniques prevent power from exceeding the peak power budget and enable the placement of several more cores on a die than what the power budget would normally allow. We show up to 47% (33% on average) improvements in throughput for a given power budget. Our techniques outperform the static oracle by 22%.
This paper introduces a high level trace qualification language and compiler which enables the user defining analysis tasks efficiently and fully utilize the powerful features of Infineon's Multi-Core Debug Solution (MCDS) without the need of getting into the internals. The language and the compiler are already in industrial use where software development is based on MCDS enabled SoCs to support the developers to achieve better product quality and shorter product development cicles.
In this paper we present an adaptive technique to locally adjust the frequency of processing elements on MP-SoC. The proposed method, based on Game Theory, optimizes the system while fulfilling dynamic constraints. A telecom test-case has been used to demonstrate the effectiveness of our technique. For the evaluated scenario, the proposed technique has obtained up to 20% of latency gain and 38% of energy gain.
We propose a new methodology based on Mixed Integer Linear Programming (MILP) for determining the input values that will exercise a specified execution path in a program. In order to seamlessly handle variable values, pointers and arrays, and variable aliasing, our method uses memory addresses for data references. This implies a dynamic methodology where all decisions are taken as the program executes. During execution, we gather constraints for the MILP problem, whose solution will directly yield the input values for the desired path. We present results that demonstrate the effectiveness of this approach. This methodology was implemented into a fully functional tool that is capable of handling medium sized real programs specified in the C language. Our work is motivated by the complexity of validating embedded systems and uses a similar approach to an existing HDL functional vector generation. The joint solution of the MILP problems will provide a hardware/software co-validation tool.
This paper describes how an efficient and deterministic multitasking run-time environment supporting the Ravenscar tasking model of Ada 2005 was implemented on the Atmel AVR32 UC3A microcontroller. The open source GNU Ada Compiler (GNAT GPL 2007) was also ported to AVR32 as a part of this work, making a working Ada development environment available on the architecture for the first time.
In this paper we propose a virtualization layer to
handle the program execution on reconfigurable computers
in order to address one of their biggest problems which is the
management of the reconfigurable hardware in a multitasking
environment. The virtualization layer is responsible
for allocating the hardware at run-time based on the status
of the system. Furthermore, it provides a consistent and low
overhead interface to decouple the process of software
development from hardware design which will result in the
software to be independent of the underlying reconfigurable
hardware. This paper discusses the virtual layer's
specification and components. Our preliminary results for a
prototype simulated on Molen hardware organization show
a competitive performance comparing with an optimal
Keywords-component; run-time support, reconfigurable computers, virtualizationa
The compilation of imperative synchronous languages like Esterel has been widely studied, the separate compilation of synchronous modules has not, and remains a challenge. We propose a new compilation method inspired by traditional sequential code generation techniques to produce coroutines whose hierarchical structure reflects the control flow of the original source code. A minimalistic runtime system executes separately compiled modules.
This paper summarizes a special session on multicore/ multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developer's viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy.
Boolean satisfiability (SAT) solving has become an enabling technology with wide-ranging applications in numerous disciplines. These applications tend to be most naturally encoded using arbitrary Boolean expressions, but to use modern SAT solvers, one has to generate expressions in Conjunctive Normal Form (CNF). This process can significantly affect SAT solving times. In this paper, we introduce a new linear-time CNF generation algorithm. We have implemented our algorithm and have conducted extensive experiments, which show that our algorithm leads to faster SAT solving times and smaller CNF than existing approaches.
In this paper we present a procedure for solving quantified boolean formulas (QBF), which uses And-Inverter Graphs (AIGs) as the core data-structure. We make extensive use of structural information extracted from the input formula such as functional definitions of variables and non-linear quantifier structures. We show how this information can directly be exploited by the symbolic, AIG based representation. We implemented a prototype QBF solver based on our ideas and performed a number of experiments proving the effectiveness of our approach, and moreover, showing that our method is able to solve QBF instances on which state-of-the-art QBF solvers known from literature fail.
Bit-precise verification with variables modeled as bitvectors has recently drawn much interest. However, a huge search space usually results after bit-blasting. To accelerate the verification of bit-vector formulae, we propose an efficient algorithm to discover non-uniformencoding widthsWe of variables in the verification model, which may be smaller than their original modeling widths but sufficient to find a counterexample. Different from existing approaches, our algorithm is path-oriented, in that it takes advantage of the controllability and observability values in the structure of the model to guide the computation of the paths, their encoding widths and the effective adjustment of these widths in subsequent steps. For path selection, a subset of singlebit path-controlling variables is set to constant values. This can restrict the search from those paths deemed less favorable or have been checked in previous steps, thus simplifying the problem. Experiments show that our algorithm can significantly speed up the search by focusing first on those promising, easy paths for verifying those path-intensive models, with reduced, non-uniform bitwidth encoding.
Emerging SDR baseband platforms are usually based on multiple DLP+ILP processors with massive parallelism . Although these platforms would theoretically enable advanced SDR signal processing, existing work implemented basic systems and simple algorithms. Importantly, MIMO is not fully supported in most implementations .  implemented MIMO but with a simple linear detector. Our work explores the feasibility for SDR implementations of soft-output ML MIMO detectors, which brings 6-12 dB SNR gains when compared to popular linear detectors. Although soft-output ML MIMO detectors are considered to be challenging even for ASICs , we combine architecture-friendly algorithms, application specific instructions, code transformations and ILP/DLP explorations to make SDR implementations feasible. In our work, a 2x4 ADRES based ASIP with 16-way SIMD can deliver 193Mbps for 2x2 64QAM, and 368Mbps for 2x2 16QAM transmissions. To the best of our knowledge, this is the first work exploring SDR based soft-output ML MIMO detectors.
The IEEE 802.15.4a amendment has introduced ultra-wideband impulse radio (UWB IR) as a promising physical layer for energy-efficient, low data rate communications. A critical part of the UWB IR receiver design is the low-power implementation of the digital baseband processing required for synchronization and data decoding. In this paper we present the development of an application-specific instruction-set processor (ASIP) that is tailored to the requirements defined by the baseband algorithms. We report a number of optimizations applied to the algorithms as well as to the hardware architecture. This enables performance increases up to a factor of 122x and energy consumption decreases up to 90x as compared to a 16-bit baseline architecture. Furthermore, this ASIP offers greater flexibility due to programmability as compared to an ASIC implementation.
A novel 16-bit flexible Application-Specific Instruction-set Processor for an MMSE-IC Linear Equalizer, used in iterative turbo receiver, is presented in this paper. The proposed ASIP has an SIMD architecture with a specialized instruction-set and 7-stage pipeline control. It supports diverse requirements of MIMO-OFDM wireless standards such as use of QPSK, 16-QAM and 64-QAM modulation in 2x2 and 4x4 spatially multiplexed MIMO-OFDM environment. For these various operational modes, analysis of MMSE-IC LE equations and corresponding complex data representations was conducted. Efficient computational and storage resource sharing is proposed through: (1) Matrix Register Banks (MRB) multiplexing, (2) 16-bit Complex Arithmetic Unit (CAU) comprised of 4 combined complex adder/subtractor/multiplier units, 2 real multipliers, 5 complex adders, and 2 complex subtractors, and (3) flexible 32-bit to 16-bit data conversion at multipliers' output. With this architecture, the designed ASIP ensures, along with flexibility, high performance in terms of throughput and area. Logic synthesis results reveal a maximum clock frequency of 546 MHz and a total area of 0.37 mm2 using 90 nm technology. For 2x2 spatially multiplexed MIMO system, the proposed ASIP achieves a throughput of 273 MSymbol/Sec.
This paper presents a novel VLSI implementation of a MIMO detector for OFDM systems. The proposed architecture is able to perform both linear MMSE and reduced latticeaided MIMO detection, making it possible to adjust the balance between performance and power consumption. In order to facilitate real-time detection in reduced lattice mode of operation, a novel fixed-complexity version of the LLL lattice reduction algorithm has been developed, allowing for strict practical timing requirements, such as those specified for new generation IEEE 802.11n wireless LAN systems, to be met. An implementation of the MIMO detector for a system employing up to 4 transmit and receive antennas is described and its complexity and performance are evaluated.
Today, mobile and embedded real-time systems have to cope with the migration and allocation of multiple software tasks running on top of a real-time operating system (RTOS) residing on one or multiple system processors. RTOS simulations and timing analysis applies for fast and early estimation to configure it towards the individual needs of the application and environment. In this context, a high accuracy of the simulation compared to an instruction set simulation (ISS) is of key importance. In this paper, we investigate the accuracy of abstract RTOS simulation and compare it to ISS and the behavior of the physical system. We show that we can reach an increased accuracy of the simulation when we inject noise into the time model. Our results indicate that it is sufficient to inject uniformly distributed random time values to the RTOS real-time clock.
This paper presents an accurate and scalable implementation of an energy-aware simulator for wireless sensor networks (WSN's). Scalability and accuracy have been achieved through an energy-aware instrumentation of the Instruction Set Simulator of node's microcontroller and a functional SystemC TLM model of the radio module implementing the IEEE 802.15.4 protocol. The framework allows to execute actual software and to evaluate accurately its effect on the network lifetime. We first prototype of a wireless sensor node. The methodology, compared against state-of-the-art simulators such as NS-2, represents a flexible and scalable solution for fast and accurate prototyping of WSN software.
Addressing both standby and active power is a major challenge in developing System-on-Chip designs for battery-powered products. Powering off sections of logic or memories loses internal register and RAM states so designers have to weigh up the benefits and costs of implementing state retention on some or all of the power gated subsystems where state recovery has significant real-time or energy cost, compared to resetting the subsystem and re-acquiring state from scratch. Library IP and EDA tools can support state retention in hardware synthesized from standard RTL, but due to the silicon area costs there is strong interest in only retaining certain selective state for example the "architectural state" of a CPU to implement sleep modes. Currently there is no known rigourous technique for checking the integrity of selective state retention, and this is due to the complexity of checking that the correctness of the design is not compromised in any way. The complexity is exacerbated due to the interaction between the retained and the non-retained state, and exhaustive simulation rapidly becomes infeasible. This paper presents a case study based on symbolic simulation for assisting the designers to design and implement selective retention correctly. The main finding of our study is that the programmer visible state or the architectural state of the CPU needs to be implemented using retention registers whilst other micro-architectural enhancements such as pipeline registers, TLBs and caches can be implemented using normal registers without retention. This has a profound impact on power and area savings for chip design. By selectively retaining the state of the programmer's "architectural" model and not the increasing proportion of extra state, one can incorporate energy-efficient sleep modes. To the best of our knowledge this is the first study in the area of rigourous design and implementation of selective state retention.
We propose a new method for the integral nonlinearity (INL) and differential nonlinearity (DNL) testing of D/A - A/D converter pairs employing the recently developed stimulus identification method. This allows both converters to be measured independently but simultaneously without significant fault masking problems. Simulations show that the INL and DNL estimation errors for 12-b A/D and D/A converters are less than 0.5 least significant bit (LSB) units, and experimental tests give similar results.
This paper proposes a novel self-healing methodology for embedded RF Amplifiers (LNAs) in RF sub-systems. The proposed methodology is based on oscillation principles in which the Device-under-Test (DUT) itself generates the output test signature with the help of additional circuitry. The self-generated test signature from the DUT is analyzed by using on-chip resources for testing the LNA and controlling its calibration knobs to compensate for multi-parameter variations in the LNA manufacturing process. Thus, the proposed methodology enables self-test and self-calibration of RF circuits without the need for external test stimulus. The proposed methodology is demonstrated through simulations as well as measurements performed on a RF LNA.
Linear Model-based Test and Diagnosis (MbT&D) has been successfully applied to single-block modules like Digital-to-Analog Converters (DACs) with a static non-linear transfer characteristic. For Multi-block modules, a diagnosis methodology is needed that can deal with cascades of several linear and nonlinear blocks. In contrast to non-linear methods, linear MbT&D methods only require matrix operations associated with relatively low computational effort. A modification of the linear MbT&D in combination with Volterra series is presented that can be applied to cascaded non-linear systems, for example, a DAC followed by a low-pass filter. A simultaneous identification of numerous frequency domain Volterra kernels is enabled, and thus, to test the compliance to data sheet specifications.
This paper discusses the generation of informationrich, arbitrarily-large synthetic data sets which can be used to (a) efficiently learn tests that correlate a set of low-cost measurements to a set of device performances and (b) grade such tests with parts per million (PPM) accuracy. This is achieved by sampling a non-parametric estimate of the joint probability density function of measurements and performances. Our case study is an ultra-high frequency receiver front-end and the focus of the paper is to learn the mapping between a lowcost test measurement pattern and a single pass/fail test decision which reflects compliance to all performances. The small fraction of devices for which such a test decision is prone to error are identified and retested through standard specification-based test. The mapping can be set to explore thoroughly the tradeoff between test escapes, yield loss, and percentage of retested devices.
The process of sequential redundancy identification is the cornerstone of sequential synthesis and equivalence checking frameworks. The scalability of the proof obligations inherent in redundancy identification hinges not only upon the ability to cross-assume those redundancies, but also upon the way in which these assumptions are lever-aged. In this paper, we study the technique of speculative reduction for efficiently modeling redundancy assumptions. We provide theoretical and experimental evidence to demonstrate that speculative reduction is fundamental to the scalability of the redundancy identification process under various proof techniques. We also propose several techniques to speed up induction-based redundancy identification. Experiments demonstrate the effectiveness of our tech niques in enabling substantially faster redundancy identification, up to six orders of magnitude on large designs.
The ability of logic transformations to enhance safety property checking has been well-established, and many industrial-strength verification solutions accordingly rely upon a variety of synthesis and abstraction techniques for speed and scalability. However, little prior work has addressed the applicability of such transformations in the domain of liveness checking. In this paper, we provide the theoretical foundation to enable the efficient use of a variety of (possibly customized) transformations in a liveness-checking framework. We demonstrate the practical utility of this theory on a variety of complex verification prolems.
Constraints represent a key component of state-of-the-art verification tools based on compositional approaches and assume-guarantee reasoning. In recent years, most of the research efforts on verification constraints have focused on defining formats and techniques to encode, or to synthesize, constraints starting from the specification of the design. In this paper, we analyze the impact of constraints on the performance of model checking tools, and we discuss how to effectively exploit them. We also introduce an approach to explicitly derive verification constraints hidden in the design and/or in the property under verification. Such constraints may simply come from true design constraints, embedded within the properties, or may be generated in the general effort to reduce or partition the state space. Experimental results show that, in both cases, we can reap benefits for the overall verification process in several hard-to-solve designs, where we obtain speed-ups of more than one order of magnitude.
Model Checking is an automated formal method for verifying whether a finite-state system satisfies a user-supplied specification. The usefulness of the verification result depends on how well the specification distinguishes intended from non-intended system behavior. Vacuity is a notion that helps formalize this distinction in order to improve the user's understanding of why a property is satisfied. The goal of this paper is to expose vacuity in a property in a way that increases our knowledge of the design. Our approach, based on abstraction refinement, computes a maximal set of atomic subformula occurrences that can be strengthened without compromising satisfaction. The result is a shorter and stronger and thus, generally, more valuable property. We quantify the benefits of our technique on a substantial set of circuit benchmarks.
In the digital VLSI cycle, logic transformations are often required to modify the design to meet different synthesis and optimization goals. Logic transformations on sequential circuits are hard to perform due to the vast underlying solution space. This paper proposes an SPFD-based sequential logic transformation methodology to tackle the problem with no sacrifice on performance. It first presents an efficient approach to construct approximate SPFDs (aSPFDs) for sequential circuits. Then, it demonstrates an algorithm using aSPFDs to perform the desirable sequential logic transformations using both combinational and sequential don't cares. Experimental results show the effectiveness and robustness of the approach.
Variable-latency designs may improve the performance of those circuits in which the worst-case delay paths are infrequently activated. Telescopic units emerged as a scheme to automatically synthesize variable-latency circuits. In this paper, a novel approach is proposed that brings three main contributions with regard to the methods used for telescopic units: first, no multi-cycle timing analysis is required to ensure the correctness of the circuit; second, the method can be applied to large circuits; third, the circuit can be optimized for the most frequent input patterns. The approach is based on finding approximations of critical nodes in the netlist that substitute the exact behavior. Two cycles are required when the approximations are not correct. These approximations can be obtained by the simulation of traces applied to the circuit. Experimental results on selected examples show a tangible speed-up (15%) with a small area overhead (3%).
Accurate timing analysis is crucial for obtaining the optimal clock frequency, and for other design stages such as power analysis. Most methods for estimating propagation delay identify multi-cycle paths (MCPs), which allow timing to be relaxed, but ignore the set of reachable states, achieving scalability at the cost of a severe lack of precision. Even simple circuits contain paths affecting timing that can only be detected if the set of reachable states is considered. We examine the theoretical foundations of MCP identification and characterise the MCPs in a circuit by a fixed point equation. The optimal solution to this equation can be computed iteratively and yields the largest set of MCPs in a circuit. Further, we define conservative approximations of this set, show how different MCP identification methods in the literature compare in terms of precision, and show one method to be unsound. The practical application of these results is a new method to detect multi-cycle paths using techniques for computing invariants in a circuit. Our implementation performs well on several benchmarks, including an exponential improvement on circuits analysed in the literature.