DATE 2008 ABSTRACTS

Sessions: [Keynote Addresses] [1.2] [1.3] [1.4] [1.5] [1.6] [1.7] [2.2] [2.3] [2.4] [2.5] [2.6] [2.7] [IP1] [3.2] [3.3] [3.4] [3.5] [3.6] [3.7] [4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [IP2] [5.1.1] [5.1.2] [5.2] [5.3] [5.4] [5.5] [5.6] [5.7] [6.1] [6.2] [6.3] [6.4] [6.5] [6.6] [6.7] [IP3] [7.1] [7.2] [7.3] [7.4] [7.5] [7.6] [7.7] [8.1] [8.2] [8.3] [8.4] [8.5] [8.6] [8.7] [IP4] [9.1.1] [9.1.2] [9.2] [9.3] [9.4] [9.5] [9.6] [9.7] [10.1] [10.2] [10.3] [10.4] [10.5] [10.6] [10.7] [IP5] [11.1] [11.2] [11.3] [11.4] [11.5] [11.6] [11.7]

Cover Page
DATE Executive Committee
DATE Sponsors Committee
Technical Program Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
Tutorials
PH.D. Forum
Call for Papers: DATE 2009


Keynote Addresses

Designing Micro/Nano Systems for a Safer and Healthier Tomorrow [p. 1]
G. De Micheli

The ongoing scaling and hybridisation of manufacturing technologies enables us to attain unprecedented levels performance as well as to integrate electronic and fluidic circuits with sensors and actuators. Smart micro/nano systems will be the building blocks of wearable and ambient systems, that gather and integrate heterogeneous data in real time and operate and communicate in a wireless and ultra low power mode. These systems will foster a revolution in health and environmental management, with the final objective of improving security and quality of life. At the same time, they will create a large market of components and systems, and a renewed perspective for electronic design and manufacturing companies. To accomplish such an ambitious goal, new technologies and architectures must be matched and tailored to the operational environment by solving novel an challenging design and optimisation problems, through the creation of novel design methodologies and tools.

Perspective on Embedded Systems: Challenges, Solutions and Research Priorities [p. 2]
D. Vernay

The societal demands in Europe for Health, Security & Safety, Energy & Environment, and the market demands in nomadic, transport, communications, entertainment products, ask for innovations and technical leadership. Enabling embedded Systems require new challenging solutions like multi-physics devices, millions of interconnected nodes, very low power for autonomy, trusted and safe operations, reliability. The talk will introduce THALES vision and research priorities for embedded systems and will illustrate them through presentations of solutions and on-going research projects and initiatives. Thales effort related to mission-critical systems is focused on advanced high-performance embedded computing platforms, on middleware technologies, on software systems design and verification tools for safety and security and on the emergence of open standards in these domains. THALES is also actively contributing to the development of innovation eco-systems: the Joint Undertaking ARTEMIS in Europe; the Pôle de Compétitivité SYSTEM@TIC PARIS REGION in France.


1.2: Transaction-Level Modelling (TLM)

Moderators: S. Bocchio, STMicroelectronics, IT; W. Mueller, Paderborn U, DE
Cycle-approximate Retargetable Performance Estimation at the Transaction Level [p. 3]
Y. Hwang, S. Abdi and D. Gajski

This paper presents a novel cycle-approximate performance estimation technique for automatically generated transaction level models (TLMs) for heterogeneous multicore designs. The inputs are application C processes and their mapping to processing units in the platform. The processing unit model consists of pipelined datapath, memory hierarchy and branch delay model. Using the processing unit model, the basic blocks in the C processes are analyzed and annotated with estimated delays. This is followed by a code generation phase where delay-annotated C code is generated and linked with a SystemC wrapper consisting of inter-process communication channels. The generated TLM is compiled and executed natively on the host machine. Our key contribution is that the estimation technique is close to cycle-accurate, it can be applied to any multi-core platform and it produces high-speed native compiled TLMs. For experiments, timed TLMs for industrial scale designs such as MP3 decoder were automatically generated for 4 heterogeneous multi-processor platforms with up to 5 PEs under 1 minute. Each TLM simulated under 1 second, compared to 3-4 hrs of instruction set simulation (ISS) and 15-18 hrs of RTL simulation. Comparison to on-board measurement showed only 8% error on average in estimated number of cycles.

A Method for the Efficient Development of Timed and Untimed Transaction-Level Models of Systems-on-Chip [p. 9]
J. Cornet, F. Maraninchi and L. Maillet-Contoz

Transaction Level Modeling (TLM) captures abstract models of Systems-on-Chip that simulate faster than traditional RTL simulations and are available earlier in the design flow. Such models allow the development of the embedded software on a virtual prototype of the hardware, before the chip is available. Various levels of details in TL models are needed; using untimed and timed models for different purposes is a usual practice. We present a method for developing very abstract untimed models first, and then enriching them to get detailed timed models, while preserving the functionality. The timed models can be as rich as the models usually written from scratch. The experiments with industrial case-studies show improved simulation speed and reduced modeling effort for both untimed and timed models.

Integrating RTL IPs into TLM Designs Through Automatic Transactor Generation [p. 15]
N. Bombieri, N. Deganello and F. Fummi

Transaction Level Modeling (TLM) is an emerging design practice for overcoming increasing design complexity. It aims at simplifying the design flow of embedded systems by designing and verifying a system at different abstraction levels. In this context, transactors play a fundamental role since they allow communication between the system components, implemented at different abstraction levels. Reuse of RTL IPs into TLM systems is a meaningful example of key advantage guaranteed by exploiting transactors. Nevertheless, transactors implementation is still manual, tedious and error-prone, and the effort spent to verify their correctness often overcomes the benefits of the TLM-based design flow. In this paper we present a methodology to automatically generate transactors for RTL IPs. We show how the transactor code can be automatically generated by exploiting the testbench of any RTL IP.


1.3: Invited Industrial Session - Industrial System Designs in Transportation and Information Technologies

Moderators: B. Candaele, Thales, FR; L. Fanucci, Pisa U, IT
Tailored Solutions for Safety-Installations in the Loetschberg Tunnel - A Project with Importance for the Trans-European Rail Traffic [p. 21]
W. Fuβ

The Loetschberg base tunnel was the largest project for the Swiss railway infrastructure in the last five years. With a length of 34,6 km it is the third longest tunnel in the world at present. The maximum speed allowed to drive is 250 km/h. The project comprised four interlocking stations, ETCS Level 2 and additional automatic functions to handle the traffic through this tunnel in an optimized way. To fulfill all the safety requirements and the challenges of reliability and maintainability of this very long tunnel a lot new functions were to be realized in each of the mentioned systems. Following a unified approach using certified hardware and middleware and enhancing the scope of test to a total system simulation these demands were met. This article focuses on the challenges for the interlocking system LockTrac 6131 ELEKTRA.

On the Verification of High-Order Constraint Compliance in IC Design [p. 26]
J. Freuer, G. Jerke, J. Gerlach and W. Nebel

The increasing quality requirements on safety-critical electronic components and the rapid technological progress necessitate the compliance with all specified functional and non-functional design constraints. This paper introduces a novel verification method based on an unified data representation of constraints to enable multi-tool verification tasks. A Constraint Engineering System is presented which provides flexible, extensible, and multi-tool definitions of complex constraints and high-order verification tasks. Existing verification and simulation tools are combined so that the achieved complexity level of the high-order verification by far exceeds the level of the single tools. The shown examples target practical applications in analog system design and demonstrate the flexibility and the potential of this new verification approach.

Industrial IP Integration Flows Based on IP-XACTTM Standards [p. 32]
W. Kruijtzer, P. van der Wolf, E. de Kock, J. Stuyt, W. Ecker, A. Mayer, S. Hustin, C. Amerijckx, S. de Paoli and E. Vaumorin

Effective integration of advanced Systems-on-Chip (SoC) requires extensive reuse of IP modules as well as automation of the IP integration process, including verification. Key enablers for this are standards to describe and package IP modules. We focus on the IP-XACT standards and demonstrate how these standards are deployed in three industrial IP integration flows. Further, we report on two future extensions to IP-XACT that are currently being explored in the SPRINT project, i.e. IPXACT based verification software generation and IP-XACT based configuration of debug environments. We conclude that IP-XACT is enabling powerful IP integration methodologies and that future extensions can further increase the effectiveness of IP-XACT standards.


1.4: Application of Reconfigurable and Adaptive Systems

Moderators: C. Heer, Infineon Technologies, DE; M. Hübner, Karlsruhe U, DE
A Reconfigurable Application Specific Instruction Set Processor for Convolutional and Turbo Decoding in a SDR Environment [p. 38]
T. Vogt and N. Wehn

Future mobile and wireless communication networks require flexible modem architectures to support seamless services between different network standards. Hence, a common hardware platform that can support multiple protocols implemented or controlled by software, generally referred to as software defined radio (SDR), is essential. This paper presents a family of dynamically reconfigurable application-specific instruction-set processors (ASIP) for the application domain of channel coding in wireless communication systems. As a weakly programmable IP core, it can implement trellis based channel decoding in a SDR environment. It features binary convolutional decoding, and turbo decoding for binary as well as duobinary turbo codes for all current and upcoming standards. The ASIPs consist of a specialized pipeline with 15 stages and a dedicated communication and memory infrastructure. Logic synthesis revealed a maximum clock frequency of 400 MHz and a total area of 0.42 mm2 for a 65 nm technology. Simulation results for Viterbi and turbo decoding demonstrate maximum throughput of 196 and 34 Mbps, respectively, and outperforms existing SDR based approaches for channel decoding.

Using Reconfigurable Logic to Optimise GPU Memory Accesses [p. 44]
B. Cope, P. Y. K. Cheung and W. Luk

Memory access patterns common in video processing algorithms, which are unsuited to the GPU (Graphics Processing Unit) memory system, are identified. We develop REDA (Reconfigurable Engine for Data Access) to improve GPU performance for such access patterns, by employing reconfigurable logic for address mapping. It is shown that a sixty times reduction in number of video memory accesses can be achieved for previously unsuited access patterns, with no detriment to well suited patterns. Surprisingly, memory access locality is also improved.

Cost - And Power Optimized FPGA Based System Integration: Methodologies and Integration of a Low-Power Capacity- Based Measurement Application on Xilinx FPGAs [p. 50]
K. Paulsson, M. Hübner and J. Becker

The application of Field Programmable Gate Arrays (FPGAs) in low power and low cost industrial mass products has become an important issue for designers of electronic systems. The flexibility and performance offered by reconfigurable hardware architectures often stands in the opposite to increased power consumption in comparison to Application Specific Integrated Circuit (ASIC) solutions. By exploiting the flexibility of reconfigurable hardware architectures, e.g. the capability of run-time HW reconfiguration of some modern FPGA devices, power consumption of FPGA-based solutions can be further decreased. This paper presents an approach for cost- and power optimized system integration of a low-power capacity-based measurement system by exploiting the dynamic and partial reconfiguration capability of Xilinx FPGAs.
Keywords: Low-power applications, reconfigurable architectures, hardware reconfiguration

Design Flow for Embedded FPGAs Based on a Flexible Architecture Template [p. 56]
B. Neumann, T. Von Sydow, H. Blume and T. G. Noll

Modern digital signal processing applications have an increasing demand for computational power while needing to preserve low power dissipation and high flexibility. For many applications, the growth of algorithmic complexity is already faster than the growth of computational power provided by discrete general purpose processors [1]. A typical approach to address this problem is the combination of a processor core with dedicated accelerators. Since changes in standards or algorithms can change the demands on the accelerators, an attractive alternative to highly customised VLSImacros is the use of reconfigurable embedded FPGAs (eFPGAs). First commercial products combining a general purpose processor core and an embedded FPGA recently emerged (e.g. Stretch S6000 [2], Menta eFPGAaugmented CPUs [3]). For many digital signal processing applications, a significantly improved efficiency in terms of power dissipation, throughput and chip area can be achieved by tailoring both the processor core and the reconfigurable accelerator to the given application domain [4]. In this work, a methodology to design highly customisable eFPGA-architectures starting from a high level description is presented. The design framework elaborated during this work enables a physically optimised VLSI-design of the specified eFPGA and aims to support simulation of the according eFPGA-macros both on a functional and netlist-level by providing an elementary configuration tool based on the same high level description as the eFPGA-architecture.


1.5: Advances in BIST Techniques for Mixed-Signal Devices

Moderators: H. Kerkhoff, Twente U/ CTIT-TDT, NL; J. Machado Da Silva, INESC, PT
Optimal High-Resolution Spectral Analyzer [p. 62]
A. Tchegho, H. Mattes and S. Sattler

This paper presents a new application f ield for the Goertzel algorithm. The test of mixed-signal circuits involves the generation and analysis of signals. A standard method for the signal analysis is the Fast Fourier Transform (FFT algorithms). Such complex algorithms are not suitable for BIST (Built-In Self-Test) or BOST (Built-Of f Self-Test) solutions due to their high demand for resources. In this paper, the Goertzel algorithm will be presented as an alternative to FFT algorithms. A new optimized structure of the Goertzel algorithm and its implementation in an FPGA (Field Programmable Gate Array) is presented. A comparison within the scope of the production test of RF transceiver devices shows a considerable reduction of the test time (factor 6) and resources (factor 10) compared to a FFT sof tware solution respectively hardware solution.

A General Method to Evaluate RF BIST Techniques Based on Non-Parametric Density Estimation [p. 68]
H.-G. Stratigopoulos, J. Tongbong and S. Mir

We present a general method to evaluate RF Built- In Self-Test (BIST) techniques during the design stage. In particular, the adaptive kernel estimator is used to construct an estimate of the joint probability density function of the performances of the RF device under test and the actual BIST measurements. The density is sampled to generate a large volume of new data, which is subsequently used to estimate the relevant test metrics with parts per million (ppm) accuracy given the BIST limits. Thus, the BIST limits can be set to obtain the desired trade-offs between different test metrics. The proposed method aims to assist designers in comparing RF BIST techniques on the basis of accurately calculated test metrics and to provide information for early BIST refinements, thus reducing the design cycles. The method is demonstrated for a previously published RF BIST technique [1] applied to an LNA.

Diagnostic Analysis of Static Errors in Multi-Step Analog to Digital Converters [p. 74]
A. Zjajo and J. Pineda De Gyvez

A new approach for diagnostic analysis of static errors in multi-step ADC based on the steepestdescent method is proposed. To set initial data, estimate the parameter update and to guide the test, dedicated sensors have been designed. The information obtained through monitoring process variations is re-used and supplement the circuit calibration. The technique also allows the test procedure to test only for the most likely group of faults induced by a manufacturing process. The implemented design-for-test approach permits circuit reconfiguration in such a way that all sub-blocks are tested for their full input range allowing full observability and controllability of the device under test.

Practical Implementation of a Network Analyzer for Analog BIST Applications [p. 80]
M.J. Barragán, D. Vázquez and A. Rueda

This paper presents a practical implementation of a network analyzer for analog BIST applications. The network analyzer consists of a sinewave generator and a sinewave evaluator based on switch-capacitor techniques. Both the generator and the evaluator have been integrated in a 0.35 μm CMOS technology. The functionality of the system has been proved in the lab. For this purpose, a demonstrator board has been developed including the proposed network analyzer and a filter as DUT. Measurements in the lab demonstrate a dynamic range of 70dB in the frequency range up to 20kHz.


1.6: HOT TOPIC - Quantitative Evaluation in Embedded Systems Design

Organizer: B R Haverkort, Twente U, NL
Moderator: R Hersemeule, RWTH Aachen U, DE

Quantitative Evaluation in Embedded System Design: Trends in Modeling and Analysis Techniques [p. 86]
J.-P. Katoen

The evaluation of extra-functional properties of embedded systems, such as reliability, timeliness, and energy consumption, as well as dealing with uncertainty, e.g., in the timing of events, is getting more and more important. What are the models and approaches to analyze such properties in a reliable way? We survey some main developments and trends in the modeling, and the analysis of these aspects and stress the importance of approaches that tackle both extrafunctional, as well as correctness aspects.

Quantitative Evaluation in Embedded System Design: Validation of Multiprocessor Multithreaded Architectures [p. 88]
N. Coste, H. Garavel, H. Hermanns, R. Hersemeule, Y. Thonnart and M. Zidouni

As levels of parallelism are becoming increasingly complex in multiprocessor architectures, GALS, and asynchronous circuits, methodologies and software tools are needed to verify their functional behavior (qualitative properties) and to predict their performance (quantitative properties). This paper presents the work currently done in the Multival project (pôle de compétitivité mondial Minalogic), in which verification and performance evaluation tools developed at INRIA and Saarland University are applied to three industrial architectures designed by Bull, CEA/Leti and STMicroelectronics.

Quantitative Evaluation in Embedded System Design: Predicting Battery Lifetime in Mobile Devices [p. 90]
L. Cloth and B.R. Haverkort

In the design process of an (embedded) computer system there are several important attributes the developer has to take care of: first of all, the final product should do the right thing, we then speak of functional correctness. Second, the performance should be adequate, expressed in measures such as throughput, delay or loss probability. Third, when relying on a battery as power source, it becomes increasingly important that the system behaves in an energyaware manner. We could assess any of the three attributes in isolation, using completely different sets of models and tools. However, since the alteration of one of the attributes most surely also affects the other two, an integrated framework where all aspects can be evaluated and balanced is definitely desirable. We present such an integrated approach, but focus on the evaluation of battery lifetime. The system under consideration is represented by a stochastic workload model which then is combined with a battery model. In doing so, several design alternatives in the behaviour of the system can be compared early in the design process and the optimum with respect to functionality, performance and energy-consumption can be chosen.


1.7: System-Level Power Management and Energy Harvesting

Moderators: D. Stroobandt, Ghent U, BE; T. Ishihara, Kyushu U, JP
A Framework of Stochastic Power Management Using Hidden Markov Model [p. 92]
Y. Tan and Q. Qiu

The effectiveness of stochastic power management relies on the accurate system and workload model and effective policy optimization. Workload modeling is a machine learning procedure that finds the intrinsic pattern of the incoming tasks based on the observed workload attributes. Markov Decision Process (MDP) based model has been widely adopted for stochastic power management because it delivers provable optimal policy. Given a sequence of observed workload attributes, the hidden Markov model (HMM) of the workload is trained. If the observed workload attributes and states in the workload model do not have one-to-one correspondence, the MDP becomes a Partially Observable Markov Decision Process (POMDP). This paper presents a framework of modeling and optimization for stochastic power management using HMM and POMDP. The proposed technique discovers the HMM of the workload by maximizing the likelihood of the observed attribute sequence. The POMDP optimization is formulated and solved as a quadraticly constrained linear programming (QCLP). Compared with traditional optimization technique, which is based on value iteration, the QCLP based optimization provides superior policy by enabling stochastic control.

Harvesting Wasted Heat in a Microprocessor Using Thermo-Electric Generators: Modeling, Analysis And Measurement [p. 98]
Y. Zhou, S. Paul and S. Bhunia

Harvesting energy from previously unemployed ambient sources can play important role in saving energy and reducing the dependency to primary power sources (AC power or battery) of an electronic system. High-performance integrated circuits such as microprocessor, typically suffers from high surface temperature (in the order of 80-100°C) resulting from the high power density and limited cooling capacity of the package. In this paper, we consider the scope of harvesting thermoelectric energy from the wasted heat in a microprocessor leveraging on the temperature gradient between processor die surface and environment. First, we develop analytical model to accurately estimate the recycled energy considering the non-uniformity of temperature distribution in the die surface. Next, we analyze the effectiveness of the approach for thermoelectric generator (TEG) with different efficiencies (measured in terms of its figure of merit, ZT) under varying processor workload. Finally, we propose a possible arrangement for using the TEG on a processor and provide measurement results on the amount of harvested energy. The measurements on a Pentium III processor running at 1GHz show that we can harvest ~7mW of power from the processor for average workload using a commercial TEG..

An Efficient Solar Energy Harvester for Wireless Sensor Nodes [p. 104]
D. Brunelli, L. Benini, C. Moser and L. Thiele

Solar harvesting circuits have been recently proposed to increase the autonomy of embedded systems. One key design challenge is how to optimize the efficiency of solar energy collection under non stationary light conditions. This paper proposes a scavenger that exploits miniaturized photovoltaic modules to perform automatic maximum power point tracking at a minimum energy cost. The system adjusts dynamically to the light intensity variations and its measured power consumption is less than 1mW. Experimental results show increments of global efficiency up to 80%, diverging from ideal situation by less than 10%, and demonstrate the flexibility and the robustness of our approach.

Temperature Control of High-Performance Multi-core Platforms Using Convex Optimization [p. 110]
S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, L. Benini and G. De Micheli

With technology advances, the number of cores integrated on a chip and their speed of operation is increasing. This, in turn is leading to a significant increase in chip temperature. Temperature gradients and hot-spots not only affect the performance of the system, but also lead to unreliable circuit operation and affect the life-time of the chip. Meeting the temperature constraints and reducing the hot-spots are critical for achieving reliable and efficient operation of complex multi-core systems. In this work, we present Pro-Temp, a convex optimization based method that pro-actively controls the temperature of the cores, while minimizing the power consumption and satisfying application performance constraints. The method guarantees that the temperature of the cores are below a userdefined threshold at all instances of operation, while also reducing the hot-spots. We perform experiments on several realistic multicore benchmarks, which show that the proposed method guarantees that the cores never exceed the maximum temperature limit, while matching the application performance requirements. We compare this to traditional methods, where we find several temperature violations during the operation of the system.
Keywords Thermal-aware design, temperature control, dynamic frequency scaling, static and dynamic optimization.


2.2: Heterogeneous System Modelling, Analysis and Implementation

Moderators: J. Lilius, Abo Akademi U, FI; A. Fouilliart, Thales Communications, FR
Parametric Throughput Analysis of Synchronous Data Flow Graphs [p. 116]
A. H. Ghamarian, M.C.W. Geilen, T. Basten and S. Stuijk

Synchronous Data Flow Graphs (SDFGs) have proved to be a very successful tool for modeling, analysis and synthesis of multimedia applications targeted at both single- and multiprocessor platforms. One of the most prominent performance constraints of concurrent real-time applications is throughput. For given actor execution times, throughput can be verified by analyzing the SDFG models of such applications, for instance using maximum cycle mean analysis or state space analysis. In various contexts, such as design space exploration or run-time reconfiguration, many fast throughput computations are required for varying actor execution times. We present methods to compute throughput of an SDFG where actor execution times can be parameters. The throughput of these graphs is obtained in the form of a function of these parameters. Recalculation of throughput is then merely an evaluation of this function for specific parameter values, which is much faster than the standard throughput analysis. We propose three different algorithms for parametric throughput analysis and evaluate these algorithms experimentally, showing the feasibility of the approach and showing that a divide and conquer algorithm performs best.

Introducing Preemptive Scheduling in Abstract RTOS Models Using Result Oriented Modeling [p. 122]
G. Schirner and R. Dömer

With the increasing SW content of modern SoC designs, modeling and development of Hardware Dependent Software (HDS) become critical. Previous work addressed this by introducing abstract RTOS modeling [6], which exposes dynamic scheduling effects early in the system design flow. However, such models insufficiently capture preemption. In particular, the accuracy of preemption depends on the granularity of the timing annotation. For an accurately modeled interrupt response time, very fine-grained timing annotation is necessary, which contradicts the RTOS abstraction idea and is detrimental to simulation performance. In this paper, we eliminate the granularity dependency by applying the Result Oriented Modeling (ROM) technique previously used only for communication modeling. Our ROM approach allows precise preemptive scheduling, while retaining all the benefits of abstract RTOS modeling. Our experimental results demonstrate tremendous improvements. While the traditional model simulated an interrupt response time with a severe inaccuracy (12x longer in average and 40x longer for 96th percentile), our ROMbased model was accurate within 8% (average and 50th percentile) using identical timing annotations.

SystemC-Based Modeling, Seamless Refinement, and Synthesis of a JPEG 2000 Decoder [p. 128]
K. Grüttner, F. Oppenheimer, W. Nebel, F. Colas-Bigey and A.-M. Fouilliart

This paper will exemplarily describe and evaluate the OSSS methodology for embedded hardware/software systems and its use in a JPEG 2000 decoder case-study. The OSSS approach defines a design flow starting from an Application Model providing a rich subset of SystemCTM/C++ augmented with specific OSSS language concepts. It can be used to identify the most promising parallel structure by comparing different design alternatives. A clearly defined refinement process leads to the Virtual Target Architecture (VTA) Model. These refinements enable an analysis of the system behaviour at cycle-accurate granularity and support the exploration of different target architectures for the JPEG 2000 decoder. VTA models can be used as direct input for the FOSSY synthesis tool, which performs an automatic transformation into implementation models; that is to generate VHDL code for hardware, C/C++ for software, and platform configuration files for the target technology.

Modeling and Refining Heterogeneous Systems with SystemC-AMS: Application to WSN [p. 134]
M. Vasilevski, F. Pecheux, N. Beilleau, H. Aboshady and K. Einwich

The paper presents a system-level approach for the modeling and simulation of a paradigmatic Wireless Sensor Network composed of two nodes using SystemC-AMS, an open-source C++ extension to the OSCI SystemC Standard dedicated to the description of heterogeneous systems containing digital, analog, RF hardware IPs as well as embedded software. The paper is composed of three parts. The first part details the modeled WSN (physical sensor, sigma-delta ADC, ATMEGA128 8- bit microcontroller running the embedded application, QPSK-based 2.4 GHz RF transceiver), presents the corresponding implementation in SystemC-AMS, and gives an insight on how multi-frequency simulation is handled in SystemC-AMS. The second part shows how to introduce several RF designer specifications (noise figure, IIP3, ...) into models and how to express them in SystemC-AMS. The third part proves that the combination of C++ and RF baseband equivalent dramatically reduces simulation time while keeping excellent accuracy and code readability. The paper concludes on the possibilities offered by this approach in terms of validation and optimization of heteregeneous systems through open-source simulation.


2.3: New Directions in Analogue Circuit Modelling

Moderators: C. Grimm, TU Vienna, AT; D. Mueller, TU Munich, DE
Sizing Rules for Bipolar Analog Circuit Design [p. 140]
T. Massier, H. Graeb and U. Schlichtmann

This paper presents sizing rules for basic building blocks in analog bipolar circuit design. Sizing rules efficiently capture design knowledge on the technology-specific level of transistor-pair groups. This reduces the effort for and improves the resulting quality of analog circuit synthesis. We present a hierarchical library of transistor-pair groups as basic building blocks for analog bipolar circuits. Sizing rules are constraints associated to these building blocks that must be satisfied to guarantee the function and robustness of each block. Results of applications like circuit sizing or design centering show that the use of sizing rules leads to improved and robust results.

Efficient Circuit-Level Modeling of Ballistic CNT Using Piecewise Non-Linear Approximation of Mobile Charge Density [p. 146]
T. J. Kazmierski, D. Zhou and B. M. Al-Hashimi

This paper presents a new carbon nanotube transistor (CNT) modelling technique which is based on an efficient numerical piece-wise non-linear approximation of the non-equilibrium mobile charge density. The technique facilitates the solution of the self-consistent voltage equation in a carbon nanotube such that the CNT drain-source current evaluation is accelerated by more than three orders of magnitude while maintaining high modelling accuracy. The model is currently limited to ballistic transport but can be extended to non-ballistic modes of transport when a suitable theory is developed while researchers study phenomena that sometimes prevent electrons in a carbon nanotube from going ballistic. Our results show that while the accuracy and speed of the proposed model vary with the number of piece-wise segments in the mobile charge approximation, it is possible to obtain a speed-up of more than 1000 times while maintaining the accuracy within less than 2% in terms of average RMS error compared with the state of the art theoretical reference CNT model implemented in FETToy. This numerical efficiency makes our model particularly suitable for implementation in circuit-level, eg. SPICE-like, simulators where large numbers of such devices may be used to build complex circuits.

A New Approach for Combining Yield and Performance in Behavioral Models for Analogue Integrated Circuits [p. 152]
S. Ali, R. Wilcock, P. Wilson and A. Brown

A new algorithm is presented that combines performance and variation objectives in a behavioural model for a given analogue circuit topology and process. The tradeoffs between performance and yield are analysed using a combination of a multi-objective evolutionary algorithm and Monte Carlo simulation. The results indicate a significant improvement in overall simulation time and efficiency compared to conventional simulation based approaches, without a corresponding drop in accuracy. This approach is particularly useful in the hierarchical design of large and complex circuits where computational overheads are often prohibitive. The behavioural model has been developed in Verilog-A and tested extensively with practical designs using the SpectreTM simulator. A benchmark OTA circuit was used to demonstrate the proposed algorithm and the behaviour has been verified with transistor level simulations of this circuit and a higher level filter design. This has demonstrated that an accurate performance and yield prediction can be achieved using this model, in a fraction of the time of conventional simulation based methods.


2.4: Automotive System Design and Verification

Moderators: L. Fanucci, Pisa U, IT; J. Gerlach, Robert Bosch GmbH, DE
Symbolic Reliability Analysis and Optimization of ECU Networks [p. 158]
M. Glaβ, M. Lukasiewycz, F. Reimann, C. Haubelt and J. Teich

Increasing reliability at a minimum amount of extra cost is a major challenge in todays ECU network design. Considering reliability as an objective already in early design phases has the potential to avoid expensive modifications in later design phases. Hence, there is a need for an appropriate optimization process and efficient analysis techniques to evaluate the found implementations. In this paper, we will show how symbolic techniques can be used to efficiently analyze and optimize such reliable systems. The contribution of this paper is (1) a symbolic reliability analysis that makes use of a partitioned structure function and (2) a symbolic optimization process based on binary ILP solvers. Our case study from the automotive area will show a significant speed-up using our analysis technique. Moreover, our optimization approach is able to offer implementations with considerably improved reliability at no additional costs as well as implementations with reduced costs without decreasing their reliability.

Verification of Temporal Properties in Automotive Embedded Software [p. 164]
D. Lettnin, P.K. Nalla, J. Ruf, T. Kropf, W. Rosenstiel, T. Kirsten, V. Schönknecht and S. Reitemeyer

The amount of software in embedded systems has increased significantly over the last years and, therefore, the verification of embedded software is of fundamental importance. One of the main problems in embedded software is to verify variables and functions based on temporal properties. Formal property verification using model checker often suffers from the state space explosion problem when a large software design is considered. In this paper, we propose two new approaches to integrate assertions in the verification of embedded software using simulation-based verification. Firstly, we extended a SystemC hardware temporal checker with interfaces in order to monitor the embedded software variables and functions that are stored in a microprocessor memory model. Secondly, we derived a SystemC model from the original C program in order to integrate directly with the SystemC temporal checker. We performed a case study on an embedded software from automotive industry which is responsible for controlling read and write requests to a non-volatile memory.

A Novel Approach for EMI Design of Power Electronics [p. 170]
B. Stube, B. Schroeder, E. Hoene and A. Lissner

The placement of passive components significantly influences the EMI behavior of power electronic systems. Particularly filter components are affected by magnetic field coupling reducing filter performance. In this paper we introduce a novel approach for a methodical EMI design of power electronic circuits. Based on the results of EMI prediction design rules for component placement are derived. To meet the design rules a prototype of a dedicated placement tool was developed. This tool has much interactive and automatic placement functionality to solve the very complex design task efficiently. Using the proposed approach in the design stage allows both a statement on achievable performance with the given components and the minimization of the system volume. Development costs can be relevantly reduced.

Hardware/Software Architecture of an Algorithm for Vision-Based Real-Time Vehicle Detection in Dark Environments [p. 176]
N. Alt, C. Claus and W. Stechele

Hardware/software partitioning of algorithms is gaining more and more importance in order to benefit from the advantages of both worlds. Pure software implementations are easy to change but the processing time is rather high. By contrast pure hardware implementations usually result in faster processing due to inherent parallelism but they do not offer the necessary flexibility for quick changes and adaptions. In this paper the hardware/software co-design of a self-developed algorithm to detect cars by their taillights as well as its implementation on an embedded system (FPGA) is presented. Instead of utilizing expensive sensors such as RADAR which also can be used to detect obstacles in dark environments, the detection method presented here is based solely on grayscale images taken by a low-cost on-board camera which was mounted on a moving vehicle. Only computationally intense parts - namely pixel or sliding window operations - are implemented in hardware to achieve the necessary real-time requirements. The remainder of the algorithm - the so called higher level application code - is running on standard embedded CPU cores.With this architecture it is possible to process the incoming video-stream (25 FRAMES/s) and detect cars in real-time on an embedded system.
Keywords: driver assistance, real-time video processing, hardware acceleration, taillight detection


2.5: Advances in SoC Test

Moderators: R. Dorsch, IBM Boeblingen, DE; P. Harrod, ARM Ltd, UK
Analysis of the Test Data Volume Reduction Benefit of Modular SOC Testing [p. 182]
O. Sinanoglu and E. J. Marinissen

Modular SOC testing offers numerous benefits that include test power reduction, ease of timing closure, and test re-use among many others. While all these benefits have been emphasized by researchers, the test time and data volume comparisons has been mostly constrained within the context of modular SOC testing only, by comparing the impact of various different modular SOC testing techniques to each other. In this paper, we provide a theoretical test data volume analysis that compares the monolithic test of a flattened design with the same design tested in a modular manner; we present numerous experiments that gauge the magnitude of this benefit. We show that the test data volume reduction delivered by modular SOC testing directly hinges on the test pattern count variation across different modules, and that this reduction can exceed 99% in the SOC benchmarks that we have experimented with.

Test-Architecture Optimization and Test Scheduling for SOCs with Core-Level Expansion of Compressed Test Patterns [p. 188]
A. Larsson, E. Larsson, K. Chakrabarty, P. Eles and Z. Peng

The ever-increasing test data volume for core-based system-on-chip (SOC) integrated circuits is resulting in high test times and excessive tester memory requirements. To reduce both test time and test data volume, we propose a technique for test-architecture optimization and test scheduling that is based on core-level expansion of compressed test patterns. For each wrapped embedded core and its decompressor, we show that the test time does not decrease monotonically with the width of test access mechanism (TAM) at the decompressor input. We optimize the wrapper and decompressor designs for each core, as well as the TAM architecture and the test schedule at the SOC level. Experimental results for SOCs crafted from several industrial cores demonstrate that the proposed method leads to significant reduction in test data volume and test time, especially when compared to a method that does not rely on core-level decompression of patterns.

A Novel Methodology for Reducing SoC Test Data Volume on FPGA-based Testers [p. 194]
P. Bernardi and M. Sonza Reorda

Low-Cost test methodologies for Systems-on-Chip are increasingly popular. They dictate which features have to be included on-chip and which test procedures have to be adopted in order to guarantee high test quality, while minimizing application costs. Consequently, Low-Cost test strategies can be run on testers offering lower performance and/or reduced features with respect to traditional Automatic Test Equipments (ATEs); these equipments are usually referred to as Low-Cost testers. This paper proposes a methodology for reducing the test data volume for the application of SoC Low-Cost test procedures. The method exploits a tester architecture organization suitable for SoCs testing, which includes a programmable device: the usage of this configurable block joined to the analysis of test pattern regularities permits minimizing the test data volume, thus improving the tester capabilities. The proposed method relies on test pattern compression at system level and it does not address core level pattern manipulation, as several other previously published works do. Case studies are proposed, which provide data about the application of the proposed methodology to the test of SoCs including self-testable processor and memory cores. IEEE 1149.1 and IEEE 1500 test access mechanisms are considered. The achieved pattern depth reduction ratio is up to about the 64% for the considered case studies.


2.6: Performance Analysis and Exploration of MPSoC Architectures

Moderators: G. Beltrame, European Space Agency; F. Schaefer, Cadence Design Systems, DE
Performance Analysis of SoC Architectures Based on Latency-Rate Servers [p. 200]
J. P. Vink, K. Van Berkel and P. Van Der Wolf

This paper presents a method for static performance analysis of SoC architectures. The method is based on a network calculus theory known as LR servers. This network calculus is extended and applied to make it support SoC performance analysis. Performance requirements of subsystems are elegantly captured as traffic flows and associated latency constraints. The SoC infrastructure is modeled as a set of LR servers to validate that the worst-case delays in handling the traffic flows meet the latency constraints. A multi-channel DVB-T set-top box case study demonstrates the power of the method. Key architecture choices, such as schedule or interconnect variant, can be varied easily to support exploration of architecture options.

Slack Allocation Based Co-Synthesis and Optimization of Bus and Memory Architectures for MPSoCs [p. 206]
S. Pandey and R. Drechsler

In this paper, we present a bus and memory architectures co-synthesis technique. The co-synthesis problem is formulated as an optimization problem, where scheduling, allocation, and binding of tasks are done simultaneously in order to optimize the bus widths, the number of buses, and the memory sizes. As a main contribution, bus and memory architectures are optimized simultaneously by allocating different amount of slacks to them during co-synthesis. The method finds a balance of slack allocation for both bus and memory optimization. While the previous co-synthesis approaches do not consider the slack allocation technique, the synthesized bus and memory architectures will not be optimal in terms of area and energy consumption. The experimental results carried out on real-life applications show 19% and 24% reduction in bus and memory area, respectively and 37% reduction in energy overhead due to a bridge in compared to the previous co-synthesis approach.

Run-Time Spatial Mapping of Streaming Applications to a Heterogeneous Multi-Processor System-on-Chip (MPSoC) [p. 212]
P.K.F. Hölzenspies, J.L. Hurink, J. Kuper and G.J.M. Smit

In this paper, we present an algorithm for run-time allocation of hardware resources to software applications. We define the sub-problem of run-time spatial mapping and demonstrate our concept for streaming applications on heterogeneous MPSoCs. The underlying algorithm and the methods used therein are implemented and their use is demonstrated with an illustrative example.

Architecture Exploration of NAND Flash-Based Multimedia Card [p. 218]
S. Kim, C. Park and S. Ha

In this paper, we present an architecture exploration methodology for low-end embedded systems where the reduction of cost is a primary design concern. The architecture exploration of such systems needs to explore a wide design space spanned by detailed architecture parameters through cycle-accurate performance estimation. For fast exploration, the proposed methodology is based on an efficient evolutionary algorithm, called QEA, and trace-driven simulation to evaluate architecture candidates quickly. We applied the proposed methodology to NAND flashbased Multimedia Card as a case study considering the following design parameters: buffer size, flash memory configuration, clock, communication architecture, and memory allocation. The experimental results validate the proposed methodology by showing the optimal architecture configurations with varying performance constraints and design parameters.


2.7: Advanced Power Management Techniques

Moderators: C. Silvano, Politecnico di Milano, IT; A. Hemani, Royal Institute of Technology (KTH), SE
Resilient Dynamic Power Management under Uncertainty [p. 224]
H. Jung and M. Pedram

With the increasing levels of variability and randomness in the characteristics and behavior of manufactured nanoscale structures and devices, achieving performance optimization under process, voltage, and temperature (PVT) variations as well as current, voltage, and thermal (CVT) stress has become a daunting, yet vital, task. In this paper, we present a stochastic dynamic power management (DPM) framework to improve the accuracy of decision making under probabilistic conditions induced by PVT variations and/or stress. More precisely, we propose a resilient power management technique that guarantees to select an optimal policy under sources of uncertainty. A key characteristic of the proposed technique is that the effects of uncertainties due to variability and stress are captured by stochastic processes which control a selfimproving power manager. Simulation results with a 65nm processor design show that, compared to the worst-case PVT conditions, the proposed DPM technique ensures energy efficiency, while reducing the uncertain behaviors of the system1.

Robust and Low Complexity Rate Control for Solar Powered Sensors [p. 230]
C. Moser, L. Thiele, D. Brunelli and L. Benini

This paper is concerned with solar driven sensors deployed in an outdoor environment. We present feedback controllers which adapt parameters of the application such that a maximal utility is obtained while respecting the time-varying amount of available energy. We show that already simple applications lead to complex optimization problems, involving unacceptable running times and energy consumptions for resource constrained nodes. In addition, naive designs are highly susceptible to energy prediction errors. We address both issues by proposing a hierarchical control approach which both reduces complexity and increases robustness towards prediction uncertainty. As a key component of this hierarchical approach, we propose a new worst-case energy prediction algorithm which guarantees sustainable operation. All methods are evaluated using long-term measurements of solar energy in an outdoor setting. Furthermore, we measured the implementation overhead on a real sensor node.

Energy Aware Dynamic Voltage and Frequency Selection for Real-Time Systems with Energy Harvesting [p. 236]
S. Liu, Q. Qiu and Q. Wu

In this paper, an energy aware dynamic voltage and frequency selection (EA-DVFS) algorithm is proposed. The EA-DVFS algorithm adjusts the processor's behavior depending on the summation of the stored energy and the harvested energy in a future duration. Specifically, if the system has sufficient energy, tasks are executed at full speed; otherwise, the processor slows down task execution to save energy. Simulation results show that when the utilization is low, the EA-DVFS algorithm gives a deadline miss rate that is at least 50% lower than the one given by the lazy scheduling policy. Similarly, when the workload is low, the minimum storage size is reduced by at least 25%.

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution [p. 242]
S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo and T. Kim

This paper presents a method of dynamic voltage scaling (DVS) that tackles both switching and leakage power with combined Vdd/Vbs scaling and gives minimum average energy consumption exploiting the runtime distribution of software execution. We present a mathematical formulation of the DVS problem and an efficient numerical solution. Experimental results show that the presented method shows up to 44% further reduction in energy consumption compared with existing methods. Especially, when the leakage power consumption is significant, i.e. when temperature is high, the presented method is proven to be the most effective.


IP1 Interactive Presentations

Built-In Clock Skew System for On-Line Debug and Repair [p. 248]
A. Chattopadhyay and Z. Zilic

We present a low-cost on-line system for clock skew management in integrated circuits. Our Built-In Clock Skew System (BICSS) uses a centralized approach to identify, quantify and correct skew using a two-step method. The technique assesses the time-of-flight between the central debug circuitry and each region, or tap under test to account for the measurement error due to differences in path length common in existing techniques. The system can be used to detect skew above a user-adjustable margin using a variable tolerance phase detector. The result is a solution which provides silicon debug and repair capability of on-chip clock skews with a very small area overhead.

Analysis and Optimization of the Recessed Probe Launch for High Frequency Measurements of PCB Interconnects [p. 252]
R. Rimolo-Donadio, C. Schuster, X. Gu, Y.H. Kwark and M.B. Ritter

Measurements of internal printed circuit board (PCB) structures such as striplines and vias face the problem of launching clean test signals into the device under test (DUT). Traditionally, coaxial connectors or surface probing with high frequency microprobes are used to provide interfaces to test equipment. Both approaches have to be carefully optimized in order to give adequate results for the multi-GHz range. This paper discusses a different access technique, the recessed probe launch (RPL), which was previously used by the authors for measurements up to 40 GHz. Full-wave 3D electromagnetic modeling is applied to analyze the parasitics of the proposed launch technique and to find strategies for its optimization. Comparison to measurement shows that the models are able to predict the major physics of the launch but several details still need to be explored, e.g. accurate modeling of the microprobes, material parameters, and network analyzer calibration.

On Automated Trigger Event Generation in Post-Silicon Validation [p. 256]
H.F. Ko and N. Nicolici

When searching for functional bugs in silicon, debug data is acquired after a trigger event occurs. A trigger event can be configured at run-time using a set of control registers that uniquely identify the event that initiates data acquisition. Nonetheless the values loaded in these programmable registers interact only with a set of pre-defined trigger signals that are selected at design-time. If the state conditions required for triggering cannot be expressed directly in terms of the pre-defined trigger signals, the common practice is that the designer manually searches for an equivalent trigger event that can be programmed on-chip. In this paper we investigate if trigger events can be automatically generated from a set of state conditions.

Dynamic Round-Robin Task Scheduling to Reduce Cache Misses for Embedded Systems [p. 260]
K.W. Batcher and R.A. Walker

Modern embedded CPU systems rely on a growing number of software features, but this growth increases the memory footprint and increases the need for efficient instruction and data caches. The embedded operating system will often juggle a changing set tasks in a round-robin fashion, which inevitably results in cache misses due to conflicts between different tasks. Our technique reduces cache misses by continuously monitoring CPU cache misses to grade the performance of running tasks. Through a series of step-wise refinements, our software system tunes the round-robin ordering to find a better temporal sequence for the tasks. This tuning is done dynamically during program execution and hence can adapt to changes in work load or external input stimulus. The benefits of this technique are illustrated using an ARM processor running application benchmarks with different cache organizations and round-robin scheduling techniques.

Improving the Efficiency of Run Time Reconfigurable Devices by Configuration Locking [p. 264]
Y. Qu, J.-P. Soininen and J. Nurmi

Run-time reconfigurable logic is a very attractive alterative in the design of SoC. However, configuration overhead can largely decrease the system performance. In this work, we present a novel configuration locking technique to reduce the effect of the overhead. The idea is to at run-time lock a number of the most frequently used tasks on the configuration memory so that they cannot be evicted by other tasks. With real applications in validation, the results show that using proper amount of resources to lock tasks can significantly outperform simply using more resources. In addition, an algorithm has been developed for estimating the lock ratio. Experimental results show that the estimates are close to optimal results and the measured computer runtime is less than 4 us in a commercial embedded processor.

Logic Synthesis with Nanowire Crossbar: Reality Check and Standard Cell-Based Integration [p. 268]
M. Dong and L. Zhong

Nanowire crossbar is one of the most promising circuit solutions for nanoelectronics. We show nanowire crossbars do not scale well in terms of logic density and speed. We consequently propose a Crossbar Cell design based on judicious use of silicon nanowire crossbars with microscale pitches and small dimensions. The Crossbar Cell is compatible with the conventional MOSFET fabrication and standard cell-based integration. We evaluate logic circuits using Crossbar Cells and show that they can improve density by more than fourfold over the traditional MOSFET circuits with the same process technology, while achieving close performance and over threefold power reduction.

Merged Computation for Whirlpool Hashing [p. 272]
R. Chaves, G. Kuzmanov, L. Sousa and S. Vassiliadis

This paper presents an improved hardware structure for the computation of the Whirlpool hash function. By merging the round key computation with the data compression and by using embedded memories to perform part of the Galois Field (28) multiplication, a core can be implemented in just 43% of the area of the best current related art while achieving a 12% higher throughput. The proposed core improves the Throughput per Slice compared to the state of the art by 160%, achieving a throughput of 5.47 Gbit/s with 2110 slices and 32 BRAMs on a VIRTEX II Pro FPGA. Results for a real application are also presented by considering a polymorphic computational approach.

Source-Level Timing Annotation and Simulation for a Heterogeneous Multiprocessor [p. 276]
T. Meyerowitz, A. Sangiovanni-Vincentelli, M. Sauermann and D. Langen

A generic and retargetable tool ow is presented that enables the export of timing data from software running on a cycle-accurate Virtual Prototype (VP) to a concurrent functional simulator. First, an annotation framework takes information gathered from running an application on the VP and automatically annotates the line-level delays back to the original source code. Then, a SystemC-based timed functional simulator runs the annotated source code much faster than the VP while preserving timing accuracy. This simulator is API-compatible with the multiprocessor's operating system. Therefore, it can compile and run unmodified applications on the host PC. This ow has been implemented for MuSIC(Multiple SIMD Cores) [6], a heterogeneous multiprocessor developed at Infineon to support Software Defined Radio (SDR). When compared with an optimized cycle-accurate VP of MuSIC on a variety of tests, including a multiprocessor JPEG encoder, the accuracy is within 20%, with speedups from 10x to 1000x.

Safe Automatic Flight Back and Landing of Aircraft. Flight Reconfiguration Function (FRF) [p. 280]
J.A. Herrería García

SOFIA (Safe Automatic Flight Back and Landing of Aircraft) project is a response to the challenge of developing concepts and techniques enabling the safe and automatic return to ground in the event of hostile actions. Activities in this sense have been started in the framework of the SAFEE SP3 (Secure Aircraft in the Future European Environment Sub-Project 3) project. SOFIA project is proposed as the continuation of the SAFEE works on FRF (Flight Reconfiguration Function), the system to automatically return the aircraft to ground. SOFIA will design architectures for integrating the FRF system into several typologies of avionics for civil transport aircraft; development of one of this architectures; validation, following E-OCVM (European Operational Concept Validation Methodology) of the FRF concept and the means to integrate it in the current ATM (Air Traffic Management); safety assessment of FRF at aircraft and operational (ATC-Air Traffic Control) levels. The SOFIA product is the FRF system that will take the control of the aircraft and will manage to safely return it to ground under a security emergency (e.g. hijacking), disabling the control and command of the aircraft from the cockpit. This means to create and execute a new flight plan towards a secure airport and landing the aircraft at it. The flight plan can be generated in ground (ATC), or in a military airplane and transmitted to the aircraft, or created autonomously at the FRF.

PWM-Based Test Stimuli Generation for BIST of High Resolution Sigma-Delta ADCS [p. 284]
D. De Venuto and L. Reyneri

A fully digital test stimuli generation and on-chip specifications evaluation for cheap, fast, though accurate testing of high resolution ΣΔADCs are here presented. Simulations and measurements showed a discrimination threshold on specification parameters up to -90dBc. The proposed method helps reduce the cost of ADC production test, to extend test coverage and to enable Built-In Self-Test and test-based self-calibration.


3.2: System Synthesis

Moderators: J. Teich, Erlangen-Nuremberg U, DE; P. Pop, DTU, DK
Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs [p. 288]
T. Chantem, R.P. Dick and X.S. Hu

Thermal effects in MPSoCs may cause the violation of timing constraints in real-time systems. This paper presents a mixed integer linear programming based solution to this problem. Tasks are assigned and scheduled to an MPSoC to minimize peak temperature, subject to real-time constraints. The proposed approach outperforms existing methods, reducing peak temperature by up to 24.66 °C and by an average of 8.75 °C when compared to minimal-energy solutions. We also present a heuristic for use on large problem instances. Steadystate thermal analysis is used for tasks with long execution times compared to the RC thermal time constants of the cores. Transient analysis is used otherwise. The steady-state analysis based heuristic finds solutions with at most 3.40 °C deviation from optimal peak temperature (0.22 °C on average) while improving upon existing technique by as much as 25.71 °C and 10.86 °C on average. The transient analysis based heuristic further reduce peak temperature by 1°C in the best case and 0.17 °C on average.

A Formal Approach to the Protocol Converter Problem [p. 294]
K. Avnit, V. D'Silva, A. Sowmya, S. Ramesh and S. Parameswaran

In the absence of a single module interface standard, integration of pre-designed modules in System-on-Chip design often requires the use of protocol converters. Existing approaches to automatic synthesis of protocol converters mostly lack formal foundations and either employ abstractions that ignore crucial low level behaviors, or grossly simplify the structure of the protocols considered. We present a state-machine based formal model for bus based communication protocols, and precisely define protocol compatibility, and correct protocol conversion. Our model is expressive enough to capture features of commercial protocols such as bursts, pipelined transfers, wait state insertion, and data persistence, in cycle accurate detail. We show that the most general, correct converter for a pair of protocols, can be described as the greatest fixed point of a function for updating buffer states. This characterization yields a natural algorithm for automatic synthesis of a provably correct converter by iterative computation of the fixed point. We report our experience with automatic converter synthesis between widely used commercial bus protocols, such as AMBA AHB, ASB, APB, and OCP, considering features which are beyond the scope of current techniques.

Cache Aware Mapping of Streaming Applications on a Multiprocessor System-on-Chip [p. 300]
A. Moonen, M. Bekooij, R. van den Berg and J. van Meerbergen

Efficient use of the memory hierarchy is critical for achieving high performance in a multiprocessor system-on-chip. An external memory that is shared between processors is a bottleneck in current and future systems. Cache misses and a large cache miss penalty contribute to a low processor utilisation. In this paper, we describe a novel cache optimisation technique to reduce instruction and data cache misses for streaming applications. The instruction and data locality are improved by executing a task multiple times before moving to the next task. Furthermore, we introduce a dataflow model that is used to trade-off the number of cache misses against end-to-end latency and memory usage. For our industrial application, which is a Digital Radio Mondiale receiver, the number of cache misses is reduced with a factor 4.2.

Synthesizing Synchronous Elastic Flow Networks [p. 306]
G. Hoover and F. Brewer

This paper describes an implementation language and synthesis system for automatically generating latency insensitive synchronous digital designs. These designs decouple behavioral correctness from design performance by allowing any sub-component to dynamically stall without changing correct system activity. This is accomplished by imposition of global invariants and use of local control in the form of Synchronous-Elastic Flow (SELF) networks, which are directly synthesized. This design description format reduces the complexity of implementing correct SELF networks and does not require pre-design of a correct conventional synchronous design. The design description is a specialized guarded atomic action language which is particularly suited for succinctly describing SELF designs. We present the language syntax, semantics and synthesis techniques illustrated by the design of a latency tolerant cache controller.


3.3: Analogue Simulation, Synthesis and Verification

Moderators: T. Kazmierski, Southampton U, UK; H. Graeb, TU Munich, DE
Periodic Steady-State Analysis Augmented with Design Equality Constraints [p. 312]
I. Vytyaz, P.K. Hanumolu, U.-K. Moon and K. Mayaram

A design-oriented periodic steady-state analysis is presented in this paper. The new analysis finds the values of circuit parameters that result in a desired circuit performance specified by a set of equality constraints. This is done by including the design equality constraints and the circuit parameters directly in the steady-state analysis as additional equations and unknowns. A time-domain finite difference method and the numerical implementation for the proposed analysis are described. Several examples demonstrate that the new analysis accurately and efficiently tunes circuit parameters that conform to a wide range of design specifications.

Analysis of Oscillator Injection Locking by Harmonic Balance Method [p. 318]
M.M. Gourary, S.G. Rusakov, S.L. Ulyanov, M.M. Zharov, B.J. Mulvaney and K.K. Gullapalli

A new approach to analyze injection locking mode of oscillators under small external excitation is proposed. The proposed approach exploits existence conditions of the solution of HB linear system with degenerate matrix. The method allows one to obtain the locking range for an arbitrary oscillator circuit with an arbitrary periodic injection waveform. The approach can be easily implemented into a circuit simulator. Examples are given to confirm the correctness of the new approach.

Model Checking of Analog Systems Using an Analog Specification Language [p. 324]
S. Steinhorst and L. Hedrich

In this contribution an advanced methodology for model checking of analog systems is introduced. A new Analog Specification Language (ASL) for efficient property specifications is defined and model checking algorithms for implementing this language are presented. This allows verification of complex static and dynamic circuit properties like Oscillation and Startup Time that have not yet been formally verifiable with previous approaches. The new verification methodology is applied to example circuits and experimental results are discussed and compared to conventional circuit simulation.


3.4: Aerospace Designs and MEMs Systems

Moderators: P. Manet, U Catholique de Louvain, BE; B. Candaele, Thales, FR
Mapping Semantics of CORBA IDL and GIOP to Open Core Protocol for Portability and Interoperability of SDR Waveform Components [p. 330]
G. Gailliard, H. Balp, M. Sarlotte and F. Verdier

Patterns, middlewares and frameworks have been used for decades in software architecture to address the main problems encountered today by the MPSoC and NoC communities: heterogeneity of languages, programming models, simulation/execution environments, interaction semantics and communication protocols. A complete semantics mapping of CORBA Interface Definition Language (IDL) and General Inter-ORB Protocol (GIOP) on the Open Core Protocol (OCP) has been investigated for hardware components. This mapping is generic, highly configurable and illustrated through our target application: Software Defined Radio.

On the Design of Tunable Fault Tolerant Circuits on SRAM-Based FPGAs for Safety Critical Applications [p. 336]
L. Sterpone, M. Aguirre, J. Tombs and H. Guzmán-Miranda

Mission-critical applications such as space or avionics increasingly demand high fault tolerance capabilities of their electronic systems. Among the fault tolerance characteristics, the performance and costs of an electronic system remain the leader factors in the space and avionics market. In particular, when considering SRAM-based FPGAs, specific hardening techniques generally based on Triple Modular Redundancy need to be adopted in order to guarantee the desired fault tolerance degree. While effectively increasing the fault tolerance capability, these techniques introduce an important performance degradation and a dramatic area overhead, that results in higher design costs. In this paper, we propose an innovative design flow that allow the implementation of fault tolerance circuits in SRAM-based FPGA devices with different fault tolerance capability degrees. We introduce a new metric that allows a designer to precisely estimate and set the desired fault tolerance capabilities. Experimental analysis performed on a realistic industrialtype case study demonstrates the efficiency of our methodology.

Hot Wire Anemometric MEMs Sensor for Water Flow Monitoring [p. 342]
M. Melani, L. Bertini, M. De Marinis, P. Lange, F. D'Ascoli and L. Fanucci

This paper presents an application based on a hot wire anemometric sensor in MEMS technology in the field of water flow monitoring. New generations of MEMS sensors feature remarkable savings in area, costs and power respect to conventional discrete devices, but as drawback, they require complex electronic interfaces for signal conditioning to achieve high performances and a high reliability. This anemometric sensor implementation has been developed with ISIF, a Platform SoC, aiming to fast prototype a wide range of sensors thanks to its high configurable resources. The presented system achieves good performances with respect to commercial devices, featuring resolution of ±0.35% up to ±1.76% with repeatability roughly ±1% respect to the full scale (0-250 cm/s). Furthermore the proposed system, thanks to the compact size of the sensor, its robustness and its low costs can represent a solution for diffusive monitoring in water distribution networks.


3.5: Fault Tolerant Techniques

Moderators: L. Anghel, TIMA Laboratory, FR; D. Appello, STMicroelectronics, IT
Guiding Circuit Level Fault-Tolerance Design with Statistical Methods [p. 348]
D.C. Ness and D.J. Lilja

In the last decade, the focus of fault-tolerance methods has tended towards circuit level modifications, such as transistor resizing, and away from expensive system level redundancy approaches. We present the results from a screening experiment to identify significant parameters in circuit level soft error simulations to guide such approaches to faulttolerance. This approach allows us to assess which parameters will have the most significance for reducing soft error rates and the impact that process variation will have on the accuracy of soft error rate estimates. We identify supply voltage and transistor type as being the most significant parameters affecting soft errors in logic cells across several technology scales. Additionally, we provide a ranking of more than a dozen parameters, across four technology scales, based on the significance of their impact on soft error rates.

A Delay-Efficient Radiation-Hard Digital Design Approach Using CWSP Elements [p. 354]
C. Nagpal, R. Garg and S.P. Khatri

In this paper, we present a radiation-hardened digital design approach. This approach is based on the use of Code Word State Preserving (CWSP) elements at each flip-flop of the design, and leaving the rest of the design unaltered. The CWSP element provides 100% SET protection for glitch widths up to min{Dmin/2, (Dmax - δ)/2}, where Dmin and Dmax are the minimum and maximum circuit delay respectively and D is an extra delay associated with our SET protection circuit. The CWSP circuit has two inputs - the latch output signal and the same signal delayed by a quantity d. In case an SET error is detected, then the current computation is repeated, using the correct output, which is generated later in the same clock period by the CWSP element. Unlike previous approaches, we use the CWSP element in a secondary path and the CWSP logic is designed to minimally impact the critical delay path of the design. The delay penalty of our approach (averaged over several designs) is less than 1%. Thus our technique is applicable for high-speed designs, where the additional delay associated with SET protection must be kept at a minimum.

Towards Fault Tolerant Parallel Prefix Adders in Nanoelectronic Systems [p. 360]
W. Rao and A. Orailoglu

Future nanoelectronics based arithmetic components will enjoy abundant hardware, yet at the same time confront severe unreliability challenges. We focus on the fault tolerance of high performance parallel prefix adders (PPA), and exploit the inherent redundancy in PPAs to develop efficient fault tolerance approaches. We show that the internal invariant inherent in the parallel prefix adders provides support for online fault detection and fault masking. Furthermore, based on the particular regular structure of PPAs, an online diagnosis scheme can be developed, thus enabling the application of reconfigurability of nanoelectronics for the highly flexible online repair approaches. In contrast to traditional fault tolerance techniques that rely solely on significant external overhead, the proposed approach opens up a new genre of efficient fault tolerance techniques for arithmetic components in the nanoelectronic environment.

A Novel Low Overhead Fault Tolerant Kogge-Stone Adder Using Adaptive Clocking [p. 366]
S. Ghosh, P. Ndai and K. Roy

As the feature size of transistors gets smaller, fabricating them becomes challenging. Manufacturing process follows various corrective design-for-manufacturing (DFM) steps to avoid shorts/opens/bridges. However, it is not possible to completely eliminate the possibility of such defects. If spare units are not present to replace the defective parts, then such failures cause yield loss. In this paper, we present a fault tolerant technique to leverage the redundancy present in high speed regular circuits such as Kogge-Stone adder (KSA). Due to its regularity and speed, KSA is widely used in ALU design. In KSA, the carries are computed fast by computing them in parallel. Our technique is based on the fact that even and odd carries are mutually exclusive. Therefore, defect in even bit can only corrupt the even Sum outputs whereas the odd Sums are computed correctly (and vice versa). To efficiently utilize the above property of KSA in presence of defects, we perform addition in two- clock cycles. In cycle-1, one of the correct set of bits (even or odd) are computed and stored at output registers. In cycle-2, the operands are shifted by one bit and the remaining sets of bits (odd or even) are computed and stored. This allows us to tolerate the defect at the cost of throughput degradation while maintaining high frequency and yield. The proposed technique can tolerate any number of faults as long as they are confined to either even or odd bits (but not in both). Further, this technique is applicable for any type of fault model (stuck-at, bridging, complete opens/shorts). We performed simulations on 64-bit KSA using 180nm devices. The results indicate that the proposed technique incur less that 1% area overhead. Note that there is very little throughput degradation (<0.3%) for the fault-free adders. The proposed technique utilizes the existing scan flip-flops for storage and shifting operation to minimize the area/performance overhead. Finally, the proposed technique is used in a superscalar processor, whereby the faulty adder is assigned lower priority than fault-free adders to reduce the overall throughput degradation. Experiments performed using Simplescalar for a superscalar pipeline (with four integer adders) show throughput degradation of 0.5% in the presence of a single defective adder.
Keywords: Stuck-at faults, Fault tolerant adder, Adaptive clocking, Kogge-Stone adder, Scheduling.


3.6: EMBEDDED TUTORIAL - Software for Wireless Networked Embedded Systems

Organizers: J. Beutel, ETH Zurich, CH; M. Beigl, TU Braunschweig, DE
Moderator: M. Beigl, TU Braunschweig, DE

Software for Wireless Networked Embedded Systems [p. 372]
Presenters: A. Dunkels, K. Langendoen, J. Beutel

Embedded systems driven by future applications will be tightly coupled with the increasing complexity of the real world. Consisting of myriads of wireless networked devices, of heterogeneous architectures, distributed and interacting in a number of ways and serving a multitude of purposes systems have to adapt and take advantage of conditions unpredictable at design time. In their realisation software both on a system and on an application level is playing an increasingly important role that cannot be designed independently. Dominant design factors are the severe resource constraints, the unreliability of the wireless medium and the dynamics of both the applications and the environment. Selected challenges in the area of wireless sensor networks are addressed by the speakers in this special session highlighting the current gap between theory and practice in an emerging field.


3.7: Power Optimisation by Supply and Ground Voltage Control

Moderators: R. Zafalon, STMicroelectronics, IT; D. Soudris, Democritus U of Thrace, GR
Fine-Grained Supply Gating Through Hypergraph Partitioning and Shannon Decomposition for Active Power Reduction [p. 373]
L. Leinweber and S. Bhunia

Energy-efficient performance has emerged as the key design objective of high-performance logic circuits to address power-induced reliability concerns and battery life requirements in portable devices. In the sub-65nm technology regime, these problems continue to grow as leakage power becomes the predominant form of power consumption. Among numerous power reduction techniques employed at the circuit and architectural levels, supply gating has been proven to be very effective for standby power reduction. In this paper, we propose application of fine-grained supply gating to large complex circuits for active leakage and dynamic power reduction. A design methodology and associated CAD tool is developed to synthesize combinational logic using hypergraph partitioning and Shannon decomposition, which reduces both leakage and switching power by disabling unused logic dynamically in small clusters of gates. Simulation results for a set of ISCAS-85 benchmarks show that the proposed approach can achieve up to 40% saving in total power in active mode (and up to 37% saving in standby power) with negligible impact on performance and die area for a predictive 32 nm technology.
Index Terms - Low Power Design, Supply Gating, Active Power, Hypergraph Partitioning.

A Scalable Algorithmic Framework FOR Row-Based Power-Gating [p. 379]
A. Sathanur, A. Pullini, L. Benini, A. Macii, E. Macii and M. Poncino

Leakage power is a serious concern in nanometer CMOS technologies. In this paper we focus on leakage reduction through automatic insertion of sleep transistors for power gating in standard cell based designs. In particular, we propose clustering algorithms for rowbased power-gating methodology which is based on using rows of the layout as the granularity for clustering. Our clustering methodology does timing and area constraint driven power-gating in contrast to only timing driven power-gating as proposed in the previous works. We present two distinct clustering algorithms with different accuracy-efficiency trade-off. An optimal one, which exploits a 0-1 or Binary Integer Programming approach, and a heuristic one, which resorts to an implicit enumeration of the layout rows. Results show that, for all the benchmarks, the leakage power savings, as compared to previous techniques, are more than 75% when we have the same timing constraints but half sleep transistor area and at least 60% when area constraint is set at one fourth. We also show that we can perform clustering with no speed degradation and achieve maximum leakage power savings up-to 83%.

Coarse-Grain MTCMOS Sleep Transistor Sizing Using Delay-Budgeting [p. 385]
E. Pakbaznia and M. Pedram

Power gating is one of the most effective techniques in reducing the standby leakage current of VLSI circuits. In this paper we introduce a new approach for sleep transistor sizing which minimizes the total sleep transistor width for a coarse-grain multi-threshold CMOS circuit assuming a given standard cell and sleep transistor placement. First, the circuit is decomposed into a set of modules, each containing the set of logic cells that are closest to a sleep transistor cell. Next given an upper bound on the overall circuit speed degradation, the global timing slack is distributed among different clusters using a delay-budgeting. The slack distribution result is then used to size the sleep transistors such that the total sleep transistor width is minimized while accounting for the parasitic resistances of the virtual ground net. Results show that the proposed sizing algorithm produces sleep transistor sizes that are 40% smaller than those produced by previous approaches.


4.1: Physical Architectures (Automotive Systems Day)

Organizers: A. Sangiovanni-Vincentelli, UC Berkeley, US; M. Di Natale, Scuola S Anna, Pisa, IT
Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US
Physical Architectures of Automotive Systems [p. 391]
T. Forest, A. Ferrari, G. Audisio, M. Sabatini, A. Sangiovanni-Vincentelli and M. Di Natale

This section will provide insight into new developments and advances in electronics automotive architectures. The design of innovative chip architectures, new upcoming standards for high-bandwidth and deterministic communication (FlexRay) and sensors are the domains of interest, with emphasis on reliability and support for advanced active safety functions.


4.2: High-Level Models for Validation

Moderators: I. Harris, UC Irvine, US; V. Bertacco, U of Michigan, US
A Mutation Model for the SystemC TLM 2.0 Communication Interfaces [p. 396]
N. Bombieri, F. Fummi and G. Pravadelli

Mutation analysis is a widely-adopted strategy in software testing with two main purposes: measuring the quality of test suites, and identifying redundant code in programs. Similar approaches are applied in hardware verification and testing too, especially at RTL or gate level, where mutants are generally referred as faults, and mutation analysis is performed by means of fault modeling and fault simulation. However, in modern embedded systems there is a close integration between HW and SW parts, and verification strategies should be applied early in the design flow. This requires the definition of new mutation analysis-based strategies that work at system level, where HW and SW functionalities are not partitioned yet. In this context, the paper proposes a mutation model for perturbing transaction level modeling (TLM) SystemC descriptions. In particular, the main constructs provided by the SystemC TLM 2.0 library have been analyzed, and a set of mutants is proposed to perturb the primitives related to the TLM communication interfaces.

Efficient Design Validation Based on Cultural Algorithms [p. 402]
W. Wu and M.S. Hsiao

We introduce a new semi-formal design validation framework to justify hard-to-reach corner-case states. We propose a cultural learning technique to identify the swarming of domain knowledge during the search. In addition, our guidance strategy abstracts sets of partitioned state variables, from which pre-images are computed to capture the expanded portions of the state spaces related to a target state. Experimental results show that our approach is very effective to reach hardto- reach states than existing methods.

Algorithms for Maximum Satisfiability Using Unsatisfiable Cores [p. 408]
J. Marques-Silva and J. Planes

Many decision and optimization problems in Electronic Design Automation (EDA) can be solved with Boolean Satisfiability (SAT). Moreover, well-known extensions of SAT also find application in EDA, including Pseudo-Boolean Optimization, Quantified Boolean Formulas, Multi-Valued SAT and, more recently, Maximum Satisfiability (MaxSAT). Algorithms for MaxSAT are still fairly inefficient in industrial settings, in part because the most effective SAT techniques cannot be easily extended to MaxSAT. This paper proposes a novel algorithm for MaxSAT that improves existing state of the art solvers by orders of magnitude on industrial benchmarks. The new algorithm exploits modern SAT solvers, being based on the identification of unsatisfiable subformulas. Moreover, the new algorithm provides additional insights between unsatisfiable subformulas and the maximum satisfiability problem.

In-Band Cross-Trigger Event Transmission for Transaction-Based Debug [p. 414]
S. Tang and Q. Xu

Cross-trigger, the mechanism to trigger activities in one debug entity from debug events happened in another debug entity, is a very useful technique for debugging applications involving multiple embedded cores. Existing solutions rely on dedicated interconnects (i.e., different from functional interconnects) to transfer debug events and cannot guarantee the arrival time of the debug events coincides with the arrival time of the data messages between multiple cores. This results in mismatches between the observed system internal operations and the ones that designers expect to watch. To tackle the above problem, in this paper, we propose to package the cross-trigger events and the actual data together into transaction messages and transfer them along the same functional interconnects (namely inband debug event transmission), with the help of novel design-fordebug circuits. Simulation results on a hypothetical NoC-based systems show the effectiveness of the proposed technique.


4.3: Power Grid and Interconnect Modelling

Moderators: R. Suaya, Mentor Graphics, FR; N. van der Meijs, TU Delft, NL
Efficient Representation and Analysis of Power Grids [p. 420]
J.M.S. Silva, J.R. Phillips and L.M. Silveira

Modern deep sub-micron ULSI designs with hundreds of millions of devices require huge grids for power distribution. Such grids, operating with increasingly low-power voltages, are a design limiting factor and accurate analysis of their behavior is of paramount importance as any voltage drops can seriously impact performance or functionality. As power grid models have millions of unknowns, highly optimized special purpose simulation tools are required to handle the time and memory complexity of solving for their dynamic behavior. In this work, we propose a hierarchical matrix representation of the power grid model that is both space and time efficient. With this representation, reduced storage matrix factors are efficiently computed and applied in the analysis at every time-step of the simulation. Results show an almost linear complexity growth, namely O(nloga(n)), for some small constant a, in both space and time, when using this matrix representation. Comparisons of our academic implementation with production-quality code proves this method to be very efficient when dealing with the simulation of large power grid models

High-Frequency Mutual Impedance Extraction of VLSI Interconnects in the Presence of a Multi-Layer Conducting Substrate [p. 426]
N. Srivastava, R. Suaya and K. Banerjee

We propose a computationally efficient method to calculate, with high accuracy, the mutual impedance between two wires in the presence of multilayer substrates, as needed for high frequency CAD applications. The resulting accuracy (errors smaller than 2%) and CPU time reduction (factors of seven) emerge from three different ingredients: a two dimensional Green's function approach with the correct quasi-static limit, a modified discrete complex image approximation to the Green's function, and a novel discrete dipole approximation to evaluate the magnetic vector potential. This approach permits the evaluation of the mutual impedance between two loops in terms of easily computable analytical expressions that involve the relative separations and the electromagnetic parameters of the multi-layer substrate. The results are valid for long wires, for any separation, and for frequencies up to 100 GHz.

ETBR: Extended Truncated Balanced Realization Method for On-Chip Power Grid Network Analysis [p. 432]
D. Li, S.X.-D. Tan and B. McGaughy

In this paper, we present a novel simulation approach for power grid network analysis. The new approach, called ETBR for extended truncated balanced realization, is based on model order reduction techniques to reduce the circuit matrices before the simulation. Different from the (improved) extended Krylov subspace methods EKS/IEKS [15, 2], ETBR performs fast truncated balanced realization on response Grammian to reduce the original system with the similar computation costs of EKS. ETBR also avoids the adverse explicit moment representation of the input signals. Instead, it uses spectrum representation of input signals by fast Fourier transformation. As a result, ETBR is more flexible for different types of input sources and can better capture the high frequency contents than EKS, and this leads to more accurate results especially for fast changing input signals. Experimental results on a number of large networks (up to one million nodes) show that, given the same order of the reduced model, ETBR is indeed more accurate than the EKS method especially for input sources rich in high-frequency components. ETBR also shows similar computation costs of EKS and less memory consumption than EKS.

Bandwidth-Centric Optimization for Area-Constrained Links with Crosstalk Avoidance Methods [p. 438]
B. Halak and A. Yakovlev

The effect of crosstalk avoidance codes on the throughput of fixed width communication channels is studied. Closed form expressions of the throughput which incorporate the dimensions of the interconnects and the wires overheads by such techniques are derived for lines under different buffering conditions. These formulae are utilised to optimise the bandwidth of fixed width parallel buses under different latency and reliability constraints. Our results are confirmed by the simulations we have performed in Spectre for a UMC CMOS 90nm technology.


4.4: Algorithms and Architectures Optimisation for Baseband Processing

Moderators: J. Dielissen, NXP Semiconductors, NL; C. Bouganis, Imperial College London, UK
Optimizating Near-ML MIMO Detector for SDR Baseband on Parallel Programmable Architectures [p. 444]
M. Li, B. Bougard, D. Novo, L. Van Der Perre and F. Catthoor

ML and near-ML MIMO detectors have attracted a lot of interest in recent years. However, almost all the reported implementations are delivered in ASICs or FPGAs. Our contribution is optimizing the near-ML MIMO detector for parallel programmable architectures, such as those with ILP and DLP features. In the proposed SSFE (Selective Spanning with Fast Enumeration), architecture-friendliness is explicitly introduced from the very beginning of the design flow. Importantly, high level algorithmic transformations make the dataflow pattern and structure fit architecture-characteristics very well. We enable abundant vector-parallelism with highly regular and deterministic dataflow in the SSFE; memory rearrangements, shuffling and non-predictable dynamism are all elaborately excluded. Hence, the SSFE can be easily parallelized and efficiently mapped onto ILP and DLP architectures. Furthermore, to fine-tune the SSFE on parallel architectures, extensive pre-compiler transformations are applied with the help of the application-level information. These optimize not only computation-operations but also addressgenerations and memory-accesses. Experiments show that the SSFE brings very efficient resource-utilizations on real-life VLIW architectures. Specifically, with the SSFE the percentage of NOPs instructions on VLIWis below 1%, even better than that achieved by the software-pipelined FFT. To the best of our knowledge, this is the first reported work about comprehensive optimizations of near-ML MIMO detectors for parallel programmable architectures.

Vectorization of Reed Solomon Decoding and Mapping on the EVP [p. 450]
A. Kumar and K. Van Berkel

Reed Solomon (RS) codes are used in a variety of (wireless) communication systems. Although commonly implemented in dedicated hardware, this paper explores the mapping of high-throughput RS decoding on vector DSPs. The four modules of such a decoder, viz. Syndrome Computation, Key Equation Solver, Chien Search, and Forney pose different vectorization challenges. Their vectorizations are explained in detail, including optimizations specific for Embedded Vector Processor (EVP). For RS (255,239), this solution is benchmarked vs published implementations, and scalability up to vector size 64 is explored. The best and the worst case throughput of our implementation is 8 times and 2 times higher respectively than other architectures.

A Case Study in Reliability-Aware Design: A Resilient LDPC Code Decoder [p. 456]
M. May, M. Alles and N. Wehn

Chip reliability becomes a great threat to the design of future microelectronic systems with the continuation of the progressive downscaling of CMOS technologies. Hence increasing the robustness of chip implementations in terms of error tolerance becomes an important issue. In this paper we present a case study in reliability-aware design tolerating transient errors. A state-of-the-art WiMAX channel decoder for LDPC codes is investigated on all design levels to increase its reliability for a given system performance with minimum hardware overhead. We show that an efficient exploitation of the algorithmic fault-tolerance yields a fairly small area overhead with nearly no degradation in communications performance even under high error injection rates.


4.5: DFX: Support for Test, Manufacturing, and Diagnosis

Moderators: T. Yoneda, Nara Inst. of Science and Technology, JP; J. Schloeffel, NXP Semiconductors, NL
Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction [p. 462]
A. Chandra, F. Ng and R. Kapur

We present Low Power Illinois scan architecture (LPILS) to achieve power dissipation and test data volume reduction, simultaneously. By using the proposed scan architecture, dynamic power dissipation during scan testing in registers and combinational cells can be significantly reduced without modifying the clock tree of the design. The proposed architecture is independent of the ATPG patterns and imposes a very small combinational area penalty due to the logic added between the scan cells and the CUT. Experimental results for two industrial circuits show that we can simultaneously achieve up to 47% reduction in dynamic power dissipation due to switching and 10X test data volume reduction with LPILS over basic scan.

Scan Chain Organization for Embedded Diagnosis [p. 468]
M. Elm and H.-J. Wunderlich

Keeping diagnostic resolution as high as possible while maximizing the compaction ratio is subject to research since the advent of embedded test. In this paper, we present a novel scan design methodology to maximize diagnostic resolution when compaction is employed. The essential idea is to consider the diagnostic resolution during the clustering of scan elements to scan chains. Our methodology does not depend on a fault model and is helpful with any type of compactor. A linear time heuristic is presented to solve the scan chain clustering problem. We evaluate our approach for industrial and academic benchmark circuits. It turns out to be superior to both random and to layout driven scan chain clustering. The methodology is applicable to any gate-level design and fits smoothly into an industrial design flow.
Keywords - Design for diagnosis, embedded test, scan design

State Skip LFSRs: Bridging the Gap between Test Data Compression and Test Set Embedding for IP Cores [p. 474]
V. Tenentes, X. Kavousianos and E. Kalligeros

We present a new type of Linear Feedback Shift Registers, State Skip LFSRs. State Skip LFSRs are normal LFSRs with the addition of a small linear circuit, the State Skip circuit, which can be used, instead of the characteristic-polynomial feedback structure, for advancing the state of the LFSR. In such a case, the LFSR performs successive jumps of constant length in its state sequence, since the State Skip circuit omits a predetermined number of states by calculating directly the state after them. By using State Skip LFSRs we get the wellknown high compression efficiency of test set embedding with substantially reduced test sequences, since the useless parts of the test sequences are dramatically shortened by traversing them in State Skip mode. The length of the shortened test sequences approaches that of test data compression methods. A systematic method for minimizing the test sequences of reseeding- based test set embedding methods, and a low overhead decompression architecture are also presented.

Automated Testability Enhancements for Logic Brick Libraries [p. 480]
J.G. Brown, B. Taylor, R.D.S. Blanton and L. Pileggi

Circuit fabrics composed of highly regular structures, called logic bricks, have been described recently for improving yield. An automated logic brick design flow based on a SAT formulation of the brick routing has been developed to minimize wire length and the number of vias while maintaining several design-for-manufacturability constraints. In this work, testability enhancements are imposed into a logic brick to reduce the likelihood of (i) feedback bridges to improve test and (ii) equivalent faults to improve diagnosis. This is accomplished by adding constraints to the SAT formulation of the logic brick routing that restricts certain wires from being routed in close proximity, thus making bridges between them unlikely. Application to several brick designs resulted in critical-area reductions for targeted bridges with little degradation in terms of additional wire length and via count.


4.6: Model-Based Design for Embedded Systems

Moderators: E. Brinksma, Embedded Systems Institute, NL; P. Mosterman, The MathWorks, US
A Game-Theoretic Approach to Real-Time System Testing [p. 486]
A. David, K.G. Larsen, S. Li and B. Nielsen

This paper presents a game-theoretic approach to the testing of uncontrollable real-time systems. By modelling the systems with Timed I/O Game Automata and specifying the test purposes as Timed CTL formulas, we employ a recently developed timed game solver UPPAAL-TIGA to synthesize winning strategies, and then use these strategies to conduct black-box conformance testing of the systems. The testing process is proved to be sound and complete with respect to the given test purposes. Case study and preliminary experimental results indicate that this is a viable approach to uncontrollable timed system testing.

Modeling Event Stream Hierarchies with Hierarchical Event Models [p. 492]
J. Rox and R. Ernst

Compositional Scheduling Analysis couples local scheduling analysis via event streams. While local analysis has successfully been extended to include hierarchical scheduling strategies, event streams are still flat. In this paper, we generalize the concept of a stream hierarchy to embed different types of streams in a higher level structure. We explain why this extension is a natural match to model streams generated by communication stacks that are ubiquitous in networked embedded systems. We formally define the hierarchical event model and give operations to encode, combine, and extract stream properties that can be used in flat or hierarchical local scheduling analysis. Finally, we give an example and demonstrate that the proposed model enables superior analysis results.

Semantics for Model-Based Validation of Continuous/Discrete Systems [p. 498]
L. Gheorghe, F. Bouchhima, G. Nicolescu and H. Boucheneb

Continuous and discrete components can be integrated in diverse systems including defense, medical, electronic, communication, and automotive applications. Given the heterogeneity of concepts that have to be taken into consideration, their design involves overcoming specific global modeling and validation challenges. This paper presents semantics for model-based validation of continuous/discrete systems. It focuses on the simulation interfaces semantics, representation and verification. The proposed approach is applied for the validation of a continuous/discrete medical system, an automatic glycemia level regulator.

Using UML as Front-End for Heterogeneous Software Code Generation Strategies [p. 504]
L.B. Brisolara, M.F.S. Oliveira, R. Redin, L.C. Lamb, L. Carro and F. Wagner

In this paper we propose an embedded software design flow, which starts from an UML model and provides automatic mapping to other models like Simulink or finite-state machines (FSM). An automatic synthesis of an executable and synthesizable Simulink model is also proposed, enabling the use of UML as front-end for a multi-model design strategy that includes a Simulinkbased MPSoC target design flow. In addition, the proposed synthesis tool automatically handles processor allocation, mapping of threads to processors, and insertion of required Simulink temporal barriers, ports, and dataflow connections. Following this approach, the UML model is mapped to the more appropriated model and specialized code generators are used. Therefore, this approach allows designers to employ UML to model the whole system and reuse this model to generate code using different strategies and targeting different platforms.


4.7: PANEL SESSION - Caution Ahead: The Road to Design and Manufacturing at 32 and 22 nm

Organizer: S. Turnoy, Synopsys, US
Moderator: P. Wintermeyr, Elektronik.net, DE

PANEL - Caution Ahead: The Road to Design and Manufacturing at 32 and 22 nm [p. 510] Panelists: R. Aitken, R. Lauwereins, J. Tracy Weed, V. Kiefer and J. Hartmann

At 32 and 22 nm, which manufacturing technology changes will be so revolutionary as to cause upheavals in the semiconductor supply chain and on design practices?
  • Will there be economic fallout from the higher mask cost associated with dual patterning? How will designers deal with place-and-route restrictions?
  • How likely is "direct write"? What design and OPC tool changes will be required?
  • When dealing with stress and CMP, will we need to replace DRC with a new breed of tools?
  • How will designers "sign off" on a design at 32 nm? These are just some of the challenges ahead. For every solution, collateral adjustments must be made to design technologies and methodologies. Everyone from designer to foundry equipment manufacturer would do well to look ahead at these potential hazards on the road to 32 and 22 nm.


IP2 Interactive Presentations

Fault Clustering in Deep-Submicron CMOS Processes [p. 511]
J. Schat

The fraction of ICs that pass all production tests but fail in the application is called the defect level. Defect levels depend on the average number of defects per IC, and also on the clustering of these defects. High clustering leads to a higher yield and a lower defect level. This paper compiles the coefficients for defect clustering using research findings from 1970 until 2001. Because recent data for deep submicron processes are missing in the literature, the clustering coefficient has been calculated using scan fail distributions of ICs in a 180 nm process. Clustering coefficients show a steady trend towards higher defect clustering. This is beneficial, but it is probably not sufficient to achieve today's ambitious target of 'zero defects'.

Energy Efficient and High Speed On-Chip Ternary Bus [p. 515]
C. Duan and S.P. Khatri

We propose two crosstalk reducing coding schemes using ternary busses. In addition to low power consumption and reduced delay, our schemes offer other advantages over binary coding schemes such as zero area overhead and simple, regular and fast CODEC design.

Task Scheduling with Configuration Prefetching and Anti-Fragmentation Techniques on Dynamically Reconfigurable Systems [p. 519]
F. Redaelli, M.D. Santambrogio and D. Sciuto

Aim of this paper is to define a scheduling of the task graph of an application that minimizes its total execution time on a partially dynamically reconfigurable FPGA. The scheduler has to take into account the reconfiguration overhead of each task, the area constraint of the target FPGA, the precedences between the tasks, configuration prefetching and module reuse. We introduce an ILP formulation to solve the task scheduling problem in the reconfigurable architecture scenario. This formulation has been used to identify interesting features for a possible heuristic scheduler. The results of the ILP solution show how a reconfigurationaware scheduler exploiting all the reconfiguration features can outperform one with partial knowledge.

Fast Analog Circuit Synthesis Using Sensitivity Based Near Neighbor Searches [p. 523]
A. Pradhan and R. Vemuri

We present an efficient analog synthesis algorithm employing regression models of circuit matrices. Circuit matrix models achieve accurate and speedy synthesis of analog circuits. In this paper, synthesis is accelerated by eliminating numerous computations of the matrix elements during a synthesis run. Computations are avoided by reusing exact or nearby design points visited during previous synthesis iterations. Hashing and multidimensional nearest neighbor lookup are used in incremental evaluation of design solutions encountered during synthesis. Sensitivity of the design variables is considered for locating a neighboring solution. Neighbor lookup is efficiently performed using box-decomposition trees. The proposed method is used to synthesize three benchmark circuits. Results show that with hashing and neighbor lookup, synthesis is 6x-13x faster than with the use of matrix models alone.

Spatial Correlation Extraction via Random Field Simulation and Production Chip Performance Regression [p. 527]
B. Liu

Statistical timing analysis needs a priori knowledge of process variations. Lack of such a priori knowledge of process variations prevents accurate statistical timing analysis, for which foundry confidentiality policy has largely been blamed. A significant part of process variations are design specific, and can only be extracted from production chip performance statistics. In this paper, I adopt the homogeneous isotropic random field model for intra-die random variations, apply fast Fourier transform (FFT) to simulate a homogeneous isotropic random field, obtain corners for Monte Carlo SPICE simulation of timing critical paths in a VLSI circuit, and apply regression to match production chip performance statistics. Experimental results based on a timing critical path in an industry design with 65nm Predictive Technology Models reveal constant mean, increased standard deviation, and decreased skewness of a signal propagation path delay as spatial correlation increases. The proposed spatial correlation extraction technique can be applied in a chip tapeout process, where process variations extracted from an early tapeout help to improve statistical timing analysis accuracy and guide engineering change order of subsequent tapeouts.

A Methodology for Improving Software Design Lifecycle in Embedded Control Systems [p. 533]
M.E.M. Ben Gaid, R. Kocik, Y. Sorel and R. Hamouche

Control design and real-time implementation are usually performed in isolation. The effects of the computer implementation on control system performance are still evaluated on the last phases of the development cycle. It is expected that modeling the computer implementation in order to simulate its impact on control would help reducing the length and the effort of the development cycle. This paper proposes ideas towards achieving these objectives. To this end, implementation effect on control performance is first studied. Then, we describe the preliminary ideas of a methodology considering a control law designed with the Scicos simulation environment and implemented on a distributed architecture with the SynDEx system-level CAD tool. This methodology allows simulating the impact of the distributed implementation early in the design lifecycle and provides an automatic code generation of this implementation.

Finding the Worst Voltage Violation in Multi-Domain Clock Gated Power Network [p. 537]
W. Zhang, Y. Zhu, W. Yu, L. Zhang, R. Shi, H. Peng, Z. Zhu, L. Chua-Eoan, R. Murgai, T. Shibuya, N. Ito and C.-K. Cheng

This paper proposes an efficient method to find the worst case of voltage violation by multi-domain clock gating in an on-chip power network. We first present a voltage response in an arbitrary multi-domain clock gating pattern, using a superposition technique. Then, an integer linear programming (ILP) formulation is proposed to identify the worst-case gating pattern and the maximum variation area. The ILP based method is significantly faster than a conventional method based on enumeration. The experimental results are also compared with a case where peak voltage variation is induced, which shows the latter technique largely underestimated the overall variation effect.

A System Architecture for Reconfigurable Trusted Platforms [p. 541]
B. Glas, A. Klimm, O. Sander, K. Müller-Glaser and J. Becker

For improving the security of embedded systems, trusted computing is a promising technology. For the area of microprocessors in general and personal computers in particular the Trusted Computing Group (TCG) has published detailed specifications. The resulting hardware has been available for some years. This contribution discusses the feasibility of deploying ideas from trusted computing in the domain of reconfigurable hardware, esp. FPGAs, and possible benefits and drawbacks. We give a proposal to use actually available FPGA technology to build a trusted platform on reconfigurable hardware. We also show how trusted computing can deal with partial dynamic reconfiguration while still allowing the user to fully exploit its potentials.
Keywords: Trusted computing, TPM, FPGA, reconfigurable hardware, partial dynamic reconfiguration, embedded systems.

Automatic Generation of Complex Properties for Hardware Designs [p. 545]
F. Rogin, T. Klotz, G. Fey, R. Drechsler and S. Rülke

Property checking is a promising approach to prove the correctness of today's complex designs. However, in practice this requires the formulation of formal properties which is a time consuming and non-trivial task. Therefore the acceptance and efficiency of formal verification techniques can be raised by an automated support for formulating design properties. In this paper we propose a new methodology to automatically generate complex properties for a given design. The tool, Dianosis, implements this methodology by analyzing a simulation trace. The extracted properties describe the abstract design behavior and are presented in a format that is easy to read and can be added to the set of properties used for formal or assertion-based verification. We provide experimental results on industrial hardware designs that show the effectiveness of Dianosis and motivate the practical use.


5.1.1: Software Components for Reliable Automotive Systems (Automotive Systems Day)

Organizers: M. Di Natale, Scuola S Anna, Pisa, IT; A. Sangiovanni-Vincentelli, UC Berkeley, US
Moderator: M. Di Natale, Scuola S Anna, Pisa, IT
Software Components for Reliable Automotive Systems [p. 549]
H. Heinecke, W. Damm, B. Josko, A. Metzner, H. Kopetz, A. Sangiovanni-Vincentelli and M. Di Natale

System-level integration requires an overall understanding of the interplay of the sub-systems to enable componentbased development with portability, reconfigurability and extensibility, together with guaranteed reliability and performance levels. Integration by simple interfaces and plug-and-play of sub-systems, which is the main objective of AUTOSAR, requires solving essential technical problems. We discuss to what degree the existing AUTOSAR standard can support the development of safety- and time-critical software and what is required to move toward the desirable goal of timing isolation when integrating multiple applications into the same execution platform.


5.1.2: LUNCH-TIME KEYNOTE(Automotive Systems Day)

Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US

Model-Based-Design is Nice, But... [p. 555]
H. Hanselmann

Without Model-Based-Design (MBD) today's automotive embedded systems would not exist. However, MBD generates its own challenges. Tools and concepts are helping in many areas, but the user's needs often seem to outpace the capabilities of tools and processes, especially for large systems with complex software interacting across boundaries. System Design is underdeveloped. In this keynote, an assessment of the current situation is given as well as a vision of how developers should design and test systems in the future.


5.2: Timing-Based Validation

Moderators: M. Lajolo, NEC Labs, US; F. Gaffiot, INL - ECL, FR
A Simulation Methodology for Worst-Case Response Time Estimation of Distributed Real- Time Systems [p. 556]
S. Samii, S. Rafiliu, P. Eles and Z. Peng

In this paper, we propose a simulation-based methodology for worst-case response time estimation of distributed realtime systems. Schedulability analysis produces pessimistic upper bounds on process response times. Consequently, such an analysis can lead to overdesigned systems resulting in unnecessarily increased costs. Simulations, if well conducted, can lead to tight lower bounds on worst-case response times, which can be an essential input at design time. Moreover, such a simulation methodology is very important in situations when the running application or the underlying platform is such that no formal timing analysis is available. Another important application of the proposed simulation environment is the validation of formal analysis approaches, by estimating their degree of pessimism. We have performed such an estimation of pessimism for two responsetime analysis approaches for distributed embedded systems based on two of the most important automotive communication protocols: CAN and FlexRay.

Signal Probability Based Statistical Timing Analysis [p. 562]
B. Liu

VLSI timing analysis and power estimation target the same circuit switching activity. Power estimation techniques are categorized as (1) static, (2) statistical, and (3) simulation and testing based methods. Similarly, statistical timing analysis methods are in three counterpart categories: (1) statistical static timing analysis, (2) probabilistic technique based statistical timing analysis, and (3) Monte Carlo (SPICE) simulation and testing. Leveraging with existing power estimation techniques, I propose signal probability (i.e., the logic one occurrence probability on a net) based statistical timing analysis, for improved accuracy and reduced pessimism over the existing statistical static timing analysis methods, and improved efficiency over Monte Carlo (SPICE) simulation. Experimental results on ISCAS benchmark circuits show that SPSTA computes the means (standard deviations) of the maximum signal arrival times within 5.6% (7.7%), SSTA within 16.5% (46.9%), and STA within 83.0% (132.4%) in average ofMonte Carlo simulation results, respectively. More significant accuracy improvements are expected in the presence of increased process and environmental variations.

A Current Source Model for CMOS Logic Cells Considering Multiple Input Switching and Stack Effect [p. 568]
B. Amelifard, S. Hatami, H. Fatemi and M. Pedram

This paper presents a current source model (CSM) of a CMOS logic cell, which captures simultaneous switching of multiple inputs while accounting for the effect of internal node voltages of the logic cell. Characterization procedures for various components of the proposed CSM are described and application of the model to output waveform computation is discussed. Experimental results to assess the accuracy and efficiency of the proposed multiple input switching CSM in the context of noise and timing analyses in VLSI circuits are reported.

Current Source Based Standard Cell Model for Accurate Signal Integrity and Timing Analysis [p. 574]
A. Goel and S. Vrudhula

The inductance and coupling effects in interconnects and non-linear receiver loads has resulted in complex input signals and output loads for gates in the modern deep submicron CMOS technologies. As a result, the conventional method of timing characterization, which is based on lookup tables with input slew and output load capacitance as indices, is no longer adequate. The focus has now shifted to current source based standard cell models which are based on the fundamental property of transconductance of MOSFETs. In this paper 1 we propose a systematic methodology for obtaining a current based delay model for gates, which can accommodate both single (SIS) and multi-input (MIS) switching signals of arbitrary shape and complex non-linear output loads. We use an analytical model for the gate output current expressed as a function of the node voltages. This results in an average error less than 0.5% with maximum standard deviation of 2.5% in error when compared with SPICE for a large number of standard cells. When compared with SPICE, using the proposed models gives stage delay and output slew with an average error of less than 3% and 2% respectively for arbitrary inputs and output load combinations.


5.3: Variation-Aware Modelling of Gates and Interconnects

Moderators: W. Schilders, NXP Semiconductors, NL; P. Feldmann, IBM T J Watson Research Center, US
An Efficient Method for Chip-Level Statistical Capacitance Extraction Considering Process Variations with Spatial Correlation [p. 580]
W. Zhang, W. Yu, Z. Wang, Z. Yu, R. Jiang and J. Xiong

An efficient method is proposed to consider the process variations with spatial correlation, for chip-level capacitance extraction based on the window technique. In each window, an efficient technique of Hermite polynomial collocation (HPC) is presented to extract the statistical capacitance. The capacitance covariances between windows are then calculated to reflect the spatial correlation. The proposed method is practical for chip-level extraction task, and the experiments on full-path extraction exhibit its high accuracy and efficiency.

SPARE - A Scalable Algorithm for Passive, Structure Preserving, Parameter-Aware Model Order Reduction [p. 586]
J. Fernández Villena and L.M. Silveira

In this paper we describe a flexible and efficient new algorithm for model order reduction of parameterized systems. The method is based on the reformulation of the parametric system as a parallel interconnection of the nominal transfer function and the non-parametric transfer function sensitivities with respect to the parameter variations. Such a formulation reveals an explicit dependence on each parameter which is exploited by reducing each component system independently via a standard non-parametric structure preserving algorithm. Therefore, the resulting smaller size interconnected system retains the structure of the original with respect to parameter dependence. This allows for better accuracy control, enabling independent adaptive order determination with respect to each parameter and adding flexibility in simulation environments. It is shown that the method is efficiently scalable and preserves relevant system properties such as passivity. The new technique can handle fairly large parameter variations on systems whose outputs exhibit smooth dependence on the parameters. Several examples show that besides the added flexibility and control, when compared with competing algorithms, the proposed technique can, in some cases, produce smaller reduced models with potential accuracy gains.

Transistor-Specific Delay Modeling for SSTA [p. 592]
B. Cline, K. Chopra, D. Blaauw, A. Torres and S. Sundareswaran

SSTA has received a considerable amount of attention in recent years. However, it is a general rule that any approach can only be as accurate as the underlying models. Thus, variation models are an important research topic, in addition to the development of statistical timing tools. These models attempt to predict fluctuations in parameters like doping concentration, critical dimension (CD), and ILD thickness, as well as their spatial correlations. Modeling CD variation is a difficult problem because it contains a systematic component that is context dependent as well as a probabilistic component that is caused by exposure and defocus variation. Since these variations are dependent on topology, modern-day designs can potentially contain thousands of unique CD distributions. To capture all of the individual CD distributions within statistical timing, a transistor-specific model is required. However, statistical CD models used in industry today do not distinguish between transistors contained within different standard cell types (at the same location in a die), nor do they distinguish between transistors contained within the same standard cell. In this work we verify that the current methodology is error-prone using a 90nm industrial library and lithography recipe (with industrial OPC) and propose a new SSTA delay model that on average reduces error of standard deviation from 11.8% to 4.1% when the total variation (σ/μ) is 4.9% - a 2.9X reduction. Our model is compatible with existing SSTA techniques and can easily incorporate other sources of variation such as random dopant fluctuation and line-edge roughness.


5.4: Signal Processing on Massive Parallel Architectures

Moderators: B. Bougard, IMEC, BE; F. Kienle, Kaiserslautern U, DE
Generic Multi-Phase Software-Pipelined Partial-FFT on Instruction-Level-Parallel Architectures and SDR Baseband Applications [p. 598]
M. Li, D. Novo, B. Bougard, L. Van Der Perre and F. Catthoor

The PFFT (Partial FFT) is an extended FFT where only part of input or output bins are used. By pruning the useless dataflow, the PFFT can potentially achieve a significant speedup in many important applications. Although theoretical aspects of the PFFT have been thoroughly studied in past three decades, efficient implementations were rarely reported. The most important obstacle is the highly irregular dataflow and the associated control flow. In addition, a size-N PFFT has 2N dataflow possibilities, so that delivering both flexibility and efficiency in the same implementation is very challenging. This paper presents a generic scheme to map the highly irregular dataflow of arbitrary PFFT onto ILP architectures with highly efficient SWP (SoftWare-Pipelining). Constraints and opportunities of algorithms and architecture are carefully analyzed and exploited. We introduce a multi-phase partitioning, bringing heterogeneous control structures and heterogeneous software pipelining schemes to minimize control overheads and to maximize the efficiency of SWP. The proposal has been tested with 10 representative benchmarks extracted from baseband applications. In experiments cycle-counts, instructions, NOPs, L1D/L1P access/miss/hit are thoroughly analyzed. Comparing to full FFTs with efficient SWP, our work reduces 20.5% - 87.5% cycle-counts, 11.2% - 86.5% instructions, 16.1% - 79.4% L1D cache accesses and 19.5% - 87.1% L1P cache accesses. To the best of our knowledge, this is the first reported work about the generic software-pipelined PFFT on ILP architectures.

A Novel Recursive Algorithm for Bit-Efficient Realization of Arbitrary Length Inverse Modified Cosine Transforms [p. 604]
R. Koenig, T. Stripf and J. Becker

In this paper a novel approach for Inverse Modified Cosine Transform (IMDCT) computation is presented, based on a recursive algorithm. Due to its nature, this IMDCT calculation can be performed on a reduced bit width datapath without loss of accuracy, compared to alternative recursive architectures. Combined with the regular structure, the approach allows for a much more area efficient VLSI implementation compared to existing systems. Due to its bit efficiency this approach is attractive to be implemented on reconfigurable architectures of the DSP domain as well.

Definition and SIMD Implementation of a Multi-Processing Architecture Approach on FPGA [p. 610]
P. Bonnot, F. Lemonnier, G. Edelin, G. Gaillat, O. Ruch and P. Gauget

In a context of high performance, low technology access cost and application code reusability objectives, this paper presents an "architectured FPGA" approach that consists in the definition of a general frame for embedded system application implementations. Addressing image processing as a first application domain, a FPGA architecture implementation based on that approach is presented. Built around SIMD architecture, the "Ter@Core" FPGA implementation illustrates the competitiveness of the approach compared to off-the-shelf processors and to usual FPGA approach. The presented implementation gathers 128 processing elements on a single FPGA providing 19.2 GOPS performance and very high application development productivity.
Keywords: image processing, data dependent processing, long lifecycle, FPGA, platform approach, domain specific API, MIMD architecture, SIMD architecture, middleware.


5.5: Statistical, Physical Defect Based Testing

Moderators: J. Segura, Balearic Islands U, ES; H. Manhaeve, Q-Star Test, BE
On Modeling and Testing of Lithography Related Open Faults In Nano-CMOS Circuits [p. 616]
A. Sreedhar, A. Sanyal and S. Kundu

Scaling of transistor feature size over time has been facilitated by corresponding improvement in lithography technology. However, in recent times the wavelength of the optical light source used for photolithography has not scaled in the same rate as that of the minimum feature size of the transistor. In fact, starting with 180nm devices, the wavelength of optical source has remained the same (at 193nm) due to difficulties in finding a flicker-free, high energy, coherent light source with compatible improvement in lens material for focusing this light. Consequently, upcoming technology nodes (65nm, 45nm, 32nm and 22nm) will be using a light source with wavelength much greater than the feature size. This creates a peculiar problem where line width on manufactured devices is a function of relative spacing between adjacent lines. Despite numerous restriction on layout rules, interconnects may still suffer from constriction due to this peculiarity also known as forbidden pitch problem. A small manufacturing variation turns the constrictions to open faults. Gate leakage current is a significant concern for present and upcoming technology nodes. Due to gate leakage, an open fault is not truly an open circuit. Our simulation studies show that the leakage current steers the floating input of a gate to certain metastable states. This property actually makes it easier to detect open faults either through side channel excitation or by stuck-at tests. The major contributions of this paper are (i) lithographic simulation based identification of potential open fault sites, (ii) identification of meta-stable input states for these open inputs, (iii) length calculation for side channel signals for definitive detection of open faults. Together, they provide a complete CAD framework for testing lithography related open faults.
Keywords: Open Faults, Lithography, Forbidden Pitch, Logic Switching Threshold

Optimal Margin Computation for At-Speed Test [p. 622]
J. Xiong, V. Zolotov, C. Visweswariah and P.A. Habitz

In the face of increased process variations, at-speed manufacturing test is necessary to detect subtle delay defects. This procedure necessarily tests chips at a slightly higher speed than the target frequency required in the field. The additional performance required on the tester is called test margin. There are many good reasons for margin including voltage and tem- perature requirements, incomplete test coverage, aging effects, coupling effects and accounting for modeling inaccuracies. By taking advantage of statistical timing, this paper proposes an optimal method of test margin determination to maximize yield while staying within a prescribed Shipped Product Quality Loss (SPQL) limit. If process information is available from wafer testing of scribe line structures or on-chip process monitoring circuitry, this information can be leveraged to determine a perchip test margin which can further improve yield.

Resistive Bridging Fault Simulation of Industrial Circuits [p. 628]
P. Engelke, I. Polian, J. Schloeffe and B. Becker

We report the successful application of a resistive bridging fault (RBF) simulator to industrial benchmark circuits. Despite the slowdown due to the consideration of the sophisticated RBF model, the run times of the simulator were within an order of magnitude of the run times for pattern-parallel complete-circuit stuck-at fault simulation. Industrial-size circuits, including a multi-million-gates design, could be simulated in reasonable time despite a significantly higher number of faults to be simulated compared with stuck-at fault simulation.
Keywords: Resistive bridging faults, bridging fault simulation, case study

Physically-Aware N-Detect Test Pattern Selection [p. 634]
Y.-T. Lin, O. Poku, N.K. Bhatti and R.D.S. Blanton

N-detect test has been shown to have a higher likelihood for detecting defects. However, traditional definitions of Ndetect test do not necessarily exploit the localized characteristics of defects. In physically-aware N-detect test, the objective is to ensure that the N tests establish N different logical states on the signal lines that are in the physical neighborhood surrounding the targeted fault site. We present a test selection procedure for creating a physicallyaware N-detect test set that satisfies a user-provided constraint on test-set size. Results produced for an industrial test chip demonstrate the effectiveness and practicability of our pattern selection approach. Specifically, we show that we can virtually detect the same number of faults 10 or more times as a traditional 10-detect test set and increase the number of neighborhood states and the number of faults with 10 or more states by 18.0 and 4.7%, respectively, without increasing the number of tests over a traditional 10- detect test set.


5.6: Tuning System Parameters for QoS Constrained Multimedia Appications

Moderators: T. Givargis, UC Irvine, US; P. Pop, DTU, DK
Computation of Buffer Capacities for Throughput Constrained and Data Dependent Inter- Task Communication [p. 640]
M.H. Wiggers, M.J.G. Bekooij and G.J.M. Smit

Streaming applications are often implemented as task graphs. Currently, techniques exist to derive buffer capacities that guarantee satisfaction of a throughput constraint for task graphs in which the inter-task communication is data-independent, i.e. the amount of data produced and consumed is independent of the data values in the processed stream. This paper presents a technique to compute buffer capacities that satisfy a throughput constraint for task graphs with data dependent inter-task communication, given that the task graph is a chain. We demonstrate the applicability of the approach by computing buffer capacities for an MP3 playback application, of which the MP3 decoder has a variable consumption rate. We are not aware of alternative approaches to compute buffer capacities that guarantee satisfaction of the throughput constraint for this application.

Constraint Refinement for Online Verifiable Cross-Layer System Adaptation [p. 646]
M. Kim, M.-O. Stehr, C. Talcott, N. Dutt and N. Venkatasubramanian

Adaptive resource management is critical to ensuring the quality of real-time distributed applications, particularly for energy-constrained mobile handheld devices. In this context, an optimization that simultaneously considers multiple layers (e.g., application, middleware, operating system) needs to be developed for continuous adaptation of system parameters. The tuning of system parameters greatly affects the system's ability to meet QoS requirements, and also directly affects the energy consumption and system robustness. We present a novel approach to developing cross-layer optimization for resource limited real-time distributed systems, based on a constraint refinement technique combined with formal specification and feedback from system implementation. Our approach tunes the parameters in a compositional manner allowing coordinated interaction among sub-layer optimizers that enables holistic cross-layer optimization. We present experiments on a realistic multimedia application which demonstrate that constraint refinement enables us to generate robust and near optimal parameter settings. The constraint language can be used as an interface for composition by encapsulating the details of local optimization algorithms.

Adaptive Scheduling and Voltage Scaling for Multiprocessor Real-Time Applications with Non-Deterministic Workload [p. 652]
P. Malani, P. Mukre, Q. Qiu and Q. Wu

The computational workload of some real-time applications varies significantly during runtime, which makes the task scheduling and power management a challenge. One of the major influences to the workload of an application is the selection of conditional branches which may activate or deactivate a large set of operations. Focusing on real-time applications with variable workload which is due to random branch selection, this paper presents a framework of task mapping, scheduling and dynamic voltage and frequency scaling (DVFS) for a multiprocessor system. The proposed framework maintains workload awareness using dynamic profiling of branch probability. The profiled information is utilized by the scheduling and DVFS algorithm that are adopted in this framework to generate statistically optimal solution.


5.7: EMBEDDED TUTORIAL - ARTEMIS and ENIAC Joint Undertakings: A New Approach to Conduct Research in Europe

Organizer/Moderator: E. Schutz, STMicroelectronics, BE

ARTEMIS and ENIAC Joint Undertakings: A New Approach to Conduct Research in Europe [p. 658]
Presenters: K. Glinos, D. Beenaert, L. Gide

This special session will present the two first ever Europe-wide public private R&D partnerships ARTEMIS and ENIAC. ARTEMIS will address the invisible computers (embedded systems) that today run all machines from cars, planes and phones, from energy networks and factories to washing machines and televisions. ENIAC will target the very high level of miniaturisation required for the next generations of nanoelectronics components. These Joint Technology Initiatives (JTI's ) on Embedded Computing Systems and Nano-electronics will pool industry, Member states and Commission resources into targeted research programmes. The session will include global presentations on the initiatives and information on the expected research topics included in the first calls in 2008.


6.1: Methods, Tools and Standards for the Analysis and Evaluation of Modern Automotive Architectures (Automotive Systems Day)

Organizers: M. Di Natale, Scuola S Anna, Pisa, IT; A. Sangiovanni-Vincentelli, UC Berkeley, US
Moderator: M. Di Natale, Scuola S Anna, Pisa, IT
Methods, Tools and Standards for the Analysis and Evaluation of Modern Automotive Architectures [p. 659]
E. Frank, R. Wilhelm, R. Ernst, A. Sangiovanni-Vincentelli and M.Di Natale

Automotive systems are increasingly distributed and complex. Reduced time-to-market, cost and safety concerns require advance validation of the integrated systems and its components, from the functional, timing, and reliability standpoints. In particular, function correctness and performance may depend on communication and computation delays imposed by the selected architecture platform. Hence, the need for methods and tools capable of predicting the system-level timing behaviour (latencies and jitter), resulting from the HW platform selection, the synchronization between tasks and messages, and also from the synchronization and queuing policies of the middleware and RTOS levels. In this paper, we review methods and tools for the evaluation of the function performance and its timing correctness by simulation or by worst case static analysis.


6.2: Simulation-Based Validation

Moderators: F. Fummi, Verona U, IT; P. Sanchez, Cantabria U, ES
Random Stimulus Generation Using Entropy and XOR Constraints [p. 664]
S.M. Plaza, I.L. Markov and V. Bertacco

Despite the growing research effort in formal verification, constraint-based random simulation remains an integral part of design validation, especially for large design components where formal techniques do not scale. However, stimulating important aspects of a design to uncover bugs often requires the construction of complex constraints to guide stimulus generation. We propose Toggle, a stimulus generation engine, which features (1) an entropy-based coverage analysis to efficiently find portions of the design inadequately sensitized by simulation and (2) a novel strategy to automatically stimulate these portions through a specialized SAT algorithm that uses small randomized XOR constraints. As our experimental results demonstrate, Toggle requires minimal input from the verification engineer, and significantly improves the coverage qualities of the generated stimuli when compared to plain random simulation.

MCjammer: Adaptive Verification for Multi-Core Designs [p. 670]
I. Wagner and V. Bertacco

The challenge of verification of multi-core and multi-processor designs grows dramatically with each new generation of systems produced today. Validation of memory coherence of such systems, which include multiple levels of cache and complex protocols, constitutes a major fraction of this task. Unfortunately, current tools are incapable of addressing these challenges, allowing bugs, which cause unpredictable software behavior and wrong computation results, to slip into hardware. In this work we present a scalable approach to the verification of memory coherence protocols in large multi-core and multi-processor systems. We accomplish this task through a distributed network of cooperating agents, which feed the processors with stimuli, each agent attempting to accomplish its own verification goals and support other agents on theirs as well. The agents can dynamically change the stimuli based on coverage and pressure observed during simulation. Since each agent has a minimal knowledge of the entire system, their communication and decision process is greatly simplified. Moreover, since the agents' view of the system is linear in the number of nodes in it, our approach can be efficiently scaled to target large multi-core systems. Experimental results on two common coherence protocols and a range of multi-core configurations demonstrate that our technique can reach high levels of coverage of the system-level protocol much faster than a constrained-random generator.

Efficient Implementation of Native Software Simulation for MPSoC [p. 676]
P. Gerin, X. Guérin and F. Pétrot

Efficient and precise simulation models at a high abstraction level are required in order to perform early design validations and architecture explorations of Multi-Processor System-On-Chip (MPSoC) platforms. Although native software simulation approaches provide interesting capabilities, they quickly become unsuitable when complex hardware architecture have to be considered. In this paper, we present a SystemC-based MPSoC platform implementation that allows native software simulation while keeping details of the underlying hardware model. The key contribution of this work is a realistic memory mapping modelling that makes possible the simulation of Operating Systems and software applications on complex hardware models with multiple processors and DMA devices. This method also allows the reuse of different software components for the target processor(s). Experimental results show the efficiency of the proposed method to validate software on complex hardware architectures.

Simulation-Directed Invariant Mining for Software Verification [p. 682]
X. Cheng and M.S. Hsiao

With the advance of SAT solvers, transforming a software program to a propositional formula has generated much interest for bounded model checking of software in recent years. However, reasoning at the Boolean level often may not be able to identify some key relations among the original high-level program variables. In this paper, we propose a novel framework that uses simulation-directed data mining in the original program to extract a set of high-level potential property invariants according to the dynamic execution data of the software. When these learned invariants are added as constraints to the bounded model checking instances of the software, they help to significantly reduce the search space. The simulation-directed invariant mining framework exhibits more flexibility compared to the conventional static program analysis approaches, and the experimental results showed that our approach can lead to up to an order of magnitude of speedup in software verification via bounded model checking.


6.3: Robust Mixed-Signal System Design

Moderators: A. Doboli, State U of New York at Stony Brook, US; M. Ortmanns, Freiburg U, DE
Comparison of Opamp-Based and Comparator-Based Delta-Sigma Modulation [p. 688]
M. Momeni, P.B. Bacinschi and M. Glesner

Comparator-based switched capacitor (CBSC) circuits present an alternative approach to designing sampled data systems based on the principle of detecting a virtual ground condition with a comparator rather than actively enforcing it with a high-gain operational amplifier (opamp) in feedback. This work demonstrates a 2nd-order ΔΣ converter designed using the CBSC technique. The same modulator topology was also implemented using two conventional design methods for a two-stage Miller-compensated amplifier and a single-stage folded cascode amplifier, such that all three blocks can be used as 'drop-in replacements' in the top-level circuit. The designs are done in a 0.13 μm UMC technology. The SNDR performance and power consumption of all three approaches were simulated with a sampling frequency of 5.12 MHz and an oversampling ratio of 64. It can be concluded that the CBSC method provides a great simplification of design effort and significant power savings compared to the traditional OTA-based methods.

A Novel Technique for Improving Temperature Independency of Ring-ADC [p. 694]
S. Li, H. Chen and F. Zhou

A new temperature compensation technique for ringoscillator-based ADC is proposed in this paper. It employs a novel fixed-number-based algorithm and a CTAT current biasing technology to compensate the temperaturedependent variations of the output, thus eliminates the need of digital calibrations. Simulation results prove that, with the proposed technique, the resolution under the temperature range of 0°C to 100°C can reach a 2-mV quantization bin size with an input voltage span of 120mV, at the sampling frequency fs=100KHz.

An Analog On-Chip Adaptive Body Bias Calibration for Reducing Mismatches in Transistor Pairs [p. 698]
P.B. Bacinschi, T. Murgan, K. Koch and M. Glesner

Device parameter variations exhibit an increasingly serious impact on analog and mixed-signal circuit behavior. In this paper, we propose a novel fully-analog on-chip adaptive body bias calibration method, for efficiently reducing mismatches in transistor pairs. We present three circuit implementations which achieve a mismatch reduction between 61% and 73% in terms of standard deviation.

Integrated Approach to Energy Harvester Mixed Technology Modeling and Performance Optimization [p. 704]
L. Wang, T.J. Kazmierski, B.M. Al-Hashimi, S.P. Beeby and R.N. Torah

This paper presents an integrated approach to energy harvester modelling and performance optimisation where the complete mixed physical-domain energy harvester system (micro generator, voltage booster, storage element and load) can be modelled and optimised in a systematic manner using one simulation platform. We developed an accurate HDL model for the energy harvester and demonstrated its accuracy by validating it experimentally and comparing it with recently reported models. To address the performance loss due to the close mechanical-electrical interaction that takes place in energy harvesters, we proposed a holistic methodology to the energy harvester optimisation based on the HDL model. The effectiveness of employing such an approach has been demonstrated by showing that it is possible to improve vibration-based energy harvester efficiency (energy delivered to load/harvested energy) by 30% through optimising the micro-generator size and the voltage booster circuit components.


6.4: Architectures for Wireless Communications

Moderators: W. Eberle, IMEC, BE; G. Gielen, KU Leuven, BE
A Scalable Low-Power Digital Communication Network Architecture and an Automated Design Path for Controlling the Analog/RF Part of SDR Transceivers [p. 710]
W. Eberle and M. Goffioul

Emerging new wireless standards, the move towards multi-standard transceivers, and ultimately softwaredefined radios imposes the need for a tighter interaction between digital baseband and analog/RF parts. Softwaredefined radio transceivers may face more than 400 control bits in the analog/RF part [9][10]. Configuring of the transmit/receive chain to particular standards, monitoring of front-end performance, and dynamic control of front-end behavior requires a tight bidirectional interaction. We have developed a generic concept of a flexible and scalable low-power digital communication network in a multi-standard analog/RF front-end. Our approach is layout-friendly, reduces interconnect area significantly (by 96%) compared to a star topology, scales easily with analog/RF design changes such as pin additions, and exhibits a generic bidirectional interface to the system and digital designer. Moreover, an almost fully automated design flow - starting from an on-chip connection list for all analog blocks up to VHDL code generation - has been developed and implemented, reducing design effort and potential errors. The architecture and the design flow have been successfully proven in two 0.13-um full software-defined radio transceiver designs. In the first design, the flow was still manually instantiated. In the second design, the automated flow was used and led to a significant designtime speed-up.

A Coarse-Grained Array Based Baseband Processor for 100mbps+ Software Defined Radio [p. 716]
B. Bougard, B. De Sutter, S. Rabou, D. Novo, O. Allam, S. Dupont and L. Van der Perre

The Software-Defined Radio (SDR) concept aims to enabling costeffective multi-mode baseband solutions for wireless terminals. However, the growing complexity of new communication standards applying, e.g., multi-antenna transmission techniques, together with the reduced energy budget, is challenging SDR architectures. Coarse-Grained Array (CGA) processors are strong candidates to undertake both high performance and low power. The design of a candidate hybrid CGA-SIMD processor for an SDR baseband platform is presented. The processor, designed in TSMC 90G process according to a dual-VT standard-cells flow, achieves a clock frequency of 400MHz in worst case conditions and consumes maximally 310mW active and 25mW leakage power (typical conditions) when delivering up to 25,6GOPS (16-bit). The mapping of a 20MHz 2x2 MIMO-OFDM transmit and receive baseband functionality is detailed as an application case study, achieving 100Mbps+ throughput with an average consumption of 220mW.

Scenario-Based Fixed-Point Data Format Refinement to Enable Energy-Scalable Software Defined Radios [p. 722]
D. Novo, B. Bougard, A. Lambrechts, L. Van der Perre and F. Catthoor

User demand, standards and products for digital nomadic communications are evolving quickly. The combination of this changing environment together with the need for short time-to-market pushes for more flexible implementations. Software Defined Radios (SDR) have been introduced as the ultimate way to achieve such flexibility. The reduced energy budget required by battery-powered solutions makes the typical worst-case static dimensioning unaffordable under highly dynamic operating conditions. Instead, more energy-scalable algorithms and implementations are entailed to provide flexibility while maintaining the required energy efficiency. Particularly, energy-scalable implementations can exploit data format properties to offer different tradeoffs between accuracy and energy. In this paper, such a technique is developed and applied to the SDR implementation of a 2 antennas 200 Mbps+ OFDM (Orthogonal Frequency-Division Multiplexing) inner modem receiver on a C-programmable CGA (Coarse Grain Array) processor with extensive SIMD (Single Instruction Multiple Data) support. By defining separate implementations for different combinations of modulation scheme and coding rate, up to 3-fold gains can be achieved in the average energy consumption.


6.5: HOT TOPIC - Test Challenges for Low Power Devices

Organizer: P. Girard, LIRMM/CNRS, FR
Moderator: A. Raghunathan, NEC Laboratories, US
Test Strategies for Low Power Devices [p. 728]
C.P. Ravikumar, M. Hirech and X. Wen

Ultra low-power devices are being developed for embedded applications in bio-medical electronics, wireless sensor networks, environment monitoring and protection, etc. The testing of these low-cost, low-power devices is a daunting task. Depending on the target application, there are stringent guidelines on the number of defective parts per million shipped devices. At the same time, since such devices are cost-sensitive, test cost is a major consideration. Since system-level power-management techniques are employed in these devices, test generation must be power-management-aware to avoid stressing the power distribution infrastructure in the test mode. Structural test techniques such as scan test, with or without compression, can result in excessive heat dissipation during testing and damage the package. False failures may result due to the electrical and thermal stressing of the device in the test mode of operation, leading to yield loss. This paper considers different aspects of testing low-power devices and some new techniques to address these problems.


6.6: Software Architectures for Embedded Multi-CPU Systems

Moderators: C. Schlaeger, AMD, DE; P. Felber, Neuchatel U, CH
Thermal Balancing Policy for Streaming Computing on Multiprocessor Architectures [p. 734]
F. Mulas, M. Pittau, M. Buttu, S. Carta, A. Acquaviva, L. Benini, D. Atienza and G. De Micheli,

As feature sizes decrease, power dissipation and heat generation density exponentially increase. Thus, temperature gradients inMultiprocessor Systems on Chip (MPSoCs) can seriously impact system performance and reliability. Thermal balancing policies based on task migration have been proposed to modulate power distribution between processing cores to achieve temperature flattening. However, in the context of MPSoC for multimedia streaming computing, where timeliness is critical, the impact of migration on quality of service must be carefully analyzed. In this paper we present the design and implementation of a lightweight thermal balancing policy that reduces on-chip temperature gradients via task migration. This policy exploits run-time temperature and load information to balance the chip temperature. Moreover, we assess the effectiveness of the proposed policy for streaming computing architectures using a cycle-accurate thermal-aware emulation infrastructure. Our results using a real-life software defined radio multitask benchmark show that our policy achieves thermal balancing while keeping migration costs bounded.

A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel Programs [p. 740]
O. Certner, Z. Li, P. Palatin, O. Temam, F. Arzel and N. Drach

Increasingly complex consumer electronics applications call for embedded processors with higher performance. Multi-cores are capable of delivering the required performance. However, many of these embedded applications must meet some form of soft real-time constraints, and program behavior on multi-cores is even harder to predict than on singlecores. In this article, we highlight the greater performance variability of irregular applications (non-regular control flow and/or data structures) across data sets when parallelized and run on a multi-core. We then show that a proper parallelization approach coupled with a lightweight run-time system can drastically reduce this performance variability without sacrificing their performance. This approach requires no complex program or architecture analysis or modeling. Moreover, we show that parallel program performance becomes stable enough that it is possible to reasonably and accurately predict it by sampling a few training runs.

Exact and Approximate Task Assignment Algorithms for Pipelined Software Synthesis [p. 746]
M. Hashemi and S. Ghiasi

Pipelined execution of streaming applications enable processing of high-throughput data under performance constraint. We present an integrated apporach to synthesizing pipelined software for dual-core architectures. We target streaming applications modeled as task graphs that are amenable to static analysis. We deveop a versatile task assignment algorithm that considers the combined effect of workload im-balance between processors and inter-processor communication. Our technique, which runs in pseuso-linear time, probably maximizes application throughput. Furthermore, we develop an approximation algorithm for task assignment whose complexity is strictly polynomial. It provides the designer with an adjustable knob to controllably trade solution quality with algorithm runtime and memory reqquirement. Empirical throughput measurements using an FPGA-based dual-core system validate our theoretical results. Our exact algorithm consistently outperforms a recent competitor. Compared to exact task asignment, the approximate method runs about 3 times faster, requires about 20 times less memory, and results in only 1% to 5% throughput loss.


6.7: Instruction-Set Optimisations

Moderators: G. Gaydadjiev, TU Delft,NL; T. Austin, U of Michigan, US
Run-Time System for an Extensible Embedded Processor with Dynamic Instruction Set [p. 752]
L. Bauer, M. Shafique, S. Kreutz and J. Henkel

One of the upcoming challenges in embedded processing is to incorporate an increasing amount of adaptivity in order to respond to the multifarious constraints induced by today's embedded systems that feature complex and diverse application behaviors. We present a novel concept (evaluated with a hardware prototype) that moves traditional design-time jobs to run time in order to increase efficiency (in this paper we focus on performance). Adaptivity is achieved dynamically through what we call Special Instructions (SIs) which may change during run time according to non-predictable application behavior. The new contribution of this paper is the principal component that actually makes the entire embedded processor work efficiently, namely the "Special Instruction Scheduler". It determines during run time 'when' and 'how' Special Instructions are composed and executed. We achieve a 2.38x performance increase over a reconfigurable processor system with dynamic instruction set (Molen [19]). Our whole platform consists of a toolchain including estimation and simulation tools plus a running hardware prototype. Throughout this paper, we discuss the functionality by means of an H.264 video encoder in detail even though the concept is not limited to this application.

Harnessing Horizontal Parallelism and Vertical Instruction Packing of Programs to Improve System Overall Efficiency [p. 758]
H. Lin and Y. Fei

Multi-issue processors can exploit the Instruction Level Parallelism (ILP) of programs to improve the performance greatly. How to reduce the energy consumption while maintaining the high performance of programs running on multiissue processors remains a challenging problem. In this paper, we propose a novel approach to apply the instruction register file (IRF) technique from single-issue processor to VLIW architecture. Frequently executed instructions are selected to be placed in the on-chip IRF for fast access in program execution. Violation of synchronization among VLIW instruction slots is avoided by introducing new instruction formats and microarchitectural support. The enhanced VLIW architecture is thus able to orchestrate the horizontal instruction parallelism and vertical instruction packing for programs to improve system overall efficiency. Our experimental results show that the proposed processor architecture achieves both the performance advantage provided by the VLIW architecture and high energy efficiency provided by the IRF-based instruction packing technique (e.g., 71.1% reduction in the fetch energy consumption for a 4-way VLIW architecture with 8-entry IRFs).

Instruction Set Extension Exploration in Multiple-Issue Architecture [p. 764]
I.-W. Wu, Z.-Y. Chen, J.-J. Shann and C.-P. Chung

To satisfy high-performance computing demand in modern embedded devices, current embedded processor architectures provide designer with possibility either to define customized instruction set extension (ISE) or to increase instruction issue width. Previous studies have shown that deploying ISE in multiple-issue architecture can significantly improve performance. However, identifying ISE for multiple-issue architecture by using current ISE exploration algorithms will result in unnecessary waste of silicon area and limitation of performance improvement. This is because most algorithms overlook two important considerations: (1) only packing the operations lying on the critical path into ISE can improve performance; (2) the critical path usually changes after packing operations into an ISE. With these considerations, this paper presents an algorithm for ISE exploration based on list scheduling and Ant Colony Optimization (ACO), in which combines ISE exploration and the critical path identification (i.e. instruction scheduling). Results indicate that our approach outperforms the previous work in both performance improvement and area efficiency.

Instruction Re-Encoding Facilitating Dense Embedded Code [p. 770]
T. Bonny and J. Henkel

Reducing the code size of embedded applications is one of the important constraint in embedded system design. Code compression can provide substantial savings in terms of size. In this paper, we introduce a novel and efficient hardware-supported approach. Our approach investigates the benefits of re-encoding the unused bits (we call them re-encodable bits) in the instruction format for a specific application to improve the compression ratio. Re-encoding those bits may reduce the size of decoding table by more than 37%. We achieve compression ratios as low as 44% (including all overhead that incurs). We have conducted evaluations using a representative set of applications and have applied it to two major embedded processors, namely MIPS and ARM.


IP3 Interactive Presentations

Test Instrumentation for a Laser Scanning Localization Technique for Analysis of High Speed DRAM Devices [p. 776]
M. Versen, A. Schramm, J. Schnepp and D. Diaconescu

Soft defect localization (SDL) is a method of laser scanning microscopy that utilizes the changing pass/fail behavior of an integrated circuit under test and temperature influence. Historically the pass and fail states are evaluated by a tester that leads to long and impracticable measurement times for dynamic random access memories (DRAM). The new method using a high speed comparison device allows SDL image acquisition times of a few minutes and a localization of functional DRAM fails that are caused by defects in the DRAM periphery that has not been possible before. This new method speeds up significantly the turn-around time in the failure analysis (FA) process compared to knowledge based FA.

A Mapping Framework for Guided Design Space Exploration of Heterogeneous MP-SoCs [p. 780]
B. Ristau, T. Limberg and G. Fettweis

When designing heterogeneous MP-SoCs designers have to take into account various objectives such as power, die size, flexibility, performance or programmability. But to be able to evaluate a given system according to these objectives, it is necessary to know how applications will behave on that system. Since time-to-market is one key factor in chip design, it is important to be able to evaluate these systems at a very early design stage. Today this is usually done with simulations in languages such as Simulink or SystemC. We will show how the behavior of such systems can be analyzed without the need for time-consuming implementations of simulation models. This enables fast evaluation and modification of a given system at a very early design stage allowing efficient pruning of the design space.

Impact of Leakage Current on Data Retention of RF-Powered Devices during Amplitude-Modulation-Based Communication [p. 784]
J. Haid, B. Zimek, T. Leutgeb and T. Kuenemund

Devices powered by an electromagnetic field are inherently power-constrained and thus must carefully manage static and dynamic power. High ambient temperatures and field strengths can increase the temperature of RF-powered devices up to more than 100 degrees Celsius, thereby allowing the leakage current to rise to a dominating portion of the static power consumption. Leakage reduction techniques for application in RFpowered devices are examined in this paper with the goal to avoid malfunction of the device during amplitude modulation-based communication. Results show that without leakage reduction a correct operation cannot be guaranteed for the investigated 130 nm process technology for energy gaps that are defined by the widely applied ISO/IEC 14443-2 standard (100% field modulation). The evaluation of leakage reduction techniques shows that applying body biasing prolongs the data retention time by nearly 200%, while source biasing in general aggravated the circuit's robustness against power gaps (reduction in data retention time by up to 76% loss), as did also voltage scaling (up to 98% reduction).

Accuracy-Adaptive Simulation of Transaction Level Models [p. 788]
M. Radetzki and R.S. Khaligh

Simulation of transaction level models (TLMs) is an established embedded systems design technique. Its use cases include virtual prototyping for early software development, platform simulation for design space exploration, and reference modelling for verification. The different use cases mandate different trade-offs between simulation performance and accuracy. Therefore, multiple TLM abstraction layers have been defined of which one has to be chosen and integrated into the system model prior to simulation. In this contribution we present a modelling technique that allows covering several layers in a single model and switching between the layers at any time, in particular dynamically during simulation. This feature is employed to automatically adapt simulation accuracy to an appropriate level depending on the model's state, leading to an improved trade-off between simulation performance and accuracy.

Zero-Efficient Buffer Design for Reliable Network-on-Chip in Tiled Chip-Multi-Processor [p. 792]
J. Wang, H. Zeng, K. Huang, G. Zhang and Y. Tang

Network-on-Chip (NoC) is a promising solution for efficient interconnection between processor cores in Chip- Multi-Processor (CMP). This paper is focusing on the energy-efficient design of buffers, a group of the most important components in NoC. From our investigation, an overwhelming majority of "zero" is contained in the packets transmitting in NoC for CMP. A zero-efficient buffer design is proposed as well as the error control scheme. Compared with conventional design, up to 43% energy consumption can be saved. We use a 90nm CMOS process in our simulation.

Wire Sizing Alternative - An Uniform Dual-Rail Routing Architecture [p. 796]
F.-W. Chen and Y.-Y. Liu

To achieve minimum signal propagation delay, the nonuniform wire width routing architecture has been widely used in modern VLSI design. The non-uniform routing architecture exploits the wire width flexibilities to trade area for performance. However, many additional design rules, which confine the routing flexibilities, are introduced in nanoscale circuit designs. With the increasing difficulties of fabricating nanoscale circuits, the conventional nonuniform routing architecture becomes clumsy. We propose an uniform dual-rail routing architecture to cope with these new challenges. The proposed architecture exploits the anti-Miller effect between two adjacent wires with the same signal source. Hence, the coupling capacitance between these two wires is reduced. The simulation results demonstrate that our proposed architecture provides a signal propagation channel with similar propagation delay, less crosstalk noise, and less power consumption to the conventional non-uniform routing architecture with moderate routing area overheads. In terms of the properties and the scalabilities, we argue that the uniform dual-rail routing architecture is a wire sizing alternative without incurring layout irregularity and stacked vias overheads.

Structural Synthesis of Four-Quadrant Multiplier Based on Hierarchical Topology [p. 800]
X. Wang and L. Hedrich

This paper presents a method towards automatic structural synthesis of analog multiplier based on a hierarchical topology "super-topology", which is abstracted from the most standard four-quadrant multipliers. The essential components in the super-topology are four identical cells, which consist of several MOS-transistors and determine features and performances of multipliers. We build all possible cells within 3 transistors. Experimental results present three new multiplier structures with simulation results to show the creativity of our method.

A Virtual Prototype for Bluetooth over Ultra Wide Band System Level Design [p. 804]
A. Lewicki, J. del Prado Pavon, J. Talayssat, E. Dekneuvel and G. Jacquemod

The industry is merging two different Wireless Personal Area Networks (WPAN) technologies: Bluetooth (BT) and WiMedia Ultra Wide Band (UWB), into a single BT over UWB (BToUWB) specification. The goal is to provide low cost, low power and a wide range of data rate wireless communications for multimedia and mobile applications. The complexity to study such a system requires the development of a virtual prototype at a highlevel of abstraction. The model needs a fast simulation time in order to explore the algorithms necessary for the merging of the standards. Moreover, as the merging is still in a standardization phase, this virtual prototype helps to actively participate to this effort. The aim of this paper is to provide an overview of the methodology used to create a virtual prototype of a BToUWB device.

Re-Examining the Use of Network-on-Chip as Test Access Mechanism [p. 808]
F. Yuan, L. Huang and Q. Xu

Existing work on testing NoC-based systems advocates to reuse the on-chip network itself as test access mechanism (TAM) to transport test data to/from embedded cores. While this methodology obviously reduces the routing cost when compared to the case that dedicated test buses are introduced as TAMs, it is not clear whether it is beneficial in terms of other important factors that significantly affect test cost, e.g., testing time, test control complexity and test reliability. As a result, in this paper, we re-examine the issue of using NoC as TAM in order to facilitate designers to construct a cost-effective system test architecture based on their requirements.


7.1: PANEL SESSION - The Future Car: Technology, Methods and Tools (Automotive Systems Day)

Organizers: A. Sangiovanni-Vincentelli, UC Berkeley, US; M. Di Natale, Scuola S Anna, Pisa, IT
Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US

PANEL - The Future Car: Technology, Methods and Tools [p. 812]
Panelists: H. Hanselmann, H. Heineke, A. Bouali, H. Kopetz, H. Fennel and T. Weber

The car of the future will be based on very advanced software and hardware technologies for improved safety and additional features such as autonomous driving, vehicle to vehicle communication, extensive communication and entertainment subsystems. What are the limiting factors for introducing new technology in cars? What are the standards, methods and tools that will be needed to bring these cars to market quickly and with guaranteed properties? The experts in the panel will address these questions and discuss their preferred solutions.


7.2: Formal Methods for Hardware and Software Verification

Moderators: R. Bloem, TU Graz, AT; R. Drechsler, Bremen U, DE
Improving Constant-Coefficient Multiplier Verification by Partial Product Identification [p. 813]
C.-Y. Lai and C.-Y. Huang and K.-Y. Khoo

Constant-coefficient multipliers are fundamental components in digital signal processing and arithmetic-based systems. Their verification, however, remains difficult and time-consuming. This is caused by the inability to identify the partial products from the number representation system of the constant. In this paper, we introduce an efficient number representation system as an observation on how modern synthesizers interpret constants. We also propose a robust and efficient partial product identification algorithm to improve the verification process. Experimental results show that our algorithm not only reduces the number of failing cases of the verification to one third but also speeds up the verification process by at least an average of 25%.

Improved Visibility in One-to-Many Trace Concretization [p. 819]
K. Nanshi and F. Somenzi

We present an improved algorithm for concretization of abstract error traces in abstraction refinement-based invariant checking. The proposed algorithm maps each transition of the abstract error trace to one or more transitions in the concrete model by using a combination of simulation and satisfiability checking. Prior simulationbased approaches were hindered by limited visibility, which often resulted in excessive backtracking or refinements. The proposed technique addresses this issue in three ways: By identifying variables whose addition to the abstract trace significantly improves its predictive power at a low computational cost; by combining SAT checks with pseudo-random simulation in the construction of the concrete trace; and by a more flexible budgeting of simulation vectors that accounts for the progress made in concretization.

Efficient Symbolic Simulation of Low Level Software [p. 825]
T. Arons, E. Elster, S. Ozer, J. Shalev and E. Singerman

Symbolic execution has long been a staple technique for formal hardware verification. Its application to software requires methods for dealing with software specific complexities. In this paper we elaborate methods for the efficient symbolic simulation of embedded software; some methods are new, others are improvements of existing methods. Using these techniques we have been able to symbolically execute real life microcode of thousands of lines, allowing formal methods to become an integral part of microcode validation in Intel Corporation.

Completeness in SMT-Based BMC for Software Programs [p. 831]
M.K. Ganai and A. Gupta

Bounded Model Checking (BMC) is incomplete without a completeness threshold (CT ) bound. Previous methods, using recurrence diameter for obtaining CT , check for existence of a longest loop-free path at every depth k. For terminating software programs, we propose an efficient method for obtaining CT that requires solving a formula of size O(k) at some depths only, as compared to previous methods that require solving a formula of O(k2) (or O(klogk)) size at every depth. We augment previous methods for BMC simplifications using model transformation and control flow information, with context-sensitive analysis. This results in more BMC simplifications and further reduction in the number of CT checks. We have implemented our techniques in a Satisfiability Modulo Theory (SMT)-based BMC framework. Our controlled experiments on real-world software programs show that our proposed formulation provides significant improvements over previous approaches.


7.3: Physical Design: From Pins to Transistors

Moderators: L. Scheffer, Cadence Design Systems, US; I. Markov, U of Michigan, US
Novel Pin Assignment Algorithms for Components with Very High Pin Counts [p. 837]
T. Meister, J. Lienig and G. Thomke

The wiring effort and thus, the routability of electronic designs such as printed circuit boards, multi chip modules and single chip modules largely depends on the assignment of signals to component pins. For modern components that have as many as several thousand pins, this pin assignment cannot be optimized manually. This paper presents four novel pin assignment algorithms that automatically create optimized pin assignments for wiring substrate designs with components that have very high pin counts. We also present and evaluate quality estimation metrics that enable fast assessment of the pin assignment results. The efficiency of our algorithms allows the creation of optimized pin assignments using only minutes of computation time. We show the applicability of all four algorithms, including their strengths and weaknesses, in specific design applications.

A Generic Standard Cell Design Methodology for Differential Circuit Styles [p. 843]
S. Badel, E. Güleyüpoglu, O. Inaç, A.P. Martinez, P. Vietti, F.K. Gürkaynak and Y. Leblebici

In this paper we present a generic methodology for the rapid generation and implementation of standard cell libraries for differential circuit design styles. We demonstrate a systematic approach for the classification of circuit topologies (footprints) and for generating the templates that correspond to a large number of functions. The generation of an extensive cell library with more than 4500 standard cells based on 19 footprints is demonstrated using a 180 nm CMOS technology.

Layout Level Timing Optimization by Leveraging Active Area Dependent Mobility of Strained-Silicon Devices [p. 849]
A. Chakraborty, X. Shi and D.Z. Pan

Advanced MOSFETs such as Strained Silicon (SS) devices have emerged as critical enablers to keep Moore's law on track for sub-100nm technologies. Use of Strained Silicon devices provides performance improvement equivalent to use of next generation devices, without actually requiring scaling. Traditionally, the research in the field of SS has been focussed on device modeling and process characterization. Recently (in [1] [2]), the dependence of mobility of a SS MOSFET device on its poly-to-poly distance has been reported. In this work, we propose a new methodology to exploit this dependence to achieve cycle time reduction of a design at the layout level. To the best of our knowledge, this is the first research work to tackle timing closure by layout modifications using active area dependent mobility of SS devices. Our methodology shows consistent improvement for benchmark designs mapped onto various 90nm commercial standard cell libraries. This work enables reduction of cycle time by as much as 6.31% (and on an average 5.25%) very late in the design closure cycle without requiring any optimization iterations.

Exploiting Correlation Kernels for Efficient Handling of Intra-Die Spatial Correlation, with Application to Statistical Timing [p. 856]
A. Singhee, S. Singhal and R.A. Rutenbar

Intra-die manufacturing variations are unavoidable in nanoscale processes. These variations often exhibit strong spatial correlation. Standard grid-based models assume model parameters (grid-size, regularity) in an ad hoc manner and can have high measurement cost. The random field model overcomes these issues. However, no general algorithm has been proposed for the practical use of this model in statistical CAD tools. In this paper, we propose a robust and efficient numerical method, based on the Galerkin technique and Karhunen Loéve Expansion, that enables effective use of the model. We test the effectiveness of the technique using a Monte Carlo-based Statistical Static Timing Analysis algorithm, and see errors less than 0.7%, while reducing the number of random vari- ables from thousands to 25, resulting in speedups of up to 100x.


7.4: Advanced Design Techniques for Sensor and Communication Applications

Moderators: R. Forsyth, Austriamicrosystems AG, AT; G. Van der Plas, IMEC, BE
A Triple-Mode Reconfigurable Sigma-Delta Modulator for Multi-Standard Wireless Applications [p. 862]
A. Morgado, R. del Río and J.M. de la Rosa

This paper presents the implementation and experimental characterization of a reconfigurable ΣΔ modulator intended for multi-mode wireless receivers that is capable to perform the analog-to-digital conversion for GSM, Bluetooth, and UMTS standards. The ΣΔ modulator reconfigures its cascade topology and building blocks in order to adapt the performance to the diverse standard specifications with optimized power consumption. The prototype has been implemented in a 130-nm CMOS technology and features dynamic ranges of 86.7/81.0/63.3dB and peak signal- to-(noise+distortion) ratios of 74.0/68.4/52.8dB at 400ksps/2Msps/8Msps, respectively. The modulator power consumption is 25.2/25.0/44.5mW, of which 11.0/10.5/ 24.8mW are dissipated in the analog circuitry.

Low-Noise Sigma-Delta Capacitance-to-Digital Converter for Sub-pF Capacitive Sensors with Integrated Dielectric Loss Measurement [p. 868]
M. Bingesser, T. Loeliger, W. Hinn, J. Hauer, S. Mödl, R. Dorn and M. Völker,

A sigma-delta capacitance-to-digital converter (CDC) with a resolution down to 19.3 aF at a bandwidth of 10 kHz, corresponding to a noise level of 0.2 aF/√Hz, is presented. An integrated dielectric loss measurement circuit by means of two parallel channels with different integration times offers a complex permittivity measurement in a single-chip solution. The achieved dielectric loss angle resolution is as low as 0.3 ° for a material density ratio of 0.55 %. A test chip with two converter blocks including two 2nd order and two 4th order modulators has been produced in the austriamicrosystems AG C35B3C0 0.35 μm DPTM CMOS process, operating at a single 3.3 V supply. Applications of this circuit include mass measurement and analysis of material compositions.

Calibration of Integrated CMOS Hall Sensors Using Coil-on-Chip in ATE Environment [p. 873]
M. Badaroglu, G. Decabooter, F. Laulanet and O. Charlier

Due to high demand for hall sensors mostly in the automotive and industrial applications, development and manufacturing of hall sensors in System-on-Chip (SoC) became more important. On the other hand, options for test and characterization of hall sensors in manufacturing environment are very limited. In most cases external field generators are used in order to characterize the hall sensors on a small set of production samples. In this paper, we present our Coilon-Chip (CoC) calibration methodology where there is no need for a dedicated setup/assembly. Our methodology is also immune to self-heating. Our methodology enables reduced costs in test equipment, 100% screening of hall sensors in manufacturing tests, and reliable trimming of sensitivity spread over temperature from -40oC to 150oC. Measurement results before trimming show less than 20% six-sigma spread for normalized sensitivity across 120 samples of different hall sensor structures processed in a 0.35 μm high-voltage CMOS process.

A Programmable and Low-EMI Integrated Half-Bridge Driver IN BCD Technology [p. 879]
F. D'Ascoli, L. Bacciarelli, M. Melani, L. Fanucci, G. Ricotti, E. Pardi, F. Vincis, M Forliti and M. De Marinis

This paper presents the design and the laboratory results of an integrated half-bridge driver for power electronic systems in a 0.35μm Bipolar CMOS DMOS (BCD) technology. The proposed solution is designed for frequency applications up to several hundred of KHz and it has a driving current capability up to 50 mA. This work features a design configuration and a digital control to reduce electromagnetic interference (EMI). Moreover it includes short circuit protection, programmability of voltage references and a digital control circuitry implementing mechanism to prevent dangerous failures of the driver. After a deep description of the circuit we show the laboratory results of the half-bridge driver used to drive a 20 KHz antenna.


7.5: Design Techniques for Error Mitigation

Moderators: C. Papachristou, Case Western Reserve U, US; D. Pradhan, Bristol U, UK
CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns [p. 885]
Y. Li, S. Makar and S. Mitra

CASP, Concurrent Autonomous chip self-test using Stored test Patterns, is a special kind of self-test where a system tests itself concurrently during normal operation without any downtime visible to the end-user. CASP consists of two ideas: 1. Storage of very thorough test patterns in non-volatile memory; and, 2. Architectural and system-level support for autonomous testing of one or more cores in a multi-core system using stored patterns, concurrently with normal system operation, without bringing down the entire system. CASP enables design of robust systems with built-in features for circuit failure prediction, error detection, self-diagnosis and self-repair. Such systems are necessary to overcome major reliability challenges in scaled-CMOS technologies. Implementation of CASP in the OpenSPARC T1 multi-core processor demonstrates its effectiveness and practicality.

Defect Tolerance in Homogeneous Manycore Processors Using Core-Level Redundancy with Unified Topology [p. 891]
L. Zhang, Y. Han, Q. Xu and X. Li

Homogeneous manycore processors are emerging for terascale computation. Effective defect tolerance techniques are essential to improve the yield of such complex integrated circuits. In this paper, we propose to achieve fault tolerance by employing redundancy at the core-level instead of at the microarchitecture-level. When faulty cores existing on-chip in this architecture, how to reconfigure the processor with the most effective topology is a relevant research problem. We present novel solutions for this problem, which not only maximize the performance of the manycore processor, but also provide a unified topology to operating system and application software running on the processor. Experimental results show the effectiveness of the proposed techniques.

A Low-Cost Concurrent Error Detection Technique for Processor Control Logic [p. 897]
R. Vemu, A. Jas, J.A. Abraham, S. Patil and R. Galivanche

This paper presents a concurrent error detection technique targeted towards control logic in a processor with emphasis on low area overhead. Rather than detect all modeled transient faults, the technique selects faults which have a high probability of causing damage to the architectural state of the processor and protects the circuit against these faults. Fault detection is achieved through a series of assertions. Each assertion is an implication from inputs to the outputs of a combinational circuit. Fault simulation experiments performed on control logic modules of an industrial processor suggest that high reduction in damage causing faults can be achieved with a low overhead.

Approximate Logic Circuits for Low Overhead, Non-Intrusive Concurrent Error Detection [p. 903]
M.R. Choudhury and K. Mohanram

This paper describes a scalable, technology-independent algorithm for the synthesis of approximate logic circuits. A low overhead, non-intrusive solution for concurrent error detection (CED) based on such circuits is described in this paper. CED based on approximate logic circuits does not impose any performance penalty on the original design. The proposed synthesis algorithm for approximate logic circuits scales with circuit size, and provides fine-grained trade-offs between area-power overhead and CED coverage.


7.6: Safety-Driven Embedded Systems Design

Moderators: J. Sztipanovits, Vanderbilt U, US; J. Beutel, ETH Zurich, CH
Logical Reliability of Interacting Real-Time Tasks [p. 909]
K. Chatterjee, A. Ghosal, T.A. Henzinger, D. Iercan, C.M. Kirsch, C. Pinello and A. Sangiovanni-Vincentelli

We propose the notion of logical reliability for real-time program tasks that interact through periodically updated program variables. We describe a reliability analysis that checks if the given short-term (e.g., single-period) reliability of a program variable update in an implementation is sufficient to meet the logical reliability requirement (of the program variable) in the long run. We then present a notion of design by refinement where a task can be refined by another task that writes to program variables with less logical reliability. The resulting analysis can be combined with an incremental schedulability analysis for interacting real-time tasks proposed earlier for the Hierarchical Timing Language (HTL), a coordination language for distributed real-time systems. We implemented a logical-reliabilityenhanced prototype of the compiler and runtime infrastructure for HTL.

Scheduling of Fault-Tolerant Embedded Systems with Soft and Hard Timing Constraints [p. 915]
V. Izosimov, P. Pop, P. Eles and Z. Peng

In this paper we present an approach to the synthesis of fault-tolerant schedules for embedded applications with soft and hard real-time constraints. We are interested to guarantee the deadlines for the hard processes even in the case of faults, while maximizing the overall utility. We use time/utility functions to capture the utility of soft processes. Process re-execution is employed to recover from multiple faults. A single static schedule computed off-line is not fault tolerant and is pessimistic in terms of utility, while a purely online approach, which computes a new schedule every time a process fails or completes, incurs an unacceptable overhead. Thus, we use a quasi-static scheduling strategy, where a set of schedules is synthesized off-line and, at run time, the scheduler will select the right schedule based on the occurrence of faults and the actual execution times of processes. The proposed schedule synthesis heuristics have been evaluated using extensive experiments.

Tool Support for Incremental Failure Mode and Effects Analysis of Component-Based Systems [p. 921]
J. Elmqvist and S. Nadjm-Tehrani

Failure Mode and Effects Analysis (FMEA) is a wellknown technique widely used for safety assessment in the area of safety-critical systems. However, FMEA is traditionally done manually which makes it both time-consuming and costly, specially for large and complex systems. Also, small modifications in the design may result in a complete revision of the initial FMEA. This paper presents a tool support for automated incremental component-based FMEA of SW and HW. It is based on component safety interfaces and a formal compositional safety analysis method. This tool support enables engineers to focus on more important steps in the safety assessment process. Also, during system upgrades, the tool incrementally registers the changes and identifies possible effects in the FMEA which enables the use of earlier safety analysis results. Finally, this formal approach based on design models of the components and the system always creates FMEAs which are consistent with the system design.

Compositional Design of Isochronous Systems [p. 928]
J.-P. Talpin, J. Ouy, L. Besnard and P. Le Guernic

The synchronous modeling paradigm provides strong execution correctness guarantees to embedded system design while making minimal environmental assumptions. In most related frameworks, global execution correctness is achieved by ensuring endochrony: the insensitivity of (logical) time in the system from (real) time in the environment. Interestingly, endochrony can be statically checked, making it fast to ensure design correctness. Unfortunately, endochrony is not preserved by composition, making it difficult to exploit with component-based design concepts in mind. Compositionality can be achieved by weakening the objective of endochrony but at the cost of an exhaustive state-space exploration. This raise a tradeoff between performance and precision. Our aim is to balance it by proposing a formal design methodology that adheres to a weakened global design objective: the non-blocking composition of weakly endochronous processes, while preserving local endochrony objectives. This yields an ad-hoc yet cost-efficient approach to compositional synchronous modeling.


7.7: Hot Topic - Quantitative Productivity Measurement in IC Design

Organizer/Moderator: A. Vörg, edacentrum, DE
Quantitative Productivity Measurement in IC Design [p. 934]
F. Badstübner and A. Vörg

This paper describes ongoing research in the field of quantitative productivity measurement in IC Design and simulation of different scenarios as decision support. Five topics out of this research field allow an insight in the preparation of real design flows for productivity measurement and how these measurements are used for analysis, simulation and optimization of design flows. This paper starts with an introduction in section 1 of the PRODUKTIV+ project in which most of the research has been done. The modeling of projects and extraction of the important indicators complexity and quality is explained in sections 2 and 3. In section 4 Synopsys as an EDA vendor from outside of PRODUKTIV+ adds its view on productivity measurement. Section 5 contributes to the modeling of a verification process for productivity simulations. Section 6 explains an optimization process for a microprocessor design flow under productivity considerations. Most of this work has been carried out in the PRODUKTIV+ project (label 01 M 3077) that is partly funded by the German government [17].

Determining the Technical Complexity of Integrated Circuits [p. 935]
P. Leppelt and E. Barke

The classification and quantification of a projected design's technical properties is essential for the prediction of success or failure of a microelectronic development project. The derived values have to mirror the design's capacity and thus allow for an estimation of the design complexity. This chapter depicts the PRODUKTIV+ solution approach to the ascertainment of a design artifact and the determination of equations in particular.

Qualitative and Quantitative Analysis of IC Designs [p. 935]
S. Häusler, F. Poppen, K. Hausmann, A. Hahn and W. Nebel

A project's output needs to be quantified to enable the evaluation of its productivity. Besides complexity the quality of result is a main criterion to consider. Based on the quality definition of "conformance to requirements" (see e.g. [14]), our approach combines requirements and quality modelling to allow real time tracking of the project status for integrated circuit design. Each design project has its individual (quality-) requirements and even components of the same design may differ in this aspect. A general quality evaluation concept has to cover this individualism. A simple example is a component that should be reusable in multiple designs and therefore has to fulfil specific criteria ([11], [13]). Our approach utilises a machine readable requirements definition in combination with common quality modelling techniques. Requirement fulfilment degrees and the current quality are computable based on this requirements definition and a snapshot of the current development status. The Permeter framework developed by OFFIS is used to collect the data that represents the current development status. Permeter offers the functionality to load product data from different sources and establishes links between the data, e.g. requirements and corresponding components. Permeter offers both manual and (semi-) automatic linkage facilities. For a detailed description of the data integration process refer to [15].

Capturing and Analyzing IC Design Productivity Metrics [p. 936]
J. Young

You can't improve what you can't measure and many people won't take the time to measure. This tutorial describes a practical, low impact method used by Synopsys Professional Services design teams to measure and analyze design flow and runtime metrics on their customer chip projects. Details about the capture methodology, database, and reporting infrastructure will be discussed. Uses for the metrics reports, as well as an overall context for design productivity improvement will be discussed. Although the details are provided within the framework of the Synopsys Design Environment, the concepts described are applicable to any structured design environment.

Application of Workflow Petri Nets to Modeling of Formal Verification Processes in Design Flow of Digital Integrated Circuits [p. 937]
K. Weinberger, S. Bulach and W. Rosenstiel

According to statistics the verification of digital integrated circuits (IC) claims up to 70 % of the design time and effort in the design process. This means that the verification process must be well structured and organized in order to efficiently reach desired verification goals. This paper describes the modelling of an exhaustive formal verification process of a digital IC with Workflow Petri Nets [8] and the WoPeD (Workflow Petri net Designer) tool [9], which supports modelling, simulation and analysis of a workflow process. The purpose of this work is to formalize and quantify the verification process such that it could subsequently be structurally and behaviourally analyzed according to the means provided by Petri Nets and, if desired, simulated with a particular scenario. This approach makes it possible to explicitly examine and derive the interaction of different factors which influence a verification process such that their relationships could be quantified. Initial experimental results are presented and advantages and disadvantages of this methodology are discussed.

Optimization of Design Flows for Multi-Core x86 Microprocessors in 45 and 32nm Technologies under Productivity Considerations [p. 938]
H.-J. Brand

Designing next generation 45nm and 32nm multi-core microprocessors creates new challenges caused by a dramatic increase of design complexity and constraints such as:

  • increasing number of cores per chip
  • enhancing cache sizes and cache systems
  • increasing frequency for memory and serial interfaces
  • heterogeneous multi-core architectures
  • functional enhancements (security, virtualization, ...)
  • DfY/DfM/DfV require the consideration of more and new technology specific characteristics.
Without a considerable improvement of design productivity new products will not be available in time to market to create maximum economic value. The presentation describes how an infrastructure to measure productivity relevant parameter for a microprocessor design flow for 45 and 32nm technologies can be build up.


8.1: Dependable Computing in the Face of Scaled CMOS Challenges (Dependable Embedded Systems Day)

Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: N. Suri, TU Darmstadt, DE
Implications of Technology Trends on System Dependability [p. 940]
J.A. Abraham

CMOS has been the dominant integrated circuit (IC) technology for nearly four decades, following the trends predicted by Moore's Law, and fueling the information and communication revolution. As chip geometries decrease and transistor densities increase, new types of faults - from manufacturing defects and operational transients to longterm wearout - need to be addressed. These faults and the resulting logic errors have been dealt with at both the low and high levels of the design. This talk deals with approaches for improving dependability at the system level.

Globally Optimized Robust Systems to Overcome Scaled CMOS Challenges [p. 941]
S. Mitra

Future system design methodologies must accept the fact that the underlying hardware will be imperfect, and enable design of robust systems that are resilient to hardware imperfections. Three techniques that can enable a sea change in robust system design are: 1. Built-In Soft Error Resilience (BISER), 2. Circuit Failure Prediction, and 3. Concurrent Autonomous self-test using Stored Patterns (CASP). Global optimization across multiple abstraction layers is essential for cost-effective robust system design using these techniques.

Software Protection Mechanisms for Dependable Systems [p. 947]
U. Wappler and M. Müller

We expect that in future commodity hardware will be used in safety critical applications. But the used commodity microprocessors will become less reliable because of decreasing feature size and reduced power supply. Thus software-implemented approaches to deal with unreliable hardware will be required. As one basic step to softwareimplemented hardware-fault tolerance (SIHFT) we aim at providing failure virtualization by turning arbitrary value failures caused by erroneous execution into crash failures which are easier to handle. Existing SIHFT approaches either are not broadly applicable or lack the ability to reliably deal with permanent hardware faults. In contrast, Forin [7] introduced the Vital Coded Microprocessor which reliably detects transient and permanent hardware errors but is not applicable to arbitrary programs and requires special hardware. We discuss different approaches to generalize Forin's approach and make it applicable to modern infrastructures.


8.2: Invited Industrial Session - Industrial System Designs in Information Technologies

Moderators: C. Heer, Infineon Technologies, DE; O. Deprez, Texas Instruments, FR
Subsystem Exchange in a Concurrent Design Process Environment [p. 953]
M. Strik, A. Gonier and P. Williams

This paper provides insight into the novel solutions used to build SoCs targeting increased productivity in a complex environment. Design of such SoCs relies on multi-team, multi-site cooperation and data exchange. The data exchange, made possible though descriptions based on The SPIRIT Consortium's IP-XACTTM specification and the automation for its processing, forms the basis of the approach. Initially, the specification focused at IP reuse; this has now been extended to SoC subsystem exchange. This paper also describes state-of-the-art subsystem design automation and improvement opportunities, based on a close collaboration between NXP Semiconductors and Mentor Graphics. We do not cover all the aspects of reuse but mainly stress the concurrent engineering process.

Cooperative Safety: Combination Of Mutiple Technologies [p. 959]
R. Panazzi, P. Capozio, M. Duncan, A. Scuderi, M. Siti and E. Merli

Governmental Transportation Authorities' interest in Car to Car and Car to Infrastructure has grown dramatically over the last few years in order to increase the road safety and reduce traffic emission. The achievement of these objectives is subject to development of three aspects: Transmission, Localization and Sensor Networks. New wireless technique evolved form current WiFi technology shall be able to curb down the timing latency to achieve timely and efficient communication among vehicles. Relative positioning is essential to predict whether two cars are on route of collision. Experts estimate that positioning accuracy must be below one meter in order to provide the necessary reaction time. Many technical issues exist in this field as current GPS solutions do not provide this level of accuracy. There are multiple standalone approaches existing for sensing networks including imaging, radar and lidar. In order to create fault tolerant SIL3 compliant systems, data fusion is obligatory. The amalgamation of these different data streams requires powerful multicore processing to recognize and react to multiple concurrent scenarios.

System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture [p. 962]
A. Mayer and F. Hellwig

Microcontrollers are the core part of automotive Electronic Control Units (ECUs). A significant investment of the ECU manufacturers and even their customers is linked to the specified microcontroller family. To preserve this investment it is required to continuously design new generations of the microcontroller with hardware and software compatibility but higher system performance and/or lower cost. The challenge for the microcontroller manufacturer is to get the relevant inputs for improving the system performance, since a microcontroller is used by many customers in many different applications. For Infineon's latest TriCore® based 32-bit microcontroller product line, the required statistical data is gathered by using the trace features of the Emulation Device (ED). Infineon's customers use EDs in their unchanged target system and application environment. With an analytical methodology and based on this statistical data, the performance improvements of different SoC architecture and implementation options can be quantified. This allows an objective assessment of improvement options by comparing their performance cost ratios.


8.3: Power-Aware Circuit and Process Techniques

Moderators: J. Henkel, Karlsruhe U, DE; M. Smith, Royal Institute of Technology (KTH), SE
Process Variation Tolerant Design Through a Placement-Aware Multiple Voltage Island Design Style [p. 967]
S. Bonesi, D. Bertozzi, L. Benini and E. Macii

A common technique to compensate process variation induced performance deviations during post-silicon testing consists of the dynamic adaptation of processor voltage. This however comes at a significant power cost. We envision multi supply voltage design (MSV) as a promising technique to mitigate such power overhead. Voltage islands are widely recognized as the state-of-the-art in MSV design. In this paper, we develop a novel design methodology that leverages voltage islands to compensate process variations through a commercial synthesis flow. Possible viola- tion scenarios of performance requirements in fabricated chips are pre-characterized at design time through statistical static timing analysis. Then, during post-silicon testing the supply voltage of a proper number of voltage islands is raised depending on the actual violation scenario, thus bringing performance back within nominal values. Voltage islands are generated by exploiting cell proximity for minimal perturbation of performance pre-optimized placements.

Optimal MTCMOS Reactivation under Power Supply Noise and Performance Constraints [p. 973]
A. Calimera, L. Benini and E. Macii

Sleep transistor insertion is one of today's most promising and widely adopted solutions for controlling stand-by leakage power in nanometer circuits. Although single-cycle power mode transition reduces wake-up latency, it originates large discharge current spikes, thereby causing IR-drop and inductive ground bounce for the surrounding circuit blocks. We propose a new reactivation solution which helps in controlling power supply fluctuations and in achieving minimum reactivation times. Our structure limits the turn-on current below a given threshold through sequential activation of the sleep transistors, which are connected in parallel and are sized using a novel optimal sizing algorithm. The proposed methodology is validated using HSPICE simulations of several benchmark circuits, which have been synthesized onto a commercial 65nm CMOS technology library.

A Single-supply True Voltage Level Shifter [p. 979]
R. Garg, G. Mallarapu, S.P. Khatri

When a signal traverses on-chip voltage domains, a level shifter is required. Inverters can handle a high to low voltage shift with minimal leakage. For a low to high voltage level translation, inverters tend to consume a large amount of leakage power, and hence special circuits have been proposed for this type of translation. This paper reports a novel single-supply "true" (in the sense that it can handle a low to high, or high to low voltage level conversion) voltage level shifter, which can handle low-to-high and high-to-low voltage translation. Such a requirement arises in many modern ICs or Systems-on-Chip (SoCs). The use of single supply voltage reduces circuit complexity by eliminating the need for routing both supply voltages. The proposed circuit was extensively simulated in a 90nm technology using SPICE. Simulation results demonstrate that the level shifter is able to perform voltage level shifting with low leakage for both low to high, as well as high to low voltage level translation. We have validated the correct operation of the proposed level shifter under process and temperature variations as well.

Clock Distribution Scheme Using Coplanar Transmission Lines [p. 985]
V.H. Cordero and S.P. Khatri

The current work describes a new standing wave oscillator scheme aimed for clock propagation on coplanar transmission lines on a silicon die. The design is aimed for clock signaling in the Gigahertz range (we are able to achieve clock rates of 8GHz and above). The clock is transported as an oscillatory wave on a pair of conductors. An oscillatory standing wave is formed across a transmission line loop, which is connected beginning-to-end through a Mobius configuration. A single cross coupled inverter pair is required to maintain oscillation across the ring. The design is aimed to achieve low skew, low power and extreme high frequency global clock situations. The energy recycling nature of a standing wave along a transmission line allows us to keep very high frequencies oscillations along a conductor with almost no power consumption at all. A special wide input range driver was designed to convert the differential signals on the coplanar transmission lines into a square clock pulse for standard clock sinks. The design uses CMOS 90nm BSim3v model cards for all simulations, with the transmission lines implemented on Metal8.


8.4: Multicore Design Solutions

Moderators: M. Coppola, STMicroelectronics, FR; F. Petrot, TIMA Laboratory, FR
Compositional, Dynamic Cache Management for Embedded Chip Multiprocessors [p. 991]
A.M. Molnos, M.J.M. Heijligers and S.D. Cotofana

This paper proposes a dynamic cache repartitioning technique that enhances compositionality on platforms executing media applications with multiple utilization scenarios. The repartitioning among scenarios requires a cache flush, thus two undesired effects may occur: (1) the execution of critical tasks may be disturbed and (2) a performance penalty is involved. To cope with these effects we propose a method which: (1) determines, at design time, the cache footprint of each task, such that it creates the premises for critical tasks safety, and reduces the amount of required flush, and (2) enforces these footprints and further decreases the flush penalty, at run-time. We implement our dynamic cache management strategy on a CAKE multiprocessor with 4 Trimedia cores. The experimental workload consists of 6 multimedia applications, each of which formed by multiple tasks belonging to an extended MediaBench suite. For the repartitioned cache we found on average that: (1) the relative variations of critical tasks execution time are less than 0.1%, regardless the scenario switching frequency, (2) for realistic scenario switching frequencies the inter-task cache interference is at most 4% , and (3) the off-chip memory traffic reduces with 60%, and the performance (in cycles per instructions) enhances with 10%, when compared with the shared cache.

Comparison of Memory Write Policies for NoC Based Multicore Cache Coherent Systems [p. 997]
P. Guironnet de Massas and F. Pétrot

The following study shows a direct comparison of memory write policies in SharedMemoryMulticore Systems. Although there are much work and many studies about this issue, our work takes into account the difficulties related to on chip communication using network-like interconnects. Our study is based on Cycle Approximate Bit Accurate simulations (CABA) of platforms with up to 64 processors, modelling accurately all the aspects of multi-threaded program execution and memory accesses. Our main results show that write-through caches perform well compared to write-back ones, with a slightly simpler implementation and comparable traffic.

Serialized Asynchronous Links for NoC [p. 1003]
S. Ogg, E. Valli, B. Al-Hashimi, A. Yakovlev, C. D'Alessandro and L. Benini

This paper proposes an asynchronous serialized link for NoC that can achieve the same levels of performance in terms of flits per second as a synchronous link but with a reduced number of wires in the point to point switch links and reduced power consumption. This is achieved by employing serialization in the asynchronous domain as opposed to synchronous to facilitate the removal of global clocking on the serial links. Based on transistor level simulations using 0.12 μm foundry models it has been shown that it is possible to achieve the same level of performance as synchronous but with 75% reduction in wires and 65% reduction in power for a 300 MFlit/s link with 8 buffers with a switch clock speed of 300 MHz. Furthermore the paper presents the design requirements arising from interfacing switches of synchronous NoC and asynchronous serial links.
Keywords: Network-on-Chip, Serial, Asynchronous, Point-to-Point Links..


8.5: Innovative and Emerging Technologies, Systems and Applications

Moderators: M. Geilen, TU Eindhoven, NL; H. Ben Jamaa, EPFL, Lausanne, CH
Design Guidelines for Metallic-Carbon-Nanotube-Tolerant Digital Logic Circuits [p. 1009]
J. Zhang, N.P. Patil and S. Mitra

Metallic Carbon Nanotubes (CNTs) create source-drain shorts in Carbon Nanotube Field Effect Transistors (CNFETs), causing excessive leakage, degraded noise margin and delay variation. There is no known CNT growth technique that guarantees 0% metallic CNTs. Therefore, metallic CNT removal techniques are necessary. Unfortunately, such removal techniques alone are imperfect and insufficient. This paper demonstrates the necessity for co-optimization of processing techniques for metallic CNT removal together with CNFETbased circuit design. We present a probabilistic CNFET circuit model which forms the basis for such co-optimization, and use the model to derive design and processing guidelines that enable design of CNFET-based digital circuits with practical constraints on leakage, noise margin and delay variations. These guidelines are essential for designing robust metalliccarbon- nanotube-tolerant digital circuits.

Quantified Synthesis of Reversible Logic [p. 1015]
R. Wille, H.M. Le, G.W. Dueck and D. Groβe

In the last years synthesis of reversible logic functions has emerged as an important research area. Other fields such as low-power design, optical computing and quantum computing benefit directly from achieved improvements. Recently, several approaches for exact synthesis of Toffoli networks have been proposed. They all use Boolean satisfiability to solve the underlying synthesis problem. In this paper a new exact synthesis approach based on Quantified Boolean Formula (QBF) satisfiability - a generalization of Boolean satisfiability - is presented. Besides the application of QBF solvers, we propose Binary Decision Diagrams to solve the quantified problem formulation. This allows to easily support different gate libraries during synthesis. In addition, all minimal networks are found in a single step and the best one with respect to quantum costs can be chosen. Experimental results confirm that the new technique is faster than the best previously known approach and leads to cheaper realizations in terms of quantum costs.

Adaptive Simulation for Single-Electron Devices [p. 1021]
N. Allec, R. Knobel and L. Shang

Single-electron devices have drawn much attention in the last two decades. They have been widely used for device research and also show promise as a potential alternative to complementary metal-oxide-semiconductor circuits due to their ultra low power dissipation. Three techniques have been used for single-electron device modeling in the past, including Monte Carlo simulation, master equation, and SPICE modeling. Among these, Monte Carlo method provides accuracy, but lacks the time efficiency required for large scale simulation. In this work, we introduce an adaptive multi-scale approach to single-electron device simulation using Monte Carlo method as basis, which significantly improves time efficiency while maintaining accuracy. We have shown it is possible to reduce simulation time up to 40 times and maintain an average error of 3.3% compared to non-adaptive Monte Carlo method. Going beyond simplistic approximations, we have modeled important secondary effects including cotunneling and Cooper pair tunneling, which are critical for device research.

OS-Based Sensor Node Platform and Energy Estimation Model for Health-Care Wireless Sensor Networks [p. 1027]
F.J. Rincón, M. Paselli, J. Recas, Q. Zhao, M. Sánchez Eles, D. Atienza, J. Penders and G. De Micheli

Accurate power and performance figures are critical to assess the effective design of possible sensor node architectures in Body Area Networks (BANs) since they operate on limited energy storage. Therefore, accurate power models and simulation tools that can model real-life working conditions need to be developed and validated with real platforms. In this paper we propose a sensor node platform designed for health-care applications and a validated simulation model based on event-driven operating system simulation that can be used to accurately analyze performance and power consumption in BANs composed of multiple nodes. Thus, this model can be employed to tune the node architecture and communication layer for different working conditions, applications and topologies of BANs. In this paper we validate the proposed simulation model on different reallife applications and working conditions. Our results show variations of less than 4% between the presented simulation framework and measurements in the final platforms.


8.6: New Real-Time Scheduling Approaches and their Applications

Moderators: S. Goddard, U of Nebraska - Lincoln, US; P. Mosterman, The MathWorks , US
Improvements in Polynomial-Time Feasibility Testing for EDF [p. 1033]
A. Masrur, S. Drössler and G. Färber

This paper presents two fully polynomial-time sufficient feasibility tests for EDF when considering periodic tasks with arbitrary deadlines and preemptive scheduling on uniprocessors. Both proposed methods are proven, analytically and by means of an extensive experimental comparison, to be more accurate than known polynomial-time feasibility tests. Additionally, we show for a wide interval of practical processor utilization that one of these methods presents almost the same efficiency, in terms of accepted task sets, as the more complex pseudo-polynomial-time exact feasibility tests.

A Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications [p. 1039]
A. Tumeo, M. Branca, L. Camerini, M. Ceriani, M. Monchiero, G. Palermo, F. Ferrandi and D. Sciuto

This paper presents the implementation of a dualpriority scheduling algorithm for real-time embedded systems on a shared memory multiprocessor on FPGA. The dual-priority microkernel is supported by a multiprocessor interrupt controller to trigger periodic and aperiodic thread activation and manage context switching. We show how the dual-priority algorithm performs on a real system prototype compared to the theoretical performance simulations with a typical standard workload of automotive applications, underlining where the differences are.

An Application-Based EDF Scheduler For OSEK/VDX [p. 1045]
C. Diederichs, U. Margull, F. Slomka and G. Wirrer

Earliest deadline first scheduling performs processor utilization up to 100 percent and improved robustness in overload situations. However, most automotive applica- tions are running under static priority policy. Because of this, the standard operating system in the automotive industry, OSEK/VDX, just supports priority scheduling. This paper describes an EDF scheduler plug-in for OSEK/VDX. The plug-in provides EDF scheduling without changes to the operating system by delaying task activations. The add-on was tested for an engine management system developed by SiemensVDO. Results of this experiment are presented and discussed, showing that the EDF scheduling techniques can improve the system in aspects of robustness and resource utilization.

Time Properties of the BuST Protocol under the NPA Budget Allocation Scheme [p. 1051]
G. Franchino, G. Buttazzo and T. Facchinetti

Token passing is a channel access technique used in several communication networks. Among them, one of the most effective solution for supporting both real-time traffic (synchronous messages) and non real-time trafic (asynchronous messages), is the so-called timed-token protocol. Recently, a new token passing protocol, called Budget Sharing Token protocol (BuST), was proposed to improve the existing timed-token approaches in terms of synchronous bandwidth guarantee, while guaranteeing a minimum throughput for the asynchronous traffic. This paper analyzes the ability of BuST to manage realtime and non real-time traffic in comparison with the classic timed-token protocol and its modified version, under the Normalized Proportional Allocation (NPA) scheme. We will show that BuST achieves higher guaranteed realtime bandwidth than the original timed-token protocol, and improves the service for the non real-time traffic respect to its modified version.


8.7: High-Level Synthesis and IP Protection

Moderators: P. Brisk, EPFL, Lausanne, CH; N. Dutt, UC Irvine, US
Simultaneous FU and Register Binding Based on Network Flow Method [p. 1057]
J. Cong and J. Xu

With the rapid increase of design complexity and the decrease of device features in nano-scale technologies, interconnection optimization in digital systems becomes more and more important. In this paper we develop a simultaneous FU and register (SFR) binding algorithm for multiplexer optimization based on min-cost network flow. Unlike most of the prior approaches in which functional unit binding and register binding are performed sequentially, our approach performs these two highly correlated tasks gradually and concurrently. We also present an ILP formulation of the combined functional unit and register binding problem for the optimality study of heuristics. Experimental results show that when compared to traditional binding algorithms, our simultaneous resource binding algorithm is close to optimal solutions for small-size designs (only 5% more MUX) and achieves significant reduction for MUX area (12%) and timing (10%) for a set of real-life benchmark designs.

A Variation Aware High Level Synthesis Framework [p. 1063]
F. Wang, G. Sun and Y. Xie

The worst-case delay/power of function units has been used in traditional high level synthesis to facilitate design space exploration. As technology scales to nanometer regime, the impact of process variations increases. The degree of variability encountered in the new process technologies makes worst-case analysis undesirable, because it may result in unexpected performance/power discrepancy or a pessimistic estimation, and may end up using excess resources to guarantee design constraints. In this paper, we propose a high level synthesis framework to take into account of the performance/power variation for function units. An effective metric called parametric yield, which is defined as the probability of the synthesized data flow graph (DFG) meeting the performance and power constraints, is used to guide scheduling, module selection, and resource sharing. An efficient performance/power yield perturbation computation method for DFG significantly improves the effectiveness of our yield driven high level synthesis algorithm. The experimental results show that our variation-aware synthesis framework achieves significant yield improvements, and has much faster (3X) runtime speed compared against previous approach.

EPIC: Ending Piracy of Integrated Circuits [p. 1069]
J.A. Roy, F. Koushanfar and I.L. Markov

As semiconductor manufacturing requires greater capital investments, the use of contract foundries has grown dramatically, increasing exposure to mask theft and unauthorized excess production. While only recently studied, IC piracy has now become a major challenge for the electronics and defense industries [6]. We propose a novel comprehensive technique to end piracy of integrated circuits (EPIC). It requires that every chip be activated with an external key, which can only be generated by the holder of IP rights, and cannot be duplicated. EPIC is based on (i) automatically-generated chip IDs, (ii) a novel combinational locking algorithm, and (iii) innovative use of public-key cryptography. Our evaluation suggests that the overhead of EPIC on circuit delay and power is negligible, and the standard flows for verification and test do not require change. In fact, major required components have already been integrated into several chips in production. We also use formal methods to evaluate combinational locking and computational attacks. A comprehensive protocol analysis concludes that EPIC is surprisingly resistant to various piracy attempts.


IP4 Interactive Presentations

VLSI Implementation of SISO Arithmethic Decoder FOR Joint Source Channel Coding [p. 1075]
S. Zezza and G. Masera

In this paper we propose an efficient VLSI implementation of a Soft Input Soft Output (SISO) arithmetic code (AC) decoder for joint source channel coding. The addressed application shows a very high level of processing complexity, but, to the best of our knowledge, no papers have been published in the literature on the hardware implementation of the considered joint source channel scheme. First we introduce a simplified algorithm for the SISO AC, which is 1.3 times faster than the standard one. Then an efficient SISO AC architecture is proposed and synthesis results on a 0.13 μm standard cells technology are reported for two different sets of parameters (M=128, M=256). The proposed core runs at 338.9 MHz and can decode up to 124.987 kbit/s.

Error Detection/Correction in DNA Algorithmic Self-Assembly [p. 1079]
S. Frechette and F. Lombardi

A novel error detection/correction technique for algorithmic self-assembly is presented in this paper. Through the use of a tile set that allows errors to be isolated and propagated to the boundary edge of 2D(two-dimensional) assemblies, the proposed technique permits growth errors to be detected and corrected. For assemblies in which each four-sided tile is a party to only one tile mismatch, all growth errors in the assembly can be detected and corrected using the proposedmethod with only two additional tiles. This technique relies on the attachment of so-called isolation tiles at set periods, thus implementing a checkpoint for error detection/correction. The physical environment and related features for the removal of the erroneous sections of an assembly are presented.
Index Terms: error detection and correction, check-pointing, error tolerance, DNA self-assembly, tiling.

Temperature-Aware Voltage Selection for Energy Optimization [p. 1083]
M. Bao, A. Andrei, P. Eles and Z. Peng

This paper proposes a temperature-aware dynamic voltage selection technique for energy minimization and presents a thorough analysis of the parameters that influence the potential gains that can be expected from such a technique, compared to a voltage selection approach that ignores temperature.

A Fast Approximation Algorithm for MIN-ONE SAT [p. 1087]
L. Fang and M.S. Hsiao

In this paper, we propose a novel approximation algorithm(RelaxSAT) for MIN-ONE SAT. RelaxSAT generates a set of constraints from the objective function to guide the search. The constraints are gradually relaxed to eliminate the conflicts with the original Boolean SAT formula until a solution is found. The experiments demonstrate that RelaxSAT is able to handle very large instances which cannot be solved by existing MIN-ONE algorithms; furthermore, very tight bounds on the solution were obtained with one to two orders of magnitude speedup.

Deep Submicro Interconnect Timing Model with Quadratic Random Variable Analysis [p. 1091]
J.-K. Zeng and C.-P. Chen

Shrinking feature sizes and process variations are of increasing concern in modern technology. It is urgent that we develop statistical interconnect timing models which are harmonious with the current trend in statistical timing analysis flow. Although statistical model order reduction techniques have been explored, the statistical interconnect timing model has not yet been fully analyzed. In this work, we develop a novel algorithm and its corresponding analysis for the statistical interconnect timing model, using second-order statistical variations to model the non-Gaussian distribution effects. As this model is fully congruous with current statistical static timing analysis with the canonical model and does not require any Monte Carlo simulation analysis, performance is greatly improved. Experimental results show that the proposed closed-form quadratic interconnect timing model is within 0.0046% error of the corresponding Monte Carlo simulation.

An Efficient Algorithm for Free Resources Management on the FPGA [p. 1095]
Y. Lu, T. Marconi, G. Gaydadjiev and K. Bertels

Finding the available empty space for arrival tasks on FPGAs with runtime partially reconfigurable abilities is the most time consuming phase in on-line placement algorithms. Naturally, this phase has the highest impact on the overall system performance. In this paper, we present a new algorithm which is used to find the complete set of maximum free rectangles on the FPGA at runtime. During scanning, our algorithm relies on dynamic information about the edges of all already placed tasks. Simulation results show that our algorithm has 1.5x to 5x speedup compared to state of the art algorithms aiming at maximum free rectangles. In addition, our proposal requires at least 4.4x less scanning load.

Performance-Constrained Different Cell Count Minimization for Continuously-Sized Circuits [p. 1099]
H. Yoshida and M. Fujita

A continuously-sized circuit resulting from transistor sizing consists of gates with large variety of sizes. In this paper, we first provide a formal formulation of performance-constrained different cell count minimization problem, and then propose an effective hill-climbing heuristic which iteratively minimizes the number of cells under performance constraints such as area, delay and power. To the best of our knowledge, this is the first attempt to address the different cell count minimization problem.

Test Scheduling for Wafer-Level Test-During-Burn-In of Core-Based SoCs [p. 1103]
S. Bahukudumbi, K. Chakrabarty and R. Kacprowicz

Wafer-level test during burn-in (WLTBI) has recently emerged as a promising technique to reduce test and burn-in costs in semiconductor manufacturing. However, the testing of multiple cores of a system-on-chip (SoC) in parallel during WLTBI leads to constantly-varying device power during the duration of the test. This power variation adversely affects predictions of temperature and the time required for burn-in. We present a test-scheduling technique for WLTBI of core-based SoCs, where the primary objective is to minimize the variation in power consumption during test. A secondary objective is to minimize the test application time. Simulation results are presented for two ITC'02 SoC benchmarks, and the proposed technique is compared with two baseline methods.

CARbridge, Reduction of System Complexity by Standardization of the System-Basis-Chips for Automotive Applications [p. 1107]
P. Scheer, E. Schmidt and S. Burges

Semiconductor manufacturers continue to integrate functionality into Systems on a chip. Focused target in the automotive area for today are system basis chips. In this context system basis chips are all surrounding components for embedded μ-Controllers, such as: Transceivers, Watch-Dogs, Voltage-Regulators, Sensor-Interfaces, Switches and Diagnosis functions. Because of the lack of a standard, implementations differ and acceptance is missing in the development community. Also the potential evolution of the system CPU+SBC1 does not happen, because no common target does exist. Therefore major car manufacturers are going to introduce a new standard: CARbridge.


9.1.1: Synthesis of Dependable Embedded Systems (Dependable Embedded Systems Day)

Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: C. Fetzer, TU Dresden, DE
Specification and Design Considerations for Reliable Embedded Systems [p. 1111]
A. Israr and S. Huss

The objective of this paper is to introduce a novel representation as a means to consider both permanent and temporal errors in order to increase the overall reliability of an embedded system. The deployment of embedded systems in safety critical applications, e.g. in the automotive domain, demands that the fundamental set of design criteria consisting of functionality, timeliness, and production costs be extended to consider of reliability as an optimization criterion. Thus reliability engineering becomes part of the overall design flow for embedded systems. The proposed approach is based on the introduction of Permanent/Transient error Decision Diagrams and on dedicated algorithms for the generation of system implementation sets which feature maximum reliability at minimal costs in terms of redundant resources. The proposed approach is demonstrated for a control system taken from the automotive domain.

Synthesis of Fault-Tolerant Embedded Systems [p. 1117]
P. Eles, V. Izosimov, P. Pop and Z. Peng

This work addresses the issue of design optimization for fault-tolerant hard real-time systems. In particular, our focus is on the handling of transient faults using both checkpointing with rollback recovery and active replication. Fault tolerant schedules are generated based on a conditional process graph representation. The formulated system synthesis approaches decide the assignment of fault-tolerance policies to processes, the optimal placement of checkpoints and the mapping of processes to processors, such that multiple transient faults are tolerated, transparency requirements are considered, and the timing constraints of the application are satisfied.


9.1.2: LUNCH TIME KEYNOTE - (Dependable Embedded Systems Day)

Organizer/Moderator: N. Suri, TU Darmstadt, DE
Reliable Services in an Imperfect World [p. 1123]
H. Kopetz

With the ongoing trends of hardware complexity - device density increases, reducing geometrics, lower switching threshholds etc - hardware increasingly exhibits transient faults. Software is not perfect and the increasing complexity results in Heisenbugs. Consequently it becomes a complex technological challenge to build dependable embedded systems that can accommodate and mitigate these facts of hardware and software transients such that the user perceived services are not seriously impacted.


9.2: HOT TOPIC - The Memory Challenge in NoC Based Systems

Organizer/Moderator: A. Hemani, Royal Institute of Technology, Stockholm SE; A. Jantsch, Royal Institute of Technology, Stockholm SE
Moderator: A. Hemani, Royal Institute of Technology, Stockholm SE
Video Processing Requirements on SoC Infrastructures [p. 1124]
P. van der Wolf and T. Henriksson

Applications from the embedded consumer domain put challenging requirements on SoC infrastructures, i.e. interconnect and memory. Specifically, video applications demand large storage capacity and high bandwidth while data accesses can be irregular. The SoC architectures used for implementing these applications typically contain a heterogeneous collection of processing elements and use a single interface to off-chip DRAM in order to provide the required storage capacity at a low cost. Proper integration of interconnect and memory architecture is required to achieve the required bandwidths and latencies for accessing memory. The application requirements as well as the characteristics and constraints for accessing memory are key inputs for NoC design. Future memory technologies may cause a paradigm shift by offering high-bandwidth memory access, possibly via multiple memory interfaces.

Memory Technology for Extended Large-Scale Integration in Future Electronics Applications [p. 1126]
D. Pamunuwa

Extending 2-D planar topologies in integrated circuits (ICs) to a 3-D implementation has the obvious benefits of reducing the overall footprint and average interconnection length, with associated improvements in cost, and delay and energy consumption, while also providing an opportunity to integrate disparate technologies. Such advances are very much technology driven, and early research into 3-D integration has now crystallised into commercially viable options that are being pursued by many companies. Being able to position memory in closer proximity to processing elements in a NoC architecture as afforded by a 3-D physical architecture has the potential to improve the memory bandwidth and mitigate the general nature of delay constrained performance in IC design. Understanding the nature of the opportunities and constraints provided in such a 3-D physical architecture is crucial in realising the true benefits of 3-D integration in future applications.

Memory-aware NoC Exploration and Design [p. 1128]
N. Dutt

In the past decade, tremendous progress has been made in NoC research, spanning architectures, protocols and tools. In addition to a large number of academic and research projects, we are now seeing several commercial realizations of NoCbased chip designs. With chip capacities going well beyond the billion transistor mark, on one hand large amounts of the die are occupied by memory resources and on the other hand many complex applications being mapped to these chips are also memory-intensive. In such instances, memories dominate all the axes of traditional design constraints, including, but not limited to performance, area (cost), and power/energy. Furthermore, the move towards sub-nanometer technologies elevates another critical design consideration: process variability and thermal sensitivity, which in turn critically affect the reliability of memories as well. All of these trends make the case for a memory-aware NoC design methodology.


9.3: Timing Issues in Logic Synthesis

Moderators: M. Fujita, Tokyo U, JP; T. Shiple, Synopsys, FR
Incremental Criticality and Yield Gradients [p. 1130]
J. Xiong, V. Zolotov and C. Visweswariah

Criticality and yield gradients are two crucial diagnostic metrics obtained from Statistical Static Timing Analysis (SSTA). They provide valuable information to guide timing optimization and timing-driven physical synthesis. Existing work in the literature, however, computes both metrics in a non-incremental manner, i.e., after one or more changes are made in a previously-timed circuit, both metrics need to be recomputed from scratch, which is obviously undesirable for optimizing large circuits. The major contribution of this paper is to propose two novel techniques to compute both criticality and yield gradients efficiently and incrementally. In addition, while node and edge criticalities are addressed in the literature, this paper for the first time describes a technique to compute path criticalities. To further improve algorithmic efficiency, this paper also proposes a novel technique to update "chip slack" incrementally. Numerical results show our methods to be over two orders of magnitude faster than previous work.

Latch Modeling for Statistical Timing Analysis [p. 1136]
S.X. Shi, A. Ramalingam, D. Wang and D.Z. Pan

Latch based circuits are widely adopted in high performance circuits. But there is a lack of accurate latch models for doing timing analysis. In this paper, we propose a new latch delay model in the context of SSTA based on a new perspective of latch timing. The proposed latch model also takes into account the external timing variations such as data slew. The new latch model is integrated into SSTA by considering the timing analysis of both the combinational logic network and the clock distribution network simultaneously. The experimental results show that ignoring accurate latch modeling may lead to large errors (e.g., 50% at PDF peak).

Conditional Partial Order Graphs and Dynamically Reconfigurable Control Synthesis [p. 1142]
A. Mokhov and A. Yakovlev

The paper introduces a new formal model for specifying control paths in the context of asynchronous system design. The model, called Conditional Partial Order Graph (CPOG), is capable of capturing concurrency and choice in a system's behaviour in a compact and efficient way. A problem of CPOG synthesis is formulated and solved; various CPOG optimisation techniques are presented. The introduced model can be used for the specification of system behaviour and for synthesis of area-efficient dynamically reconfigurable controllers. The synthesis of a controller is based on a novel generic architecture, called Transition Sequence Encoder (TSE). The synthesized controllers are speed independent and thus very robust to parametric variations. The ideas presented in the paper can be applied for CPU control synthesis as well as for synthesis of different kinds of event-coordination circuits often used in data coding and communication in digital systems.


9.4: Secured Systems

Moderators: O. Deprez, Texas Instruments, FR; J, Quevremont, Thales, FR
Efficient Software Architecture for IPSec Acceleration Using a Programmable Security Processor [p. 1148]
J. Thoguluva, A. Raghunathan and S.T. Chakradhar

Cryptographic accelerators and security processors are often used in embedded systems in order to enable enhanced security without significantly impacting performance or power consumption. However, realizing the performance promised by them requires the design of efficient software architectures for crypto offloading (offloading cryptographic operations from a host processor). In this paper, we describe an efficient software architecture for IPSec crypto offloading on a state-of-the-art mobile application processor system-on-chip (SoC) that includes a programmable security processor. We consider both user-space and kernel-space implementations of IPSec, compare their performance, and identify factors that limit the efficiency of crypto offloading. We describe two optimizations, called protocol-level crypto offloading and adaptive crypto offloading, which further improve the performance of IPSec by (i) offloading higher granularity computations to reduce the crypto offloading overheads, and (ii) using crypto offloading judiciously based on the trade-off between the savings in processing cycles vs. the overhead of communication with the security processor. We measure the performance of our implementation of IPSec crypto offloading using a commercial network protocol stack on the mobile application processor SoC, under a wide range of workloads. Our results indicate that efficient crypto offloading can result in application-level improvements of up to 10.6X in data rate and up to 5X in latency, enabling IPSec to be used for emerging high-bandwidth and interactive mobile applications.

Operating System Controlled Processor-Memory Bus Encryption [p. 1154]
X. Chen, R.P. Dick and A. Choudhary

Unencrypted data appearing on the processor-memory bus can result in security violations, e.g., allowing attackers to gather keys to financial accounts and personal data. Although on-chip bus encryption hardware can solve this problem, it requires hardware redesign or increases processor cost. Application redesign to prevent sensitive data from appearing on the processor-memory bus is extremely difficult. We propose and evaluate a processor-memory bus encryption technique for embedded systems that requires no changes to applications or hardware. This technique exploits cache locking or scratchpad memory, features present in many embedded processors, permitting the operating system (OS) virtual memory infrastructure to automatically encrypt data belonging to protected processes as they are written to off-chip memory. Pages belonging to unprotected processes are stored unencrypted to prevent performance and energy consumption penalties. We evaluate the proposed bus encryption technique using full system simulation. Experimental results indicate that it is possible to prevent the working data sets of processes from appearing on the processor-memory bus in plaintext, without using dedicated hardware and without changing applications. The OS based technique results in 1.37x slowdown for protected processes for processors with 512KB of L2 cache and 1.78x slowdown for processors with 256KB of L2 cache. There are negligible performance penalties for unprotected processes.

An Efficient FPGA Implementation of Principle Component Analysis Based Network Intrusion Detection System [p. 1160]
A. Das, S. Misra, S. Joshi, J. Zambreno, G. Memik and A. Choudhary

Modern Network Intrsuion Detection Systems (NIDSs) use anomaly detection to capture malicious attacks. Since such connections are described by large set of dimensions, processing these huge amounts of network data becomes extremely slow. To solve this time-efficiency problem, statistical methods like Principal Component Analysis (PCA) can be used to reduce the dimensionality of the network data. In this paper, we design and implement an efficient FPGA architecture for Principal Component Analysis to be used in NIDSs. Moreover, using representative network intrusion traces, we show that our architecture correctly classifies attacks with detection rates exceeding 99.9% and false alarm rates as low as 1.95%. Our implementation on a Xilinx Virtex-II Pro FPGA platform provides a core throughput of up to 24.72 Gbps, clocking at a frequency of 96.56 MHz. 1


9.5: Test Generation for New Technologies

Moderators: J. Teixeira, INESC-ID, PT; H. Obermeir, Infineon, DE
A Bridging Fault Model Where Undetectable Faults Imply Logic Redundancy [p. 1166]
I. Pomeranz and S.M. Reddy

We define a robust fault model as a model where the existence of an undetectable fault implies the existence of logic redundancy, or more generally, a suboptimality in the synthesis of the circuit. The stuck-at fault model is robust, but other fault models such as certain bridging fault models are not. A robust fault model provides a mechanism to synthesize circuits in which all the target faults are detectable and 100% fault coverage is achievable. The ability to achieve 100% fault coverage, or understand why it is not achievable, is important since the requirement to achieve high test quality translates into a requirement to achieve complete fault coverage for target faults, regardless of the metrics used to measure test quality. We discuss a robust bridging fault model and its use as part of a test generation process for a non-robust bridging fault model (a non-robust bridging fault model may have to be used in order to capture the behavior of bridging defects). We also present experimental results related to the robust bridging fault model.

Layout-Aware, IR-Drop Tolerant Transition Fault Pattern Generation [p. 1172]
J. Lee, S. Narayan, M. Kapralos and M. Tehranipoor

Market and customer demands have continued to push the limits of CMOS performance. At-speed test has become a common method to ensure these high performance chips are being shipped to the customers fault-free. However, at-speed tests have been known to create higher-than-average switching activity, which normally is not accounted for in the design of the power supply network. This potentially creates conditions for additional delay in the chip; causing it to fail during test. In this paper, we propose a pattern compaction technique that considers the layout and gate distribution when generating transition delay fault patterns. The technique focuses on evenly distributing switching activity generated by the patterns across the layout rather than allowing high switching activity to occur in a small area in the chip that could occur with conventional delay fault pattern generation. Due to the relationship between switching activity and IR-drop, the reduction of switching will prevent large IR-drop in high demand regions while still allowing a suitable amount of switching to occur elsewhere on the chip to prevent fault coverage loss. This even distribution of switching on the chip will also result in avoiding hot-spots.

Multi-Vector Tests: A Path to Perfect Error-Rate Testing [p. 1178]
S. Shahidi and S. Gupta

The importance of testing approaches that exploit error tolerance to improve yield has previously been established. Error rate, defined as the percentage of vectors for which the value at a circuit's output deviates from the corresponding error-free value, has been identified as a key metric for severity. In error-rate testing every chip that has an error rate greater than or equal to a threshold specified by the application is unacceptable for the application and discarded; all other chips are acceptable. The objective of error-rate testing is to reject every unacceptable chip while accepting all (or a maximum number) of the acceptable chips. We previously showed that it is not always possible to generate a test set that detects all unacceptable faults, i.e., faults that cause an error rate greater than or equal to the threshold error rate, without detecting some of the acceptable faults, i.e., faults that cause an error rate less than the threshold. In this paper, we introduce the new notion of multi-vector testing and prove that this notion enables us to detect all unacceptable faults without detecting any of the acceptable faults. We derive an upper bound on the size of such a test for a general case. As this universal bound can be large in some cases, we use a structural approach and find much tighter upper bounds for special classes of circuits. Experiments on benchmark circuits show that the required test-sizes for arbitrary circuits are much lower than our universal bounds, and practically useful.

iFill: An Impact-Oriented X-Filling Method for Shift- and Capture-Power Reduction in At-Speed Scan-Based Testing [p. 1184]
J. Li, Q. Xu, Y. Hu and X. Li

In scan-based tests, power consumptions in both shift and capture phases may be significantly higher than that in normal mode, which threatens circuits' reliability during manufacturing test. In this paper, by analyzing the impact of X-bits on circuit switching activities, we present an X-filling technique that can decrease both shift- and capture-power to guarantees the reliability of scan tests, called iFill. Moreover, different from prior work on X-filling for shift-power reduction which can only reduce shift-in power, iFill is able to decrease power consumptions during both shift-in and shift-out. Experimental results on ISCAS'89 benchmark circuits show the effectiveness of the proposed technique.


9.6: Memory-Centric Code Optimisation

Moderators: C. Haubelt, Erlangen-Nuremberg U, DE; R. Leupers, RWTH Aachen U, DE
Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors [p. 1190]
S. Park, A. Shrivastava and Y. Paek

The contribution of memory latency to execution time continues to increase, and latency hiding mechanisms become ever more important for efficient processor design. While high-end processors can use elaborate techniques like multiple issue, out-of-order execution, speculative execution, value prediction etc. to tolerate high memory latencies, they are often not viable solutions for embedded processors, due to significant area, power and chip complexity overheads. This paper proposes a hardware-software cooperative approach, called priority-based execution to hide cache miss penalty for embedded processors. The compiler classifies the instructions into low-priority and highpriority instructions. The processor executes the high-priority instructions, but delays the execution of low priority instructions. They are executed on a cache miss to hide the cache miss penalty. We empirically evaluate our proposal on the Intel XScale compiler and microarchitecture. Experimental results on benchmarks from Multimedia, MediaBench, MiBench, and SPEC2000 demonstrate an average 17% performance improvements, hiding 75% cache miss penalty.

Instruction Cache Energy Saving Through Compiler Way-Placement [p. 1196]
T.M. Jones, S. Bartolini, B. De Bus, J. Cavazos and M.F.P. O'Boyle

Fetching instructions from a set-associative cache in an mbedded processor can consume a large amount of energy due to the tag checks performed. Recent proposals to address this issue involve predicting or memoizing the correct way to access. However, they also require significant hardware storage which negates much of the energy saving. This paper proposes way-placement to save instruction ache energy. The compiler places the most frequently exeuted instructions at the start of the binary and at runtime hese are mapped to explicit ways within the cache. We compare with a state-of-the-art hardware technique and show hat our scheme saves almost 50% of the instruction cache nergy compared to 32% for the hardware approach. We eport results on a variety of cache sizes and associativiies, achieving 59% instruction cache energy savings and an ED product of 0.80 in the best configuration with negligible hardware overhead and no ISA changes.

Effective Loop Partitioning and Scheduling under Memory and Register Dual Constraints [p. 1202]
C.J. Xue, E.H.-M. Sha, Z. Shao and M. Qiu

Loops are the most important sections for embedded applications. To achieve high performance, two loop transformation techniques are often applied, namely loop pipelining and loop partitioning. Loop pipelining is an effetive approach to increase parallelism and reduce schedule length. Loop partitioning with prefetching increases data locality and hides memory latency. However, loop pipelining increases register pressure and loop partitioning increases local memory requirement. As most embedded systems have limited number of registers and limited memory, without careful stufy, these two techniques can not be applied effectively. In this paper, we propose and effective scheduling framework, Register and Memory Sensitive Partitioning (RMSP), to minimize average schedule length per iteration under register and memory dual constraints for parallel embedded systems. Experiments show that RMSP reduces schedule length by 14.1% in average compared to previous methods applied directly.


9.7: Acceleration of Reconfigurable Applications

Moderators: J. Becker, Karlsruhe Inst. of Technology - KIT, DE; K. Bertels, TU Delft, NL
Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications [p. 1208]
A.C.S. Beck, M.B. Rutzig, G. Gaydadjiev and L. Carro

Embedded systems are becoming increasingly complex. Besides the additional processing capabilities, they are characterized by high diversity of computational models coexisting in a single device. Although reconfigurable architectures have already shown to be a potential solution for such systems, they just present significant speedups of very specific dataflow oriented kernels. Furthermore, reconfigurable fabric is still withheld by the need of special tools and compilers, clearly not sustaining backward software compatibility. In this paper, we propose a new technique to optimize both dataflow and control-flow oriented code in a totally transparent process, without the need of any modification in the source or binary codes. For that, we have developed a Binary Translation algorithm implemented in hardware, which works in parallel to a MIPS processor. The proposed mechanism is responsible for transforming sequences of instructions at runtime to be executed on a dynamic coarse-grain reconfigurable array, supporting speculative execution. Executing the MIBench suite, we show performance improvements of up to 2.5 times, while reducing 1.7 times the required energy, using trivial hardware resources.

Automatic Selection of Application-Specific Reconfigurable Processor Extensions [p. 1214]
C. Wolinski and K. Kuchcinski

This paper presents a new method for automatic selection of application-specific processor extensions and shows how applications are scheduled on these new reconfigurable architectures. The extensions are implemented as specialized sequential or parallel instructions. They correspond to identified most frequently occurring computational patterns or other interesting patterns and are finally selected during mapping and scheduling. Our methods can handle both time-constrained and resource-constrained scheduling. Experimental results show that the presented method provides high coverage of application graphs with small number of patterns and ensures high application execution speed-up both for sequential and parallel application execution with processor extensions implementing selected patterns.

An Optimized Message Passing Framework for Parallel Implementation of Signal Processing Applications [p. 1220]
S. Saha, J. Schlessman, S. Puthenpurayil, S.S. Bhattacharyya and W. Wolf

Novel reconfigurable computing platforms enable efficient realizations of complex signal processing applications by allowing exploitation of parallelization resulting in high throughput in a cost-efficient way. However, the design of such systems poses various challenges due to the complexities posed by the applications themselves as well as the heterogeneous nature of the targeted platforms. One of the most significant challenges is communication between the various computing elements for parallel implementation. In this paper, we present a communication interface, called the signal passing interface (SPI), that attempts to overcome this challenge by integrating relevant properties of two different yet important paradigms in this context - dataflow and the message passing interface (MPI). SPI is targeted towards signal processing applications and, due to its careful specialization, more performance-efficient for their embedded implementation. It is also more easier and intuitive to use. Earlier, a preliminary version of SPI was presented [12] which was restricted to static dataflow behavior. Here, we present a more complete version of SPI with new features to address both static and dynamic dataflow behavior, and to provide new optimization techniques. We develop a hardware description language (HDL) realization of the SPI library, and demonstrate its functionality on the Xilinx Virtex-4 FPGA. Details of the HDL-based SPI library along with experiments with two signal processing applications on the FPGA are also presented.


10.1: Dependability Aspects (Dependable Embedded Systems Day)

Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: C. Fetzer, TU Dresden, DE
Dependability for High-Tech Systems: An Industry-as-Laboratory Approach [p. 1226]
E. Brinksma and J. Hooman

The dependability of high-volume embedded systems, such a consumer electonic devices, is threatened by a combination of quickly increasing complexity, decreasing time-to-market, and strong cost constraints. This poses challenging research questions that are investigated in the Trader project, following the industry-as-lab approach. We present the main vision of this project, which is based on a model-based control paradigm, and the current status of the project results.


10.2: Application Mapping onto NoCs and Flow Control

Moderators: D. Atienza, DACYA/Madrid Complutense U, ES; T. Basten, TU Eindhoven, NL
User-Aware Dynamic Resource Allocation in Networks-on-Chip [p. 1232]
C.-L. Chou and R. Marculescu

In this paper, we propose a run-time strategy for allocating the application tasks to platform resources in homogeneous Networks-on-Chip (NoCs). As novel contribution, we incorporate the user behavior information in the resource allocation process; this allows system to better respond to real-time changes and adapt dynamically to user needs. Several algorithms are then proposed for solving the task allocation problem, while minimizing the communication energy consumption and network contention. If user behavior is taken into consideration, we observe about 60% communication energy savings (with negligible and energy runtime overhead) compared to an arbitrary task allocation strategy.

Minimizing Virtual Channel Buffer for Routers in On-Chip Communication Architectures [p. 1238]
M.A. Al Faruque and J. Henkel

We present a novel methodology for design space exploration using a two-steps scheme to optimize the number of virtual channel buffers (buffers take the premier share of the router in a NoC [10]) used to implement logical channels multiplexed across the physical channel in a router output port for QoS supported on-chip communication. In the first step, the number of virtual channels is minimized during the mapping of tasks to the NoC at the design time of a System on Chip (SoC) for which we use a swarm intelligence-based Ant Colony Optimization (ACO) algorithm. In the second step, a probabilistic approach based on the traffic model of the application is used to further minimize the number of virtual channels. We achieve on average 90.2% reduction in the number of virtual channels compared to a fixed state-of-the-art (i.e. QNoC [1]) allocation for the E3S embedded application benchmark suit. The reduction depends on the designer and the QoS parameter, and it is dependent on the specific application driven traffic model. We demonstrate our design space exploration by means of a complete robot application and also extend our exploration by evaluating the E3S embedded application benchmark suit.

An Open-Loop Flow Control Scheme Based on the Accurate Global Information of On-Chip Communication [p. 1244]
W.-C. Kwon, S.-M. Hong, S. Yoo, B. Min, K.-M. Choi, S.-K. Eo

3D stacked memory is being adopted as a promising solution to offer high bandwidth and low latency in memory access. Compared with the on-chip network design with conventional off-chip memory, it gives a new problem of minimizing communication conflicts since multiple concurrent high bandwidth data transfers will flow through the on-chip network. In order to tackle this problem, we propose applying an open-loop flow control scheme based on the accurate global information (destination and status) of on-chip communication. The proposed open-loop flow control scheme exploits the information and selectively buffers and arbitrates data transfers to remove conflicts at destinations in a preventive manner. As an implementation of the presented scheme, we present on-chip buffers called Buf3D's that share the global information with each other to perform the selective buffering and arbitration of data transfers. Experiments with synthetic test cases and an industrial strength DTV design show that the proposed method improves aggregate memory bandwidth significantly (average 19.0%~25.8% in the synthetic cases and up to 18.4% in the DTV case) with a small area overhead (15.2% in the DTV case) of on-chip network.


10.3: Arithmetic and Logic Processing

Moderators: C. Wolinski, Rennes 1 U, FR; H. Yoshida, Tokyo U, JP
Variable Latency Speculative Adder: A New Paradigm for Arithmetic Circuit Design [p. 1250]
A.K. Verma, P. Brisk and P. Ienne

Adders are one of the key components in arithmetic circuits. Enhancing their performance can significantly improve the quality of arithmetic designs. This is the reason why the theoretical lower bounds on the delay and area of an adder have been analysed, and circuits with performance close to these bounds have been designed. In this paper, we present a novel adder design that is exponentially faster than traditional adders; however, it produces incorrect results, deterministically, for a very small fraction of input combinations. We have also constructed a reliable version of this adder that can detect and correct mistakes when they occur. This creates the possibility of a variable-latency adder that produces a correct result very fast with extremely high probability; however, in some rare cases when an error is detected, the correction term must be applied and the correct result is produced after some time. Since errors occur with extremely low probability, this new type of adder is significantly faster than state-of-the-art adders when the overall latency is averaged over many additions.

Improving Synthesis of Compressor Trees on FPGAs via Integer Linear Programming [p. 1256]
H. Parandeh-Afshar, P. Brisk and P. Ienne

Multi-input addition is an important operation for many DSP and video processing applications. On FPGAs, multi-input addition has traditionally been implemented using trees of carry-propagate adders. This approach has been used because the traditional lookup table (LUT) structure of FPGAs is not amenable to compressor trees, which are used to implement multi-input addition and parallel multiplication in ASIC technology. In prior work, we developed a greedy heuristic method to map compressor trees onto the general logic of an FPGA using a component called generalized parallel counter (GPC). Although this technique reduced the combinational delay of our circuits, when synthesized onto Altera Stratix-II FPGAs, by 27% on average; however, the area was increased by an average 11%. To further reduce the delay and limit the increase in area, we have developed a new solution to the mapping problem based on integer linear programming. This new approach reduced the delay of the compressor tree by 32% on average and reduced the area by 3% compared to an adder tree.

An Adaptable FPGA-Based System for Regular Expression Matching [p. 1262]
I. Bonesana, M. Paolieri and M.D. Santambrogio

In many applications string pattern matching is one of the most intensive tasks in terms of computation time and memory accesses. Network Intrusion Detection Systems and DNA Sequence Matching are two examples. Since software solutions are not able to satisfy the performance requirements, specialized hardware architectures are required. In this paper we propose a complete framework for regular expression matching, both in its architecture and compiler. This special-purpose processor is programmed using regular expressions as programming language. With the parallelism exploited in the design it is possible to achieve a throughput greater than one character per clock cycle, requiring O(n) memory space. The VHDL description of the proposed architecture is fully configurable. A design space exploration to find the optimal architecture based on area and performance cost-function is presented.

Comparison of Boolean Satisfiability Encodings on FPGA Detailed Routing Problems [p. 1268]
M.N. Velev and P. Gao

We compare 12 new encodings for representing of FPGA detailed routing problems as equivalent Boolean Satisfiability (SAT) problems against the only 2 previously used encodings. We also consider two symmetry-breaking heuristics. Compared to other methods for FPGA detailed routing, SAT-based approaches have the advantage that they can prove the unroutability of a global routing for a particular number of tracks per channel, and that they consider all nets simultaneously. The experiments were run on the standard MCNC benchmarks. The combination of one new encoding with a new symmetry-breaking heuristic resulted in speedup of 3 orders of magnitude or 1,139x of the total execution time on the collection of benchmarks, when proving the unroutability of FPGA global routings. The maximum obtained speedup was 9,499x on an individual benchmark. On the other hand, most of the encodings had comparable and very efficient performance when finding solutions for configurations that were routable. The availability of many SAT encodings, that can each be combined with various symmetry-breaking heuristics, opens the possibility to design portfolios of parallel strategies - each a combination of a SAT encoding and a symmetry-breaking heuristict - that can be run in parallel on different cores of a multicore CPU in order to reduce the solution time, with the rest of the runs terminated as soon as one of them returns an answer. We found that a portfolio of three particular parallel strategies produced additional speedup of more than 2x.


10.4: Security Building Blocks

Moderators: L. Fesquet, TIMA Laboratory, FR; B. Candaele, Thales, FR
Defeating Classical Hardware Countermeasures: A New Processing for Side Channel Analysis [p. 1274]
D. Real, C. Canovas, J. Clediere, M. Drissi and F. Valette

In the field of the Side Channel Analysis, hardware distortions such as glitches and random frequency are classical countermeasures. A glitch influences the side channel amplitude while a random frequency damages the signal both in time and in amplitude. For minimizing these countermeasures effects, some trace treatments based on peak extraction or auto-correlation methods exist. However, none of them takes into account the amplitude mistake. In this paper, we show that this amplitude mistake is created by glitches but also by a random frequency. We propose then a reshaping processing that erases these effects on side channel traces both on the time and amplitude axis. The solution reconstructed a side channel signal, avoiding the hardware countermeasures and the clock relativity consequences which can be meaningful for Side Channel Attacks. Its efficiency is demonstrated on a Differential Power Attack performed on a DES implementation and on a Template Attack performed on a RSA implementation.

Power Balanced Gates Insensitive to Routing Capacitance Mismatch [p. 1280]
K.J. Kulikowski, V. Venkatarama, Z. Wang and A. Taubin

Cryptographic hardware is vulnerable to power analysis attacks. To resist these attacks, special balanced dual-rail gates have been devoloped which have equal power consumption for all valid data values and transitions. A limitation of existing designs is that they require balanced routing of the dual-rail interconnect between gates. Natural process variation and suboptimal routing tools make it practically impossible to perfectly match the capacitances of the dual-rail pair making the balanced routing constraint difficult to satisfy. We present a general method and designs which achieve power balance in dual-rail circuits without requiring matching of gate output load capacitances or random masking. The method and design are based on a directional discharge protocol which ensures that both rails are always fully discharged and charged in each cycle.

On Analysis and Synthesis of (n,k)-Non-Linear Feedback Shift Registers [p. 1286]
E. Dubrova, M. Teslenko and H. Tenhunen

Non-Linear Feedback Shift Registers (NLFSRs) have been proposed as an alternative to Linear Feedback Shift Registers (LFSRs) for generating pseudo-random sequences for stream ciphers. In this paper, we introduce (n,k)-NLFSRs which can be considered a generalization of the Galois type of LFSR. In an (n, k)-NLFSR, the feedback can be taken from any of the n bits, and the next state functions can be any Boolean function of up to k variables. Our motivation for considering this type NLFSRs is that their Galois configuration makes it possible to compute each next state function in parallel, thus increasing the speed of output sequence generation. Thus, for stream cipher application where the encryption speed is important, (n, k)-NLFSRs may be a better alternative than the traditional Fibonacci ones. We derive a number of properties of (n, k)- NLFSRs. First, we demonstrate that they are capable of generating output sequences with good statistical properties which cannot be generated by the Fibonacci type of NLFSRs. Second, we show that the period of the output sequence of an (n, k)-NLFSR is not necessarily equal to the length of the largest cycle of its states. Third, we compute the period of an (n, k)-NLFSR constructed from several parallel NLFSRs whose outputs are XOR-ed and show how to maximize this period. We also present an algorithm for estimating the length of cycles of states of (n, k)-NLFSRs which uses Binary Decision Diagrams for representing the set of states and the transition relation on this set.

FPGA Design for Algebraic Tori Based Public Key Cryptography [p. 1292]
J. Fan, L. Batina, K. Sakiyama and I. Verbauwhede

Algebraic torus-based cryptosystems are an alternative for Public-Key Cryptography (PKC). It maintains the security of a larger group while the actual computations are performed in a subgroup. Compared with RSA for the same security level, it allows faster exponentiation and much shorter bandwidth for the transmitted data. In this work we implement a torus-based cryptosystem, the so-called CEILIDH, on a multicore platform with an FPGA. This platform consists of a Xilinx MicroBlaze core and a multicore coprocessor. The platform supports CEILIDH, RSA and ECC over prime fields. The results show that one 170-bit torus T6 exponentiation requires 20 ms, which is 5 times faster than 1024-bit RSA implementation on the same platform.


10.5: A Smorgardsbord of Test

Moderators: E.J. Marinissen, NXP Semiconductors, NL; A. Leininger, Infineon Technologies, DE
Automated Trace Signals Identification and State Restoration for Improving Observability in Post-Silicon Validation [p. 1298]
H.F. Ko and N. Nicolici

Embedded logic analysis has emerged as a powerful technique for identifying functional bugs during postsilicon validation, as it enables at-speed acquisition of data from the circuit nodes in real-time. Nonetheless, the amount of data that is observed is limited by the capacity of the on-chip trace buffers. This paper introduces an automated method for improving the utilization of the on-chip storage, by identifying a small set of trace signals from which a large number of states can be restored using a compute-efficient algorithm. This enlarged set of data can then be used to aid the search of functional bugs in the fabricated circuit.

Functional Self-Testing for Bus-Based Symmetric Multiprocessors [p. 1304]
A. Apostolakis, D. Gizopoulos, M. Psarakia and A. Paschalis

Functional, instruction-based self-testing of microprocessors has recently emerged as an effective alternative or supplement to other testing approaches, and is progressively adopted by major microprocessor manufacturers. In this paper, we study, for first time, the applicability of functional self-testing on bus-based symmetric multiprocessors (SMP) and the exploitation of SMPs parallelism during testing. We focus on the impact of the memory system architecture and the cache coherency mechanisms on the execution of self-test programs on the processor cores. We propose a generic self-test routines scheduling algorithm aiming at the reduction of the total test application time for the SMP by reducing both bus contention and data cache coherency invalidation. We demonstrate the proposed solutions with detailed experiments in two-core and four-core SMP benchmarks based on a RISC processor core.

Theoretical and Practical Aspects of IDDQ Settling - Impact on Measurement Timing and Quality [p. 1310]
B. Straka, H. Manhaeve, J. Brenkus and S. Kerckenaere

This paper discusses the parameters involved in making fast and reliable quiescent current (IDDQ or ISSQ) measurements, with particular attention to the test setup and the point of measurement. For that purpose a detailed theoretical and practical study was made of the IDDQ settling behaviour in function of proper measurement instrument positioning. The conclusions are that instrument positioning is a critical factor in function of achieving fast, high resolution, reliable and repeatable IDDQ measurements needed to support advanced decision making strategies and Nanotechnology IDDQ application, and that the use of add-on instrumentation offers the best perspectives to reach these goals.


10.6: HOT TOPIC - Analogue: How to Survive in the Era of Nano CMOS

Organizers: G. Gielen, KU Leuven, BE; L. Fanucci, Pisa U, IT
Moderators: L. Fanucci, Pisa U, IT
Advanced Analog Filters for Telecommunications [p. 1316]
M. De Matteis, S. D'Amico and A. Baschirotto

In this paper advances on analog filter design for telecom transceivers are addressed. Portable devices require a strong power consumption reduction to increase the battery life. Since a considerable part of the power consumption is due to the analog baseband filters, improved and/or novel analog filter design approaches have to be developed. In this paper some advances on this field reported in last years are summarized. Each design (developed for different standards) exploits the standard specifications with different architectures and circuit strategies devoted to power consumption reduction. The first is for reconfigurable Bluetooth/UMTS/WLAN receivers, the second is for very-low voltage (550mV) WLAN receivers, the third one is for impulse-radio UWB receivers, while the fourth is for very low-power OFDB-UWB receivers.

Emerging Yield and Reliability Challenges in Nanometer CMOS Technologies [p. 1322]
G. Gielen, P. DeWit, E. Maricau, J. Loeckx, J. Martín-Martínez, B. Kaczer, G. Groeseneken, R. Rodríguez and M. Nafría

With further scaling of nanometer CMOS technologies, yield and reliability become an increasing challenge. This paper reviews the most important phenomena affecting yield and reliability. For each effect, the basic physical mechanisms causing the effect and its impact on transistor parameters are described. Possible solutions to cope/handle with these effects on the design level are discussed as well.

Novel Front-End Circuit Architectures for Integrated Bio-Electronic Interfaces [p. 1328]
C. Guiducci, A. Schmid, F.K. Gürkaynak and Y. Leblebici

The prospective use of upcoming nanometer CMOS technology nodes (65nm, 45nm, and beyond) in bioelectronic interfaces is raising a number of important issues concerning circuit architectures and design. In particular, the advantages of scaling and higher density integration must be balanced against the requirements of low noise design, uniform power density and surface temperature distribution, better component matching, and immunity to parameter variations. Dealing with these constraints also requires more innovative approaches towards hybrid integration technologies. In this paper, we discuss the key design issues with specific examples from DNA detection, protein detection, and neuro-electronic interfaces.


10.7: Reconfigurable Architectures and Run-Time Optimisations

Moderators: W. Luk, Imperial College London, UK; M. Huebner, Karlsruhe U (TH), DE
High-Level Modeling and Exploration of Coarse-Grained Re-Configurable Architectures [p. 1334]
A. Chattopadhyay, X. Chen, H. Ishebabi, R. Leupers, G. Ascheid and H. Meyr

The increasing complexity of today's multimedia and wireless applications is motivating the system designers to innovate continuously. With the challenge to keep various performance metrics in a tight balance while designing a complex system, an entire range of components are now being offered as choices for system building blocks. Coarse-Grained Re-configurable Architecture (CGRA), a strongly emerging class, is currently receiving due attention for offering excellent performance as well as flexibility post fabrication. Compared to the programmable and flexible microprocessors these architectures are shown to yield stronger performance, especially in case of regular and data-driven applications. A variety of system designs are proposed of late, with CGRA as one of the key building blocks. Most of the research initiatives taken in this area have resorted to a template-based approach, where the structure of the reconfigurable architecture is partially fixed with several tunable parameters. In this paper, we present a language-driven modelling and exploration framework for CGRAs. In the domain of CGRAs, this framework attempts to bring modelling ease, genericity, early exploration and path to implementation together. The modelling formalism proposed in this paper as well as the exploration capabilities are demonstrated via experiments with several algorithmic kernels.

Scalable Architecture for On-Chip Neural Network Training Using Swarm Intelligence [p. 1340]
A. Farmahini-Farahani, S.M. Fakhraie and S. Safari

This paper presents a novel architecture for on-chip neural network training using particle swarm optimization (PSO). PSO is an evolutionary optimization algorithm with a growing field of applications which has been recently used to train neural networks. The architecture exploits PSO algorithm to evolve network weights as well as a method called layer partitioning to implement neural networks. In the proposed method, a neural network is partitioned into groups of neurons and the groups are sequentially mapped to available functional units. Thus, the architecture is reconfigurable for training and implementing different multilayer feedforward neural networks without the need for modifying the architecture. The implementation is intended for real-time applications regarding hardware cost and speed. The results show that the proposed system provides a trade-off between resource requirements and speed.

Intelligent Merging OnLine Task Placement Algorithm for Partially Reconfigurable Systems [p. 1346]
T. Marconi, Y. Lu, K. Bertels and G. Gaydadjiev

Speed and placement quality are two very important attributes of a good online placement algorithm, because the time taken by the algorithm is considered as an overhead to the application overall execution time. To solve this problem, we propose three techniques: Merging Only if Needed (MON), Partial Merging (PM), and Direct Combine (DC). Our IM (intelligent merging) algorithm uses dynamically these three techniques to exploit their specific advantages. IM outperforms Bazargan's algorithm as it has placement quality within 0.89% but is 1.72 times faster.

Design of A HW/SW Communication Infrastructure for A Heterogeneous Reconfigurable Processor [p. 1352]
A. Deledda, C. Mucci, A. Vitkovski, M. Kuehnle, F. Ries, M. Huebner, J. Becker, P. Bonnot, A. Grasset, P. Millet, M. Coppola, L. Pieralisi, R. Locatelli, G. Maruccia, F. Campi and T. DeMarco

Reconfigurable architectures and NoC (Network-on-Chip) have introduced new research directions for technology and flexibility issues, which have been largely investigated in the last decades. Exploiting run-time adaptivity opens a new area of research by considering dynamic reconfiguration. In this paper, we present the architecture and associated development tools of an heterogeneous reconfigurable SoC focusing on the chosen communication infrastructure. The SOC integrates units of various sizes of reconfiguration granularity. The included NoC approach demonstrates the mentioned benefits and scalability for actual and future SoC design. On a reference CMOS090 implementation the described interconnect system works at the system reference frequency of 200 MHZ sustaining the required run-time bandwidth on a set of reference applications, at a price < 10% in area in power consumption with respect to the overall system.


IP5 Interactive Presentations

Automated Dynamic Throughput-Constrained Structural-Level Pipelining in Streaming Applications [p. 1358]
M. Muir, T. Arslan and I. Lindsay

Stream processing applications such as image signal processing demand high throughput. However, customers increasingly demand runtime flexibility in their designs, which cannot be provided by custom ASIC solutions. Currently, reconfigurable processors tend to offer insufficient throughput for widespread use in streaming applications. This paper demonstrates how structural-level pipelining techniques can be applied to rapidly dynamically reconfigurable computing architectures, in order to increase throughput. This is done by automatically inserting registers into the data path of performance critical code sections that have already been optimised into a single configuration context. A new algorithm is presented to choose the insertion point of pipeline stage registers in order to meet a specified throughput whilst minimising register resource usage. The paper then demonstrates a new approach where properties of dynamic reconfiguration can be utilised to perform the tasks of pipeline stage initialisation and flushing. The technique is demonstrated on a real-life application: the demosaic filter in a standard image signal processing pipe used in modern digital cameras, and can be seen to boost the throughput from 16MPixels/s to 51MPixels/s on an example reconfigurable processor.

Towards Trojan-Free Trusted ICs: Problem Analysis and Detection Scheme [p. 1362]
F. Wolff, C. Papachristou, S. Bhunia and R.S. Chakraborty

There have been serious concerns recently about the security of microchips from hardware trojan horse insertion during manufacturing. This issue has been raised recently due to outsourcing of the chip manufacturing processes to reduce cost. This is an important consideration especially in critical applications such as avionics, communications, military, industrial and so on. A trojan is inserted into a main circuit at manufacturing and is mostly inactive unless it is triggered by a rare value or time event; then it produces a payload error in the circuit, potentially catastrophic. Because of its nature, a trojan may not be easily detected by functional or ATPG testing. The problem of trojan detection has been addressed only recently in very few works. Our work analyzes and formulates the trojan detection problem based on a frequency analysis under rare trigger values and provides procedures to generate input trigger vectors and trojan test vectors to detect trojan effects. We also provide experimental results.

Wrapper and TAM Co-Optimization for Reuse of SoC Functional Interconnects [p. 1366]
T. Yoneda and H. Fujiwara

This paper presents a wrapper and TAM co-optimization method for reuse of SoC functional interconnects to minimize test time under area constraint. The proposed method consists of (1) an ILP formulation for wrapper and transparent TAM cooptimization, and (2) a simulated annealing based heuristic approach to reduce the computational cost of the proposed ILP model. Experimental results show the effectiveness of the proposed methods compared to the previous transparency-based TAM approaches and the conventional dedicated test bus approaches. keywords: SoC test, wrapper, TAM, reuse of interconnect.

De Bruijn Graph as a Low Latency Scalable Architecture for Energy Efficient Massive NoCs [p. 1370]
M. Hosseinabady, M.R. Kakoee, J. Mathew and D.K. Pradhan

In this paper, we use the generalized binary de Bruijn (GBDB) graph as a scalable and efficient network topology for an on-chip communication network. Using just two-layer wiring, we propose an optimum tile-based implementation for a GBDBbased Network-on-Chip (NoC). Our experimental results show that the latency and energy consumption of generalized de Bruijn graph are much less with compared to Mesh and Torus, the two common NoC architectures in the literature.

Adaptive Filesystem Compression for Embedded Systems [p. 1374]
L.S. Bai, H. Lekatsas and R.P. Dick

Embedded system secondary storage size is often constrained, yet storage demands are growing as a result of increasing application complexity and storage of personal data and multimedia files. Filesystem compression offers a solution. This paper formalizes the problem of automatic filesystem compression using multiple compression algorithms. The average latency of on-line file accesses is optimized under a constraint on filesystem capacity. Our solution is based on predictive control. Predicted latency implications are used to solve the file compression state selection problem using a multiple choice knapsack problem formulation. This approach is evaluated on filesystem traces and compared with other efficient heuristics. Our approach results in 34.1% reduction in file access latency compared to a straight-forward heuristic that decompresses frequently-accessed files and compresses least recently used files with more aggressive compression algorithms. It reduces file access latency by 67.7% compared to uniformly compressing files to the shallowest level required to meet storage capacity constraints.

Partially Redundant Logic Detection Using Symbolic Equivalence Checking in Reversible and Irreversible Logic Circuits [p. 1378]
D.Y. Feinstein, M.A. Thornton and D.M. Miller

This paper investigates partially redundant logic detection and gate modification coverage in both reversible and irreversible (classical) logic circuits. Our methodology is to repeatedly compare a benchmark circuit with a modified copy of itself using an equivalence checker. We have found many instances in the irreversible logic ISCAS85 benchmarks where single gate replacements were not detected, indicating no change in functionality after gate replacement. In contrast, we demonstrate that the Maslov reversible and quantum logic benchmarks exhibit very high gate modification fault coverage, in line with the expectation that reversible circuits, which implement bijective functions, have maximal information content.

TinyTimber, Reactive Objects in C for Real-Time Embedded Systems [p. 1382]
P. Lindgren, J. Eriksson, S. Aittamaa and J. Nordlander

Embedded systems are often operating under hard real-time constraints. Such systems are naturally described as time-bound reactions to external events, a point of view made manifest in the high-level programming and systems modeling language Timber. In this paper we demonstrate how the Timber semantics for parallel reactive objects translates to embedded real-time programming in C. This is accomplished through the use of a minimalistic Timber Run-Time system, TinyTimber (TT). The TT kernel ensures state integrity, and performs scheduling of events based on given time-bounds in compliance with the Timber semantics. In this way, we avoid the volatile task of explicitly coding parallelism in terms of processes/threads/semaphores/monitors, and side-step the delicate task to encode time-bounds into priorities. In this paper, the TT kernel design is presented and performance metrics are presented for a number of representative embedded platforms, ranging from small 8-bit to more potent 32-bit micro controllers. The resulting system runs on bare metal, completely free of references to external code (even C-lib) which provides a solid basis for further analysis. In comparison to a traditional thread based real-time operating system for embedded applications (FreeRTOS), TT has tighter timing performance and considerably lower code complexity. In conclusion, TinyTimber is a viable alternative for implementing embedded real-time applications in C today.

Dynamic Task Allocation Strategies in MPSoC for Soft Real-Time Applications [p. 1386]
E. Wenzel Brião, D. Barcelos, F. Rech Wagner

This work evaluates task allocation strategies based on bin-packing algorithms in the context of multiprocessor systems-on-chip (MPSoCs) with task migration capabilities, running soft real-time applications. The task migration model assumes that the whole code and data of the tasks are transferred from an origin node to the chosen destination node. We combine two types of algorithms to obtain better allocation results. Experimental results show that there is a trade-off between deadline misses and system energy consumption when applying bin-packing and linear clustering algorithms. In order to save energy, our system turns off idle processors and applies Dynamic Voltage Scaling to processors with slack. Depending on the algorithm selection and on the application, it is possible to obtain a reduction on deadline misses from 30% to 100% and energy consumption savings from 60% to 80%.

Mixed-Signal Design Space Exploration of Time-Interleaved A/D Converters for Ultra-Wide Band Applications [p. 1390]
P. Nuzzo, C. Nani, S. Saponara, L. Fanucci and G. Van der Plas

This paper addresses system-level design of time-interleaved analog-to-digital converters (TI-ADCs) for ultra-wide band communications. Design space exploration of a TI successive approximation architecture is performed via Monte Carlo simulations, by exploiting behavioral models built bottom-up after characterizing the main ADC blocks in a 90-nm 1-V CMOS technology. Different speed/resolution scenarios are efficiently investigated and the impact of parallelism on system performance, yield and power consumption is assessed starting from the early design phases, finally enabling the selection of two candidate implementations (a 6-bit 4.6-mW and a 7-bit 8.1-mW ADC targeting 1 GS/s) that effectively trade accuracy for energy efficiency and area.


11.1: PANEL SESSION - New Directions and Challenges (Dependable Embedded Systems Day)

Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: N. Suri, TU Darmstadt, DE
Dependable Embedded Systems Day Panel: Issues and Challenges in Dependable Embedded Systems [p. 1394]
Panelists: J. Abraham, S. Poledna, A. Mendelson and S. Mitra

Embedded Systems are pervasively appearing in virtually all walks of life - communication, computing, e-/mcommerce, leisure, medical, WSN, transportation, biometrics. The utility of these embedded systems and services is based, in large part, in our depending on their sustained functionality in spite of the encountered operational or malicious disruptions. As the number of transient and also permanent disruptions (given the decreasing device geometries, higher device density, lower voltage latching, faster clocks etc) is expected to increase substantially, this will not only be a key issue for the hardware community but also the systems community in general. Solutions using a combination of hardware and software might be more effective than hardware-only or software-only solutions. Building upon the discussions on the conceptual and applied issues for design, analysis and validation of dependable embedded systems, the panel will bring together both the academic and industrial perspectives on the upcoming challenge themes. Specifically the coverage will encompass the spectrum of device level, communication aspects and system level aspects tackling both synergistic and across the board needs for the future dependable embedded "systems".


11.2: Routing and Link Design

Moderators: P. Kundu, Intel, US; S. Murali, EPFL, CH
Multicast Parallel Pipeline Routing Architecture FOR Network-on-Chip [p. 1396]
F.A. Samman, T. Hollstein and M. Glesner

This paper presents a flexible mesh router architecture using synchronous parallel pipeline worm-switching supporting unicast and multicast services. A very flexible mechanism to manage broadcast-flow to share the communication link in on-chip network is proposed. The proposed machanism guarantees, that all flits in multicast packets can be accepted in their multiple destination nodes. Our Network-on-Chip (NoC) is implemented based on modular synthesizable VHDL objects. The Architecture is flexible to design new NoC prototypes. Area overhead to update the NoC from unicast to multicast with the same routing algorithm is only about 15%.

Variation Tolerant NoC Design by Means of Self-Calibrating Links [p. 1402]
S. Medardoni, M. Lajolo and D. Bertozzi

We present the implementation and analysis of a variation tolerant version of a switch-to-switch link in a NoC. The goal is to tolerate the effects of process variations on NoC architectures using self-correcting links that automatically detect delay variations and compensate them. The correction is applied without increasing the switch-to-switch latency by substituting the output flip-flops of the sending switch with a self-correcting flip-flop followed by an adaptive voltage swing selector. Higher delay variations will result in a smaller slack in the switch-to-switch path, but the adaptive voltage swing selector could mitigate its impact on the NoC communication by increasing the voltage swing on the link, thus allowing a compensation of the delay variation. As a result, it is possible to tolerate delay variations at the cost of additional power consumption.

BARP- A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs [p. 1408]
P. Lotfi-Kamran, M. Daneshtalab, C. Lucas and Z. Navabi

A novel routing algorithm, named Balanced Adaptive Routing Protocol (BARP), is proposed for NoCs to provide adaptive routing and ensure deadlock-free and livelock-free routing at the same time. By evenly distributing input packets of a router among all its shortest path output ports, a novel adaptive routing protocol for avoiding congestion condition emerges. It is observed that BARP can achieve better performance compared to static XY routing, oddeven routing and dynamic XY routing.

Developing Mesochronous Synchronizers to Enable 3D NoCs [p. 1414]
I. Loi, F. Angiolini and L. Benini

The NETWORK-ON-CHIP (NOC) interconnection paradigm has been gaining momentum thanks to its flexibility, scalability and suitability to deep submicron technology processes. The next challenge is to use NoCs as the backbones of the upcoming generation of 3D chips, assembled by stacking multiple silicon layers. Multiple technical issues have to be tackled in this respect. One of the foremost is the unsuitability of a purely synchronous design style, as it is not straightforward to impose a strict bound on the clock skew among multiple clock trees across different layers. In this paper, we present a scheme to handle mesochronous communication in 3D NoCs and analyze (i) the circuit design, (ii) the timing properties, (iii) the requirements to support flow control across mesochronous links, (iv) the implementation cost of such a scheme after placement and routing.


11.3: Microarchitecture Analysis and Optimisation

Moderators: T. Austin, U of Michigan, US; G. Gaydadjiev, TU Delft, NL
Memory Organization with Multi-Pattern Parallel Accesses [p. 1420]
A. Vitkovski, G. Kuzmanov and G. Gaydadjiev

We propose an interleaved memory organization supporting multi-pattern parallel accesses in twodimensional (2D) addressing space. Our proposal targets computing systems with high memory bandwidth demands such as vector processors, multimedia accelerators, etc. We substantially extend prior research on interleaved memory organizations introducing 2D-strided accesses along with additional parameters, which define a large variety of 2D data patterns. The proposed scheme guarantees minimum memory latency and efficient bandwidth utilization for arbitrary configuration parameters of the data pattern. We provide mathematical descriptions and proofs of correctness for the proposed addressing schemes. The design complexity and the critical paths are evaluated using technology independent resource counts and confirm the scalability of the proposal. Hardware synthesis results for 90nm CMOS technology suggest that throughputs in the range between 44 and 1182 Gbit/s can be obtained at the cost of 26-212 Kgates for configurations of 2x2 32-bit up to 8x8 64-bit memory modules.
Index Terms - Conflict-free access, high bandwidth, multi-pattern access, parallel memories.

CATCH: A Mechanism for Dynamically Detecting Cache-Content-Duplication and Its Application to Instruction Caches [p. 1426]
M. Kleanthous and Y. Sazeides

Cache-Content-Duplication (CCD) occurs when there is a miss for a block in a cache and the entire content of the missed block is already in the cache in a block with a different tag. Caches aware of content-duplication can have lower miss rates by allowing only blocks with unique content to enter a cache. This work examines the potential of CCD for instruction caches. We show that CCD is a frequent phenomenon and that an idealized duplicationdetection mechanism for instruction caches has the potential to increase performance of an out-of-order processor, with a 2-way eight instruction per block 16KB instruction cache, often by more than 5% and up to 20%. This work also proposes CATCH, a hardware based mechanism for dynamically detecting CCD. Experimental results for an out-of-order processor show that a CATCH with a 2.32KB cost usually captures 60% or more of the CCD's idealized potential.

MAGELLAN: A Search and Machine Learning-Based Framework for Fast Multi-Core Design Space Exploration and Optimization [p. 1432]
S. Kahng and R. Kumar

In this paper, we treat multi-core processor design space exploration as an application-driven machine learning problem. We develop two machine learning-based techniques for efficiently exploring the processor design space. We observe that these techniques result in multi-core processors whose performance is comparable (within 1%) to a processor design that requires an exhaustive exploration of the design space. These techniques often take orders of magnitude (a factor of 3800 at the minimum) less time for coming up with these processors. The benefits are up to 13% over intelligent search techniques that have been adapted to do multi-core design space exploration. We leverage the knowledge gained in this research to develop Magellan - a framework for accelerating multi-core design space exploration and optimization. Magellan can be used to find the highest throughput processors of a given type for a given area, power, or time budget. It can be used to aid even experienced processor designers that prefer to rely on intuition by allowing fast refinements to an input design.

Process Variation Aware Issue Queue Design [p. 1438]
R. K and M. Mutyam

In sub-90nm process technology it becomes harder to control the fabrication process, which in turn causes variations between the design-time parameters and the fabricated parameters. Variations in the critical process parameters can result in significant fluctuations in the switching speed and leakage power consumption of different transistors in the same chip. In this paper, we study the impact of process variation on issue queues. Due to process variation, issue queues can take variable access latency. In order to work with nonuniform access latency issue queues, by exploiting ready operands of instructions at dispatch time, we propose a process variation aware issue queue design. Experimental results reveal that, for a 64-entry issue queue with half of the entries affected by process variation, our technique recovers most of the lost performance due to process variation and incurs a performance penalty of less than 2% with respect to the performance of issue queues without process variation.


11.4: System Implementations for Network and Cryptography

Moderators: L. Torres, LIRMM, Montpellier, FR; W. Eberle, IMEC, BE
Implementation of Parallel LFSR-Based Applications on an Adaptive DSP Featuring a Pipelined Configurable Gate Array [p. 1444]
C. Mucci, L. Vanzolini, I. Mirimin, D. Gazzola, A. Deledda, S. Goller, J. Knaeblein, A. Schneider, L. Ciccarelli and F. Campi

Linear feedback shift registers (LFSRs) are common structures in many application fields, including cryptography, digital broadcasting and communication. Highthroughput requirements need highly parallel implementations, usually accomplished in state of the art system on chips (SoCs) with application specific coprocessors. Although this approach achieves the required performance, it rapidly shows lack of flexibility when those devices are proposed, as an example, for multi-standard modems or for security applications in which run-time update can provide added value. This paper shows the implementation of parallel LFSR-based applications on an embedded adaptive DSP featuring a Pipelined Configurable Gate Array (PiCoGA). With respect to standard embedded FPGAs, pipelined devices usually provide better performance, e.g. in terms of speed, but they commonly show the undeniable drawback of additional design constraints. As a test-case, we consider the implementation of the 32-bit CRC used in the Ethernet standard that achieves on the target architecture up to ~25Gbit/sec throughput, with a parallel LFSR processing 128 bit at time, which is comparable to the performance offered by some ASIC devices.

GMDS: Hardware Inplementation of Novel Real Output Queuing Architecture [p. 1450]
R. Arteaga, F. Tobajas, R. Esper-Chain, V. de Armas and R. Sarmiento

In this paper, a real output queuing switch prototype implementation is presented. This implementation is based on a novel high speed multidrop backplane and a general purpose line card which includes a Virtex-II 6000 FPGA. This switch is named GMDS (Gigabit MultiDrop Switch) and its main features are the switch matrix replacement by the multidrop backplane -increasing system reliability-, variable lenght packet switching support -avoiding bandwidth efficient loss-, multiple output queuing structure for supporting QoS (Quality of Service) and a minimum speedup.

Front End Device for Content Networking [p. 1456]
J. Buboltz and T. Kocak

The bandwidth and speed of network connections are continually increasing. The speed increase in network technology is set to soon outpace the speed increase in CMOS technology. This asymmetrical growth is beginning to causing software applications that once worked with then current levels of network traffic to flounder under the new high data rates. Processes that were once executed in software now have to be executed, partially if not wholly in hardware. One such application that could benefit from hardware implementation is high layer routing. By allowing a network device to peer into higher layers of the OSI model, the device can scan for viruses, provide higher quality-of-service (QoS), and efficiently route packets. This paper proposes an architecture for a device that will utilize hardware-level string matching to distribute incoming requests for a server farm. The proposed architecture is implemented in VHDL, synthesized, and laid out on an Altera FPGA.

Power Aware Reconfigurable Multiprocessor for Elliptic Curve Cryptography [p. 1462]
M. Purnaprajna, C. Puttmann, M. Porrmann

Reconfigurable architectures are being increasingly used for their flexibility and extensive parallelism to achieve accelerations for computationally intensive applications. Although these architectures provide easy adaptability, it is so with an overhead in terms of area, power and timing, as compared to non-reconfigurable ASICs. Here, we propose a low overhead reconfigurable multiprocessor, which provides both parallelism and flexibility. The architecture has been evaluated for its energy efficiency for a computational intensive algorithm used in elliptic curve cryptography (ECC). Typically, algorithms in ECC exhibit task-level parallelism and demand large amount of computational resources for custom implementations to achieve a significant speedup. A finite field multiplication in GF(2233) was chosen as a sample application to evaluate the performance on the QuadroCore reconfigurable multiprocessor architecture. A three-fold performance improvement as compared to a single processor implementation was observed. Further, via reconfiguration to suit the application, power savings of about 24% were noted in UMC's 90nm standard cell technology.


11.5: Jitter Test and Fault Diagnosis

Moderators: M. Sonza Reorda, Politecnico di Torino, IT; A. Zjajo, NXP Semiconductors, NL
Digital Bit Stream Jitter Testing Using Jitter Expansion [p. 1468]
H. Choi and A. Chatterjee

This paper presents a time-domain jitter expansion technique for high-speed digital bit sequence jitter testing. While jitter expansion has been applied to phase noise measurements of sinusoidal signals before, its applicability to random clock jitter testing and data-dependent jitter testing have not been explored. The latter problems have wide application and necessitate new analysis procedures given in this paper. Since low phase noise sinusoids can be generated relatively easily as compared to low jitter digital clocks, the proposed technique utilizes a low-frequency sine wave as a reference signal which can be fed to the device under test with less concern for reference signal noise. A special circuit called a jitter-sensor is used for jitter extraction and produces a low-speed output signal with higher jitter values that track the jitter of the high-speed digital test signal. Thus, conventional narrow-bandwidth testers are able to analyze the sensor output. This allows high resolution jitter testing for high-speed digital signals possible at low cost.

A Same/Different Fault Dictionary: An Extended Pass/Fail Fault Dictionary with Improved Diagnostic Resolution [p. 1474]
I. Pomeranz and S.M. Reddy

We describe a new type of fault dictionary called a same/different fault dictionary. The same/different fault dictionary is similar to a pass/fail fault dictionary in that it contains a single bit bi ,j for every modeled fault f i and test vector tj . However, in a pass/fail fault dictionary, bi ,j is determined by comparing the output vector of the faulty circuit with the output vector of the fault free circuit; while in a same/different fault dictionary, bi ,j is determined by comparing the output vector of the faulty circuit with a preselected output vector called a baseline output vector. By selecting appropriately the baseline output vectors for all the test vectors, it is possible to obtain increased diagnostic resolution with a same/different fault dictionary compared to a pass/fail fault dictionary. We describe a procedure for selecting baseline output vectors and present experimental results.

A Design-for-Diagnosis Technique for SRAM Write Drivers [p. 1480]
A. Ney, P. Girard, S. Pravossoudovitch, A. Virazel, M. Bastian and V. Gouin

Diagnosis is becoming a major concern with the rapid development of semiconductor memories. It provides information about the location of manufacturing defects in the memory, and its effectiveness allows a fast yield ramp up. Most of existing diagnosis methods uses a fault dictionary to provide detailed information of fault localization. However, these solutions are most of the time unable to distinguish between all faults, and more importantly often fail to identify the actual faulty block of the memory. Identifying which block of a memory (corecell array, write drivers, address decoders, pre-charge circuits, etc...) is defective allows saving considerable amount of time during the ramp up phase. In this paper, we propose a very low cost Design-for-Diagnosis (DfD) solution for identifying faulty write drivers. It consists in verifying logic and analog conditions that guarantee the fault-free behavior of the write driver. The proposed solution allows a fast diagnosis (only three consecutive write operations are needed to fully diagnose the write driver) and induces a low area overhead (about 0.5% for a 512x512 SRAM). Beside diagnosis, an additional interest of such a solution is its usefulness during a post-silicon characterization process, where it can be used to extract the main features of write drivers (logic and analog levels on bit lines).

Variable Delay of Multi-Gigahertz Digital Signals for Deskew and Jitter-Injection Test Applications [p. 1486]
D.C. Keezer, D. Minier and P. Ducharme

The ability to precisely control the timing of digital signals is especially important for multi-GHz testing applications where errors are measured in picoseconds or even 100fs. While many solutions exist for continuous clock-type signals, delay of wide-bandwidth data signals is not so easy. In this paper we introduce a novel technique for adjusting the delay of ~7Gbps data signals on a picosecond scale without significant distortion. The approach is based on a timing/amplitude dependency effect observed in a variable-gain SiGe buffer. A prototype is demonstrated with a variable delay range of about 50ps. This circuit is enhanced by adding a "coarse" delay section, including four 33ps steps, to provide the desired total range of ~140ps. The end application requires several of these circuits for deskewing parallel buses of 6.4Gbps ATE signals. The circuit is also useful for injecting a variable amount of jitter, limited by the fine-delay adjustment range.


11.6: Software Synthesis and Embedded Code Generation

Moderators: K. Larsen, Aalborg U, DK; J. Gerlach, Robert Bosch GmbH, DE
Retargetable Code Optimization for Predicated Execution [p. 1492]
M. Hohenauer, F. Engel, R. Leupers, G. Ascheid, H. Meyr, G. Bette and B. Singh

Retargetable C compilers are key components of today's embedded processor design platforms for quickly obtaining compiler support and performing early processor architecture exploration. The inherent problem of the retargetable compilation approach, though, is the well known trade-off between the compiler's flexibility and the quality of generated code. However, it can be circumvented by designing flexible, configurable code optimization techniques applicable to a certain range of target architectures. This paper focuses on target machines with predicated execution support which is wide-spread in deeply pipelined and highly parallel embedded processors used in next generation high-end video, multimedia and wireless devices. We present an efficient and quickly retargetable code optimization technique for predicated execution that is integrated into an industrial retargetable C compiler. Experimental results for several embedded processors demonstrate that the proposed technique is applicable to real-life target machines and that it produces significant code quality improvements for control intensive applications.

Programming Shared Memory Multiprocessors with Deterministic Message-Passing Concurrency: Compiling SHIM to Pthreads [p. 1498]
S.A. Edwards, N. Vasudevan and O. Tardieu

Multicore shared-memory architectures are becoming prevalent and bring many programming challenges. Among the biggest are data races: accesses to shared resources that make a program's behavior depend on scheduling decisions beyond its control. To eliminate such races, the SHIM concurrent programming language adopts deterministic message passing as it sole communication mechanism. We demonstrate such language restrictions are practical by presenting a SHIM to C-plus-Pthreads compiler that can produce efficient code for shared-memory multiprocessors. We present a parallel JPEG decoder and FFT exhibiting 3.05 and 3.3x speedups on a four-core processor.

Modularity vs. Reusability: Code Generation from Synchronous Block Diagrams [p. 1504]
R. Lublinerman and S. Tripakis

We present several methods to generate modular code from synchronous hierarchical block diagrams. Modularity means code is generated for a given macro (i.e., composite) block independently from context, that is, without knowing where this block is to be used, and also with minimal knowledge about its sub-blocks. We achieve this by generating a set of interface functions for each block and a set of dependencies between these functions that is exported along with the interface. The main trade-off is the degree of modularity (number of interface functions) vs. reusability (the set of diagrams that the block can be used in without creating dependency cycles).

ezRealtime: A Domain-Specific Modeling Tool for Embedded Hard Real-Time Software Synthesis [p. 1510]
F. Cruz, R. Barreto, L. Cordeiro and P. Maciel

In this paper, we introduce the ezRealtime project, which relies on the Time Petri Net (TPN) formalism and defines a Domain-Specific Modeling (DSM) tool to provide an easy-to-use environment for specifying Embedded Hard Real-Time (EHRT) systems and for synthesizing timely and predictable scheduled C code. Therefore, this paper presents a generative programming method in order to boost code quality and improve substantially developer productivity by making use of automated software synthesis. The ezRealtime tool reads and automatically translates the system's specification to a time Petri net model through composition of building blocks with the purpose of providing a complete model of all tasks in the system. Hence, this model is used to find a feasible schedule by applying a depth-first search algorithm. Finally, the scheduled code is generated by traversing the feasible schedule, and replacing transition's instances by the respective code segments. We also present the application of the proposed method in an expressive case study.


11.7: HOT TOPIC - 3D Integration or How to Scale in the 21st Century

Organizers: B. Bougard, IMEC, BE; P. Marchal, IMEC, BE
Moderator: P. Marchal, IMEC, BE

3D Integration or How to Scale in the 21st Century [p. 1516]
Presenters: L. Benini, D. Keitel-Schulz, N. Checka

3D integration offers numerous opportunities for design, and is probably the best hope for carrying ICs along (and even beyond) the path of Moore's Law in the 21st century. However, many questions still need to be answered to take advantage of 3D. First, what will become the mainstream 3D technology? Today, many technology options are proposed, but each having different cost, design and test implications. Secondly, how to make 3D designs reliable? Many unknowns still exist related to thermal load, reliability and signal integrity challenges. Finally, what about design solutions/methods and architectural modifications for 3D integration? The objective of this special session is to create a better understanding of forthcoming 3D technologies, their implication on design and test. An attempt will be made to roadmap 3D technologies and their design implications. This will enable R&D planning by design houses, EDA vendors, foundries and academia, paving the way for a widespread acceptance of 3D technologies.