| |
DATE 2008 ABSTRACTS
Sessions:
[Keynote Addresses]
[1.2]
[1.3]
[1.4]
[1.5]
[1.6]
[1.7]
[2.2]
[2.3]
[2.4]
[2.5]
[2.6]
[2.7]
[IP1]
[3.2]
[3.3]
[3.4]
[3.5]
[3.6]
[3.7]
[4.1]
[4.2]
[4.3]
[4.4]
[4.5]
[4.6]
[4.7]
[IP2]
[5.1.1]
[5.1.2]
[5.2]
[5.3]
[5.4]
[5.5]
[5.6]
[5.7]
[6.1]
[6.2]
[6.3]
[6.4]
[6.5]
[6.6]
[6.7]
[IP3]
[7.1]
[7.2]
[7.3]
[7.4]
[7.5]
[7.6]
[7.7]
[8.1]
[8.2]
[8.3]
[8.4]
[8.5]
[8.6]
[8.7]
[IP4]
[9.1.1]
[9.1.2]
[9.2]
[9.3]
[9.4]
[9.5]
[9.6]
[9.7]
[10.1]
[10.2]
[10.3]
[10.4]
[10.5]
[10.6]
[10.7]
[IP5]
[11.1]
[11.2]
[11.3]
[11.4]
[11.5]
[11.6]
[11.7]
Cover Page
DATE Executive Committee
DATE Sponsors Committee
Technical Program Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
Tutorials
PH.D. Forum
Call for Papers: DATE 2009
-
Designing Micro/Nano Systems for a Safer and Healthier Tomorrow [p. 1]
-
G. De Micheli
The ongoing scaling and hybridisation of manufacturing technologies enables us to attain unprecedented levels
performance as well as to integrate electronic and fluidic circuits with sensors and actuators.
Smart micro/nano systems will be the building blocks of wearable and ambient systems, that gather and integrate
heterogeneous data in real time and operate and communicate in a wireless and ultra low power mode.
These systems will foster a revolution in health and environmental management, with the final objective of
improving security and quality of life. At the same time, they will create a large market of components and
systems, and a renewed perspective for electronic design and manufacturing companies.
To accomplish such an ambitious goal, new technologies and architectures must be matched and tailored to the
operational environment by solving novel an challenging design and optimisation problems, through the creation
of novel design methodologies and tools.
-
Perspective on Embedded Systems: Challenges, Solutions and Research Priorities [p. 2]
-
D. Vernay
The societal demands in Europe for Health, Security & Safety, Energy & Environment, and the market demands
in nomadic, transport, communications, entertainment products, ask for innovations and technical leadership.
Enabling embedded Systems require new challenging solutions like multi-physics devices, millions of
interconnected nodes, very low power for autonomy, trusted and safe operations, reliability.
The talk will introduce THALES vision and research priorities for embedded systems and will illustrate them
through presentations of solutions and on-going research projects and initiatives. Thales effort related to
mission-critical systems is focused on advanced high-performance embedded computing platforms, on
middleware technologies, on software systems design and verification tools for safety and security and on the
emergence of open standards in these domains. THALES is also actively contributing to the development of
innovation eco-systems: the Joint Undertaking ARTEMIS in Europe; the Pôle de Compétitivité SYSTEM@TIC
PARIS REGION in France.
Moderators: S. Bocchio, STMicroelectronics, IT; W. Mueller, Paderborn U, DE
-
Cycle-approximate Retargetable Performance Estimation at the Transaction Level [p. 3]
-
Y. Hwang, S. Abdi and D. Gajski
This paper presents a novel cycle-approximate performance
estimation technique for automatically generated
transaction level models (TLMs) for heterogeneous multicore
designs. The inputs are application C processes and
their mapping to processing units in the platform. The processing
unit model consists of pipelined datapath, memory
hierarchy and branch delay model. Using the processing
unit model, the basic blocks in the C processes are analyzed
and annotated with estimated delays. This is followed by
a code generation phase where delay-annotated C code is
generated and linked with a SystemC wrapper consisting of
inter-process communication channels. The generated TLM
is compiled and executed natively on the host machine. Our
key contribution is that the estimation technique is close to
cycle-accurate, it can be applied to any multi-core platform
and it produces high-speed native compiled TLMs. For experiments,
timed TLMs for industrial scale designs such as
MP3 decoder were automatically generated for 4 heterogeneous
multi-processor platforms with up to 5 PEs under
1 minute. Each TLM simulated under 1 second, compared
to 3-4 hrs of instruction set simulation (ISS) and 15-18 hrs
of RTL simulation. Comparison to on-board measurement
showed only 8% error on average in estimated number of
cycles.
-
A Method for the Efficient Development of Timed and Untimed Transaction-Level Models of
Systems-on-Chip [p. 9]
-
J. Cornet, F. Maraninchi and L. Maillet-Contoz
Transaction Level Modeling (TLM) captures abstract
models of Systems-on-Chip that simulate faster than traditional
RTL simulations and are available earlier in the design
flow. Such models allow the development of the embedded
software on a virtual prototype of the hardware, before
the chip is available. Various levels of details in TL models
are needed; using untimed and timed models for different
purposes is a usual practice.
We present a method for developing very abstract untimed
models first, and then enriching them to get detailed
timed models, while preserving the functionality. The timed
models can be as rich as the models usually written from
scratch. The experiments with industrial case-studies show
improved simulation speed and reduced modeling effort for
both untimed and timed models.
-
Integrating RTL IPs into TLM Designs Through Automatic Transactor Generation [p. 15]
-
N. Bombieri, N. Deganello and F. Fummi
Transaction Level Modeling (TLM) is an emerging design
practice for overcoming increasing design complexity.
It aims at simplifying the design flow of embedded systems
by designing and verifying a system at different abstraction
levels. In this context, transactors play a fundamental role
since they allow communication between the system components,
implemented at different abstraction levels. Reuse of
RTL IPs into TLM systems is a meaningful example of key
advantage guaranteed by exploiting transactors. Nevertheless,
transactors implementation is still manual, tedious and
error-prone, and the effort spent to verify their correctness
often overcomes the benefits of the TLM-based design flow.
In this paper we present a methodology to automatically
generate transactors for RTL IPs. We show how the transactor
code can be automatically generated by exploiting the
testbench of any RTL IP.
Moderators: B. Candaele, Thales, FR; L. Fanucci, Pisa U, IT
-
Tailored Solutions for Safety-Installations in the Loetschberg Tunnel - A Project with
Importance for the Trans-European Rail Traffic [p. 21]
-
W. Fuβ
The Loetschberg base tunnel was the largest project for
the Swiss railway infrastructure in the last five years.
With a length of 34,6 km it is the third longest tunnel in
the world at present. The maximum speed allowed to
drive is 250 km/h. The project comprised four
interlocking stations, ETCS Level 2 and additional
automatic functions to handle the traffic through this
tunnel in an optimized way. To fulfill all the safety
requirements and the challenges of reliability and
maintainability of this very long tunnel a lot new
functions were to be realized in each of the mentioned
systems. Following a unified approach using certified
hardware and middleware and enhancing the scope of
test to a total system simulation these demands were
met. This article focuses on the challenges for the
interlocking system LockTrac 6131 ELEKTRA.
-
On the Verification of High-Order Constraint Compliance in IC Design [p. 26]
-
J. Freuer, G. Jerke, J. Gerlach and W. Nebel
The increasing quality requirements on safety-critical
electronic components and the rapid technological progress
necessitate the compliance with all specified functional and
non-functional design constraints. This paper introduces
a novel verification method based on an unified data representation
of constraints to enable multi-tool verification
tasks. A Constraint Engineering System is presented which
provides flexible, extensible, and multi-tool definitions of
complex constraints and high-order verification tasks. Existing
verification and simulation tools are combined so that
the achieved complexity level of the high-order verification
by far exceeds the level of the single tools. The shown examples
target practical applications in analog system design
and demonstrate the flexibility and the potential of this new
verification approach.
-
Industrial IP Integration Flows Based on IP-XACTTM Standards [p. 32]
-
W. Kruijtzer, P. van der Wolf, E. de Kock, J. Stuyt, W. Ecker, A. Mayer, S. Hustin, C. Amerijckx,
S. de Paoli and E. Vaumorin
Effective integration of advanced Systems-on-Chip (SoC)
requires extensive reuse of IP modules as well as
automation of the IP integration process, including
verification. Key enablers for this are standards to describe
and package IP modules. We focus on the IP-XACT
standards and demonstrate how these standards are
deployed in three industrial IP integration flows. Further,
we report on two future extensions to IP-XACT that are
currently being explored in the SPRINT project, i.e. IPXACT
based verification software generation and IP-XACT
based configuration of debug environments. We conclude
that IP-XACT is enabling powerful IP integration
methodologies and that future extensions can further
increase the effectiveness of IP-XACT standards.
Moderators: C. Heer, Infineon Technologies, DE; M. Hübner, Karlsruhe U, DE
-
A Reconfigurable Application Specific Instruction Set Processor for Convolutional and
Turbo Decoding in a SDR Environment [p. 38]
-
T. Vogt and N. Wehn
Future mobile and wireless communication networks require
flexible modem architectures to support seamless services
between different network standards. Hence, a common
hardware platform that can support multiple protocols
implemented or controlled by software, generally referred
to as software defined radio (SDR), is essential.
This paper presents a family of dynamically reconfigurable
application-specific instruction-set processors (ASIP) for
the application domain of channel coding in wireless communication
systems. As a weakly programmable IP core, it
can implement trellis based channel decoding in a SDR environment.
It features binary convolutional decoding, and
turbo decoding for binary as well as duobinary turbo codes
for all current and upcoming standards.
The ASIPs consist of a specialized pipeline with 15
stages and a dedicated communication and memory infrastructure.
Logic synthesis revealed a maximum clock frequency
of 400 MHz and a total area of 0.42 mm2 for a
65 nm technology. Simulation results for Viterbi and turbo
decoding demonstrate maximum throughput of 196 and 34
Mbps, respectively, and outperforms existing SDR based
approaches for channel decoding.
-
Using Reconfigurable Logic to Optimise GPU Memory Accesses [p. 44]
-
B. Cope, P. Y. K. Cheung and W. Luk
Memory access patterns common in video processing algorithms,
which are unsuited to the GPU (Graphics Processing
Unit) memory system, are identified. We develop
REDA (Reconfigurable Engine for Data Access) to improve
GPU performance for such access patterns, by employing
reconfigurable logic for address mapping. It is shown that
a sixty times reduction in number of video memory accesses
can be achieved for previously unsuited access patterns,
with no detriment to well suited patterns. Surprisingly,
memory access locality is also improved.
-
Cost - And Power Optimized FPGA Based System Integration: Methodologies and
Integration of a Low-Power Capacity- Based Measurement Application on Xilinx FPGAs [p. 50]
-
K. Paulsson, M. Hübner and J. Becker
The application of Field Programmable Gate Arrays
(FPGAs) in low power and low cost industrial mass
products has become an important issue for designers of
electronic systems. The flexibility and performance
offered by reconfigurable hardware architectures often
stands in the opposite to increased power consumption in
comparison to Application Specific Integrated Circuit
(ASIC) solutions. By exploiting the flexibility of
reconfigurable hardware architectures, e.g. the capability
of run-time HW reconfiguration of some modern FPGA
devices, power consumption of FPGA-based solutions can
be further decreased. This paper presents an approach
for cost- and power optimized system integration of a
low-power capacity-based measurement system by
exploiting the dynamic and partial reconfiguration
capability of Xilinx FPGAs.
Keywords: Low-power applications, reconfigurable
architectures, hardware reconfiguration
-
Design Flow for Embedded FPGAs Based on a Flexible Architecture Template [p. 56]
-
B. Neumann, T. Von Sydow, H. Blume and T. G. Noll
Modern digital signal processing applications have an
increasing demand for computational power while
needing to preserve low power dissipation and high
flexibility. For many applications, the growth of
algorithmic complexity is already faster than the growth
of computational power provided by discrete general
purpose processors [1]. A typical approach to address
this problem is the combination of a processor core with
dedicated accelerators. Since changes in standards or
algorithms can change the demands on the accelerators,
an attractive alternative to highly customised VLSImacros
is the use of reconfigurable embedded FPGAs
(eFPGAs). First commercial products combining a
general purpose processor core and an embedded FPGA
recently emerged (e.g. Stretch S6000 [2], Menta eFPGAaugmented
CPUs [3]). For many digital signal
processing applications, a significantly improved
efficiency in terms of power dissipation, throughput and
chip area can be achieved by tailoring both the processor
core and the reconfigurable accelerator to the given
application domain [4].
In this work, a methodology to design highly
customisable eFPGA-architectures starting from a high
level description is presented. The design framework
elaborated during this work enables a physically
optimised VLSI-design of the specified eFPGA and aims
to support simulation of the according eFPGA-macros
both on a functional and netlist-level by providing an
elementary configuration tool based on the same high
level description as the eFPGA-architecture.
Moderators: H. Kerkhoff, Twente U/ CTIT-TDT, NL; J. Machado Da Silva, INESC, PT
-
Optimal High-Resolution Spectral Analyzer [p. 62]
-
A. Tchegho, H. Mattes and S. Sattler
This paper presents a new application f ield for the Goertzel
algorithm. The test of mixed-signal circuits involves
the generation and analysis of signals. A standard method
for the signal analysis is the Fast Fourier Transform (FFT
algorithms). Such complex algorithms are not suitable for
BIST (Built-In Self-Test) or BOST (Built-Of f Self-Test) solutions
due to their high demand for resources. In this paper,
the Goertzel algorithm will be presented as an alternative to
FFT algorithms. A new optimized structure of the Goertzel
algorithm and its implementation in an FPGA (Field Programmable
Gate Array) is presented. A comparison within
the scope of the production test of RF transceiver devices
shows a considerable reduction of the test time (factor 6)
and resources (factor 10) compared to a FFT sof tware solution
respectively hardware solution.
-
A General Method to Evaluate RF BIST Techniques Based on Non-Parametric Density
Estimation [p. 68]
-
H.-G. Stratigopoulos, J. Tongbong and S. Mir
We present a general method to evaluate RF Built-
In Self-Test (BIST) techniques during the design stage. In
particular, the adaptive kernel estimator is used to construct
an estimate of the joint probability density function of the
performances of the RF device under test and the actual BIST
measurements. The density is sampled to generate a large volume
of new data, which is subsequently used to estimate the relevant
test metrics with parts per million (ppm) accuracy given the BIST
limits. Thus, the BIST limits can be set to obtain the desired
trade-offs between different test metrics. The proposed method
aims to assist designers in comparing RF BIST techniques on
the basis of accurately calculated test metrics and to provide
information for early BIST refinements, thus reducing the design
cycles. The method is demonstrated for a previously published
RF BIST technique [1] applied to an LNA.
-
Diagnostic Analysis of Static Errors in Multi-Step Analog to Digital Converters [p. 74]
-
A. Zjajo and J. Pineda De Gyvez
A new approach for diagnostic analysis of
static errors in multi-step ADC based on the steepestdescent
method is proposed. To set initial data, estimate
the parameter update and to guide the test, dedicated
sensors have been designed. The information obtained
through monitoring process variations is re-used and
supplement the circuit calibration. The technique also
allows the test procedure to test only for the most likely
group of faults induced by a manufacturing process. The
implemented design-for-test approach permits circuit reconfiguration
in such a way that all sub-blocks are
tested for their full input range allowing full
observability and controllability of the device under test.
-
Practical Implementation of a Network Analyzer for Analog BIST Applications [p. 80]
-
M.J. Barragán, D. Vázquez and A. Rueda
This paper presents a practical implementation of a
network analyzer for analog BIST applications. The
network analyzer consists of a sinewave generator and a
sinewave evaluator based on switch-capacitor techniques.
Both the generator and the evaluator have been integrated
in a 0.35 μm CMOS technology. The functionality of the
system has been proved in the lab. For this purpose, a
demonstrator board has been developed including the
proposed network analyzer and a filter as DUT.
Measurements in the lab demonstrate a dynamic range of
70dB in the frequency range up to 20kHz.
Organizer: B R Haverkort, Twente U, NL
Moderator: R Hersemeule, RWTH Aachen U, DE
-
Quantitative Evaluation in Embedded System Design: Trends in Modeling and Analysis
Techniques [p. 86]
-
J.-P. Katoen
The evaluation of extra-functional properties of embedded
systems, such as reliability, timeliness, and energy consumption,
as well as dealing with uncertainty, e.g., in the
timing of events, is getting more and more important. What
are the models and approaches to analyze such properties
in a reliable way? We survey some main developments and
trends in the modeling, and the analysis of these aspects and
stress the importance of approaches that tackle both extrafunctional,
as well as correctness aspects.
-
Quantitative Evaluation in Embedded System Design: Validation of Multiprocessor
Multithreaded Architectures [p. 88]
-
N. Coste, H. Garavel, H. Hermanns, R. Hersemeule, Y. Thonnart and M. Zidouni
As levels of parallelism are becoming increasingly complex
in multiprocessor architectures, GALS, and asynchronous
circuits, methodologies and software tools are
needed to verify their functional behavior (qualitative properties)
and to predict their performance (quantitative properties).
This paper presents the work currently done in the
Multival project (pôle de compétitivité mondial Minalogic),
in which verification and performance evaluation tools developed
at INRIA and Saarland University are applied to
three industrial architectures designed by Bull, CEA/Leti
and STMicroelectronics.
-
Quantitative Evaluation in Embedded System Design: Predicting Battery Lifetime in Mobile
Devices [p. 90]
-
L. Cloth and B.R. Haverkort
In the design process of an (embedded) computer system
there are several important attributes the developer has
to take care of: first of all, the final product should do the
right thing, we then speak of functional correctness. Second,
the performance should be adequate, expressed in measures
such as throughput, delay or loss probability. Third,
when relying on a battery as power source, it becomes increasingly
important that the system behaves in an energyaware
manner. We could assess any of the three attributes
in isolation, using completely different sets of models and
tools. However, since the alteration of one of the attributes
most surely also affects the other two, an integrated framework
where all aspects can be evaluated and balanced is definitely
desirable. We present such an integrated approach,
but focus on the evaluation of battery lifetime. The system
under consideration is represented by a stochastic workload
model which then is combined with a battery model. In doing
so, several design alternatives in the behaviour of the
system can be compared early in the design process and
the optimum with respect to functionality, performance and
energy-consumption can be chosen.
Moderators: D. Stroobandt, Ghent U, BE; T. Ishihara, Kyushu U, JP
-
A Framework of Stochastic Power Management Using Hidden Markov Model [p. 92]
-
Y. Tan and Q. Qiu
The effectiveness of stochastic power management
relies on the accurate system and workload model and effective
policy optimization. Workload modeling is a machine learning
procedure that finds the intrinsic pattern of the incoming tasks
based on the observed workload attributes. Markov Decision
Process (MDP) based model has been widely adopted for
stochastic power management because it delivers provable
optimal policy. Given a sequence of observed workload
attributes, the hidden Markov model (HMM) of the workload
is trained. If the observed workload attributes and states in the
workload model do not have one-to-one correspondence, the
MDP becomes a Partially Observable Markov Decision Process
(POMDP). This paper presents a framework of modeling and
optimization for stochastic power management using HMM
and POMDP. The proposed technique discovers the HMM of
the workload by maximizing the likelihood of the observed
attribute sequence. The POMDP optimization is formulated
and solved as a quadraticly constrained linear programming
(QCLP). Compared with traditional optimization technique,
which is based on value iteration, the QCLP based optimization
provides superior policy by enabling stochastic control.
-
Harvesting Wasted Heat in a Microprocessor Using Thermo-Electric Generators: Modeling,
Analysis And Measurement [p. 98]
-
Y. Zhou, S. Paul and S. Bhunia
Harvesting energy from previously unemployed
ambient sources can play important role in saving energy and
reducing the dependency to primary power sources (AC power
or battery) of an electronic system. High-performance
integrated circuits such as microprocessor, typically suffers
from high surface temperature (in the order of 80-100°C)
resulting from the high power density and limited cooling
capacity of the package. In this paper, we consider the scope of
harvesting thermoelectric energy from the wasted heat in a
microprocessor leveraging on the temperature gradient
between processor die surface and environment. First, we
develop analytical model to accurately estimate the recycled
energy considering the non-uniformity of temperature
distribution in the die surface. Next, we analyze the
effectiveness of the approach for thermoelectric generator
(TEG) with different efficiencies (measured in terms of its
figure of merit, ZT) under varying processor workload. Finally,
we propose a possible arrangement for using the TEG on a
processor and provide measurement results on the amount of
harvested energy. The measurements on a Pentium III
processor running at 1GHz show that we can harvest ~7mW of
power from the processor for average workload using a
commercial TEG..
-
An Efficient Solar Energy Harvester for Wireless Sensor Nodes [p. 104]
-
D. Brunelli, L. Benini, C. Moser and L. Thiele
Solar harvesting circuits have been recently proposed to increase
the autonomy of embedded systems. One key design challenge
is how to optimize the efficiency of solar energy collection
under non stationary light conditions. This paper proposes a scavenger
that exploits miniaturized photovoltaic modules to perform
automatic maximum power point tracking at a minimum energy
cost. The system adjusts dynamically to the light intensity variations
and its measured power consumption is less than 1mW.
Experimental results show increments of global efficiency up to
80%, diverging from ideal situation by less than 10%, and demonstrate
the flexibility and the robustness of our approach.
-
Temperature Control of High-Performance Multi-core Platforms Using Convex Optimization [p. 110]
-
S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, L. Benini and G. De Micheli
With technology advances, the number of cores integrated on a chip
and their speed of operation is increasing. This, in turn is leading
to a significant increase in chip temperature. Temperature gradients
and hot-spots not only affect the performance of the system,
but also lead to unreliable circuit operation and affect the life-time
of the chip. Meeting the temperature constraints and reducing the
hot-spots are critical for achieving reliable and efficient operation
of complex multi-core systems. In this work, we present Pro-Temp,
a convex optimization based method that pro-actively controls the
temperature of the cores, while minimizing the power consumption
and satisfying application performance constraints. The method
guarantees that the temperature of the cores are below a userdefined
threshold at all instances of operation, while also reducing
the hot-spots. We perform experiments on several realistic multicore
benchmarks, which show that the proposed method guarantees
that the cores never exceed the maximum temperature limit, while
matching the application performance requirements. We compare
this to traditional methods, where we find several temperature violations
during the operation of the system.
Keywords
Thermal-aware design, temperature control, dynamic frequency
scaling, static and dynamic optimization.
Moderators: J. Lilius, Abo Akademi U, FI; A. Fouilliart, Thales Communications, FR
-
Parametric Throughput Analysis of Synchronous Data Flow Graphs [p. 116]
-
A. H. Ghamarian, M.C.W. Geilen, T. Basten and S. Stuijk
Synchronous Data Flow Graphs (SDFGs) have proved
to be a very successful tool for modeling, analysis and synthesis
of multimedia applications targeted at both single- and multiprocessor
platforms. One of the most prominent performance constraints
of concurrent real-time applications is throughput. For
given actor execution times, throughput can be verified by analyzing
the SDFG models of such applications, for instance using
maximum cycle mean analysis or state space analysis. In various
contexts, such as design space exploration or run-time reconfiguration,
many fast throughput computations are required for varying
actor execution times.
We present methods to compute throughput of an SDFG where
actor execution times can be parameters. The throughput of these
graphs is obtained in the form of a function of these parameters.
Recalculation of throughput is then merely an evaluation of this
function for specific parameter values, which is much faster than
the standard throughput analysis. We propose three different algorithms
for parametric throughput analysis and evaluate these algorithms
experimentally, showing the feasibility of the approach and
showing that a divide and conquer algorithm performs best.
-
Introducing Preemptive Scheduling in Abstract RTOS Models Using Result Oriented
Modeling [p. 122]
-
G. Schirner and R. Dömer
With the increasing SW content of modern SoC designs,
modeling and development of Hardware Dependent Software
(HDS) become critical. Previous work addressed this
by introducing abstract RTOS modeling [6], which exposes
dynamic scheduling effects early in the system design flow.
However, such models insufficiently capture preemption. In
particular, the accuracy of preemption depends on the granularity
of the timing annotation. For an accurately modeled
interrupt response time, very fine-grained timing annotation
is necessary, which contradicts the RTOS abstraction idea
and is detrimental to simulation performance.
In this paper, we eliminate the granularity dependency
by applying the Result Oriented Modeling (ROM) technique
previously used only for communication modeling.
Our ROM approach allows precise preemptive scheduling,
while retaining all the benefits of abstract RTOS modeling.
Our experimental results demonstrate tremendous improvements.
While the traditional model simulated an interrupt
response time with a severe inaccuracy (12x longer
in average and 40x longer for 96th percentile), our ROMbased
model was accurate within 8% (average and 50th
percentile) using identical timing annotations.
-
SystemC-Based Modeling, Seamless Refinement, and Synthesis of a JPEG 2000 Decoder [p. 128]
-
K. Grüttner, F. Oppenheimer, W. Nebel, F. Colas-Bigey and A.-M. Fouilliart
This paper will exemplarily describe and evaluate the
OSSS methodology for embedded hardware/software systems
and its use in a JPEG 2000 decoder case-study. The
OSSS approach defines a design flow starting from an Application
Model providing a rich subset of SystemCTM/C++
augmented with specific OSSS language concepts. It can
be used to identify the most promising parallel structure
by comparing different design alternatives. A clearly defined
refinement process leads to the Virtual Target Architecture
(VTA) Model. These refinements enable an analysis
of the system behaviour at cycle-accurate granularity and
support the exploration of different target architectures for
the JPEG 2000 decoder. VTA models can be used as direct
input for the FOSSY synthesis tool, which performs an automatic
transformation into implementation models; that is
to generate VHDL code for hardware, C/C++ for software,
and platform configuration files for the target technology.
-
Modeling and Refining Heterogeneous Systems with SystemC-AMS: Application to WSN [p. 134]
-
M. Vasilevski, F. Pecheux, N. Beilleau, H. Aboshady and K. Einwich
The paper presents a system-level approach for the modeling
and simulation of a paradigmatic Wireless Sensor Network composed of
two nodes using SystemC-AMS, an open-source C++ extension to the
OSCI SystemC Standard dedicated to the description of heterogeneous
systems containing digital, analog, RF hardware IPs as well as embedded
software. The paper is composed of three parts. The first part details
the modeled WSN (physical sensor, sigma-delta ADC, ATMEGA128 8-
bit microcontroller running the embedded application, QPSK-based 2.4
GHz RF transceiver), presents the corresponding implementation in
SystemC-AMS, and gives an insight on how multi-frequency simulation
is handled in SystemC-AMS. The second part shows how to introduce
several RF designer specifications (noise figure, IIP3, ...) into models and
how to express them in SystemC-AMS. The third part proves that the
combination of C++ and RF baseband equivalent dramatically reduces
simulation time while keeping excellent accuracy and code readability.
The paper concludes on the possibilities offered by this approach in
terms of validation and optimization of heteregeneous systems through
open-source simulation.
Moderators: C. Grimm, TU Vienna, AT; D. Mueller, TU Munich, DE
-
Sizing Rules for Bipolar Analog Circuit Design [p. 140]
-
T. Massier, H. Graeb and U. Schlichtmann
This paper presents sizing rules for basic building blocks
in analog bipolar circuit design. Sizing rules efficiently capture
design knowledge on the technology-specific level of
transistor-pair groups. This reduces the effort for and improves
the resulting quality of analog circuit synthesis. We
present a hierarchical library of transistor-pair groups as
basic building blocks for analog bipolar circuits. Sizing
rules are constraints associated to these building blocks that
must be satisfied to guarantee the function and robustness
of each block. Results of applications like circuit sizing or
design centering show that the use of sizing rules leads to
improved and robust results.
-
Efficient Circuit-Level Modeling of Ballistic CNT Using Piecewise Non-Linear
Approximation of Mobile Charge Density [p. 146]
-
T. J. Kazmierski, D. Zhou and B. M. Al-Hashimi
This paper presents a new carbon nanotube transistor
(CNT) modelling technique which is based on an efficient
numerical piece-wise non-linear approximation of the
non-equilibrium mobile charge density. The technique facilitates
the solution of the self-consistent voltage equation in a carbon
nanotube such that the CNT drain-source current evaluation
is accelerated by more than three orders of magnitude while
maintaining high modelling accuracy. The model is currently
limited to ballistic transport but can be extended to non-ballistic
modes of transport when a suitable theory is developed while
researchers study phenomena that sometimes prevent electrons
in a carbon nanotube from going ballistic. Our results show
that while the accuracy and speed of the proposed model vary
with the number of piece-wise segments in the mobile charge
approximation, it is possible to obtain a speed-up of more than
1000 times while maintaining the accuracy within less than 2%
in terms of average RMS error compared with the state of the art
theoretical reference CNT model implemented in FETToy. This
numerical efficiency makes our model particularly suitable for
implementation in circuit-level, eg. SPICE-like, simulators where
large numbers of such devices may be used to build complex
circuits.
-
A New Approach for Combining Yield and Performance in Behavioral Models for Analogue
Integrated Circuits [p. 152]
-
S. Ali, R. Wilcock, P. Wilson and A. Brown
A new algorithm is presented that combines performance
and variation objectives in a behavioural model for a
given analogue circuit topology and process. The tradeoffs
between performance and yield are analysed using a
combination of a multi-objective evolutionary algorithm
and Monte Carlo simulation. The results indicate a
significant improvement in overall simulation time and
efficiency compared to conventional simulation based
approaches, without a corresponding drop in accuracy.
This approach is particularly useful in the hierarchical
design of large and complex circuits where computational
overheads are often prohibitive. The behavioural model
has been developed in Verilog-A and tested extensively
with practical designs using the SpectreTM simulator. A
benchmark OTA circuit was used to demonstrate the
proposed algorithm and the behaviour has been verified
with transistor level simulations of this circuit and a
higher level filter design. This has demonstrated that an
accurate performance and yield prediction can be
achieved using this model, in a fraction of the time of
conventional simulation based methods.
Moderators: L. Fanucci, Pisa U, IT; J. Gerlach, Robert Bosch GmbH, DE
-
Symbolic Reliability Analysis and Optimization of ECU Networks [p. 158]
-
M. Glaβ, M. Lukasiewycz, F. Reimann, C. Haubelt and J. Teich
Increasing reliability at a minimum amount of extra cost
is a major challenge in todays ECU network design. Considering
reliability as an objective already in early design
phases has the potential to avoid expensive modifications
in later design phases. Hence, there is a need for an appropriate
optimization process and efficient analysis techniques
to evaluate the found implementations. In this paper,
we will show how symbolic techniques can be used to efficiently
analyze and optimize such reliable systems. The
contribution of this paper is (1) a symbolic reliability analysis
that makes use of a partitioned structure function and
(2) a symbolic optimization process based on binary ILP
solvers. Our case study from the automotive area will show
a significant speed-up using our analysis technique. Moreover,
our optimization approach is able to offer implementations
with considerably improved reliability at no additional
costs as well as implementations with reduced costs without
decreasing their reliability.
-
Verification of Temporal Properties in Automotive Embedded Software [p. 164]
-
D. Lettnin, P.K. Nalla, J. Ruf, T. Kropf, W. Rosenstiel, T. Kirsten, V. Schönknecht and S.
Reitemeyer
The amount of software in embedded systems
has increased significantly over the last years and,
therefore, the verification of embedded software is
of fundamental importance. One of the main problems
in embedded software is to verify variables
and functions based on temporal properties. Formal
property verification using model checker often
suffers from the state space explosion problem
when a large software design is considered. In this
paper, we propose two new approaches to integrate
assertions in the verification of embedded software
using simulation-based verification. Firstly, we extended
a SystemC hardware temporal checker with
interfaces in order to monitor the embedded software
variables and functions that are stored in a
microprocessor memory model. Secondly, we derived
a SystemC model from the original C program
in order to integrate directly with the SystemC
temporal checker. We performed a case study
on an embedded software from automotive industry
which is responsible for controlling read and write
requests to a non-volatile memory.
-
A Novel Approach for EMI Design of Power Electronics [p. 170]
-
B. Stube, B. Schroeder, E. Hoene and A. Lissner
The placement of passive components significantly
influences the EMI behavior of power electronic systems.
Particularly filter components are affected by magnetic
field coupling reducing filter performance. In this paper
we introduce a novel approach for a methodical EMI
design of power electronic circuits. Based on the results
of EMI prediction design rules for component placement
are derived. To meet the design rules a prototype of a
dedicated placement tool was developed. This tool has
much interactive and automatic placement functionality to
solve the very complex design task efficiently. Using the
proposed approach in the design stage allows both a
statement on achievable performance with the given
components and the minimization of the system volume.
Development costs can be relevantly reduced.
-
Hardware/Software Architecture of an Algorithm for Vision-Based Real-Time Vehicle
Detection in Dark Environments [p. 176]
-
N. Alt, C. Claus and W. Stechele
Hardware/software partitioning of algorithms is
gaining more and more importance in order to benefit from
the advantages of both worlds. Pure software implementations
are easy to change but the processing time is rather high. By
contrast pure hardware implementations usually result in faster
processing due to inherent parallelism but they do not offer
the necessary flexibility for quick changes and adaptions. In
this paper the hardware/software co-design of a self-developed
algorithm to detect cars by their taillights as well as its implementation
on an embedded system (FPGA) is presented. Instead
of utilizing expensive sensors such as RADAR which also can
be used to detect obstacles in dark environments, the detection
method presented here is based solely on grayscale images taken
by a low-cost on-board camera which was mounted on a moving
vehicle. Only computationally intense parts - namely pixel or
sliding window operations - are implemented in hardware to
achieve the necessary real-time requirements. The remainder of
the algorithm - the so called higher level application code - is
running on standard embedded CPU cores.With this architecture
it is possible to process the incoming video-stream (25 FRAMES/s)
and detect cars in real-time on an embedded system.
Keywords: driver assistance, real-time video processing,
hardware acceleration, taillight detection
Moderators: R. Dorsch, IBM Boeblingen, DE; P. Harrod, ARM Ltd, UK
-
Analysis of the Test Data Volume Reduction Benefit of Modular SOC Testing [p. 182]
-
O. Sinanoglu and E. J. Marinissen
Modular SOC testing offers numerous benefits that include test
power reduction, ease of timing closure, and test re-use among
many others. While all these benefits have been emphasized by
researchers, the test time and data volume comparisons has been
mostly constrained within the context of modular SOC testing
only, by comparing the impact of various different modular SOC
testing techniques to each other. In this paper, we provide a theoretical
test data volume analysis that compares the monolithic
test of a flattened design with the same design tested in a modular
manner; we present numerous experiments that gauge the magnitude
of this benefit. We show that the test data volume reduction
delivered by modular SOC testing directly hinges on the test pattern
count variation across different modules, and that this reduction
can exceed 99% in the SOC benchmarks that we have experimented
with.
-
Test-Architecture Optimization and Test Scheduling for SOCs with Core-Level Expansion of
Compressed Test Patterns [p. 188]
-
A. Larsson, E. Larsson, K. Chakrabarty, P. Eles and Z. Peng
The ever-increasing test data volume for core-based
system-on-chip (SOC) integrated circuits is resulting in high test
times and excessive tester memory requirements. To reduce
both test time and test data volume, we propose a technique
for test-architecture optimization and test scheduling that is
based on core-level expansion of compressed test patterns.
For each wrapped embedded core and its decompressor, we
show that the test time does not decrease monotonically with
the width of test access mechanism (TAM) at the
decompressor input. We optimize the wrapper and
decompressor designs for each core, as well as the TAM
architecture and the test schedule at the SOC level.
Experimental results for SOCs crafted from several
industrial cores demonstrate that the proposed method
leads to significant reduction in test data volume and test
time, especially when compared to a method that does not
rely on core-level decompression of patterns.
-
A Novel Methodology for Reducing SoC Test Data Volume on FPGA-based Testers [p. 194]
-
P. Bernardi and M. Sonza Reorda
Low-Cost test methodologies for Systems-on-Chip are
increasingly popular. They dictate which features have to
be included on-chip and which test procedures have to be
adopted in order to guarantee high test quality, while
minimizing application costs. Consequently, Low-Cost test
strategies can be run on testers offering lower
performance and/or reduced features with respect to
traditional Automatic Test Equipments (ATEs); these
equipments are usually referred to as Low-Cost testers.
This paper proposes a methodology for reducing the
test data volume for the application of SoC Low-Cost test
procedures. The method exploits a tester architecture
organization suitable for SoCs testing, which includes a
programmable device: the usage of this configurable block
joined to the analysis of test pattern regularities permits
minimizing the test data volume, thus improving the tester
capabilities. The proposed method relies on test pattern
compression at system level and it does not address core
level pattern manipulation, as several other previously
published works do.
Case studies are proposed, which provide data about
the application of the proposed methodology to the test of
SoCs including self-testable processor and memory cores.
IEEE 1149.1 and IEEE 1500 test access mechanisms are
considered. The achieved pattern depth reduction ratio is
up to about the 64% for the considered case studies.
Moderators: G. Beltrame, European Space Agency; F. Schaefer, Cadence Design Systems, DE
-
Performance Analysis of SoC Architectures Based on Latency-Rate Servers [p. 200]
-
J. P. Vink, K. Van Berkel and P. Van Der Wolf
This paper presents a method for static performance
analysis of SoC architectures. The method is based on a
network calculus theory known as LR servers. This network
calculus is extended and applied to make it support SoC
performance analysis. Performance requirements of subsystems
are elegantly captured as traffic flows and associated
latency constraints. The SoC infrastructure is modeled as
a set of LR servers to validate that the worst-case delays
in handling the traffic flows meet the latency constraints.
A multi-channel DVB-T set-top box case study demonstrates
the power of the method. Key architecture choices, such as
schedule or interconnect variant, can be varied easily to
support exploration of architecture options.
-
Slack Allocation Based Co-Synthesis and Optimization of Bus and Memory Architectures for
MPSoCs [p. 206]
-
S. Pandey and R. Drechsler
In this paper, we present a bus and memory architectures
co-synthesis technique. The co-synthesis problem is formulated
as an optimization problem, where scheduling, allocation,
and binding of tasks are done simultaneously in order to
optimize the bus widths, the number of buses, and the memory
sizes. As a main contribution, bus and memory architectures
are optimized simultaneously by allocating different
amount of slacks to them during co-synthesis. The method
finds a balance of slack allocation for both bus and memory
optimization. While the previous co-synthesis approaches do
not consider the slack allocation technique, the synthesized
bus and memory architectures will not be optimal in terms
of area and energy consumption. The experimental results
carried out on real-life applications show 19% and 24% reduction
in bus and memory area, respectively and 37% reduction
in energy overhead due to a bridge in compared to
the previous co-synthesis approach.
-
Run-Time Spatial Mapping of Streaming Applications to a Heterogeneous Multi-Processor
System-on-Chip (MPSoC) [p. 212]
-
P.K.F. Hölzenspies, J.L. Hurink, J. Kuper and G.J.M. Smit
In this paper, we present an algorithm for run-time allocation
of hardware resources to software applications.
We define the sub-problem of run-time spatial mapping
and demonstrate our concept for streaming applications
on heterogeneous MPSoCs. The underlying algorithm and
the methods used therein are implemented and their use is
demonstrated with an illustrative example.
-
Architecture Exploration of NAND Flash-Based Multimedia Card [p. 218]
-
S. Kim, C. Park and S. Ha
In this paper, we present an architecture exploration methodology
for low-end embedded systems where the reduction of cost is a
primary design concern. The architecture exploration of such
systems needs to explore a wide design space spanned by detailed
architecture parameters through cycle-accurate performance
estimation. For fast exploration, the proposed methodology is
based on an efficient evolutionary algorithm, called QEA, and
trace-driven simulation to evaluate architecture candidates
quickly. We applied the proposed methodology to NAND flashbased
Multimedia Card as a case study considering the following
design parameters: buffer size, flash memory configuration, clock,
communication architecture, and memory allocation. The
experimental results validate the proposed methodology by
showing the optimal architecture configurations with varying
performance constraints and design parameters.
Moderators: C. Silvano, Politecnico di Milano, IT; A. Hemani, Royal Institute of Technology
(KTH), SE
-
Resilient Dynamic Power Management under Uncertainty [p. 224]
-
H. Jung and M. Pedram
With the increasing levels of variability and randomness in the
characteristics and behavior of manufactured nanoscale structures
and devices, achieving performance optimization under process,
voltage, and temperature (PVT) variations as well as current,
voltage, and thermal (CVT) stress has become a daunting, yet vital,
task. In this paper, we present a stochastic dynamic power
management (DPM) framework to improve the accuracy of decision
making under probabilistic conditions induced by PVT variations
and/or stress. More precisely, we propose a resilient power
management technique that guarantees to select an optimal policy
under sources of uncertainty. A key characteristic of the proposed
technique is that the effects of uncertainties due to variability and
stress are captured by stochastic processes which control a selfimproving
power manager. Simulation results with a 65nm
processor design show that, compared to the worst-case PVT
conditions, the proposed DPM technique ensures energy efficiency,
while reducing the uncertain behaviors of the system1.
-
Robust and Low Complexity Rate Control for Solar Powered Sensors [p. 230]
-
C. Moser, L. Thiele, D. Brunelli and L. Benini
This paper is concerned with solar driven sensors deployed
in an outdoor environment. We present feedback controllers
which adapt parameters of the application such that a maximal
utility is obtained while respecting the time-varying
amount of available energy. We show that already simple
applications lead to complex optimization problems, involving
unacceptable running times and energy consumptions
for resource constrained nodes. In addition, naive designs
are highly susceptible to energy prediction errors. We address
both issues by proposing a hierarchical control approach
which both reduces complexity and increases robustness
towards prediction uncertainty. As a key component
of this hierarchical approach, we propose a new worst-case
energy prediction algorithm which guarantees sustainable
operation. All methods are evaluated using long-term measurements
of solar energy in an outdoor setting. Furthermore,
we measured the implementation overhead on a real
sensor node.
-
Energy Aware Dynamic Voltage and Frequency Selection for Real-Time Systems with
Energy Harvesting [p. 236]
-
S. Liu, Q. Qiu and Q. Wu
In this paper, an energy aware dynamic voltage and frequency
selection (EA-DVFS) algorithm is proposed. The EA-DVFS
algorithm adjusts the processor's behavior depending on the
summation of the stored energy and the harvested energy in a future
duration. Specifically, if the system has sufficient energy, tasks are
executed at full speed; otherwise, the processor slows down task
execution to save energy. Simulation results show that when the
utilization is low, the EA-DVFS algorithm gives a deadline miss rate
that is at least 50% lower than the one given by the lazy scheduling
policy. Similarly, when the workload is low, the minimum storage size
is reduced by at least 25%.
-
Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution [p. 242]
-
S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo and T. Kim
This paper presents a method of dynamic voltage scaling (DVS)
that tackles both switching and leakage power with combined
Vdd/Vbs scaling and gives minimum average energy consumption
exploiting the runtime distribution of software execution. We
present a mathematical formulation of the DVS problem and an
efficient numerical solution. Experimental results show that the
presented method shows up to 44% further reduction in energy
consumption compared with existing methods. Especially, when the
leakage power consumption is significant, i.e. when temperature is
high, the presented method is proven to be the most effective.
-
Built-In Clock Skew System for On-Line Debug and Repair [p. 248]
-
A. Chattopadhyay and Z. Zilic
We present a low-cost on-line system for clock skew
management in integrated circuits. Our Built-In Clock
Skew System (BICSS) uses a centralized approach to
identify, quantify and correct skew using a two-step
method. The technique assesses the time-of-flight
between the central debug circuitry and each region,
or tap under test to account for the measurement error
due to differences in path length common in existing
techniques. The system can be used to detect skew
above a user-adjustable margin using a variable
tolerance phase detector. The result is a solution
which provides silicon debug and repair capability of
on-chip clock skews with a very small area overhead.
-
Analysis and Optimization of the Recessed Probe Launch for High Frequency Measurements
of PCB Interconnects [p. 252]
-
R. Rimolo-Donadio, C. Schuster, X. Gu, Y.H. Kwark and M.B. Ritter
Measurements of internal printed circuit board (PCB)
structures such as striplines and vias face the problem of
launching clean test signals into the device under test
(DUT). Traditionally, coaxial connectors or surface
probing with high frequency microprobes are used to
provide interfaces to test equipment. Both approaches
have to be carefully optimized in order to give adequate
results for the multi-GHz range. This paper discusses a
different access technique, the recessed probe launch
(RPL), which was previously used by the authors for
measurements up to 40 GHz. Full-wave 3D
electromagnetic modeling is applied to analyze the
parasitics of the proposed launch technique and to find
strategies for its optimization. Comparison to
measurement shows that the models are able to predict
the major physics of the launch but several details still
need to be explored, e.g. accurate modeling of the
microprobes, material parameters, and network analyzer
calibration.
-
On Automated Trigger Event Generation in Post-Silicon Validation [p. 256]
-
H.F. Ko and N. Nicolici
When searching for functional bugs in silicon, debug
data is acquired after a trigger event occurs. A trigger event
can be configured at run-time using a set of control registers
that uniquely identify the event that initiates data acquisition.
Nonetheless the values loaded in these programmable
registers interact only with a set of pre-defined trigger signals
that are selected at design-time. If the state conditions
required for triggering cannot be expressed directly in terms
of the pre-defined trigger signals, the common practice is
that the designer manually searches for an equivalent trigger
event that can be programmed on-chip. In this paper we
investigate if trigger events can be automatically generated
from a set of state conditions.
-
Dynamic Round-Robin Task Scheduling to Reduce Cache Misses for Embedded Systems [p. 260]
-
K.W. Batcher and R.A. Walker
Modern embedded CPU systems rely on a growing number of
software features, but this growth increases the memory footprint
and increases the need for efficient instruction and data caches.
The embedded operating system will often juggle a changing set
tasks in a round-robin fashion, which inevitably results in cache
misses due to conflicts between different tasks. Our technique
reduces cache misses by continuously monitoring CPU cache
misses to grade the performance of running tasks. Through a
series of step-wise refinements, our software system tunes the
round-robin ordering to find a better temporal sequence for the
tasks. This tuning is done dynamically during program execution
and hence can adapt to changes in work load or external input
stimulus. The benefits of this technique are illustrated using an
ARM processor running application benchmarks with different
cache organizations and round-robin scheduling techniques.
-
Improving the Efficiency of Run Time Reconfigurable Devices by Configuration Locking [p. 264]
-
Y. Qu, J.-P. Soininen and J. Nurmi
Run-time reconfigurable logic is a very attractive
alterative in the design of SoC. However, configuration
overhead can largely decrease the system performance. In
this work, we present a novel configuration locking
technique to reduce the effect of the overhead. The idea is
to at run-time lock a number of the most frequently used
tasks on the configuration memory so that they cannot be
evicted by other tasks. With real applications in
validation, the results show that using proper amount of
resources to lock tasks can significantly outperform
simply using more resources. In addition, an algorithm
has been developed for estimating the lock ratio.
Experimental results show that the estimates are close to
optimal results and the measured computer runtime is less
than 4 us in a commercial embedded processor.
-
Logic Synthesis with Nanowire Crossbar: Reality Check and Standard Cell-Based
Integration [p. 268]
-
M. Dong and L. Zhong
Nanowire crossbar is one of the most promising circuit
solutions for nanoelectronics. We show nanowire crossbars
do not scale well in terms of logic density and speed. We
consequently propose a Crossbar Cell design based on judicious
use of silicon nanowire crossbars with microscale
pitches and small dimensions. The Crossbar Cell is compatible
with the conventional MOSFET fabrication and
standard cell-based integration. We evaluate logic circuits
using Crossbar Cells and show that they can improve density
by more than fourfold over the traditional MOSFET
circuits with the same process technology, while achieving
close performance and over threefold power reduction.
-
Merged Computation for Whirlpool Hashing [p. 272]
-
R. Chaves, G. Kuzmanov, L. Sousa and S. Vassiliadis
This paper presents an improved hardware structure for
the computation of the Whirlpool hash function. By merging
the round key computation with the data compression and
by using embedded memories to perform part of the Galois
Field (28) multiplication, a core can be implemented in just
43% of the area of the best current related art while achieving
a 12% higher throughput. The proposed core improves
the Throughput per Slice compared to the state of the art
by 160%, achieving a throughput of 5.47 Gbit/s with 2110
slices and 32 BRAMs on a VIRTEX II Pro FPGA. Results
for a real application are also presented by considering a
polymorphic computational approach.
-
Source-Level Timing Annotation and Simulation for a Heterogeneous Multiprocessor [p. 276]
-
T. Meyerowitz, A. Sangiovanni-Vincentelli, M. Sauermann and D. Langen
A generic and retargetable tool ow is presented that enables
the export of timing data from software running on
a cycle-accurate Virtual Prototype (VP) to a concurrent
functional simulator. First, an annotation framework takes
information gathered from running an application on the
VP and automatically annotates the line-level delays back
to the original source code. Then, a SystemC-based timed
functional simulator runs the annotated source code much
faster than the VP while preserving timing accuracy. This
simulator is API-compatible with the multiprocessor's operating
system. Therefore, it can compile and run unmodified
applications on the host PC. This ow has been implemented
for MuSIC(Multiple SIMD Cores) [6], a heterogeneous
multiprocessor developed at Infineon to support Software
Defined Radio (SDR). When compared with an optimized
cycle-accurate VP of MuSIC on a variety of tests,
including a multiprocessor JPEG encoder, the accuracy is
within 20%, with speedups from 10x to 1000x.
-
Safe Automatic Flight Back and Landing of Aircraft. Flight Reconfiguration Function (FRF) [p. 280]
-
J.A. Herrería García
SOFIA (Safe Automatic Flight Back and Landing of
Aircraft) project is a response to the challenge of
developing concepts and techniques enabling the safe and
automatic return to ground in the event of hostile actions.
Activities in this sense have been started in the framework
of the SAFEE SP3 (Secure Aircraft in the Future
European Environment Sub-Project 3) project. SOFIA
project is proposed as the continuation of the SAFEE
works on FRF (Flight Reconfiguration Function), the
system to automatically return the aircraft to ground.
SOFIA will design architectures for integrating the FRF
system into several typologies of avionics for civil
transport aircraft; development of one of this
architectures; validation, following E-OCVM (European
Operational Concept Validation Methodology) of the FRF
concept and the means to integrate it in the current ATM
(Air Traffic Management); safety assessment of FRF at
aircraft and operational (ATC-Air Traffic Control) levels.
The SOFIA product is the FRF system that will take the
control of the aircraft and will manage to safely return it
to ground under a security emergency (e.g. hijacking),
disabling the control and command of the aircraft from the
cockpit. This means to create and execute a new flight
plan towards a secure airport and landing the aircraft at
it. The flight plan can be generated in ground (ATC), or in
a military airplane and transmitted to the aircraft, or
created autonomously at the FRF.
-
PWM-Based Test Stimuli Generation for BIST of High Resolution Sigma-Delta ADCS [p. 284]
-
D. De Venuto and L. Reyneri
A fully digital test stimuli generation and on-chip
specifications evaluation for cheap, fast, though accurate
testing of high resolution ΣΔADCs are here presented.
Simulations and measurements showed a discrimination
threshold on specification parameters up to -90dBc. The
proposed method helps reduce the cost of ADC
production test, to extend test coverage and to enable
Built-In Self-Test and test-based self-calibration.
Moderators: J. Teich, Erlangen-Nuremberg U, DE; P. Pop, DTU, DK
-
Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on
MPSoCs [p. 288]
-
T. Chantem, R.P. Dick and X.S. Hu
Thermal effects in MPSoCs may cause the violation
of timing constraints in real-time systems. This paper presents
a mixed integer linear programming based solution to this
problem. Tasks are assigned and scheduled to an MPSoC to
minimize peak temperature, subject to real-time constraints.
The proposed approach outperforms existing methods, reducing
peak temperature by up to 24.66 °C and by an average of
8.75 °C when compared to minimal-energy solutions. We also
present a heuristic for use on large problem instances. Steadystate
thermal analysis is used for tasks with long execution
times compared to the RC thermal time constants of the cores.
Transient analysis is used otherwise. The steady-state analysis
based heuristic finds solutions with at most 3.40 °C deviation from
optimal peak temperature (0.22 °C on average) while improving
upon existing technique by as much as 25.71 °C and 10.86 °C on
average. The transient analysis based heuristic further reduce
peak temperature by 1°C in the best case and 0.17 °C on average.
-
A Formal Approach to the Protocol Converter Problem [p. 294]
-
K. Avnit, V. D'Silva, A. Sowmya, S. Ramesh and S. Parameswaran
In the absence of a single module interface standard, integration
of pre-designed modules in System-on-Chip design
often requires the use of protocol converters. Existing
approaches to automatic synthesis of protocol converters
mostly lack formal foundations and either employ abstractions
that ignore crucial low level behaviors, or grossly simplify
the structure of the protocols considered. We present
a state-machine based formal model for bus based communication
protocols, and precisely define protocol compatibility,
and correct protocol conversion. Our model is expressive
enough to capture features of commercial protocols
such as bursts, pipelined transfers, wait state insertion,
and data persistence, in cycle accurate detail. We show that
the most general, correct converter for a pair of protocols,
can be described as the greatest fixed point of a function for
updating buffer states. This characterization yields a natural
algorithm for automatic synthesis of a provably correct
converter by iterative computation of the fixed point. We report
our experience with automatic converter synthesis between
widely used commercial bus protocols, such as AMBA
AHB, ASB, APB, and OCP, considering features which are
beyond the scope of current techniques.
-
Cache Aware Mapping of Streaming Applications on a Multiprocessor System-on-Chip [p. 300]
-
A. Moonen, M. Bekooij, R. van den Berg and J. van Meerbergen
Efficient use of the memory hierarchy is critical
for achieving high performance in a multiprocessor system-on-chip.
An external memory that is shared between processors
is a bottleneck in current and future systems. Cache
misses and a large cache miss penalty contribute to a low
processor utilisation. In this paper, we describe a novel
cache optimisation technique to reduce instruction and data
cache misses for streaming applications. The instruction
and data locality are improved by executing a task multiple
times before moving to the next task. Furthermore, we
introduce a dataflow model that is used to trade-off the number
of cache misses against end-to-end latency and memory
usage. For our industrial application, which is a Digital
Radio Mondiale receiver, the number of cache misses is reduced
with a factor 4.2.
-
Synthesizing Synchronous Elastic Flow Networks [p. 306]
-
G. Hoover and F. Brewer
This paper describes an implementation language and
synthesis system for automatically generating latency insensitive
synchronous digital designs. These designs decouple
behavioral correctness from design performance by
allowing any sub-component to dynamically stall without
changing correct system activity. This is accomplished by
imposition of global invariants and use of local control
in the form of Synchronous-Elastic Flow (SELF) networks,
which are directly synthesized. This design description format
reduces the complexity of implementing correct SELF
networks and does not require pre-design of a correct conventional
synchronous design. The design description is a
specialized guarded atomic action language which is particularly
suited for succinctly describing SELF designs. We
present the language syntax, semantics and synthesis techniques
illustrated by the design of a latency tolerant cache
controller.
Moderators: T. Kazmierski, Southampton U, UK; H. Graeb, TU Munich, DE
-
Periodic Steady-State Analysis Augmented with Design Equality Constraints [p. 312]
-
I. Vytyaz, P.K. Hanumolu, U.-K. Moon and K. Mayaram
A design-oriented periodic steady-state analysis is presented
in this paper. The new analysis finds the values of circuit parameters
that result in a desired circuit performance specified by a set of equality
constraints. This is done by including the design equality constraints and
the circuit parameters directly in the steady-state analysis as additional
equations and unknowns. A time-domain finite difference method and
the numerical implementation for the proposed analysis are described.
Several examples demonstrate that the new analysis accurately and
efficiently tunes circuit parameters that conform to a wide range of
design specifications.
-
Analysis of Oscillator Injection Locking by Harmonic Balance Method [p. 318]
-
M.M. Gourary, S.G. Rusakov, S.L. Ulyanov, M.M. Zharov, B.J. Mulvaney and K.K. Gullapalli
A new approach to analyze injection locking mode of
oscillators under small external excitation is proposed. The
proposed approach exploits existence conditions of the
solution of HB linear system with degenerate matrix. The
method allows one to obtain the locking range for an
arbitrary oscillator circuit with an arbitrary periodic
injection waveform. The approach can be easily
implemented into a circuit simulator. Examples are given to
confirm the correctness of the new approach.
-
Model Checking of Analog Systems Using an Analog Specification Language [p. 324]
-
S. Steinhorst and L. Hedrich
In this contribution an advanced methodology for model checking
of analog systems is introduced. A new Analog Specification
Language (ASL) for efficient property specifications is defined and
model checking algorithms for implementing this language are
presented. This allows verification of complex static and dynamic
circuit properties like Oscillation and Startup Time that have
not yet been formally verifiable with previous approaches. The
new verification methodology is applied to example circuits and
experimental results are discussed and compared to conventional
circuit simulation.
Moderators: P. Manet, U Catholique de Louvain, BE; B. Candaele, Thales, FR
-
Mapping Semantics of CORBA IDL and GIOP to Open Core Protocol for Portability and
Interoperability of SDR Waveform Components [p. 330]
-
G. Gailliard, H. Balp, M. Sarlotte and F. Verdier
Patterns, middlewares and frameworks have been
used for decades in software architecture to address the
main problems encountered today by the MPSoC and
NoC communities: heterogeneity of languages,
programming models, simulation/execution environments,
interaction semantics and communication protocols. A
complete semantics mapping of CORBA Interface
Definition Language (IDL) and General Inter-ORB
Protocol (GIOP) on the Open Core Protocol (OCP) has
been investigated for hardware components. This
mapping is generic, highly configurable and illustrated
through our target application: Software Defined Radio.
-
On the Design of Tunable Fault Tolerant Circuits on SRAM-Based FPGAs for Safety Critical
Applications [p. 336]
-
L. Sterpone, M. Aguirre, J. Tombs and H. Guzmán-Miranda
Mission-critical applications such as space or avionics
increasingly demand high fault tolerance capabilities of
their electronic systems. Among the fault tolerance
characteristics, the performance and costs of an electronic
system remain the leader factors in the space and avionics
market. In particular, when considering SRAM-based
FPGAs, specific hardening techniques generally based on
Triple Modular Redundancy need to be adopted in order
to guarantee the desired fault tolerance degree. While
effectively increasing the fault tolerance capability, these
techniques introduce an important performance
degradation and a dramatic area overhead, that results in
higher design costs. In this paper, we propose an
innovative design flow that allow the implementation of
fault tolerance circuits in SRAM-based FPGA devices with
different fault tolerance capability degrees. We introduce
a new metric that allows a designer to precisely estimate
and set the desired fault tolerance capabilities.
Experimental analysis performed on a realistic industrialtype
case study demonstrates the efficiency of our
methodology.
-
Hot Wire Anemometric MEMs Sensor for Water Flow Monitoring [p. 342]
-
M. Melani, L. Bertini, M. De Marinis, P. Lange, F. D'Ascoli and L. Fanucci
This paper presents an application based on a hot wire
anemometric sensor in MEMS technology in the field of
water flow monitoring. New generations of MEMS
sensors feature remarkable savings in area, costs and
power respect to conventional discrete devices, but as
drawback, they require complex electronic interfaces for
signal conditioning to achieve high performances and a
high reliability. This anemometric sensor implementation
has been developed with ISIF, a Platform SoC, aiming to
fast prototype a wide range of sensors thanks to its high
configurable resources.
The presented system achieves good performances
with respect to commercial devices, featuring resolution
of ±0.35% up to ±1.76% with repeatability roughly ±1%
respect to the full scale (0-250 cm/s). Furthermore the
proposed system, thanks to the compact size of the sensor,
its robustness and its low costs can represent a solution
for diffusive monitoring in water distribution networks.
Moderators: L. Anghel, TIMA Laboratory, FR; D. Appello, STMicroelectronics, IT
-
Guiding Circuit Level Fault-Tolerance Design with Statistical Methods [p. 348]
-
D.C. Ness and D.J. Lilja
In the last decade, the focus of fault-tolerance methods
has tended towards circuit level modifications, such as transistor
resizing, and away from expensive system level redundancy
approaches. We present the results from a screening
experiment to identify significant parameters in circuit level
soft error simulations to guide such approaches to faulttolerance.
This approach allows us to assess which parameters
will have the most significance for reducing soft error
rates and the impact that process variation will have on
the accuracy of soft error rate estimates. We identify supply
voltage and transistor type as being the most significant
parameters affecting soft errors in logic cells across several
technology scales. Additionally, we provide a ranking
of more than a dozen parameters, across four technology
scales, based on the significance of their impact on soft error
rates.
-
A Delay-Efficient Radiation-Hard Digital Design Approach Using CWSP Elements [p. 354]
-
C. Nagpal, R. Garg and S.P. Khatri
In this paper, we present a radiation-hardened digital design approach.
This approach is based on the use of Code Word State Preserving (CWSP)
elements at each flip-flop of the design, and leaving the rest of the design
unaltered. The CWSP element provides 100% SET protection for glitch
widths up to min{Dmin/2, (Dmax - δ)/2}, where
Dmin and Dmax are the
minimum and maximum circuit delay respectively and D is an extra delay
associated with our SET protection circuit. The CWSP circuit has two
inputs - the latch output signal and the same signal delayed by a quantity
d. In case an SET error is detected, then the current computation is
repeated, using the correct output, which is generated later in the same
clock period by the CWSP element. Unlike previous approaches, we use
the CWSP element in a secondary path and the CWSP logic is designed to
minimally impact the critical delay path of the design. The delay penalty
of our approach (averaged over several designs) is less than 1%. Thus
our technique is applicable for high-speed designs, where the additional
delay associated with SET protection must be kept at a minimum.
-
Towards Fault Tolerant Parallel Prefix Adders in Nanoelectronic Systems [p. 360]
-
W. Rao and A. Orailoglu
Future nanoelectronics based arithmetic components will enjoy abundant
hardware, yet at the same time confront severe unreliability challenges.
We focus on the fault tolerance of high performance parallel
prefix adders (PPA), and exploit the inherent redundancy in PPAs to
develop efficient fault tolerance approaches. We show that the internal
invariant inherent in the parallel prefix adders provides support for online
fault detection and fault masking. Furthermore, based on the particular
regular structure of PPAs, an online diagnosis scheme can be
developed, thus enabling the application of reconfigurability of nanoelectronics
for the highly flexible online repair approaches. In contrast
to traditional fault tolerance techniques that rely solely on significant
external overhead, the proposed approach opens up a new genre of
efficient fault tolerance techniques for arithmetic components in the
nanoelectronic environment.
-
A Novel Low Overhead Fault Tolerant Kogge-Stone Adder Using Adaptive Clocking [p. 366]
-
S. Ghosh, P. Ndai and K. Roy
As the feature size of transistors gets smaller,
fabricating them becomes challenging. Manufacturing process
follows various corrective design-for-manufacturing (DFM) steps to
avoid shorts/opens/bridges. However, it is not possible to completely
eliminate the possibility of such defects. If spare units are not
present to replace the defective parts, then such failures cause yield
loss. In this paper, we present a fault tolerant technique to leverage
the redundancy present in high speed regular circuits such as
Kogge-Stone adder (KSA). Due to its regularity and speed, KSA is
widely used in ALU design. In KSA, the carries are computed fast by
computing them in parallel. Our technique is based on the fact that
even and odd carries are mutually exclusive. Therefore, defect in
even bit can only corrupt the even Sum outputs whereas the odd
Sums are computed correctly (and vice versa). To efficiently utilize
the above property of KSA in presence of defects, we perform
addition in two- clock cycles. In cycle-1, one of the correct set of bits
(even or odd) are computed and stored at output registers. In cycle-2,
the operands are shifted by one bit and the remaining sets of bits
(odd or even) are computed and stored. This allows us to tolerate the
defect at the cost of throughput degradation while maintaining high
frequency and yield. The proposed technique can tolerate any
number of faults as long as they are confined to either even or odd
bits (but not in both). Further, this technique is applicable for any
type of fault model (stuck-at, bridging, complete opens/shorts). We
performed simulations on 64-bit KSA using 180nm devices. The
results indicate that the proposed technique incur less that 1% area
overhead. Note that there is very little throughput degradation
(<0.3%) for the fault-free adders. The proposed technique utilizes
the existing scan flip-flops for storage and shifting operation to
minimize the area/performance overhead. Finally, the proposed
technique is used in a superscalar processor, whereby the faulty
adder is assigned lower priority than fault-free adders to reduce the
overall throughput degradation. Experiments performed using
Simplescalar for a superscalar pipeline (with four integer adders)
show throughput degradation of 0.5% in the presence of a single
defective adder.
Keywords: Stuck-at faults, Fault tolerant adder, Adaptive clocking,
Kogge-Stone adder, Scheduling.
Organizers: J. Beutel, ETH Zurich, CH; M. Beigl, TU Braunschweig, DE
Moderator: M. Beigl, TU Braunschweig, DE
-
Software for Wireless Networked Embedded Systems [p. 372]
-
Presenters: A. Dunkels, K. Langendoen, J. Beutel
-
Embedded systems driven by future applications will be tightly coupled with the increasing complexity of the
real world. Consisting of myriads of wireless networked devices, of heterogeneous architectures, distributed and
interacting in a number of ways and serving a multitude of purposes systems have to adapt and take advantage of
conditions unpredictable at design time.
In their realisation software both on a system and on an application level is playing an increasingly important
role that cannot be designed independently. Dominant design factors are the severe resource constraints, the
unreliability of the wireless medium and the dynamics of both the applications and the environment. Selected
challenges in the area of wireless sensor networks are addressed by the speakers in this special session
highlighting the current gap between theory and practice in an emerging field.
Moderators: R. Zafalon, STMicroelectronics, IT; D. Soudris, Democritus U of Thrace, GR
-
Fine-Grained Supply Gating Through Hypergraph Partitioning and Shannon Decomposition
for Active Power Reduction [p. 373]
-
L. Leinweber and S. Bhunia
Energy-efficient performance has emerged as the key
design objective of high-performance logic circuits to address
power-induced reliability concerns and battery life requirements in
portable devices. In the sub-65nm technology regime, these
problems continue to grow as leakage power becomes the
predominant form of power consumption. Among numerous power
reduction techniques employed at the circuit and architectural
levels, supply gating has been proven to be very effective for
standby power reduction. In this paper, we propose application of
fine-grained supply gating to large complex circuits for active
leakage and dynamic power reduction. A design methodology and
associated CAD tool is developed to synthesize combinational logic
using hypergraph partitioning and Shannon decomposition, which
reduces both leakage and switching power by disabling unused
logic dynamically in small clusters of gates. Simulation results for
a set of ISCAS-85 benchmarks show that the proposed approach
can achieve up to 40% saving in total power in active mode (and up
to 37% saving in standby power) with negligible impact on
performance and die area for a predictive 32 nm technology.
Index Terms - Low Power Design, Supply Gating, Active Power,
Hypergraph Partitioning.
-
A Scalable Algorithmic Framework FOR Row-Based Power-Gating [p. 379]
-
A. Sathanur, A. Pullini, L. Benini, A. Macii, E. Macii and M. Poncino
Leakage power is a serious concern in nanometer CMOS technologies.
In this paper we focus on leakage reduction through automatic
insertion of sleep transistors for power gating in standard cell based
designs. In particular, we propose clustering algorithms for rowbased
power-gating methodology which is based on using rows of
the layout as the granularity for clustering. Our clustering methodology
does timing and area constraint driven power-gating in contrast
to only timing driven power-gating as proposed in the previous
works. We present two distinct clustering algorithms with different
accuracy-efficiency trade-off. An optimal one, which exploits a 0-1
or Binary Integer Programming approach, and a heuristic one,
which resorts to an implicit enumeration of the layout rows.
Results show that, for all the benchmarks, the leakage power savings,
as compared to previous techniques, are more than 75% when
we have the same timing constraints but half sleep transistor area
and at least 60% when area constraint is set at one fourth. We also
show that we can perform clustering with no speed degradation and
achieve maximum leakage power savings up-to 83%.
-
Coarse-Grain MTCMOS Sleep Transistor Sizing Using Delay-Budgeting [p. 385]
-
E. Pakbaznia and M. Pedram
Power gating is one of the most effective techniques in reducing the
standby leakage current of VLSI circuits. In this paper we introduce
a new approach for sleep transistor sizing which minimizes the total
sleep transistor width for a coarse-grain multi-threshold CMOS
circuit assuming a given standard cell and sleep transistor
placement. First, the circuit is decomposed into a set of modules,
each containing the set of logic cells that are closest to a sleep
transistor cell. Next given an upper bound on the overall circuit
speed degradation, the global timing slack is distributed among
different clusters using a delay-budgeting. The slack distribution
result is then used to size the sleep transistors such that the total
sleep transistor width is minimized while accounting for the
parasitic resistances of the virtual ground net. Results show that the
proposed sizing algorithm produces sleep transistor sizes that are
40% smaller than those produced by previous approaches.
Organizers: A. Sangiovanni-Vincentelli, UC Berkeley, US; M. Di Natale, Scuola S Anna, Pisa, IT
Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US
-
Physical Architectures of Automotive Systems [p. 391]
-
T. Forest, A. Ferrari, G. Audisio, M. Sabatini, A. Sangiovanni-Vincentelli and
M. Di Natale
This section will provide insight into new developments
and advances in electronics automotive architectures. The
design of innovative chip architectures, new upcoming standards
for high-bandwidth and deterministic communication
(FlexRay) and sensors are the domains of interest, with emphasis
on reliability and support for advanced active safety
functions.
Moderators: I. Harris, UC Irvine, US; V. Bertacco, U of Michigan, US
-
A Mutation Model for the SystemC TLM 2.0 Communication Interfaces [p. 396]
-
N. Bombieri, F. Fummi and G. Pravadelli
Mutation analysis is a widely-adopted strategy in software
testing with two main purposes: measuring the quality of test
suites, and identifying redundant code in programs. Similar approaches
are applied in hardware verification and testing too,
especially at RTL or gate level, where mutants are generally referred
as faults, and mutation analysis is performed by means
of fault modeling and fault simulation. However, in modern
embedded systems there is a close integration between HW and
SW parts, and verification strategies should be applied early
in the design flow. This requires the definition of new mutation
analysis-based strategies that work at system level, where HW
and SW functionalities are not partitioned yet. In this context,
the paper proposes a mutation model for perturbing transaction
level modeling (TLM) SystemC descriptions. In particular,
the main constructs provided by the SystemC TLM 2.0 library
have been analyzed, and a set of mutants is proposed to perturb
the primitives related to the TLM communication interfaces.
-
Efficient Design Validation Based on Cultural Algorithms [p. 402]
-
W. Wu and M.S. Hsiao
We introduce a new semi-formal design validation framework
to justify hard-to-reach corner-case states. We propose
a cultural learning technique to identify the swarming of domain
knowledge during the search. In addition, our guidance
strategy abstracts sets of partitioned state variables, from
which pre-images are computed to capture the expanded portions
of the state spaces related to a target state. Experimental
results show that our approach is very effective to reach hardto-
reach states than existing methods.
-
Algorithms for Maximum Satisfiability Using Unsatisfiable Cores [p. 408]
-
J. Marques-Silva and J. Planes
Many decision and optimization problems in Electronic
Design Automation (EDA) can be solved with Boolean Satisfiability
(SAT). Moreover, well-known extensions of SAT
also find application in EDA, including Pseudo-Boolean
Optimization, Quantified Boolean Formulas, Multi-Valued
SAT and, more recently, Maximum Satisfiability (MaxSAT).
Algorithms for MaxSAT are still fairly inefficient in industrial
settings, in part because the most effective SAT techniques
cannot be easily extended to MaxSAT. This paper
proposes a novel algorithm for MaxSAT that improves existing
state of the art solvers by orders of magnitude on industrial
benchmarks. The new algorithm exploits modern
SAT solvers, being based on the identification of unsatisfiable
subformulas. Moreover, the new algorithm provides
additional insights between unsatisfiable subformulas and
the maximum satisfiability problem.
-
In-Band Cross-Trigger Event Transmission for Transaction-Based Debug [p. 414]
-
S. Tang and Q. Xu
Cross-trigger, the mechanism to trigger activities in one debug entity
from debug events happened in another debug entity, is a very useful
technique for debugging applications involving multiple embedded
cores. Existing solutions rely on dedicated interconnects (i.e., different
from functional interconnects) to transfer debug events and
cannot guarantee the arrival time of the debug events coincides with
the arrival time of the data messages between multiple cores. This
results in mismatches between the observed system internal operations
and the ones that designers expect to watch. To tackle the
above problem, in this paper, we propose to package the cross-trigger
events and the actual data together into transaction messages and
transfer them along the same functional interconnects (namely inband
debug event transmission), with the help of novel design-fordebug
circuits. Simulation results on a hypothetical NoC-based systems
show the effectiveness of the proposed technique.
Moderators: R. Suaya, Mentor Graphics, FR; N. van der Meijs, TU Delft, NL
-
Efficient Representation and Analysis of Power Grids [p. 420]
-
J.M.S. Silva, J.R. Phillips and L.M. Silveira
Modern deep sub-micron ULSI designs with hundreds of
millions of devices require huge grids for power distribution.
Such grids, operating with increasingly low-power
voltages, are a design limiting factor and accurate analysis
of their behavior is of paramount importance as any
voltage drops can seriously impact performance or functionality.
As power grid models have millions of unknowns,
highly optimized special purpose simulation tools are required
to handle the time and memory complexity of solving
for their dynamic behavior. In this work, we propose a hierarchical
matrix representation of the power grid model that
is both space and time efficient. With this representation,
reduced storage matrix factors are efficiently computed and
applied in the analysis at every time-step of the simulation.
Results show an almost linear complexity growth, namely
O(nloga(n)), for some small constant a, in both space and
time, when using this matrix representation. Comparisons
of our academic implementation with production-quality
code proves this method to be very efficient when dealing
with the simulation of large power grid models
-
High-Frequency Mutual Impedance Extraction of VLSI Interconnects in the Presence of a
Multi-Layer Conducting Substrate [p. 426]
-
N. Srivastava, R. Suaya and K. Banerjee
We propose a computationally efficient method to calculate,
with high accuracy, the mutual impedance between two wires in
the presence of multilayer substrates, as needed for high
frequency CAD applications. The resulting accuracy (errors
smaller than 2%) and CPU time reduction (factors of seven)
emerge from three different ingredients: a two dimensional
Green's function approach with the correct quasi-static limit, a
modified discrete complex image approximation to the Green's
function, and a novel discrete dipole approximation to evaluate
the magnetic vector potential. This approach permits the
evaluation of the mutual impedance between two loops in terms
of easily computable analytical expressions that involve the
relative separations and the electromagnetic parameters of the
multi-layer substrate. The results are valid for long wires, for
any separation, and for frequencies up to 100 GHz.
-
ETBR: Extended Truncated Balanced Realization Method for On-Chip Power Grid Network
Analysis [p. 432]
-
D. Li, S.X.-D. Tan and B. McGaughy
In this paper, we present a novel simulation approach for
power grid network analysis. The new approach, called
ETBR for extended truncated balanced realization, is based
on model order reduction techniques to reduce the circuit
matrices before the simulation. Different from the (improved)
extended Krylov subspace methods EKS/IEKS [15,
2], ETBR performs fast truncated balanced realization on
response Grammian to reduce the original system with the
similar computation costs of EKS. ETBR also avoids the
adverse explicit moment representation of the input signals.
Instead, it uses spectrum representation of input signals by
fast Fourier transformation. As a result, ETBR is more flexible
for different types of input sources and can better capture
the high frequency contents than EKS, and this leads
to more accurate results especially for fast changing input
signals. Experimental results on a number of large networks
(up to one million nodes) show that, given the same order
of the reduced model, ETBR is indeed more accurate
than the EKS method especially for input sources rich in
high-frequency components. ETBR also shows similar computation
costs of EKS and less memory consumption than
EKS.
-
Bandwidth-Centric Optimization for Area-Constrained Links with Crosstalk Avoidance
Methods [p. 438]
-
B. Halak and A. Yakovlev
The effect of crosstalk avoidance codes on the
throughput of fixed width communication channels is
studied. Closed form expressions of the throughput which
incorporate the dimensions of the interconnects and the
wires overheads by such techniques are derived for lines
under different buffering conditions. These formulae are
utilised to optimise the bandwidth of fixed width parallel
buses under different latency and reliability constraints.
Our results are confirmed by the simulations we have
performed in Spectre for a UMC CMOS 90nm
technology.
Moderators: J. Dielissen, NXP Semiconductors, NL; C. Bouganis, Imperial College London, UK
-
Optimizating Near-ML MIMO Detector for SDR Baseband on Parallel Programmable
Architectures [p. 444]
-
M. Li, B. Bougard, D. Novo, L. Van Der Perre and F. Catthoor
ML and near-ML MIMO detectors have attracted
a lot of interest in recent years. However, almost all the reported
implementations are delivered in ASICs or FPGAs. Our contribution
is optimizing the near-ML MIMO detector for parallel
programmable architectures, such as those with ILP and DLP
features. In the proposed SSFE (Selective Spanning with Fast
Enumeration), architecture-friendliness is explicitly introduced
from the very beginning of the design flow. Importantly, high
level algorithmic transformations make the dataflow pattern and
structure fit architecture-characteristics very well. We enable
abundant vector-parallelism with highly regular and deterministic
dataflow in the SSFE; memory rearrangements, shuffling and
non-predictable dynamism are all elaborately excluded. Hence,
the SSFE can be easily parallelized and efficiently mapped onto
ILP and DLP architectures. Furthermore, to fine-tune the SSFE
on parallel architectures, extensive pre-compiler transformations
are applied with the help of the application-level information.
These optimize not only computation-operations but also addressgenerations
and memory-accesses. Experiments show that the
SSFE brings very efficient resource-utilizations on real-life VLIW
architectures. Specifically, with the SSFE the percentage of NOPs
instructions on VLIWis below 1%, even better than that achieved
by the software-pipelined FFT. To the best of our knowledge, this
is the first reported work about comprehensive optimizations of
near-ML MIMO detectors for parallel programmable architectures.
-
Vectorization of Reed Solomon Decoding and Mapping on the EVP [p. 450]
-
A. Kumar and K. Van Berkel
Reed Solomon (RS) codes are used in a variety of (wireless)
communication systems. Although commonly implemented
in dedicated hardware, this paper explores the mapping
of high-throughput RS decoding on vector DSPs. The
four modules of such a decoder, viz. Syndrome Computation,
Key Equation Solver, Chien Search, and Forney pose
different vectorization challenges. Their vectorizations are
explained in detail, including optimizations specific for Embedded
Vector Processor (EVP). For RS (255,239), this solution
is benchmarked vs published implementations, and
scalability up to vector size 64 is explored. The best and the
worst case throughput of our implementation is 8 times and
2 times higher respectively than other architectures.
-
A Case Study in Reliability-Aware Design: A Resilient LDPC Code Decoder [p. 456]
-
M. May, M. Alles and N. Wehn
Chip reliability becomes a great threat to the design of
future microelectronic systems with the continuation of the
progressive downscaling of CMOS technologies. Hence increasing
the robustness of chip implementations in terms of
error tolerance becomes an important issue. In this paper
we present a case study in reliability-aware design tolerating
transient errors. A state-of-the-art WiMAX channel decoder for
LDPC codes is investigated on all design levels to increase
its reliability for a given system performance
with minimum hardware overhead. We show that an efficient
exploitation of the algorithmic fault-tolerance yields
a fairly small area overhead with nearly no degradation in
communications performance even under high error injection rates.
Moderators: T. Yoneda, Nara Inst. of Science and Technology, JP; J. Schloeffel, NXP
Semiconductors, NL
-
Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume
Reduction [p. 462]
-
A. Chandra, F. Ng and R. Kapur
We present Low Power Illinois scan architecture (LPILS)
to achieve power dissipation and test data volume
reduction, simultaneously. By using the proposed scan
architecture, dynamic power dissipation during scan
testing in registers and combinational cells can be
significantly reduced without modifying the clock tree of
the design. The proposed architecture is independent of
the ATPG patterns and imposes a very small
combinational area penalty due to the logic added
between the scan cells and the CUT. Experimental results
for two industrial circuits show that we can simultaneously
achieve up to 47% reduction in dynamic power dissipation
due to switching and 10X test data volume reduction with
LPILS over basic scan.
-
Scan Chain Organization for Embedded Diagnosis [p. 468]
-
M. Elm and H.-J. Wunderlich
Keeping diagnostic resolution as high as possible while
maximizing the compaction ratio is subject to research since
the advent of embedded test. In this paper, we present a
novel scan design methodology to maximize diagnostic resolution
when compaction is employed. The essential idea
is to consider the diagnostic resolution during the clustering
of scan elements to scan chains. Our methodology does
not depend on a fault model and is helpful with any type of
compactor.
A linear time heuristic is presented to solve the scan
chain clustering problem. We evaluate our approach for
industrial and academic benchmark circuits. It turns out to
be superior to both random and to layout driven scan chain
clustering. The methodology is applicable to any gate-level
design and fits smoothly into an industrial design flow.
Keywords - Design for diagnosis, embedded test, scan
design
-
State Skip LFSRs: Bridging the Gap between Test Data Compression and Test Set
Embedding for IP Cores [p. 474]
-
V. Tenentes, X. Kavousianos and E. Kalligeros
We present a new type of Linear Feedback Shift Registers,
State Skip LFSRs. State Skip LFSRs are normal LFSRs with
the addition of a small linear circuit, the State Skip circuit,
which can be used, instead of the characteristic-polynomial
feedback structure, for advancing the state of the LFSR. In
such a case, the LFSR performs successive jumps of constant
length in its state sequence, since the State Skip circuit omits a
predetermined number of states by calculating directly the
state after them. By using State Skip LFSRs we get the wellknown
high compression efficiency of test set embedding with
substantially reduced test sequences, since the useless parts
of the test sequences are dramatically shortened by traversing
them in State Skip mode. The length of the shortened test sequences
approaches that of test data compression methods. A
systematic method for minimizing the test sequences of reseeding-
based test set embedding methods, and a low overhead
decompression architecture are also presented.
-
Automated Testability Enhancements for Logic Brick Libraries [p. 480]
-
J.G. Brown, B. Taylor, R.D.S. Blanton and L. Pileggi
Circuit fabrics composed of highly regular structures, called
logic bricks, have been described recently for improving
yield. An automated logic brick design flow based on a
SAT formulation of the brick routing has been developed
to minimize wire length and the number of vias while maintaining
several design-for-manufacturability constraints. In
this work, testability enhancements are imposed into a logic
brick to reduce the likelihood of (i) feedback bridges to
improve test and (ii) equivalent faults to improve diagnosis.
This is accomplished by adding constraints to the SAT
formulation of the logic brick routing that restricts certain
wires from being routed in close proximity, thus making
bridges between them unlikely. Application to several
brick designs resulted in critical-area reductions for targeted
bridges with little degradation in terms of additional
wire length and via count.
Moderators: E. Brinksma, Embedded Systems Institute, NL; P. Mosterman, The MathWorks, US
-
A Game-Theoretic Approach to Real-Time System Testing [p. 486]
-
A. David, K.G. Larsen, S. Li and B. Nielsen
This paper presents a game-theoretic approach to the
testing of uncontrollable real-time systems. By modelling
the systems with Timed I/O Game Automata and specifying
the test purposes as Timed CTL formulas, we employ a recently
developed timed game solver UPPAAL-TIGA to synthesize
winning strategies, and then use these strategies to
conduct black-box conformance testing of the systems. The
testing process is proved to be sound and complete with respect
to the given test purposes. Case study and preliminary
experimental results indicate that this is a viable approach
to uncontrollable timed system testing.
-
Modeling Event Stream Hierarchies with Hierarchical Event Models [p. 492]
-
J. Rox and R. Ernst
Compositional Scheduling Analysis couples local
scheduling analysis via event streams. While local analysis
has successfully been extended to include hierarchical
scheduling strategies, event streams are still flat. In this
paper, we generalize the concept of a stream hierarchy to
embed different types of streams in a higher level structure.
We explain why this extension is a natural match to model
streams generated by communication stacks that are ubiquitous
in networked embedded systems. We formally define
the hierarchical event model and give operations to encode,
combine, and extract stream properties that can be used in
flat or hierarchical local scheduling analysis. Finally, we
give an example and demonstrate that the proposed model
enables superior analysis results.
-
Semantics for Model-Based Validation of Continuous/Discrete Systems [p. 498]
-
L. Gheorghe, F. Bouchhima, G. Nicolescu and H. Boucheneb
Continuous and discrete components can be
integrated in diverse systems including defense,
medical, electronic, communication, and automotive
applications. Given the heterogeneity of concepts
that have to be taken into consideration, their design
involves overcoming specific global modeling and
validation challenges. This paper presents semantics
for model-based validation of continuous/discrete
systems. It focuses on the simulation interfaces
semantics, representation and verification. The
proposed approach is applied for the validation of a
continuous/discrete medical system, an automatic
glycemia level regulator.
-
Using UML as Front-End for Heterogeneous Software Code Generation Strategies [p. 504]
-
L.B. Brisolara, M.F.S. Oliveira, R. Redin, L.C. Lamb, L. Carro and F. Wagner
In this paper we propose an embedded software design
flow, which starts from an UML model and provides
automatic mapping to other models like Simulink or
finite-state machines (FSM). An automatic synthesis of an
executable and synthesizable Simulink model is also
proposed, enabling the use of UML as front-end for a
multi-model design strategy that includes a Simulinkbased
MPSoC target design flow. In addition, the
proposed synthesis tool automatically handles processor
allocation, mapping of threads to processors, and
insertion of required Simulink temporal barriers, ports,
and dataflow connections. Following this approach, the
UML model is mapped to the more appropriated model
and specialized code generators are used. Therefore, this
approach allows designers to employ UML to model the
whole system and reuse this model to generate code using
different strategies and targeting different platforms.
Organizer: S. Turnoy, Synopsys, US
Moderator: P. Wintermeyr, Elektronik.net, DE
-
PANEL - Caution Ahead: The Road to Design and Manufacturing at 32 and 22 nm [p. 510]
Panelists: R. Aitken, R. Lauwereins, J. Tracy Weed, V. Kiefer and J. Hartmann
-
At 32 and 22 nm, which manufacturing technology changes will be so revolutionary as to cause upheavals in the
semiconductor supply chain and on design practices?
- Will there be economic fallout from the higher mask cost associated with dual patterning? How will
designers deal with place-and-route restrictions?
- How likely is "direct write"? What design and OPC tool changes will be required?
- When dealing with stress and CMP, will we need to replace DRC with a new breed of tools?
- How will designers "sign off" on a design at 32 nm?
These are just some of the challenges ahead. For every solution, collateral adjustments must be made to design
technologies and methodologies. Everyone from designer to foundry equipment manufacturer would do well to
look ahead at these potential hazards on the road to 32 and 22 nm.
-
Fault Clustering in Deep-Submicron CMOS Processes [p. 511]
-
J. Schat
The fraction of ICs that pass all production tests but fail
in the application is called the defect level. Defect levels
depend on the average number of defects per IC, and also
on the clustering of these defects. High clustering leads to
a higher yield and a lower defect level.
This paper compiles the coefficients for defect clustering
using research findings from 1970 until 2001. Because
recent data for deep submicron processes are missing in
the literature, the clustering coefficient has been calculated
using scan fail distributions of ICs in a 180 nm
process.
Clustering coefficients show a steady trend towards
higher defect clustering. This is beneficial, but it is
probably not sufficient to achieve today's ambitious target
of 'zero defects'.
-
Energy Efficient and High Speed On-Chip Ternary Bus [p. 515]
-
C. Duan and S.P. Khatri
We propose two crosstalk reducing coding schemes
using ternary busses. In addition to low power consumption and
reduced delay, our schemes offer other advantages over binary
coding schemes such as zero area overhead and simple, regular
and fast CODEC design.
-
Task Scheduling with Configuration Prefetching and Anti-Fragmentation Techniques on
Dynamically Reconfigurable Systems [p. 519]
-
F. Redaelli, M.D. Santambrogio and D. Sciuto
Aim of this paper is to define a scheduling of the task
graph of an application that minimizes its total execution
time on a partially dynamically reconfigurable FPGA. The
scheduler has to take into account the reconfiguration overhead
of each task, the area constraint of the target FPGA,
the precedences between the tasks, configuration prefetching
and module reuse. We introduce an ILP formulation to
solve the task scheduling problem in the reconfigurable architecture
scenario. This formulation has been used to identify
interesting features for a possible heuristic scheduler.
The results of the ILP solution show how a reconfigurationaware
scheduler exploiting all the reconfiguration features
can outperform one with partial knowledge.
-
Fast Analog Circuit Synthesis Using Sensitivity Based Near Neighbor Searches [p. 523]
-
A. Pradhan and R. Vemuri
We present an efficient analog synthesis algorithm
employing regression models of circuit matrices. Circuit
matrix models achieve accurate and speedy synthesis of
analog circuits. In this paper, synthesis is accelerated
by eliminating numerous computations of the matrix elements
during a synthesis run. Computations are avoided
by reusing exact or nearby design points visited during
previous synthesis iterations. Hashing and multidimensional
nearest neighbor lookup are used in incremental
evaluation of design solutions encountered during synthesis.
Sensitivity of the design variables is considered
for locating a neighboring solution. Neighbor lookup is
efficiently performed using box-decomposition trees. The
proposed method is used to synthesize three benchmark
circuits. Results show that with hashing and neighbor
lookup, synthesis is 6x-13x faster than with the use of
matrix models alone.
-
Spatial Correlation Extraction via Random Field Simulation and Production Chip
Performance Regression [p. 527]
-
B. Liu
Statistical timing analysis needs a priori knowledge of
process variations. Lack of such a priori knowledge of process
variations prevents accurate statistical timing analysis,
for which foundry confidentiality policy has largely been
blamed. A significant part of process variations are design
specific, and can only be extracted from production
chip performance statistics. In this paper, I adopt the homogeneous
isotropic random field model for intra-die random
variations, apply fast Fourier transform (FFT) to simulate
a homogeneous isotropic random field, obtain corners
for Monte Carlo SPICE simulation of timing critical
paths in a VLSI circuit, and apply regression to match production
chip performance statistics. Experimental results
based on a timing critical path in an industry design with
65nm Predictive Technology Models reveal constant mean,
increased standard deviation, and decreased skewness of
a signal propagation path delay as spatial correlation increases.
The proposed spatial correlation extraction technique
can be applied in a chip tapeout process, where process
variations extracted from an early tapeout help to improve
statistical timing analysis accuracy and guide engineering
change order of subsequent tapeouts.
-
A Methodology for Improving Software Design Lifecycle in Embedded Control Systems [p. 533]
-
M.E.M. Ben Gaid, R. Kocik, Y. Sorel and R. Hamouche
Control design and real-time implementation are usually
performed in isolation. The effects of the computer implementation
on control system performance are still evaluated
on the last phases of the development cycle. It is expected
that modeling the computer implementation in order to simulate
its impact on control would help reducing the length
and the effort of the development cycle. This paper proposes
ideas towards achieving these objectives. To this end,
implementation effect on control performance is first studied.
Then, we describe the preliminary ideas of a methodology
considering a control law designed with the Scicos
simulation environment and implemented on a distributed
architecture with the SynDEx system-level CAD tool. This
methodology allows simulating the impact of the distributed
implementation early in the design lifecycle and provides an
automatic code generation of this implementation.
-
Finding the Worst Voltage Violation in Multi-Domain Clock Gated Power Network [p. 537]
-
W. Zhang, Y. Zhu, W. Yu, L. Zhang, R. Shi, H. Peng, Z. Zhu, L. Chua-Eoan, R. Murgai, T.
Shibuya, N. Ito and C.-K. Cheng
This paper proposes an efficient method to find the
worst case of voltage violation by multi-domain clock
gating in an on-chip power network. We first present a
voltage response in an arbitrary multi-domain clock
gating pattern, using a superposition technique. Then, an
integer linear programming (ILP) formulation is proposed
to identify the worst-case gating pattern and the maximum
variation area. The ILP based method is significantly
faster than a conventional method based on enumeration.
The experimental results are also compared with a case
where peak voltage variation is induced, which shows the
latter technique largely underestimated the overall
variation effect.
-
A System Architecture for Reconfigurable Trusted Platforms [p. 541]
-
B. Glas, A. Klimm, O. Sander, K. Müller-Glaser and J. Becker
For improving the security of embedded systems, trusted
computing is a promising technology. For the area of microprocessors
in general and personal computers in particular
the Trusted Computing Group (TCG) has published detailed
specifications. The resulting hardware has been available
for some years. This contribution discusses the feasibility
of deploying ideas from trusted computing in the domain of
reconfigurable hardware, esp. FPGAs, and possible benefits
and drawbacks. We give a proposal to use actually
available FPGA technology to build a trusted platform on
reconfigurable hardware. We also show how trusted computing
can deal with partial dynamic reconfiguration while
still allowing the user to fully exploit its potentials.
Keywords: Trusted computing, TPM, FPGA, reconfigurable
hardware, partial dynamic reconfiguration,
embedded systems.
-
Automatic Generation of Complex Properties for Hardware Designs [p. 545]
-
F. Rogin, T. Klotz, G. Fey, R. Drechsler and S. Rülke
Property checking is a promising approach to prove the
correctness of today's complex designs. However, in practice
this requires the formulation of formal properties which
is a time consuming and non-trivial task. Therefore the
acceptance and efficiency of formal verification techniques
can be raised by an automated support for formulating design
properties. In this paper we propose a new methodology
to automatically generate complex properties for a
given design. The tool, Dianosis, implements this methodology
by analyzing a simulation trace. The extracted properties
describe the abstract design behavior and are presented
in a format that is easy to read and can be added
to the set of properties used for formal or assertion-based
verification. We provide experimental results on industrial
hardware designs that show the effectiveness of Dianosis
and motivate the practical use.
Organizers: M. Di Natale, Scuola S Anna, Pisa, IT; A. Sangiovanni-Vincentelli, UC Berkeley, US
Moderator: M. Di Natale, Scuola S Anna, Pisa, IT
-
Software Components for Reliable Automotive Systems [p. 549]
-
H. Heinecke, W. Damm, B. Josko, A. Metzner, H. Kopetz, A. Sangiovanni-Vincentelli and M. Di Natale
System-level integration requires an overall understanding
of the interplay of the sub-systems to enable componentbased
development with portability, reconfigurability and
extensibility, together with guaranteed reliability and performance
levels. Integration by simple interfaces and plug-and-play
of sub-systems, which is the main objective of AUTOSAR,
requires solving essential technical problems. We
discuss to what degree the existing AUTOSAR standard can
support the development of safety- and time-critical software
and what is required to move toward the desirable
goal of timing isolation when integrating multiple applications
into the same execution platform.
Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US
-
Model-Based-Design is Nice, But... [p. 555]
-
H. Hanselmann
Without Model-Based-Design (MBD) today's automotive embedded systems would not exist. However, MBD
generates its own challenges. Tools and concepts are helping in many areas, but the user's needs often seem to
outpace the capabilities of tools and processes, especially for large systems with complex software interacting
across boundaries. System Design is underdeveloped. In this keynote, an assessment of the current situation is
given as well as a vision of how developers should design and test systems in the future.
Moderators: M. Lajolo, NEC Labs, US; F. Gaffiot, INL - ECL, FR
-
A Simulation Methodology for Worst-Case Response Time Estimation of Distributed Real-
Time Systems [p. 556]
-
S. Samii, S. Rafiliu, P. Eles and Z. Peng
In this paper, we propose a simulation-based methodology
for worst-case response time estimation of distributed realtime
systems. Schedulability analysis produces pessimistic
upper bounds on process response times. Consequently, such
an analysis can lead to overdesigned systems resulting in unnecessarily
increased costs. Simulations, if well conducted,
can lead to tight lower bounds on worst-case response times,
which can be an essential input at design time. Moreover,
such a simulation methodology is very important in situations
when the running application or the underlying platform
is such that no formal timing analysis is available.
Another important application of the proposed simulation
environment is the validation of formal analysis approaches,
by estimating their degree of pessimism. We have performed
such an estimation of pessimism for two responsetime
analysis approaches for distributed embedded systems
based on two of the most important automotive communication
protocols: CAN and FlexRay.
-
Signal Probability Based Statistical Timing Analysis [p. 562]
-
B. Liu
VLSI timing analysis and power estimation target the
same circuit switching activity. Power estimation techniques
are categorized as (1) static, (2) statistical, and (3)
simulation and testing based methods. Similarly, statistical
timing analysis methods are in three counterpart categories:
(1) statistical static timing analysis, (2) probabilistic
technique based statistical timing analysis, and (3)
Monte Carlo (SPICE) simulation and testing. Leveraging
with existing power estimation techniques, I propose signal
probability (i.e., the logic one occurrence probability
on a net) based statistical timing analysis, for improved accuracy
and reduced pessimism over the existing statistical
static timing analysis methods, and improved efficiency over
Monte Carlo (SPICE) simulation. Experimental results on
ISCAS benchmark circuits show that SPSTA computes the
means (standard deviations) of the maximum signal arrival
times within 5.6% (7.7%), SSTA within 16.5% (46.9%), and
STA within 83.0% (132.4%) in average ofMonte Carlo simulation
results, respectively. More significant accuracy improvements
are expected in the presence of increased process
and environmental variations.
-
A Current Source Model for CMOS Logic Cells Considering Multiple Input Switching and
Stack Effect [p. 568]
-
B. Amelifard, S. Hatami, H. Fatemi and M. Pedram
This paper presents a current source model (CSM) of a
CMOS logic cell, which captures simultaneous switching
of multiple inputs while accounting for the effect of
internal node voltages of the logic cell. Characterization
procedures for various components of the proposed CSM
are described and application of the model to output
waveform computation is discussed. Experimental
results to assess the accuracy and efficiency of the
proposed multiple input switching CSM in the context of
noise and timing analyses in VLSI circuits are reported.
-
Current Source Based Standard Cell Model for Accurate Signal Integrity and Timing
Analysis [p. 574]
-
A. Goel and S. Vrudhula
The inductance and coupling effects in interconnects
and non-linear receiver loads has resulted in complex input
signals and output loads for gates in the modern deep submicron
CMOS technologies. As a result, the conventional method
of timing characterization, which is based on lookup tables
with input slew and output load capacitance as indices, is no
longer adequate. The focus has now shifted to current source
based standard cell models which are based on the fundamental
property of transconductance of MOSFETs. In this paper 1 we
propose a systematic methodology for obtaining a current based
delay model for gates, which can accommodate both single (SIS)
and multi-input (MIS) switching signals of arbitrary shape and
complex non-linear output loads. We use an analytical model
for the gate output current expressed as a function of the node
voltages. This results in an average error less than 0.5% with
maximum standard deviation of 2.5% in error when compared
with SPICE for a large number of standard cells. When compared
with SPICE, using the proposed models gives stage delay and
output slew with an average error of less than 3% and 2%
respectively for arbitrary inputs and output load combinations.
Moderators: W. Schilders, NXP Semiconductors, NL; P. Feldmann, IBM T J Watson Research
Center, US
-
An Efficient Method for Chip-Level Statistical Capacitance Extraction Considering Process
Variations with Spatial Correlation [p. 580]
-
W. Zhang, W. Yu, Z. Wang, Z. Yu, R. Jiang and J. Xiong
An efficient method is proposed to consider the process
variations with spatial correlation, for chip-level
capacitance extraction based on the window technique. In
each window, an efficient technique of Hermite polynomial
collocation (HPC) is presented to extract the statistical
capacitance. The capacitance covariances between
windows are then calculated to reflect the spatial
correlation. The proposed method is practical for
chip-level extraction task, and the experiments on full-path
extraction exhibit its high accuracy and efficiency.
-
SPARE - A Scalable Algorithm for Passive, Structure Preserving, Parameter-Aware Model
Order Reduction [p. 586]
-
J. Fernández Villena and L.M. Silveira
In this paper we describe a flexible and efficient new
algorithm for model order reduction of parameterized systems.
The method is based on the reformulation of the parametric
system as a parallel interconnection of the nominal
transfer function and the non-parametric transfer function
sensitivities with respect to the parameter variations. Such
a formulation reveals an explicit dependence on each parameter
which is exploited by reducing each component system
independently via a standard non-parametric structure
preserving algorithm. Therefore, the resulting smaller size
interconnected system retains the structure of the original
with respect to parameter dependence. This allows for better
accuracy control, enabling independent adaptive order
determination with respect to each parameter and adding
flexibility in simulation environments. It is shown that the
method is efficiently scalable and preserves relevant system
properties such as passivity. The new technique can handle
fairly large parameter variations on systems whose outputs
exhibit smooth dependence on the parameters. Several
examples show that besides the added flexibility and control,
when compared with competing algorithms, the proposed
technique can, in some cases, produce smaller reduced
models with potential accuracy gains.
-
Transistor-Specific Delay Modeling for SSTA [p. 592]
-
B. Cline, K. Chopra, D. Blaauw, A. Torres and S. Sundareswaran
SSTA has received a considerable amount of attention in recent
years. However, it is a general rule that any approach can only be
as accurate as the underlying models. Thus, variation models are
an important research topic, in addition to the development of statistical
timing tools. These models attempt to predict fluctuations
in parameters like doping concentration, critical dimension (CD),
and ILD thickness, as well as their spatial correlations. Modeling
CD variation is a difficult problem because it contains a systematic
component that is context dependent as well as a probabilistic
component that is caused by exposure and defocus variation.
Since these variations are dependent on topology, modern-day
designs can potentially contain thousands of unique CD distributions.
To capture all of the individual CD distributions within statistical
timing, a transistor-specific model is required. However,
statistical CD models used in industry today do not distinguish
between transistors contained within different standard cell types
(at the same location in a die), nor do they distinguish between
transistors contained within the same standard cell. In this work
we verify that the current methodology is error-prone using a
90nm industrial library and lithography recipe (with industrial
OPC) and propose a new SSTA delay model that on average
reduces error of standard deviation from 11.8% to 4.1% when the
total variation (σ/μ) is 4.9% - a 2.9X reduction. Our model is
compatible with existing SSTA techniques and can easily incorporate
other sources of variation such as random dopant fluctuation
and line-edge roughness.
Moderators: B. Bougard, IMEC, BE; F. Kienle, Kaiserslautern U, DE
-
Generic Multi-Phase Software-Pipelined Partial-FFT on Instruction-Level-Parallel
Architectures and SDR Baseband Applications [p. 598]
-
M. Li, D. Novo, B. Bougard, L. Van Der Perre and F. Catthoor
The PFFT (Partial FFT) is an extended FFT where
only part of input or output bins are used. By pruning the
useless dataflow, the PFFT can potentially achieve a significant
speedup in many important applications. Although theoretical
aspects of the PFFT have been thoroughly studied in past
three decades, efficient implementations were rarely reported.
The most important obstacle is the highly irregular dataflow
and the associated control flow. In addition, a size-N PFFT
has 2N dataflow possibilities, so that delivering both flexibility
and efficiency in the same implementation is very challenging.
This paper presents a generic scheme to map the highly
irregular dataflow of arbitrary PFFT onto ILP architectures
with highly efficient SWP (SoftWare-Pipelining). Constraints
and opportunities of algorithms and architecture are carefully
analyzed and exploited. We introduce a multi-phase partitioning,
bringing heterogeneous control structures and heterogeneous
software pipelining schemes to minimize control overheads and
to maximize the efficiency of SWP. The proposal has been tested
with 10 representative benchmarks extracted from baseband
applications. In experiments cycle-counts, instructions, NOPs,
L1D/L1P access/miss/hit are thoroughly analyzed. Comparing to
full FFTs with efficient SWP, our work reduces 20.5% - 87.5%
cycle-counts, 11.2% - 86.5% instructions, 16.1% - 79.4% L1D
cache accesses and 19.5% - 87.1% L1P cache accesses. To the
best of our knowledge, this is the first reported work about the
generic software-pipelined PFFT on ILP architectures.
-
A Novel Recursive Algorithm for Bit-Efficient Realization of Arbitrary Length Inverse
Modified Cosine Transforms [p. 604]
-
R. Koenig, T. Stripf and J. Becker
In this paper a novel approach for Inverse Modified
Cosine Transform (IMDCT) computation is presented, based on
a recursive algorithm. Due to its nature, this IMDCT calculation
can be performed on a reduced bit width datapath without
loss of accuracy, compared to alternative recursive architectures.
Combined with the regular structure, the approach allows for
a much more area efficient VLSI implementation compared
to existing systems. Due to its bit efficiency this approach is
attractive to be implemented on reconfigurable architectures of
the DSP domain as well.
-
Definition and SIMD Implementation of a Multi-Processing Architecture Approach on FPGA [p. 610]
-
P. Bonnot, F. Lemonnier, G. Edelin, G. Gaillat, O. Ruch and P. Gauget
In a context of high performance, low technology
access cost and application code reusability objectives,
this paper presents an "architectured FPGA" approach
that consists in the definition of a general frame for
embedded system application implementations. Addressing
image processing as a first application domain, a FPGA
architecture implementation based on that approach is
presented. Built around SIMD architecture, the
"Ter@Core" FPGA implementation illustrates the
competitiveness of the approach compared to off-the-shelf
processors and to usual FPGA approach. The presented
implementation gathers 128 processing elements on a
single FPGA providing 19.2 GOPS performance and very
high application development productivity.
Keywords: image processing, data dependent processing,
long lifecycle, FPGA, platform approach, domain specific
API, MIMD architecture, SIMD architecture, middleware.
Moderators: J. Segura, Balearic Islands U, ES; H. Manhaeve, Q-Star Test, BE
-
On Modeling and Testing of Lithography Related Open Faults In Nano-CMOS Circuits [p. 616]
-
A. Sreedhar, A. Sanyal and S. Kundu
Scaling of transistor feature size over time has been facilitated
by corresponding improvement in lithography technology.
However, in recent times the wavelength of the optical light
source used for photolithography has not scaled in the same rate
as that of the minimum feature size of the transistor. In fact,
starting with 180nm devices, the wavelength of optical source has
remained the same (at 193nm) due to difficulties in finding a
flicker-free, high energy, coherent light source with compatible
improvement in lens material for focusing this light.
Consequently, upcoming technology nodes (65nm, 45nm, 32nm
and 22nm) will be using a light source with wavelength much
greater than the feature size. This creates a peculiar problem
where line width on manufactured devices is a function of relative
spacing between adjacent lines. Despite numerous restriction on
layout rules, interconnects may still suffer from constriction due to
this peculiarity also known as forbidden pitch problem. A small
manufacturing variation turns the constrictions to open faults.
Gate leakage current is a significant concern for present and
upcoming technology nodes. Due to gate leakage, an open fault is
not truly an open circuit. Our simulation studies show that the
leakage current steers the floating input of a gate to certain metastable
states. This property actually makes it easier to detect open
faults either through side channel excitation or by stuck-at tests.
The major contributions of this paper are (i) lithographic
simulation based identification of potential open fault sites, (ii)
identification of meta-stable input states for these open inputs, (iii)
length calculation for side channel signals for definitive detection
of open faults. Together, they provide a complete CAD framework
for testing lithography related open faults.
Keywords: Open Faults, Lithography, Forbidden Pitch,
Logic Switching Threshold
-
Optimal Margin Computation for At-Speed Test [p. 622]
-
J. Xiong, V. Zolotov, C. Visweswariah and P.A. Habitz
In the face of increased process variations, at-speed
manufacturing test is necessary to detect subtle delay defects.
This procedure necessarily tests chips at a slightly higher speed
than the target frequency required in the field. The additional
performance required on the tester is called test margin. There
are many good reasons for margin including voltage and tem-
perature requirements, incomplete test coverage, aging effects,
coupling effects and accounting for modeling inaccuracies. By
taking advantage of statistical timing, this paper proposes an
optimal method of test margin determination to maximize yield
while staying within a prescribed Shipped Product Quality Loss
(SPQL) limit. If process information is available from wafer
testing of scribe line structures or on-chip process monitoring
circuitry, this information can be leveraged to determine a perchip
test margin which can further improve yield.
-
Resistive Bridging Fault Simulation of Industrial Circuits [p. 628]
-
P. Engelke, I. Polian, J. Schloeffe and B. Becker
We report the successful application of a resistive bridging
fault (RBF) simulator to industrial benchmark circuits. Despite
the slowdown due to the consideration of the sophisticated
RBF model, the run times of the simulator were within
an order of magnitude of the run times for pattern-parallel
complete-circuit stuck-at fault simulation. Industrial-size
circuits, including a multi-million-gates design, could be
simulated in reasonable time despite a significantly higher
number of faults to be simulated compared with stuck-at fault
simulation.
Keywords: Resistive bridging faults, bridging fault simulation,
case study
-
Physically-Aware N-Detect Test Pattern Selection [p. 634]
-
Y.-T. Lin, O. Poku, N.K. Bhatti and R.D.S. Blanton
N-detect test has been shown to have a higher likelihood
for detecting defects. However, traditional definitions of Ndetect
test do not necessarily exploit the localized characteristics
of defects. In physically-aware N-detect test, the
objective is to ensure that the N tests establish N different
logical states on the signal lines that are in the physical
neighborhood surrounding the targeted fault site. We
present a test selection procedure for creating a physicallyaware
N-detect test set that satisfies a user-provided constraint
on test-set size. Results produced for an industrial
test chip demonstrate the effectiveness and practicability
of our pattern selection approach. Specifically, we show
that we can virtually detect the same number of faults 10 or
more times as a traditional 10-detect test set and increase
the number of neighborhood states and the number of faults
with 10 or more states by 18.0 and 4.7%, respectively, without
increasing the number of tests over a traditional 10-
detect test set.
Moderators: T. Givargis, UC Irvine, US; P. Pop, DTU, DK
-
Computation of Buffer Capacities for Throughput Constrained and Data Dependent Inter-
Task Communication [p. 640]
-
M.H. Wiggers, M.J.G. Bekooij and G.J.M. Smit
Streaming applications are often implemented
as task graphs. Currently, techniques exist to derive
buffer capacities that guarantee satisfaction of a throughput
constraint for task graphs in which the inter-task communication
is data-independent, i.e. the amount of data produced
and consumed is independent of the data values in the
processed stream. This paper presents a technique to compute
buffer capacities that satisfy a throughput constraint
for task graphs with data dependent inter-task communication,
given that the task graph is a chain. We demonstrate
the applicability of the approach by computing buffer capacities
for an MP3 playback application, of which the MP3
decoder has a variable consumption rate. We are not aware
of alternative approaches to compute buffer capacities that
guarantee satisfaction of the throughput constraint for this
application.
-
Constraint Refinement for Online Verifiable Cross-Layer System Adaptation [p. 646]
-
M. Kim, M.-O. Stehr, C. Talcott, N. Dutt and N. Venkatasubramanian
Adaptive resource management is critical to ensuring the
quality of real-time distributed applications, particularly for
energy-constrained mobile handheld devices. In this context,
an optimization that simultaneously considers multiple layers
(e.g., application, middleware, operating system) needs to be
developed for continuous adaptation of system parameters. The
tuning of system parameters greatly affects the system's ability
to meet QoS requirements, and also directly affects the energy
consumption and system robustness. We present a novel
approach to developing cross-layer optimization for resource
limited real-time distributed systems, based on a constraint
refinement technique combined with formal specification and
feedback from system implementation. Our approach tunes the
parameters in a compositional manner allowing coordinated
interaction among sub-layer optimizers that enables holistic
cross-layer optimization. We present experiments on a realistic
multimedia application which demonstrate that constraint
refinement enables us to generate robust and near optimal parameter
settings. The constraint language can be used as an
interface for composition by encapsulating the details of local
optimization algorithms.
-
Adaptive Scheduling and Voltage Scaling for Multiprocessor Real-Time Applications with
Non-Deterministic Workload [p. 652]
-
P. Malani, P. Mukre, Q. Qiu and Q. Wu
The computational workload of some real-time
applications varies significantly during runtime, which makes
the task scheduling and power management a challenge. One of
the major influences to the workload of an application is the
selection of conditional branches which may activate or
deactivate a large set of operations. Focusing on real-time
applications with variable workload which is due to random
branch selection, this paper presents a framework of task
mapping, scheduling and dynamic voltage and frequency scaling
(DVFS) for a multiprocessor system. The proposed framework
maintains workload awareness using dynamic profiling of
branch probability. The profiled information is utilized by the
scheduling and DVFS algorithm that are adopted in this
framework to generate statistically optimal solution.
Organizer/Moderator: E. Schutz, STMicroelectronics, BE
-
ARTEMIS and ENIAC Joint Undertakings: A New Approach to Conduct Research in Europe [p. 658]
-
Presenters: K. Glinos, D. Beenaert, L. Gide
-
This special session will present the two first ever Europe-wide public private R&D partnerships ARTEMIS and
ENIAC. ARTEMIS will address the invisible computers (embedded systems) that today run all machines from
cars, planes and phones, from energy networks and factories to washing machines and televisions. ENIAC will
target the very high level of miniaturisation required for the next generations of nanoelectronics components.
These Joint Technology Initiatives (JTI's ) on Embedded Computing Systems and Nano-electronics will pool
industry, Member states and Commission resources into targeted research programmes. The session will include
global presentations on the initiatives and information on the expected research topics included in the first calls
in 2008.
Organizers:
M. Di Natale, Scuola S Anna, Pisa, IT;
A. Sangiovanni-Vincentelli, UC Berkeley, US
Moderator:
M. Di Natale, Scuola S Anna, Pisa, IT
-
Methods, Tools and Standards for the Analysis and Evaluation of Modern
Automotive Architectures [p. 659]
-
E. Frank, R. Wilhelm, R. Ernst, A. Sangiovanni-Vincentelli and M.Di Natale
Automotive systems are increasingly distributed and
complex. Reduced time-to-market, cost and safety concerns
require advance validation of the integrated systems and
its components, from the functional, timing, and reliability
standpoints. In particular, function correctness and performance
may depend on communication and computation delays
imposed by the selected architecture platform. Hence,
the need for methods and tools capable of predicting the
system-level timing behaviour (latencies and jitter), resulting
from the HW platform selection, the synchronization between
tasks and messages, and also from the synchronization
and queuing policies of the middleware and RTOS levels.
In this paper, we review methods and tools for the evaluation
of the function performance and its timing correctness
by simulation or by worst case static analysis.
Moderators: F. Fummi, Verona U, IT; P. Sanchez, Cantabria U, ES
-
Random Stimulus Generation Using Entropy and XOR Constraints [p. 664]
-
S.M. Plaza, I.L. Markov and V. Bertacco
Despite the growing research effort in formal verification,
constraint-based random simulation remains an integral
part of design validation, especially for large design
components where formal techniques do not scale. However,
stimulating important aspects of a design to uncover
bugs often requires the construction of complex constraints
to guide stimulus generation. We propose Toggle, a stimulus
generation engine, which features (1) an entropy-based coverage
analysis to efficiently find portions of the design inadequately
sensitized by simulation and (2) a novel strategy
to automatically stimulate these portions through a specialized
SAT algorithm that uses small randomized XOR constraints.
As our experimental results demonstrate, Toggle
requires minimal input from the verification engineer, and
significantly improves the coverage qualities of the generated
stimuli when compared to plain random simulation.
-
MCjammer: Adaptive Verification for Multi-Core Designs [p. 670]
-
I. Wagner and V. Bertacco
The challenge of verification of multi-core and multi-processor
designs grows dramatically with each new generation of systems
produced today. Validation of memory coherence of such
systems, which include multiple levels of cache and complex
protocols, constitutes a major fraction of this task. Unfortunately,
current tools are incapable of addressing these challenges,
allowing bugs, which cause unpredictable software behavior
and wrong computation results, to slip into hardware.
In this work we present a scalable approach to the verification
of memory coherence protocols in large multi-core and
multi-processor systems. We accomplish this task through
a distributed network of cooperating agents, which feed the
processors with stimuli, each agent attempting to accomplish
its own verification goals and support other agents on theirs
as well. The agents can dynamically change the stimuli based
on coverage and pressure observed during simulation. Since
each agent has a minimal knowledge of the entire system,
their communication and decision process is greatly simplified.
Moreover, since the agents' view of the system is linear
in the number of nodes in it, our approach can be efficiently
scaled to target large multi-core systems. Experimental
results on two common coherence protocols and a range
of multi-core configurations demonstrate that our technique
can reach high levels of coverage of the system-level protocol
much faster than a constrained-random generator.
-
Efficient Implementation of Native Software Simulation for MPSoC [p. 676]
-
P. Gerin, X. Guérin and F. Pétrot
Efficient and precise simulation models at a high abstraction
level are required in order to perform early design
validations and architecture explorations of Multi-Processor
System-On-Chip (MPSoC) platforms. Although
native software simulation approaches provide interesting
capabilities, they quickly become unsuitable when complex
hardware architecture have to be considered.
In this paper, we present a SystemC-based MPSoC platform
implementation that allows native software simulation
while keeping details of the underlying hardware model.
The key contribution of this work is a realistic memory mapping
modelling that makes possible the simulation of Operating
Systems and software applications on complex hardware
models with multiple processors and DMA devices.
This method also allows the reuse of different software components
for the target processor(s). Experimental results
show the efficiency of the proposed method to validate software
on complex hardware architectures.
-
Simulation-Directed Invariant Mining for Software Verification [p. 682]
-
X. Cheng and M.S. Hsiao
With the advance of SAT solvers, transforming a software
program to a propositional formula has generated much interest
for bounded model checking of software in recent years.
However, reasoning at the Boolean level often may not be able
to identify some key relations among the original high-level
program variables. In this paper, we propose a novel framework
that uses simulation-directed data mining in the original
program to extract a set of high-level potential property
invariants according to the dynamic execution data of the
software. When these learned invariants are added as
constraints to the bounded model checking instances of the
software, they help to significantly reduce the search space. The
simulation-directed invariant mining framework exhibits more
flexibility compared to the conventional static program analysis
approaches, and the experimental results showed that our
approach can lead to up to an order of magnitude of speedup in
software verification via bounded model checking.
Moderators: A. Doboli, State U of New York at Stony Brook, US; M. Ortmanns, Freiburg U, DE
-
Comparison of Opamp-Based and Comparator-Based Delta-Sigma Modulation [p. 688]
-
M. Momeni, P.B. Bacinschi and M. Glesner
Comparator-based switched capacitor (CBSC) circuits
present an alternative approach to designing sampled data
systems based on the principle of detecting a virtual ground
condition with a comparator rather than actively enforcing
it with a high-gain operational amplifier (opamp) in feedback.
This work demonstrates a 2nd-order ΔΣ converter
designed using the CBSC technique. The same modulator
topology was also implemented using two conventional design
methods for a two-stage Miller-compensated amplifier
and a single-stage folded cascode amplifier, such that all
three blocks can be used as 'drop-in replacements' in the
top-level circuit. The designs are done in a 0.13 μm UMC
technology. The SNDR performance and power consumption
of all three approaches were simulated with a sampling
frequency of 5.12 MHz and an oversampling ratio of 64. It
can be concluded that the CBSC method provides a great
simplification of design effort and significant power savings
compared to the traditional OTA-based methods.
-
A Novel Technique for Improving Temperature Independency of Ring-ADC [p. 694]
-
S. Li, H. Chen and F. Zhou
A new temperature compensation technique for ringoscillator-based
ADC is proposed in this paper. It employs
a novel fixed-number-based algorithm and a CTAT current
biasing technology to compensate the temperaturedependent
variations of the output, thus eliminates the
need of digital calibrations. Simulation results prove that,
with the proposed technique, the resolution under the
temperature range of 0°C to 100°C can reach a 2-mV
quantization bin size with an input voltage span of 120mV,
at the sampling frequency fs=100KHz.
-
An Analog On-Chip Adaptive Body Bias Calibration for Reducing Mismatches in Transistor
Pairs [p. 698]
-
P.B. Bacinschi, T. Murgan, K. Koch and M. Glesner
Device parameter variations exhibit an increasingly serious
impact on analog and mixed-signal circuit behavior.
In this paper, we propose a novel fully-analog on-chip adaptive
body bias calibration method, for efficiently reducing
mismatches in transistor pairs. We present three circuit implementations
which achieve a mismatch reduction between
61% and 73% in terms of standard deviation.
-
Integrated Approach to Energy Harvester Mixed Technology Modeling and Performance
Optimization [p. 704]
-
L. Wang, T.J. Kazmierski, B.M. Al-Hashimi, S.P. Beeby and R.N. Torah
This paper presents an integrated approach to energy
harvester modelling and performance optimisation where
the complete mixed physical-domain energy harvester system
(micro generator, voltage booster, storage element and
load) can be modelled and optimised in a systematic manner
using one simulation platform. We developed an accurate
HDL model for the energy harvester and demonstrated
its accuracy by validating it experimentally and comparing
it with recently reported models. To address the performance
loss due to the close mechanical-electrical interaction
that takes place in energy harvesters, we proposed
a holistic methodology to the energy harvester optimisation
based on the HDL model. The effectiveness of employing
such an approach has been demonstrated by showing that it
is possible to improve vibration-based energy harvester efficiency
(energy delivered to load/harvested energy) by 30%
through optimising the micro-generator size and the voltage
booster circuit components.
Moderators: W. Eberle, IMEC, BE; G. Gielen, KU Leuven, BE
-
A Scalable Low-Power Digital Communication Network Architecture and an Automated
Design Path for Controlling the Analog/RF Part of SDR Transceivers [p. 710]
-
W. Eberle and M. Goffioul
Emerging new wireless standards, the move towards
multi-standard transceivers, and ultimately softwaredefined
radios imposes the need for a tighter interaction
between digital baseband and analog/RF parts. Softwaredefined
radio transceivers may face more than 400
control bits in the analog/RF part [9][10]. Configuring of
the transmit/receive chain to particular standards,
monitoring of front-end performance, and dynamic
control of front-end behavior requires a tight
bidirectional interaction. We have developed a generic
concept of a flexible and scalable low-power digital
communication network in a multi-standard analog/RF
front-end.
Our approach is layout-friendly, reduces interconnect
area significantly (by 96%) compared to a star topology,
scales easily with analog/RF design changes such as pin
additions, and exhibits a generic bidirectional interface to
the system and digital designer. Moreover, an almost fully
automated design flow - starting from an on-chip
connection list for all analog blocks up to VHDL code
generation - has been developed and implemented,
reducing design effort and potential errors.
The architecture and the design flow have been
successfully proven in two 0.13-um full software-defined
radio transceiver designs. In the first design, the flow was
still manually instantiated. In the second design, the
automated flow was used and led to a significant designtime
speed-up.
-
A Coarse-Grained Array Based Baseband Processor for 100mbps+ Software Defined Radio [p. 716]
-
B. Bougard, B. De Sutter, S. Rabou, D. Novo, O. Allam, S. Dupont and L. Van der Perre
The Software-Defined Radio (SDR) concept aims to enabling costeffective
multi-mode baseband solutions for wireless terminals.
However, the growing complexity of new communication standards
applying, e.g., multi-antenna transmission techniques, together with
the reduced energy budget, is challenging SDR architectures. Coarse-Grained
Array (CGA) processors are strong candidates to undertake
both high performance and low power.
The design of a candidate hybrid CGA-SIMD processor for an SDR
baseband platform is presented. The processor, designed in TSMC
90G process according to a dual-VT standard-cells flow, achieves a
clock frequency of 400MHz in worst case conditions and consumes
maximally 310mW active and 25mW leakage power (typical
conditions) when delivering up to 25,6GOPS (16-bit). The mapping
of a 20MHz 2x2 MIMO-OFDM transmit and receive baseband
functionality is detailed as an application case study, achieving
100Mbps+ throughput with an average consumption of 220mW.
-
Scenario-Based Fixed-Point Data Format Refinement to Enable Energy-Scalable Software
Defined Radios [p. 722]
-
D. Novo, B. Bougard, A. Lambrechts, L. Van der Perre and F. Catthoor
User demand, standards and products for digital
nomadic communications are evolving quickly. The
combination of this changing environment together with the
need for short time-to-market pushes for more flexible
implementations. Software Defined Radios (SDR) have been
introduced as the ultimate way to achieve such flexibility. The
reduced energy budget required by battery-powered solutions
makes the typical worst-case static dimensioning unaffordable
under highly dynamic operating conditions. Instead, more
energy-scalable algorithms and implementations are entailed to
provide flexibility while maintaining the required energy
efficiency. Particularly, energy-scalable implementations can
exploit data format properties to offer different tradeoffs
between accuracy and energy. In this paper, such a technique is
developed and applied to the SDR implementation of a 2
antennas 200 Mbps+ OFDM (Orthogonal Frequency-Division
Multiplexing) inner modem receiver on a C-programmable
CGA (Coarse Grain Array) processor with extensive SIMD
(Single Instruction Multiple Data) support. By defining
separate implementations for different combinations of
modulation scheme and coding rate, up to 3-fold gains can be
achieved in the average energy consumption.
Organizer: P. Girard, LIRMM/CNRS, FR
Moderator: A. Raghunathan, NEC Laboratories, US
-
Test Strategies for Low Power Devices [p. 728]
-
C.P. Ravikumar, M. Hirech and X. Wen
Ultra low-power devices are being developed for
embedded applications in bio-medical electronics, wireless
sensor networks, environment monitoring and protection,
etc. The testing of these low-cost, low-power devices is a
daunting task. Depending on the target application, there
are stringent guidelines on the number of defective parts
per million shipped devices. At the same time, since such
devices are cost-sensitive, test cost is a major consideration.
Since system-level power-management techniques are
employed in these devices, test generation must be power-management-aware
to avoid stressing the power
distribution infrastructure in the test mode. Structural test
techniques such as scan test, with or without compression,
can result in excessive heat dissipation during testing and
damage the package. False failures may result due to the
electrical and thermal stressing of the device in the test
mode of operation, leading to yield loss. This paper
considers different aspects of testing low-power devices
and some new techniques to address these problems.
Moderators: C. Schlaeger, AMD, DE; P. Felber, Neuchatel U, CH
-
Thermal Balancing Policy for Streaming Computing on Multiprocessor Architectures [p. 734]
-
F. Mulas, M. Pittau, M. Buttu, S. Carta, A. Acquaviva, L. Benini, D. Atienza and G. De Micheli,
As feature sizes decrease, power dissipation and heat generation
density exponentially increase. Thus, temperature gradients inMultiprocessor
Systems on Chip (MPSoCs) can seriously impact system
performance and reliability. Thermal balancing policies based
on task migration have been proposed to modulate power distribution
between processing cores to achieve temperature flattening.
However, in the context of MPSoC for multimedia streaming computing,
where timeliness is critical, the impact of migration on quality
of service must be carefully analyzed. In this paper we present
the design and implementation of a lightweight thermal balancing
policy that reduces on-chip temperature gradients via task migration.
This policy exploits run-time temperature and load information
to balance the chip temperature. Moreover, we assess the effectiveness
of the proposed policy for streaming computing architectures
using a cycle-accurate thermal-aware emulation infrastructure.
Our results using a real-life software defined radio multitask
benchmark show that our policy achieves thermal balancing while
keeping migration costs bounded.
-
A Practical Approach for Reconciling High and Predictable Performance in Non-Regular
Parallel Programs [p. 740]
-
O. Certner, Z. Li, P. Palatin, O. Temam, F. Arzel and N. Drach
Increasingly complex consumer electronics applications
call for embedded processors with higher performance.
Multi-cores are capable of delivering the required performance.
However, many of these embedded applications must
meet some form of soft real-time constraints, and program behavior
on multi-cores is even harder to predict than on singlecores.
In this article, we highlight the greater performance
variability of irregular applications (non-regular control flow
and/or data structures) across data sets when parallelized
and run on a multi-core. We then show that a proper parallelization
approach coupled with a lightweight run-time system
can drastically reduce this performance variability without
sacrificing their performance. This approach requires no
complex program or architecture analysis or modeling. Moreover,
we show that parallel program performance becomes
stable enough that it is possible to reasonably and accurately
predict it by sampling a few training runs.
-
Exact and Approximate Task Assignment Algorithms for Pipelined Software Synthesis [p. 746]
-
M. Hashemi and S. Ghiasi
Pipelined execution of streaming applications enable processing of high-throughput
data under performance constraint. We present an integrated apporach to synthesizing
pipelined software for dual-core architectures. We target streaming applications
modeled as task graphs that are amenable to static analysis. We deveop a versatile
task assignment algorithm that considers the combined effect of workload im-balance
between processors and inter-processor communication. Our technique, which runs in
pseuso-linear time, probably maximizes application throughput. Furthermore,
we develop an approximation algorithm for task assignment whose complexity is strictly
polynomial. It provides the designer with an adjustable knob to controllably trade
solution quality with algorithm runtime and memory reqquirement. Empirical throughput
measurements using an FPGA-based dual-core system validate our theoretical results. Our
exact algorithm consistently outperforms a recent competitor. Compared to exact
task asignment, the approximate method runs about 3 times faster, requires about
20 times less memory, and results in only 1% to 5% throughput loss.
Moderators: G. Gaydadjiev, TU Delft,NL; T. Austin, U of Michigan, US
-
Run-Time System for an Extensible Embedded Processor with Dynamic Instruction Set [p. 752]
-
L. Bauer, M. Shafique, S. Kreutz and J. Henkel
One of the upcoming challenges in embedded processing is to incorporate
an increasing amount of adaptivity in order to respond
to the multifarious constraints induced by today's embedded systems
that feature complex and diverse application behaviors.
We present a novel concept (evaluated with a hardware prototype)
that moves traditional design-time jobs to run time in order
to increase efficiency (in this paper we focus on performance).
Adaptivity is achieved dynamically through what we call Special
Instructions (SIs) which may change during run time according
to non-predictable application behavior. The new contribution of
this paper is the principal component that actually makes the entire
embedded processor work efficiently, namely the "Special Instruction
Scheduler". It determines during run time 'when' and
'how' Special Instructions are composed and executed.
We achieve a 2.38x performance increase over a reconfigurable
processor system with dynamic instruction set (Molen [19]).
Our whole platform consists of a toolchain including estimation
and simulation tools plus a running hardware prototype.
Throughout this paper, we discuss the functionality by means of
an H.264 video encoder in detail even though the concept is not
limited to this application.
-
Harnessing Horizontal Parallelism and Vertical Instruction Packing of Programs to Improve
System Overall Efficiency [p. 758]
-
H. Lin and Y. Fei
Multi-issue processors can exploit the Instruction Level
Parallelism (ILP) of programs to improve the performance
greatly. How to reduce the energy consumption while maintaining
the high performance of programs running on multiissue
processors remains a challenging problem. In this
paper, we propose a novel approach to apply the instruction
register file (IRF) technique from single-issue processor
to VLIW architecture. Frequently executed instructions
are selected to be placed in the on-chip IRF for fast access
in program execution. Violation of synchronization among
VLIW instruction slots is avoided by introducing new instruction
formats and microarchitectural support. The enhanced
VLIW architecture is thus able to orchestrate the
horizontal instruction parallelism and vertical instruction
packing for programs to improve system overall efficiency.
Our experimental results show that the proposed processor
architecture achieves both the performance advantage provided
by the VLIW architecture and high energy efficiency
provided by the IRF-based instruction packing technique
(e.g., 71.1% reduction in the fetch energy consumption for
a 4-way VLIW architecture with 8-entry IRFs).
-
Instruction Set Extension Exploration in Multiple-Issue Architecture [p. 764]
-
I.-W. Wu, Z.-Y. Chen, J.-J. Shann and C.-P. Chung
To satisfy high-performance computing demand in modern
embedded devices, current embedded processor
architectures provide designer with possibility either to
define customized instruction set extension (ISE) or to
increase instruction issue width. Previous studies have
shown that deploying ISE in multiple-issue architecture
can significantly improve performance. However,
identifying ISE for multiple-issue architecture by using
current ISE exploration algorithms will result in
unnecessary waste of silicon area and limitation of
performance improvement. This is because most algorithms
overlook two important considerations: (1) only packing
the operations lying on the critical path into ISE can
improve performance; (2) the critical path usually changes
after packing operations into an ISE. With these
considerations, this paper presents an algorithm for ISE
exploration based on list scheduling and Ant Colony
Optimization (ACO), in which combines ISE exploration
and the critical path identification (i.e. instruction
scheduling). Results indicate that our approach
outperforms the previous work in both performance
improvement and area efficiency.
-
Instruction Re-Encoding Facilitating Dense Embedded Code [p. 770]
-
T. Bonny and J. Henkel
Reducing the code size of embedded applications is one
of the important constraint in embedded system design.
Code compression can provide substantial savings in terms
of size. In this paper, we introduce a novel and efficient
hardware-supported approach. Our approach investigates
the benefits of re-encoding the unused bits (we call them
re-encodable bits) in the instruction format for a specific
application to improve the compression ratio. Re-encoding
those bits may reduce the size of decoding table by more
than 37%. We achieve compression ratios as low as 44%
(including all overhead that incurs). We have conducted
evaluations using a representative set of applications and
have applied it to two major embedded processors, namely
MIPS and ARM.
-
Test Instrumentation for a Laser Scanning Localization Technique for Analysis of High
Speed DRAM Devices [p. 776]
-
M. Versen, A. Schramm, J. Schnepp and D. Diaconescu
Soft defect localization (SDL) is a method of laser
scanning microscopy that utilizes the changing pass/fail
behavior of an integrated circuit under test and
temperature influence. Historically the pass and fail
states are evaluated by a tester that leads to long and
impracticable measurement times for dynamic random
access memories (DRAM). The new method using a
high speed comparison device allows SDL image
acquisition times of a few minutes and a localization of
functional DRAM fails that are caused by defects in the
DRAM periphery that has not been possible before. This
new method speeds up significantly the turn-around time
in the failure analysis (FA) process compared to
knowledge based FA.
-
A Mapping Framework for Guided Design Space Exploration of Heterogeneous MP-SoCs [p. 780]
-
B. Ristau, T. Limberg and G. Fettweis
When designing heterogeneous MP-SoCs designers have
to take into account various objectives such as power, die
size, flexibility, performance or programmability. But to be
able to evaluate a given system according to these objectives,
it is necessary to know how applications will behave
on that system. Since time-to-market is one key factor in
chip design, it is important to be able to evaluate these systems
at a very early design stage. Today this is usually done
with simulations in languages such as Simulink or SystemC.
We will show how the behavior of such systems can be analyzed
without the need for time-consuming implementations
of simulation models. This enables fast evaluation and modification
of a given system at a very early design stage allowing
efficient pruning of the design space.
-
Impact of Leakage Current on Data Retention of RF-Powered Devices during Amplitude-Modulation-Based Communication [p. 784]
-
J. Haid, B. Zimek, T. Leutgeb and T. Kuenemund
Devices powered by an electromagnetic field are inherently
power-constrained and thus must carefully manage
static and dynamic power. High ambient temperatures and
field strengths can increase the temperature of RF-powered
devices up to more than 100 degrees Celsius, thereby allowing
the leakage current to rise to a dominating portion
of the static power consumption.
Leakage reduction techniques for application in RFpowered
devices are examined in this paper with the
goal to avoid malfunction of the device during amplitude
modulation-based communication. Results show that without
leakage reduction a correct operation cannot be guaranteed
for the investigated 130 nm process technology for
energy gaps that are defined by the widely applied ISO/IEC
14443-2 standard (100% field modulation). The evaluation
of leakage reduction techniques shows that applying body
biasing prolongs the data retention time by nearly 200%,
while source biasing in general aggravated the circuit's robustness
against power gaps (reduction in data retention
time by up to 76% loss), as did also voltage scaling (up to
98% reduction).
-
Accuracy-Adaptive Simulation of Transaction Level Models [p. 788]
-
M. Radetzki and R.S. Khaligh
Simulation of transaction level models (TLMs) is an established
embedded systems design technique. Its use cases include
virtual prototyping for early software development,
platform simulation for design space exploration, and reference
modelling for verification. The different use cases
mandate different trade-offs between simulation performance
and accuracy. Therefore, multiple TLM abstraction
layers have been defined of which one has to be chosen and
integrated into the system model prior to simulation. In this
contribution we present a modelling technique that allows
covering several layers in a single model and switching between
the layers at any time, in particular dynamically during
simulation. This feature is employed to automatically
adapt simulation accuracy to an appropriate level depending
on the model's state, leading to an improved trade-off
between simulation performance and accuracy.
-
Zero-Efficient Buffer Design for Reliable Network-on-Chip in Tiled Chip-Multi-Processor [p. 792]
-
J. Wang, H. Zeng, K. Huang, G. Zhang and Y. Tang
Network-on-Chip (NoC) is a promising solution for efficient
interconnection between processor cores in Chip-
Multi-Processor (CMP). This paper is focusing on the
energy-efficient design of buffers, a group of the most important
components in NoC. From our investigation, an overwhelming
majority of "zero" is contained in the packets
transmitting in NoC for CMP. A zero-efficient buffer design
is proposed as well as the error control scheme. Compared
with conventional design, up to 43% energy consumption
can be saved. We use a 90nm CMOS process in our simulation.
-
Wire Sizing Alternative - An Uniform Dual-Rail Routing Architecture [p. 796]
-
F.-W. Chen and Y.-Y. Liu
To achieve minimum signal propagation delay, the nonuniform
wire width routing architecture has been widely
used in modern VLSI design. The non-uniform routing
architecture exploits the wire width flexibilities to trade area
for performance. However, many additional design rules,
which confine the routing flexibilities, are introduced in
nanoscale circuit designs. With the increasing difficulties
of fabricating nanoscale circuits, the conventional nonuniform
routing architecture becomes clumsy. We propose an
uniform dual-rail routing architecture to cope with these new
challenges. The proposed architecture exploits the anti-Miller
effect between two adjacent wires with the same signal
source. Hence, the coupling capacitance between these two
wires is reduced. The simulation results demonstrate that
our proposed architecture provides a signal propagation
channel with similar propagation delay, less crosstalk noise,
and less power consumption to the conventional non-uniform
routing architecture with moderate routing area overheads.
In terms of the properties and the scalabilities, we argue that
the uniform dual-rail routing architecture is a wire sizing
alternative without incurring layout irregularity and stacked
vias overheads.
-
Structural Synthesis of Four-Quadrant Multiplier Based on Hierarchical Topology [p. 800]
-
X. Wang and L. Hedrich
This paper presents a method towards automatic structural
synthesis of analog multiplier based on a hierarchical
topology "super-topology", which is abstracted from
the most standard four-quadrant multipliers. The essential
components in the super-topology are four identical
cells, which consist of several MOS-transistors and determine
features and performances of multipliers. We build
all possible cells within 3 transistors. Experimental results
present three new multiplier structures with simulation results
to show the creativity of our method.
-
A Virtual Prototype for Bluetooth over Ultra Wide Band System Level Design [p. 804]
-
A. Lewicki, J. del Prado Pavon, J. Talayssat, E. Dekneuvel and G. Jacquemod
The industry is merging two different Wireless
Personal Area Networks (WPAN) technologies: Bluetooth
(BT) and WiMedia Ultra Wide Band (UWB), into a single
BT over UWB (BToUWB) specification. The goal is to
provide low cost, low power and a wide range of data rate
wireless communications for multimedia and mobile
applications. The complexity to study such a system
requires the development of a virtual prototype at a highlevel
of abstraction. The model needs a fast simulation
time in order to explore the algorithms necessary for the
merging of the standards. Moreover, as the merging is still
in a standardization phase, this virtual prototype helps to
actively participate to this effort. The aim of this paper is
to provide an overview of the methodology used to create a
virtual prototype of a BToUWB device.
-
Re-Examining the Use of Network-on-Chip as Test Access Mechanism [p. 808]
-
F. Yuan, L. Huang and Q. Xu
Existing work on testing NoC-based systems advocates to reuse the
on-chip network itself as test access mechanism (TAM) to transport
test data to/from embedded cores. While this methodology obviously
reduces the routing cost when compared to the case that dedicated test
buses are introduced as TAMs, it is not clear whether it is beneficial in
terms of other important factors that significantly affect test cost, e.g.,
testing time, test control complexity and test reliability. As a result, in
this paper, we re-examine the issue of using NoC as TAM in order to
facilitate designers to construct a cost-effective system test architecture
based on their requirements.
Organizers: A. Sangiovanni-Vincentelli, UC Berkeley, US; M. Di Natale, Scuola S Anna, Pisa, IT
Moderator: A. Sangiovanni-Vincentelli, UC Berkeley, US
-
PANEL - The Future Car: Technology, Methods and Tools [p. 812]
-
Panelists: H. Hanselmann, H. Heineke, A. Bouali, H. Kopetz, H. Fennel and T. Weber
-
The car of the future will be based on very advanced software and hardware technologies for improved safety
and additional features such as autonomous driving, vehicle to vehicle communication, extensive communication
and entertainment subsystems. What are the limiting factors for introducing new technology in cars? What are
the standards, methods and tools that will be needed to bring these cars to market quickly and with guaranteed
properties? The experts in the panel will address these questions and discuss their preferred solutions.
Moderators: R. Bloem, TU Graz, AT; R. Drechsler, Bremen U, DE
-
Improving Constant-Coefficient Multiplier Verification by Partial Product Identification [p. 813]
-
C.-Y. Lai and C.-Y. Huang and K.-Y. Khoo
Constant-coefficient multipliers are fundamental
components in digital signal processing and arithmetic-based
systems. Their verification, however, remains
difficult and time-consuming. This is caused by the
inability to identify the partial products from the number
representation system of the constant. In this paper, we
introduce an efficient number representation system as
an observation on how modern synthesizers interpret
constants. We also propose a robust and efficient partial
product identification algorithm to improve the verification
process. Experimental results show that our
algorithm not only reduces the number of failing cases
of the verification to one third but also speeds up the
verification process by at least an average of 25%.
-
Improved Visibility in One-to-Many Trace Concretization [p. 819]
-
K. Nanshi and F. Somenzi
We present an improved algorithm for concretization of abstract error
traces in abstraction refinement-based invariant checking. The
proposed algorithm maps each transition of the abstract error trace
to one or more transitions in the concrete model by using a combination
of simulation and satisfiability checking. Prior simulationbased
approaches were hindered by limited visibility, which often
resulted in excessive backtracking or refinements. The proposed
technique addresses this issue in three ways: By identifying variables
whose addition to the abstract trace significantly improves its
predictive power at a low computational cost; by combining SAT
checks with pseudo-random simulation in the construction of the
concrete trace; and by a more flexible budgeting of simulation vectors
that accounts for the progress made in concretization.
-
Efficient Symbolic Simulation of Low Level Software [p. 825]
-
T. Arons, E. Elster, S. Ozer, J. Shalev and E. Singerman
Symbolic execution has long been a staple technique for
formal hardware verification. Its application to software requires
methods for dealing with software specific complexities.
In this paper we elaborate methods for the efficient
symbolic simulation of embedded software; some methods
are new, others are improvements of existing methods. Using
these techniques we have been able to symbolically execute
real life microcode of thousands of lines, allowing formal
methods to become an integral part of microcode validation
in Intel Corporation.
-
Completeness in SMT-Based BMC for Software Programs [p. 831]
-
M.K. Ganai and A. Gupta
Bounded Model Checking (BMC) is incomplete without
a completeness threshold (CT ) bound. Previous methods,
using recurrence diameter for obtaining CT , check for existence
of a longest loop-free path at every depth k. For terminating
software programs, we propose an efficient method
for obtaining CT that requires solving a formula of size
O(k) at some depths only, as compared to previous methods
that require solving a formula of O(k2) (or O(klogk))
size at every depth. We augment previous methods for BMC
simplifications using model transformation and control flow
information, with context-sensitive analysis. This results in
more BMC simplifications and further reduction in the number
of CT checks. We have implemented our techniques in
a Satisfiability Modulo Theory (SMT)-based BMC framework.
Our controlled experiments on real-world software
programs show that our proposed formulation provides significant
improvements over previous approaches.
Moderators: L. Scheffer, Cadence Design Systems, US; I. Markov, U of Michigan, US
-
Novel Pin Assignment Algorithms for Components with Very High Pin Counts [p. 837]
-
T. Meister, J. Lienig and G. Thomke
The wiring effort and thus, the routability of electronic
designs such as printed circuit boards, multi chip modules
and single chip modules largely depends on the
assignment of signals to component pins. For modern
components that have as many as several thousand pins,
this pin assignment cannot be optimized manually. This
paper presents four novel pin assignment algorithms that
automatically create optimized pin assignments for wiring
substrate designs with components that have very high pin
counts. We also present and evaluate quality estimation
metrics that enable fast assessment of the pin assignment
results. The efficiency of our algorithms allows the
creation of optimized pin assignments using only minutes
of computation time. We show the applicability of all four
algorithms, including their strengths and weaknesses, in
specific design applications.
-
A Generic Standard Cell Design Methodology for Differential Circuit Styles [p. 843]
-
S. Badel, E. Güleyüpoglu, O. Inaç, A.P. Martinez, P. Vietti, F.K. Gürkaynak and Y. Leblebici
In this paper we present a generic methodology for the
rapid generation and implementation of standard cell libraries
for differential circuit design styles. We demonstrate
a systematic approach for the classification of circuit
topologies (footprints) and for generating the templates that
correspond to a large number of functions. The generation
of an extensive cell library with more than 4500 standard
cells based on 19 footprints is demonstrated using a 180 nm
CMOS technology.
-
Layout Level Timing Optimization by Leveraging Active Area Dependent Mobility of
Strained-Silicon Devices [p. 849]
-
A. Chakraborty, X. Shi and D.Z. Pan
Advanced MOSFETs such as Strained Silicon (SS) devices have
emerged as critical enablers to keep Moore's law on track for sub-100nm
technologies. Use of Strained Silicon devices provides performance
improvement equivalent to use of next generation devices,
without actually requiring scaling. Traditionally, the research in
the field of SS has been focussed on device modeling and process
characterization. Recently (in [1] [2]), the dependence of mobility
of a SS MOSFET device on its poly-to-poly distance has been
reported. In this work, we propose a new methodology to exploit
this dependence to achieve cycle time reduction of a design at the
layout level. To the best of our knowledge, this is the first research
work to tackle timing closure by layout modifications using active
area dependent mobility of SS devices. Our methodology shows
consistent improvement for benchmark designs mapped onto various
90nm commercial standard cell libraries. This work enables
reduction of cycle time by as much as 6.31% (and on an average
5.25%) very late in the design closure cycle without requiring any
optimization iterations.
-
Exploiting Correlation Kernels for Efficient Handling of Intra-Die Spatial Correlation, with
Application to Statistical Timing [p. 856]
-
A. Singhee, S. Singhal and R.A. Rutenbar
Intra-die manufacturing variations are unavoidable in nanoscale
processes. These variations often exhibit strong spatial correlation.
Standard grid-based models assume model parameters (grid-size,
regularity) in an ad hoc manner and can have high measurement
cost. The random field model overcomes these issues. However,
no general algorithm has been proposed for the practical use of this
model in statistical CAD tools. In this paper, we propose a robust
and efficient numerical method, based on the Galerkin technique
and Karhunen Loéve Expansion, that enables effective use of the
model. We test the effectiveness of the technique using a Monte
Carlo-based Statistical Static Timing Analysis algorithm, and see
errors less than 0.7%, while reducing the number of random vari-
ables from thousands to 25, resulting in speedups of up to 100x.
Moderators: R. Forsyth, Austriamicrosystems AG, AT; G. Van der Plas, IMEC, BE
-
A Triple-Mode Reconfigurable Sigma-Delta Modulator for Multi-Standard Wireless
Applications [p. 862]
-
A. Morgado, R. del Río and J.M. de la Rosa
This paper presents the implementation and experimental
characterization of a reconfigurable ΣΔ modulator intended
for multi-mode wireless receivers that is capable to
perform the analog-to-digital conversion for GSM, Bluetooth,
and UMTS standards. The ΣΔ modulator reconfigures
its cascade topology and building blocks in order to
adapt the performance to the diverse standard specifications
with optimized power consumption. The prototype has
been implemented in a 130-nm CMOS technology and features
dynamic ranges of 86.7/81.0/63.3dB and peak signal-
to-(noise+distortion) ratios of 74.0/68.4/52.8dB at
400ksps/2Msps/8Msps, respectively. The modulator power
consumption is 25.2/25.0/44.5mW, of which 11.0/10.5/
24.8mW are dissipated in the analog circuitry.
-
Low-Noise Sigma-Delta Capacitance-to-Digital Converter for Sub-pF Capacitive Sensors
with Integrated Dielectric Loss Measurement [p. 868]
-
M. Bingesser, T. Loeliger, W. Hinn, J. Hauer, S. Mödl, R. Dorn and M. Völker,
A sigma-delta capacitance-to-digital converter (CDC)
with a resolution down to 19.3 aF at a bandwidth of 10 kHz,
corresponding to a noise level of 0.2 aF/√Hz, is presented.
An integrated dielectric loss measurement circuit by means
of two parallel channels with different integration times offers
a complex permittivity measurement in a single-chip
solution. The achieved dielectric loss angle resolution is
as low as 0.3 ° for a material density ratio of 0.55 %. A
test chip with two converter blocks including two 2nd order
and two 4th order modulators has been produced in the
austriamicrosystems AG C35B3C0 0.35 μm DPTM CMOS
process, operating at a single 3.3 V supply. Applications of
this circuit include mass measurement and analysis of material
compositions.
-
Calibration of Integrated CMOS Hall Sensors Using Coil-on-Chip in ATE Environment [p. 873]
-
M. Badaroglu, G. Decabooter, F. Laulanet and O. Charlier
Due to high demand for hall sensors mostly in the
automotive and industrial applications, development
and manufacturing of hall sensors in System-on-Chip
(SoC) became more important. On the other hand,
options for test and characterization of hall sensors in
manufacturing environment are very limited. In most
cases external field generators are used in order to
characterize the hall sensors on a small set of
production samples. In this paper, we present our Coilon-Chip
(CoC) calibration methodology where there is
no need for a dedicated setup/assembly. Our
methodology is also immune to self-heating. Our
methodology enables reduced costs in test equipment,
100% screening of hall sensors in manufacturing tests,
and reliable trimming of sensitivity spread over
temperature from -40oC to 150oC. Measurement results
before trimming show less than 20% six-sigma spread
for normalized sensitivity across 120 samples of
different hall sensor structures processed in a 0.35 μm
high-voltage CMOS process.
-
A Programmable and Low-EMI Integrated Half-Bridge Driver IN BCD Technology [p. 879]
-
F. D'Ascoli, L. Bacciarelli, M. Melani, L. Fanucci, G. Ricotti, E. Pardi, F. Vincis, M Forliti and M.
De Marinis
This paper presents the design and the laboratory results
of an integrated half-bridge driver for power electronic
systems in a 0.35μm Bipolar CMOS DMOS (BCD)
technology. The proposed solution is designed for
frequency applications up to several hundred of KHz and
it has a driving current capability up to 50 mA. This work
features a design configuration and a digital control to
reduce electromagnetic interference (EMI). Moreover it
includes short circuit protection, programmability of
voltage references and a digital control circuitry
implementing mechanism to prevent dangerous failures of
the driver. After a deep description of the circuit we show
the laboratory results of the half-bridge driver used to
drive a 20 KHz antenna.
Moderators: C. Papachristou, Case Western Reserve U, US; D. Pradhan, Bristol U, UK
-
CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns [p. 885]
-
Y. Li, S. Makar and S. Mitra
CASP, Concurrent Autonomous chip self-test using Stored test
Patterns, is a special kind of self-test where a system tests
itself concurrently during normal operation without any
downtime visible to the end-user. CASP consists of two ideas:
1. Storage of very thorough test patterns in non-volatile
memory; and, 2. Architectural and system-level support for
autonomous testing of one or more cores in a multi-core
system using stored patterns, concurrently with normal system
operation, without bringing down the entire system. CASP
enables design of robust systems with built-in features for
circuit failure prediction, error detection, self-diagnosis and
self-repair. Such systems are necessary to overcome major
reliability challenges in scaled-CMOS technologies.
Implementation of CASP in the OpenSPARC T1 multi-core
processor demonstrates its effectiveness and practicality.
-
Defect Tolerance in Homogeneous Manycore Processors Using Core-Level Redundancy with
Unified Topology [p. 891]
-
L. Zhang, Y. Han, Q. Xu and X. Li
Homogeneous manycore processors are emerging for terascale
computation. Effective defect tolerance techniques are
essential to improve the yield of such complex integrated circuits.
In this paper, we propose to achieve fault tolerance
by employing redundancy at the core-level instead of at the
microarchitecture-level. When faulty cores existing on-chip in
this architecture, how to reconfigure the processor with the most
effective topology is a relevant research problem. We present
novel solutions for this problem, which not only maximize the
performance of the manycore processor, but also provide a unified
topology to operating system and application software running
on the processor. Experimental results show the effectiveness
of the proposed techniques.
-
A Low-Cost Concurrent Error Detection Technique for Processor Control Logic [p. 897]
-
R. Vemu, A. Jas, J.A. Abraham, S. Patil and R. Galivanche
This paper presents a concurrent error detection technique
targeted towards control logic in a processor with emphasis
on low area overhead. Rather than detect all modeled
transient faults, the technique selects faults which have
a high probability of causing damage to the architectural
state of the processor and protects the circuit against these
faults. Fault detection is achieved through a series of assertions.
Each assertion is an implication from inputs to the
outputs of a combinational circuit. Fault simulation experiments
performed on control logic modules of an industrial
processor suggest that high reduction in damage causing
faults can be achieved with a low overhead.
-
Approximate Logic Circuits for Low Overhead, Non-Intrusive Concurrent Error Detection [p. 903]
-
M.R. Choudhury and K. Mohanram
This paper describes a scalable, technology-independent algorithm
for the synthesis of approximate logic circuits. A low overhead,
non-intrusive solution for concurrent error detection (CED) based
on such circuits is described in this paper. CED based on approximate
logic circuits does not impose any performance penalty on the
original design. The proposed synthesis algorithm for approximate
logic circuits scales with circuit size, and provides fine-grained
trade-offs between area-power overhead and CED coverage.
Moderators: J. Sztipanovits, Vanderbilt U, US; J. Beutel, ETH Zurich, CH
-
Logical Reliability of Interacting Real-Time Tasks [p. 909]
-
K. Chatterjee, A. Ghosal, T.A. Henzinger, D. Iercan, C.M. Kirsch, C. Pinello and
A. Sangiovanni-Vincentelli
We propose the notion of logical reliability for real-time
program tasks that interact through periodically updated
program variables. We describe a reliability analysis that
checks if the given short-term (e.g., single-period) reliability
of a program variable update in an implementation is
sufficient to meet the logical reliability requirement (of the
program variable) in the long run. We then present a notion
of design by refinement where a task can be refined
by another task that writes to program variables with less
logical reliability. The resulting analysis can be combined
with an incremental schedulability analysis for interacting
real-time tasks proposed earlier for the Hierarchical Timing
Language (HTL), a coordination language for distributed
real-time systems. We implemented a logical-reliabilityenhanced
prototype of the compiler and runtime infrastructure
for HTL.
-
Scheduling of Fault-Tolerant Embedded Systems with Soft and Hard Timing Constraints [p. 915]
-
V. Izosimov, P. Pop, P. Eles and Z. Peng
In this paper we present an approach to the synthesis of fault-tolerant
schedules for embedded applications with soft and hard
real-time constraints. We are interested to guarantee the deadlines
for the hard processes even in the case of faults, while maximizing
the overall utility. We use time/utility functions to
capture the utility of soft processes. Process re-execution is employed
to recover from multiple faults. A single static schedule
computed off-line is not fault tolerant and is pessimistic in terms
of utility, while a purely online approach, which computes a new
schedule every time a process fails or completes, incurs an unacceptable
overhead. Thus, we use a quasi-static scheduling
strategy, where a set of schedules is synthesized off-line and, at
run time, the scheduler will select the right schedule based on
the occurrence of faults and the actual execution times of processes.
The proposed schedule synthesis heuristics have been
evaluated using extensive experiments.
-
Tool Support for Incremental Failure Mode and Effects Analysis of Component-Based
Systems [p. 921]
-
J. Elmqvist and S. Nadjm-Tehrani
Failure Mode and Effects Analysis (FMEA) is a wellknown
technique widely used for safety assessment in the
area of safety-critical systems. However, FMEA is traditionally
done manually which makes it both time-consuming
and costly, specially for large and complex systems. Also,
small modifications in the design may result in a complete
revision of the initial FMEA.
This paper presents a tool support for automated incremental
component-based FMEA of SW and HW. It is based
on component safety interfaces and a formal compositional
safety analysis method. This tool support enables engineers
to focus on more important steps in the safety assessment
process. Also, during system upgrades, the tool incrementally
registers the changes and identifies possible effects in
the FMEA which enables the use of earlier safety analysis
results. Finally, this formal approach based on design models
of the components and the system always creates FMEAs
which are consistent with the system design.
-
Compositional Design of Isochronous Systems [p. 928]
-
J.-P. Talpin, J. Ouy, L. Besnard and P. Le Guernic
The synchronous modeling paradigm provides strong execution
correctness guarantees to embedded system design
while making minimal environmental assumptions. In
most related frameworks, global execution correctness is
achieved by ensuring endochrony: the insensitivity of (logical)
time in the system from (real) time in the environment.
Interestingly, endochrony can be statically checked, making
it fast to ensure design correctness. Unfortunately, endochrony
is not preserved by composition, making it difficult
to exploit with component-based design concepts in
mind. Compositionality can be achieved by weakening the
objective of endochrony but at the cost of an exhaustive
state-space exploration. This raise a tradeoff between performance
and precision. Our aim is to balance it by proposing
a formal design methodology that adheres to a weakened
global design objective: the non-blocking composition
of weakly endochronous processes, while preserving
local endochrony objectives. This yields an ad-hoc yet
cost-efficient approach to compositional synchronous modeling.
Organizer/Moderator: A. Vörg, edacentrum, DE
-
Quantitative Productivity Measurement in IC Design [p. 934]
-
F. Badstübner and A. Vörg
This paper describes ongoing research in the field of
quantitative productivity measurement in IC Design and
simulation of different scenarios as decision support. Five
topics out of this research field allow an insight in the
preparation of real design flows for productivity
measurement and how these measurements are used for
analysis, simulation and optimization of design flows. This
paper starts with an introduction in section 1 of the
PRODUKTIV+ project in which most of the research has
been done. The modeling of projects and extraction of the
important indicators complexity and quality is explained
in sections 2 and 3. In section 4 Synopsys as an EDA
vendor from outside of PRODUKTIV+ adds its view on
productivity measurement. Section 5 contributes to the
modeling of a verification process for productivity
simulations. Section 6 explains an optimization process
for a microprocessor design flow under productivity
considerations.
Most of this work has been carried out in the
PRODUKTIV+ project (label 01 M 3077) that is partly
funded by the German government [17].
-
Determining the Technical Complexity of Integrated Circuits [p. 935]
-
P. Leppelt and E. Barke
The classification and quantification of a projected
design's technical properties is essential for the prediction
of success or failure of a microelectronic development
project. The derived values have to mirror the design's
capacity and thus allow for an estimation of the design
complexity. This chapter depicts the PRODUKTIV+
solution approach to the ascertainment of a design artifact
and the determination of equations in particular.
-
Qualitative and Quantitative Analysis of IC Designs [p. 935]
-
S. Häusler, F. Poppen, K. Hausmann, A. Hahn and W. Nebel
A project's output needs to be quantified to enable the
evaluation of its productivity. Besides complexity the
quality of result is a main criterion to consider.
Based on the quality definition of "conformance to
requirements" (see e.g. [14]), our approach combines
requirements and quality modelling to allow real time
tracking of the project status for integrated circuit design.
Each design project has its individual (quality-)
requirements and even components of the same design
may differ in this aspect. A general quality evaluation
concept has to cover this individualism. A simple example
is a component that should be reusable in multiple designs
and therefore has to fulfil specific criteria ([11], [13]).
Our approach utilises a machine readable requirements
definition in combination with common quality modelling
techniques. Requirement fulfilment degrees and the
current quality are computable based on this requirements
definition and a snapshot of the current development
status.
The Permeter framework developed by OFFIS is used to
collect the data that represents the current development
status. Permeter offers the functionality to load product
data from different sources and establishes links between
the data, e.g. requirements and corresponding components.
Permeter offers both manual and (semi-) automatic
linkage facilities. For a detailed description of the data
integration process refer to [15].
-
Capturing and Analyzing IC Design Productivity Metrics [p. 936]
-
J. Young
You can't improve what you can't measure and many
people won't take the time to measure. This tutorial
describes a practical, low impact method used by
Synopsys Professional Services design teams to measure
and analyze design flow and runtime metrics on their
customer chip projects. Details about the capture
methodology, database, and reporting infrastructure will
be discussed. Uses for the metrics reports, as well as an
overall context for design productivity improvement will
be discussed. Although the details are provided within the
framework of the Synopsys Design Environment, the
concepts described are applicable to any structured design
environment.
-
Application of Workflow Petri Nets to Modeling of Formal Verification Processes in Design
Flow of Digital Integrated Circuits [p. 937]
-
K. Weinberger, S. Bulach and W. Rosenstiel
According to statistics the verification of digital integrated
circuits (IC) claims up to 70 % of the design time and
effort in the design process. This means that the
verification process must be well structured and organized
in order to efficiently reach desired verification goals.
This paper describes the modelling of an exhaustive
formal verification process of a digital IC with Workflow
Petri Nets [8] and the WoPeD (Workflow Petri net
Designer) tool [9], which supports modelling, simulation
and analysis of a workflow process. The purpose of this
work is to formalize and quantify the verification process
such that it could subsequently be structurally and
behaviourally analyzed according to the means provided
by Petri Nets and, if desired, simulated with a particular
scenario. This approach makes it possible to explicitly
examine and derive the interaction of different factors
which influence a verification process such that their
relationships could be quantified. Initial experimental
results are presented and advantages and disadvantages of
this methodology are discussed.
-
Optimization of Design Flows for Multi-Core x86 Microprocessors in 45 and 32nm
Technologies under Productivity Considerations [p. 938]
-
H.-J. Brand
Designing next generation 45nm and 32nm multi-core
microprocessors creates new challenges caused by a
dramatic increase of design complexity and constraints
such as:
- increasing number of cores per chip
- enhancing cache sizes and cache systems
- increasing frequency for memory and serial interfaces
- heterogeneous multi-core architectures
- functional enhancements (security, virtualization, ...)
- DfY/DfM/DfV require the consideration of more and
new technology specific characteristics.
Without a considerable improvement of design
productivity new products will not be available in time to
market to create maximum economic value.
The presentation describes how an infrastructure to
measure productivity relevant parameter for a
microprocessor design flow for 45 and 32nm technologies
can be build up.
Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: N. Suri, TU Darmstadt, DE
-
Implications of Technology Trends on System Dependability [p. 940]
-
J.A. Abraham
CMOS has been the dominant integrated circuit (IC)
technology for nearly four decades, following the trends
predicted by Moore's Law, and fueling the information and
communication revolution. As chip geometries decrease
and transistor densities increase, new types of faults - from
manufacturing defects and operational transients to longterm
wearout - need to be addressed. These faults and
the resulting logic errors have been dealt with at both the
low and high levels of the design. This talk deals with approaches
for improving dependability at the system level.
-
Globally Optimized Robust Systems to Overcome Scaled CMOS Challenges [p. 941]
-
S. Mitra
Future system design methodologies must accept the fact
that the underlying hardware will be imperfect, and enable
design of robust systems that are resilient to hardware
imperfections. Three techniques that can enable a sea change
in robust system design are: 1. Built-In Soft Error Resilience
(BISER), 2. Circuit Failure Prediction, and 3. Concurrent
Autonomous self-test using Stored Patterns (CASP). Global
optimization across multiple abstraction layers is essential for
cost-effective robust system design using these techniques.
-
Software Protection Mechanisms for Dependable Systems [p. 947]
-
U. Wappler and M. Müller
We expect that in future commodity hardware will be
used in safety critical applications. But the used commodity
microprocessors will become less reliable because of
decreasing feature size and reduced power supply. Thus
software-implemented approaches to deal with unreliable
hardware will be required. As one basic step to softwareimplemented
hardware-fault tolerance (SIHFT) we aim at
providing failure virtualization by turning arbitrary value
failures caused by erroneous execution into crash failures
which are easier to handle. Existing SIHFT approaches either
are not broadly applicable or lack the ability to reliably
deal with permanent hardware faults. In contrast, Forin [7]
introduced the Vital Coded Microprocessor which reliably
detects transient and permanent hardware errors but is not
applicable to arbitrary programs and requires special hardware.
We discuss different approaches to generalize Forin's
approach and make it applicable to modern infrastructures.
Moderators: C. Heer, Infineon Technologies, DE; O. Deprez, Texas Instruments, FR
-
Subsystem Exchange in a Concurrent Design Process Environment [p. 953]
-
M. Strik, A. Gonier and P. Williams
This paper provides insight into the novel solutions used
to build SoCs targeting increased productivity in a complex
environment. Design of such SoCs relies on multi-team,
multi-site cooperation and data exchange. The data
exchange, made possible though descriptions based on The
SPIRIT Consortium's IP-XACTTM specification and the
automation for its processing, forms the basis of the
approach. Initially, the specification focused at IP reuse;
this has now been extended to SoC subsystem exchange.
This paper also describes state-of-the-art subsystem design
automation and improvement opportunities, based on a
close collaboration between NXP Semiconductors and
Mentor Graphics. We do not cover all the aspects of reuse
but mainly stress the concurrent engineering process.
-
Cooperative Safety: Combination Of Mutiple Technologies [p. 959]
-
R. Panazzi, P. Capozio, M. Duncan, A. Scuderi, M. Siti and E. Merli
Governmental Transportation Authorities' interest
in Car to Car and Car to Infrastructure has grown dramatically
over the last few years in order to increase the road safety and
reduce traffic emission.
The achievement of these objectives is subject to development
of three aspects: Transmission, Localization and Sensor
Networks.
New wireless technique evolved form current WiFi technology
shall be able to curb down the timing latency to achieve timely
and efficient communication among vehicles.
Relative positioning is essential to predict whether two cars
are on route of collision. Experts estimate that positioning
accuracy must be below one meter in order to provide the
necessary reaction time. Many technical issues exist in this field
as current GPS solutions do not provide this level of accuracy.
There are multiple standalone approaches existing for sensing
networks including imaging, radar and lidar. In order to create
fault tolerant SIL3 compliant systems, data fusion is obligatory.
The amalgamation of these different data streams requires
powerful multicore processing to recognize and react to multiple
concurrent scenarios.
-
System Performance Optimization Methodology for Infineon's 32-Bit Automotive
Microcontroller Architecture [p. 962]
-
A. Mayer and F. Hellwig
Microcontrollers are the core part of automotive Electronic
Control Units (ECUs). A significant investment of the ECU
manufacturers and even their customers is linked to the
specified microcontroller family. To preserve this
investment it is required to continuously design new
generations of the microcontroller with hardware and
software compatibility but higher system performance
and/or lower cost. The challenge for the microcontroller
manufacturer is to get the relevant inputs for improving the
system performance, since a microcontroller is used by
many customers in many different applications.
For Infineon's latest TriCore® based 32-bit microcontroller
product line, the required statistical data is gathered by
using the trace features of the Emulation Device (ED).
Infineon's customers use EDs in their unchanged target
system and application environment. With an analytical
methodology and based on this statistical data, the
performance improvements of different SoC architecture
and implementation options can be quantified. This allows
an objective assessment of improvement options by
comparing their performance cost ratios.
Moderators: J. Henkel, Karlsruhe U, DE; M. Smith, Royal Institute of Technology (KTH), SE
-
Process Variation Tolerant Design Through a Placement-Aware Multiple Voltage Island
Design Style [p. 967]
-
S. Bonesi, D. Bertozzi, L. Benini and E. Macii
A common technique to compensate process variation induced
performance deviations during post-silicon testing consists of the
dynamic adaptation of processor voltage. This however comes
at a significant power cost. We envision multi supply voltage
design (MSV) as a promising technique to mitigate such power
overhead. Voltage islands are widely recognized as the state-of-the-art
in MSV design. In this paper, we develop a novel design
methodology that leverages voltage islands to compensate process
variations through a commercial synthesis flow. Possible viola-
tion scenarios of performance requirements in fabricated chips
are pre-characterized at design time through statistical static timing
analysis. Then, during post-silicon testing the supply voltage
of a proper number of voltage islands is raised depending on the
actual violation scenario, thus bringing performance back within
nominal values. Voltage islands are generated by exploiting cell
proximity for minimal perturbation of performance pre-optimized
placements.
-
Optimal MTCMOS Reactivation under Power Supply Noise and Performance Constraints [p. 973]
-
A. Calimera, L. Benini and E. Macii
Sleep transistor insertion is one of today's most promising
and widely adopted solutions for controlling stand-by leakage
power in nanometer circuits. Although single-cycle power
mode transition reduces wake-up latency, it originates large
discharge current spikes, thereby causing IR-drop and inductive
ground bounce for the surrounding circuit blocks. We
propose a new reactivation solution which helps in controlling
power supply fluctuations and in achieving minimum
reactivation times. Our structure limits the turn-on current
below a given threshold through sequential activation of
the sleep transistors, which are connected in parallel and are
sized using a novel optimal sizing algorithm. The proposed
methodology is validated using HSPICE simulations of several
benchmark circuits, which have been synthesized onto a
commercial 65nm CMOS technology library.
-
A Single-supply True Voltage Level Shifter [p. 979]
-
R. Garg, G. Mallarapu, S.P. Khatri
When a signal traverses on-chip voltage domains, a level shifter
is required. Inverters can handle a high to low voltage shift with
minimal leakage. For a low to high voltage level translation, inverters
tend to consume a large amount of leakage power, and
hence special circuits have been proposed for this type of translation.
This paper reports a novel single-supply "true" (in the
sense that it can handle a low to high, or high to low voltage level
conversion) voltage level shifter, which can handle low-to-high and
high-to-low voltage translation. Such a requirement arises in many
modern ICs or Systems-on-Chip (SoCs). The use of single supply
voltage reduces circuit complexity by eliminating the need for
routing both supply voltages. The proposed circuit was extensively
simulated in a 90nm technology using SPICE. Simulation results
demonstrate that the level shifter is able to perform voltage level
shifting with low leakage for both low to high, as well as high to
low voltage level translation. We have validated the correct operation
of the proposed level shifter under process and temperature
variations as well.
-
Clock Distribution Scheme Using Coplanar Transmission Lines [p. 985]
-
V.H. Cordero and S.P. Khatri
The current work describes a new standing wave oscillator scheme aimed for
clock propagation on coplanar transmission lines on a silicon die. The design
is aimed for clock signaling in the Gigahertz range (we are able to achieve
clock rates of 8GHz and above). The clock is transported as an oscillatory
wave on a pair of conductors. An oscillatory standing wave is formed across a
transmission line loop, which is connected beginning-to-end through a Mobius
configuration. A single cross coupled inverter pair is required to maintain
oscillation across the ring. The design is aimed to achieve low skew, low
power and extreme high frequency global clock situations. The energy recycling
nature of a standing wave along a transmission line allows us to keep
very high frequencies oscillations along a conductor with almost no power
consumption at all. A special wide input range driver was designed to convert
the differential signals on the coplanar transmission lines into a square clock
pulse for standard clock sinks. The design uses CMOS 90nm BSim3v model
cards for all simulations, with the transmission lines implemented on Metal8.
Moderators: M. Coppola, STMicroelectronics, FR; F. Petrot, TIMA Laboratory, FR
-
Compositional, Dynamic Cache Management for Embedded Chip Multiprocessors [p. 991]
-
A.M. Molnos, M.J.M. Heijligers and S.D. Cotofana
This paper proposes a dynamic cache repartitioning technique
that enhances compositionality on platforms executing
media applications with multiple utilization scenarios. The
repartitioning among scenarios requires a cache flush, thus
two undesired effects may occur: (1) the execution of critical
tasks may be disturbed and (2) a performance penalty
is involved. To cope with these effects we propose a method
which: (1) determines, at design time, the cache footprint
of each task, such that it creates the premises for critical
tasks safety, and reduces the amount of required flush, and
(2) enforces these footprints and further decreases the flush
penalty, at run-time. We implement our dynamic cache management
strategy on a CAKE multiprocessor with 4 Trimedia
cores. The experimental workload consists of 6 multimedia
applications, each of which formed by multiple tasks belonging
to an extended MediaBench suite. For the repartitioned
cache we found on average that: (1) the relative variations
of critical tasks execution time are less than 0.1%, regardless
the scenario switching frequency, (2) for realistic scenario
switching frequencies the inter-task cache interference
is at most 4% , and (3) the off-chip memory traffic reduces
with 60%, and the performance (in cycles per instructions)
enhances with 10%, when compared with the shared cache.
-
Comparison of Memory Write Policies for NoC Based Multicore Cache Coherent Systems [p. 997]
-
P. Guironnet de Massas and F. Pétrot
The following study shows a direct comparison of memory
write policies in SharedMemoryMulticore Systems. Although
there are much work and many studies about this
issue, our work takes into account the difficulties related
to on chip communication using network-like interconnects.
Our study is based on Cycle Approximate Bit Accurate simulations
(CABA) of platforms with up to 64 processors,
modelling accurately all the aspects of multi-threaded program
execution and memory accesses. Our main results
show that write-through caches perform well compared to
write-back ones, with a slightly simpler implementation and
comparable traffic.
-
Serialized Asynchronous Links for NoC [p. 1003]
-
S. Ogg, E. Valli, B. Al-Hashimi, A. Yakovlev, C. D'Alessandro and L. Benini
This paper proposes an asynchronous
serialized link for NoC that can achieve the same levels of
performance in terms of flits per second as a synchronous
link but with a reduced number of wires in the point to
point switch links and reduced power consumption. This
is achieved by employing serialization in the
asynchronous domain as opposed to synchronous to
facilitate the removal of global clocking on the serial
links. Based on transistor level simulations using 0.12 μm
foundry models it has been shown that it is possible to
achieve the same level of performance as synchronous
but with 75% reduction in wires and 65% reduction in
power for a 300 MFlit/s link with 8 buffers with a switch
clock speed of 300 MHz. Furthermore the paper presents
the design requirements arising from interfacing switches
of synchronous NoC and asynchronous serial links.
Keywords: Network-on-Chip, Serial, Asynchronous,
Point-to-Point Links..
Moderators: M. Geilen, TU Eindhoven, NL; H. Ben Jamaa, EPFL, Lausanne, CH
-
Design Guidelines for Metallic-Carbon-Nanotube-Tolerant Digital Logic Circuits [p. 1009]
-
J. Zhang, N.P. Patil and S. Mitra
Metallic Carbon Nanotubes (CNTs) create source-drain
shorts in Carbon Nanotube Field Effect Transistors (CNFETs),
causing excessive leakage, degraded noise margin and delay
variation. There is no known CNT growth technique that
guarantees 0% metallic CNTs. Therefore, metallic CNT
removal techniques are necessary. Unfortunately, such removal
techniques alone are imperfect and insufficient. This paper
demonstrates the necessity for co-optimization of processing
techniques for metallic CNT removal together with CNFETbased
circuit design. We present a probabilistic CNFET circuit
model which forms the basis for such co-optimization, and use
the model to derive design and processing guidelines that
enable design of CNFET-based digital circuits with practical
constraints on leakage, noise margin and delay variations.
These guidelines are essential for designing robust metalliccarbon-
nanotube-tolerant digital circuits.
-
Quantified Synthesis of Reversible Logic [p. 1015]
-
R. Wille, H.M. Le, G.W. Dueck and D. Groβe
In the last years synthesis of reversible logic functions
has emerged as an important research area. Other fields
such as low-power design, optical computing and quantum
computing benefit directly from achieved improvements. Recently,
several approaches for exact synthesis of Toffoli networks
have been proposed. They all use Boolean satisfiability
to solve the underlying synthesis problem. In this paper a
new exact synthesis approach based on Quantified Boolean
Formula (QBF) satisfiability - a generalization of Boolean
satisfiability - is presented. Besides the application of QBF
solvers, we propose Binary Decision Diagrams to solve the
quantified problem formulation. This allows to easily support
different gate libraries during synthesis. In addition,
all minimal networks are found in a single step and the best
one with respect to quantum costs can be chosen. Experimental
results confirm that the new technique is faster than
the best previously known approach and leads to cheaper
realizations in terms of quantum costs.
-
Adaptive Simulation for Single-Electron Devices [p. 1021]
-
N. Allec, R. Knobel and L. Shang
Single-electron devices have drawn much attention in the last two
decades. They have been widely used for device research and also
show promise as a potential alternative to complementary metal-oxide-semiconductor
circuits due to their ultra low power dissipation. Three
techniques have been used for single-electron device modeling in the
past, including Monte Carlo simulation, master equation, and SPICE
modeling. Among these, Monte Carlo method provides accuracy,
but lacks the time efficiency required for large scale simulation. In
this work, we introduce an adaptive multi-scale approach to single-electron
device simulation using Monte Carlo method as basis, which
significantly improves time efficiency while maintaining accuracy. We
have shown it is possible to reduce simulation time up to 40 times
and maintain an average error of 3.3% compared to non-adaptive
Monte Carlo method. Going beyond simplistic approximations, we have
modeled important secondary effects including cotunneling and Cooper
pair tunneling, which are critical for device research.
-
OS-Based Sensor Node Platform and Energy Estimation Model for Health-Care Wireless
Sensor Networks [p. 1027]
-
F.J. Rincón, M. Paselli, J. Recas, Q. Zhao, M. Sánchez Eles, D. Atienza, J. Penders and G. De
Micheli
Accurate power and performance figures are critical to
assess the effective design of possible sensor node architectures
in Body Area Networks (BANs) since they operate on
limited energy storage. Therefore, accurate power models
and simulation tools that can model real-life working conditions
need to be developed and validated with real platforms.
In this paper we propose a sensor node platform designed
for health-care applications and a validated simulation
model based on event-driven operating system simulation
that can be used to accurately analyze performance and
power consumption in BANs composed of multiple nodes.
Thus, this model can be employed to tune the node architecture
and communication layer for different working conditions,
applications and topologies of BANs. In this paper
we validate the proposed simulation model on different reallife
applications and working conditions. Our results show
variations of less than 4% between the presented simulation
framework and measurements in the final platforms.
Moderators: S. Goddard, U of Nebraska - Lincoln, US; P. Mosterman, The MathWorks , US
-
Improvements in Polynomial-Time Feasibility Testing for EDF [p. 1033]
-
A. Masrur, S. Drössler and G. Färber
This paper presents two fully polynomial-time sufficient
feasibility tests for EDF when considering periodic tasks
with arbitrary deadlines and preemptive scheduling on
uniprocessors. Both proposed methods are proven, analytically
and by means of an extensive experimental comparison,
to be more accurate than known polynomial-time
feasibility tests. Additionally, we show for a wide interval
of practical processor utilization that one of these methods
presents almost the same efficiency, in terms of accepted
task sets, as the more complex pseudo-polynomial-time exact
feasibility tests.
-
A Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications [p. 1039]
-
A. Tumeo, M. Branca, L. Camerini, M. Ceriani, M. Monchiero, G. Palermo, F. Ferrandi
and D. Sciuto
This paper presents the implementation of a dualpriority
scheduling algorithm for real-time embedded systems
on a shared memory multiprocessor on FPGA. The
dual-priority microkernel is supported by a multiprocessor
interrupt controller to trigger periodic and aperiodic thread
activation and manage context switching. We show how the
dual-priority algorithm performs on a real system prototype
compared to the theoretical performance simulations with a
typical standard workload of automotive applications, underlining
where the differences are.
-
An Application-Based EDF Scheduler For OSEK/VDX [p. 1045]
-
C. Diederichs, U. Margull, F. Slomka and G. Wirrer
Earliest deadline first scheduling performs processor
utilization up to 100 percent and improved robustness in
overload situations. However, most automotive applica-
tions are running under static priority policy. Because of
this, the standard operating system in the automotive industry,
OSEK/VDX, just supports priority scheduling. This
paper describes an EDF scheduler plug-in for OSEK/VDX.
The plug-in provides EDF scheduling without changes to
the operating system by delaying task activations.
The add-on was tested for an engine management system
developed by SiemensVDO. Results of this experiment are
presented and discussed, showing that the EDF scheduling
techniques can improve the system in aspects of robustness
and resource utilization.
-
Time Properties of the BuST Protocol under the NPA Budget Allocation Scheme [p. 1051]
-
G. Franchino, G. Buttazzo and T. Facchinetti
Token passing is a channel access technique used in
several communication networks. Among them, one of
the most effective solution for supporting both real-time
traffic (synchronous messages) and non real-time trafic
(asynchronous messages), is the so-called timed-token
protocol. Recently, a new token passing protocol, called
Budget Sharing Token protocol (BuST), was proposed to
improve the existing timed-token approaches in terms of
synchronous bandwidth guarantee, while guaranteeing a
minimum throughput for the asynchronous traffic.
This paper analyzes the ability of BuST to manage realtime
and non real-time traffic in comparison with the classic
timed-token protocol and its modified version, under
the Normalized Proportional Allocation (NPA) scheme.
We will show that BuST achieves higher guaranteed realtime
bandwidth than the original timed-token protocol,
and improves the service for the non real-time traffic respect
to its modified version.
Moderators: P. Brisk, EPFL, Lausanne, CH; N. Dutt, UC Irvine, US
-
Simultaneous FU and Register Binding Based on Network Flow Method [p. 1057]
-
J. Cong and J. Xu
With the rapid increase of design complexity and the
decrease of device features in nano-scale technologies,
interconnection optimization in digital systems becomes more
and more important. In this paper we develop a simultaneous
FU and register (SFR) binding algorithm for multiplexer
optimization based on min-cost network flow. Unlike most of the
prior approaches in which functional unit binding and register
binding are performed sequentially, our approach performs
these two highly correlated tasks gradually and concurrently. We
also present an ILP formulation of the combined functional unit
and register binding problem for the optimality study of
heuristics. Experimental results show that when compared to
traditional binding algorithms, our simultaneous resource
binding algorithm is close to optimal solutions for small-size
designs (only 5% more MUX) and achieves significant reduction
for MUX area (12%) and timing (10%) for a set of real-life
benchmark designs.
-
A Variation Aware High Level Synthesis Framework [p. 1063]
-
F. Wang, G. Sun and Y. Xie
The worst-case delay/power of function units has been used
in traditional high level synthesis to facilitate design space exploration. As
technology scales to nanometer regime, the impact of process variations
increases. The degree of variability encountered in the new process
technologies makes worst-case analysis undesirable, because it may
result in unexpected performance/power discrepancy or a pessimistic
estimation, and may end up using excess resources to guarantee design
constraints. In this paper, we propose a high level synthesis framework
to take into account of the performance/power variation for function
units. An effective metric called parametric yield, which is defined as
the probability of the synthesized data flow graph (DFG) meeting the
performance and power constraints, is used to guide scheduling, module
selection, and resource sharing. An efficient performance/power yield
perturbation computation method for DFG significantly improves the
effectiveness of our yield driven high level synthesis algorithm. The
experimental results show that our variation-aware synthesis framework
achieves significant yield improvements, and has much faster (3X)
runtime speed compared against previous approach.
-
EPIC: Ending Piracy of Integrated Circuits [p. 1069]
-
J.A. Roy, F. Koushanfar and I.L. Markov
As semiconductor manufacturing requires greater capital
investments, the use of contract foundries has grown
dramatically, increasing exposure to mask theft and unauthorized
excess production. While only recently studied, IC
piracy has now become a major challenge for the electronics
and defense industries [6].
We propose a novel comprehensive technique to end
piracy of integrated circuits (EPIC). It requires that every
chip be activated with an external key, which can only be
generated by the holder of IP rights, and cannot be duplicated.
EPIC is based on (i) automatically-generated chip
IDs, (ii) a novel combinational locking algorithm, and (iii)
innovative use of public-key cryptography. Our evaluation
suggests that the overhead of EPIC on circuit delay and
power is negligible, and the standard flows for verification
and test do not require change. In fact, major required components
have already been integrated into several chips in
production. We also use formal methods to evaluate combinational
locking and computational attacks. A comprehensive
protocol analysis concludes that EPIC is surprisingly
resistant to various piracy attempts.
-
VLSI Implementation of SISO Arithmethic Decoder FOR Joint Source Channel Coding [p. 1075]
-
S. Zezza and G. Masera
In this paper we propose an efficient VLSI implementation
of a Soft Input Soft Output (SISO) arithmetic code (AC)
decoder for joint source channel coding. The addressed application
shows a very high level of processing complexity,
but, to the best of our knowledge, no papers have been
published in the literature on the hardware implementation
of the considered joint source channel scheme. First we
introduce a simplified algorithm for the SISO AC, which
is 1.3 times faster than the standard one. Then an efficient
SISO AC architecture is proposed and synthesis results
on a 0.13 μm standard cells technology are reported
for two different sets of parameters (M=128, M=256). The
proposed core runs at 338.9 MHz and can decode up to
124.987 kbit/s.
-
Error Detection/Correction in DNA Algorithmic Self-Assembly [p. 1079]
-
S. Frechette and F. Lombardi
A novel error detection/correction technique for algorithmic
self-assembly is presented in this paper. Through
the use of a tile set that allows errors to be isolated and
propagated to the boundary edge of 2D(two-dimensional)
assemblies, the proposed technique permits growth errors
to be detected and corrected. For assemblies in which each
four-sided tile is a party to only one tile mismatch, all
growth errors in the assembly can be detected and corrected
using the proposedmethod with only two additional
tiles. This technique relies on the attachment of so-called
isolation tiles at set periods, thus implementing a checkpoint
for error detection/correction. The physical environment
and related features for the removal of the erroneous
sections of an assembly are presented.
Index Terms: error detection and correction, check-pointing, error tolerance, DNA self-assembly, tiling.
-
Temperature-Aware Voltage Selection for Energy Optimization [p. 1083]
-
M. Bao, A. Andrei, P. Eles and Z. Peng
This paper proposes a temperature-aware dynamic voltage
selection technique for energy minimization and presents a
thorough analysis of the parameters that influence the potential
gains that can be expected from such a technique, compared
to a voltage selection approach that ignores temperature.
-
A Fast Approximation Algorithm for MIN-ONE SAT [p. 1087]
-
L. Fang and M.S. Hsiao
In this paper, we propose a novel approximation algorithm(RelaxSAT)
for MIN-ONE SAT. RelaxSAT generates a set of constraints
from the objective function to guide the search. The
constraints are gradually relaxed to eliminate the conflicts with
the original Boolean SAT formula until a solution is found. The
experiments demonstrate that RelaxSAT is able to handle very
large instances which cannot be solved by existing MIN-ONE
algorithms; furthermore, very tight bounds on the solution were
obtained with one to two orders of magnitude speedup.
-
Deep Submicro Interconnect Timing Model with Quadratic Random Variable Analysis [p. 1091]
-
J.-K. Zeng and C.-P. Chen
Shrinking feature sizes and process variations are of increasing
concern in modern technology. It is urgent that
we develop statistical interconnect timing models which are
harmonious with the current trend in statistical timing analysis
flow. Although statistical model order reduction techniques
have been explored, the statistical interconnect timing
model has not yet been fully analyzed.
In this work, we develop a novel algorithm and its corresponding
analysis for the statistical interconnect timing
model, using second-order statistical variations to model
the non-Gaussian distribution effects. As this model is
fully congruous with current statistical static timing analysis
with the canonical model and does not require any
Monte Carlo simulation analysis, performance is greatly
improved. Experimental results show that the proposed
closed-form quadratic interconnect timing model is within
0.0046% error of the corresponding Monte Carlo simulation.
-
An Efficient Algorithm for Free Resources Management on the FPGA [p. 1095]
-
Y. Lu, T. Marconi, G. Gaydadjiev and K. Bertels
Finding the available empty space for arrival tasks on FPGAs
with runtime partially reconfigurable abilities is the most time
consuming phase in on-line placement algorithms. Naturally, this
phase has the highest impact on the overall system performance.
In this paper, we present a new algorithm which is used to find
the complete set of maximum free rectangles on the FPGA at runtime.
During scanning, our algorithm relies on dynamic information
about the edges of all already placed tasks. Simulation results
show that our algorithm has 1.5x to 5x speedup compared to state
of the art algorithms aiming at maximum free rectangles. In addition,
our proposal requires at least 4.4x less scanning load.
-
Performance-Constrained Different Cell Count Minimization for Continuously-Sized
Circuits [p. 1099]
-
H. Yoshida and M. Fujita
A continuously-sized circuit resulting from transistor sizing
consists of gates with large variety of sizes. In this paper,
we first provide a formal formulation of performance-constrained
different cell count minimization problem, and
then propose an effective hill-climbing heuristic which iteratively
minimizes the number of cells under performance
constraints such as area, delay and power. To the best of our
knowledge, this is the first attempt to address the different
cell count minimization problem.
-
Test Scheduling for Wafer-Level Test-During-Burn-In of Core-Based SoCs [p. 1103]
-
S. Bahukudumbi, K. Chakrabarty and R. Kacprowicz
Wafer-level test during burn-in (WLTBI) has recently
emerged as a promising technique to reduce test and
burn-in costs in semiconductor manufacturing. However, the
testing of multiple cores of a system-on-chip (SoC) in parallel
during WLTBI leads to constantly-varying device power during
the duration of the test. This power variation adversely affects
predictions of temperature and the time required for burn-in.
We present a test-scheduling technique for WLTBI of core-based
SoCs, where the primary objective is to minimize the variation
in power consumption during test. A secondary objective is
to minimize the test application time. Simulation results are
presented for two ITC'02 SoC benchmarks, and the proposed
technique is compared with two baseline methods.
-
CARbridge, Reduction of System Complexity by Standardization of the System-Basis-Chips
for Automotive Applications [p. 1107]
-
P. Scheer, E. Schmidt and S. Burges
Semiconductor manufacturers continue to
integrate functionality into Systems on a chip.
Focused target in the automotive area for today are
system basis chips. In this context system basis chips
are all surrounding components for embedded μ-Controllers,
such as: Transceivers, Watch-Dogs,
Voltage-Regulators, Sensor-Interfaces, Switches and
Diagnosis functions. Because of the lack of a
standard, implementations differ and acceptance is
missing in the development community. Also the
potential evolution of the system CPU+SBC1 does not
happen, because no common target does exist.
Therefore major car manufacturers are going to
introduce a new standard: CARbridge.
Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: C. Fetzer, TU Dresden, DE
-
Specification and Design Considerations for Reliable Embedded Systems [p. 1111]
-
A. Israr and S. Huss
The objective of this paper is to introduce a novel
representation as a means to consider both permanent
and temporal errors in order to increase the overall reliability
of an embedded system. The deployment of embedded
systems in safety critical applications, e.g. in
the automotive domain, demands that the fundamental
set of design criteria consisting of functionality, timeliness,
and production costs be extended to consider of
reliability as an optimization criterion. Thus reliability
engineering becomes part of the overall design flow
for embedded systems. The proposed approach is based
on the introduction of Permanent/Transient error Decision
Diagrams and on dedicated algorithms for the
generation of system implementation sets which feature
maximum reliability at minimal costs in terms of
redundant resources. The proposed approach is demonstrated
for a control system taken from the automotive
domain.
-
Synthesis of Fault-Tolerant Embedded Systems [p. 1117]
-
P. Eles, V. Izosimov, P. Pop and Z. Peng
This work addresses the issue of design optimization for fault-tolerant
hard real-time systems. In particular, our focus is on the
handling of transient faults using both checkpointing with rollback
recovery and active replication. Fault tolerant schedules
are generated based on a conditional process graph representation.
The formulated system synthesis approaches decide the assignment
of fault-tolerance policies to processes, the optimal
placement of checkpoints and the mapping of processes to processors,
such that multiple transient faults are tolerated, transparency
requirements are considered, and the timing constraints
of the application are satisfied.
Organizer/Moderator: N. Suri, TU Darmstadt, DE
-
Reliable Services in an Imperfect World [p. 1123]
-
H. Kopetz
With the ongoing trends of hardware complexity - device density increases, reducing geometrics, lower
switching threshholds etc - hardware increasingly exhibits transient faults. Software is not perfect and the
increasing complexity results in Heisenbugs. Consequently it becomes a complex technological challenge to
build dependable embedded systems that can accommodate and mitigate these facts of hardware and software
transients such that the user perceived services are not seriously impacted.
Organizer/Moderator: A. Hemani, Royal Institute of Technology, Stockholm SE; A. Jantsch, Royal Institute of Technology, Stockholm SE
Moderator: A. Hemani, Royal Institute of Technology, Stockholm SE
-
Video Processing Requirements on SoC Infrastructures [p. 1124]
-
P. van der Wolf and T. Henriksson
Applications from the embedded consumer domain put
challenging requirements on SoC infrastructures, i.e.
interconnect and memory. Specifically, video applications
demand large storage capacity and high bandwidth while
data accesses can be irregular. The SoC architectures used
for implementing these applications typically contain a
heterogeneous collection of processing elements and use a
single interface to off-chip DRAM in order to provide the
required storage capacity at a low cost. Proper integration
of interconnect and memory architecture is required to
achieve the required bandwidths and latencies for
accessing memory. The application requirements as well as
the characteristics and constraints for accessing memory
are key inputs for NoC design. Future memory technologies
may cause a paradigm shift by offering high-bandwidth
memory access, possibly via multiple memory interfaces.
-
Memory Technology for Extended Large-Scale Integration in Future Electronics
Applications [p. 1126]
-
D. Pamunuwa
Extending 2-D planar topologies in integrated circuits
(ICs) to a 3-D implementation has the obvious benefits of
reducing the overall footprint and average interconnection
length, with associated improvements in cost, and
delay and energy consumption, while also providing an
opportunity to integrate disparate technologies. Such
advances are very much technology driven, and early
research into 3-D integration has now crystallised into
commercially viable options that are being pursued by
many companies. Being able to position memory in closer
proximity to processing elements in a NoC architecture as
afforded by a 3-D physical architecture has the potential
to improve the memory bandwidth and mitigate the general
nature of delay constrained performance in IC
design. Understanding the nature of the opportunities and
constraints provided in such a 3-D physical architecture is
crucial in realising the true benefits of 3-D integration in
future applications.
-
Memory-aware NoC Exploration and Design [p. 1128]
-
N. Dutt
In the past decade, tremendous progress has been made in NoC
research, spanning architectures, protocols and tools. In
addition to a large number of academic and research projects,
we are now seeing several commercial realizations of NoCbased
chip designs. With chip capacities going well beyond
the billion transistor mark, on one hand large amounts of the
die are occupied by memory resources and on the other hand
many complex applications being mapped to these chips are
also memory-intensive. In such instances, memories dominate
all the axes of traditional design constraints, including, but not
limited to performance, area (cost), and power/energy.
Furthermore, the move towards sub-nanometer technologies
elevates another critical design consideration: process
variability and thermal sensitivity, which in turn critically
affect the reliability of memories as well. All of these trends
make the case for a memory-aware NoC design methodology.
Moderators: M. Fujita, Tokyo U, JP; T. Shiple, Synopsys, FR
-
Incremental Criticality and Yield Gradients [p. 1130]
-
J. Xiong, V. Zolotov and C. Visweswariah
Criticality and yield gradients are two crucial diagnostic
metrics obtained from Statistical Static Timing Analysis (SSTA). They
provide valuable information to guide timing optimization and timing-driven
physical synthesis. Existing work in the literature, however,
computes both metrics in a non-incremental manner, i.e., after one
or more changes are made in a previously-timed circuit, both metrics
need to be recomputed from scratch, which is obviously undesirable
for optimizing large circuits. The major contribution of this paper is
to propose two novel techniques to compute both criticality and yield
gradients efficiently and incrementally. In addition, while node and edge
criticalities are addressed in the literature, this paper for the first time
describes a technique to compute path criticalities. To further improve
algorithmic efficiency, this paper also proposes a novel technique to
update "chip slack" incrementally. Numerical results show our methods
to be over two orders of magnitude faster than previous work.
-
Latch Modeling for Statistical Timing Analysis [p. 1136]
-
S.X. Shi, A. Ramalingam, D. Wang and D.Z. Pan
Latch based circuits are widely adopted in high
performance circuits. But there is a lack of accurate latch models
for doing timing analysis. In this paper, we propose a new latch
delay model in the context of SSTA based on a new perspective of
latch timing. The proposed latch model also takes into account the
external timing variations such as data slew. The new latch model
is integrated into SSTA by considering the timing analysis of both
the combinational logic network and the clock distribution
network simultaneously. The experimental results show that
ignoring accurate latch modeling may lead to large errors (e.g.,
50% at PDF peak).
-
Conditional Partial Order Graphs and Dynamically Reconfigurable Control Synthesis [p. 1142]
-
A. Mokhov and A. Yakovlev
The paper introduces a new formal model for specifying
control paths in the context of asynchronous system design.
The model, called Conditional Partial Order Graph
(CPOG), is capable of capturing concurrency and choice
in a system's behaviour in a compact and efficient way. A
problem of CPOG synthesis is formulated and solved; various
CPOG optimisation techniques are presented.
The introduced model can be used for the specification
of system behaviour and for synthesis of area-efficient dynamically
reconfigurable controllers. The synthesis of a
controller is based on a novel generic architecture, called
Transition Sequence Encoder (TSE). The synthesized controllers
are speed independent and thus very robust to parametric
variations. The ideas presented in the paper can be
applied for CPU control synthesis as well as for synthesis
of different kinds of event-coordination circuits often used
in data coding and communication in digital systems.
Moderators: O. Deprez, Texas Instruments, FR; J, Quevremont, Thales, FR
-
Efficient Software Architecture for IPSec Acceleration Using a Programmable Security
Processor [p. 1148]
-
J. Thoguluva, A. Raghunathan and S.T. Chakradhar
Cryptographic accelerators and security processors are
often used in embedded systems in order to enable enhanced
security without significantly impacting performance
or power consumption. However, realizing the performance
promised by them requires the design of efficient software
architectures for crypto offloading (offloading cryptographic
operations from a host processor). In this paper, we describe
an efficient software architecture for IPSec crypto offloading
on a state-of-the-art mobile application processor
system-on-chip (SoC) that includes a programmable security
processor. We consider both user-space and kernel-space
implementations of IPSec, compare their performance, and
identify factors that limit the efficiency of crypto offloading.
We describe two optimizations, called protocol-level crypto
offloading and adaptive crypto offloading, which further improve
the performance of IPSec by (i) offloading higher granularity
computations to reduce the crypto offloading overheads,
and (ii) using crypto offloading judiciously based on
the trade-off between the savings in processing cycles vs.
the overhead of communication with the security processor.
We measure the performance of our implementation of IPSec
crypto offloading using a commercial network protocol stack
on the mobile application processor SoC, under a wide range
of workloads. Our results indicate that efficient crypto offloading
can result in application-level improvements of up
to 10.6X in data rate and up to 5X in latency, enabling IPSec
to be used for emerging high-bandwidth and interactive mobile
applications.
-
Operating System Controlled Processor-Memory Bus Encryption [p. 1154]
-
X. Chen, R.P. Dick and A. Choudhary
Unencrypted data appearing on the processor-memory
bus can result in security violations, e.g., allowing
attackers to gather keys to financial accounts and personal
data. Although on-chip bus encryption hardware can solve this
problem, it requires hardware redesign or increases processor
cost. Application redesign to prevent sensitive data from appearing
on the processor-memory bus is extremely difficult.
We propose and evaluate a processor-memory bus encryption
technique for embedded systems that requires no changes to
applications or hardware. This technique exploits cache locking
or scratchpad memory, features present in many embedded
processors, permitting the operating system (OS) virtual memory
infrastructure to automatically encrypt data belonging to protected
processes as they are written to off-chip memory. Pages
belonging to unprotected processes are stored unencrypted to
prevent performance and energy consumption penalties.
We evaluate the proposed bus encryption technique using full
system simulation. Experimental results indicate that it is possible
to prevent the working data sets of processes from appearing on
the processor-memory bus in plaintext, without using dedicated
hardware and without changing applications. The OS based
technique results in 1.37x slowdown for protected processes
for processors with 512KB of L2 cache and 1.78x slowdown
for processors with 256KB of L2 cache. There are negligible
performance penalties for unprotected processes.
-
An Efficient FPGA Implementation of Principle Component Analysis Based Network
Intrusion Detection System [p. 1160]
-
A. Das, S. Misra, S. Joshi, J. Zambreno, G. Memik and A. Choudhary
Modern Network Intrsuion Detection Systems (NIDSs)
use anomaly detection to capture malicious attacks. Since
such connections are described by large set of dimensions,
processing these huge amounts of network data becomes extremely
slow. To solve this time-efficiency problem, statistical
methods like Principal Component Analysis (PCA) can
be used to reduce the dimensionality of the network data. In
this paper, we design and implement an efficient FPGA architecture
for Principal Component Analysis to be used in
NIDSs. Moreover, using representative network intrusion
traces, we show that our architecture correctly classifies attacks
with detection rates exceeding 99.9% and false alarm
rates as low as 1.95%. Our implementation on a Xilinx
Virtex-II Pro FPGA platform provides a core throughput of
up to 24.72 Gbps, clocking at a frequency of 96.56 MHz. 1
Moderators: J. Teixeira, INESC-ID, PT; H. Obermeir, Infineon, DE
-
A Bridging Fault Model Where Undetectable Faults Imply Logic Redundancy [p. 1166]
-
I. Pomeranz and S.M. Reddy
We define a robust fault model as a model where
the existence of an undetectable fault implies the existence
of logic redundancy, or more generally, a suboptimality in
the synthesis of the circuit. The stuck-at fault model is
robust, but other fault models such as certain bridging
fault models are not. A robust fault model provides a
mechanism to synthesize circuits in which all the target
faults are detectable and 100% fault coverage is achievable.
The ability to achieve 100% fault coverage, or
understand why it is not achievable, is important since the
requirement to achieve high test quality translates into a
requirement to achieve complete fault coverage for target
faults, regardless of the metrics used to measure test quality.
We discuss a robust bridging fault model and its use
as part of a test generation process for a non-robust
bridging fault model (a non-robust bridging fault model
may have to be used in order to capture the behavior of
bridging defects). We also present experimental results
related to the robust bridging fault model.
-
Layout-Aware, IR-Drop Tolerant Transition Fault Pattern Generation [p. 1172]
-
J. Lee, S. Narayan, M. Kapralos and M. Tehranipoor
Market and customer demands have continued to
push the limits of CMOS performance. At-speed test has become
a common method to ensure these high performance chips are
being shipped to the customers fault-free. However, at-speed tests
have been known to create higher-than-average switching activity,
which normally is not accounted for in the design of the power
supply network. This potentially creates conditions for additional
delay in the chip; causing it to fail during test. In this paper, we
propose a pattern compaction technique that considers the layout
and gate distribution when generating transition delay fault
patterns. The technique focuses on evenly distributing switching
activity generated by the patterns across the layout rather than
allowing high switching activity to occur in a small area in
the chip that could occur with conventional delay fault pattern
generation. Due to the relationship between switching activity and
IR-drop, the reduction of switching will prevent large IR-drop
in high demand regions while still allowing a suitable amount
of switching to occur elsewhere on the chip to prevent fault
coverage loss. This even distribution of switching on the chip
will also result in avoiding hot-spots.
-
Multi-Vector Tests: A Path to Perfect Error-Rate Testing [p. 1178]
-
S. Shahidi and S. Gupta
The importance of testing approaches that exploit error
tolerance to improve yield has previously been established.
Error rate, defined as the percentage of vectors for which
the value at a circuit's output deviates from the corresponding
error-free value, has been identified as a key metric for
severity. In error-rate testing every chip that has an error
rate greater than or equal to a threshold specified by the application
is unacceptable for the application and discarded;
all other chips are acceptable. The objective of error-rate
testing is to reject every unacceptable chip while accepting
all (or a maximum number) of the acceptable chips.
We previously showed that it is not always possible to generate
a test set that detects all unacceptable faults, i.e., faults
that cause an error rate greater than or equal to the threshold
error rate, without detecting some of the acceptable faults,
i.e., faults that cause an error rate less than the threshold.
In this paper, we introduce the new notion of multi-vector
testing and prove that this notion enables us to detect all
unacceptable faults without detecting any of the acceptable
faults. We derive an upper bound on the size of such a test for
a general case. As this universal bound can be large in some
cases, we use a structural approach and find much tighter
upper bounds for special classes of circuits. Experiments on
benchmark circuits show that the required test-sizes for arbitrary
circuits are much lower than our universal bounds, and
practically useful.
-
iFill: An Impact-Oriented X-Filling Method for Shift- and Capture-Power Reduction in At-Speed Scan-Based Testing [p. 1184]
-
J. Li, Q. Xu, Y. Hu and X. Li
In scan-based tests, power consumptions in both shift and
capture phases may be significantly higher than that in
normal mode, which threatens circuits' reliability during
manufacturing test. In this paper, by analyzing the impact of
X-bits on circuit switching activities, we present an X-filling
technique that can decrease both shift- and capture-power to
guarantees the reliability of scan tests, called iFill. Moreover,
different from prior work on X-filling for shift-power
reduction which can only reduce shift-in power, iFill is able
to decrease power consumptions during both shift-in and
shift-out. Experimental results on ISCAS'89 benchmark
circuits show the effectiveness of the proposed technique.
Moderators: C. Haubelt, Erlangen-Nuremberg U, DE; R. Leupers, RWTH Aachen U, DE
-
Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors [p. 1190]
-
S. Park, A. Shrivastava and Y. Paek
The contribution of memory latency to execution
time continues to increase, and latency hiding mechanisms
become ever more important for efficient processor design. While
high-end processors can use elaborate techniques like multiple
issue, out-of-order execution, speculative execution, value prediction
etc. to tolerate high memory latencies, they are often
not viable solutions for embedded processors, due to significant
area, power and chip complexity overheads. This paper proposes
a hardware-software cooperative approach, called priority-based
execution to hide cache miss penalty for embedded processors.
The compiler classifies the instructions into low-priority and highpriority
instructions. The processor executes the high-priority
instructions, but delays the execution of low priority instructions.
They are executed on a cache miss to hide the cache miss
penalty. We empirically evaluate our proposal on the Intel XScale
compiler and microarchitecture. Experimental results on benchmarks
from Multimedia, MediaBench, MiBench, and SPEC2000
demonstrate an average 17% performance improvements, hiding
75% cache miss penalty.
-
Instruction Cache Energy Saving Through Compiler Way-Placement [p. 1196]
-
T.M. Jones, S. Bartolini, B. De Bus, J. Cavazos and M.F.P. O'Boyle
Fetching instructions from a set-associative cache in an
mbedded processor can consume a large amount of energy
due to the tag checks performed. Recent proposals to address
this issue involve predicting or memoizing the correct
way to access. However, they also require significant hardware
storage which negates much of the energy saving.
This paper proposes way-placement to save instruction
ache energy. The compiler places the most frequently exeuted
instructions at the start of the binary and at runtime
hese are mapped to explicit ways within the cache. We compare
with a state-of-the-art hardware technique and show
hat our scheme saves almost 50% of the instruction cache
nergy compared to 32% for the hardware approach. We
eport results on a variety of cache sizes and associativiies,
achieving 59% instruction cache energy savings and
an ED product of 0.80 in the best configuration with negligible
hardware overhead and no ISA changes.
-
Effective Loop Partitioning and Scheduling under Memory and Register Dual Constraints [p. 1202]
-
C.J. Xue, E.H.-M. Sha, Z. Shao and M. Qiu
Loops are the most important sections for embedded applications. To achieve high
performance, two loop transformation techniques are often applied, namely loop pipelining
and loop partitioning. Loop pipelining is an effetive approach to increase parallelism
and reduce schedule length. Loop partitioning with prefetching increases data locality
and hides memory latency. However, loop pipelining increases register pressure and loop
partitioning increases local memory requirement. As most embedded systems have limited
number of registers and limited memory, without careful stufy, these two techniques can
not be applied effectively. In this paper, we propose and effective scheduling
framework, Register and Memory Sensitive Partitioning (RMSP), to minimize
average schedule length per iteration under register and memory dual constraints
for parallel embedded systems. Experiments show that RMSP reduces schedule length by 14.1%
in average compared to previous methods applied directly.
Moderators: J. Becker, Karlsruhe Inst. of Technology - KIT, DE; K. Bertels, TU Delft, NL
-
Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications [p. 1208]
-
A.C.S. Beck, M.B. Rutzig, G. Gaydadjiev and L. Carro
Embedded systems are becoming increasingly complex. Besides
the additional processing capabilities, they are characterized by
high diversity of computational models coexisting in a single
device. Although reconfigurable architectures have already shown
to be a potential solution for such systems, they just present
significant speedups of very specific dataflow oriented kernels.
Furthermore, reconfigurable fabric is still withheld by the need of
special tools and compilers, clearly not sustaining backward
software compatibility. In this paper, we propose a new technique
to optimize both dataflow and control-flow oriented code in a
totally transparent process, without the need of any modification
in the source or binary codes. For that, we have developed a
Binary Translation algorithm implemented in hardware, which
works in parallel to a MIPS processor. The proposed mechanism
is responsible for transforming sequences of instructions at runtime
to be executed on a dynamic coarse-grain reconfigurable
array, supporting speculative execution. Executing the MIBench
suite, we show performance improvements of up to 2.5 times,
while reducing 1.7 times the required energy, using trivial
hardware resources.
-
Automatic Selection of Application-Specific Reconfigurable Processor Extensions [p. 1214]
-
C. Wolinski and K. Kuchcinski
This paper presents a new method for automatic selection
of application-specific processor extensions and shows
how applications are scheduled on these new reconfigurable
architectures. The extensions are implemented as specialized
sequential or parallel instructions. They correspond to
identified most frequently occurring computational patterns
or other interesting patterns and are finally selected during
mapping and scheduling. Our methods can handle both
time-constrained and resource-constrained scheduling. Experimental
results show that the presented method provides
high coverage of application graphs with small number of
patterns and ensures high application execution speed-up
both for sequential and parallel application execution with
processor extensions implementing selected patterns.
-
An Optimized Message Passing Framework for Parallel Implementation of Signal Processing
Applications [p. 1220]
-
S. Saha, J. Schlessman, S. Puthenpurayil, S.S. Bhattacharyya and W. Wolf
Novel reconfigurable computing platforms enable efficient
realizations of complex signal processing applications
by allowing exploitation of parallelization resulting in high
throughput in a cost-efficient way. However, the design of
such systems poses various challenges due to the complexities
posed by the applications themselves as well as the heterogeneous
nature of the targeted platforms. One of the
most significant challenges is communication between the
various computing elements for parallel implementation. In
this paper, we present a communication interface, called the
signal passing interface (SPI), that attempts to overcome
this challenge by integrating relevant properties of two different
yet important paradigms in this context - dataflow
and the message passing interface (MPI). SPI is targeted
towards signal processing applications and, due to its careful
specialization, more performance-efficient for their
embedded implementation. It is also more easier and intuitive
to use. Earlier, a preliminary version of SPI was presented
[12] which was restricted to static dataflow behavior.
Here, we present a more complete version of SPI with new
features to address both static and dynamic dataflow behavior,
and to provide new optimization techniques. We develop
a hardware description language (HDL) realization of the
SPI library, and demonstrate its functionality on the Xilinx
Virtex-4 FPGA. Details of the HDL-based SPI library along
with experiments with two signal processing applications on
the FPGA are also presented.
Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: C. Fetzer, TU Dresden, DE
-
Dependability for High-Tech Systems: An Industry-as-Laboratory Approach [p. 1226]
-
E. Brinksma and J. Hooman
The dependability of high-volume embedded systems, such a consumer electonic devices,
is threatened by a combination of quickly increasing complexity, decreasing time-to-market,
and strong cost constraints. This poses challenging research questions that are investigated
in the Trader project, following the industry-as-lab approach. We present the main
vision of this project, which is based on a model-based control paradigm, and the
current status of the project results.
Moderators: D. Atienza, DACYA/Madrid Complutense U, ES; T. Basten, TU Eindhoven, NL
-
User-Aware Dynamic Resource Allocation in Networks-on-Chip [p. 1232]
-
C.-L. Chou and R. Marculescu
In this paper, we propose a run-time strategy for allocating the
application tasks to platform resources in homogeneous Networks-on-Chip
(NoCs). As novel contribution, we incorporate the
user behavior information in the resource allocation process; this
allows system to better respond to real-time changes and adapt
dynamically to user needs. Several algorithms are then proposed
for solving the task allocation problem, while minimizing the
communication energy consumption and network contention. If
user behavior is taken into consideration, we observe about 60%
communication energy savings (with negligible and energy runtime
overhead) compared to an arbitrary task allocation strategy.
-
Minimizing Virtual Channel Buffer for Routers in On-Chip Communication Architectures [p. 1238]
-
M.A. Al Faruque and J. Henkel
We present a novel methodology for design space exploration
using a two-steps scheme to optimize the number of
virtual channel buffers (buffers take the premier share of the
router in a NoC [10]) used to implement logical channels
multiplexed across the physical channel in a router output
port for QoS supported on-chip communication. In the first
step, the number of virtual channels is minimized during the
mapping of tasks to the NoC at the design time of a System
on Chip (SoC) for which we use a swarm intelligence-based
Ant Colony Optimization (ACO) algorithm. In the second
step, a probabilistic approach based on the traffic model of
the application is used to further minimize the number of
virtual channels. We achieve on average 90.2% reduction
in the number of virtual channels compared to a fixed
state-of-the-art (i.e. QNoC [1]) allocation for the E3S embedded
application benchmark suit. The reduction depends on the
designer and the QoS parameter, and it is dependent on the
specific application driven traffic model. We demonstrate
our design space exploration by means of a complete robot
application and also extend our exploration by evaluating
the E3S embedded application benchmark suit.
-
An Open-Loop Flow Control Scheme Based on the Accurate Global Information of On-Chip
Communication [p. 1244]
-
W.-C. Kwon, S.-M. Hong, S. Yoo, B. Min, K.-M. Choi, S.-K. Eo
3D stacked memory is being adopted as a promising solution to
offer high bandwidth and low latency in memory access. Compared
with the on-chip network design with conventional off-chip memory,
it gives a new problem of minimizing communication conflicts since
multiple concurrent high bandwidth data transfers will flow
through the on-chip network. In order to tackle this problem, we
propose applying an open-loop flow control scheme based on the
accurate global information (destination and status) of on-chip
communication. The proposed open-loop flow control scheme
exploits the information and selectively buffers and arbitrates data
transfers to remove conflicts at destinations in a preventive manner.
As an implementation of the presented scheme, we present on-chip
buffers called Buf3D's that share the global information with each
other to perform the selective buffering and arbitration of data
transfers. Experiments with synthetic test cases and an industrial
strength DTV design show that the proposed method improves
aggregate memory bandwidth significantly (average 19.0%~25.8%
in the synthetic cases and up to 18.4% in the DTV case) with a
small area overhead (15.2% in the DTV case) of on-chip network.
Moderators: C. Wolinski, Rennes 1 U, FR; H. Yoshida, Tokyo U, JP
-
Variable Latency Speculative Adder: A New Paradigm for Arithmetic Circuit Design [p. 1250]
-
A.K. Verma, P. Brisk and P. Ienne
Adders are one of the key components in arithmetic circuits.
Enhancing their performance can significantly improve the quality
of arithmetic designs. This is the reason why the theoretical
lower bounds on the delay and area of an adder have been analysed,
and circuits with performance close to these bounds have
been designed. In this paper, we present a novel adder design that
is exponentially faster than traditional adders; however, it produces
incorrect results, deterministically, for a very small fraction
of input combinations. We have also constructed a reliable version
of this adder that can detect and correct mistakes when they
occur. This creates the possibility of a variable-latency adder that
produces a correct result very fast with extremely high probability;
however, in some rare cases when an error is detected, the
correction term must be applied and the correct result is produced
after some time. Since errors occur with extremely low probability,
this new type of adder is significantly faster than state-of-the-art
adders when the overall latency is averaged over many additions.
-
Improving Synthesis of Compressor Trees on FPGAs via Integer Linear Programming [p. 1256]
-
H. Parandeh-Afshar, P. Brisk and P. Ienne
Multi-input addition is an important operation for many DSP and
video processing applications. On FPGAs, multi-input addition has
traditionally been implemented using trees of carry-propagate
adders. This approach has been used because the traditional lookup
table (LUT) structure of FPGAs is not amenable to compressor
trees, which are used to implement multi-input addition and
parallel multiplication in ASIC technology. In prior work, we
developed a greedy heuristic method to map compressor trees onto
the general logic of an FPGA using a component called
generalized parallel counter (GPC). Although this technique
reduced the combinational delay of our circuits, when synthesized
onto Altera Stratix-II FPGAs, by 27% on average; however, the
area was increased by an average 11%. To further reduce the delay
and limit the increase in area, we have developed a new solution to
the mapping problem based on integer linear programming. This
new approach reduced the delay of the compressor tree by 32% on
average and reduced the area by 3% compared to an adder tree.
-
An Adaptable FPGA-Based System for Regular Expression Matching [p. 1262]
-
I. Bonesana, M. Paolieri and M.D. Santambrogio
In many applications string pattern matching is one of
the most intensive tasks in terms of computation time and
memory accesses. Network Intrusion Detection Systems
and DNA Sequence Matching are two examples. Since
software solutions are not able to satisfy the performance
requirements, specialized hardware architectures are required.
In this paper we propose a complete framework for
regular expression matching, both in its architecture and
compiler. This special-purpose processor is programmed
using regular expressions as programming language. With
the parallelism exploited in the design it is possible to
achieve a throughput greater than one character per clock
cycle, requiring O(n) memory space. The VHDL description
of the proposed architecture is fully configurable. A
design space exploration to find the optimal architecture
based on area and performance cost-function is presented.
-
Comparison of Boolean Satisfiability Encodings on FPGA Detailed Routing Problems [p. 1268]
-
M.N. Velev and P. Gao
We compare 12 new encodings for representing of FPGA
detailed routing problems as equivalent Boolean Satisfiability
(SAT) problems against the only 2 previously used encodings.
We also consider two symmetry-breaking heuristics. Compared
to other methods for FPGA detailed routing, SAT-based
approaches have the advantage that they can prove the
unroutability of a global routing for a particular number of
tracks per channel, and that they consider all nets simultaneously.
The experiments were run on the standard MCNC
benchmarks. The combination of one new encoding with a
new symmetry-breaking heuristic resulted in speedup of 3
orders of magnitude or 1,139x of the total execution time on
the collection of benchmarks, when proving the unroutability
of FPGA global routings. The maximum obtained speedup
was 9,499x on an individual benchmark. On the other hand,
most of the encodings had comparable and very efficient performance
when finding solutions for configurations that were
routable. The availability of many SAT encodings, that can
each be combined with various symmetry-breaking heuristics,
opens the possibility to design portfolios of parallel strategies -
each a combination of a SAT encoding and a symmetry-breaking
heuristict - that can be run in parallel on different
cores of a multicore CPU in order to reduce the solution time,
with the rest of the runs terminated as soon as one of them
returns an answer. We found that a portfolio of three particular
parallel strategies produced additional speedup of more
than 2x.
Moderators: L. Fesquet, TIMA Laboratory, FR; B. Candaele, Thales, FR
-
Defeating Classical Hardware Countermeasures: A New Processing for Side Channel
Analysis [p. 1274]
-
D. Real, C. Canovas, J. Clediere, M. Drissi and F. Valette
In the field of the Side Channel Analysis, hardware distortions
such as glitches and random frequency are classical
countermeasures. A glitch influences the side channel
amplitude while a random frequency damages the signal
both in time and in amplitude. For minimizing these
countermeasures effects, some trace treatments based on
peak extraction or auto-correlation methods exist. However,
none of them takes into account the amplitude mistake.
In this paper, we show that this amplitude mistake is
created by glitches but also by a random frequency. We propose
then a reshaping processing that erases these effects
on side channel traces both on the time and amplitude axis.
The solution reconstructed a side channel signal, avoiding
the hardware countermeasures and the clock relativity consequences
which can be meaningful for Side Channel Attacks.
Its efficiency is demonstrated on a Differential Power
Attack performed on a DES implementation and on a Template
Attack performed on a RSA implementation.
-
Power Balanced Gates Insensitive to Routing Capacitance Mismatch [p. 1280]
-
K.J. Kulikowski, V. Venkatarama, Z. Wang and A. Taubin
Cryptographic hardware is vulnerable to power analysis attacks. To resist
these attacks, special balanced dual-rail gates have been devoloped which
have equal power consumption for all valid data values and transitions. A
limitation of existing designs is that they require balanced routing of the
dual-rail interconnect between gates. Natural process variation and suboptimal
routing tools make it practically impossible to perfectly match the capacitances
of the dual-rail pair making the balanced routing constraint difficult to satisfy.
We present a general method and designs which achieve power balance in dual-rail
circuits without requiring matching of gate output load capacitances or random
masking. The method and design are based on a directional discharge protocol
which ensures that both rails are always fully discharged and charged in
each cycle.
-
On Analysis and Synthesis of (n,k)-Non-Linear Feedback Shift Registers [p. 1286]
-
E. Dubrova, M. Teslenko and H. Tenhunen
Non-Linear Feedback Shift Registers (NLFSRs) have been
proposed as an alternative to Linear Feedback Shift Registers (LFSRs) for
generating pseudo-random sequences for stream ciphers. In this paper, we
introduce (n,k)-NLFSRs which can be considered a generalization of the
Galois type of LFSR. In an (n, k)-NLFSR, the feedback can be taken from
any of the n bits, and the next state functions can be any Boolean function
of up to k variables. Our motivation for considering this type NLFSRs
is that their Galois configuration makes it possible to compute each next
state function in parallel, thus increasing the speed of output sequence
generation. Thus, for stream cipher application where the encryption
speed is important, (n, k)-NLFSRs may be a better alternative than the
traditional Fibonacci ones. We derive a number of properties of (n, k)-
NLFSRs. First, we demonstrate that they are capable of generating output
sequences with good statistical properties which cannot be generated by
the Fibonacci type of NLFSRs. Second, we show that the period of the
output sequence of an (n, k)-NLFSR is not necessarily equal to the length
of the largest cycle of its states. Third, we compute the period of an
(n, k)-NLFSR constructed from several parallel NLFSRs whose outputs
are XOR-ed and show how to maximize this period. We also present an
algorithm for estimating the length of cycles of states of (n, k)-NLFSRs
which uses Binary Decision Diagrams for representing the set of states
and the transition relation on this set.
-
FPGA Design for Algebraic Tori Based Public Key Cryptography [p. 1292]
-
J. Fan, L. Batina, K. Sakiyama and I. Verbauwhede
Algebraic torus-based cryptosystems are an alternative
for Public-Key Cryptography (PKC). It maintains the security
of a larger group while the actual computations are
performed in a subgroup. Compared with RSA for the same
security level, it allows faster exponentiation and much
shorter bandwidth for the transmitted data. In this work
we implement a torus-based cryptosystem, the so-called
CEILIDH, on a multicore platform with an FPGA. This
platform consists of a Xilinx MicroBlaze core and a multicore
coprocessor. The platform supports CEILIDH, RSA
and ECC over prime fields. The results show that one 170-bit
torus T6 exponentiation requires 20 ms, which is 5 times
faster than 1024-bit RSA implementation on the same platform.
Moderators: E.J. Marinissen, NXP Semiconductors, NL; A. Leininger, Infineon Technologies, DE
-
Automated Trace Signals Identification and State Restoration for Improving Observability in
Post-Silicon Validation [p. 1298]
-
H.F. Ko and N. Nicolici
Embedded logic analysis has emerged as a powerful
technique for identifying functional bugs during postsilicon
validation, as it enables at-speed acquisition of data
from the circuit nodes in real-time. Nonetheless, the amount
of data that is observed is limited by the capacity of the
on-chip trace buffers. This paper introduces an automated
method for improving the utilization of the on-chip storage,
by identifying a small set of trace signals from which a large
number of states can be restored using a compute-efficient
algorithm. This enlarged set of data can then be used to aid
the search of functional bugs in the fabricated circuit.
-
Functional Self-Testing for Bus-Based Symmetric Multiprocessors [p. 1304]
-
A. Apostolakis, D. Gizopoulos, M. Psarakia and A. Paschalis
Functional, instruction-based self-testing of microprocessors
has recently emerged as an effective alternative
or supplement to other testing approaches, and is
progressively adopted by major microprocessor manufacturers.
In this paper, we study, for first time, the applicability
of functional self-testing on bus-based symmetric
multiprocessors (SMP) and the exploitation of
SMPs parallelism during testing. We focus on the impact
of the memory system architecture and the cache coherency
mechanisms on the execution of self-test programs
on the processor cores. We propose a generic self-test
routines scheduling algorithm aiming at the reduction of
the total test application time for the SMP by reducing
both bus contention and data cache coherency invalidation.
We demonstrate the proposed solutions with detailed
experiments in two-core and four-core SMP
benchmarks based on a RISC processor core.
-
Theoretical and Practical Aspects of IDDQ Settling - Impact on Measurement Timing and
Quality [p. 1310]
-
B. Straka, H. Manhaeve, J. Brenkus and S. Kerckenaere
This paper discusses the parameters involved in making
fast and reliable quiescent current (IDDQ or ISSQ)
measurements, with particular attention to the test setup
and the point of measurement. For that purpose a detailed
theoretical and practical study was made of the IDDQ
settling behaviour in function of proper measurement
instrument positioning. The conclusions are that
instrument positioning is a critical factor in function of
achieving fast, high resolution, reliable and repeatable
IDDQ measurements needed to support advanced decision
making strategies and Nanotechnology IDDQ application,
and that the use of add-on instrumentation offers the best
perspectives to reach these goals.
Organizers: G. Gielen, KU Leuven, BE; L. Fanucci, Pisa U, IT
Moderators: L. Fanucci, Pisa U, IT
-
Advanced Analog Filters for Telecommunications [p. 1316]
-
M. De Matteis, S. D'Amico and A. Baschirotto
In this paper advances on analog filter design for
telecom transceivers are addressed. Portable devices require a
strong power consumption reduction to increase the battery
life. Since a considerable part of the power consumption is due
to the analog baseband filters, improved and/or novel analog
filter design approaches have to be developed. In this paper
some advances on this field reported in last years are
summarized. Each design (developed for different standards)
exploits the standard specifications with different architectures
and circuit strategies devoted to power consumption reduction.
The first is for reconfigurable Bluetooth/UMTS/WLAN
receivers, the second is for very-low voltage (550mV) WLAN
receivers, the third one is for impulse-radio UWB receivers,
while the fourth is for very low-power OFDB-UWB receivers.
-
Emerging Yield and Reliability Challenges in Nanometer CMOS Technologies [p. 1322]
-
G. Gielen, P. DeWit, E. Maricau, J. Loeckx, J. Martín-Martínez, B. Kaczer, G. Groeseneken, R.
Rodríguez and M. Nafría
With further scaling of nanometer CMOS technologies,
yield and reliability become an increasing challenge. This
paper reviews the most important phenomena affecting
yield and reliability. For each effect, the basic physical
mechanisms causing the effect and its impact on transistor
parameters are described. Possible solutions to
cope/handle with these effects on the design level are discussed
as well.
-
Novel Front-End Circuit Architectures for Integrated Bio-Electronic Interfaces [p. 1328]
-
C. Guiducci, A. Schmid, F.K. Gürkaynak and Y. Leblebici
The prospective use of upcoming nanometer CMOS
technology nodes (65nm, 45nm, and beyond) in bioelectronic
interfaces is raising a number of important
issues concerning circuit architectures and design. In
particular, the advantages of scaling and higher density
integration must be balanced against the requirements of
low noise design, uniform power density and surface
temperature distribution, better component matching, and
immunity to parameter variations. Dealing with these
constraints also requires more innovative approaches
towards hybrid integration technologies. In this paper, we
discuss the key design issues with specific examples from
DNA detection, protein detection, and neuro-electronic
interfaces.
Moderators: W. Luk, Imperial College London, UK; M. Huebner, Karlsruhe U (TH), DE
-
High-Level Modeling and Exploration of Coarse-Grained Re-Configurable Architectures [p. 1334]
-
A. Chattopadhyay, X. Chen, H. Ishebabi, R. Leupers, G. Ascheid and H. Meyr
The increasing complexity of today's multimedia and wireless applications
is motivating the system designers to innovate continuously.
With the challenge to keep various performance metrics in
a tight balance while designing a complex system, an entire range
of components are now being offered as choices for system building
blocks. Coarse-Grained Re-configurable Architecture (CGRA), a
strongly emerging class, is currently receiving due attention for offering
excellent performance as well as flexibility post fabrication.
Compared to the programmable and flexible microprocessors these
architectures are shown to yield stronger performance, especially in
case of regular and data-driven applications. A variety of system
designs are proposed of late, with CGRA as one of the key building
blocks. Most of the research initiatives taken in this area have
resorted to a template-based approach, where the structure of the reconfigurable
architecture is partially fixed with several tunable parameters.
In this paper, we present a language-driven modelling and
exploration framework for CGRAs. In the domain of CGRAs, this
framework attempts to bring modelling ease, genericity, early exploration
and path to implementation together. The modelling formalism
proposed in this paper as well as the exploration capabilities are
demonstrated via experiments with several algorithmic kernels.
-
Scalable Architecture for On-Chip Neural Network Training Using Swarm Intelligence [p. 1340]
-
A. Farmahini-Farahani, S.M. Fakhraie and S. Safari
This paper presents a novel architecture for on-chip
neural network training using particle swarm optimization
(PSO). PSO is an evolutionary optimization algorithm
with a growing field of applications which has been recently
used to train neural networks. The architecture exploits
PSO algorithm to evolve network weights as well as
a method called layer partitioning to implement neural networks.
In the proposed method, a neural network is partitioned
into groups of neurons and the groups are sequentially
mapped to available functional units. Thus, the architecture
is reconfigurable for training and implementing different
multilayer feedforward neural networks without the
need for modifying the architecture. The implementation is
intended for real-time applications regarding hardware cost
and speed. The results show that the proposed system provides
a trade-off between resource requirements and speed.
-
Intelligent Merging OnLine Task Placement Algorithm for Partially Reconfigurable Systems [p. 1346]
-
T. Marconi, Y. Lu, K. Bertels and G. Gaydadjiev
Speed and placement quality are two very important attributes
of a good online placement algorithm, because the
time taken by the algorithm is considered as an overhead to
the application overall execution time. To solve this problem,
we propose three techniques: Merging Only if Needed
(MON), Partial Merging (PM), and Direct Combine (DC).
Our IM (intelligent merging) algorithm uses dynamically
these three techniques to exploit their specific advantages.
IM outperforms Bazargan's algorithm as it has placement
quality within 0.89% but is 1.72 times faster.
-
Design of A HW/SW Communication Infrastructure for A Heterogeneous Reconfigurable
Processor [p. 1352]
-
A. Deledda, C. Mucci, A. Vitkovski, M. Kuehnle, F. Ries, M. Huebner, J. Becker, P. Bonnot, A.
Grasset, P. Millet, M. Coppola, L. Pieralisi, R. Locatelli, G. Maruccia, F. Campi and T. DeMarco
Reconfigurable architectures and NoC (Network-on-Chip)
have introduced new research directions for technology
and flexibility issues, which have been largely investigated
in the last decades. Exploiting run-time adaptivity
opens a new area of research by considering dynamic reconfiguration.
In this paper, we present the architecture and
associated development tools of an heterogeneous reconfigurable
SoC focusing on the chosen communication infrastructure.
The SOC integrates units of various sizes of reconfiguration
granularity. The included NoC approach demonstrates
the mentioned benefits and scalability for actual and
future SoC design.
On a reference CMOS090 implementation the described
interconnect system works at the system reference frequency
of 200 MHZ sustaining the required run-time bandwidth on
a set of reference applications, at a price < 10% in area in
power consumption with respect to the overall system.
-
Automated Dynamic Throughput-Constrained Structural-Level Pipelining in Streaming
Applications [p. 1358]
-
M. Muir, T. Arslan and I. Lindsay
Stream processing applications such as image signal
processing demand high throughput. However, customers
increasingly demand runtime flexibility in their designs,
which cannot be provided by custom ASIC solutions. Currently,
reconfigurable processors tend to offer insufficient
throughput for widespread use in streaming applications.
This paper demonstrates how structural-level pipelining
techniques can be applied to rapidly dynamically reconfigurable
computing architectures, in order to increase
throughput. This is done by automatically inserting registers
into the data path of performance critical code sections
that have already been optimised into a single configuration
context. A new algorithm is presented to choose the
insertion point of pipeline stage registers in order to meet
a specified throughput whilst minimising register resource
usage. The paper then demonstrates a new approach where
properties of dynamic reconfiguration can be utilised to perform
the tasks of pipeline stage initialisation and flushing.
The technique is demonstrated on a real-life application:
the demosaic filter in a standard image signal processing
pipe used in modern digital cameras, and can be seen to
boost the throughput from 16MPixels/s to 51MPixels/s on
an example reconfigurable processor.
-
Towards Trojan-Free Trusted ICs: Problem Analysis and Detection Scheme [p. 1362]
-
F. Wolff, C. Papachristou, S. Bhunia and R.S. Chakraborty
There have been serious concerns recently about the security
of microchips from hardware trojan horse insertion during
manufacturing. This issue has been raised recently due
to outsourcing of the chip manufacturing processes to reduce
cost. This is an important consideration especially in critical
applications such as avionics, communications, military, industrial
and so on. A trojan is inserted into a main circuit at
manufacturing and is mostly inactive unless it is triggered by
a rare value or time event; then it produces a payload error
in the circuit, potentially catastrophic. Because of its nature,
a trojan may not be easily detected by functional or ATPG
testing. The problem of trojan detection has been addressed
only recently in very few works. Our work analyzes and formulates
the trojan detection problem based on a frequency
analysis under rare trigger values and provides procedures to
generate input trigger vectors and trojan test vectors to detect
trojan effects. We also provide experimental results.
-
Wrapper and TAM Co-Optimization for Reuse of SoC Functional Interconnects [p. 1366]
-
T. Yoneda and H. Fujiwara
This paper presents a wrapper and TAM co-optimization
method for reuse of SoC functional interconnects to minimize
test time under area constraint. The proposed method consists
of (1) an ILP formulation for wrapper and transparent TAM cooptimization,
and (2) a simulated annealing based heuristic approach
to reduce the computational cost of the proposed ILP
model. Experimental results show the effectiveness of the proposed
methods compared to the previous transparency-based TAM
approaches and the conventional dedicated test bus approaches.
keywords: SoC test, wrapper, TAM, reuse of interconnect.
-
De Bruijn Graph as a Low Latency Scalable Architecture for Energy Efficient Massive NoCs [p. 1370]
-
M. Hosseinabady, M.R. Kakoee, J. Mathew and D.K. Pradhan
In this paper, we use the generalized binary de Bruijn (GBDB)
graph as a scalable and efficient network topology for an on-chip
communication network. Using just two-layer wiring, we
propose an optimum tile-based implementation for a GBDBbased
Network-on-Chip (NoC). Our experimental results show
that the latency and energy consumption of generalized de Bruijn
graph are much less with compared to Mesh and Torus, the two
common NoC architectures in the literature.
-
Adaptive Filesystem Compression for Embedded Systems [p. 1374]
-
L.S. Bai, H. Lekatsas and R.P. Dick
Embedded system secondary storage size is often constrained,
yet storage demands are growing as a result of increasing
application complexity and storage of personal data and multimedia
files. Filesystem compression offers a solution. This paper formalizes the
problem of automatic filesystem compression using multiple compression
algorithms. The average latency of on-line file accesses is optimized
under a constraint on filesystem capacity. Our solution is based on
predictive control. Predicted latency implications are used to solve the
file compression state selection problem using a multiple choice knapsack
problem formulation. This approach is evaluated on filesystem traces and
compared with other efficient heuristics. Our approach results in 34.1%
reduction in file access latency compared to a straight-forward heuristic
that decompresses frequently-accessed files and compresses least recently
used files with more aggressive compression algorithms. It reduces file
access latency by 67.7% compared to uniformly compressing files to the
shallowest level required to meet storage capacity constraints.
-
Partially Redundant Logic Detection Using Symbolic Equivalence Checking in Reversible
and Irreversible Logic Circuits [p. 1378]
-
D.Y. Feinstein, M.A. Thornton and D.M. Miller
This paper investigates partially redundant logic
detection and gate modification coverage in both
reversible and irreversible (classical) logic circuits. Our
methodology is to repeatedly compare a benchmark
circuit with a modified copy of itself using an equivalence
checker. We have found many instances in the
irreversible logic ISCAS85 benchmarks where single gate
replacements were not detected, indicating no change in
functionality after gate replacement. In contrast, we
demonstrate that the Maslov reversible and quantum
logic benchmarks exhibit very high gate modification
fault coverage, in line with the expectation that reversible
circuits, which implement bijective functions, have
maximal information content.
-
TinyTimber, Reactive Objects in C for Real-Time Embedded Systems [p. 1382]
-
P. Lindgren, J. Eriksson, S. Aittamaa and J. Nordlander
Embedded systems are often operating under hard real-time
constraints. Such systems are naturally described as time-bound
reactions to external events, a point of view made manifest in the
high-level programming and systems modeling language Timber.
In this paper we demonstrate how the Timber semantics for parallel
reactive objects translates to embedded real-time programming
in C. This is accomplished through the use of a minimalistic
Timber Run-Time system, TinyTimber (TT). The TT kernel
ensures state integrity, and performs scheduling of events based
on given time-bounds in compliance with the Timber semantics.
In this way, we avoid the volatile task of explicitly coding parallelism
in terms of processes/threads/semaphores/monitors, and
side-step the delicate task to encode time-bounds into priorities.
In this paper, the TT kernel design is presented and performance
metrics are presented for a number of representative embedded
platforms, ranging from small 8-bit to more potent 32-bit
micro controllers. The resulting system runs on bare metal,
completely free of references to external code (even C-lib) which
provides a solid basis for further analysis. In comparison to
a traditional thread based real-time operating system for embedded
applications (FreeRTOS), TT has tighter timing performance
and considerably lower code complexity. In conclusion,
TinyTimber is a viable alternative for implementing embedded
real-time applications in C today.
-
Dynamic Task Allocation Strategies in MPSoC for Soft Real-Time Applications [p. 1386]
-
E. Wenzel Brião, D. Barcelos, F. Rech Wagner
This work evaluates task allocation strategies based
on bin-packing algorithms in the context of multiprocessor
systems-on-chip (MPSoCs) with task migration
capabilities, running soft real-time applications. The
task migration model assumes that the whole code and
data of the tasks are transferred from an origin node to
the chosen destination node. We combine two types of
algorithms to obtain better allocation results. Experimental
results show that there is a trade-off between
deadline misses and system energy consumption when
applying bin-packing and linear clustering algorithms.
In order to save energy, our system turns off idle processors
and applies Dynamic Voltage Scaling to processors
with slack. Depending on the algorithm selection
and on the application, it is possible to obtain a reduction
on deadline misses from 30% to 100% and energy
consumption savings from 60% to 80%.
-
Mixed-Signal Design Space Exploration of Time-Interleaved A/D Converters for Ultra-Wide
Band Applications [p. 1390]
-
P. Nuzzo, C. Nani, S. Saponara, L. Fanucci and G. Van der Plas
This paper addresses system-level design of time-interleaved
analog-to-digital converters (TI-ADCs) for
ultra-wide band communications. Design space
exploration of a TI successive approximation architecture
is performed via Monte Carlo simulations, by exploiting
behavioral models built bottom-up after characterizing
the main ADC blocks in a 90-nm 1-V CMOS technology.
Different speed/resolution scenarios are efficiently
investigated and the impact of parallelism on system
performance, yield and power consumption is assessed
starting from the early design phases, finally enabling the
selection of two candidate implementations (a 6-bit 4.6-mW
and a 7-bit 8.1-mW ADC targeting 1 GS/s) that
effectively trade accuracy for energy efficiency and area.
Organizers: N. Suri, TU Darmstadt, DE; C. Fetzer, TU Dresden, DE
Moderator: N. Suri, TU Darmstadt, DE
-
Dependable Embedded Systems Day Panel: Issues and Challenges in Dependable Embedded Systems [p. 1394]
-
Panelists: J. Abraham, S. Poledna, A. Mendelson and S. Mitra
Embedded Systems are pervasively appearing in virtually
all walks of life - communication, computing, e-/mcommerce,
leisure, medical, WSN, transportation, biometrics.
The utility of these embedded systems and services
is based, in large part, in our depending on their sustained
functionality in spite of the encountered operational or malicious
disruptions. As the number of transient and also
permanent disruptions (given the decreasing device geometries,
higher device density, lower voltage latching, faster
clocks etc) is expected to increase substantially, this will not
only be a key issue for the hardware community but also the
systems community in general. Solutions using a combination
of hardware and software might be more effective than
hardware-only or software-only solutions.
Building upon the discussions on the conceptual and applied
issues for design, analysis and validation of dependable
embedded systems, the panel will bring together both
the academic and industrial perspectives on the upcoming
challenge themes. Specifically the coverage will encompass
the spectrum of device level, communication aspects and
system level aspects tackling both synergistic and across the
board needs for the future dependable embedded "systems".
Moderators: P. Kundu, Intel, US; S. Murali, EPFL, CH
-
Multicast Parallel Pipeline Routing Architecture FOR Network-on-Chip [p. 1396]
-
F.A. Samman, T. Hollstein and M. Glesner
This paper presents a flexible mesh router architecture
using synchronous parallel pipeline worm-switching supporting
unicast and multicast services. A very flexible
mechanism to manage broadcast-flow to share the communication
link in on-chip network is proposed. The proposed
machanism guarantees, that all flits in multicast packets
can be accepted in their multiple destination nodes. Our
Network-on-Chip (NoC) is implemented based on modular
synthesizable VHDL objects. The Architecture is flexible to
design new NoC prototypes. Area overhead to update the
NoC from unicast to multicast with the same routing algorithm
is only about 15%.
-
Variation Tolerant NoC Design by Means of Self-Calibrating Links [p. 1402]
-
S. Medardoni, M. Lajolo and D. Bertozzi
We present the implementation and analysis of a variation tolerant
version of a switch-to-switch link in a NoC. The goal is to tolerate
the effects of process variations on NoC architectures using
self-correcting links that automatically detect delay variations and
compensate them. The correction is applied without increasing the
switch-to-switch latency by substituting the output flip-flops of the
sending switch with a self-correcting flip-flop followed by an adaptive
voltage swing selector. Higher delay variations will result in a
smaller slack in the switch-to-switch path, but the adaptive voltage
swing selector could mitigate its impact on the NoC communication
by increasing the voltage swing on the link, thus allowing a
compensation of the delay variation. As a result, it is possible to
tolerate delay variations at the cost of additional power consumption.
-
BARP- A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs [p. 1408]
-
P. Lotfi-Kamran, M. Daneshtalab, C. Lucas and Z. Navabi
A novel routing algorithm, named Balanced Adaptive
Routing Protocol (BARP), is proposed for NoCs to provide
adaptive routing and ensure deadlock-free and livelock-free
routing at the same time. By evenly distributing input
packets of a router among all its shortest path output ports,
a novel adaptive routing protocol for avoiding congestion
condition emerges. It is observed that BARP can achieve
better performance compared to static XY routing, oddeven
routing and dynamic XY routing.
-
Developing Mesochronous Synchronizers to Enable 3D NoCs [p. 1414]
-
I. Loi, F. Angiolini and L. Benini
The NETWORK-ON-CHIP (NOC) interconnection paradigm has
been gaining momentum thanks to its flexibility, scalability and
suitability to deep submicron technology processes. The next challenge
is to use NoCs as the backbones of the upcoming generation
of 3D chips, assembled by stacking multiple silicon layers. Multiple
technical issues have to be tackled in this respect. One of the
foremost is the unsuitability of a purely synchronous design style,
as it is not straightforward to impose a strict bound on the clock
skew among multiple clock trees across different layers. In this
paper, we present a scheme to handle mesochronous communication
in 3D NoCs and analyze (i) the circuit design, (ii) the timing
properties, (iii) the requirements to support flow control across
mesochronous links, (iv) the implementation cost of such a scheme
after placement and routing.
Moderators: T. Austin, U of Michigan, US; G. Gaydadjiev, TU Delft, NL
-
Memory Organization with Multi-Pattern Parallel Accesses [p. 1420]
-
A. Vitkovski, G. Kuzmanov and G. Gaydadjiev
We propose an interleaved memory organization supporting
multi-pattern parallel accesses in twodimensional
(2D) addressing space. Our proposal targets
computing systems with high memory bandwidth demands
such as vector processors, multimedia accelerators,
etc. We substantially extend prior research on interleaved
memory organizations introducing 2D-strided accesses
along with additional parameters, which define a
large variety of 2D data patterns. The proposed scheme
guarantees minimum memory latency and efficient bandwidth
utilization for arbitrary configuration parameters
of the data pattern. We provide mathematical descriptions
and proofs of correctness for the proposed addressing
schemes. The design complexity and the critical paths are
evaluated using technology independent resource counts
and confirm the scalability of the proposal. Hardware
synthesis results for 90nm CMOS technology suggest that
throughputs in the range between 44 and 1182 Gbit/s can
be obtained at the cost of 26-212 Kgates for configurations
of 2x2 32-bit up to 8x8 64-bit memory modules.
Index Terms - Conflict-free access, high bandwidth,
multi-pattern access, parallel memories.
-
CATCH: A Mechanism for Dynamically Detecting Cache-Content-Duplication and Its
Application to Instruction Caches [p. 1426]
-
M. Kleanthous and Y. Sazeides
Cache-Content-Duplication (CCD) occurs when there is
a miss for a block in a cache and the entire content of
the missed block is already in the cache in a block with
a different tag. Caches aware of content-duplication can
have lower miss rates by allowing only blocks with unique
content to enter a cache. This work examines the potential
of CCD for instruction caches. We show that CCD is
a frequent phenomenon and that an idealized duplicationdetection
mechanism for instruction caches has the potential
to increase performance of an out-of-order processor,
with a 2-way eight instruction per block 16KB instruction
cache, often by more than 5% and up to 20%. This work
also proposes CATCH, a hardware based mechanism for
dynamically detecting CCD. Experimental results for an
out-of-order processor show that a CATCH with a 2.32KB
cost usually captures 60% or more of the CCD's idealized
potential.
-
MAGELLAN: A Search and Machine Learning-Based Framework for Fast Multi-Core
Design Space Exploration and Optimization [p. 1432]
-
S. Kahng and R. Kumar
In this paper, we treat multi-core processor design
space exploration as an application-driven machine learning
problem. We develop two machine learning-based techniques
for efficiently exploring the processor design space.
We observe that these techniques result in multi-core processors
whose performance is comparable (within 1%) to
a processor design that requires an exhaustive exploration
of the design space. These techniques often take orders of
magnitude (a factor of 3800 at the minimum) less time for
coming up with these processors. The benefits are up to 13%
over intelligent search techniques that have been adapted to
do multi-core design space exploration.
We leverage the knowledge gained in this research to develop
Magellan - a framework for accelerating multi-core
design space exploration and optimization. Magellan can
be used to find the highest throughput processors of a given
type for a given area, power, or time budget. It can be used
to aid even experienced processor designers that prefer to
rely on intuition by allowing fast refinements to an input design.
-
Process Variation Aware Issue Queue Design [p. 1438]
-
R. K and M. Mutyam
In sub-90nm process technology it becomes harder to
control the fabrication process, which in turn causes variations
between the design-time parameters and the fabricated
parameters. Variations in the critical process parameters
can result in significant fluctuations in the switching
speed and leakage power consumption of different transistors
in the same chip.
In this paper, we study the impact of process variation
on issue queues. Due to process variation, issue queues can
take variable access latency. In order to work with nonuniform
access latency issue queues, by exploiting ready
operands of instructions at dispatch time, we propose a process
variation aware issue queue design. Experimental results
reveal that, for a 64-entry issue queue with half of the
entries affected by process variation, our technique recovers
most of the lost performance due to process variation
and incurs a performance penalty of less than 2% with respect
to the performance of issue queues without process
variation.
Moderators: L. Torres, LIRMM, Montpellier, FR; W. Eberle, IMEC, BE
-
Implementation of Parallel LFSR-Based Applications on an Adaptive DSP Featuring a
Pipelined Configurable Gate Array [p. 1444]
-
C. Mucci, L. Vanzolini, I. Mirimin, D. Gazzola, A. Deledda, S. Goller, J. Knaeblein, A. Schneider,
L. Ciccarelli and F. Campi
Linear feedback shift registers (LFSRs) are common
structures in many application fields, including cryptography,
digital broadcasting and communication. Highthroughput
requirements need highly parallel implementations,
usually accomplished in state of the art system on
chips (SoCs) with application specific coprocessors. Although
this approach achieves the required performance,
it rapidly shows lack of flexibility when those devices are
proposed, as an example, for multi-standard modems or for
security applications in which run-time update can provide
added value. This paper shows the implementation of parallel
LFSR-based applications on an embedded adaptive DSP
featuring a Pipelined Configurable Gate Array (PiCoGA).
With respect to standard embedded FPGAs, pipelined devices
usually provide better performance, e.g. in terms of
speed, but they commonly show the undeniable drawback of
additional design constraints. As a test-case, we consider
the implementation of the 32-bit CRC used in the Ethernet
standard that achieves on the target architecture up to
~25Gbit/sec throughput, with a parallel LFSR processing
128 bit at time, which is comparable to the performance
offered by some ASIC devices.
-
GMDS: Hardware Inplementation of Novel Real Output Queuing Architecture [p. 1450]
-
R. Arteaga, F. Tobajas, R. Esper-Chain, V. de Armas and R. Sarmiento
In this paper, a real output queuing switch prototype implementation
is presented. This implementation is based on
a novel high speed multidrop backplane and a general purpose
line card which includes a Virtex-II 6000 FPGA. This
switch is named GMDS (Gigabit MultiDrop Switch) and its
main features are the switch matrix replacement by the multidrop
backplane -increasing system reliability-, variable
lenght packet switching support -avoiding bandwidth efficient
loss-, multiple output queuing structure for supporting
QoS (Quality of Service) and a minimum speedup.
-
Front End Device for Content Networking [p. 1456]
-
J. Buboltz and T. Kocak
The bandwidth and speed of network connections
are continually increasing. The speed increase in
network technology is set to soon outpace the speed
increase in CMOS technology. This asymmetrical
growth is beginning to causing software applications
that once worked with then current levels of network
traffic to flounder under the new high data rates.
Processes that were once executed in software now
have to be executed, partially if not wholly in
hardware. One such application that could benefit
from hardware implementation is high layer routing.
By allowing a network device to peer into higher layers
of the OSI model, the device can scan for viruses,
provide higher quality-of-service (QoS), and efficiently
route packets. This paper proposes an architecture for
a device that will utilize hardware-level string
matching to distribute incoming requests for a server
farm. The proposed architecture is implemented in
VHDL, synthesized, and laid out on an Altera FPGA.
-
Power Aware Reconfigurable Multiprocessor for Elliptic Curve Cryptography [p. 1462]
-
M. Purnaprajna, C. Puttmann, M. Porrmann
Reconfigurable architectures are being increasingly used
for their flexibility and extensive parallelism to achieve accelerations
for computationally intensive applications. Although
these architectures provide easy adaptability, it is
so with an overhead in terms of area, power and timing,
as compared to non-reconfigurable ASICs. Here, we propose
a low overhead reconfigurable multiprocessor, which
provides both parallelism and flexibility. The architecture
has been evaluated for its energy efficiency for a computational
intensive algorithm used in elliptic curve cryptography
(ECC).
Typically, algorithms in ECC exhibit task-level parallelism
and demand large amount of computational resources
for custom implementations to achieve a significant
speedup. A finite field multiplication in GF(2233) was chosen
as a sample application to evaluate the performance
on the QuadroCore reconfigurable multiprocessor architecture.
A three-fold performance improvement as compared to
a single processor implementation was observed. Further,
via reconfiguration to suit the application, power savings of
about 24% were noted in UMC's 90nm standard cell technology.
Moderators: M. Sonza Reorda, Politecnico di Torino, IT; A. Zjajo, NXP Semiconductors, NL
-
Digital Bit Stream Jitter Testing Using Jitter Expansion [p. 1468]
-
H. Choi and A. Chatterjee
This paper presents a time-domain jitter expansion technique
for high-speed digital bit sequence jitter testing.
While jitter expansion has been applied to phase noise measurements
of sinusoidal signals before, its applicability to
random clock jitter testing and data-dependent jitter testing
have not been explored. The latter problems have wide application
and necessitate new analysis procedures given in
this paper. Since low phase noise sinusoids can be generated
relatively easily as compared to low jitter digital
clocks, the proposed technique utilizes a low-frequency sine
wave as a reference signal which can be fed to the device
under test with less concern for reference signal noise. A
special circuit called a jitter-sensor is used for jitter extraction
and produces a low-speed output signal with higher
jitter values that track the jitter of the high-speed digital
test signal. Thus, conventional narrow-bandwidth testers
are able to analyze the sensor output. This allows high resolution
jitter testing for high-speed digital signals possible
at low cost.
-
A Same/Different Fault Dictionary: An Extended Pass/Fail Fault Dictionary with Improved
Diagnostic Resolution [p. 1474]
-
I. Pomeranz and S.M. Reddy
We describe a new type of fault dictionary called a
same/different fault dictionary. The same/different fault
dictionary is similar to a pass/fail fault dictionary in that
it contains a single bit bi ,j for every modeled fault f i and
test vector tj . However, in a pass/fail fault dictionary, bi ,j
is determined by comparing the output vector of the faulty
circuit with the output vector of the fault free circuit;
while in a same/different fault dictionary, bi ,j is determined
by comparing the output vector of the faulty circuit
with a preselected output vector called a baseline output
vector. By selecting appropriately the baseline output vectors
for all the test vectors, it is possible to obtain
increased diagnostic resolution with a same/different fault
dictionary compared to a pass/fail fault dictionary. We
describe a procedure for selecting baseline output vectors
and present experimental results.
-
A Design-for-Diagnosis Technique for SRAM Write Drivers [p. 1480]
-
A. Ney, P. Girard, S. Pravossoudovitch, A. Virazel, M. Bastian and V. Gouin
Diagnosis is becoming a major concern with the rapid
development of semiconductor memories. It provides
information about the location of manufacturing defects in
the memory, and its effectiveness allows a fast yield ramp
up. Most of existing diagnosis methods uses a fault
dictionary to provide detailed information of fault
localization. However, these solutions are most of the time
unable to distinguish between all faults, and more
importantly often fail to identify the actual faulty block of
the memory. Identifying which block of a memory (corecell
array, write drivers, address decoders, pre-charge
circuits, etc...) is defective allows saving considerable
amount of time during the ramp up phase.
In this paper, we propose a very low cost Design-for-Diagnosis
(DfD) solution for identifying faulty write
drivers. It consists in verifying logic and analog conditions
that guarantee the fault-free behavior of the write driver.
The proposed solution allows a fast diagnosis (only three
consecutive write operations are needed to fully diagnose
the write driver) and induces a low area overhead (about
0.5% for a 512x512 SRAM). Beside diagnosis, an
additional interest of such a solution is its usefulness
during a post-silicon characterization process, where it
can be used to extract the main features of write drivers
(logic and analog levels on bit lines).
-
Variable Delay of Multi-Gigahertz Digital Signals for Deskew and Jitter-Injection Test
Applications [p. 1486]
-
D.C. Keezer, D. Minier and P. Ducharme
The ability to precisely control the timing of digital
signals is especially important for multi-GHz testing
applications where errors are measured in picoseconds
or even 100fs. While many solutions exist for continuous
clock-type signals, delay of wide-bandwidth data signals
is not so easy. In this paper we introduce a novel
technique for adjusting the delay of ~7Gbps data signals
on a picosecond scale without significant distortion. The
approach is based on a timing/amplitude dependency
effect observed in a variable-gain SiGe buffer. A
prototype is demonstrated with a variable delay range of
about 50ps. This circuit is enhanced by adding a
"coarse" delay section, including four 33ps steps, to
provide the desired total range of ~140ps. The end
application requires several of these circuits for
deskewing parallel buses of 6.4Gbps ATE signals. The
circuit is also useful for injecting a variable amount of
jitter, limited by the fine-delay adjustment range.
Moderators: K. Larsen, Aalborg U, DK; J. Gerlach, Robert Bosch GmbH, DE
-
Retargetable Code Optimization for Predicated Execution [p. 1492]
-
M. Hohenauer, F. Engel, R. Leupers, G. Ascheid, H. Meyr, G. Bette and B. Singh
Retargetable C compilers are key components of today's
embedded processor design platforms for quickly obtaining
compiler support and performing early processor architecture
exploration. The inherent problem of the retargetable
compilation approach, though, is the well known trade-off
between the compiler's flexibility and the quality of generated
code. However, it can be circumvented by designing
flexible, configurable code optimization techniques applicable
to a certain range of target architectures. This paper focuses
on target machines with predicated execution support
which is wide-spread in deeply pipelined and highly parallel
embedded processors used in next generation high-end
video, multimedia and wireless devices. We present an efficient
and quickly retargetable code optimization technique
for predicated execution that is integrated into an industrial
retargetable C compiler. Experimental results for several
embedded processors demonstrate that the proposed technique
is applicable to real-life target machines and that it
produces significant code quality improvements for control
intensive applications.
-
Programming Shared Memory Multiprocessors with Deterministic Message-Passing
Concurrency: Compiling SHIM to Pthreads [p. 1498]
-
S.A. Edwards, N. Vasudevan and O. Tardieu
Multicore shared-memory architectures are becoming
prevalent and bring many programming challenges. Among
the biggest are data races: accesses to shared resources that
make a program's behavior depend on scheduling decisions
beyond its control. To eliminate such races, the SHIM concurrent
programming language adopts deterministic message
passing as it sole communication mechanism.
We demonstrate such language restrictions are practical
by presenting a SHIM to C-plus-Pthreads compiler that can
produce efficient code for shared-memory multiprocessors.
We present a parallel JPEG decoder and FFT exhibiting 3.05
and 3.3x speedups on a four-core processor.
-
Modularity vs. Reusability: Code Generation from Synchronous Block Diagrams [p. 1504]
-
R. Lublinerman and S. Tripakis
We present several methods to generate modular code
from synchronous hierarchical block diagrams. Modularity
means code is generated for a given macro (i.e., composite)
block independently from context, that is, without knowing
where this block is to be used, and also with minimal knowledge
about its sub-blocks. We achieve this by generating a
set of interface functions for each block and a set of dependencies
between these functions that is exported along with
the interface. The main trade-off is the degree of modularity
(number of interface functions) vs. reusability (the set
of diagrams that the block can be used in without creating
dependency cycles).
-
ezRealtime: A Domain-Specific Modeling Tool for Embedded Hard Real-Time Software
Synthesis [p. 1510]
-
F. Cruz, R. Barreto, L. Cordeiro and P. Maciel
In this paper, we introduce the ezRealtime project, which
relies on the Time Petri Net (TPN) formalism and defines a
Domain-Specific Modeling (DSM) tool to provide an easy-to-use
environment for specifying Embedded Hard Real-Time (EHRT)
systems and for synthesizing timely and predictable scheduled C code. Therefore, this paper presents
a generative programming method in order to boost code
quality and improve substantially developer productivity by
making use of automated software synthesis. The ezRealtime
tool reads and automatically translates the system's
specification to a time Petri net model through composition
of building blocks with the purpose of providing a complete
model of all tasks in the system. Hence, this model
is used to find a feasible schedule by applying a depth-first
search algorithm. Finally, the scheduled code is generated
by traversing the feasible schedule, and replacing transition's
instances by the respective code segments. We also
present the application of the proposed method in an expressive
case study.
Organizers: B. Bougard, IMEC, BE; P. Marchal, IMEC, BE
Moderator: P. Marchal, IMEC, BE
-
3D Integration or How to Scale in the 21st Century [p. 1516]
-
Presenters: L. Benini, D. Keitel-Schulz, N. Checka
-
3D integration offers numerous opportunities for design, and is probably the best hope for carrying ICs along
(and even beyond) the path of Moore's Law in the 21st century. However, many questions still need to be
answered to take advantage of 3D. First, what will become the mainstream 3D technology? Today, many
technology options are proposed, but each having different cost, design and test implications. Secondly, how to
make 3D designs reliable? Many unknowns still exist related to thermal load, reliability and signal integrity
challenges. Finally, what about design solutions/methods and architectural modifications for 3D integration?
The objective of this special session is to create a better understanding of forthcoming 3D technologies, their
implication on design and test. An attempt will be made to roadmap 3D technologies and their design
implications. This will enable R&D planning by design houses, EDA vendors, foundries and academia, paving
the way for a widespread acceptance of 3D technologies.
|