| |
DATE 2006 ABSTRACTS
Sessions:
[Keynote Addresses]
[1A]
[1B]
[IP1]
[1C]
[IP2]
[1E]
[1F]
[2A]
[IP3]
[2B]
[IP4]
[2C]
[2E]
[2F]
[3A]
[3B]
[IP5]
[3C]
[IP6]
[3D]
[IP7]
[3E]
[3F]
[4A]
[4B]
[IP8]
[4C]
[IP9]
[4E]
[4F]
[IP10]
[4G]
[5A]
[IP11]
[5B]
[IP12]
[5C]
[5E]
[5F]
[IP13]
[5G]
[5K]
[6A]
[IP14]
[6B]
[IP15]
[6C]
[IP16]
[6E]
[IP17]
[6F]
[IP18]
[6G]
[7A]
[IP19]
[7B]
[7C]
[IP20]
[7E]
[7F]
[7G]
[7H]
[8A]
[IP21]
[8B]
[8C]
[8E]
[IP22]
[8F]
[8G]
[9A]
[9B]
[IP23]
[9C]
[IP24]
[9E]
[9F]
[9G]
[9K]
[10A]
[10B]
[IP25]
[10C]
[IP26]
[10E]
[IP27]
[10F]
[IP28]
[10G]
[11A]
[11B]
[11C]
[11E]
[11F]
[11G]
-
EDA Challenges in the Converging Application World [p. 1]
-
R. Penning de Vries
Whereas Moore's law continues to push the industry to ever more complex technologies (More Moore)
supporting sophisticated digital applications, the so-called More than Moore technologies are finding more
and more heterogeneous application domains. EDA challenges in More Moore are related to power
optimisation, DfM and verification needs whereas the More than Moore technologies require EDA tools
that relate various electrical, logical and physical domains in one environment. System level design is
badly needed in More Moore and in More than Moore.
-
Sociology of Design and EDA [p. 2]
-
W. C. Rhines
Successful design of complex electronic systems increasingly requires the bi-directional flow of
information among groups of design specialists who are becoming more dispersed geographically and
organisationally. This affects the type of design flows we develop, the nature of the design tools, how
design software is supported and the organisational structure of the EDA and electronics industries.
Dr Rhines will provide examples and discuss issues that suggest how the interaction among designers and
design organisations will affect the future evolution of design methodology and EDA.
Moderators: P. Eles, Linkoping U, SE; D. Atienza, DACYA/Madrid Complutense U, ES
-
Communication-Aware Allocation and Scheduling Framework for Stream-Oriented Multi-Processor
Systems-on-Chip [p. 3]
-
M. Ruggiero, A. Guerri, D. Bertozzi, F. Poletti, and M. Milano
This paper proposes a complete allocation and scheduling
framework, where an MPSoC virtual platform is used to accurately
derive input parameters, validate abstract models of system components
and assess constraint satisfaction and objective function
optimization. The optimizer implements an efficient and exact approach
to allocation and scheduling based on problem decomposition.
The allocation subproblem is solved through Integer Programming
while the scheduling one through Constraint Programming.
The two solvers can interact by means of no-good generation,
thus building an iterative procedure which has been proven to
converge to the optimal solution. Experimental results show significant
speedups w.r.t. pure IP and CP exact solution strategies as well
as high accuracy with respect to cycle accurate functional simulation.
A case study further demostrates the practical viability of our
framework for real-life systems and applications.
-
Efficient Link Capacity and QoS Design for Network-on-Chip [p. 9]
-
Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny
This paper addresses the allocation of link capacities in
the automated design process of a network-on-chip based
system. Communication resource costs are minimized
under Quality-of-Service timing constraints.
First, we introduce a novel analytical delay model for
virtual channeled wormhole networks with non-uniform
link capacities that eliminates costly simulations at the
inner-loop of the optimization process. Second, we present
an efficient capacity allocation algorithm that assigns link
capacities such that packet delays requirements for each
flow are satisfied. We demonstrate the benefit of capacity
allocation for a typical system on chip, where the traffic is
heterogeneous and delay requirements may largely vary,
in comparison with the standard approach which assumes
uniform-capacity links.
-
Supporting Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study [p. 15]
-
S. Bertozzi, A. Acquaviva, D. Bertozzi, and A. Poggiali
With the advent of multi-processor systems-on-chip, the interest
in process migration is again on the rise both in research and
in product development. New challenges associated with the new
scenario include increased sensitivity to implementation complexity,
tight power budgets, requirements on execution predictability,
the lack of virtual memory support in many low-end MPSoCs. As
a consequence, effectiveness and applicability of traditional transparent
migration mechanisms are put in discussion in this context.
Our paper proposes a task management software infrastructure
that is well suited for the constraints of single chip multiprocessors
with distributed operating systems. Load balancing in the
system is maintained by means of intelligent initial placement and
task migration. We propose a user-managed migration scheme
based on code checkpointing and user-level middleware support
as an effective solution for many MPSoC application domains. In
order to prove the practical viability of this scheme, we also propose
a characterization methodology for task migration overhead.
We derive the minimum execution time following a task migration
event during which the system configuration should be frozen to
make up for the migration cost.
Moderators: L. Daniel, MIT, US; L. M. Silveira, TU Lisbon and INESC, PT
-
Time Domain Model Order Reduction by Wavelet Collocation Method [p. 21]
-
X. Zeng, L. Feng, Y. Su, W. Cai, D. Zhou, C. Chiang
In this paper, a wavelet based approach is proposed
for the model order reduction of linear circuits in time
domain. Compared with Chebyshev reduction method, the
wavelet reduction approach can achieve smaller reduced
order circuits with very high accuracy, especially for those
circuits with strong singularities. Furthermore, to compute
the basis function coefficient vectors, a fast Sylvester
equation solver is proposed, which works more than one
or two orders faster than the vector equation solver
employed by Chebyshev reduction method. The proposed
wavelet method is also compared with the frequency
domain model reduction method, which may loose
accuracy in time domain. Both theoretical analysis and
experiment results have demonstrated the high speed and
high accuracy of the proposed method.
-
Large Power Grid Analysis Using Domain Decomposition [p. 27]
-
Q. Zhou, K. Sun, K. Mohanram and D. C. Sorensen
This paper presents a domain decomposition (DD)
technique for efficient simulation of large-scale linear circuits such
as power distribution networks. Simulation results show that by integrating
the proposed DD framework, existing linear circuit simulators
can be extended to handle otherwise intractable systems.
-
Analysis and Modeling of Power Grid Transmission Lines [p. 33]
-
J. Balachandran, S. Brebels, G. Carchon, T. Webers, W. De Raedt, B. Nauwelaers, and E. Beyne
Power distribution and signal transmission are becoming
key limiters for chip performance in nanometer era. These
issues can be simultaneously addressed by designing
transmission lines in power grids. The transmission lines are
well suited for high quality intra-chip signal transmission at
multi gigabit data rates. By having signal lines between the
power grids, the VDD and GND lines in the grid can be
exploited as return paths besides being used for regular
power distribution. This approach also improves wiring
density. In this paper, we rigorously analyze and discuss the
design considerations for laying transmission lines in power
grids. We also present design oriented modeling methods in
2D and 3D geometry. We show how the grid modeling
complexity is simplified. We experimentally validate our
results with fabricated test structures. We also show VDD
lines in the grid act as good return path without external
decoupling capacitors in our design. Further we discuss
substrate effects and deduce guidelines for designing power
grid transmission lines on a low resistive silicon substrate.
-
A Logarithmic Full-Chip Thermal Analysis Algorithm Based on Multi-Layer Green's Function [p. 39]
-
B. Wang and P. Mazumder
This paper derives the multi-layer heat conduction
Green's function, by integrating the eigen-expansion technique
and the classic transmission line theories, and
presents a logarithmic full-chip thermal analysis algorithm,
which is verified by comparisons with a computational
fluid dynamics tool (FLUENT). The paper considers
Dirichlet's and general heat convection boundary conditions
at chip surfaces. Experimental results show that
the algorithm offers superior computing speed, compared
to FLUENT and traditional Green's function based
methods. The paper also studies the limitations of the traditional
single-layer thermal model.
-
Large Scale RLC Circuit Analysis Using RLCG-MNA Formulation [p. 45]
-
Y. Tanji, T. Watanabe, H. Kubota and H. Asai
A fast method for timing analysis of large scale RLC networks
using the RLCG-MNA formulation, which provides
good properties for fast matrix solvers, is presented. The
proposed method is faster than INDUCTWISE and more
general than the RLP algorithm, where INDUCTWISE and
RLP algorithm are known as the state-of-art simulation
methods. In the numerical example, good performances of
the proposed method are illustrated compared with the previous
works.
Moderators: R. Galivanche, Intel Corporation, US; M. Goessel, Potsdam U, DE
-
Soft Delay Error Analysis in Logic Circuits [p. 47]
-
B. Gill, C. Papachristou and F. Wolff
In this paper, we present an analysis methodology to
compute circuit node sensitivity due to charged particle induced delay
(timing) errors, Soft Delay Errors (SDE). We define node sensitivity
metric and describe a step by step procedure to compute node sensitivity.
We use mixed-mode simulations to extract accurate current pulses for
the characterization of SDE. A technique for logic cell library characterization
for SDE is described. Our approach is orders of magnitude faster
than using Spice based analysis and its accuracy is close to Spice. Using
our approach, we provide distribution of nodes sensitivity for various
ISCAS85 circuits and two adders. Such analysis is important to employ
node hardening techniques on selected nodes to increase the reliability
of CMOS circuits. We use two test circuits to apply a node hardening
technique on the highly sensitivy nodes which were determined by our
approach. Results are provided for the reduction of the circuit sensitivity.
-
A Built-In Redundancy-Analysis Scheme for RAMS with 2D Redundancy Using 1D Local Bitmap [p. 53]
-
T.-W. Tseng, J.-F. Li and D.-M. Chang
Built-in self-repair (BISR) technique is gaining popular
for repairing embedded memory cores in system-onchips
(SOCs). To increase the utilization of memory redundancy,
the BISR technique usually needs to perform built-in
redundancy-analysis (BIRA) algorithm for redundancy allocation.
This paper presents an efficient BIRA scheme for
embedded memory repair. The BIRA scheme executes the
2D redundancy allocation based on the 1D local bitmap.
This enables that the BIRA circuitry can be implemented
with low area cost. Also, the BIRA algorithm can provide
good repair rate (i.e., the ratio of the number of repaired
memories to the number of defective memories). Experimental
results show that the repair rate of the proposed
BIRA scheme approximates to that of the optimal scheme
for the memories with different fault distributions. Also, the
ratio of the analysis time to the test time is small.
-
Analysis of the Impact of Bus Implemented EDCs on On-Chip SSN [p. 59]
-
D. Rossi, C. Steiner and C. Metra
In this paper we analyze the impact of error detecting
codes, implemented on an on-chip bus, on the on-chip
simultaneous switching noise (SSN). First, we analyze in
detail how SSN is impacted by different bus transitions,
pointing out its dependency on the number and placement
of switching wires. Afterwards, we present an analytical
model that we have developed in order to estimate the SSN,
and that we prove to be very accurate in SSN prediction.
Finally, by employing the developed model, we estimate the
SSN due to different EDCs implemented on an on-chip bus.
In particular, we highlight how their differences in the number
of switching wires, bus parallelism and codewords influence
the on-chip SSN.
-
Optimal Periodic Testing of Intermittent Faults in Embedded Pipelined Processor Applications [p. 65]
-
N. Kranitis, A. Merentitis, N. Laoutaris, G. Theodorou, A. Paschalis, D. Gizopoulos and C. Halatsis
Today's nanometer technology trends have a very
negative impact on the reliability of semiconductor
products. Intermittent faults constitute the largest part of
reliability failures that are manifested in the field during
the semiconductor product operation. Since Software-Based
Self-Test (SBST) has been proposed as an effective
strategy for on-line testing of processors integrated in
non-safety critical low-cost embedded system
applications, optimal test period specification is becoming
increasingly challenging.
In this paper we first introduce a reliability analysis
for optimal periodic testing of intermittent faults that
minimizes the test cost incurred based on a two-state
Markov model for the probabilistic modeling of
intermittent faults. Then, we present for the first time an
enhanced SBST strategy for on-line testing of complex
pipelined embedded processors. Finally, we demonstrate
the effectiveness of the proposed optimal periodic SBST
strategy by applying it to a fully-pipelined RISC embedded
processor and providing experimental results.
-
Berger Code-Based Concurrent Error Detection in Asynchronous Burst-Mode Machines [p. 71]
-
S. Almukhaizim and Y. Makris
We discuss the use of the Berer code for Concurrent Error Detection (CED)
in Asynchronous Burst-Mode Machines (ABMMs). We present a state encoding
method which guarantees the existence of the two key components for
Berger-encoding an ABMM, namely an inverter-free ABMM implementation
of the circuit and an ABMM implementation of the corresponding Berger
code generator. We also propose improved solutions to two inherent
problems of CED in ABMMs, namely checking synchronization and detection
of error-induced hazards. Experimental results demonstrate that Berger-code-based
CED reduces significantly the cost of previous CED methods for ABMMs.
Moderators: G. De Micheli, EPFL, CH; R. Zafalon, STMicroelectronics, IT
-
Two-Phase Resonant Clocking for Ultra-Low-Power Hearing Aid Applications [p. 73]
-
F. Carbognani, F. Buergin, N. Felber, H. Kaeslin and W. Fichtner
Resonant clocking holds the promise of trading speed
for energy in CMOS circuits that can afford to operate at
low frequency, like hearing aids. An experimental chip with
110k transistors and more than 2500 latches, has been designed,
fabricated and tested. The measured energy consumption
of the design at 0.8 V is 62 μW/MHz, about 7.5%
less than the conventional single-edge-triggered benchmark.
Closer analysis reveals that much of the energy
savings brought about by resonant clocking at low supply
voltages are lost when a CMOS circuit is operated
at higher voltages. This is because of the crossover currents
that persist for much of a clock period when a circuit
is driven from sine-type clock waveform.
-
A Network-On-Chip with 3gbps/Wire Serialized On-Chip Interconnect Using Adaptive Control Schemes [p. 79]
-
S.-J. Lee, K. Kim, H. Kim, N. Cho and H.-J. Yoo
An on-chip interconnect is implemented with 3Gbps/wire
bandwidth performance with 8:1 serialization scheme. Such
high-speed serialization is achieved using a novel serialization
scheme, Wave-Front-Train. In order to apply such high-speed
link technique to Network-on-Chip channels, three adaptive
control schemes are used: supply voltage dependent reference
voltage control, phase compensation scheme with selfcalibrating
function, and adaptive bandwidth control. The chip
is fabricated using 0.18µm CMOS technology.
-
A Single Photon Avalanche Diode Array Fabricated in Deep-Submicron CMOS Technology [p. 81]
-
C. Niclass, M. Sergio and E. Charbon
We report the first fully integrated single photon avalanche
diode array fabricated in 0.35 µm CMOS technology. At
25 μm, the pixel pitch achieved by this design is the smallest
ever reported. Thanks to the level of miniaturization
enabled by this design, we were able to build the largest
single photon streak camera ever built in any technology,
thus proving the scalability of the technology. Applications
requiring low noise, high dynamic range, and/or
picosecond timing accuracies are the prime candidates of
this technology. Examples include bio-imaging at cellular
and molecular level, fast optical imaging, single photon
telecommunications, 3D cameras, optical rangefinders,
LIDAR, and low light level imagers.
Organiser/Moderator: W. Mueller, Paderborn U, DE
-
MATLAB/Simulink for Automotive Systems Design [p. 87]
-
J. Friedman
Automotive systems are becoming increasingly
difficult and expensive to design successfully as the
market demands increasing complexity. Body electronics
are particularly affected by this trend, a good example
being power windows design. This seemingly mundane
area involves meeting market and legislative
requirements, which means creating a control system that
combines the input from several sensors and follows
complex behavioral rules [1].
Traditional design methodologies involve writing a
text specification and implementing algorithms in C.
However, algorithms cannot be verified without hardware.
This approach leaves the engineer in the unenviable
position of waiting for the last piece of hardware to arrive
to enable them to test their system.
To avoid these problems, engineers need to decouple
algorithm development and verification from the
availability of hardware. To address this need, OEMs and
suppliers around the world are switching to Model-Based
Design.
-
Model-Based Development of In-Vehicle Software [p. 89]
-
M. Conrad and H. Doerr
Mathematical modeling, which already is established
for a long time in many engineering domains, now also
gains strongly in importance in the development of
embedded software. In the automotive sector [9],
modeling is used on the one hand for the conceptual
anticipation of the functionality to be realized
(open/closed loop control, monitoring) and, on the other,
for the simulation of the behavior of real physical systems
(plant, environment).
-
Model-Based Testing of Automotive Electronics [p. 91]
-
K. Lamberg
The increasing importance of electronics in the automotive
industry is illustrated by the growing proportion of
manufacturing costs taken up by electrical and electronic
systems - this has now reached approx. 30%. At the same time,
electrical and electronic systems are the main cause of vehicle
failures in the field, accounting for approx. 30% of these.
Manufacturers and also suppliers are well aware of the problems
caused by the increasing number of electronic control units
(ECUs). Thus, quality assurance is becoming increasingly
important, as problems in quality are a liability risk, with the
danger of image problems and the cost of recall campaigns and
rectification. The realization is that "good quality is expensive,
bad quality even more so".
Quality must not be left behind by the immense speed at which
new technologies and functions are being developed. Quality is
becoming a decisive factor in competition, and quality assurance
is becoming a key task and a core competence; and testing is a
key component of quality assurance.
To allow testing throughout the entire development process,
powerful and efficient means of developing and describing tests
are necessary. These also have to take into account the various
requirements of the test tasks and the different development
phases. This contribution gives an overview of modern test
development in various phases of development and of test
management throughout the overall process, using a model-based
test process.
-
Designing Signal Processing Systems for FPGAs [p. 92]
-
J. Heighton
DSP System designers have no shortage of great ideas,
and are forever finding new, powerful, and creative
algorithms, now this is what they are good at and it is this
skill they generally use to secure their paycheck. So there
doesn't appear to be any problems with this situation, you
have ideas people being paid to come up with ideas,
everyone is happy.
Most of the companies who pay system designers have
a product to complete, and it is getting this product to
market, and selling the product, that is the challenge. This
is not just a business problem, but also a technical issue as
designers are being driven to consider more efficient ways
of verifying and implementing their designs.
For a long time software tools have helped the designer
develop and simulate their designs, thus helping to
improve productivity. However as designs become more
complex, such as Wireless Broadband, Software Defined
Radio, and Video/Imaging systems, the task of converting
algorithms into code for hardware implementation
becomes exponentially more difficult.
-
From UML/SysML to Matlab/Simulink: Current State and Future Perspectives [p. 93]
-
Y. Vanderperren and W. Dehaene
Several recent EDA surveys [1-2] confirm that The
Mathworks Matlab/Simulink and the Unified Modelling
Language (UML) are both gaining increased attention as
Electronic System Level (ESL) languages. While Matlab
is commonly used to model signal processing intensive
systems, UML has the potential to support innovative ESL
methodologies which tie the architecture, design and
verification aspects in a unified perspective. Integrated
design flows which exploit the benefits of the
complementarity between UML and Matlab provide an
interesting answer to the issues of mono-disciplinary
modeling and the necessity of moving beyond point-tool
solutions [3]. This paper summarizes how UML and
Matlab/Simulink can be associated and what is the impact
of SysML, a new modeling language based on UML to
describe complex heterogeneous systems.
Moderators: F. Fummi, Verona U, IT; I. Harris, UC Irvine, US
-
An Efficient TLM/T Modeling and Simulation Environment Based on Conservative Parallel Discrete
Event Principles [p. 94]
-
E. Viaud, F. Pecheux and A. Greiner
The paper presents an innovative simulation scheme
to speed-up simulations of multi-clusters multi-processors
SoCs at the TLM/T (Transaction Level Model with Time)
abstraction level. The hardware components of the SoC architecture
are written in standard SystemC. The goal is to
describe the dynamic behavior of a given software application
running on a given hardware architecture (including
the dynamic contention in the interconnect and the cache
effects), in order to provide the system designer with the
same reliable timing information as a cycle accurate simulation,
with a simulation speed similar to a TLM simulation.
The key idea is to apply Parallel Discrete Event Simulation
(PDES) techniques to a collection of communicating
SystemC SC THREAD. Experimental results show a simulation
speedup of a factor up to 50 versus a BCA simulation
(Bus Cycle Accurate), for a timing error lower than 10-3.
-
Exploiting TLM and Object Introspection for System-Level Simulation [p. 100]
-
G. Beltrame, D. Sciuto, C. Silvano, D. Lyonnard and C. Pilkington
The introduction of Transaction Level Modeling (TLM)
allows a system designer to model a complete application,
composed of hardware and software parts, at several levels
of abstraction. The simulation speed of TLM is orders
of magnitude faster than traditional RTL simulation; nevertheless,
it can become a limiting factor when considering a
Multi-Processor System-On-Chip (MP-SoC), as the analysis
of these systems can be very complex. The main goal of
this paper is to introduce a novel way of exploiting TLM features
to increase simulation efficiency of complex systems by
switching TLM models at runtime. Results show that simulation
performance can be increased significantly without
sacrificing the accuracy of critical application kernels.
-
Efficient Assertion Based Verification Using TLM [p. 106]
-
A. Habibi, S. Tahar, A. Samarah, D. Li and O. A. Mohamed
Recent advancement in hardware design urged using a transaction
based model as a new intermediate design level. Supporters for
the Transaction Level Modeling (TLM) trend claim its efficiency in
terms of rapid prototyping and fast simulation in comparison to the
classical RTL-based approach. Intuitively, from a verification point
of view, faster simulation induces better coverage results. This is
driven by two factors: coverage measurement and simulation guidance.
In this paper, we propose to use an abstract model of the design,
written in the Abstract State Machines Language (AsmL), in
order to provide an adequate way for measuring the functional coverage.
Then, we use this metric in defining the fitness function of a
genetic algorithm proposed to improve the simulation efficiency. Finally,
we compare our coverage and simulation results to: (1) random
simulation at TLM; and (2) the Specman tool of Verisity at RTL.
-
Constructing Portable Compiled Instruction-Set Simulators -- An ADL-Driven Approach [p. 112]
-
J. D'Errico and W. Qin
Instruction set simulators are common tools used
for the development of new architectures and embedded
software among countless other functions. This paper
presents a framework that quickly generates fast
and flexible instruction-set simulators from a specification
based on a C-like architecture-description language.
The framework provides a consistent platform for
constructing and evaluating different classes of simulators,
including interpreters, static-compiled simulators,
and dynamic-compiled simulators. The framework also features
a new construction method for dynamic-compiled
simulator that involves no low-level programming. It profiles
and translates frequently executed regions of simulated
binary to C++ code and invokes GCC to compile
such code into dynamically loaded libraries, which
are then loaded into the simulator at run time to accelerate
simulation. Our experimental results based on
the MIPS architecture and the SPEC CPU2000 benchmarks
show that our dynamic-compiled simulator is capable
of achieving up to 11 times speedup compared to our
fast interpreter. Compared to other dynamic-compiled simulators
requiring significant system programming expertise
to construct, the proposed approach is simpler to implement
and more portable.
Moderators: M. Coppola, STMicroelectronics, FR; B. Candaele, Thales Communications, FR
-
A Methodology for Mapping Multiple Use-Cases onto Networks on Chips [p. 118]
-
S. Murali, M. Coenen, A. Radulescu, K. Goossens, and G. De Micheli
A communication-centric design approach, Networks on
Chips (NoCs), has emerged as the design paradigm for
designing a scalable communication infrastructure for future
Systems on Chips (SoCs). As technology advances, the
number of applications or use-cases integrated on a single
chip increases rapidly. The different use-cases of the SoC
have different communication requirements (such as different
bandwidth, latency constraints) and traffic patterns. The
underlying NoC architecture has to satisfy the constraints of
all the use-cases. In this work, we present a methodology
to map multiple use-cases onto the NoC architecture, satisfying
the constraints of each use-case. We present dynamic
re-configurationmechanisms that match the NoC configuration
to the communication characteristics of each use-case,
also accounting for use-cases that can run in parallel. The
methodology is applied to several real and synthetic SoC
benchmarks, which result in a large reduction in NoC area
(an average of 80%) and power consumption (an average
of 54%) compared to traditional design approaches.
Keywords: Networks on Chips, Systems on Chips, Use-Cases,
Modes, Dynamic Re-Configuration.
-
Contrasting a NoC and a Traditional Interconnect Fabric with Layout Awareness [p. 124]
-
F. Angiolini, P. Meloni, S. Carta, L. Benini and L. Raffo
Increasing miniaturization is posing multiple challenges to electronic
designers. In the context of Multi-Processor System-on-Chips (MPSoCs),
we focus on the problem of implementing efficient
interconnect systems for devices which are ever more densely
packed with parallel computing cores. Easily seen that traditional
buses can not provide enough bandwidth, a revolutionary path
to scalability is provided by packet-switched Network-on-Chips
(NoCs), while a more conservative approach dictates the addition
of bandwidth-rich components (e.g. crossbars) within the preexisting
fabrics. While both alternatives have already been explored,
a thorough contrastive analysis is still missing. In this paper,
we bring crossbar and NoC designs to the chip layout level in
order to highlight the respective strengths and weaknesses in terms
of performance, area and power, keeping an eye on future scalability.
-
A Low Complexity Heuristic for Design of Custom Network-on-Chip Architectures [p. 130]
-
K. Srinivasan and K. S. Chatha
Network-on-Chip (NoC) has been proposed to replace traditional
bus based architectures to address the global communication challenges
in nanoscale technologies. In future SoC architectures, minimizing
power consumption will continue to be an important design
goal. In this paper, we present a novel heuristic technique
consisting of system-level physical design, and interconnection network
generation that generates custom low power NoC architectures
for application specific SoC. We demonstrate the quality of
the solutions produced by our technique by experimentation with
many benchmarks. Our technique has a low computational complexity,
and consumes only 1.25 times the power consumption, and
0.85 times the number of router resources compared to an optimal
MILP based technique [1] whose computational complexity is not
bounded.
-
A Dynamically Reconfigurable Packet-Switched Network-on-Chip [p. 136]
-
T. Pionteck, C. Albrecht and R. Koch
This paper presents the design of an adaptable NoC for
FPGA based dynamically reconfigurable SoCs. At runtime,
switches can be added or removed from the network, allowing
to adapt the NoC to the number, size and location of
currently configured hardware modules. By using dynamic
routing tables, reconfiguration can be done without stopping
or stalling the NoC. The proposed architecture avoids
the limitations of bus-based interconnection schemes which
are often applied in partially dynamically reconfigurable
FPGA designs.
Moderators: T. Ifstroem, Robert Bosch GmbH, DE; E. Martens, KU Leuven, BE
-
Arbitrary Design of High Order Noise Transfer Function for a Novel Class of Reduced-Sample-
Rate Delta-Sigma-Pipeline ADCs [p. 138]
-
V. Majidzadeh and O. Shoaei
A novel noise transfer function (NTF) for high order
reduced-sample-rate sigma-delta-pipeline (SDP) ADCs
is presented. The proposed NTF determines the location
of the non-zero poles improving the stabilization of the
loop and implementing the reduced-sample-rate
structure, concurrently. A design methodology based on
simulated-annealing-algorithm is developed to design
the optimum NTF. To verify the usefulness of the
proposed NTF and design procedure, two different
modulators are presented. Simulation results show that
with a 4th order modulator, designed making use of the
proposed approach, the maximum SNDR of 115dB and
124.1dB can be achieved with only OSR of 8, and 16
respectively.
-
Systematic and Optimal Design of CMOS Two-Stage Opamps with Hybrid Cascode Compensation [p. 144]
-
M. Yavari, O. Shoaei and A. Rodriguez-Vazquez
This paper presents a systematic and optimal design of
hybrid cascode compensation method which is used in
fully differential two-stage CMOS operational
transconductance amplifiers (OTAs). The closed loop
analysis results are given to obtain a design procedure. A
simple design procedure for the minimum settling time of
the hybrid cascode compensation technique for a twostage
class A/AB amplifier is proposed. Optimal design
issues of power dissipation are considered to achieve the
lowest power consumption for the required settling time.
Finally, a design example is presented to show both the
usefulness of the hybrid cascode compensation and the
proposed design procedure. The proposed design
technique can help circuit designers as well as it can be
used in computer aided circuit design tools.
-
Systematic Stability-Analysis Method for Analog Circuits [p. 150]
-
G. Vandersteen, S. Bronckers, P. Dobrovolny and Y. Rolain
Analyzing the stability of an analog circuit is an important
part of the circuit design. Several commercial simulators
are equipped with special stability analysis techniques.
Problems arise when your design kit does not support such
simulator. Another issue is when the designer wants to get
insight into the sources of the instability to propose a stabilization.
This can be done through analyzing the open-loop
or the closed-loop transfer function of the circuit.
The aim of this paper is to propose an automated analysis
method which identifies the nodes to be considered for
stabilization. The method does not need to break feedback
loops or to manipulate netlists. It only uses AC simulations
and does not require the full modified nodal equations. The
method is illustrated on 3 design examples: a Voltage Controlled
Oscillator (VCO), a reference bias circuit and the
common-mode feedback network in a gm-C filter.
-
ALAMO: An Improved Sigma-Space Based Methodology for Modeling Process Parameter Variations
in Analog Circuits [p. 156]
-
H. Zhang, Y. Zhao and A. Doboli
This paper describes an original methodology for accurately
modeling MOSFET process parameter variations. As
compared to other process parameter variation modeling
methods, the proposed methodology is capable of correctly
modeling not only differences of process/model parameters,
but also the process parameter variations for individual devices.
This capability is very important for popular analog
circuits like current biasing circuits, voltage reference circuits,
and single-ended output amplifiers.
-
A Synthesis Tool for Power-Efficient Base-Band Filter Design [p. 162]
-
V. Giannini, P. Nuzzo, F. De Bernardinis, J. Craninckx, B. Come, S. D'Amico and A. Baschirotto
A baseband filter synthesizer that takes a
behavioural description of the design and produces an
efficient transistor level implementation is presented. The
tool optimizes the filter at the cascade level, providing the
best trade-off between power consumption and dynamic
range, and at the cell level, selecting minimum power
solutions, through accurate analytical models and an
efficient bi-quad topology. Differently from past cascade
design techniques based on dynamic range optimization
through linear programming [2], we focus on power
minimization while guaranteeing minimum performance
levels, given the increasing importance of power savings
in hand-held devices. A synthesized filter has been realized
in silicon demonstrating the effectiveness of our approach.
Moderators: S. Piestrak, Metz U, FR; D. Gizopoulos, Pireaus U, GR
-
An Efficient Static Algorithm for Computing the Soft Error Rates of Combinational Circuits [p. 164]
-
R. Rao, K. Chopra, D. Blaauw and D. Sylvester
Soft errors have emerged as an important reliability challenge for
nanoscale VLSI designs. In this paper, we present a fast and efficient
soft error rate (SER) computation algorithm for combinational circuits.
We first present a novel parametric waveform model based on
the Weibull function to represent particle strikes at individual nodes
in the circuit. We then describe the construction of the SET descriptor
that efficiently captures the correlation between the transient
waveforms and their associated rate distribution functions. The proposed
algorithm consists of operations to inject, propagate and
merge SET descriptors while traversing forward along the gates in a
circuit. The parameterized waveforms enable an efficient static
approach to calculate the SER of a circuit. We exercise the proposed
approach on a wide variety of combinational circuits and observe
that our algorithm has linear runtime with the size of the circuit. The
runtimes for soft error estimation were observed to be in the order of
about one second, compared to several minutes or even hours for
previously proposed methods.
-
Low-Cost and Highly Reliable Detector for Transient and Crosstalk Faults Affecting FPGA Interconnects [p. 170]
-
M. Omaña, J. M. Cazeaux; D. Rossi and C. Metra
In this paper we present a novel circuit for the on-line
detection of transient and crosstalk faults affecting the
interconnects of systems implemented using Field
Programmable Gate-Arrays (FPGAs). The proposed
detector features self-checking ability with respect to
faults possibly affecting itself, thus being suitable for
systems with high reliability requirements, like those for
space applications. Compared to alternate solutions, the
proposed circuit requires a significantly lower area
overhead, while implying a comparable, or lower,
impact on system performance. We have verified our
circuit operation and self-checking ability by means of
post-layout simulations.
-
Evaluating Coverage of Error Detection Logic for Soft Errors Using Formal Methods [p. 176]
-
U. Krautz, M. Pflanz, C. Jacobi, H. W. Tast, K. Weber and H. T. Vierhaus
In this paper we describe a methodology to measure
exactly the quality of fault-tolerant designs by combining fault-injection
in high level design (HLD) descriptions with a formal
verification approach. We utilize BDD based symbolic simulation
to determine the coverage of online error-detection and -
correction logic. We describe an easily portable approach, which
can be applied to a wide variety of multi-GHz industrial designs.
Index Terms—Formal Verification, Soft Error Injection,
Error Detection and Correction, Fault/Error Coverage
-
Soft-Error Classification and Impact Analysis on Real-Time Operating Systems [p. 182]
-
N. Ignat, B. Nicolescu, Y. Savaria and G. Nicolescu
This paper investigates the sensitivity of real-time
systems running applications under operating systems
that are subject to soft-errors. We consider
applications using different real-time operating system
services: scheduling, time and memory management,
intertask communication and synchronization. We
report results of a detailed analysis regarding the
impact of soft-errors on real-time operating systems
cores, taking into account the application timing
constraints. Our results show the extent to which soft-errors
occurring in a real-time operating system's
kernel impact its reliability.
Moderators: R. Lauwereins, IMEC, BE; J. Becker, Karlsruhe U, DE
-
40Gbps De-Layered Silicon Protocol Engine for TCP Record [p. 188]
-
H. Shrikumar
We present a de-layered protocol engine for termination of
40Gbps TCP connections using a reconfigurable FPGA silicon
platform. This protocol engine is designed for a planned
attempt at the Internet Speed Record. In laboratory demonstrations
at 40Gbps, this core beat the previous record of
7.2Gbps by a factor of five. We present an aggressive crosslayer
optimization methodology and corresponding design-flow and
tools used to implement this record-breaking TCP
Protocol Engine.
The 40Gbps TCP Of.oad Engine has been implemented on
a Xilinx FPGA platform, based on a VirtexII-pro 2VP7 device.
Each FPGA device terminates a 10Gbps OC-768 channel,
and the aggregate capacity of the four FPGA devices is
40Gbps. The four 10Gbps channels are intended to be connected
to four trunked 10GbE ethernet ports on a router. The
40Gbps TCP implementation has been demonstrated in the
lab in system level as well as gate-level simulations, and live
implementations have been tested with each 10Gbps channel
FPGA board connected back-to-back in transmission tests at
full wire-speed. We believe this to be the fastest TCP protocol
engine to have been implemented so far.
-
A Reconfigurable HW/SW Platform for Computation Intensive High-Resolution Real-Time Digital
Film Applications [p. 194]
-
A. do Carmo Lucas, S. Heithecker, P. Rueffer, R. Ernst, H. Rueckert, G. Wischermann, K. Gebel,
R. Fach, W. Huther, S. Eichner and G. Scheller
This paper presents a multi-board, multi-FPGA hardware/
software architecture, for computation intensive, high
resolution (2048x2048 pixels), real-time (24 frames per second)
digital film processing. It is based on Xilinx Virtex-
II Pro FPGAs, large SDRAM memories for multiple frame
storage and a PCI express communication network. The architecture
reaches record performance running a complex
noise reduction algorithm including a 2.5 dimensions DWT
and a full 16x16 motion estimation at 24 fps requiring a
total of 203 Gops/s net computing performance and a total
of 28 Gbit/s DDR-SDRAM frame memory bandwidth.
To increase design productivity and yet achieve high clock
rates (125MHz), the architecture combines macro component
configuration and macro level floorplanning with weak
programmability using distributed microcoding. As an example,
the core of the bidirectional motion estimation using
2772 CLBs reaching 155 Gop/s (1538 op/pixel) requiring
7 Gbit/s external memory bandwidth was developed in two
men-months.
Keywords: motion-estimation, weak-programming,
stream-based architechture, digital film, reconfigurable,
FPGA
-
Disclosing the LDPC Code Decoder Design Space [p. 200]
-
T. Brack, F. Kienle and N. Wehn
The design of future communication systems with high
throughput demands will become a critical task, especially
when sophisticated channel coding schemes have to be applied.
LDPC codes are one of the most promising candidates
because of their outstanding communications performance.
One major problem for a decoder hardware realization
is the huge design space composed of many interrelated
parameters which enforces drastic design trade-offs.
Another important issue is the need for flexibility of such
systems.
In this paper we illuminate this design space with special
emphasis on the strong interrelations of theses parameters.
Three design studies are presented to highlight the effects
on a generic architecture if some parameters are constraint
by a given standard, given technology, and given area constraints.
Moderators: E. M. Panainte, TU Delft, NL; S. Vassiliadis, TU Delft, NL
-
Automating Processor Customisation: Optimised Memory Access and Resource Sharing [p. 206]
-
R. Dimond, O. Mencer and W. Luk
We propose a novel methodology to generate Application
Specific Instruction Processors (ASIPs) including
custom instructions. Our implementation balances performance
and area requirements by making custom instructions
reusable across similar pieces of code. In addition to
arithmetic and logic operations, table look-ups within custom
instructions reduce costly accesses to global memory.
We present synthesis and cycle-accurate simulation results
for six embedded benchmarks running on customised processors.
Reusable custom instructions achieve an average
319% speedup with only 5% additional area. The maximum
speedup of 501% for the Advanced Encryption Standard
(AES) requires only 3.6% additional area.
-
Automatic Identification of Application-Specific Functional Units with Architecturally Visible Storage [p. 212]
-
P. Biswas, N. Dutt, P. Ienne and L. Pozzi
Instruction Set Extensions (ISEs) can be used effectively
to accelerate the performance of embedded processors. The
critical, and difficult task of ISE selection is often performed
manually by designers. A few automatic methods for ISE
generation have shown good capabilities, but are still limited
in the handling of memory accesses, and so they fail
to directly address the memory wall problem. We present
here the first ISE identification technique that can automatically
identify state-holding Application-specific Functional
Units (AFUs) comprehensively, thus being able to
eliminate a large portion of memory traffic from cache and
main memory. Our cycle-accurate results obtained by the
SimpleScalar simulator show that the identified AFUs with
architecturally visible storage gain significantly more than
previous techniques, and achieve an average speedup of
2.8x over pure software execution. Moreover, the number
of required memory-access instructions is reduced by two
thirds on average, suggesting corresponding benefits on energy
consumption.
-
Combining Algorithm Exploration with Instruction Set Design: A Case Study in Elliptic Curve
Cryptography [p. 218]
-
J. Groszschaedl, P. Ienne, L. Pozzi, S. Tillich and A. K. Verma
In recent years, processor customization has matured to
become a trusted way of achieving high performance with
limited cost/energy in embedded applications. In particular,
Instruction Set Extensions (ISEs) have been proven very
effective in many cases. A large body of work exists today on
creating tools that can select efficient ISEs given an application
source code: ISE automation is crucial for increasing
the productivity of design teams. In this paper we show that
an additional motivation for automating the ISE process is
to facilitate algorithm exploration: the availability of ISE
can have a dramatic impact on the performance of different
algorithmic choices to implement identical or equivalent
functionality. System designers need fast feedbacks on the
ISE-ability of various algorithmic flavors. We use a case
study in elliptic curve (EC) cryptography to exemplify the
following contributions: (1) ISE can reverse the relative
performance of different algorithms for one and the same
operation, and (2) automatic ISE, even without predicting
speed-ups as precisely as detailed simulation can, is able to
show exactly the trends that the designer should follow.
-
Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors [p. 224]
-
A. Zmily and C. Kozyrakis
Code size and energy consumption are critical design
concerns for embedded processors as they determine the
cost of the overall system. Techniques such as reduced
length instruction sets lead to significant code size savings
but also introduce performance and energy consumption
impediments such as additional dynamic instructions or decompression
latency. In this paper, we show that a blockaware
instruction set (BLISS) which stores basic block descriptors
in addition to and separately from the actual instructions
in the program allows embedded processors to
achieve significant improvements in all three metrics: reduced
code size and improved performance and lower energy
consumption.
Moderators: E. Villar, Cantabria U, ES; T. Schattkowsky, Paderborn U, DE
-
Quantitative Analysis of Transaction Level Models for the AMBA Bus [p. 230]
-
G. Schirner and R. Doemer
The increasing complexity of embedded systems pushes
system designers to higher levels of abstraction. Transaction
Level Modeling (TLM) has been proposed to model
communication in systems in an abstract manner. Although
being widely accepted, TLMs have not been analyzed for
their loss in accuracy.
This paper will analyze and quantify the speed-accuracy
tradeoff of TLM using a case study on AMBA, an industry
bus standard. It shows the results of modeling the Advanced
High-performance Bus (AHB) of AMBA using a set
of models at different abstraction levels. The analysis of
the simulation speed shows improvements of two orders of
magnitude for each TLM abstraction, while the timing in
the model remains accurate for many applications.
As a result, the paper will classify the different models
towards their applicability in typical modeling situations,
allowing the system designer to achieve fast and accurate
simulation of communication.
-
Combining Simulation and Formal Methods for System-Level Performance Analysis [p. 236]
-
S. Kuenzli, F. Poletti, L. Benini and L. Thiele
Recent research on performance analysis for embedded
systems shows a trend to formal compositional models and
methods. These compositional methods can be used to determine
the performance of embedded systems by composing
formal analytical models of the individual components.
In case there exist no formal component models with the
required precision, simulation-based approaches are used
for system-level performance analysis. The often high runtimes
of simulation runs lead to the new approach described
in this paper: Analytical methods are combined with
simulation-based approaches to speed up simulation. We
describe how the simulation models can be coupled with the
formal analysis framework, specify the interfaces needed
for such a combination and show the applicability of the
approach using a case study.
-
Formal Performance Analysis and Simulation of UML/SysML Models for ESL Design [p. 242]
-
A. Viehl, T. Schoenwald, O. Bringmann and W. Rosenstiel
UML2 and SysML try to adopt techniques known from software
development to systems engineering. However, the focus has
been put on modeling aspects until now and quantitative performance
analysis is not adequately taken into account in early design
stages of the system. In this paper, we present our approach
for formal and simulation based performance analysis of systems
specified with UML2/SysML. The basis of our analysis approach is
the detection of communication that synchronize the control flow
of the corresponding instances of the system and make the relationship
explicit. Using this knowledge, we are able to determine a
global timing behavior and violations of this effected by preset constraints.
Hence, it is also possible to detect potential conflicts on
shared communication resources if a specification of the target architecture
is given. With these information it is possible to evaluate
system models at an early design stage.
-
Performance Evaluation for System-on-Chip Architectures Using Trace-Based Transaction Level
Simulation [p. 248]
-
T. Wild, A. Herkersdorf and R. Ohlendorf
The ever increasing complexity and heterogeneity of
modern System-on-Chip (SoC) architectures make an early
and systematic exploration of alternative solutions
mandatory. Efficient performance evaluation methods are
of highest importance for a broad search in the solution
space. In this paper we present an approach that captures
the SoC functionality for each architecture resource as
sequences of trace primitives. These primitives are
translated at simulation runtime into transactions and
superposed on the system architecture. The method uses
SystemC as modeling language, requires low modeling
effort and yet provides accurate results within reasonable
turnaround times. A concluding application example
demonstrates the effectiveness of our approach.
Organiser: R. Rutenbar, Carnegie Mellon U, US
Moderator: J. Cohn, IBM Microelectronics, US
-
Is "Network" the Next "Big Idea" in Design? [p. 254]
-
R. Marculescu, J. Rabaey and A. Sangiovanni-Vincentelli
As the complexity of nowadays systems continues to grow,
we are moving away from creating individual components
from scratch, toward methodologies that emphasize composition
of re-usable components via the network paradigm.
Complex component interactions can create a range
of amazing behaviors, some useful, some unwanted, some
even dangerous. To manage them, a “science” for network
design is evolving, applicable in some surprising
areas. In this paper, we consider a few application
domains and discus the design challenges involved from a
methodology standpoint. From large-scale hardware/software
systems, to dynamically adaptive sensor networks,
and network-on-chip architectures, these ideas find wide
application.
Moderators: G. Vandersteen, IMEC, BE; L. Hedrich, Frankfurt U, DE
-
Verifying Analog Oscillator Circuits Using Forward/Backward Abstraction Refinement [p. 257]
-
G. Frehse, B. H. Krogh and R. A. Rutenbar
Properties of analog circuits can be verified formally by
partitioning the continuous state space and applying hybrid
system verification techniques to the resulting abstraction.
To verify properties of oscillator circuits, cyclic invariants
need to be computed. Methods based on forward reachability
have proven to be inefficient and in some cases inadequate
in constructing these invariant sets. In this paper
we propose a novel approach combining forward- and
backward-reachability while iteratively refining partitions
at each step. The technique can yield dramatic memory and
runtime reductions. We illustrate the effectiveness by verifying,
for the first time, the limit cycle oscillation behavior of
a third-order model of a differential VCO circuit.
-
Efficient AC Analysis of Oscillators Using Least-Squares Methods [p. 263]
-
T. Mei and J. Roychowdhury
We present a generalization of standard AC analysis to oscillators
by exploiting least-squares solution techniques. This provides an
attractive alternative to the current practice of employing transient simulation
for small signal analysis of oscillators. Unlike phase condition based
oscillator analysis techniques, which suffer from numerical artifacts, the
least-squares approach of this paper results in a robust and efficient oscillator
AC technique. We validate our method on LC and ring oscillators,
obtaining speedups of 1-3 orders of magnitude over transient simulation,
and 4-6x over phase-condition-based techniques.
-
Double-Strength CAFFEINE: Fast Template-Free Symbolic Modeling of Analog Circuits
via Implicit Canonical Form Functions and Explicit Introns [p. 269]
-
T. McConaghy and G. Gielen
CAFFEINE, introduced previously, automatically
generates nonlinear, template-free symbolic performance
models of analog circuits from SPICE data. Its key was a
directly-interpretable functional form, found via
evolutionary search. In application to automated sizing
of analog circuits, CAFFEINE was shown to have the best
predictive ability from among 10 regression techniques,
but was too slow to be used practically in the optimization
loop. In this paper, we describe Double-Strength
CAFFEINE, which is designed to be fast enough for
automated sizing, yet retain good predictive abilities. We
design "smooth, uniform" search operators which have
been shown to greatly improve efficiency in other
domains. Such operators are not straightforward to
design; we achieve them in functions by simultaneously
making the grammar-constrained functional form
implicit, and embedding explicit "introns" (subfunctions
appearing in the candidate that are not expressed).
Experimental results on six test problems show that
Double-Strength CAFFEINE achieves an average
speedup of 5x on the most challenging problems and 3x
overall; thus making the technique fast enough for
automated sizing.
-
Top-Down Heterogeneous Synthesis of Analog and Mixed-Signal Systems [p. 275]
-
E. Martens and G. Gielen
A new approach for automated synthesis of analog and
mixed-signal systems is presented. The heterogeneous genetic
optimization strategy starts from a functional description
and evolves a simple design solution in a strict topdown
design process to a complex one that fulfills multiple
objectives. Transformations of both architecture and parameters
are applied. The expected improvement of the violated
objectives is used as driver for the transformation
selection. The topology is really created, giving the opportunity
to explore new architectures.
-
Nonlinear Model Order Reduction Using Remainder Functions [p. 281]
-
J. A. Martinez, S. P. Levitan and D. M. Chiarulli
This paper describes a novel approach to the problem of
model order reduction (MOR) of very large nonlinear
systems. We consider the behavior of a dynamic
nonlinear system as having two fundamental
characteristics: a global behavioral "envelope" that
describes major transformations to the state of the system
under external stimuli and a local behavior that describes
small perturbation responses. The nonlinear low order
envelope function is generated by using the remainders
from the coalescence of projection bases taken through a
space-state sample. A behavioral model can then be
expressed as the superposition of these two descriptions,
operating according to the input stimuli and the current
state value.
The global behavior describes major transformations to
the state of the system under external stimuli and the local
behavior describes small perturbation responses. Local
effects are captured by regions through a set of linear
projections to a reduced state-space while global effects
are captured by examining the non-commonalty among
these projections. These "remainders" are used to build a
modulation function that will generate the required
dynamic changes in the common linear projection.
The advantage of the envelope representation for
strongly nonlinear systems is that it simplifies the
complexity of the model into a two-part problem.
Depending on the complexity or cost of the behavioral
separation procedure, it can be repeated recursively.
-
Efficient Temperature-Dependent Symbolic Sensitivity Analysis and Symbolic Performance Evaluation
in Analog Circuit Synthesis [p. 283]
-
H. Yang and R. Vemuri
We present a new methodology for fast analog circuit
synthesis, based on the use of temperature-dependent symbolic
sensitivity analysis and symbolic performance evaluation
in synthesis loop. Fast sensitivity analysis achieved and
performance estimation are based on element-coefficient diagrams
(ECDs). Sensitivity and performance evaluation expressions
are generated from ECDs at the same time which
reduces overall runtime greatly. The experimental results
demonstrate that the speed and convergence of analog synthesis
are improved significantly.
Moderators: R. Dorsch, IBM Deutschland Entwicklung GmbH, DE; E. Larsson, Linkoping U, SE
-
Hierarchy-Aware and Area-Efficient Test Infrastructure Design for Core-Based System Chips [p. 285]
-
A. Sehgal, S. K. Goel, E. J. Marinissen and K. Chakrabarty
Multiple levels of design hierarchy are common in current-generation
system-on-chip (SOC) integrated circuits. However,
most prior work on test access mechanism (TAM) optimization
and test scheduling is based on a flattened design hierarchy. We
investigate hierarchy-aware test infrastructure design, wherein
wrapper/TAM optimization and test scheduling are carried out
for hierarchical SOCs for two practical design scenarios. In the
first scenario, the wrapper and TAM implementation for the embedded
child cores in hierarchical (parent) cores are delivered
in a hard form by the core provider. In the second scenario, the
wrapper and TAM architecture of the child cores embedded in
the parent cores are implemented by the system integrator. Experimental
results are presented for the ITC'02 SOC test benchmarks.
-
Power Constrained and Defect-Probability Driven SoC Test Scheduling with Test Set Partitioning [p. 291]
-
Z. He, Z. Peng and P. Eles
This paper presents a test scheduling approach for system-onchip
production tests with peak-power constraints. An abort-on-first-fail
test approach is assumed, whereby the test is terminated
as soon as the first fault is detected. Defect probabilities of
individual cores are used to guide the test scheduling and the
peak-power constraint is considered in order to limit the test
concurrency. Test set partitioning is used to divide a test set into
several test sequences so that they can be tightly packed into the
two-dimensional space of power and time. The partitioning of test
sets is integrated into the test scheduling process. A heuristic has
been developed to find an efficient test schedule which leads to
reduced expected test time. Experimental results have shown the
efficiency of the proposed test scheduling approach.
-
Power-Constrained Test Scheduling for Multi-Clock Domain SoCs [p. 297]
-
T. Yoneda, K. Masuda and H. Fujiwara
This paper presents a wrapper and test access mechanism
design for multi-clock domain SoCs that consists of
cores with different clock frequencies during test. We also
propose a test scheduling algorithm for multi-clock domain
SoCs to minimize test time under power constraint. In the
proposed method, we use virtual TAM to solve the frequency
gaps between cores and the ATE, and also to reduce power
consumption of a core during test while maintaining the test
time of the core. Experimental results show the effectiveness
of our method not only for multi-clock domain SoCs, but
also for single-clock domain SoCs with power constraints.
keywords: multi-clock domain SoC, test scheduling, test
access mechanism, power consumption
-
Reuse-Based Test Access and Integrated Test Scheduling for Network-on-Chip Systems [p. 303]
-
C. Liu, Z. Link and D. K. Pradhan
In this paper, we propose a new method for test access and
test scheduling in NoC-based system. It relies on a progressive
reuse of the network resources for transporting test data
to routers. We present possible solutions to the implementation
of this scheme. We also show how the router testing
can be scheduled concurrently with core testing to reduce
test application time. Experimental results for the ITC'02
SoC benchmarks show that the proposed method can lead to
substantial reduction on test application time compared to
previous work based on the use of serial boundary scan. The
method can also help to reduce hardware overhead.
-
A Design for Failure Analysis (DFFA) Technique to Ensure Incorruptible Signatures [p. 309]
-
S. Kundu
Fast failure analysis is a key enabler in shortening the
time between design tape out and product introduction in
the market. With faster detection of manufacturability
issues, problems associated with parametric variations,
model approximations or physical design rules can be
fixed faster either at the process control level or at the
mask level. Failure analysis can be accelerated with
additional hardware support for design-for-testability
(DFT) and design-for-failure-analysis (DFFA). In this
paper, we will focus on one such DFFA technique
deployed in the industry, identify its shortcomings and
offer improvements to fix deficiencies.
Moderators: G. Stromberg, Infineon Technologies, DE; C. Paulus, Siemens, DE
-
Test Generation for Combinational Quantum Cellular Automata (QCA) Circuits [p. 311]
-
P. Gupta, N. K. Jha and L. Lingappan
In this paper, we present a test generation framework
for testing of quantum cellular automata (QCA) circuits. QCA is a
nanotechnology that has attracted significant recent attention and shows
immense promise as a viable future technology. This work is motivated
by the fact that the stuck-at fault test set of a circuit is not guaranteed to
detect all defects that can occur in its QCA implementation. We show how
to generate additional test vectors to supplement the stuck-at fault test
set to guarantee that all simulated defects in the QCA gates get detected.
Since nanotechnologies will be dominated by interconnects, we also target
bridging faults on QCA interconnects. The efficacy of our framework is
established through its application to QCA implementations of MCNC
benchmarks that use majority gates as primitives.
-
Analysis and Synthesis of Quantum Circuits by Using Quantum Decision Diagrams [p. 317]
-
A. Abdollahi and M. Pedram
Quantum information processing technology is in its pioneering
stage and no proficient method for synthesizing quantum circuits
has been introduced so far. This paper introduces an effective
analysis and synthesis framework for quantum logic circuits. The
proposed synthesis algorithm and flow can generate a quantum
circuit using the most basic quantum operators, i.e., the rotation
and controlled-rotation primitives. The paper introduces the
notion of quantum factored forms and presents a canonical and
concise representation of quantum logic circuits in the form of
quantum decision diagrams (QDD’s), which are amenable to
efficient manipulation and optimization including recursive
unitary functional bi-decomposition. This paper concludes by
presenting the QDD-based algorithm for automatic synthesis of
quantum circuits.
-
Droplet Routing in the Synthesis of Digital Microfluidic Biochips [p. 323]
-
F. Su, W. Hwang and K. Chakrabarty
Recent advances in microfluidics are expected to lead to sensor
systems for high-throughput biochemical analysis. CAD tools are
needed to handle increased design complexity for such systems.
Analogous to classical VLSI synthesis, a top-down design
automation approach can shorten the design cycle and reduce
human effort. We focus here on the droplet routing problem, which
is a key issue in biochip physical design automation. We develop the
first systematic droplet routing method that can be integrated with
biochip synthesis. The proposed approach minimizes the number of
cells used for droplet routing, while satisfying constraints imposed
by throughput considerations and fluidic properties. A real-life
biochemical application is used to evaluate the proposed method.
-
Priority Scheduling in Digital Microfluidics-Based Biochips [p. 329]
-
A. J. Ricketts, K. Irick, N. Vijaykrishnan and M. J. Irwin
Discrete droplet digital microfluidics-based biochips face
problems similar to that in other VLSI CAD systems, but
with new constraints and interrelations. We focus on one
such problem of resource constrained scheduling for digital
microfluidic biochips. Since the problem is NP-complete,
finding the optimal solution is a very time expensive task.
We propose a hybrid priority scheduling algorithm solution
directly applicable to digital microfluidics with the potential
to yield near optimal schedules in the general case in a very
short time. Furthermore we propose the use of
configurable detectors that allow for even more improved
system performance.
-
A Hybrid Framework for Design and Analysis of Fault-Tolerant Architectures for Nanoscale
Molecular Crossbar Memories [p. 335]
-
D. Bhaduri, S. Shukla, D. Coker, V. Taylor, P. Graham and M. Gokhale
It is anticipated that self assembled ultra-dense
nanomemories will be more susceptible to manufacturing
defects and transient faults than conventional CMOS-based
memories, thus the need exists for fault-tolerant memory
architectures. The development of such architectures
will require intense analysis in terms of achievable performance
measures - power dissipation, area, delay and reliability.
In this paper, we propose and develop a hybrid automation
framework, called HMAN, that aids the design
and analysis of fault-tolerant architectures for nanomemories.
Our framework can analyze memory architectures
at two different levels of the design abstraction, namely
the system and circuit levels. To the best of our knowledge,
this is the first such attempt at analyzing memory
systems at different levels of abstraction and then correlating
the different performance measures. We also illustrate
the application of our framework to self-assembled crossbar
architectures by analyzing a hierarchical fault-tolerant
crossbar-based memory architecture that we have developed.
-
(774)Optical Routing for 3D System-on-Package [p. 337]
-
J. R. Minz, S. Thyagaraja and S.-K. Lim
Optical interconnects enable faster signal propagation
with virtually no crosstalk. In addition, wavelength division
multiplexing allows a single waveguide to be shared among
multiple interconnects. This paper proposes efficient algorithms
for the construction of timing and congestion-driven waveguides
considering the optical resource constraints. We develop the
first optical router for System-on-Packages (SOPs), which reduce
electrical wirelength by 11% and improve performance by 23%,
when a single optical layer is introduced for every placement
layer.
Moderators: P. Ienne, EPFL Lausanne, CH; T. Austin, The U of Michigan, US
-
Distributed Loop Controller Architecture for Multi-Threading in Uni-Threaded VLIW Processors [p. 339]
-
P. Raghavan, A. Lambrechts, M. Jayapala, F. Catthoor and D. Verkest
Reduced energy consumption is one of the most important
design goals for embedded application domains like
wireless, multimedia and biomedical. Instruction memory
hierarchy has been proven to be one of the most power
hungry parts of the system. This paper introduces an architectural
enhancement for the instruction memory to reduce
energy and improve performance. The proposed distributed
instruction memory organization requires minimal
hardware overhead and allows execution of multiple loops
in parallel in a uni-processor system. This architecture enhancement
can reduce the energy consumed in the instruction
and data memory hierarchy by 70.01% and improve the
performance by 32.89% compared to enhanced SMT based
architectures.
-
Compositional, Efficient Caches for a Chip Multi-Processor [p. 345]
-
A. M. Molnos, M. J. M. Heijligers, S. D. Cotofana and J. T. J. Van Eijndhoven
In current multi-media systems major parts of the functionality
consist of software tasks executed on a set of concurrently
operating processors. Those tasks interfere with
each other when they share memory and other hardware
components. For instance when the tasks share caches and
no precautions are taken they potentially flush each other's
data at random. In this case the control over the system
performance is lost. However, in media processing the performance
must be under tight control. In particular the performance
of each individual task must be preserved if the
tasks are executed concurrently in arbitrary combinations
or if additional tasks are added. A system satisfying this
property is addressed as being compositional.
This paper proposes a novel cache partitioning technique
that enhances compostionality. We assume a cache
to be a rectangular array of memory elements arranged in
"sets" (rows) and "ways" (columns). We perform two partitioning
types. First, each task and each inter-task common
data gets an exclusive part of the cache sets. Second,
inside the cache sets of common data each task accessing
it gets a number of ways. We apply the proposed method
on a homogeneous multiprocessor using two applications:
H.264 decoding and picture-in-picture-TV. Our experiments
indicate that, for both applications, under our partitioning
scheme the sum of misses of the individual tasks executed
separately and the number of misses of all tasks executed
concurrently differs at most by 4%. We conclude that compositionality
is achieved within reasonable bounds. Additionally,
our technique appears to improve the efficiency of
the cache operation.
-
Efficient Design Space Exploration of High Performance Embedded Out-of-Order Processors [p. 351]
-
S. Eyerman, L. Eeckhout and K. De Bosschere
Previous work on efficient customized processor design
primarily focused on in-order architectures. However, with
the recent introduction of out-of-order processors for highend
high-performance embedded applications, researchers
and designers need to address how to automate the design
process of customized out-of-order processors. Because of
the parallel execution of independent instructions in out-of-order
processors, in-order processor design methodologies
which subdivide the search space in independent components
are unlikely to be effective in terms of accuracy for
designing out-of-order processors. In this paper we propose
and evaluate various automated single- and multi-objective
optimizations for exploring out-of-order processor designs.
We conclude that the newly proposed genetic local search
algorithm outperforms all other search algorithms in terms
of accuracy. In addition, we propose two-phase simulation
in which the first phase explores the design space through
statistical simulation; a region of interest is then simulated
through detailed simulation in the second phase. We show
that simulation time speedups can be obtained of a factor
2.2X to 7.3X using two-phase simulation.
-
Application-Specific Reconfigurable XOR-Indexing to Eliminate Cache Conflict Misses [p. 357]
-
H. Vandierendonck, P. Manet and J.-D. Legat
Embedded systems allow application-specific optimizations
to improve the power/performance trade-off. In this
paper, we show how application-specific hashing of the address
can eliminate a large number of conflict misses in
caches. We consider XOR-functions: each set index bit is
computed as the XOR of a subset of the address bits.
Previous work has considered simpler bit-selecting functions.
Compared to such work, the contributions of this paper
are two-fold. Firstly, we present a heuristic algorithm
to construct application-specific XOR-functions. Secondly,
in order to adapt the hashing to the application, we show
that a reconfigurable XOR-function selector is inherently
less complex than a reconfigurable selector for bit-selecting
functions. This is possible by placing restrictions on the allowed
XOR-functions.
Our evaluation shows a reduction of cache misses for
standard benchmarks averaging between 30% and 60%, depending
on the cache size.
Moderators: P. Lysaght, Xilinx, US; W. Luk, Imperial College London, UK
-
A Spatial Mapping Algorithm for Heterogeneous Coarse-Grained Reconfigurable Architectures [p. 363]
-
M. Ahn, J. W. Yoon, Y. Paek, Y. Kim, M. Kiemb and K. Choi
In this work, we investigate the problem of automatically
mapping applications onto a coarse-grained reconfigurable
architecture and propose an efficient algorithm to solve the
problem. We formalize the mapping problem and show that
it is NP-complete. To solve the problem within a reasonable
amount of time, we divide it into three subproblems:
covering, partitioning and layout. Our empirical results
demonstrate that our technique produces nearly as good
performance as hand-optimized outputs for many kernels.
-
Compiler-Driven FPGA-Area Allocation for Reconfigurable Computing [p. 369]
-
E. M. Panainte, K. Bertels and S. Vassiliadis
In this paper, we propose two FPGA-area allocation algorithms
based on profiling results for reducing the impact
on performance of dynamic reconfiguration overheads. The
problem of FPGA-area allocation is presented as a 0-1 integer
linear programming problem and efficient solvers are
incorporated for finding the optimal solutions. Additionally,
we discuss the FPGA-area allocation problem in two scenarios.
In the first scenario, all hardware operations are
allocated on the FPGA while in the second scenario, any
hardware operation can be switched to software execution
in order to provide an overall performance improvement.
We evaluate our proposed algorithms using the MPEG2
and MJPEG encoder multimedia benchmarks and the hardware
implementations for SAD, DCT, IDCT, Quantization
and VLC tasks. We show that a significant performance improvement
(up to 61 % for MPEG2 and 94 % for MJPEG)
is to be achieved when the proposed algorithms are used,
while the reconfiguration overhead is reduced by at least 36% for MJPEG.
-
Temporal Partitioning for Image Processing Based on Time-Space Complexity in Reconfigurable
Architectures [p. 375]
-
P. S. Brandão do Nascimento and M. E. de Lima
Temporal partitioning techniques are useful to
implement large and complex applications, which can be
split into partitions in FPGA devices. In order to minimize
resources, each of these partitions can be multiplexed in
an only FPGA area by reconfiguration techniques. These
multiplexing approaches increase the effective area,
allowing parallelism exploitation in small devices.
However, multiplexing means reconfiguration time, which
can cause impact on the application performance. Thus,
intensive parallelism exploitation in massive computation
applications must be explored to compensate such
inconvenient and improve processes. In this work, a
temporal partitioning technique is presented for a class of
image processing (massive computation) applications.
The proposal technique is based on the algorithmic
complexity (area x time) for each task that composes the
applications. Experimental results are used to
demonstrate the efficiency of the approach when
compared to the optimal solution obtained by exhaustive
timing search.
-
System-Level Scheduling on Instruction Cell Based Reconfigurable Systems [p. 381]
-
Y. Yi, I. Nousias , M. Milward, S. Khawam, T. Arslan and I. Lindsay
This paper presents a new operation chaining
reconfigurable scheduling algorithm (CRS) based on list
scheduling that maximizes instruction level parallelism
available in distributed high performance instruction cell
based reconfigurable systems. Unlike other typical scheduling
methods, it considers the placement and routing effect,
register assignment and advanced operation chaining
compilation technique to generate higher performance
scheduled code. The effectiveness of this approach is
demonstrated here using a recently developed industrial
distributed reconfigurable instruction cell based architecture
[11]. The results show that schedules using this approach
achieve equivalent throughput to VLIW architectures but at
much lower power consumption.
Organisers: M. Buehler, IBM Deutschland Entwicklung GmbH, DE; A. Ripp, MunEDA GmbH, DE
Moderator: A. Ripp, MunEDA GmbH, DE
-
DFM/DFY Design for Manufacturability and Yield - Influence of Process Variations and Increased
Defect Sensitivity in Digital, Analogue and Mixed-Signal Circuit Design [p. 387]
-
M. Buehler, J. Koehl, J. Bickford, J. Hibbeler, U. Schlichtmann, R. Sommer, M. Pronath and A. Ripp
The concepts of Design for Manufacturability and
Design for Yield DFM/DFY are bringing together
domains that co-existed mostly separated until now -
circuit design, physical design and manufacturing
process. New requirements like SoC, mixed
analog/digital design and deep-submicron
technologies force to a mutual integration of all
levels. A major challenge coming with new deepsubmicron
technologies is to design and verify
integrated circuits for high yield. Random and
systematic defects as well as parametric process
variations have a large influence on quality and yield
of the designed and manufactured circuits. With
further shrinking of process technology, the on-chip
variation is getting worse for each technology node.
For technologies larger than 180nm feature sizes,
variations are mostly in a range of below 10%. Here
an acceptable yield range is achieved by regular but
error-prone re-shifts of the drifting process. However,
shrinking technologies down to 90nm, 65nm and
below cause on-chip variations of more than 50%. It
is understandable that tuning the technology process
alone is not enough to guarantee sufficient yield and
robustness levels any more. Redesigns and, therefore,
respins of the whole development and manufacturing
chain lead to high costs of multiple manufacturing
runs. All together the risk to miss the given market
window is extremely high. Thus, it becomes
inevitable to have a seamless DFM/DFY concept
realized for the design phase of digital, analog, and
mixed-signal circuits. New DFY methodologies are
coming up for parametric yield analysis and
optimization and have recently been made available
for the industrial design of individual analog blocks
on transistor level up to 1500 transistors. The transfer
of yield analysis and yield optimization techniques to
other abstraction levels " both for digital as well as
for analog " is a big challenge. Yield analysis and
optimization is currently applied to individual circuit
blocks and not to the overall chip yielding on the one
hand often too pessimistic results - best/worst case
and OCV (On Chip Variation) factor - for the digital parts.
On the other hand for analog often very high efforts are
spent to design individual blocks with highrobustness (>6σ).
For abstraction to higher digitallevels first approaches like
statistical static timing analysis (SSTA) are under development.
For theanalog parts a strategy to develop macro models and
hierarchical simulation or behavioral simulation methodologies
is required that includes low-level statistical effects
caused by local and global processvariation of the individual devices.
Moderators: M. Glesner, TU Darmstadt, DE; D. Leenaerts, Philips Research Labs, NL
-
Systematic Methodology for Designing Reconfigurable ΔΣ Modulator Topologies for Multimode Communication Systems [p. 393]
-
Y. Wei, H. Tang and A. Doboli
This paper proposes a methodology for designing reconfigurable
continuous-time DS modulator topologies. The
methodology is based on the concept of generic topology
that expresses all possible signal paths in a reconfigurable
topology. Topologies are optimized for minimizing the complexity
of the topologies, maximizing the sharing of circuitry for
different modes, maximizing the topology robustness with
respect to circuit nonidealities, and minimizing
total power consumption. The paper presents a case study
for designing topologies for a three mode reconfigurable
DS modulator, and compares topologies with state-of-the-art design.
-
Double-Sampling Single-Loop Sigma-Delta Modulator Topologies for Broadband Applications [p. 399]
-
M. Yavari, O. Shoaei and A. Rodriguez-Vazquez
This paper presents novel double sampling high order
single-loop sigma-delta modulator structures for
wideband applications. To alleviate the quantization
noise folding into the inband frequency region, two
previously reported techniques are used. The DAC
sampling paths are implemented with the single capacitor
approach and an additional zero is placed at the half of
the sampling frequency of the modulator’s noise transfer
function (NTF). The detrimental effect of this additional
zero on both the NTF and signal transfer function (STF)
is also resolved through the proposed modulator
architectures with a low additional circuit requirement.
-
A 10 GHz 15 dB Four-Stage Distributed Amplifier in 0.18 μm CMOS Process [p. 405]
-
K. K. Moez and M. I. Elmasry
This paper presents a four-stage CMOS distributed
amplifier (DA) design implemented in standard 0.18 μm
CMOS technology. The proposed design eliminates the need
for transmission line capacitors and, consequently, uses
significantly smaller spiral inductors compared with the
previous designs. Using the minimum size inductor, the
bandwidth of the amplifiers is extended, and the quality
factors of the on-chip inductor are improved. Proposed DA
occupies the smallest die area (0.3μm*0.8μm) amongst the
DAs reported with the same performance. A unity gain
bandwidth of 10 GHz and a gain of 15 dB are measured.
DC power dissipation is 56 mW.
-
Bootstrapped Full-Swing CMOS Driver for Low Supply Voltage Operation
-
J. Garcia, J. A. Montiel-Nelson and S. Nooshabadi [p. 410]
This paper reports a high speed and low power consumption
direct-indirect bootstrapped full-swing CMOS inverter
driver circuit (bfi-driver). The simulation results, based on
0:13μm triple well CMOS technology, show that, when operated
at 1V , bfi-driver is 94% faster and consumes 22%
less power compared to a counterpart direct bootstrap circuit
[1].
Moderators: H. Obermeir, Infineon Technologies, DE; N. Nicolici, McMaster U, CA
-
An Effective Technique for Minimizing the Cost of Processor Software-Based Diagnosis In SoCs [p. 412]
-
P. Bernardi, E. Sánchez, M. Schillaci, G. Squillero and M. Sonza Reorda
The ever increasing usage of microprocessor devices is
sustained by a high volume production that in turn
requires a high production yield, backed by a controlled
process. Fault diagnosis is an integral part of the
industrial effort towards these goals. This paper presents a
novel cost-effective approach to the construction of
diagnostic software-based test sets for microprocessors.
The methodology exploits an existing post-production test
set, designed for software-based self-test, and an already
developed infrastructure IP to perform the diagnosis. An
initial diagnostic test set is built, and then iteratively
refined resorting to an evolutionary method. Experimental
results are reported in the paper showing the feasibility
and effectiveness of the approach for an Intel i8051
processor core.
-
Timing-Reasoning-Based Delay Fault Diagnosis [p. 418]
-
K. Yang and K.-.T Cheng
In this paper, we propose a timing-reasoning algorithm to
improve the resolution of delay fault diagnosis. In contrast
to previous approaches which identify candidates by utilizing
only logic conditions, we propose a timing-simulation-based
method to perform the candidate reasoning. Based
on the circuit timing information, we identify invalid candidates
which cannot maintain the consistency of failure
behaviors. By eliminating those invalid candidates, the diagnosis
resolution can be improved. We then analyze the
problem of circuit timing uncertainty caused by the delay
variation and the simulation model. We calculate a metric,
named invalid-probability, for each candidate. Then we
propose a candidate-ranking heuristic which is robust with
respect to such sources of timing uncertainty. By ranking
the candidates based on their invalid-probability, we
can improve the candidate first-hit-rate of the traditional
critical path tracing (CPT) technique. To demonstrate the
efficiency of the proposed method, we have developed a
timing diagnosis framework which can simulate the real
diagnosis process to evaluate and compare different algorithms.
-
Multiple-Fault Diagnosis Based on Single-Fault Activation and Single-Output Observation [p. 424]
-
Y.-C. Lin and K.-T. Cheng
In this paper, we propose a new circuit transformation
technique in conjunction with the use of a special diagnostic
test pattern, named SO-SLAT pattern, to achieve higher
multiple-fault diagnosis resolutions. For a given list of
candidate faults, which could be stuck-at, transition, bridging,
or other faults, we generate a set of SO-SLAT patterns,
each of which attempts to activate only one fault in the list
and propagate its effects to only one observation point.
Observing the responses to SO-SLAT patterns helps more
precisely identify fault candidates. The method can also
tolerate most of the timing hazards for more accurate diagnosis
of failures caused by timing faults. The experimental
results demonstrate the effectiveness of the proposed
method for diagnosing multiple faults.
-
Software-Based Self-Test of Processors under Power Constraints [p. 430]
-
J. Zhou and H.-J. Wunderlich
Software-based self-test (SBST) of processors offers
many benefits, such as dispense with expensive test
equipments, test execution during maintenance and in the
field or initialization tests for the whole system. In this
paper, for the first time a structural SBST methodology is
proposed which optimizes energy, average power consumption,
test length and fault coverage at the same time.
Key words: Test program generation, processor test,
low power test
-
Diagnosis of Defects on Scan Enable and Clock Trees [p. 436]
-
Y. Huang and K. Gallie
Scan is the most widely used DFT technique in
today's VLSI industry. Mux-DFF and Level Sensitive
Scan Design (LSSD) are the most popular scan
architectures. For Mux-DFF, when scan enable is set to
"1", the scan chain is in shift mode. When scan enable
is set to "0", the scan chain is in capture mode. For
LSSD, two clocks are used to control the shift. When
scan enable or scan clock has defects, it is desirable to
locate the defects at logic level by algorithmic
techniques to guide failure analysis.
Similar to the defects on other signals, faulty scan
enable / clock signals may be caused by numerous types
of defects. E.g., a shorted net, an open net or an incorrect
timing with respect to clock or scan data stream. The
following examples are used to illustrate how to apply
various fault models for different defects.
If a scan enable signal is shorted to VCC, only
incorrect capturing will result. Scan cells will capture
data from the previous scan cell instead of capturing data
from system logic. We may use a stuck-at-1 fault model
for this scenario. Clearly, the chain integrity test will
pass since these patterns don't have the capture
operation. The scan patterns would fail and the scan
logic diagnosis will be used in this scenario. In rest of
this paper, we do not discuss this category of scan enable
defects.
Moderators: S. Baruah, North Carolina U Chapel Hill, US; G. Fohler, Malardalen U, SE
-
Lock-Free Synchronization for Dynamic Embedded Real-Time Systems [p. 438]
-
H. Cho, B. Ravindran and E. D. Jensen
We consider lock-free synchronization for dynamic embedded
real-time systems that are subject to resource overloads
and arbitrary activity arrivals. We model activity arrival
behaviors using the unimodal arbitrary arrival model
(or UAM). UAM embodies a stronger "adversary" than
most traditional arrival models. We derive the upper bound
on lock-free retries under the UAM with utility accrual
scheduling - the first such result. We establish the tradeoffs
between lock-free and lock-based sharing under UAM.
These include conditions under which activities' accrued
timeliness utility is greater under lock-free than lock-based,
and the consequent upper bound on the increase in accrued
utility that is possible with lock-free. We confirm our analytical
results with a POSIX RTOS implementation.
-
Performance Analysis of Greedy Shapers in Real-Time Systems [p. 444]
-
E. Wandeler, A. Maxiaguine and L. Thiele
Traffic shaping is a well-known technique in the
area of networking and is proven to reduce global buffer requirements
and end-to-end delays in networked systems. Due to these
properties, shapers also play an increasingly important role in the
design of multi-processor embedded systems that exhibit a considerable
amount of on-chip traffic. Despite their growing importance
in this area, no methods exist to analyze shapers in distributed embedded
systems, and to incorporate them into a system-level performance
analysis. Hence it is until now not possible to determine
the effect of shapers to end-to-end delay guarantees or buffer requirements
in these systems. In this work, we present a method to
analyze greedy shapers, and we embed this analysis method into
a well-established modular performance analysis framework. The
presented approach enables system-level performance analysis of
complete systems with greedy shapers, and we prove its applicability
by analyzing two case study systems.
-
Improved Offset-Analysis Using Multiple Timing-References [p. 450]
-
R. Henia and R. Ernst
In this paper, we present an extension to existing approaches
that capture and exploit timing-correlation between
tasks for scheduling analysis in distributed systems.
Previous approaches consider a unique timing-reference for
each set of time-correlated tasks and thus, do not capture the
complete timing-correlation between task activations. Our
approach is to consider multiple timing-references which
allows us to capture more information about the timing-correlation
between tasks. We also present an algorithm that
exploits the captured information to calculate tighter bounds
for the worst-case response time analysis under a static priority
preemptive scheduler.
-
Procrastinating Voltage Scheduling with Discrete Frequency Sets [p. 456]
-
Z. Lu, Y. Zhang, M. Stan, J. Lach and K. Skadron
This paper presents an efficient method to find the optimal
intra-task voltage/frequency scheduling for single tasks
in practical real-time systems using statistical workload information.
Our method is analytic in nature and proved to
be optimal. Simulation results verify our theoretical analysis
and show significant energy savings over previous methods.
In addition, in contrast to the previous techniques in
which all available frequencies are used in a schedule, we
find that, by carefully selecting a subset of a small number
of frequencies, one can still design a reasonably good
schedule while avoiding unnecessary transition overheads.
Moderators: G. Martin, Tensilica, US; P. Pop, Linkoping University, SE
-
Communication and Co-Simulation Infrastructure for Heterogeneous System Integration [p. 462]
-
G. Yang, X. Chen, F. Balarin, H. Hsieh and A. Sangiovanni-Vincentelli
With the increasing complexity and heterogeneity of embedded
electronic systems, a unified design methodology at
higher levels of abstraction becomes a necessity. Meanwhile,
it is also important to incorporate the current design practice
emphasizing IP reuse at various abstraction levels. However,
the abstraction gap prohibits easy communication and
synchronization in IP integration and co-simulation. In this
paper, we present a communication infrastructure for an integrated
design framework that enables co-design and cosimulation
of heterogeneous design components specified at
different abstraction levels and in different languages. The
core of the approach is to abstract different communication
interfaces or protocols to a common high level communication
semantics. Designers only need to specify the interfaces
of the design components using extended regular expressions;
communication adapters can then be automatically
generated for the co-simulation or other co-design and
co-verification purposes.
-
A SW Performance Estimation Framework for Early System-Level-Design Using Fine-Grained
Instrumentation [p. 468]
-
T. Kempf, K. Karuri, S. Wallentowitz, G. Ascheid, R. Leupers and H. Meyr
The increasing demands of high-performance in embedded
applications under shortening time-to-market has
prompted system architects in recent time to opt for Multi-Processor
Systems-on-Chip (MP-SoCs) employing several
programmable devices. The programmable cores provide a
high amount of flexibility and reusability, and can be optimized
to the requirements of the application to deliver high-performance
as well. Since application software forms the
basis of such designs, the need to tune the underlying SoC
architecture for extracting maximum performance from the
software code has become imperative.
In this paper, we propose a framework that enables software
development, verification and evaluation from the very
beginning of MP-SoC design cycle. Unlike traditional SoC
design flows where software design starts only after the initial
SoC architecture is ready, our framework allows a codevelopment
of the hardware and the software components
in a tightly coupled loop where the hardware can be refined
by considering the requirements of the software in a stepwise
manner. The key element of this framework is the integration
of a fine-grained software instrumentation tool into a
System-Level-Design (SLD) environment to obtain accurate
software performance and memory access statistics. The
accuracy of such statistics is comparable to that obtained
through Instruction Set Simulation (ISS), while the execution
speed of the instrumented software is almost an order of
magnitude faster than ISS. Such a combined design approach
assists system architects to optimize both the hardware and
the software through fast exploration cycles, and can result in
far shorter design cycles and high productivity. We demonstrate
the generality and the e±ciency of our methodology
with two case studies selected from two most prominent and
computationally intensive embedded application domains.
-
A Unified System-Level Modeling and Simulation Environment for MPSoC Design: MPEG-4 Decoder
Case Study [p. 474]
-
V. Reyes, W. Kruijtzer, T. Bautista, G. Alkadi and A. Nuñez
New generation Electronic System-Level design tools
are the key to overcome the complexity and the increasing
design productivity gap in the development of future
Multiprocessor Systems-on-Chip. This paper presents a
SystemC-based system-level simulation environment,
called CASSE, which helps in the modelling and analysis
of complex SoCs. CASSE combines application modeling,
architecture modeling, mapping and analysis within a
unified environment, with the aim to ease and speed up
these modeling steps. The main contribution of this tool is
to enable this fast modelling and analysis at the very
beginning of the design process, helping in the design
space exploration phase. CASSE capabilities are disclosed
in this work by means of a case study where an
MPEG-4 decoder application is implemented on an Altera
Excalibur platform.
-
(145)Task-Accurate Performance Modeling in SystemC for Real-Time Multi-Processor Architectures [p. 480]
-
M. Streubuehr, J. Falk, C. Haubelt, J. Teich, R. Dorsch and T. Schlipf
We propose a novel framework, called Virtual Processing
Components (VPC), that permits the modeling and simulation
of multiple processors running arbitrary scheduling
strategies in SystemC. The granularity is given by task accuracy
that guarantees a small simulation overhead.
Organisers: H. Meyr, RWTH Aachen U, DE and CoWare Inc; G. Fettweis, TU Dresden, DE
Moderator: O. Schliebusch, CoWare Inc, DE
-
Distributed Object Models for Multi-Processor SoC's, with Application to Low-Power Multimedia
Wireless Systems [p. 482]
-
P. G. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, O. Benny, D. Lyonnard, B. Lavigeuer
and D. Lo
This paper summarizes the characteristics of distributed
object models used in large-scale distributed software
systems. We examine the common subset of requirements
for distributed software systems and systems-on-a-chip
(SoC), namely: openness, heterogeneity and multiple
forms of transparency. We describe the application of
these concepts to the emerging class of complex, parallel
SoC's, including multiple heterogeneous embedded
processors interacting with hardware co-processors and
I/O devices. An implementation of this approach is
embodied in STMicroelectronics' DSOC (Distributed
System Object Component) programming model. The use
of this programming model for an architecture exploration
of ST's Nomadik mobile multimedia platform is described.
-
Virtual Prototyping of Embedded Platforms for Wireless and Multimedia [p. 488]
-
T. Kogel and M. Braun
Most of the challenges related to the development of
multi-processor platforms for complex wireless and multi-media
applications fall into the Electronic System Level
(ESL) domain. That is to say, design tasks like embedded
SW development, architecture definition, or system verification
have to be addressed before the silicon or even
the RTL implementation becomes available. We believe
that one of the major obstacles preventing the urgently
required adoption and proliferation of an ESL based design
approach is the nonexistence of an efficient and intuitive
methodology for modeling complex platforms. This
extended abstract gives a rough overview of a modeling
methodology we have developed on the basis of SystemC
based Transaction Level Modeling (TLM) in order to
remedy this lack of modeling competence.
-
Application Specific NoC Design [p. 491]
-
L. Benini
Scalable Networks on Chips (NoCs) are needed to match
the ever-increasing communication demands of large-scale
Multi-Processor Systems-on-chip (MPSoCs) for high-end
wireless communications applications. The heterogeneous
nature of on-chip cores, and the energy efficiency requirements
typical of wireless communications call for
application-specific NoCs which eliminate much of the
overheads connected with general-purpose communication
architectures. However, application-specific NoCs must be
supported by adequate design flows to reduce design time
and effort.
In this paper we survey the main challenges in
application-specific NoC design, and we outline a complete
NoC design flow and methodology. A case study on
a high complexity SoC demonstrates that it is indeed possible
to generate an application-specific NoC from a high
level specification in a few hours. Comparison with a handtuned
solution shows that the automatically generated one
is very competitive from the area, performance and power
viewpoint, while design time is reduced from days to hours.
Keywords: Systems on chip, networks on chip,
application-specific integrated systems, design methodologies
Moderators: J. Henkel, Karlsruhe U, DE; D. Liu, Linkoping U, SE
-
Automatic Insertion of Low Power Annotations in RTL for Pipelined Microprocessors [p. 496]
-
V. Viswanath, J. A. Abraham and W. A. Hunt Jr.
We propose instruction-driven slicing, a new technique for
annotating microprocessor descriptions at the Register Transfer
Level (RTL) in order to achieve lower power dissipation.
Our technique automatically annotates existing RTL code to
optimize the circuit for lowering power dissipated by switching
activity. Our technique can be applied at the architectural
level as well, achieving similar power gains. We demonstrate
our technique on architectural and RTL models of a 32-bit
OpenRISC processor (OR1200), showing power gains for the
SPEC2000 benchmarks.
-
Power Analysis of Mobile 3D Graphics [p. 502]
-
B. Mochocki, K. Lahiri and S. Cadambi
The world of 3D graphics, until recently restricted
to high-end workstations and game consoles, is rapidly expanding
into the domain of mobile platforms such as cellular phones
and PDAs. Even as the mobile chip market is poised to exceed
production of 500 million chips per year, incorporation of 3D
graphics in handhelds poses several serious challenges to the
hardware designer. Compared with other platforms, graphics
on handhelds have to contend with limited energy supplies and
lower computing horsepower. Nevertheless, images must still be
rendered at high quality since handheld screens are typically
held closer to the observer’s eye, making imperfections and
approximations very noticeable.
In this paper, we provide an in-depth quantitative analysis of
the power consumption of mobile 3D graphics pipelines. We analyze
the effects of various 3D graphics factors such as resolution,
frame rate, level of detail, lighting and texture maps on power
consumption. We demonstrate that significant imbalance exists
across the workloads of different graphics pipeline stages. In
addition, we illustrate how this imbalance may vary dynamically,
depending on the characteristics of the graphics application.
Based on this observation, we identify and compare the benefits
of candidate Dynamic Voltage and Frequency Scaling (DVFS)
schemes for mobile 3D graphics pipelines. In our experiments
we observe that DVFS for mobile 3D graphics reduces energy
by as much as 50%.
-
Automatic Run-Time Selection of Power Policies for Operating Systems [p. 508]
-
N. Pettis, J. Ridenour and Y.-H. Lu
A significant volume of research has concentrated on
operating-system directed power management (OSPM).
The primary focus of previous research has been the
development of OSPM policies. Under different conditions,
one policy may outperform another and vice
versa. In this paper, we explain how to select the best
policies at run-time without user or administrator intervention.
We present a hardware-neutral architecture
portable across different platforms running Linux. Our
experiments reveal that changing policies at run-time
can adapt to workloads more quickly than using any of
the policies individually.
-
Energy Reduction by Workload Adaptation in a Multi-Process Environment [p. 514]
-
C. Xian and Y.-H. Lu
Reducing energy consumption is an important issue in
modern computers. Dynamic power management (DPM)
has been extensively studied in recent years. One approach
for DPM is to adjust workloads, such as clustering or eliminating
requests, as a way to trade-off energy consumption
and quality of services. Previous studies focus on single processes.
However, when multiple concurrently running processes
are considered, workload adjustment must be determined
based on the interleaving of the processes' requests.
When multiple processes share the same hardware component,
adjusting one process may not save energy. This paper
presents an approach to assign energy responsibility to individual
processes based on how they affect power management.
The assignment is used to estimate potential energy
reduction by adjusting the processes. We use the estimation
to guide runtime adaptation of workload behavior. Experiments
demonstrate that our approach can save more energy
and improve energy efficiency.
-
Dynamic Bit-Width Adaptation in DCT : Image Quality Versus Computation Energy Trade-Off [p. 520]
-
J. Park, J. H. Choi and K. Roy
We present a dynamic bit-width adaptation scheme in
DCT applications for efficient trade-off between image
quality and computation energy. Based on sensitivity differences of
64 DCT coefficients, various operand bit-widths
are used for different frequency components to reduce computation
energy in DCT operation. Numerical results show
that our DCT architecture can achieve power savings ranging from
36 % to 75% compared to normal operation.
Moderators: P. Feldmann, IBM T J Watson Research Center, US; E. Beyne, IMEC, BE
-
Bus Stuttering: An Encoding Technique to Reduce Inductive Noise in Off-Chip Data Transmission [p. 522]
-
B. J. LaMeres and S. P. Khatri
Simultaneous switching noise due to inductance in VLSI packaging is a significant
limitation to system performance. The inductive parasitics within IC packaging
causes bounce on the power supply pins in addition to glitches and rise-time
degradation on the signal pins. These factors bound the maximum performance
of off-chip busses, which limits overall system performance. Until recently,
the parasitic inductance problem was addressed by aggressive package design which
attempts to decrease the total inductance in the package interconnect. In this
work we present an encoding technique for off-chip data transmission to limit
bounce on the supplies and reduce inductive signal coupling. This is accomplished
by inserting intermediate (henceforth called "stutter") states in the data
transmission to bound the maximum number of signals that switch simultaneously,
thereby limiting the overall inductive noise. Bus stuttering is cheaper than
expensive package design since it increases the bus performance without changing
the package. We demonstrate that bus stuttering can bound the maximum amount
of inductive noise, which results in increased bus performance even after
accounting for the encoding overhead. Our results show that the performance
of an encoded bus can be increased up to 225% over using un-encoded data. In
addition, synthesis results of the encoder in a TSMC 0.13μm process show that
the encoder size and delay are negligible in a modern VLSI design.
-
Statistical Timing Analysis with Path Reconvergence and Spatial Correlations [p. 528]
-
L. Zhang, Y. Hu and C.C-P. Chen
State of the art statistical timing analysis (STA) tools
often yield less accurate results when timing variables become
correlated. Spatial correlation and correlation caused by path
reconvergence are among those which are most difficult to deal
with. Existing methods treating these correlations will either
suffer from high computational complexity or significant errors.
In this paper, we present a new sensitivity pruning method
which will significantly reduce the computational cost to consider
path reconvergence correlation. We also develop an accurate and
efficient model to deal with the spatial correlation.
-
Non-Gaussian Statistical Interconnect Timing Analysis [p. 533]
-
S. Abbaspour, H. Fatemi and M. Pedram
This paper focuses on statistical interconnect timing analysis in a
parameterized block-based statistical static timing analysis tool. In
particular, a new framework for performing timing analysis of RLC
networks with step inputs, under both Gaussian and non-Gaussian
sources of variation, is presented. In this framework, resistance,
inductance, and capacitance of the RLC line are modeled in a
canonical first order form and used to produce the corresponding
propagation delay and slew (time) in the canonical first-order form.
To accomplish this step, mean, variance, and skewness of delay and
slew distributions are obtained in an efficient, yet accurate, manner.
The proposed framework can be extended to consider higher order
terms of the various sources of variation. Experimental results show
average errors of less than 2% for the mean, variance and skewness
of interconnect delay and slew while achieving orders of magnitude
speedup with respect to a Monte Carlo simulation with 104 samples.
-
Cell Delay Analysis Based on Rate-of-Current Change [p. 539]
-
S. Nazarian and M. Pedram
A cell delay model based on rate-of-current change
is presented, which accounts for the impact of the
shape of the noisy waveform on the output voltage
waveform. More precisely, a pre-characterized table of
time derivatives of the output current as a function of input
voltage and output load values is constructed. The data in
this table, in combination with the Taylor series expansion
of the output current, is utilized to progressively compute
the output current waveform, which is then integrated to
produce the output voltage waveform. Experimental
results show the effectiveness and efficiency of this new
delay model.
-
A Practical Method to Estimate Interconnect Responses to Variabilities [p. 545]
-
F. Liu
Variabilities in metal interconnect structures can affect circuit
timing performance or even cause function failure in
VLSI designs. This paper proposes a method to estimate
the difference between the nominal and perturbed circuit
waveforms by calculating the moments in frequency-domain
via efficient iterative method. The algorithm can be used to
accurately reproduce the di.erential waveforms, or to provide
efficient early estimates on the timing impact of the
variabilities for RC networks.
Organisers: H.-J. Wunderlich, Stuttgart U, DE; P. van Staa, Robert Bosch GmbH, DE
Moderator: C. Sebeke, Robert Bosch GmbH, DE
Panellists: C. Jung, BMW Group, DE; K. Harbich, Robert Bosch GmbH, DE; S. Fuchs, Credence Systems
GmbH, DE; J. Schwarz, DaimlerChrysler AG, DE; P. Goehner, Stuttgart U, DE
-
Test and Reliability Challenges in Automotive Microelectronics [p. 547]
-
C. Jung, K. Harbich, S. Fuchs, J. Schwarz and P. Goehner
Absolutely fail-safe operation in any critical situation, highest reliability in
day-to-day operation and best-in-class convenience at a reasonable price: all drive innovation in automotive electronics. These goals
result in car systems with ever-increasing complexity, challenging every single component, IC and line of
code. As electronics’ failure rates are perceived to grow, we introduce root cause analysis, key
technologies and new measures that enable carmakers to keep pace. The goal is to introduce test and
reliability challenges and respective solutions for automotive systems. Representatives of car companies
and suppliers will explain their views and practical experiences.
Organisers/Moderators: J. Becker, Karlsruhe U, DE; J. Bortolazzi, DaimlerChrysler AG, DE
-
Exploring Trade-offs between Centralized versus Decentralized Automotive Architectures Using a Virtual
Integration Environment [p. 548]
-
S. Kanajan, H. Zeng, C. Pinello and A. Sangiovanni-Vincentelli
The large variety of architectural dimensions in automotive
electronics design, for example, bus protocols, number
of nodes, sensors and actuators interconnections and power
distribution topologies, makes architecture design task a
very complex but crucial design step especially for OEMs.
This situation motivates the need for a design environment
that accommodates the integration of a variety of models
in a manner that enables the exploration of design alternatives
in an efficient and seamless fashion. Exploring these
design alternatives in a virtual environment and evaluating
them with respect to metrics such as cost, latency, flexibility
and reliability provide an important competitive advantage
to OEMs and help minimize integration risks later in the design
cycle. In particular, the choice of the degree of decentralization
of the architecture has become a crucial issue in
automotive electronics. In this paper, we demonstrate how
a rigorous methodology (Platform-Based Design) and the
Metropolis framework can be used to find the balance between
centralized and decentralized architectures.
-
Management of Complex Automotive Communication Networks [p. 554]
-
T. Weber
Automakers are still facing an increasing complexity in vehicle requirements with regard to
their EE systems. This complexity is not only caused by innovations, which are being
provided for tomorrow's drivers, but is also due to system requirements regarding EE
Architecture cost, managing software updates und diagnostics concepts.
One way to conquer this challenge is the use of standards in the field of basic technology.
Ongoing activities such as Autosar, FlexRay, HIS (Herstellerinitiative
Software - OEM initiative software) and more, underline the car industry's contribution to create and establish
these standards.
Another way - more linked to OEM's internal processes - is to undertake a deeper analysis of
Architecture work. Here at first, a profound description is essential. This tool-based
description is the basis for a more detailed analysis. Two options should be focussed upon:
expert reviews and automatically calculated metrics in the tool, such as cost, weight or even
more sophisticated metrics for feasability. With this technique, iteration by iteration the EE
Architecture reaches more profound stability and will meet the functional and non-functional
requirements far better.
-
AutoVision - Flexible Processor Architecture for Video-assisted Driving [p. 556]
-
A. Herkersdorf
Future automotive security systems will benefit from visual scene analysis based on a fusion of video, infrared,
and radar images. Today we have already functions like lane departure warning and automatic cruise control
(ACC) for pretty well defined driving environments, such as highways and primary roads. Recent research
activities concentrate on more complex environments, such as city traffic with a wide variety of traffic
participants moving in a unpredictable manner, e.g. bikes, pedestrians, children, and even animals, and under
changing weather and lighting conditions.
-
Domain Specific Model Driven Design for Automotive Electronic Control Units [p. 557]
-
K. D. Mueller-Glaser
To enhance efficiency and reliability in the design of distributed electronic control units with
hard real-time constraints new methods and computer aided tools are required, especially to
support early system design phases. Domain specific tools are required to support design
space exploration in the concept phase of electric/electronic systems. Design and verification
based on heterogeneous models (closed loop control systems, reactive systems and UML
based software intensive systems) and using a CASE-tool integration platform will allow for a
seemless design flow.
-
Electric and Electronic Vehicle Architecture Assessment [p. 558]
-
P. Dégardins
Automotive suppliers like Siemens VDO cover the full spectrum of automotive electronics. Our customers
expect from such suppliers the capability to deliver cross-domain solutions and products. The deliveries should
not only be correctly interconnected and integrated, but also simultaneously defined and developped in order to
ensure that they achieve on the global vehicle level the optimum of cost, quality, flexibility and scalability.
SV combined the expertise of its divisions into a corporate Vehicle System Group, in charge of developing the
cross-divisional expertise and solutions. The final aim of the division is to successfully assist the OEM during
the pre-development phase of the architecture, by allocating system engineers who will help in the understanding
of the requirements in order to select from a product portfolio the optimum combination.
Such an activity requires a strong expertise in intersystem technologies, such as Flexray and AUTOSAR, where
SV is a leading contributor today.
In order to master the complexity of an architecture definition phase, where many actors are involved, many
hypothesis analysed, Siemens VDO invested a lot of efforts in the definition of the processes and in the
development of an architecture toolchain, SEDAM, which support and accelerate this process.
The conference will explain the vision of architecture Design and Assessment within Siemens VDO Automotive.
-
Automotive Semi-Conductor Trend and Challenges [p. 559]
-
P. Leteinturier
The automotive electronics has been introduced with multiple waves over the time: powertrain, safety &
vehicle dynamic, body & convenience, telematics. The future is already knocking at the door and
revolutionary systems are currently developed: X-by-wire, E-safety, Hybrid vehicle. The increasing
requirements for fuel economy, safety, emission reduction, and onboard diagnosis push the automotive
industry for more innovative solutions with a rapid increase of complexity. The presentation will highlight
the motivation to introduce high performance electronics in the car.
At the early time of electronic, ECUs (electronic control units) were seen as been the system, with the birth
of networking the complete car was the system to be controlled, today with modern communication &
services the car is just a node in the traffic, this last one is now the system to be considered. The innovation
for the individual transportation is at 90% enabled by electronic. The development of such system shows
three main challenges: dependable communication, dependable computation and dependable power.
The modern high-end cars are running more than 80 ECUs, the communication bandwidth and message
determinism require the development of new busses such as Flexray. The increasing power demand is
pushing for a different voltage class. The cost pressure and the time to market are forcing the automotive
industry to re-invent processes, development cycle and to introduce standards.
This panel discussion will demonstrate the key elements to provide a powerful, scalable and configurable
control solution that offer a migration pass to evolution and even revolution of automotive electronics.
Moderators: W. Mueller, C-LAB/Paderborn U, DE; A. Gerstlauer, UC Irvine, US
-
A Systematic IP and Bus Subsystem Modeling for Platform-Based System Design [p. 560]
-
J. Um, W.-C. Kwon, S. Hong, Y.-T. Kim, K.-M. Choi, J.-T. Kong, S.-K. Eo and T. Kim
The topic on platform-based system modeling has
received a great deal of attention today. One of the
important tasks that significantly affect the effectiveness
and efficiency of the system modeling is the modeling of
IP components and communication between IPs. To be
effective, it is generally accepted that the system modeling
should be performed in two steps; In the first step, a fast
but some inaccurate system modeling is considered to
facilitate the simultaneous development of software and
hardware. The second step then refines the models of the
software and hardware blocks (i.e., IPs) to increase the
simulation accuracy for the system performance analysis.
Here, one critical factor required for a successful system
modeling is a systematic modeling of the IP blocks and
bus subsystem connecting the IPs. In this respect, this
work addresses the problem of systematic modeling of the
IPs and bus subsystem in different levels of refinements.
In the experiments, we found that by applying our
proposed IP and bus modeling methods to the MPEG-4
application, we are able to achieve 4x performance
improvement and at the same time, reduce the software
development time by 35%, compared to that by
conventional modeling methods.
-
Heterogeneous Behavioral Hierarchy for System Level Designs [p. 565]
-
H. D. Patel, S. K. Shukla and R.A. Bergamaschi
Enhancing productivity for designing complex embedded systems
requires system level design methodology and language support
for capturing complex design in high level models. For an effective
methodology, efficiency of simulation and a sound refinement
based implementation path are also necessary. Although some of
the recent system level design languages for system level abstractions,
several essential ingredients are missing from these. We
consider (i) explicit support for multiple models of computation
(MoCs) or heterogeneity; (ii) the ability to build complex behaviors
by hierarchically composing simpler behaviors; and (iii) hierarchical
composition of behaviors that belong to distinct models
of computation, as essential for successful SLDLs. These render
an SLDL with modeling fidelity that exploits both heterogeneity
and hierarchy and allows for simpler modeling and efficient simulation.
One important requirement for such an SLDL should be
that the simulation semantics be also compositional, and hence no
flattening of hierarchically composed behaviors be needed for simulation.
In this paper we show how we designed SystemC extensions
to provide facilities for heterogeneous behavioral hierarchy,
compositional simulation semantics, and implemented a simulation
kernel which we show experimentally as up to 50% more efficient
than standard SystemC simulation.
-
Design with Race-Free Hardware Semantics [p. 571]
-
P. Schaumont, S. Shukla and I. Verbauwhede
Most hardware description languages do not enforce
determinacy, meaning that they may yield races. Race conditions
pose a problem for the implementation, verification,
and validation of hardware. Enforcing determinacy at the
modeling level provides a solution to this problem. In this
paper, we consider a common model of computation for
hardware modeling - a network of cycle-true finite-state machines
with datapaths (FSMDs) - and we identify the
conditions under which such models are guaranteed to be
race-free. We base our analysis on the Kahn Principle and
a formal framework to represent FSMD semantics. We
present our conclusions as four simple and easy to enforce
modeling rules. A hardware designer that applies those four
modeling rules, will thus obtain race-free hardware.
-
Comfortable Modeling of Complex Reactive Systems [p. 577]
-
S. Prochnow and R. von Hanxleden
Modeling systems based on semi-formal graphical formalisms,
such as Statecharts, has become standard practice
in the design of reactive embedded devices. However,
the modeling of realistic applications often results in very
large and unmanageable graphics, severely compromising
their readability and practical use. To overcome this, we
present a methodology to support the easy development and
understanding of complex Statecharts. Central to our approach
is the definition of a Statechart Normal Form (SNF),
which provides a standardized layout that is compact and
makes systematic use of secondary notations to aid readability.
This concept is extended to dynamic Statecharts.
-
Faster Exploration of High Level Design Alternatives Using UML for Better Partitions [p. 579]
-
W. Ahmed and D. Myers
Partitioning is a time consuming and
computationally complex optimization problem in the
codesign of hardware software systems. The stringent
time-to-market requirements have resulted in truncating
this step resulting in sub-optimal solutions being offered
to consumers.
To obtain the true global minima, which translates to
finding the best solution available, a new methodology
is needed that can achieve this goal in a minimal time.
An approach is presented that forms a basis for design
space exploration from a partitioning perspective using
UML 2.0.
Organisers: H. Meyr, RWTH Aachen U, DE and CoWare Inc; G. Fettweis, TU Dresden, DE
Moderator: G. Ascheid, RWTH Aachen U, DE
-
A Design Flow for Configurable Embedded Processors Based on Optimized Instruction Set Extension
Synthesis [p. 581]
-
R. Leupers, K. Karuri, S. Kraemer and M. Pandey
Design tools for application specific instruction set processors
(ASIPs) are an important discipline in system-level
design for wireless communications and other embedded
application areas. Some ASIPs are still designed
completely from scratch to meet extreme efficiency demands.
However, there is also a trend towards use of partially
predefined, configurable RISC-like embedded processor
cores that can be quickly tuned to given applications
by means of instruction set extension (ISE) techniques.
While the problem of optimized ISE synthesis has been
studied well from a theoretical perspective, there are still
few approaches to an overall HW/SW design flow for
configurable cores that take all real-life constraints into account.
In this paper, we therefore present a novel procedure
for automated ISE synthesis that accommodates both
user-specified and processor-specific constraints in a flexible
way and that produces valid, optimized ISE solutions in
short time. Driven by an advanced applicationCcode analysis/
profiling frontend, the ISE synthesis core algorithm is
embedded into a complete design flow, where the backend
is formed by a state-of-the-art industrial tool for processor
configuration, ISE HW synthesis, and SW tool retargeting.
The proposed design flow, including ISE synthesis,
is demonstrated via several benchmarks for the MIPS
CorExtend configurable RISC processor platform.
-
Energy Efficiency vs. Programmability Trade-off: Architectures and Design Principles [p. 587]
-
J.P. Robelly, H. Seidel, K.C. Chen, and G. Fettweis
Performance achievements on programmable architectures
due to process technology are reaching their limits,
since designs are becoming wire- and power-limited rather
than device limited. Likewise, traditional exploitation of instruction
level parallelism saturates as the conventional approach
for designing wider issue machines leads to very expensive
interconnections, big instruction memory footprint
and high register file pressure. New architectural concepts
targeted to the application domain of media processing are
needed in order to push current state-of-the-art limitations.
To this end, we regard media applications as a collection
of tasks which consume and produce chunks of data. The
exploitation of task level parallelism as well as more traditional
forms of parallelism is a key issue for achieving the
required amount of MOPS/Watt and MOPS/mm2 for media
applications. Tasks comprise data transfers and number
crunching algorithm kernels, which are very computingintensive
yet highly predictable. Moreover, most of the data
manipulated by a task is of a local nature. Granularity and
characteristics of these tasks will lead us in this paper to
draw conclusions about memory hierarchy, task scheduling
strategies and efficient low-overhead programmable architectures
for highly predictable kernel computations.
-
Advanced Receiver Algorithms for MIMO Wireless Communication [p. 593]
-
A. Burg, M. Borgmann, M. Wenk, C. Studer and H. Boelcskei
We describe the VLSI implementation of MIMO detectors
that exhibit close-to optimum error-rate performance,
but still achieve high throughput at low silicon area. In particular,
algorithms and VLSI architectures for sphere decoding
(SD) and K-best detection are considered, and the corresponding
trade-offs between uncoded error-rate performance,
silicon area, and throughput are explored. We show
that SD with a per-block run-time constraint is best suited
for practical implementations.
Organiser/Moderator: H. Meyr, CoWare, DE
-
Next Generation Architectures Can Dramatically Reduce the 4G Deployment Cycle [p. 599]
-
D. Shaver
We have been "talking" about 4G systems emerging in 2010 for many years.
However, to deploy these systems in 2010, we should already know with high confidence
the 4G signal processing and SoC architectures for 4G handsets. It realistically takes 2
years to develop a power-efficient, cost competitive system-on-a-chip (SoC) for a volume
market. There are standards to be completed, field trials, and wide scale acceptance
before a system solution becomes viable. The entire cycle is at least 5 years. But, rather
than giving up on 2010 as the year for 4G, we need to continue developing the right
signal processing, network protocols, and SoC architectures given our knowledge of
Moore's Law, emerging tools sets, and advanced receiver technology, which together
facilitate rapid time-to-market of energy efficient solutions. The market winners will
quickly adapt to the emerging 4G ecosystem and will develop solutions before others.
This talk provides some historical perspectives on architectures and systems evolution
with the goal of providing an optimistic view that 4G is very near.
Moderators: J. Haid, Infineon Technologies, DE; R. Zafalon, STMicroelectronics, IT
-
Automatic ADL-Based Operand Isolation for Embedded Processors [p. 600]
-
A. Chattopadhyay, B. Geukes, D. Kammler, E. M. Witte, O. Schliebusch, H. Ishebabi, R. Leupers,
G. Ascheid and H. Meyr
Cutting-edge applications of future embedded systems demand
highest processor performance with low power consumption to get
acceptable battery-life times. Therefore, low power optimization
techniques are strongly applied during the development of modern
Application Specific Instruction Set Processors (ASIPs). Electronic
System Level design tools based on Architecture Description Languages
(ADL) offer a significant reduction in design time and effort
by automatically generating the software tool-suite as well as
the Register Transfer Level (RTL) description of the processor. In
this paper, the automation of power optimization in ADL-based RTL
generation is addressed.
Operand isolation is a well-known power optimization technique
applicable at all stages of processor development. With increasing
design complexitiy several efforts have been undertaken to automate
operand isolation. In pipelined datapaths, where isolating signals
are often implicitly available, the traditional RTL-based approach
introduces unnecessary overhead. We propose an approach which
extracts high-level structural information from the ADL representation
and systematically uses the available control signals. Our
experiments with state-of-the-art embedded processors show a significant
power reduction (improvement in power efficiency).
-
Power/Performance Hardware Optimization for Synchronization Intensive Applications in MPSoCs [p. 606]
-
M. Monchiero, G. Palermo, C. Silvano and O. Villa
This paper explores optimization techniques of the synchronization
mechanisms for MPSoCs based on complex
interconnect (Network-on-Chip), targeted at future power
efficient systems. The proposed solution is based on the idea
of locally performing synchronization operations which require
the continuous polling of a shared variable, thus featuring
large contention (e.g. spin locks). We introduce a HW
module, the Synchronization-operation Buffer (SB), which
queues and manages the requests issued by the processors.
Experimental validation has been carried out by using
GRAPES, a cycle-accurate performance/power simulation
platform. For 8-processor target architecture, we show
that the proposed solution achieves up to 40% performance
improvement and 30% energy saving with respect to synchronization
based on directory-based coherence protocol.
-
An Analytical State Dependent Leakage Power Model for FPGAs [p. 612]
-
A. Kumar and M. Anis
In this paper we present a state dependent analytical leakage
power model for FPGAs. The model accounts for subthreshold
leakage and gate leakage in FPGAs, since these are the two dominant
components of total leakage power. The power model takes
into account the dependency of gate and subthreshold leakage on
the probability of the state of circuit inputs. The leakage power
model has two main components, one which computes the probability
of a state for a particular FPGA circuit element, and the
other which computes the leakage of the FPGA circuit element for
a given input using analytical equations. This FPGA power model
is particularly important for rapidly analyzing various FPGA architectures
across different technology nodes.
-
Smart Bit-Width Allocation for Low Power Optimization in a SystemC Based ASIC Design Environment [p. 618]
-
A. Mallik, D. Sinha, P. Banerjee and H. Zhou
The modern era of embedded system design is geared towards
design of low-power systems. One way to reduce power in an
ASIC implementation is to reduce the bit-width precision of its
computation units. This paper describes algorithms to optimize the
bit-widths of fixed point variables for low power in a SystemC
design environment. We propose an algorithm for optimal bitwidth
precision for two variables and a greedy heuristic which
works for any number of variables. The algorithms are used in the
automation of converting floating point SystemC programs into
ASIC synthesizable SystemC programs. Expected inputs are
profiled to estimate errors in the finite precision conversions.
Experimental results on the trade-offs between quantization error,
power consumption and hardware resources used are reported on
a set of four SystemC benchmarks that are mapped onto 0.18
micron ASIC cell library from Artisan Components. We
demonstrate that it is possible to reduce the power consumption by
50% on average by allowing round-off errors to increase from
0.5% to 1%.
-
Value-Based Bit Ordering for Energy Optimization of On-Chip Global Signal Buses [p. 624]
-
K. Sundaresan and N. R. Mahapatra
In this paper, we present a technique that exploits the statistical
behavior of data values transmitted on global signal
buses to determine an energy-efficient ordering of bits that
minimizes the inter-wire coupling energy and also reduces
total bus energy. Statistics are collected for instruction and
data bus traffic from eight SPEC CPU2K benchmarks and
an optimization problem is formulated and solved optimally
using a publicly-available tool. Results obtained using the
optimal bit order on large non-overlapping test samples
from the same set of benchmarks show that, on average, adjacent
inter-wire coupling energies reduce by about 35.4%
for instruction buses and by about 21.6% for data buses using
the proposed technique.
Moderators: F. Gaffiot, Ecole Centrale de Lyon, FR; M. Zwolinski, Southampton U, UK
-
Modeling Multiple Input Switching of CMOS Gates in DSM Technology Using HDMR [p. 626]
-
J. Sridharan and T. Chen
Continuing scaling of CMOS technology has allowed
aggressive pursuant of increased clock rate in DSM chips.
The ever shorter clock period has made switching times of
different inputs on a logic gate ever closer to each other. The
traditional method of static timing analysis assuming single input
switching is no longer adequate enough to capture gate level
delays accurately. Gate delay models considering multiple input
switching are needed for DSM chips.We propose a new method of
systematically modeling gate delays using the high dimensional
model representation (HDMR) method. The proposed method
models gate delays with respect to the relative signal arrival
times (RSAT) of its inputs. The systematic nature of the proposed
algorithm allows gate delay characterization with more inputs
switching close to each other. This paper will show, for the
first time, gate delay models of up to 5 inputs. In addition, the
proposed model is extended to allow the input signal slope and
process variations to be taken into account for statistical static
timing analysis. Our results show that the proposed HDMR model
gives an error between 2.2% to 12.9% for a variety of static and
dynamic logic gates as compared to SPICE results, depending
on the number of inputs involved in switching.
-
A Signal Theory Based Approach to the Statistical Analysis of Combinatorial Nanoelectronic Circuits [p. 632]
-
O. Soffke, P. Zipf, T. Murgan and M. Glesner
In this paper we present a method which allows the statistical
analysis of nanoelectronic Boolean networks with
respect to timing uncertainty and noise. All signals are
considered to be instationary random processes which is
the most general signal representation. As one cannot deal
with random processes per se, we focus on certain statistical
properties which are propagated through networks of
Boolean gates yielding the instationary probability density
function (pdf) of each signal in the network. Finally, several
values of interest as the error probability, the average
path delay or the average signal trace over time can be extracted
from these pdf.
-
Using Conjugate Symmetries to Enhance Gate-Level Simulations [p. 638]
-
P. M. Maurer
State machine based simulation of Boolean functions is
substantially faster if the function being simulated is
symmetric. Unfortunately function symmetries are
comparatively rare. Conjugate symmetries can be used to
reduce the state space for functions that have no
detectable symmetries, allowing the benefits of symmetry
to be applied to a much wider class of functions.
Substantial improvements in simulation speed, from 30-40%
have been realized using these techniques.
-
HDL Models of Ferromagnetic Core Hysteresis Using Timeless Discretisation of the Magnetic Slope [p. 644]
-
H. Al-Junaid and T. Kazmierski
A new methodology is presented to assure numerically
reliable integration of the magnetisation slope in the Jiles-Atherton
model of ferromagnetic core hysteresis. Two HDL
implementations of the technique are presented, one in SystemC
and the other in VHDL-AMS. The new model uses
timeless discretisation of the magnetisation slope equation
and provides superior accuracy and numerical stability especially
at the discontinuity points that occur in hysteresis.
Numerical integration of the magnetisation slope is carried
out by the model itself rather than by the underlying analogue
solver. The robustness of the model is demonstrated
by practical simulations of examples involving both major
and minor hysteresis loops.
Moderators: E. J. Marinissen, Philips Research, NL; A. Rueda, Seville U/IMSE-CNM, ES
-
An RF Improved Loopback for Test Time Reduction [p. 646]
-
M. Negreiros, L. Carro and A. A. Susin
In this work a method to improve the loopback test used
in RF analog circuits is described. The approach is
targeted to the SoC environment, being able to reuse
system resources in order to minimize the test overhead.
An RF sampler is used to observe spectral
characteristics of the RF signal path during loopback
operation. While able to improve the observability of the
signal path, the method also allows faster diagnosis
than conventional loopback tests, as the number of
transmitted symbols can be greatly reduced. Practical
results for a prototyped RF link at 860MHz are
presented in order to demonstrate the relevance of the
method.
-
Test Scheduling with Thermal Optimization for Network-on-Chip Systems Using Variable-Rate
On-Chip Clocking [p. 652]
-
C. Liu and V. Iyengar
Chip overheating has become a critical problem during test
of today's complex core-based systems. In this paper, we address
the overheating problem in Network-on-Chip (NoC) systems
through thermal optimization using variable-rate on-chip
clocking. We control the core temperatures during test scheduling
by assigning different test clock frequencies to cores. We
present two heuristics to achieve thermal optimization and reduced
test time. Experimental results for example NoC systems
show that the proposed method can guarantee thermal safety
and yield better thermal balance, compared to previous methods
using power constraints. Test application time is also reduced.
-
Online RF Checkers for Diagnosing Multi-Gigahertz Automatic Test Boards on Low Cost ATE
Platforms [p. 658]
-
G. Srinivasan, F. Taenzler and A. Chatterjee
Digital and analog centric load boards have well
established board check methodologies as part of their
"release to production requirements", while for RF load
boards this is still an open research issue. Potential faults
on RF load can be caused by mechanical/electrical defects
of components and sockets used on the board. Hence, we
propose a novel methodology to accurately check/diagnose
the RF path using only reflection measurements with
suitable terminations of these paths. These reflection
measurements and derived "checker equations" are used to
accurately diagnose the RF path on the load board during
production test at no extra test cost. A pilot test vehicle is
used to demonstrate the practical implementation and
production worthiness of the proposed board check and
diagnosis methodology.
-
Pseudorandom Functional BIST for Linear and Nonlinear MEMS [p. 664]
-
A. Dhayni, S. Mir, L. Rufer and A. Bounceur
Pseudorandom test techniques are widely used for
measuring the impulse response (IR) for linear devices and
Volterra kernels for nonlinear devices, especially in the
acoustics domain. This paper studies the application of
pseudorandom functional test techniques to linear and
nonlinear MEMS Built-In-Self-Test (BIST). We will first
present the classical pseudorandom BIST technique for Linear
Time Invariant (LTI) systems which is based on the evaluation
of the IR of the Device Under Test (DUT) stimulated by a
Maximal Length Sequence (MLS). Then we will introduce a
new type of pseudorandom stimuli called the Inverse-Repeat
Sequence (IRS) that proves better immunity to noise and
distortion than MLS. Next, we will illustrate the application of
these techniques for weakly nonlinear, purely nonlinear and
strongly nonlinear devices.
-
On-Chip 8GHz Non-Periodic High-Swing Noise Detector [p. 670]
-
M. Abbas, M. Ikeda and K. Asada
In this paper we present an overview of an on-chip
noise detection circuit. Mainly, this work is different form
the previous works concerning on-chip noise
measurement in one or more of the following: First: it
does not assume specific noise properties such as
periodicity. Second: the requested bandwidth of the output
channel can be adjusted freely, therefore, the user can
avoid the effect of on-chip parasites and the need to offchip
sophisticated monitoring tools. Third: the detector is
equipped with an on-chip voltage divider, which enables
measuring the high and low swing fluctuations accurately.
Therefore, the detector is suitable to measure the nonperiodic
/single event noise for the purpose of reliability
evaluation and performance modeling. A slower version
of the detector is implemented in a test chip using Hitachi
0.18 μm technology.
Moderators: M. Broy, TU Munich, DE; J. Sztipanovits, ISIS - Vanderbilt U, US
-
Battery-Aware Code Partitioning for a Text to Speech System [p. 672]
-
A. Lahiri, A. Basu, M. Choudhury and S. Mitra
The advent of multi-core embedded processors has
brought along new challenges for embedded system
design. This paper presents an efficient, battery aware,
code partitioning technique for a text to speech system,
which is executed on a multi-core embedded processor.
The system achieves significant performance
improvements both in terms of execution time as well as
battery lifetimes. The mentioned technique provides a new
paradigm for battery aware embedded system design
which can be easily extended to other applications.
-
Performance Optimization for Energy-Aware Adaptive Checkpointing in Embedded Real-Time Systems [p. 678]
-
Z. Li, H. Chen and S. Yu
Using additional store-checkpoinsts (SCPs) and
compare-checkpoints (CCPs), we present an adaptive
checkpointing for double modular redundancy (DMR) in
this paper. The proposed approach can dynamically adjust
the checkpoint intervals. We also design methods to
calculate the optimal numbers of checkpoints, which can
minimize the average execution time of tasks. Further, the
adaptive checkpointing is combined with the DVS (dynamic
voltage scaling) scheme to achieve energy reduction.
Simulation results show that, compared with the
previous methods, the proposed approach significantly
increases the likelihood of timely task completion and
reduces energy consumption in the presence of faults.
-
Software Annotations for Power Optimization on Mobile Devices [p. 684]
-
R. Cornea, A. Nicolau and N. Dutt
Modern applications for mobile devices, such as multimedia
video/audio, often exhibit a common behavior: they
process streams of incoming data in a regular, predictable
way. The runtime behavior of these applications can be accurately
estimated most of the time by analyzing the data to
be processed and annotating the stream with the information
collected. We introduce a software annotation based
approach to power optimization and demonstrate its application
on a backlight adjustment technique for LCD displays
during multimedia playback, for improved battery
life and user experience. Results from analysis and simulation
show that up to 65% of backlight power can be saved
through our technique, with minimal or no visible quality
degradation.
-
Dynamic Partitioning of Processing and Memory Resources in Embedded MPSoC Architectures [p. 690]
-
L. Xue, O. Ozturk, F. Li, M. Kandemir and I. Kolcu
Current trends indicate that multiprocessor-systemon-
chip (MPSoC) architectures are being increasingly
used in building complex embedded systems. While circuit/
architectural support for MPSoC based systems
are making significant strides, programming these devices
and providing suitable software support (e.g., compiler
and operating systems) seem to be a tougher problem.
This is because either programmers or compilers will
have to make code explicitly parallel to run on these systems.
An additional difficulty occurs when multiple applications
use an MPSoC at the same time, because MPSoC
resources should be partitioned across these applications
carefully. This paper explores a proactive resource
partitioning scheme for parallel applications simultaneously
exercising the same MPSoC system. The proposed
approach has two major components. The first component
includes an offline preprocessing of applications which
gives us an estimated profile for each application. Each application
to be executed on our MPSoC is profiled and
annotated with the profile information. The second component
of our approach is an online resource partitioning,
which partitions both the processing cores (i.e., computation
resources) and on-chip memory space (i.e.,
storage resource) among simultaneously-executing applications.
Our experimental evaluation with this partitioner
shows that it generates much better results than conventional
operating system based resource management. The
results also reveal that both memory partitioning and processor
partitioning are very important for obtaining the
best results.
-
Activity Clustering for Leakage Management in SPMs [p. 696]
-
M. Kandemir, G. Chen, F. Li, M. J. Irwin and I. Kolcu
This paper we proposes compiler-based leakage optimization
strategy for on-chip scratch-pad memories
(SPMs). The idea is to keep only a small set of SPM regions
active at a given time and pre-activate SPM regions
based on the compiler-extracted data access pattern.
Our strategy, called activity clustering, increases the
length of the idle periods of SPM regions by clustering accesses
to a small set of regions at a time. It thus allows an
SPM to take better advantage of the underlying leakage optimization
mechanism.
-
Adaptive Data Placement in an Embedded Multiprocessor Thread Library [p. 698]
-
P. Stanley-Marbell, K. Lahiri and A. Raghunathan
Embedded multiprocessors pose new challenges in
the design and implementation of embedded software. This has
led to the need for programming interfaces that expose the
capabilities of the underlying hardware. In addition, for systems
that implement applications consisting of multiple concurrent
threads of computation, the optimized management of interthread
communication is crucial for realizing high-performance.
This paper presents the design of an application-adaptive
thread library that conforms to the IEEE POSIX 1003.1c
threading standard (Pthreads). The library adapts the placement
of both explicitly marked application data objects, as well as
implicitly created data objects, in a physically distributed onchip
memory architecture, based on the application’s data access
characteristics.
Moderators: J. Madsen, TU Denmark, DK; J. Teich, Erlangen-Nuremberg U, DE
-
COSMECA: Application Specific Co-Synthesis of Memory and Communication Architectures for MPSoC [p. 700]
-
S. Pasricha and N. Dutt
Memory and communication architectures have a significant
impact on the cost, performance, and time-to-market of complex
multi-processor system-on-chip (MPSoC) designs. The memory
architecture dictates most of the data traffic flow in a design, which in
turn influences the design of the communication architecture. Thus
there is a need to co-synthesize the memory and communication
architectures to avoid making sub-optimal design decisions. This is in
contrast to traditional platform-based design approaches where
memory and communication architectures are synthesized separately.
In this paper, we propose an automated application specific cosynthesis
methodology for memory and communication architectures
(COSMECA) in MPSoC designs. The primary objective is to design a
communication architecture having the least number of busses, which
satisfies performance and memory area constraints, while the
secondary objective is to reduce the memory area cost. Results of
applying COSMECA to several industrial strength MPSoC
applications from the networking domain indicate a saving of as
much as 40% in number of busses and 29% in memory area
compared to the traditional approach.
-
Synthesis of Fault-Tolerant Schedules with Transparency/Performance Trade-Offs for Distributed
Embedded Systems [p. 706]
-
V. Izosimov, P. Pop, P. Eles and Z. Peng
In this paper we present an approach to the scheduling of fault-tolerant
embedded systems for safety-critical applications. Processes and messages
are statically scheduled, and we use process re-execution for recovering
from multiple transient faults. If process recovery is
performed such that the operation of other processes is not affected, we
call it transparent recovery. Although transparent recovery has the advantages
of fault containment, improved debugability and less memory
needed to store the fault-tolerant schedules, it will introduce delays
that can violate the timing constraints of the application. We propose a
novel algorithm for the synthesis of fault-tolerant schedules that can
handle the transparency/performance trade-offs imposed by the designer,
and makes use of the fault-occurrence information to reduce the
overhead due to fault tolerance. We model the application as a conditional
process graph, where the fault occurrence information is represented
as conditional edges and the transparent recovery is captured
using synchronization nodes.
-
Communication Architecture Optimization: Making the Shortest Path Shorter in Regular Networks-on-Chip [p. 712]
-
U. Y. Ogras, R. Marculescu, H. G. Lee and N. Chang
Network-on-Chip (NoC)-based communication represents a
promising solution to complex on-chip communication problems.
Due to their regular structure, mesh-like NoC architectures
have become very popular recently. However, they have
poor topological properties such as long inter-node distances.
In this paper, we address this very issue and explore
the potential of partial NoC customization to improve both
static and dynamic properties of the network significantly,
while minimally affecting its regularity. Precise energy measurements
on an FPGA prototype show that the improvement
in network properties is achieved without a significant penalty
in area and communication energy consumption.
-
Buffer Space Optimisation with Communication Synthesis and Traffic Shaping for NoCs [p. 718]
-
S. Manolache, P. Eles and Z. Peng
This paper addresses communication optimisation for applications
implemented on networks-on-chip. The mapping of data packets
to network links and the timing of the release of the packets
are critical for avoiding destination contention. This reduces
the demand for communication buffers with obvious advantages in
chip area and energy savings. We propose a buffer need analysis
approach and a strategy for communication synthesis and packet
release timing with minimum communication buffer demand that
guarantees worst-case response times.
-
Cooptimization of Interface Hardware and Software for I/O Controllers [p. 724]
-
K. J. Lin, S. H. Huang and S. C. Fang
The allocation of device variables on I/O registers affects
the code size and performance of an I/O device driver.
This work seeks the allocation with the minimal software
or hardware cost in a hardware/software codesign
environment. The problems of exact minimization under
constraints are formulated as zero-one integer linear
programming problems. Heuristic algorithms based on
iterative refinement are also proposed. The proposed
design methodology was implemented in C language.
Compared with industrial designs, the system can obtain
design alternatives that reduce both software and
hardware costs.
Organisers: H. Meyr, RWTH Aachen U, DE and CoWare Inc, US; G. Fettweis
Moderator: L. Gaszi, Infineon, DE
-
Cross Disciplinary Aspects (4G Wireless Special Day) [p. 726]
-
Speakers: T. G. Noll and U. Lambrette
This session addresses the inter-dependency of physical and system level design and the economic issues of 4G implementations.
-
SoC - Fuelling the Hopes of the Mobile Industry [p. 727]
-
U. Lambrette
Status. Supply and Demand in the mobile operator seem almost decoupled - New
technologies and ever increasing bandwidths compete for the attention of CTO's.
On higher network layers, IMS and VoIP are key technologies shaking up the mobile
value chain. New contenders, WiMax, WiFi, jointly with independent VoIP based
operators threaten the value proposition of mobile altogether. Mobile technology is
still evolving and the rate of innovation is high.
The marketing department, on the other hand, focuses on value creation, large
bundles, simplified terminals, format competition, thereby slowly coming to terms
with the likely reduction of ARPU. Marketing has to deal with all the features of a
mature market. Surely, some experiments on mobile broadband propositions are
being launched, for example Mobile TV, music download, browsing, but none of
them has yet delivered significant ARPU contributions. Finally, mobile
communications are mainly peer-to-peer communications, i.e. too much player
differentiation implies limited ability to communicate, an effect that has significantly
hampered e.g. the success of MMS.
The impact of all this on mobile terminals, specifically, is that diversity of
requirements in terms of supported radio and coding standards, applications, speed
and power efficiency increases dramatically. Operators will carry a high cost burden
against a background of not yet amortized investment into licenses and network
equipment.
Moderators: F. Balarin, Cadence Berkeley Laboratories, US; H. Hsieh, UC Riverside, US
-
Integrated Data Relocation and Bus Reconfiguration for Adaptive System-on-Chip Platforms [p. 728]
-
K. Sekar, K. Lahiri, A. Raghunathan and S. Dey
Dynamic variations in application functionality
and performance requirements can lead to the imposition
of widely disparate requirements on System-on-Chip (SoC)
platform hardware over time. This has led to interest in
the design and use of adaptive SoC platforms that are
capable of providing high performance in the face of such
variations. Recent advances in circuits and architectures are
enabling platforms that contain various mechanisms for runtime
adaptation. However, the problem of exploiting such
configurability in a coordinated manner at the system level
remains a challenging task.
In this work, we focus on two configurable subsystems
of SoC platforms that play a crucial role in determining
overall system performance, namely, the on-chip communication
architecture, and the on-chip memory architecture.
Using detailed case studies, we demonstrate the limitations
of designs in which the architectural configuration of a busbased
communication architecture and the placement of data
in memory are statically optimized, and those in which each
is customized separately, without considering their interdependence.
We propose an integrated methodology for dynamically
relocating on-chip data and reconfiguring the communication
architecture, and discuss the necessary hardware support.
Experiments conducted on an SoC platform that integrates
decoders for the UMTS (3G) and IEEE 802.11a (Wireless LAN)
standards demonstrate that the proposed integrated adaptation
technique helps boost the maximum achievable performance
by up to 32% over the best statically optimized design.
-
FPGA Architecture Characterization for System Level Performance Analysis [p. 734]
-
D. Densmore, A. Donlin and A. Sangiovanni-Vincentelli
We present a modular and scalable approach for automatically
extracting actual performance information from a set
of FPGA-based architecture topologies. This information is
used dynamically during simulation to support performance
analysis in a System Level Design environment. The topologies
capture systems representing common designs using
FPGA technologies of interest. Their characterization is
done only once; the results are then used during simulation
of actual systems being explored by the designer. Our approach
allows a rich set of FPGA architectures to be explored
accurately at various abstraction levels to seek optimized solutions
with minimal effort by the designer. To offer an
industrial example of our results, we describe the characterization
process for Xilinx CoreConnect-based platforms and
the integration of this data into the Metropolis modeling
environment.
-
Dynamic Data Type Refinement Methodology for Systematic Performance-Energy Design
Exploration of Network Applications [p. 740]
-
A. Bartzas, S. Mamagkakis, G. Pouiklis, D. Atienza, F. Catthoor, D. Soudris and A. Thanailakis,
Network applications are becoming increasingly popular
in the embedded systems domain requiring high performance,
which leads to high energy consumption. In networks
is observed that due to their inherent dynamic nature
the dynamic memory subsystem is a main contributor
to the overall energy consumption and performance. This
paper presents a new systematic methodology, generating
performance-energy trade-offs by implementing Dynamic
Data Types (DDTs), targeting network applications. The
proposed methodology consists of: (i) the application-level
DDT exploration, (ii) the network-level DDT exploration
and (iii) the Pareto-level DDT exploration. The methodology,
supported by an automated tool, offers the designer a
set of optimal dynamic data type design solutions. The effectiveness
of the proposed methodology is tested on four
representative real-life case studies. By applying the second
step, it is proved that energy savings up to 80% and performance
improvement up to 22% (compared to the original
implementations of the benchmarks) can be achieved. Additional
energy and performance gains can be achieved and a
wide range of possible trade-offs among our Pare-to-optimal
design choices are obtained, by applying the third step. We
achieved up to 93% reduction in energy consumption and
up to 48% increase in performance.
-
Customization of Application Specific Heterogeneous Multi-Pipeline Processors [p. 746]
-
S. Radhakrishnan, H. Guo and S. Parameswaran
In this paper we propose Application Specific Instruction Set Processors
with heterogeneous multiple pipelines to effficiently exploit
the available parallelism at instruction level. We have developed a
design system based on the Thumb processor architecture. Given an
application specified in C language, the design system can generate
a processor with a number of pipelines specifically suitable to the application,
and the parallel code associated with the processor. Each
pipeline in such a processor is customized, and implements its own
special instruction set so that the instructions can be executed in parallel
with low hardware overhead. Our simulations and experiments
with a group of benchmarks, largely from Mibench suite, show that
on average, 77% performance improvement can be achieved compared
to a single pipeline ASIP, with the overheads of 49% on area,
51% on leakage power, 17% on switching activity, and 69% on code
size.
-
Impact of Bit-Width Specification on the Memory Hierarchy for a Real-Time Video Processing System [p. 752]
-
B. Thornberg and M. O'Nils
Bit-width specification will affect the total memory
storage requirement of a video processing system.
However, what is not so obvious is that the bit-width
specification will also affect the design of the memory
hierarchy. Experiments with a real-life surveillance
system show how the optimal allocation of shift registers
for the storage of intermediate results is sensitive to bit-widths.
It is shown that the total on-chip memory storage
requirement can be reduced by 61 percent compared to a
non-optimal design.
-
Efficient Factorization of DSP Transforms Using Taylor Expansion Diagram [p. 754]
-
J. Guillot, E. Boutillon, Q. Ren, M. Ciesielski, D. Gomez-Prado and S. Askar
This paper describes an efficient method to perform factorization
of DSP transforms based on Taylor Expansion
Diagram (TED). It is shown that TED can efficiently represent
and manipulate mathematical expressions. We demonstrate
that it enables efficient factorization of arithmetic expressions
of DSP transforms, resulting in a simplification of
the computation.
Moderators: P. Groeneveld, TU Eindhoven, NL; A. Kuehlmann, Cadence Berkeley Laboratories, US
-
Integrated Placement and Skew Optimization for Rotary Clocking [p. 756]
-
G. Venkataraman, J. Hu, F. Liu and C.-N. Sze
The clock distribution network is a key component on any
synchronous VLSI design. As techonology moves into the
nanometer era, innovative clocking techniques are required
to solve the power dissipation and variability issues. Rotary
clocking is a novel technique which employs unterminated
rings formed by differential transmission lines to save
power and reduce skew variability. Despite its appealing advantages,
rotary clocking requires latch locations to match
pre-designed clock skew on rotary clock rings. This requirement
is a difficult chicken-and-egg problem which prevents
its wide application. In this work, we proposed an integrated
placement and skew scheduling methodology to break this
hurdle, making rotary clocking compatible with practical
design flows. A network flow based latch assignment algorithm
and a cost-driven skew optimization algorithm are
developed. Experiments show that our method can generate
chip placements which satisfy the unique requirements
of rotary clocks, without sacrificing design quality. By enabling
concurrent clock network and placement design, our
method can also be applied in other clocking methodologies
as well.
-
Associative Skew Clock Routing for Difficult Instances [p. 762]
-
M.-S. Kim and J. Hu
In clock network synthesis, sometimes skew constraints
are required only within certain groups of clock sinks and
do not exist between different groups. This is the so-called
associative skew clock routing problem. Although the number
of constraints is reduced, the problem becomes more difficult
to solve due to the enlarged solution space. The perhaps
only previous work used a very primitive delay model
and cannot handle difficult instances in which sink groups
are intermingled. We reuse existing techniques to solve this
problem, including the difficult instances, based on a more
accurate and popular delay model. Experimental results
show that our algorithm can reduce the total clock routing
wirelength by 12% on average compared to greedy-DME
which is one of the best zero skew routing algorithms.
-
Efficient Timing-Driven Incremental Routing for VLSI Circuits Using DFs and Localized
Slack-Satisfaction Computations [p. 768]
-
S. Dutt, H. Arslan
In current very deep submicron (VDSM) circuits, incremental
routing is crucial to incorporating engineering change orders (ECOs)
late in the design cycle. In this paper, we address the important incremental
routing objective of satisfying timing constraints in high-speed designs
while minimizing wirelength, vias and routing layers. We develop an effective
timing-driven (TD) incremental routing algorithm TIDE for ASIC
circuits that addresses the dual goals of time-efficiency, and slack satisfaction
coupled with effective optimizations. There are three main novelties
in our approach: (i) a technique for locally determining slack satisfaction
of the entire routing tree when either a new pin is added to the tree or an
interconnect in it is re-routed - this technique is used in both the global and
detailed routing phases; (ii) an interval-intersection and tree-truncation algorithm,
used in global routing, for quickly determining a near-minimumlength
slack-satisfying interconnection of a pin to a partial routing tree;
(iii) a depth-first-search process, used in detailed routing, that allows new
nets to bump and re-route existing nets in a controlled manner in order
to obtain better optimized designs. Experimental results show that within
the constraint of routing all nets in only two metal layers, TIDE succeeds in
routing more than 94% of ECO-generated nets, and also that its failure rate
is 7 and 6.7 times less than that of the TD versions of previous incremental
routers Standard (Std) and Ripup&Reroute (R&R), respectively. It is also
able to route nets with very little (3.4%) slack violations, while the other
two methods have appreciable slack violations (16-19%). TIDE is about 2
times slower than the simple TD-Std method, but more than 3 times faster
than TD-R&R.
Moderators: D. Pradhan, Bristol U, UK; K. Chakrabarty, Duke U, US
-
Defect Tolerance of QCA Tiles [p. 774]
-
J. Huang, M. Momenzadeh and F. Lombardi
Quantum dot Cellular Automata (QCA) is one of the
promising technologies for nano scale implementation. The
operation of QCA systems is based on a new paradigm generally
referred to as processing-by-wire (PBW). This paper
analyzes the defect tolerance properties of PBW when tiles
are employed using molecular QCA cells. Based on a 3x3
QCA block, with different input/output arrangements, different
tiles are analyzed and simulated using a coherence vector engine.
The functional characterization and polarization level of
these tiles for undeposited cell defects are
reported. It is shown that novel features of PBW are possible
due to spatial redundancy and QCA tiles are robust and
inherently defect tolerant.
Index words: QCA, defect tolerance, emerging technologies.
-
Temporal Performance Degradation under NBTI: Estimation and Design for Improved Reliability
of Nanoscale Circuits [p. 780]
-
B. C. Paul, K. Kang, H. Kufluoglu, M. A. Alam and K. Roy
Negative Bias Temperature Instability (NBTI) has become
one of the major causes for temporal reliability
degradation of nanoscale circuits. In this paper, we analyze
the temporal delay degradation of logic circuits due to
NBTI.We show that knowing the threshold voltage degradation
of a single transistor due to NBTI, one can predict
the performance degradation of a circuit with a reasonable
degree of accuracy. We also propose a sizing algorithm
taking NBTI-affected performance degradation
into account to ensure the reliability of nanoscale circuits
for a given period of time. Experimental results on several
benchmark circuits show that with an average of 8.7% increase
in area one can ensure reliable performance of circuits
for 10 years.
-
Novel Designs for Thermally Robust Coplanar Crossing in QCA [p. 786]
-
S. Bhanja, M. Ottavi, F. Lombardi and S. Pontarelli
In this paper, different circuit arrangements of Quantumdot
Cellular Automata (QCA) are proposed for the so-called
coplanar crossing. These arrangements exploit the majority
voting properties of QCA to allow a robust crossing of
wires on the Cartesian plane. This is accomplished using
enlarged lines and voting. Using a Bayesian Network (BN)
based simulator, new results are provided to evaluate the
robustness to so-called kink of these arrangements to thermal
variations. The BN simulator provides fast and reliable
computation of the signal polarization versus normalized
temperature. It is shown that by modifying the layout, a
higher polarization level can be achieved in the routed signal
by utilizing the proposed QCA arrangements.
-
Designing MRF Based Error Correcting Circuits for Memory Elements [p. 792]
-
K. Nepal, R. I. Bahar, J. Mundy, W. R. Patterson and A. Zaslavsky
As devices are scaled to the nanoscale regime, it is clear that
future nanodevices will be plagued by higher soft error rates
and reduced noise margins. Traditional implementations
of error correcting codes (ECC) can add to the reliability
of systems but can be ineffective in highly noisy operating
conditions. This paper proposes an implementation of ECC
based on the theory of Markov random fields (MRF). The
MRF probabilistic model is mapped onto CMOS circuitry,
using feedback between transistors to reinforce the correct
joint probability of valid logical states. We show that our
MRF approach provides superior noise immunity formemory
systems that operate under highly noisy conditions.
Moderators: P. Puschner, TU Vienna, AT; S. Goddard, Nebraska U Lincoln, US
-
A Time-Triggered Ethernet (TTE) Switch [p. 794]
-
K. Steinhammer, P. Grillinger, A. Ademaj and H. Kopetz
This paper presents the design of a Time-Triggered Ethernet
(TTE) Switch, which is one of the core units of the
Time-Triggered Ethernet system. Time-triggered Ethernet is
a communication architecture intended to support eventtriggered
and time-triggered traffic in a single communication
system. The TTE Switch distinguishes between two
classes of traffic. The Event-Triggered (ET) traffic is handled
in conformance with the existing Ethernet standard,
while the Time-Triggered (TT) traffic is transmitted with
temporal guarantees. A TTE Switch is used in the Time-Triggered
Ethernet system for exchanging time-triggered
messages in a time-predictable way while continuing the
support of standard Ethernet traffic in order to use existing
networking protocols such as IP, UDP or IPX without
any modifications.
In this paper we present the mechanisms the TTE Switch
uses to guarantee a constant transmission delay for timetriggered
traffic. Also an experimental validation of these
mechanisms is given.
-
A Time Predictable Java Processor [p. 800]
-
M. Schoeberl
This paper presents a Java processor, called JOP, designed
for time-predictable execution of real-time tasks. JOP
is the implementation of the Java virtual machine in hardware.
We propose a processor architecture that favors low
worst-case execution time (WCET) over average case performance.
The resulting processor is an easy target for the
low-level WCET analysis.
-
Optimizing the Generation of Object-Oriented Real-Time Embedded Applications Based on the
Real-Time Specification for Java [p. 806]
-
M. A. Wehrmeister, C. E. Pereira and L. B. Becker
The object-oriented paradigm has become popular over
the last years due to its characteristics that help managing
the complexity in computer systems design. This feature
also attracted the embedded systems community, as
today's embedded systems need to cope with several
complex functionalities as well as timing, power, and
area restrictions. Such scenario has promoted the use of
the Java language and its real-time extension (RTSJ) for
embedded real-time systems design. Nevertheless, the
RTSJ was not primarily designed to be used within the
embedded domain. This paper presents an approach to
optimize the use of the RTSJ for the development of embedded
real-time systems. Firstly, it describes how to
design real-time embedded applications using an API
based on RTSJ. Secondly, it shows how the generated
code is optimized to cope with the tight resources available,
without interfering in the mandatory timing predictability
of the generated system. Finally it discusses an
approach to synthesize the applications on top of affordable
FPGAs. The approach used to synthesize the embedded
real-time system ensures a bounded timing behavior
of the object-oriented aspects of the application, like
the polymorphism mechanism and read/write access to
object's data fields.
Moderators: G. Cabodi, Politecnico di Torino, IT; R. Drechsler, Bremen U, DE
-
Quantifier Structure in Search Based Procedures for QBFs [p. 812]
-
E. Giunchiglia, M. Narizzano and A. Tacchella
The best currently available solvers for Quantified
Boolean Formulas (QBFs) process their input in prenex
form, i.e., all the quantifiers have to appear in the prefix
of the formula separated from the purely propositional part
representing the matrix. However, in many QBFs deriving
from applications, the propositional part is intertwined with
the quantifier structure. To tackle this problem, the standard
approach is to first convert them in prenex form, thereby
loosing structural information about the prefix.
In this paper we show that conversion to prenex form is
not necessary, i.e., that it is relatively easy to extend current
search based solvers in order to exploit the original
quantifier structure, i.e., to handle non prenex QBFs. Further,
we show that the conversion can lead to the exploration
of search spaces bigger than the space explored by solvers
handling non prenex QBFs. To validate our claims, we
implemented our ideas in the state-of-the-art search based
solver QUBE, and conducted an extensive experimental
analysis. The results show that very substantial speedups
can be obtained.
-
Strong Conflict Analysis for Propositional Satisfiability [p. 818]
-
H. S. Jin and F. Somenzi
We present a new approach to conflict analysis for propositional
satisfiability solvers based on the DPLL procedure and clause recording.
When conditions warrant it, we generate a supplemental clause
from a conflict. This clause does not contain a unique implication
point, and therefore cannot replace the standard conflict clause.
However, it is very effective at reducing excessive depth in the implication
graphs and at preventing repeated conflicts on the same
clause. Experimental results show consistent improvements over
state-of-the-art solvers and confirm our analysis of why the new
technique works.
-
Equivalence Verification of Arithmetic Datapaths with Multiple Word-Length Operands [p. 824]
-
N. Shekhar, P Kalla and F. Enescu
This paper addresses the problem of equivalence
verification of RTL descriptions that implement
arithmetic computations (add, mult, shift) over bit-vectors
that have differing bit-widths. Such designs are
found in many DSP applications where the widths of input
and output bit-vectors are dictated by the desired precision.
A bit-vector of size n can represent integer values
from 0 to 2n -1; i.e. integers reduced modulo 2n.
Therefore, to verify bit-vector arithmetic over multiple word-length
operands, we model the RTL datapath as a polynomial
function from Z2n1 x Z 2n2 x ... x Z 2nd to
Z2m. Subsequently, RTL equivalence f ≡ g is solved by
proving whether (f-g) ≡ 0 over such mappings. Exploiting concepts
from number theory and commutative algebra, a systematic, complete
algorithmic procedure is derived for this purpose. Experimentally, we
demonstrate how this approach can be applied within a practical CAD
setting. Using our approach, we verify a set of arithmetic datapaths
at RTL where contemporary approaches prove to be infeasible.
Speakers: G. Fettweis, TU Dresden, DE; H. Meyr, RWTH Aachen U, DE and CoWare Inc
-
4G Applications, Architectures, Design Methodology and Tools for MPSoC [p. 830]
-
Motivation
The telecommunications and semiconductor industry are inseparably linked. More than sixty
percent of the revenue of the semiconductor industry is attributed to communications and
multimedia applications. On the other hand wireless communications (and multimedia)
applications demand ultrahigh computational performance and therefore have become
technology drivers for the semiconductor industry. The single most important aspect of
designing and, successfully deploying 4G, is its cross-disciplinary character, ranging from
semiconductors to services to deployment to business.
Moderators: C. Silvano, Politecnico di Milano, IT; M. Poncino, Politecnico di Torino, IT
-
Thermal Resilient Bounded-Skew Clock Tree Optimization Methodology [p. 832]
-
A. Chakraborty, P. Sithambaram, K. Duraisami, A. Macii, E. Macii and M. Poncino,
The existence of non-uniform thermal gradients on the substrate
in high performance IC's can significantly impact the
performance of global on-chip interconnects. This issue is
further exacerbated by the aggressive scaling and other factors
such as dynamic power management schemes and nonuniform
gate level switching activity.
In high-performance systems, one of the most important
problems is clock skew minimization since it has a direct
impact on the maximum operating frequency of the system.
Since clocks are routed across the entire chip, the presence
of thermal gradients can significantly alter their characteristics
because wire resistance increases linearly as the temperature
increases. This often results in failure to meet original
timing constraints thereby rendering the original topology
unusable. Therefore it is necessary to perform a temperature
aware re-embedding of the original topology to meet
timing under these temperature effects.
This work primarily explores these issues by proposing two
algorithms that re-structure an existing clock tree topology
to compensate for such temperature effects and as a result
also meet timing constraints.
-
Exploring "Temperature-Aware" Design in Low-Power MPSoCs [p. 838]
-
G. Paci, P. Marchal, F. Polett and L. Benini
The power density inside high performance systems continues to
rise with every process technology generation, thereby increasing the
operating temperature and creating "hot spots" on the die. As a result,
the performance, reliability and power consumption of the system
degrade. To avoid these "hot spots", "temperature-aware" design
has become a must. For low-power embedded systems though,
it is not clear whether similar thermal problems occur. These systems
have very different characteristics from the high performance
ones: they consume hundred times less power, they are based on a
multi-processor architecture with lots of embedded memory and rely
on cheap packaging solutions. In this paper, we investigate the need
for temperature-aware design in a low-power systems-on-a-chip and
provide guidlines to delimit the conditions for which temperatureaware
design is needed.
-
Adaptive Chip-Package Thermal Analysis for Synthesis and Design [p. 844]
-
Y. Yang, Z. Gu, C. Zhu, L. Shang and R.P. Dick
Ever-increasing integrated circuit (IC) power densities and peak
temperatures threaten reliability, performance, and economical cooling.
To address these challenges, thermal analysis must be embedded
within IC synthesis. However, detailed thermal analysis requires accurate
three-dimensional chip-package heat flow analysis. This has typically
been based on numerical methods that are too computationally
intensive for numerous repeated applications during synthesis or
design. Thermal analysis techniques must be both accurate and fast
for use in IC synthesis.
This article presents a novel, accurate, incremental, self-adaptive,
chip-package thermal analysis technique, called ISAC, for use in IC
synthesis and design. It is common for IC temperature variation
to strongly depend on position and time. ISAC dynamically adapts
spatial and temporal modeling granularity to achieve high efficiency
while maintaining accuracy. Both steady-state and dynamic thermal
analysis are accelerated by the proposed heterogeneous spatial resolution
adaptation and temporally decoupled element time marching
techniques. Each technique enables orders of magnitude improvement
in performance while preserving accuracy when compared with other
state-of-the-art adaptive steady-state and dynamic IC thermal analysis
techniques. Experimental results indicate that these improvements are
sufficient to make accurate dynamic and static thermal analysis practical
within the inner loops of IC synthesis algorithms. ISAC has been
validated against reliable commercial thermal analysis tools using industrial
and academic synthesis test cases and chip designs. It has
been implemented as a software package suitable for integration in IC
synthesis and design flows and has been publicly released.
-
On-Chip Bus Thermal Analysis and Optimization [p. 850]
-
F. Wang, Y. Xie, N. Vijaykrishnan and M. J. Irwin
As technology scales, increasing clock rates, decreasing
interconnect pitch, and the introduction of low-k dielectrics have
made self-heating of the global interconnects an important issue in
VLSI design. In this paper, we study the self-heating of on-chip
buses and show that the thermal impact due to self-heating of onchip
buses increases as technology scales, thus motivating the
need of finding solutions to mitigate this effect. Based on the
theoretical analysis, we propose an irredundant bus encoding
scheme for on-chip buses to tackle the thermal issue. Simulation
results show that our encoding scheme is very efficient to reduce
the on-chip bus temperature rise over substrate temperature, with
much less overhead compared to other low power encoding
schemes.
Moderators: E. Schmidt, ChipVision Design Systems, DE; A. Bogliolo, Urbino U, IT
-
Ultralow Power Computing with Sub-Threshold Leakage: A Comparative Study of Bulk and
SOI Technologies [p. 856]
-
A. Raychowdhury, B.C. Paul, S. Bhunia and K. Roy
This paper presents a novel design methodology for ultralow
power design (in bulk and double-gate SOI technology) using subthreshold
leakage as the operating current (suitable for medium
frequency of operation: tens to hundreds of MHz). It has been shown
that a complete co-design at all levels of hierarchy (device, circuit and
architecture) is necessary to reduce the overall power consumption.
Simulation results of co-design on a five-tap FIR filter shows ~2.5x (for
bulk) and ~3.8x (for SOI) improvement in throughput at iso-power
compared to a conventional design. It has been further demonstrated
that the double-gate SOI technology is better suited for sub-threshold
operation.
-
Low Power Synthesis of Dynamic Logic Circuits Using Fine-Grained Clock Gating [p. 862]
-
N. Banerjee, K. Roy, H. Mahmoodi and S. Bhunia
Clock power consumes a significant fraction of total power dissipation in high speed precharge/evaluate logic styles. In this paper, we present a novel low-cost design methodology for reducing clock power in the active mode for dynamic circuits with fine-grained clock gating. The proposed technique also improves switching power by preventing redundant computations. A logic synthesis approach for domino/skewed logic styles based on Shannon expansion is proposed, that dynamically identifies idle parts of logic and applies clock gating to them to reduce power in the active mode of operation. Results on a set of MCNC benchmark circuits in predictive 70nm process exhibit improvements of 15% to 64% in total power with minimal overhead in terms of delay and area compared to conventionally synthesized domino/skewed logic.
-
Enabling Fine-Grain Leakage Management by Voltage Anchor Insertion [p. 868]
-
P. Babighian, L Benini, A. Macii and E. Macii
Functional unit shutdown based on MTCMOS devices is effective
for leakage reduction in aggressively scaled technologies.
However, the applicability of MTCMOS-based shutdown
in a synthesis-based design flow poses the challenge
of interfacing logic blocks in shutdown mode with active
units: The outputs of inactive gates can float at intermediate
voltages, causing very large short-circuit currents in
the active gates they drive.
In this paper, we propose two novel low-overhead elementary
cells that fully address this issue. These cells can be
added to any synthesis library, and they can be inserted
into a netlist at the boundary between shutdown and active
regions. Our results show that: (i) Our cells solve the
interfacing problem with minimum overhead; (ii) A nonintrusive
design flow enhancement is sufficient to automatically
insert interface cells in post-synthesis netlists.
-
Automated Exploration of Pareto-Optimal Configurations in Parameterized Dynamic Memory
Allocation for Embedded Systems [p. 874]
-
S. Mamagkakis, D. Atienza, C. Poucet, F. Catthoor, D. Soudris and J. M. Mendias
New applications in embedded systems are becoming
increasingly dynamic. In addition to increased dynamism,
they have massive data storage needs. Therefore,
they rely heavily on dynamic, run-time memory allocation.
The design and configuration of a dynamic memory
allocation subsystem requires a big design effort,
without always achieving the desired results. In this paper,
we propose a fully automated exploration of dynamic
memory allocation configurations. These configurations
are fine tuned to the specific needs of applications
with the use of a number of parameters. We assess the effectiveness
of the proposed approach in two representative
real-life case studies of the multimedia and wireless network
domains and show up to 76% decrease in memory accesses
and 66% decrease in memory footprint within the
Pareto-optimal trade-off space.
-
A Control Theoretic Approach to Run-Time Energy Optimization of Pipelined Processing in MPSoCs [p. 876]
-
A. Alimonda, A. Acquaviva, S. Carta and A. Pisano
In this work we take a control-theoretic approach to feedback-based
dynamic voltage scaling (DVS) in Multi Processor System
on Chip (MPSoC) pipelined architectures. We present and discuss
a novel feedback approach based on both linear and non-linear
techniques aimed at controlling interprocessor queue occupancy.
Theoretical analysis and experiments, carried out on a cyclea-ccurate
multiprocessor simulation platform, show that feedbackbased
control reduces energy consumption with respect to standard
local DVS policies and highlight that non-linear strategies
allows a more flexible and robust implementation in presence of
variable workload conditions.
Moderators: J. Koehl, IBM, DE; F. Johannes, TU Munich, DE
-
3D Floorplanning with Thermal Vias [p. 878]
-
E. Wong and S. K. Lim
3D circuits have the potential to improve performance
over traditional 2D circuits by reducing wirelength and
interconnect delay. One major problem with 3D circuits is that
their higher device density due to reduced footprint area leads
to greater temperatures. Thermal vias are a potential solution
to this problem. This paper presents a thermal via insertion
algorithm that can be used to plan thermal via locations during
floorplanning. The thermal via insertion algorithm relies on
a new thermal analyzer based on random walk techniques.
Experimental results show that, in many cases, considering
thermal vias during floorplanning stages can significantly reduce
the temperature of a 3D circuit.
-
Timing-Driven Cell Layout De-Compaction for Yield Optimization by Critical Area Minimization [p. 884]
-
T. Iizuka, M. Ikeda and K. Asada
This paper proposes a yield optimization method for
standard-cells under timing constraints. Yield-aware logic
synthesis and physical optimization require yield-enhanced
standard cells and the proposed method automatically creates
yield-enhanced cell layouts by de-compacting the original
cell layout. However, the careless modification of
the original layout may degrade its performances severely.
Therefore, the proposed method de-compacts the original
layout under given timing constraints using a Linear Programming
(LP). We develop a new accurate linear delay
model which approximates the di.erence from the original
delay and use this model to formulate the timing constraints
in the LP. Experimental results show that the proposed
method can pick up the yield variants of a cell layout
from the trade o. curve of cell delay versus critical area
and is used to create the yield-enhanced cell library which
is essential to realize yield-aware VLSI design flows.
-
Lens Aberration Aware Timing-Driven Placement [p. 890]
-
A. B. Kahng, C.-H. Park, P. Sharma and Q. Wang
Process variations due to lens aberrations are to a large extent
systematic, and can be modeled for purposes of analyses and
optimizations in the design phase. Traditionally, variations induced
by lens aberrations have been considered random due to
their small extent. However, as process margins reduce, and as
improvements in reticle enhancement techniques control variations
due to other sources with increased efficacy, lens aberration-induced
variations gain importance. For example, our experiments
indicate that lens aberration can result in up to 8% variation
in cell delay. In this paper, we propose an aberration-aware
timing-driven analytical placement approach that accounts for
aberration-induced variations during placement. Our approach
minimizes the design's cycle time and prevents hold-time violations
under systematic aberration-induced variations. On average,
the proposed placement technique reduces cycle time by
~5% at the cost of ~2% increase in wirelength.
Moderators: M. Renovell, LIRMM, FR; C. Hawkins, New Mexico U, US
-
On Test Conditions for the Detection of Open Defects [p. 896]
-
B. Kruseman and M. Heiligers
The impact of test conditions on the detectability of
open defects is investigated. We performed an inductive
fault analysis on representative standard gates. The simulation
results show that open-like defects result in a wide
range of different voltage-delay dependencies, ranging
from a strongly increasing to a strongly decreasing delay
as a function of voltage. The behaviour is not only determined
by the defect location but also by the test pattern.
Knowing the expected behaviour of a certain defect location
helps failure localisation. The detectability of a defect
is strongly determined by the behaviour of the
affected path as well as that of the longest path. Our simulations
and measurements show that in general elevated
supply voltages give a better detectability of open-like defects.
-
A Compact Model to Identify Delay Faults Due to Crosstalk [p. 902]
-
J. L. Rossello and J. Segura
In this work we present an analytical formulation to estimate quickly and
accurately the impact of crosstalk induced delay in submicron CMOS ICs gates taking
into account time skew. Crosstalk delay is computed from the additional charge
injected from the aggressor gate on the victim gate during simultaneous switching.
The model provides a very good agreement with HSPICE simulations for a 0.18μm
technology.
-
Generation of Broadside Transition Fault Test Sets That Detect Four-Way Bridging Faults [p. 907]
-
I. Pomeranz and S. M. Reddy
Generation of n -detection test sets is typically done for a
single fault model. In this work we investigate the generation
of n -detection test sets by pairing each fault of a target
fault model with n faults of a different fault model.
Tests are generated such that they detect both faults of a
pair. To facilitate test generation, we ensure that the faults
included in a single pair have overlapping requirements
for their detection. The advantage of this approach is that
it ensures the detection of additional faults that would not
be targeted during n -detection test generation for a single
fault model. Experimental results with transition faults as
the first fault model and four-way bridging faults as the
second fault model are presented.
-
Extraction of Defect Density and Size Distributions from Wafer Sort Test Results [p. 913]
-
J. E. Nelson, T. Zanon, R. Desineni, J. G. Brown, N. Patil, W. Maly and R. D. Blanton
Defect density and defect size distributions (DDSDs) are key
parameter used in IC yield loss predictions. Traditionally,
memories and specialized test structures have been used to
estimate these distributions. In this paper, we propose a stratategy
to accur accurately estimate DDSDs for shorts in metal layers
using production IC test results.
Moderators: J. Teich, Erlangen-Nuremberg U, DE; H. van Someren, ACE Associated Compiler Experts, NL
-
An Interprocedural Code Optimization Technique for Network Processors Using Hardware
Multi-Threading Support [p. 919]
-
H. Scharwaechter, M. Hohenauer, R. Leupers, G. Ascheid and H. Meyr
Sophisticated C compiler support for network processors
(NPUs) is required to improve their usability and consequently,
their acceptance in system design. Nonetheless,
high-level code compilation always introduces overhead,
regarding code size and performance compared to handwritten
assembly code. This overhead results partially from
high-level function calls that usually introduce memory accesses
in order to save and reload register contents. A
key feature of many NPU architectures is hardware multi-threading
support, in the form of separate register files, for
fast context switching between different application tasks.
In this paper, a new NPU code optimization technique to
use such HW contexts is presented that minimizes the overhead
for saving and reloading register contents for function
calls via the runtime stack. The feasibility and the performance
gain of this technique are demonstrated for the Infineon
Technologies PP32 NPU architecture and typical network
application kernels.
-
An Integrated Scratch-Pad Allocator for Affine and Non-Affine Code [p. 925]
-
S. Udayakumaran and R. Barua
Scratch-Pad memory (SPM) allocators that exploit the
presence of affine references to arrays are important for scientific
benchmarks. On the other hand, such allocators have
so far been limited in their general applicability. In this paper
we propose an integrated scheme that for the first time
combines the specialized solution for affine program allocation
with a general framework for other code. We find that
our integrated framework does as well or outperforms other
allocators for a variety of SPM sizes.
-
Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns [p. 931]
-
G. Chen, O. Ozturk, M. Kandemir and M. Karakoy
There exist many embedded applications such as those
executing on set-top boxes, wireless base stations, HDTV,
and mobile handsets that are structured as nested loops
and benefit significantly from a software managed memory.
Prior work on scratchpad memories (SPMs) focused primarily
on applications with regular data access patterns.
Unfortunately, some embedded applications do not fit in
this category and consequently conventional SPM management
schemes will fail to produce the best results for them.
In this work, we propose a novel compilation strategy for
data SPMs for embedded applications that exhibit irregular
data access patterns. Our scheme divides the task of optimization
between compiler and runtime. The compiler processes
each loop nest and insert code to collect information
at runtime. Then, the code is modified in such a fashion
that, depending on the collected information, it dynamically
chooses to use or not to use the data SPM for a given set of
accesses to irregular arrays. Our results indicate that this
approach is very successful with the applications that have
irregular patterns and improves their execution cycles by
about 54% over a state-of-the-art SPM management technique
and 23% over the conventional cache memories. Also,
the additional code size overhead incurred by our approach
is less than 5% for all the applications tested.
-
Restructuring Field Layouts for Embedded Memory System [p. 937]
-
K. Shin, J. Kim, S. Kim and H. Han
In many computer systems with large data computations,
the delay of memory access is one of the major performance
bottlenecks. In this paper, we propose an enhanced field
remapping scheme for dynamically allocated structures in
order to provide better locality than conventional field layouts.
Our proposed scheme reduces cache miss rates drastically
by aggregating and grouping fields from multiple
instances of the same structure, which implies the performance
improvement and power reduction. Our methodology
will become more important in the design space exploration,
especially as the embedded systems for data oriented
application become prevalent. Experimental results show
that average L1 and L2 data cache misses are reduced by
23% and 17%, respectively. Due to the enhanced localities,
our remapping achieves 13% faster execution time on
average than original programs. It also reduces power consumption
by 18% for data cache.
-
Power-Aware Compilation for Embedded Processors with Dynamic Voltage Scaling and Adaptive
Body Biasing Capabilities [p. 943]
-
P.-K. Huang and S. Ghiasi
Traditionally, active power has been the primary
source of power dissipation in CMOS designs. Although,
leakage power is becoming increasingly more important
as technology feature sizes continue to shrink,
traditioinal power optimization techniques often neglect
its contribution to total system power. In this
paper, we present a power-aware compilation methodology
that targets an embedded processor with both
dynamic voltage scaling (DVS) and adaptive body biasing
(ABB) capabilities. Our technique has the unique
advantage of optimizing design power by jointly optimizing
dynamic and leakage power dissipation. Considering
the delay and energy penalty of swithching between
processor modes, our compiler generates code with minimum
power consumption under deadline constraints.
Compared to not performing any optimization, or using
DVS alone, our technique improves the power
consumption of a number of embedded application kernels
by 26%, and 14%, respectively.
-
Dynamic Code Overlay of SDF-Modeled Programs on Low-End Embedded Systems [p. 945]
-
H.-W. Park, K. Oh, S. Park, M.-M. Sim, and S. Ha
In this paper we propose a dynamic code overlay
technique of synchronous data-flow (SDF) - modeled
program for low-end embedded systems which lack MMU-support.
With this technique, the system can utilize
expensive SRAM memory more efficiently by using flash
memory as code storage. SRAM is divided into several
regions called overlay slots. A data-flow block or a cluster
of data-flow blocks is loaded into the corresponding
overlay slot on demand at run-time. Which blocks are
clustered together and which overlay slots are allocated to
the clusters are statically decided by the clustering and
placement algorithm. We also propose an automatic code
generation framework that generates the C-program code,
dynamic loader and linker script files from the given SDF-modeled
blocks and schematic, so we can run or simulate
the program immediately without any additional coding
effort. Experiments report that we can reduce the SRAM
size significantly with a reasonable amount of time
overhead for several real applications.
Moderators: S. Vernalde, IMEC, BE; K. Bertels, TU Delft, NL
-
optiMap: A Tool for Automated Generation of NoC Architectures Using Multi-Port Routers for FPGAs [p. 947]
-
B. Sethuraman and R. Vemuri
Networks-on-Chip (NoC) way of system design has been
introduced to overcome the communication and the performance
bottlenecks of a bus based system design. Area is at
a premium in FPGAs. In this research, we propose to reduce
network area overhead by reducing the number of routers,
by making the router handle multiple logic cores. We implement
an improved multi-local port router design with variable
number of local ports. In addition to substantial area
savings, we observe significant performance improvement.
We discuss the issues involved in the use of multi-local port
routers for NoC design in FPGAs. We observe an average
of 36% area savings (maximum of 47.5%) on XC2V P30
FPGA and significant performance gain (30% average compared
to single-local port version) with a multi-local port
router. Mapping of cores onto such a non-traditional NoC architecture
is a complex task. We present an algorithm which
optimally maps the cores based on the given set of objectives.
For the given task graph and the set of constraints, the algorithm
finds the optimal number of routers, configuration of
each router, optimal mesh topology and the final mapping.
We test the algorithm on a wide variety of benchmarks and
report the results.
-
Hardware Efficient Architectures for Eigenvalue Computation [p. 953]
-
Y. Liu, C.-S. Bouganis, P. Y. K. Cheung, P. H. W. Leong and S. J. Motley
Eigenvalue computation is essential in many fields of science
and engineering. For high performance and real-time
applications, this may need to be done in hardware. This
paper focuses on the exploration of hardware architectures
which compute eigenvalues of symmetric matrices. We propose
to use the Approximate Jacobi Method for general case
symmetric matrix eigenvalue problem. The paper illustrates
that the proposed architecture is more efficient than previous
architectures reported in the literature. Moreover, for
the special case of 3x3 symmetric matrices, we propose to
use an Algebraic Method. It is shown that the pipelined architecture
based on the Algebraic Method has a significant
advantage in terms of area.
-
Memory Centric Thread Synchronization on Platform FPGAs [p. 959]
-
C. Kulkarni and G. Brebner
Concurrent programs are difficult to write, reason about,
re-use, and maintain. In particular, for system-level
descriptions that use a shared memory abstraction for
thread or process synchronization, the current practice
involves manual scheduling of processes, introduction of
guard conditions, and clocking tricks, to enforce memory
dependencies. This process is tedious, time consuming,
and error-prone. At the same time, the need for a
concurrent programming model is becoming ever essential
to bridge the productivity gap that is widening with every
manufacturing process generation. In this paper, we
present two novel techniques to automatically enforce
memory dependencies in platform FPGAs using on-chip
memories, starting from a system-level description. Both
the techniques utilize static analysis to generate circuits for
enforcing these dependencies. This paper will investigate
these two techniques for their generality, overhead in
implementation, and usefulness or otherwise for different
application requirements.
-
A Parallel Configuration Model for Reducing the Run-Time Reconfiguration Overhead [p. 965]
-
Y. Qu, J.-P. Soininen and J. Nurmi
Multitasking on reconfigurable logic can achieve very
high silicon reusability. However, configuration latency is
a major limitation and it can largely degrade the system
performance. One reason is that tasks can run in parallel
but configurations of the tasks can be done only in
sequence. This work presents a novel configuration model
to enable configuration parallelism. It consists of multiple
homogeneous tiles and each tile has its own configuration
SRAM that can be individually accessed. Thus multiple
configuration controllers can load tasks in parallel and
more speedups can be achieved. We used a prefetch
scheduling technique to evaluate the model with randomly
generated tasks. The experiment results reveal that in
average using multiple controllers can reduce the
configuration overheads by 21%. Compared to best cases
of using multiple tiles with a single controller, additional
40% speedup can be achieved using multiple controllers.
Organiser/Moderator: C. Enz, CSEM, CH
-
Wireless Sensor Networks and Beyond [p. 970]
-
P. J. M. Havinga
Wireless sensor networks are a hot issue worldwide, and significant progress has been achieved in the past
few years. However, we are only beginning to find out about their real potential, and there are still major
challenges that need to be solved. In this presentation an overview of the biggest challenges in wireless
sensor networks are addressed, and some of the solutions will be highlighted. Then, some applications of
sensor network technologies are presented, which go beyond traditional sensor network applications.
-
The Ultra Low-Power WiseNET System [p. 971]
-
A. El-Hoyidi, C. Arm, R. Caseiro, S. Cserveny, J.-D. Decotignie, C. Enz, F. Giroud, S. Gyger,
E. Leroux, T. Melly, V. Peiris, F. Pengg, P.-D. Pfister, N. Raemy, A. Ribordy, D. Ruffieux and P. Volet
The WiseNET system includes an ultra low-power
system-on-chip (SoC) hardware platform and WiseMAC, a
low power medium access control protocol (MAC) dedicated
to duty-cycled radios. Both elements have been designed
to meet the specific requirements of wireless sensor
networks and are particularly well suited to ad-hoc and hybrid
networks. The WiseNET radio offers dual-band operation
(434-MHz and 868-MHz) and runs from a single 1.5-V
battery. It consumes only 2.5-mW in receive mode with a
sensitivity smaller than -108-dBm at a BER of 10-3 and
for a 100-kb/s data rate. In addition to this low-power radio,
the WiseNET system-on-chip (SoC) also includes
all the functions required for data acquisition, processing
and storage of the information provided by the sensors.
Ultra-low power consumption with the WiseNET system
is achieved thanks to the combination of the low
power consumption of the transceiver and the high energy
efficiency of the WiseMAC protocol. The WiseNET solution
consumes more than 250 times less power than
comparable solutions based on the IEEE 802.15.4 standard.
-
Fast-prototyping Using the BTnode Platform [p. 977]
-
J. Beutel
The BTnode platform is a versatile and flexible platform
for functional prototyping of ad hoc and sensor networks.
Based on an Atmel microcontroller, a Bluetooth radio and
a low-power ISM band radio it offers ample resources to
implement and test a broad range of algorithms and applications
ranging from pure technology studies to complete
application demonstrators. Accompanying the hardware is
a suite of system software, application examples and tutorials
as well as support for debugging, test, deployment
and validation of wireless sensor network applications. We
discuss aspects of system design, development and deployment
based on our experience with real wireless sensor network
experiments. We further discuss our approach of a
deployment-support network that tries to close the gap between
current proof-of-concept experiments to sustainable
real-world sensor network solutions.
Moderators: M. Miranda, IMEC, BE; A. Macii, Politecnico di Torino, IT
-
Circuit-Aware Device Design Methodology for Nanometer Technologies: A Case Study for Low
Power SRAM Design [p. 983]
-
Q. Chen, S. Mukhopadhyay, A. Bansal and K. Roy
In this paper, we propose a general Circuit-aware Device
Design methodology, which can improve the overall circuit design
by taking advantages of the individual circuit characters during
the device design phase. The proposed methodology analytically
derives the optimal device in terms of the pre-specified circuit
quality factor. We applied the proposed methodology to SRAM
design and achieved significant reduction in standby leakage and
access time (11% and 7%, respectively, for conventional 6TSRAM).
Also, we observed that the optimal devices selected
depend considerably on the applied circuit techniques. We believe
that the proposed Circuit-aware Device Design methodology will
be useful in the sub-90nm technology, where different leakage
components (subthreshold, gate, and junction tunneling) are
comparable in magnitude. Also, in this work, we have presented a
design automation framework for SRAM, which is conventionally
custom designed and optimized.
-
Architectural and Technology Influence on the Optimal Total Power Consumption [p. 989]
-
C. Schuster, J.-L. Nagel, C. Piguet and P.-A. Farine
In this paper, an approximated closed-form total power
consumption equation for circuits working at their
optimal supply and threshold voltage is presented.
Comparisons of this formula to the numerical calculation
show an error less than 3% on a set of thirteen 16 bit
multipliers. Starting from this equation the influence of
architecture transformations (including pipelining,
parallelization, sequentialization) on the optimal total
power is discussed. Finally, by a similar approach, the
impact of the technology choice on achievable power
saving is considered, showing how a moderated tradeoff
between leakage and speed is the key characteristic of a
good low power technology.
-
Reducing the Sub-Threshold and Gate-Tunneling Leakage of SRAM Cells Using Dual-Vt and
Dual-Tox Assignment [p. 995]
-
B. Ameliard, F. Fallah and M. Pedram
Aggressive CMOS scaling results in low threshold
voltage and thin oxide thickness for transistors manufactured in
very deep submicron regime. As a result, reducing the
subthreshold and gate-tunneling leakage currents has become
one of the most important criteria in the design of VLSI
circuits. This paper presents a method based on dual-Vt and
dual-Tox assignment to reduce the total leakage power
dissipation of SRAMs while maintaining their performance.
The proposed method is based on the observation that the read
and write delays of a memory cell in an SRAM block depend
on the physical distance of the cell from the sense amplifier and
the decoder. Thus, the idea is to deploy different types of six-transistor
SRAM cells corresponding to different threshold
voltage and oxide thickness assignments for the transistors.
Unlike other techniques for low-leakage SRAM design, the
proposed technique incurs neither area nor delay overhead. In
addition, it results in a minor change in the SRAM design flow.
Simulation results with a 65nm process demonstrate that this
technique can reduce the total leakage power dissipation of a
64Kb SRAM by more than 50%.
-
Exploiting Data-Dependent Slack Using Dynamic Multi-VDD to Minimize Energy Consumption
in Datapath Circuits [p. 1001]
-
K. R. Gandhi and N. R. Mahapatra
Modern microprocessors feature wide datapaths to support
large on-chip memory and to enable computation on
large-magnitude operands. With device scaling and rising
clock frequencies, energy consumption and power density
have become critical concerns, especially in datapath circuits.
Datapaths are typically designed to optimize delay for
worst-case operands. However, such operands rarely occur;
the most frequently occurring input operand words (comprising
long strings or subwords of 0's and 1's) present two
major opportunities for energy optimization: (1) avoiding
unnecessary computation involving such "special" input
operand subword values and (2) exploiting timing slack in
circuits (designed to accommodate worst-case inputs) arising
due to such values. Previous techniques have exploited
only one or the other of these factors, but not both simultaneously.
Our new technique, dynamic multi-VDD, which is
capable of dynamically switching between supply voltages
in hardware submodules, simultaneously exploits both factors.
Using the computation bypass framework and multiple
supply voltages, we estimate data-dependent slack based on
submodules that will be bypassed and exploit this slack by
operating active submodules at a lower supply voltage. Our
analysis of SPEC CPU2K benchmarks shows energy savings
of up to 55% (and 46.53% on average) in functional
units with minimal performance overheads.
Moderators: J. Marques-Silva, Southampton U, UK; E. M. Aboulhamid, Montreal U, CA
-
On the Evaluation of Transactor-Based Verification for Reusing TLM Assertions and Testbenches
at RTL [p. 1007]
-
N. Bombieri, F. Fummi and G. Pravadelli
Transaction level modeling (TLM) is becoming an usual
practice for simplifying system-level design and architecture
exploration. It allows the designers to focus on the
functionality of the design, while abstracting away implementation
details that will be added at lower abstraction
levels. However, moving from transaction level to RTL requires
to redefine TLM testbenches and assertions. Such
a wasteful and error prone conversion can be avoided by
adopting transactor-based verification (TBV). Many recent
works adopt this strategy to propose verification methodologies
that allow (1) mixing TLM and RTL components, and
(2) reusing TLM assertions and testbenches at RTL. Even if
practical advantages of such an approach are evident, there
are no papers in the literature that evaluate the effectiveness
of the TBV compared to a more traditional RTL verification
strategy. This paper is intended to fill in the gap. It theoretically
compares the quality of the TBV towards the rewriting
of assertions and testbenches at RTL with respect to both
fault coverage and assertion coverage.
-
Functional Verification Methodology Based on Formal Interface Specification and Transactor Generation [p. 1013]
-
F. Balarin and R. Passerone
Transaction level models promise to be the basis of the
verification environment for the whole design process. Realizing
this promise requires connecting transaction level
and RTL blocks through an object called a transactor, which
translates back and forth between RTL signal-based communication,
and transaction level function-call based communication.
Each transactor is associated with a pair of interfaces,
one at RTL and one at transaction level. Typically,
however, a pair of interfaces is associated to more than one
transactor, each assuming a different role in the verification
process. In this paper we propose a methodology in which
both the interfaces and their relation are captured by a single
formal specification. By using the specification, we show
how the code for all the transactors associated with a pair
of interfaces can be automatically generated.
-
A Coverage Metric for the Validation of Interacting Processes [p. 1019]
-
I. G. Harris
We present a coverage metric which evaluates the testing
of a set of interacting concurrent processes. Existing behavioral
coverage metrics focus almost exclusively on the testing
of individual processes. However the vast majority of
practical hardware descriptions are composed of many processes
which must correctly interact to implement the system.
Coverage metrics which evaluate processes separately
are unlikely to model the range of design errors which manifest
themselves when components are integrated to build
a system. A metric which models component interactions
is essential to enable validation techniques to scale with
growing design complexity. We describe the effectiveness of
our metric and provide results to demonstrate that coverage
computation using our metric is tractable.
-
New Methods and Coverage Metrics for Functional Verification [p. 1025]
-
V. Jerinic, J. Langer, U. Heinkel and D. Mueller
An ever increasing portion of design effort is spent on functional
verification. The verification space as the set of possible
combinations of a design's attributes is likely to be very
large making it infeasible to verify each point in this space.
State-of-the-art verification tools tackle this problem by using
directed random generation of combinations in conjunction
with manually defined corner cases in order to get
satisfactory coverage with the desired distribution. In this
work, the underlying methodology to automatically generating
complete sets of disjoint coverage models on the basis
of formal attribute definitions is extended to take relational
constraints into account. This allows the utilization of coverage
models with non-orthogonal, non-planar boundaries,
which can make hole analysis for coverage data obsolete. It
shall be demonstrated, how the proposed methodology can
be used to automatically determine corner cases more accurately
than it is possible with conventional approaches.
-
Classification Trees for Random Tests and Functional Coverage [p. 1031]
-
A. Krupp and W. Mueller
This article presents the classification tree method for
functional verification to close the gap from the specification
of a test plan to SystemVerilog [2] testbench generation.
Our method supports the systematic development of
test configurations and is based on the classification tree
method for embedded systems (CTM/ES) [1] extending
CTM/ES for random test generation as well as for functional
coverage and property specification. We support the
structured coding of assertions and constraints by a twostep
method: (i) creation of the classification tree (ii) creation
of (sample) abstract test sequences. For SystemVerilog
testbench generation, we introduce a mapping to SystemVerilog
random tests, assertions, and functional coverage
specifications. As our method is derived from the
CTM/ES, it is also compliant to the V-method and thus applies
to IEC61508-conformant development of electronic
safety related systems. The remainder of this paper gives
an overview of the classification tree method (CTM) before
presenting our extension for functional verification.
Moderators: J. Schloeffel, Philips Semiconductors, DE; H. T. Vierhaus, Cottbus U, DE
-
Efficient Test-Data Compression for IP Cores Using Multilevel Huffman Coding [p. 1033]
-
X. Kavousianos, E. Kalligeros and D. Nikolos
In this paper we introduce a new test-data compression
method for IP cores with unknown structure. The proposed
method encodes the test data provided by the core vendor
using a new, very effective compression scheme based on
multilevel Huffman coding. Specifically, three different kinds
of information are compressed using the same Huffman code,
and thus significant test data reductions are achieved. A
simple architecture is proposed for decoding on-chip the
compressed data. Its hardware overhead is very low and
comparable to that of the most efficient methods in the literature.
Additionally, the proposed technique offers increased
probability of detection of unmodeled faults since the majority
of the unknown values of the test set are replaced by
pseudorandom data generated by an LFSR.
-
Functional Constraints vs. Test Compression in Scan-Based Delay Testing. [p. 1039]
-
I. Polian and H. Fujiwara
We present an approach to prevent overtesting in scan-based
delay test. The test data is transformed with respect to functional
constraints while simultaneously keeping as many positions
as possible unspecified in order to facilitate test compression.
The method is independent of the employed delay
fault model, ATPG algorithm and test compression technique,
and it is easy to integrate into an existing flow. Experimental
results emphasize the severity of overtesting in scanbased
delay test. Influence of different functional constraints
on the amount of the required test data and the compression
efficiency is investigated. To the best of our knowledge,
this is the first systematic study on the relationship between
overtesting prevention and test compression.
Keywords: Overtesting prevention, Functional constraints,
Scan-based delay test, Test compression
-
Concurrent Core Test for SoC Using Shared Test Set and Scan Chain Disable [p. 1045]
-
G. Zeng and H. Ito
A concurrent core test approach is proposed to reduce the
test cost of SOC. Multiple cores in SOC can be tested
simultaneously by using a shared test set and scan chain
disable. Prior to test, the test sets corresponding to cores
under test (CUT) are merged by using the proposed
merging algorithm to obtain a shared test set with
minimum size. During test, the on-chip scan chain disable
signal (SCDS) generator is employed to retrieve the
original test vectors from the shared test set. The
approach is non-intrusive and automatic test pattern
generator (ATPG) independent. Moreover, the approach
can reduce test cost further by combining with general test
compression/decompression technique. Experimental
results for ISCAS 89 benchmark circuits have proven the
efficiency of the proposed approach.
-
Efficient Unknown Blocking Using LFSR Reseeding [p. 1051]
-
S. Wang, K. J. Balakrishnan and S. T. Chakradhar
This paper presents an efficient method to block unknown
values from entering temporal compactors. The control signals
for the blocking logic are generated by an LFSR. The
proposed technique minimizes the size of the LFSR by propagating
only one fault effect for each fault and balancing the
number of specified bits in each control pattern. The linear
solver to find seeds of the LFSR intelligently chooses a
solution such that the impact on test quality is minimal. Experimental
results show that sizes of control data for the proposed
method are smaller than prior work and run time of
the proposed method is several orders of magnitude smaller
than that of prior work. Hardware overhead is very low.
-
Coverage Loss by Using Space Compactors in Presence of Unknown Values [p. 1053]
-
M. C.-T. Chao, S. Wang, S. T. Chakradhar, W. Wei and K.-T. Cheng
The presence of unknown values in simulation is the greatest barrier to effective
test response compaction. For space compactors, some response may not be observable
due to the masking effect caused by unknown values. This paper reports on
experiments conducted to evaluate the impact on the test quality of various
percentages of observable responses for both modeled and un-modeled faults.
Moderators: S. Baruah, North Carolina U, US; H. van Someren, ACE Associated Compiler Experts, NL
-
Online Energy-Aware I/O Device Scheduling for Hard Real-Time Systems [p. 1055]
-
H. Cheng and S. Goddard
Much research has focused on power conservation for the
processor, while power conservation for I/O devices has received
little attention. In this paper, we analyze the problem
of online energy-aware I/O scheduling for hard realtime
systems based on the preemptive periodic task model.
We propose an online energy-aware I/O device scheduling
algorithm: Energy-efficient Device Scheduling (EEDS).
The EEDS algorithm utilizes device slack to perform device
power state transitions to save energy, without jeopardizing
temporal correctness. An evaluation of the approach shows
that it yields significant energy savings with respect to no
Dynamic Power Management (DPM) techniques.
-
Multiprocessor Synthesis for Periodic Hard Real-Time Tasks under a Given Energy Constraint [p. 1061]
-
H.-.R Hsu, J.-J. Chen and T.-W. Kuo
The energy-aware design for electronic systems has been an important
issue in hardware and/or software implementations, especially
for embedded systems. This paper targets a synthesis problem for
heterogeneous multiprocessor systems to schedule a set of periodic
real-time tasks under a given energy consumption constraint. Each
task is required to execute on a processor without migration, where
tasks might have different execution times on different processor
types. Our objective is to minimize the processor cost of the entire
system under the given timing and energy consumption constraints.
The problem is first shown being NP-hard and having
no polynomial-time algorithm with a constant approximation ratio
unless NP = P. We propose polynomial-time approximation algorithms
with (m + 2)-approximation ratios for this challenging
problem, where m is the number of the available processor types.
Experimental results show that the proposed algorithms could always
derive solutions with system costs close to those of optimal
solutions.
Keywords: Energy-aware systems, Task scheduling, Real-time
systems, Task partitioning, Multiprocessor synthesis.
-
Scheduling under Resource Constraints Using Dis-Equations [p. 1067]
-
H. Cherroun, A. Darte and P. Feautrier
Scheduling is an important step in high-level synthesis
(HLS). In our tool, we perform scheduling in two steps:
coarse-grain scheduling, in which we take into account the
whole control structure of the program including imperfect
loop nests, and fine-grain scheduling, where we refine each
logical step using a detailed description of the available resources.
This paper focuses on the second step. Tasks are
modeled as reservation tables (or templates) and we express
resource constraints using dis-equations (i.e., negations
of equations). We give an exact algorithm based on a
branch-and-bound method, coupled with variants of Dijkstra's
algorithm, which we compare with a greedy heuristic.
Both algorithms are tested on pieces of scientific applications
to demonstrate their suitability for HLS tools.
-
Scalable Performance-Energy Trade-Off Exploration of Embedded Real-Time Systems
on Multiprocessor Platforms [p. 1073]
-
Z. Ma and F. Catthoor
Conventional task scheduling on real-time systems with
multiple processors is notorious for its computational intractability.
This problem becomes even harder when designers
also have to consider other constraints such as energy
consumptions. Such a multi-objective trade-off exploration
is a crucial step to generating cost-efficient real-time
embedded systems. Although previous task schedulers have
attempted to provide fast heuristics for design space exploration,
they cannot handle large systems efficiently. As
today's embedded systems become increasingly larger, we
need a scalable scheduler to handle this complexity. This
paper presents a hierarchical scheduler that combines the
graph partition and the task interleaving to tackle the tradeoff
exploration problem in a scalable way. Our scheduler
can employ the existing flattened scheduler and significantly
accelerate the design space explorations for large tasks.
The speed-up of up to 2 orders of magnitude has been obtained
for large task models compared to the conventional
flattened scheduler.
Moderators: T. Shiple, Synopsys, FR; R. Drechsler, Bremen U, DE
-
Building a Better Boolean Matcher and Symmetry Detector [p. 1079]
-
D. Chai and A. Kuehlmann
Boolean matching is a powerful technique that has been used in
technology mapping to overcome the limitations of structural pattern
matching. The current basis for performing Boolean matching
is the computation of a canonical form to represent functions that
are equivalent under negation and permutation of inputs and outputs.
In this paper, we first present a detailed analysis of previous
techniques for Boolean matching. We then describe a novel combination
of existing methods and new ideas that results in a matcher
which is dramatically faster than previous work. We point out that
the presented algorithm is equally relevant for detecting generalized
functional symmetries, which has broad applications in logic
optimization and verification.
-
Optimizing Sequential Cycles through Shannon Decomposition and Retiming [p. 1085]
-
C. Soviani, O. Tardieu and S. A. Edwards
Optimizing sequential cycles is essential for many types
of high-performance circuits, such as pipelines for packet
processing. Retiming is a powerful technique for speeding
pipelines, but it is stymied by tight sequential cycles. Designers
usually attack such cycles by manually combining Shannon decomposition
with retiming - effectively a form of speculation -
but such manual decomposition is error-prone.
We propose an efficient algorithm that simultaneously applies
Shannon decomposition and retiming to optimize circuits
with tight sequential cycles. While the algorithm is only able
to improve certain circuits (roughly half of the benchmarks we
tried), the performance increase can be dramatic (7%-61%)
with only a modest increase in area (3%-12%). The algorithm
is also fast, making it a practical addition to a synthesis flow.
-
Efficient Incremental Clock Latency Scheduling for Large Circuits [p. 1091]
-
C. Albrecht
The clock latency scheduling problem is usually solved
on the sequential graph, also called register-to-register
graph. In practice, the the extraction of the sequential graph
for the given circuit is much more expensive than computing
the clock latency schedule for the sequential graph. In this
paper we present a new algorithm for clock latency scheduling
which does not require the complete sequential graph as
input. The new algorithm is based on the parametric shortest
paths algorithm by Young, Tarjan and Orlin. It extracts
the sequential timing graph only partly, that is in the critical
regions, through a call back. It is still guaranteed that
the algorithm finds the critical cycle and the minimum clock
period. As additional input the algorithm only requires for
every register the maximum delay of any outgoing combinational
path. Computing these maximum delays for all the
registers is equivalent to the timing analysis problem, hence
they can be computed very efficiently. Computational results
on recently released public benchmarks and industrial designs
show that in average only 20.0 % of the edges in the
sequential graph need to be extracted and this reduces the
overall runtime to 5.8 %.
-
Analyzing Timing Uncertainty in Mesh-Based Clock Architectures [p. 1097]
-
S. M. Reddy, G. R. Wilke and R. Murgai
Mesh architectures are used to distribute critical global signals on
a chip, such as clock and power/ground. Redundancy created by
mesh loops smooths out undesirable variations between signal nodes
spatially distributed over the chip. However, one problem with the
mesh architectures is the difficulty in accurately analyzing large instances.
Furthermore, variations in process and temperature, supply
noise and crosstalk noise cause uncertainty in the delay from clock
source to flip-flops. In this paper, we study the problem of analyzing
timing uncertainty in mesh-based clock architectures. We propose
solutions for both pure mesh and (mesh + global-tree) architectures.
The solutions can handle large design and mesh instances. The maximum
error in uncertainty values reported by our solutions is 1-3ps
with respect to the golden Monte Carlo simulations, which is at most
0.5% of the nominal clock latency of about 600ps.
Organiser/Moderator: J. Rabaey, UC Berkeley, US
-
Deploying Networks Based on TinyOS
-
D. Culler
-
Platform-Based Design of Wireless Sensor Networks for Industrial Applications[p. 1103]
-
A. Bonivento, L. P. Carloni and A. Sangiovanni-Vincentelli
We present a methodology, an environment and
supporting tools to map an application on a wireless sensor
network (WSN). While the method is quite general, we use
extensively an example in the domain of industrial control as
it is one of the most promising application of WSN and yet it is
largely untouched by it.
Our design flow starts from a high level description of the
control algorithm and a set of candidate hardware platforms
and automatically derives an implementation that satisfies system
requirements while optimizing for power consumption. To
manage the heterogeneity and complexity inherent in this rather
complete design flow, we identify three abstraction layers and
introduce the tools to transition between different layers and
obtain the final solution.
We present a case study of a control application for manufacturing
plants that shows how the methodology covers all the
aspects of the design process, from conceptual description to
implementation.
-
An Environment for Controlled Experiments with In-House Sensor Networks [p. 1108]
-
V. Handziski, A. Koepke, A. Willig and A. Wolisz
Controlled experiments, with larger sensor networks configurations (100 + nodes), as
well as discovering inefficiencies in their operation are rather complex. In this paper we
will present a concept of a testbed for WSNs supporting easy change of configuration,
multi tier operation and precise observation of network behaviour using a wired backbone
connectivity. The design considerations will be accompanied by early usage experience.
Organiser/Moderator: C. Enz, CSEM, CH
-
Hogthrob: Towards a Sensor Network Infrastructure for Sow Monitoring [p. 1109]
-
P. Bonnet, M. Leopold and K. Madsen
We aim at developing a next-generation system for sow monitoring. Today, farmers use RFID based
solutions with an ear tag on the sows and a reader located inside the feeding station. This does not allow
the farmers to locate a sow in a large pen, or to monitor the life cycle of the sow (detect heat period, detect
injury...). Our goal is to explore the design of a sensor network that supports such functionalities and meets
the constraints of this industry in terms of price, energy consumption and availability.
Moderators: C. Piguet, CSEM, CH; P. Maurine, LIRMM, FR
-
Ultra Efficient (Embedded) SoC Architectures Based on Probabilistic CMOS (PCMOS) Technology [p. 1110]
-
L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem and B. Seshasayee
Major impediments to technology scaling in the nanometer
regime include power (or energy) dissipation and "erroneous"
behavior induced by process variations and noise susceptibility.
In this paper, we demonstrate that CMOS devices whose
behavior is rendered probabilistic by noise (yielding probabilistic
CMOS or PCMOS) can be harnessed for ultra low energy
and high performance computation. PCMOS devices are
inherently probabilistic in that they are guaranteed to compute
correctly with a probability 1/2 < p < 1 and thus, by
design, they are expected to compute incorrectly with a probability
(1-p). In this paper, we show that PCMOS technology yields significant
improvements, both in the energy consumed as well as in the performance,
for probabilistic applications with broad utility. these benefits are derived using an
application-architecture-technology (A2T) co-design methodology
introduced here, yielding an entirely novel family of probabilistic system-on-a-chip
(PSOC) architectures. All of our application and architectural savings are
quantified using the product of the energy and the performance denoted (energy x
performance): the PCMOS based gains are as high as a substantial
multiplicative factor of over 560 when
compared to a competing energy-efficient CMOS based realization.
-
Minimizing Ohmic Loss and Supply Voltage Variation Using a Novel Distributed Power Supply Network [p. 1116]
-
M. Budnik and K. Roy
IR and di/dt events may cause ohmic losses and large
supply voltage variations due to system parasitics. Today,
parallelism in the power delivery path is used to reduce
ohmic loss while decoupling capacitance is used to
minimize the supply voltage variation. Future integrated
circuits, however, will exhibit large enough currents and
current transients to mandate additional safeguards. A
novel, distributed power delivery and decoupling network
is introduced reducing the supply voltage variation
magnitude by 67% and the future ohmic loss by 15.9W
(compared to today's power delivery and decoupling
networks) using conventional processing and packaging
techniques in a 130nm technology node.
-
An Ultra Low-Power TLB Design [p. 1122]
-
Y.-J. Chang
This paper presents an ultra low-power TLB design, which
combines two techniques to minimize the power dissipated in
TLB accesses. In our design, we first propose a real-time filter
scheme to eliminate the redundant TLB accesses. Without
delay penalty the proposed real-time filter can distinguish the
redundant TLB access as soon as the virtual address is
generated. The second technique is a banking-like structure,
which aims to reduce the TLB power consumption in case of
necessary accesses. We present two adaptive variants of the
banked TLB. Compared to the conventional banked TLB, these
two variants achieve better power efficiency without
increasing the TLB miss ratio. The experimental results show
that by filtering out all the redundant TLB accesses and then
minimizing the power consumption per TLB access, our design
can effectively improve the Energy*Delay product of the TLBs,
especially for the data TLBs with poor spatial locality.
-
Determining the Optimal Timeout Values for a Power-Managed System Based on the Theory of
Markovian Processes: Offline and Online Algorithms [p. 1128]
-
P. Rong and M. Pedram
This paper presents a timeout-driven DPM technique which relies
on the theory of Markovian processes. The objective is to
determine the energy-optimal timeout values for a system with
multiple power saving states while satisfying a set of user defined
performance constraints. More precisely, a controllable
Markovian process is exploited to model the power management
behavior of a system under the control of a timeout policy.
Starting with this model, a perturbation analysis technique is
applied to develop an offline gradient-based approach to
determine the optimal timeout values. Online implementation of
this technique for a system with dynamically-varying system
parameters is also described. Experimental results demonstrate
the effectiveness of the proposed approach.
Introduction
Dynamic power management (DPM), which refers to selective
shut-off or slow-down of components that are idle or
underutilized, has proven to be a particularly effective technique
for reducing power dissipation in such systems. In the literature,
various DPM techniques have been proposed, from heuristic
methods presented in early works [1][2] to stochastic
optimization approaches [3][4].
Among the heuristic DPM methods, the timeout policy is
the most widely used approach in industry and has been
implemented in many operating systems. Examples include the
power management scheme incorporated into the Windows
system, the low-power saving mode of the IEEE 802.11a-g
protocol for wireless LAN card, and the enhanced adaptive
battery life extender (EABLE) for the Hitachi disk drive. Most of
these industrial DPM techniques provide mechanisms to adjust
the timeout values at the user level.
Moderators: M. Lajolo, NEC Labs, US; A. Fedeli, STMicroelectronics, IT
-
A Formal Model and Efficient Traversal Algorithm for Generating Testbenches for Verification
of IEEE Standard Floating Point Division [p. 1134]
-
D. W. Matula and L. D. McFearin
We utilize a formal model of division for determining
a testbench of p-bit (dividend, divisor) pairs whose output
2p-bit quotients have properties characterizing these instances
as the most challenging for verifying any division
algorithm design and implementation. Specifically, our
test suites yield 2p-bit quotients where the leading p-bits
traverse all or a pseudo-random sample of leading bit
combinations, and the next p-bits comprise a round bit
followed by (p-1) identical bits. These values are pro ven
to be closest to the p-bit quotient rounding boundaries
and shown to possess other desirable coverage properties.
We introduce an efficient method of generating these test-benches.
We also describe applications of these testbenches
at the design simulation stage and the product evaluation
stage.
-
On the Relation between Simulation-Based and SAT-Based Diagnosis [p. 1139]
-
G. Fey, S. Safarpour, A. Veneris and R. Drechsler
The problem of diagnosis - or locating the source of an
error or fault - occurs in several areas of computer aided design,
such as dynamic verification, property checking, equivalence
checking and production test. Manually locating errors
can be a time consuming and resource-intensive process. Several
automated approaches for diagnosis have been presented,
among them are simulation-based and SAT-based techniques.
These two approaches are found to be robust even for large
circuits as well as being applicable to a broad range of diagnosis
problems. An in-depth comparison of both approaches
necessary to augment our knowledge of diagnosis procedures
has not been addressed by previous work.
This paper provides a thorough analysis of the similarities
and differences between simulation-based and SAT-based
procedures for diagnosis. The relation between the basic approaches
is theoretically analyzed. Issues regarding performance
and diagnosis quality (resolution) are discussed. Experimental
data strengthens the theoretical results. This detailed
understanding of the relations between the techniques
is necessary to provide further improvements to the field of
diagnosis. The initial steps towards building a hybrid technique
are also presented.
-
An Integrated Open Framework for Heterogeneous MPSoC Design Space Exploration [p. 1145]
-
F. Angiolini, J. Ceng, R. Leupers, F. Ferrari, C. Ferri and, L. Benini
In recent years, increasing manufacturing density has allowed
the development of Multi-Processor Systems-on-Chip (MPSoCs).
Application-Specific Instruction Set Processors (ASIPs) stand out
as one of the most efficient design paradigms and could be especially
effective as SoC computing engines. However, multiple
hurdles which are hindering the productivity of SoC designers and
researchers must be solved first. Among them, the difficulty of
thoroughly exploring the design space by simultaneously sweeping
axes like processing elements, memory hierarchies and chip
interconnect fabrics. We tackle this challenge by proposing an integrated
approach where state-of-the-art platform modeling infrastructures,
at the IP core level and at the system level, meet to provide
the designer with maximum openness and flexibility in terms
of design space exploration.
-
Parallel Co-Simulation Using Virtual Synchronization with Redundant Host Execution [p. 1151]
-
D. Kim, S. Ha and R. Gupta
In traditional parallel co-simulation approaches, the
simulation speed is heavily limited by time synchronization
overhead between simulators and idle time caused by
data dependency. Recent work has shown that the time
synchronization overhead can be reduced significantly by
predicting the next synchronization points more effectively
or by separating trace-driven architecture simulation
from trace generation from component simulators.
The latter is known as virtual synchronization technique.
In this paper, we propose redundant host execution to
minimize the simulation idle time caused by data dependency
in simulation models. By combining virtual synchronization
and redundant host execution techniques we
could make parallel execution of multiple simulators a
viable solution for fast but cycle-accurate co-simulation.
Experiments show about 40% performance gain over a
technique which uses virtual synchronization only.
-
An Efficient and Portable Scheduler for RTOS Simulation and its Certified Integration to SystemC [p. 1157]
-
H. Nakamura, N. Sato and N. Tabuchi
We propose a new task scheduling algorithm for
timed-functional simulation of concurrent software
tasks. It attains efficiency by reducing the frequency of
context-switching between concurrent tasks. It also provides
a high-degree of portability in the sense that it only
needs the underlying system to support a very small number
of primitives. We provide a concrete implementation
built on top of the SystemC scheduler and show some results
of preliminary evaluation.
Moderators: J. Leenstra, IBM Boeblingen, DE; C. Papachristou, Case Western Reserve U, US
-
Minimizing Test Power in SRAM through Reduction of Pre-Charge Activity [p. 1159]
-
L. Dilillo, P. Rosinger, B. M. Al-Hashimi and P. Girard
In this paper we analyze the test power of SRAM
memories and demonstrate that the full functional precharge
activity is not necessary during test mode because
of the predictable addressing sequence. We exploit this
observation in order to minimize power dissipation during
test by eliminating the unnecessary power consumption
associated with the pre-charge activity. This is achieved
through a modified pre-charge control circuitry,
exploiting the first degree of freedom of March tests,
which allows choosing a specific addressing sequence.
The efficiency of the proposed solution is validated
through extensive Spice simulations.
-
Efficient On-Line Interconnect Testing in FPGAs with Provable Detectability for Multiple Faults [p. 1165]
-
V. Suthar and S. Dutt
We present a very effective on-line interconnect built-in-self-test
(BIST) method I-BIST for FPGAs that uses a combination of the following
novel techniques: a track-adjacent and a switch-adjacent (also called
"mirror adjacent") pairwise net comparison mechanism that achieves high
detectability, a carefully designed set of only five net-configurations that
cover all types and locations of wire-segment and switch faults, a 2-phase
global-detailed testing approach, and a divide-and-conquer technique used
in detailed testing to quickly narrow down the set of potential suspect interconnects
that are then detail-diagnosed. These techniques result in I-BIST
having provable detectability in the presence of an unbounded number of
multiple faults, very high diagnosability of 99-100% even for high fault densities
of up to 10% that are expected in emerging nano-scale technologies,
and much lower test times or fault latencies than the previous best interconnect
BIST techniques. In particular, for application to on-line testing, our
method requires 2n roving-tester (ROTE) configurations to test an entire
n x n FPGA, while the previous best online interconnect BIST technique
requires n2 configurations. Thus, I-BIST is an order of magnitude more
time- as well as power-effficient, and will scale well with rapidly increasing
FPGA device sizes that are expected in emerging technologies.
-
A Concurrent Testing Method for NoC Switches [p. 1171]
-
M. Hosseinabady, A. Banaiyan, M. N. Bojnordi and Z. Navabi
This paper proposes reuse of on-chip networks for
testing switches in Network on Chips (NoCs). The
proposed algorithm broadcasts test vectors of switches
through the on-chip networks and detects faults by
comparing output responses of switches with each other.
This algorithm alleviates the need for: (1) external
comparison of the output response of the circuit-under-test
with the response of a fault free circuit stored on a tester
(2) on-chip signature analysis (3) a dedicated test-bus to
reach test vectors and collect their responses.
Experimental results on a few test benches compare the
proposed algorithm with traditional System on Chip (SoC)
test methods.
-
A Secure Scan Design Methodology [p. 1177]
-
D. Hély, F. Bancel, M.-L. Flottes and B. Rouzeyre
It has been proven that scan path is a potent hazard
for secure chips. Scan based attacks have been recently
demonstrated against DES or AES and several solutions
have been presented in the literature in order to securize
the scan chain. Nevertheless, the different proposed
techniques are all ad hoc techniques, which are not
always easy to integrate into a completely automated
design flow or in an IP reuse environment. In this paper,
we propose a scan chain integrity detection mechanism,
which respects both automated design flow and IP reuse
environment.
Moderators: F. Ferrandi, Politecnico di Milano, IT; E. De Kock, Philips Research, NL
-
RAS-NANO: A Reliability-Aware Synthesis Framework for Reconfigurable Nanofabrics [p. 1179]
-
C. He and M. F. Jacome
Entering the nanometer era, a major challenge to current design
methodologies and tools is to effectively address the high defect densities
projected for nanotechnologies. To this end, we proposed a
reconfiguration-based defect-avoidance methodology for defect-prone
nanofabrics. It judiciously architects the nanofabric, using probabilistic
considerations, such that a very large number of alternative
implementations can be mapped into it, enabling defects to be circumvented
at configuration time in a scalable way. Building on this
foundation, in this paper we propose a synthesis framework aimed
at implementing this new design paradigm. A key novelty of our approach
with respect to traditional high level synthesis is that, rather
than carefully optimizing a single ("deterministic") solution, our goal
is to simultaneously synthesize a large family of alternative solutions,
so as to meet the required probability of successful configuration, or
yield, while maximizing the family's average performance. Experimental
results generated for a set of representative benchmark kernels,
assuming different defect regimes and target yields, empirically
show that our proposed algorithms can effectively explore the complex
probabilistic design space associated with this new class of high
level synthesis problems.
-
Layout Driven Data Communication Optimization for High Level Synthesis [p. 1185]
-
R. Kastner, W. Gong, X. Hao, F. Brewer, A. Kaplan, P. Brisk and M. Sarrafzadeh
High level synthesis transformations play a major part in
shaping the properties of the final circuit. However, most
optimizations are performed without much knowledge of the
final circuit layout. In this paper, we present a physically
aware design flow for mapping high level application
specifications to a synthesizable register transfer level
hardware description. We study the problem of optimizing
the data communication of the variables in the application
specification. Our algorithm uses floorplan information that
guides the optimization. We develop a simple, yet effective,
incremental floorplanner to handle the perturbations caused
by the data communication optimization. We show that the
proposed techniques can reduce the wirelength of the final
design, while maintaining a legal floorplan with the same
area as the initial floorplan.
-
Physical-Aware Simulated Annealing Optimization of Gate Leakage in Nanoscale Datapath Circuits [p. 1191]
-
S. P. Mohanty, R. Velagapudi and E. Kougianos
For CMOS technologies below 65nm, gate oxide direct
tunneling current is a major component of the total power
dissipation. This paper presents a simulated annealing
based algorithm for the gate leakage current reduction by
simultaneous scheduling, allocation and binding during behavioral
synthesis. Gate leakage current reduction is based
on the use of functional units of different oxide thickness
while simultaneously accounting for process variations. We
present a cost function that minimizes leakage and area
overhead. The algorithm minimizes the cost function for
a given delay trade-off factor. It uses a pre-characterized
cell library for tunneling current, delay and area, expressed
as analytical functions of the gate oxide thickness Tox. We
tested our approach using a number of behavioral level
benchmark circuits characterized for a 45nm library by integrating
our algorithm into a high-level synthesis system.
We obtained an average gate leakage reduction of 76.88%
with an average area overhead of 17.38% for different delay
trade-off factors ranging from 1.0 to 1.4.
-
Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Processors [p. 1197]
-
S. Park, A. Shrivastava, N. Dutt, E. Earlie, A. Nicolau and Y. Paek
Customizing the bypasses in an embedded processor uncovers
valuable trade-o.s between the power, performance and
the cost of the processor. Meaningful exploration of bypasses
requires bypass-sensitive compiler. Operation Tables (OTs)
have been proposed to perform bypass-sensitive compilation.
However, due to lack of automated methods to generate OTs,
OTs are currently manually speci.ed by the designer. Manual
speci.cation of OTs is not only an extremely time consuming
task, but is also highly error-prone. In this paper,
we present AutoOT, an algorithm to automatically generate
OTs from a high-level processor description. Our experiments
on the Intel XScale processor model running MiBench
benchmarks demonstrate that AutoOT greatly reduces the
time and e.ort of speci.cation. Automatic generation of
OTs makes it feasible to perform full bypass exploration on
the Intel XScale and thus discover interesting alternate bypass
con.gurations in a reasonable time. To further reduce
the compile-time overhead of OT generation, we propose another
novel algorithm, AutoOTDB. AutoOTDB is able to
cut the compile-time overhead of OT generation by half.
-
High Level Synthesis of Higher Order Continuous Time State Variable Filters with Minimum Sensitivity
and Hardware Count [p. 1203]
-
S. Pandit, S. Kar, C. Mandal and A. Patra
The sensitivity of the response of an analog system to
circuit parameter variations is a vital performance metric
for evaluation of its quality. This paper proposes a unified
high level synthesis methodology for higher order
continuous time state variable filters, considering the
optimization of this metric. Minimization of the hardware
count, which is another important issue, has also been
taken into account at a much earlier stage of design. The
entire methodology is illustrated with the case study of a
state variable low pass filter and the benefits of the
approach are clearly brought out.
Moderators: E. Giunchiglia, Genova U, IT; A. Tacchella, DIST ˇ© Genova U, IT
-
Disjunctive Image Computation for Embedded Software Verification [p. 1205]
-
C. Wang, Z. Yang, F. Ivancic and A. Gupta
Finite state models generated from software programs
have unique characteristics that are not exploited by existing
model checking algorithms. In this paper, we propose
a novel disjunctive image computation algorithm and other
simplifications based on these characteristics. Our algorithm
divides an image computation into a disjunctive set of
easier ones that can be performed in isolation. Hypergraph
partitioning is used to minimize the number of live variables
in each disjunctive component. We use the live variables
to simplify transition relations and reachable state subsets.
Our experiments on a set of real-world C programs show
that the new algorithm achieves orders-of-magnitude performance
improvement over the best known conjunctive image
computation algorithm.
-
Distance-Guided Hybrid Verification with GUIDO [p. 1211]
-
S. Shyam and V. Bertacco
Constrained random simulation is a widespread technique
used to perform functional verification on complex digital
designs, because it can generate simulation vectors at a very
high rate. However, the generation of high-coverage tests
remains a major challenge even in light of this high performance.
In this paper we present Guido, a hybrid verification
software that uses formal veri.cation techniques to
guide the simulation towards a verification goal. Guido is
novel in that 1) it guides the simulation by means of a distance
function derived from the circuit structure, and 2) it
has a trace sequence controller that monitors and controls
the direction of the simulation by striking a balance between
random chance and controlled hill-climbing. We present experimental
results indicating that Guido can tackle complex
designs, including a picoJava microprocessor, and reach a
veri.cation goal in far fewer simulation cycles than random
simulation.
-
What Lies between Design Intent Coverage and Model Checking? [p. 1217]
-
S. Das, P. Basu, P. Dasgupta and P. P. Chakrabarti
Practitioners of formal property verification often work
around the capacity limitations of formal verification tools
by breaking down properties into smaller properties that
can be checked on the sub-modules of the parent module.
To support this methodology, we have developed a formal
methodology for verifying whether the decomposition
is indeed sound and complete, that is, whether verifying
the smaller properties on the submodules actually guarantees
the original property on the parent module. In practice,
however designers do not write properties for all modules
and thereby our previous methodology was applicable
to selected cases only. In this paper we present new formal
methods that allow us to handle RTL blocks in the analysis.
We believe that the new approach will significantly widen
the scope of the methodology, thereby enabling the validation
engineer to handle much larger designs than admitted
by existing formal verification tools.
-
On the Numerical Verification of Probabilistic Rewriting Systems [p. 1223]
-
J. Ben Hassen and S. Tahar
We present in this paper a technique for the formal verification
of probabilistic systems described in PMAUDE, a
probabilistic extension of the rewriting system Maude. Our
methodology is based on a numerical verification using the
probabilistic symbolic model checking tool PRISM. In particular,
we show how we can construct an abstract system
from the runs of a model that preserve all the probabilistic
properties of the latter. Then we deduce the probabilistic
matrix that will be used for the verification in PRISM.
-
Avoiding False Negatives in Formal Verification for Protocol-Driven Blocks [p. 1225]
-
G. Fey, D. Grosse and R. Drechsler
During Bounded Model Checking (BMC) blocks of a
design are often considered separately due to complexity
issues. Because the environment of a block is not available
for the proof, invalid input sequences frequently lead
to false negatives, i.e. counter-examples that can not occur
in the complete design. Finding and understanding such
false negatives is currently a time-consuming manual task.
Here, we propose a method to automatically avoid false
negatives which are caused by invalid input sequences for
blocks connected by standard communication protocols.
Organisers: E. Macii, Politecnico di Torino, IT; M. Casale-Rossi, Synopsys Inc, IT
Panel Moderator: G. De Micheli, EPFL Lausanne, CH
-
Low-Power Design Tools: Are EDA Vendors Taking this Matter Seriously? [p. 1227]
-
E. Macii, M. Pedram, D. Friebel, R. Aitken, A. Domic and R. Zafalon
Organisers: E. Macii, Politecnico di Torino, IT; M. Casale-Rossi, Synopsys Inc, IT
Panel Moderator: G. De Micheli, EPFL Lausanne, CH
-
Low-Power Design Tools: Are EDA Vendors Taking this Matter Seriously? [p. 1227]
-
E. Macii, M. Pedram, D. Friebel, R. Aitken, A. Domic and R. Zafalon
While transistors per square millimeter and on-chip clock keep scaling smoothly according to
Moore's Law, Vdd does not, nor does Vth. This leads to a dramatic increase in chip power density, and to a
significant shift in the balance between dynamic and leakage power. In spite of the recent effort made by
EDA vendors in delivering novel solutions that help mitigating the effects on power consumption of
technology scaling, the question of whether EDA industry is taking the low-power matter seriously still
This session will provide an answer to this intriguing question, by first offering a short review of the stateof-
the-art in design technologies for dynamic and leakage power minimisation. The session will then
continue with a public "trial", in which OEMs, IDMs, IP and fabless semiconductor vendors will play the
role of the public prosecutor, against defendant EDA industry. The court's ruling will tell us about the
future targets the EDA vendors will pursue in low-power design technologies.
Moderators: V. Bertacco, The U of Michigan, US; R. Bloem, Inst for Software Tech Graz, AT
-
Formal Verification of SystemC Designs Using a Petri-Net Based Representation [p. 1228]
-
D. Karlsson, P. Eles and Z. Peng
This paper presents an effective approach to formally
verify SystemC designs. The approach translates SystemC
models into a Petri-Net based representation. The Petri-net
model is then used for model checking of properties expressed
in a timed temporal logic. The approach is particularly
suitable for, but not restricted to, models at a high
level of abstraction, such as transaction-level. The efficiency
of the approach is illustrated by experiments.
-
Monolithic Verification of Deep Pipelines with Collapsed Flushing [p. 1234]
-
R. Kane, P. Manolios and S. K. Srinivasan
We introduce collapsed flushing, a new flushing-based
refinement map for automatically verifying safety and liveness
properties of term-level pipelined machine models. We
also present a new method for handling liveness that is
both simpler to define and easier to verify than previous
approaches. To empirically validate collapsed flushing, we
ran extensive experiments which show more than an order-of-magnitude
improvement in verification times over standard
flushing. Furthermore, by combining collapsed flushing
with commitment refinement maps, we can monolithically
verify complex pipelined machine models with deep
pipelines - a salient feature of state-of-the-art microprocessor
designs - that previous approaches cannot handle.
-
Functional Test Generation Using Property Decompositions for Validation of Pipelined Processors [p. 1240]
-
H.-M. Koo and P. Mishra
Functional validation is a major bottleneck in pipelined processor
design. Simulation using functional test vectors is the
most widely used form of processor validation. While existing
model checking based approaches have proposed several
promising ideas for efficient test generation, many challenges
remain in applying them to realistic pipelined processors. The
time and resources required for test generation using existing
model checking based techniques can be extremely large. This
paper presents an efficient test generation technique using decompositional
model checking. The contribution of the paper
is the development of both property and design decomposition
procedures for efficient test generation of pipelined processors.
Our experimental results using a multi-issue MIPS processor
demonstrate several orders-of-magnitude reduction in memory
requirement and test generation time.
-
Proven Correct Monitors from PSL Specifications [p. 1246]
-
K. Morin-Allory and D. Borrione
We developed an original method to synthesize monitors
from declarative specifications written in the PSL standard.
Monitors observe sequences of values on their input signals,
and check their conformance to a specified temporal expression.
Our method implements both the weak and strong
versions of PSL FL operators, and has been proven correct
using the PVS theorem prover. This paper discusses the
salient aspects of the proof of our prototype implementation
for on-line design verification.
Moderators: B. Straube, FhG IIS/EAS Dresden, DE; I. Pomeranz, Purdue U, US
-
Space of DRAM Fault Models and Corresponding Testing [p. 1252]
-
Z. Al-Ars, S. Hamdioui and A. J. van de Goor
DRAMs play an important role in the semi-conductor industry,
due to their highly dense layout and
their low price per bit. This paper presents the first framework of fault models specifically designed to describe the
faulty behavior of DRAMs. The fault models in this paper
are the outcome of a close collaboration with the industry,
and are validated using a detailed Spice-based analysis
of the faulty behavior of real DRAMs. The resulting fault
space is then used to derive a couple of new DRAM-specific
tests, needed to detect some of the faults in practice.
-
Automatic March Tests Generations for Static Linked Faults in SRAMs [p. 1258]
-
A. Benso, A. Bosio, S. Di Carlo, G. Di Natale and P. Prinetto
Static Linked Faults are considered an interesting
class of memory faults. Their capability of influencing
the behavior of other faults causes the hiding of the
fault effect and makes test algorithm design a very
complex task. A large number of March Tests with
different fault coverage have been published and some
methodologies have been presented to automatically
generate March Tests. In this paper we present an
approach to automatically generate March Tests for
Static Linked Faults. The proposed approach
generates better test algorithms then previous, by
reducing the test length.
-
Test Compaction for Transition Faults under Transparent-Scan [p. 1264]
-
I. Pomeranz and S. M. Reddy
Transparent-scan was proposed as an approach to test
generation and test compaction for scan circuits. Its effectiveness
was demonstrated earlier in reducing the test
application time for stuck-at faults. We show that similar
advantages exist when considering transition faults. We
first show that a test sequence under the transparent-scan
approach can imitate the application of broadside tests for
transition faults. Test compaction can proceed similar to
stuck-at faults by omitting test vectors from the test
sequence. A new approach for enhancing test compaction
is also described, whereby additional broadside tests are
embedded in the transparent-scan sequence without
increasing its length or reducing its fault coverage.
-
Test Set Enrichment Using a Probabilistic Fault Model and the Theory of Output Deviations [p. 1270]
-
Z. Wang, K. Chakrabarty and M. Goessel
We present a probabilistic fault model that allows
any number of gates in an integrated circuit to fail probabilistically.
Tests for this fault model, determined using the theory of
output deviations, can be used to supplement tests for classical
fault models, thereby increasing test quality and reducing the
probability of test escape. Output deviations can also be used
for test selection, whereby the most effective test patterns can be
selected from large test sets during time-constrained and high-volume
production testing. Experimental results are presented to
evaluate the effectiveness of patterns with high output deviations
for the single stuck-at and bridging fault models.
Moderators: T. Austin, The U of Michigan, US; S. Vassiliadis, TU Delft, NL
-
Vulnerability Analysis of L2 Cache Elements to Single Event Upsets [p. 1276]
-
H. Asadi, V. Sridharan, M. B. Tahoori and D. Kaeli
Memory elements are the most vulnerable system component
to soft errors. Since memory elements in cache arrays consume a
large fraction of the die in modern microprocessors, the probability
of particle strikes in these elements is high and can significantly
impact overall processor reliability. Previous work [2] has developed
effective metrics to accurately measure the vulnerability of
cache memory elements. Based on these metrics, we have developed
a reliability-performance evaluation framework, which has
been built upon the Simplescalar simulator.
In this work, we focus on the reliability aspects of L1 and L2
caches. Specifically, we present algorithms for tag vulnerability
computation and investigate and report in detail on the vulnerability
of data, tag, and status bits in the L2 array. Experiments on
SPECint2K and SPECfp2K benchmarks show that one class of error,
replacement error, makes up almost 85% of the total tag vulnerability
of a 1MB write-back L2 cache. In addition, the vulnerability
of L2 tag-addresses significantly increases as the size of the
memory address space increases. Results show that the L2 tag array
can be as susceptible as first-level instruction and data caches
(IL1/DL1) to soft errors.
-
Area-Efficient Error Protection for Caches [p. 1282]
-
S. Kim
Due to increasing concern about various errors, current
processors adopt error protection mechanisms. Especially,
protecting L2/L3 caches incur as much as 12.5% area
overhead due to error correcting codes. Considering large
L2/L3 caches of current processors, the area overhead is
very high. This paper proposes an area-efficient error protection
scheme for L2/L3 caches. First, it selectively applies
ECC (Error Correcting Code) to only dirty cache lines and
other clean cache lines are protected using simple parity
check codes. Second, the dirty cache lines are periodically
cleaned by exploiting the generational behavior of cache
lines. Experimental results show that the cleaning technique
effectively reduces the number of dirty cache lines per cycle.
The ECCs of this reduced number of dirty cache lines
can be maintained in a small storage. Our proposed scheme
is shown to reduce the area overhead of a 1MB L2 cache for
error protection by 59% for SPEC2000 benchmarks running
on a typical four-issue superscalar processor.
-
Microarchitectural Floorplanning under Performance and Thermal Tradeoff [p. 1288]
-
M. Healy, M. Vittes, M. Ekpanyapong, C. Ballapuram, S. K. Lim, H.-H. S. Lee and G. H. Loh
In this paper, we present the first multi-objective
microarchitectural floorplanning algorithm for designing high-performance,
high-reliability processors in the early design phase.
Our floorplanner takes a microarchitectural netlist and determines
the placement of the functional modules while simultaneously
optimizing for performance and thermal reliability. The
traditional design objectives such as area and wirelength are also
considered. Our multi-objective hybrid floorplanning approach
combining Linear Programming and Simulated Annealing is
shown to be fast and effective in obtaining high-quality solutions.
We evaluate the trade-off of performance, temperature, area, and
wirelength and provide comprehensive experimental results.
Moderators: R, Hermida, Madrid Complutense U, ES; T. Shiple, Synopsys, FR
-
Optimizing High Speed Arithmetic Circuits Using Three-Term Extraction [p. 1294]
-
A. Hosangadi, F. Fallah and R. Kastner
Carry Save Adder (CSA) trees are commonly used for
high speed implementation of multi-operand additions.
We present a method to reduce the number of the adders
in CSA trees by extracting common three-term
subexpressions. Our method can optimize multiple CSA
trees involving any number of variables. This
optimization has a significant impact on the total area of
the synthesized circuits, as we show in our experiments.
To the best of our knowledge, this is the only known
method for eliminating common subexpressions in CSA
structures. Since extracting common subexpressions can
potentially increase delay, we also present a delay aware
extraction algorithm that takes into account the different
arrival times of the signals.
-
Efficient Minimization of Fully Testable 2-SPP Networks [p. 1300]
-
A. Bernasconi, V. Ciriani, R. Drechsler and T. Villa
The paper presents a heuristic algorithm for the minimization
of 2-SPP networks, i.e., three-level EXOR-ANDOR
forms with EXOR gates restricted to fan-in 2. Previous
works had presented exact algorithms for the minimization
of unrestricted SPP networks and of 2-SPP networks. The
exact minimization procedures were formulated as covering
problems as in the minimization of SOP forms and had
worst-case exponential complexity. Extending the expand-irredundant-reduce
paradigm of the ESPRESSO heuristic,
we propose a minimization algorithm for 2-SPP networks
that iterates local minimization and reshape of a solution
until further improvement. We introduce also the notion of
EXOR-irredundant to prove that OR-AND-EXOR irredundant
networks are fully testable and guarantee that our algorithm
yields OR-AND-EXOR irredundant solutions. We
report a large set of experiments showing impressive high-quality
results with affordable run times, handling also examples
whose exact solutions could not be computed.
-
Pre-Synthesis Optimization of Multiplications to Improve Circuit Performance [p. 1306]
-
R. Ruiz-Sautua, M. C. Molina, J. M. Mendias and R. Hermida
Conventional high-level synthesis uses the worst case
delay to relate all inputs to all outputs of an operation.
This is a very conservative approximation of reality,
especially in arithmetic operations (where some bits are
required later than others and some bits are produced
earlier than others). This paper proposes a pre-synthesis
optimization algorithm that takes advantage of this
feature for more efficient high-level synthesis of data-flow
graphs formed by additions and multiplications. The
presented pre-processor analyzes the critical path at bit-granularity
and splits the arithmetic operations into subwords
fragments. In particular, some of the specification
multiplications are broken up into several smaller
multiplications, additions, and other operations of three
new types specially defined to reduce the clock cycle
duration. These fragments become the input to any
regular high-level synthesis tool to speed up circuit
execution times. The experimental results carried out
show that implementations obtained from the optimized
specification are on average 70% faster and in most cases
substantial area reductions are also achieved.
-
Crosstalk-Aware Domino Logic Synthesis [p. 1312]
-
Y.-Y. Liu and T. T. Hwang
We propose a logic synthesis flow which utilizes the
functionality of circuit to synthesize a domino-cell network
which will have more wires crosstalk-immune to each other.
For that purpose, techniques of output phase flipping and
crosstalk-aware technology mapping are used. Meanwhile,
metric to measure the crosstalk sensitivity of domino cells in
synthesis level is proposed. Experimental results demonstrate
that the crosstalk sensitivity of the synthesized domino-cell
network is greatly reduced by 51% using our synthesis flow
as compared with conventional methodology. Furthermore,
after placement and routing are performed, the ratio of the
number of crosstalk-immune wire pairs to the number of
total wire pairs is about 25% using our methodology as
compared to 9% using conventional techniques.
Moderators: R. Ernst, TU Braunschweig, DE; P. Ienne, EPFL IC LAP, CH
-
TRAIN: A Virtual Transaction Layer Architecture for TLM-Based HW/SW Codesign of
Synthesizable MPSoC [p. 1318]
-
W. Klingauf, H. Gaedke, R. Guenzel
Our concept of a virtual transaction layer (VTL) architecture
allows to directly map transaction-level communication
channels onto a synthesizable multiprocessor SoC implementation.
The VTL is above the physical MPSoC communication
architecture, acting as a hardware abstraction layer for both
HW and SW components. TLM channels are represented by
virtual channels which efficiently route transactions between
SW and HW entities through the on-chip communication network
with respect to quality-of-service and realtime requirements.
The goal is to methodically simplify MPSoC design by
systematic HW/SW interface abstraction, thus enabling early
SW verification, rapid prototyping and fast exploration of critical
design issues. With TRAIN, we present our implementation
of such a VTL architecture for Virtex-II Pro and PowerPC
and illustrate its efficiency by experimentation.
-
Configurable Multiprocessor Platform with RTOS for Distributed Execution of UML 2.0
Designed Applications [p. 1324]
-
T. Arpinen, P. Kukkala, E. Salminen, M. Hännikäinen and T. D. Hämäläinen
This paper presents the design and full prototype
implementation of a configurable multiprocessor platform
that supports distributed execution of applications
described in UML 2.0. The platform is comprised of
multiple Altera Nios II softcore processors and custom
hardware accelerators connected by the Heterogeneous
IP Block Interconnection (HIBI) communication
architecture. Each processor has a local copy of eCos
real-time operating system for the scheduling of multiple
application threads. The mapping of a UML application
into the proposed platform is presented by distributing a
WLAN medium access control protocol onto multiple
CPUs. The experiments performed on FPGA show that
our approach raises system design to a new level. To our
knowledge, this is the first real implementation combining
a high-level design flow with a synthesizable platform.
-
ASIP-Based Multiprocessor SoC Design for Simple and Double Binary Turbo Decoding [p. 1330]
-
O. Muller, A. Baghdadi and M. Jézéquel
This paper presents a new multiprocessor platform for
high throughput turbo decoding. The proposed platform is
based on a new configurable ASIP combined with an
efficient memory and communication interconnect scheme.
This Application-Specific Instruction-set Processor has an
SIMD architecture with a specialized and extensible
instruction-set and 5-stages pipeline control. The attached
memories and communication interfaces enable the design
of efficient multiprocessor architectures. These
multiprocessor architectures benefit from the recent
shuffling technique introduced in the turbo-decoding field to
reduce communication latency. The major characteristics of
the proposed platform are its flexibility and scalability
which make it reusable for various standards and operating
modes. Results obtained for double binary DVB-RCS turbo
codes demonstrate a 100 Mbit/s throughput using 16-ASIP
multiprocessor architecture.
|