| |
DATE 2005 ABSTRACTS
Sessions:
[Keynote Addresses]
[1A]
[IP1]
[1B]
[1C]
[IP2]
[1E]
[1F]
[2A]
[2B]
[2C]
[IP3]
[2E]
[IP4]
[2F]
[IP5]
[3A]
[IP6]
[3B]
[IP7]
[3C]
[IP8]
[3E]
[IP9]
[3F]
[4A]
[IP10]
[4B]
[IP11]
[4C]
[IP12]
[4E]
[IP13]
[4F]
[IP14]
[4G]
[5A]
[IP15]
[5B]
[5C]
[5E]
[IP16]
[5F]
[IP17]
[5G]
[5K]
[6A]
[IP18]
[6B]
[IP19]
[6C]
[6E]
[IP20]
[6F]
[IP21]
[6G]
[IP22]
[7A]
[IP23]
[7B]
[7C]
[IP24]
[7E]
[IP25]
[7F]
[IP26]
[7G]
[IP27]
[8A]
[8B]
[IP28]
[8C]
[IP29]
[8E]
[IP30]
[8F]
[IP31]
[8G]
[9A]
[9B]
[9C]
[9E]
[9F]
[9G]
[9K]
[10A]
[10B]
[10C]
[10E]
[10F]
[10G]
Volume I
-
SoC in Nanoera: Challenges and Endless Possibility [p. 2]
-
J. Kong
Growth of the semiconductor industry has been driven by a series of electronic system applications, such
as personal computers, home entertainment, and mobile handsets. The most recent growth is driven by revolution of
the information technology (IT) industry. The key word of this next revolution is "Ubiquitous". As semiconductor
technology is scaled into the nanometer regime where hundreds of millions of transistors can be placed on a chip,
designers are now incorporating their advanced system concepts into silicon. These systems include digital,
analogue, and RF components. System-on-a-Chip (SoC) enables the IT industry to realise various products that can
comply with rapidly changing market requirements as well as with unprecedented ubiquitous life style. However,
SoC products in the ubiquitous era are facing challenges such as high performance, low-power, small-size and low-cost.
These factors may jeopardise the success of SoC unless there is a breakthrough from system-level design
through manufacturing technologies. Advanced EDA technology is indispensable to cope with ever-increasing
design complexity of gigascale integration and complicated physical effects inherent from the nanoscale technology.
In this talk, the speaker will provide an overview of the key challenges with SoC developments in days to come,
namely: Issues in the system-level design, low power, high performance, verification, and relevant nanometer
technology. Solutions including some of Samsung's recent R&D activities in those areas will be discussed and the
speaker will conclude his speech by saying that all these challenges will promise the endless possibilities of the SoC.
-
Striking a New Balance in the Nanometer Era: First-Time-Right and Time-To-Market
Demands Versus Technology Challenges [p. 3]
-
G. Hughes
Today's semiconductor marketplace demands nanometer designs of unprecedented complexity and
performance, with uncompromising time-to-market requirements. This drives a focus on predictable, high-quality
design results despite the challenges associated with these next-generation technologies.
This scenario is complicated even further by the need to address these challenges across a wide spectrum of
products, ranging from high-frequency processor designs to extremely complex ASIC designs. In the nanometer era,
the common factor for ensuring market leadership across this broad variety of products is achieving single-pass
design success to avoid costly re-spins and the loss of market opportunities: design turnaround time must be
minimized without compromising design efficiency and first-time-right requirements.
Design automation tools must balance both requirements, while providing designers with information that enables
them to "design around" potential trouble spots in both today's and tomorrow's environment to ensure an
exceptional level of built-in quality. This discussion highlights some of the innovations IBM is developing, such as
variation-aware and statistical timing, faster serial and parallel processing, more highly integrated data models and
tools, and concurrent chip and package design, which optimise the competing requirements of simultaneously
reducing design turnaround time and achieving single-pass design success, while effectively managing the technical
challenges associated with nanometer designs.
Moderators: S. Vernalde, IMEC, BE; S. Vassiliadis, TU Delft, NL
-
A Register Allocation Algorithm in the Presence of Scalar Replacement for
Fine-Grain Configurable Architectures [p. 6]
-
N. Baradaran and P. Diniz
The aggressive application of scalar replacement to array
references substantially reduces the number of memory
operations at the expense of a possibly very large number
of registers. In this paper we describe a register allocation
algorithm that assigns registers to scalar replaced array
references along the critical paths of a computation, in
many cases exploiting the opportunity for concurrent memory
accesses. Experimental results, for a set of image/signal
processing code kernels, reveal that the proposed algorithm
leads to a substantial reduction of the number of execution
cycles for the corresponding hardware implementation on
a contemporary Field-Programmable-Gate-Array (FPGA)
when compared to other greedy allocation algorithms, in
some cases, using even fewer number of registers.
-
Resource Sharing and Pipelining in Coarse-Grained Reconfigurable Architecture for
Domain-Specific Optimization [p. 12]
-
Y. Kim, M. Kiemb, C. Park, J. Jung, and K. Choi
Coarse-grained reconfigurable architectures aim to
achieve both goals of high performance and flexibility.
However, existing reconfigurable array architectures
require many resources without considering the specific
application domain. Functional resources that take long
latency and/or large area can be pipelined and/or
shared among the processing elements. Therefore the
hardware cost and the delay can be effectively reduced
without any performance degradation for some application
domains. We suggest such reconfigurable array architecture
template and design space exploration flow
for domain-specific optimization. Experimental results
show that our approach is much more efficient both in
performance and area compared to existing reconfigurable
architectures.
-
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores Using
Dynamic Hardware/Software Partitioning [p. 18]
-
R. Lysecky and F. Vahid
Field programmable gate arrays (FPGAs) provide designers with
the ability to quickly create hardware circuits. Increases in FPGA
configurable logic capacity and decreasing FPGA costs have
enabled designers to more readily incorporate FPGAs in their
designs. FPGA vendors have begun providing configurable soft
processor cores that can be synthesized onto their FPGA
products. While FPGAs with soft processor cores provide
designers with increased flexibility, such processors typically
have degraded performance and energy consumption compared to
hard-core processors. Previously, we proposed warp processing,
a technique capable of optimizing a software application by
dynamically and transparently re-implementing critical software
kernels as custom circuits in on-chip configurable logic. In this
paper, we study the potential of a MicroBlaze soft-core based
warp processing system to eliminate the performance and energy
overhead of a soft-core processor compared to a hard-core
processor. We demonstrate that the soft-core based warp
processor achieves average speedups of 5.8 and energy
reductions of 57% compared to the soft core alone. Our data
shows that a soft-core based warp processor yields performance
and energy consumption competitive with existing hard-core
processors, thus expanding the usefulness of soft processor cores
on FPGAs to a broader range of applications.
Keywords
Hardware/software partitioning, warp processing, FPGA, dynamic
optimization, soft cores, MicroBlaze.
-
Reconfigurable Elliptic Curve Cryptosystems on a Chip [p. 24]
-
R. Cheung, W. Luk, and P. Cheung
This paper presents a System-on-a-Chip (SoC) architecture
for Elliptic Curve Cryptosystems (ECC) which targets
reconfigurable hardware. A four-level partitioning
scheme is described for exploring the area and speed trade-offs.
A design generator is used to generate parameterisable
building blocks for the configurable SoC architecture.
A secure web server, which runs on a reconfigurable soft-processor
and an embedded hard-processor, shows over
2000 times speedup when the computationally-intensive operations
run on the customised building blocks. The embedded
on-chip timer block gives accurate performance information.
The design factors of configurable SoC architectures
are also discussed and evaluated.
-
An Infrastructure to Functionally Test Designs Generated by Compilers Targeting FPGAs [p. 30]
-
R. Rodrigues and J. Cardoso
This paper presents an infrastructure to test the functionality
of the specific architectures output by a high-level
compiler targeting dynamically reconfigurable
hardware. It results in a suitable scheme to verify the architectures
generated by the compiler, each time new optimization
techniques are included or changes in the compiler
are performed. We believe this kind of infrastructure
is important to verify, by functional simulation, further research
techniques, as far as compilation to Field-Programmable
Gate Array (FPGA) platforms is concerned.
-
FPGA Architecture for Multi-Style Asynchronous Logic [p. 32]
-
N. Huot, H. Dubreuil, L. Fesquet, and M. Renaudin
This paper presents a novel FPGA architecture for
implementing various styles of asynchronous logic. The
main objective is to break the dependency between the
FPGA architecture dedicated to asynchronous logic and
the logic style. The innovative aspects of the architecture
are described. Moreover the structure is well suited to be
rebuilt and adapted to fit with further asynchronous
logic evolutions thanks to the architecture genericity. A
full-adder was implemented in different styles of logic to
show the architecture flexibility.
Organiser/Moderator: G. Gielen, KU Leuven, BE
Speakers: G. Gielen, KU Leuven, BE; W. Dehaene, KU Leuven, BE; D. Drexelmayr, Infineon, AT;
E. Janssens, ST Microelectronics, BE; T. Vucurevich, Cadence, US; K. Maex, IMEC, BE; P. Christie, Philips, NL
-
Analog and Digital Circuit Design in 65 nm CMOS: End of the Road? [p. 36]
-
G. Gielen, W. Dehaene, P. Christie, D. Draxelmayr, E. Janssens, K. Maex, and T. Vucurevich
This special session adresses the problems that designers face when implementing analog and digital circuits
in nanometer technologies. An introductory embedded tutorial will give an overview of the design problems at
hand : the leakage power and process variability and their implications for digital circuits and memories, and the
reducing supply voltages, the design productivity and signal integrity problems for embedded analog blocks. Next,
a panel of experts from both industrial semiconductor houses and design companies, EDA vendors and research
institutes will present and discuss with the audience their opinions on whether the design road ends at marker "65nm" or not.
Moderators: E. Larsson, Linkoping U, SE; R. Dorsch, IBM, DE
-
On-Chip Test Infrastructure Design for Optimal Multi-Site Testing of System Chips [p. 44]
-
S. Goel and E. Marinissen
Multi-site testing is a popular and effective way to increase test
throughput and reduce test costs. We present a test throughput
model, in which we focus on wafer testing, and consider parameters
like test time, index time, abort-on-fail, and contact yield.
Conventional multi-site testing requires sufficient ATE resources,
such as ATE channels, to allow to test multiple SOCs in parallel.
In this paper, we design and optimize on-chip DfT, in order to
maximize the test throughput for a given SOC and ATE. The on-chip
DfT consists of an E-RPCT wrapper, and, for modular SOCs,
module wrappers and TAMs. We present experimental results for
a Philips SOC and several ITC'02 SOC Test Benchmarks.
-
Test Planning for Mixed-Signal SOCs with Wrapped Analog Cores [p. 50]
-
A. Sehgal, F. Liu, S. Ozev, and K. Chakrabarty
Many SOCs today contain both digital and analog embedded cores.
Even though the test cost for such mixed-signal SOCs is significantly
higher than that for digital SOCs, most prior research in this area has
focused exclusively on digital cores. We propose a low-cost test development
methodology for mixed-signal SOCs that allows the analog
and digital cores to be tested in a unified manner, thereby minimizing
the overall test cost. The analog cores in the SOC are wrapped such
that they can be accessed using a digital test access mechanism (TAM).
We evaluate the impact of the use of analog test wrappers on area overhead
and test time. To reduce area overhead, we present an analog test
wrapper optimization technique, which is then combined with TAM optimization
in a cost-oriented heuristic approach for test scheduling. We
also demonstrate the feasibility of using analog wrappers by presenting
transistor-level simulations for an analog wrapper and a representative
core. We present experimental results on test scheduling for
an ITC'02 benchmark SOC that has been augmented with five analog
cores.
-
Logic Design for On-Chip Test Clock Generation - Implementation Details and
Impact on Delay Test Quality [p. 56]
-
M. Beck, O. Barondeau, M. Kaibel, F. Poehl, X. Lin, and R. Press
This paper addresses delay test for SOC devices with high frequency clock domains. A logic design for on-chip high-speed clock generation, implemented to avoid expensive test equipment, is described in detail. Techniques for on-chip clock generation, meant to reduce test vector count and to increase test quality, are discussed. ATPG results for the proposed techniques are given.
-
Test Time Reduction Reusing Multiple Processors in a Network-on-Chip Based Architecture [p. 62]
-
A. Amory, M. Lubaszewski, F. Moraes, and E. Moreno
The increasing complexity and the short life cycles of
embedded systems are pushing the current system-on-chip
designs towards a rapid increasing on the number
of programmable processing units, while decreasing the
gate count for custom logic. Considering this trend, this
work proposes a test planning method capable of reusing
available processors as test sources and sinks, and
the on-chip network as the test access mechanism. Experimental
results are based on ITC'02 benchmarks and
on two open core processors compliant with MIPS and
SPARC instruction set. The results show that the cooperative
use of both the on-chip network and the embedded
processors can increase the test parallelism and
reduce the test time without additional cost in area and
pins.
Organiser: G. Martin, Tensilica, US
Moderator: L. Lavagno, Politecnico di Torino, IT
Speakers: S. Edwards, Columbia U, US; A. Dean, North Carolina State U, US; I. Oliver, Nokia, FI
-
The Challenges of Hardware Synthesis from C-like Languages [p. 66]
-
S. Edwards
MANY TECHNIQUES for synthesizing digital hardware from
C-like languages have been proposed, but none have emerged
as successful as Verilog or VHDL for register-transfer-level design.
This paper looks at two of the fundamental challenges:
concurrency and timing control.
Familiarity is the main reason C-like languages have been
proposed for hardware synthesis. Synthesize hardware from
C, proponents claim, and we will be able to turn a C programmer
into a hardware designer. Another common motivation is
hardware/software codesign: today's systems usually contain a
mix of hardware and software, and it is often unclear initially
which portions to implement in hardware. Here, using a single
language should simplify the migration task.
-
Software Thread Integration and Synthesis for Real-Time Applications [p. 68]
-
A. Dean
Software Thread Integration (STI) [1] and Asynchronous
STI (ASTI) [2] are compiler techniques which interleave
functions from separate program threads at the assembly
language level, creating implicitly multithreaded functions
which provide low-cost concurrency on generic hardware.
This extends the reach of software and reduces the need to
rely upon dedicated hardware. STI and ASTI are driven by
two types of timing requirements: thread-level (e.g. the delay
between an event occuring and a service thread running)
and instruction-level (e.g. when a specific instruction
or code region must begin executing relative to the
start of the function or another such instruction or region).
These coarse- and fine-grain approach provide a precise
method of defining timing requirements. STI provides synchronous
thread progress; both functions proceed lock-step.
ASTI provides asynchronous (independent) thread progress
through the use of lightweight context switches (coroutine
calls) between primary and secondary threads. The primary
thread has hard-real-time constraints, while the secondary
thread is not real-time, or has much longer deadlines.
We assume that instructions take a predictable number
of cycles to execute. This implies a straightforward instruction
execution pipeline (if used) and a predictable memory
system (e.g. the cache is locked, software managed, or
not present). These requirements are met for the processors
we target: 8 and 16 bit microcontrollers. We target applications
with only one hard real-time thread (the primary
thread, used for the communication protocol), although recent
extensions to STI [3] support multiple hard-real-time
primary threads. We have implemented a thread-integrating
compiler Thrint which implements many of these analyses
and transformations for the AVR architecture, which is 8-bit,
load/store, and optimized for embedded C code.
-
Applying UML and MDA to Real Systems Design [p. 70]
-
I. Oliver
Traditionally system design has been made from a black
box/functionality only perspective which forces the developer
to concentrate on how the functionality can be decomposed
and recomposed into so called components. While this
technique is well established and well known it does suffer
from some drawbacks; namely that the systems produced can
often be forced into certain, incompatible architectures, difficult
to maintain or reuse and the code itself difficult to debug.
Now that ideas such as the OMG's Model Based Architecture
(MDA) or Model Based Engineering (MBE)1 and the ubiquitous
modelling language UML are being used (allegedly)
and desired we face a number of challenges to existing techniques.
When working with the UML, one must take into consideration
object orientation. The UML is a language for expressing
systems (or whatever) in terms of object oriented
concepts and its meta-model and its semantics make this explicit.
Object orientation, unlike functional based approaches
makes both functionality and data first-class modelling elements.
Whenever anything is specified in UML, that modelling
element is either based on the notion of a class or is
directly related to a class. Some methods appear to adhere to
this but fail to use classes in this way by assuming the existence
of a "global" system and then just using classes as data
elements. Effectively the UML equivalent of programming
Fortran in C++.
Moderators: C. Piguet, CSEM, CH; A. Macii, Politecnico di Torino, IT
-
Energy Bounds for Fault-Tolerant Nanoscale Designs [p. 74]
-
D. Marculescu
The problem of determining lower bounds for the
energy cost of a given nanoscale design is addressed via a
complexity theory-based approach. This paper provides a
theoretical framework that is able to assess the trade-offs
existing in nanoscale designs between the amount of
redundancy needed for a given level of resilience to errors and
the associated energy cost. Circuit size, logic depth and error
resilience are analyzed and brought together in a theoretical
framework that can be seamlessly integrated with automated
synthesis tools and can guide the design process of nanoscale
systems comprised of failure prone devices. The impact of
redundancy addition on the switching energy and its
relationship with leakage energy is modeled in detail. Results
show that 99% error resilience is possible for fault-tolerant
designs, but at the expense of at least 40% more energy if
individual gates fail independently with probability of 1%.
-
DVS for On-Chip Bus Designs Based on Timing Error Correction [p. 80]
-
H. Kaul, D. Sylvester, D. Blaauw, T. Mudge, and T. Austin
On-chip buses are typically designed to meet performance
constraints at worst-case conditions, including process corner,
temperature, IR-drop, and neighboring net switching pattern. This
can result in significant performance slack at more typical
operating conditions. In this paper, we propose a dynamic voltage
scaling (DVS) technique for buses, based on a double sampling
latch which can detect and correct for delay errors without the
need for retransmission. The proposed approach recovers the
available slack at non-worst-case operating points through more
aggressive voltage scaling and tracks changing conditions by
monitoring the error recovery rate. Voltage margins needed in
traditional designs to accommodate worst-case performance
conditions are therefore eliminated, resulting in a significant
improvement in energy efficiency. The approach was implemented
for a 6mm memory read bus operating at 1.5GHz (0.13 μm
technology node) and was simulated for a number of benchmark
programs. Even at the worst-case process and environment
conditions, energy gains of up to 17% are achieved, with error
recovery rates under 2.3%. At more typical process and
environment conditions, energy gains range from 35% to 45%,
with a performance degradation under 2%. An analysis of
optimum interconnect architectures for maximizing energy gains
with this approach shows that the proposed approach performs
well with technology scaling.
-
Joint Power Management of Memory and Disk [p. 86]
-
L. Cai and Y.-H. Lu
This paper presents a scheme to combine memory
and power management for achieving better energy
reduction. Our method periodically adjusts the size of
physical memory and the timeout value to shut down
a hard disk for reducing the average power consumption.
We use Pareto distributions to model the distributions
of idle time. The parameters of the distributions
are adjusted at run-time for calculating the
corresponding timeout value of the disk power management.
The memory size is changed based on the
inclusion property to predict the number of disk accesses
at different memory sizes. Experimental results
show more than 50% energy savings compared to a 2-competitive
fixed-timeout method.
-
Assertion-Based Design Exploration of DVS in Network Processor Architectures [p. 92]
-
J. Yu, W. Wu, X. Chen, H. Hsieh, J. Yang, and F. Balarin
With the scaling of technology and higher requirements on
performance and functionality, power dissipation is becoming
one of the major design considerations in the development of
network processors. In this paper, we use an assertion-based
methodology for system-level power/performance analysis to
study two dynamic voltage scaling (DVS) techniques, traffic-based
DVS and execution-based DVS, in a network processor
model. Using the automatically generated distribution analyzers,
we analyze the power and performance distributions and
study their trade-offs for the two DVS policies with different
parameter settings such as threshold values and window sizes.
We discuss the optimal configurations of the two DVS policies
under different design requirements. By a set of experiments,
we show that the assertion-based trace analysis methodology
is an efficient tool that can help a designer easily compare
and study optimal architectural configurations in a large design
space.
Moderators: F. Kurdahi, UC Irvine, US; C. Passerone, Politecnico di Torino, IT
-
Instruction Scheduling for Dynamic Hardware Configuration [p. 100]
-
E. Panainte, K. Bertels, and S. Vassiliadis
Although the huge reconfiguration latency of the available
FPGA platforms is a well-known shortcoming of the
current FCCMs, little research in instruction scheduling
has been undertaken to eliminate or diminish its negative
influence on performance. In this paper, we introduce an
instruction scheduling algorithm that minimizes the number
of executed hardware reconfiguration instructions taking
into account the "FPGA area placement conflicts" between
the available configurations. The algorithm is based
on compiler analyses and feedback-directed techniques and
it can switch from hardware execution to software execution
for an operation, when the reconfiguration latency could not
be reduced. The algorithm has been tested for the M-JPEG
encoder application and the real hardware implementations
for DCT, Quantization and VLC operations. Based on simulation
results, we determine that, while a simple scheduling
produces a significant performance decrease, our proposed
scheduling contributes for up to 16x M-JPEG encoder
speedup.
-
A Hybrid Prefetch Scheduling Heuristic to Minimize at Run-Time the Reconfiguration
Overhead of Dynamically Reconfigurable Hardware [p. 106]
-
J. Resano, D. Mozos, and F. Catthoor
Due to the emergence of highly dynamic multimedia
applications there is a need for flexible platforms and runtime
scheduling support for embedded systems. Dynamic
Reconfigurable Hardware (DRHW) is a promising
candidate to provide this flexibility but, currently, not
sufficient run-time scheduling support to deal with the
run-time reconfigurations exists. Moreover, executing at
run-time a complex scheduling heuristic to provide this
support may generate an excessive run-time penalty.
Hence, we have developed a hybrid design/run-time
prefetch heuristic that schedules the reconfigurations at
run-time, but carries out the scheduling computations at
design-time by carefully identifying a set of near-optimal
schedules that can be selected at run-time. This approach
provides run-time flexibility with a negligible penalty.
-
Optimized Generation of Data-Path from C Codes for FPGAs [p. 112]
-
Z. Guo, B. Buyukkurt, W. Najjar, and K. Vissers
FPGAs, as computing devices, offer significant speedup
over microprocessors. Furthermore, their configurability
offers an advantage over traditional ASICs. However, they
do not yet enjoy high-level language programmability, as
microprocessors do. This has become the main obstacle for
their wider acceptance by application designers.
ROCCC is a compiler designed to generate circuits from
C source code to execute on FPGAs, more specifically on
CSoCs. It generates RTL level HDLs from frequently
executing kernels in an application. In this paper, we
describe ROCCC's system overview and focus on its data
path generation. We compare the performance of ROCCC-generated
VHDL code with that of Xilinx IPs. The synthesis
result shows that ROCCC-generated circuit takes around
2x ~ 3x area and runs at comparable clock rate.
Moderators: G. Vandersteen, IMEC, BE; H. Graeb, TU Munich, DE
-
Time-Domain Simulation of Sampled Weakly Nonlinear Systems Using Analytical Integration and
Orthogonal Polynomial Series [p. 120]
-
E. Martens and G. Gielen
This paper presents a novel method for simulation of
sampled systems with weakly nonlinear behavior. These
systems can be characterized by adding weakly non-linear
terms to the linear state-space equations of the system resulting
in an extended state-space model. Perturbation theory
is used to split these equations in an ideal linear behavior
and a non-ideal small perturbation. The linear equations
are solved analytically which reduces simulation time compared
to numerical evaluation. The solution of the perturbation
equations is approximated by orthogonal polynomials.
This methodology not only reduces simulation time compared
to traditional numerical simulations, but also deals
naturally with clock jitter and the discontinuous behavior
of sampled systems. An implementation of the methodology
has been used to analyze systems including switched filters
and continuous-time DS modulators.
-
Hierarchical Variance Analysis for Analog Circuits Based on Graph Modelling and
Correlation Loop Tracing [p. 126]
-
F. Liu, J. Flomenberg, D. Yasaratne, and S. Ozev
Process variations play an increasingly important role on the
success of analog circuits. State-of-the-art analog circuits are
based on complex architectures and contain many hierarchical
layers and parameters. Knowledge of the parameter variances
and their contribution patterns is crucial for a successful design
process. This information is valuable to find solutions for many
problems in design, design automation, testing, and fault tolerance.
In this paper, we present a hierarchical variance analysis
methodology for analog circuits. In the proposed method, we make
use of previously computed values whenever possible so as to reduce
computational time. Experimental results indicate that the
proposed method provides both accuracy and computational efficiency
when compared with prior approaches.
-
On Statistical Timing Analysis with Inter and Intra-die Variations [p. 132]
-
H. Mangassarian and M. Anis
In this paper, we highlight a fast, effective and practical statistical
approach that deals with inter and intra-die variations in VLSI
chips. Our methodology is applied to a number of random variables
while accounting for spatial correlations. Our methodology sorts
the Probability Density Functions (PDFs) of the critical paths of a
circuit based on a confidence-point. We show the mathematical accuracy
of our method as well as implement a typical program to test
it on various benchmarks. We find that worst-case analysis over-estimates
path delays by more than 50% and that a path's probabilistic
rank with respect to delay is very different from its deterministic
rank.
-
Multi-Placement Structures for Fast and Optimized Placement in Analog Circuit Synthesis [p. 138]
-
R. Badaoui and R. Vemuri
This paper presents the novel idea of multi-placement structures,
for a fast and optimized placement instantiation in analog
circuit synthesis. These structures need to be generated only
once for a specific circuit topology. When used in synthesis,
these pre-generated structures instantiate various layout floorplans
for various sizes and parameters of a circuit. Unlike procedural
layout generators, they enable fast placement of circuits
while keeping the quality of the placements at a high level during
a synthesis process. The fast placement is a result of high speed
instantiation resulting from the efficiency of the multi-placement
structure. The good quality of placements derive from the extensive
and intelligent search process that is used to build the
multi-placement structure. The target benchmarks of these structures
are analog circuits in the vicinity of 25 modules . An algorithm
for the generation of such multi-placement structures is
presented. Experimental results show placement execution times
with an average of a few milliseconds making them usable during
layout-aware synthesis for optimized placements.
Moderators: A. Chatterjee, Georgia Institute of Technology, US; J. Carbonero, STMicroelectronics, FR
-
On-Chip Multi-Channel Waveform Monitoring for Diagnostics of Mixed-Signal VLSI Circuits [p. 146]
-
K. Noguchi and M. Nagata
Multi-channel waveform monitoring technique enhances
built-in test and diagnostic capability of mixed-signal VLSI
circuits. An 8-channel prototype system incorporates adaptive
sample time generation with a 10-bit variable step delay
generator and algorithmic digitization with a 10-bit incremental
reference voltage generator. The prototype in
a 0.18-μm CMOS technology demonstrated on-chip waveform
acquisition at 40-ps and 200-μV resolutions. The
waveforms were as accurate as those by an off-chip measurement
technique, while more than 95 % reduction of the
waste time in waveform monitoring was achieved. The area
of 700μm x 600μm was occupied by a single waveform acquisition
kernel that was shared with 8 front-end modules of
60μm x 200μm each. The developed on-chip multi-channel
waveform monitoring technique is waveform accurate, area
efficient, and low cost, which are all requisite factors for
diagnosing methodology toward mixed analog and digital
signal integrity in a systems-on-a-chip era.
-
Low-Cost Multi-Gigahertz Test Systems Using CMOS FPGAs and PECL [p. 152]
-
D. Keezer, C. Gray, A. Majid, and N. Taher
This paper describes two research projects that
develop new low-cost techniques for testing devices with
multiple high-speed (2 to 5 Gbps) signals. Each project
uses commercially available components to keep costs
low, yet achieves performance characteristics comparable
to (and in some ways exceeding) more expensive ATE. A
common CMOS FPGA-based logic core provides
flexibility, adaptability, and communication with
controlling computers while customized positive emitter-coupled
logic (PECL) achieves multi-gigahertz data rates
with about +25ps timing accuracy.
-
Noise Figure Evaluation Using Low Cost BIST [p. 158]
-
M. Negreiros, L. Carro, and A. Susin
A technique for evaluating noise figure suitable for BIST
implementation is described. It is based on a low cost
single-bit digitizer, which allows the simultaneous
evaluation of noise figure in several test points of the
analog circuit. The method is also able to benefit from
SoC resources, like memory and processing power.
Theoretical background and experimental results are
presented in order to demonstrate the feasibility of the
approach.
-
Specification Test Compaction for Analog Circuits and MEMS [p. 164]
-
S. Biswas, R. Blanton, L. Pileggi, and P. Li
Testing a non-digital integrated system against all of its specification
can be quite expensive due to the elaborate test application and measurement
setup required. We propose to eliminate redundant tests by employing ε-SVM
based statistical learning. Application of the proposed methodology to
an operational amplifier and a MEMS accelerometer reveal that redundant tests
can be statistically identified from a complete set of specification-based tests
with negligible error. Specifically, after eliminating five of seven
specification-based tests for an operatic amplifier, the defect escape
and yield loss is small at 0.6% and 0.9%, respectively. For the accelerometer,
defect escape of 0.2% and yield loss of 0.1% occurs when the hot and cold
tests are eliminated. For the accelerometer, this level of compaction would
reduce test cost by more than half.
-
Optimising Test Sets for a Low Noise Amplifier with a Defect-Oriented Approach [p. 170]
-
V. Danelon, J. Carbonero, R. Kheriji, and S. Mir
This paper is aimed at studying defect-oriented test
techniques for RF components in order to optimize
production test sets. This study is mandatory for the
definition of an efficient test flow strategy. We have
carried out a fault simulation campaign for a Low-Noise
Amplifier (LNA) for reducing a test set while maintaining
high fault coverage. The set of production test
measurements should include low-cost structural tests
such as simple current consumption and only a few more
sophisticated tests dedicated to functional specifications
such as S parameters, Noise Figure (NF) or IP3.
-
IEEE 1149.4 Compatible ABMs for Basic RF Measurements [p. 172]
-
P. Syri, J. Hakkinen, and M. Moilanen
An analogue testing standard IEEE 1149.4 is mainly
targeted for low-frequency testing. The problem studied
in this paper is extending the standard also for radio
frequency testing. IEEE 1149.4 compatible measurement
structures (ABMs) developed in this study extract the
information one is measuring from the radio frequency
signal and represent the result as a DC voltage level. The
ABMs presented in this paper are targeted for power and
frequency measurements operating in frequencies from 1
GHz to 2 GHz. The power measurement error caused by
temperature, supply voltage and process variations is
roughly 2 dB and the frequency measurement error is 0.1
GHz, respectively.
-
Fault-Trajectory Approach for Fault Diagnosis on Analog Circuits [p. 174]
-
C. Savioli, C. Czendrodi, J. Calvano, and A. Mesquita
This issue discusses the fault-trajectory approach suitability
for fault diagnosis on analog networks. Recent works have
shown promising results concerning a method based on this
concept for ATPG for diagnosing faults on analog networks.
Such method relies on evolutionary techniques, where a
generic algorithm (GA) is coded to generate a set of
optimum frequencies capable to disclose faults.
Moderators: T. Basten, TU Eindhoven, NL; R. Marculescu, Carnegie Mellon U, US
-
Secure Embedded Processing through Hardware-Assisted Run-time Monitoring [p. 178]
-
D. Arora, N. Jha, S. Ravi, and A. Raghunathan
Security is emerging as an important concern in embedded
system design. The security of embedded systems is often compromised due
to vulnerabilities in "trusted" software that they execute. Security attacks
exploit these vulnerabilities to trigger unintended program behavior, such
as the leakage of sensitive data or the execution of malicious code.
In this work, we present a hardware-assisted paradigm to enhance
embedded system security by detecting and preventing unintended program
behavior. Specifically, we extract properties of an embedded program
through static program analysis, and use them as the bases for enforcing
permissible program behavior in real-time as the program executes.
We present an architecture for hardware-assisted run-time monitoring,
wherein the embedded processor is augmented with a hardware monitor
that observes the processor's dynamic execution trace, checks whether
the execution trace falls within the allowed program behavior, and flags
any deviations from the expected behavior to trigger appropriate response
mechanisms. We present properties that can be used to capture permissible
program behavior at different levels of granularity within a program,
namely inter-procedural control flow, intra-procedural control flow, and
instruction stream integrity. We also present a systematic methodology
to design application-specific hardware monitors for any given embedded
program. We have evaluated the hardware requirements and performance
of the proposed architecture for several embedded software benchmarks.
Hardware implementations using a commercial design flow, and architectural
simulations using the SimpleScalar framework, indicate that the
proposed technique can thwart several common software and physical
attacks, facilitating secure program execution with minimal overheads.
-
Energy-Aware Routing for E-Textile Applications [p. 184]
-
J.-C. Kao and R. Marculescu
As the scale of electronic devices shrinks, "electronic
textiles" (e-textiles) will make possible a wide variety of novel
applications which are currently unfeasible. Due to the
wearability concerns, low-power techniques are critical for etextile
applications. In this paper, we address the issue of the
energy-aware routing for e-textile platforms and propose an
efficient algorithm to solve it. The platform we consider
consists of dedicated components for e-textiles, including
computational modules, dedicated transmission lines and
thin-film batteries on fiber substrates. Furthermore, we derive
an analytical upper bound for the achievable number of jobs
completed over all possible routing strategies. From a
practical standpoint, for the Advanced Encryption Standard
(AES) cipher, the routing technique we propose achieves
about fifty percent of this analytical upper bound. Moreover,
compared to the non-energy-aware counterpart, our routing
technique increases the number of encryption jobs completed
by one order of magnitude.
-
LORD: A Localized, Reactive and Distributed Protocol for Node Scheduling in
Wireless Sensor Networks [p. 190]
-
A. Ghosh and T. Givargis
The lifetime of wireless sensor networks can be increased
by minimizing the number of active nodes that provide complete
coverage, while switching off the rest. In this paper,
we propose a distributed and scalable node-scheduling algorithm
that conserves overall system energy by minimizing
the number of active nodes, localizing the execution to the
dying sensor(s), and minimizing the frequency of execution
by reacting only to the occurrence of a sensing hole. This effects
an increased system lifetime while maintaining coverage
over an application-defined threshold value. We compare
our algorithm to a network with a centralized nodescheduling
algorithm. Our results show equivalent coverage
degree over a wide range of sensor networks.
Keywords
Wireless Sensor Network, Coverage, Set Cover
-
Energy Efficiency of the IEEE 802.15.4 Standard in Dense Wireless Microsensor Networks:
Modeling and Improvement Perspectives [p. 196]
-
B. Bougard, F. Catthoor, D. Daly, A. Chandrakasan, and W. Dehaene
Wireless microsensor networks, which have been the
topic of intensive research in recent years, are now
emerging in industrial applications. An important
milestone in this transition has been the release of the
IEEE 802.15.4 standard that specifies interoperable
wireless physical and medium access control layers
targeted to sensor node radios. In this paper, we evaluate
the potential of an 802.15.4 radio for use in an ultra
low power sensor node operating in a dense network.
Starting from measurements carried out on the off-the-shelf
radio, effective radio activation and link adaptation
policies are derived. It is shown that, in a typical
sensor network scenario, the average power per node
can be reduced down to 211mW. Next, the energy consumption
breakdown between the different phases of a
packet transmission is presented, indicating which part
of the transceiver architecture can most effectively be
optimized in order to further reduce the radio power,
enabling self-powered wireless microsensor networks.
-
Lifetime Modeling of a Sensor Network [p. 202]
-
V. Rai and R. Mahapatra
In this paper, we study a communication/sensing network that
comprises of large number of radio enabled sensors. These sensors are
either randomly or deterministically placed within a certain region to
monitor events that are spatially and temporally independent of each
other. Possible applications include: habitat and climate monitoring,
diagnosing faults in industrial supply lines, measuring data such as
traffic-intensity, detecting human/vehicular intrusion, etc. The sensor
nodes in these networks are powered by a battery with limited power,
which is dissipated during the data transmission/reception. A cheap
and effective approach is to replace the sensor nodes in due course
instead of replenishing of their batteries. Thus, the objective is to
find the replacement time Tr such that none of the sensor nodes
run out of their batteries (disconnected) before Tr. An alternative
way of formulating this problem is to find the lifetime T of the
network, which is defined as the time after which the first node in
the network disconnects. Studies evaluating the lifetime model of the
sensor networks have been done before in [1], [2], [4]. However,
the primary difference between previous approaches and our work is
that we specifically model a data generation process at an individual
sensor node, where each node covers certain area and the amount of
data generated at a node is proportional to its coverage area.
Moderators: E. Schmidt, Chip Vision Design Systems, DE; J. Haid, Infineon, DE
-
A Fast Concurrent Power-Thermal Model for Sub-100nm Digital ICs [p. 206]
-
J. Rosselló, V. Canals, S. Bota, J. Segura, and A. Keshavarzi
As technology scales down, the static power is expected to
become a significant fraction of the total power. The
exponential dependence of static power with the operating
temperature makes the thermal profile estimation of high-performance
ICs a key issue to compute the total power
dissipated in next-generations. In this paper we present
accurate and compact analytical models to estimate the static
power dissipation and the temperature of operation of CMOS
gates. The models are the fundamentals of a performance
estimation tool in which numerical procedures are avoided for
any computation to set a faster estimation and optimization.
The models developed are compared to measurements and
SPICE simulations for a 0.12mm technology showing excellent
results.
-
Activity Packing in FPGAs for Leakage Power Reduction [p. 212]
-
H. Hassan, M. Anis, A. El Daher, and M. Elmasry
In this paper, two packing algorithms for the detection of activity
profiles in MTCMOS-based FPGA structures are proposed
for leakage power mitigation. The first algorithm is a connection-based
packing technique by which the proximity of the logic blocks
is accounted for, and the second algorithm is a logic-based packing
approach by which the weighted Hamming distance between
the blocks activities is considered. After both algorithms are analyzed,
they are applied to a number of FGPA benchmarks for verification.
Once the activity profiles are realized, sleep transistors are
carefully positioned to contain the clustered blocks that share similar
activity profiles. Finally, the percentage of the leakage power
savings for each of the two algorithms is evaluated.
-
Simultaneous Partitioning and Frequency Assignment for On-chip Bus Architectures [p. 218]
-
S. Srinivasan, L. Li, and N. Vijaykrishnan
In this paper, we provide a methodology to perform
both bus partitioning and bus frequency assignment to
each of the bus segment simultaneously while
optimizing both power consumption and performance of
the system. We use a genetic algorithm and design an
appropriate cost function which optimizes the solution
on the basis of its power consumption and performance.
The evaluation of our approach using a set of
multiprocessor applications show that an average
reduction of the energy consumption by 60% over a
single shared bus architecture. Our results also show
that it is beneficial to simultaneously assign bus
frequencies and performing bus partitioning instead of
performing them sequentially.
-
Modeling and Analysis of Loading Effect in Leakage of Nano-Scaled Bulk-CMOS Logic Circuits [p. 224]
-
S. Mukhopadhyay, S. Bhunia, and K. Roy
In nanometer scaled CMOS devices significant increase in the
subthreshold, the gate and the reverse biased junction
band-to-band- tunneling (BTBT) leakage, results in the large increase of
total leakage power in a logic circuit. Leakage components
interact with each other in device level (through device
geometry, doping profile) and also in the circuit level (through
node voltages). Due to the circuit level interaction of the
different leakage components, the leakage of a logic gate
strongly depends on the circuit topology i.e. number and nature
of the other logic gates connected to its input and output. In this
paper, for the first time, we have analyzed loading effect on
leakage and proposed a method to accurately estimate the total
leakage in a logic circuit, from its logic level description
considering the impact of loading and transistor stacking.
-
Leakage-Aware Interconnect for On-Chip Network [p. 230]
-
Y.-F. Tsai, V. Narayaynan, Y. Xie, and M. Irwin
On-chip networks have been proposed as the interconnect fabric
for future systems-on-chip and multi-processors on chip. Power is
one of the main constraints of these systems and interconnect
consumes a significant portion of the power budget. In this paper,
we propose four leakage-aware interconnect schemes .Our
schemes achieve 10.13%~63.57% active leakage savings and
12.35%~95.96% standby leakage savings across schemes while the
delay penalty ranges from 0% to 4.69%.
Moderators: K. Goossens, Philips Research, NL; P. Ienne, EPFL, CH
-
Centralized Run-Time Resource Management in a Network-on-Chip
Containing Reconfigurable Hardware Tiles [p. 234]
-
V. Nollet, T. Marescaux, P. Avasare, J.-Y. Mignolet, and D. Verkest
Run-time management of both communication and computation
resources in a heterogeneous Network-on-Chip
(NoC) is a challenging task. First, platform resources need
to be assigned in a fast and efficient way. Secondly, the resources
might need to be reallocated when platform conditions
or user requirements change. We developed a run-time
resource management scheme that is able to efficiently manage
a NoC containing fine grain reconfigurable hardware
tiles. This paper details our task assignment heuristic and
two run-time task migration mechanisms that deal with the
message consistency problem in a NoC. We show that specific
reconfigurable hardware tile support improves performance
of the heuristic and that task migration mechanisms
need to be tailored to on-chip networks.
-
Symmetric Multiprocessing on Programmable Chips Made Easy [p. 240]
-
A. Hung, W. Bishop, and A. Kennings
Vendor-provided softcore processors often support advanced
features such as caching that work well in uniprocessor
or uncoupled multiprocessor architectures. However,
it is a challenge to implement Symmetric Multiprocessor
on a Programmable Chip (SMPoPC) systems using
such processors. This paper presents an implementation of
a tightly-coupled, cache-coherent symmetric multiprocessing
architecture using a vendor-provided softcore processor.
Experimental results show that this implementation can be
achieved without invasive changes to the vendor-provided
softcore processor and without degradation of the performance
of the memory system.
-
A Complete Network-On-Chip Emulation Framework [p. 246]
-
N. Genko, G. De Micheli, D. Atienza, J. Mendias, R. Hermida, and F. Catthoor
Current Systems-On-Chip (SoC) execute applications
that demand extensive parallel processing. Networks-On-Chip
(NoC) provide a structured way of realizing interconnections
on silicon, and obviate the limitations of bus-based
solution. NoCs can have regular or ad hoc topologies, and
functional validation is essential to assess their correctness
and performance. In this paper, we present a flexible emulation
environment implemented on an FPGA that is suitable
to explore, evaluate and compare a wide range of NoC solutions
with a very limited effort. Our experimental results
show a speed-up of four orders of magnitude with respect to
cycle-accurate HDL simulation, while retaining cycle accuracy.
With our emulation framework, designers can explore
and optimize a various range of solutions, as well as characterize
quickly performance figures.
-
Low Cost Task Migration Initiation in a Heterogeneous MP-SoC [p. 252]
-
V. Nollet, P. Avasare, J.-Y. Mignolet, and D. Verkest
Run-time task migration in a heterogeneous multiprocessor
System-on-Chip (MP-SoC) is a challenge that requires
cooperation between the task and the operating system. In
task migration, minimization of the overhead during normal
task execution (i.e when not migrating) and the minimization
of the migration reaction time are important. We introduce
a novel technique that reuses the processor's debug
registers in order to minimize the overhead during normal
execution. This paper explains our task migration proof-of-concept
setup and compares it to the state-of-the art. By
reusing existing hardware and software functionality our
approach reduces the run time overhead.
-
Predictable Embedding of Large Data Structures in Multiprocessor Networks-On-Chip [p. 254]
-
S. Stuijk, T. Basten, B. Mesman, and M. Geilen
This extended abstract presents models to derive timing
and resource usage numbers for an application when distant,
shared memories are used in an important class of future embedded
platforms, namely network-on-chip-based multiprocessors.
Moderators: T. Ifström, Robert Bosch, DE; A. Rodriguez, IMSE-CNM, ES
-
Top-Down Design of a Low-Power Multi-Channel 2.5-Gbit/s/Channel Gated Oscillator
Clock-Recovery Circuit [p. 258]
-
P. Muller, Y. Leblebici, M. Atarodi, and A. Tajalli
We present a complete top-down design of a low-power
multi-channel clock recovery circuit based on gated
current-controlled oscillators. The flow includes several
tools and methods used to specify block constraints, to
design and verify the topology down to the transistor level,
as well as to achieve a power consumption as low as
5mW/Gbit/s. Statistical simulation is used to estimate the
achievable bit error rate in presence of phase and
frequency errors and to prove the feasibility of the concept.
VHDL modeling provides extensive verification of the
topology. Thermal noise modeling based on well-known
concepts delivers design parameters for the device sizing
and biasing. We present two practical examples of possible
design improvements analyzed and implemented with this
methodology.
-
MINLP Based Topology Synthesis for Delta Sigma Modulators Optimized for
Signal Path Complexity, Sensitivity and Power Consumption [p. 264]
-
H. Tang, Y. Wei, and A. Doboli
This paper proposes a novel architecture synthesis algorithm
for single-loop single-bit δσmodulators. We defined a generic
modulator architecture and derived its noise and signal transfer
function (NTF/STF) in symbolic forms. We then used the
TF in MINLP to generate optimal topologies for a variety of
design requirement, such as modulator complexity, sensitivity
and power consumption, which appeared as cost functions.
Experiments show the superiority of synthesized topologies as
compared to traditional solutions.
-
Simulation Methodology for Analysis of Substrate Noise Impact on Analog / RF Circuits
Including Interconnect Resistance [p. 270]
-
C. Soens, P. Wambacq, G. Van Der Plas, and S. Donnay
This paper reports a novel simulation methodology for
analysis and prediction of substrate noise impact on
analog / RF circuits taking into account the role of the
parasitic resistance of the on-chip interconnect in the
impact mechanism. This methodology allows investigation
of the role of the separate devices (also parasitic devices) in
the analog / RF circuit in the overall impact. This way is
revealed which devices have to be taken care of (shielding,
topology change) to protect the circuit against substrate
noise. The developed methodology is used to analyze impact
of substrate noise on a 3 GHz LC-tank Voltage Controlled
Oscillator (VCO) designed in a high-ohmic 0.18 mm 1PM6
CMOS technology. For this VCO (in the investigated
frequency range from DC to 15 MHz) impact is mainly
caused by resistive coupling of noise from the substrate to
the non-ideal on-chip ground interconnect, resulting in
analog ground bounce and frequency modulation. Hence,
the presented test-case reveals the important role of the on-chip
interconnect in the phenomenon of substrate noise
impact.
-
Systematic Figure of Merit Computation for the Design of Pipeline ADC [p. 277]
-
L. Barrandon, S. Crand, and D. Houzet
The emerging concept of SoC-AMS leads to research new top-down
methodologies to aid systems designers in sizing analog and mixed devices.
This work applies this idea to the high-level optimization of pipeline
ADC. Considering a given technology, if consists in comparing different
configurations according to their imperfections and their architectures
without FFT computation or time-consuming simulators. The final selection
is based on a figure of merit.
-
Designer-Driven Topology Optimization for Pipelined Analog to Digital Converters [p. 279]
-
Y.-T. Chien, J.-H. Lou, D. Chen, G.-K. Ma, R. Rutenbar, and T. Mukherjee
This paper suggests a practical "hybrid" synthesis
methodology which integrates designer-derived analytical
models for system-level description with simulation-based
models at the circuit level. We show how to optimize stage-resolution
to minimize the power in a pipelined ADC.
Exploration (via detailed synthesis) of several ADC
configurations is used to show that a 4-3-2... resolution
distribution uses the least power for a 13-bit 40 MSPS
converter in a 0.25 μm CMOS process.
Moderators: C. Metra, Bologna U, IT; R. Leveugle, TIMA Laboratory, FR
-
Accurate Reliability Evaluation and Enhancement via Probabilistic Transfer Matrices [p. 282]
-
S. Krishnaswamy, G. Viamontes, I. Markov, and J. Hayes
Soft errors are an increasingly serious problem for logic
circuits. To estimate the effects of soft errors on such circuits,
we develop a general computational framework based
on probabilistic transfer matrices (PTMs). In particular, we
apply them to evaluate circuit reliability in the presence of
soft errors, which involves combining the PTMs of gates
to form an overall circuit PTM. Information such as output
probabilities, the overall probability of error, and signal
observability can then be extracted from the circuit PTM.
We employ algebraic decision diagrams (ADDs) to improve
the efficiency of PTM operations. A particularly challenging
technical problem, solved in our work, is to simultaneously
extend tensor products and matrix multiplication in
terms of ADDs to non-square matrices. Our PTM-based
method enables accurate evaluation of reliability for moderately
large circuits and can be extended by circuit partitioning.
To demonstrate the power of the PTM approach,
we apply it to several problems in fault-tolerant design and
reliability improvement.
-
Soft-Error Tolerance Analysis and Optimization of Nanometer Circuits [p. 288]
-
Y. Dhillon, A. Diril, and A. Chatterjee
Nanometer circuits are becoming increasingly
susceptible to soft-errors due to alpha-particle and
atmospheric neutron strikes as device scaling reduces
node capacitances and supply/threshold voltage scaling
reduces noise margins. It is becoming crucial to add soft-error
tolerance estimation and optimization to the design
flow to handle the increasing susceptibility. The first part
of this paper presents a tool for accurate soft-error
tolerance analysis of nanometer circuits (ASERTA) that
can be used to estimate the soft-error tolerance of
nanometer circuits consisting of millions of gates. The
tolerance estimates generated by the tool match SPICE
generated estimates closely while taking orders of
magnitude less computation time. The second part of the
paper presents a tool for soft-error tolerance optimization
of nanometer circuits (SERTOPT) using the tolerance
estimates generated by ASERTA. The tool finds optimal
sizes, channel lengths, supply voltages and threshold
voltages to be assigned to gates in a combinational circuit
such that the soft-error tolerance is increased while
meeting the timing constraint. Experiments on ISCAS'85
benchmark circuits showed that soft-error rate of the
optimized circuit decreased by as much as 47% with
marginal increase in circuit delay.
-
Improving the Process-Variation Tolerance of Digital Circuits Using Gate Sizing and
Statistical Techniques [p. 294]
-
O. Neiroukh and X. Song
A new approach for enhancing the process-variation tolerance of digital circuits is described. We extend recent advances in statistical timing analysis into an optimization framework. Our objective is to reduce the performance variance of a technology-mapped circuit where delays across elements are represented by random variables which capture the manufacturing variations. We introduce the notion of statistical critical paths, which account for both means and variances of performance variation. An optimization engine is used to size gates with a goal of reducing the timing variance along the statistical critical paths. We apply a pair of nested statistical analysis methods deploying a slower more accurate approach for tracking statistical critical paths and a fast engine for evaluation of gate size assignments. We derive a new approximation for the max operation on random variables which is deployed for the faster inner engine. Circuit optimization is carried out using a gain-based algorithm that terminates when constraints are satisfied or no further improvements can be made. We show optimization results that demonstrate an average of 72% reduction in performance variation at the expense of average 20% increase in design area.
-
Circuit-Level Modeling for Concurrent Testing of Operational Defects due to Gate Oxide Breakdown [p. 300]
-
J. Carter, S. Ozev, and D. Sorin
As device sizes shrink and current densities increase, the probability of device failures due to gate oxide breakdown (OBD) also increases. To provide designs that are tolerant to such failures, we must investigate and understand the manifestations of this physical phenomenon at the circuit and system level. In this paper, we develop a model for operational OBD defects, and we explore how to test for faults due to OBD. For a NAND gate, we derive the necessary input conditions that excite and detect errors due to OBD defects at the gate level. We show that traditional pattern generators fail to exercise all of these defects. Finally, we show that these test patterns can be propagated and justified for a combinational circuit in a manner similar to traditional ATPG.
-
An Accurate SER Estimation Method Based on Propagation Probability [p. 306]
-
G. Asadi and M. Tahoori
In this paper, we present an accurate but very fast soft
error rate (SER) estimation technique for digital circuits
based on error propagation probability (EPP) computation.
Experiments results and comparison of the results with
the random simulation technique show that our proposed
method is on average within 6% of the random simulation
method and four to five orders of magnitude faster.
-
Techniques for Fast Transient Fault Grading Based on Autonomous Emulation [p. 308]
-
C. López-Ongil, M. García-Valderas, M. Portela-García, and L. Entrena-Arrontes
Very deep submicron and nanometer technologies have
increased notably integrated circuit (IC) sensitiveness to
radiation. Soft errors are currently appearing into ICs
working at earth surface. Hardened circuits are currently
required in many applications where Fault Tolerance (FT)
was not a requirement in the very near past. The use of
platform FPGAs for the emulation of single-event upset
effects (SEU) is gaining attention in order to speed up the
FT evaluation. In this work, a new emulation system for
FT evaluation with respect to SEU effects is proposed,
providing shorter evaluation times by performing all the
evaluation process in the FPGA and avoiding emulator-host
communication bottlenecks.
Moderators: R. Seepold, Carlos III de Madrid U, ES; G. Martin, Tensilica, US
-
TDMA Time Slot and Turn Optimization with Evolutionary Search Techniques [p. 312]
-
A. Hamann and R. Ernst
In this paper we present arithmetic real-coded variation
operators tailored for time slot and turn optimization on
TDMA-scheduled resources with evolutionary algorithms.
Our operators implement a heuristic strategy to converge
towards the solution space and are able to escape local
minima. Furthermore, we explicitly separate the variation
of the admitted loads and the turn-length in order to
give the designer increased control over the optimization
process. Experimental results show that our variation operators
have advantages over string-coded binary variation
operators which are frequently used to solve continuous
optimization problems.
-
Scheduling of Soft Real-Time Systems for Context-Aware Applications [p. 318]
-
J. Wong, F. Li, W. Liao, L. He, and M. Potkonjak
Context-aware applications pose new challenges, including
a need for new computational models, uncertainty
management, and efficient optimization under uncertainty.
Uncertainty can arise at two levels: multiple and single
tasks. When a mobile user changes environments, the context
changes resulting in the possibility of the user requesting
tasks which are specific for the new environment. However,
as the user moves these requested tasks may no longer
be context relevant. Additionally, the runtime of each task is
often highly dependent on the input data.
We introduce a hierarchical multi-resolution statistical
task model that captures relevant aspects at the task and intertask
levels, and captures not only uncertainty, but also introduces
the notion of utility for the user.We have developed
a system of non-parametric statistical techniques for modeling
the runtime of a specific task. This model is a framework
where we define problems of design and optimization of statistical
soft real-time systems (SSRTS). The main algorithmic
novelty is a cumulative potential-based task scheduling
heuristic for maximizing utility. The heuristic conducts
global optimization and induces low runtime overhead. We
demonstrate the effectiveness of the scheduling heuristic using
a Trimaran-based evaluation platform.
-
Model Reuse through Hardware Design Patterns [p. 324]
-
F. Rincón, F. Moya, J. Barba, and J. López
Increasing reuse opportunities is a well-known problem
for software designers as well as for hardware designers.
Nonetheless, current software and hardware engineering
practices have embraced different approaches to this problem.
Software designs are usually modelled after a set of
proven solutions to recurrent problems called design patterns.
This approach differs from the component-based
reuse usually found in hardware designs: design patterns
do not specify unnecessary implementation details.
Several authors have already proposed translating structural
design patterns concepts to hardware design. In this
paper we extend the discussion to behavioural design patterns.
Specifically, we describe how the hardware version
of the Iterator can be used to enhance model reuse.
-
A Public-Key Watermarking Technique for IP Designs [p. 330]
-
A. Abdel-Hamid, S. Tahar, and E. Aboulhamid
Sharing IP blocks in today's competitive market
poses significant high security risks. Creators and owners of IP
designs want assurances that their content will not be illegally
redistributed by consumers, and consumers want assurances that
the content they buy is legitimate. Recently, digital watermarking
emerged as a candidate solution for copyright protection of IP
blocks. In this paper, we propose a new approach for watermarking
IP designs based on the embedding of the ownership
proof as part of the IP design's FSM. The approach utilizes
coinciding as well as, un-used transitions in the state transition
graph of the design. Our approach increases the robustness
of the watermark and allows a secure implementation, hence
enabling the development of the first public-key IP watermarking
scheme at the FSM level. We also define evaluation criteria for our
approach, and use experimental measures to prove its robustness.
-
Design of a Virtual Component Neutral Network-on-Chip Transaction Layer [p. 336]
-
P. Martin
Research studies have demonstrated the
feasibility and advantages of Network-on-Chip (NoC)
over traditional bus-based architectures but have not
focused on compatibility communication standards. This
paper describes a number of issues faced when designing
a VC-neutral NoC, i.e. compatible with standards such
as AHB 2.0, AXI, VCI, OCP, and various other
proprietary protocols, and how a layered approach to
communication helps solve these issues.
Moderators: J. Henkel, Karlsruhe U, DE; W. Nebel, OFFIS, DE
-
Quality-Driven Proactive Computation Elimination for Power-Aware Multimedia Processing [p. 340]
-
S. Yardi, M. Hsiao, T. Martin, and D. Ha
We present a novel, quality-driven, architectural-level approach
that trades-off the output quality to enable power-aware
processing of multimedia streams. The error tolerance of multimedia
data is exploited to selectively eliminate computation
while maintaining a specified output quality. We construct relaxed,
synthesized power macro-models for power-hungry units
to predict the cycle-accurate power consumption of the input
stream on the fly. The macro-models, together with an effective
quality model, are integrated into a programmable architecture
that allows both power savings and quality to be dynamically
tuned with the available battery-life. In a case study, power monitors
are integrated with functional units of the IDCT module of
a MPEG-2 decoder. Experiments indicate that, for a moderate
power monitor energy overhead of 5%, power savings of 72% in
the functional units can be achieved resulting in an increase in
battery life by 1.95x.
-
HEBS: Histogram Equalization for Backlight Scaling [p. 346]
-
A. Iranli, H. Fatemi, and M. Pedram
In this paper, a method is proposed for finding a pixel
transformation function that maximizes backlight dimming while maintaining a
pre-specified image distortion level for a liquid crystal display. This is achieved
by finding a pixel transformation function, which maps the original image
histogram to a new histogram with lower dynamic range. Next the contrast of the
transformed image is enhanced so as to compensate for brightness loss that
would arise from backlight dimming. The proposed approach relies on an
accurate definition of the image distortion which takes into account both the pixel
value differences and a model of the human visual system and is amenable to
highly efficient hardware realization. Experimental results show that the
histogram equalization for backlight scaling method results in about 45% power
saving with an effective distortion rate of 5% and 65% power saving for a 20%
distortion rate. This is significantly higher power savings compared to previously
reported backlight dimming approaches.
-
Energy- and Performance-Driven NoC Communication Architecture Synthesis Using a
Decomposition Approach [p. 352]
-
U. Ogras and R. Marculescu
In this paper, we present a methodology for customized
communication architecture synthesis that matches the communication
requirements of the target application. This is an
important problem, particularly for network-based implementations
of complex applications. Our approach is based on
using frequently encountered generic communication primitives
as an alphabet capable of characterizing any given communication
pattern. The proposed algorithm searches
through the entire design space for a solution that minimizes
the system total energy consumption, while satisfying the
other design constraints. Compared to the standard mesh
architecture, the customized architecture generated by the
newly proposed approach shows about 36% throughput
increase and 51% reduction in the energy required to encrypt
128 bits of data with a standard encryption algorithm.
-
A Way Memoization Technique for Reducing Power Consumption of Caches in
Application Specific Integrated Processors [p. 358]
-
T. Ishihara and F. Fallah
This paper presents a technique for eliminating redundant
cache-tag and cache-way accesses to reduce power
consumption. The basic idea is to keep a small number
of Most Recently Used (MRU) addresses in a Memory Address
Buffer (MAB) and to omit redundant tag and way accesses
when there is a MAB-hit. Since the approach keeps
only tag and set-index values in the MAB, the energy and
area overheads are relatively small even for a MAB with a
large number of entries. Furthermore, the approach does
not sacrifice the performance. In other words, neither the
cycle time nor the number of executed cycles increases.
The proposed technique has been applied to Fujitsu VLIW
processor (FR-V) and its power saving has been estimated
using NanoSim. Experiments for 32kB 2-way set associative
caches show the power consumption of I-cache and
D-cache can be reduced by 40% and 50%, respectively.
Moderators: F. Petrot, Pierre et Marie Curie U, Paris VI, FR; H. Hsieh, UC Riverside, US
-
Design Space Exploration for Dynamically Reconfigurable Architectures [p. 366]
-
B. Miramond and J.-M. Delosme
By incorporating reconfigurable hardware in embedded
system architectures it has become easier to
satisfy the performance constraints of demanding applications
while lowering system cost. In order to evaluate
the performance of a candidate architecture, the nodes
(tasks) of the data flow graphs that describe an application
must be assigned to the computing resources of
the architecture: programmable processors and reconfigurable
FPGAs, whose run-time reconfiguration capabilities
must be exploited. In this paper we present a novel design
exploration tool - based on a local search algorithm
with global convergence properties - which simultaneously
explores choices for computing resources, assignments
of nodes to these resources, task schedules on the
programmable processors and context definitions for the reconfigurable
circuits. The tool finds a solution that minimizes
system cost while meeting the performance constraints;
more precisely it lets the designer select the quality
of the optimization (hence its computing time) and finds accordingly
a solution with close-to-minimal cost.
-
A Dependability-Driven System-Level Design Approach for Embedded Systems [p. 372]
-
A. Jhumka, S. Klaus, and S. Huss
The objective of this paper is to introduce dependability
as an optimization criterion in the system-level design
process of embedded systems. Given the pervasiveness of
embedded systems, especially in the area of highly dependable
and safety-critical systems, it is imperative to directly
consider dependability in the system level design process.
This naturally leads to a multi-objective optimization problem,
as cost and time have to be considered too. This paper
proposes a genetic algorithm to solve this multi-objective
optimization problem and to determine a set of Pareto optimal
design alternatives in a single optimization run. Based
on these alternatives, the designer can choose his best solution,
finding the desired tradeoff between cost, schedulability,
and dependability.
-
A Time Slice Based Scheduler Model for System Level Design [p. 378]
-
V. Shah, C. Passerone, L. Lavagno, and Y. Watanabe
Efficient evaluation of design choices, in terms of selection
of algorithms to be implemented as hardware or software,
and finding an optimal hw/sw design mix is an important
requirement in the design flow of Embedded Systems.
Time-to-market, faster upgradability and flexibility are some
of the driving points to put increasing amounts of functionality
as software executed on general purpose processing elements.
In this scenario, dividing a monolithic task into multiple
interacting tasks, and scheduling them on limited processing
elements has become very important for a system designer.
This paper presents an approach to model time-slice
based task schedulers in the designs where the performance
estimate of hardware and software models is less than time-slice
accurate. The approach aims to increase the simulation
efficiency of designs modeled at system level. We used
Metropolis [1] as our codesign environment.
-
A Prediction Packetizing Scheme for Reducing Channel Traffic in Transaction-Level
Hardware/Software Co-Emulation [p. 384]
-
J.-G. Lee, M.-K. Chung, K.-Y. Ahn, S.-H. Lee, and C.-M. Kyung
This paper presents a scheme for efficient channel usage
between simulator and accelerator where the accelerator
models some RTL sub-blocks in the accelerator-based
hardware/software co-simulation while the simulator runs
transaction-level model of the remaining part of the whole
chip being verified. With conventional simulation accelerator,
evaluations of simulator and accelerator alternate at
every valid simulation time, which results in poor simulation
performance due to startup overhead of simulator-accelerator
channel access. The startup overhead can be
reduced by merging multiple transactions on the channel
into a single burst traffic. We propose a predictive packetizing
scheme for reducing channel traffic by merging as
many transactions into a burst traffic as possible based on
"prediction and rollback". Under ideal condition with 100%
prediction accuracy, the proposed method shows a performance
gain of 1500% compared to the conventional one.
-
Automated Synthesis of Assertion Monitors Using Visual Specification [p. 390]
-
A. Gadkari and S. Ramesh
Automated synthesis of monitors from high-level properties plays a significant role in assertion-based verification. We present here a methodology to synthesize assertion monitors from visual specifications given in CESC (Clocked Event Sequence Chart). CESC is a visual language designed for specifying system level interactions involving single and multiple clock domains. It has well-defined graphical and textual syntax and formal semantics based on synchronous language paradigm enabling formal analysis of specifications. In this paper we provide an overview of CESC language with few illustrative examples. The algorithm for automated synthesis of assertion monitors from CESC specifications is described. A few examples from standard bus protocols (OCP-IP and AMBA) are presented to demonstrate the application of monitor synthesis algorithm.
-
A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms [p. 396]
-
G. Stitt and F. Vahid
In this paper, we present a software compilation approach for
microprocessor/FPGA platforms that partitions a software
binary onto custom hardware implemented in the FPGA. Our
approach imposes less restrictions on software tool flow than
previous compiler approaches, allowing software designers to
use any software language and compiler. Our approach uses a
back-end partitioning tool that utilizes decompilation techniques
to recover important high-level information, resulting in
performance comparable to high-level compiler-based
approaches.
Moderators: M. Berkelaar, Magma Design Automation, NL; T. Villa, DIEGM - Udine U, IT
-
Statistical Timing Based Optimization Using Gate Sizing [p. 400]
-
A. Agarwal, K. Chopra, and D. Blaauw
The increased dominance of intra-die process variations has
motivated the field of Statistical Static Timing Analysis (SSTA) and
has raised the need for SSTA-based circuit optimization. In this
paper, we propose a new sensitivity based, statistical gate sizing
method. Since brute-force computation of the change in circuit
delay distribution to gate size change is computationally expensive,
we propose an efficient and exact pruning algorithm. The pruning
algorithm is based on a novel theory of perturbation bounds which
are shown to decrease as they propagate through the circuit. This
allows pruning of gate sensitivities without complete propagation of
their perturbations. We apply our proposed optimization algorithm
to ISCAS benchmark circuits and demonstrate the accuracy and
efficiency of the proposed method. Our results show an improvement
of up to 10.5% in the 99-percentile circuit delay for the same
circuit area, using the proposed statistical optimizer and a run time
improvement of up to 56x compared to the brute-force approach.
-
An Efficient Algorithm for Finding Double-Vertex Dominators in Circuit Graphs [p. 406]
-
M. Teslenko and E. Dubrova
Graph dominators provide a general mechanism for identifying
re-converging paths in circuits. This is useful in a number
of CAD applications including computation of signal probabilities
for test generation, switching activities for power and noise
analysis, statistical timing analysis, cut point selection in equivalence
checking, etc. Single-vertex dominators are too rare in
circuit graphs to handle re-converging paths in a practical way.
This paper addresses the problem of finding double-vertex dominators,
which occur more frequently. First, we introduce a data
structure, called dominator chain, which allows representing all
possible O(n2) double-vertex dominators of a given vertex in O(n)
space, where n is the number of vertices of the circuit graph. Dominator
chains can be efficiently manipulated, e.g. it takes constant
time to look-up whether a given pair of vertices is a double-vertex
dominator. Second, we present an efficient algorithm for finding
double-vertex dominators. The experimental results show that the
presented algorithm is an order of magnitude faster than existing
algorithms for finding double-vertex dominators. Thus, it is suitable
for running in an incremental manner during logic synthesis.
-
SAT-Based Complete Don't-Care Computation for Network Optimization [p. 412]
-
A. Mishchenko and R. Brayton
This paper describes an improved approach to Boolean network
optimization using internal don't-cares. The improvements concern
the type of don't-cares computed, their scope, and the computation
method. Instead of the traditionally used compatible observability
don't-cares (CODCs), we introduce and justify the use of complete
don't-cares (CDC). To ensure the robustness of the don't-care
computation for very large industrial networks, a optional
windowing scheme is implemented that computes substantial subsets
of the CDCs in reasonable time. Finally, we give a SAT-based don't-care
computation algorithm that is more efficient than BDD-based
algorithms. Experimental results confirm that these improvements
work well in practice. Complete don't-cares allow for a reduction in
the number of literals compared to the CODCs. Windowing
guarantees robustness, even for very large benchmarks on which
previous methods could not be applied. SAT reduces the runtime and
enhances robustness, making don't-cares affordable for a variety of
other Boolean methods applied to the network.
-
Efficient Solution of Language Equations Using Partitioned Representations [p. 418]
-
A. Mishchenko, R. Brayton, R. Jiang, T. Villa, and N. Yevtushenko
A class of discrete event synthesis problems can be reduced to
solving language equations F · X ⊆S, where F is the fixed
component and S the specification. Sequential synthesis deals with
FSMs when the automata for F and S are prefix closed, and are
naturally represented by multi-level networks with latches. For this
special case, we present an efficient computation, using partitioned
representations, of the most general prefix-closed solution of the
above class of language equations. The transition and the output
relations of the FSMs for F and S in their partitioned form are
represented by the sets of output and next state functions of the
corresponding networks. Experimentally, we show that using
partitioned representations is much faster than using monolithic
representations, as well as applicable to larger problem instances.
-
DPA on Quasi Delay Insensitive Asynchronous Circuits: Formalization and Improvement [p. 424]
-
G. Bouesse, M. Renaudin, S. Dumont, and F. Germain
The purpose of this paper is to formally specify a flow
devoted to the design of Differential Power Analysis (DPA)
resistant QDI asynchronous circuits. The paper first
proposes a formal modeling of the electrical signature of
QDI asynchronous circuits. The DPA is then applied to the
formal model in order to identify the source of leakage of
this type of circuits. Finally, a complete design flow is
specified to minimize the information leakage. The
relevancy and efficiency of the approach is demonstrated
using the design of an AES crypto-processor.
-
Bound Set Selection and Circuit Re-Synthesis for Area/Delay Driven Decomposition [p. 430]
-
A. Martinelli and E. Dubrova
This paper addresses two problems related to disjoint-support
decomposition of Boolean functions. First, we
present a heuristic for finding a subset of variables,
X, which results in the disjoint-support decomposition
f(X,Y) = h(g(X),Y) with a good area/delay
trade-off. Second, we present a technique for re-synthesis of
the original circuit implementing f(X,Y) into a circuit implementing
the decomposed representation h(g(X),Y).
Preliminary experimental results indicate that the proposed
approach has a significant potential.
-
Uniformly-Switching Logic for Cryptographic Hardware [p. 432]
-
I. Markov and D. Maslov
Recent work on Differential Power Analysis shows that even
mathematically-secure cryptographic protocols may be vulnerable at
the physical implementation level. By measuring energy consumed by
a working digital circuit, one can glean enough information to break
encryption. Thwarting such attacks requires a new approach to logic
and physical design. In this work, we seek to equalize switching activity
of a circuit over all possible inputs and input transitions by adding
redundant gates and increasing the overall number of signal transitions.
We introduce uniformly-switching (U-S) logic, and present a
doubling construction that equalizes power dissipation without requiring
drastic changes in CAD tools.
-
Exact Synthesis of 3-qubit Quantum Circuits from Non-Binary Quantum Gates Using
Multiple-Valued Logic and Group Theory [p. 434]
-
G. Yang, W. Hung, X. Song, and M. Perkowski
We propose an approach to optimally synthesize
quantum circuits from non-permutative quantum gates
such as Controlled-Square-Root-of-Not (i.e. Controlled-V).
Our approach reduces the synthesis problem to
multiple-valued optimization and uses group theory. We
devise a novel technique that transforms the quantum
logic synthesis problem from a multi-valued constrained
optimization problem to a group permutation problem.
The transformation enables us to utilize group theory to
exploit the properties of the synthesis problem. Assuming
a cost of one for each two-qubit gate, we found all
reversible circuits with quantum costs of 4, 5, 6, etc, and
give another algorithm to realize these reversible circuits
with quantum gates.
Moderators: R. Aitken, Artisan, US; C. Hawkins, New Mexico U, US
-
Memory Testing under Different Stress Conditions: An Industrial Evaluation [p. 438]
-
A. Majhi, M. Azimane, G. Gronthoud, M. Lousberg, S. Eichenberger, and F. Bowen
This paper presents the effectiveness of various stress conditions
(mainly voltage and frequency) on detecting the resistive shorts and open
defects in deep sub-micron embedded memories in an industrial environment.
Simulation studies on very-low voltage, high voltage and at-speed testing
show the need of the stress conditions for high quality products; i.e.,
low defect-per-million (DPM) level, which is driving the semiconductor
market today. The above test conditions have been validated to screen out
bad devices on real silicon (a test-chip) built on CMOS 0.18μm
technology. IFA (inductive fault analysis) based simulation technique
leads to an efficient fault coverage and DPM estimator, which helps the
customers upfront to make decisions on test algorithm implementations under
different stress conditions in order to reduce the number of test escapes.
-
Worst-Case and Average-Case Analysis of n-Detection Test Sets [p. 444]
-
I. Pomeranz and S. Reddy
Test sets that detect each target fault n times (n-detection
test sets) are typically generated for restricted values of n
due to the increase in test set size with n. We perform
both a worst-case analysis and an average-case analysis
to check the effect of restricting n on the unmodeled fault
coverage of an (arbitrary) n-detection test set. Our
analysis is independent of any particular test set or test
generation approach. It is based on a specific set of target
faults and a specific set of untargeted faults. It shows that,
depending on the circuit, very large values of n may be
needed to guarantee the detection of all the untargeted
faults. We discuss the implications of these results.
-
Defect Aware Test Patterns [p. 450]
-
H. Tang, G. Chen, S. Reddy, C. Wang, J. Rajski, and I. Pomeranz
A method to generate test patterns referred to as defect
aware test patterns is proposed. Defect aware test
patterns have greater ability to detect un-modeled
defects. The proposed method can be used with any test
generation procedure to improve the effectiveness of the
tests in detecting un-modeled defects. Experimental
results on several industrial designs show the
effectiveness of defect aware tests. We also propose a
measure to estimate the effectiveness of given test sets in
detecting un-modeled defects.
-
Computational Intelligence Characterization Method of Semiconductor Device [p. 456]
-
E. Liau and D. Schmitt-Landsiedel
Characterization of semiconductor devices is used to gather as much data about the device as possible to determine weaknesses in design or trends in the manufacturing process.
In this paper, we propose a novel multiple trip point characterization concept to overcome the constraint of single trip point concept in device characterization phase. In addition, we use computational intelligence techniques (e.g. neural network, fuzzy and genetic algorithm) to further manipulate these sets of multiple trip point values and tests based on semiconductor test equipments, Our experimental results demonstrate an excellent design parameter variation analysis in device characterization phase, as well as detection of a set of worst case tests that can provoke the worst case variation, while traditional approach was not capable of detecting them.
-
A New Embedded Measurement Structure for eDRAM Capacitor [p. 462]
-
L. Lopez, D. Née, and J. Portal
The embedded DRAM (eDRAM) is more and more used in System On Chip (SOC).
The integration of the DRAM capacitor process into a logic process is
challenging to get satisfactory yields. The specific process of DRAM
capacitor and the low capacitance value (~30fF) of this device induce
problems of process monitoring and failure analysis. We propose a new test
structure to measure the capacitance value of each DRAM cell capacitor in a
DRAM array. This concept has been validated by simulation on a 0.18μm
eDRAM technology.
-
Smart Temperature Sensor for Thermal Testing of Cell-Based ICs [p. 464]
-
S. Bota, M. Rosales, J. Rosseló, and J. Segura
In this paper we present a simple and efficient built-in temperature sensor
for thermal monitoring of standard-cell based VLSI circuits. The proposed
smart temperature sensor uses a ring-oscillator composed of complex gates
instead of inverters to optimize their linearity. Simulation results from
a 0.18μm CMOS technology show that the non-linearity error of the
sensor can be reduced when on adequate set of standard logic gates is
selected.
Moderators: S. Baruah, North Carolina U, US; J.-D. Decotignie, CSEM, CH
-
An Approximation Algorithm for Energy-Efficient Scheduling on a Chip Multiprocessor [p. 468]
-
C.-Y. Yang, J.-J. Chen, and T.-W. Kuo
In the recent decade, voltage scaling has become an attractive
feature for many system component designs. In this
paper, we consider energy-efficient real-time task scheduling
over a chip multiprocessor architecture. The objective is to
schedule a set of frame-based tasks with the minimum energy
consumption, where all tasks are ready at time 0 and share a
common deadline. We show that such a minimization problem
is NP-hard and then propose a 2.371-approximation algorithm.
The strength of the proposed algorithm was demonstrated
by a series of simulations, for which near optimal results
were obtained.
-
Energy-Efficient, Utility Accrual Real-Time Scheduling Under the Unimodal Arbitrary Arrival Model [p. 474]
-
H. Wu, B. Ravindran, and E. Jensen
We present an energy-efficient real-time scheduling algorithm
called EUA*, for the unimodal arbitrary arrival
model (or UAM). UAM embodies a "stronger" adversary
than most arrival models. The algorithm considers application
activities that are subject to time/utility function time
constraints, UAM, and the multi-criteria scheduling objective
of probabilistically satisfying utility lower bounds,
and maximizing system-level energy efficiency. Since the
scheduling problem is intractable, EUA* allocates CPU
cycles, scales clock frequency, and heuristically computes
schedules using statistical estimates of cycle demands, in
polynomial-time. We establish that EUA* achieves optimal
timeliness during under-loads, and identify the conditions
under which timeliness assurances hold. Our simulation experiments
illustrate EUA*'s superiority.
-
Context-Aware Scheduling Analysis of Distributed Systems with Tree-Shaped Task-Dependencies [p. 480]
-
R. Henia and R. Ernst
In this paper we present a new technique which exploits
timing-correlation between tasks for scheduling analysis
in multiprocessor and distributed systems with tree-shaped
task-dependencies. Previously developed techniques also
allow capturing and exploiting timing-correlation in distributed
systems. However, they are only suitable for linear
systems, where tasks cannot trigger more than one succeeding
task. The new technique presented in this paper, allows
capturing timing-correlation between tasks in parallel paths
in a more accurate way, enabling its exploitation to calculate
tighter bounds for the worst-case response time analysis for
tasks scheduled under a static priority preemptive scheduler.
-
A New Task Model for Streaming Applications and its Schedulability Analysis [p. 486]
-
S. Chakraborty and L. Thiele
In this paper we introduce a new task model that
is specifically targeted towards representing stream processing
applications. Examples of such applications are
those involved in network packet processing (such as a
software-based router) and multimedia processing (such as
an MPEG decoder application). Our task model is made up
of two parts: (i) a new task structure to accurately model the
software structures of stream processing applications such
as conditional branches and different end-to-end deadlines
for different types of input data items, and (ii) a new event
model to represent the arrival pattern of the data items to
be processed, which triggers the task structure. This event
model is more expressive than classical models such as
purely periodic, periodic with jitter or sporadic event models.
We then present algorithms for the schedulability analysis
of this task model. The basic scheme underlying our
algorithms is a generalization of the techniques used for the
schedulability analysis of the recently proposed generalized
multiframe and the recurring real-time task models.
-
Efficient Feasibility Analysis for Real-Time Systems with EDF Scheduling [p. 492]
-
K. Albers and F. Slomka
This paper presents new fast exact feasibility tests for
uniprocessor real-time systems using preemptive EDF
scheduling. Task sets which are accepted by previously
described sufficient tests will be evaluated in nearly the
same time as with the old tests by the new algorithms.
Many task sets are not accepted by the earlier tests
despite them being feasible. These task sets will be
evaluated by the new algorithms a lot faster than with
known exact feasibility tests. Therefore it is possible to
use them for many applications for which only sufficient
test are suitable. Additionally this paper shows that the
best previous known sufficient test, the best known
feasibility bound and the best known approximation
algorithm can be derived from these new tests. In result
this leads to an integrated schedulability theory for
EDF.
-
Unified Modeling of Complex Real-Time Control Systems [p. 498]
-
H. He, Y.-F. Zhong, and C.-L. Cai
Complex real-time control system is a software
dense and algorithms dense system, which needs
modern software engineering techniques to design.
UML is an object-oriented industrial standard
modeling language, used more and more in real-time
domain. This paper first analyses the advantages and
problems of using UML for real-time control systems
design. Then, it proposes an extension of UML-RT to
support time-continuous subsystems modeling. So we
can unify modeling of complex real-time control
systems on UML-RT platform, from requirement
analysis, model design, simulation, until generation
code.
Moderators: M. Poncino, Verona U, IT; R. Zafalon, STMicroelectronics, IT
-
Exploring NoC Mapping Strategies: An Energy and Timing Aware Technique [p. 502]
-
C. Marcon, A. Susin, N. Calazans, F. Moraes, F. Hessel, and I. Reis
Complex applications implemented as Systems on
Chip (SoCs) demand extensive use of system level modeling
and validation. Their implementation gathers a large
number of complex IP cores and advanced interconnection
schemes, such as hierarchical bus architectures or
networks on chip (NoCs). Modeling applications involves
capturing its computation and communication characteristics.
Previously proposed communication weighted models
(CWM) consider only the application communication
aspects. This work proposes a communication dependence
and computation model (CDCM) that can simultaneously
consider both aspects of an application. It presents a solution
to the problem of mapping applications on regular
NoCs while considering execution time and energy consumption.
The use of CDCM is shown to provide estimated
average reductions of 40% in execution time, and
20% in energy consumption, for current technologies.
-
Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs:
Snoop-Based Cache Coherence vs. Software Solutions [p. 508]
-
M. Loghi and M. Poncino
Shared memory is a common interprocessor communication
paradigm for single-chip multi-processor platforms.
Snoop-based cache coherence is a very successful technique
that provides a clean shared-memory programming
abstraction in general-purpose chip multi-processors, but
there is no consensus on its usage in resource-constrained
multiprocessor systems on chips (MPSoCs) for embedded
applications.
This work aims at providing a comparative energy and
performance analysis of cache coherence support schemes
in MPSoCs. Thanks to the use of a complete multiprocessor
simulation platform, which relies on accurate
technology-homogeneous power models, we were able to
explore different cache-coherent shared-memory communication
schemes for a number of cache configurations and
workloads.
-
Quasi-Static Voltage Scaling for Energy Minimization with Time Constraints [p. 514]
-
A. Andrei, P. Eles, Z. Peng, M. Schmitz, and B. Al Hashimi
Supply voltage scaling and adaptive body-biasing are important techniques
that help to reduce the energy dissipation of embedded systems.
This is achieved by dynamically adjusting the voltage and performance
settings according to the application needs. In order to take full advantage
of slack that arises from variations in the execution time, it is
important to recalculate the voltage (performance) settings during runtime,
i.e., online. However, voltage scaling (VS) is computationally expensive,
and thus significantly hampers the possible energy savings. To
overcome the online complexity, we propose a quasi-static voltage scaling
scheme, with a constant online time complexity O(1). This allows to
increase the exploitable slack as well as to avoid the energy dissipated
due to online recalculation of the voltage settings. We conduct several
experiments that demonstrate the advantages of the proposed technique
over the previously published voltage scaling approaches.
-
Tag Overflow Buffering: An Energy-Efficient Cache Architecture [p. 520]
-
P. Azzoni, M. Loghi, and M. Poncino
We propose a novel energy-efficient memory architecture
which relies on the use of cache with a reduced number of
tag bits. The idea behind the proposed architecture is based
on moving a large number of the tag bits from the cache
into an external register (Tag Overflow Buffer) that identifies
the current locality of the memory references; additional
hardware allows to dynamically update the value of
the reference locality contained in the buffer. Energy efficiency
is achieved by using, for most of the memory accesses,
a reduced-tag cache.
This architecture is minimally intrusive for existing designs,
since it assumes the use of a regular cache, and does
not require any special circuitry internal to the cache such
as row or column activation mechanisms. Average energy
savings are 51% on tag energy, corresponding to about 20%
saving on total cache energy, measured on a set of typical
embedded applications.
-
Q-DPM: An Efficient Model-Free Dynamic Power Management Technique [p. 526]
-
M. Li, X. Wu, R. Yao, and X. Yan
When applying Dynamic Power Management (DPM)
technique to pervasively deployed embedded systems, the
technique needs to be very efficient so that it is feasible to
implement the technique on low end processor and tight-budget
memory. Furthermore, it should have the
capability to track time varying behavior rapidly because
the time varying is an inherent characteristic of real world
system. Existing methods, which are usually model-based,
may not satisfy the aforementioned requirements. In this
paper, we propose a model-free DPM technique based on
Q-Learning. Q-DPM is much more efficient because it
removes the overhead of parameter estimator and mode-switch
controller. Furthermore, its policy optimization is
performed via consecutive online trialing, which also
leads to very rapid response to time varying behavior.
-
Hardware Accelerated Power Estimation [p. 528]
-
J. Coburn, S. Ravi, and A. Raghunathan
In this paper, we present power emulation, a novel design paradigm
that utilizes hardware acceleration for the purpose of fast power estimation.
Power emulation is based on the observation that the functions
necessary for power estimation (power model evaluation, aggregation,
etc.) can be implemented as hardware circuits. Therefore, we can enhance
any given design with "power estimation hardware", map it to a
prototyping platform, and exercise it with any given test stimuli to obtain
power consumption estimates. Our empirical studies with industrial designs
reveal that power emulation can achieve significant speedups (10X
to 500X) over state-of-the-art commercial register-transfer level (RTL)
power estimation tools.
Organiser/Moderator: J. Bortolazzi, DaimlerChrysler, DE
Speakers: A. Sangiovanni-Vincentelli, UC Berkeley, US; H. Brinkmeyer, IBB, DE; S. Ortmann, Carmeq, DE;
J. Langenwalter, The MathWorks Inc, US/DE
-
Integrated Electronics in the Car and the Design Chain Evolution or Revolution? [p. 532]
-
A. Sangiovanni-Vincentelli
The lack of an overall understanding of the interplay of
the sub-systems and of the difficulties encountered in integrating
very complex parts, system integration is becoming
increasingly a nightmare. In fact, Jurgen Hubbert, in
charge of the Mercedes-Benz passenger car division, publicly
stated in 2003: "The industry is fighting to solve problems
that are coming from electronics and companies that
introduce new technologies face additional risks. We have
experienced blackouts on our cockpit management and navigation
command system and there have been problems with
telephone connections and seat heating". I believe that this
sorry state is the rule for the leading OEMs, it is not the exception
in today's environment. The source of these problems
is clearly the increased complexity but also the difficulty
of the OEMs in managing the integration and maintenance
process with subsystems that come from different
suppliers who use different design methods, different
software architecture, different hardware platforms, different
(and often proprietary) Real-Time Operating Systems.
Therefore, the need for for standards in the software and
hardware domains that will allow plug-and-play of subsystems
and their implementation are essential while the
competitive advantage of an OEM will increasingly reside
on essential functionalities (e.g. stability control).
-
A New Approach to Component Testing [p. 534]
-
H. Brinkmeyer
Carefully tested electric/electronic components are a
requirement for effective hardware-in-the-loop tests and
vehicle tests in automotive industry. A new method for
definition and execution of component tests is described.
The most important advantage of this method is
independance from the test stand. It therefore offers the
opportunity to build up knowledge over a long period of
time and the ability to share this knowledge with different
partners.
-
Process Oriented Software Quality Assurance - An Experience Report in Process Improvement -
OEM Perspective [p. 536]
-
T. Illgen and S. Ortmann
In the early 19th Henry Ford has the vision of an
economical build car priced for everyone. Henry
Ford reaches his vision by a new organisation of
well defined management and engineering
processes. A historical citation is
"It means a lot to me to prove clearly that our
ideas are all about accomplishable" that they
are not automotive specific but rather they are a
part of a global code.
One day he explains that he in the future would
build only one kind of a car and every car has the
same chassis. He explained that each customer
could paint his car as he want if it is black. His
consequent way to his success by a very limited,
standardised offer of one car was also his
problem in the future when GM starts a model
offensive with several configuration possibilities.
What could we learn from Henry Fords vision
and his implementation? He realised that
individual construction steps and a lot of
individual build parts is the main problem for
reaching high quality and low costs. In modern
cars we find the same problems today with
complex software systems build by individual
"artists". A well defined software construction
process and standards for key components seem
to be a successful way for future software-engineering.
One of the main topics of building
complex and safety critical software systems is
the establishing of constructing quality as key
knowledge for automotive engineers.
-
Embedded Automotive System Development Process Steer-by-Wire System [p. 538]
-
J. Langenwalter
Model based design enables the automatic
generation of final-build software from models
for high-volume automotive embedded
systems.
This paper presents a framework of
processes, methods and tools for the design
of automotive embedded systems. A steer-by-wire
system serves as an example.
Moderators: P. Ellervee, TU Tallinn, ES; S. Singh, Microsoft, US
-
Functional Validation of System Level Static Scheduling [p. 542]
-
S. Abdi and D. Gajski
Increase in system level modeling has given rise to a
need for efficient functional validation of models above cycle
accurate level. This paper presents a technique for
comparing system level models, before and after the static
scheduling of tasks on processing elements of the architecture.
We derive a graph representation from models written
in system level design languages (SLDLs) and define their
execution semantics. Notion of functional equivalence of
system level models is established using these graphs. We
then present well defined rules for reduction of such graphs
to a normal form. Finally, we show how to check for functional
equivalence of two system level models by isomorphism
of their normal graph representations. A checker
built on the above concept is used to automatically validate
the functional correctness of the static scheduling step. As
a result, the models generated for various scheduling decisions
do not have to be reverified using costly simulations.
-
Defining an Enhanced RTL Semantics [p. 548]
-
S. Zhao and D. Gajski
In this paper we formally define an enhanced RTL semantics.
This is intended to elevate the RTL design abstraction
level and help bridge the HDL semantic gap among
synthesis, simulation and formal verification tools. We
define the enhanced semantics based on a new RTL++
language that supports pipelined operations using a new
pipelined register variable concept. The execution semantics
of RTL++ is specified in a structural operational semantics
style aimed to form the basis for related simulation
and formal verification algorithm development. A RFSM
model is defined to support natively the synthesis semantics
of RTL++. We also present an example of extending SystemC
to support the notion of pipelined register variable.
-
RTK-Spec TRON: A Simulation Model of an ITRON Based RTOS Kernel in SystemC [p. 554]
-
H. Hassan, K. Sakanushi, Y. Takeuchi, and M. Imai
This paper presents the methodology and the modeling
constructs we have developed to capture the real time
aspects of RTOS simulation models in a System Level
Design Language (SLDL) like SystemC. We describe
these constructs and show how they are used to build a
simulation model of an RTOS kernel targeting the μ-ITRON
OS specification standard.
-
Design for Verification of SystemC Transaction Level Models [p. 560]
-
A. Habibi and S. Tahar
Transaction level modeling allows exploring several SoC design
architectures leading to better performance and easier verification
of the final product. In this paper, we present an approach
to design and verify SystemC models at the transaction level. We
integrate the verification as part of the design-flow. In the proposed
approach, we first model both the design and the properties
(written in PSL) in UML. Then, we translate them into an intermediate
format modeled with Abstract State Machines (ASM).
The ASM model is used to generate an FSM of the design including
the properties. Checking the correctness of the properties is
performed on-the-fly while generating the state machine. Finally,
we translate the verified design to SystemC and map the properties
to a set of assertions (as monitors in C#) that can be re-used
to validate the design at lower levels through simulation. We illustrate
our approach on two case studies including the PCI bus
standard and a generic Master/Slave architecture from the SystemC
library.
-
Systematic Transaction Level Modeling of Embedded Systems with SystemC [p. 566]
-
W. Klingauf
This paper gives an overview of a transaction level modeling
(TLM) design flow for straightforward embedded system
design with SystemC. The goal is to systematically develop
both application-specific HW and SW components of
an embedded system using the TLM approach, thus allowing
for fast communication architecture exploration, rapid
prototyping and early embedded SW development. To this
end, we specify the lightweight transaction-based communication
protocol SHIP and present a methodology for automatic
mapping of the communication part of a system to
a given architecture, including HW/SW interfaces.
-
Modeling and Verification of Globally Asynchronous and Locally Synchronous Ring Architectures [p. 568]
-
S. Dasgupta and A. Yakovlev
The goal of this paper is to demonstrate a prevalent
global deadlock situation resulting from a local deadlock
in a GALS ring architecture. We present a novel design for
building systems which will be tolerant to such deadlocks
arising in the local modules. This paper, concentrates on
the modeling of the proposed design methodology and its
correctness is proved with the help of a public domain verification tool.
Organiser: Y. Zorian, Virage Logic, US
Moderator: J. Barr, Buckingham Capital, US
Panellists: D. Wassung, AH&H, US; J. Ensel, Virage Logic, US; G. Stark, Synopsys, US;
M. Gianfagna, eSilicon, US; K. Ruparel, Cisco Systems, US; A. de la Haye, Philips Semiconductors, NL
-
Semiconductor Industry Disaggregation vs Reaggregation: Who will be the Shark? [p. 572]
-
Y. Zorian, J. Barr, D. Wassung, J. Ensel, G. Stark, M. Gianfagna, K. Ruparel, A. de la Haye
Several years ago, the vertically integrated semiconductor companies started to disaggregate into
separate sectors, such as fabless, EDA, IP, Design Services, DFT, foundry, and test & packaging
houses. On the one hand the disaggregated sectors are in the process of merging and optimizing their
product lines, roles and responsibilities On the other hand, certain companies are trying to reverse the
trend by reaggregating some of the disaggregated sectors. Will the reaggregation trend dominate?
Which trend enables a better semiconductor market growth? Which trend allows superior
technological offerrings? Who will be the shark?
Moderators: D. Gizopoulos, Piraeus U, GR; M. Sonza Reorda, Politecnico di Torino, IT
-
An Efficient Transparent Test Scheme for Embedded Word-Oriented Memories [p. 574]
-
J.-F. Li, T.-W. Tseng, and C.-L. Wey
Memory cores are usually the densest portion with the
smallest feature size in system-on-chip (SOC) designs. The
reliability of memory cores thus has heavy impact on the
reliability of SOCs. Transparent test is one of useful technique
for improving the reliability of memories during life
time. This paper presents a systematic algorithm used
for transforming a bit-oriented march test into a transparent
word-oriented march test. The transformed transparent
march test has shorter test complexity compared with
that proposed in the previous works [12, 13]. For example,
if a memory with 32-bit words is tested with March
C, time complexity of the transparent word-oriented test
transformed by the proposed scheme is only about 56% or
19% time complexity of the transparent word-oriented test
converted by the scheme reported in [12] or [13], respectively.
-
On the Analysis of Reed Solomon Coding for Resilience to Transient/Permanent Faults in
Highly Reliable Memories [p. 580]
-
L. Schiano, M. Ottavi, F. Lombardi, S. Pontarelli, and A. Salsano
Single Event Upsets (SEU) as well as permanent faults
can significantly affect the correct on-line operation of digital
systems, such as memories and microprocessors; a memory
can be made resilient to permanent and transient faults
by using modular redundancy and coding. In this paper,
different memory systems are compared: these systems utilize
simplex and duplex arrangements with a combination
of Reed Solomon coding and scrubbing. The memory systems
and their operations are analyzed by novel Markov
chains to characterize performance for dynamic reconfiguration
as well as error detection and correction under the
occurrence of permanent and transient faults. For a specific
Reed Solomon code, the duplex arrangement allows to efficiently
cope with the occurrence of permanent faults, while
the use of scrubbing allows to cope with transient faults
Index Terms: High Reliability Systems, Reliability Evaluation,
Reed-Solomon Codes, Scrubbing, Dynamic Redundancy.
-
Increasing Register File Immunity to Transient Errors [p. 586]
-
G. Memik, M. Kandemir, and O. Ozturk
Transient errors are one of the major reasons for system downtime
in many systems. While prior research has mainly focused on the
impact of transient errors on datapath, caches and main memories,
the register file has largely been neglected. Since the register file is
accessed very frequently, the probability of transient errors is high.
In addition, errors in it can quickly spread to different parts of the
system, and cause application crash or silent data corruption.
This paper addresses the reliability of register files in superscalar
processors. Particularly, we propose to duplicate actively used
physical registers in unused physical registers. The rationale behind
this idea is that if the protection mechanism (parity or ECC) used
for the primary copy indicates an error, the duplicate can provide
the data as long as it is not corrupted. We implement two types of
strategies based on this register duplication idea. In the
"conservative strategy," we limit ourselves with the given register
usage behavior, and duplicate register contents only on otherwise
unused registers. Consequently, there is no impact on the original
performance when there is no error, except for the protection
mechanism used for the primary copy. Our experiments with two
different versions of this strategy show that, with the more powerful
conservative scheme, 78% of the accesses are to the physical
registers with duplicates. The "aggressive strategy" sacrifices some
performance to increase the number of register accesses with
duplicates. It does so by marking the registers not used for a long
time as "dead" and using them for duplicating actively used
registers. The experiments with this strategy indicate that it takes
the fraction of the reliable register accesses to 84%, and degrades
the overall performance by only 0.21% on the average.
-
An Efficient BICS Design for SEUs Detection and Correction in Semiconductor Memories [p. 592]
-
B. Gill, M. Nicolaidis, F. Wolff, C. Papachristou, and S. Garverick
In this paper we propose a new Built in Current Sensor
(BICS) to detect single event upsets in SRAM. The BICS
is designed and validated for 100nm process technology.
The BICS reliability analysis for process, voltage, temperature,
and power supply noise are provided. This BICS detect
various shapes of current pulses generated due to particle
strike. The BICS power consumption and area overhead are
also provided. This BICS found to be very reliable for process,
voltage and temperature variation and under stringent
noise conditions.
Moderators: P. Puschner, TU Vienna, AT; G. Fohler, Malardalen U, SE
-
Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software [p. 600]
-
L. Wehmeyer and P. Marwedel
Safety-critical embedded systems having to meet real-time constraints
are expected to be highly predictable in order to guarantee
at design time that certain timing deadlines will always
be met. This requirement usually prevents designers from utilizing
caches due to their highly dynamic, thus hardly predictable
behavior. The integration of scratchpad memories represents
an alternative approach which allows the system to benefit from
a performance gain comparable to that of caches while at the
same time maintaining predictability. In this work, we compare
the impact of scratchpad memories and caches on worst
case execution time (WCET) analysis results. We show that
caches, despite requiring complex techniques, can have a negative
impact on the predicted WCET, while the estimated WCET
for scratchpad memories scales with the achieved performance
gain at no extra analysis cost.
-
Automatic Timing Model Generation by CFG Partitioning and Model Checking [p. 606]
-
I. Wenzel, B. Rieder, R. Kirner, and P. Puschner
In this paper we present a new measurement-based
worst-case execution time (WCET) analysis method.
Exhaustive end-to-end measurements are computationally
intractable in most cases. Therefore, we propose to
measure execution times of subparts of the application.
We use heuristic methods and model checking to generate
test data, forcing the execution of selected paths to
perform runtime measurements. The measured times
are used to calculate the WCET in a final computation
step. As we operate on source code level our approach
is platform independent except for the run time measurements
performed on the target host. We show the feasibility
of the required steps and explain our approach by means of a case study.
-
A Contribution to Branch Prediction Modeling in WCET Analysis [p. 612]
-
C. Burguière and C. Rochange
The wider and wider use of high-performance processors
as part of real-time systems makes it more and more
difficult to guarantee that programs will respect their strict
deadlines. While the computation of Worst-Case Execution
Times relies on static analysis of the code, the challenge is
to model with enough safety and accuracy the behaviour
of intrinsically dynamic components. In this paper, we focus
on the dynamic branch predictor. Several models to
bound the number of branch mispredictions have been previously
published. Some of them exhibit a high complexity
while other ones have shown that taking into account semantic
information from the source code makes things more
tractable. We extend this work to more general nested loop
structures. We also give some simulation results that show
that the way branch mispredictions are usually taken into
account cannot be both safe and accurate in the case of
high-performance pipelines. We propose a more realistic
approach to be used as part of WCET computation.
-
Verifying Safety-Critical Timing and Memory-Usage Properties of Embedded Software by
Abstract Interpretation [p. 618]
-
C. Ferdinand and R. Heckman
Static program analysis by abstract interpretation is an
efficient method to determine properties of embedded software.
One example is value analysis, which determines the
values stored in the processor registers. Its results are used
as input to more advanced analyses, which ultimately yield
information about the stack usage and the timing behavior
of embedded software.
Moderators: C. Svensson, Linkoping U, SE; A.J. Acosta Jimenez, Seville U, ES
-
An Iterative Algorithm for Battery-Aware Task Scheduling on Portable Computing Platforms [p. 622]
-
J. Khan and R. Vemuri
In this work we consider battery powered portable systems
which either have Field Programmable Gate Arrays (FPGA)
or voltage and frequency scalable processors as their main
processing element. An application is modeled in the form of
a precedence task graph at a coarse level of granularity. We
assume that for each task in the task graph several unique
design-points are available which correspond to different
hardware implementations for FPGAs and different voltage-frequency
combinations for processors. It is assumed that
performance and total power consumption estimates for each
design-point are available for any given portable platform,
including the peripheral components such as memory and
display power usage. We present an iterative heuristic
algorithm which finds a sequence of tasks along with an
appropriate design-point for each task, such that a deadline
is met and the amount of battery energy used is as small as
possible. A detailed illustrative example along with a case
study of a real-world application of a robotic arm controller
which demonstrates the usefulness of our algorithm is also
presented.
-
Design Method for Constant Power Consumption of Differential Logic Circuits [p. 628]
-
K. Tiri and I. Verbauwhede
Side channel attacks are a major security concern for
smart cards and other embedded devices. They analyze
the variations on the power consumption to find the secret
key of the encryption algorithm implemented within the
security IC. To address this issue, logic gates that have a
constant power dissipation independent of the input
signals, are used in security ICs. This paper presents a
design methodology to create fully connected differential
pull down networks. Fully connected differential pull
down networks are transistor networks that for any
complementary input combination connect all the internal
nodes of the network to one of the external nodes of the
network. They are memoryless and for that reason have a
constant load capacitance and power consumption. This
type of networks is used in specialized logic gates to
guarantee a constant contribution of the internal nodes
into the total power consumption of the logic gate.
-
Exploiting Dynamic Workload Variation in Low Energy Preemptive Task Scheduling [p. 634]
-
L.-F. Leung, C.-Y. Tsui, and X. Hu
A novel energy reduction strategy to maximally exploit the
dynamic workload variation is proposed for the offline voltage
scheduling of preemptive systems. The idea is to construct a
fully-preemptive schedule that leads to minimum energy
consumption when the tasks take on approximately the average
execution cycles yet still guarantees no deadline violation
during the worst-case scenario. End-time for each sub-instance
of the tasks obtained from the schedule is used for the on-line
dynamic voltage scaling (DVS) of the tasks. For the tasks that
normally require a small number of cycles but occasionally a
large number of cycles to complete, such a schedule provides
more opportunities for slack utilization and hence results in
larger energy saving. The concept is realized by formulating the
problem as a Non-Linear Programming (NLP) optimization
problem. Experimental results show that, by using the proposed
scheme, the total energy consumption at runtime is reduced by
as high as 60% for randomly generated task sets when
comparing with the static scheduling approach only using worst
case workload.
-
Low Power Oriented CMOS Circuit Optimization Protocol [p. 640]
-
A. Verle, X. Michel, N. Azemard, P. Maurine, and D. Auvergne
Low power oriented circuit optimization consists in
selecting the best alternative between gate sizing, buffer
insertion and logic structure transformation, for
satisfying a delay constraint at minimum area cost.
In this paper we used a closed form model of delay in
CMOS structures to define metrics for a deterministic
selection of the optimization alternative. The target is
delay constraint satisfaction with minimum area cost. We
validate the design space exploration method, defining
maximum and minimum delay bounds on logical paths.
Then we adapt this method to a "constant sensitivity
method" allowing to size a circuit at minimum area under
a delay constraint. An optimisation protocol is finally
defined to manage the trade-off performance constraint -
circuit structure. These methods are implemented in an
optimization tool (POPS) and validated by comparing on
a 0.25μm process, the optimization efficiency obtained on
various benchmarks (ISCAS'85) to that resulting from an
industrial tool.
-
Area-Efficient Selective Multi-Threshold CMOS Design Methodology for
Standby Leakage Power Reduction [p. 646]
-
T. Kitahara, N. Kawabe, F. Minami, K. Seta, and T. Furusawa
This paper presents a design flow for an improved selective
multi-threshold(Selective-MT) circuit. The Selective-MT
circuit is improved so that plural MT-cells can share one
switch transistor. We propose the design methodology from
RTL(Register Transfer Level) to final layout with optimizing
switch transistor structure.
-
Hotspot Prevention through Runtime Reconfiguration in Network-On-Chip [p. 648]
-
G. Link and N. Vijaykrishnan
Many existing thermal management techniques focus on
reducing the overall power consumption of the chip, and do
not address location-specific temperature problems referred
to as hotspots. We propose the use of dynamic runtime reconfiguration
to shift the hotspot-inducing computation periodically
and make the thermal profile more uniform. Our
analysis shows that dynamic reconfiguration is an effective
technique in reducing hotspots for NoCs.
-
Power-Performance Trade-offs in Nanometer-Scale Multi-Level Caches Considering Total Leakage [p. 650]
-
R. Bai, N.-S. Kim, T. Kgil, T. Mudge, and D. Sylvester
In this paper, we investigate the impact of Tox and Vth on power performance trade-offs for
on-chip caches. We start by examining the optimization of the various components of a single
level cache and then extend this to two level cache systems. In addition to leakage, our studies
also account for the dynamic power expended as a result of cache misses. Our results show that
one can often reduce overall power by increasing the size of the L2 cache if we only allow one
pair of Vth/Tox in L2. However, if we allow the memory cells and the peripherals to have their own
Vth's and Tox's, we show that a two-level cache system with smaller L2's will yield less total leakage.
We further show that two Vth's and two Tox's are sufficient to get close to an optimal solution, and that
Vth is generally a better design knob than Tox for leakage optimization, thus it is better to restrict
the number of Tox's rather than Vth's if cost is a concern.
Organiser/Moderator: J. Bortolazzi, DaimlerChrysler, DE
Speakers: J.-L. Maté, SiemensVDO, FR; J. Becker, Karlsruhe U, DE; C. Morgano, Microsoft Europe
-
Panel Session - Automotive System Architectures [p. 654]
-
J. Bortolazzi, J.-L. Maté, J. Becker, and C. Morgano
This session addresses different approaches to automotive design architectures: state-of-the-art and trends in
automotive system architectures from a tier one supplier's perspective, a new approach to reconfigurable
architectures as well as a new trend in automotive system design, i.e. platforms to integrate several in-car services
and telecommunication services in one configurable approach. The speakers are a mix of industrial and academic
experts with experience in automotive system design.
-
Automotive System Design - Challenges and Potential [p. 656]
-
H. Heinecke
Increasing functional and non-functional requirements
in automotive electric /electronic vehicle development will
significantly enhance the integration of novel functions in
the embedded networks. Major driving forces are the
demand for driver assistance function, active and passive
safety systems and the fulfillment of environmental and
legal requirements.
The contribution will demonstrate that this task in
system design can only be managed, if the non competitive
elements are developed together in automotive industry -
leading to an infrastructure standard like e.g. in
AUTOSAR, FlexRay and LIN.
Working on such basis the OEMs can have a
dedicated system design environment for the competitive
implementations of functions already starting in early
phases for feasibility studies. This basis is consequently a
fix point through serial development and even in the
maintenance phase and enables shared functional
development and exploitation as well as in project
adaptations of non-automotive industry driven hardware
developments.
Volume II
Moderators: V. Bertacco, Michigan U, US; R. Bloem, TU Graz, AT
-
Effective Lower Bounding Techniques for Pseudo-Boolean Optimization [p. 660]
-
V. Manquinho and J. Marques-Silva
Linear Pseudo-Boolean Optimization (PBO) is a widely
used modeling framework in Electronic Design Automation
(EDA). Due to significant advances in Boolean Satisfiability
(SAT), new algorithms for PBO have emerged,
which are effective on highly constrained instances. However,
these algorithms fail to handle effectively the information
provided by the cost function of PBO. This paper addresses
the integration of lower bound estimation methods
with SAT-related techniques in PBO solvers. Moreover, the
paper shows that the utilization of lower bound estimates
can dramatically improve the overall performance of PBO
solvers for most existing benchmarks from EDA.
-
Efficient Conflict-Based Learning in an RTL Circuit Constraint Solver [p. 666]
-
G. Parthasarathy, M. Iyer, and K.-T. Cheng
We present new techniques for improving search in a hybrid
Davis-Putnam-Logemann-Loveland based constraint
solver for RTL circuits (HDPLL). In earlier work on HDPLL
[7], the authors combined solvers for integer and
Boolean domains using finite-domain constraint propagation
with heuristic conflict-based learning. In this work, we
describe a new algorithm that extends the conflict-based
unique-implication point learning in Boolean SAT solvers
to hybrid Boolean-Integer domains in HDPLL. We describe
data-structures for efficient constraint propagation on
the hybrid learned relations, similar to two-literal watching
in Boolean SAT. We demonstrate that these new techniques
provide considerable performance benefits when
compared with other combinations of decision theories.
-
A Faster Counterexample Minimization Algorithm Based on Refutation Analysis [p. 672]
-
S. Shen, Y. Qin, and S.-K. Li
It is a hot research topic to eliminate irrelevant variables
from counterexample, to make it easier to be understood.
The BFL algorithm is the most effective counterexample
minimization algorithm compared to all other approaches.
But its time overhead is very large due to one
call to SAT solver for each candidate variable to be eliminated.
The key to reduce time overhead is to eliminate multiple
variables simultaneously. Therefore, we propose a faster
counterexample minimization algorithm based on refutation
analysis in this paper. We perform refutation analysis on
those UNSAT instances of BFL, to extract the set of variables
that lead to UNSAT. All variables not belong to this set
can be eliminated simultaneously as irrelevant variables.
Thus we can eliminate multiple variables with only one call
to SAT solver. Theoretical analysis and experiment result
shows that, our algorithm can be 2 to 3 orders of magnitude
faster than existing BFL algorithm, and with only minor
lost in counterexample minimization ability.
-
Functional Coverage Driven Test Generation for Validation of Pipelined Processors [p. 678]
-
P. Mishra and N. Dutt
Functional verification of microprocessors is one of the most
complex and expensive tasks in the current system-on-chip design
process. A significant bottleneck in the validation of such
systems is the lack of a suitable functional coverage metric. This
paper presents a functional coverage based test generation technique
for pipelined architectures. The proposed methodology
makes three important contributions. First, a general graph-theoretic
model is developed that can capture the structure and
behavior (instruction-set) of a wide variety of pipelined processors.
Second, we propose a functional fault model that is used
to define the functional coverage for pipelined architectures. Finally,
test generation procedures are presented that accept the
graph model of the architecture as input and generate test programs
to detect all the faults in the functional fault model. Our
experimental results on two pipelined processor models demonstrate
that the number of test programs generated by our approach
to obtain a fault coverage is an order of magnitude
less than those generated by traditional random or constrained-random
test generation techniques.
-
Pueblo: A Modern Pseudo-Boolean SAT Solver [p. 684]
-
H. Sheini and K. Sakallah
This paper introduces a new SAT solver that integrates logic-based
reasoning and integer programming methods to systems
of CNF and PB constraints. Its novel features include an efficient
PB literal watching strategy and several PB learning
methods that take advantage of the pruning power of PB constraints
while minimizing their overhead.
-
Space-Efficient Bounded Model Checking [p. 686]
-
J. Katz, Z. Hanna, and N. Dershowitz
Current algorithms for bounded model checking use
SAT methods for checking satisfiability of Boolean
formulae. These methods suffer from the potential memory
explosion problem. Methods based on the validity of
Quantified Boolean Formulae (QBF) allow an
exponentially more succinct representation of formulae to
be checked, because no "unrolling" of the transition
relation is required. These methods have not been widely
used, because of the lack of an efficient decision procedure
for QBF. We evaluate the usage of QBF in bounded model
checking (BMC), using general-purpose SAT and QBF
solvers. We develop a special-purpose decision procedure
for QBF used in BMC, and compare our technique with the
methods using general-purpose SAT and QBF solvers on
real-life industrial benchmarks.
-
Circuit Based Quantification: Back to State Set Manipulation within Unbounded Model Checking [p. 688]
-
G. Cabodi, M. Crivellari, S. Nocco, and S. Quer
In this paper a non-canonical circuit-based state set representation
is used to efficiently perform quantifier elimination.
The novelty of this approach lies in adapting equivalence
checking and logic synthesis techniques, to the goal of
compacting circuit based state set representations resulting
from existential quantification. The method can be efficiently
combined with other verification approaches such as inductive
and SAT-based pre-image verifications.
Moderators: E. Villar, Cantabria U, ES; A. Jantsch, Royal Institute of Technology, SE
-
A Model-Based Approach for Executable Specifications on Reconfigurable Hardware [p. 692]
-
T. Schattkowsky, W. Mueller, and A. Rettberg
UML 2.0 provides a rich set of diagrams for systems
documentation and specification. Many efforts have been
undertaken to employ different aspects of UML for
multiple domains, mainly in the area of software systems.
Considering the area of electronic design automation,
however, we currently see only very few approaches,
which investigate UML for hardware design and
hardware/software co-design. In this article, we present
an approach for executable UML closing the gap from
system specification to its model-based execution on
reconfigurable hardware. For this purpose, we present
our Abstract Execution Platform (AEP), which is based
on a Virtual Machine running an executable UML subset
for embedded software and reconfigurable hardware.
This subset combines UML 2.0 Class, StateMachine and
Sequence Diagrams for complete system specification. We
describe how these binary encoded UML specifications
can be directly executed and give the implementation of
such a virtual machine on a Virtex II FPGA. Finally, we
present evaluation results comparing the AEP
implementation with C code on a C167 microcontroller.
-
The Role of Model-Level Transactors and UML in Functional Prototyping of Systems-on-Chip:
A Software-Radio Application [p. 698]
-
A. Chureau, Y. Savaria, and E. Aboulhamid
Developing a functional prototype of a system-on-chip
provides a unifying vehicle for model validation and
system refinement. Keeping the prototype executable
across several abstraction levels, clock domains and
design tools is a key requirement to effective prototyping.
This paper presents how model-level transactors address
design heterogeneity by unifying event-based and cycle-based
worlds from specification to implementation.
Transactors are used to build a functional prototype of a
software-radio component. An executable UML model is
bridged to a hardware abstraction of a radio stream
developed with Simulink to implement a realistic and
working prototype. Model validation and performance
measurements are realized through prototype execution
and real-time monitoring.
-
A SoC Design Methodology Involving a UML 2.0 Profile for SystemC_ [p. 704]
-
E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio
In this paper, we present a SoC design methodology joining
the capabilities of UML and SystemC to operate at system-level.
We present a UML 2.0 profile of the SystemC
language exploiting the MDA capabilities of defining
modeling languages, platform independent and reducible
to platform dependent languages. The UML profile
captures both the structural and the behavioral features of
the SystemC language, and allows high level modeling of
system-on-a-chip with straightforward translation to
SystemC code.
-
UML 2.0 Profile for Embedded System Design [p. 710]
-
P. Kukkala, J. Riihimäki, M. Hännikäinen, T. Hämäläinen, and K. Kronlöf
Unified Modeling Language (UML) 2.0 is emerging in
the area of embedded system design. This paper presents
a new UML 2.0 profile - called TUT-Profile - that
introduces a set of stereotypes and design rules for an
application, platform, and mapping. The profile classifies
different application and platform components, and
enables their parameterization. TUT-Profile concentrates
on the structure of an application and platform, and
utilizes standard UML 2.0 for the behavioral modeling.
The application is seen as a set of active classes with an
internal behavior. Correspondingly, the platform is seen
as a component library with a parameterized presentation
in UML 2.0 for each library component.
-
UML 2 and SysML: An Approach to Deal with Complexity in SoC/NoC Design [p. 716]
-
Y. Vanderperren and W. Dehaene
UML is gaining increased attention as a system design
language, as indicated by current standardization
activities such as the SysML initiative and the UML for
SoC Forum. Moreover the adoption of UML 2 is a
significant step towards a broader range of modeling
capabilities. This paper provides an overview of the
impact of these recent advances on the application of
UML for SoC and NoC development, proposes a model-driven
development method taking benefit of the best
techniques recently introduced, and investigates the design
of power efficient systems with UML.
-
Design Refinement for Efficient Clustering of Objects in Embedded Systems [p. 718]
-
W. Ahmed and D. Myers
Hardware software co-design seeks to meet
performance objectives via a combination of hardware
and software modules. One difficulty in reaching these
objectives lies in lack of cohesion and increased
coupling amongst the implemented modules that results
in an increased inter module communication cost. While
most of the traditional partitioning approaches are
initiated in the post-coding phase, we suggest the design
stage may be a better focus of attention in addressing
this problem.
In this paper, we propose a novel approach that uses
information from sequence diagrams in UML designs to
help ease the partitioning problem.
Organiser: E. Marinissen, Philips Research Laboratories, NL
Moderator: J. Hendrickx, Philips Semiconductors, NL
Speakers: B. Prince, Memory Strategies International, US; D. Keitel-Schulz, Infineon Technologies, DE;
Y. Zorian, Virage Logic, US
-
Challenges in Embedded Memory Design and Test [p. 722]
-
E. Marinissen, B. Prince, D. Keitel-Schulz, and Y. Zorian
Both the number of embedded memories, as well as the total embedded memory content in our chips is growing steadily.
Time for chip designers, EDA makers, and test engineers to update their knowledge on memories. This Hot Topic paper
provides an embedded tutorial on embedded memories, in terms of what is new and coming versus what is old and
vanishing, and what are the associated design, test, and repair challenges related to using embedded memories.
Moderators: J. Teich, Erlangen-Nuremberg U, DE; R. Leupers, RWTH Aachen, DE
-
Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures [p. 730]
-
A. Gangwar, M. Balakrishnan, P. Panda, and A. Kumar
With new sophisticated compiler technology, it is possible
to schedule distant instructions efficiently. As a
consequence, the amount of exploitable instruction level
parallelism (ILP) in applications has gone up considerably.
However, monolithic register file VLIW architectures
present scalability problems due to a centralized register
file which is far slower than the functional units (FU). Clustered
VLIW architectures, with a subset of FUs connected
to any RF are the solution to this scalability problem.
Recent studies with a wide variety of inter-cluster interconnection
mechanisms have presented substantial gains in
performance (number of cycles) over the most studied RF-to-RF
type interconnections. However, these studies have
compared only one or two design points in the RF-to-RF
interconnects design space. In this paper, we extend the
previous reported work. We consider both multi-cycle and
pipelined buses. To obtain realistic bus latencies, we synthesized
the various architectures and found out post layout
clock periods. The results demonstrate that while there is
very little variation in interconnect area, all the bus based
architectures are heavily performance constrained. Also,
neither multi-cycle or pipelined buses nor increasing the
number of buses itself is able to achieve performance comparable
to point-to-point type interconnects.
-
Flexible Hardware/Software Support for Message Passing on a Distributed Shared Memory Architecture [p. 736]
-
F. Poletti, A. Poggiali, and P. Marchal
With the advent of multi-processor systems on a chip, the
interest for message passing libraries has revived. Message
passing helps in mastering the design complexity of parallel
systems. However, to satisfy the stringent energy-budget
of embedded applications, the message passing overhead
should be limited. Recently, several hardware extensions
have been proposed for reducing the transfer cost on a
distributed memory architecture. Unfortunately, they ignore
the synchronization cost between sender/receiver and/or require
many dedicated hardware blocks. To overcome the
above limitations, we present in this paper light-weight
support for message passing. Moreover, we have made
our library as flexible as possible such that we can optimally
match the application with the target architecture. We
demonstrate the benefits of our approach by means of representative
benchmarks from the multimedia domain.
-
Lightweight Multitasking Support for Embedded Systems Using the Phantom Serializing Compiler [p. 742]
-
A. Nácul and T. Givargis
Embedded software continues to play an ever increasing role in
the design of complex embedded applications. In part, the elevated
level of abstraction provided by a high-level programming
paradigm immensely facilitates a short design cycle, fewer errors,
portability, and reuse. Serializing compilers have been proposed as
an alternative to traditional OS techniques, enabling a designer to
develop multitasking applications without the need of OS support.
In this work, we outline the inner workings of the Phantom serializing
compiler and analyze the quality of the generated code with
respect to memory and processing overheads. Our results show
that such serializing compilers are extremely efficient, making them
ideal to be used in design of highly parallel applications (e.g., multimedia,
graphics, and signal processing applications).
-
Multithreaded Extension to Multicluster VLIW Processors for Embedded Applications [p. 748]
-
D. Barretta, W. Fornaciari, M. Sami, and D. Bagni
Instruction Level Parallelism (ILP) extraction for multi-cluster
VLIW processors is a very hard task. In this paper,
we propose a retargetable architecture that can exploit ILP
and thread level parallelism jointly, thus allowing an easier
parallelism extraction and improving the performance
with respect to traditional multicluster VLIW processors.
Moderators: M. Zwolinski, Southampton U, UK; F. Gaffiot, Ecole Centrale de Lyon, FR
-
An Efficiently Preconditioned GMRES Method for Fast Parasitic-Sensitive Deep-Submicron
VLSI Circuit Simulation [p. 752]
-
Z. Li and C.-J. Shi
We propose an efficiently preconditioned
generalized minimal residual (GMRES) method for fast
SPICE-accurate transient simulation of parasitic-sensitive
deep-submicron VLSI circuits. First, when time step-sizes
vary within a predefined range, the preconditioned
GMRES method is applied to solve circuit matrix equations
rather than LU factorization. The preconditioner we use
comes directly from the previously factorized L and U
matrices. Second, to keep using the same preconditioner
during nonlinear iteration, the successive variable chord
method is applied as an alternative to the Newton-Raphson
method. An improved piecewise weakly nonlinear
definition of MOSFETs is adopted and the low-rank update
technique is implemented to refresh the preconditioner
efficiently. With these techniques, the number of required
LU factorizations during transient simulation is reduced
dramatically. Experimental results on power/ground
networks have demonstrated that the proposed method
yields SPICE-like accuracy with an about 18X overall
CPU time speedup over SPICE3 for circuits with tens of
thousands elements.
-
Nano-Sim: A Step Wise Equivalent Conductance Based Statistical Simulator for
Nanotechnology Circuit Design [p. 758]
-
B. Sukhwani, U. Padmanabhan, and J. Wang
New nanotechnology based devices are replacing CMOS
devices to overcome CMOS technology's scaling
limitations. However, many such devices exhibit nonmonotonic
I-V characteristics and uncertain properties
which lead to the negative differential resistance (NDR)
problem and the chaotic performance. This paper proposes
a new circuit simulation approach that can effectively
simulate nanotechnology devices with uncertain input
sources and negative differential resistance (NDR)
problem. The experimental results show a 20-30 times
speedup comparing with existing simulators.
-
Statistical Timing Analysis Using Levelized Covariance Propagation [p. 764]
-
K. Kang, B. Paul, and K. Roy
Variability in process parameters is making accurate timing analysis
of nano-scale integrated circuits an extremely challenging task.
In this paper, we propose a new algorithm for statistical timing analysis
using Levelized Covariance Propagation (LCP). The algorithm
simultaneously considers the impact of random placement of dopants
(which makes every transistor in a die independent in terms of threshold
voltage) and the spatial correlation of the process parameters
such as channel length, transistor width and oxide thickness due to
the intra-die variations. It also considers the signal correlation due
to reconvergent paths in the circuit. Results on several benchmark circuits
in 70nm technology show an average of 0.21% and 1.07% errors
in mean and the standard deviation, respectively, in timing analysis
using the proposed technique compared to the Monte-Carlo analysis.
-
A Probabilistic Collocation Method Based Statistical Gate Delay Model Considering
Process Variations and Multiple Input Switching [p. 770]
-
S. Kumar, J. Li, C. Talarico, and J. Wang
Since the advent of new nanotechnologies, the variability of gate delay
due to process variations has become a major concern. This paper proposes
a new gate delay model that includes impact from both process variations
and multiple input switching. The proposed model uses orthogonal polynomial
based probabilistic collocation method to construct a delay analytical equation
from circuit timing performance. From the experimental results, our approach
has less than 0.2% error on the mean delay of gates and less than 3% error
on the standard deviation.
-
Modeling and Propagation of Noisy Waveforms in Static Timing Analysis [p. 776]
-
S. Nazarian, M. Pedram, E. Tuncer, T. Lin, and A. Ajami
A technique based on the sensitivity of the output to input
waveform is presented for accurate propagation of delay
information through a gate for the purpose of static timing analysis
(STA) in the presence of noise. Conventional STA tools represent a
waveform by its arrival time and slope. However, this is not an
accurate way of modeling the waveform for the purpose of noise
analysis. The key contribution of our work is the development of a
method that allows efficient propagation of equivalent waveforms
throughout the circuit. Experimental results demonstrate higher
accuracy of the proposed sensitivity-based gate delay propagation
technique, SGDP, compared to the best of existing approaches.
SGDP is compatible with the current level of gate characterization
in conventional ASIC cell libraries, and as a result, it can be easily
incorporated into commercial STA tools to improve their accuracy.
Moderators: M. Lajolo, NEC Laboratories, US; E. Aboulhamid, Montreal U, CA
-
A Network Traffic Generator Model for Fast Network-on-Chip Simulation [p. 780]
-
S. Mahadevan, M. Storgaard, R. Olsen, J. Sparsø, J. Madsen, F. Angiolini, and L. Benini
For Systems-on-Chip (SoCs) development, a predominant
part of the design time is the simulation time. Performance
evaluation and design space exploration of such systems
in bit- and cycle-true fashion is becoming prohibitive.
We propose a traffic generation (TG) model that provides
a fast and effective Network-on-Chip (NoC) development
and debugging environment. By capturing the type and the
timestamp of communication events at the boundary of an
IP core in a reference environment, the TG can subsequently
emulate the core's communication behavior in different environments.
Access patterns and resource contention in a
system are dependent on the interconnect architecture, and
our TG is designed to capture the resulting reactiveness.
The regenerated traffic, which represents a realistic workload,
can thus be used to undertake faster architectural exploration
of interconnection alternatives, effectively decoupling
simulation of IP cores and of interconnect fabrics. The
results with the TG on an AMBA interconnect show a simulation
time speedup above a factor of 2 over a complete
system simulation, with close to 100% accuracy.
-
Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation [p. 786]
-
M. Reshadi and N. Dutt
Detailed modeling of processors and high performance cycle-accurate
simulators are essential for today's hardware and software
design. These problems are challenging enough by themselves and
have seen many previous research efforts. Addressing both
simultaneously is even more challenging, with many existing
approaches focusing on one over another. In this paper, we propose
the Reduced Colored Petri Net (RCPN) model that has two
advantages: first, it offers a very simple and intuitive way of modeling
pipelined processors; second, it can generate high performance cycle-accurate
simulators. RCPN benefits from all the useful features of
Colored Petri Nets without suffering from their exponential growth in
complexity. RCPN processor models are very intuitive since they are a
mirror image of the processor pipeline block diagram. Furthermore, in
our experiments on the generated cycle-accurate simulators for XScale
and StrongArm processor models, we achieved an order of magnitude
(~15 times) speedup over the popular SimpleScalar ARM simulator.
-
Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs [p. 792]
-
J. Schnerr, O. Bringmann, and W. Rosenstiel
In this paper, the application of a cycle accurate binary translator
for rapid prototyping of SoCs will be presented. This translator generates
code to run on a rapid prototyping system consisting of a VLIW
processor and FPGAs. The generated code is annotated with information
that triggers cycle generation for the hardware in parallel to the
execution of the translated program. The VLIW processor executes the
translated program whereas the FPGAs contain the hardware for the
parallel cycle generation and the bus interface that adapts the bus of
the VLIW processor to the SoC bus of the emulated processor core.
-
Virtual Hardware Prototyping through Timed Hardware-Software Co-Simulation [p. 798]
-
F. Fummi, M. Loghi, M. Poncino, S. Martini, G. Perbellini, and M. Monguzzi
Designers of factory automation applications increasingly
demand for tools for rapid prototyping of hardware extensions
to existing systems and verification of resulting
behaviors through hardware and software co-simulation.
This work presents a framework for the timing-accurate cosimulation
of HDL models and their verification against
hardware and software running on an actual embedded device
of which only a minimal knowledge of the current design
is required.
Experiments on real-life applications show that early architectural
and design decisions can be taken by measuring the
expected performance on the models realized using the proposed
framework.
-
Fast Dynamic Memory Integration in Co-Simulation Frameworks for Multiprocessor System on-Chip [p. 804]
-
O. Villa, M. Monchiero, G. Palermo, P. Schaumont, and I. Verbauwhede
In this paper is proposed a technique to integrate and
simulate a dynamic memory in a multiprocessor framework
based on C/C++/SystemC. Using host machine's memory
management capabilities, dynamic data processing is supported
without compromising speed and accuracy of the
simulation. A first prototype in a shared memory context is
presented.
Moderators: R. Hermida, Madrid U, ES; E. De Kock, Philips Research, NL
-
FORAY-GEN: Automatic Generation of Affine Functions for Memory Optimizations [p. 808]
-
I. Issenin and N. Dutt
In today's embedded applications a significant portion of
energy is spent in the memory subsystem. Several approaches
have been proposed to minimize this energy, including the use
of scratch pad memories, with many based on static analysis of
a program. However, often it is not possible to perform static
analysis and optimization of a program's memory access
behavior unless the program is specifically written for this
purpose. In this paper we introduce the FORAY model of a
program that permits aggressive analysis of the application's
memory behavior that further enables such optimizations since
it consists of "for" loops and array accesses which are easily
analyzable. We present FORAY-GEN: an automated profile-based
approach for extraction of the FORAY model from the
original program. We also demonstrate how FORAY-GEN
enhances applicability of other memory subsystem optimization
approaches, resulting in an average of two times increase in
the number of memory references that can be analyzed by
existing static approaches.
-
Nonuniform Banking for Reducing Memory Energy Consumption [p. 814]
-
O. Ozturk and M. Kandemir
Main memories can consume a large percentage of overall
energy in many data-intensive embedded applications.
The past research proposed and evaluated memory banking
as a possible approach for reducing memory energy consumption.
One of the common characteristics/assumptions
made by most of the past work on banking is that all the
banks are of the same size. While this makes the formulation
of the problem easy, it also restricts the potential solution
space.Motivated by this observation, this paper investigates
the possibility of employing nonuniform bank sizes for
reducing memory energy consumption. Specifically, it proposes
an integer linear programming (ILP) based approach
that returns the optimal nonuniform bank sizes and accompanying
data-to-bankmapping. It also studies how data migration
can further improve over nonuniform banking. We
implemented our approach using an ILP tool and made
extensive experiments. The results show that the proposed
strategy brings important energy benefits over the uniform
banking scheme, and data migration across banks tends to
increase these savings.
-
Systematic Analysis of Active Clock Deskewing Systems Using Control Theory [p. 820]
-
V. Verghase, T. Chen, and P. Young
A formal methodology for the analysis of a closed loop
clock distribution and active deskewing network is proposed. In this
paper an active clock distribution and deskewing network is modeled
as a closed loop feedback system using state space equations. State
space analysis allows systematic analysis of any clock distribution and
deskewing systems to determine various conditions under which a system
can over-compensate and become potentially unstable. Such an analysis
can be very useful to designers as they will be able to determine
analytically as to how the clock deskewing system behaves. By using
the proposed approach, repeated simulations can be greatly limited
and maybe entirely avoided. We applied the proposed method to an
experimental clock deskewing system to illustrate the effectiveness of the
proposed approach. The proposed approach can be further extended to
determine performance of such systems under different configurations.
-
Buffer Insertion for Bridges and Optimal Buffer Sizing for Communication Sub-System of
Systems-on-Chip [p. 826]
-
S. Kallakuri, A. Doboli, and E. Feinberg
We have presented an optimal buffer sizing and buffer
insertion methodology which uses stochastic models of the
architecture and Continuous Time Markov Decision Processes
CTMDPs. Such a methodology is useful in managing
the scarce buffer resources available on chip as compared
to network based data communication which can have large
buffer space. The modeling of this problem in terms of a CTMDP
framework lead to a nonlinear formulation due to usage
of bridges in the bus architecture. We present a methodology
to split the problem into several smaller though linear
systems and we then solve these subsystems.
-
Extended Control Flow Graph Based Performance Optimization Using Scratch-Pad Memory [p. 828]
-
H. Pu, M. Ling, and J. Jin
This paper presents an exploration approach for the researcher
to choose the suitable size of Scratch-Pad memory
(SPM) for maximal performance improvement of a specified
application. The approach uses an extended control flow
graph (ECFG) to describe the application and provides a
solution to reduce the additional overhead of moving nodes
to SPM. Experiments achieves on average 11% increase
in performance compared to the previous approaches and
44% decrease in the application's runtime compared to
none SPM environment.
Organiser: W. Mueller, Paderborn U, DE
Moderator: G. Martin, Tensilica, US
Speakers: T. Schattkowsky, Paderborn U/C-LAB, DE; S. Mellor and J. Wolfe, Mentor Graphics, US;
Q. Zhu, Fujitsu Laboratories, JP
-
UML 2.0 - Overview and Perspectives in SoC Design [p. 832]
-
T. Schattkowsky
The design productivity gap requires more efficient
design methods. Software systems have faced the same
challenge and seem to have mastered it with the
introduction of more abstract design methods. The UML
has become the standard for software systems modeling
and thus the foundation of new design methods. Although
the UML is defined as a general purpose modeling
language, its application to hardware and
hardware/software codesign is very limited. In order to
successfully apply the UML at these fields, it is essential
to understand its capabilities and to map it to a new
domain.
-
Why Systems-on-Chip Needs More UML like a Hole in the Head [p. 834]
-
S. Mellor, J. Wolfe, and C. McCausland
Let's be clear from the outset: SoC can most certainly
make use of UML; SoC just doesn't need more UML, or
even all of it. The advent of model mappings, coupled with
marks that indicate which mapping rule to apply, enable a
major simplification of the use of UML in SoC.
-
Integrating UML into SoC Design Process [p. 836]
-
Q. Zhu, R. Oishi, T. Hasegawa, and T. Nakata
In this paper, we proposed a method for integrating UML
model into the current SoC design process. UML is
introduced as a formal model of specification for SoC
design. The consistency and completeness of the
specification is validated based on the formal UML model.
The implementation is validated by a systematic derivation
of test scenarios from UML model. The method has been
applied to the design of a new media-processing chip for
mobile devices. The application of the method shows that it
is not only effective for finding logical errors in the
implementation, but also eliminates errors due to
inconsistency and incompleteness of the specification.
Moderators: P. Prinetto, Politecnico di Torino, IT; C. Papachristou, Case Western Reserve U, US
-
Rapid Generation of Thermal-Safe Test Schedules [p. 840]
-
P. Rosinger, B. Al-Hashimi, and K. Chakrabarty
Overheating has been acknowledged as a major
issue in testing complex SOCs. Several power
constrained system-level DFT solutions (power constrained
test scheduling) have recently been proposed
to tackle this problem. However, as it will be shown
in this paper, imposing a chip-level maximum power
constraint doesn't necessarily avoid local overheating
due to the non-uniform distribution of power across
the chip. This paper proposes a new approach for
dealing with overheating during test, by embedding
thermal awareness into test scheduling. The proposed
approach facilitates rapid generation of thermal-safer
test schedules without requiring time-consuming thermal
simulations. This is achieved by employing a low-complexity
test session thermal model used to guide
the test schedule generation algorithm. This approach
reduces the chances of a design re-spin due to potential
overheating during test.
-
Simultaneous Reduction of Dynamic and Static Power in Scan Structures [p. 846]
-
S. Sharifi, J. Jaffari, M. Hosseinabady, A. Afzali-Kusha, and Z. Navabi
Power dissipation during test is a major challenge in testing
integrated circuits. Dynamic power has been the dominant
part of power dissipation in CMOS circuits, however, in future
technologies the static portion of power dissipation will
outreach the dynamic portion. This paper proposes an
efficient technique to reduce both dynamic and static power
dissipation in scan structures. Scan cell outputs which are not
on the critical path(s) are multiplexed to fixed values during
scan mode. These constant values and primary inputs are
selected such that the transitions occurred on non-multiplexed
scan cells are suppressed and the leakage current during scan
mode is decreased. A method for finding these vectors is also
proposed. Effectiveness of this technique is proved by
experiments performed on ISCAS89 benchmark circuits.
-
A Fast Diagnosis Scheme for Distributed Small Embedded SRAMs [p. 852]
-
B. Wang, A. Ivanov, and Y. Wu
This paper proposes a diagnosis scheme aimed at
reducing diagnosis time of distributed small embedded
SRAMs (e-SRAMs). This scheme improves the one
proposed in [7, 8]. The improvements are mainly two-fold.
On one hand, the diagnosis of time-consuming
Data Retention Faults (DRFs), which is neglected by the
diagnosis architecture in [7, 8], is now considered and
performed via a DFT technique referred to as the "No
Write Recovery Test Mode (NWRTM)". On the other
hand, a pair comprising a Serial to Parallel Converter
(SPC) and a Parallel to Serial Converter (PSC) is
utilized to replace the bi-directional serial interface, to
avoid the problems of serial fault masking and defect
rate dependent diagnosis. Results from our evaluations
show that the proposed diagnosis scheme achieves an
increased diagnosis coverage and reduces diagnosis
time compared to those obtained in [7, 8], with
neglectable extra area cost.
Keywords: Distributed Small Embedded SRAMs,
Memory Diagnosis, Data Retention Fault, SPC, PSC,
Diagnosis Time
-
New Schemes for Self-Testing RAM [p. 858]
-
G. Bodean, D. Bodean, and A. Labunetz
This paper gives an overview of a new technique,
named pseudo-ring testing (PRT). PRT can be applied for
testing wide type of random access memories (RAM): bit-or word-oriented
and single- or dual-port RAM's. An
essential particularity of the proposed methodology is the
emulation of a linear automaton over Galois field by
memory own components.
-
At-Speed Logic BIST for IP Cores [p. 860]
-
B. Cheon, E. Lee, L.-T. Wang, X. Wen, P. Hsu, J. Cho, J. Park, H. Chao, and S. Wu
This paper describes a flexible logic BIST scheme that
features high fault coverage achieved by fault-simulation
guided test point insertion, real at-speed test capability
for multi-clock designs without clock frequency
manipulation, and easy physical implementation due to
the use of a low-speed SE signal. Application results of
this scheme to two widely used IP cores are also reported.
Moderators: J. Madsen, TU Denmark, DK; J. Lopéz, Castilla-la Mancha U, ES
-
Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems [p. 864]
-
V. Izosimov, P. Pop, P. Eles, and Z. Peng
In this paper we present an approach to the design optimization of fault-tolerant
embedded systems for safety-critical applications. Processes are
statically scheduled and communications are performed using the time-triggered
protocol. We use process re-execution and replication for tolerating
transient faults. Our design optimization approach decides the mapping
of processes to processors and the assignment of fault-tolerant
policies to processes such that transient faults are tolerated and the timing
constraints of the application are satisfied. We present several heuristics
which are able to find fault-tolerant implementations given a limited
amount of resources. The developed algorithms are evaluated using extensive
experiments, including a real-life example.
-
Locality-Aware Process Scheduling for Embedded MPSoCs [p. 870]
-
M. Kandemir and G. Chen
Utilizing on-chip caches in embedded multiprocessor
system-on-a-chip (MPSoC) based systems is critical from
both performance and power perspectives. While most of
the prior work that targets at optimizing cache behavior
are performed at hardware and compilation levels,
operating system (OS) can also play major role as it sees
the global access pattern information across applications.
This paper proposes a cache-conscious OS process
scheduling strategy based on data reuse. The proposed
scheduler implements two complementary approaches.
First, the processes that do not share any data between
them are scheduled at different cores if it is possible to do
so. Second, the processes that could not be executed at the
same time (due to dependences) but share data among
each other are mapped to the same processor core so that
they share the cache contents. Our experimental results
using this new data locality aware OS scheduling strategy
are promising, and show significant improvements in task
completion times.
-
A Modular Simulation Framework for Spatial and Temporal Task Mapping onto
Multi-Processor SoC Platforms [p. 876]
-
T. Kempf, M. Doerper, R. Leupers, G. Ascheid, H. Meyr, T. Kogel, and B. Vanthournout
Heterogeneous Multi-Processor SoC platforms bear the potential
to optimize conflicting performance, flexibility and energy
efficiency constraints as imposed by demanding signal
processing and networking applications. However, in order to
take advantage of the available processing and communication
resources, an optimal mapping of the application tasks
onto the platform resources is of crucial importance.
In this paper, we propose a SystemC-based simulation
framework, which enables the quantitative evaluation
of application-to-platform mappings by means of an executable
performance model. Key element of our approach is
a configurable event-driven Virtual Processing Unit to capture
the timing behavior of multi-processor/multi-threaded
MP-SoC platforms. The framework features an XML-based
declarative construction mechanism of the performance
model to significantly accelerate the navigation in large design
spaces.
The capabilities of the proposed framework in terms of design
space exploration is presented by a case study of a commercially
available MP-SoC platform for networking applications.
Focusing on the application to architecture mapping,
our introduced framework highlights the potential for optimization
of an efficient design space exploration environment.
-
Access Pattern-Based Code Compression for Memory-Constrained Embedded Systems [p. 882]
-
O. Ozturk, H. Saputra, M. Kandemir, and I. Kolcu
As compared to a large spectrum of performance
optimizations, relatively little effort has been dedicated to
optimize other aspects of embedded applications such as
memory space requirements, power, real-time
predictability, and reliability. In particular, many modern
embedded systems operate under tight memory space
constraints. One way of satisfying these constraints is to
compress executable code and data as much as possible.
While research on code compression have studied
efficient hardware and software based code strategies,
many of these techniques do not take application behavior
into account, that is, the same
compression/decompression strategy is used irrespective
of the application being optimized. This paper presents a
code compression strategy based on control flow graph
(CFG) representation of the embedded program. The idea
is to start with a memory image wherein all basic blocks
are compressed, and decompress only the blocks that are
predicted to be needed in the near future. When the
current access to a basic block is over, our approach also
decides the point at which the block could be compressed.
We propose several compression and decompression
strategies that try to reduce memory requirements without
excessively increasing the original instruction cycle
counts.
-
System Synthesis for Networks of Programmable Blocks [p. 888]
-
R. Mannion, H. Hsieh, S. Cotterell, and F. Vahid
The advent of sensor networks presents untapped opportunities
for synthesis. We examine the problem of synthesis of behavioral
specifications into networks of programmable sensor blocks. The
particular behavioral specification we consider is an intuitive
user-created network diagram of sensor blocks, each block
having a pre-defined combinational or sequential behavior. We
synthesize this specification to a new network that utilizes a
minimum number of programmable blocks in place of the predefined
blocks, thus reducing network size and hence network cost
and power. We focus on the main task of this synthesis problem,
namely partitioning pre-defined blocks onto a minimum number
of programmable blocks, introducing the efficient but effective
PareDown decomposition algorithm for the task. We describe the
synthesis and simulation tools we developed. We provide results
showing excellent network size reductions through such synthesis,
and significant speedups of our algorithm over exhaustive search
while obtaining near-optimal results for 15 real network designs
as well as nearly 10,000 randomly generated designs.
-
Distributed HW/SW-Partitioning for Embedded Reconfigurable Networks [p. 894]
-
T. Streichert, C. Haubelt, and J. Teich
In this paper, we propose a distributed online HW/SW-partitioning
strategy for increasing fault tolerance in
HW/SW-reconfigurable networked systems.
-
Synchronization Processor Synthesis for Latency Insensitive Systems [p. 896]
-
P. Bomel, E. Martin, and E. Boutillon
In this paper we present our contribution in terms of
synchronization processor for a SoC design methodology
based on the theory of the latency insensitive systems
(LIS) of Carloni et al[1]. Our contribution consists in IP
encapsulation into a new wrapper model which speed
and area are optimized and synthetizability guaranteed.
The main benefit of our approach is to preserve the local
IP performances when encapsulating them and reduce
SoC silicon area.
-
Thermal-Aware Task Allocation and Scheduling for Embedded Systems [p. 898]
-
W.-L. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, and M. Irwin
Temperature affects not only the reliability but also the
performance, power, and cost of the embedded system. This
paper proposes a thermal-aware task allocation and scheduling
algorithm for embedded systems. The algorithm is used as a
sub-routine for hardware/software co-synthesis to reduce the
peak temperature and achieve a thermally even distribution
while meeting real time constraints. The paper investigates
both power-aware and thermal-aware approaches to task
allocation and scheduling. The experimental results show that
the thermal-aware approach outperforms the power-aware
schemes in terms of maximal and average temperature
reductions. To the best of our knowledge, this is the first task
allocation and scheduling algorithm that takes temperature into
consideration.
Moderators: J. Koehl, IBM Microelectronics, DE; J. Lienig, TU Dresden, DE
-
An Improved Multi-Level Framework for Force-Directed Placement [p. 902]
-
K. Vorwerk and A. Kennings
One of the greatest impediments to achieving high quality
placements using force-directed methods lies in the large
amount of overlap initially present in these techniques.
This overlap makes the determination of cell ordering difficult
and can lead to the inadvertent separation of highly-connected
cells by the spreading forces. We show that a
multi-level clustering strategy can minimize the ill effects
of overlap and improve the quality of placements generated
by the force-directed tool FDP. Moreover, we present
a means of improving initial cell ordering through the unification
of min-cut partitioning and force-based placement,
and describe an enhanced median improvement heuristic
which further aids in minimizing HPWL. Numerical results
are presented showing that our flow generates placements
which are, on average, 15% better than mPG and 4% better
than Capo 9.0 on mixed-size designs.
-
Bright-Field AAPSM Conflict Detection and Correction [p. 908]
-
X. Xu, A. Kahng, S. Sinha, C. Chiang, and A. Zelikovsky
As feature sizes shrink, it will be necessary to use AAPSM (Alternating-Aperture Phase Shift Masking) to
image critical features, especially on the polysilicon layer. This imposes additional constraints on the
layouts beyond traditional design rules. Of particular note is the requirement that all critical features
be flanked by opposite-phase shifters, while the shifters obey minimum width and spacing requirements. A
layout is called phase-assignable, the phase conflicts have to be removed to enable the use of AAPSM for
the layout. Previous work has sought to detect a suitable set of phase conflicts to be removed, as well
as correct them. [3,4,5,6,8].
The contributions of this paper are the following: (1) a new approach to detect a minimal set of phase
conflicts (also referred to as AAPSM conflicts), which when corrected will produce a phase-assignable
layout; (2) a novel layout modification scheme for correcting these AAPSM conflicts. The proposed approach
for conflict detection shows significant improvements in the quality of results and runtime for real
industrial circuits, when compared to previous methods. To the best of our knowledge, this is the first
time layout modification results are presented for bright-field AAPSM. Our experiments show that the
percentage area increase for making a layout phase-assignable ranges from 0.7-11.8%.
-
Systematic Analysis of Energy and Delay Impact of Very Deep Submicron
Process Variability Effects in Embedded SRAM Modules [p. 914]
-
H. Wang, M. Miranda, W. Dehaene, F. Catthoor, and K. Maex
Variability is becoming a serious problem in process
technology for nanometer technology nodes. The increasing
difficulty in controlling the uniformity of critical process
parameters (e.g. doping levels) in the smaller devices,
makes the electrical properties of such scaled devices much
less predictable than in the past. In this paper, we study
how these technology effects influence the energy and delay
of a SRAM module. Despite the implications in the correct
operation of the module, in practically all cases the affected
memory implementations become also slower while
consuming on average more energy than nominally. This is
partly counter-intuitive and no existing literature describes
this in a systematic generic way for SRAMs. In this paper,
we identify and illustrate the different mechanisms behind
this unexpected behavior and quantify the impact of these
effects for on-chip SRAMs at the 65nm technology node.
-
TSUNAMI: An Integrated Timing-Driven Place and Route Research Platform [p. 920]
-
C. Alexandre, H. Clément, J.-P. Chaput, M. Sroka, C. Masson, and R. Escassut
In this paper, we present an experimental integrated
platform for the research, development and evaluation of
new VLSI back-end algorithms and design flows. Interconnect
scaling to nanometer processes presents many difficult
challenges to CAD flows. Academic research on back-end
mostly focuses on specific algorithmic issues separately.
However one key issue to address also is the cooperation
of multiple algorithmic tools. TSUNAMI, our platform, is
based on an integrated C++ database around which all
tools consistently interact and collaborate. Above this platform
a fixed die standard cell timing-driven placement and
global routing flow has been developed.
-
Inductive and Capacitive Coupling Aware Routing Methodology Driven by a
Higher Order RLCK Moment Metric [p. 922]
-
A. Bhaduri and R. Vemuri
A new routing methodology, which accounts for inductive
and capacitive coupling between neighboring wires is proposed.
The inductive and capacitive coupling of the wires are
introduced through a "moment" based higher order RLCK cost
function. The routing process guided by this cost-function ensures
that the final solution has minimum ringing and delay.
Moderators: L. Lavagno, Politecnico di Torino, IT; W. Kruijtzer, Philips Research, IT
-
Statistical Modeling of Pipeline Delay and Design of Pipeline under Process
Variation to Enhance Yield in sub-100nm Technologies [p. 926]
-
A. Datta, S. Bhunia, S. Mukhopadhyay, N. Banerjee, and K. Roy
Operating frequency of a pipelined circuit is determined by the
delay of the slowest pipeline stage. However, under statistical
delay variation in sub-100nm technology regime, the slowest
stage is not readily identifiable and the estimation of the pipeline
yield with respect to a target delay is a challenging problem. We
have proposed analytical models to estimate yield for a pipelined
design based on delay distributions of individual pipe stages.
Using the proposed models, we have shown that change in logic
depth and imbalance between the stage delays can improve the
yield of a pipeline. A statistical methodology has been developed
to optimally design a pipeline circuit for enhancing yield.
Optimization results show that, proper imbalance among the
stage delays in a pipeline improves design yield by 9% for the
same area and performance (and area reduction by about 8.4%
under a yield constraint) over a balanced design.
-
Compositional Memory Systems for Multimedia Communicating Tasks [p. 932]
-
A. Molnos, S. Cotofana, M. Heijligers, and J. Van Eijndhoven
Conventional cache models are not suited for real-time
parallel processing because tasks may flush each other's
data out of the cache in an unpredictable manner. In this
way the system is not compositional so the overall performance
is difficult to predict and the integration of new tasks
expensive. This paper proposes a new method that imposes
compositionality to the system's performance and makes
different memory hierarchy optimizations possible for multimedia
communicating tasks when running on embedded
multiprocessor architectures. The method is based on a
cache allocation strategy that assigns sets of the unified
cache exclusively to tasks and to the communication buffers.
We also analytically formulate the problem and describe a
method to compute the cache partitioning ratio for optimizing
the throughput and the consumed power. When applied
to a multiprocessor with memory hierarchy our technique
delivers also performance gain. Compared to the shared
cache case, for an application consisting of two jpeg decoders
and one edge detection algorithm 5 times less misses
are experienced and for an mpeg2 decoder 6.5 times less
misses are experienced.
-
Introducing Flexible Quantity Contracts into Distributed SoC and Embedded System Design Processes [p. 938]
-
J. Kruse, C. Thomsen, R. Ernst, T. Volling, and T. Spengler
Increasing design complexity eventually leads to a design
process that is distributed over several companies. This
is already found in the automotive industry but SoC design
appears to move in the same direction. Design processes
for complex systems are iterative, but iteration hardly
reaches beyond company borders. Iterations require availability
of preliminary design data and estimations, but due
to cost and liability issues suppliers often hesitate to provide
such preliminary data. Moreover, companies are rarely
able to judge the accuracy and precision of externally estimated
data. So, the systems integrator experiences increased
design risk. Particular mechanisms are needed to
ensure, that the integrated system will meet the overall requirements
even if part of the early estimations are wrong
or imprecise. Based on work in supply chain management,
we propose an inter-company design process that is based
on formal techniques from real-time systems engineering
and so called flexible quantity contracts. In this process,
formal techniques control design risk and flexible contracts
regulate cooperation and cost distribution. The process effectively
delays the design freeze point beyond the contract
conclusion to enable design iterations. We explain the process
and give an example.
-
A New System Design Methodology for Wire Pipelined SoC [p. 944]
-
M. Casu and L. Macchiarulo
Wire Pipelining (WP) has been proposed in order to limit
the impact of increasing wire delays. In general, the added
pipeline elements alters the system such that architectural
changes are needed to preserve functionality. We illustrate
a proposal that, while allowing the use of IP blocks without
modification, takes advantage of a minimal knowledge
of the IP's communication profile to dramatically increase
the performances. We showed the formal equivalence between
WP and original system and proved the higher performance
achievable through a relevant case study.
-
A Memory Hierarchical Layer Assigning and Prefetching Technique to Overcome the
Memory Performance/Energy Bottleneck [p. 946]
-
M. Dasygenis, D. Soudris, A. Thanailakis, E. Brockmeyer, B. Durinck, and F. Catthoor
The memory subsystem has always been a bottleneck
in performance as well as significant power contributor in
memory intensive applications. Many researchers have presented
multi-layered memory hierarchies as a means to design
energy and performance efficient systems. However,
most of the previous work do not explore trade-offs systematically.
We fill this gap by proposing a formalized technique
that takes into consideration data reuse, limited lifetime
of the arrays of an application and application specific
prefetching opportunities, and performs a thorough tradeoff
exploration for different memory layer sizes. This technique
has been implemented on a prototype tool, which was
tested successfully using nine real-life applications of industrial
relevance. Following this approach we have able
to reduce execution time up to 60%, and energy consumption
up to 70%.
Organiser/Moderator: W. Rosenstiel, Tuebingen U, DE
Panellists: R. Bergamaschi, IBM, US; F. Ghenassia, STMicroelectronics, FR; T. Groetker, Synopsys, DE;
M. Kawarabayashi, NEC, JP; M. van Lier, Philips Semiconductors, NL; A. Mayer, Infineon Technologies, DE;
M. Meredith, Forte Design Systems, US; M. Milligan, CoWare, US; S. Swan, Cadence, US
-
Is There a Market for SystemC Tools? [p. 950]
-
W. Rosenstiel, R. Bergamaschi, F. Ghenassia, T. Groetker,
M. Kawarabayashi, A. Mayer, M. Meredith, and M. Milligan
SystemC, users and tool providers are at a crossroads. More and more companies are using
SystemC; however EDA companies are hesitant to give a full commitment to SystemC tools,
especially at system-level. There are several reasons for this dichotomy. While users seem
excited about SystemC for its technical qualities for system-level design, tool providers may
not share this excitement because of the current small market for such tools. Are the existing
(free) reference implementation and the current small market for system level tools to blame,
or are there any technical issues impeding the fast development of SystemC tools? Among the
SystemC users from industry and academia there is currently some uncertainty about the
future availability of state of the art EDA tools supporting SystemC. This panel brings
together industrial SystemC users as well as EDA companies to discuss these issues. The
industrial panellists will present the current situation regarding the use of SystemC in
industry, its importance (or lack thereof) for system design and future needs for such tools.
The tool providers will explain their current position regarding the commitment to SystemC
and clarify their future plans for supporting it.
Moderators: P. Feldmann, IBM TJ Watson Research Center, US; D. Luca, MIT, US
-
Statistical Timing Analysis with Extended Pseudo-Canonical Timing Model [p. 952]
-
L. Zhang, W. Chen, Y. Hu, and C.C.-P. Chen
State of the art statistical timing analysis (STA)
tools often yield less accurate results when timing variables
become correlated due to global source of variations and path
reconvergence. To the best of our knowledge, no good solution is
available dealing both types of correlations simultaneously.
In this paper, we present a novel extended pseudo-canonical
timing model to retain and evaluate both type of correlation
during statistical timing analysis with minimum computation
cost. Also, an intelligent pruning method is introduced to enable
trade-off runtime with accuracy.
Tested with ISCAS benchmark suites, our method shows both
high accuracy and high performance. For example, on the circuit
c6288, our distribution estimation error shows 15x accuracy
improvement compared with previous approaches.
-
Modeling Interconnect Variability Using Efficient Parametric Model Order Reduction [p. 958]
-
P. Li, F. Liu, S. Nassif, and L. Pileggi
Assessing IC manufacturing process fluctuations and their
impacts on IC interconnect performance has become
unavoidable for modern DSM designs. However, the
construction of parametric interconnect models is often
hampered by the rapid increase in computational cost and
model complexity. In this paper we present an efficient yet
accurate parametric model order reduction algorithm for
addressing the variability of IC interconnect performance.
The efficiency of the approach lies in a novel combination of
low-rank matrix approximation and multi-parameter moment
matching. The complexity of the proposed parametric model
order reduction is as low as that of a standard Krylov
subspace method when applied to a nominal system. Under
the projection-based framework, our algorithm also
preserves the passivity of the resulting parametric models.
-
Stochastic Power Grid Analysis Considering Process Variations [p. 964]
-
P. Ghanta, S. Vrudhula, J. Wang, and R. Panda
In this paper, we investigate the impact of interconnect and device
process variations on voltage fluctuations in power grids. We
consider random variations in the power grid's electrical parameters
as spatial stochastic processes and propose a new and efficient
method to compute the stochastic voltage response of the power
grid. Our approach provides an explicit analytical representation
of the stochastic voltage response using orthogonal polynomials in
a Hilbert space. The approach has been implemented in a prototype
software called OPERA (Orthogonal Polynomial Expansions
for Response Analysis). Use of OPERA on industrial power grids
demonstrated speed-ups of up to two orders of magnitude. The results
also show a significant variation of about ± 35% in the nominal
voltage drops at various nodes of the power grids and demonstrate
the need for variation-aware power grid analysis.
-
Buffer Insertion Considering Process Variation [p. 970]
-
J. Xiong, K. Tam, and L. He
A comprehensive probabilistic methodology is proposed to
solve the buffer insertion problem with the consideration
of process variations. In contrast to a recent work, we
point out, for the first time, that the correlation between
the required arrival time and the downstream loading capacitance
must be considered in order to solve the problem
"correctly". We develop an efficient bottom-up recursive algorithm
to calculate the joint probability density function
that accurately captures the above correlation, and propose
effective pruning rules to exclude probabilistically inferior
solutions. We verify our buffer insertion using timing analysis
with both device and interconnect variations, and show
that compared to the conventional buffer insertion algorithm
using nominal device and interconnect parameters, our new
buffer insertion methodology can reduce the probability of
timing violation by up to 30%.
-
EM Wave Coupling Noise Modeling Based on Chebyshev Approximation and
Exact Moment Formulation [p. 976]
-
B. Wang and P. Mazumder
This paper presents a new mathematical approach to
modeling EM wave coupling noise so that it can be easily
integrated into chip-level noise analysis tools. The
new method employs Chebyshev approximation technique
to model the distributed sources arising in the
Telegrapher's equations due to EM wave coupling. A uniform
plane wave illumination metric is provided to determine
the order of approximation. Closed-form formulas
for the noise transfer functions' moments are derived.
By utilizing the formulated moments, reduced order models
can be efficiently obtained to generate the induced noise
caused by EM wave illumination. The accuracy of the proposed
method is verified by Hspice simulation.
-
Modeling the Non-Linear Behavior of Library Cells for an Accurate Static Noise Analysis [p. 982]
-
C. Forzan and D. Pandini
In signal integrity analysis, the joint effect of propagated noise through library cells, and of the noise injected on a quiet net by neighboring switching nets through coupling capacitances, must be considered in order to accurately estimate the overall noise impact on design functionality and performances. In this work the impact of the cell non-linearity on the noise glitch waveform is analyzed in detail, and a new macromodel that allows to accurately and efficiently modeling the non-linear effects of the victim driver in noise analysis is presented.
Experimental results demonstrate the effectiveness of our method, and confirm that existing noise analysis approaches based on linear superposition of the propagated and crosstalk-injected noise can be highly inaccurate, thus impairing the sign-off functional verification phase.
-
Performance Driven Decoupling Capacitor Allocation Considering Data and Clock Interactions [p. 984]
-
A. Chandy and T. Chen
We propose a sensitivity-based method to allocate
decaps incorporating leakage constraints and tighter data and
clock interactions. The proposed approach attempts to allocate
decaps not only based on the power grid integrity criteria, but
also based on the impact of power grid noise on timing criticality
and robustness. The resulting algorithm reduces the power grid
noise to below a threshold and improves the performance or timing
robustness of the circuit at the same time.
-
Reduction of CMOS Power Consumption and Signal Integrity Issues by Routing Optimization [p. 986]
-
P. Zuber, A. Windschiegl, R. de Otálora, W. Stechele, and A. Herkersdorf
This paper suggests a methodology to decrease the
power of a static CMOS standard cell design at layout level by
focusing on switched capacitance. The term switched is the key:
if a capacitance is not switched often, it may be high. If it is
frequently switched, it should be minimized in order to reduce
power consumption. This can be done by an algorithm based
on forces that automatically optimizes the position and length
of every single wire segment in a routed design. The forces are
proportional to the toggle activities derived from a gate level
simulation. The novelty is that this allows to iteratively find a
new topology for the wire segments. Our algorithm takes as input
an already given, grid routed layout.
Moderators: R. Galivanche, Intel, US; H. Obermeir, Infineon Technologies, DE
-
Implicit and Exact Path Delay Fault Grading in Sequential Circuits [p. 990]
-
V. Kumar, S. Tragoudas, R. Jayabharathi, and S. Chakravarty
The first path implicit and exact non-robust path delay
fault grading technique for non-scan sequential circuits
is presented. Non enumerative exact coverage is
obtained, by allowing any latched error representing
a delayed transition to propagate to a primary output
with the support of other potentially latched errors. The
generalized error propagation is done by symbolic simulation.
Appropriate data structures for function manipulation
are used. The advantage of the proposed
method is demonstrated experimentally with consistent
improvement in coverage over an existing pessimistic
heuristic despite enforced bounds on the memory requirements.
-
Extraction Error Modeling and Automated Model Debugging in High-Performance
Low Power Custom Designs [p. 996]
-
Y.-S. Yang, A. Veneris, P. Thadikaran, and S. Venkataraman
Test model generation is common in the design cycle of custom
made high performance low power designs targeted for high
volume production. Logic extraction is a key step in test model
generation to produce a logic level netlist from the transistor level
representation. This is a semi-automated process which is error
prone. This paper analyzes typical extraction errors applicable to
clocking schemes seen in high-performance designs today. An automated
debugging solution for these errors in designs with no state
equivalence information is also presented. A suite of experiments
on circuits with similar architecture to that found in the industry
confirm the fitness and practicality of the solution.
-
Integration of Learning Techniques into Incremental Satisfiability for
Efficient Path-Delay Fault Test Generation [p. 1002]
-
K. Chandrasekar and M. Hsiao
In recent years, several Electronic Design Automation
(EDA) problems in testing and verification have been formulated
as Boolean Satisfiability (SAT) instances due to
the development of efficient general-purpose SAT solvers.
Problem-specific learning techniques and heuristics can be
integrated into the SAT solver to further speed-up the search
for a satisfying assignment. In this paper, we target the problem
of generating a complete test-suite for the path delay
fault (PDF) model. We provide an Incremental Satisfiability
framework that learns from (1) static logic implications,
(2) segment-specific clauses, and (3) unsatisfiability cores of
each untestable partial PDF. These learning techniques improvise
the test generation for path delay faults that have
common testable and/or untestable segments. The experimental
results show that a significant portion of PDFs can
be excluded dynamically in the proposed incremental SAT
formulation for large benchmark circuits, thus potentially
achieving speed-ups for PDF test generation.
-
The Accidental Detection Index as a Fault Ordering Heuristic for Full-Scan Circuits [p. 1008]
-
I. Pomeranz and S. Reddy
We investigate a new fault ordering heuristic for test generation
in full-scan circuits. The heuristic is referred to as
the accidental detection index. It associates a value
ADI (f ) with every circuit fault f . The heuristic estimates
the number of faults that will be detected by a test generated
for f . Fault ordering is done such that a fault with
a higher accidental detection index appears earlier in the
ordered fault set and targeted earlier during test generation.
This order is effective for generating compact test
sets, and for obtaining a test set with a steep fault coverage
curve. Such a test set has several applications. We
present experimental results to demonstrate the effectiveness
of the heuristic.
-
Diagnostic and Detection Fault Collapsing for Multiple Output Circuits [p. 1014]
-
R. Sandireddy and V. Agrawal
We discuss fault equivalence and dominance relations
for multiple output combinational circuits. The conventional
definition for equivalence says that "Two faults are
equivalent if and only if the corresponding faulty circuits
have identical output functions". This definition, which is
based on indistinguishability of the faults, is extended for
multiple output circuits as "Two faults of a Boolean circuit
are equivalent if and only if the pair of the output functions
is identical at each output of the circuit". This is termed as
diagnostic equivalence in this paper. "If all tests that detect
a fault also detect another fault, not necessarily on the same
output, then the two faults are called detection equivalent".
Two detection equivalent faults need not be indistinguishable.
The definitions for fault dominance follow on similar
lines. A novel algorithm based on redundancy identification
has been proposed to find the equivalence and dominance
collapsed sets based on diagnostic and detection collapsing.
Applying the algorithm to a 4-bit ALU would collapse
the total fault set of 502 faults to 253 and 155, respectively,
according to diagnostic equivalence and dominance. The
collapsed sets have 234 and 92 faults, respectively, for detection
equivalence and dominance. In comparison, the traditional
structural equivalence and dominance collapsing
results in 301 and 248 faults, respectively. Finally, we use
library-based functional collapsing in a hierarchical system
and find that smaller fault sets are obtained with an order
of magnitude reduction in CPU time for very large circuits.
-
Framework for Fault Analysis and Test Generation in DRAMs [p. 1020]
-
Z. Al-Ars, S. Hamdioui, A. Van De Goor, and G. Mueller
With the increasing complexity of memory behavior,
attempts are being made to come up with a methodical
approach that employs electrical simulation to tackle
the memory test problem. This paper describes a framework
of algorithms and tools developed jointly by the Delft
University of Technology and Infineon Technologies to systematically
generate DRAM tests using Spice simulation.
The proposed Spice-based test approach enjoys the advantage
of being relatively inexpensive, yet highly accurate in
describing the desired memory faulty behavior.
Keywords: tool framework, DRAM testing, faulty behavior,
defect simulation, test generation.
-
Mutation Sampling Technique for the Generation of Structural Test Data [p. 1022]
-
M. Scholivé, V. Beroulle, C. Robach, M. Flottes, and B. Rouzeyre
Our goal is to produce validation data that can be used as an efficient (pre) test set for
structural stuck-at faults. In this paper, we detail an original test-oriented mutation
sampling technique used for generating such data and we present a first evaluation
on these validation data with regard to a structural test.
Moderators: J. Sztipanovits, ISIS Vanderbilt U, US; P. Kajfasz, Thales Communication, FR
-
Studying Storage-Recomputation Tradeoffs in Memory-Constrained Embedded Processing [p. 1026]
-
M. Kandemir, F. Li, G. Chen, G. Chen, and O. Ozturk
Fueled by an unprecedented desire for convenience and
self-service, consumers are embracing embedded technology
solutions that enhance their mobile lifestyles. Consequently,
we witness an unprecedented proliferation of embedded/
mobile applications. Most of the environments that
execute these applications have severe power, performance,
and memory space constraints that need to be accounted
for. In particular, memory limitations can present serious
challenges to embedded software designers. The current
solutions to this problem include sophisticated packaging
techniques and code optimizations for effective memory utilization.
While the first solution is not scalable, the second
one is restricted by intrinsic data dependences in the code
that prevent code restructuring. In this paper, we explore
an alternate approach for reducing memory space requirements
of embedded applications. The idea is to re-compute
the result of a code block (potentially multiple times) instead
of storing it in memory and performing a memory operation
whenever needed. The main benefit of this approach
is that it reduces memory space requirements, that is, no
memory space is reserved for storing the result of the code
block in question.
-
BB-GC: Basic-Block Level Garbage Collection [p. 1032]
-
O. Ozturk, M. Kandemir, and M. Irwin
Memory space limitation is a serious problem for
many embedded systems from diverse application domains.
While circuit/packaging techniques are definitely important
to squeeze large quantities of data/ instruction
into small size memories typically employed by embedded
systems, software can also play a crucial role in reducing
memory space demands of embedded applications.
This paper focuses on a software-managed two-level memory
hierarchy and instruction accesses. Our goal is to
reduce on-chip memory requirements of a given application
as much as possible, so that the memory space saved
can be used by other simultaneously-executing applications.
The proposed approach achieves this by tracking
the lifetime of instructions. Specifically, when an instruction
is dead (i.e., it could not be visited again in the rest of
execution), we deallocate the on-chip memory space allocated
to it. Working on the control flow graph representation
of an embedded application, our approach performs
basic block-level garbage collection for on-chip memories.
-
Fine Grain QoS Control for Multimedia Application Software [p. 1038]
-
J. Combaz, J.-C. Fernandez, J. Sifakis, and T. Lepley
We propose a method for fine grain QoS control of dataflow
applications. We assume that the application software
is described as the composition of actions (C-functions)
with quality level parameters. The method allows to compute
a QoS controller from this description, and average
execution times, worst case execution times and deadlines
for its actions. The controller computes dynamically feasible
schedules and quality assignments for their actions.
Furthermore, the control policy ensures optimal time budget
utilization. A prototype tool implementing the method is
shown as well as experimental results for a non trivial example.
The results show the interest of fine grain QoS control
for video encoders.
-
Correct-by-Construction Transformations across Design Environments for
Model-Based Embedded Software Development [p. 1044]
-
M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-Vincentelli,
U. Freund, E. Schlenker, and H.-J. Wolff
Embedded software design for real time reactive system
has become the bottleneck in the market introduction of
complex products such as automobiles, airplanes, and industrial
control plants. In particular, functional correctness
and reactive performance are increasingly difficult to verify.
The advent of model-based design methodologies has
alleviated some of the verification-related problems by making
the code-generation process flow automatically from the
model description. Given the relative infancy of this approach,
several companies rely upon design flows based on
different tools connected together by file transfer. This way
of integrating tools defeats the very purpose of the methodology
introducing a high potential of errors in the transformation
from one format to another and preventing formal
analysis of the properties of the design. In this paper, we
propose to adopt a formal transformation across different
tools and we give an example of this approach by linking
two tools that are widely used in the automotive domain:
Simulink and ASCET. We believe that this approach can be
applied to any embedded software design flow to leverage
the power of all the tools in the flow.
-
galsC: A Language for Event-Driven Embedded Systems [p. 1050]
-
E. Cheong and J. Liu
We introduce galsC, a language designed for programming
event-driven embedded systems such as sensor networks. galsC implements
the TinyGALS programming model. At the local level, software
components are linked via synchronous method calls to form actors. At
the global level, actors communicate with each other asynchronously via
message passing, which separates the flow of control between actors. A
complementary model called TinyGUYS is a guarded yet synchronous
model designed to allow thread-safe sharing of global state between actors
via parameters without explicitly passing messages. The galsC compiler
extends the nesC compiler, which allows for better type checking and
code generation. Having a well-structured concurrency model at the
application level greatly reduces the risk of concurrency errors, such
as deadlock and race conditions. The galsC language is implemented on
the Berkeley motes and is compatible with the TinyOS/nesC component
library. We use a multi-hop wireless sensor network as an example to
illustrate the effectiveness of the language.
-
Compiler-Directed Instruction Duplication for Soft Error Detection [p. 1056]
-
J. Hu, F. Li, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M. Irwin
In this work, we experiment with complier-directed instruction
duplication to detect soft errors in VLIW datapaths
. In the proposed approach, the compiler determines
the instruction schedule by balancing the permissible performance
degradation with the required degree of duplication.
Our experimental results show that our algorithms allow
the designer to perform tradeoff analysis between performance
and reliability.
-
OS Debugging Method Using a Lightweight Virtual Machine Monitor [p. 1058]
-
T. Takeuchi
Demands for implementing original OSs that can achieve high I/O performance on
PC/AT compatible hardware have recently been increasing, but conventional OS debugging
environments have not been able to simultaneously assure their stability, be easily
customized to new OSs and new I/O devices, and assure efficient execution of I/O operations.
We therefore developed a novel OS debugging method using a lightweight virtual machine. We
evaluated this debugging method experimentally and confirmed that it can transfer data about
5.4 times as fast as the conventional virtual machine monitor.
-
Hardware Support for Arbitrarily Complex Loop Structures in Embedded Applications [p. 1060]
-
N. Kavvadias and S. Nikolaidis
In this paper, the program control unit of an
embedded RISC processor is enhanced with a novel zero-overhead
loop controller (ZOLC) supporting arbitrary
loop structures with multiple-entry/exit nodes. The ZOLC
has been incorporated to an open RISC processor core
to evaluate the performance of the proposed unit for
alternative configurations of the selected processor. It is
proven that speed improvements of 8.4% to 48.2% are
feasible for the used benchmarks.
Moderators: L. Hedrich, Frankfurt U, DE; E. Martens, KU Leuven, BE
-
Mixing Global and Local Competition in Genetic Optimization Based Design
Space Exploration of Analog Circuits [p. 1064]
-
A. Somani, P. Chakrabarti, and A. Patra
The knowledge of optimal design space boundaries of
component circuits can be extremely useful in making
good subsystem-level design decisions which are aware of
the parasitics and other second-order circuit-level details.
However, direct application of popular Multi-objective genetic
optimization algorithms were found to produce Pareto
fronts with poor diversity for analog circuits problems. This
work proposes a novel approach to control the diversity
of solutions by partitioning the solution space, using Local
Competition to promote diversity and Global competition
for convergence, and by controlling the proportion of these
two mechanisms by a Simulated Annealing based formulation.
The algorithm was applied to extract numerical results
on analog switched capacitor integrator circuits with a wide
range of tight specifications. The results were found to be
significantly better than traditional GA based uncontrolled
optimization methods.
-
Efficient Multiobjective Synthesis of Analog Circuits Using
Hierarchical Pareto-Optimal Performances Hypersurfaces [p. 1070]
-
T. Eeckelaert, T. McConaghy, and G. Gielen
An efficient methodology is presented to generate the
Pareto-optimal hypersurface of the performance space of
a complete mixed-signal electronic system. This Pareto-optimal
front offers the designer access to all optimal design
solutions: starting from the performance specifications, a satisfactory
point can a posteriori be selected on the hypersurface
which immediately determines the final design parameters.
Fast execution is guaranteed by using multi-objective
evolutionary optimization techniques and hierachical decomposition.
The presented method takes advantage of the Pareto
hypersurfaces of the subblocks to generate the overall Pareto
front. The hierarchical approach combines behavioral simulation
with behavioral-models at the higher levels, with SPICE
simulations with transistor-level accuracy at the lowest level.
Storing the performance data of all subblocks enables reuse
for other systems later on.
-
Estimating Scalable Common-Denominator Laplace-Domain MIMO Models in an
Errors-in-Variables Framework [p. 1076]
-
G. Vandersteen, L. De Locht, S. Jenei, Y. Rolain, and R. Pintelon
Design of electrical systems demands simulations using
models evaluated in different design parameters choices. To
enable the simulation of linear systems, one often requires
their modeling as ordinary differential equations given tabular
data obtained from device simulations or measurements.
Existing techniques need to do this for every choice
of design parameters since the model representations don't
scale smoothly with the external parameter.
The paper describes a frequency-domain identification
algorithm to extract the poles and zeros of linear MIMO systems.
Furthermore, it expresses the poles and zeros as trajectories
that are functions of the design parameter(s). The
paper describes the used framework, solves the starting-value
problem, presents a solution for high-order systems
and provides a model-order selection strategy. The properties
of the algorithm are illustrated on microwave measurements
of inductors, a variable gain amplifier and a high-order
SAW-filter. As shown by these examples, the proposed
identification algorithm is very well suited to derive scalable,
physically relevant models out of tabular frequency-response
data.
-
CAFFEINE: Template-Free Symbolic Model Generation of Analog Circuits via
Canonical Form Functions and Genetic Programming [p. 1082]
-
T. McConaghy, T. Eeckelaert, and G. Gielen
This paper presents a method to automatically
generate compact symbolic performance models of
analog circuits with no prior specification of an equation
template. The approach takes SPICE simulation data as
input, which enables modeling of any nonlinear circuits
and circuit characteristics. Genetic programming is
applied as a means of traversing the space of possible
symbolic expressions. A grammar is specially designed to
constrain the search to a canonical form for functions.
Novel evolutionary search operators are designed to
exploit the structure of the grammar. The approach
generates a set of symbolic models which collectively
provide a tradeoff between error and model complexity.
Experimental results show that the symbolic models
generated are compact and easy to understand, making
this an effective method for aiding understanding in
analog design. The models also demonstrate better
prediction quality than posynomials.
-
A Two-Level Modeling Approach to Analog Circuit Performance Macromodeling [p. 1088]
-
M. Ding and R. Vemuri
In this paper, we present a two-level modeling approach
to performance macromodeling based on radial basis function
Support Vector Machine (SVM). The two-level model
consists of a feasibility model and a set of performance
models. The feasibility model identifies the feasible designs
that satisfy the design constraints. The performance macromodel
is valid for feasible designs. We formulate the feasibility
macromodeling problem as a classification problem
and the performance macromodeling as a regression problem
and apply SVM algorithm to build the classifier and regressors
correspondingly. Our experiment shows that performance
macromodels for feasible designs are much more
accurate, faster to train and evaluate than those without
functional or performance constraints considered.
Organiser/Moderator: C. Paulus, Infineon Technologies, DE
-
New Perspectives and Opportunities from the Wild West of Microelectronic Biochips [p. 1092]
-
N. Manaresi, G. Medoro, M. Abonnenc, V. Auger,
P. Vulto, A. Romani, L. Altomare, M. Tartagni, and R. Guerrieri
Application of Microelectronic to bioanalysis is an
emerging field which holds great promise. From the
standpoint of electronic and system design, biochips imply
a radical change of perspective, since new, completely
different constraints emerge while other usual constraints
can be relaxed. While electronic parts of the system can
rely on the usual established design-flow, fluidic and
packaging design, calls for a new approach which relies
significantly on experiments. We hereby make some
general considerations based on our experience in the
development of biochips for cell analysis.
Moderators: R. Drechsler, Bremen U, DE; E. Giunchiglia, DIST - Genova U, IT
-
Verification of Embedded Memory Systems Using Efficient Memory Modeling [p. 1096]
-
M. Ganai, A. Gupta, and P. Ashar
We describe verification techniques for embedded memory
systems using efficient memory modeling (EMM), without
explicitly modeling each memory bit. We extend our previously
proposed approach of EMM in Bounded Model Checking
(BMC) for a single read/write port single memory system, to
more commonly occurring systems with multiple memories,
having multiple read and write ports. More importantly, we
augment such EMM to providing correctness proofs, in addition
to finding real bugs as before. The novelties of our verification
approach are in a) combining EMM with proof-based
abstraction that preserves the correctness of a property up to a
certain analysis depth of SAT-based BMC, and b) modeling
arbitrary initial memory state precisely and thereby, providing
inductive proofs using SAT-based BMC for embedded memory
systems. Similar to the previous approach, we construct a
verification model by eliminating memory arrays, but retaining
the memory interface signals with their control logic and
adding constraints on those signals at every analysis depth to
preserve the data forwarding semantics. The size of these EMM
constraints depends quadratically on the number of memory
accesses and the number of read and write ports; and linearly
on the address and data widths and the number of memories.
We show the effectiveness of our approach on several industry
designs and software programs.
-
An Efficient Sequential SAT Solver with Improved Search Strategies [p. 1102]
-
F. Lu, M. Iyer, G. Parthasarathy, L.-C. Wang, K.-T. Cheng, and K. Chen
A sequential SAT solver Satori [1] was recently proposed as an alternative
to combinational SAT in verification applications. This paper
describes the design of Seq-SAT - an efficient sequential SAT solver
with improved search strategies over Satori. The major improvements
include (1) a new and better heuristic for minimizing the set of assignments
to state variables, (2) a new priority-based search strategy
and a flexible sequential search framework which integrates different
search strategies, and (3) a decision variable selection heuristic more
suitable for solving the sequential problems. We present experimental
results to demonstrate that our sequential SAT solver can achieve
orders-of-magnitude speedup over Satori.
We plan to release the source code of Seq-SAT along with this paper.
-
Considering Circuit Observability Don't Cares in CNF Satisfiability [p. 1108]
-
Z. Fu, Y. Yu, and S. Malik
Boolean Satisfiability (SAT) has seen significant use in various tasks
in circuit verification in recent years. A key contributor to the efficiency
of contemporary SAT solvers is fast deduction using Boolean
Constraint Propagation (BCP). This can be efficiently implemented
with a Conjunctive Normal Form (CNF) representation of a circuit.
However, most circuit verification tasks start from a logic circuit
description of the problem instance. Fortunately, there is a simple
conversion from a logic circuit to a CNF [12] that enables the use
of the CNF representation even for circuit verification tasks. However,
this process loses some information regarding the structure of
the circuit. One example of such structural information is the Circuit
Observability Don't Cares. Several recent papers [6] [7] [8] [9]
[11] [13] have addressed the issue of handling circuit unobservability
in CNF-based SAT. However, as we will demonstrate, none of
these accurately captures the conditions for use of this information
in all stages of a CNF-based SAT solver. In this paper, we propose
a broader approach to take such Don't Care information into
consideration in a CNF-based SAT solver. It does so by adding certain
don't care literals to clauses in the CNF representation. These
don't care literals are treated differently at different times during
the solution process, much like don't cares in logic synthesis. The
major merit of this scheme, unlike other recently proposed techniques,
is that the solver can continue to use this don't care information
during the learning process, which is an important part of
contemporary SAT solvers. We have implemented this approach in
the zChaff SAT solver and experiments show that significant performance
gain can be obtained through their use.
Organiser/Moderator: Y.-L. Lin, National Tsing Hua U, Taiwan, ROC
Speakers: J.-Y Lin, Global UniChip Corp, Taiwan, ROC; L.-G. Chen, ERSO/ITRI & National Taiwan U, ROC;
C.-W. Wu, National Tsing Hua U, Taiwan, ROC
-
Integration, Verification and Layout of a Complex Multimedia SOC [p. 1116]
-
C.-L. Chen, J.-Y. Lin, and Y.-L. Lin
We present our experience of designing a single-chip
controller for advanced digital still camera from
specification all the way to mass production. The process
involves collaboration with camera system designer, IP
vendors, EDA vendors, silicon wafer foundry, package &
testing houses, and camera maker. We also co-work with
academic research groups to develop a JPEG codec IP and
memory BIST and SOC testing methodology. In this
presentation, we cover the problems encountered, our
solutions, and lessons learned.
-
JPEG, MPEG-4, and H.264 Codec IP Development [p. 1118]
-
C.-J. Lian, Y.-W. Huang, H.-C. Fang, Y.-C. Chang, and L.-G. Chen
This paper summarizes our design experiences of various
image and video codec IPs. The design issues and
methodology of custom video codecs are discussed. The design
methodology can be summarized as four stages, system
analysis, algorithm optimization, architecture exploration,
and code development. Based on these guidelines, several
design cases are presented, including the proposed JPEG,
MPEG-4, and H.264 architectures.
-
SOC Testing Methodology and Practice [p. 1120]
-
C.-W. Wu
On a commercial digital still camera (DSC) controller
chip we practice a novel SOC test integration platform,
solving real problems in test scheduling, test IO reduction,
timing of functional test, scan IO sharing, embedded
memory built-in self-test (BIST), etc. The chip has been fabricated
and tested successfully by our approach. Test results
justify that short test integration cost, short test time, and
small area overhead can be achieved. To support SOC testing,
a memory BIST compiler and an SOC testing integration
system have been developed.
Moderators: F. Hapke, Philips Semiconductors, DE; M. Flottes, LIRMM, FR
-
Evolutionary Optimization in Code-Based Test Compression [p. 1124]
-
I. Polian, A. Czutro, and B. Becker
We provide a general formulation for the code-based test
compression problem with fixed-length input blocks and propose
a solution approach based on Evolutionary Algorithms.
In contrast to existing code-based methods, we allow unspecified
values in matching vectors, which allows encoding of
arbitrary test sets using a relatively small number of codewords.
Experimental results for both stuck-at and path delay
fault test sets for ISCAS circuits demonstrate an improvement
compared to existing techniques.
Keywords: Test compression, code-based compression,
evolutionary algorithms
-
Reconfigurable Linear Decompressors Using Symbolic Gaussian Elimination [p. 1130]
-
K. Balakrishnan and N. Touba
A methodology for designing a reconfigurable linear decompressor
is presented. A symbolic Gaussian elimination
method to solve a constrained Boolean matrix is proposed
and utilized for designing the reconfigurable network. The
proposed scheme can be implemented in conjunction with
any decompressor that has a combinational linear network.
Using the given linear decompressor as a starting point, the
proposed method improves the compression further. A nice
feature of the proposed method is that it can be implemented
with very little hardware overhead. Experimental results indicate
that significant improvements can be achieved.
-
A Novel Low-Overhead Delay Testing Technique for Arbitrary Two-Pattern Test Application [p. 1136]
-
S. Bhunia, H. Mahmoodi, A. Raychowdhury, and K. Roy
With increasing process fluctuations in nano-scale
technology, testing for delay faults is becoming essential in
manufacturing test to complement stuck-at-fault testing. Design-for-testability
techniques, such as enhanced scan are typically
associated with considerable overhead in die-area, circuit performance,
and power during normal mode of operation. This
paper presents a novel test technique, which can be used as
an alternative to the enhanced scan based delay fault testing
method, with significantly less design overhead. Instead of using
an extra latch as in the enhanced scan method, we propose using
supply gating at the first level of logic gates to hold the state of a
combinational circuit. Experimental results on a set of ISCAS89
benchmarks show an average reduction of 33% in area overhead
with an average improvement of 71% in delay overhead and 90%
in power overhead during normal mode of operation, compared
to the enhanced scan implementation.
-
Hybrid BIST Based on Repeating Sequences and Cluster Analysis [p. 1142]
-
L. Li and K. Chakrabarty
We present a hybrid BIST approach that extracts the most
frequently occurring sequences from deterministic test patterns;
these extracted sequences are stored on-chip. We use cluster
analysis for sequence extraction, and encode deterministic patterns
on the basis of the stored sequences. Experimental results
for the ISCAS-89 benchmark circuits show that the proposed approach
often requires less on-chip storage and test data volume
than other recent BIST methods.
Moderators: H. van Sommeren, ACE Associated Compiler Experts, NL; P. Marwedel, Dortmund U, DE
-
C Compiler Retargeting Based on Instruction Semantics Models [p. 1150]
-
J. Ceng, M. Hohenauer, G. Braun, R. Leupers, G. Ascheid, and H. Meyr
Efficient architecture exploration and design of application
specific instruction-set processors (ASIPs) requires retargetable
software development tools, in particular C compilers
that can be quickly adapted to new architectures. A
widespread approach is to model the target architecture in
a dedicated architecture description language (ADL) and
to generate the tools automatically from the ADL specification.
For C compiler generation, however, most existing systems
are limited either by the manual retargeting effort or
by redundancies in the ADL models that lead to potential
inconsistencies. We present a new approach to retargetable
compilation, based on the LISA 2.0 ADL with instruction semantics,
that minimizes redundancies while simultaneously
achieving a high degree of automation. The key of our approach
is to generate the mapping rules needed in the compiler's
code selector from the instruction semantics information.
We describe the required analysis and generation
techniques, and present experimental results for several embedded
processors.
-
A Constraint Network Based Approach to Memory Layout Optimization [p. 1156]
-
G. Chen, M. Kandemir, and M. Karakoy
While loop restructuring based code optimization for
array intensive applications has been successful in the
past, it has several problems such as the requirement of
checking dependences (legality issues) and transformation
of all of the array references within the loop body
indiscriminately (while some of the references can benefit
from the transformation, others may not). As a result, data
transformations, i.e., transformations that modify memory
layout of array data instead of loop structure have been
proposed. One of the problems associated with data
transformations is the difficulty of selecting a memory
layout for an array that is acceptable to the entire program
(not just to a single loop). In this paper, we formulate the
problem of determining the memory layouts of arrays as a
constraint network, and explore several methods of
solution in a systematic way. Our experiments provide
strong support in favor of employing constraint
processing, and point out future research directions.
-
Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access [p. 1162]
-
M. Absar and F. Catthoor
Scratch-pad memory is becoming an important fixture in
embedded multimedia systems. It is significantly more efficient
than the cache, in performance and power, and has the added
advantage of better timing-predictability. Current techniques
for the management of the scratch-pad are quite mature in
the case of arrays accessed in a regular fashion, i.e. inside
nested-loop by index expressions which are affine functions
of the loop-iterators. Many multimedia codes, however, also
use arrays as subscripted variables in the index expression
of other arrays, thereby making the access pattern irregular.
Existing techniques fail in such cases, bringing down the
performance. In this paper, we extend the framework that
exists today, to the case of irregular access. We provide
a clear and precise compiler-based technique for analyzing
irregular array-access, and efficiently mapping such arrays
to the scratch-pad. On the average, 20% reduction in energy
consumption, for a set of realistic applications, was achieved
using our methods.
-
Structural Testing Based on Minimum Kernels [p. 1168]
-
E. Dubrova
Structural testing techniques, such as statement and
branch coverage, play an important role in improving dependability
of software systems. However, finding a set of
tests which guarantees high coverage is a time-consuming
task. In this paper we present a technique for structural
testing based on kernel computation. A kernel satisfies the
property that any set of tests which executes all vertices
(edges) of the kernel executes all vertices (edges) of the
program's flowgraph. We present a linear-time algorithm
for computing minimum kernels based on pre- and post-dominator
relations of a flowgraph.
Moderators: G. Nicolescu, Ecole Polytechnique de Montréal, CA; W. Cesário, TIMA Laboratory, FR
-
An Application-Specific Design Methodology for STbus Crossbar Generation [p. 1176]
-
S. Murali and G. De Micheli
As the communication requirements of current and future
Multiprocessor Systems on Chips (MPSoCs) continue to increase,
scalable communication architectures are needed to
support the heavy communication demands of the system.
This is reflected in the recent trend that many of the standard
bus products such as STbus, have now introduced the
capability of designing a crossbar with multiple buses operating
in parallel. The crossbar configuration should be designed
to closely match the application traffic characteristics
and performance requirements. In this work we address
this issue of application-specific design of optimal crossbar
(using STbus crossbar architecture), satisfying the performance
requirements of the application and optimal binding
of cores onto the crossbar resources. We present a simulation
based design approach that is based on analysis of actual
traffic trace of the application, considering local variations
in traffic rates, temporal overlap among traffic streams
and criticality of traffic streams. Our methodology is applied
to several MPSoC designs and the resulting crossbar
platforms are validated for performance by cycle-accurate
SystemC simulation of the designs. The experimental case
studies show large reduction in packet latencies (up to 7x)
and large crossbar component savings (up to 3.5x) compared
to traditional design approaches.
Keywords: Systems on Chips, Networks on Chips,
crossbar, bus, application-specific, SystemC.
-
A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to
Accelerate SOC Design and Verification [p. 1182]
-
K. Goossens, J. Dielissen, O. Gangwal, S. Gonzalez Pestana, A. Radulescu, and E. Rijpkema
Systems on chip (SOC) are composed of intellectual property
blocks (IP) and interconnect. While mature tooling exists to design
the former, tooling for interconnect design is still a research area.
In this paper we describe an operational design flow that generates
and configures application-specific network on chip (NOC) instances,
given application communication requirements. The NOC
can be simulated in SystemC and RTL VHDL. An independent performance
verification tool verifies analytically that the NOC instance
(hardware) and its configuration (software) together meet
the application performance requirements. The Æthereal NOC's
guaranteed performance is essential to replace time-consuming
simulation by fast analytical performance validation. As a result,
application-specific NOCs that are guaranteed to meet the application's
communication requirements are generated and verified
in minutes, reducing the number of design iterations. A realistic
MPEG SOC example substantiates our claims.
-
xpipes Lite: A Synthesis Oriented Design Library for Networks on Chips [p. 1188]
-
S. Stergiou, G. De Micheli, F. Angiolini, D. Bertozzi, S. Carta, and L. Raffo
The limited scalability of current bus topologies for Systems
on Chips (SoCs) dictates the adoption of Networks on
Chips (NoCs) as a scalable interconnection scheme. Current
SoCs are highly heterogeneous in nature, denoting homogeneous,
preconfigured NoCs as inefficient drop-in alternatives.
While highly parametric, fully synthesizeable (soft) NoC
building blocks appear as a good match for heterogeneous
MPSoC architectures, the impact of instantiation-time flexibility
on performance, power and silicon cost has not been
quantified yet. This work details xpipes Lite, a design flow
for automatic generation of heterogeneous NoCs. xpipes
Lite is based on highly customizable, high frequency and
low latency NoC modules, that are fully synthesizeable. Synthesis
results provide with modules that are directly comparable,
if not better, than the current published state-of-the-art
NoCs in terms of area, power, latency and target frequency
of operation measurements.
Moderators: V. Narayanan, Penn State U, US; M. Poncino, Politecnico di Torino, IT
-
Yield Enhancement of Digital Microfluidics-Based Biochips Using
Space Redundancy and Local Reconfiguration [p. 1196]
-
F. Su, K. Chakrabarty, and V. Pamula
As microfluidics-based biochips become more complex, manufacturing yield will have significant
influence on production volume and product cost. We propose an interstitial redundancy
approach to enhance the yield of biochips that are based on droplet-based microfluidics.
In this design method, spare cells are placed in the interstitial sites within the
microfluidic array, and they replace neighboring faulty cells via local reconfiguration.
The proposed design method is evaluated using a set of concurrent real-life bioassays.
-
Design of Fault-Tolerant and Dynamically-Reconfigurable Microfluidic Biochips [p. 1202]
-
F. Su and K. Chakrabarty
Microfluidics-based biochips are soon expected to revolutionize clinical diagnosis, DNA sequencing, and other laboratory procedures involving molecular biology. Most microfluidic biochips are based on the principle of continuous fluid flow and they rely on permanently-etched microchannels, micropumps, and microvalves. We focus here on the automated design of "digital" droplet-based microfluidic biochips. In contrast to continuous-flow systems, digital microfluidics offers dynamic reconfigurability; groups of cells in a microfluidics array can be reconfigured to change their functionality during the concurrent execution of a set of bioassays. We present a simulated annealing-based technique for module placement in such biochips. The placement procedure not only addresses chip area, but it also considers fault tolerance, which allows a microfluidic module to be relocated elsewhere in the system when a single cell is detected to be faulty. Simulation results are presented for a case study involving the polymerase chain reaction.
-
Quantum Circuit Simplification Using Templates [p. 1208]
-
D. Maslov, C. Young, D. Miller, and G. Dueck
Optimal synthesis of quantum circuits is intractable and
heuristic methods must be employed. Templates are a general
approach to reversible and quantum circuit simplification.
In this paper, we consider the use of templates to
simplify a quantum circuit initially found by other means.
We present and analyze templates in the general case, and
then provide particular details for circuits composed of
NOT, CNOT and controlled-sqrt-of-NOT gates. We introduce
templates for this set of gates and apply them to simplify
both known quantum realizations of Toffoli gates and
circuits found by earlier heuristic Fredkin and Toffoli gate
synthesis algorithms. While the number of templates is quite
small, the reduction in quantum cost is often significant.
-
Towards Designing Robust QCA Architectures in the Presence of Sneak Noise Paths [p. 1214]
-
K. Kim, K. Wu, and R. Karri
Quantum-dot Cellular Automata (QCA) is attracting a lot of
attentions due to their extremely small feature sizes and
ultra low power consumption. Up to now there are several
designs using QCA technology have been proposed.
However, we found not all of the designs function properly.
Further, no general design guidelines have been proposed
so far. A straightforward extension of a simple functional
design pattern may fail. This makes designing a large scale
circuits using QCA technology an extremely time-consuming
process. In this paper we show several critical
vulnerabilities in the structures of primitive QCA gates and
QCA interconnects, and propose a disciplinary guideline to
prevent any additional plausible but malfunctioning QCA
designs.
Moderator: C. Paulus, Infineon Technologies, DE
-
CMOS-Based Biosensor Arrays [p. 1222]
-
R. Thewes, C. Paulus, M. Schienle, F. Hofmann, A. Frey, R. Brederlow, M. Augustyniak, M. Jenkner,
B. Eversmann, P. Schindler-Bauer, M. Atzesberger, B. Holzapfl, G. Beer, T. Haneder, and H.-C. Hanke
CMOS-based sensor array chips provide new and
attractive features as compared to today's standard tools
for medical, diagnostic, and biotechnical applications.
Examples for molecule- and cell-based approaches and
related circuit design issues are discussed.
Moderators: G. De Micheli, EPFL Lausanne, CH; D. Bertozzi, DEIS - Bologna U, IT
-
A Router Architecture for Connection-Oriented Service Guarantees in the
MANGO Clockless Network-on-Chip [p. 1226]
-
T. Bjerregaard and J. Sparsø
On-chip networks for future system-on-chip designs need
simple, high performance implementations. In order to promote
system-level integrity, guaranteed services (GS) need
to be provided. We propose a network-on-chip (NoC) router
architecture to support this, and demonstrate with a CMOS
standard cell design. Our implementation is based on clockless
circuit techniques, and thus inherently supports a modular,
GALS-oriented design flow. Our router exploits virtual
channels to provide connection-oriented GS, as well as
connection-less best-effort (BE) routing. The architecture
is highly flexible, in that support for different types of BE
routing and GS arbitration can be easily plugged into the
router.
-
A Quality-of-Service Mechanism for Interconnection Networks in System-on-Chips [p. 1232]
-
W.-D. Weber, I. Swarbrick, J. Chou, and D. Wingard
As Moore's Law continues to fuel the ability to build ever increasingly
complex system-on-chips (SoCs), achieving performance
goals is rising as a critical challenge to completing designs. In
particular, the system interconnect must efficiently service a
diverse set of data flows with widely ranging quality-of-service
(QoS) requirements. However, the known solutions for off-chip
interconnects such as large-scale networks are not necessarily
applicable to the on-chip environment. Latency and memory constraints
for on-chip interconnects are quite different from larger-scale
interconnects. This paper introduces a novel on-chip interconnect
arbitration scheme. We show how this scheme can be distributed
across a chip for high-speed implementation. We compare
the performance of the arbitration scheme with other known interconnect
arbitration schemes. Existing schemes typically focus
heavily on either low latency of service for some initiators, or
alternatively on guaranteed bandwidth delivery for other initiators.
Our scheme allows service latency on some initiators to be
traded off smoothly against jitter bounds on other initiators, while
still delivering bandwidth guarantees. This scheme is a subset of
the QoS controls that are available in the SonicsMX. (SMX)
product.
-
A Technology-Aware and Energy-Oriented Topology Exploration for On-Chip Networks [p. 1238]
-
H. Wang, L.-S. Peh, and S. Malik
As packet-switching interconnection networks replace buses and dedicated
wires to become the standard on-chip interconnection fabric,
reducing their power consumption has been identified to be a major
design challenge. Network topologies have high impact on network
power consumption. Technology scaling is another important factor
that affects network power since each new technology changes semiconductor
physical properties. As shown in this paper, these two aspects
need to be considered synergistically.
In this paper, we characterize the impact of process technologies on
network energy for a range of topologies, starting from 2-dimensional
meshes/tori, to variants of meshes/tori that incorporate higher dimensions,
multiple hierarchies and express channels. We present a method
which uses an analytical model to predict the most energy-efficient
topology based on network size and architecture parameters for future
technologies. Our model is validated against cycle-accurate network
power simulation and shown to arrive at the same predictions. We
also show how our method can be applied to actual parallel benchmarks
with a case study. We see this work as a starting point for
defining a roadmap of future on-chip networks.
Moderators: C. Silvano, Politecnico di Milano, IT; P. Pop, Linkoping U, SE
-
ISEGEN: Generation of High-Quality Instruction Set Extensions by Iterative Improvement [p. 1246]
-
P. Biswas, S. Banerjee, N. Dutt, L. Pozzi, and P. Ienne
Customization of processor architectures through Instruction
Set Extensions (ISEs) is an effective way to meet
the growing performance demands of embedded applications.
A high-quality ISE generation approach needs to obtain
results close to those achieved by experienced designers,
particularly for complex applications that exhibit regularity:
expert designers are able to exploit manually such
regularity in the data flow graphs to generate high-quality
ISEs. In this paper, we present ISEGEN, an approach that
identifies high-quality ISEs by iterative improvement following
the basic principles of the well-known Kernighan-Lin (K-L)
min-cut heuristic. Experimental results on a
number of MediaBench, EEMBC and cryptographic applications
show that our approach matches the quality of the
optimal solution obtained by exhaustive search. We also
show that our ISEGEN technique is on average 20x faster
than a genetic formulation that generates equivalent solutions.
Furthermore, the ISEs identified by our technique exhibit
35% more speedup than the genetic solution on a large
cryptographic application (AES) by effectively exploiting its
regular structure.
-
Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis [p. 1252]
-
R. Ruiz-Sautua, M. Molina, J. Mendías, and R. Hermida
Early scheduling algorithms usually adjusted the clock
cycle duration to the execution time of the slowest
operation. This resulted in large slack times wasted in
those cycles executing faster operations. To reduce the
wasted times multi-cycle and chaining techniques have
been employed. While these techniques have produced
successful designs, its effectiveness is often limited due to
the area increment that may derive from chaining, and the
extra latencies that may derive from multicycling. In this
paper we present an optimization method that solves the
time-constrained scheduling problem by transforming
behavioural specifications into new ones whose subsequent
synthesis substantially improves circuit performance. Our
proposal breaks up some of the specification operations,
allowing their execution during several possibly
unconsecutive cycles, and also the calculation of several
data-dependent operation fragments in the same cycle. To
do so, it takes into account the circuit latency and the
execution time of every specification operation. The
experimental results carried out show that circuits obtained
from the optimized specification are on average 60% faster
than those synthesized from the original specification, with
only slight increments in the circuit area.
-
Reliability-Centric High-Level Synthesis [p. 1258]
-
S. Tosun, N. Mansouri, E. Arvas, M. Kandemir, and Y. Xie
Importance of addressing soft errors in both safety critical
applications and commercial consumer products is increasing,
mainly due to ever shrinking geometries, higher-density circuits,
and employment of power-saving techniques such as voltage
scaling and component shut-down. As a result, it is becoming
necessary to treat reliability as a first-class citizen in system
design. In particular, reliability decisions taken early in system
design can have significant benefits in terms of design quality.
Motivated by this observation, this paper presents a
reliability-centric high-level synthesis approach that addresses the soft error
problem. The proposed approach tries to maximize reliability of
the design while observing the bounds on area and performance,
and makes use of our reliability characterization of hardware
components such as adders and multipliers. We implemented the
proposed approach, performed experiments with several designs,
and compared the results with those obtained by a prior proposal.
-
PBExplore: A Framework for Compiler-in-the-Loop Exploration of Partial
Bypassing in Embedded Processors [p. 1264]
-
A. Shrivastava, A. Nicolau, N. Dutt, and E. Earlie
Varying partial bypassing in pipelined processors is an
effective way to make performance, area and energy trade-offs
in embedded processors. However, performance evaluation
of partial bypassing in processors has been inaccurate,
largely due to the absence of bypass-sensitive retargetable
compilation techniques. Furthermore no existing
partial bypass exploration framework estimates the power
and cost overhead of partial bypassing. In this paper we
present PBExplore: A framework for Compiler-in-the-Loop
exploration of partial bypassing in processors. PBExplore
accurately evaluates the performance of a partially bypassed
processor using a generic bypass-sensitive compilation
technique. It synthesizes the bypass control logic and
estimates the area and energy overhead of each bypass configuration.
PBExplore is thus able to effectively perform
multi-dimensional exploration of the partial bypass design
space. We present experimental results on the Intel XScale
architecture on MiBench benchmarks and demonstrate the
need, utility and exploration capabilities of PBExplore.
Moderators: S. Piestrak, TU Wroclaw, PL; M. Nicolaidis, iRoC, FR
-
Concurrent Error Detection in Asynchronous Burst-Mode Controllers [p. 1272]
-
S. Almukhaizim and Y. Makris
We discuss the problem of Concurrent Error Detection
(CED) in a popular class of asynchronous controllers,
namely Burst-Mode machines. We first outline the particularities
of these clock-less circuits, including the use of redundancy
to ensure hazard-free operation, and we explain how
they limit the applicability and effectiveness of traditional
CED methods, such as duplication. We then demonstrate
how duplication can be enhanced to resolve these limitations
through additional hardware for comparison synchronization
and detection of error-induced hazards, which jeopardize
the interaction of the circuit with its environment. Finally,
we propose a Transition-Triggered CED method which
employs a transition prediction function to eliminate the need
for hazard detection circuitry and hazard-free implementation
of the duplicate. As indicated by experimental results,
the proposed method reduces significantly the cost of CED,
with an average of 22% in hardware savings.
-
Reliable System Specification for Self-Checking Data-Paths [p. 1278]
-
C. Bolchini, F. Salice, D. Sciuto, and L. Pomante
The design of reliable circuits has received a lot of attention
in the past, leading to the definition of several design
techniques introducing fault detection and fault tolerance
properties in systems for critical applications/environments.
Such design methodologies tackled the problem at different
abstraction levels, from switch-level to logic, RT level,
and more recently to system level. Aim of this paper is to
introduce a novel system-level technique based on the redefinition
of the operators functionality in the system specification.
This technique provides reliability properties to
the system data path, transparently with respect to the designer.
Feasibility, fault coverage, performance degradation
and overheads are investigated on a FIR circuit.
-
Evaluation of Error-Resilience for Reliable Compression of Test Data [p. 1284]
-
H. Hashempour, L. Schiano, and F. Lombardi
This paper addresses error-resilience as the capability to
tolerate bit-flips in a compressed test data stream (which is
transferred from an Automatic Test Equipment (ATE) to the
Device-Under-Test (DUT)). In an ATE, bit-flips may occur
in either the electronics components of the loadboard, or the
high speed serial communication links (between the user interface
workstation and the head). It is shown that errors
caused by bit-flips can seriously degrade the test quality (as
measured by coverage) of the compressed data streams. The
effects of bit-flips on compression are analyzed and various
test data compression techniques are evaluated. It is shown
that for benchmark circuits, coverage of test sets can be reduced
by 10%-30%.
Index terms: error resilience, fault tolerance, yield, reliable
operation of ATE, compression.
-
On the Optimal Design of Triple Modular Redundancy Logic for SRAM-Based FPGAs [p. 1290]
-
F. Kastensmidt, L. Sterpone, M. Sonza Reorda, and L. Carro
Triple Modular Redundancy (TMR) is a suitable fault
tolerant technique for SRAM-based FPGA. However, one
of the main challenges in achieving 100% robustness in
designs protected by TMR running on programmable
platforms is to prevent upsets in the routing from
provoking undesirable connections between signals from
distinct redundant logic parts, which can generate an
error in the output. This paper investigates the optimal
design of the TMR logic (e.g., by cleverly inserting voters)
to ensure robustness. Four different versions of a TMR
digital filter were analyzed by fault injection. Faults were
randomly inserted straight into the bitstream of the
FPGA. The experimental results presented in this paper
demonstrate that the number and placement of voters in
the TMR design can directly affect the fault tolerance,
ranging from 4.03% to 0.98% the number of upsets in the
routing able to cause an error in the TMR circuit.
Moderators: D. Stoffel, Kaiserslautern U, DE; G. Cabodi, Politecnico di Torino, IT
-
Automatic Formal Verification of Fused-Multiply-Add FPUs [p. 1298]
-
C. Jacobi, K. Weber, V. Paruthi, and J. Baumgartner
In this paper we describe a fully-automated methodology for formal
verification of fused-multiply-add floating point units (FPUs).
Our methodology verifies an implementation FPU against a simple
reference model derived from the processor's architectural specification,
which may include all aspects of the IEEE specification
including denormal operands and exceptions. Our strategy uses
a combination of BDD- and SAT-based symbolic simulation. To
make this verification task tractable, we use a combination of
case-splitting,
multiplier isolation, and automatic model reduction techniques.
The case-splitting is defined only in terms of the reference
model, which makes this approach easily portable to new designs.
The methodology is directly applicable to multi-GHz industrial implementation
models (e.g., HDL or gate-level circuit representations)
that contain all details of the high-performance transistor-level
model, such as aggressive pipelining, clocking, etc. Experimental
results are provided to demonstrate the computational efficiency of this approach.
-
Refinement Maps for Efficient Verification of Processor Models [p. 1304]
-
P. Manolios and S. Srinivasan
While most of the effort in improving verification times
for pipeline machine verification has focused on faster decision
procedures, we show that the refinement maps used
also have a drastic impact on verification times. We introduce
a new class of refinement maps for pipelined machine
verification, and using the state-of-the-art verification tools
UCLID and Siege we show that one can attain several orders
of magnitude improvements in verification times over
the standard flushing-based refinement maps, even enabling
the verification of machines that are too complex to otherwise
automatically verify.
-
Functional Equivalence Checking for Verification of Algebraic Transformations on
Array-Intensive Source Code [p. 1310]
-
K. Shashidhar, F. Catthoor, M. Bruynooghe, and G. Janssens
Development of energy and performance-efficient embedded
software is increasingly relying on application of complex
transformations on the critical parts of the source code.
Designers applying such nontrivial source code transformations
are often faced with the problem of ensuring functional
equivalence of the original and transformed programs. Currently
they have to rely on incomplete and time-consuming
simulation. Formal automatic verification of the transformed
program against the original is instead desirable. This calls
for equivalence checking tools similar to the ones available
for comparing digital circuits. We present such a tool to compare
array-intensive programs related through a combination
of important global transformations like expression propagations,
loop and algebraic transformations. When the transformed
program fails to pass the equivalence check, the tool
provides specific feedback on the possible locations of errors.
Moderators: D. Stroobandt, Ghent U, BE; M. Berkelaar, Magma Design Automation, NL
-
Encoding-Based Minimization of Inductive Cross-Talk for Off-Chip Data Transmission [p. 1318]
-
B. LaMeres and S. Khatri
Inductive cross-talk within IC packaging is becoming a significant
bottleneck in high-speed inter-chip communication.
The parasitic inductance within IC packaging causes bounce
on the power supply pins in addition to glitches and rise-time
degradation on the signal pins. Until recently, the parasitic inductance
problem was addressed by aggressive package design.
In this work we present a technique to encode the off-chip data
transmission to limit bounce on the supplies and reduce inductive
signal coupling due to transitions on neighboring signal
lines. Both these performance limiting factors are modeled in
a common mathematical framework. Our experimental results
show that the proposed encoding based techniques result in reduced
supply bounce and signal degradation due to inductive
cross-talk, closely matching the theoretical predictions. We
demonstrate that the overall bandwidth of a bus actually increases
by 85% using our technique, even after accounting for
the encoding overhead. The asymptotic bus size overhead is
between 30% and 50%, depending on how stringent the user-specified
inductive cross-talk parameters are.
-
An O(bn2) Time Algorithm for Optimal Buffer Insertion with b Buffer Types [p. 1324]
-
Z. Li and W. Shi
Buffer insertion is a popular technique to reduce the interconnect
delay. The classic buffer insertion algorithm of
van Ginneken has time complexity O(n2), where n is the
number of buffer positions. Lillis, Cheng and Lin extended
van Ginneken's algorithm to allow b buffer types in time
O(b2n2). For modern design libraries that contain hundreds
of buffers, it is a serious challenge to balance the
speed and performance of the buffer insertion algorithm.
In this paper, we present a new algorithm that computes
the optimal buffer insertion in O(bn2) time. The reduction
is achieved by the observation that the (Q,C) pairs of the
candidates that generate the new candidates must form a
convex hull. On industrial test cases, the new algorithm is
faster than the previous best buffer insertion algorithms by
orders of magnitude.
-
RIP: An Efficient Hybrid Repeater Insertion Scheme for Low Power [p. 1330]
-
X. Liu, Y. Peng, and M. Papaefthymiou
This paper presents a novel repeater insertion algorithm
for interconnect power minimization. The novelty of our approach
is in the judicious integration of an analytical solver
and a dynamic programming based method. Specifically, the
analytical solver chooses a concise repeater library and a
small set of repeater location candidates such that the dynamic
programming algorithm can be performed fast with
little degradation of the solution quality. In comparison with
previously reported repeater insertion schemes, within comparable
runtimes, our approach achieves up to 37% higher
power savings. Moreover, for the same design quality, our
scheme attains a speedup of two orders of magnitude.
Organiser/Moderator: C. Paulus, Infineon Technologies, DE
Speakers: B. Vigna, STMicroelectronics, IT; R. Campagnolo, CEA G/LETI, FR; K.-U. Kirstein, ETH Zurich, CH
-
eMICAM: A New Generation of Active DNA Chip with in Situ Electrochemical Detection [p. 1338]
-
R. Campagnolo
Most of the DNA chips available on the market are
based on external or internal optical detection
(fluorescence or chemiluminescence) and need a bulky
chip reader (optics, laser, camera or PMT). We will
present a new detection strategy using direct
electrochemical detection of DNA hybridisation using
conductive polymers grafted on an active silicon chip. We
will report results on the first step of the fabrication
process and emphasis on full wafer electro-polymerisation
of DNA probes on modified CMOS technology.
-
Cantilever-Based Biosensors in CMOS Technology [p. 1340]
-
K.-U. Kirstein, Y. Li, M. Zimmermann, C. Vancura, T. Volden,
W. Song, J. Lichtenberg, and A. Hierlemannn
Single-chip CMOS-based biosensors that feature
microcantilevers as transducer elements are presented.
The cantilevers are functionalized for the capturing of
specific analytes, e.g., proteins or DNA. The binding of
the analyte changes the mechanical properties of the
cantilevers such as surface stress and resonant frequency,
which can be detected by an integrated Wheatstone
bridge. The monolithic integrated readout allows for a
high signal-to-noise ratio, lowers the sensitivity to
external interference and enables autonomous device
operation.
|