| |
DATE 2004 ABTRACTS
Sessions:
[Plenary]
[1A]
[1B]
[1C]
[1D]
[1E]
[1F]
[2A]
[2B]
[2C]
[2E]
[2F]
[3A]
[3B]
[3C]
[3E]
[3F]
[4A]
[4B]
[4C]
[4E]
[4F]
[4G]
[5A]
[5B]
[5C]
[5E]
[5F]
[5G]
[IP1]
[IP2]
[IP3]
[6A]
[6B]
[6C]
[6E]
[6F]
[6G]
[7A]
[7B]
[7C]
[7E]
[7F]
[7G]
[8A]
[8B]
[8C]
[8E]
[8G]
[9A]
[9B]
[9C]
[9E]
[9G]
[10A]
[10B]
[10C]
[10E]
[10F]
[10G]
[IP4]
[IP5]
[IP6]
Volume I
Moderator: J. Figueras, UP Catalunya, ES
-
Opportunities and Challenges in Building Silicon Products in 65nm and Beyond [p. 2]
-
G. Spirakis
Moderators: J. Henkel, NEC, US; A. Macii, Politecnico di Torino, IT
-
Fine-Grained Dynamic Voltage and Frequency Scaling for Precise Energy and
Performance Trade-Off Based on the Ratio of Off-Chip Access to On-Chip Computation Times [p. 4]
-
K. Choi, R. Soma, and M. Pedram
This paper presents an intra-process dynamic voltage and
frequency scaling (DVFS) technique targeted toward non real-time
applications running on an embedded system platform. The key idea
is to make use of runtime information about the external memory
access statistics in order to perform CPU voltage and frequency
scaling with the goal of minimizing the energy consumption while
translucently controlling the performance penalty. The proposed
DVFS technique relies on dynamically-constructed regression
models that allow the CPU to calculate the expected workload and
slack time for the next time slot, and thus, adjust its voltage and
frequency in order to save energy while meeting soft timing
constraints. This is in turn achieved by estimating and exploiting the
ratio of the total off-chip access time to the total on-chip
computation time. The proposed technique has been implemented
on an XScale-based embedded system platform and actual energy
savings have been calculated by current measurements in hardware.
For memory-bound programs, a CPU energy saving of more than
70% with a performance degradation of 12% was achieved. For
CPU-bound programs, 15~60% CPU energy saving was achieved
at the cost of 5-20% performance penalty.
-
Hybrid Architectural Dynamic Thermal Management [p. 10]
-
K. Skadron
When an application or external environmental conditions
cause a chip's cooling capacity to be exceeded, dynamic
thermal management (DTM) dynamically reduces the
power density on the chip to maintain safe operating temperatures.
The challenge is that even though this reduction
in power density reduces heat dissipation and can be used
to regulate temperature and reduce the need for expensive
thermal packages, reducing power density may come at a
cost in execution speed. This paper shows the importance of
processor-architecture techniques for DTM, and proposes
a new, "hybrid," low-overhead implementation based on
combining fetch gating and dynamic voltage scaling (DVS).
When thermal stress is low, fetch gating is superior because
it exploits instruction-level parallelism (ILP). Once thermal
stress becomes severe enough that fetch gating degrades
ILP, DVS is engaged instead to take advantage of its greater
ability to reduce power density. We show that under a variety
of assumptions about DVS implementation, a hybrid
policy reduces DTM performance overhead by 25% on average
compared to DVS, and is easy to design.
-
Value-Conscious Cache: Simple Technique for Reducing Cache Access Power [p. 16]
-
Y. Chang, C. Yang, and F. Lai
Most microprocessors employ the on-chip caches to
bridge the performance gap between the processor and main
memory. However, the cache accesses usually contribute
significantly to the total power consumption of the chip. Based
on the observation that an overwhelming majority of the cache
access bits are '0', in this paper we propose a value-conscious
(VC) cache to reduce the average cache power consumption
during an access. Unlike the conventional cache with
differential-bitline implementation, the VC cache is a single-bitline
design. Depending on the access bit value, the VC
cache can dynamically prevent the bitline from being
discharged such that the power dissipated in accessing '0' is
much less than the power dissipated in accessing '1'. The
implementation of the VC cache is a circuit-level technique,
which is software independent and orthogonal to other low
power techniques at architecture-level. The experimental
results based on the SPEC2000 and MediaBench traces show
that without compromise of both performance and stability, by
exploiting the prevalence of '0' bits in access data the VC
cache can reduce the average cache read and write power by
about 18%~22% and 36%~40%, respectively.
-
State-Preserving vs. Non-State-Preserving Leakage Control in Caches [p. 22]
-
D. Parikh, K. Sankaranarayanan, Y. Li, K. Skadron, Y. Zhang, and M. Stan
This paper compares the effectiveness of state-preserving
and non-state-preserving techniques for leakage
control in caches by comparing drowsy cache and
gated-Vssfor data caches using 70nm technology parameters.
To perform the comparison, we introduce 'HotLeakage',
a new architectural model for subthreshold and
gate leakage that explicitly models the effects of temperature,
voltage, and parameter variations, and has the ability
to recalculate leakage currents dynamically as temperature
and voltage change at runtime due to operating
conditions, DVS techniques, etc.
By comparing drowsy-cache and gated-Vssat different
L2 latencies and different gate oxide thickness
values, we are able to identify a range of operating parameters
at which gated-Vss is more energy efficient
than drowsy-cache, even though gated-Vssdoes not preserve
data in cache lines that have been deactivated.
We are also able to show potential further benefits of
gated-Vss if an effective dynamic adaptation technique can
be found. These results debunk a fairly widespread belief
that state-preserving techniques are inherently superior
to non-state-preserving techniques.
Moderators: A. Veneris, Toronto U, CA; K. Winkelmann, Infineon Technologies, DE
-
Arithmetic Reasoning in DPLL-Based SAT Solving [p. 30]
-
M. Wedler, D. Stoffel, and W. Kunz
We propose a new arithmetic reasoning calculus to speed
up a SAT solver based on the Davis Putnam Longman
Loveland (DPLL) procedure. It is based on an arithmetic
bit level description of the arithmetic circuit parts and the
property. This description can easily be provided by the
front-end of an RTL property checker. The calculus yields
significant speedup and more robustness on hard SAT instances
derived from the formal verification of arithmetic
circuits.
-
Enhanced Diameter Bounding via Structural Transformation [p. 36]
-
J. Baumgartner and A. Kuehlmann
Bounded model checking (BMC) has gained widespread industrial
use due to its relative scalability. Its exhaustiveness over all
valid input vectors allows it to expose arbitrarily complex design
flaws. However, BMC is limited to analyzing only a specific time
window, hence will only expose those flaws which manifest within
that window and thus cannot readily prove correctness. The diameter
of a design has thus become an important concept -- a bounded
check of depth equal to the diameter constitutes a complete proof.
While the diameter of a design may be exponential in the number
of its state elements, in practice it often ranges from tens to a
few hundred regardless of design size. Therefore, a powerful diameter
over-approximation technique may enable automatic proofs
that otherwise would be infeasible. Unfortunately, exact diameter
calculation requires exponential resources, and over-approximation
techniques may yield exponentially loose bounds. In this paper,
we provide a general approach for enabling the use of structural
transformations, such as redundancy removal, retiming, and target
enlargement, to tighten the bounds obtained by arbitrary diameter
approximation techniques. Numerous experiments demonstrate
that this approach may significantly increase the set of designs for
which practically useful diameter bounds may be obtained.
-
Improved Symbolic Simulation by Dynamic Functional Space Partitioning [p. 42]
-
T. Feng, L. Wang, K. Cheng, and A. Lin
In this paper, we provide a flexible and automatic method to
partition the functional space for efficient symbolic simulation.
We utilize a 2-tuple list representation as the basis for
partitioning the functional space. The partitioning is carried
out dynamically during the symbolic simulation based
on the sizes of OBDDs. We develop heuristics for choosing
the optimal partitioning points. These heuristics intend to
balance the tradeoff between the time and space complexity.
We demonstrate the effectiveness of our new symbolic simulation
approach through experiments based on a floating
point adder and a memory management unit.
Moderators: S. Kundu, Intel, US; B. Straube, FhG IIS/EAS Dresden, DE
-
Using BDDs and ZBDDs for Efficient Identification of Testable Path Delay Faults [p. 50]
-
S. Padmanaban and S. Tragoudas
We present a novel framework to identify all the robustly
testable and untestable path delay faults in a circuit. The
method uses a combination of decision diagrams for manipulating
path delay faults and boolean functions. The approach
benefits from processing partial paths or fanout free
segments in the circuit rather than the entire path. The
effectiveness of the proposed framework is demonstrated experimentally.
It is observed that the methodology identifies 350% more testable faults in the ISCAS'85 benchmark
C6288 than any existing technique by utilizing only a fraction
of the time compared to earlier work.
-
Level of Similarity: A Metric for Fault Collapsing [p. 56]
-
I. Pomeranz and S. Reddy
We describe a new approach to fault collapsing that
extends fault collapsing based on fault equivalence and
fault dominance. The new approach is based on a metric
called level of similarity between faults. Informally, a
fault fj is said to be similar to a fault fi
with a level of
similarity SLi,j ≤ 1 if a fraction SLi,j of the
tests for fi also detect fj. If SLi,j
is high enough, one may exclude
fjfrom the set of target faults and rely on the test for
fi (and tests for other faults) to detect fj.
We describe a
procedure for fault collapsing based on the level of similarity,
and study its effectiveness experimentally.
-
Design of Routing-Constrained Low Power Scan Chains [p. 62]
-
Y. Bonhomme, P. Girard, L. Guiller, C. Landrault, S. Pravossoudovitch, and A. Virazel
Scan-based architectures, though widely used in
modern designs, are expensive in power consumption.
Recently, we proposed a technique based on clustering
and reordering of scan cells that allows to design low
power scan chains [1]. The main feature of this technique
is that power consumption during scan testing is
minimized while constraints on scan routing are satisfied.
In this paper, we propose a new version of this technique.
The clustering process has been modified to allow a better
distribution of scan cells in each cluster and hence lead to
more important power reductions. Results are provided at
the end of the paper to highlight this point and show that
scan design constraints (length of scan connections,
congestion problems) are still satisfied.
-
Z-Sets and Z-Detections: Circuit Characteristics that Simplify Fault Diagnosis [p. 68]
-
I. Pomeranz, S. Venkataraman, S. Reddy, and B. Seshadri
We define the concepts of z -sets and z -detections for
combinational circuits (or the combinational logic of scan
circuits). Based on these concepts we define structural
characteristics and characteristics based on fault simulation.
We show that these characteristics determine the
numbers of fault pairs that are guaranteed to be distinguished
by a given fault detection test set. These fault
pairs do not need to be considered during diagnostic fault
simulation or test generation. We demonstrate that benchmark
circuits as well as industrial circuits have these
characteristics to a larger extent than may be expected.
As a result, only small percentages of fault pairs need to
be considered during diagnostic fault simulation or test
generation once a fault detection test set is available. In
addition, these fault pairs can be identified efficiently.
Moderators: A. Rodriguez-Vazquez, IMSE-CNM, ES; P. Wambacq, IMEC, BE
-
A 2.7V 350µW 11-b Algorithmic Analog-to-Digital Converter with Single-Ended Multiplexed Inputs [p. 76]
-
A. Nagari and G. Nicollini
A low-power low-area CMOS algorithmic A/D converter
that does not require trimming nor digital calibration is
presented. The topology is based on a classical cyclic A/D
conversion using a capacitor ratio-independent
computation circuitry. All the non idealities have been
carefully analyzed and reduced by proper choices of design
and layout solutions. As a result the errors coming from
opamp offset and finite open-loop dc gain, switch charge
injection and clock feedthrough, parasitic capacitors, and
intrinsic noise sources are reduced under the LSB level. To
process a multiplexed (8 channels) single-ended analog
input, an efficient single-ended to fully differential circuit
has been presented. The converter achieves 11 bit accuracy
in the Nyquist band at a sampling rate of 8kSps. The total
power dissipation is only 350µW at 2.7V supply voltage.
The active area is 0.3 mm2 in a 0.35µm 5 metal levels
CMOS technology with double-poly linear capacitors.
-
Digital Background Gain Error Correction in Pipeline ADCs [p. 82]
-
A. Ginés, E. Peralías, and A. Rueda
This paper presents a new digital technique for background
calibration of gain errors in Pipeline ADCs. The
proposed algorithm estimates and corrects both the MDAC
gain error of the stage under calibration and the global gain
error associated to the uncalibrated stages without interruption
of the conversion and without reduction of the dynamic
rate. It is based on the use of a stage with two input-output
characteristics, depending on the value of a digital noise
signal.
Key Words: Analog-to-Digital Converter, Pipeline ADC,
Background Calibration, On-line Calibration.
-
Digital Ground Bounce Reduction by Phase Modulation of the Clock [p. 88]
-
M. Badaroglu, G. Gielen, H. De Man, P. Wambacq, G. Van Der Plas, and S. Donnay
The digital switching noise that propagates through the chip substrate to the analog circuitry on the same chip is a major limitation for mixed-signal SoC integration. In synchronous digital systems, digital circuits switch simultaneously on the clock edge, hereby generating a large ground bounce. In order to reduce the spectral peaks in the ground bounce spectrum, we combine the two techniques: (1) phase modulation of the clock and (2) introducing intended clock skews to spread the switching activities. Experimental results show around 16 dB reduction in the spectral peaks of the noise spectrum when these two techniques are combined. These two techniques are believed to be good candidates for the development of methodologies for digital low-noise design techniques in future CMOS technologies.
-
Pseudo-Random Sequence Based Tuning System for Continuous-Time Filters [p. 94]
-
A. Baschirotto, S. D'Amico, F. Corsi, C. Marzocca, and G. Matarrese
Continuos-Time filters are widely used in signal processing but require a tuning system to align their frequency response. Several tuning techniques have been proposed in the literature, which can be grouped in two basic schemes: master-slave and self-calibration arrangements. Here we propose a novel tuning approach which can be applied to both tuning schemes. The tuning algorithm is based on the application of a pseudo-random input Test Pattern Signal and on the evaluation of a few samples of the input-output cross-correlation function. The key advantages of the proposed technique are basically the use of a pseudo-random pattern signal which can be generated by a very simple circuit in a small die area and the simple circuitry required to sample the filter output and to perform the cross-correlation operation.
Some experimental results of the application of the proposed tuning technique to a benchmark filter are given in order to assess its effectiveness.
Moderators: J. Teich, Erlangen-Nuremberg U, DE; P. Cheung, Imperial College London, UK
-
A Crosstalk Aware Interconnect with Variable Cycle Transmission [p. 102]
-
L. Li, N. Vijaykrishnan, M. Kandemir, and M. Irwin
Crosstalk between wires, caused by increased capacitive
coupling, is considered one of the major factors that affect
the performance of interconnects such as buses. The data-dependent
nature of crosstalk-induced delays necessitates
bus cycle time to be designed for the worst case crosstalk.
However, this pessimism incurs a significant performance
penalty. Consequently, we propose a crosstalk aware interconnect
that uses a faster clock and dynamically controls
the number of cycles required for transmission based on
the estimated delay of the data pattern to be transmitted.
In order to accomplish this, we designed a crosstalk analyzer
circuit that is incorporated into the sender side of
the bus and support a variable cycle transmission mechanism.
We evaluate the effectiveness of the proposed scheme
focusing on the on-chip buses of a microprocessor and by
using the SPEC2000 benchmarks. The experimental results
show that the proposed approach improves performance by
31.5% as compared to the original pessimistic approach.
Furthermore, we employ a coding optimization to enhance
the effectiveness of the proposed approach. We also show
that the proposed scheme is an area-efficient approach to
improving performance as compared to other crosstalk reduction
schemes.
-
Layout Conscious Bus Architecture Synthesis for Deep Submicron Systems on Chip [p. 108]
-
N. Thepayasuwan and A. Doboli
System-level design has a disadvantage in not knowing important
aspects about the final layout. This is critical for
SoC, where uncertainties in communication delay by very
deep submicron effects cannot be neglected. This paper presents
a layout-aware bus architecture (BA) synthesis algorithm for
designing the communication sub-system of an SoC. BA synthesis
includes finding bus topology and routing individual
buses, so that constraints like area, bus speed and length, are
tackled at the physical level. The paper presents the BA automatically
synthesized for a network processor and a JPEG
SoC.
-
Loop Shifting and Compaction for the High-Level Synthesis of Designs with Complex Control Flow [p. 114]
-
S. Gupta, N. Dutt, A. Nicolau, and R. Gupta
Emerging embedded system applications in multimedia
and image processing are characterized by complex control
flow consisting of deeply nested conditionals and loops. We
present a technique called loop shifting that incrementally
exploits loop level parallelism across iterations by shifting
and compacting operations across loop iterations. Our experimental
results show that loop shifting is particularly
effective for the synthesis of designs with complex control
especially when resource utilization is already high and/or
under tight resource constraints. In situations when further
loop unrolling (or initiating another iteration of the loop
body) leads to a sharp increase in the longest combinational
path in the circuit and the circuit area, loop shifting is able
to achieve up to 20 % reduction in the input-to-output delay
in the synthesized circuit. We implemented loop shifting
within the SPARK parallelizing high-level synthesis framework
and present results for experiments on designs derived
from multimedia and image processing applications.
Organiser/Moderator: G. Martin, Cadence Berkeley Labs, US; D. Sciuto, Politecnico di Milano
Panellists:
S. Swan, Cadence, US
F. Ghenassia, STMicroelectronics, FR
P. Flake, Synopsys, US
J. Srouji, Intel, Israel
W. Rosenstiel, Tübingen U, DE
-
SystemC and System Verilog: Where do They Fit? Where are they going? [p. 122]
-
There is tremendous interest in design languages these days -
and more particularly, SystemC and SystemVerilog. Sometimes
the truth about design languages can be obscured by marketing
and the press. This panel is meant to deepen the technical
understanding of the DATE audience on the issue of design
languages. It contains five technical experts -- an academic
expert in design languages and SystemC and SystemVerilog in
particular; a language expert for each of SystemC and
SystemVerilog; and a user expert for these two languages.
The language experts have been heavily involved in the
specification and evolution of their respective languages. The
user experts have been heavily involved in developing use
methodologies for these languages within their own design
communities, and in applying them to real design problems.
The panelists will consider the questions:
- what are the key capabilities of these languages and what
do they offer to users?
- which design problems are they best used for? what is
their scope?
- how has application of these languages to real design
problems improved the productivity of designers and the
quality of the design results?
- where should the languages develop further capabilities?
Moderators: E. Schmidt, Chip Vision Design Systems, DE; C. Guardiani, PDF Solutions, IT
-
Re-Configurable Bus Encoding Scheme for Reducing Power Consumption of the
Cross Coupling Capacitance for Deep Sub-Micron Instruction Bus [p. 130]
-
S. Wong and C. Tsui
In very deep sub-micron designs, cross coupling
capacitances become the dominant factor of the total bus
loading and have a significant impact on the power
consumption. In this paper, we propose two reconfigurable
bus encoding schemes, which are based on
the correlation among the bit lines, to reduce the power
consumption at the cross coupling capacitances of the
instruction buses. The instruction is encoded by flipping
and reordering the bit lines during compilation time to
reduce the total switching capacitances. A crossbar is
used to map back the data to the original instruction code
before sending to the instruction decoder. The reordering
can be re-configured during run-time by using different
configurations in the crossbar. We propose two types of
re-configuration, static and dynamic. Static coding uses a
fix flipping and re-configuring pattern after the
corresponding program is compiled. Dynamic coding
allows different re-configuring patterns during program
execution. Experimental results show that by using the
proposed schemes, significant energy reduction, 17-23%,
can be achieved. Comparisons with existing bit lines
reordering encoding scheme have also been made and on
average more than 15% reduction can be obtained using
our method.
-
Hierarchical Adaptive Dynamic Power Management [p. 136]
-
Z. Ren, B. Krogh, and R. Marculescu
The main contribution of this paper is a novel hierarchical
scheme for adaptive dynamic power management (DPM)
under nonstationary service requests. We model the non-stationary
arrival process of service requests as a Markov-modulated stochastic process
in which the stochastic process
for each modulation state models a particular stationary mode
of the arrival process. The bottom layer of our hierarchical architecture is a set of stationary optimal DPM policies, pre-calculated off-line
for selected modes from policy optimization
in Markov decision processes. The supervisory power manager
at the top layer adaptively and optimally switches among
these stationary policies on-line to accommodate the actual
mode-switching arrival dynamics. Simulation results show
that our approach, under highly nonstationary requests, can
lead to significant power savings compared to previously proposed
heuristic approaches.
Keywords: low-power design, hierarchical adaptive dynamic
power management, nonstationary service requests.
-
A Self-Tuning Cache Architecture for Embedded Systems [p. 142]
-
C. Zhang, F. Vahid, and R. Lysecky
Memory accesses can account for about half of a microprocessor
system's power consumption. Customizing a microprocessor
cache's total size, line size and associativity to a particular
program is well known to have tremendous benefits for
performance and power. Customizing caches has until recently
been restricted to core-based flows, in which a new chip will be
fabricated. However, several configurable cache architectures
have been proposed recently for use in pre-fabricated
microprocessor platforms. Tuning those caches to a program is
still however a cumbersome task left for designers, assisted in part
by recent computer-aided design (CAD) tuning aids. We propose
to move that CAD on-chip, which can greatly increase the
acceptance of configurable caches. We introduce on-chip
hardware implementing an efficient cache tuning heuristic that can
automatically, transparently, and dynamically tune the cache to an
executing program. We carefully designed the heuristic to avoid
any cache flushing, since flushing is power and performance
costly. By simulating numerous Powerstone and MediaBench
benchmarks, we show that such a dynamic self-tuning cache can
reduce memory-access energy by 45% to 55% on average, and as
much as 97%, compared with a four-way set-associative base
cache, completely transparently to the programmer.
Keywords
Cache, configurable, architecture tuning, low power, low energy,
embedded systems, on-chip CAD, dynamic optimization.
-
Scheduling Reusable Instructions for Power Reduction [p. 148]
-
J. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. Irwin
In this paper, we propose a new issue queue design that
is capable of scheduling reusable instructions. Once the issue
queue is reusing instructions, no instruction cache access
is needed since the instructions are supplied by the issue
queue itself. Furthermore, dynamic branch prediction
and instruction decoding can also be avoided permitting the
gating of the front-end stages of the pipeline (the stages before
register renaming). Results using array-intensive codes
show that up to 82% of the total execution cycles, the
pipeline front-end can be gated, providing a power reduction
of 72% in the instruction cache, 33% in the branch predictor,
and 21% in the issue queue, respectively, at a small
performance cost. Our analysis of compiler optimizations
indicates that the power savings can be further improved by
using optimized code.
Moderators: R. Drechsler, Bremen U, DE; H. Eveking, TU Darmstadt, DE
-
Using Counter Example Guided Abstraction Refinement to Find Complex Bugs [p. 156]
-
P. Bjesse and J. Kukula
In this paper, we present a method for finding failure traces
for safety properties that are out of reach for traditional
approaches to counter example generation. We do this by
guiding Bounded Model Checking (BMC) with information
gathered from counter example guided abstraction refinement.
Unlike previously described approaches based on reconstructing
abstract counter examples on the concrete machines,
we do not limit ourselves to search for failures of the
same length as the current abstract counterexample. We
also describe a combination of previously known methods
for choosing registers to include in the abstraction that we
have found works very well together with our technique for
finding failures. Our experimental results show that the resulting
method can find counter examples that are out of
range for both standard BMC and two previously published
approaches to abstraction-guided BMC.
-
Cost-Efficient Block Verification for a UMTS Up-Link Chip-Rate Coprocessor [p. 162]
-
G. Fey, D. Stoffel, H. Trylus, and K. Winkelmann
ASIC designs for future communication applications
cannot be simulated exhaustively. Formal Property
Checking is a powerful technology to overcome the
limitations of current functional verification approaches.
The paper reports on a large-scale experiment employing
the CVE property checker for verifying the block-level
functional correctness of a large ASIC.
This new verification methodology achieves
substantial quality and productivity gains. The two biggest
advantages are:
• Coding and Verification can be done in parallel.
• The whole state space of a test case will be verified
in a single run.
Formal Property Checking simplifies and shortens
the functional verification of large-scale ASICs at least in
the same order of magnitude as Static Timing Analysis did
for timing verification.
-
Automatic Verification of Safety and Liveness for XScale-Like Processor Models
Using WEB Refinements [p. 168]
-
P. Manolios and S. Srinivasan
We show how to automatically verify that complex
XScale-like pipelined machine models satisfy the
same safety and liveness properties as their corresponding
instruction set architecture models, by using the notion
of Well-founded Equivalence Bisimulation (WEB)
refinement. Automation is achieved by reducing the WEB-refinement
proof obligation to a formula in the logic of
Counter arithmetic with Lambda expressions and Uninterpreted
functions (CLU). We use the tool UCLID to
transform the resulting CLU formula into a Boolean formula,
which is then checked with a SAT solver. The models
we verify include features such as out of order completion,
precise exceptions, branch prediction, and interrupts.
We use two types of refinement maps. In one, flushing
is used to map pipelined machine states to instruction
set architecture states; in the other, we use the commitment
approach, which is the dual of flushing, since partially
completed instructions are invalidated. We present experimental
results for all the machines modeled, including verification
times. For our application, we found that the time
spent proving liveness accounts for about 5% of the overall
verification time.
Moderators: H. Obermeir, Infineon Technologies, DE; M. Hsiao, Virginia Tech., US
-
A Probabilistic Method for the Computation of Testability of RTL Constructs [p. 176]
-
J. Fernandes, M. Santos, A. Oliveira, and J. Teixeira
Validation of RTL descriptions remains one of the principal bottlenecks in the circuit design process. Random simulation based methods for functional validation suffer from fundamental limitations and may be inappropriate or too expensive. In fact, for some circuits, a large number of vectors is required in order to make the circuit reach hard to test constructs and obtain accurate values for their testability. In this work, we present a static, non-simulation based, method for the determination of the controllability of RTL constructs that is efficient and gives accurate feedback to the designers in what regards the presence of hard to control constructs in their RTL code. The method takes as input a Verilog RTL description, solves the Chapman-Kolmogorov equations that describe the steady-state of the circuit and outputs the computed values for the controllability of the RTL constructs. To avoid the exponential blow-up that results from writing one equation for each circuit state and solving the resulting system of equations, an approximation method is used. We present results showing that the approximation is effective and describe how the method can be used to bias a random test generator in order to achieve higher coverage using a smaller number of vectors.
-
Graph-Based Functional Test Program Generation for Pipelined Processors [p. 182]
-
P. Mishra and N. Dutt
Functional verification is widely acknowledged as a major
bottleneck in microprocessor design. While early work
on specification driven functional test program generation
has proposed several promising ideas, many challenges remain
in applying them to realistic embedded processors. We
present a graph coverage based functional test program generation
approach for pipelined processors. The proposed
methodology makes three important contributions. First, it
automatically generates the graph model of the pipelined
processor from the specification using functional abstraction.
Second, it generates functional test programs based on the
coverage of the pipeline behavior. Finally, the test generation
time is drastically reduced due to the use of module level
property checking. We applied this methodology on the DLX
processor to demonstrate the usefulness of our approach.
-
Automatic Generation of Validation Stimuli for Application-Specific Processors [p. 188]
-
O. Goloubeva, M. Sonza Reorda, and M. Violante
Microprocessor soft cores offer today an effective solution to the problem of rapidly developing new system-on-a-chips. However, all the features they offer are rarely used in embedded applications, and thus designers are often involved in the challenging task of soft-core customization to obtain application-specific processors. This paper proposes a novel approach to help designers in the simulation-based validation of application-specific processors. Suitable input stimuli are automatically generated while reasoning only on the software application the processor is intended to execute, while all the details concerning the processor hardware are neglected. Experimental results on a 8051 soft core show the effectiveness of the proposed approach.
-
Efficient Static Compaction of Test Sequence Sets through the Application of Set Covering Techniques [p. 194]
-
M. Dimopoulos and P. Linardis
The test sequence compaction problem is modeled
here, first, as a set covering problem. This formulation
enables the straightforward application of set covering
methods for compaction. Because of the complexity
inherent in the first model, a second more efficient,
formulation is proposed where the test sequences are
modeled as matrix columns with variable costs (number of
vectors). Further, matrix reduction rules appropriate to
the new formulation, which do not affect the optimality of
the solution, are introduced. Finally, the reduced problem
is minimized with a Branch & Bound algorithm.
Experiments on a large number of test sets show
significant reductions to the original problem by simply
using the presented reduction rules. Experimental results
comparing our method with others from the literature and
also with the absolute minima of the examples, computed
separately with the MINCOV algorithm, support the
potential of the proposed approach.
Moderators: R. Bergamaschi, IBM TJ Watson Res. Center, US; R. Hermida, Madrid Complutense U, ES
-
Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies [p. 202]
-
I. Issenin, N. Dutt, E. Brockmeyer, and M. Miranda
In multimedia and other streaming applications a
significant portion of energy is spent on data transfers.
Exploiting data reuse opportunities in the application, we
can reduce this energy by making copies of frequently used
data in a small local memory and replacing speed and
power inefficient transfers from main off-chip memory by
more efficient local data transfers. In this paper we present
an automated approach for analyzing these opportunities in
a program that allows modification of the program to use
custom scratch pad memory configurations comprising a
hierarchical set of buffers for local storage of frequently
reused data. Using our approach we are able to reduce
energy consumption of the memory subsystem when using a
scratch pad memory by a factor of two on average
compared to a cache of the same size.
-
Automatic Tuning of Two-Level Caches to Embedded Applications [p. 208]
-
A. Gordon-Ross, F. Vahid, and N. Dutt
The power consumed by the memory hierarchy of a
microprocessor can contribute to as much as 50% of the total
microprocessor system power, and is thus a good candidate
for optimizations. We present an automated method for
tuning two-level caches to embedded applications for reduced
energy consumption. The method is applicable to both a
simulation-based exploration environment and a hardware-based
system prototyping environment. We introduce the two-level
cache tuner, or TCaT -- a heuristic for searching the huge
solution space of possible configurations. The heuristic
interlaces the exploration of the two cache levels and
searches the various cache parameters in a specific order
based on their impact on energy. We show the integrity of our
heuristic across multiple memory configurations and even in
the presence of hardware/software partitioning -- a common
optimization capable of achieving significant speedups
and/or reduced energy consumption. We apply our
exploration heuristic to a large set of embedded applications.
Our experiments demonstrate the efficacy of our heuristic: on
average the heuristic examines only 7% of the possible cache
configurations, but results in cache sub-system energy
savings of 53%, only 1% more than the optimal cache
configuration. In addition, the configured cache achieves an
average speedup of 30% over the base cache configuration
due to tuning of cache line size to the application's needs.
-
Low Static-Power Frequent-Value Data Caches [p. 214]
-
C. Zhang, J. Yang, and F. Vahid
Static energy dissipation in cache
memories will constitute an increasingly larger portion
of total microprocessor energy dissipation due to
nanoscale technology characteristics and the large size
of on-chip caches. We propose to reduce the static
energy dissipation of an on-chip data cache by taking
advantage of the frequent values (FV) that widely exist in
a data cache memory. The original FV-based low-power
cache design aimed at only reducing dynamic power, at
the cost of a 5% slowdown. We propose a better design
that reduces both static and dynamic cache power, and
that uses a circuit design that eliminates performance
overhead. A designer can utilize our architecture by
simulating an application and then synthesizing the FVs
into an application-specific FV cache design when values
will not change, or by simulating and then writing to an
FV-cache with configuration registers when values could
change. Furthermore, we describe hardware that can
dynamically determine FVs and write to the
configuration registers completely transparently.
Experiments on 11 Spec 2000 benchmarks show that, in
addition to the dynamic power savings, 33% static
energy savings for data caches can be achieved.
-
Using a Victim Buffer in an Application-Specific Memory Hierarchy [p. 220]
-
C. Zhang and F. Vahid
Customizing a memory hierarchy to a
particular application or applications is becoming
increasingly common in embedded system design, with
one benefit being reduced energy. Adding a victim
buffer to the memory hierarchy is known to reduce
energy and improve performance on average, yet victim
buffers are not typically found in commercial embedded
processors. One problem with such buffers is, while
they work well on average, they tend to hurt
performance for many applications. We show that a
victim buffer can be very effective if it is considered as
a parameter in designing a memory hierarchy, like the
traditional cache parameters of total size, associativity,
and line size. We describe experiments on PowerStone
and MediaBench benchmarks, showing that having the
option of adding a victim buffer to a direct-mapped
cache can reduce memory-access energy by a factor of
3 in some cases. Furthermore, even when other cache
parameters are configurable, we show that a victim
buffer can still reduce energy by 43%. By treating the
victim buffer as a parameter, meaning the buffer can be
included or excluded, we can avoid performance
overhead of up to 4% on some examples. We discuss the
victim buffer in the context of both core-based and
pre-fabricated platform based design approaches.
Organiser/Moderator: M. Renaudin, TIMA Laboratory, FR; F. Bouesse, TIMA Laboratory, FR
Speakers:
P. Proust, Gemplus Corporate R&D Security Technologies, FR
J. Tual, Axalto - Schlumberger, FR
L. Sourgen, STMicroelectronics, FR
F. Germain, DCSSI - French Government Service on the Security of Information Systems, FR
-
High Security Smartcards [p. 228]
-
New consumer appliances such as PDA, Set Top Box,
GSM/UMTS terminals enable an easy access to the
internet and strongly contribute to the development of ecommerce
and m-commerce services. Tens of billion
payments are made using cards today, and this is expected
to grow in a near future. Smartcard platforms will enable
operators and service providers to design and deploy new
e- and m-commerce services. This development can only
be achieved if a high level of security is guaranteed for the
transactions and the customer's information.
In this context, smartcard design is very challenging in
order to provide the flexibility and the powerfulness
required by the applications and services, while at the
same time guaranteeing the security of the transactions
and the customer's privacy. The goal of the session is to
introduce this context and highlights the main challenges
the smartcard designers/manufacturers have to face.
Moderators: M. Miranda, IMEC, BE; W. Nebel, OFFIS, DE
-
Energy-Aware Communication and Task Scheduling for Network-on-Chip Architectures
Under Real-Time Constraints [p. 234]
-
J. Hu and R. Marculescu
In this paper, we present a novel Energy-Aware Scheduling
(EAS) algorithm which statically schedules both communication
transactions and computation tasks onto heterogeneous
Network-on-Chip (NoC) architectures under realtime
constraints. Our algorithm automatically assigns tasks
onto different processing elements and then schedules their
execution. At the same time, the algorithm also takes into
consideration the exact communication delay by scheduling
communication transactions in parallel. As the main contribution,
we first formulate the problem of concurrent communication
and task scheduling for heterogeneous NoC architectures
and then propose an efficient heuristic to solve
it. Experimental results show that significant energy savings
can be achieved by using our energy-aware scheduler while
meeting the specified performance constraints. For instance,
for a complex multimedia application, 44% energy savings
have been observed, on average, compared to the schedules
generated by a standard earliest-deadline-first scheduler.
-
A Low Cost Individual-Well Adaptive Body Bias (IWABB) Scheme for Leakage Power
Reduction and Performance Enhancement in the Presence of Intra-Die Variations [p. 240]
-
T. Chen and J. Gregg
This paper presents a new method of adapting
body biasing on a chip during post-fabrication testing in order
to mitigate the effects of process variations. Individual well
biasing voltages can be changed to be connected either to a
chip wide well bias or to a different bias voltage through a
self-regulating mechanism, allowing biasing voltage adjustments
on a per well basis. The scheme requires only one bias voltage
distribution network, but allows for back biasing adjustments
to more effectively mitigate die-to-die and within-die process
variations. The biasing setting for each well is determined using
a modified genetic algorithm. Our experimental results show that
binning yields as low as 17% can be improved to greater than
90% after using the proposed IWABB method.
-
A Logic Level Design Methodology for a Secure DPA Resistant ASIC or FPGA Implementation [p. 246]
-
K. Tiri and I. Verbauwhede
This paper describes a novel design methodology to
implement a secure DPA resistant crypto processor. The
methodology is suitable for integration in a common automated
standard cell ASIC or FPGA design flow. The technique
combines standard building blocks to make "new"
compound standard cells, which have a close to constant
power consumption. Experimental results indicate a 50
times reduction in the power consumption fluctuations.
-
Power Minimization in a Backlit TFT-LCD Display by Concurrent Brightness and Contrast Scaling [p. 252]
-
W. Cheng, Y. Hou, and M. Pedram
This paper presents a Concurrent Brightness and Contrast
Scaling (CBCS) technique for a cold cathode fluorescent lamp
(CCFL) backlit TFT-LCD display. The proposed technique aims
at conserving power by reducing the backlight illumination while
retaining the image fidelity through preservation of the image
contrast. First, we explain how CCFL works and show how to
model the non-linearity between its backlight illumination and
power consumption. Next, we propose the contrast distortion
metric to quantify the image quality loss after backlight scaling.
Finally, we formulate and optimally solve the CBCS optimization
problem with the objective of minimizing the fidelity and power
metrics. Experimental results show that an average of 3.7X
power saving can be achieved with only 10% of contrast
distortion.
Moderators: P. Bjesse, Synopsys, US; G. Cabodi, Politecnico di Torino, IT
-
Managing Don't Cares in Boolean Satisfiability [p. 260]
-
S. Safarpour, A. Veneris, R. Drechsler, and J. Lee
Advances in Boolean satisfiability solvers have popularized
their use in many of today's CAD VLSI challenges. Existing
satisfiability solvers operate on a circuit representation that
does not capture all of the structural circuit characteristics
and properties. This work proposes algorithms that take into
account the circuit don't care conditions thus enhancing the
performance of these tools. Don't care sets are addressed
in this work both statically and dynamically to reduce the
search space and guide the decision making process. Experiments
demonstrate performance gains.
-
Exploiting Signal Unobservability for Efficient Translation to CNF in Formal
Verification of Microprocessors [p. 266]
-
M. Velev
The paper presents a method for translating Boolean circuits to
CNF by identifying trees of ITE operators, where each ITE has
fanout count of 1, and representing every such tree with a single
set of equivalent CNF clauses without intermediate variables for
ITE outputs, except for the tree output. This not only eliminates
intermediate variables, but also reduces the number of clauses,
compared to conventional translation to CNF, where each ITE is
assigned an output variable and is represented with a separate
set of clauses. Other gates with fanout count of 1 are similarly
merged with their fanout gate to generate a single set of equivalent
clauses. This translation to CNF was implemented in a decision
procedure for the logic of Equality with Uninterpreted
Functions and Memories (EUFM), and was applied to formulas
from formal verification of microprocessors. To increase the
number of ITE-trees in the Boolean formulas, the decision procedure
was optimized to preserve the ITE-tree structure of arguments
to equality comparisons. In conventional translation to
CNF with the unoptimized decision procedure, the benchmark
formulas require up to hundreds of thousands of CNF variables
and millions of clauses. The best translation strategy reduced the
CNF variables by up to 8 x; the clauses by up to 17 x; the SATsolver
decisions by up to 79 x; the SAT-solver conflicts by up to
96 x; and accelerated the SAT solving by up to 420 x .
-
A Novel SAT All-Solutions Solver for Efficient Preimage Computation [p. 272]
-
B. Li, M. Hsiao, and S. Sheng
In this paper, we present a novel all-solutions preimage
SAT solver, SOLALL, with the following features: (1) a new
success-driven learning algorithm employing smaller cut
sets; (2) a marked CNF database non-trivially combining
success/conflict-driven learning; (3) quantified-jump-back
dynamically quantifying primary input variables from the
preimage; (4) improved free BDD built on the fly, saving
memory and avoiding inclusion of PI variables; finally, (5)
a practical method of storing all solutions into a canonical
OBDD format. Experimental results demonstrated the efficiency
of the proposed approach for very large sequential
circuits.
Moderators: A. Richardson, Lancaster U, UK; F Azais, LIRMM, FR
-
Efficient Test Strategy for TDMA Power Amplifiers Using Transient Current Measurements:
Uses and Benefits [p. 280]
-
G. Srinivasan, S. Bhattacharya, A. Chatterjee, and S. Cherubal
A novel algorithm for fast and accurate testing of TDMA
power amplifiers in a transmitter system is presented.
First, the steep cost of high frequency testers can be
largely complemented by the proposed method due to its
ease of implementation on low-cost testers. Secondly,
TDMA power amplifiers usually have a control voltage
to operate the device in various modes of operation. At
each of the control voltage values, all the specifications
of the power amplifier are measured to ensure the
performance of each tested device. A new method is
proposed to test all the specifications of these devices
using the transient current response of their bias circuits
to a time-varying control voltage stimulus. This results in
shorter test times compared to conventional test methods.
The test specification values are measured to an
accuracy of less than 5% for all the specifications
measured. The proposed test approach can specifically
benefit production test of quad-band amplifiers
(GSM850, GSM900, PCS/DCS), as a single transient
current measurement can be used to compute all the
specifications of the device in different modes of
operation, over different operating frequencies.
-
Random Jitter Extraction Technique in a Multi-Gigahertz Signal [p. 286]
-
C. Ong, D. Hong, K. Cheng, and L. Wang
In this paper, we propose a simple technique for estimating
the standard deviation of a Gaussian random jitter component in
a multi-gigahertz signal. This method may utilize existing on-chip
single-shot period measurement techniques to measure the
multi-gigahertz signal periods for the estimation. This method
does not require an external sampling clock, nor any additional
measurement beyond existing techniques. Experimental results
show that this extraction method can accurately estimate the
random jitter variance in a multi-gigahertz signal even with the
presence of a few hundred-hertz sinusoidal jitter components.
-
Low Cost Analog Testing of RF Signal Paths [p. 292]
-
M. Negreiros, L. Carro, and A. Susin
A low cost method for testing analog RF signal paths
suitable for BIST implementation in a SoC environment is
described. The method is based on the use of a simple and
low-cost one-bit digitizer that enables the reuse of
processor and memory resources available in the SoC,
while incurring little analog area overhead. The proposed
method also allows a constant load to be observed by the
circuit, since no switches or muxes are needed for
digitizing specific test points. Mathematical background
and experimental results are presented in order to
validate the test approach.
-
A Method for Parameter Extraction of Analog Sine-Wave Signals for Mixed-Signal
Built-In-Self-Test Applications [p. 298]
-
D. Vázquez, G. Huertas, G. Leger, A. Rueda, and J. Huertas
This paper presents a method for extracting, in the
digital domain, the main characteristic parameters of
an analog sine-wave signal. The required circuitry for
on-chip implementation is very simple and robust,
which makes the present approach very suitable for
BIST applications. Solutions in this sense are addressed
together with simulation results that validate the
feasibility of the proposed approach.
Moderators: E. de Kock, Philips Research, NL; G. Constantinides, Imperial College, UK
-
A Novel Implementation of Tile-Based Address Mapping [p. 306]
-
S. Hettiaratchi and P. Cheung
Tile-based data layout has been applied to achieve various
objectives such as minimizing cache conflicts and memory
row switching activity. In some applications of tile-based
mapping, the size of the tile can be assumed to be
a power of two. In this paper, this 'power of two' assumption
has been used to drastically simplify the tile-based address
mapping functions. Once optimized, the implementation
of the non-linear tile-based mapping consumes 60%
less power than the implementation of the linear row-major
mapping. This result is very interesting because one would
normally expect a power penalty in the address generation
stage of the more sophisticated tile-based mapping. Moreover,
on average tile-based mapping implementation takes
10% less area and incurs virtually no additional delay over
row-major mapping implementation.
-
Power Aware Variable Partitioning and Instruction Scheduling for Multiple Memory Banks [p. 312]
-
Z. Wang and X. Hu
Many high-end DSP processors employ both multiple
memory banks and heterogeneous register files to improve
performance and power consumption. The complexity of
such architectures presents a great challenge to compiler
design. In this paper, we present an approach for variable
partitioning and instruction scheduling to maximally exploit
the benefits provided by such architectures. Our approach
is built on a novel graph model which strives to capture
both performance and power demands. We propose an
algorithm to iteratively find the variable partition such that
the maximum energy saving is achieved while satisfying the
given performance constraint. Experimental results demonstrate
the effectiveness of our approach.
-
Time-Energy Design Space Exploration for Multi-Layer Memory Architectures [p. 318]
-
R. Szymanek, K. Kuchcinski, and F. Catthoor
This paper presents an exploration algorithm which examines
execution time and energy consumption of a given
application, while considering a parameterized memory
architecture. The input to our algorithm is an application
given as an annotated task graph and a specification of
a multi-layer memory architecture. The algorithm
produces Pareto trade-off points representing different
multi-objective execution options for the whole
application. Different metrics are used to estimate parameters
for application-level Pareto points obtained
by merging all Pareto diagrams of the tasks composing
the application. We estimate application execution
time although the final scheduling is not yet known.
The algorithm makes it possible to trade off the quality
of the results and its runtime depending on the used
metrics and the number of levels in the hierarchical composition
of the tasks' Pareto points. We have evaluated
our algorithm on a medical image processing application
and randomly generated task graphs. We have shown
that our algorithm can explore huge design space and obtain
(near) optimal results in terms of Pareto diagram
quality.
-
Breaking Instance-Independent Symmetries in Exact Graph Coloring [p. 423]
-
A. Ramani, F. Aloul, I. Markov, and K. Sakallah
Code optimization and high level synthesis can be posed
as constraint satisfaction and optimization problems, such
as graph coloring used in register allocation. Naturally-occurring
instances of such problems are often small and
can be solved optimally. A recent wave of improvements
in algorithms for Boolean satisfiability (SAT) and 0-1 ILP
suggests generic problem-reduction methods, rather than
problem-specific heuristics, because (1) heuristics are easily
upset by new constraints, (2) heuristics tend to ignore
structure, and (3) many relevant problems are provably inapproximable.
The NP-spec project offers a language to
specify NP-problems and automatic reductions to SAT.
Problem reductions often lead to highly symmetric SAT
instances, and symmetries are known to slow down SAT
solvers. In this work, we compare several avenues for
symmetry-breaking, in particular when certain kinds of
symmetry are present in all generated instances. Our surprising
conclusion is that instance-independent symmetries
should often be processed together with instance-specific
symmetries rather than earlier, at the specification level.
Organiser: R. Lauwereins, IMEC, BE
Moderator: R. Wilson, CMP Media, US
Panellists:
K. Maex, IMEC, BE
P. Groeneveld, Magma Design Automation, US
G. Martin, Cadence, US
A. Cuomo, STMicroelectronics, IT
F. Catthoor, IMEC, BE
P. van de Steeg, Philips Semiconductors, NL
-
How Can System Level Design Solve the
Interconnect Technology Scaling Problem? [p. 332]
-
The scaling of interconnect technology hits a red brick wall: interconnect delay and power do not follow Moore's law anymore. The use of new materials like Cu and low-k alleviated the problem temporarily, but physical limits are being hit. What does this mean for system level design? The session starts with an embedded tutorial, given by an interconnect semiconductor technology expert, explaining the physics behind the interconnect problem and the degrees of freedom semiconductor technology offers system designers. Panelists will then express their thoughts and discuss with you how the interconnect problem can be solved by taking these degrees of freedom into account at the system design level. Views from industrial designers, CAD vendors, IC manufacturers and researchers will be presented.
Moderators: Y. Mathys, Motorola, FR; H. Hsieh, UC Riverside, US
-
System Design Using Kahn Process Networks: The Compaan/Laura Approach [p. 340]
-
E. Deprettere, B. Kienhuis, T. Stefanov, A. Turjan, and C. Zissulescu
New emerging embedded system platforms in the realm of high-throughput
multimedia, imaging, and signal processing will consist
of multiple microprocessors and reconfigurable components. One
of the major problems is how to program these platforms in a systematic
and automated way so as to satisfy the performance need
of applications executed on these platforms.
In this paper, we present our system design approach as an efficient
solution to this programming problem. We show how for an
application written in Matlab, a Kahn Process Network specification
can automatically be derived and systematically mapped onto
a target platform composed of a microprocessor and an FPGA.
Furthermore, we illustrate how the mapping approach is applied
on a real-life example, namely an M-JPEG encoder.
-
Microarchitecture Development via Metropolis Successive Platform Refinement [p. 346]
-
D. Densmore, A. Sangiovanni-Vincentelli, and S. Rekhi
Productivity data for IC designs indicates an exponential
increase in design time and cost with the number of elements that
are to be included in a device. Present applications require the
development of complex systems to support novel functionality.
To cope with these difficulties, we need to change radically the
present design methodology to allow for extensive re-use, early
verification in the design cycle, pervasive use of software, and
architecture-level optimization. Platform-based design as
defined in [1], has these characteristics. We present the
application of this methodology to a complex industrial
application provided by Cypress Semiconductor. In this case
study, we focus on a particular aspect of this methodology that
eases considerably the verification process: successive
refinement. We compare this approach versus a parallel team of
designers who developed the IC using standard design
approaches.
-
Fast Exploration of Parameterized Bus Architecture for Communication-Centric SoC Design [p. 352]
-
C. Shin, Y. Kim, E. Chung, K. Choi, J. Kong, and S. Eo
For successful SoC design, efficient and scalable
communication architecture is crucial. Some bus interconnects
now provide configurable structures to meet this requirement of
an SoC design. Furthermore, bus IP vendors provide software
tools that automatically generate RTL codes of a bus once its
designer configures it. Configurability, however, imposes more
challenges upon designers because complexity involved in
optimization increases exponentially as the number of
parameters grows. In this paper, we present a novel approach
with which effort requirement can be dramatically reduced. An
automated optimization tool we developed is used and it
exploits a genetic algorithm for fast design exploration. This
paper shows that the time for the optimizing task can be
reduced by more than 90% when the tool is used and, more
significantly the task can be done without an expert's hand
while ending up with a better solution.
Index Terms: Platform-based design, Bus Configuration,
Optimization, SoC design, genetic algorithm.
-
SoftContract: An Assertion-Based Software Development Process that Enables Design-by-Contract [p. 358]
-
J. Brunel, P. Giusto, M. di Natale, A. Ferrari, and L. Lavagno
This paper discusses a model-based design flow for requirements
in distributed embedded software development.
Such requirements are specified using a language similar to
Linear Temporal Logic which allows one to reason about
time and sequencing. They consist of assertions which must
hold for a design, given some assumptions on its environment.
They can be checked both during simulation and, at
least for a subset, even on the target. The key contribution of
the paper is the extension to the embedded software domain
of assertion-based verification, and the automated generation
of property-checking code in multiple target languages,
from simulation, to prototyping, to final production.
-
A System Level Exploration Platform and Methodology for Network Applications Based on
Configurable Processors [p. 364]
-
D. Quinn, B. Lavigueur, G. Bois, and M. Aboulhamid
A recent practice in the development of programmable SoC is the integration of configurable processors, since they offer an interesting compromise between purely software and hardware solutions. This paper proposes an adjustment to the current codesign approach to integrate this opportunity at the partitioning level. Since configurable processors seem to be an interesting option for NPU designs, we integrated into a system level exploration platform the support of an Xtensa processor for more investigation. As case studies, this paper illustrates the methodology for two realistic network-processing applications, for which interesting performances are obtained.
Moderators: S. Singh, Xilinx, US; A. Jantsch, Royal Inst. of Tech., SE
-
Refinement of Mixed-Signal Systems with Affine Arithmetic [p. 372]
-
C. Grimm, W. Heupke, and K. Waldschmidt
This paper describes a framework for the refinement of
control and signal processing functions. The design starts
with an executable specification, and allowed deviations
thereof. Refinement steps introduce models of analog or digital
implementations, and augment the 'ideal' behavior with
different sources of uncertainty. The framework verifies and
analyzes the influence of these uncertainties on system properties
using affine arithmetic.
-
System-Level Performance Analysis in SystemC [p. 378]
-
H. Posadas, F. Herrera, P. Sánchez, E. Villar, and F. Blasco
As both the ITRS and the Medea+ DA Roadmaps have
highlighted, early performance estimation is an
essential step in any SoC design methodology [1-2].
This paper presents a C++ library for timing
estimation at system level. The library is based on a
general and systematic methodology that takes as
input the original SystemC source code without any
modification and provides the estimation parameters
by simply including the library within a usual
simulation. As a consequence, the same models of
computation used during system design are preserved
and all simulation conditions are maintained. The
method exploits the advantages of dynamic analysis,
that is, easy management of unpredictable data-dependent
conditions and computational efficiency
compared with other alternatives (ISS or RT
simulation, without the need for SW generation and
compilation and HW synthesis). Results obtained on
several examples show the accuracy of the method. In
addition to the fundamental parameters needed for
system-level design exploration, the proposed
methodology allows the designer to include capture
points at any place in the code. The user can process
the corresponding captured events for unrestricted
timing constraint verification.
-
Modeling and Validating Globally Asynchronous Design in Synchronous Frameworks [p. 384]
-
M. Mousavi, P. Le Guernic, J. Talpin, S. Shukla, and T. Basten
We lay a foundation for modeling and validation of asynchronous
designs in a multi-clock synchronous programming model. This
allows us to study properties of globally asynchronous systems
using synchronous simulation and model-checking toolkits. Our
approach can be summarized as automatic transformation of a
design consisting of two asynchronously composed synchronous
components into a fully synchronous multi-clock model preserving
behavioral equivalence. The ultimate goal of this research
is to provide the ability to model and build GALS systems in a
fully synchronous design framework and deploy it on an asynchronous
network preserving all properties of the system proven
in the synchronous framework.
-
Synchronous Protocol Automata: A Framework for Modelling and Verification of
SoC Communication Architectures [p. 390]
-
V. D'Silva, S. Ramesh, and A. Sowmya
Plug-n-Play style Intellectual Property(IP) reuse in System
on Chip(SoC) design is facilitated by the use of an
on-chip bus architecture. We present a synchronous, Finite
State Machine based framework for modelling communication
aspects of such architectures. This formalism has been
developed via interaction with designers and the industry
and is intuitive and lightweight. We have developed cycle
accurate methods to formally specify protocol compatibility
and component composition and show how our model
can be used for compatibility verification, interface synthesis
and model checking with automated specification. We
demonstrate the utility of our framework by modelling the
AMBA bus architecture including details such as pipelined
operation, burst and split transfers, the AHB-APB bridge
and arbitration features.
-
Aspects of Formal and Graphical Design of a Bus System [p. 396]
-
T. Seceleanu and T. Westerlund
This study shows the derivation of a local segmented bus
arbiter from an original single segment bus arbiter. The
operations are performed in the formal framework of action
systems and illustrated in a graphical manner using
the corresponding action systems -- UML profile notations.
The derivation is useful both to demonstrate the capability
of preserving correctness when considering an important
hardware design decision and also to identify means
through which this kind of decisions can be performed in a
graphical environment.
Moderators: R. Dorsch, IBM Deutschland Entwicklung, DE; E. Larsson, Linköping U, SE
-
Scan Power Minimization through Stimulus and Response Transformations [p. 404]
-
O. Sinanoglu and A. Orailoglu
Scan-based cores impose considerable test power challenges due to excessive
switching activity during shift cycles. The consequent test power
constraints force SOC designers to sacrifice parallelism among core tests,
as exceeding power thresholds may damage the chip being tested. Reduction
of test power for SOC cores can thus increase the number of
cores that can be tested in parallel, improving significantly SOC test application
time. In this paper, we propose a scan chain modification technique
that inserts logic gates on the scan path. The consequent beneficial
test data transformations are utilized to reduce the scan chain transitions
during shift cycles and hence test power. We introduce a matrix band
algebra that models the impact of logic gate insertion between scan cells
on the test stimulus and response transformations realized. As we have
successfully modeled the response transformations as well, the methodology
we propose is capable of truly minimizing the overall test power.
The test vectors and responses are analyzed in an intertwined manner,
identifying the best possible scan chain modification, which is realized
at minimal area cost. Experimental results justify the efficacy of the proposed
methodology as well.
-
Synchro-Tokens: Eliminating Nondeterminism to Enable Chip-Level Test of Globally Asynchronous
Locally-Synchronous SoC's [p. 410]
-
M. Heath, W. Burleson, and I. Harris
Globally asynchronous locally synchronous (GALS)
clocking applied to a system-on-a-chip (SoC) results in a
design in which each core is a synchronous block (SB) of
logic with a locally generated clock. Inter-core
communication is asynchronous and controlled by
wrapper logic around the cores. The nondeterministic
synchronization used by most GALS architectures makes
chip-level silicon debug and functional test difficult and
costly. Deterministic GALS methodologies make dataflow
assumptions which are only valid for a very limited set of
applications. This paper describes a novel deterministic
GALS methodology called 'synchro-tokens' whose
parameterized wrappers are flexible enough to be useful
for a wide range of applications while supporting
synchronous debug and test methodologies such as
1149.1 and P1500. The validation of determinism,
estimation of area overhead, and analysis of performance
impact are detailed.
-
Wrapper Design for Testing IP Cores with Multiple Clock Domains [p. 416]
-
Q. Xu and N. Nicolici
This paper addresses the testability problems raised by embedded
cores with multiple clock domains. The proposed solution,
based on a novel core wrapper architecture, shows
how multi-frequency at-speed test response capture can be
achieved using low-speed testers synchronized with high-speed
on-chip generated clocks. Using experimental data,
the trade-offs between the number of tester channels, testing
time, area overhead and power dissipation are discussed.
-
Efficient Modular Testing of SOCs Using Dual-Speed TAM Architectures [p. 422]
-
A. Sehgal and K. Chakrabarty
The increasing complexity of system-on-chip (SOC) integrated
circuits has spurred the development of versatile automatic
test equipment (ATE) that can simultaneously drive
different channels at different data rates. Examples of such
ATEs include the Agilent 93000 series tester based on port
scalability and the test processor-per-pin architecture, and
the Tiger system from Teradyne. The number of tester channels
with high data rates may be constrained in practice
however due to ATE resource limitations, the power rating
of the SOC, and scan frequency limits for the embedded
cores. Therefore, we formulate the following optimization
problem: given two available data rates for the tester channels,
an SOC-level test access mechanism (TAM) width
W,V ( V < W) channels that can transport test data at the
higher data rate, determine an SOC TAM architecture that
minimizes the testing time. We present an efficient heuristic
algorithm for TAM optimization that exploits port scalability
of ATEs to reduce SOC testing time and test cost. We
present experimental results on dual-speed TAM optimization
for the ITC'2002 SOC test benchmarks.
-
An Arithmetic Structure for Test Data Horizontal Compression [p. 428]
-
M. Flottes, R. Poirier, and B. Rouzeyre
We propose a method for reducing test data volume of
integrated circuits or cores in a System-on-Chip. This
method is intended to reduce the required number of
Automatic Test Equipment (ATE) output channels
compared to the number of scan-in input pins in a
classical multi-chain implementation (horizontal
compression). Compression and decompression are based
on arithmetic operations and structures which present a
very low area overhead. The proposed compression
scheme does not impact the fault coverage achieved by the
original test sequence before compression.
Moderators: F. Fernandez, IMSE-CNM, ES; R. Schwencker, Infineon Technologies, DE
-
A Phase-Frequency Transfer Description of Analog and Mixed-Signal Front-End Architectures for
System-Level Design [p. 436]
-
E. Martens and G. Gielen
A novel approach for the modeling of front-end architectures
is presented. Architectures are described as a system
transforming polyphase harmonic signals through building
blocks modeled by polyphase harmonic transfer matrices
and distortion tensors. The major goal of the method is to
provide a model that is suited for systematic architectural
exploration during front-end system design. An example of
a downconversion architecture describes the system nonidealities
as the result of parasitic transfers between phases
and frequencies.
-
Hierarchical Automatic Behavioral Model Generation of Nonlinear Analog Circuits Based on
Nonlinear Symbolic Techniques [p. 442]
-
L. Näthke, V. Burkhay, L. Hedrich, and E. Barke
We present an extended method of automatic behavioral
model generation for nonlinear analog circuits. The focus
is on a decrease of simulation time. A procedural model
formulation approach is introduced, together with a new
simplification method based on the recognition of physical
transistor properties of the element models. The simplification
process is performed with respect to simulation
time, and a hierarchical modeling approach is proposed.
The result of these extensions are models with an obvious
speed-up in simulation time compared to the simulation of
the original netlists.
-
Performance Modeling of Analog Integrated Circuits Using Least-Squares Support Vector Machines [p. 448]
-
T. Kiely and G. Gielen
This paper describes the application of Least-Squares
Support Vector Machine (LS-SVM) training to analog circuit
performance modeling as needed for accelerated or hierarchical
analog circuit synthesis. The training is a type
of regression, where a function of a special form is fit to
experimental performance data derived from analog circuit
simulations. The method is contrasted with a feasibility
model approach based on the more traditional use
of SVMs, namely classification. A Design of Experiments
(DOE) strategy is reviewed which forms the basis of an efficient
simulation sampling scheme. The results of our functional
regression are then compared to two other DOE-based
fitting schemes: a simple linear least-squares regression
and a regression using posynomial models. The LSSVM
fitting has advantages over these approaches in terms
of accuracy of fit to measured data, prediction of intermediate
data points and reduction of free model tuning parameters.
-
Extended Subspace Identification of Improper Linear Systems [p. 454]
-
G. Vandersteen, D. Linten, R. Pintelon, and S. Donnay
The modeling of linear transfer functions is often required
prior to the simulation of electronic systems. An example
is the modeling of on-chip inductors starting from
2-port measurements. The modeling is often done using
state-space models that can only represent proper systems.
This leads to modeling problems in the case of improper
systems such as in the case of 2-port modeling of the admittance
matrix of an on-chip inductor. This paper first
describes an extended state-space model to represent improper
systems. Afterwards, the paper introduces an extension
to classical frequency-domain subspace identification
methods. The usefulness of both the extended state-space
model and the extended subspace modeling technique are illustrated
by comparing them with commercially available
solutions. This includes a comparison on measurements
of an on-chip inductor and on simulations of a coplanar
waveguide.
-
Identification and Modeling of Nonlinear Dynamic Behavior in Analog Circuits [p. 460]
-
X. Huang and H. Mantooth
This paper presents a new approach for identifying
nonlinear dynamic behavior in analog circuits. The
approach facilitates the creation of models that more
accurately reflect the dynamic behavior of a circuit. It has
been used in a fully automated, behavioral modeling tool,
Ascend, that starts from the netlist description of the
circuit and generates differential algebraic equation
(DAE) based behavioral models. The underlying
modeling approach is overviewed to provide a context for
this research. Some demonstrative test results illustrate
the effectiveness of the new method.
Moderators: G. Koch, Micronas GmbH, DE; C. Passerone, Politecnico di Torino, IT
-
Exploring Logic Block Granularity for Regular Fabrics [p. 468]
-
A. Koorapaty, V. Kheterpal, P. Gopalakrishnan, M. Fu, and L. Pileggi
Driven by the economics of design and manufacturing
nanoscale integrated circuits, an emphasis is being placed
on developing new, regular logic fabrics that leverage the
regularity and programmability of FPGAs, yet deliver a
level of performance and density close to ASICs. One example
of such a fabric is a Via-Patterned Gate Array (VPGA)
[9474], which employs ASIC style global routing on top of an
array of patternable logic blocks (PLBs). Previous work
[8480], [6486], [10494] showed that by employing even limited heterogeneity
for the VPGA logic blocks, namely combining a
3-LUT with two 3-input Nand gates, one can achieve performance
comparable to that provided by standard cells. Since
the area cost for such heterogenity is far less for a VPGA
than for SRAM programmed fabrics such as FPGAs, we can
explore new configurations of via-configurable logic blocks
that offer greater heterogenity and granularity to achieve
even higher performance. In this paper, we present a new,
more granular, via-patterned heterogeneous logic block architecture
and compare it to a less granular LUT-based
heterogeneous PLB. Our results show higher performance
and more effective packing of the logic functions due to increased
granularity.
-
Network Topology Exploration of Mesh-Based Coarse-Grain Reconfigurable Architectures [p. 474]
-
N. Bansal, S. Gupta, N. Dutt, A. Nicolau, and R. Gupta
Several coarse-grain reconfigurable architectures proposed
recently consist of a large number of processing elements
(PEs) connected in a mesh-like network topology.
We study the effects of three aspects of network topology
exploration on the performance of applications on these architectures:
(a) changing the interconnection between PEs,
(b) changing the way the network topology is traversed
while mapping operations to the PEs, and (c) changing the
communication delays on the interconnects between PEs.
We propose network topology traversal strategies that first
schedule PEs that are spatially close and that have more interconnections
among them. We use an interconnect aware
list scheduling heuristic as a vehicle to perform the network
topology exploration experiments on a set of designs
derived from DSP applications. Our experimental results
show that a spiral traversal strategy, coupled with a two
neighbor interconnect topology leads to good performance
for the DSP benchmarks considered. Our prototype framework
thus provides an exploration environment for system
architects to explore and tune coarse-grain reconfigurable
architectures for particular application domains.
-
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning [p. 480]
-
R. Lysecky and F. Vahid
In previous work, we showed the benefits and feasibility of having
a processor dynamically partition its executing software such that
critical software kernels are transparently partitioned to execute
as a hardware coprocessor on configurable logic -- an approach
we call warp processing. The configurable logic place and route
step is the most computationally intensive part of such
hardware/software partitioning, normally running for many
minutes or hours on powerful desktop processors. In contrast,
dynamic partitioning requires place and route to execute in just
seconds and on a lean embedded processor. We have therefore
designed a configurable logic architecture specifically for
dynamic hardware/software partitioning. Through experiments
with popular benchmarks, we show that by specifically focusing
on the goal of software kernel speedup when designing the FPGA
architecture, rather than on the more general goal of ASIC
prototyping, we can perform place and route for our architecture
50 times faster, using 10,000 times less data memory, and 1,000
times less code memory, than popular commercial tools mapping
to commercial configurable logic. Yet, we show that we obtain
speedups (2x on average, and as much as 4x) and energy savings
(33% on average, and up to 74%) when partitioning even just one
loop, which are comparable to commercial tools and fabrics.
Thus, our configurable logic architecture represents a good
candidate for platforms that will support dynamic
hardware/software partitioning, and enables ultra-fast desktop
tools for hardware/software partitioning, and even for fast
configurable logic design in general.
Keywords:
Hardware/software partitioning, FPGA fabric, configurable logic,
synthesis, place and route, platforms, system-on-a-chip, dynamic
optimization, codesign, self-improving chips, just-in-time
compilation, warp processors, reconfigurable computing.
-
Configuration-Sensitive Process Scheduling for FPGA-Based Computing Platforms [p. 486]
-
G. Chen, M. Kandemir, and U. Sezer
Reconfigurable computing has become an important part
of research in software systems and computer
architecture. While prior research on reconfigurable
computing have addressed architectural and
compilation/programming aspects to some extent, there is
still not much consensus on what kind of operating system
(OS) support should be provided. In this paper, we focus
on OS process scheduler, and demonstrate how it can be
customized considering the needs of reconfigurable
hardware. Our process scheduler is configuration
sensitive, that is, it reuses the current FPGA configuration
as much as possible. Our extensive experimental results
show that the proposed scheduler is superior to classical
scheduling algorithms such First-Come-First-Serve
(FCFS) and Shortest Job First (SJF).
Moderators: R. Zafalon, STMicroelectronics, IT; K. Roy, Purdue U, US
-
Simultaneous State, Vt and Tox Assignment for Total Standby Power Minimization [p. 494]
-
D. Lee, H. Deogun, D. Blaauw, and D. Sylvester
Standby leakage current minimization is a pressing concern for
mobile applications that rely on standby modes to extend battery
life. Also, gate oxide leakage current (Igate) has become comparable
to subthreshold leakage (Isub) in 90nm technologies. In this paper,
we propose a new method that uses a combined approach of sleepstate,
threshold voltage (Vt) and gate oxide thickness (Tox) assignments
in a dual-Vt and dual-Tox process to minimize both Isub and
Igate. Using this method, total leakage current can be dramatically
reduced since in a known state in standby mode, only certain transistors
are responsible for leakage current and need to be considered for
high-Vt or thick-Tox assignment. We formulate the optimization
problem for simultaneous state, Vt and Tox assignments under delay
constraints and propose two practical heuristics. We implemented
and tested the proposed methods on a set of synthesized benchmark
circuits. Results show an average leakage current reduction of 5-6X
and 2-3X compared to previous approaches that only use state or
state+Vt assignment, respectively, with small delay penalties.
-
A Scalable ODC-Based Algorithm for RTL Insertion of Gated Clocks [p. 500]
-
P. Babighian, E. Macii, and L. Benini
This paper describes a new automatic clock-gating extraction
algorithm working at the RT-level. The key features of our approach
are: (i) Seamless merging with existing industrial design flows
and commercial tools; (ii) High scalability to deal with large
circuits; (iii) Improved quality of results with respect to
available commercial tools; (iv) Smaller and well-controlled
overhead in speed and area. Experimental results, on a set of
industrial RTL designs, demonstrate the viability and practical
impact of our approach.
-
Impact of Data Transformations on Memory Bank Locality [p. 506]
-
M. Kandemir
High-energy consumption presents a problem for sustainable clock
frequency and deliverable performance. In particular, memory
energy consumption of array-intensive applications can be
overwhelming due to poor cache locality. One option for reducing
memory energy is to adopt a banked memory architecture, where
memory space is divided into banks and each bank can be powered
down if it is not in active use. An important issue here is the bank
access pattern, which determines opportunities for saving energy. In
this paper, we present a compiler-based data layout transformation
strategy for increasing the effectiveness of a banked memory
architecture. The idea is to transform the array layouts in memory
in such a way that two loop iterations executed one after another
access the data in the same bank as much as possible; the remaining
banks can be placed into a low-power mode. Our simulation-based
experiments with nine array-intensive applications show significant
savings in memory energy consumption.
-
Why Transition Coding for Power Minimization of On-Chip Buses Does Not Work [p. 512]
-
C. Kretzschmar, D. Müller, and A. Nieuwland
Encoding techniques which minimize the self- or coupling
activity of buses are often proposed to reduce power
dissipation on system buses. In this paper, we investigate
the efficiency of several coding schemes for on-chip buses
with respect to overall power dissipation. The power of the
codec systems was estimated by power simulations with the
lay-outs and related to the savings on the bus. We derived
an expression for the energy efficiency of the codecs as a
function of bus length (capacitive load). Despite the fact
that adaptive schemes could obtain up to 40% savings, the
bus lengths required to reduce the overall power consumption
are not realistic for on-chip buses.
-
Overhead-Conscious Voltage Selection for Dynamic and Leakage Energy
Reduction of Time-Constrained Systems [p. 518]
-
A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. Al-Hashimi
Dynamic voltage scaling and adaptive body biasing have been shown to
reduce dynamic and leakage power consumption effectively. In this paper,
we optimally solve the combined supply voltage and body bias selection
problem for multi-processor systems with imposed time constraints,
explicitly taking into account the transition overheads implied by changing
voltage levels. Both energy and time overheads are considered. We
investigate the continuous voltage scaling as well as its discrete counterpart,
and we prove NP-hardness in the discrete case. Furthermore,
the continuous voltage scaling problem is formulated and solved using
nonlinear programming with polynomial time complexity, while for the
discrete problem we use mixed integer linear programming. Extensive
experiments, conducted on several benchmarks and a real-life example,
are used to validate the approaches.
Moderators: B. Kienhuis, LIACS, Leiden U, NL; F. Petrot, Pierre et Marie Curie U, FR
-
Dynamic Power Management Using Date Buffers [p. 526]
-
Y. Lu and L. Cai
This paper presents a method to reduce energy consumption
by inserting data buffers. The method determines
whether power can be reduced by inserting a
buffer between two components and periodically turning
off one of them. This method calculates the length
of the period and the required buffer size to achieve
the optimal energy savings. Our approach can be applied
to any applications whose data arrival and departure
rates are different and known in advance.
-
Dynamic Memory Management Design Methodology for Reduced Memory.
Footprint in Multimedia and Wireless Network Applications [p. 532]
-
D. Atienza, J. Mendias, S. Mamagkakis, D. Soudris, and F. Catthoor
New portable consumer embedded devices must execute
multimedia and wireless network applications that demand
extensive memory footprint. Moreover, they must heavily
rely on Dynamic Memory (DM) due to the unpredictability
of the input data (e.g. 3D streams features) and system behaviour
(e.g. number of applications running concurrently
defined by the user). Within this context, consistent design
methodologies that can tackle efficiently the complex DM
behaviour of these multimedia and network applications are
in great need. In this paper, we present a new methodology
that allows to design custom DM management mechanisms
with a reduced memory footprint for such kind of dynamic
applications. The experimental results in real case studies
show that our methodology improves memory footprint 60%
on average over current state-of-the-art DM managers.
-
High-Level System Modeling and Architecture Exploration with SystemC on a
Network SoC: S3C2510 Case Study [p. 538]
-
H. Jang, M. Kang, K. Shim, M. Lee, K. Chae, and K. Lee
This paper presents a high-level design methodology
applied on a Network SoC using SystemC. The topic will
emphasize on high-level design approach for intensive
architecture exploration and verifying cycle accurate
SystemC models comparative to real Verilog RTL models.
Unlike many high-level designs, we started the project
with working Verilog RTL models in hands, which we later
compared our SystemC models to. Moreover, we were able
to use the on-chip test board performance simulation data
to verify our SystemC-based platform.
This paper illustrates that in high-level design, we could
have the same accuracy as RTL models but achieve over
one hundred times faster simulation speed than that of
RTL's. The main topic of the paper will be on architecture
exploration in search of performance degradation in
source.
-
A SystemC-Based Verification Methodology for Complex Wireless Software IP [p. 544]
-
G. Post, P. Venkataraghavan, T. Ray, and D. Seetharaman
The implementation of a complex hardware Intellectual
Property (IP) together with complex lower-level software
and the integration into a system platform poses tough
challenges to the design and verification engineers.
Traditionally, embedded software is developed and tested
towards the end of the development cycle because of late
availability of lab prototype equipment and hardware IP.
In this paper, a 'software-centric' hardware/software
implementation and verification methodology for a 3G
WCDMA modem is presented, with emphasis on physical
layer software design and early verification. The subsystem
architecture of 3G hardware and software is
presented along with design and verification steps carried
out. A versatile SystemC-based test environment is
described, which links test case modules producing the
stimuli from protocol stack and hardware components to
the L1 SW code, executed on a instruction set simulator.
Moderators: M. Zwolinski, Southampton U, UK; M. Lajolo, NEC Laboratories, US
-
A New Optimized Implementation of the SystemC Engine Using Acyclic Scheduling [p. 552]
-
D. Pérez, O. Temam, and G. Mouchard
SystemC is rapidly gaining wide acceptance as a simulation
framework for SoC and embedded processors. While
its main assets are modularity and the very fact it is becoming
a de facto standard, the evolution of the SystemC
framework (from version 0.9 to version 2.0.1) suggests the
environment is particularly geared toward increasing the
framework functionalities rather than improving simulation
speed. For cycle-level simulation, speed is a critical factor
as simulation can be extremely slow, affecting the extent of
design space exploration.
In this article, we present a fast SystemC engine that,
in our experience, can speed up simulations by a factor
of 1.93 to 3.56 over SystemC 2.0.1. This SystemC engine
is designed for cycle-level simulators and for the moment,
it only supports the subset of the SystemC syntax (signals,
methods) that is most often used for such simulators. We
achieved greater speed (1) by completely rewriting the SystemC
engine and improving the implementation software
engineering, and (2) by proposing a new scheduling technique,
intermediate between SystemC dynamic scheduling
technique and existing static scheduling schemes. Unlike
SystemC dynamic scheduling, our technique removes many
if not all useless process wake-ups, while using a simpler
scheduling algorithm than in existing static scheduling techniques.
-
Stimuli Generation with Late Binding of Values [p. 558]
-
A. Ziv
Generating test-cases that reach corner cases in the design
is one of the main challenges in the functional verification
of complex designs. In this paper, we describe a new
technique that increases the ability of test generators by delaying
assignment of values in the generated stimuli, until
these values are used in the design. This late-binding allows
the generator to have a more accurate view of the state
of the design, and thus it can better choose the correct values.
Experimental results show that late-binding can significantly
improve coverage, with a reasonable penalty in simulation
time.
-
Native ISS-SystemC Integration for the Co-Simulation of Multi-Processor SoC [p. 564]
-
F. Fummi, S. Martini, G. Perbellini, and M. Poncino
In a system-level design flow, the transition from a
high-level description entry implies the refinement from
an untimed, unpartitioned description to a real architecture
where applications are executed on a programmable
device and interact with ad-hoc hardware components.
Simulation of such architectures requires the capability
of efficient co-simulation of a model of hardware with a
model of the processor.
This paper presents two co-simulation methodologies,
based on SystemC as hardware modeling language and on
an Instruction Set Simulator (ISS) as a model of the processor.
The first one works at the SystemC kernel level and
exploits potentialities of the GNU suite, whereas the second
uses features offered by the operating system running
on the ISS.
The two methodologies improve co-simulation performance
with respect to state-of the art methods, and provide different
trade-offs between the simplicity of the programming
model, the modeling power, and co-simulation performance.
-
Extraction of Schematic Array Models for Memory Circuits [p. 570]
-
S. Bose and A. Nandi
The modeling and simulation of memory circuits remains an outstanding
problem when accuracy with respect to the actual schematic
implementation is desired. Functionally equivalent RTL models often
cannot be used for designs with embedded memory blocks, because
schematic models for the surrounding logic may be required for
fault modeling accuracy. Existing methods derive a latch model
that essentially represents each memory location as a latch primitive,
and have a large number of gates. We present new algorithms that
model such circuits as decoded arrays that access entire rows of
cells for individual read and write operations. Decoded array models
allow fault modeling accuracy for the surrounding logic, including
the memory address decoder. Experimental data show improvements of
an order of magnitude for both logic and fault simulations, when
compared to the equivalent latch model.
Moderators: T. Mak, Intel Corp., US; Y. Tsiatouhas, Ioannina U, GR
-
Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors [p. 578]
-
A. Paschalis and D. Gizopoulos
Software-based self-test (SBST) strategies are particularly
useful for periodic testing of deeply embedded processors in low-cost
embedded systems that do not require immediate detection of
errors and cannot afford the well-known hardware, software, or
time redundancy mechanisms.
In this paper, first, we identify the stringent characteristics of
an SBST test program to be suitable for on-line periodic testing.
Then, we introduce a new SBST methodology with a new
classification scheme for processor components. After that, we
analyze the self-test routine code styles for the three more effective
test pattern generation (TPG) strategies in order to select the most
effective self-test routine for on-line periodic testing of a
component under test. Finally, we demonstrate the effectiveness of
the proposed SBST methodology for on-line periodic testing by
presenting experimental results for a RISC pipeline processor.
-
Evaluating the Effects of SEUs Affecting the Configuration Memory of an SRAM-Based FPGA [p. 584]
-
M. Bellato, P. Bernardi, A. Candelori, M. Rebaudengo, M. Sonza Reorda, M. Violante,
M. Ceschia, D. Bortolato, A. Paccagnella, and P. Zambolin
This paper analyses the effects of Single Event Upsets
in an SRAM-based FPGA, with special emphasis for the
transient faults affecting the configuration memory. Two
approaches are combined: from one side, by exploiting
the available information and tools dealing with the
device configuration memory, we were able to make
hypothesis on the meaning of every bit in the
configuration memory. From the other side, radiation
testing was exploited to validate the hypothesis and to
gather experimental evidence about the correctness of
the obtained results. As a major result, we can provide
detailed information about the effects of SEUs affecting
the configuration memory of a commercial FPGA device.
As a second contribution, we describe a method for
obtaining the same result with similar devices. Finally,
the obtained results are crucial to allow the possible
usage of SRAM-based FPGAs in safety-critical
environments, e.g., by working on the place and route
strategies of the supporting tools.
-
Early SEU Fault Injection in Digital, Analog and Mixed Signal Circuits: A Global Flow [p. 590]
-
R. Leveugle and A. Ammari
Fault injection techniques have been proposed for
years to early analyze the dependability characteristics of
digital circuits. Very few attempts have however been
reported to perform the same task in analog parts.
Furthermore, these attempts are all based on parametric
variations. With the increasing number of mixed signal
circuits, a unified approach becomes mandatory to
globally validate the digital and analog parts, while
taking into account real faults occurring in the field, e.g.
SEUs. In this paper, a global analysis flow is proposed,
based on a high-level model of the circuit. The possibility
to inject transient faults in the different parts is discussed.
The results obtained on a case study are reported to show
the feasibility of the injection in analog blocks.
-
On Concurrent Error Detection with Bounded Latency in FSMs [p. 596]
-
S. Almukhaizim, Y. Makris, and P. Drineas
We discuss the problem of concurrent error detection
(CED) with bounded latency in finite state machines (FSMs).
The objective of this approach is to reduce the overhead of
CED, albeit at the cost of introducing a small latency in the
detection of errors. In order to ensure no loss of error detection
capabilities as compared to CED without latency, an upper
bound is imposed on the introduced latency. We examine
the necessary conditions for performing CED with bounded
latency, based on which we extend a parity-based method to
permit bounded latency. We formulate the problem of minimizing
the number of required parity bits as an Integer Program
and we propose an algorithm based on Linear Program
relaxation and Randomized Rounding to solve it. Experimental
results indicate that allowing a small bounded latency reduces
the hardware cost of the CED circuitry.
Moderators: H. Graeb, TU Munich, DE; G. Vandersteen, IMEC, BE
-
Fast, Layout-Inclusive Analog Circuit Synthesis Using Pre-Compiled Parasitic-Aware
Symbolic Performance Models [p. 604]
-
M. Ranjan, A. Agarwal, H. Sampath, R. Vemuri, W. Verhaegen, and G. Gielen
We present a new methodology for fast analog circuit synthesis,
based on the use of parameterized layout generators
and symbolic performance models (SPMs) in the synthesis
loop. Fast layout generation is achieved by using efficient parameterized
procedural layout generators. Fast performance
estimation is achieved by using pre-compiled SPMs, stored as
efficient DDD-like structures called Element Coefficient Diagrams.
Techniques have been developed to include layout geometry
effects in the SPMs. The accuracy and efficiency of the
parasitic inclusion technique as well as the proposed methodology
have been demonstrated by comparisons to traditional
synthesis methods. The proposed methodology is used for the
synthesis of opamps and filters and is demonstrated to achieve
effective performance closure.
-
Sensitivity-Based Modeling and Methodology for Full-Chip Substrate Noise Analysis [p. 610]
-
R. Murgai, S. Reddy, T. Miyoshi, T Horie, and M. Tahoori
Substrate noise (SN) is an important problem in mixed-signal
designs. With increasing design complexity, it is not possible to
simulate for SN with a detailed SPICE model that uses an accurate
model for each transistor. In this paper, we propose a sensitivity
analysis- and static timing analysis-based methodology to derive
a reduced model that computes the worst case substrate noise in the
design. The reduced model contains only passive components, which are
very few, and is very quick to simulate. The main feature of
our methodology is that, unlike previous approaches, it is independent
of input patterns and does not need to simulate for millions of
clock cycles. This lets us apply it to a full-chip design in
reasonable CPU time. We validate our reduced model on several
benchmark circuits against a detailed and highly accurate reference
model. On average, the reduced model is within 16.4% of the
reference model and is up to 38 times faster. Finally, we apply our
methodology to a mixed-signal switch chip design consisting of 8
million gates and show that it finishes in 17 minutes.
-
SubCALM: A Program for Hierarchical Substrate Coupling Simulation on Floorplan Level [p. 616]
-
T. Brandtner and R. Weigel
The hierarchical substrate coupling simulation tool Sub-CALM
offers the opportunity to estimate substrate coupling
on floorplan level. A novel approach for modeling
well and SOI structures in a boundary element description
is introduced. Several acceleration techniques like precalculated
macromodels and sophisticated preconditioning
algorithms are presented which are applied to an
O(n)-conjugate-gradient Poisson solver in order to be able to
process large full-chip layouts during floorplanning.
-
Optimization of Integrated Spiral Inductors Using Sequential Quadratic Programming [p. 622]
-
Y. Zhan and S. Sapatnekar
The optimization of integrated spiral inductors has great
practical importance. Previous optimization methods used
in this field are either too slow or depend on very simplified
assumptions in the device modeling which result in the
algorithm only applicable to low-frequency cases. In this
paper, we propose using the sequential quadratic programming
(SQP) approach to optimize the on-chip spiral inductors.
A physical model based on first principles is used in the
back-end device-parameter extraction engine which makes
the algorithm suitable to the optimization at any frequency
range. In addition, compared with enumeration, which is used
in many inductance optimization packages, our experiments
show that the SQP algorithm can achieve at least an order of
magnitude speedup while maintaining the same quality of the
optimized design.
Moderators: B. Juurlink, TU Delft, NL; R. Leupers, RWTH Aachen, DE
-
System Design for DSP Applications Using the MASIC Methodology [p. 630]
-
A. K. Deb, A. Jantsch, and J. Öberg
Expensive top-down iterations are often required in the
design cycle of complex DSP systems. In this paper, we
introduce two levels of abstraction in the design flow by
systematically categorizing the architectural decisions. As
a result, the top-down iteration loop is broken. We also
present a technique to capture and inject the architectural
decisions such that the system models can be created and
simulated efficiently. The concepts are illustrated by a
realistic speech processing example, which is implemented
using the AMBA on-chip architecture. Our methodology
offers a smooth path from the functional modeling phase to
the implementation level, facilitates the reuse of HW and
SW components, and enjoys existing tool support at the
backend.
-
Flexible Software Protection Using Hardware/Software Codesign Techniques [p. 636]
-
J. Zambreno, A. Choudhary, R. Simha, and B. Narahari
A strong level of trust in the software running on an
embedded processor is a prerequisite for its widespread
deployment in any high-risk system. The expanding field
of software protection attempts to address the key steps
used by hackers in attacking a software system. In this paper
we present an efficient and tunable approach to some
problems in embedded software protection that utilizes a
hardware/software codesign methodology. By coupling our
protective compiler techniques with reconfigurable hardware
support, we allow for a greater flexibility of placement
on the security-performance spectrum than previously
proposed mainly-hardware or software approaches. Results
show that for most of our benchmarks, the average performance
penalty of our approach is less than 20%, and that
this number can be greatly improved upon with the proper
utilization of compiler and architectural optimizations.
-
Interactive Cosimulation with Partial Evaluation [p. 642]
-
P. Schaumont and I. Verbauwhede
We present a technique to improve the efficiency of hardware-software
cosimulation, using design information
known at simulator compile-time. The generic term for such
optimization is partial evaluation. Our contribution is that
we apply the optimization transparantly to the user, and at
multiple abstraction levels in the simulation.
We use the technique to create an interactive codesign
environment, and evaluate it on several designs including
an AES encryption coprocessor and a Viterbi decoder, and
for several instruction-set simulators. Compared to SystemC-based
cosimulation, we achieve comparable cosimulation
performance at only a fraction of the model-build
time.
-
Communication Analysis for System on Chip Design [p. 648]
-
A. Siebenborn, O. Bringmann, and W. Rosenstiel
In this paper we present an approach for analysis of systems of parallel,
communicating processes for SoC design. We present a method
to detect communications that synchronize the program flow of two or
more processes. These synchronization points set the processes into relation
and allow the determination of the global timing behavior of such
a system. Using the results of our method for communication analysis,
we present a new method to detect communications that might produce
conflicts on shared communication resources. This information can be
used for the assignment of communication resources.
Organizer/Moderator: C. Piguet, CSEM, CH
Presenters:
J. Gautier, CEA-LETI, FR
C. Heer, Infineon Technologies, DE
I. O'Connor, Ecole centrale de Lyon, FR
U. Schlichtmann, Technical University of Munich, DE
-
Extremely Low-Power Logic [p. 656]
-
For extremely Low-power Logic, three very new and
promising techniques will be described. The first are
methods on circuit and system level for reduced
supply voltages. In large logic blocks, interconnect
becomes a main issue, that could be solved by on-chip
optical interconnect. Nano-devices will also be
presented, as a possibility to compute with nearly
zero power, and compared to future 10 nanometers
transistors.
-
Decomposition of Instruction Decoder for Low Power Design [p. 664]
-
W. Kuo, T. Hwang, and A. Wu
Microprocessors have been used in wide-ranged
applications. During the execution of instructions, instruction
decoding is a major task for identifying instructions and
generating control signals for data-paths. By exploiting program
behaviors, we propose a novel instruction-decoding approach for
power minimization. Using the proposed instruction-decoding
structure, we present a partitioning method that decomposes the
instruction-decoding circuit into two sub-circuits according to
the execution frequencies of instructions. Using our proposed
decoding structure, only one sub-circuit will be activated when
executing an instruction. Experimental results have demonstrated
that our proposed approach achieves on an average of 26.71%
and 15.69% power reductions for the instruction decoder and the
control unit, respectively.
-
Functional Level Power Analysis: An Efficient Approach for Modeling the
Power Consumption of Complex Processors [p. 666]
-
J. Laurent, N. Julien, E. Senn, and E. Martin
A high-level consumption estimation methodology and its
associated tool, SoftExplorer, are presented. The estimation
methodology uses a functional modeling of the processor
combined with a parametric model to allow the designer to
estimate the power consumption when the embedded software
is executed on the target. SoftExplorer uses as input the
assembly code generated by the compiler; its efficiency is
compared to SimplePower's approach. Results for different
processors (TI C62, C67, C55 and ARM7) and for several
DSP applications provide an average error less than 5%.
-
Formal Verification Coverage: Are the RTL-Properties Covering the Design's Architectural Intent? [p. 668]
-
P. Basu, S. Das, P. Dasgupta, P. Chakrabarti, C. Mohan, and L. Fix
It is essential to formally ascertain whether the RTL validation
effort effectively guarantees the correctness with respect
to the design's architectural intent. The design's architectural
intent can be expressed in formal properties. However,
due to the capacity limitation of formal verification,
these architectural-properties cannot be directly verified on
the RTL. As a result, a set of lower level RTL-properties
are developed and verified against the RTL. In this paper
we present: (1) a method for checking whether the RTL-properties
are covering the architectural-properties, that is,
whether verifying the RTL-properties guarantee the correctness
of the design's architectural intent, and (2) a method
to identify the coverage holes in terms of the architectural-properties
(or their sub-properties) that are not covered.
-
Functional Coverage Metric Generation from Temporal Event Relation Graph [p. 670]
-
Y. Kwon and C. Kyung
Functional coverage is a technique which can be used
for checking the completeness of test vectors. In this paper,
automatic generation of temporal events for functional
coverage is proposed. The TERG(Temporal Event Relation
Graph) is the graph where the nodes represent basic temporal
property and the edges represent the time-shift value between
two properties. Hierarchical temporal events are generated
by traversing TERG such that invalid, or irrelevant
properties are eliminated. Concurrent edge groups in TERG
make it possible to generate more comprehensive temporal
properties.
-
Automatic Scan Insertion and Pattern Generation for Asynchronous Circuits [p. 672]
-
A. Efthymiou, D. Edwards, and C. Sotiriou
This paper presents 3phisLSSD, a novel, easily automatable
approach for scan insertion and ATPG
of asynchronous circuits. 3phisLSSD inserts scan latches
only into global circuit feedback paths, leaving the local
feedback paths of asynchronous state-storing gates intact.
By employing a three-phase LSSD clocking scheme
and complemented by a novel ATPG method, our approach
achieves industrial quality testability with significantly
less area overhead testing the same number of faults
compared to full-scan LSSD. The effectiveness of our approach
is demonstrated on an asynchronous SOC interconnection
fabric, where our phisLSSD ATPG tool achieved
over 99% test coverage.
-
Automatic Synthesis and Simulation of Continuous-Time ΣΔ Modulators [p. 674]
-
H. Aboushady, L. de Lamarre, N. Beilleau, and M. Louërat
This paper presents a mixed equation-based and
simulation-based design methodology for continuous-time
Sigma-Delta modulators from high level specifications
down to Layout. The calculation and scaling of the Sigma-Delta
coefficients as well as circuit sizing and layout generation
are implemented in the same analog design environment
CAIRO+. The design of a complete third order
current-mode continuous-time Sigma-Delta modulator
is taken as an example to show the effectiveness of the proposed
design methodology.
-
A Methodology for System-Level Analog Design Space Exploration [p. 676]
-
F. De Bernardinis and A. Sangiovanni-Vincentelli
This paper describes a novel approach to system level analog design.
A new abstraction level -- the platform -- is introduced to separate
circuit design from design space exploration. An Analog Platform
encapsulates analog components
concurrently modeling their behavior and their achievable
performances. Performance models are obtained through statistical
sampling of circuit configurations. The design configurations space
is specified with Analog Constraint Graphs so
that the sampling space is significantly reduced. System level
exploration can be achieved through optimization on behavioral
models constrained by performance models. Finally, an
example is provided showing the effectiveness of the approach
on a WCDMA amplifier.
-
Systematic Design for Optimization of High-Resolution Pipelined ADCs [p. 678]
-
R. Lotfi, M. Taherzadeh-Sani, and O. Shoaei
Pipelining is the promising approach to implement high-speed
medium-to-high resolution analog-to-digital converters with
minimum power consumption. In this paper, the most important
specifications of a pipelined ADC including the signal-to-noise-and-distortion
ratio and spurious-free dynamic range as well as
the total current consumption of the converter are presented in
closed-form equations and an optimization methodology for design
of pipelined ADCs is suggested. Simulation results confirming the
effectiveness of the methodology are presented.
-
A Direct Bootstrapped CMOS Large Capacitive-Load Driver Circuit [p. 680]
-
J. García, J. Montiel-Nelson, J. Sosa, and H. Navarro
A new 2.5V CMOS large capacitive-load driver circuit,
using a direct bootstrap technique, for low-voltage CMOS
VLSI digital design is presented. The proposed driver circuit
exhibits a high speed and low power consumption to
drive large capacitive loads. We compare our driver structure
with the direct bootstrap circuit [1] in terms of the
product of three metrics, active area, propagation time delay
and power consumption. Results demonstrate the superior
performance of the proposed driver circuit.
-
Co-Processor Synthesis: A New Methodology for Embedded Software Acceleration [p. 682]
-
B. Hounsell and R. Taylor
This paper introduces co-processor synthesis -- a
methodology that provides design benefits by implementing
hardware co-processors directly from embedded software.
The paper examines the design benefits in this new approach
vs behavioral synthesis and configurable processor
methodologies.
-
Behavioural Bitwise Scheduling Based on Computational Effort Balancing [p. 684]
-
M. Molina, R. Ruiz-Sautua, J. Mendías, and R. Hermida
Conventional synthesis algorithms schedule multiple
precision specifications by balancing the number of
operations of every different type and width executed per
cycle. However, totally balanced schedules are not always
possible and therefore some hardware waste appears. In
this paper a heuristic scheduling algorithm to minimize
this hardware waste is presented. It successively
transforms specification operations into sets of smaller
ones until the most uniform distribution of the
computational effort of operations among cycles is
reached. In the schedules proposed some operations are
executed during a set of non-consecutive cycles.
-
A Tool for Automatic Generation of RTL-Level VHDL Description of RNS FIR Filters [p. 686]
-
A. Nannarelli, A. Del Re, and M. Re
Although digital filters based on the Residue Number
System (RNS) show high performance and low power dissipation,
RNS filters are not widely used in DSP systems,
because of the complexity of the algorithms involved. We
present a tool to design RNS FIR filters which hides the
RNS algorithms to the designer, and generates a synthesizable
VHDL description of the filter taking into account several
design constraints such as: delay, area and energy.
-
On Transfer Function and Power Consumption Transient Response [p. 688]
-
L. Cao
This paper proposes to use time series analysis
techniques to model both average and cycle-by-cycle
moving average power consumption behavior of
electronic systems. The power model is in the form of first
and/or second order transfer functions that represent the
mapping from primary input/output activities to power
consumption profile over time. Such an approach has
power estimation applications in both software simulation
and hardware implementation of power monitor circuit.
-
Polynomial Abstraction for Verification of Sequentially Implemented Combinational Circuits [p. 690]
-
T. Raudvere, A. Singh, I. Sander, and A. Jantsch
Todays integrated circuits with increasing complexity
cause the well known state space explosion problem in verification
tools. In order to handle this problem a much simpler
abstract model of the design has to be created for verification.
We introduce the polynomial abstraction technique,
which efficiently simplifies the verification task of sequential
design blocks whose functionality can be expressed as
a polynomial. Through our technique, the domains of possible
values of data input signals can be reduced. This is done
in such a way that the abstract model is still valid for model
checking of the design functionality in terms of the system's
control and data properties. We incorporate polynomial abstraction
into the ForSyDe methodology, for the verification
of clock domain design refinements.
-
Regression Simulation: Applying Path-Based Learning in Delay Test and Post-Silicon Validation [p. 692]
-
L. Wang
This paper presents a novel path-based learning methodology to
achieve timing Regression Simulation. The methodology can be
applied for two purposes: (1) In pre-silicon phase, regression simulation
can be used to produce a fast and approximate timing simulator
to avoid the high cost associated with statistical timing simulation.
(2) In post-silicon phase, regression simulation can be
used as a vehicle to deduce critical paths from the pass/fail behavior
observed on the test chips. Our path-based learning methodology
consists of four major components: a delay test pattern set,
a logic simulator, a set of selected paths as the basis for learning,
and a machine learner. We summarize the key concepts in our
regression simulation approach and present experimental results.
-
A Game Theoretic Approach to Low Energy Wireless Video Streaming [p. 696]
-
A. Iranli, K. Choi, and M. Pedram
This paper presents a dynamic energy management
policy for a wireless video streaming system, consisting of battery-powered
client and server. The paper starts from the observation
that the video quality in wireless streaming is a function of three
factors: encoding aptitude of the server, decoding aptitude of the
client, and the wireless channel. Based on this observation, the
energy consumption of a wireless video streaming system is
modeled and analyzed. Using the proposed model, the optimal
energy assignment to each video frame is done such that the
maximum system lifetime is achieved while satisfying a given
minimum video quality requirement. Experimental results show that
the proposed policy increases the system lifetime by 20%.
-
Block-Enabled Memory Macros: Design Space Exploration and Application-Specific Tuning [p. 698]
-
A. Ivaldi, A. Macii, E. Macii, and L. Benini
In this paper, we propose a combined solution that allows us
to customize the architecture of internally partitioned SRAM
macros according to the given application to be executed. Energy
savings with respect to monolithic memory configurations
are above 40%, without access time violation.
-
Synthesis of Partitioned Shared Memory Architectures for Energy-Efficient Multi-Processor SoC [p. 700]
-
E. Macii, K. Patel, and M. Poncino
Accesses to the shared memory in multi-processor
systems-on-chip represent a significant performance bottleneck.
Multi-port memories are a common solution to this
problem, because they allow to parallelize accesses. However,
they are not an energy-efficient solution.
We propose an energy-efficient shared-memory architecture
that can be used as a substitute for multi-port memories,
which is based on an application-driven partitioning
of the shared address space into a multi-bank architecture.
Experiments on a set of parallel benchmarks show energy
savings of about 56% with respect to a dual-port memory
artchitecture, at a very limited performance penalty.
-
A Low Power Strategy for Future Mobile Terminals [p. 702]
-
M. Nikitovic and M. Brorsson
In this paper, we have investigated the efficiency of two
power-saving strategies that reduces both static and
dynamic power consumption when applied to a chip-multiprocessor
(CMP). They are evaluated under two
workload scenarios and compared against a conventional
uni-processor architecture and a CMP without any power-aware
scheduling. The results show that energy due to static
and dynamic power consumption can be reduced by up to
78% and that further 8% energy can be saved at the expense
of response-time of non-critical applications.
Furthermore, a small study on the potential impact of
system-level events showed that system calls can contribute
significantly to the total energy consumed.
-
A 0.18 µm CMOS Implementation of On-Chip Analogue Test Signal Generation from Digital Test Patterns [p. 704]
-
L. Rolíndez, S. Mir, G. Prenat, and A. Bounceur
The test of Analogue and Mixed-Signal (AMS) cores
requires the use of expensive AMS testers and accessibility to
internal analogue nodes. The test cost can be considerably
reduced by the use of Built-In-Self-Test (BIST) techniques.
One of these techniques consists in generating analogue test
signals from digital test patterns (obtained via SD
modulation) and converting the responses of the analogue
modules into digital signatures that are compared with the
expected ones. This paper presents an implementation of the
analogue test signal generation part that includes
programmability of the circuit blocks, leading to an
improvement of performance and a reduction of circuit size
with respect to previous approaches. A 0.18µm CMOS
circuit has been designed and fabricated, allowing the
generation of test signals ranging from 10 Hz to 1 MHz.
-
A Digital Test for First-Order ΣΔModulators [p. 706]
-
G. Leger and A. Rueda
This paper presents a digital structural test for first
order Sigma-Delta modulators. A periodic digital sequence
is used as a stimulus to obtain a signature of the integrator
leakage. This parameter is known to be related to the
modulator precision and its estimation is of great
importance to assess if the modulator works as expected. As
the proposed technique is fully digital, it is specially
suitable to test modulators embedded in complex Mixed-Signal
circuits.
-
Trim Bit Setting of Analog Filters Using Wavelet-Based Supply Current Analysis [p. 708]
-
S. Bhunia, A. Raychowdhury, and K. Roy
Wavelet transform has the property of resolving signal in
both time and frequency unlike Fourier transform. In this
work, we show that time-domain information obtained
from wavelet analysis of supply current can be used to
efficiently trimanalog filters. The pole/zero locations in
the frequency response of analog filters shift due to
change in component values with process variations.
Wavelet analysis of supply current can be a promising
alternative to test frequency specification of analog
filters, since it needs only one test stimulus and is
virtually unaffected by transistor threshold variation.
Simulation results on two test circuits demonstrate that
we can estimate pole/zero shift with less than 3% error.
Index Terms: Wavelet Transform, Analog Filer, Trim
Bit, Dynamic Supply Current (IDD).
-
SoC Test Scheduling with Power-Time Tradeoff and Hot Spot Avoidance [p. 710]
-
J. Chin and M. Nourani
We present a test scheduling methodology for core-based
system-on-chips that can avoid hot spots and allows tradeoff
between physical power dissipation and overall test time. A
mixed integer linear programming formulation is presented to
globally perform the power-time tradeoff, satisfy constraints,
and produce the SoC test schedule.
-
STEPS: Experimenting a New Software-Based Strategy for Testing SoCs Containing
P1500-Compliant IP Cores [p. 712]
-
M. Benabdenbi, F. Pêcheux, A. Greiner, M. Tuna and E. Viaud
This paper presents STEPS, an innovative software-based
approach for testing P1500-compliant SoCs. STEPS
is based on the concept that the ATE is not considered as
an initiator applying vectors to the SoC test pins but rather
as a target, a huge repository of 32-bits test data and control
commands. The ATE is connected to the functional SoC
external RAM controller interface. The only additional test
component in the SoC is a P1500 test processor that converts
test data into serial P1500 streams. This paper applies
the STEPS methodology to SoCs containing a VCI-compliant
interconnect, a microprocessor, P1500 compliant
IP cores and an external RAM controller interface. Using
the ITC02 SoC benchmarks a comparison is done between
the STEPS architecture and a classical bus-based strategy.
-
Are Our Designs for Testability Features Fault Secure? [p. 714]
-
C. Metra, M. Omaña, and T. Mak
We analyze the risks associated with faults affecting some
common Design For Testability (DFT) features employed
within digital products. We will show that some DFT
structures may become useless, with consequent dramatic
impact on test effectiveness and product quality. We
borrow the Fault Secure property and we will show that it
guarantees that no escapes or false acceptance of faulty
products may occur because of faults within the DFT
structures.
-
Test Compression and Hardware Decompression for Scan-Based SoCs [p. 716]
-
F. Wolff, C. Papachristou, and D. McIntyre
We present a new decompression architecture
suitable for embedded cores in SoCs which focuses on improving
the download time by avoiding higher internal-to-ATE
clock ratios and by exploiting hardware parallelism.
The Bounded Huffman compression facilitates
decompression hardware tradeoffs. Our technique is scalable
in that the downloadable RAM-based decode table
and accommodates for different SoC cores with different
characteristics such as the number of scan chains and
test set data distributions.
-
Concurrent Sizing, Vdd and Vth Assignment for Low-Power Design [p. 718]
-
A. Srivastava, D. Sylvester, and D. Blaauw
We present a sensitivity based algorithm for total power including dynamic and subthreshold leakage power minimization using simultaneous sizing, Vdd and Vth assignment. The proposed algorithm is implemented and tested on a set of combinational benchmark circuits. A comparison with traditional CVS based algorithms demonstrates the advantage of the algorithm including an average power reduction of 37% at primary input activities of 0.1. We also investigate the impact of various low Vdd values on total power savings.
-
Sizing and Characterization of Leakage-Control Cells for Layout-Aware Distributed Power-Gating [p. 720]
-
P. Babighian, E. Macii, and L. Benini
This paper proposes a methodology for sleep transistor sizing for
usage in a novel, single-threshold leakage cut-off approach, where
power gating cells are distributed row-by-row in a fully placed
circuit. Sizing equations are obtained by performing SPICE
simulations for a 130nm technology. Furthermore, the layout
of a test case is considered and power and delay values are
extracted in order to demonstrate the practical impact of our
solution.
-
An Asynchronous Synthesis Toolset Using Verilog [p. 724]
-
F. Burns, D. Shang, A. Koelmans, and A. Yakovlev
We present a new CAD tool set for generating asynchronous
circuits from high-level Verilog level-sensitive
specifications. Initially high-level Verilog descriptions are
compiled and converted into a novel intermediate Petri-net
format. The intermediate format is subsequently passed to
optimization tools and mapping tools where it is directly
mapped into asynchronous datapath and control circuits using
David Cells (DCs). Finally logic optimization tools are
applied to generate speed independent (SI) circuits. The
speed independent circuits generated perform well compared
to circuits generated by existing asynchronous tools.
-
Organizing Libraries of DFG Patterns [p. 726]
-
G. Dittmann
We propose to arrange a library of tree patterns into a
hierarchy by means of identity operations. Compared with
current unstructured approaches, our new method reduces
the computational complexity of searching a pattern from
O(n.p) to only O(d), d ≤ p. Furthermore, the organization
reveals synergies between patterns for ASIP instruction-set
synthesis, data-path sharing, and code generation.
-
Compositional Memory Systems for Data Intensive Applications [p. 728]
-
A. Molnos, M. Heijligers, J. Van Eijndhoven, and S. Cotofana
To alleviate the system performance unpredictability of
multitasking applications running on multiprocessor platforms
with shared memory hierarchies we propose a task
level set based cache partitioning. We evaluate our approach
on a CAKE platform with three Trimedias, one MIPS
and a shared level 2 cache using a picture in picture benchmark.
We compare the performance implications of two
types of cache partitioning namely set based. Our experiments
indicates that associativity based cache partitioning
induces at least 30% performance degradation, whereas
set-based partitioning provide 27% performance improvement
when compared to non-partitioned cache scenario.
-
Scalar Metric for Temporal Locality and Estimation of Cache Performance [p. 730]
-
J. Alakarhu and J. Niittylahti
A scalar metric for temporal locality is proposed. The
metric is based on LRU stack distance. This paper shows
that the cache hit rate can be estimated based on the proposed
metric (an error of a few percents can be expected).
The metric alleviates high-level memory system outlining
and enables using stack processing in run-time locality
analysis.
-
.NET Framework -- A Solution for the Next Generation Tools for System-Level Modeling and Simulation [p. 732]
-
J. Lapalme, E. Aboulhamid, G. Nicolescu, L. Charest, J. David, F. Boyer, and G. Bois
-
Modeling and Simulating Memory Hierarchies in a Platform-Based Design Methodology [p. 734]
-
P. Viana, E. Barros, S. Rigo, R. Azevedo, and G. Araújo
This paper presents an environment based on SystemC
for architecture specification of programmable systems.
Making use of the new architecture description language
ArchC, able to capture the processor description as
well as the memory subsystem configuration, this environment
offers support for system-level specification, intended
for platform-based design. As a case study, it is presented
the memory architecture exploration for a simple
image processing application, yet a more robust environment
evaluation is performed through the execution of
some real-world benchmarks.
-
Integrating the Synchronous Dataflow Model with UML [p. 736]
-
P. Green and S. Essa
UML has attracted significant interest as a system
description language. However, some aspects of
embedded system behavior are difficult to model in UML.
In particular, applications with significant dataflow
components are not well represented. This paper
considers how the synchronous dataflow model can be
integrated with UML to provide behavioral descriptions,
in an object oriented context, for system elements that
perform stream processing. The integration of the SDF
model with the UML state machine model is also
discussed.
-
Design and Behavioral Modeling Tools for Optical Network-On-Chip [p. 738]
-
M. Brière, L. Carrel, T. Michalke, F. Mieyeville, I. O'Connor, and F. Gaffiot
In this paper, we present a tool to analyse photonic devices
that can be used to realize basic building blocks of an
optical network-on-chip (ONoC). Co-design between electrical
tools and optical tools is possible. The VHDL-AMS
language has been used to implement behavioral models
of photonic devices. For low-level simulation, a gateway
between an optical simulator, based on the finite elements
method, and a typical EDA layout editor has been realized.
-
Hierarchical Modeling and Simulation of Large Analog Circuits [p. 740]
-
S. Tan, Z. Qi, and H. Li
This paper proposes a new hierarchical circuit modeling
and simulation technique in s-domain for linear analog
circuits. The new algorithm can perform circuit complexity
reduction by deriving the exact or approximate admittances
in rational form in the reduced circuit matrix and deriving
the circuit characteristics for very large linear analog and
interconnect circuits. We characterize some theoretical results
regarding the conditions on the generations of canceling
terms during the general hierarchical circuit analysis
and propose an explicit de-cancellation scheme to remove
canceling terms based on a new hierarchical symbolic analysis
framework. The resulting algorithm can be used for
modeling and simulation of linear analog and interconnect
circuits in both frequency and time domain.
-
Efficient Mixed-Domain Behavioural Modeling of Ferromagnetic Hysteresis Implemented in VHDL-AMS [p. 742]
-
P. Wilson, J. Ross, A. Brown, T. Kazmierski, and J. Baranowski
In this paper a modified model of ferromagnetic
hysteresis suitable for mixed-signal simulations in VHDLAMS
is presented. The aim of this paper is to demonstrate
how a numerically stable and accurate implementation of
the Jiles-Atherton model can be achieved using a 4th
order Runga-Kutta integration of the derivative of
magnetization with respect to the field strength (H). While
most SPICE-like implementations require inconvenient
integration in time to obtain the magnetization derivative,
our approach is more general as it does not rely on the
underlying differential equation solver for this purpose.
The model addresses the non-physical situation of
negative BH slopes and proposes an alternative
implementation of the anhysteretic function using a
polynomial approximation of the Langevin function for
low signal levels and a new function with no
discontinuities. Model efficiency is improved by
monitoring the change in H and only activating the
integration function when H changes by a specified
amount.
-
A Fast Algorithm for Finding Maximal Empty Rectangles for Dynamic FPGA Placement [p. 744]
-
M. Handa and R. Vemuri
In this paper, we present a fast algorithm for finding
empty area on the FPGA surface with some rectangular
tasks placed on it. We use a staircase datastructure to report
the empty area in the form of a list of maximal empty
rectangles. We model the FPGA surface using an innovative
encoding scheme that improves runtime and reduces memory
requirement of our algorithm. Worst-case time complexity
of our algorithm is O(xy) where x is number of columns,
y is number of rows and x.y is the total number of cells on
the FPGA.
-
Enhancing Reliability of Operational Interconnections in FPGAs [p. 746]
-
A. Fit-Florea, M. Halas, and F. Kocan
SRAM-based Field-Programmable Gate Arrays (FPGAs)
have fixed numbers of wires, switches and look-up
tables. An application does not fully utilize all available
components in a FPGA, e.g. wires. In this paper,
we propose methods to improve reliability of less reliable
operational interconnections by efficiently utilizing
unused wires to mask errors dynamically. With these methods,
we are able to improve the reliability of more than
two-thirds of all interconnections in the studied MCNC
benchmarks. As a result, the overall unreliability of operational
interconnections decreases more than 20%.
-
Operating System Support for Interface Virtualization of Reconfigurable Coprocessors [p. 748]
-
M. Vuletic, L. Righetti, L. Pozzi, and P. Ienne
Reconfigurable Systems-on-Chip (SoC) consist of large
Field-Programmable Gate-Arrays (FPGAs) and standard
processors. The reconfigurable logic can be used for
application-specific coprocessors to speedup execution of
applications. The widespread use is limited by the complexity
of interfacing software applications with coprocessors.
We present a virtualisation layer that lowers the interfacing
complexity and improves the portability. The layer shifts
the burden of moving data between processor and coprocessor
from the programmer to the Operating System (OS).
A reconfigurable SoC running Linux is used to prove the
concept.
Volume II
Moderators: R. Ernst, TU Braunschweig, DE; A. Jantsch, Royal Inst. of Tech., SE
-
Analyzing On-Chip Communication in a MPSoC Environment [p. 752]
-
F. Angiolini, D. Bertozzi, L. Benini, M. Loghi, and R. Zafalon
This work focuses on communication architecture analysis
for multi-processor Systems-on-Chips (MPSoCs), and it
leverages a SystemC-based platform to simulate a complete
multi-processor system at the cycle-accurate and signal-accurate
level. These features allow to stimulate the communication
sub-system with functional traffic generated by
real applications running on top of a configurable number
of ARM processors. This opens up the possibility for communication
infrastructure exploration and for the investigation
of its impact on system performance at the highest
level of accuracy. Our simulation environment proved
capable of a detailed comparative analysis between two
industry-standard communication architectures, under realistic
workloads and different system configurations, pointing
out the impact of fine grained architectural mismatches
on macroscopic performance differences.
-
A Mapping Strategy for Resource-Efficient Network Processing on Multiprocessor SoCs [p. 758]
-
M. Grünewald, J. Niemann, M. Porrmann, and U. Rückert
Hardware architectures based on a field of hardware-extended
processors can provide flexible computing power
for applications where parallelism can be exploited. For
multiprocessors, the assignment of functionality to execution
units can have a great impact on the performance.
Additionally, finding the optimal mapping can be a time-consuming
task. We present a multiprocessor architecture
along with a suitable design method that includes an automated
solution to the mapping problem. Our hardware architecture
employs a network-on-chip (NoC) to achieve a
high degree of scalability for the application and for the
system in respect to future integration technologies.We also
show how to reduce the packet buffer requirements with a
proper scheduling strategy and present first estimates for
the resource consumption of an application targeted for mobile
networking.
-
Cost-Performance Trade-Offs in Networks on Chip: A Simulation-Based Approach [p. 764]
-
S. Pestana, E. Rijpkema, A. Radulescu, K. Goossens, and O. Gangwal
A challenge facing designers of systems on chip (SoC) containing
networks on chip (NoC) is to find NoC instances that balance
the cost (e.g. area) and performance (e.g. latency and throughput).
In this paper we present a simulation-based approach to
address this problem. We use XML to instantiate network components
(routers, network interfaces) and their composition. NoCs
are evaluated in terms of cost and performance by sweeping over
different parameters (e.g. network topology, network interface
queue depth). We then show, how we can obtain trade-off plots
by using the results obtained with our simulation environment. Finally,
by means of two examples we illustrate how trade-off plots
can help the NoC designers in selecting the right network based
on a set of different constraints.
-
A Case Study in Networks-on-Chip Design for Embedded Video [p. 770]
-
J. Xu, W. Wolf, T. Lv, J. Henkel, and S. Chakradhar
In this paper we study bus-based and switch-based onchip
networks for an embedded video application, the Smart
Camera SoC (system on chip). We analyze network
performance and overall system performance in detail. We
explore system performance using crossbars with different
sizes, fixed size but different numbers of ports, and different
numbers of shared memories. We find that network is a
performance bottleneck in our design, and the system using
an optimized NoC can outperform one using a bus by 132%.
Our simulations are based upon recorded real
communication traces, which give more accurate system
performance. Our study finds that for the Smart Camera
system, a 16-bit/port 3x3 crossbar with two shared
memories shows 85.7% performance improvement over the
bus-based model and also has less maximum network
throughput than the bus-based model. This design example
illustrates a methodology to quickly and accurately estimate
the performance of NoC's at architecture level.
Moderators: T. Villa, Udine U, IT; T. Shiple, Synopsys, FR
-
Exploiting Crosstalk to Speed up On-Chip Buses [p. 778]
-
C. Duan and S. Khatri
In modern VLSI processes, the cross-coupling capacitance
between adjacent neighboring wires on the same
metal layer is a very large fraction of the total wire capacitance.
This leads to problems of delay variation due to
crosstalk and reduced noise immunity, arguably one of the
biggest obstacles in the design of ICs in recent times.
This problem is particularly severe in long on-chip
buses, since bus signals are routed at minimum pitch for
long distances. In this work, we propose to solve this problem
by the use of crosstalk canceling CODECs. We only
utilize memoryless CODECs, to reduce the logical complexity
and enhance the robustness of our techniques.
Bus data patterns can be classified (as 4.C, 3.C, 2.C, 1.C or 0.C
patterns) based on the maximum
amount of crosstalk that they can exhibit. Crosstalk avoidance
CODECs which eliminate 4.C and 3.C patterns have
been reported. In this paper, we describe crosstalk avoidance
techniques which eliminate 2.C and 1.C patterns.
We describe an analytical methodology to accurately characterize
the bus area overhead 2.C pattern CODECs.
Using these results, we characterize the area overhead versus
crosstalk immunity achieved. A similar exercise is
performed for 1.C patterns.
Our experimental results show that by using 2.C
crosstalk canceling techniques, buses can be sped up by up
to a factor of 6 with an area overhead of about 200%, and
that 1.C techniques are not very robust.
-
False-Noise Analysis for Domino Circuits [p. 784]
-
A. Glebov, S. Gavrilov, V. Zolotov, M. Becer, C. Oh, and R. Panda
High-performance digital circuits are facing increasingly
severe noise problems due to cross-coupled noise injection. Traditionally,
noise analysis tools use the conservative assumption that
all neighbors of a net can switch simultaneously, producing the
worst-case noise. However, due to logic correlations in the circuit,
this worst-case noise may not be realizable, resulting in a so-called
false noise failure. Some techniques for computing logic
correlations have been designed targeting static CMOS circuits.
However high performance microprocessors commonly use domino
logic for their ALU. The domino circuits have lower noise margins
than static CMOS circuits and are more sensitive to coupled
noise. Any unnecessary pessimism of the noise analysis tool results
in large number of false noise violations and either requires additional
extensive SPICE simulations or circuit over-design. Unfortunately
false noise analysis developed for static CMOS circuits
[11] fails to compute many logic correlations in domino circuits.
In this paper we propose a novel technique of computing logic correlations
in domino circuits. It takes into account the fact that both
pull up and pull down networks of a domino gate can be in non
conducting state. The proposed technique generates additional
logic correlations for such states of domino gates. In order to
improve the capability of logic correlation derivation technique we
combine the resolution method with recursive learning algorithm[
12]. The proposed technique is implemented in an industrial
noise analysis tool and tested on high performance ALU blocks.
-
Crosstalk Minimization in Logic Synthesis for PLA [p. 790]
-
Y. Liu, T. Hwang, and K. Wang
We propose a maximum crosstalk minimization algorithm
taking logic synthesis into consideration for PLA structure.
To minimize the crosstalk, technique of permuting wire
is used which includes the following steps. First, product
lines are partitioned into long set and short set, and then
product lines in long set and short set are interleaved. By
interleaving algorithm, an upper bound on the maximum
coupling capacitance of the product lines can be derived.
Then, we take advantage of crosstalk immunity of product
lines in long set to further reduce the maximum crosstalk
effect of the PLA. Finally, synthesis techniques such as local
transformation and global transformation are taken into
consideration to search for a better result. The experiments
demonstrate that our algorithm can effectively minimize the
maximum crosstalk effect of a circuit by 48% as compared
with the original area-minimized PLA without crosstalk
minimization.
-
Synthesis for Manufacturability: A Sanity Check [p. 796]
-
A. Nardi and A. Sangiovanni-Vincentelli
As we move towards nanometer technology, manufacturing
problems become overwhelmingly difficult to solve.
Presently, optimization for manufacturability is performed
at a post-synthesis stage and has been shown capable of reducing
manufacturing cost up to 10%. As in other cases,
raising the abstraction layer where optimization is applied
is expected to yield substantial gains. This paper focuses on
a new approach to design for manufacturability: logic synthesis
for manufacturability. This methodology consists of
replacing the traditional area-driven technology mapping
with a new manufacturability-driven one. We leverage existing
logic synthesis tools to test our method. The results
obtained by using STMicroelectronics 0.13µm library confirm
that this approach is a promising solution for designing
circuits with lower manufacturing cost, while retaining
performance. Finally, we show that our synthesis for manufacturability
can achieve even larger cost reduction when
yield-optimized cells are added to the library, thus enabling
a wider area-yield tradeoff exploration.
Moderators: G: Carlsson, Ericsson Telecom, SE; K. Chakrabarty, Duke U, US
-
Design of Sub-10-Picoseconds On-Chip Time Measurement Circuit [p. 804]
-
M. Abas, D. Kinniment, and G. Russell
The rapid pace of change in IC technology, specifically
in speed of operation, demands sophisticated design
solutions for IC testing methodologies. Moreover, the
current technology of System-on-chip (SOC) makes
great demands for testing internal speed accurately as
the limitation on accessing internal nodes using I/O pins
becomes more difficult. This paper presents two high-resolution
time measurement schemes for digital BIST
applications, namely: Two-Delay Interpolation Method
(TDIM) and Time Amplifier. The two schemes are
combined to produce a completely new design for BIST
time measurement which offers two main advantages: a
low range of timing measurement which has never been
achieved before, and a small size of layout occupying
0.2 mm2 or equivalent to 3020 transistors. These two
features are undoubtedly compatible with present high-speed
SOC design architectures.
-
Impact of Test Point Insertion on Silicon Area and Timing during Layout [p. 810]
-
H. Vranken, H. Wunderlich, and F. Sapei
This paper presents an experimental investigation on
the impact of test point insertion on circuit size and performance.
Often test points are inserted into a circuit in
order to improve the circuit's testability, which results in
smaller test data volume, shorter test time, and higher
fault coverage. Inserting test points however requires additional
silicon area and influences the timing of a circuit.
The paper shows how placement and routing is affected
by test point insertion during layout generation. Experimental
data for industrial circuits show that inserting 1%
test points in general increases the silicon area after layout
by less than 0.5% while the performance of the circuit
may be reduced by 5% or more.
-
Designing Self Test Programs for Embedded DSP Cores [p. 816]
-
H. Rizk, C. Papachristou, and F. Wolff
This paper describes a self test program design technique
for embedded DSP cores. The method requires minimal knowledge of
the core's internals and minimal insertion of external LFSR hardware,
without scan insertions. The test program consists of a small set of
instructions which operate iteratively on pseudorandom data generated
by the LFSRs to fully test the DSP core components. The method uses
instruction-based test metrics and a program template as a blueprint to
generate the test program. The self test scheme has been successfully
applied on an industrial-strength DSP core and the results compare
favorably to other methods using ATPG and pseudorandom BIST.
Moderators: P. Feldmann, IBM T.J. Watson Res. Center, US; L. Silveira, INESC ID/IST - TU Lisbon, PT
-
Automated, Accurate Macromodelling of Digital Aggressors for Power/Ground/Substrate Noise Prediction [p. 824]
-
Z. Wang, J. Roychowdhury, and R. Murgai
Noise analysis and power distribution network reliability as
is extremely important in deep sub-micron digital
and mixed-signal circuit design. Both relate closely to the
nonlinear loading impact of digital circuits. Consequently,
accurate estimation of the latter is critical. In this paper,
we present extraction techniques that automatically generate
a family of small, time-varying macromodels for digital cell
libraries, at the time of their library characterization. Our approach
is based on importing and adapting the Time-Varying
Padé (TVP) method, for linear time-varying (LTV) model reduction,
from the mixed-signal macromodelling domain. Our
approach features naturally higher accuracy than previous
ones, and in addition, offers the user a tradeoff between accuracy
and macromodel complexity. A key attraction of our
approach is that it can be merged into cell library extraction
methodologies to produce accurate-by-construction noise
models for digital blocks. Simulations and comparisons confirming
the efficacy of our approach are provided.
-
Thermal and Power Integrity Based Power/Ground Networks Optimization [p. 830]
-
T. Wang, J. Tsai, and C. Chen
With the increasing power density and heat-dissipation cost
of modern VLSI designs, thermal and power integrity has become
serious concern. Although the impacts of thermal effects
on transistor and interconnect performance are well-studied, the
interactions between power-delivery and thermal effects are not
clear. As a result, power-delivery design without thermal consideration
may cause soft-error, reliability degradation, and even
premature chip failures. In this paper, we propose a thermal-aware
power-delivery optimization algorithm. By simultaneously
considering thermal and power integrity, we are able to
achieve high power supply quality and thermal reliability. For a
58 x 72 mesh as shown in the experimental results, our algorithm
shows that the lifetime of the optimized ground network
is 9.5 years. Whereas the lifetime of the ground network generated
by a traditional method is only 2 years without thermal concern.
-
Synthesized Compact Models (SCM) of Substrate Noise Coupling Analysis and
Synthesis in Mixed-Signal ICs [p. 836]
-
H. Lan and R. Dutton
An approach for synthesized compact models (SCM) of
substrate noise coupling is presented. The model is formulated
using parameterized and scalable Z matrix. The
improvement in modeling near field effects results in better
substrate noise modeling for analog circuits. The geometrical
scalability of the model provides a bi-directional link
between noise analysis in the post-layout phase for verification
and the noise-aware layout synthesis using convex
optimization techniques. The model is validated by rigorous
EM and device simulations. Several application examples
are used to demonstrate the bi-directional usage of the
model.
Organizer/Moderator: P. Paulin, STMicroelectronics, FR
Panellists:
R. Bramley, STMicroelectronics
A. Silburt, Cisco, CAN
J. Balzano, Alcatel, FR
K. van Berkel, Philips Research, NL
N. Wehn, Kaiserslautern U, DE
-
Chips of the Future: Soft, Crunchy or Hard? [p. 844]
-
Today's electronic products are composed of an
increasingly diverse set of IC's, ranging from dedicated
ASIC's, domain-specific ASSP's, platform FPGA's, to
general-purpose FPGA's. With increasing integration, a
mix of different fabrics on a single SoC becomes possible,
combining ASIC-style standard cells, embedded FPGA's,
mask-programmable sea-of-gates, and programmable
processors. The panelists will present their vision of the
fabric which will dominate SoC's in 90nm technologies and
beyond, based on industrial trends and case studies. They
will also outline the key CAD tool challenges for the
chosen fabric.
Moderators: M. Pedram, Southern California U, US; A. Amara, ISEP, FR
-
Tuning In-Sensor Data Filtering to Reduce Energy Consumption in Wireless Sensor Networks [p. 852]
-
I. Kadayif and M. Kandemir
In recent years, research on wireless sensor networks has been
undergoing a revolution, promising to have significant impact on
a broad range of applications from military to health care to
food safety. An important problem in many sensor network
applications is to decide the amount of computation (or filtering)
that needs to be done in the sensor nodes before the data are
shifted to a central base station. Right amount of data filtering in
the sensor nodes can lead to large savings in network-wide
energy consumption. The main goal of this paper is to develop an
automated strategy for data filtering in wireless sensor nodes.
Assuming that one needs to reduce the overall energy
consumption (as opposed to reducing just computation energy or
communication energy), the proposed strategy attempts to strike
a balance between computation energy consumption and
communication energy consumption. Our experimental results
clearly indicate that the proposed data filtering strategy
generates substantial energy savings in practice.
-
Power-Aware Network Swapping for Wireless Palmtop PCs [p. 858]
-
A. Acquaviva, E. Lattanzi, and A. Bogliolo
Virtual memory is considered to be an unlimited resource
in desktop or notebook computers with high storage memory
capabilities. However, in wireless mobile devices like
palmtops and personal digital assistants (PDA), storage
memory is limited or absent due to weight, size and power
constraints. As a consequence, swapping over remote memory
devices can be considered as a viable alternative. Nevertheless,
power hungry wireless network interface cards
(WNIC) may limit the battery lifetime and application performance
if not efficiently exploited. In this work we explore
performance and energy of network swapping in comparison
with swapping on local micro-drives and flash memories.
Our study points out that remote swapping over power-manageable
WNICs can be more efficient than local swapping
and that both energy and performance can be optimized
through power-aware reshaping of data requests. Experimental
results show that our optimization technique can
save up to 60% of communication energy while improving
performance.
-
Power Aware Interface Synthesis for Bus-Based SoC Design [p. 864]
-
N. Liveris and P. Banerjee
In this paper we discuss the problem of interface synthesis
for a system on a chip (SoC) such that the power
consumption is minimized under some given latency constraints.
Since the AMBA protocol has become one of the
standard interfaces for SoC cores, we develop our interface
synthesis methods around the AMBA protocol. We first provide
an analysis of the parameters of the AMBA bus and
the communication protocols and a bus power model that
will be used by various transformations. Several latency
improving and power minimizing transformations are presented
at the bus level. Finally, a heuristic is presented
which applies the above transformations in a certain order
to provide minimum power under a given latency constraint.
Experimental results are reported on two example
benchmarks in that show that the heuristic is able to reduce
power consumption on the wires by about 28% on the
average from an initial design having a single layer bus architecture.
-
Asynchronous Design by Conversion: Converting Synchronous Circuits into Asynchronous Ones [p. 870]
-
A. Branover, R. Kol, and R. Ginosar
A novel methodology and algorithm for the design of
large low-power asynchronous systems are described.
The system is synthesized by a commercial tool as a
synchronous circuit, and subsequently converted into an
asynchronous one. The conversion algorithm consists of
extracting input and output sets, replacing the storage
elements, identifying fork and join sets, and constructing
request and acknowledge networks. A DLAP (Doubly
Latched Asynchronous Pipeline) architecture is
employed. The resulting asynchronous circuit can adapt
its effective operating frequency to the supply voltage,
facilitating flexible and efficient power management. The
algorithm has been validated on several circuits.
Moderators: G. Nicolescu, Ecole Polytechnique de Montreal, CA; M. Coppola, STMicroelectronics, FR
-
An Efficient On-Chip Network Interface Offering Guaranteed Services, Shared-Memory Abstraction, and
Flexible Network Configuration [p. 878]
-
A. Radulescu, J. Dielissen, K. Goossens, E. Rijpkema, and P. Wielage
In this paper we present a network interface for an on-chip
network. Our network interface decouples computation from communication
by offering a shared-memory abstraction, which is independent
of the network implementation. We use a transaction-based
protocol to achieve backward compatibility with existing
bus protocols such as AXI, OCP and DTL. Our network interface
has a modular architecture, which allows flexible instantiation. It
provides both guaranteed and best-effort services via connections.
These are configured via network interface ports using the network
itself, instead of a separate control interconnect. An example
instance of this network interface with 4 ports has an area of
0.143mm2 in a 0.13µm technology, and runs at 500 MHz.
-
×pipesCompiler: A Tool for Instantiating Application Specific Networks-on-Chip [p. 884]
-
S. Murali, G. De Micheli, A. Jalabert, and L. Benini
Future Systems on Chips (SoCs) will integrate a large
number of processor and storage cores onto a single chip
and require Networks on Chip (NoC) to support the heavy
communication demands of the system. The individual components
of the SoCs will be heterogeneous in nature with
widely varying functionality and communication requirements.
The communication infrastructure should optimally
match communication patterns among these components
accounting for the individual component needs. In this paper
we present ×pipesCompiler, a tool for automatically
instantiating an application-specific NoC for heterogeneous
Multi-Processor SoCs. The ×pipesCompiler
instantiates a network of building blocks from a library of
composable soft macros (switches, network interfaces and
links) described in SystemC at the cycle-accurate level. The
network components are optimized for that particular network
and support reliable, latency-insensitive operation.
Example systems with application-specific NoCs built using
the ×pipesCompiler show large savings in area (factor
of 6.5), power (factor of 2.4) and latency (factor of 1.42)
when compared to a general-purpose mesh-based NoC architecture.
Keywords: Systems on Chips, Networks on Chips,
latency-insensitive design, application-specific, SystemC.
-
Guaranteed Bandwidth Using Looped Containers in Temporally Disjoint Networks within the
Nostrum Network on Chip [p. 890]
-
M. Millberg, E. Nilsson, R. Thid, and A. Jantsch
In today's emerging Network-on-Chips, there is a need for
different traffic classes with different Quality-of-Service
guarantees. Within our NoC architecture Nostrum, we have
implemented a service of Guaranteed Bandwidth (GB),
and latency, in addition to the already existing service of
Best-Effort (BE) packet delivery. The guaranteed bandwidth
is accessed via Virtual Circuits (VC). The VCs are
implemented using a combination of two concepts that we
call "Looped Containers" and "Temporally Disjoint Networks".
The Looped Containers are used to guarantee
access to the network -- independently of the current network
load without dropping packets; and the TDNs are
used in order to achieve several VCs, plus ordinary BE traffic,
in the network. The TDNs are a consequence of the
deflective routing policy used, and gives rise to an explicit
time-division-multiplexing within the network. To prove
our concept an HDL implementation has been synthesised
and simulated. The cost in terms of additional hardware
needed, as well as additional bandwidth is very low -- less
than 2 percent in both cases! Simulations showed that
ordinary BE traffic is practically unaffected by the VCs.
-
Bandwidth-Constrained Mapping of Cores onto NoC Architectures [p. 896]
-
S. Murali and G. De Micheli
We address the design of complex monolithic systems,
where processing cores generate and consume a varying
and large amount of data, thus bringing the communication
links to the edge of congestion. Typical applications are
in the area of multi-media processing. We consider a mesh-based
Networks on Chip (NoC) architecture, and we explore
the assignment of cores to mesh cross-points so that the traffic
on links satisfies bandwidth constraints. A single-path
deterministic routing between the cores places high bandwidth
demands on the links. The bandwidth requirements
can be significantly reduced by splitting the traffic between
the cores across multiple paths. In this paper, we present
NMAP, a fast algorithm that maps the cores onto a mesh
NoC architecture under bandwidth constraints, minimizing
the average communication delay. The NMAP algorithm is
presented for both single minimum-path routing and split-traffic
routing. The algorithm is applied to a benchmark
DSP design and the resulting NoC is built and simulated
at cycle accurate level in SystemC using macros from the
×pipes library. Also, experiments with six video processing
applications show significant savings in bandwidth and
communication cost for NMAP algorithm when compared
to existing algorithms.
Keywords: Systems on Chips, Networks on Chips,
cores, mapping, bandwidth, routing.
Moderators: T. Kutzschebauch, Magma Design Automation, US; L. Stok, IBM, US
-
Synthesis and Optimization of Threshold Logic Networks with Application to Nanotechnologies [p. 904]
-
R. Zhang, P. Gupta, L. Zhong, and N. Jha
We propose an algorithm for efficient threshold network
synthesis of arbitrary multi-output Boolean functions. The main purpose
of this work is to bridge the wide gap that currently exists between
research on the development of nanoscale devices and research on the
development of synthesis methodologies to generate optimized networks
utilizing these devices. Many nanotechnologies, such as resonant tunneling
diodes (RTD) and quantum cellular automata (QCA), are capable
of implementing threshold logic. While functionally correct threshold
gates have been successfully demonstrated, there exists no methodology
or design automation tool for general multi-level threshold network
synthesis. We have built the first such tool, ThrEshold Logic Synthesizer
(TELS), on top of an existing Boolean logic synthesis tool. Experiments
with about 60 multi-output benchmarks were performed, though the
results of only 10 of them are reported in this paper because of space
restrictions. They indicate that up to 77% reduction in gate count
is possible when utilizing threshold logic, with an average reduction
being 52%, compared to traditional logic synthesis. Furthermore, the
synthesized networks are well-balanced, and hence delay-optimized.
-
Fast Comparisons of Circuit Implementations [p. 910]
-
S. Karandikar and S. Sapatnekar
Digital designs can be mapped to different implementations
using diverse approaches, with varying cost criteria.
Post-processing transforms, such as transistor sizing
can drastically improve circuit performance, by optimizing
critical paths to meet timing specifications. However, most
transistor sizing tools have high execution times, and the
attainable circuit delay can be determined only after running
the tool. In this paper, we present an approach for fast
transistor sizing that can enable a designer to choose one
among several functionally identical implementations. Our
algorithm computes the minimum achievable delay of a circuit
with a maximum average error of 5.5% in less than a
second for even the largest benchmarks.
-
Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs [p. 916]
-
A. Tiwari and K. Tomko
Modern FPGAs contain on-chip synchronous embedded
memory blocks (SEMBs), these memory blocks can be
used to implement control units, when not used as on-chip
memory. In this paper, we explore the mapping of Finite
State Machines (FSMs) into the SEMBs for power and
area minimization. We have shown the SEMB based
implementation of the FSMs and compared it with
conventional Flip-Flop (FF) based implementation. The
proposed implementation of the FSMs consumes less
power and has lower area and routing overhead than the
FF based approach and it can be clocked at the maximum
clock frequency supported by the SEMBs. Experimental
results show that the SEMB based FSM consumes 4% to
26% less power than the conventional implementation. In
addition it is observed that the power consumption can be
further reduced by stopping the clock to the SEMBs
during the idle states.
-
MemMap: Technology Mapping Algorithm for Area Reduction in FPGAs with
Embedded Memory Arrays Using Reconvergence Analysis [p. 922]
-
M. Kumar, J. Bobba, and V. Kamakoti
Modern day Field Programmable Gate Arrays (FPGA)
include in addition to Look-up Tables, reasonably big configurable
Embedded Memory Blocks (EMB) to cater to
the on-chip memory requirements of systems/applications
mapped on them. While mapping applications on to such
FPGAs, some of the EMBs may be left unused. This paper
presents a methodology to utilize such unused EMBs
as large look-up tables to map multi-output combinational
sub-circuits of the application, which, otherwise would be
mapped on to a number of small Look-Up Tables (LUT)
available on the FPGA. This in turn leads to a huge reduction
in the area of the FPGA, utilized for mapping an
application. Experimental results show that our proposed
methodology, when employed on popular benchmark circuits,
can lead to additional 50% reduction in area utilized
when compared with other methodologies reported in the
literature.
Organizer: K. Thapar, Mentor Graphics Europe, UK
Moderator: J. Rajski, Mentor Graphics, US
Panellists:
M. Vergniault, STMicroelectronics, FR
P. Muhmenthaler, Infineon Technologies, DE
E. Haioun, Motorola, FR
E. Marinissen, Philips Research, NL
R. Illman, Cadence Design Foundry, UK
B. Bennetts, Bennetts Associates, UK
S. Dowd, Jennic, UK
-
Nanometer Design: What are the Requirements for Manufacturing Test? [p. 930]
Moderators: I. Elfadel, IBM T.J. Watson Res. Center, US; U. Feldmann, Infineon Technologies, DE
-
Poor Man's TBR: A Simple Model Reduction Scheme [p. 938]
-
J. Phillips and L. Silveira
This paper presents a model reduction algorithm motivated by a
connection between frequency domain projection methods and approximation
of truncated balanced realizations. The method produces
guaranteed passive models, has near-optimal error properties,
is computationally simple to implement, contains error estimators,
and can incorporate frequency weighting information in a straightforward
manner. Examples are shown to prove that the method can
outperform the standard order reduction techniques by providing
similar accuracy with lower models or superior accuracy for the
same size model.
-
Model Order Reduction Techniques for Linear Systems with Large Numbers of Terminals [p. 944]
-
P. Feldmann
This paper addresses the well known difficulty of applying
model order reduction (MOR) to linear circuits with a
large number of input-output terminals. Traditional MOR
techniques substitute the original large but sparse matrices
used in the mathematical modeling of linear circuits by
models that approximate the behavior of the circuit at its
terminals, and use significantly smaller matrices. Unfortunately
these small MOR matrices become dense as the number
of terminals increases, thus canceling the benefits of size
reduction. The paper introduces a model reduction technique
suitable for circuits with numerous terminals. The
technique exploits the correlation that almost always exists
between circuit responses at different terminals. The correlation
is rendered explicit through an SVD-based algorithm
and the result is a substantial sparsification of the MOR matrices.
The proposed sparsification technique is applicable
to a large class of problems encountered in the analysis and
design of interconnect in VLSI circuits. Relevant examples
are used to analyze and validate the method.
-
SCORE: Spice COmpatible Reluctance Extraction [p. 948]
-
R. Jiang and C. Chen
Presently, a necessary modification to mainstream analysis
tools prevents the direct application of reluctance k.
In this paper, we propose a reluctance realization algorithm
(RRA) by directly converting reluctances to circuit elements
compatible with general simulation engines, such
as SPICE. Reluctance realization is applicable to arbitrary
circuit topology and no accuracy penalty is involved in the
realization process. Since the stability of the converted circuit
largely depends on the stability of the reluctance matrix,
we present an efficient Improved Recursive Bisection
Cutting Algorithm (IRBCA) to obtain stability-guaranteed
reluctance matrices, and integrate IRBCA and RRA into a
SPICE compatible reluctance extraction tool, SCORE.
-
A Compact Propagation Delay Model for Deep-Submicron CMOS Technologies including Crosstalk [p. 954]
-
J. Rosselló and J. Segura
We present a compact, fully physical, analytical model for the
propagation delay and the output transition time of deep-submicron
CMOS gates. The model accounts for crosstalk effects, short-circuit
currents, the input-output coupling capacitance and carrier velocity
saturation effects. It is based on the nth-power law MOSFET model
and computes the propagation delay from the charge delivered to the
gate. Comparison with HSPICE simulations and other previously
published models for different submicron technologies show
significant improvements in terms of accuracy.
Moderators: T. Basten, TU Eindhoven, NL; L. Claesen, National Chiao Tung U, Taiwan, ROC
-
A Framework for Battery-Aware Sensor Management [p. 962]
-
S. Dasika, S. Vrudhula, S. Chopra, and R. Srinivasan
A distributed sensor network (DSN) designed to cover a given region
R, is said to be alive if there is at least one subset of sensors
that can collectively cover (sense) the region R. When no such subset
exists, the network is said to be dead. A key challenge in the
design of a DSN is to maximize the operational life of the network.
Since sensors are typically powered by batteries, this requires maximizing
the battery lifetime. One way to achieve this is to determine
the optimal schedule for transitioning sets of sensors between active
and inactive states while satisfying user specified performance
constraints. This requires identification of feasible subsets (covers)
of sensors and a scheme for switching between such subsets. We
present an algorithmic solution to compute all the sensor covers in
an implicit manner by formulating the problem as unate covering
problem (UCP). The representation of all possible sensor sets is extremely
efficient and can accommodate very large number of sensor
covers. The representation and formulation makes it possible to
consider the residual battery charge when switching between covers.
We develop algorithms for switching between sensor covers
aimed at maximizing the lifetime of the network. The algorithms
take into account the transmission/reception costs of sensors, a user
specified quality constraint and also utilize a novel battery model
that accounts for the rate-dependent capacity effect and charge recovery
during idle periods. Our simulation results show that lifetime
improvement can be achieved by exploiting the charge recovery
process. The work 1 presented here constitutes a framework for
battery aware sensor management in which various types of constraints
can be incorporated and a range of other communication
protocols can be examined.
-
Local Decisions and Triggering Mechanisms for Dynamic Fault Tolerance [p. 968]
-
P. Stanley-Marbell and D. Marculescu
Dynamic fault-tolerance management (DFTM) was previously
introduced as a means of providing environment and
workload-driven adaptation for failure-prone battery
powered systems.
This paper introduces and analyzes the role of local decision
policies in a DFTM environment, and presents a precise
formulation for when it is beneficial to activate a given
DFTM algorithm with respect to metrics that combine performance,
reliability, power consumption and battery life.
In particular, local decision algorithms are described in
the context of an imaging array application running on a
network of resource-constrained processing elements. It is
demonstrated that DFTM algorithms, in conjunction with
appropriately chosen activation times, increase the mean
computation before battery failure for a single battery, by a
factor between 1.1 to 5.8, for the application investigated.
-
An Algorithm for Nano-Pipelining of Circuits and Architectures for a Nanotechnology [p. 974]
-
P. Gupta and N. Jha
In this paper, we describe an algorithm to post-process a
register-transfer level (RTL) architecture to enable gate-level pipelining
or nano-pipelining for the nanotechnology based on resonant tunneling
diodes (RTDs). Nano-pipelining offers the opportunity to obtain massive
throughput and, therefore, has applications in data-intensive algorithms
such as digital signal processing (DSP). Since RTDs are a self-latching
nanotechnology, nano-pipelining is an implicit property that should be
exploited for this technology. The novelty of this work lies in exploring
and demonstrating the benefits of nano-pipelining and presenting an
algorithm for architectural nano-pipelining.
-
Smaller Two-Qubit Circuits for Quantum Communication and Computation [p. 980]
-
V. Shende, I. Markov, and S. Bullock
We show how to implement an arbitrary two-qubit unitary
operation using any of several quantum gate libraries
with small a priori upper bounds on gate counts. In analogy
to library-less logic synthesis, we consider circuits and
gates in terms of the underlying model of quantum computation,
and do not assume any particular technology. As
increasing the number of qubits can be prohibitively expensive,
we assume throughout that no extra qubits are available
for temporary storage.
Using quantum circuit identities, we improve an earlier
lower bound of 17 elementary gates by Bullock and Markov
to 18, and their upper bound of 23 elementary gates to 18.
We also improve upon the generic circuit with six CNOT
gates by Zhang et al. (our circuit uses three), and that by
Vidal and Dawson with 11 basic gates (we use 10).
We study the performance of our synthesis procedures on
two-qubit operators that are useful in quantum algorithms
and communication protocols. With additional work, we
find small circuits and improve upon previously known circuits
in some cases.
Organizer: E. Macii, Politecnico di Torino, IT
Moderator: N. Chang, Seoul National U, KR
Speakers:
I. Verbauwhede, UCLA, US
C. Piguet, CSEM, CH
P. Schaumont, UCLA, US
B. Kienhuis, Leiden U, NL
-
Architectures and Design Techniques for Energy Efficient
Embedded DSP and Multimedia Processing [p. 988]
-
Energy efficient embedded systems consist of a heterogeneous
collection of very specific building blocks, connected
together by a complex network of many dedicated
busses and interconnect options. The trend to merge multiple
functions into one device makes the design and integration
of these 'systems-on-chip' (SOC's) even more problematic.
Yet, specifications and applications are never fixed
and require the embedded units to be programmable.
The topic of this paper is to give the designer architectures
and design techniques to find the right balance between
energy efficiency and flexibility. The key is to include
programmability (or reconfiguration) at the right level
of abstraction and tuned to the application domain. The
challenge is to provide an exploration and programming
environment for this heterogeneous architecture platform.
Moderators: R. Seepold, Carlos III de Madrid U, ES; T. Riesgo, UP Madrid, ES
-
Measurement of IP Qualification Costs and Benefits [p. 996]
-
A. Vörg, W. Rosenstiel, and M. Radetzki
IP core reuse is necessary to overcome the design gap.
Yet experience during IP integration has shown that risk is
still considerably high when dealing with IPs. IP qualification
provides IP providers and integrators with measurable
quality characteristics that allow for high quality IP
cores and to put buy decisions on a quantifiable basis. This
paper presents unprecedented results that facilitate the
comparison of the effectiveness of reusing qualified, digital
soft IP to previous, immature reuse methods. An impressive
reduction in IP integration effort, which is profitable
for the IP customer, is demonstrated. Moreover, we show
that the IP business can be profitable for the IP provider
despite the additional qualification effort.
-
Architecture-Level Performance Estimation for IP-Based Embedded Systems [p. 1002]
-
K. Ueda, K. Sakanushi, Y. Takeuchi, and M. Imai
In this paper, we propose a architecture-level performance
estimation method for IP-based embedded systems
using system-level profiling. Our method enables the performance
estimation by the following procedures; 1) System-level
profiling. 2) Automatic construction of the execution
dependency graph (EDG) from the profile information. 3)
Estimation of the system performance based on the EDG
analysis. Our method enables fast performance estimation
because it can estimate the performance of various architectures
from the same system-level profile information. Experimental
results show that our estimation method is about
10,000 times faster than the architecture-level simulations.
-
Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures [p. 1008]
-
M. Singh and M. Theobald
Latency-insensitive systems were recently proposed by
Carloni et al. as a correct-by-construction methodology
for single-clock system-on-a-chip (SoC) design using predesigned
IP blocks. Their approach overcomes the problem
of long latencies of global interconnects in deep-submicron
technologies, while still maintaining much of the inherent
simplicity of synchronous design. In particular, wires whose
latency is greater than a clock cycle are segmented using
"relay stations," and IP blocks are made robust to arbitrary
communication delays.
This paper shows, however, that significant extensions
are needed to make latency-insensitive systems useful for
the practical design of large-scale SoC's. In particular, this
paper proposes three extensions. The first extension allows
each synchronous module to treat its input and output channels
in a much more flexible manner, i.e., with greater decoupling.
The second extension generalizes inter-module
communication from point-to-point channels to more complex
networks of arbitrary topologies. Finally, the third extension
is to target multi-clock SoC's. The net impact of our
extensions is the potential for improved throughput, reduced
power consumption, and greater flexibility in design.
-
Platform Based on Open-Source Cores for Industrial Applications [p. 1014]
-
M. Bolado, J. Castillo, P. Huerta, H. Posadas, P. Sánchez, C. Sánchez, F. Blasco, and H. Fouren
The latest version of the International Technology Roadmap for Semiconductors predicts that design reuse will be essential in the near future to face the constantly increasing design complexity. The concept comes from software engineering in which reuse is a fundamental technology. In order to provide libraries and applications to reuse in software development, some open-source initiatives (e.g. Linux, gcc, X, mysql) have appeared during the last decades. The basic idea is to distribute the library or application source code (normally free-of-charge) and allow any developer to use, modify, debug and improve it. Several initiatives have tried to port this idea to hardware development. The main goal of this paper is to develop a synthesizable platform described in SystemC from an open architecture. The platform includes a CPU (OpenRISC) and some basic peripherals. A set of software development tools (compiler, assembler, debugger) and RTOS (eCos) has also been developed. This work enables the evaluation of the advantages and disadvantages of the open-source model in electronic system design.
-
MINCE: Matching INstructions with Combinational Equivalence for Extensible Processor [p. 1020]
-
N. Cheung, S. Parameswaran, J. Henkel, and J. Chan
Designing custom-extensible instructions for Extensible
Processors1 is a computationally complex task because of the
large design space. The task of automatically matching candidate
instructions in an application (e.g. written in a high-level
language) to a pre-designed library of extensible instructions
is especially challenging. Previous approaches have focused
on identifying extensible instructions (e.g. through profiling),
synthesizing extensible instructions, estimating expected performance
gains etc. In this paper we introduce our approach
of automatically matching extensible instructions as this key
step is missing in automating the entire design flow of an ASIP
with extensible instruction capabilities. Since matching using
simulation is practically infeasible (simulation time), and
traditional pattern matching approaches would not yield reliable
results (ambiguity related to a functionally equivalent code that
can be represented in many different ways), we adopt combinational
equivalence checking. Our MINCE tool as part of
our ASIP design flow consists of a translator, a filtering algorithm
and a combinational equivalence checking tool. We
report matching times of extensible instructions that are 7.3x
faster on average (using Mediabench applications) compared
to the best known approaches to the problem (partial simulations).
In all our experiments MINCE matched correctly and
the outcome of the matching step yielded an average speedup
of the application of 2.47x. As a summary, our work represents
a key step towards automating the whole design flow of
an ASIP with extensible instruction capabilities.
Moderators: S. Hu, Notre Dame U, US; F. Wolf, Volkswagen, DE
-
Design Optimization of Multi-Cluster Embedded Systems for Real-Time Applications [p. 1028]
-
P. Pop, P. Eles, Z. Peng, V. Izosimov, M. Hellring, and O. Bridal
We present an approach to design optimization of multi-cluster embedded
systems consisting of time-triggered and event-triggered clusters, interconnected
via gateways. In this paper, we address design problems which are
characteristic to multi-clusters: partitioning of the system functionality
into time-triggered and event-triggered domains, process mapping, and the
optimization of parameters corresponding to the communication protocol.
We present several heuristics for solving these problems. Our heuristics
are able to find schedulable implementations under limited resources,
achieving an efficient utilization of the system. The developed algorithms
are evaluated using extensive experiments and a real-life example.
-
Timing Analysis for Preemptive Multi-Tasking Real-Time Systems with Caches [p. 1034]
-
Y. Tan and V. Mooney
In this paper, we propose an approach to estimate the Worst
Case Response Time (WCRT) of tasks in a preemptive multi-tasking
single-processor real-time system with a set associative cache. The
approach focuses on analyzing the cache reload overhead caused
by preemptions. We combine inter-task cache eviction behavior
analysis and path analysis of the preempted task to reduce, in our
analysis, the estimate of the number of cache lines that can possibly
be evicted by the preempting task (thus requiring a reload by the
preempted task). A mobile robot application which contains three
tasks is used to test our approach. The experimental results show
that our approach can tighten the WCRT estimate by up to 73%
over prior state-of-the-art.
-
Workload Characterization Model for Tasks with Variable Execution Demand [p. 1040]
-
A. Maxiaguine, S. Künzli, and L. Thiele
The analysis of real-time properties of an embedded
system usually relies on the worst-case execution times
(WCET) of the tasks to be executed. In contrast to that,
in real world applications the running time of tasks may
vary from execution to execution, e. g. in multimedia applications.
The traditional worst-case analysis of the system
then returns overly pessimistic estimates of the system performance.
In this paper we propose a new effective method
to characterize tasks with variable execution requirements,
which leads to tighter worst-case bounds on system performance
and better use of available resources. We show the
applicability of our approach by a detailed study of a multimedia
application.
-
Context-Aware Performance Analysis for Efficient Embedded System Design [p. 1046]
-
M. Jersak, R. Henia, and R. Ernst
Performance analysis has many advantages in theory
compared to simulation for the validation of complex embedded
systems, but is rarely used in practice. To make
analysis more attractive, it is critical to calculate tight analysis
bounds. This paper shows that advanced performance
analysis techniques taking correlations between successive
computation or communication requests as well a correlated
load distribution into account can yield much tighter
analysis bounds. Cases where such correlations have a
large impact on system timing are especially difficult to simulate
and, hence, are an ideal target for formal performance
analysis.
-
Compact Binaries with Code Compression in a Software Dynamic Translator [p. 1052]
-
S. Shogan and B. Childers
Embedded software is becoming more flexible and adaptable,
which presents new challenges for management of
highly constrained system resources. Software dynamic
translation (SDT) has been used to enable software malleability
at the instruction level for dynamic code optimizers,
security checkers, and binary translators. This paper
studies the feasibility of using SDT to manage program
code storage in embedded systems. We explore to what
extent code compression can be incorporated in a software
infrastructure to reduce program storage requirements,
while minimally impacting run-time performance and
memory resources. We describe two approaches for code
compression, called full and partial image compression,
and evaluate their compression ratios and performance in
a software dynamic translation system. We demonstrate
that code decompression is indeed feasible in a SDT.
Moderators: R. Aitken, Artisan, US; H. Manhaeve, Q-star Test, BE
-
Pattern Selection for Testing of Deep Sub-Micron Timing Defects [p. 1060]
-
M. C. Chao, L. Wang, and K. Cheng
Due to process variations in deep sub-micron (DSM) technologies,
the effects of timing defects are difficult to capture.
This paper presents a novel coverage metric for estimating
the test quality with respect to timing defects under
process variations. Based on the proposed metric and a
dynamic timing analyzer, we develop a pattern-selection algorithm
for selecting the minimal number of patterns that
can achieve the maximal test quality. To shorten the run
time in dynamic timing analysis, we propose an algorithm
to speed up the Monte-Carlo-based simulation. Our experimental
results show that, selecting a small percentage of
patterns from a multiple-detection transition fault pattern
set is sufficient to maintain the test quality given by the entire
pattern set. We present run-time and accuracy comparisons
to demonstrate the efficiency and effectiveness of our
pattern selection framework.
-
Balanced Excitation and its Effect on the Fortuitous Detection of Dynamic Defects [p. 1066]
-
J. Dworak, B. Cobb, J. Wingfield, and M. Mercer
Dynamic defects are less likely to be fortuitously
detected than static defects because they have more
stringent detection requirements. We show that (in
addition to more site observations) balanced excitation is
essential for detection of these defects, and we present a
metric for estimating this degree of balance. We also
show that excitation balance correlates with the
parameter in the MPG-D defective part level model.
-
Intermittent Scan Chain Fault Diagnosis Based on Signal Probability Analysis [p. 1072]
-
Y. Huang, W. Cheng, C. Hsieh, H. Tseng, A. Huang, and Y. Hung
A new algorithm to diagnose intermittent scan chain
fault in scan-based designs is proposed in this paper. An
intermittent scan chain fault sometimes is triggered and
sometimes is not triggered during scan chain shifting,
which makes it very difficult to locate the fault sites. In
this paper, we provide answers to three questions:
(1) Why intermittent scan chain faults happen?
(2) Why diagnosis of this type of faults is necessary?
(3) How to diagnose this type of faults?
The experimental results presented demonstrate that
the proposed diagnosis algorithm is effective for large
industrial designs with multiple intermittent scan chain
faults.
-
A Modeling Approach for Addressing Power Supply Switching Noise Related Failures of Integrated Circuits [p. 1078]
-
C. Tirumurti, S. Kundu, S. Sur-Kolay, and Y. Chang
Power density of high-end microprocessors has been increasing
by approximately 80% per technology generation,
while the voltage is scaling by a factor of 0.8. This leads to
225% increase in current per unit area in successive generation
of technologies. The cost of maintaining the same IR
drop becomes too high. This leads to compromise in power
delivery and power grid becomes a performance limiter.
Traditional performance related test techniques with transition
and path delay fault models focus on testing the logic
but not the power delivery. In this paper we view power grid
as performance limiter and develop a fault model to address
the problem of vector generation for delay faults arising
out of power delivery problems. A fault extraction methodology
applied to a microprocessor design block is explained.
-
Soft Faults and the Importance of Stresses in Memory Testing [p. 1084]
-
Z. Al-Ars and A. van De Goor
Memory testing in general, and DRAM testing
in particular, has become greatly dependent on the modification
of stresses (timing, temperature and voltages) in a
way that is difficult to justify using the current understanding
of memory faults. This paper introduces a new class
of fault models (soft faults) based on a special classification
of memory faults, that shows why it is fundamentally
necessary to apply stresses. The paper calculates the relative
probability of soft faults for a specific failure mechanism
and compares this probability in DRAMs with that in
SRAMs. In addition, the concept of soft faults is validated
using defect injection and electrical simulation of a Spice
DRAM model.
Keywords: Fault modeling, soft faults, memory testing,
stress application, defect simulation.
Moderators: J. Lienig, TU Dresden, DE; R. Otten, TU Eindhoven, NL
-
Wire Retiming for System-On-Chip by Fixpoint Computation [p. 1092]
-
C. Lin and H. Zhou
In the current and future System-On-Chips, a non-negligible
part of operation time is spent on multiple-clock period wires.
Retiming -- that is moving flip-flops in a circuit without changing
its functionality -- can be explored to pipeline long interconnect
wires in SOC designs. The problem of retiming over a
netlist of macro-blocks, where the internal structures may not
be changed and flip-flops may not be inserted on some wire
segments is called the wire retiming problem. In this paper,
we formulate the constraints of the wire retiming problem as
a fixpoint computation and use an iterative algorithm to solve
it. Experimental results show that this approach is multiple
orders more efficient than the previous one.
-
Boosting: Min-Cut Placement with Improved Signal Delay [p. 1098]
-
A. Kahng, S. Reda, and I. Markov
In this work we improve top-down min-cut placers in the context
of timing closure. Using the concept of boosting factors,
we adjust net weights according to net spans, so as to reduce
the quadratic wirelength. Our method is generic and does not
involve any timing analysis during or prior to placement. In
essence, we skew the netlength distribution produced by a min-cut
placer so as to decrease the number of long nets, with minimal
impact on the overall wirelength. Empirically this approach does
not significantly affect runtime, but reduces the worst negative
slack and total negative slack of industrial benchmarks by up to
70% compared to Capo [5] and a leading industrial placer.
-
Optimal Algorithm for Minimizing the Number of Twists in an On-Chip Bus [p. 1104]
-
L. Deng and M. Wong
Complementary bus architecture is used to achieve
higher speed and lower power in VLSI chips. However, in deep
submicron circuit design, the effects of crosstalk become more
and more serious, especially in the bus structure where wires are
placed close to each other. Complementary bus architecture with
twisted wires can reduce the coupling noise. But in current chip
design flow, engineering change order (ECO) happens commonly
to meet improvement requirement. Layout changes due to ECO
introduce obstacles to the twists, which could reduce the number
of twists and increase the coupling noise. In this paper, an ECO
algorithm for generating twisted complementary architecture is
proposed based on the shortest path algorithm. Our algorithm
guarantees to give the minimum number of twists along the bus
wires under noise constraints. Experimental results show that the
twist patterns generated by our algorithm can effectively reduce the
capacitive coupling noises.
-
A Fast Word-Level Statistical Estimator of Intra-Bus Crosstalk [p. 1110]
-
S. Gupta and S. Katkoori
Given word-level statistics, namely mean, standard deviation,
and lag-one temporal correlation of input data,
we estimate the bit-level crosstalk probability on a system
bus using a non-enumerative statistical approach.We introduce
a sampling technique for fast evaluation of integrals
during the estimation process. We had proposed two techniques
previously -- (a) a stream-based estimator that counts
crosstalk events on a bus; and (b) a statistical enumeration
technique that enumerates crosstalk-producing values on a
bus and computes their occurrence probability. Both these
techniques suffer from exponential time complexity with respect
to the bus-width. In this work, we propose a statistical
non-enumerative technique that has linear time complexity
with respect to the bus-width. We achieve the linear
complexity by resorting to: (1) manipulating the data
stream to make the crosstalk-producing values contiguous
and (2) sampling the distribution function and storing it
as a lookup table. Experimental results for data streams
from different data environments are presented, compared
against the stream-based approach. Average errors of less
than 12% are obtained for bus-widths ranging from 8b to
32b.
-
Full-Chip Multilevel Routing for Power and Signal Integrity [p. 1116]
-
J. Xiong and L. He
Conventional physical design flow separates the design
of power network and signal network. Such a separated
approach results in slow design convergence
for wire-limited deep sub-micron designs. We present a
novel design methodology that simultaneously considers
global signal routing and power network design under
integrity constraints. The key part to this approach is a
simple yet accurate power net estimation formula that decides
the minimum number of power nets needed to satisfy
both power and signal integrity constraints prior to detailed
layout. The proposed design methodology is a
one-pass solution to the co-design of power and signal
networks in the sense that no iteration between them
is required in order to meet design closure. Experiment
results using large industrial benchmarks show that
compared to the state-of-the-art alternative design approach,
the proposed method can reduce the power network
area by 19.4% on average under the same signal
and power integrity constraints with better routing quality,
but use less runtime.
Organizer/Moderator: E. Pol, Philips Research, NL
Speakers:
H. Van Antwerpen, Philips Research, NL
R. von Vignau, Philips Research, NL
R. Gupta, UC San Diego, US
N. Dutt, UC Irvine, US
N. Venkatasubramanian, UC Irvine, US
S. Mohapatra, UC California Irvine, US
C. Pereira, UC California San Diego, US
-
Energy-Aware System Design for Wireless Multimedia [p. 1124]
-
In this paper, we present various challenges that arise in
the delivery and exchange of multimedia information to mobile
devices. Specifically, we focus on techniques for maintaining
QoS to end-user multimedia applications (e.g. video
streaming, multimedia conferencing) while maximizing device
lifetimes. In order to cope with the resource intensive
nature of multimedia applications (in terms of computation,
bandwidth and consequently power) and dynamic congestion
levels in wireless networks, an end-to-end approach to QoSaware
power optimization is required. We discuss the trend
towards such an integrated approach that couples the architectural,
OS, middleware and application layers to achieve
both user experience and device energy gains. We conclude
with a discussion of tools for integrated system design and
testing that will aid in rapid deployment of wireless multimedia.
Moderators: K. Goossens, Philips Research, NL; L. Benini, Bologna U, IT
-
Unified Component Integration Flow for Multi-Processor SoC Design and Validation [p. 1132]
-
M. Dziri, W. Cesário, A. Jerraya, and F. Wagner
Most system-on-Chip (SoC) design methodologies
promote the reuse of pre-designed (hardware, software,
and functional) components. However, as these
components are heterogeneous, their integration requires
complex interface sub-systems. These sub-systems can
also be constructed by assembling pre-designed basic
interface components. Hence, SoC design and validation
involves component composition techniques to create
hardware, software, and functional interface sub-systems
by assembling basic interface components. We propose a
unified methodology for automatic component integration
that allows designers to reuse pre-designed components
effectively. We also present ROSES, a design flow that
uses this methodology to generate hardware, software,
and functional interface sub-systems automatically
starting from a system-level architectural model.
-
An Interconnect Channel Design Methodology for High Performance Integrated Circuits [p. 1138]
-
V. Chandra, A. Xu, H. Schmit, and L. Pileggi
On-chip communication is becoming a bottleneck for high performance
designs. Conventional interconnect design methodology does
not account for architectures and/or communication schemes that require
storage buffers (First-In-First-Out queues or FIFOs) in the interconnect
channel. For example, FIFOs and flow-control are needed
for Network-on-Chip, high performance ASICs and multiple clock domain
designs. These IC implementation architectures require an efficient
methodology to determine the size of the FIFOs in the channel
since the FIFO sizes affect system performance. In this work we devised
a methodology to size the FIFOs in an interconnect channel
containing one or more FIFOs connected in series. We show that the
sizing of the FIFOs in the channel is a function of system parameters
such as data production rate and consumption rate, data burstiness,
number of channel stages etc. and we also quantify their effect on
performance. For a single clock design, we have developed an efficient
algorithm which reduces the search space for the optimal sizing
of the FIFOs in the channel.
-
Modeling Shared Resource Contention Using a Hybrid Simulation/Analytical Approach [p. 1144]
-
A. Bobrek, J. Pieper, J. Nelson, J. Paul, and D. Thomas
Future Systems-on-Chips will include multiple heterogeneous
processing units, with complex data-dependent shared
resource access patterns dictating the performance of a design.
Currently, the most accurate methods of simulating the
interactions between these components operate at the cycle-accurate
level, which can be very slow to execute for large
systems. Analytical models sacrifice accuracy for speed, and
cannot cope with dynamic data-dependent behavior well.
We propose a hybrid approach combining simulation with
piecewise evaluation of analytical models that apply time
penalties to simulated regions. Our experimental results
show that for representative heterogeneous multiprocessor
applications, simulation time can be decreased by 100 times
over cycle-accurate models, while the error can be reduced
by 60% to 80% over traditional analytical models to within
18% of an ISS simulation.
-
Supporting Cache Coherence in Heterogeneous Multiprocessor Systems [p. 1150]
-
T. Suh, D. Blough, and H. Lee
In embedded system-on-a-chip (SoC) applications, the
demand for integrating heterogeneous processors onto a single
chip is increasing. An important issue in integrating
multiple heterogeneous processors on the same chip is to
maintain the coherence of their data caches. In this paper,
we propose a hardware/software methodology to make
caches coherent in heterogeneous multiprocessor platforms
with shared memory. Our approach works with any combination
of processors that support invalidation-based protocols.
As shown in our experiments, up to 58% performance
improvement can be achieved with low miss penalty at the
expense of adding simple hardware, compared to a pure
software solution. Speedup can be improved even further
as the miss penalty increases. In addition, our approach
provides embedded system programmers a transparent view
of shared data, removing the burden of software synchronization.
Moderators: R. Ernst, TU Braunschweig; P. Kajfasz, Thales Communications, FR
-
Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessors [p. 1158]
-
I. Kadayif, M. Kandemir, and I. Kolcu
Advances in semiconductor technology are enabling designs
with several hundred million transistors. Since building
sophisticated single processor based systems is a complex
process from design, verification, and software development
perspectives, the use of chip multiprocessing is inevitable
in future microprocessors. In fact, the abundance
of explicit loop-level parallelism in many embedded applications
helps us identify chip multiprocessing as one of the
most promising directions in designing systems for embedded
applications. Another architectural trend that we observe
in embedded systems, namely, multi-voltage processors,
is driven by the need of reducing energy consumption
during program execution. Practical implementations
such as Transmeta's Crusoe and Intel's XScale tune processor
voltage/frequency depending on current execution
load. Considering these two trends, chip multiprocessing
and voltage/frequency scaling, this paper presents an optimization
strategy for an architecture that makes use of both
chip parallelism and voltage scaling. In our proposal, the
compiler takes advantage of heterogeneity in parallel execution
between the loads of different processors and assigns
different voltages/frequencies to different processors if doing
so reduces energy consumption without increasing overall
execution cycles significantly. Our experiments with a set
of applications show that this optimization can bring large
energy benefits without much performance loss.
-
Fault-Tolerant Deployment of Embedded Software for Cost-Sensitive
Real-Time Feedback-Control Applications [p. 1164]
-
C. Pinello, L. Carloni, and A. Sangiovanni-Vincentelli
Designing cost-sensitive real-time control systems for safety-critical
applications requires a careful analysis of the cost/coverage
trade-offs of fault-tolerant solutions. This further complicates the difficult
task of deploying the embedded software that implements the
control algorithms on the execution platform that is often distributed
around the plant (as it is typical, for instance, in automotive applications).
We propose a synthesis-based design methodology that relieves
the designers from the burden of specifying detailed mechanisms for
addressing platform faults, while involving them in the definition of the
overall fault-tolerance strategy. Thus, they can focus on addressing
plant faults within their control algorithms, selecting the best components
for the execution platform, and defining an accurate fault model.
Our approach is centered on a new model of computation, Fault Tolerant
Data Flows (FTDF), that enables the integration of formal validation
techniques.
-
Task Feasibility Analysis and Dynamic Voltage Scaling in Fault-Tolerant Real-Time Embedded Systems [p. 1170]
-
Y. Zhang and K. Chakrabarty
We investigate dynamic voltage scaling (DVS) in realtime
embedded systems that use checkpointing for fault
tolerance. We present feasibility-of-scheduling tests for
checkpointing schemes for a constant processor speed as well
as for variable processor speeds. DVS is then carried out on
the basis of the feasibility analysis. We incorporate practical
issues such as faults during checkpointing and state
restoration, rollback recovery time, memory access time and
energy, and DVS overhead. Simulation results are presented
for real-life checkpointing data and embedded processors.
-
Quasi-Static Scheduling for Real-Time Systems with Hard and Soft Tasks [p. 1176]
-
L. Cortés, P. Eles, and Z. Peng
This paper addresses the problem of scheduling for realtime
systems that include both hard and soft tasks. The relative
importance of soft tasks and how the quality of results
is affected when missing a soft deadline are captured by utility
functions associated to soft tasks. Thus the aim is to
find the execution order of tasks that makes the total utility
maximum and guarantees hard deadlines. We consider time
intervals rather than fixed execution times for tasks. Since
a purely off-line solution is too pessimistic and a purely online
approach incurs an unacceptable overhead due to the high
complexity of the problem, we propose a quasi-static approach
where a number of schedules are prepared at design-time and
the decision of which of them to follow is taken at run-time
based on the actual execution times. We propose an exact
algorithm as well as different heuristics for the problem addressed
in this paper.
Organizer/Moderator: B. Bennetts, Bennetts Associates, UK
-
Status of IEEE Testability Standards 1149.4, 1532 and 1149.6 [p. 1184]
-
S. Sunter, A. Osseiran, A. Cron, N. Jacobson, D. Bonnett, B. Eklow, C. Barnhart, and B. Bennetts
Single board, and now multi-board testability
is highly conditioned by the availability of
various forms of boundary scan technology.
This paper surveys the three more recent IEEE
Standards relating to boundary scan. The
paper is based on three backgrounders
prepared by members of the individual
Working Groups for the IEEE Standards booth
at ITC 2003.
Moderators: I. Markov, Michigan U, US; J. Lienig, TU Dresden, DE
-
Eliminating False Positives in Crosstalk Noise Analysis [p. 1192]
-
Y. Ran, M. Marek-Sadowska, A. Kondratyev, and Y. Watanabe
Noise affects circuit operation by increasing gate delays
and causing latches to capture incorrect values. Noise analysis
techniques can detect some of such noise faults, but accurate
analysis requires a careful examination of timing and
functional properties of the circuit. This paper proposes a
method to check the 'true' noise impact on path delay.
It uses four-variable Boolean logic to characterize signal
transitions in a time interval, and formulates Boolean satisfiability
between aggressors and a victim under the min-max
delay model for gates. The proposed technique is scalable
as it keeps the size of Boolean formulation linear to the size
of the modeled circuit. By applying it to a set of large circuits,
it has eliminated up to 50% of noise delay faults reported
by conventional noise analysis method.
-
A New Approach to Timing Analysis Using Event Propagation and Temporal Logic [p. 1198]
-
A. Mondal, P. Chakrabarti, and C. Mandal
Present day designers require deep reasoning methods
to analyze circuit timing. This includes analysis of effects
of dynamic behavior (like glitches) on critical paths, simultaneous
switching and identification of specific patterns
and their timings. This paper proposes a novel approach
that uses a combination of symbolic event propagation and
temporal reasoning to extract timing properties of gate-level
circuits. The formulation captures complex situations
like trigerring of traditional false paths and simultaneous
switching in a unified symbolic representation in addition
to identifying false paths, critical paths as well as conditions
for such situations. This information is then represented
as an event-time graph. A simple temporal logic on
events is proposed that can be used to formulate a wide
class of useful queries for various input scenarios. These include
maximum/minimum delays, transition times, duration
of patterns, etc. An algorithm is developed that retrieves answers
to such queries from the event-time graph. A complete
BDD based implementation of this system has been made.
Results on the ISCAS85 benchmarks indicate very interesting
properties of these circuits.
-
A New Effective Congestion Model in Floorplan Design [p. 1204]
-
Y. Hsieh and T. Hsieh
In this paper, we provide a new efficient and accurate
congestion model embedded into a floorplanner to estimate
the congestion of floorplans. It is based on probabilistic
analysis and a new concept of Irregular-Grid which uses
the routing information to determine the evaluating regions
instead of fixed-size grids. Three complete experiments are
performed and the experimental results show the
correctness, accuracy and efficiency of our new congestion
model.
-
ULSI Interconnect Length Distribution Model Considering Core Utilization [p. 1210]
-
H. Nakashima, J. Inoue, K. Okada, and K. Masu
Interconnect Length Distribution (ILD) represents a correlation
between the number of interconnects and length.
The ILD can predict power consumption, clock frequency,
chip size, etc. It has been said that high core utilization
and small circuit area improve chip performance. We
propose a ILD model to predict a correlation between core
utilization and chip performance. The proposed model
predicts influences of interconnect length and interconnect
density on circuit performances. As core utilization increases,
small and simple circuits improve the performances.
In large complex circuits, decrease of load capacitance
is more important than that of total interconnect
length for improvement of chip performance. The proposed
ILD model expresses actual ILD more accurate than conventional
models.
Moderators: Y. Tanurhan, Actel, US; W. Rosenstiel, Tuebingen U and FZI Karlsruhe, DE
-
Implementation of a UMTS Turbo-decoder on a Dynamically Reconfigurable Platform [p. 1218]
-
A. La Rosa, C. Passerone, F. Gregoretti, and L. Lavagno
Modern embedded systems must execute a variety of
high performance real-time tasks, such as audio and image
compression and decompression, channel coding and
encoding, etc. Reconfigurable platforms can effectively be
used in these cases, because they allow to re-use the architecture
for as many applications as possible.
This paper describes the implementation of a UMTS
turbo-decoder on one such platform, the XiRisc reconfigurable
processor. Our goal is to test the development
framework and design flow that we already developed on a
real industrial example. Our results shows that, with some
manual effort from the designer, very good performance
improvements can be achieved, using a flow close to embedded
software development.
-
Design Methodology for a Tightly Coupled VLIW/Reconfigurable Matrix Architecture: A Case Study [p. 1224]
-
B. Mei, R. Lauwereins, S. Vernalde, and D. Verkest
Coarse-grained reconfigurable architectures have seen
growing importance recently. Design tools and methodology
are essential to their success. Based on our previous
work on modulo scheduling algorithms and a novel architecture
with tightly coupled VLIW/reconfigurable matrix,
we present a C-based design flow using an MPEG-2 decoder
as a design example. The application is mapped to
the architecture in less than one person-week starting from
a software implementation. The kernel and overall speedup
over the reference VLIWare 4.84 and 3.05 respectively. The
case study shows that our methodology and architecture can
deliver a competitive package in terms of design efforts and
performance over other programmable architectures.
-
Efficient Implementations of Mobile Video Computations on Domain-Specific Reconfigurable Arrays [p. 1230]
-
I. Ahmed, S. Baloch, A. Pai, T. Arslan, N. Aydin, S. Khawam, and F. Westall
Mobile video processing as defined in standards like
MPEG-4 and H.263 contains a number of timeconsuming
computations that cannot be efficiently
executed on current hardware architectures. The authors
recently introduced a reconfigurable SoC platform that
permits a low-power, high-throughput and flexible
implementation of the motion estimation and DCT
algorithms. The computations are done using domain-specific
reconfigurable arrays that have demonstrated up
to 75% reduction in power consumption when compared
to generic FPGA architecture, which makes them suitable
for portable devices. This paper presents and compares
different configurations of the arrays to efficiently
implementing DCT and motion estimation algorithms. A
number of algorithms are mapped into the various
reconfigurable fabrics demonstrating the flexibility of the
new reconfigurable SoC architecture and its ability to
support a number of implementations having different
performance characteristics.
-
Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience [p. 1236]
-
H. Krupnova
Today, having a fast hardware platform for SoC software development prior to silicon is an important challenge to gain the time-to-market. The FPGAs offer an excellent prototyping basis for building hardware platforms since more than ten years ([1]). However, as the circuit complexity increases and project timeframes shrink, building a multi-FPGA prototype represents a real challenge from the complexity viewpoint. The paper describes the state-of-the-art mapping methodology, prototyping tools and flows, shows the most difficult mapping problems and the ways to overcome them. The paper is issued from the experience of mapping on FPGA platform of four latest highly complex ST Microelectronics SoCs ranging from 1.5 to 4 million real ASIC gates mapped to up to 9 highest capacity FPGAs.
Moderators: B. Candaele, Thales, FR; A. Jerraya, TIMA Laboratory, FR
-
Using a Communication Architecture Specification in an Application-Driven Retargetable [p. 1244]
Prototyping Platform for Multiprocessing
-
X. Zhu and S. Malik
In multiprocessor based SoCs, optimizing the communication
architecture is often as important, if not more important, than
optimizing the computation architecture. While there are mature
platforms and techniques for the modeling and evaluation
of architectures of processing elements, the same is not true
for the communication architectures. This paper presents an
application-driven retargetable prototyping platform which fills
this gap. This environment aims to facilitate the design exploration
of the communication sub-system through application-level
execution-driven simulations and quantitative analysis.
First, we introduce an expressive communication architecture
specification which gives the designers the freedom to choose
and configure their custom interconnection schemes over a wide
range of communication architectures, covering the spectrum
from buses to packet switching networks. This, combined with
a distributed application model, drives a modular modeling and
simulation environment that permits detailed simulation of the
communication (and computation) architectures at the application
level. Through the case studies motivated by an embedded
system application, we show that through simulations, critical
system information such as timings and communication patterns
can be obtained and evaluated. Consequently, system-level design
choices regarding the communication architecture can be
made with high confidence in the early stages of design. In addition
to improving design quality, this methodology also results
in significantly shortening design-time.
-
A Power and Performance Model for Network-on-Chip Architectures [p. 1250]
-
N. Banerjee, P. Vellanki, and K. Chatha
Networks-on-Chip (NoC) has been proposed as a solution
for addressing the design challenges of future high-performance
nanoscale architectures. Innovative system-level
performance models are required for designing NoC
based architectures. This paper presents a VHDL based cycle
accurate register transfer level model for evaluating the
latency, throughput, dynamic, and leakage power consumption
of NoC based interconnection architectures. We implemented
a parameterized register transfer level design of the
NoC architecture elements. The design is parameterized on
(i) size of packets, (ii) length and width of physical links,
(iii) number, and depth of virtual channels, and (iv) switching
technique. The paper discusses in detail the architecture
and characterization of the various NoC components. The
paper presents results obtained by application of the model
towards design space exploration, and power versus performance
trade-off analysis of 4x4 mesh based NoC architecture.
-
A System Level Processor/Communication Co-Exploration Methodology for
Multi-Processor System-on-Chip Platforms [p. 1256]
-
A. Wieferink, T. Kogel, R. Leupers, G. Ascheid, H. Meyr, G. Braun, and A. Nohl
Current and future SoC designs will contain an increasing
number of heterogeneous programmable units combined
with a complex communication architecture to meet flexibility,
performance and cost constraints. Designing such a heterogenous
MP-SoC architecture bears enormous potential for optimization,
but requires a system-level design environment and
methodology to evaluate architectural alternatives. This paper
proposes a methodology to jointly design and optimize the
processor architecture together with the on-chip communication
based on the LISA Processor Design Platform in combination
with SystemC Transaction Level Models. The proposed
methodology advocates a successive refinement flow of the
system-level models of both the processor cores and the communication
architecture. This allows design decisions based
on the best modeling efficiency, accuracy and simulation performance
possible on the respective abstraction level. The effectiveness
of our approach is demonstrated by the exemplary
design of a dual-processor JPEG decoding system.
Moderators: F. Rousseau, TIMA Laboratory, FR; J. Madsen, TU Denmark, DK
-
Cache-Aware Scratchpad Allocation Algorithm [p. 1264]
-
M. Verma, L. Wehmeyer, and P. Marwedel
In the context of portable embedded systems, reducing
energy is one of the prime objectives. Most high-end embedded
microprocessors include onchip instruction and data
caches, along with a small energy efficient scratchpad. Previous
approaches for utilizing scratchpad did not consider
caches and hence fail for the au courant architecture. In the
presented work, we use the scratchpad for storing instructions
and propose a generic Cache Aware Scratchpad Allocation
(CASA) algorithm. We report an average reduction
of 8-29% in instruction memory energy consumption compared
to a previously published technique for benchmarks
from the Mediabench suite.
The scratchpad in the presented architecture is similar
to a preloaded loop cache. Comparing the energy consumption
of our approach against preloaded loop caches, we report
average energy savings of 20-44%.
-
Phase Coupled Code Generation for DSPs Using a Genetic Algorithm [p. 1270]
-
M. Lorenz and P. Marwedel
The growing use of digital signal processors (DSPs) in embedded
systems necessitates the use of optimizing compilers
supporting special hardware features. Due to the irregular architectures
present in today's DSPs there is a need of compilers
which are capable of performing a phase coupling of the
highly interdependent code generation subtasks and a graph
based code selection. In this paper we present a code generator
which performs a graph based code selection and a complete
phase coupling of code selection, instruction scheduling
(including compaction) and register allocation. In addition,
our code generator takes into account effects of the subsequent
address code generation phase. In order to solve the phase
coupling problem and to handle the problem complexity, our
code generator is based on a genetic algorithm. Experimental
results for several benchmarks and an MP3 application for
two DSPs show the effectiveness and the retargetability of our
approach. Using the presented techniques, the number of execution
cycles is reduced by 51% on average for the M3-DSP
and by 38% on average for the ADSP2100 compared to standard
techniques1 .
-
A Methodology and Tool Suite for C Compiler Generation from ADL Processor Models [p. 1276]
-
M. Hohenauer, H. Scharwaechter, K. Karuri, O. Wahlen, T. Kogel,
R. Leupers, G. Ascheid, H. Meyr, G. Braun, and H. van Someren
Retargetable C compilers are key tools for efficient architecture
exploration for embedded processors. In this paper
we describe a novel approach to retargetable compilation
based on LISA, an industrial processor modeling language
for efficient ASIP design. In order to circumvent the
well-known trade-off between flexibility and code quality in
retargetable compilation, we propose a user-guided, semiautomatic
methodology that in turn builds on a powerful
existing C compiler design platform. Our approach allows
to include generated C compilers into the ASIP architecture
exploration loop at an early stage, thereby allowing
for a more efficient design process and avoiding application/
architecture mismatches. We present the corresponding
methodology and tool suite and provide experimental data
for two real-life embedded processors that prove the feasibility
of the approach.
Moderators: H. Vranken, Philips Research, NL; C. Papachristou, Case Western Reserve U, US
-
Nine-Coded Compression Technique with Application to Reduced Pin-Count Testing and
Flexible On-Chip Decompression [p. 128412841284128412841284128412841284128412841284]
-
M. Tehranipour, M. Nourani, and K. Chakrabarty
This paper presents a new test data compression
technique based on a compression code that uses exactly nine codewords.
In spite of its simplicity, it provides significant reduction in
test data volume and test application time. In addition, the decompression
logic is very small and independent of the precomputed
test data set. Our technique leaves many don't-care bits
unchanged in the compressed test set, and these bits can be filled
randomly to detect non-modeled faults. The proposed technique
can be efficiently adopted for single- or multiple-scan chain designs
to reduce test application time and pin requirement. Experimental
results for ISCAS'89 benchmarks illustrate the flexibility
and efficiency of the proposed technique.
-
CircularScan: A Scan Architecture for Test Cost Reduction [p. 1290]
-
B. Arslan and A. Orailoglu
Scan-based designs are widely used to decrease the complexity
of the test generation process; nonetheless, they increase test time
and volume. A new scan architecture is proposed to reduce test
time and volume while retaining the original scan input count. The
proposed architecture allows the use of the captured response as a
template for the next pattern with only the necessary bits of the captured
response being updated while observing the full captured response.
The theoretical and experimental analysis promises a substantial
reduction in test cost for large circuits.
-
Hybrid Delay Scan: A Low Hardware Overhead Scan-Based Delay Test Technique for
High Fault Coverage and Compact Test Sets [p. 1296]
-
S. Wang, S. Chakradhar, and X. Liu
A novel scan-based delay test approach, referred as the
hybrid delay scan, is proposed in this paper. The proposed
scan-based delay testing method combines advantages of
the skewed-load and broad-side approaches. Unlike the
skewed-load approach whose design requirement is often
too costly to meet due to the fast switching scan enable signal,
the hybrid delay scan does not require a strong buffer
or buffer tree to drive the fast switching scan enable signal.
Hardware overhead added to standard scan designs to
implement the hybrid approach is negligible. Since the fast
scan enable signal is internally generated, no external pin
is required. Transition delay fault coverage achieved by the
hybrid approach is equal to or higher than that achieved
by the broad-side load for all ISCAS 89 benchmark circuits.
On an average, about 4.5% improvement in fault coverage
is obtained by the hybrid approach over the broad-side approach.
-
Diagnosis of Scan-Chains by Use of a Configurable Signature Register and Error-Correcting Codes [p. 1302]
-
A. Leininger, P. Muhmenthaler, and M. Goessel
In this paper a new diagnosis method for scan designs
with many scan-paths based on error correcting linear
block codes with N information bits and K control bits
is proposed, where N is the number of scan-paths. The
new approach can be implemented on a modified STUMPS-architecture.
In diagnosis mode the test has K times to
be repeated. In the K repetitions of the test the outputs
of the scan-paths are connected to a configurable signature
register (with disconnected feedback logic) according
to the coefficients of the K syndrome equations of the code.
By monitoring the one-dimensional output sequence of the
configurable signature register the failing scan-cells in the
different scan-paths can be identified with the resolution of
the selected error correcting code. Since for the relevant
codes, e.g.(shortened) Hamming codes, T-error correcting
BCH-code, the ratio K
N decreases very fast with an increasing
number N the method is useful for a large number of
scan-paths.
Moderators: T. Kazmierski, Southampton U, UK; S. Yoo, TIMA Laboratory, FR
-
Hierarchical Multi-Dimensional Table Lookup for Model Compiler Based Circuit Simulation [p. 1310]
-
B. Wan and C. Shi
In this paper, a systematic method for automatically generating hierarchical multi-dimensional table lookup models for compact device and behavioral models with any number of terminals is presented. The method is based on an Abstract Syntax Tree representation of analytic equations. Expensive part of the computations represented by abstract syntax trees are identified and replaced by two-dimensional table lookup models. An error-control based optimization algorithm is developed to generate table lookup models with the minimal amount of table data for a given accuracy requirement. The proposed method has been implemented in the model compiler MCAST and the circuit simulator SPICE3. Experimental results show that, compared to non-optimized compilation based simulation, the simulation using the proposed table lookup optimization method is about 40 times faster and achieves sufficiently accurate results with error less than 1-2%.
Index Terms -- Model Compiler, Syntax-Tree, Hierarchical Multi-dimensional Table Lookup, Optimization, Circuit Simulation.
-
Direct Nonlinear Order Reduction with Variational Analysis [p. 1316]
-
L. Feng, X. Zeng, C. Chiang, D. Zhou, and Q. Fang
The variational analysis [11] has been employed in [7]
for order reduction of weakly nonlinear systems. For a relatively
strong nonlinear system, this method will mostly lose
efficiency because of the exponentially increased number
of inputs in higher order variational equations caused by
the individual reduction process of the variational systems.
Moreover, the inexact inputs into the higher order variational
equations indispensably introduce extra errors in the
order reduction process. Inspired by the variational analysis,
we propose a direct model order reduction method. The
order of the approximate polynomial system of the original
nonlinear system is directly reduced by one project space.
The proposed direct reduction technique can easily avoid
the errors brought by inexact inputs and the exponentially
increased inputs. We show theoretically and experimentally
that the proposed method can achieve much more accurate
reduced system with smaller order size than the conventional
variational equation order reduction method.
-
Steady-State Analysis of Nonlinear Circuits Using Discrete Singular Convolution Method [p. 1322]
-
X. Zhou, D. Zhou, J. Liu, R. Li, X. Zeng, and C. Chiang
In this paper, we propose a novel time-domain based
method, Discrete Singular Convolution algorithm, for
computing steady-state response in nonlinear circuit.
Properties and advantages of Discrete Singular
Convolution method are discussed, compared with some
other approaches. The accuracy and efficiency of this
method are tested by the numerical experiments.
-
Hybrid Reduction Technique for Efficient Simulation of Linear/Nonlinear Mixed Circuits [p. 1327]
-
T. Mine, H. Kubota, A. Kamo, T. Watanabe, and H. Asai
In this paper, we propose a new method which makes
transient simulation faster for the circuit including both
nonlinear and linear elements. First, the method for
generating the projection matrix with Krylov-subspace
technique is described. The order of the circuit equation is
reduced by congruence transformation with the projection
matrix. Next, we suggest a method which can calculate the
reduced Jacobian matrix directly in the each
Newton-Raphson iteration. Since this technique does not
need to calculate the original size of Jacobian matrix, the
calculation cost is reduced drastically. Therefore, efficient
circuit simulation can be achieved. Finally, our method is
applied to some example circuits and the validity of the
nonlinear circuit reduction technique is verified.
Organizer/Moderator: H. Schlebusch, Synopsys, DE
Speaker: T. Fitzpatrick, Synopsys, US
-
System Verilog for VHDL Users [p. 1334]
-
SystemVerilog was developed to provide an evolutionary path
from existing hardware description languages (HDLs) to next-generation
design and verification methodologies necessary to support the
development of the increasingly complex SoC designs of today and
tomorrow. Although its roots are firmly planted in Verilog, many of
the features of SystemVerilog were targeted to address capabilities
that VHDL users have had for years.
This tutorial will provide an overview of SystemVerilog, focusing
on those language features that enable the adoption of SystemVerilog
by VHDL designers, such as complex and user-defined data types,
multi-dimensional arrays, and the concept of strong data type
checking. In addition, we will show how VHDL and Verilog users
can take advantage of distinct SystemVerilog features to improve
their productivity with advanced coding capability and built-in
verification.
Organizer/Moderator: P. Eles, Linkoping U, SE
Speakers:
R. Marculescu, Carnegie Mellon U, US
J. Henkel, NEC, US
M. Pedram, Southern California U, US
-
Distributed Multimedia System Design: A Holistic Perspective [p. 1342]
-
Multimedia systems play a central part in many human activities.
Due to the significant advances in the VLSI technology, there is an
increasing demand for portable multimedia appliances capable of
handling advanced algorithms required in all forms of communication.
Over the years, we have witnessed a steady move from standalone
(or desktop) multimedia to deeply distributed multimedia
systems. Whereas desktop-based systems are mainly optimized
based on the performance constraints, power consumption is the
key design constraint for multimedia devices that draw their
energy from batteries. The overall goal of successful design is then
to find the best mapping of the target multimedia application onto
the architectural resources, while satisfying an imposed set of
design constraints (e.g. minimum power dissipation, maximum
performance) and specified QoS metrics (e.g. end-to-end latency,
jitter, loss rate) which directly impact the media quality. This paper
addresses a few fundamental issues that make the design process
particularly challenging and offers a holistic perspective towards
a coherent design methodology.
-
Adaptive Prefetching for Multimedia Applications in Embedded Systems [p. 1350]
-
H. Sbeyti, S. Niar, and L. Eeckhout
This paper presents a new and simple prefetching
mechanism to improve the memory performance of
multimedia applications. This method adapts the memory
access mechanism to the access patterns as observed in
the application. By doing so, performance is increased,
the available resources are better utilized and energy
consumption is reduced. Using our prefetch method, we
are able to get up to 5.5% IPC improvement, more than
50% cache miss reduction, and up to 4.5% energy
reduction. Our mechanism results in better performance
for a 2KB data cache than is achievable with an 8KB data
cache (without prefetching) for StrongArm SA1110 and
Xscale-like processor configurations. This mechanism
requires limited hardware resources and generates little
additional external bus transfers. This makes this adaptive
prefetching well suited for embedded microprocessor
systems.
-
Data Windows: A Data-Centric Approach for Query Execution in Memory-Resident Databases [p. 1352]
-
J. Pisharath, A. Choudhary, and M. Kandemir
Structured embedded databases are currently becoming
an integrated part of embedded systems, thus, enabling
higher standards in system automation. These embedded
databases are typically memory resident. In this paper, we
present a data-centric approach called data windowing that
optimizes multiple queries issued to an embedded database.
Traditional approaches improve the performance by optimizing
the control flow of operations, whereas we target
performance improvements based on the data that is
brought into the system.
-
High-Performance QuIDD-Based Simulation of Quantum Circuits [p. 1354]
-
G. Viamontes, I. Markov, and J. Hayes
Simulating quantum computation on a classical computer
is a difficult problem. The matrices representing
quantum gates, and vectors modeling qubit states grow exponentially
with the number of qubits. It has been shown experimentally
that the QuIDD (Quantum Information Decision
Diagram) datastructure greatly facilitates simulations
using memory and runtime that are polynomial in the number
of qubits. In this paper, we present a complexity analysis
which formally describes this class of matrices and vectors.
We also present an improved implementation of QuIDDs
which can simulate Grover's algorithm for quantum search
with the asymptotic runtime complexity of an ideal quantum
computer up to negligible overhead.
-
An Application of Parallel Discrete Event Simulation Algorithms to Mixed Domain System Simulation [p. 1356]
-
D. Reed, S. Levitan, J. Boles, J. Martinez, and D. Chiarulli
We present our system-level co-simulation
environment for mixed domain microsystems. The
environment provides synchronization and cosimulation
between the Chatoyant MOEMS (Micro-Electro Mechanical Systems) simulator and
ModelTech ModelSim. By using shared memory
IPC (Inter-Process Communication) and PDES
(Parallel Discrete Event Simulation) techniques, we
achieve two orders of magnitude speedup over
standard pipe/socket communication.
-
Fault Tolerance of Programmable Switch Blocks [p. 1358]
-
J. Huang, M. Tahoori, and F. Lombardi
This paper presents a new approach for the evaluation of FPGA
routing resources in the presence of faulty switches. This is
considered under the worst case scenario of open faults. Signal
routing in the presence of faulty switches is analyzed at switch
block level probabilitic routing (routability) is used as figure
of merit for evaluating the interconnect resources of FPGAs. The
presented approach utilizes a path-based technique to find the
probability of establishing a path between pairs of input and
output endpoints in a switch block. The results are reported
for various commercial and academic FPGAs.
-
A New Self-Checking Sum-Bit Duplicated Carry-Select Adder [p. 1360]
-
E. Sogomonyan, D. Marienfeld, V. Ocheretnij, and M. Gössel
In this paper the first code-disjoint totally self-checking
carry-select adder is proposed. The adder blocks
are fast ripple adders with a single NAND-gate delay for
carry-propagation per cell. In every adder block both the
sum-bits and the corresponding inverted sum-bits are simultaneously
implemented. The parity of the input operands
is checked against the XOR-sum of the propagate signals.
For 64 bits area and maximal delay are determined by the
SYNOPSYS CAD tool of the EUROCHIP project. Compared
to a 64 bit carry-select adder without error detection
the delay of the most significant sum-bit does not increase.
The area is 170% of a 64 bit carry-select adder (without
error detection and not code-disjoint).
-
A Macromodelling Methodology for Efficient High-Level Simulation of Substrate Noise Generation [p. 1362]
-
L. Elvira, F. Martorell, X. Aragonés, and J. González
Efficient prediction of the substrate noise generated by
digital sections is currently a major challenge in System-on-a-Chip
design. In this paper a macromodel to accurately and
efficiently predict the substrate noise generated by digital
standard cells is presented. The macromodel accuracy is
demonstrated for some simple circuits.
-
Accurate Estimation of Parasitic Capacitances in Analog Circuits [p. 1364]
-
A. Agarwal, H. Sampath, V. Yelamanchili, and R. Vemuri
This paper presents efficient and accurate techniques for modeling
parasitic capacitances in analog CMOS circuits. A layout
aware synthesis flow using these parasitic models has
been proposed. The fast parasitic estimation process replaces
the time consuming steps of layout generation and extraction
during synthesis. Results indicate that these models are extremely
fast and accurate.
-
GRAAL -- A Development Framework for Embedded Graphics Accelerators [p. 1366]
-
D. Crisu, S. Cotofana, S. Vassiliadis, and P. Liuha
This paper presents a versatile hardware/software cosimulation
and co-design environment for embedded 3D
graphics accelerators. The GRAphics AcceLerator design
exploration framework (GRAAL) is an open system which
offers a coherent development methodology based on an extensive
library of SystemC RTL models of graphics pipeline
components. GRAAL incorporates tools to assist in the visual
debugging of the graphics algorithms implemented
in hardware, and to estimate the performance in terms of
throughput, power consumption, and area.
-
From Synchronous to Asynchronous: An Automatic Approach [p. 1368]
-
J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and C. Sotiriou
This paper presents a methodology to derive asynchronous
circuits from optimized synchronous circuits by
replacing the clock distribution tree by a handshaking network.
A case study shows the applicability of the method
and the potential benefits of de-synchronizing synchronous
circuits.
-
Enhancing Testability of System on Chips Using Network Management Protocols [p. 1370]
-
O. Laouamri and C. Aktouf
This paper shows how to adapt the P1500 Design-For-Test standard through network management
protocols to make the testing problem of System-On-Chips (SoCs) easier and cost-effective. For this purpose,
a SoC is analyzed as a distributed system in which its own
basic components or IP Cores (Intellectual Proprieties)
are considered as network agents according to SNMP
(Simple Network Management Protocol) protocol. An
experimental study was carried out to show the
effectiveness of such an approach.
-
Minimization of Crosstalk Noise, Delay and Power Using a Modified Bus Invert Technique [p. 1372]
-
M. Lampropoulos, B. Al-Hashimi, and P. Rosinger
Previously reported bus encoding approaches reduce
crosstalk delay but they ignore the effects of inductive coupling
between the bus lines, i.e. crosstalk noise. Aiming to
solve this issue, this paper presents a modified bus-invert
technique which minimizes crosstalk noise, as well as delay
and power, at the expense of a small area overhead.
-
Energy-Efficient Design for Highly Associative Instruction Caches in Next-Generation Embedded Processors [p. 1374]
-
J. Aragon, D. Nicolaescu, A. Veidenbaum, and A. Badulescu
This paper proposes a low-energy solution for CAM-based
highly associative I-caches using a segmented wordline
and a predictor-based instruction fetch mechanism.
Not all instructions in a given I-cache fetch are used due
to branches. The proposed predictor determines which instructions
in a cache access will be used and does not fetch
any other instructions. Results show an average I-cache energy
savings of 44% over the baseline case and 6% over the
segmented case with no negative impact on performance.
-
Dynamic Voltage and Cache Reconfiguration for Low Power [p. 1376]
-
A. Nacul and A. Givargis
In this work, we propose a combined Dynamic Voltage Scaling
(DVS) and Dynamic Cache Reconfiguration (DCR) online algorithm
that dynamically adapts the processor speed (i.e., voltage) and the
cache subsystem to the workload requirements for the purposes of
saving energy. The workload is considered to be a set of tasks with
real-time deadlines. Our online algorithm is invoked as part of the OS
scheduler, which performs standard earliest deadline first (EDF) task
scheduling first. Then, our online algorithm, determines an ideal
voltage/cache configuration for the current executing task.
-
Overhead-free Polymorphism in Network-on-Chip Implementation of Object-Oriented Models [p. 1380]
-
M. Goudarzi, S. Hessabi, and A. Mycroft
We unify virtual-method despatch (polymorphism implementation)
and network packet-routing operations; virtual-method calls
correspond to network packets, and network
addresses are allocated such that routing the packet
corresponds to dispatching the call. As the run-time routing
structure is inherent in Network-on-Chip platforms,
this unification implements polymorphism for free.
-
Multi-Processor SoC Design Methodology Using a Concept of Two-Layer Hardware-Dependent Software [p. 1382]
-
S. Yoo, M. Youssef, A. Bouchhima, A. Jerraya, and M. Diaz-Nava
In conventional multiprocessor SoC (MPSoC) design methods,
we find two problems: lack of SW code portability and lack
of early SW validation. The problems cause a long design
cycle. To resolve them, we present a concept of two-layer
hardware-dependent software (HdS). The presented HdS
consists of hardware abstraction layer to abstract the subsystem
architecture and SoC abstraction layer to abstract the
global MPSoC architecture. During the exploration of global
and sub-system architectures, the application programming
interfaces of presented two-layer HdS allow to keep the SW
independent from architectural change. The simulation
models of two-layer HdS enable to validate the entire system
including the SW and HW design early in the design steps.
We show the effectiveness of the presented methodology in
the MPSoC architecture exploration of an OpenDiVX encoder
system design.
-
Synthesis of Reversible Logic [p. 1384]
-
A. Agrawal and N. Jha
A function is reversible if each input vector produces a
unique output vector. Reversible functions find applications in low power
design, quantum computing, and nanotechnology. Logic synthesis for
reversible circuits differs substantially from traditional logic synthesis.
In this paper, we present the first practical synthesis algorithm and
tool for reversible functions with a large number of inputs. It uses
positive-polarity Reed-Muller decomposition at each stage to synthesize
the function as a network of Toffoli gates. The heuristic uses a priority
queue based search tree and explores candidate factors at each stage
in order of attractiveness. The algorithm produces near-optimal results
for the examples discussed in the literature. The key contribution of
the work is that the heuristic finds very good solutions for reversible
functions with a large number of inputs.
-
A Unified Design Space for Regular Parallel Prefix Adders [p. 1386]
-
M. Ziegler and M. Stan
We consider sparsity, fanout, and radix as three dimensions
in the design space of regular parallel prefix adders
and present a unified formalism to describe such structures.
Keywords: parallel prefix adder, Kogge-Stone adder,
Han-Carlson adder, Brent-Kung adder.
-
MODD: A New Decision Diagram and Representation for Multiple Output Binary Functions [p. 1388]
-
A. Jabir and D. Pradhan
This paper presents a new decision diagram (DD), called
MODD, for multiple output binary and multiple-valued
functions. This DD is canonic and can be made minimal
with respect to a given variable order. Unlike other reported
DDs, our approach can represent arbitrary combination of
bits at the word-level. The preliminary results show that
our representation can result in considerable memory saving
[11390].
-
Issues in Implementing Latency Insensitive Protocols [p. 1390]
-
M. Casu and L. Macchiarulo
-
Model-Based Specification and Execution of Embedded Real-Time Systems [p. 1392]
-
T. Schattkowsky and W. Mueller
-
A Demonstration of Co-Design and Co-Verification in a Synchronous Language [p. 1394]
-
S. Singh
-
Profile Guided Management of Code Partitions for Embedded Systems [p. 1396]
-
S. Zhou, B. Childers, and N. Kumar
Researchers have proposed to divide embedded
applications into code partitions and to download
partitions on demand from a wireless code server to
enable a diverse set of applications for very tightly
constrained embedded systems. This paper describes a
new approach for managing the request and storage of
code partitions and we explore the benefits of our scheme.
-
Realizable Reduction for Electromagnetically Coupled RLMC Interconnects [p. 1400]
-
R. Jiang and C. Chen
This paper presents a realizable RLMC1 reduction algorithm
for extracted interconnect circuits based on two
effective approaches: RL branch reduction and RC/LC
node reduction. Our algorithm takes advantage of some
structures existing extensively in interconnect circuits and
hence has extremely fast execution time. It takes about 8
seconds to reduce a circuit of over 300,000 elements while
maintaining 3% error and 75% element reduction ratio.
-
Statistically Aware Buffer Planning [p. 1402]
-
G. Garcea, N. van der Meijs, and R. Otten
In this paper, we will develop an analytic approach to estimate
the statistical properties (mean and variance) of the performance
of a uniformly buffered global IC interconnect, based on the mean
and (co)variance of the appropriate design and technology parameters.
Compared to other approaches, such as Monte Carlo based
approaches, our analytic approach would allow a much tighter
design optimization loop and provide a better insight in the factors
involved. The model that we use is generic, but in this paper
we assume a set of synthetic (not based on actual process data)
but realistically large values for the variability of the input parameters.
Under these assumptions, it follows that solutions for
the area/power/performance tradeoff that are optimal in a deterministic
setting, might suffer from excessive variability, potentially
leading to a yield problem.
-
A Tunneling Model for Gate Oxide Failure in Deep Sub-Micron Technology [p. 1404]
-
S. Bernadini, J. Portal, and P. Masson
Parametric failures in CMOS IC nanoelectronics,
leads to strong detection problem. In order to develop new
defect oriented test methods, it is of prime importance to
study the behavior of the transistor affected by those kind
of failures. In this paper, we present a new electrical
transistor model, which allows to study the impact of gate
oxide thickness drop. It is shown that electrical behavior
of the proposed model matches in a satisfactory way the
defective transistor behavior.
-
Power Supply Noise Monitor for Signal Integrity Faults [p. 1406]
-
J. Vázquez and J. de Gyvez
We propose a monitor able to detect on-line
excessive Power Supply Noise (PSN) at the power/ground
lines. It has high resolution (100 ps), enough to collect the
important features of PSN and its output is isolated from
the local PSN. It is useful for any scheme that takes
corrective actions to prevent signal integrity faults after
detection of excessive PSN.
-
Testing of Quantum Dot Cellular Automata Based Designs [p. 1408]
-
M. Tahoori and F. Lombardi
There has been considerable research on quantum dots cellular
automata as a new computing scheme in the nano-scale
regimes. The basic logic element of this technology is a majority
voter. In this paper, testing of these devices is investigated and
compared with conventional CMOS-based designs. A testing
technique is presented; it requires only a constant number of
test vectors to achieve 100% fault coverage with respect to the
fault list of the original design. A design-for-test scheme is also
presented which results in the generation of a reduced test set.
-
Net and Pin Distribution for 3D Package Global Routing [p. 1410]
-
J. Minz, M. Pathak, and S. Lim
In this paper, we study the net and pin distribution
problem for global routing targeting three dimensional
packaging layout via System-on-Package (SOP). The
routing environment for the new emerging mixed-signal
SOP technology is more advanced than that of the
conventional PCB or MCM technology -- pins are located
at all layers of SOP packaging substrate rather than the
top-most layer only. This is the first work to formulate and
solve the multi-layer net and pin distribution for layer,
wirelength, and crosstalk minimization.
-
Placement Using a Localization Probability Model (LPM) [p. 1412]
-
M. Olbrich and E. Barke
We propose a new placement model for global placement.
This model uses probabilities to localize the cells.
It enables arbitrary levels of placement abstraction. Wirelength
estimations at any level can be derived from the
model. We present a new placer, that uses a special variant
of the proposed model. Examples show that the model
properties improve placement quality.
-
CMOS Structures Suitable for Secured Hardware [p. 1414]
-
S. Guilley, P. Hoogvorst, Y. Mathieu, R. Pacalet, and J. Provost
Unsecured electronic circuits leak physical syndromes
correlated to the data they handle. Side-channels attacks,
like SPA or DPA, exploit this information leakage. We provide
balanced and memoryless CMOS structures for a 2-input secured NAND gate.
-
Timing Correction and Optimization with Adaptive Delay Sequential Elements [p. 1416]
-
K. Rahimi, S. Bridges, and C. Diorio
This paper introduces Adaptive Delay Sequential
Elements (ADSEs). ADSEs are registers that use
nonvolatile, floating-gate transistors to tune their
internal clock delays. We propose ADSEs for
correcting timing violations and optimizing circuit
performance. We present an ADSE circuit example,
system architecture, and tuning methodology. We
present experimental results that demonstrate the
correct operation of our example circuit and discuss
the die-area impact of using ADSEs. Our experiments
also show that voltage and temperature sensitivity of
ADSEs are comparable to non-adaptive flip-flops.
|