| |
DATE 2003 ABSTRACTS
Sessions:
[Plenary]
[1A]
[1B]
[1C]
[1F]
[2A]
[2B]
[2C]
[2E]
[2F]
[3A]
[3B]
[3C]
[3F]
[4A]
[4B]
[4C]
[4E]
[4F]
[5A]
[5B]
[5C]
[5E]
[5F]
[6A]
[6B]
[6C]
[6E]
[6F]
[7A]
[7B]
[7C]
[7E]
[7F]
[8A]
[8B]
[8C]
[8D]
[8E]
[8F]
[9A]
[9B]
[9C]
[9D]
[9E]
[9F1]
[9F2]
[10A]
[10B]
[10C]
[10D]
[10E]
[10F]
[Poster]
Moderator: D. Verkest, IMEC, BE
-
IC Design Challenges for Ambient Intelligence [p. 2]
-
E. Aarts and R. Roovers
The vision of Ambient Intelligence opens a world of
unprecedented experiences: the interaction of people with
electronic devices is changed as contextual awareness, natural
interfaces and ubiquitous availability of information are
realized. We analyze the consequences of the ambient
intelligence vision for electronic devices by mapping the
involved technologies on a power-information graph. Based on
the differences in power consumption, three types of devices are
introduced: the autonomous or microWatt-node, the personal or
milliWatt-node and the static or Watt-node. Ambient intelligent
functions are realized by a network of these devices with the
computing, communication and interface electronics realized in
Silicon IC technologies. Three case studies highlight the IC
design challenges involved, and show the variety of problems
that have to be solved.
-
Semiconductor Challenges [p. 8]
-
A. Cuomo
Semiconductors will play a key role to drive the technological evolution in the next 20 years. We already possess
many of the technologies that will deeply change our scenario, among which we can mention nanotechnologies,
bioelectronics, photonics. The central role of Integrated Circuits in the economy will grow stronger and stronger in
future, starting from the convergence between storage, security, video, audio, mobility and connectivity. Systems are
converging and ICs are more and more converging with systems. The fundamental issue is how to translate
knowledge and competence coming from different fields into single architectures. The key factor to win this
challenge is to build the right culture. This means to be able to build an organisation for innovation, with the right
mix of creativity, personal initiative and execution skills.
Organizer/Moderator: M. Lindwer, Philips, NL
Speakers:
R. Zimmermann, European Union, Unit for Microelectronics, BE
D. Marculescu, CMU, US
R. Marculescu, CMU, US
S. Jung, Infineon, DE
E. Cantatore, Philips, NL
T. Basten, Eindhoven U, NL
-
Panel Statement [p. 10]
-
The Ambient Intelligence vision is abstract and as such
not useful for funding decisions, research project definition,
and business plan development. This is in particular the
case for the electronic design community. The European
Commission intends for the EU to achieve world leadership
in Information Societies technologies within ten years.
To that end, it has incorporated the Ambient Intelligence
vision in its Sixth Framework. Microelectronics and nano-and
optical devices are seen as key technologies. Interesting
chip-level challenges are found in, amongst others, explicit
modeling of mobility and self-management, and novel computing
substrates, based on electronic textiles or organic
electronics.
Moderators: V. Narayanan, Penn State U, US; P. Fugger, Infineon, DE
-
Improving the Efficiency of Memory Partitioning by Address Clustering [p. 18]
-
A. Macii, E. Macii, and M. Poncino
Memory partitioning is an effective approach to memory
energy optimization in embedded systems. Spatial locality
of the memory address profile is the key property that
partitioning exploits to determine an efficient multi-bank
memory architecture.
This paper presents an approach, called address clustering,
for increasing the locality of a given memory access profile,
and thus improving the efficiency of partitioning.
Results obtained on several embedded applications running
on an ARM7 core show average energy reductions of 25%
(maximum 57%) w.r.t. a partitioned memory architecture
synthesized without resorting to address clustering.
-
A New Algorithm for Energy-Driven Data Compression in VLIW Embedded Processors [p. 24]
-
A. Macii, E. Macii, F. Crudo, and R. Zafalon
This paper presents a new algorithm for on-the-fly
data compression in high performance VLIW
processors. The algorithm aggressively targets energy
minimization of some of the dominant factors in the
SoC energy budget (i.e., main memory access and high
throughput global bus). Based on a differential
technique, both the new algorithm and the HW
compression unit have been developed to efficiently
manage data compression and decompression into a
high performance industrial processor architecture,
under strict real time constraints (Lx-ST200: A 4issue,
6-stages pipelined VLIW processor with on-chip D and
I-Cache). The original DataCache
line is compressed
before write-back to main memory and, then,
decompressed whenever Cache refill takes place. An
extensive experimental strategy has been developed
for the specific validation of the target Lx processor. In
order to allow public comparison, we also report the
results obtained on a MIPS pipelined RISC processor
simulated with SimpleScalar. The two platforms have
been benchmarked over Ptolemy and MediaBench
programs. Energy savings provided by the application
of the proposed technique range from 10% to 22% on
the Lx-ST200 platform and from 11% to 14% on the
MIPS platform.
Keywords: Data compression algorithms, system-level
energy optimization, VLIW embedded processors.
-
Power Efficiency through Application-Specific Instruction Memory Transformations [p. 30]
-
P. Petrov and A. Orailoglu
The instruction memory communication path constitutes a significant
amount of power consumption in embedded processors. We propose
an encoding technique that exploits application information to reduce
the associated power consumption. The microarchitectural support
enables reprogrammability of the encoding transformations so
as to track code particularities effectively. The restriction to functional
transformations enables effective coding while delivering major
power savings, in the process obviating furthermore the necessity to
rely on dictionary lookup, one of the major shortcomings of prior approaches.
The frugal functional transformation, reliant on a single bit
logic gate, introduces no impact to the critical fetch stage of the processor
pipeline while delivering fully all the theoretically achievable
power savings. The reprogrammable hardware implementation enables
flexible and inexpensive switches between the transformations.
Extensive experimental results on numerical and DSP codes confirm
the theoretically expected magnitude of power savings, evincing reductions
that range up to half of the original transitions.
-
Low Energy Data Management for Different On-Chip Memory Levels in
Multi-Context Reconfigurable Architectures [p. 36]
-
M. Sánchez-Élez, M. Fernández, M. Anido, H. Du, R. Hermida, and N. Bagherzadeh
This paper presents a new technique to improve the
efficiency of data scheduling for multi-context
reconfigurable architectures targeting multimedia and DSP
applications. The main goal is to improve application
energy consumption. Two levels of on-chip data storage are
assumed in the reconfigurable architecture. The Data
Scheduler attempts to optimally exploit this storage, by
deciding in which on-chip memory the data have to be
stored in order to reduce energy consumption. We also
show that a suitable data scheduling could decrease the
energy required to implement the dynamic reconfiguration
of the system.
Organizer/Moderator: S. Kundu, Intel, US
Speakers:
B. Grundmann, Intel Corporation, US
R. Galivanche, Principal Engineer, Intel Corporation, US
-
Panel Statement [p. 44]
-
There are already a huge number of problems for silicon
designers and it is likely to just get worse. Many of these
problems are technical associated with shrinking
geometries and increasing architecture complexities, but
there are a significant number that seem to be caused by
procedurally related mistakes and issues. Many of the
technical problems are solved and re-solved on a piecemeal
basis, focusing on local optimizations of small
design-space problems. Unfortunately, many of these
local solutions really create a less apparent but larger
inefficiency in the whole design flow. The reason for this
is that a few ever look at the whole design methodology,
especially as it applies to large design teams. As a
consequence, this lack of oversight for the whole
methodology is causing project procedural problems and
inefficiencies.
Moderators: F. Johannes, TU Munich, DE; C. Sechen, Washington U, US
-
Global Wire Bus Configuration with Minimum Delay Uncertainty [p. 50]
-
L. Huang, H. Chen, and D. Wong
The gap between the advances and the utilization of the deep submicron
(DSM) technology is increasing as the new generation of technology is
introduced faster than ever. Signal integrity is one of the most important
issues in overcoming this gap. With the increasing coupling capacitance
between the high aspect ratio wires, the delay uncertainty is unpredictable
in the current design flow. We present an algorithm to generate the global
wire bus configuration with minumum delay uncertainty under timing constraints.
The timing window information from the timing budget (or specified in IPs)
is integrated with the modern accurate crosstalk noise models in the
proposed algorithm. HSPICE simulations show that the algorithm is very
effective and efficient when compared to the buffer insertion scheme with
minumom delay. The standard deviation of the delay obtained from the
Monte-Carlo simulation is improved up to 73%. This global wire bus
configuration can be adopted in early wire planning to improve the timing
closure problem and increase the accuracy of the timing budget.
-
Timing Verification with Crosstalk for Transparently Latched Circuits [p. 56]
-
H. Zhou
Delay variation due to crosstalk has made timing analysis
a hard problem. In sequential circuits with transparent
latches, crosstalk makes the timing verification (also known
as clock schedule verification) even harder. In this paper,
we point out a false negative problem in current timing verification
techniques and propose a new approach based on
switching windows. In this approach, coupling delay calculations
are combined naturally with latch timing iterations. A
novel algorithm is given for timing verification with crosstalk
in transparently latched circuits and primitive experiments
show promising results.
Categories & Subject Descriptors
B.7.2 Design Aids: Verification
General Terms
Algorithms, Design, Verification
Keywords
Timing, Clock, Verification, Coupling, Delay
-
Statistical Timing Analysis Using Bounds [p. 62]
-
A. Agarwal, D. Blaauw, V. Zolotov, and S. Vrudhula
The growing impact of within-die process variation has created
the need for statistical timing analysis, where gate delays are modeled
as random variables. Statistical timing analysis has traditionally
suffered from exponential run time complexity with circuit size, due
to the dependencies created by reconverging paths in the circuit. In
this paper, we propose a new approach to statistical timing analysis
which uses statistical bounds. First, we provide a formal definition of
the statistical delay of a circuit and derive a statistical timing analysis
method from this definition. Since this method for finding the
exact statistical delay has exponential run time complexity with circuit
size, we also propose a new method for computing statistical
bounds which has linear run time complexity. We prove the correctness
of the proposed bounds. Since we provide both a lower and
upper bound on the true statistical delay, we can determine the quality
of the bounds. The proposed methods were implemented and
tested on benchmark circuits. The results demonstrate that the proposed
bounds have only a small error.
-
Reduced Delay Uncertainty in High Performance Clock Distribution Networks [p. 68]
-
D. Velenis, E. Friedman, and M. Papaefthymiou
The design of clock distribution networks in synchronous
digital systems presents enormous challenges. Controlling
the clock signal delay in the presence of various noise
sources, process parameter variations, and environmental
effects represents a fundamental problem in the design of
high speed synchronous circuits. A polynomial time algorithm
that improves the tolerance of a clock distribution network
to process and environmental variations is presented
in this paper. The algorithm generates a clock tree topology
that minimizes the uncertainty of the clock signal delay
to the most critical data paths. Strategies for enhancing
the physical layout of the clock tree to decrease delay uncertainty
are also presented. Application of the methodology on
benchmark circuits demonstrates clock tree topologies with
decreased delay uncertainties of up to 90%. Techniques to
enhance a clock tree layout have been applied on a set of
benchmark circuits, yielding a reduction in delay uncertainty
of up to 48%.
Organizer/Moderator: T. Basten, TU Eindhoven, NL
Speakers:
A. Chandrakasan, MIT, US
M. Lindwer, Philips, NL
J. Liu, PARC, US
L. Benini, DEIS, Bologna U, IT
R. Min, MIT, US
F. Zhao, Palo Alto Research Center (PARC), US
-
Panel Statement [p. 76]
-
Envision the situation that high quality information and entertainment
is easily accessible to anyone, anywhere, at any time,
and on any device. How realistic is this vision? And what does it
require from the underlying technology?
Ambient Intelligence (AmI) integrates concepts ranging from
ubiquitous computing to autonomous and intelligent systems. An
AmI environment will be highly dynamic in many aspects. Underlying
technology must be very flexible to cope with this dynamism.
Scalability of technology is only one crucial aspect. This paper
explores scalability from the processing, the communication, and
the software perspectives.
keywords: multi-processor systems on chip, high-density wireless
networks, scalable software infrastructure, scalable algorithms,
low power and energy
Moderators: R. Zafalon, STMicroelectronics, IT; D. Marculescu, CMU, US
-
Masking the Energy Behavior of DES Encryption [p. 84]
-
H. Saputra, S. Vijaykrishnan, M. Kandemir, M. Irwin, R. Brooks, S. Kim, and W. Zhang
Smart cards are vulnerable to both invasive and non-invasive
attacks. Specifically, non-invasive attacks using power and timing
measurements to extract the cryptographic key has drawn a lot of
negative publicity for smart card usage. The power measurement
techniques rely on the data-dependent energy behavior of the
underlying system. Further, power analysis can be used to identify
the specific portions of the program being executed to induce timing
glitches that may in turn help to bypass key checking. Thus, it
is important to mask the energy consumption when executing the
encryption algorithms.
In this work, we augment the instruction set architecture of a
simple five-stage pipelined smart card processor with secure
instructions to mask the energy differences due to key-related data-dependent
computations in DES encryption. The secure versions
operate on the normal and complementary versions of the operands
simultaneously to mask the energy variations due to value
dependent operations. However, this incurs the penalty of increased
overall energy consumption in the data-path components.
Consequently, we employ secure versions of instructions only for
critical operations; that is we use secure instructions selectively, as
directed by an optimizing compiler. Using a cycle-accurate energy
simulator, we demonstrate the effectiveness of this enhancement.
Our approach achieves the energy masking of critical operations
consuming 83% less energy as compared to existing approaches
employing dual rail circuits.
-
Scheduling and Mapping of Conditional Task Graphs for the Synthesis of Low Power Embedded Systems [p. 90]
-
D. Wu, B. Al-Hashimi, and P. Eles
This paper describes a new Dynamic Voltage Scaling
(DVS) technique for embedded systems expressed as
Conditional Task Graphs (CTGs). The idea is to identify
and exploit the available worst case slack time, taking into
account the conditional behaviour of CTGs. Also we
examine the effect of combining a genetic algorithm based
mapping with the DVS technique for CTGs and show that
further energy reduction can be obtained. The techniques
have been tested on a number of CTGs including a real life
example. The results show that the DVS technique can
be applied to CTGs with energy saving up to 24%.
Furthermore it is shown that savings of up to 51% are
achieved by considering DVS during the mapping.
-
Synthesis of Application-Specific Highly Efficient Multi-Mode Systems for Low-Power Applications [p. 96]
-
L. Chiou, S. Bhunia, and K. Roy
We present a novel design methodology for synthesizing
multiple configurations (or modes) into a single programmable
system. Many DSP and multimedia applications
require reconfigurability of a system along with efficiency
in terms of power, performance and area. FPGAs
provide a reconfigurable platform, however, they are slower
in speed with significantly higher power consumption than
achievable by a customized ASIC. In this work, we have
developed techniques to realize an efficient reconfigurable
system for a set of user-specified configurations. A data
flow graph transformation method coupled with efficient
scheduling and allocation are used to automatically synthesize
the system from its behavioral level specifications. Experimental
results on several applications demonstrate that
we can achieve about 60X power reduction on average with
about 4X improvement in performance over corresponding
FPGA implementations.
Moderators: M. Flottes, LIRMM, FR; P. Prinetto, Politecnico di Torino, IT
-
Virtual Compression through Test Vector Stitching for Scan Based Design [p. 104]
-
W. Rao and A. Orailoglu
We propose a technique for compressing test vectors. The technique
reduces test application time and tester memory requirements
by utilizing part of the predecessor response in constructing the
subsequent test vector. An algorithm is provided for stitching test
vectors that retains full fault coverage while appreciably reducing
time and tester requirements. The analysis provided enables significant
compression ratios, while necessitating no hardware outlay
whatsoever, making the technique we propose particularly suitable
for SOC testing. The test time benefits necessitate no MISR
utilization, ensuring no consequent aliasing loss. We examine a
number of implementation considerations for the new compression
technique and we provide experimental data that can be used to
guide an eventual commercial implementation. Experimental data
confirms the significant test application time and tester memory reductions.
-
Test Pattern Compression Using Prelude Vectors in Fan-Out Scan Chain with Feedback Architecture [p. 110]
-
N. Oh, R. Kapur, T. Williams, and J. Sproch
This paper proposes a new test compression technique
that employs Fan-out SCAN chain with Feedback
(FSCANF) architecture. It allows us to use prelude
vectors to resolve dependencies created by fanning out
multiple scan chains from a single scan-in pin. This paper
describes the new proposed architecture as well as the
algorithm that generates compressed test vectors using
vertex coloring algorithm. The distribution of specified
bits in each test pattern determines the compression ratio
of the individual test pattern. Therefore, our technique
optimizes the overall compression ratio and shows higher
reduction in test data and application time than previous
techniques, which use the extreme case of serializing all
the scan chains in the presence of conflicts across the fanout
scan chains. The FSCANF architecture has small
hardware overhead and is independent of scan cell orders
in the scan chains. Experimental results show that our
technique significantly reduces both the test data volume
and test application time in six of the largest ISCAS 89
sequential benchmark circuits compared to the previous
techniques.
-
A Technique for High Ratio LZW Compression [p. 116]
-
M. Knieser, D. Weyer, R. McIntyre, F. Wolff, and C. Papachristou
Reduction of both the test suite size and the download time
of test vectors is important in today's System-On-a-Chip
designs. In this paper, a method for compressing the scan
test patterns using the LZW algorithm is presented. This
method leverages the large number of "Don't-Cares" in
test vectors in order to improve the compression ratio
significantly. The hardware decompression architecture
presented here uses existing on-chip embedded memories.
Tests using the ISCAS89 and the ITC99 benchmarks show
that this method achieves high compression ratios.
-
Fast Computation of Data Correlation Using BDDs [p. 122]
-
Z. Zeng, Q. Zhang, I. Harris, and M. Ciesielski
Data correlation is a well-known problem that causes
difficulty in VLSI testing. Based on a correlation metric,
an efficient heuristic to select BIST registers has been
proposed in the previous work. However, the computation
of data correlation itself was a computational intensive
process and became a bottleneck in the previous
work. This paper presents an efficient technique to compute
data correlation using Binary Decision Diagrams
(BDDs). Once a BDD is built, our algorithms take linear
time to compute the corresponding data correlation. The
experimental results show that this technique is much
faster than the previous technique based on simulation.
It enables testing approaches based on data correlation to
handle more practical designs. As one of the successful
applications, partial scan is demonstrated by integrating
our computation results.
Moderators: S. Yoo, TIMA Laboratory, FR; M. Coppola, STMicroelectronics, IT
-
RTOS Modeling for System Level Design [p. 130]
-
A. Gerstlauer, H. Yu, and D. Gajski
System level synthesis is widely seen as the solution for
closing the productivity gap in system design. High level
system models are used in system level design for early design
exploration. While real time operating systems (RTOS)
are an increasingly important component in system design,
specific RTOS implementations can not be used directly in
high level models. On the other hand, existing system level
design languages (SLDL) lack support for RTOS modeling.
In this paper we propose a RTOS model built on top of existing
SLDLs which, by providing the key features typically
available in any RTOS, allows the designer to model the dynamic
behavior of multi-tasking systems at higher abstraction
levels to be incorporated into existing design flows. Experimental
result shows that our RTOS model is easy to use
and efficient while being able to provide accurate results.
-
Modeling and Integration of Peripheral Devices in Embedded Systems [p. 136]
-
S. Wang, S. Malik, and R. Bergamaschi
This paper describes automation methods for device driver
development in IP-based embedded systems in order to achieve
high reliability, productivity, reusability and fast time to market.
We formally specify device behaviors using event driven finite
state machines, communication channels, declaratively described
rules, constraints and synthesis patterns. A driver is synthesized
from this specification for a virtual environment that is platform
(processor, operating system and other hardware) independent.
The virtual environment is mapped to a specific platform to
complete the driver implementation. The illustrative application of
our approach for a USB device driver in Linux demonstrates
improved productivity and reusability.
-
Systematic Embedded Software Generation from SystemC [p. 142]
-
F. Herrera, H. Posadas, P. Sánchez, and E. Villar
The embedded software design cost represents an
important percentage of the embedded-system
development costs [1]. This paper presents a method for
systematic embedded software generation that reduces the
software generation cost in a platform-based HW/SW
codesign methodology for embedded systems based on
SystemC. The goal is that the same SystemC code allows
system-level specification and verification, and, after
SW/HW partition, SW/HW co-simulation and embedded
software generation. The C++ code for the SW partition
(processes and process communication including HW/SW
interfaces) is systematically generated including the user-selected
embedded OS (e.g.: the eCos open source OS).
Moderators: G. Vandersteen, IMEC, BE; H. Graeb, TU Munich, DE
-
Noise Macromodel for Radio Frequency Integrated Circuits [p. 150]
-
Y. Xu, X. Li, P. Li, and L. Pileggi
Noise performance is a critical analog and RF
circuit design constraint, and can impact the selection of
the IC system-level architecture. It is therefore imperative
that some model of the noise is represented at the highest
levels of abstraction during the design process. In this
paper we propose a noise macromodel for analog circuits
and demonstrate it by way of implementation in a system
level simulator based on MATLAB. We also explain our
process of macromodel extraction via reformulation of
frequency-domain noise analysis results, and the
corresponding steps of model order reduction. The results
demonstrate the efficacy of this macromodel for frequency
domain system level simulation.
-
Approximation Approach for Timing Jitter Characterization in Circuit Simulators [p. 156]
-
M. Gourary, S. Rusakov, S. Ulyanov, M. Zharov, K. Gullapalli, and B. Mulvaney
A new computational concept of timing jitter is
proposed that is suitable for exploitation in circuit
simulators. It is based on the approximation of computed
noise characteristicin an arbitrary language. It is
explained how different models of the building blocks can
easily be included in the model. To show the validity of the
model, some simulation results with an implementation in
the MATLABTM programming language are presented.
-
Behavioural Modelling and Simulation of Delta-Sigma Modulators Using Hardware Description Languages [p. 168]
-
R. Castro-López, F. Fernández, F. Medeiro, and A. Rodríguez-Vázquez
Behavioural simulation is the common alternative to the
costly electrical simulation of Delta-Sigma modulators
(Delta-SigmaMs). This
paper explores the behavioural modelling and simulation
of Delta-SigmaMs by using hardware description languages (HDLs)
and commercial behavioural simulators,as an alternative
to the common special-purpose behavioural simulators. A
library of building blocks, where a HDL has been used to
model a complete set of circuit non-idealities influencing
the performance of Delta-SigmaMs,is introduced. Three alternatives
for introducing Delta-SigmaM topologies have been implemented.
Experimental results of the simulation of a fourth-order 2-1-1
cascade multi-bit Delta-SigmaM are given.
Organizer/Moderator: A. Raghunathan, NEC, US
Speakers:
S. Ravi, NEC, US
S. Hattangady, Texas Instruments, US
J.-J. Quisquater, UC de Louvain, BE
-
Panel Statement [p. 176]
-
As intelligent electronic systems pervade all aspects of our lives, capturing,
storing, and communicating a wide range of sensitive and personal data, security
is emerging as a critical concern that must be addressed in order to
enable several current and future applications. Mobile appliances, which
will play a critical role in enabling the visions of ubiquitous computing and
communications, and ambient intelligence, are perhaps the most challenging
to secure -- they often rely on a public medium for (wireless) communications,
are easily lost or stolen due to their small form factors and mobility,
and are highly constrained in cost and size, as well as computing and battery
resources.
This paper presents an introduction to security concerns in mobile appliances,
and translates them into challenges that confront system architects,
HW engineers, and SW developers, including how to bridge the processing
and battery gaps, efficient tamper-proofing of devices, content protection,
etc. Recent innovations and emerging commercial technologies that address
these issues are also highlighted. We envision that, for a large class of embedded
systems, security considerations will pervade all aspects of system
design, driving innovations in system architecture, software, circuits, and
design methodologies.
Moderators: V. Mooney, Georgia IT, US; A. Jantsch, Royal IT, SE
-
Schedulability Analysis and Optimization for the Synthesis of Multi-Cluster
Distributed Embedded Systems [p. 184]
-
P. Pop, P. Eles, and Z. Peng
We present an approach to schedulability analysis for the synthesis
of multi-cluster distributed embedded systems consisting of time-triggered
and event-triggered clusters, interconnected via gateways.
We have also proposed a buffer size and worst case queuing delay
analysis for the gateways, responsible for routing inter-cluster traffic.
Optimization heuristics for the priority assignment and synthesis
of bus access parameters aimed at producing a schedulable system
with minimal buffer needs have been proposed. Extensive experiments
and a real-life example show the efficiency of our approaches.
-
A General Framework for Analysing System Properties in Platform-Based Embedded System Designs [p. 190]
-
S. Chakraborty, S. Kuenzli, and L. Thiele
We present a framework (Real-Time Calculus) for
analysing various system properties pertaining to timing
analysis, loads on various components and on-chip buffer
memory requirements of heterogeneous platform-based architectures,
in a single coherent way. Many previous analysis
techniques from the real-time systems domain, which are
based on standard event models, turn out to be special cases
of our framework. We illustrate this using various realistic
examples.
-
Exact High Level WCET Analysis of Synchronous Programs by Symbolic State Space Exploration [p. 196]
-
G. Logothetis and K. Schneider
In this paper, a novel approach to high-level (i.e. architecture
independent) worst case execution time (WCET) analysis
is presented that automatically computes exact bounds
for all inputs. To this end, we make use of the distinction
between micro and macro steps as usually done by
synchronous languages. As macro steps must not contain
loops, a later low-level WCET analysis (architecture dependent)
is simplified to a large extent.
Checking exact execution times for all inputs is a complex
task that can nevertheless be efficiently done when implicit
state space representations are used. With our tools, it
is not only possible to compute path information by exploring
all computations, but also to verify given path information.
-
Rapid Prototyping of Flexible Embedded Systems on Multi-DSP Architectures [p. 204]
-
B. Rinner, M. Schmid, and R. Weiss
The functionality of a typical embedded system is specified
once at design time and cannot be altered later during
the whole mission period. There are, however, a number
of important application domains that ask for both flexibility
and availability. In such a flexible embedded system the
functionality can be modified while the application is running.
This paper presents a rapid prototyping environment for
flexible embedded systems on multi-DSP architectures. This
prototyping environment automatically maps and schedules
an application onto a multi-DSP architecture and introduces
a special, lightweight reconfiguration environment
onto the target platform. A running multi-DSP application
can, therefore, be modified by reconfiguring software tasks.
By using our prototyping environment the modified application
can be tested, simulated and emulated prior to the
implementation on the target.
keywords: embedded system; multi-DSP architectures;
task reconfiguration; testability; rapid prototyping
Moderators: A. Benso, Politecnico di Torino, IT; I.G. Harris, Massachusetts U, Amherst, US
-
DFT for Testing High-Performance Pipelined Circuits with Slow-Speed Testers [p. 212]
-
M. Nummer and M. Sachdev
This paper presents a methodology for testing high-performance
pipelined circuits with slow-speed testers. The
technique uses a clock timing circuit to control data transfer
in the pipeline in test mode. A clock timing circuit capable of
achieving a timing resolution of 50ps in 0.18um CMOS technology
is presented. The design provides the ability to test the clock
timing circuit itself.
Keywords: Delay-fault testing, high-performance testing,
design for testability, design for delay testability.
-
Extending JTAG for Testing Signal Integrity in SoCs [p. 218]
-
N. Ahmed, M. Tehranipour, and M. Nourani
As the technology is shrinking and the working frequency is going into
multi gigahertz range, the issues related to interconnect testing are
becoming more dominant. Specifically, signal integrity loss issues are
becoming more important and detection and diagnosis of these losses
are becoming a great challenge. In this paper, an enhanced boundary
scan architecture with slight modification in the boundary scan cells
is proposed to test signal integrity in SoC interconnects. Our extended
JTAG architecture: 1) minimizes scan-in operation by using modified
boundary scan cells in pattern generation; and 2) incorporates the
integrity loss information within the modified observation cells. To
fully comply with JTAG standard, we propose two new instructions,
one for pattern generation and the other for scanning out the captured
signal integrity information.
-
EBIST: A Novel Test Generator with Built-In Fault Detection Capability [p. 224]
-
D. Pradhan, C. Liu, and K. Chakraborty
A novel design methodology for test pattern
generation in BIST is presented. Here faults and errors
in the generator itself are detected. Two different design
methodologies are presented. The first one guarantees all
single fault/error detection and the second methodology is
capable of detecting multiple faults and errors. Furthermore
the proposed LFSRs do not have additional hardware overhead.
Also importantly the test patterns generated have the
potential to achieve superior fault coverage.
-
A Partition-Based Approach for Identifying Failing Scan Cells in Scan-BIST with
Applications to System-On-Chip Fault Diagnosis [p. 230]
-
C. Liu and K. Chakrabarty
We present a new partition-based fault diagnosis
technique for identifying failing scan cells in a scan-BIST
environment. This approach relies on a two-step scan
chain partitioning scheme. In the first step, an interval-based
partitioning scheme is used to generate a small
number of partitions, where each element of a partition
consists of a set of scan cells. In the second step,
additional partitions are created using an earlier-proposed
random-selection partitioning method. Two-step
partitioning leads to higher diagnostic resolution than a
scheme that relies only on random-selection partitioning,
with only a small amount of additional hardware. The
proposed scheme is especially suitable for a system-on-chip
(SOC) composed of multiple embedded cores, where
test access is provided by means of a TestRail that is
threaded through the internal scan chains of the
embedded cores. We present experimental results for the
six largest ISCAS-89 benchmark circuits and for two
SOCs crafted from some of the ISCAS-89 circuits.
Moderators: F.V. Fernandez, IMSE CNM, ES; R. Schwencker, Infineon, DE
-
Time-Varying, Frequency-Domain Modeling and Analysis of Phase-Locked Loops with
Sampling Phase-Frequency Detectors [p. 238]
-
P. Vanassche, G. Gielen, and W. Sansen
This paper presents a new, frequency-domain based
method for modeling and analysis of phase-locked loop
(PLL) small-signal behavior, including time-varying aspects.
Focus is given to PLLs with sampling phase-frequency
detectors (PFDs) which compute the phase error
only once per period of the reference signal. Using
the harmonic transfer matrix (HTM) formalism, the well-known
continuous-time, linear time-invariant (LTI) approximations
are extended to take the impact of time-varying
behavior, arising from the sampling nature of the PFD, into
account. Especially for PLLs with a fast feedback loop, this
time-varying behavior has severe impact on, for example,
loop stability and cannot be neglected. Contrary to LTI
analysis, our method is able to predict and quantify these
difficulties. The method is verified for a typical loop design.
-
A New Simulation Technique for Periodic Small-Signal Analysis [p. 244]
-
M. Gourary, S. Rusakov, S. Ulyanov, M. Zharov, and J. Mulvaney
A new numerical technique for periodic small signal
analysis based on harmonic balance method is proposed.
Special-purpose numerical procedures based on Krylov
subspace methods are developed that reduce the
computational efforts of solving linear problems under
frequency sweeping. Examples are given to show the
efficiency of the new algorithm for computing small signal
characteristics for typical RF circuits.
-
Generalized Posynomial Performance Modeling [p. 250]
-
T. Eeckelaert, W. Daems, G. Gielen, and W. Sansen
This paper presents a new method to automatically generate
posynomial symbolic expressions for the performance
characteristics of analog integrated circuits. The coefficient
set as well as the exponent set of the posynomial expression
are determined based on SPICE simulation data
with device-level accuracy. We will prove that this problem
corresponds to solving a non-convex optimization problem
without local minima. The presented method is capable
of generating posynomial performance expressions for
both linear and nonlinear circuits and circuit characteristics.
This approach allows to automatically generate an
accurate sizing model that composes a geometric program
that fully describes the analog circuit sizing problem. The
automatic generation avoids the time-consuming nature of
hand-crafted analytic model generation. Experimental results
illustrate the capabilities and effectiveness of the presented
modeling technique.
-
Holmes: Capturing the Yield-Optimized Design Space Boundaries of Analog and RF Integrated Circuits [p. 256]
-
B. De Smedt and G. Gielen
A novel methodology is presented to structured yield-aware
synthesis. The trade-off between yield and
the unspecified performances is explored along the
design space boundaries, while respecting specifications
on the other performances. Through the unique
combination of multi-objective evolutionary optimization
techniques, multi-variate regression modeling and
sensitivity-based yield estimation, the designer is given
access to this trade-off, all within transistor-level accuracy.
Even more, a large reduction in required computer
resources is obtained compared to alternative approaches.
Moderators: P.Y.K. Cheung, Imperial College London, UK; J. Teich, Erlangen-Nuremberg U, DE
-
High-Level Allocation to Minimize Internal Hardware Wastage [p. 264]
-
M. Molina, J. Mendías, and R. Hermida
Conventional synthesis algorithms perform the
allocation of heterogeneous specifications, those formed
by operations of different types and widths, by binding
operations to functional units of their same type and
width. Thus, in most of the implementations obtained some
hardware waste appears. This paper proposes an
allocation algorithm able to minimize this hardware waste
by fragmenting operations into their common operative
kernel, which then may be executed over the same
functional units. Hence, fragmented operations are
executed over sets of several linked hardware resources.
The implementations proposed by our algorithm need
considerably smaller area than the ones proposed by
conventional allocation algorithms. And due to operation
fragmentation, in the datapaths produced the type,
number, and width of the hardware resources are
independent of the type, number, and width of the
specification operations and variables.
-
Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive Designs [p. 270]
-
S. Gupta, N. Dutt, R. Gupta, and A. Nicolau
We present two novel strategies to increase the scope
for application of speculative code motions: (1) Adding
scheduling steps dynamically during scheduling to conditional
branches with fewer scheduling steps. This increases
the opportunities to apply code motions such as
conditional speculation that duplicate operations into
the branches of a conditional block. (2) Determining
if an operation can be conditionally speculated into
multiple basic blocks either by using existing idle resources
or by creating new scheduling steps. These
strategies lead to balancing of the number of steps in
the conditional branches without increasing the longest
path through the conditional block. Algorithms for
these strategies have been implemented within the Spark
high-level synthesis framework that accepts a behavioral
description in ANSI-C as input and produces synthesizable
register-transfer level VHDL. Experiments
on two moderately complex industrial-strength applications,
namely, MPEG-1 and the GIMP image processing tool, demonstrate
that conditional speculation
is ineffective without using these strategies.
-
Distributed Synchronous Control Units for Dataflow Graphs under Allocation of
Telescopic Arithmetic Units [p. 276]
-
E. Kim, H. Saito, H. Nakamura, T. Nanya, J. Lee, and D. Lee
In order to enjoy performance improvement effects of
variable computation time arithmetic units in a system level,
we propose a new synchronous control unit design methodology
for dataflow graphs under allocation of a telescopic
arithmetic unit which is one of well-known synchronous
variable computation time arithmetic units. The proposed
method generates an independent synchronous controller
for each component arithmetic unit, and builds a distributed
synchronous control unit through integrating derived controllers.
The distributed structure of a final synchronous
control unit maximizes performance improvement effect of
telescopic arithmetic units through a complete preservation
of original concurrency among operations.
-
Automated Bus Generation for Multiprocessor SoC Design [p. 282]
-
K. Ryu and V. Mooney III
The performance of a system, especially a
multiprocessor system, heavily depends upon the efficiency
of its bus architecture. This paper presents a methodology
to generate a custom bus system for a multiprocessor
System-on-a-Chip (SoC). Our bus synthesis tool (BusSyn)
uses this methodology to generate five different bus
systems as examples: Bi-FIFO Bus Architecture (BFBA),
Global Bus Architecture Version I (GBAVI), Global Bus
Architecture Version III (GBAVIII), Hybrid bus
architecture (Hybrid) and Split Bus Architecture (SplitBA).
We verify and evaluate the performance of each bus system
in the context of two applications: an Orthogonal
Frequency Division Multiplexing (OFDM) wireless
transmitter and an MPEG2 decoder. This methodology
gives the designer a great benefit in fast design space
exploration of bus architectures across a variety of
performance impacting factors such as bus types,
processor types and software programming style. In this
paper, we show that BusSyn can generate buses that
achieve superior performance when compared to a simple
General Global Bus Architecture (GGBA) (e.g., 16.44%
performance improvement in the case of OFDM
transmitter) or when compared to the CoreConnect Bus
Architecture (CCBA) (e.g., 15.54% performance
improvement in the case of MPEG2 decoder). In addition,
the bus architecture generated by BusSyn is designed in a
matter of seconds instead of weeks for the hand design of a
custom bus system.
Moderators: A. Koch, TU Braunschweig, DE; G. Koch, Bridges2Silicon, DE
-
Online Scheduling for Block-Partitioned Reconfigurable Devices [p. 290]
-
H. Walder and M. Platzner
This paper presents our work toward an operating system
that manages the resources of a reconfigurable device
in a multitasking manner. We propose an online scheduling
system that allocates tasks to a block-partitioned reconfigurable
device. The blocks are statically-fixed but can have
different widths, which allows to match the computational
resources with the task requirements. We implement several
non-preemptive and preemptive schedulers as well as
different placement strategies. Finally, we present a simulation
environment that allows to experimentally investigate
the effects of specific partitioning, placement, and scheduling
methods.
-
Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures
Using Modulo Scheduling [p. 296]
-
B. Mei, H. De Man, R. Lauwereins, S. Vernalde, and D. Verkest
Coarse-grained reconfigurable architectures have become
increasingly important in recent years. Automatic
design or compilation tools are essential to their success.
In this paper, we present a modulo scheduling algorithm
to exploit loop-level parallelism for coarse-grained reconfigurable
architectures. This algorithm is a key part of
our Dynamically Reconfigurable Embedded Systems Compiler
(DRESC). It is capable of solving placement, scheduling
and routing of operations simultaneously in a modulo-constrained
3D space and uses an abstract architecture representation
to model a wide class of coarse-grained architectures.
The experimental results show high performance
and efficient resource utilization on tested kernels.
-
Virtual Hardware Byte Code as a Design Platform for Reconfigurable Embedded Systems [p. 302]
-
S. Lange and U. Kebschull
Reconfigurable hardware will be used in many future embedded
applications. Since most of these embedded systems
will be temporarily or permanently connected to a network,
the possibility to reload parts of the application at run time
arises. In the 90ies it was recognized, that the huge variety
of processors would lead to a tremendous amount of binaries
for the same piece of software. For the hardware parts
of an embedded system, the situation today is even worse.
The java approach based on a java virtual machine (JVM)
was invented to solve the problem for software. In this paper,
we show how the hardware parts of an embedded system
can be implemented in a hardware byte code, which can
be interpreted using a virtual hardware machine running
on an arbitrary FPGA. Our results show that this approach
is feasible and that it leads to fast, portable and reconfigurable
designs, which run on any programmable target architecture.
Moderators: H. Obermeir, Infineon, DE; N. Nicolici, McMaster U, CA
-
A Method of Test Generation for Path Delay Faults Using Stuck-At Fault Test Generation Algorithms [p. 310]
-
S. Ohtake, K. Ohtani, and H. Fujiwara
In this paper, we propose a test generation method for
non-robust path delay faults using stuck-at fault test generation
algorithms. In our method, we first transform an
original combinational circuit into a circuit called a partial
leaf-dag using path-leaf transformation. Then we generate
test patterns using a stuck-at fault test generation algorithm
for stuck-at faults in the partial leaf-dag. Finally we transform
the test patterns into two-pattern tests for path delay
faults in the original circuit. We prove the correctness of the
approach and experimental results on several benchmark
circuits show the effectiveness of it.
-
A Novel, Low-Cost Algorithm for Sequentially Untestable Fault Identification [p. 316]
-
M. Syal and M. Hsiao
This paper presents a new and low-cost approach for
identifying sequentially untestable faults. Unlike the single
fault theorem, where the stuck-at fault is injected only in the
right-most time frame of the k-frame unrolled circuit, our
approach can handle fault injection in any time frame
within the unrolled sequential circuit. To efficiently apply
our concept to untestable fault identification, powerful
sequential implications are used to efficiently extend the
unobservability propagation of gates in multiple time
frames. Application of the proposed theorem to ISCAS '89
sequential benchmark circuits showed that more untestable
faults could be identified using our approach, at practically
no overhead in both memory and execution time.
-
Non-Enumerative Path Delay Fault Diagnosis [p. 322]
-
S. Padmanaban and S. Tragoudas
The first non-enumerative framework for diagnosing path
delay faults using zero suppressed binary decision diagrams
is introduced. We show that fault free path delay faults with
a validated non-robust test may together with fault free robustly
tested faults be used to eliminate faults from the set
of suspected faults. All operations are implemented by an
implicit diagnosis tool based on the zero suppressed binary
decision diagram. The proposed method is space and time
non-enumerative as opposed to existing methods which are
space and time enumerative. Experimental results on the
ISCAS'85 benchmarks show that the proposed technique is
on an average least three times more efficient to improve
the diagnostic resolution than existing techniques.
-
Delay Defect Diagnosis Based Upon Statistical Timing Models -- The First Step [p. 328]
-
A. Krstic, L. Wang, K. Cheng, J. Liou, and M. Abadir
This paper defines a new diagnosis problem for diagnosing
delay defects based upon statistical timing models. We
illustrate the differences between the delay defect diagnosis and
traditional logic defect diagnosis. We propose different diagnosis
algorithms, and evaluate their performance via statistical defect
injection and statistical delay fault simulation. With a statistical
timing analysis framework developed in the past, we demonstrate
the new concepts in delay defect diagnosis, and discuss experimental
results based upon benchmark circuits.
Organizer: A. Jerraya, TIMA Laboratory, FR
Moderator: J. Madsen, TU Denmark, DK
-
Introduction to Hardware Abstraction Layers for SoC [p. 336]
-
S. Yoo and A. Jerraya
In this paper, hardware abstraction layer is explained in
the context of SoC design. First, HAL definition is given
and the difference between HAL and other similar
concepts are given. Existing HALs are examined. The role
of HAL is explained for SoC design. Finally, a proposal of
standard HAL is presented.
-
Hardware/Software Partitioning of Operating Systems [p. 338]
-
V. Mooney III
Traditionally, an Operating System (OS) implements in software basic
system functions such as task/process management and I/O. Furthermore, a
Real-Time Operating Systems (RTOS) has also been implemented in software
to manage tasks in a predictable, real-time manner. However, with
System-on-a-Chip (SoC) architecture similar to Figure becoming more and
more common, OS and RTOS functionality need not be implemented solely
in software. Thus, partitioning the interface between hardware and
software for an OS is a new idea that can have a significant impact.
-
Embedded Software in Digital AM-FM Chipset [p. 340]
-
M. Sarlotte, B. Candaele, J. Quevremont, and D. Merel
The new standard DRM for digital radio broadcast in AM
band requires integrated devices for radio receivers at
low cost and very low power consumption. A chipset is
currently designed based upon an ARM9 multi-cores
architecture. This paper introduces the application itself,
the HW architecture of the SoC and the SW architecture
which includes physical layer, receiver management, the
application layer and the global scheduler based on a
real-time OS. Then, the paper presents the HW/SW
partitioning and SW breakdown between the various
processing cores. The methodology used in the project to
develop, to validate and to integrate the SW covering
various methods such as simulation, emulation and covalidation
is described. Key points and critical issues are
also addressed.
One of the challenge is to integrate the whole
receiver in the mono-chip with respect to the real-time
constraints linked to the audio services.
Moderators: P. Ienne, EPF Lausanne, CH; E. Verhulst, Eonic Solutions GmbH, DE
-
Packetized On-Chip Interconnect Communication Analysis for MPSoC [p. 344]
-
T. Ye, G. De Micheli, and L. Benini
Interconnect networks play a critical role in shared memory multiprocessor
systems-on-chip (MPSoC) designs. MPSoC performance
and power consumption are greatly affected by the packet dataflows
that are transported on the network. In this paper, by introducing
a packetized on-chip communication power model, we discuss the
packetization impact on MPSoC performance and power consumption.
Particularly, we propose a quantitative analysis method to
evaluate the relationship between different design options (cache,
memory, packetization scheme, etc.) at the architectural level. From
the benchmark experiments, we show that optimal performance
and power tradeoff can be achieved by the selection of appropriate
packet sizes.
-
Trade Offs in the Design of a Router with both Guaranteed and Best-Effort Services for
Networks On Chip [p. 350]
-
E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen,
J. van Meerbergen, P. Wielage, and E. Waterlander
Managing the complexity of designing chips containing billions
of transistors requires decoupling computation from communication.
For the communication, scalable and compositional interconnects,
such as networks on chip (NoC), must be used. In this
paregate bandwidth of
80 Gbit/s. It occupies 0.26 mm2 in CMOS12. This shows that
our router provides high performance at reasonable cost, bringing
NoCs one step closer.
-
Communication Centric Architectures for Turbo-Decoding on Embedded Multiprocessors [p. 356]
-
F. Gilbert, M. Thul, and N. Wehn
Software implementations of channel decoding algorithms
are attractive for communication systems with their
large variety of existing and emerging standards due to their
flexibility and extensibility. For high throughput, however,
a single processor can not provide the necessary compute
power. Using several processors in parallel without exploiting
the internal parallelism of the algorithm leads to intolerable
overhead in area, power consumption, and latency.
We propose a multiprocessor based Turbo-Decoder implementation
where inherently parallel decoding tasks are
mapped onto individual processing nodes. The implied
challenging inter-processor communication is efficiently
handled by our framework such that throughput is not degraded.
In this paper we present communication centric
architectures from buses to heterogenous networks that allow
to interconnect numerous processors to perform high
throughput Turbo-decoding.
Moderators: C. Delgado Kloos, Carlos III de Madrid U, ES; E. Villar, Cantabria, U, ES
-
Development and Application of Design Transformations in ForSyDe [p. 364]
-
I. Sander, A. Jantsch, and Z. Lu
The ForSyDe methodology has been developed for system
level design. Starting with a formal specification model,
that captures the functionality of the system at a high abstraction
level, it provides formal design transformation
methods for a transparent refinement process of the system
model into an implementation model that is optimized for
synthesis. The main contribution of this paper is the formal
treatment of transformational design refinement. Using
the formal semantics of ForSyDe processes we introduce the
term characteristic function to be able to define and classify
transformations as either semantic preserving or design decision.
We also illustrate how we can incorporate classical
synthesis techniques that have traditionally been used
with control/data-flow graphs as ForSyDe transformations.
Thus, our approach avoids discontinuities since it moves design
refinement into the domain of the specification model.
-
System Level Specification in Lava [p. 370]
-
S. Singh
The Lava system provides novel techniques for representing
system level specifications which are supported by a design
flow that maps Lava descriptions onto System-on-Chip
platforms implemented on very large FPGAs. The key contribution
of this paper is a type class based approach for
specifying bus-based system configurations. This provides
a very flexible and parameterised flow for combining predesigned
IP blocks into a complete FPGA-based system.
-
Formal Semantics of Synchronous SystemC [p. 376]
-
A. Salem
In this article, a denotational definition of synchronous
subset of SystemC is proposed. The subset treated
includes modules, processes, threads, wait statement,
ports and signals. We propose formal model for System C
delta delay. Also, we give a complete semantic definition
for the language's two-phase scheduler. The proposed
semantic can constitute a base for validating the
equivalence of synchronous HDL subsets.
-
Introspection in System-Level Language Frameworks: Meta-Level vs. Integrated [p. 382]
-
F. Doucet, R. Gupta, and S. Shukla
Reflection and automated introspection of a design in
system level design frameworks are seen as necessities for
the CAD tools to manipulate the designs within the tools.
These features are also useful for debuggers, class and
object browsers, design analyzers, composition validation,
type checking, compatibility checking, etc. However, the
central question is whether such features should be integrated
into the language, or if we should build frameworks
which feature these capabilities in a meta-layer, leaving the
system-level language intact. In our recent interactions with
designers, we have found differing opinions. Especially in
the context of SystemC, the temptation to integrate reflective
APIs into the language is great, because C++ is expressive,
and already has type introspective packages available.
In this paper, we analyze this issue and show that
(i) it is a better EDA system architecture to implement reflection
/introspection at a meta-layer in a design framework
(ii) there are relatively unexplored territories of design
automation, such as behavioral typing of component
interfaces, corresponding type-theory, and their implication
in automating component composition, interface synthesis,
and validation, which can be better incorporated if the introspection
is implemented at a meta-layer.
-
SystemC-AMS Requirements, Design Objectives and Rationale [p. 388]
-
A. Vachoux, C. Grimm, and K. Einwich
This paper presents and discusses the foundations on
which the analog and mixed-signal extensions of SystemC,
named SystemC-AMS, will be developed. First, requirements
from targeted application domains are
identified. These are then used to derive design objectives
and related rationales. Finally, some preliminary seed
work is presented and the outline of the analog and
mixed-signal extensions development work is given.
Organizer/Moderator: I. Bolsens, Xilinx, US
-
Parallel Processing Architectures for Reconfigurable Systems [p. 396]
-
K. Vissers
Novel reconfigurable computing architectures exploit
the inherent parallelism available in many signal processing
problems. These architectures often consist of
networks of compute elements that have an ALU-like
structure with corresponding instructions. This opens
opportunities for rapid dynamic reconfiguration and
instruction multiplexing. The field of computer
architectures has significantly contributed to the
systematic and quantified exploration of architectures.
Novel reconfigurable architecture exploration should
learn from this approach.
Future System-on-a-Chip platforms will consist of a
combination of processor architectures, on-chip
memories, and reconfigurable architectures. The real
challenge is to design those architectures that can be
programmed efficiently. This requires that first a
programming environment and benchmarks be created
and then that the reconfigurable architectures be
systematically explored.
-
Different Approaches to Add Reconfigurability in a SoC Architecture [p. 398]
-
B. Gupta and M. Borgatti
Dynamically reprogrammable hardware has been
advocated in the academic research community as the
next hot area in system design for some time now. The
lack of integrated systems in the marketplace that
incorporate dynamic reprogramming stands at contrast to
the enthusiasm of the research community for the topic.
We would like to offer as a middle ground several
examples of dynamic reprogramming in working silicon
that might help to illuminate the path towards the future
of SoCs. In our research at STMicroelectronics, we have
built two independent SoCs that utilize embedded
FPGAs to provide the dynamic reprogramming
capability. The benefit of the embedded FPGA has been
demonstrated to range from application acceleration to
augmenting functionality and providing silicon area
reuse. The first system to be described is intended for
image processing and biometric recognition. The second
system is aimed at wireless LAN baseband processing.
-
A Lightweight Approach for Embedded Reconfiguration of FPGAs [p. 399]
-
B. Blodget, S. McMillan, and P. Lysaght
This paper presents a lightweight approach for embedded
reconfiguration of Xilinx Virtex IItm series FPGAs. A
hardware and software infrastructure is reported that enables
an FPGA to dynamically reconfigure itself under the
control of a soft microprocessor core that is instantiated on
the same array. The system provides a highly integrated,
lightweight approach to dynamic reconfiguration for embedded
systems. It combines the benefits of intelligent control,
fast reconfiguration and small overhead.
Organizer/Moderator: E.J. Marinissen, Philips, NL
Speakers:
B. Vermeulen, Philips, NL
R. Madge, LSI Logic, US
M. Kessler and M. Müller, IBM Boeblingen, DE
-
Panel Statement [p. 402]
-
Test is often seen as a necessary evil; it is a fact of life that ICs have manufacturing defects and those need to be filtered out by testing
before the ICs are shipped to the customer. In this paper, we show that techniques and tools used in the testing field can also be
(re-)used to create value to (1) designers, (2) manufacturers, and (3) customers alike. First, we show how the test infrastructure can
be used to detect, diagnose, and correct design errors in prototype silicon. Secondly, we discuss how test results are used to improve
the manufacturing process and hence production yield. Finally, we present test technologies that enable systems of high reliability
for safety-critical applications.
Moderators: R. Leupers, RWTH Aachen, DE; J.C. Lopez, Castilla-La Mancha, ES
-
Control Flow Driven Splitting of Loop Nests at the Source Code Level [p. 410]
-
H. Falk and P. Marwedel
This paper presents a novel source code transformation
for control flow optimization called loop nest splitting which
minimizes the number of executed if-statements in loop
nests of embedded multimedia applications. The goal of the
optimization is to reduce runtimes and energy consumption.
The analysis techniques are based on precise mathematical
models combined with genetic algorithms. Due to the inherent
portability of source code transformations, a very
detailed benchmarking using 10 different processors can be
performed. The application of our implemented algorithms
to three real-life multimedia benchmarks leads to average
speed-ups by 23.6% -- 62.1% and energy savings by 19.6%
-- 57.7%. Furthermore, our optimization also leads to advantageous
pipeline and cache performance.
-
Data Space Oriented Scheduling in Embedded Systems [p. 416]
-
M. Kandemir, G. Chen, W. Zhang, and I. Kolcu
With the widespread use of embedded devices such
as PDAs, printers, game machines, cellular telephones,
achieving high performance demands an optimized operating
system (OS) that can take full advantage of the underlying
hardware components. This paper presents a locality
conscious process scheduling strategy for embedded
environments. The objective of our scheduling strategy is
to maximize reuse in the data cache. It achieves this by
restructuring the process codes based on data sharing patterns
between processes.
-
Compiler-Directed ILP Extraction for Clustered VLIW/EPIC Machines:
Predication, Speculation and Modulo Scheduling [p. 422]
-
S. Pillai and M. Jacome
Compiler-directed ILP extraction techniques are critical
to effectively exploiting the significant processing capacity
of contemporaneous VLIW/EPIC machines. In this paper
we propose a novel algorithm for ILP extraction targeting
clustered EPIC machines that integrates three powerful
techniques: predication, speculation and modulo scheduling.
In addition, our framework schedules and binds operations,
generating actual VLIW code. To the best of our
knowledge, there is no other algorithm in the literature on
predicated code optimizations that jointly considers speculation
and modulo scheduling in the context of clustered
EPIC machines. Our experimental results show that by
jointly considering different extraction techniques in a resource
aware context, the proposed algorithm can take maximum
advantage of the resources available on the clustered
machine, aggressively improving performance.
-
An Efficient Hash Table Based Approach to Avoid State Space Explosion in
History Driven Quasi-Static Scheduling [p. 428]
-
A. Lomeña, M. López-Vallejo, Y. Watanabe, and A. Kondratyev
This paper presents an efficient hash table based method
to optimally overcome a new variant of the state space explosion
which appears during the quasi-static task scheduling
of embedded, reactive systems.
Our application domain is targeted to one-processor
software synthesis, and the scheduling process is based on
Petri net reachability analysis to ensure cyclic, bounded
and undeadlocked programs. To achieve greater flexibility,
we employ a dynamic, history based criterion to prune the
search space. This makes our synthesis approach different
from most existing code generation techniques.
Our experimental results reveal a significant reduction in
algorithmic complexity (both in memory storage and CPU
time) obtained for medium and large size problems.
Moderators: I. Markov, Michigan U, US; J. Lienig, TU Dresden, DE
-
Time Budgeting in a Wireplanning Context [p. 436]
-
J. Westra, R. Otten, D. Jongeneel, and C. Visweswariah
Wireplanning is an approach in which the timing of input-output
paths is planned before modules are specified, synthesized
or sized. If these global wires are optimally segmented
and buffered, their delay is linear in the path length and independent
of the position of the modules along these paths. From
timing requirements, the total budget left to modules after allocating
the appropriate delay to the wires can be determined.
This paper describes how this budget can be optimally divided
amongst the modules. A novel, static timing-like, mathematical
programming formulation is introduced such that the total
module area is minimized. Instead of only the worst delay,
all pin-to-pin delays are implicitly taken into account. If
area-delay tradeoffs are convex, a reasonable approximation
in practice, the program can be solved efficiently. Further, efficiency
of different formulations is discussed, and a low-cost
method of making the budget relatively immune to downstream
uncertainties and surprises is presented. The efficiency of the
formulation is clear from benchmarks with over 2000 nodes
and 5e19 paths.
-
Interconnect Planning with Local Area Constrained Retiming [p. 442]
-
R. Lu and C. Koh
We present a framework that considers global routing,
repeater insertion, and flip-flop relocation for early interconnect
planning. We formulate the interconnect retiming
and flip-flop placement problem as a local area constrained
retiming problem and solve it as a series of weighted minimum
area retiming problems. Our method for early interconnect
planning can reduce and even avoid design iterations
between physical planning and high level designs.
Experimental results show that our method can reduce the
number of area violations by an average of 84% in a single
interconnect planning step.
-
A Novel Metric for Interconnect Architecture Performance [p. 448]
-
P. Dasgupta, A. Kahng, and S. Muddu
We propose a new metric for evaluation of interconnect architectures.
This metric is computed by optimal assignment of wires
from a given wire length distribution (WLD) to a given interconnect
architecture (IA). This new metric, the rank of an IA, is a single
number that gives the number of connections in the WLD that
meet a specific target delay when embedded in the IA. A dynamic
programming algorithm is presented to exactly compute the rank
of an IA with respect to a given WLD within practical runtimes.
We use our new IA metric to quantitatively compare impacts of
geometric parameters as well as process and material technology
advances. For example, we observe that 42% reduction in Miller
coupling factor achieves the same rank improvement as a 38% reduction
in inter-layer dielectric permittivity for a 1M gate design in
the 130nm technology.
Moderators: R. Seepold, Carlos III de Madrid U, ES; T. Riesgo, UP Madrid, ES
-
Specification of Non-Functional Intellectual Property Components [p. 456]
-
J. Zhu and W. Mong
In its most general sense, intellectual property components
(IPs) refer to any design artifacts that are reusable.
While the specification of the functional IPs, such as behavioral
and RTL specifications have been widely investigated,
the specifications of others, such as timing, constraints,
layouts and architectures are largely ad hoc. This leads
to different standard or proprietary file/database formats
with interoperatability problems, which eventually hinder
the distribution and integration of IPs. In this paper, we
address the difficult problem of integrating semantically diverse
non-functional IPs by the use of a new, extensible language
called Babel. Despite its simple 1-page grammar,
Babel is front-end for a powerful IP-based design infrastructure.
We demonstrate the effectiveness of our approach
by two case studies, one for the creation of parameterized
memory IPs and one for the creation of processor IPs.
-
Profile-Driven Selective Code Compression [p. 462]
-
Y. Xie, W. Wolf, and H. Lekatsas
In the embedded system design, memory is one of the
most restricted resources. Code compression has been proposed
as a solution to reduce the code size of applications
for embedded systems. Data compression techniques are
used to compress programs to reduce memory size. Most
previous work compresses all instructions found in an executable,
without taking into account the program execution
profile. In this paper, a profile-driven code compression
design methodology is proposed. Program profiling information
can be used to help code compression to selectively
compress non-critical instructions, such that the system performance
degradation due to the decompression penalty is
reduced.
-
Design and Analysis of a Programmable Single-Chip Architecture for DVB-T Base-Band Receiver [p. 468]
-
C. Pan, N. Bagherzadeh, A. Kamalizad, and A. Koohi
This work treats the design and analysis of a
programmable (or reconfigurable) DSP-domain-specific
architecture called MorphoSys, upon which world's first
single-chip software solution for DVB-T base-band
receiver can be implemented. Based on the first version
of MorphoSys, many modifications have been made to
improve greatly both computation power and data
movement efficiency. Sequential codes and SIMD codes
can be parallelized; temporal granularity adjustment
boosts up performance up to 4 times; numerous different
types of data movement can be accelerated 8 to 64 times
faster than sequential movement. As a complicated
(21GOPS) and typical communication system, DVB-T
base-band receiver is designed with low performance
loss and mapped onto MorphoSys architecture
(>28GOPS). This solidly contributes to the software
defined radio development.
Organizer: W. Rosenstiel, Tuebingen U/FZI, DE
Moderator: R. Lauwereins, IMEC, BE
Panellists:
I. Bolsens, Xilinx, US
C. Rowen, Tensilica, US
Y. Tanurhan, Actel, US
K. Vissers, Chameleon Systems, US
S. Wang, Axis Systems, US
-
Panel Statement [p. 476]
-
With vanishing time to market windows and rapidly changing design specifications, flexibility
becomes the number one constraint in the development of the SoC's of the future. Adding efficient
power and resource management to the requirements list made the cry for reconfigurable computing
even louder. In this panel we will listen to different industry and research representatives debating
their approaches to reconfigurable computing.
Moderators: R. Galivanche, Intel, US; A. Richardson, Lancaster U, UK
-
RF-BIST: Loopback Spectral Signature Analysis [p. 478]
-
D. Lupea, U. Pursche, and H. Jentschel
Built-In Self-Test (BIST) becomes important also for more complex structures
like complete front-ends. In order to bring down the costs for the test
overhead, Spectral Signature Analysis at system level seems to be a promising
concept. Investigations that have been carried out are targeted on the most
challenging problems. Generation of the Test Signature, Evaluation
of the Signature Response, Implementation of the concept and Verification
by Simulation. From investigation it can be concluded that the concept
is suitable especially in the case of transceiver-type DUT.
-
Optimizing Stresses for Testing DRAM Cell Defects Using Electrical Simulation [p. 484]
-
Z. Al-Ars, A. van de Goor, J. Braun, and D. Richter
Stresses are considered an integral part of any
modern industrial DRAM test. This paper describes a
novel method to optimize stresses for memory testing, using
defect injection and electrical simulation. The new
method shows how each stress should be applied to achieve
a higher fault coverage of a given test, based on an understanding
of the internal behavior of the memory. In addition,
results of a fault analysis study, performed to verify
the new optimization method, show its effectiveness.
Key words: stresses, memory testing, test optimization,
defect simulation.
-
On Modeling Cross-Talk Faults [p. 490]
-
S. Zachariah, Y. Chang, S. Kundu, and C. Tirumurti
Circuit marginality failures in high performance
VLSI circuits are projected to increase due to shrinking
process geometries and high frequency design
techniques. Capacitive cross coupling between interconnects
is known to be a prime contributor to such
failures. In this paper, we present novel techniques to
model and prioritize capacitive cross-talk faults.
Experimental results are provided to show effectiveness
of the proposed modeling technique on industrial
circuits.
-
Techniques for Automatic on Chip Closed Loop Transfer Function Monitoring for
Embedded Charge Pump Phase Locked Loops [p. 496]
-
M. Burbidge, A. Richardson, and J. Tijou
Charge Pump Phase locked loops are used in a variety of applications, including on chip clock synthesis, symbol timing recovery for serial data streams, and generation of frequency agile high frequency carrier signals. In many applications PLL's are embedded into larger digital systems, in consequence, analogue test access is often limited. Test motivation is thus towards methods that can either aid digital only test of the PLL, or alternatively facilitate complete self testing of the PLL. One useful characterisation technique used by PLL designers is that of closed loop phase transfer function measurement. This test allows, an estimation of the PLL's natural frequency, damping, and 3dB bandwidth to be made from the magnitude and phase response plots. These parameters relate directly to the time domain response of the PLL and will indicate errors in the PLL circuitry. This paper provides suggestions towards test methods that use a novel maximum frequency detection technique to aid automatic measurement of the closed loop phase transfer function. In addition, techniques presented have potential for full BIST applications.
Keywords: PLL, CP-PLL, BIST, TEST, DfT.
Moderators: J. Henkel, NEC, US; P. Marwedel, Dortmund U, DE
-
Pre-Characterization Free, Efficient Power/Performance Analysis of Embedded and
General Purpose Software Applications [p. 504]
-
V. Rapaka and D. Marculescu
This paper presents a novel approach for an efficient, yet
accurate estimation technique for power consumption and performance
of embedded and general purpose applications. Our
approach is adaptive in nature and is based on detecting sections
of code characterized by high temporal locality (also called
hotspots) in the execution profile of the benchmark being executed
on a target processor. The technique itself is architecture and input
independent and can be used for both embedded, as well as
for general purpose processors. We have implemented a hybrid
simulation engine which can significantly shorten the simulation
time by using on-the-fly profiling for critical sections of the code
and by reusing this information during power/performance estimation
for the rest of the code. By using this strategy, we were
able to achieve up to 20X better accuracy compared to a flat,
non-adaptive sampling scheme and a simulation speed-up of up
to 11.84X with a maximum error of 1.03% for performance and
1.92% for total energy on a wide variety of media and general
purpose applications.
-
Runtime Code Parallelization for On-Chip Multiprocessors [p. 510]
-
M. Kandemir, W. Zhang, and M. Karakoy
Chip multiprocessing (or multiprocessor system-on-a-chip)
is a technique that combines two or more processor
cores on a single piece of silicon to enhance computing performance.
An important problem to be addressed in executing
applications on an on-chip multiprocessor environment
is to select the most suitable number of processors to use
for a given objective function (e.g., minimizing execution
time or energy-delay product) under multiple constraints.
Previous research proposed an ILP-based solution to this
problem that is based on exhaustive evaluation of each nest
under all possible processor sizes. In this paper, we take a
different approach and propose a pure runtime strategy for
determining the best number of processors to use at runtime.
This approach is more general than static techniques
and can be applicable in situations where the latter cannot
be.
-
SDRAM-Energy-Aware Memory Allocation for Dynamic Multi-Media Applications on
Multi-Processor Platforms [p. 516]
-
P. Marchal, F. Catthoor, D. Bruni, L. Benini, J. Gomez, L. Piñuel, and H. Corporaal
Heterogeneous multi-processors platforms are an interesting
option to satisfy the computational performance of dynamic
multi-media applications at a reasonable energy cost. Today,
almost no support exists to energy-efficiently manage the data
of a multi-threaded application on these platforms. In this paper
we show that the assignment of data of dynamically created/
deleted tasks to the shared memory has a large impact on
the energy consumption. We present two dynamic memory allocators
which solve the bank assignment problem for shared
multi-banked SDRAM memories. Both allocators assign the
tasks' data to the available SDRAM banks such that the number
of page-misses is reduced. We have measured large energy savings
with these allocators compared to existing dynamic memory
allocators for several task-sets based on MediaBench[5].
Moderators: L.M. Silveira, INESC ID/IST, PT; J.R. Phillips, Cadence, US
-
Modeling and Evaluation of Substrate Noise Induced by Interconnects [p. 524]
-
F. Martorell, D. Mateo, and X. Aragonès
Interconnects have deserved attention as a source of
crosstalk to other interconnects, but have been ignored as
a source of substrate noise. In this paper, we evaluate the
importance of interconnect-induced substrate noise. A
known interconnect and substrate model is validated by
comparing simulation results to experimental measurements.
Based on the validated modeling approach, a complete
study considering frequency, geometrical, load and
shielding effects is presented. The importance of interconnect-induced
substrate noise is demonstrated after observing
that, for typically sized interconnects and state-of-the-art speeds,
the amount of coupled noise is already
comparable to that injected by hundreds of transistors.
-
Model-Order Reduction Based on PRONY's Method [p. 530]
-
M. Mansour and A. Mehrotra
A new model-order reduction technique for linear dynamic
systems is presented. The idea behind this technique is
to transform the dynamic system function from the s-domain
into the z-domain via the bilinear transformation, then use
Prony's [1] least-squares approximation method instead of
the commonly employed Padé approximation method, and finally
transform the reduced system back into the s-domain
using the inverse bilinear transformation.tegy
allows a very accurate and efficient full-wave solution of
interconnection structures with possibly complex geometry
including the nonlinear and dynamic effects of real-world
digital devices, without the need of detailed transistor-level
models. Examples of signal integrity and field coupling
analysis are shown.
-
Enhancing Signal Integrity through a Low-Overhead Encoding Scheme on Address Buses [p. 542]
-
T. Lv, W. Wolf, J. Henkel, and H. Lekatsas
Signal integrity is and will continue to be a major
concern in deep sub-micron VLSI designs where the
proximity of signal carrying lines leads to crosstalk,
unpredictable signal delays and other parasitic side
effects. Our scheme uses bus encoding that guarantees
that at any time any two signal carrying lines will be
separated by at least one grounded line and thus
providing a high degree of signal integrity. This comes
at a small overhead of only one additional bus line (the
closest related work needs 14 additional lines for a 32-bit bus)
and a small average performance decrease of
0.36%. By means of a large set of real-world
applications, we compare our scheme to other state-of-the-art
approaches and present comparisons in terms of
degree of integrity, overhead (e.g. additional lines
required) and a possible performance decrease.
Moderators: B.M. Al-Hashimi, Southampton U, UK; M. Rencz, TU Budapest, HU
-
Building Fast and Accurate SW Simulation Models Based on Hardware Abstraction Layer and
Simulation Environment Abstraction Layer [p. 550]
-
S. Yoo, I. Bacivarov, A. Bouchhima, Y. Paviot, and A. Jerraya
As a fast and accurate SW simulation model, we present a
model called fast timed SW model. The model enables
fast simulation by native execution of application SW and
OS. It gives simulation accuracy by timed SW and HW
simulation. When building fast timed SW models, we
need to solve two problems: (1) how to enable timing
synchronization between the native execution and HW
simulation and (2) how to obtain the portability of native
execution (that needs multi-tasking from simulation
environments to emulate its multi-tasking operation) on
different simulation environments (that give different
types of multi-tasking). In this paper, to enable the
synchronization, we present a synchronization function.
To enable the portability, we present an adaptation layer
called simulation environment abstraction layer. We
present our case studies in building fast timed SW models.
-
Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation [p. 556]
-
W. Qin and S. Malik
Given the growth in application-specific processors, there is a
strong need for a retargetable modeling framework that is capable
of accurately capturing complex processor behaviors and generating
efficient simulators. We propose the operation state machine
(OSM) computation model to serve as the foundation of such a
modeling framework. The OSM model separates the processor
into two interacting layers: the operation layer where operation
semantics and timing are modeled, and the hardware layer where
disciplined hardware units interact. This declarative model allows
for direct synthesis of micro-architecture simulators as it encapsulates
precise concurrency semantics of microprocessors. We illustrate
the practical benefits of this model through two case studies
- the StrongARM core and the PowerPC-750 superscalar processor.
The experimental results demonstrate that the OSM model has
excellent modeling productivity and model efficiency. Additional
applications of this modeling framework include derivation of information
required by compilers and formal analysis for processor
validation.
-
Instruction Set Emulation for Rapid Prototyping of SoCs [p. 562]
-
J. Schnerr, G. Haug, and W. Rosenstiel
In this paper the application of Instruction Set Emulation
for rapid prototyping of SoCs will be presented. The emulation
works in a way that both the software and the hardware
behaviour of the emulated processor core is reproduced cycle
accurately. This requires the use of hardware and software
components. The hardware component consists of a
board containing a VLIW processor and FPGAs. The software
component is an instruction set simulator of the core
running on the VLIW processor. The FPGAs are used for
emulating the SoC bus of this processor core. This way the
simulation of an instruction set of a processor core has been
extended to a real emulation of this core that can be used for
rapid prototyping.
Moderators: P. Lysaght, Xilinx, US; S. Vernalde, IMEC, BE
-
Hardware/Software Design Space Exploration for a Reconfigurable Processor [p. 570]
-
A. La Rosa, L. Lavagno, and C. Passerone
This paper describes an approach to hardware/
software design space exploration for reconfigurable
processors. The existing compiler tool-chain,
because of the user-definable instructions, needs to be
extended in order to offer developers an easy way to
explore design space. Such extension often is not easy to
use for developer that have only a software background,
thus ignoring reconfigurable architecture details or
hardware logic synthesis on FPGA. Our approach differs
from others because it is based on a simple extension on
the standard programming model well known to software
developers.
-
From C Programs to the Configure-Execute Model [p. 576]
-
J. Cardoso and M. Weinhardt
The emergence of run-time reconfigurable architectures
makes feasible the configure-execute paradigm. Compilation
of behavioral descriptions (in, e.g., C, Java, etc.), apart
from mapping the computational structures onto the available
resources on the device, must split the program in temporal
sections if it needs more resources than physically
available. In addition, since the execution of the computational
structures in a configuration needs at least two stages
(i.e., configuring and computing), it is important to split the
program such that the reconfiguration overheads are minimized,
taking advantage of the overlapping of the execution
stages on different configurations. This paper presents
mapping techniques to cope with those features. The techniques
are being researched in the context of a C compiler
for the eXtreme Processing Platform (XPP). Temporal partitioning
is applied to furnish a set of configurations that
reduces the reconfiguration overhead and thus may lead to
performance gains. We also show that when applications
include a sequence of loops, the use of several configurations
may be more beneficial than the mapping of the entire
application onto a single configuration. Preliminary results
for a number of benchmarks strongly confirm the approach.
-
FPGA-Based Implementation of a Serial RSA Processor [p. 582]
-
A. Mazzeo, L. Romano, G. Saggese, and N. Mazzocca
In this paper we present an hardware implementation of
the RSA algorithm for public-key cryptography. The RSA algorithm
consists in the computation of modular exponentials
on large integers, that can be reduced to repeated modular
multiplications. We present a serial implementation of RSA,
which is based upon an optimized version of the RSA algorithm
originally proposed by P.L. Montgomery. The proposed
architecture is innovative, and it widely exploits specific capabilities
of Xilinx programmable devices. As compared to
other solutions in the literature, the proposed implementation
of the RSA processor has smaller area occupation and
comparable performance. The final performance level is a
function of the serialization factor. We provide a thorough
discussion of design tradeoffs, in terms of area requirements
vs performance, for different values of the key length and of
the serialization factor.
Moderators: L. Bouzaida, STMicroelectronics, FR; A. Paschalis, Athens U, GR
-
Optimal Reconfiguration Functions for Column or Data-Bit Built-In Self-Repair [p. 590]
-
M. Nicolaidis, N. Achouri, and S. Boutobza
In modern SoCs, embedded memories occupy the largest
part of the chip area and include an even larger amount of
active devices. As memories are designed very tightly to the
limits of the technology they are more prone to failures
than logic. Thus, memories concentrate the large majority
of defects and affect circuit yield dramatically. As a matter
Built-In Self-Repair is gaining importance. This work
presents optimal reconfigurations functions for memory
built-in self-repair on the data-bit level. We also present a
dynamic repair scheme that allows reducing the size of the
repairable units. The combination of these schemes allows
repairing multiple faults affecting both regular and spare
units, by means of low hardware cost. The scheme uses a
single test pass, resulting on low test and repair time.
-
Versatile High-Level Synthesis of Self-Checking Datapaths Using an On-Line Testability Metric [p. 596]
-
P. Oikonomakos, M. Zwolinski, and B. Al-Hashimi
There have been several recent attempts to include
duplication-based on-line testability in behaviourally
synthesized designs. In this paper, on-line testability is
considered within the optimisation process of iterative,
cost function-driven high-level synthesis, such that on-line
testing resources are inserted automatically without any
modification of the source HDL code. This involves the
introduction of a metric for on-line testability. A variation
of duplication testing (namely inversion testing) is also
used, providing the system with an additional degree of
freedom towards minimising hardware overheads
associated with test resource insertion. Considering online
testability within the synthesis process facilitates fast
and efficient design space exploration, resulting in a
versatile high-level synthesis process, capable of
producing alternative realisations according to the
designer's directions.
-
An Accurate Analysis of the Effects of Soft Errors in the Instruction and
Data Caches of a Pipelined Microprocessor [p. 602]
-
M. Rebaudengo, M. Sonza Reorda, and M. Violante
Instruction and data caches are well known
architectural solutions that allow significantly improving
the performance of high-end processors. Due to their
sensitivity to soft errors they are often disabled in safety
critical applications, thus sacrificing performance for
improved dependability. In this paper we report an
accurate analysis of the effects of soft errors in the
instruction and data caches of a soft core implementing
the SPARC architecture. Thanks to an efficient simulation-based
fault injection environment we developed, we are
able to present in this paper an extensive analysis of the
effects of soft errors on a processor running several
applications under different memory configurations. The
procedure we followed allows the precise computation of
the processor failure rate when the cache is enabled even
without resorting to expensive radiation experiments.
-
High Speed and Highly Testable Parallel Two -Rail Code Checker [p. 608]
-
M. Omaña, D. Rossi, and C. Metra
In this article we propose a high speed and highly
testable parallel two-rail code checker, which features a
compact structure and is Totally-Self-Checking or Strongly
Code-Disjoint with respect to a wide set of realistic faults.
The proposed checker is also particularly suitable to implement
embedded two-rail code checkers, as it requires only
two input codewords for fault detection. Our checker can
be employed to check the correct operation of a connected
functional block using the two-rail code, to implement the
output two-rail code checker of "normal" checkers for unordered
codes, or to join together the error messages produced
by various checkers (possibly using different codes)
present within the same self-checking system. The behavior
of our checker has been verified by means of electrical
level simulations (performed using HSPICE), considering
both nominal values and statistical variations of electrical
parameters.
Organizer/Moderator: R. Ernst, TU Braunschweig, DE
-
Safe Automotive Software Development [p. 616]
-
K. Tindell, H. Kopetz, F. Wolf, and R. Ernst
Automotive systems engineering has made significant
progress in using formal methods to design safe
hardware-software systems. The architectures and
design methods could become a model for safe and
cost-efficient embedded software development as a
whole. This paper gives several examples from the
leading edge of industrial automotive applications.
Moderators: A. Kaiser, IEMN-ISEN, FR; T. Ifstroem, Robert Bosch GmbH, DE
-
Analysis and White-Box Modeling of Weakly Nonlinear Time-Varying Circuits [p. 624]
-
P. Dobrovolný, G. Vandersteen, P. Wambacq, and S. Donnay
The architectural study of wireless communication
systems typically requires simulations with high-level
models for different analog and RF blocks. Among these
blocks, frequency-translating devices such as mixers pose
problems in RF circuit simulation since their response
typically covers a mix of long- and short-time scales. This
paper proposes a technique to analyze and model
nonlinear frequency-translating RF circuits such as up-and
down conversion mixers. The proposed method is
based on a generalized Volterra series approach for
periodically time-varying systems. It enables a multi-tone
distortion analysis starting from a circuit description and
derives simplified high-level models based on the most
important nonlinear contributions. These models give
both insight in the nonlinear behavior and enable an
efficient high-level simulation during architectural design
of front-ends of RF transceivers.
-
Linear Model-Based Error Identification and Calibration for Data Converters [p. 630]
-
C. Wegener and M. Kennedy
For the example of a 12-bit Nyquist-rate ADC, a model for
nonlinearity-causing mechanisms is developed based on circuit
simulations. The model is used to estimate circuit element
values from measured device characteristics. Post-manufacture
reconfiguration of the digital control part of the
device-type that is used as a test vehicle in this work can improve
the linearity performance of a device. An algorithm is
proposed that searches for a locally-optimal reconfiguration
based on the determined circuit element values. Applying
calibration to the circuit simulation model allows one to estimate
the performance improvement obtainable with the proposed
calibration scheme for a given manufacturing process
prior to a physical implementation.
-
Improved Design Methodology for High-Speed High-Accuracy Current Steering D/A Converters [p. 636]
-
M. Albiol, J. González, and E. Alarcón
This paper describes a sizing and design methodology for
high-speed high-accuracy current steering D/A converters
taking into account mismatching in all the transistors of the
current source cell. The presented method allows a more
accurate selection of the optimal design point without
introducing arbitrary safety margins, as was done in the
previous literature. This methodology has been applied to the
design of a CMOS 12-bit 400 MHz current-steering
segmented D/A converter. Commercial CAD tools are used
to automatically lay out regular structures of the DAC,
specially the current source array, following an optimal two-dimensional
switching scheme to compensate for systematic
mismatch errors.
-
Behavioral Modeling and Simulation of a Mixed Analog/Digital Automatic Gain Control
Loop in a 5 GHz WLAN Receiver [p. 642]
-
W. Eberle, H. De Man, G. Vandersteen, P. Wambacq, G. Gielen, and S. Donnay
Wireless LAN (WLAN) operating in the 5-6 GHz range,
become commercially viable only, if they can be
produced at low cost. Consequently, tight integration of
the physical layer, consisting of the radio front-end and
the digital signal processing part, is a must. Especially
with respect to mixed-signal feedback loops, with
automatic gain control as a recurring example, existing
tools have major difficulties in offering efficient ways of
modeling and simulation. We present a modeling
approach where the complexity of the analog behavioral
model has been reduced to the minimum required by the
digital receiver, namely its steady-state responses and a
`worst-case' time delay. Moreover, we show how this
mixed-signal receiver model can be used in an end-to-end
communication link simulation to provide the
designer insight into statistical information such as
transient delays and gain tolerances. For this model, we
set up a co-simulation of two existing in-house tools, one
for the analog part, the other for the digital system part.
Moderators: H. Hsieh, UC Riverside, US; W. Kruijtzer, Philips, NL
-
Analytical Design Space Exploration of Caches for Embedded Systems [p. 650]
-
A. Ghosh and T. Givargis
The increasing use of microprocessor cores in embedded
systems, as well as mobile and portable devices, creates
an opportunity for customizing the cache subsystem for
improved performance. Traditionally, a
design-simulate-analyze methodology is used to achieve
desired cache performance. Here, to bootstrap the
process, arbitrary cache parameters are selected, the
cache sub-system is simulated using a cache simulator,
based on performance results, cache parameters are
tuned, and the process is repeated until an acceptable
design is obtained. Since the cache design space is
typically very large, the traditional approach often
requires a very long time to converge. In the proposed
approach, we outline an efficient algorithm that directly
computes cache parameters satisfying the desired
performance. We demonstrate the feasibility of our
algorithm by applying it to a large number of embedded
system benchmarks.
Keywords
Cache Optimization, Core-Based Design, Design Space
Exploration, System-on-a-Chip
-
Fast and Accurate Multiprocessor Architecture Exploration with Symbolic Programs [p. 656]
-
V. Zivkovic, E. Deprettere, E. de Kock, and P. van der Wolf
In system-level platform-based embedded systems design,
the mapping model is a crucial link between the application
model and the architecture model. All three models
must match when design-space exploration has to be fast
and accurate, and when exploration methods and design
methods have to be closely related. For the media processing
application domain we present an architecture model
and corresponding mapping model that meet these requirements
better than previously proposed models. A case study
illustrates this improvement.
-
Design Space Exploration for a Wireless Protocol on a Reconfigurable Platform [p. 662]
-
L. Vanzago, J. Cambonie, B. Bhattacharya, and L. Lavagno
This paper describes a design space exploration
experiment for a real application from the embedded
networking domain -- the physical layer of a wireless
protocol. The application models both control oriented as
well as data processing functions, and hence requires
composing tasks from different models of computation. We
show how the cost and performance of communication and
computation can be quickly evaluated, with a reasonable
modeling cost. While the example uses a specific tool, the
methodology and results can be used in a more general
context.
-
A First Step Towards HW/SW Partitioning of UML Specifications [p. 668]
-
W. Fornaciari, F. Salice, P. Micheli, and L. Zampella
This paper proposes a novel methodology tailored to design
embedded systems, taking into account the emerging market
needs, such as hw/sw partitioning, object-oriented specifications,
overall design costs and early analysis of design alternatives.
The proposal tackles the problem by considering UML as the
starting point for system-level description and uses a
customization of Function Point analysis and COCOMO to
provide cost metrics both for hardware and software. Finally, a
genetic algorithm is used to select the best candidate
architecture. The paper also reports some results, obtained from
a case studies, showing the viability of the proposed approach.
-
Multi-Granularity Metrics for the Era of Strongly Personalized SOCs [p. 674]
-
Y. Le Moullec, J. Diguet, J. Philippe, N. Ben Amor, and M. Abid
This paper details the first step of the Design Trotter
framework for design space exploration applied to
dedicated SOCs. The aim of this step is to provide metrics
in order to guide the designer and the synthesis tool
towards an efficient application architecture matching.
This work presents a computation of metrics at all levels
of the application graph-based hierarchy. These metrics
are computed through data and control dependency
analysis. They quantify the memory, control and
processing orientations as well as the average of
parallelism for different granularities.
Moderators: W. Nebel, OFFIS/Oldenburg U, DE; J. Henkel, NEC, US
-
Energy Estimation for Extensible Processors [p. 682]
-
Y. Fei, N. Jha, S. Ravi, and A. Raghunathan
This paper presents an efficient methodology for estimating the
energy consumption of application programs running on extensible
processors. Extensible processors, which are increasingly popular
in embedded system design, allow a designer to customize a base
processor core through instruction set extensions. Existing processor
energy macro-modeling techniques are not applicable to extensible
processors, since they assume that the instruction set architecture
as well as the underlying structural description of the microarchitecture
remain fixed.
Our solution to this problem is an energy macro-model suitably
parameterized to estimate the energy consumption of a processor
instance that incorporates any custom instruction extensions. Such
a characterization is facilitated by careful selection of macro-model
parameters/variables that can capture both the functional and structural
aspects of the execution of a program on an extensible processor.
Another feature of the proposed characterization flow is the use
of regression analysis to build the macro-model. Regression analysis
allows for in-situ characterization, thus allowing arbitrary test
programs to be used during macro-model construction.
We validate the proposed methodology by characterizing the energy
consumption of a state-of-the-art extensible processor (Tensilica's
Xtensa). We use the macro-model to analyze the energy consumption
of several benchmark applications with custom instructions.
The mean absolute error in the macro-model estimates is
only 3.3%, when compared to the energy values obtained by a commercial
tool operating on the synthesized RTL description of the
custom processor. Our approach achieves an average speedup of
three orders of magnitude over the commercial RTL energy estimator.
Our experiments show that the proposed methodology also
achieves good relative accuracy, which is essential in energy optimization
studies.
-
Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of
Regular NoC Architectures [p. 688]
-
J. Hu and R. Marculescu
In this paper, we present an algorithm which automatically
maps the IPs onto a generic regular Network on Chip
(NoC) architecture and constructs a deadlock-free deterministic
routing function such that the total communication energy
is minimized. At the same time, the performance of
the resulting communication system is guaranteed to satisfy
the specified constraints through bandwidth reservation.
As the main contribution, we first formulate the problem of
energy/performance aware mapping, in a topological sense,
and show how the routing flexibility can be exploited to expand
the solution space and improve the solution quality.
An efficient branch-and-bound algorithm is then described to
solve this problem. Experimental results show that the proposed
algorithm is very fast, and significant energy savings
can be achieved. For instance, for a complex video/audio
application, 51.7% energy savings have been observed, on
average, compared to an ad-hoc implementation.
-
Chromatic Encoding: A Low Power Encoding Technique for Digital Visual Interface [p. 694]
-
W. Cheng and M. Pedram
This paper presents a low-power encoding technique, called
chromatic encoding, for the Digital Visual Interface standard
(DVI), a digital serial video interface. Chromatic encoding
reduces power consumption by minimizing the transition
counts on the DVI. This technique relies on the notion of tonal
locality, i.e., the observation that the signal differences
between adjacent pixels in images follow a Gaussian
distribution. Based on this observation, an optimal code
assignment is performed to minimize the transition counts.
Furthermore, the three color channels of the DVI may be
reciprocally encoded to achieve even more power saving. The
idea is that given the signal values from the three color
channels, one or two of these channels are encoded by
reciprocal differences with a number of redundant bits used to
indicate the selection. The proposed technique requires only
three redundant bits for each 24-bit pixel. Experimental
results show up to a 75% transition reduction.
-
MRPF: An Architectural Transformation for Synthesis of High-Performance and
Low-Power Digital Filters [p. 700]
-
H. Choo, K. Roy, and K. Muhammad
We present a graph theoretical methodology that reduces
the implementation complexity of a vector multiplied by a
scalar. The proposed approach is called MRP (minimally
redundant parallel) optimization and is presented in FIR
filtering framework to obtain a low-complexity multiplier-less
implementation. The key idea is to expand the design
space using shift inclusive differential coefficients together
with computation reordering using a graph theoretic approach
to obtain maximal computation sharing. The transformed
architecture of a filter is obtained by solving a set
cover problem of the graph. A simple algorithm based on
a greedy approach is presented. The proposed approach
is merged with common sub-expression elimination. The
simulation results show that 70% and 16% improvement in
terms of computational complexity over simple implementation
(transposed direct form) and common sub-expression,
respectively, when using carry lookahead adder synthesized
from synopsys designware library in .25 u technology.
-
Transport Protocol Optimization for Energy Efficient Wireless Embedded Systems [p. 706]
-
D. Bertozzi, L. Benini, A. Raghunathan, and S. Ravi
For wireless embedded systems, the power consumption
in the network interface (radio) plays a dominant role in determining
battery life. In this paper, we explore transport protocol optimizations
for reducing the energy consumption of wireless LAN
interfaces. Our work is based on the observation that, the transport
protocol, which implements flow control to regulate the network
traffic, plays a significant role in determining the workload of
the network interface. Hence, by monitoring run-time parameters
in the transport protocol, coarse-granularity idle periods, which
present the best opportunities for network interface power reduction,
can be accurately identified. We further show that, by tuning
parameters in the protocol software implementation, we can shape
the activity profile of the network interface, making it more energy
efficient while remaining compliant to the TCP standard. We have
performed extensive current measurements using an experimental
testbed that consists of a Compaq iPAQ PDA with a Cisco Aironet
wireless network adapter, to validate the proposed techniques. Our
measurements indicate energy savings ranging from 28% to 69%
compared to the use of state-of-the-art MAC layer power reduction
techniques, with little or no impact on performance.
Moderators: P. Harrod, ARM, UK; S. Shoukourian, VirageLogic International, AR
-
Low-Cost Software-Based Self-Testing of RISC Processor Cores [p. 714]
-
N. Kranitis, A. Paschalis, G. Xenoulis, D. Gizopoulos, and Y. Zorian
Software self-testing of embedded processor cores
which effectively partitions the testing effort between low-speed
external equipment and internal processor
resources, has been recently proposed as an alternative to
classical hardware built-in self-test techniques over which
it provides significant advantages.
In this
A P1500-Compatible Programmable BIST Approach for the Test of Embedded Flash Memories [p. 720]
-
P. Bernardi, M. Rebaudengo, M. Sonza Reorda, and M. Violante
In this paper we present a microprocessor-based approach
suitable for embedded flash memory testing in a System-on-achip
(SOC) environment. The main novelty of the approach is
the high flexibility, which guarantees easy exploitation of the
same architecture to different memory cores. The proposed
approach is compatible with the P1500 standard. A case study
has been developed and demonstrates the advantages of the
proposed core test strategy in terms of area overhead and test
application time.
-
Test Data Compression: The System Integrator's Perspective [p. 726]
-
P. Gonciari, B. Al-Hashimi, and N. Nicolici
Test data compression (TDC) is a promising low-cost
methodology for System-on-a-Chip (SOC) test. This is due
to the fact that it can reduce not only the volume of test data
but also the bandwidth requirements. In this paper we provide
a quantitative analysis of two distinctive TDC methods
from the system integrator's standpoint considering a core
based SOC environment. The proposed analysis addresses
four parameters: compression ratio, test application time,
area overhead and power dissipation. Based on our analysis,
some future research directions are given which can
lead to an easier integration of TDC in the SOC design flow
and to further improve the four parameters.
-
Time Domain Multiplexed TAM: Implementation and Comparison [p. 732]
-
Z. Ebadi and A. Ivanov
One of the difficult problems which core-based system-on-chip
(SoC) designs face is test access. For testing the
cores in a SoC, a special mechanism is required, since they
are not directly accessible via chip inputs and outputs. In
this paper we introduce a novel Test Access Mechanism
(TAM) based on time domain multiplexing (TDM-TAM).
This TAM is P1500 compatible and uses a P1500 wrapper.
The TAM characteristics are its flexibility, scalability, and
reconfigurability. The proposed TAM is compared with two
other approaches: a serial threading approach analogous
to the IEEE1149.1 standard (Serial TAM)[7]and a packet switching
test network (NIMA)[9]. A network-processing
engine SoC is used as a platform to compare the different
TAMs [6]. Results show that in most cases, TDM is the most
effective TAM in both test time and overhead area.
Keywords: SoC testing, Embedded core testing, Test Access
Mechanism (TAM), Time domain multiplexed TAM,
Optimal test time.
-
Layout-Driven SOC Test Architecture Design for Test Time and Wire Length Minimization [p. 738]
-
K. Goel and E. Marinissen
This paper extends existing SOC test architecture design approaches
that minimize required tester vector memory depth and test application
time, with the capability to minimize the wire length required by the
test architecture. We present a simple, yet effective wire length cost
model for test architectures together with a new test architecture design
algorithm that minimizes both test time and wire length. The user
specifies the relative weight of the costs of test time versus wire length.
In an integrated fashion, the algorithm partitions the total available
TAM width over individual TAMs, assigns the modules to these TAMs,
and orders the modules within one TAM such that the total cost is minimized.
Experimental results on five benchmark SOCs show that we can
obtain savings of up to 86% in wiring costs at the expense of <4% in
test time.
-
Delay Fault Testing of Core-Based Systems-On-A-Chip [p. 744]
-
Q. Xu and N. Nicolici
Existing approaches for modular manufacturing testing of
core-based systems-on-a-chip (SOCs) do not provide any
explicit mechanism for high quality two-pattern tests required
for performance validation through delay fault testing.
This paper proposes a new approach for broadside delay
fault testing of core-based SOCs, by adapting the existing
solutions for automatic test pattern generation and
design for test support, test access mechanism division and
test scheduling.
Moderators: W. Kunz, Kaiserslautern U, DE; T. Villa, Udine U, IT
-
Reducing Multi-Valued Algebraic Operations to Binary [p. 752]
-
J. Jiang, R. Brayton, and A. Mishchenko
Algebraic operations were developed for binary logic synthesis and
extended later to apply to multi-valued (MV) logic. Operations in
the MV domain were considered more complex and slower. This paper shows
that MV algebraic operations are essentially as easy as binary ones, with
only a slight overhead (linear in the size of the expression) in transformation
into and out of the binary domain. By introducing co-singleton sets as a
new basis, any MV sum-of-products expression can be rewritten and passed to
a binary logic synthesizer for fast execution; the optimized results can be
directly interpreted in the MV domain. This process, called EBD, reduces
MV algebraic operations to binary. A pure MV operation differs mainly
from its corresponding EBD one in that the former possesses "semi-algebraic"
generality, which has not been implemented for the binary logic. Experiments
show that the proposed methods are significantly faster, with little or no
loss in quality when run in comparable scripts of sequences of logic
synthesis operations.
-
Combination of Lower Bounds in Exact BDD Minimization [p. 758]
-
R. Ebendt, W. Günther, and R. Drechsler
Ordered Binary Decision Diagrams (BDDs) are a data
structure for efficient representation and manipulation of
Boolean functions. They are frequently used in logic synthesis
and formal verification. The size of BDDs depends on a
chosen variable ordering, i.e. the size may vary from linear
to exponential, and the problem of improving the variable
ordering is known to be NP-complete.
In this paper we present a new exact branch & bound
technique for determining an optimal variable order. In
contrast to all previous approaches, that only considered
one lower bound, our method makes use of a combination
of three bounds and by this avoids unnecessary computations.
The lower bounds are derived by generalization
of a lower bound known from VLSI design. They allow to
build the BDD either top down or bottom up. Experimental
results are given to show the efficiency of our approach.
-
Implicit Resolution of the Chapman-Kolmogorov Equations for Sequential Circuits:
An Application in Power Estimation [p. 764]
-
A. Freitas and A. Oliveira
In this work we describe an approach that implicitly
formulates and solves the Chapman-Kolmogorov equations
that describe the state probabilities associated with the stationary
behavior of sequential circuits. Unlike previous
approaches that assumed uncorrelated input signals, we
model the more general case where the sequential circuit
is driven by a sequence of inputs described by a discrete
time Markov chain. This Markov chain is described implicitly
using a formalism that allows for a compact description
of chains with an exponentially high number of
states. Using this approach, we present an application in
power estimation of sequential circuits that takes into account
all the temporal and spatial correlations between the
primary inputs and the internal signals. We present results
showing that, in some cases, it is possible to solve exactly
the Chapman-Kolmogorov equations for systems with more
than 107equations.
-
Performance-Directed Retiming for FPGAs Using Post-Placement Delay Information [p. 770]
-
U. Seidl, F. Johannes, and K. Eckl
In today's deep-submicron designs, the interconnect delays
contribute an increasing part to the overall performance
of an implementation. Particularly when targeting
field programmable gate arrays (FPGAs), interconnect delays
are crucial, since they can easily vary by orders of magnitude.
Many existing performance-directed retiming methods
use simple delay models which either neglect routing
delays or use inaccurate delay estimations. In this paper, we
propose a retiming approach which overcomes the problem
of inaccurate delay models. Our retiming technique uses
delay information extracted from a fully placed and routed
design and takes account of register timing requirements.
By applying physical constraints, we ensure that the delay
information remains valid during retiming. In our experiments,
we achieved up to 27% performance improvement.
Moderators: W. Fornaciari, Politecnico di Milano, IT; T. Thurner, DaimlerChrysler Research, DE
-
Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology [p. 778]
-
A. Agarwal, T. Vijaykumar, and K. Roy
In this paper we propose a design technique to pipeline
cache memories for high bandwidth applications. With the
scaling of technology cache access latencies are multiple
clock cycles. The proposed pipelined cache architecture can
be accessed every clock cycle and thereby, enhances
bandwidth and overall processor performance. The
proposed architecture utilizes the idea of banking to reduce
bit-line and word-line delay, making word-line to sense
amplifier delay to fit into a single clock cycle. Experimental
results show that optimal banking allows the cache to be
split into multiple stages whose delays are equal to clock
cycle time. The proposed design is fully scalable and can be
applied to future technology generations. Power, delay and
area estimates show that on average, the proposed
pipelined cache improves MOPS (millions of operations per
unit time per unit area per unit energy) by 40-50%
compared to current cache architectures.
-
Enhancing Speedup in Network Processing Applications by Exploiting Instruction
Reuse with Flow Aggregation [p. 784]
-
G. Surendra, S. Banerjee, and S. Nandy
Instruction reuse is a microarchitectural technique that
improves the execution time of a program by removing redundant
computations at run-time. Although this is the job
of an optimizing compiler, they do not succeed many a time
due to limited knowledge of run-time data. In this paper we
examine instruction reuse of integer ALU and load instructions
in network processing applications. Specifically, this
paper attempts to answer the following questions: (1) How
much of instruction reuse is inherent in network processing
applications?, (2) Can reuse be improved by reducing interference
in the reuse buffer?, (3) What characteristics of
network applications can be exploited to improve reuse?,
and (4) What is the effect of reuse on resource contention
and memory accesses? We propose an aggregation scheme
that combines the high-level concept of network traffic i.e.
"flows" with a low level microarchitectural feature of programs
i.e. repetition of instructions and data along with
an architecture that exploits temporal locality in incoming
packet data to improve reuse. We find that for the benchmarks
considered, 1% to 50% of instructions are reused
while the speedup achieved varies between 1% and 24%.
As a side effect, instruction reuse reduces memory traffic
and can therefore be considered as a scheme for low power.
-
On-Chip Stochastic Communication [p. 790]
-
T. Dumitras and R. Marculescu
As CMOS technology scales down into the deep-submicron
(DSM)
domain, the costs of design and verification for Systems-On-Chip
(SoCs) are rapidly increasing due to the inefficiency of traditional
CAD tools. Relaxing the requirement of 100% correctness for devices
and interconnects drastically reduces the costs of design but,
at the same time, requires that SoCs be designed with some system-level
fault-tolerance.
In this paper, we introduce a new communication
paradigm for SoCs, namely stochastic communication.
The newly proposed scheme not only separates communication
from computation, but also provides the required built-in
fault-tolerance
to DSM failures, is scalable and cheap to implement.
For a generic tile-based
architecture, we show how a ubiquitous
multimedia application (an MP3 encoder) can be implemented
using stochastic communication in an efficient and robust manner.
More precisely, up to 70% data upsets, 80% packet drops because
of buffer overflow, and severe levels of synchronization failures
can be tolerated while maintaining a low latency.
-
An Integrated Approach for Improving Cache Behavior [p. 796]
-
G. Memik, A. Choudhary, M. Kandemir, and I. Kadayif
The widening gap between processor and memory speeds
renders data locality optimization a very important issue in
data-intensive embedded applications. Throughout the years
hardware designers and compiler writers focused on optimizing
data cache locality using intelligent cache management mechanisms
and program-level transformations, respectively. Until
now, there has not been significant research investigating the interaction
between these optimizations. In this work, we investigate
this interaction and propose a selective hardware/compiler
strategy to optimize cache locality for integer, numerical (array-intensive),
and mixed codes. In our framework, the role of
the compiler is to identify program regions that can be optimized
at compile time using loop and data transformations and
to mark (at compile-time) the unoptimizable regions with special
instructions that activate/deactivate a hardware optimization
mechanism selectively at run-time. Our results show that
our technique can improve program performance by as much
as 60% with respect to the base configuration and 17% with
respect to a non-selective hardware/compiler approach.
-
Rapid Configuration and Instruction Selection for an ASIP: A Case Study [p. 802]
-
N. Cheung, S. Parameswaran, and J. Henkel
We present a methodology that maximizes the performance
of Tensilica based Application Specific Instruction-set
Processor (ASIP) through instruction selection when an area
constraint is given. Our approach rapidly selects from a set
of pre-fabricated coprocessors/functional units from our library
of pre-designed specific instructions (to evaluate our
technology we use the Tensilica platform). As a result, we
significantly increase application performance while area
constraints are satisfied. Our methodology uses a combination
of simulation, estimation and a pre-characterised library
of instructions, to select the appropriate co-processors
and instructions. We report that by selecting the appropriate
coprocessors/functional units and specific TIE instructions,
the total execution time of complex applications (we study
a voice encoder/decoder), an application's performance can
be reduced by up to 85% compared to the base implementation.
Our estimator used in the system takes typically less
than a second to estimate, with an average error rate of 4%
(as compared to full simulation, which takes 45 minutes).
The total selection process using our methodology takes 3-4
hours, while a full design space exploration using simulation
would take several days.
Moderators: W. Ecker, Infineon, DE; T. Kropf, Robert Bosch GmbH, DE
-
Local Search for Boolean Relations on the Basis of Unit Propagation [p. 810]
-
Y. Novikov
We propose a method for local search of Boolean
relations relating variables of a CNF formula. The method
is to branch on small subsets of the set of CNF variables
and to analyze results of unit propagation. By taking into
account variable value assignments deduced during the
unit propagation procedure the method is able to justify
any relation represented by a Boolean expression. The
proposed technique is based on bitwise logical operations
over ternary vectors. We implement a restricted version of
the method used for unit clause derivation and equivalent-literal
identification in a preprocessor engine for a SAT-solver.
The experiments show that the proposed technique
is useful for solving real-world instances of the formal
verification domain.
-
Set Manipulation with Boolean Functional Vectors for Symbolic Reachability Analysis [p. 816]
-
A. Goel and R. Bryant
Symbolic techniques usually use characteristic functions
for representing sets of states. Boolean functional vectors
provide an alternate set representation which is suitable
for symbolic simulation. Their use in symbolic reachability
analysis and model checking is limited, however, by
the lack of algorithms for performing set operations. We
present algorithms for set union, intersection and quantification
that work with a canonical Boolean functional vector
representation and show how this enables efficient symbolic
simulation based reachability analysis. Our experimental
results for reachability analysis indicate that the Boolean
functional vector representation is often more compact than
the corresponding characteristic function, thus giving significant
performance improvements on some benchmarks.
-
Efficient Preimage Computation Using a Novel Success-Driven ATPG [p. 822]
-
S. Sheng and M. Hsiao
Preimage computation is a key step in formal verification.
Pure OBDD-based symbolic method is vulnerable to
the space-explosion problem. On the other hand, conventional
ATPG/SAT-based method can handle large designs but
can suffer from time explosion. Unlike methods that combine
ATPG/SAT and OBDD, we present a novel success-driven
learning algorithm which significantly accelerates a ATPG engine
for enumerating all solutions (preimages). The algorithm
effectively prunes redundant search space due to overlapped
solutions and constructs a free BDD on the fly so that it becomes
the representation of the preimage set at the end. Experimental
results have demonstrated the effectiveness of the
approach, in which we are able to compute preimages for large
sequential circuits, where OBDD-based method fail.
-
Using Formal Techniques to Debug the AMBA System-On-Chip Bus Protocol [p. 828]
-
A. Roychoudhury, T. Mitra, and S. Karri
System-on-chip (SoC) designs use bus protocols for high
performance data transfer among the Intellectual Property
(IP) cores. These protocols incorporate advanced features
such as pipelining, burst and split transfers. In this paper,
we describe a case study in formally verifying a widely used
SoC bus protocol: the Advanced Micro-controller Bus Architecture
(AMBA) protocol from ARM.
In particular, we develop a formal specification of the
AMBA protocol. We then employ model checking, a state
space exploration based formal verification technique, to
verify crucial design invariants. The presence of pipelining
and split transfer in the AMBA protocol gives rise to interesting
corner cases, which are hard to detect via informal
reasoning. Using the SMV model checker, we have detected
a potential bus starvation scenario in the AMBA protocol.
Such scenarios demonstrate the inherent intricacies in designing
pipelined bus protocols.
-
Cross-Product Functional Coverage Measurement with Temporal Properties-Based Assertions [p. 834]
-
A. Ziv
Temporal specification languages provide an efficient
way to express events comprised of complex temporal scenarios.
Assertions based on these languages are used to
detect violations of the specification and monitor coverage
events. In this paper, we propose to extend temporal
specification languages, and assertions based on these languages
with auxiliary variables. We attach these variables
to sub-expressions and assign them values when the subexpressions
are evaluated. The use of auxiliary variables
enables the implementation of large cross-product coverage
models, using small number of assertions. This simplifies
the definition and implementation of coverage models and
helps reduce the simulation overhead caused by assertions,
thus increasing the efficiency of simulation resources.
Organizer/Moderator: R. Wittmann, Nokia, DE
Speakers:
J. Hartung, Cadence, DE
H.-J. Wassener, Atmel, DE
G. Tränkle, Infineon, DE
M. Schröter, CEDIC, DE
-
Panel Statement [p. 842]
-
The characteristics of highly integrated High-Speed Data-
Transmission-Systems for near future communication
standards are high mobility, low power, reliability, high
performance, low cost and short development time. Recent
available RF development tools do not allow the efficient
design of tailored, well partitioned, highly integrated, high
data-rate communication systems targeting up to 6 GHz
carrier frequency. Reasons for this are the missing
flexibility and accuracy in the hierarchical design process
from system down to physical level. In this session
representatives of system houses, a leading EDA provider
and a silicon foundry will introduce and discuss the
challenges related to the specific areas. Considering the
market relevance of high data rate wireless
communication systems the missing overall design
technology today is a real hot topic.
Suggestions on how to improve the existing situation
addressing the design technology as a whole will be
discussed. Depending on the representatives perspective
converging requirements will be worked out and non-converging
requirements will be debated in a
controversial manner.
Moderators: C. Sechen, Washington U, US; I. Markov, Michigan U, US
-
Power/Ground Mesh Area Optimization Using Multigrid-Based Technique [p. 850]
-
K. Wang and M. Marek-Sadowska
In this paper, we present a novel multigrid-based technique
for power/ground mesh area optimization subject to
reliability constraints. The multigrid-based technique is
applied to reduce a large-scale mesh to a much coarser
one. The reduced mesh can be efficiently optimized. The
solution for the original mesh is then computed using a
back-mapping process. Experimental results are very
encouraging. Large-scale power/ground meshes with millions
of nodes can be solved in a few minutes. The proposed
technique not only speeds up the optimization
process significantly without compromising the quality of
solutions, but also brings up a possibility of incorporating
the power/ground mesh optimization into other physical
design stages such as signal routing.
-
A New and Efficient Congestion Evaluation Model in Floorplanning:
Wire Density Control with Twin Binary Trees [p. 856]
-
S. Wai, E. Young, and C. Chu
As technology moves into the deep-submicron era, the
complexity of VLSI circuit design grows rapidly, especially
in the interconnections between modules. Therefore, interconnect
optimization has become an important concern in
floorplanning today. Most routability-driven floorplanners
[2][6][8] use grid-based approach that divides a floorplan
into grids as in global routing. Congestion is estimated as the
expected number of nets passing through each grid. Although
this approach is direct and accurate, it is not efficient when
dealing with complex circuit containing thousands of nets.
In this paper, an efficient and innovative routability-driven
floorplanner using twin binary trees (TBT)[9][10] representation
is proposed. The congestion model we used is the wire
density on the half-perimeter boundary of different regions
in a floorplan. These regions are defined naturally by the
TBT representation. In order to increase the efficiency of our
floorplanner, a fast algorithm for the least common ancestor
(LCA) problem in [1] is used to compute the wire density.
From the experimental results, the number of unroutable
wires can be reduced in a short time.
-
Crosstalk Reduction in Area Routing [p. 862]
-
R. Smey, P. Madden, and B. Swartz
Interconnect delay dominates system delay in modern
circuits, and with reduced feature sizes, coupling capacitance
and signal crosstalk have become significant issues.
By spacing interconnect wires effectively, and avoiding long
parallel runs, coupling can be reduced; with current routing
methods, however, this is difficult to achieve.
In this paper, we present a new approach to area routing.
Rather than inserting routes sequentially (using a performance
driven maze router), multiple candidate routes for
each connection are generated; excess routes are then eliminated
iteratively. Experiments show that we obtain substantial
reductions in coupling capacitance, without sacrificing
routing completion rates.
-
Area Fill Generation with Inherent Data Volume Reduction [p. 868]
-
Y. Chen, A. Kahng, Y. Zheng, G. Robins, and A. Zelikovsky
Control of variability and performance in the back end of the VLSI
manufacturing line has become extremely difficult with the introduction
of new materials such as copper and low-k dielectrics. Uniformity
of chemical-mechanical planarization (CMP) requires the
insertion of area fill features into the layout, in order to smoothen
the variation of feature densities across the die and thus improve
manufacturability. Because the size of area fill features is very
small compared with the large empty layout areas that must be
filled, the filling process can increase the size of a GDSII file by
an order of magnitude or more. Data compression is therefore a
significant issue in successful fill synthesis. In this paper, we introduce
compressed fill strategies which exploit the GDSII array
reference record (AREF) construct. We apply greedy and linear
programming based optimization techniques, and obtain practical
compressed filling solutions.
Organizer: H.-J. Schlebusch, Synopsys, DE
Moderator: G. Smith, Gartner Dataquest, US
Panellists:
D. Sciuto, Politecnico di Milano, IT
D. Gajski, UC Irvine, US
C. Mielenz, Infineon Technologies, DE
C.K. Lennard, ARM, UK
F. Ghenassia, STMicroelectronics, FR
S. Swan, Cadence, US
J. Kunkel, Synopsys, US
-
Panel Statement [p. 876]
-
Complex systems on chip (SoCs) present challenges in the design and
verification process that cannot be adequately addressed by traditional
methodologies based on register transfer descriptions. Some of the
aspects are efficient design exploration based on component reuse,
getting closure on the architecture, as well as early development, integration
and verification of embedded software. In search for responses to these
challenges, Transaction level modeling (TLM) has got quite some attention
in the area of SoC design. This panel attempts to do a reality check on
TLM from an engineering point of view.
Questions to discuss are: Is the Transaction Level (TL) really useful for
the design and/or for the verification of SoCs? How can TL speed up the design
process and lowering the risk of design failures ? What are the implications
on tools, languages, and Intellectual Property (IP) used in the
design/verification process ? The panelists will share their thoughts on
transaction based design and verification, and will discuss benefits and
issues based on their experiences of applying transaction level
methodologies.
Moderators: R. Drechsler, Bremen U, DE; C. Pixley, Synopsys, US
-
Validating SAT Solvers Using an Independent Resolution-Based Checker:
Practical Implementations and other Applications [p. 880]
-
L. Zhang and S. Malik
As the use of SAT solvers as core engines in EDA applications
grows, it becomes increasingly important to validate their
correctness. In this paper, we describe the implementation of an
independent resolution-based checking procedure that can check
the validity of unsatisfiable claims produced by the SAT solver
zchaff. We examine the practical implementation issues of such
a checker and describe two implementations with different pros
and cons. Experimental results show low overhead for the
checking process. Our checker can work with many other
modern SAT solvers with minor modifications, and it can
provide information for debugging when checking fails. Finally
we describe additional results that can be obtained by the
validation process and briefly discuss their applications.
-
Verification of Proofs of Unsatisfiability for CNF Formulas [p. 886]
-
E. Goldberg and Y. Novikov
As SAT-algorithms become more and more complex, there
is little chance of writing a SAT-solver that is free of bugs.
So it is of great importance to be able to verify the
information returned by a SAT-solver. If the CNF formula
to be tested is satisfiable, solution verification is trivial and
can be easily done by the user. However, in the case of
unsatisfiability, the user has to rely on the reputation of the
SAT-solver. We describe an efficient procedure for
checking the correctness of evelop better heuristics to
further improve its efficiency, by either speeding up the Boolean
Constraint Propagation (BCP) procedure or finding a better decision
ordering (or both). In this paper, we propose an entirely different
SAT solver design concept that is circuit-based. Our solver
is able to utilize circuit topological information and signal correlations
to enforce a decision ordering that is more efficient for solving
circuit-based SAT problem instances. In particular, for unsatisfiable
circuit examples, our solver is able to achieve from 2x up
to more than 75x speedup over a state-of-the-art SAT solver.
-
Improving SAT-Based Bounded Model Checking by Means of BDD-Based Approximate Traversals [p. 898]
-
G. Cabodi, S. Nocco, and S. Quer
Binary Decision Diagrams (BDDs) have been widely used for hardware
verification since the beginning of the '90s, whereas Boolean Satisfiability
(SAT) has been gaining ground more recently, with the introduction of
Bounded Model Cheking (BMC).
In this paper we dovetail BDD and SAT based methods to improve
the efficiency of BMC. More specifically, we first exploit inexpensive symbolic
approximate reachability analysis to gather information on the state
space. We then use the above information to restrict and focus the overall
search space of SAT based BMC.
In the experimental results section we show how the information coming
from a BDD tool can improve the efficiency of a SAT engine by drastically
reducing the number of "variable assignments" and "variable conflicts".
This results in a significant overall performance gain associated
with a general, robust, and easy-to-apply methodology.
Moderators: L. Thiele, ETH Zurich, CH; Z. Peng, Linköping U, SE
-
Generalized Data Transformations for Enhancing Cache Behavior [p. 906]
-
V. De La Luz, M. Kandemir, I. Kadayif, and U. Sezer
The performance gap between processors and off-chip
memories has widened in the last years and is expected to
widen even more. Today, it is widely accepted that caches improve
significantly the performance of programs, since most
of the programs exhibit temporal and/or spatial locality in
their memory reference patterns. However, conflict misses
can be a major obstacle preventing an application from utilizing
the data cache effectively. While array padding can
reduce conflict misses it can also increase the data space requirements
significantly. In this paper, we present a compiler-based
data transformation strategy, called the "generalized
data transformations," for reducing inter-array conflict
misses in embedded applications. We present the theory behind
the generalized data transformations and discuss how
they can be integrated with compiler-based loop transformations.
Our experimental results demonstrate that the generalized
data transformations are very effective in improving data
cache behavior of embedded applications.
-
Software Streaming via Block Streaming [p. 912]
-
P. Kuacharoen, V. Mooney, and V. Madisetti
Software streaming allows the execution of stream-enabled
software on a device even while the transmission/
streaming may still be in progress. Thus, the software
can be executed while it is being streamed instead of
causing the user to wait for the completion of download,
decompression, installation and reconfiguration. Our
streaming method can reduce application load time seen
by the user since the application can start running as
soon as the first executable unit is loaded into the memory.
Furthermore, unneeded parts of the application
might not be downloaded to the device. As a result, resource
utilization such as memory and bandwidth usage
may also be more efficient. Using our streaming method,
an embedded device can support a wide range of real-time
applications. The applications can be run on demand.
In this paper, a streaming method we call block
streaming is proposed. Block streaming is determined at
the assembly code level. We implemented a tool to partition
real-time software into parts which can be transmitted
(streamed) to the embedded de-vice. Our streaming
method was implemented and simulated on a hardware/
software co-simulation platform in which we used
the PowerPC architecture. We show a robotics application
that without our streaming method is unable to meet
its real-time deadline. However, with our software
streaming method, the application is able to meet its
deadline. The application load time for this application
also improves by a factor of more than 10X when compared
to downloading the entire application before running
it.
-
Energy-Aware Adaptive Checkpointing in Embedded Real-Time Systems [p. 918]
-
Y. Zhang and K. Chakrabarty
We present an integrated approach that provides fault
tolerance and dynamic power management for a real-time
task executing in an embedded system. Fault tolerance is
achieved through an adaptive checkpointing scheme that
dynamically adjusts the checkpointing interval during task
execution. Adaptive checkpointing is then combined with a
dynamic voltage scaling scheme to achieve power reduction.
The resulting energy-aware adaptive checkpointing scheme
uses a dynamic voltage scaling criterion that is based not
only on the slack in task execution but also on the
occurrences of faults during task execution. Simulation
results show that compared to previous methods, the
proposed approach significantly reduces power consumption
and increases the likelihood of timely task completion in the
presence of faults.
Moderators: M. Renaudin, TIMA Laboratory, FR; A. Koelmans, Newcastle U, UK
-
Visualization and Resolution of Coding Conflicts in Asynchronous Circuit Design [p. 926]
-
A. Madalinski, A. Bystrov, V. Khomenko, and A. Yakovlev
Synthesis of asynchronous circuits from Signal Transition
Graphs (STGs) involves resolving state coding conflicts.
The refinement process is generally done automatically
using heuristics and often produces sub-optimal solutions,
which have to be corrected manually. This paper
presents a framework for an interactive refinement process
aimed to help the designer. It is based on the visualization
of conflict cores, i.e., sets of transitions causing coding
conflicts, which are represented at the level of finite and
complete prefixes of STG unfoldings.
-
STG Optimisation in the Direct Mapping of Asynchronous Circuits [p. 932]
-
D. Sokolov, A. Bystrov, and A. Yakovlev
Direct mapping from Petri nets (PN) and Signal Transition
Graphs (STG) avoids algorithmic complexity inherent
in logic synthesis methods based on optimal state encoding.
However, it may lead to inefficient implementation, both in
size and performance, due to excessive use of state-holding
elements.This paper presents a set of tools that optimise
logic produced by the direct mapping technique by means
of: exposure of outputs, detection and elimination of redundant
places. Output exposure is an approach to explicitly
model output signals as STG places, which can be directly
mapped into output flip-flops. The STG can be simplified after
output exposure. The detection of redundant places is a
computationally hard problem with multiple solutions. The
tool solves this problem by using several heuristics aimed at
speed and size. All operations preserve behavioural equivalence.
The efficiency of the overall algorithm and individual
heuristics is analysed using a number of benchmarks.
Moderators: T.J. Kazmierski, Southampton U, UK; W. Mueller, C-LAB Paderborn, DE
-
Ubiquitous Access to Reconfigurable Hardware: Application Scenarios and Implementation Issues [p. 940]
-
L. Indrusiak, F. Lubitz, R. Reis, and M. Glesner
This paper presents an approach for the integration of
reconfigurable hardware and computer applications
based on the concept of ubiquitous computing. The goal is
to allow a network of reconfigurable hardware modules
to be transparently accessible by client applications. The
communication between them is done at the API level,
and a Jini-based infrastructure is used to provide an
interface for the client applications to find available
reconfigurable hardware modules over the network. A
DES-based cryptography system was implemented as a
case study.
-
Dynamic Tool Integration in Heterogeneous Computer Networks [p. 946]
-
H. Eikerling, W. Mueller, T. Schattkowsky, and J. Wegner
Tool installation and automation of administrative tasks
in heterogeneous computer networks becomes of increasing
importance with the availability of complex
heterogeneous computer networks. This article introduces
a new approach for dynamic network tool management,
i.e., TRMS. A variant of TRMS using SNMP -- a well
established standard for network administration -- is
outlined and illustrated by the application of the
integration and management of design tools for Printed
Circuit Boards (PCBs).
Moderators: J. Madsen, TU Denmark, DK; F. Rousseau, TIMA Laboratory, FR
-
Layered, Multi-Threaded, High-Level Performance Design [p. 954]
-
A. Cassidy, J. Paul, and D. Thomas
A primary goal of high-level modeling is to efficiently
explore a broad design space, converging on an optimal or
near-optimal system architecture before moving to a more
detailed design. This paper evaluates a high-level, layered
software-on-hardware performance modeling environment
called MESH that captures coarse-grained, interacting
system elements. The validity of the high-level model is
established by comparing the outcome of the high-level
model with a corresponding low-level, cycle-accurate
instruction set simulator. We model a network processor and
show that both high and low level models converge on the
same architecture when design modifications are classified
as good or bad performance impacts.
-
A Co-Design Methodology for Energy-Efficient Multi-Mode Embedded Systems with
Consideration of Mode Execution Probabilities [p. 960]
-
M. Schmitz, B. Al-Hashimi, and P. Eles
Multi-mode systems are characterised by a set of interacting
operational modes to support different functionalities and standards.
In this paper, we present a co-design methodology for
multi-mode embedded systems that produces energy-efficient implementations.
Based on the key observation that operational
modes are executed with different probabilities, i.e., the system
spends uneven amounts of time in the different modes, we develop
a novel co-design technique that exploits this property to significantly
reduce energy dissipation. We conduct several experiments,
including a smart phone real-life example, that demonstrate
the effectiveness of our approach. Reductions in power
consumption of up to 64% are reported.
-
Processor/Memory Co-Exploration on Multiple Abstraction Levels [p. 966]
-
G. Braun, A. Wieferink, O. Schliebusch, R. Leupers, and H. Meyr, and A. Nohl
Recently, the evolution of embedded systems has shown
a strong trend towards application-specific, single-chip solutions.
As a result, application-specific instruction set processors
(ASIP) are more and more replacing off-the-shelf
processors in such systems-on-chip (SoC). Along with the
processor cores, heterogeneous memory architectures play
an important role as part of the system. According to last
year's ITRS [5], in 2004 about 70 percent of the chip area
will be made up of memories. As such architectures are
highly optimized for a particular application domain, processor
core and memory subsystem design cannot be apart,
but have to merge into an efficient design process. In this
paper, we present a unified approach for processor/memory
co-exploration using an architecture description language.
We show an efficient way of considering instruction set and
memory architecture during the entire exploration process.
Finally, we illustrate the feasibility of our approach with a
real-world case study.
Moderators: Y. Tanurhan, Actel, US; C. Passerone, Politecnico di Torino, IT
-
Run-Time Management of Logic Resources on Reconfigurable Systems [p. 974]
-
M. Gericota, G. Alves, M. Silva, and J. Ferreira
Dynamically reconfigurable systems based on partial
and dynamically reconfigurable FPGAs may have their
functionality partially modified at run-time without
stopping the operation of the whole system.
The efficient management of the logic space available
is one of the biggest problems faced by these systems.
When the sequence of reconfigurations to be performed is
not predictable, resource allocation decisions have to be
made on-line. A rearrangement may be necessary to get
enough contiguous space to implement incoming
functions, avoiding the spreading of their components and
the resulting degradation of system performance.
A new software tool that helps to handle the problems
posed by the consecutive reconfiguration of the same logic
space is presented in this paper. This tool uses a novel on-line
rearrangement procedure to solve fragmentation
problems and to rearrange the logic space in a way
completely transparent to the applications currently
running.
-
Managing a Reconfigurable Processor in a General Purpose Workstation Environment [p. 980]
-
M. Dales
Reconfigurable processor hybrids are becoming an accepted
solution in the embedded systems domain, but have
yet to gain acceptance in the general purpose workstation
domain. One problem with current solutions is their lack of
support for the dynamic workloads and resource demands
of a general purpose workstation. In this paper we describe
and demonstrate a reconfigurable processor architecture
that lets the operating system dynamically share the FPL
resource between a set of applications without the management
overheads negating the benefit of having the extra resource.
-
Infrastructure for Design and Management of Relocatable Tasks in a
Heterogeneous Reconfigurable System-On-Chip [p. 986]
-
J. Mignolet, V. Nollet, P. Coene, S. Vernalde, D. Verkest, and R. Lauwereins
The ability to (re)schedule a task either in hardware or
software will be an important asset in a reconfigurable
systems-on-chip. To support this feature we have developed
an infrastructure that, combined with a suitable
design environment permits the implementation and management
of hardware/software relocatable tasks. This
paper presents the general scope of our research, and
details the communication scheme, the design environment
and the hardware/software context switching issues.
The infrastructure proved its feasibility by allowing us to
design a relocatable video decoder. When implemented
on an embedded platform, the decoder performs at 23
frames (320x240 pixels, 16 bits per pixel) in reconfigurable
hardware and 6 FRAMES/s in software.
Moderators: S. Kundu, Intel, US; B. Straube, FhG IIS/EAS Dresden, DE
-
RTL Test Pattern Generation for High Quality Loosely Deterministic BIST [p. 994]
-
M. Santos, J. Fernandes, I. Teixeira, and J. Teixeira
High quality Built-In Self Test (BIST) needs to efficiently tackle the
coverage of random-pattern-resistant (r.p.r) defects. Several techniques
have been proposed to cover r.p.r faults at logic level, namely, weighted
pseudo-random and mixed-mode. In mixed-mode test pattern generation (TPG)
techniques, deterministic tests are added to pseudo-random vectors to detect
r.p.r faults. Recently, a RTL mixed-mode TPG technique has been proposed
to cover r.p.r defects, the mask-based BIST technique. The purpose of this
paper is to present mask-based BIST TPG improvements, namely in two areas:
RTL estimation of the test length to be used for each mask, in order to reach
high Defects Coverage (DC), and the identification of an optimum mask for
each set of nested RTL conditions. Results are used to predict the
number of customized vectors for each masks of one ITC'99 benchmark module.
-
A New Approach to Test Generation and Test Compaction for Scan Circuits [p. 1000]
-
I. Pomeranz and S. Reddy
We propose a new approach to test generation and test
compaction for scan circuits that eliminates the distinction
between scan operations and application of primary input
vectors. Under this approach, the scan-in, scan-select and
scan-out lines are treated as conventional primary inputs
or primary outputs of the circuit. As a result, limited scan
operations, where scan chains are shifted a number of
times smaller than their lengths, are incorporated naturally
into the test sequences generated by this approach. This
leads to very aggressive compaction, resulting in test
sequences with the lowest known test application times
for benchmark circuits.
-
Fully Automatic Test Program Generation for Microprocessor Cores [p. 1006]
-
F. Corno, G. Cumani, M. Sonza Reorda, and G. Squillero
Microprocessor cores are a major challenge in the test arena: not only is their complexity always increasing, but also their specific characteristics intensify all difficulties. A microprocessor embedded inside a SOC is even harder to test since its input might be harder to control and its behavior may be harder to observe. Functional testing is an effective solution which consists in forcing the microprocessor to execute a suitable test program. This paper presents a new approach to automatic test program generation exploiting an evolutionary paradigm. It overcomes the main limitations of previous methodologies and provides significantly better results. Human intervention is limited to the enumeration of all assembly instructions. Also internal parameters of the optimizer are auto-adapted. Experimental results show the effectiveness of the approach.
-
On the Characterization of Hard-To-Detect Bridging Faults [p. 1012]
-
I. Pomeranz, S. Reddy, and S. Kundu
We investigate a characterization of hard-to-detect bridging
faults. For circuits with large numbers of lines (or
nodes), this characterization can be used to select target
faults for test generation when it is impractical to target all
the bridging faults (or all the realistic bridging faults). We
demonstrate that the faults selected based on the proposed
characterization are indeed hard-to-detect by showing that
the fault coverage of a given test set with respect to this
subset is lower and more sensitive to the test set than the
fault coverage obtained with respect to a random subset of
the same size, with respect to the complete set of faults,
and when possible, with respect to a subset of realistic
bridging faults of the same size. We also demonstrate that
a test set for the selected subset of faults detects other
faults more effectively than when a test set is derived for a
randomly selected subset of faults of the same size.
Moderators: M. Zwolinski, Southampton U, UK; P. Schwarz, FhG IIS/EAS Dresden, DE
-
The Power Grid Transient Simulation in Linear Time Based on 3D
Alternating-Direction-Implicit Method [p. 1020]
-
Y. Lee and C. Chen
The rising power consumption and clock frequency of VLSI
technology demand robust and stable power delivery. Extensive
transient simulations on large scale power delivery
structures are required to analyze power delivery fluctuation
caused by dynamic IR-, and Ldi/dt drop as well as package
and on-chip resonance. This paper develops a novel
and efficient transient simulation algorithm for the power distribution
networks. The 3D TLM-ADI (Transmission-Line-Modeling
Alternating-Direction-Implicit) method, first models
the power delivery structure as three dimensional transmission
line shunt node structure and transfers those equations to
the Telegraph equation. Finally, we solve it by the alternating
direction implicit method. The 3D TLM-ADI method, with
linear runtime and memory requirement, is also unconditionally
stable which ensures that the time-steps are not limited
by any stability requirement. Numerical experimental results
show that the 3D TLM-ADI method is not only over 300,000
times faster than SPICE but also extremely memory saving and
accurate.
-
Transistor-Level Static Timing Analysis by Piecewise Quadratic Waveform Matching [p. 1026]
-
Z. Wang and J. Zhu
While fast timing analysis methods, such as asymptotic
waveform evaluation (AWE), have been well established
for linear circuits, the timing analysis for non-linear circuits,
which are dominant in digital CMOS circuits, is
usually performed by a SPICE like, time domain integration
based approach, involving expensive Newton Raphson
iterations at numerous time steps. In this paper, we
propose a new technique that leads to the transient solution
of charge/discharge paths with a complexity equivalent
to only K DC operating point calculations, where K
is the number of transistors along the path. This is accomplished
by approximating each nodal voltage as a piecewise
quadratic waveform, whose characteristics can be determined
by matching the charge/discharge currents. Experiments
on a wide range of circuits show that a 31.6 times
speed-up over SPICE transient simulation with 10ps step
size can be achieved, while maintaining an average accuracy
of 99%.
-
A Fast Algorithm for the Layout Based Electro-Thermal Simulation [p. 1032]
-
M. Rencz, V. Székely, and A. Poppe
A new algorithm has been developed for the layout based
direct electro-thermal simulation of integrated circuits. The
advantage of the direct electro-thermal simulation over
simulator coupling is, that very fast changes can also be
considered, the drawback is that the thermal nodes are
added to the number of nodes of the network to be
simulated. The novelties of our method are the modeling
and the solution of the thermal structure. This paper
presents the algorithm of the time constant spectrum
based FOSTER chain matrix thermal modeling, and the new
algorithm of the coupled electro-thermal solution, where
parts of the network, which represent the thermal
behavior, are not computed in all steps of the iteration.
This speeded up algorithm works both in the time-, and in
the frequency domain. A simulation example demonstrates
a typical application: the prediction of how the layout
arrangement and the packaging of an analogue integrated
circuit influence the electrical parameters.
-
Platform Based Testbench Generation [p. 1038]
-
R. Henftling, M. Bauer, M. Zambaldi, A. Zinn, and W. Ecker
This paper presents a new technology that
accelerates system verification. In a real life
example, we achieved a speed-up of a factor of
about 5000. The key for this speed-up is a
configurable, synthesizable testbench
architecture, which can be completely mapped to
emulators or FPGAs. Exploiting generic
controllers and re-using protocol-specific stimuli
generators combined with topology and microprogram
generation is responsible for almost
zero overhead compared to behavioral
testbenches.
Moderators: M. Miranda, IMEC, BE; F. Fallah, Fujitsu, US
-
Software Architectural Transformations: A New Approach to Low Energy Embedded Software [p. 1046]
-
T. Tan, A. Raghunathan, and N. Jha
Previous work on software optimization for low energy has focussed on
instruction-level optimizations and compiler techniques. We argue, and
demonstrate, that significant energy savings could be "left on the table" if
energy is not considered during the design of the software architecture. As
a first step towards addressing this gap, we propose a systematic framework for
software architectural transformations to reduce energy consumption.
We consider software architectural transformations in the context of the
multi-process software style driven by an operating system (OS), which is
very commonly employed in energy-sensitive embedded systems. Our methodology
for applying software architectural transformations consists of (i) constructing
a software architecture graph representation, (ii) deriving initial energy
and performance statistics using a detailed energy simulation framework,
(iii) constructing sequences of atomic software architectural transformations,
guided by energy change estimates derived from high-level energy macro-models,
that result in maximal energy reduction, and (iv) generation of program source
code to reflect the optimized software architecture. We employ a wide suits of
software architectural transformations whose effects span the application-OS
boundary, including how the program functionality is structured into
architectural components (e.g., application process, signal handlers, and
device drivers), and connectors between them (inter-component synchronization
and communication mechanisms).
We present experimental results on several multi-process embedded software
programs, in the context of an embedded system that features the Intel
StrongARM processor and the embedded Linux OS. The presented results clearly
underscore the potential of the proposed methodology (up to 66.1% reduction
in energy is obtained). In a broader sense, our work demonstrates the impact
of considering energy during the earlier stages of the software design
process.
-
Dynamic Functional Unit Assignment for Low Power [p. 1052]
-
S. Haga, N. Reeves, R. Barua, and D. Marculescu
A hardware method for functional unit assignment is presented,
based on the principle that a functional unit's power
consumption is approximated by the switching activity of its
inputs. Since computing the Hamming distance of the inputs
in hardware is expensive, only a portion of the inputs
are examined. Integers often have many identical top bits,
due to sign extension, and floating points often have many
zeros in the least significant digits, due to the casting of
integer values into floating point. The accuracy of these approximations
is studied and the results are used to develop
a simple, but effective, hardware scheme.
-
Implementation and Evaluation of an On-Demand Parameter-Passing Strategy for Reducing Energy [p. 1058]
-
M. Kandemir, W. Zhang, and I. Kolcu
In this paper, we present an energy-aware parameter-passing
strategy called on-demand parameter-passing. The
objective of this strategy is to eliminate redundant actual
parameter evaluations if the corresponding formal parameter
in a subroutine is not used during execution. This on-demand
parameter-passing is expected to be very successful
in reducing energy consumption of large, multi-routine
embedded applications at the expense of a slight implementation
complexity.
-
Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors [p. 1064]
-
D. Nicolaescu, A. Veidenbaum, and A. Nicolau
Modern embedded processors use data caches with
higher and higher degrees of associativity in order to increase
performance. A set-associative data cache consumes
a significant fraction of the total power budget in
such embedded processors. This paper describes a technique
for reducing the D-cache power consumption and
shows its impact on power and performance of an embedded
processor. The technique utilizes cache line address
locality to determine (rather than predict) the cache way
prior to the cache access. It thus allows only the desired
way to be accessed for both tags and data. The proposed
mechanism is shown to reduce the average L1 data cache
power consumption when running the MiBench embedded
benchmark suite for 8, 16 and 32-way set-associate caches
by, respectively, an average of 66%, 72% and 76%. The
absolute power savings from this technique increase significantly
with associativity. The design has no impact on
performance and, given that it does not have mis-prediction
penalties, it does not introduce any new non-deterministic
behavior in program execution.
Moderators: R. Hermida, Madrid Complutense U, ES; P. Eles, Linköping U, SE
-
Layer Assignment Techniques for Low Energy in Multi-Layered Memory Organisations [p. 1070]
-
E. Brockmeyer, M. Miranda, F. Catthoor, and H. Corporaal
Nearly all platforms use a multi-layer memory hierarchy
to bridge the enormous latency gap between the large
off-chip memories and local register files. However, most of
previous works on HW or SW controlled techniques for
layer assignment have been mainly focussed on performance.
As a result, the intermediate layers have been assigned too
large sizes leading to energy inefficiency. In this paper
we present a technique that takes advantage of both the
temporal locality and limited lifetime of the arrays of the
application for minimum energy consumption under layer size
constraints. A prototype tool has been developed and tested using
two real-life applications of industrial relevance. Following
this approach we have been able to half the energy consumed by
the memory hierarchy for each of our drivers.
-
Mesh Partitioning Approach to Energy Efficient Data Layout [p. 1076]
-
S. Hettiaratchi and P. Cheung
Memory access consumes a significant amount of energy
in data transfer intensive applications. The selection of a
memory location from a CMOS memory cell array involves
driving row and column select lines. A switching event on a
row select line often consumes significantly more energy in
comparison to a switching event on a column select line.
In order to exploit this difference in energy consumption
of row and column select lines, we propose a novel data
layout method that aims to minimize row switching activity
by assigning spatially and temporally local data items
to the same row. The problem of minimum row switching
data layout has been formulated as a multi-way mesh partitioning
problem. The constraints imposed on the problem
formulation ensure that the complexity of the address generator
required to implement the optimized data layout is
bounded and that the data layout optimization can be applied
to all address generator synthesis methods. Our experiments
demonstrate that our method can significantly reduce
row transition counts over row major data layout.
-
On-Chip Stack Based Memory Organization for Low Power Embedded Architectures [p. 1082]
-
M. Mamidipaka and N. Dutt
This paper presents a on-chip stack based memory organization
that effectively reduces the energy dissipation in
programmable embedded system architectures. Most embedded
systems use the notion of stack for implementation
of function calls. However, such stack data is stored in processor
address space, typically in the main memory and accessed
through caches. Our analysis of several benchmarks
show that the callee saved registers and return addresses
for function calls call constitute a significant portion of the
total memory accesses. We propose a separate stack-based
memory organization to store these registers and return addresses.
Our experimental results show that effective use of
such stack-based memories yield significant reductions in
system power/energy, while simultaneously improving the
system performance. Application of our approach to the
SPECint95 and MediaBench benchmark suites show a reduction
of up to 32.5% reduction in energy in L1 data cache,
with marginal improvements in system performance.
-
Figure of Merit Based Selection of A/D Converters [p. 1190]
-
M. Vogels and G. Gielen
A new method for selecting analog to digital (A/D) converters
based on a generic figure of merit is described. First
a figure of merit is introduced that includes both specifications
and technology data and that has five generic parameters.
The values of these generic parameters can be found
by means of a fitting procedure using data from published
designs. It is shown that the generic parameters have different
values for different types of converters. Therefore the
trade-off between speed, resolution, power dissipation and
technology parameters depends on the type of converter.
This trade-off can than be used to select a particular type of
converter for a given application area.
-
XBM2PLA: A Flexible Synthesis Tool for Extended Burst Mode Machines [p. 1092]
-
O. Kraus and M. Padeffke
This paper describes the results of a new synthesis
tool (XBM2PLA) for asynchronous state machines [2].
XBM2PLA generates the boolean functions for an asynchronous
circuit. XBM2PLA is compatible to several
other tools but often generates smaller results. Moreover,
XBM2PLA allows a more flexible description of asynchronous
circuits than other existing tools.
-
Multithreaded Synchronous Data Flow Simulation [p. 1094]
-
J. Kin and J. Pino
This paper introduces an efficient multithreaded
synchronous dataflow (SDF) scheduler that can
significantly reduce the running time of multi-rate SDF
simulations on multiprocessor machines with only a slight
increase of memory usage over standard cluster loop
scheduler[1]. Experiments run on a dual processors
machine achieves on average approximately 146% increase
in performance with less than 2.4% increase in memory
usage. There is an average of 2x speedup with a quad
processors machine.
-
PLFire: A Visualization Tool for Asynchronous Phased Logic Designs [p. 1096]
-
K. Fazel, M. Thornton, and R. Reese
We present a visualization tool called PLFire, which
allows a user to observe the behavior of a Phased Logic
(PL) circuit. Phased logic is a technique for realizing
self-timed circuitry that is delay-insensitive and requires
no global clock. One advantage of self-timed circuits is
that throughput is based on average propagation delays
and not worst-case delay. By being able to visualize the
operation of a PL circuit, including the token flow, a
designer gets a better understanding of what features of a
design have the greatest impact on performance.
-
Extraction of Piecewise-Linear Analog Circuit Models from Trained Neural
Networks Using Hidden Neuron Clustering [p. 1098]
-
S. Doboli, G. Gothoskar, and A. Doboli
This paper presents a new technique for automatically
creating analog circuit models. The method extracts -- from
trained neural networks -- piecewise linear models expressing
the linear dependencies between circuit performances
and design parameters. The paper illustrates the technique
for an OTA circuit for which models for gain and bandwidth
were automatically generated. The extracted models have a
simple form that accurately fits the sampled points and the
behavior of the trained neural networks. These models are
useful for fast simulation of systems with non-linear behavior
and performances.
-
Simulation and Analysis of Embedded DSP Systems Using MASIC Methodology [p. 1100]
-
A. Deb, J. Öberg, and A. Jantsch
MASIC, short for Maths to ASIC, models embedded DSP
systems using grammar based technique. This paper
presents a simulation and analysis technique for the
MASIC model of a system. We build a Petri Net (PN)
representation of the MASIC model to simulate the
communication architecture. It helps to evaluate the
effects of architectural decisions early in the design cycle.
The correctness of the protocol description is analyzed
using PN based boundedness and conservation analysis.
-
A Custom-Cell Identification Method for High-Performance Mixed Standard/Custom-Cell Designs [p. 1102]
-
J. Lo, W. Kuo, T. Hwang, and A. Wu
Over the years, many design methodologies/tools and layout
architectures have been developed for datapath-oriented
designs. One commonly used approach for high-speed
datapath designs is the full-custom design method targeted to
bit-sliced or bit-alignment layout architectures. Using this
approach, designers can fully exploit design properties, such
as various circuit designs, structural regularities and layout
structures, to develop high-speed, low-power and high-density
datapath designs. However, the main drawbacks of
this approach are threefold: (1) extremely high development
cost, (2) very long development time, and (3) difficult to
migrate the custom design to a new technology.
The other approach is the HDL-based synthesis
method targeted to the standard-cell layout architecture. This
approach is highly productive and easy to implement.
However, due to lack of the ability to exploit the regularity
properties of datapath designs, it's mostly suitable to produce
low/medium-speed datapath designs. Another design method
targeted to mixed standard- and custom-cells is carried out as
follows. First, it generates a standard-cell design to obtain a
maximal achievable speed. Second, it performs a cell
customization process to determine which standard cells need
to be customized in order to satisfy the given timing
constraint. This cell customization process is usually
performed incrementally based on designers' intuitions and
experiences. Once designers select a set of standard cells to
be customized, they will define the specification of the
custom cells and develop the cells accordingly. Finally, the
designers will replace the standard cells with the custom ones
and re-evaluate the speed of the design. The customization
process will continue until the timing constraint is satisfied.
In this paper, we present a custom-cell identification
method for high-speed datapath-oriented designs. The
proposed method uses an HDL-based standard-cell design
flow by integrating a custom-cell identification algorithm to
determine a minimal-cost cell set and the design budgets for
cell customization. Experimental results on a set of
benchmarking designs are reported to demonstrate the
effectiveness of the proposed method. Note that our problem
is different from the cell-sizing problem. While the cell-sizing
problem is to select and to replace (re-size) cells
(instances of cells in cell library) one by one, our problem is
to select and re-design cells in the cell library and to replace
all instances of the cell in the design at one time.
-
Hierarchical Global Floorplacement Using Simulated Annealing and Network Flow Area Migration [p. 1104]
-
W. Choi and K. Bazargan
Floorplanning large designs with many hard
macros and IP blocks of various sizes is becoming an
increasingly important and challenging problem. This
paper presents a global floorplacement method that
combines a hierarchical simulated annealing
floorplanning method with a partitioning-based global
placement technique. A novel area migration method
formulated as a min-cost, max-flow network flow problem
is used to improve area utilization, and provide a
communication mechanism between the partitioning
engine and the placement method for better design
quality. The network flow area migration method can be
used in managing incremental changes in the design as
well. Our global placement wire length is 12% better than
the detailed placement wire length of a previous work,
while our global placement is almost 8 times faster than
their global placement.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Placement and routing.
General Terms
Algorithms, Management, Design.
Keywords
Floorplanning, floorplacement, global placement,
hierarchical, network flow, area migration, simulated
annealing.
-
LIT -- An Automatic Layout Generation Tool for Trapezoidal Association of Transistors for
Basic Analog Building Blocks [p. 1106]
-
A. Girardi and S. Bampi
This paper describes a methodology for analog layout
synthesis based on the automatic generation of an
equivalent composite transistor with the same DC current
characteristics of the transistors in the electrical
schematic. The tool serves a dual purpose: i) the layout
synthesis of analog blocks over a digital sea-of-gates pre-diffused
array, and ii) the generation of custom
associations of transistors for matched common-source
input pairs and current mirrors. The LIT tool generates
the layout in a line-matrix, sea-of-gates with gate isolation
style of several blocks: the trapezoidal-like composite
transistors and of matched transistor pairs. In addition,
the tools provides an environment for manual analog cells
placement and automatic routing. These features
drastically reduce the design time, reduce costs and
include matching properties.
-
Symbolic Analysis of Nonlinear Analog Circuits [p. 1108]
-
A. Manthe, Z. Li, R. Shi, and K. Mayaram
A new method is presented to model symbolically strongly nonlinear circuits, characterized by Piece-Wise Linear (PWL) functions. The method follows the idea of Bokhoven and Leenaerts, and formulates the problem as a linear complementarity problem (LCP). A new graph representation, called complementarity decision diagram, is introduced to represent the explicit LCP solutions. Orders of magnitude improvements in computational efficiency and memory usage have been obtained.
-
Improved Time Domain Simulation of Optical Multimode Intrasystem Interconnects [p. 1110]
-
J. Gerling, O. Stübbe, J. Schrage, G. Mrozynski, and J. Teich
To increase the bandwidth of high-performance intrasystem
interconnections optical multimode waveguides can be
used. Since the design procedure of optical interconnections
has to be widely compatible with conventional design
processes, adequate simulation methods are required. This
paper presents an improved time domain method for simulating
the signal transmission along optical multimode interconnections.
The improvements mainly result from the
more efficient method for the piecewise approximation of
the waveguides step responses by a few exponential functions.
The adapted semi-analytical recursive convolution
method decreases the computation times.
-
A New Crosstalk Noise Model for DOMINO Logic Circuits [p. 1112]
-
S. Choi and K. Roy
A new crosstalk noise model is proposed for DOMINO
logic gates. Our noise model takes the effect of keeper into
account and provides more accurate noise measure.
-
Modeling Noise Transfer Characteristic of Dynamic Logic Gates [p. 1114]
-
L. Ding and P. Mazumder
Dynamic noise analysis is recently gaining more attention
as a definitive method to overcome glaring deficiencies of
static noise analysis. Exact dynamic noise analysis requires
modeling of both injected noise and propagated noise. In this
paper, we have developed a strategy to study the noise propagation
problem. An efficient analytical formula has been derived
to accurately model the noise waveform transfer characteristic
of dynamic CMOS logic gates. Experiments have
shown that the maximum error in peak propagated noises of
the proposed model is less than 10%.
-
Heterogeneous Programmable Logic Block Architectures [p. 1118]
-
A. Koorapaty, V. Chandra, K. Tong, C. Patel, L. Pileggi, and H. Schmit
In this poster, we propose four new heterogeneous programmable
logic blocks (PLBs) consisting of a combination
of various sizes of look up tables (LUTs), multiplexers
(MUXes), and logic gates. We demonstrate that these PLBs
offer significant performance and density benefits over more
homogeneous PLBs.
-
An Industrial/Academic Configurable System-On-Chip Project (CSoC):
Coarse-Grain XPP-/Leon-Based Architecture Integration [p. 1120]
-
J. Becker, A. Thomas, M. Vorbach, and V. Baumgarte
This paper describes the actual status and results of a dynamically
Configurable System-on-Chip (CSoc) integration, consisting of a
SPARC-compatible LEON processor-core, a commercial coarse-grain XPP-array
of suitable size from PACT Informationstechnologie AG, and application-tailored
global/local memory topology with efficient Amba-based communication
interfaces. The given adaptive architecture is synthesized within an
industrial/academic SoC project onto 0.18 and 0.13 um UMC CMOS
technologies at Universitaet Karlsruhe (TH). Due to exponential increasing
CMOS mask costs, essential aspects for the industry are now adaptivity of
SoCs, which can be realized by integrating reconfigurable re-usable hardware
parts on different granularities into Configurable Systems-on-Chip (CSoCs)[1].
-
Development of a Tool-Set for Remote and Partial Reconfiguration of FPGAs [p. 1122]
-
F. Moraes, D. Mesquita, J. Palma, L. Möller, and N. Calazans
This work describes the implementation of digital
reconfigurable systems (DRS) using commercial FPGA
devices. The main goal is to present a set of tools for
remote and partial reconfiguration developed for the
Virtex FPGA family. Even though the tools are targeted to
a specific device, their building principles may easily be
adapted to other FPGA families, if they have an internal
architecture enabling partial reconfiguration. The main
contribution of the paper is the tool-set proposed to
manipulate cores using partial reconfiguration in existing
FPGAs.
-
Mapping Applications to an FPFA Tile [p. 1124]
-
M. Rosien, Y. Guo, G. Smit, and T. Krol
This paper introduces a transformational design
method which can be used to map code written in a
high level source language, like C, to a coarse grain reconfigurable
architecture. The source code is first translated into a
Control Dataflow graph (CDFG), which is minimized using
a set of behaviour preserving transformations such as dependency
analysis, common subexpression elimination, etc.
After applying graph clustering, scheduling and allocation
transformations on this minimized graph, it can be mapped
onto the target architecture.
-
Load Distribution with the Proximity Congestion Awareness in a Network On Chip [p. 1126]
-
E. Nilsson, M. Millberg, J. Öberg, and A. Jantsch
In Networks on Chip, NoC, very low cost and high performance
switches will be of critical importance. For a
regular two-dimensional NoC we propose a very simple,
memoryless switch. In case of congestion, packets are emitted
in a non-ideal direction, also called deflective routing.
To increase the maximum tolerable load of the network,
we propose a Proximity Congestion Awareness, PCA, technique,
where switches use load information of neighbouring
switches, called stress values, for their own switching decisions,
thus avoiding congested areas. We present simulation
results with random traffic which show that the PCA technique
can increase the maximum traffic load by a factor of
over 20.
-
Micro-Network for SoC: Implementation of a 32-Port SPIN Network [p. 1128]
-
A. Andriahantenaina and A. Greiner
We present a physical implementation of a 32-ports
SPIN micro-network. For a 0.13 micron CMOS process, the
total area is 4.6 mm2, for a cumulated bandwidth of about
100 Gbits/s. In a 6 metal process, all the routing wires can
be routed on top of the switching components. The SPIN32
macro-cell will be fabricated by ST Microelectronics, but
this macrocell uses symbolic layout, and can be manufactured
with any CMOS process including 6 metal layers.
-
A Fully Self-Timed Bit-Serial Pipeline-Architecture for Embedded Systems [p. 1130]
-
A. Rettberg, M. Zanella, C. Bobda, and T. Lehmann
Area minimization, low power and high performance
are objectives to be reached in chip design. Bit-serial
architecture offers a great advantage in comparison with
bit-parallel architectures as regards area minimization.
One field of application of such an architecture is e.g.
signal processing in terms of digital filters or digital
controllers. These algorithms may be realized in hardware
by means of the proposed architecture; this requires only
small chip area and an equally small number of input and
outputs pins, thus reducing the size and the complexity of
the printed circuit. The speed of the bit-serial processing
is high enough for the application domain in question.
E.g., in an electrical motor-current control there are the
delays of input/output converters (e.g., A/Ds and D/As)
and those resulting from the inertia of the motor.
In synchronous design, the performance of these
architectures is affected by the long lines which are used
to control the operators and the gated clocks. The
architecture described in this contribution avoid a long
control lines by a local distribution of the control circuitry
on the operator level. To our knowledge, this is the second
paper detailing the implementation of a fully interlocked
synchronous architecture after that described in [1].
-
Library Functions Timing Characterization for Source-Level Analysis [p. 1132]
-
C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto
Execution time estimation of software at source-level
is nowadays a crucial phase of the system design flow,
especially for portable devices and real-time systems.
From a source-level perspective, a call to an external
library function is a black-box: only the binary code
of such functions is, in fact, available. This paper
proposes a methodology for library functions analysis
within a source-level estimation framework.
-
G-MAC: An Application-Specific MAC/Co-Processor Synthesizer [p. 1134]
-
A. Chang, W. Kuo, T. Hwang, and A. Wu
A modern special-purpose processor (e.g., for image and
graphical applications) usually contains a set of instructions
supporting complex multiply-operations. These instructions
perform a variety of multiply-operations with various data bitwidths
and concurrent-execution requirements. For instance,
such an instruction set may include instructions to perform
signed/unsigned 32X32, signed/unsigned dual 16X16,
signed/unsigned 8X8 MAC, and etc. Typically, a co-processor
or a complex MAC (Multiplier-ACcumulator) unit is required
to execute those instructions.
Developing such a complex MAC/co-processor
involves a series of design tasks including micro-architecture
design, component allocation/binding, interconnect binding,
pipeline insertion and control generation. This design process is
non-trivial, time-consuming and error-prone, which is usually
performed by experienced design engineers. In this paper, we
present a synthesis method for application-specific MAC/coprocessor
generation.
The MAC/co-processor synthesis problem is defined
as: Given a set of instructions and the number of execution
cycles for each instruction, generate a MAC/co-processor
design (including a data-path and a control unit) such that the
total area-cost is minimized subject to the given execution cycle
constraints.
-
Power Constrained High-Level Synthesis of Battery Powered Systems [p. 1136]
-
S. Nielsen and J. Madsen
We present a high-level synthesis algorithm solving the
combined scheduling, allocation and binding problem minimizing
area under both latency and maximum power per
clock-cycle constraints. Our approach eliminates the large
power spikes, resulting in an increased battery lifetime, a
property of outmost importance for battery powered embedded
systems. Our approach extends the partial-clique
partitioning algorithm of [3] by introducing power awareness
through a heuristic algorithm which bounds the design
space to those of power feasible schedules. We have applied
our algorithm on a set of dataflow graphs and investigated
the impact on circuit area when applying different power
constraints.
-
PARLAK: Parametrized Lock Cache Generator [p. 1138]
-
B. Akgul and V. Mooney
A system-on-chip lock cache (SoCLC) is an intellectual property
(IP) core that provides effective lock synchronization in
a heterogeneous multiprocessor shared-memory system-on-a-chip
(SoC). We present PARLAK, a parameterized lock cache
generator tool. PARLAK generates a synthesizable SoCLC architecture
with a user specified number of lock variables and
user specified number and type(s) of processor(s). PARLAK
can generate a full range of customized SoCLCs, from a version
for two processors with 32 lock variables occupying 1,790
gates of area to a version for 14 processors with 256 lock variables
occupying 37,380 gates of area (in TSMC 0.25u technology).
PARLAK is an important contribution to IP-generator
tools for both custom and reconfigurable SoC designs.
-
A Secure Web-Based Framework for Electronic System Level Design [p. 1140]
-
T. Kazmierski and X. Yang
This contribution presents the concept of a secure implementation
of a distributed, web-based electronic design
framework. Our two-tier client-webserver-toolserver architecture
has been extended to support permanent databases
for collaborative, distributed design development. In the
sample application of the framework, developed in Java,
any of the servers can be based on Linux, MS Windows or
Sun-SPARC Java-servlet enabled server. The feasibility of
a secure, web-based design framework has been sufficiently
proved. The technological approach used to do this involves
the use of novel, yet tried and tested methods, taking where
appropriate from the rapidly advancing field of e-commerce
solutions.
-
Background Data Organisation for the Low-Power Implementation in
Real-Time of a Digital Audio Broadcast Receiver on a SIMD Processor [p. 1144]
-
P. Op de Beeck, C. Ghez, E. Brockmeyer, M. Miranda, F. Catthoor, and G. Deconinck
In this work we illustrates the strong interaction between
the data organisation in background memory and the data
format required for sub-word level acceleration. The impact
of such interaction is demonstrated on the implementation
of a Digital Audio Broadcast Channel Decoder on a TriMedia
processor, where data format transformations applied
on the background memory data enable a substantially better
exploitation of the available Single Instruction Multiple
Data instructions. As a result, a factor two reduction for
both execution time and data memory energy is achieved.
-
Compiler Support for Reducing Leakage Energy Consumption [p. 1146]
-
W. Zhang, M. Kandemir, N. Vijaykrishnan, M. Irwin, and V. De
Current trends indicate that leakage energy consumption will be an important
concern in upcoming process technologies. In this paper, we propose
a compiler-based leakage energy optimization strategy. Our strategy
is built upon a data-flow analysis that identifies basic blocks that
do not use a given functional unit. Based on this information, the compiler
then inserts activate/deactivate instructions in the code to set/reset
a sleep signal which controls leakage current for functional units. Our
experimental results show that the proposed compiler-based strategy is
very effective in reducing leakage energy of functional units.
-
An Analytical Model for Predicting the Remaining Battery Capacity of Lithium-Ion Batteries [p. 1148]
-
P. Rong and M. Pedram
Predicting the residual energy of the battery source that powers a portable
electronic device is quite important in designing and employing an effective
dynamic power management policy in the device. This paper presents a
closed-form analytical model for predicting the remaining capacity of
a lithium-ion battery. The proposed high-level model relies on online
current and voltage measurements and, at the same time, correctly accounts
for the temperature and cycle aging effects. The accuracy of the high-level
model is validated by comparing our analytical model with the Dualfoil
simulation results (under some simplifying assumptions), demonstrating
5% error between simulated and predicted data.
-
Simultaneous Dynamic Voltage Scaling of Processors and Communication Links in
Real-Time Distributed Embedded Systems [p. 1150]
-
J. Luo, N. Jha, and L. Peh
Dynamic voltage scaling has been widely acknowledged as a powerful
technique for trading off power consumption and delay for processors.
Recently, variable-frequency (and variable-voltage) parallel and serial
links have also been proposed, which can save link power consumption
by exploiting variations in bandwidth requirement. In this paper, we
address joint dynamic voltage scaling for variable-voltage processors and
communication links in such systems. We propose a scheduling algorithm
for real-time applications, with both data flow and control flow
information captured. It performs efficient routing of communication
events through multi-hops, as well as efficient slack allocation among
heterogeneous processors and communication links to maximize energy
savings, while meeting all real-time constraints.
-
Decomposition of Extended Finite State Machine for Low Power Design [p. 1152]
-
M. Lee, T. Hwang, and S. Huang
Power reduction can be achieved by turning off portions of circuits that
are idle. Unlike previous work which focused only on either controller or
data-path, we propose a decomposition technique that takes both controller and
data-path into consideration where, in the former, the state probability
of an FSM provides the execution frequency of operations in that state.
And, in the latter, operations performed in states provide the information
of resource requirements which can be used to determine the resource sharing
among states. The goal is to synthesize circuits such that the sub-machine
with small area (data-path and controller) will be turned on most of the
time (high state probability) and all other parts are turned off. Our
experimental results show that on the average, 10% area reduction and
24% power reduction can be achieved as compared with designs without
decomposition.
-
Equisolvability of Series vs. Controller's Topology in Synchronous Language Equations [p. 1154]
-
N. Yevtushenko, T. Villa, R. Brayton, A. Sangiovanni-Vincentelli, and A. Petrenko
Given a plant MA and a specification MC,
the largest solution of the FSM equation MX. MA
<=MC contains all possible discrete controllers
MX. Often
we are interested in computing the complete solutions
whose composition with the plant is exactly equivalent
to the specification. Not every solution contained in the
largest one satisfies such property, that holds instead
for the complete solutions of the series topology. We
study the relation between the solvability of an equation
for the series topology and of the corresponding
equation for the controller's topology. We establish that,
if MA is a deterministic FSM, then the FSM equation
MX.MA <=MC
is solvable for the series topology
with an unknown head component iff it is solvable for
the controller's topology. Our proof is constructive, i.e.,
for a given solution MB
of the series topology it shows
how to build a solution MD
of the controller's topology and viceversa.
-
Using RTL Statespace Information and State Encoding for Induction Based Property Checking [p. 1156]
-
M. Wedler, D. Stoffel, and W. Kunz
This paper focuses on checking safety properties for sequential
circuits specified on the RT-level. We study how
different state encodings can be used to create a gate-level
representation of the circuit that facilitates the computation
of effective invariants for induction-based property checking.
Our experiments show the strong impact of state encoding
on the efficiency of the induction process.
-
Combining Simulation and Guided Traversal for the Verification of Concurrent Systems [p. 1158]
-
E. Pastor and M. Peña
We present a hybrid methodology that combines simulation
and symbolic traversal in order to improve invariant
checking. The methodology concentrates on concurrent
systems, whose peculiarities are not fully exploited by other
existing techniques for hybrid verification. Our approach
exploits the information obtained from simulations to improve
the knowledge of the state space, effectively guiding
symbolic traversal.
-
Selectively Clocked CMOS Logic Style for Low-Power Noise-Immune Operations in
Scaled Technologies [p. 1160]
-
N. Sirisantana and K. Roy
This paper proposes Selectively Clocked Logic (SCL) style based on skewed logic for noise-tolerant low-power high-performance applications. Variations of the logic style with multiple threshold voltage (MVth-SCL) and multiple oxide thickness
(Mtox-SCL) techniques are also studied. Simulation
results indicate that SCL, MVth-SCL, and
Mtox-SCL circuits reduce the total power consumption
(leakage plus switching power) of the ISCAS benchmark circuits by
51.5%, 53.1%, and 69.6%, respectively, with over 25% improvement
in noise immunity compared to Domino circuits with comparable
performance.
-
Self-Testing Embedded Checkers for Bose-Lin, Bose, and a Class of Borden Codes [p. 1162]
-
S. Tarnick
A new approach for designing t-UED and
BUED code checkers is presented. In particular we consider
Borden codes for t = 2k - 1, Bose and Bose-Lin
codes. The design technique for all three checker types follows
the same principle, which is mainly based on averaging
weights and check symbol values of the code words.
The checkers are very well suited for use as embedded
checkers since they are self-testing with respect to single
stuck-at faults under very weak assumptions. All three
checker types can be tested by 2 or 3 code words.
-
Non-Intrusive Concurrent Error Detection in FSMs through State/Output Compaction and
Monitoring via Parity Trees [p. 1164]
-
P. Drineas and Y. Makris
We discuss a non-intrusive methodology for
concurrent error detection in FSMs. The proposed method is
based on compaction and monitoring of the state/output bits
of an FSM via parity trees. While errors may affect more than
one state/output bit, not all combinations of state/output bits
constitute potential erroneous cases for a given fault model.
Therefore, it is possible to compact them without loss of error
information. Thus, concurrent error detection is performed
through hardware that predicts the values of the compacted
state/output bits and compares them to the actual values of
the FSM. In order to minimize the incurred hardware overhead,
a randomized algorithm is proposed for selecting the
minimum number of required parity functions.
-
SAT-Based Techniques in System Synthesis [p. 1168]
-
C. Haubelt, J. Teich, R. Feldmann, and B. Monien
In this paper, we show how to integrate SAT-based techniques
into the task of system synthesis by regarding the the
problems: (i) feasibility check and (ii) evaluation of quality.
-
Refinement of Mixed-Signal Systems with SystemC [p. 1170]
-
C. Grimm, C. Meise, W. Heupke, and K. Waldschmidt
This paper gives an overview of a design methodology
specific extension library for SystemC. The supported design
methodology successively refines an executable specification
to a concrete mixed-signal architecture. A first prototype,
the ASC library, has been implemented and evaluated.
-
Polychrony for Refinement-Based Design [p. 1172]
-
J. Talpin, P. Le Guernic, S. Shukla, R. Gupta, and F. Doucet
Rising complexities and performances of integrated circuits and systems,
shortening time-to-market demands for electronic equipments, growing installed
bases of intellectual property, requirements for adapting existing IPs with
new services, all stress high-level design as a prominents research topics
and call for the development of appropriate methodological solutions. In this
aim, system design based on the so-called "synchronous hypothesis" consists of
abstracting the non-functional implementation details of a system away and
let one benefit from a focused reasoning on the logics behind the instants
at which the system functionalities should be secured. From this point of
view, synchronous design models and languages provide intuitive models for
integrated circuits. this affinity explains the ease of generating synchronous
circuits and verify their functionalities using compilers and related tools
that implement this approach. In the relational model of the SIGNAL/POLYCHRONY
design language/plateform [3.5] this affinity goes beyond the domain of
purely synchronous circuits to embrace the context of architectures consisting
of synchronous circuits and desynchronization protocols: GALS architectures.
The unique features of this model are to provide the notion of polychrony:
the capability to describe multi-clocked (or partially clocked) circuits and
systems; and to support formal design refinements, from the early stages of
requirements specification, to the later stages of synthesis and deployment,
and by using formal verification techniques.
-
Automatic Generation of Simulation Monitors from Quantitative Constraint Formula [p. 1174]
-
X. Chen, H. Hsieh, F. Balarin, and Y. Watanabe
System design methodology is poised to become the next
big enabler for highly sophisticated electronic products.
Design verification continues to be a major challenge and
simulation will remain an important tool for making sure
that implementations perform as they should. In this paper
we present algorithms to automatically generate C++
checkers from any formula written in the formal quantitative
constraint language, Logic Of Constraints (LOC).
The executable can then be used to analyze the simulation
traces for constraint violation and output debugging information.
Different checkers can be generated for fast analysis
under different memory limitations. LOC is particularly
suitable for specification of system level quantitative constraints
where relative coordination of instances of events,
not lower level interaction, is of paramount concern. We illustrate
the usefulness and efficiency of our automatic trace
analysis methodology with case studies on large simulation
traces from various system level designs.
-
Consequences of RAM Bitline Twisting for Test Coverage [p. 1176]
-
I. Schanstra and J. van de Goor
In order to reduce coupling effects between bitlines
in static or dynamic RAMs, bitline twisting can be
used in the design. For testing, however, this has consequences
for the to-be-used data backgrounds. A generic
twisting scheme is introduced and the involved fault models
are identified.
-
An Approach to the Classification of Mixed-Signal Circuits in a Pseudorandom Testing Scheme [p. 1178]
-
F. Corsi, C. Marzocca, and G. Matarrese
BIST methods based on the application of a
pseudorandom pulse sequence as input stimulus have
recently been suggested for testing linear time-invariant
(LTI) analog parts [1-3]. In such methods, the input-output
cross-correlation function Rxy(t) gives a good
estimation of
the impulse response of the DUT, h(t), provided the autocorrelation
of the pseudorandom input sequence
approaches a single Dirac'fs pulse Delta(t) [4]. Much work has
been devoted to the investigation of several aspects of the
pseudorandom testing techniques, such as the impact of
the finite length of the input pulse sequence, the duration
of the single pulse, the ADC resolution, etc [1-3]. In
particular, ADC's with low resolution can be employed to
sample the DUT response as the averaging effect
intrinsically performed by the input-output cross-correlation
operation greatly reduces the effects of the
quantization errors [2]. As a consequence, from a BIST
point of view, the test of very fast DUT is possible, since
ADC's with low accuracy and fast sampling rate are easily
available on chip.
In this paper we focus on the issues related to the
choice of a suitable set of samples of the cross-correlation
function Rxy(mi), i=1,..,n, as DUT signature, and, at the
same time, to the definition of an effective classification
procedure [5-6].
In particular, we make the choice of the DUT signature
on the basis of the sensitivities of the cross-correlation
samples to the circuit specifications sj, expressed in terms
of the partial derivatives partij=partRxy
(mi)/partsj. This sensitivity
study is carried out by considering a large set RTS of good
instances of the circuit generated in simulation by
independently varying the performance parameters of an
high level description of the DUT within their acceptance
ranges (±5%). The same set of good circuits is used to
extract the initial acceptance ranges [Rxymin(mi),
Rxymax(mi)] of the samples
Rxy(mi) selected as signature
components, and thus it constitutes the starting point of the
subsequent classification procedure.
-
Test Generation for Acyclic Sequential Circuits with Single Stuck-At Fault Combinational ATPG [p. 1180]
-
H. Ichihara and T. Inoue
A test generation method with time-expansion model can
achieve high fault efficiency for acyclic sequential circuits[1,
2]. While the model is combinational circuit, a single stuck at
fault in an original circuit is represented by a multiple
one in this model. This paper proposes a test generation
method for acyclic sequential circuits with a circuit model,
called MS-model, which can express multiple stuck-at faults
in time-expansion model as single stuck-at faults. Our procedure
can generate test sequences for acyclic sequential circuits
with just combinational test pattern generation algorithm
for single stuck-at faults.
-
Comparison of Test Pattern Decompression Techniques [p. 1182]
-
O. Novák
Test pattern decompression techniques are bounded
with the algorithm of test pattern ordering and test data
flow controlling. Some of the methods could have more
sophisticated sorting algorithm, some of the methods
may be supported by fault simulation and some can work
without any information about the quality of previously
generated test patterns. The differences between the
decompression techniques cause that the efficiency of
the automata is not comparable and the published
results do not give us any idea of the best choice of
decompressing automaton. We have performed
experiments, in which we have studied the degradation
of random input stimuli by the decompression automata
(DA) used by different authors. We have found that a
medium percentage of the (n,r) exhaustive test set
created on the DA outputs after stimulating the DA
inputs with a given number of random pattern is
correlated with the DA efficiency. This statement has
been verified by experiments with ISCAS circuits.
-
Evolutionary Optimization of Markov Sources for Pseudo Random Scan BIST [p. 1184]
-
I. Polian, B. Becker, and S. Reddy
Recent work [1] showed that Markov sources lead to scan
BIST designs of lower cost compared to earlier proposed
methods in scan BIST. However the method presented in [1]
utilizes tests generated using a deterministic test generator
for target faults in synthesizing the Markov source to generate
the tests. The requirement of a deterministic test generator
may hinder the use of this procedure in industrial settings
since the BIST tool must also include a deterministic ATPG
tool that may add to the cost of the BIST tool.
In this paper we investigate a procedure to synthesize
BIST controllers with Markov sources for test generation using
Evolutionary Algorithms (EAs). This allows us to avoid
using the deterministic ATPG needed in [1]. Additionally we
do not employ inversion logic used in [1] thereby potentially
reducing the hardware in the BIST controller. Nevertheless,
the proposed method achieves close to 100% fault efficiency
using far fewer tests than required by pseudo random tests.
In this work, similar to [1], we employ Markov sources
based on finite state machines (FSMs) with transitions controlled
by additional inputs. Hence, transition probabilities
(TPs) of the FSM are determined by the 1-probability on a
corresponding input.
-
Test Data Compression Based on Output Dependence [p. 1186]
-
I. Pomeranz and S. Reddy
In a large circuit it is common to find that an output of the circuit
depends structurally on a proper subset of the circuit inputs. We
use this observation to provide test data compression. The proposed
approach can be used in addition to test data compression
techniques based on encoding.
Structural output dependence is utilized in this work to
provide data compression as follows. Consider a test set T that
needs to be applied to the circuit. Suppose that the circuit inputs
can be divided into subsets A0,A1, . . . ,
Am-1 such that for every
circuit output bj there exists an input subset Ak
satisfying the condition that bj depends only on inputs
in Ak . For simplicity,
we also assume that all the subsets Ak contain the same number
of inputs, denoted by NA (later we will ensure that this condition
is satisfied). The value of NA is determined by the output that
depends on the maximum number of inputs. Let the number of
circuit inputs be NC. If NA << NC, it is
possible to use the input subsets in order to reduce the amount of input
test data by loading into the circuit patterns of length NA instead of loading
patterns of length NC.
-
A Unified Approach for SOC Testing Using Test Data Compression and TAM Optimization [p. [p. 1188]
-
V. Iyengar, A. Chandra, S. Schweizer, and K. Chakrabarty
We integrate for the first time test access mechanism (TAM) optimization
and test data compression into a single test methodology.
We show how an integrated test architecture based on TAMs and test
data decoders can be designed. The proposed approach offers considerable
savings in test data volume and testing time. Two case studies
using the integrated test architecture are presented.
|